Skip to content

Data Preparation and Input Requirements

Author: SeekGene
Time: 5 min
Words: 891 words
Updated: 2026-03-05
Reads: 0 times

Before starting the analysis, please select the appropriate workflow based on your current data type.

Scenario 1: Existing CopyKit Results (RDS)

If you already have .rds files generated by CopyKit's runVarbin (as shown below), you can skip this chapter and proceed directly to the subsequent analysis steps.

tree
./
├── copykit_MDA_cs_1Mb.rds      # 1Mb resolution results
├── copykit_MDA_cs_500kb.rds    # 500kb resolution results
└── copykit_MDA_cs_200kb.rds    # 220kb resolution results

Scenario 2: Existing Single-Cell BAM Files

If you have single-cell BAM files generated by the SeekSoulMethyl pipeline (typically located in the xxx/xxx_methy/step3/split_bams/merged directory) with the following structure:

tree
./merged/
├── AAAGAAGAAGAGTTTAG.bam
├── AAAGAAGAAGGAGTGGT.bam
├── ...
└── TTTGGATATTTGTGAGG.bam

Next Step: You need to execute Phase 2: BAM Preprocessing to perform flag correction and deduplication.

Scenario 3: Bulk-Level BAM Files Only

If you only have unsplit "Bulk" level BAM files (i.e., merged files containing reads from all cells) with the following structure:

tree
./bismark/
├── MDA_cs_forward_AAAG_1_bismark_bt2_pe.bam
├── MDA_cs_forward_AAAT_1_bismark_bt2_pe.bam
├── ...
└── MDA_cs_reverse_TTTG_1_bismark_bt2_pe.bam

Next Step: You need to first execute Phase 1: Single-Cell BAM Splitting, followed by Phase 2: BAM Preprocessing.


3. Data Processing Workflow

Phase 1: Single-Cell BAM Splitting (Scenario 3 Only)

We need to split the Bulk BAM files into individual single-cell BAM files based on cell barcodes.

The splitting process involves three steps:

  1. Sort each BAM by read name (name-sort).
  2. Split single-cell BAM files from each sorted BAM.
  3. Merge the forward and reverse BAM files for each single cell.

Tools Required:

Procedure: Use the batch_single_cell_bam.py script for batch splitting.

shell
# Example Command
python batch_single_cell_bam.py \
  --assay_type DD-MET5 \
  --bam_dir /path/to/bismark/ \
  --outdir /path/to/output/ \
  --gex_barcodes /path/to/filtered_feature_bc_matrix/barcodes.tsv.gz \
  --threads 4 \
  --parallel_jobs 8

# Parameter Explanation:
# --assay_type      : Assay type (DD-MET5 or DD-MET3)
# --bam_dir         : Directory containing input Bulk BAM files
# --outdir          : Output directory
# --gex_barcodes    : Corresponding RNA Barcode file
# --parallel_jobs   : Number of parallel jobs (Recommended 8-16, depending on server config)
# --cbcsv           : (DD-MET3 only) Path to the whitelist file

Upon completion, single-cell BAM files will be generated in the /path/to/output/split_bam directory.

Phase 2: BAM Preprocessing (Flag Correction and Deduplication)

CNV analysis is highly sensitive to read counts, so ensuring input data accuracy is crucial. This phase includes two key steps: Flag Correction and UMI Deduplication.

Tools Required:

Step 1: Flag Correction

Background: When processing OT/OB (original strand) and CTOT/CTOB (complementary strand) reads, the Bismark aligner automatically swaps the flag values for CTOT/CTOB reads. specifically, R1 is flagged as "second in pair" and R2 as "first in pair". To ensure downstream deduplication tools (like umi_tools) correctly identify R1 and R2, we need to revert these flags to their standard state.

Command:

shell
# Configuration
outdir="./"
input_dir="/path/to/step3/split_bams/merged/"  # Directory containing single-cell BAMs
workers=16                                      # Number of parallel workers

# 1. Generate file list
find -L "$input_dir" -name "*.bam" | sort > "$outdir/bam_list.txt"

# 2. Create output directory
mkdir -p $outdir/flag_reverse/

# 3. Execute Flag Correction
python step1_swap_flags_on_reverse.py \
  --bam_list $outdir/bam_list.txt \
  --out_dir $outdir/flag_reverse \
  --max_workers $workers

Estimated Time Based on a test with a 24-core server: Processing 3,641 single-cell BAM files takes approximately 30–40 minutes.

Step 2: UMI Deduplication

Background: Random fragmentation and PCR amplification are introduced during single-cell methylation library construction. To eliminate the impact of PCR bias on CNV estimation, deduplication is necessary. Since library fragments are randomly sheared at the ends, we use R1 alignment position + UMI sequence as the unique identifier for deduplication.

Command: Use umi_tools for deduplication. The key parameters are --extract-umi-method tag --umi-tag UR --paired --ignore-tlen.

Installing umi_tools: If umi_tools is not installed in your environment, you can install it via pip, conda, or micromamba:

shell
# Install via micromamba
micromamba install -c bioconda umi_tools
# Install via conda
# conda install -c bioconda umi_tools
# Install via pip
# pip install umi_tools
shell
# 1. Generate list of corrected BAMs
find "$outdir/flag_reverse" -name "*.bam" | sort > "$outdir/swapflags_bam_list.txt"

# 2. Create output directory
mkdir -p $outdir/umi_tools_swapflags_dedup_bam/

# 3. Execute Deduplication
python step2_umi_tools_dedup.py \
  --bam_list $outdir/swapflags_bam_list.txt \
  --out_dir $outdir/umi_tools_swapflags_dedup_bam/ \
  --max_workers $workers \
  --skip_existing

Estimated Time Based on a test with a 24-core server: Processing 3,641 single-cell BAM files takes approximately 4 hours.

Final Output: The deduplicated BAM files will be stored in the $outdir/umi_tools_swapflags_dedup_bam/ directory. These files serve as the final input for CopyKit analysis.

0 comments·0 replies