Skip to content

Algorithm

Author: SeekGene
Time: 6 min
Words: 1.2k words
Updated: 2026-05-12
Reads: 0 times
scMethyl + RNA-seq Algorithm

Data Structure

SeekOne DD Single Cell Multiome Methylation + RNA libraries are provided in two chemistries.

DD-MET3 Methylation Library Structure

Figure 1: DD-MET3 methylation library structure

Structure notes:

  • SP1/SP2: adapter sequences
  • barcode: 17 bp cell barcode
  • 7F: 7 bp linker sequence
  • 17L: 17 bp fixed sequence CgtCCgtCgttgCtCgt
  • ME: 19 bp fixed sequence AGATGTGTATAAGAGACAG
  • 9 bp: extension sequence from the Tn5 insertion fragment

DD-MET5 Methylation Library Structure

Figure 2: DD-MET5 methylation library structure

Structure notes:

  • SP1/SP2: adapter sequences
  • barcode: 17 bp cell barcode
  • UMI: 12 bp UMI sequence
  • TSO: 13 bp TSO sequence TTTCTTATATGGG
  • 17L: 17 bp fixed sequence CgtCCgtCgttgCtCgt
  • ME: 19 bp fixed sequence AGATGTGTATAAGAGACAG
  • 9 bp: extension sequence from the Tn5 insertion fragment

During enzymatic conversion, unmethylated cytosines are converted to thymines. Adapter cytosines are protected, while specific fixed-sequence cytosines are intentionally left unprotected and are later used to estimate the C-to-T conversion rate.

Transcriptome Workflow

Transcriptome processing is performed through SeekSoulTools. Cells retained for methylation downstream analysis are determined using transcriptome-derived cell barcodes.

Methylation Workflow

Step 1: Preprocessing and Barcode Parsing

Barcode extraction and correction

The pipeline locates the barcode region using the expected library structure. If the extracted barcode appears in the whitelist, it is kept directly. Otherwise, the pipeline attempts one-mismatch barcode correction:

  • If exactly one whitelist candidate is found, the barcode is corrected to that candidate.
  • If several candidates match, the barcode supported by the highest read count is selected.
  • If no valid correction is found, the read is discarded as an invalid-barcode read.

UMI extraction

UMIs are extracted from predefined positions and are not corrected during parsing.

Forward and reverse read determination

The first and last two original cytosine positions in the 17L and ME segments are used to distinguish forward and reverse reads. If all three positions remain cytosines, the read is classified as reverse; otherwise it is classified as forward.

  • Forward read pattern: TgtTTgtTgttgTtTgtAGATGTGTATAAGAGAT
  • Reverse read pattern: CgtCCgtCgttgCtCgtAGATGTGTATAAGAGAC
Figure 3: Forward and reverse reads determination

C-to-T conversion rate

To estimate conversion efficiency, the workflow inspects fixed-sequence cytosine sites in the 17L and ME segments, but only for reads with verified structures. This calculation is restricted to a high-confidence subset and does not remove reads from the final FASTQ outputs.

Figure 4: C-to-T conversion rate

Read cleanup and filtering

The pipeline then removes artificial library segments, trims overlapping adapters, trims the 9 bp Tn5 gap regions, optionally filters reads with excessive non-CpG methylation, and removes reads that become too short after trimming.

Step 2: Bismark Alignment and BAM Sorting

A customized Bismark build is used for methylation alignment. The workflow adds corrected cell barcodes and raw UMIs to BAM tags, aligns forward and reverse read groups with different options, and sorts BAM files by read name for downstream processing.

Step 3: ALLCools Analysis

The workflow splits name-sorted BAM files by cell barcode, converts each per-cell BAM into ALLC format, performs UR-tag-based UMI handling with the customized ALLCools implementation, and builds multi-scale methylation datasets such as chrom10k, chrom20k, chrom50k, chrom100k, chrom500k, chrom1M, and geneslop2k.

Figure 5: UMI correction and per-cell methylation quantification

Step 4: Dimensionality Reduction and Visualization

By default, dimensionality reduction is performed on chrom20k bins with LSI, followed by UMAP visualization.

Workflow Summary

At the implementation level, the Nextflow pipeline is organized into four major process groups:

  • step1: QC, transcriptome analysis, methylation barcode parsing, and FASTQ sharding
  • step2: Bismark alignment and BAM name sorting
  • step3: per-cell BAM splitting, ALLC generation, ALLC merge, and MCDS generation
  • step4: summary statistics, clustering, visualization, and integrated HTML reporting

Nextflow Step-by-Step Details

This section captures the detailed process-oriented description from the README and maps it back into the documentation structure used here.

Step 1: Preprocessing and Barcode Parsing

  • COMPUTE_CPG_SITES: counts genome-wide CpG sites from genome.fa and chromosome size information
  • FASTP_EXPRESSION_MULTI: performs transcriptome FASTQ trimming and QC
  • FASTP_METHYLATION_MULTI: performs methylation FASTQ QC before barcode parsing
  • SEEKSOULTOOLS_RNA: runs transcriptome alignment, counting, filtering, clustering, and differential expression
  • METHYLATION_BARCODE_EXTRACTION: parses methylation barcodes and UMIs according to chemistry type, performs barcode correction, and can shard FASTQs with --split_fastq
  • PARSE_FASTQ_FILES: pairs forward and reverse FASTQ fragments for downstream mapping
  • FASTP_METHYLATION_BARCODE_EXTRACT: runs post-barcode-extraction QC on the parsed methylation FASTQs

Step 2: Bismark Alignment and BAM Sorting

  • BISMARK_ALIGNMENT_FORWARD: aligns forward methylation reads with barcode and UMI tags
  • BISMARK_ALIGNMENT_REVERSE: aligns reverse methylation reads with PBAT mode enabled
  • SORT_BAM_BY_NAME: sorts BAM files by read name so downstream per-cell splitting can be performed correctly

Step 3: Per-cell Split, ALLC Generation, Merge, and Dataset Building

  • SPLIT_BAM_FILES: splits name-sorted BAMs into per-cell BAM files using transcriptome-derived barcodes
  • MERGE_BISMARK_BAM: merges forward and reverse per-cell BAMs and combines barcode count summaries
  • ALLCOOLS_BAM_TO_ALLC: converts per-cell BAMs to ALLC format using the customized ALLCools implementation
  • MERGE_FILTERED_BARCODE_READS_COUNTS: merges barcode-level metrics and writes consolidated cell summaries
  • ALLCOOLS_GENERATE_DATASETS: builds multi-scale methylation datasets such as chrom10k, chrom20k, chrom50k, chrom100k, chrom500k, chrom1M, and geneslop2k
  • ALLCOOLS_SUBMERGE and ALLCOOLS_MERGE: merge sharded ALLC outputs when FASTQ splitting is enabled
  • ALLCOOLS_EXTRACT: extracts merged CG-context ALLC outputs for downstream analysis

For the methy_only workflow, read-count-based barcode estimation and filtering are additionally handled through helper utilities in utils.nf.

Step 4: Summary, Dimensionality Reduction, and Joint Report

  • METHYLATION_SUMMARY: aggregates Bismark reports, cell metrics, and CpG statistics into JSON and CSV summaries
  • METHYLATION_LSI_PCA_CLUSTERING: performs dimensionality reduction and clustering, using LSI on chrom20k bins by default
  • MULTI_REPORT: combines transcriptome and methylation outputs into the integrated HTML report

References

[1] Lu X, Yuan Y, et al. Improved tagmentation-based whole-genome bisulfite sequencing for input DNA from less than 100 mammalian cells. Epigenomics. 2015;7(1):47-56. doi:10.2217/epi.14.76.

0 comments·0 replies