Algorithm

Author: SeekGene

Time: 6 min

Words: 1.2k words

Updated: 2026-05-12

Reads: 0 times

scMethyl + RNA-seq Algorithm

Data Structure

SeekOne DD Single Cell Multiome Methylation + RNA libraries are provided in two chemistries.

DD-MET3 Methylation Library Structure

Figure 1: DD-MET3 methylation library structure

Structure notes:

SP1/SP2: adapter sequences
barcode: 17 bp cell barcode
7F: 7 bp linker sequence
17L: 17 bp fixed sequence CgtCCgtCgttgCtCgt
ME: 19 bp fixed sequence AGATGTGTATAAGAGACAG
9 bp: extension sequence from the Tn5 insertion fragment

DD-MET5 Methylation Library Structure

Figure 2: DD-MET5 methylation library structure

Structure notes:

SP1/SP2: adapter sequences
barcode: 17 bp cell barcode
UMI: 12 bp UMI sequence
TSO: 13 bp TSO sequence TTTCTTATATGGG
17L: 17 bp fixed sequence CgtCCgtCgttgCtCgt
ME: 19 bp fixed sequence AGATGTGTATAAGAGACAG
9 bp: extension sequence from the Tn5 insertion fragment

During enzymatic conversion, unmethylated cytosines are converted to thymines. Adapter cytosines are protected, while specific fixed-sequence cytosines are intentionally left unprotected and are later used to estimate the C-to-T conversion rate.

Transcriptome Workflow

Transcriptome processing is performed through SeekSoulTools. Cells retained for methylation downstream analysis are determined using transcriptome-derived cell barcodes.

Methylation Workflow

Step 1: Preprocessing and Barcode Parsing

Barcode extraction and correction

The pipeline locates the barcode region using the expected library structure. If the extracted barcode appears in the whitelist, it is kept directly. Otherwise, the pipeline attempts one-mismatch barcode correction:

If exactly one whitelist candidate is found, the barcode is corrected to that candidate.
If several candidates match, the barcode supported by the highest read count is selected.
If no valid correction is found, the read is discarded as an invalid-barcode read.

UMI extraction

UMIs are extracted from predefined positions and are not corrected during parsing.

Forward and reverse read determination

The first and last two original cytosine positions in the 17L and ME segments are used to distinguish forward and reverse reads. If all three positions remain cytosines, the read is classified as reverse; otherwise it is classified as forward.

Forward read pattern: TgtTTgtTgttgTtTgtAGATGTGTATAAGAGAT
Reverse read pattern: CgtCCgtCgttgCtCgtAGATGTGTATAAGAGAC

Figure 3: Forward and reverse reads determination

C-to-T conversion rate

To estimate conversion efficiency, the workflow inspects fixed-sequence cytosine sites in the 17L and ME segments, but only for reads with verified structures. This calculation is restricted to a high-confidence subset and does not remove reads from the final FASTQ outputs.

Figure 4: C-to-T conversion rate

Read cleanup and filtering

The pipeline then removes artificial library segments, trims overlapping adapters, trims the 9 bp Tn5 gap regions, optionally filters reads with excessive non-CpG methylation, and removes reads that become too short after trimming.

Step 2: Bismark Alignment and BAM Sorting

A customized Bismark build is used for methylation alignment. The workflow adds corrected cell barcodes and raw UMIs to BAM tags, aligns forward and reverse read groups with different options, and sorts BAM files by read name for downstream processing.

Step 3: ALLCools Analysis

The workflow splits name-sorted BAM files by cell barcode, converts each per-cell BAM into ALLC format, performs UR-tag-based UMI handling with the customized ALLCools implementation, and builds multi-scale methylation datasets such as chrom10k, chrom20k, chrom50k, chrom100k, chrom500k, chrom1M, and geneslop2k.

Figure 5: UMI correction and per-cell methylation quantification

Step 4: Dimensionality Reduction and Visualization

By default, dimensionality reduction is performed on chrom20k bins with LSI, followed by UMAP visualization.

Workflow Summary

At the implementation level, the Nextflow pipeline is organized into four major process groups:

step1: QC, transcriptome analysis, methylation barcode parsing, and FASTQ sharding
step2: Bismark alignment and BAM name sorting
step3: per-cell BAM splitting, ALLC generation, ALLC merge, and MCDS generation
step4: summary statistics, clustering, visualization, and integrated HTML reporting

Nextflow Step-by-Step Details

This section captures the detailed process-oriented description from the README and maps it back into the documentation structure used here.

Step 1: Preprocessing and Barcode Parsing

COMPUTE_CPG_SITES: counts genome-wide CpG sites from genome.fa and chromosome size information
FASTP_EXPRESSION_MULTI: performs transcriptome FASTQ trimming and QC
FASTP_METHYLATION_MULTI: performs methylation FASTQ QC before barcode parsing
SEEKSOULTOOLS_RNA: runs transcriptome alignment, counting, filtering, clustering, and differential expression
METHYLATION_BARCODE_EXTRACTION: parses methylation barcodes and UMIs according to chemistry type, performs barcode correction, and can shard FASTQs with --split_fastq
PARSE_FASTQ_FILES: pairs forward and reverse FASTQ fragments for downstream mapping
FASTP_METHYLATION_BARCODE_EXTRACT: runs post-barcode-extraction QC on the parsed methylation FASTQs

Step 2: Bismark Alignment and BAM Sorting

BISMARK_ALIGNMENT_FORWARD: aligns forward methylation reads with barcode and UMI tags
BISMARK_ALIGNMENT_REVERSE: aligns reverse methylation reads with PBAT mode enabled
SORT_BAM_BY_NAME: sorts BAM files by read name so downstream per-cell splitting can be performed correctly

Step 3: Per-cell Split, ALLC Generation, Merge, and Dataset Building

SPLIT_BAM_FILES: splits name-sorted BAMs into per-cell BAM files using transcriptome-derived barcodes
MERGE_BISMARK_BAM: merges forward and reverse per-cell BAMs and combines barcode count summaries
ALLCOOLS_BAM_TO_ALLC: converts per-cell BAMs to ALLC format using the customized ALLCools implementation
MERGE_FILTERED_BARCODE_READS_COUNTS: merges barcode-level metrics and writes consolidated cell summaries
ALLCOOLS_GENERATE_DATASETS: builds multi-scale methylation datasets such as chrom10k, chrom20k, chrom50k, chrom100k, chrom500k, chrom1M, and geneslop2k
ALLCOOLS_SUBMERGE and ALLCOOLS_MERGE: merge sharded ALLC outputs when FASTQ splitting is enabled
ALLCOOLS_EXTRACT: extracts merged CG-context ALLC outputs for downstream analysis

For the methy_only workflow, read-count-based barcode estimation and filtering are additionally handled through helper utilities in utils.nf.

Step 4: Summary, Dimensionality Reduction, and Joint Report

METHYLATION_SUMMARY: aggregates Bismark reports, cell metrics, and CpG statistics into JSON and CSV summaries
METHYLATION_LSI_PCA_CLUSTERING: performs dimensionality reduction and clustering, using LSI on chrom20k bins by default
MULTI_REPORT: combines transcriptome and methylation outputs into the integrated HTML report

References

[1] Lu X, Yuan Y, et al. Improved tagmentation-based whole-genome bisulfite sequencing for input DNA from less than 100 mammalian cells. Epigenomics. 2015;7(1):47-56. doi:10.2217/epi.14.76.

Algorithm ​

Data Structure ​

Transcriptome Workflow ​

Methylation Workflow ​

Step 1: Preprocessing and Barcode Parsing ​

Step 2: Bismark Alignment and BAM Sorting ​

Step 3: ALLCools Analysis ​

Step 4: Dimensionality Reduction and Visualization ​

Workflow Summary ​

Nextflow Step-by-Step Details ​

Step 1: Preprocessing and Barcode Parsing ​

Step 2: Bismark Alignment and BAM Sorting ​

Step 3: Per-cell Split, ALLC Generation, Merge, and Dataset Building ​

Step 4: Summary, Dimensionality Reduction, and Joint Report ​

References ​

Algorithm

Data Structure

Transcriptome Workflow

Methylation Workflow

Step 1: Preprocessing and Barcode Parsing

Step 2: Bismark Alignment and BAM Sorting

Step 3: ALLCools Analysis

Step 4: Dimensionality Reduction and Visualization

Workflow Summary

Nextflow Step-by-Step Details

Step 1: Preprocessing and Barcode Parsing

Step 2: Bismark Alignment and BAM Sorting

Step 3: Per-cell Split, ALLC Generation, Merge, and Dataset Building

Step 4: Summary, Dimensionality Reduction, and Joint Report

References