Skip to content

scMethyl + RNA Multi-omics: Basic Analysis

Author: SeekGene
Time: 43 min
Words: 8.6k words
Updated: 2026-03-18
Reads: 0 times
scMethyl + RNA-seq Notebooks

Module Introduction

This module is designed for single-cell multi-omics data (containing both transcriptome (RNA) and methylation information) Basic Analysis.

This module implements the following core analysis workflow:

  • Data Integration: Integrate single-cell RNA sequencing and scMethyl sequencing data from multiple samples.
  • Quality Control: Perform independent QC filtering on scRNA and scMethyl data to ensure data quality.
  • Batch Effect Correction: Use Harmony algorithm to correct technical variations between multiple samples while preserving biological signals.
  • Cell Atlas Construction: Construct single-cell atlas, identify cell types and subpopulations through dimensionality reduction, clustering, and visualization.

Input File Preparation

This module requires the following input files:

Required Files

  • scRNA-seq Data: Gene expression matrix (filtered_feature_bc_matrix directory).
  • scMethyl Data: scMethyl dataset in MCDS format (.mcds file).
  • Barcode Mapping File (Optional): If using DD-MET3 Kit, used to associate cell Barcode correspondence between RNA and scMethyl data.

File Structure Example

text
data/
├── AY1768874914782
│   └── methylation
│       └── demoWTJW969-task-1
│           └── WTJW969
│               ├── allcools_generate_datasets
│               │   └── WTJW969.mcds # scMethyl data
│               ├── filtered_feature_bc_matrix # Transcriptome expression matrix
│               └── split_bams
│                   ├── WTJW969_cells.csv
│                   └── filtered_barcode_reads_counts.csv
└── AY1768876253533
    └── methylation
        └── demo4WTJW880-task-1
            └── WTJW880
                ├── allcools_generate_datasets
                │   └── WTJW880.mcds # scMethyl data
                ├── filtered_feature_bc_matrix # Transcriptome expression matrix
                └── split_bams
                    ├── WTJW880_cells.csv
                    └── filtered_barcode_reads_counts.csv
DD-M_bUCB3_whitelist.csv  # Barcode Mapping File (Optional)

⚠️ Important Note
This tutorial applies to the above directory structure. If your file organization is different, please adjust the folder paths or modify the file addresses in the code accordingly.

python
# --- Import Necessary Python Packages ---

# System Operations and File Handling
import os
import re
import glob
import warnings

# ALLCools related packages: for scMethyl data processing
from ALLCools.mcds import MCDS
from ALLCools.clustering import (
    tsne, significant_pc_test, log_scale, lsi, 
    binarize_matrix, filter_regions, cluster_enriched_features, 
    ConsensusClustering, Dendrogram, get_pc_centers, one_vs_rest_dmg
)
from ALLCools.clustering.doublets import MethylScrublet
from ALLCools.plot import *

# Single-cell Analysis Tools
import scanpy as sc
import scanpy.external as sce

# Batch Effect Correction
from harmonypy import run_harmony

# Data Processing and Visualization
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
from matplotlib.lines import Line2D

# Other Tools
import xarray as xr
import pybedtools
from scipy import sparse
python
# --- Input Parameter Configuration ---

## samples: List of samples to analyze
samples = ["WTJW969", "WTJW880"]

## File Path Configuration: Configure according to new file hierarchy
## format: {sample_name: {'top_dir': top_level_dir, 'demo_dir': demo_dir}}
## Example: {"WTJW969": {"top_dir": "AY1768874914782", "demo_dir": "demoWTJW969-task-1"}}
sample_path_config = {
    "WTJW969": {"top_dir": "AY1768874914782", "demo_dir": "demoWTJW969-task-1"},
    "WTJW880": {"top_dir": "AY1768876253533", "demo_dir": "demo4WTJW880-task-1"}
}

## Data Root Directory (default is "data")
data_root = "data"

## reagent_type: Kit type ('DD-MET3' or 'DD-MET5'), can be specified here if consistent across all samples
## reagent_type_map: Configure kit type by sample (Optional). If present, takes precedence over reagent_type
## Example   reagent_type_map = {"SampleA": "DD-MET3", "SampleB": "DD-MET5"}
## DD-MET3: RNA and scMethyl use different Barcodes, need to associate via mapping file
## DD-MET5: RNA and scMethyl use same Barcode, no mapping needed
reagent_type = "DD-MET3"
reagent_type_map = {"WTJW969": "DD-MET5", "WTJW880": "DD-MET3"}

## var_dim: genomic window size (resolution) for scMethyl data
var_dim = 'chrom20k'

## obs_dim: Observation dimension (usually 'cell')
obs_dim = 'cell'

## mc_type: scMethyl site type ('CGN' represents CpG sites)
mc_type = 'CGN'

## quant_type: Quantification method for scMethyl score ('hypo-score' represents hypo-methylation score)
quant_type = 'hypo-score'

## Plotting Palette
my_palette = [
    "#66C2A5", "#FC8D62", "#8DA0CB", "#E78AC3", 
    "#A6D854", "#FFD92F", "#E5C494", "#B3B3B3", 
    "#8DD3C7", "#BEBADA", "#FB8072", "#80B1D3", 
    "#FDB462", "#B3DE69", "#FCCDE5", "#BC80BD", 
    "#CCEBC5", "#FFED6F", "#A6CEE3", "#FB9A99"
]

Data QC and Integration

This section first performs QC filtering on scRNA and scMethyl data separately, then calculates barcode intersection of both datasets to ensure cell consistency for downstream analysis.

RNA Data Integration and QC Filtering

This section integrates scRNA data from multiple samples and performs QC filtering to remove low-quality cells and genes, ensuring the reliability of downstream analysis.

QC Filtering Strategy:

  • n_genes_by_counts (Number of Detected Genes): Filter cells with gene counts < 200 or > 10000. Cells with too few genes may have low capture efficiency or poor quality; cells with too many genes may be doublets or technical anomalies.
  • total_counts (Total UMI Counts): Filter cells with total UMI counts < 200 or > 30000. Cells with too few UMIs have insufficient sequencing depth; cells with too many UMIs may be doublets or technical anomalies.
  • Mitochondrial Gene Percentage: Filter cells with mitochondrial gene percentage > 20%. Cells with high mitochondrial gene percentage may be dead or damaged, indicating low quality.
  • Low Expression Gene Filtering: Filter genes expressed in fewer than 3 cells. These genes have extremely low expression and may be noise or sequencing errors, contributing little to analysis.

After QC filtering, statistics on cell and gene counts before and after filtering will be output to evaluate the filtering effect.

Figure Legend (RNA QC Violin Plot):

The figure below shows the distribution of RNA data QC metrics before and after QC filtering, grouped by Sample.

  • Before QC Violin Plot: Displays QC metric distribution before filtering, including n_genes_by_counts (detected gene count), total_counts (total UMI count), and pct_counts_mt (mitochondrial gene percentage). Used to evaluate raw data quality and identify outliers.
  • After QC Violin Plot: Displays QC metric distribution after filtering. Low-quality cells (too few/many genes, abnormal UMI counts, high mitochondrial percentage) have been removed, improving data quality.
  • X-axis: Different Samples, facilitating comparison of QC metrics across samples.
  • Y-axis: Values of each QC metric. The width of the Violin Plot represents cell density in that value range; wider means more cells.
  • Purpose: Comparing distributions before and after filtering helps evaluate QC effectiveness and confirm if thresholds are reasonable.
python
warnings.filterwarnings('ignore')
# --- RNA Multi-sample Data Integration and QC Filtering ---

# Read RNA expression data for each sample
obj_rna_list = {}
for i in samples:
    # Construct path based on new file hierarchy: ../../data/{top_dir}/methylation/{demo_dir}/{sample}/filtered_feature_bc_matrix
    top_dir = sample_path_config[i]['top_dir']
    demo_dir = sample_path_config[i]['demo_dir']
    s_rna_path = os.path.join('../../',data_root, top_dir, 'methylation', demo_dir, i, 'filtered_feature_bc_matrix')
    obj_rna_list[i] = sc.read_10x_mtx(s_rna_path)

# Merge data from all samples
adata_rna = obj_rna_list[samples[0]].concatenate([obj_rna_list[i] for i in samples[1:]])

# Add sample information
adata_rna.obs['Sample'] = adata_rna.obs['batch'].map(
    {f'{index}': value for index, value in enumerate(samples)}
)

# --- QC Filtering ---
# Calculate QC metrics
adata_rna.var['mt'] = adata_rna.var_names.str.startswith('MT-')  # Mitochondrial gene marker
sc.pp.calculate_qc_metrics(adata_rna, qc_vars=['mt'], percent_top=None, log1p=False, inplace=True)

# Save pre-QC statistics
n_cells_before = adata_rna.n_obs
n_genes_before = adata_rna.n_vars

print(f"Number of cells before QC: {n_cells_before}")
print(f"Number of genes before QC: {n_genes_before}")

# --- Pre-QC Violin Plot Visualization ---
print("\n=== Pre-QC QC Metrics Distribution ===")
# sc.pl.violin(adata_rna, keys=['n_genes_by_counts', 'total_counts', 'pct_counts_mt'], 
#              groupby='Sample', rotation=45, multi_panel=True)
fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(10, 4), dpi=80) 

sc.pl.violin(adata_rna, keys='n_genes_by_counts', groupby='Sample', rotation=45, ax=ax1, show=False, palette=my_palette, stripplot=False, size=1, inner='box', linewidth=1, cut=0)
ax1.set_title("Number of Genes", fontsize=10)
ax1.set_ylabel("Gene_Counts", fontsize=9)
ax1.set_xlabel("")

sc.pl.violin(adata_rna, keys='total_counts', groupby='Sample', rotation=45, ax=ax2, show=False, palette=my_palette, stripplot=False, size=1, inner='box', linewidth=1, cut=0)
ax2.set_title("Total UMI Counts", fontsize=10)
ax2.set_ylabel("UMI_Counts", fontsize=9)
ax2.set_xlabel("")

sc.pl.violin(adata_rna, keys='pct_counts_mt', groupby='Sample', rotation=45, ax=ax3, show=False, palette=my_palette, stripplot=False, size=1, inner='box', linewidth=1, cut=0)
ax3.set_title("Mitochondrial %", fontsize=10)
ax3.set_ylabel("Percentage(%)", fontsize=9)
ax3.set_xlabel("")

plt.tight_layout()
plt.show()

# Filter low-quality cells
# nFeature (n_genes_by_counts): Filter based on detected genes per cell (stored in adata_rna.obs['n_genes_by_counts'])
# nCount (total_counts): Filter based on total UMI counts per cell (stored in adata_rna.obs['total_counts'])
# Mitochondrial Gene Percentage: Filter based on mitochondrial transcript proportion (stored in adata_rna.obs['pct_counts_mt'])

# 1) Filter low gene count cells by nFeature
sc.pp.filter_cells(adata_rna, min_genes=200)  # Equivalent to filtering by n_genes_by_counts >= 200

# 2) Filter low expression genes (genes expressed in fewer than 3 cells)
sc.pp.filter_genes(adata_rna, min_cells=3)

# 3) Further filter by nCount and mitochondrial percentage
adata_rna = adata_rna[(adata_rna.obs['n_genes_by_counts'] >= 200) & (adata_rna.obs['n_genes_by_counts'] <= 10000), :].copy() # nFeature filtering
adata_rna = adata_rna[(adata_rna.obs['total_counts'] >= 200) & (adata_rna.obs['total_counts'] <= 30000), :].copy()  # nCount filtering
adata_rna = adata_rna[adata_rna.obs['pct_counts_mt'] < 20, :].copy()  # Mitochondrial gene percentage filtering

print(f"\nNumber of cells after QC: {adata_rna.n_obs}")
print(f"Number of genes after QC: {adata_rna.n_vars}")
print(f"Number of filtered cells: {n_cells_before - adata_rna.n_obs}")
print(f"Number of filtered genes: {n_genes_before - adata_rna.n_vars}")

# --- Post-QC Violin Plot Visualization ---
print("\n=== Post-QC QC Metrics Distribution ===")
fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(10, 4), dpi=80) 

sc.pl.violin(adata_rna, keys='n_genes_by_counts', groupby='Sample', rotation=45, ax=ax1, show=False, palette=my_palette, stripplot=False, size=1, inner='box', linewidth=1, cut=0)
ax1.set_title("Number of Genes", fontsize=10)
ax1.set_ylabel("Gene_Counts", fontsize=9)
ax1.set_xlabel("")

sc.pl.violin(adata_rna, keys='total_counts', groupby='Sample', rotation=45, ax=ax2, show=False, palette=my_palette, stripplot=False, size=1, inner='box', linewidth=1, cut=0)
ax2.set_title("Total UMI Counts", fontsize=10)
ax2.set_ylabel("UMI_Counts", fontsize=9)
ax2.set_xlabel("")

sc.pl.violin(adata_rna, keys='pct_counts_mt', groupby='Sample', rotation=45, ax=ax3, show=False, palette=my_palette, stripplot=False, size=1, inner='box', linewidth=1, cut=0)
ax3.set_title("Mitochondrial %", fontsize=10)
ax3.set_ylabel("Percentage(%)", fontsize=9)
ax3.set_xlabel("")

plt.tight_layout()
plt.show()

scMethyl Data Integration and QC Filtering

This section integrates scMethyl data from multiple samples, merges QC metadata, and then filters based on QC metrics.

Data Integration and Metadata Merging Steps:

  1. Data Integration: Read MCDS format scMethyl data for each sample and merge.
  2. Metadata Reading: Read cell metadata file and Reads count file for each sample.
  3. Methylation Rate Calculation: Calculate CpG and CH methylation rates for each cell.
  4. Metadata Merging: Merge calculated QC metrics into adata_met.obs.

QC Filtering Strategy:

  • total_cpg_number (Total CpG Sites): Filter cells with total CpG sites < 1000. These cells may have low capture efficiency or poor quality.
  • CpG% (CpG Methylation Rate): Filter cells with CpG methylation rate not in 60-100% range. These values may be abnormal.
  • CH% (CH Methylation Rate): Filter cells with CH methylation rate not in 0-5% range (mammalian cell CH% is usually < 5%). Abnormally high values may indicate technical issues.

After QC filtering, statistics on cell counts before and after filtering will be calculated, and QC metrics will be visualized using Violin Plots.

Figure Legend (scMethyl QC Violin Plot):

The figure below shows distribution of QC metrics before and after scMethyl data QC filtering, grouped by Sample.

  • Before QC Violin Plot: Displays QC metric distribution before filtering, including total_cpg_number (total CpG sites), CpG% (CpG methylation rate), and CH% (CH methylation rate). Used to evaluate raw data quality and identify outliers.
  • After QC Violin Plot: Displays QC metric distribution after filtering. Low-quality cells (too few CpG sites, abnormal methylation rates) have been removed, improving data quality.
  • X-axis: Different Samples, facilitating comparison of QC metrics across samples.
  • Y-axis: Values of each QC metric. The width of the Violin Plot represents cell density in that value range; wider means more cells.
  • Purpose: Comparing distributions before and after filtering helps evaluate QC effectiveness and confirm if thresholds are reasonable.
python
warnings.filterwarnings('ignore')
# --- scMethyl data multi-sample integration ---

# Read MCDS format scMethyl data for each sample
obj_met_list = {}
mcds_paths = {}
for i in samples:
    # Construct path based on new file hierarchy: data/{top_dir}/methylation/{demo_dir}/{sample}/allcools_generate_datasets/{sample}.mcds
    top_dir = sample_path_config[i]['top_dir']
    demo_dir = sample_path_config[i]['demo_dir']
    mcds_path = os.path.join('../../',data_root, top_dir, 'methylation', demo_dir, i, 'allcools_generate_datasets', f'{i}.mcds')
    
    mcds_paths[i] = mcds_path
    mcds = MCDS.open(mcds_path, obs_dim='cell', var_dim=var_dim)
    adata = mcds.get_score_adata(mc_type=mc_type, quant_type=quant_type, sparse=True)
    obj_met_list[i] = adata

# Merge data from all samples
adata_met = obj_met_list[samples[0]].concatenate([obj_met_list[i] for i in samples[1:]])

# Save raw data
adata_met.layers['raw'] = adata_met.X.copy()

# Add sample information
adata_met.obs['Sample'] = adata_met.obs['batch'].map(
    {f'{index}': value for index, value in enumerate(samples)}
)

# --- Helper Function to Calculate CpG and CH Methylation Rates ---
def compute_cg_ch_level(df):
    # Calculate CpG and CH methylation levels
    cg_cov_col = [i for i in df.columns[df.columns.str.startswith('CG')].to_list() if i.endswith('_cov')]
    cg_mc_col = [i for i in df.columns[df.columns.str.startswith('CG')].to_list() if i.endswith('_mc')]
    ch_cov_col = [i for i in df.columns[df.columns.str.contains('^(CT|CC|CA)')].to_list() if i.endswith('_cov')]
    ch_mc_col = [i for i in df.columns[df.columns.str.contains('^(CT|CC|CA)')].to_list() if i.endswith('_mc')]
    
    df['CpG_cov'] = df[cg_cov_col].sum(axis=1)
    df['CpG_mc'] = df[cg_mc_col].sum(axis=1)
    df['CpG%'] = round(df['CpG_mc'] * 100 / df['CpG_cov'], 2)
    
    df['CH_cov'] = df[ch_cov_col].sum(axis=1)
    df['CH_mc'] = df[ch_mc_col].sum(axis=1)
    df['CH%'] = round(df['CH_mc'] * 100 / df['CH_cov'], 2)

    return df

# --- Read and Merge QC Metadata ---
suffix_map = {f'{value}': index for index, value in enumerate(samples)}
meta_list = list()
keep_col = ['barcode', 'total_cpg_number', 'genome_cov', 'reads_counts', 'CpG%', 'CH%']

for i in samples:
    # Construct path based on new file hierarchy
    top_dir = sample_path_config[i]['top_dir']
    demo_dir = sample_path_config[i]['demo_dir']
    
    # Read cell metadata file: ../../data/{top_dir}/methylation/{demo_dir}/{sample}/split_bams/{sample}_cells.csv
    s_cells_meta_path = os.path.join('../../',data_root, top_dir, 'methylation', demo_dir, i, 'split_bams', f'{i}_cells.csv')
    s_cells_meta = pd.read_csv(s_cells_meta_path, header=0)
    s_cells_meta['barcode'] = [b.replace('_allc.gz', '') for b in s_cells_meta['cell_barcode']]
    
    # Read Reads count file: ../../data/{top_dir}/methylation/{demo_dir}/{sample}/split_bams/filtered_barcode_reads_counts.csv
    s_reads_counts_path = os.path.join('../../',data_root, top_dir, 'methylation', demo_dir, i, 'split_bams', 'filtered_barcode_reads_counts.csv')
    s_reads_counts = pd.read_csv(s_reads_counts_path, header=0)
    
    # Merge data and calculate methylation rates
    s_merged_df = s_cells_meta.merge(s_reads_counts, how='inner', left_on='barcode', right_on='barcode')
    s_merged_df = compute_cg_ch_level(s_merged_df)
    s_merged_df = s_merged_df[keep_col]
    
    # If multi-sample, add sample suffix to barcode
    if len(samples) > 1:
        s_merged_df['barcode'] = [f'{b}-{suffix_map[i]}' for b in s_merged_df['barcode']]
    
    meta_list.append(s_merged_df)

# Merge metadata of all samples into adata_met.obs
adata_met.obs = adata_met.obs.merge(
    pd.concat(meta_list).set_index('barcode'), 
    how='left', 
    left_index=True, 
    right_index=True
)

# --- scMethyl data QC filtering ---
# Save pre-QC statistics
n_cells_before = adata_met.n_obs

print(f"Number of cells before QC: {n_cells_before}")

# --- Pre-QC Violin Plot Visualization ---
print("\n=== Pre-QC QC Metrics Distribution ===")
fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(10, 4), dpi=80)

# 1. Total CpG Number
sc.pl.violin(adata_met, keys='total_cpg_number', groupby='Sample', rotation=45, ax=ax1, show=False, palette=my_palette, stripplot=False, size=1, inner='box', linewidth=1, cut=0)
ax1.set_title("Total CpG Number", fontsize=10)
ax1.set_ylabel("Count", fontsize=9)
ax1.set_xlabel("")

# 2. CpG%
sc.pl.violin(adata_met, keys='CpG%', groupby='Sample', rotation=45, ax=ax2, show=False, palette=my_palette, stripplot=False, size=1, inner='box', linewidth=1, cut=0)
ax2.set_title("CpG Methylation %", fontsize=10)
ax2.set_ylabel("Percentage (%)", fontsize=9)
ax2.set_xlabel("")

# 3. CH%
sc.pl.violin(adata_met, keys='CH%', groupby='Sample', rotation=45, ax=ax3, show=False, palette=my_palette, stripplot=False, size=1, inner='box', linewidth=1, cut=0)
ax3.set_title("CH Methylation %", fontsize=10)
ax3.set_ylabel("Percentage (%)", fontsize=9)
ax3.set_xlabel("")

plt.tight_layout()
plt.show()


# Filter low-quality cells based on QC metrics
# total_cpg_number: Filter cells with too few CpG sites (threshold can be adjusted)
# CpG%: Filter cells with abnormal CpG methylation rates (usually 40-80% in mammalian cells)
# CH%: Filter cells with abnormal CH methylation rates (usually low, < 5% in mammalian cells)

# Check and handle missing values
if 'total_cpg_number' in adata_met.obs.columns:
    adata_met = adata_met[adata_met.obs['total_cpg_number'].notna(), :].copy()
    adata_met = adata_met[(adata_met.obs['total_cpg_number'] >= 100) & (adata_met.obs['total_cpg_number'] <= 10000000), :].copy()

if 'CpG%' in adata_met.obs.columns:
    adata_met = adata_met[adata_met.obs['CpG%'].notna(), :].copy()
    adata_met = adata_met[(adata_met.obs['CpG%'] >= 60) & (adata_met.obs['CpG%'] <= 100), :].copy()

if 'CH%' in adata_met.obs.columns:
    adata_met = adata_met[adata_met.obs['CH%'].notna(), :].copy()
    adata_met = adata_met[(adata_met.obs['CH%'] >= 0) & (adata_met.obs['CH%'] <= 5), :].copy()

print(f"Cell count after QC: {adata_met.n_obs}")
print(f"Filtered out cell count: {n_cells_before - adata_met.n_obs}")

# --- Post-QC Violin Plot Visualization ---
print("\n=== Post-QC QC Metrics Distribution ===")
fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(10, 4), dpi=80)

# 1. Total CpG Number
sc.pl.violin(adata_met, keys='total_cpg_number', groupby='Sample', rotation=45, ax=ax1, show=False, palette=my_palette, stripplot=False, size=1, inner='box', linewidth=1, cut=0)
ax1.set_title("Total CpG Number", fontsize=10)
ax1.set_ylabel("Count", fontsize=9)
ax1.set_xlabel("")

# 2. CpG%
sc.pl.violin(adata_met, keys='CpG%', groupby='Sample', rotation=45, ax=ax2, show=False, palette=my_palette, stripplot=False, size=1, inner='box', linewidth=1, cut=0)
ax2.set_title("CpG Methylation %", fontsize=10)
ax2.set_ylabel("Percentage (%)", fontsize=9)
ax2.set_xlabel("")

# 3. CH%
sc.pl.violin(adata_met, keys='CH%', groupby='Sample', rotation=45, ax=ax3, show=False, palette=my_palette, stripplot=False, size=1, inner='box', linewidth=1.5, cut=0)
ax3.set_title("CH Methylation %", fontsize=10)
ax3.set_ylabel("Percentage (%)", fontsize=9)
ax3.set_xlabel("")

plt.tight_layout()
plt.show()

scRNA and scMethyl data barcode association, intersection and doublet filtering

This section first associates scRNA and scMethyl data Barcodes based on kit type, then finds common cells (barcode intersection) in both datasets, and finally removes doublet cells based on doublet table.

Steps:

  1. Barcode Association: Associate scRNA and scMethyl data Barcodes based on kit type (DD-MET3/DD-MET5).
  2. Barcode Intersection: Calculate barcode intersection of scRNA and scMethyl data, ensuring cell consistency in both datasets.
  3. Doublet Filtering: Read doublet table for each sample, remove doublet cells from scRNA and scMethyl data.

Associate scRNA and scMethyl data Barcodes based on kit type, establishing cell correspondence between the two datasets.

  • DD-MET3: scRNA and scMethyl data use different Barcodes, requiring mapping through a mapping file. Please download the Barcode Mapping File DD-M_bUCB3_whitelist.csv locally. Then read the Barcode Mapping File (DD-M_bUCB3_whitelist.csv), establish correspondence between scRNA Barcode (gex_cb) and scMethyl Barcode (m_cb), and create rna_meta mapping table for later use.
  • DD-MET5: scRNA and scMethyl data use the same Barcode, no mapping needed, skip this step directly.

💡 Note
The mapping relationship (rna_meta) established in this step will be used for subsequent doublet filtering and Cell Type Annotation Transfer.

python
# --- scRNA and scMethyl data association (supports sample-specific config: some DD-MET3 need mapping, some DD-MET5 do not) ---

print("=== Data Status Check ===")
print(f"RNA Data Cell Count: {len(adata_rna.obs)}")
print(f"scMethyl data cell count: {len(adata_met.obs)}")
if len(adata_rna.obs) == 0:
    raise ValueError("RNA data is empty, cannot establish mapping. Please check data loading and QC steps.")
if len(adata_met.obs) == 0:
    raise ValueError("scMethyl data is empty, cannot establish mapping relationship. Please check previous QC filtering steps.")

suffix_map = {s: i for i, s in enumerate(samples)}  # sample -> batch index

# If DD-MET3 samples exist, load barcode mapping file (load only once)
if any(reagent_type_map.get(s, reagent_type) == "DD-MET3" for s in samples):
    mapping_file = "DD-M_bUCB3_whitelist.csv"
    if not os.path.exists(mapping_file):
        raise FileNotFoundError("Mapping file not found: DD-M_bUCB3_whitelist.csv, please check file path.")
    gex_mc_bc_map = pd.read_csv(mapping_file, index_col=None)
    if 'gex_cb' not in gex_mc_bc_map.columns or 'm_cb' not in gex_mc_bc_map.columns:
        raise ValueError(f"Mapping file format error: Should contain 'gex_cb' and 'm_cb' columns, actual columns: {list(gex_mc_bc_map.columns)}")
    print(f"Mapping file path: {mapping_file}, record count: {len(gex_mc_bc_map)}")
else:
    gex_mc_bc_map = None

rna_meta_list = []
for sample in samples:
    rt = reagent_type_map.get(sample, reagent_type)
    rna_obs_s = adata_rna.obs[adata_rna.obs['Sample'] == sample]
    met_barcodes_s = set(adata_met.obs.index[adata_met.obs['Sample'] == sample])
    suf = suffix_map[sample]

    if rt == "DD-MET3":
        # This sample requires barcode mapping
        rm = rna_obs_s.copy()
        rm["rna_bc_with_suffix"] = rm.index
        rm["gex_cb"] = [re.sub(r'-.*$', '', b) for b in rm["rna_bc_with_suffix"]]
        rm = rm.merge(gex_mc_bc_map, how='left', left_on='gex_cb', right_on='gex_cb')
        rm = rm.dropna(subset=['m_cb'])
        rm["m_cb"] = rm["m_cb"].astype(str) + "-" + str(suf)
        rm = rm[rm["m_cb"].isin(met_barcodes_s)]
        rm.index = rm["m_cb"]
        rna_meta_list.append(rm)
        print(f"  {sample} (DD-MET3): {len(rm)} common cells retained after mapping")
    else:
        # DD-MET5: Same barcode, retain only cells present in both RNA and scMethyl
        common_s = set(rna_obs_s.index) & met_barcodes_s
        rm = pd.DataFrame({"rna_bc_with_suffix": list(common_s), "m_cb": list(common_s), "gex_cb": [re.sub(r'-.*$', '', b) for b in common_s]})
        rm.index = rm["m_cb"]
        rna_meta_list.append(rm)
        print(f"  {sample} (DD-MET5): No mapping, {len(rm)} common cells")

if rna_meta_list:
    rna_meta = pd.concat(rna_meta_list, axis=0)
    rna_met_mapping = rna_meta[["gex_cb", "m_cb", "rna_bc_with_suffix"]].copy()
    print(f"\n=== Summary ===")
    print(f"Number of successfully associated cells: {len(rna_meta)}")
else:
    rna_meta = None
    rna_met_mapping = None
python
# --- Intersect and filter RNA and scMethyl data barcodes ---
# Since RNA and scMethyl data were QC filtered separately, ensure cells in both datasets are consistent
# If mapping table rna_meta exists (DD-MET3 or mixed), intersect by mapping; otherwise intersect by barcode (all DD-MET5)

print("=== Intersect RNA and scMethyl data barcodes ===")

if rna_meta is not None and len(rna_meta) > 0:
    # DD-MET3: Intersect through mapping relationship
    # rna_meta index is scMethyl barcode (m_cb), containing corresponding RNA barcode info
    # Use existing mapping relationship in rna_meta to intersect
    
    if rna_meta is None or len(rna_meta) == 0:
        raise ValueError("rna_meta is empty, please execute barcode association step (Cell 15) first")
    
    # Get valid scMethyl barcodes from mapping table (those with established mapping)
    mapped_met_barcodes = set(rna_meta.index)
    
    # Get actual barcodes present in scMethyl data
    met_barcodes = set(adata_met.obs.index)
    
    # Get actual barcodes in RNA data
    rna_barcodes_with_suffix = set(adata_rna.obs.index)
    
    print(f"RNA data cell count: {len(rna_barcodes_with_suffix)}")
    print(f"scMethyl data cell count: {len(met_barcodes)}")
    print(f"Number of valid scMethyl barcodes in mapping table: {len(mapped_met_barcodes)}")
    
    # Intersection: barcodes present in mapping table and also in scMethyl data
    common_met_barcodes = mapped_met_barcodes & met_barcodes
    print(f"Intersection of mapping table and scMethyl data (scMethyl barcode): {len(common_met_barcodes)}")
    
    if len(common_met_barcodes) == 0:
        print("⚠️  Warning: No intersection between mapping table and scMethyl data!")
        print("Possible causes: ")
        print("1. Mapping establishment failed")
        print("2. scMethyl data was over-filtered in QC step")
        print("3. Barcode format mismatch")
        raise ValueError("Cannot find intersection, please check mapping relationship and data status.")
    
    # Find corresponding RNA barcode (with suffix) through mapping
    rna_meta_filtered = rna_meta.loc[list(common_met_barcodes), :]
    common_rna_barcodes_with_suffix = set(rna_meta_filtered["rna_bc_with_suffix"].values)
    
    print(f"Number of RNA barcodes found through mapping: {len(common_rna_barcodes_with_suffix)}")
    
    # Ensure RNA barcode exists in RNA data
    common_rna_barcodes_with_suffix = common_rna_barcodes_with_suffix & rna_barcodes_with_suffix
    print(f"Intersection of RNA barcodes and RNA data: {len(common_rna_barcodes_with_suffix)}")
    
    if len(common_rna_barcodes_with_suffix) == 0:
        print("⚠️  Warning: Mapped RNA barcodes do not exist in RNA data!")
        raise ValueError("Cannot find valid RNA barcodes, please check mapping relationship and RNA data.")
    
    # Update corresponding scMethyl barcode (retain only those with existing RNA barcode)
    rna_meta_filtered = rna_meta_filtered[rna_meta_filtered["rna_bc_with_suffix"].isin(common_rna_barcodes_with_suffix)]
    common_met_barcodes = set(rna_meta_filtered.index)
    
    print(f"\nFinal Intersection Statistics: ")
    print(f"  Intersection Cell Count (scMethyl barcode): {len(common_met_barcodes)}")
    print(f"  Intersection Cell Count (RNA barcode): {len(common_rna_barcodes_with_suffix)}")
    
    # Filter both datasets, retaining only intersection cells
    print(f"\nStarting data filtering...")
    adata_rna_before = adata_rna.n_obs
    adata_met_before = adata_met.n_obs
    
    adata_rna = adata_rna[adata_rna.obs.index.isin(common_rna_barcodes_with_suffix), :].copy()
    adata_met = adata_met[adata_met.obs.index.isin(common_met_barcodes), :].copy()
    
    # Update rna_meta, retaining only filtered mapping relationships
    rna_meta = rna_meta.loc[list(common_met_barcodes), :]
    
    print(f"\nFiltering Results: ")
    print(f"  RNA Data: {adata_rna_before}{adata_rna.n_obs} (Removed {adata_rna_before - adata_rna.n_obs})")
    print(f"  scMethyl data: {adata_met_before} -> {adata_met.n_obs} (deleted {adata_met_before - adata_met.n_obs})")
    
    # Verify consistency after filtering
    if adata_rna.n_obs != adata_met.n_obs:
        print(f"⚠️  Warning: Cell counts of RNA and scMethyl data are inconsistent after filtering!")
        print(f"  RNA: {adata_rna.n_obs}, scMethyl: {adata_met.n_obs}")
    else:
        print(f"✅ Cell counts consistent after filtering: {adata_rna.n_obs}")
    
else:
    # No mapping table or empty: RNA and scMethyl use same Barcode, intersect directly (all DD-MET5 or no cells)
    rna_barcodes = set(adata_rna.obs.index)
    met_barcodes = set(adata_met.obs.index)
    
    # Calculate intersection
    common_barcodes = rna_barcodes & met_barcodes
    
    print(f"RNA data cell count: {len(rna_barcodes)}")
    print(f"scMethyl data cell count: {len(met_barcodes)}")
    print(f"Intersection Cell Count: {len(common_barcodes)}")
    print(f"Cells only in RNA data: {len(rna_barcodes - met_barcodes)}")
    print(f"Cell count only in scMethyl data: {len(met_barcodes - rna_barcodes)}")
    
    if len(common_barcodes) == 0:
        print("⚠️  Warning: No intersection between RNA and scMethyl data!")
        print("Possible causes: ")
        print("1. Barcode format mismatch between datasets")
        print("2. No common cells after QC filtering")
        raise ValueError("Cannot find intersection, please check data status and barcode format.")
    
    # Filter both datasets, retaining only intersection cells
    print(f"\nStarting data filtering...")
    adata_rna_before = adata_rna.n_obs
    adata_met_before = adata_met.n_obs
    
    adata_rna = adata_rna[adata_rna.obs.index.isin(common_barcodes), :].copy()
    adata_met = adata_met[adata_met.obs.index.isin(common_barcodes), :].copy()
    
    print(f"\nFiltering Results: ")
    print(f"  RNA Data: {adata_rna_before}{adata_rna.n_obs} (Removed {adata_rna_before - adata_rna.n_obs})")
    print(f"  scMethyl data: {adata_met_before} -> {adata_met.n_obs} (deleted {adata_met_before - adata_met.n_obs})")
    
    # Verify consistency after filtering
    if adata_rna.n_obs != adata_met.n_obs:
        print(f"⚠️  Warning: Cell counts of RNA and scMethyl data are inconsistent after filtering!")
        print(f"  RNA: {adata_rna.n_obs}, scMethyl: {adata_met.n_obs}")
    else:
        print(f"✅ Cell counts consistent after filtering: {adata_rna.n_obs}")

Remove Doublets Based on Doublet Table

This section removes doublets from RNA and scMethyl data based on doublet identification results for each sample.

Data Source: {sample}_doublet.txt file comes from scMethyl+RNA Multi-omics-Doublet Detection.ipynb analysis results, containing doublet information for each sample.

Processing Logic:

  • Read doublet table for each sample ({sample}_doublet.txt), get scMethyl data doublet barcodes (m_cb column).
  • Process differently based on kit type:
    • DD-MET3: Convert scMethyl barcode to RNA barcode via mapping file, then remove from both datasets.
    • DD-MET5: scMethyl barcode is RNA barcode, remove directly from both datasets using same barcode.
  • Output cell count statistics before and after removal.

💡 Note
If doublet removal is not needed, skip this cell.

python
# --- Remove doublets based on doublet table ---
doublet_df_column_name = "met_is_doublet"
# Read and merge doublet tables for each sample
doublet_met_barcodes = set()  # scMethyl data doublet barcodes (used to remove from scMethyl data)
doublet_rna_barcodes = set()  # RNA doublet barcode (for removing from RNA data)
suffix_map = {f'{value}': index for index, value in enumerate(samples)}

for sample in samples:
    doublet_path = f"{sample}_doublet.txt"
    if os.path.exists(doublet_path):
        doublet_df = pd.read_csv(doublet_path, sep='\t')
        # Determine how to identify doublets based on doublet_df column names
        if doublet_df_column_name == 'met_is_doublet':
            # If column name is "met_is_doublet", use met_is_doublet == True to identify doublets
            sample_doublets_met = doublet_df[doublet_df['met_is_doublet'] == True]['m_cb'].tolist()
        elif doublet_df_column_name == 'cell_multi_highlight':
            # If column name is "cell_multi_highlight", 'multi_cells' under this column are doublets
            sample_doublets_met = doublet_df[doublet_df['cell_multi_highlight'] == 'multi_cells']['m_cb'].tolist()
        elif doublet_df_column_name == 'cell_multi_highlight':
            sample_doublets_met = doublet_df[doublet_df['rna_doublet'] == 'doublet']['m_cb'].tolist()
        else:
            raise ValueError(f"Expected column name not found in doublet file {doublet_path}. Expected: 'met_is_doublet' or 'cell_multi_highlight', Actual: {list(doublet_df.columns)}")
        
        # Add sample suffix to scMethyl barcode (used to remove from scMethyl data)
        if len(samples) > 1:
            sample_doublets_met_with_suffix = [f"{b}-{suffix_map[sample]}" for b in sample_doublets_met]
        else:
            sample_doublets_met_with_suffix = sample_doublets_met
        doublet_met_barcodes.update(sample_doublets_met_with_suffix)
        rt = reagent_type_map.get(sample, reagent_type)
        
        # Process differently based on kit type
        if rt == "DD-MET3":
            # DD-MET3: Need to convert scMethyl barcode to RNA barcode
            # Convert scMethyl barcode to RNA barcode via mapping file
            sample_gex_mc_map = gex_mc_bc_map[gex_mc_bc_map['m_cb'].isin(sample_doublets_met)]
            sample_doublets_rna = sample_gex_mc_map['gex_cb'].tolist()
            # Add sample suffix to RNA barcode
            if len(samples) > 1:
                sample_doublets_rna_with_suffix = [f"{b}-{suffix_map[sample]}" for b in sample_doublets_rna]
            else:
                sample_doublets_rna_with_suffix = sample_doublets_rna
            doublet_rna_barcodes.update(sample_doublets_rna_with_suffix)
        elif rt == "DD-MET5":
            # DD-MET5: scMethyl barcode is RNA barcode, use directly
            doublet_rna_barcodes.update(sample_doublets_met_with_suffix)
        else:
            raise ValueError(f"Unsupported kit type: {reagent_type}, please set to 'DD-MET3' or 'DD-MET5'")
        
        print(f"{sample}: Detected {len(sample_doublets_met)} doublet cells")
    else:
        print(f"Warning: Doublet file {doublet_path} for {sample} not found")

if doublet_met_barcodes:
    print(f"\nTotal detected doublet cells: {len(doublet_met_barcodes)}")
    print(f"Kit Type: {reagent_type}")
    
    # Remove doublets from scMethyl data (using scMethyl barcode)
    adata_met = adata_met[~adata_met.obs.index.isin(doublet_met_barcodes), :].copy()
    
    # Remove doublets from RNA data (using RNA barcode)
    adata_rna = adata_rna[~adata_rna.obs.index.isin(doublet_rna_barcodes), :].copy()
    
    print(f"RNA data cell count after doublet removal: {adata_rna.n_obs}")
    print(f"scMethyl data cell count after removing doublets: {adata_met.n_obs}")
else:
    print("Warning: No doublet files found, skipping doublet filtering")

scRNA-seq Data Multi-sample Integration Analysis Workflow

This section performs normalization, feature selection, dimensionality reduction, batch correction, and clustering analysis on the QC-filtered RNA data.

Data Normalization and Dimensionality Reduction Analysis

python
warnings.filterwarnings('ignore')
# --- RNA Data Normalization and Dimensionality Reduction Analysis ---

# Normalization: Total count normalization to eliminate sequencing depth differences between cells
sc.pp.normalize_total(adata_rna, inplace=True)

# Log Transformation: log(x + 1)
sc.pp.log1p(adata_rna)

# Identify highly variable genes (2000), these genes best reflect biological differences between cells
sc.pp.highly_variable_genes(adata_rna, n_bins=100, n_top_genes=2000)

# Save raw data, then keep only highly variable genes
adata_rna.raw = adata_rna
adata_rna = adata_rna[:, adata_rna.var.highly_variable]

# Highly Variable Gene Z-score Normalization
sc.pp.scale(adata_rna, max_value=10)

# Principal Component Analysis (PCA)
sc.tl.pca(adata_rna)

# Harmony Batch Effect Correction
ho = run_harmony(
    adata_rna.obsm['X_pca'],
    meta_data=adata_rna.obs,
    vars_use='Sample',
    random_state=0,
    max_iter_harmony=20
)
adata_rna.obsm['X_pca_harmony'] = ho.Z_corr.T

# Use Harmony corrected results for downstream analysis
reduc_use = 'X_pca_harmony'

# Build KNN Graph
sc.pp.neighbors(adata_rna, n_neighbors=30, use_rep=reduc_use)

# UMAP Dimensionality Reduction Visualization
sc.tl.umap(adata_rna)

# t-SNE Dimensionality Reduction Visualization
sc.tl.tsne(adata_rna, use_rep=reduc_use)

# Leiden Clustering
sc.tl.leiden(adata_rna, resolution=1)

RNA Single-cell Annotation

Figure Legend (RNA UMAP - Leiden Clustering):

The figure below shows the cell distribution of RNA data in UMAP Dimensionality Reduction space, colored by Leiden Clustering results.

  • X-axis and Y-axis: Two main dimensions after UMAP Dimensionality Reduction (UMAP1 and UMAP2), used to display cell similarity relationships in high-dimensional data in 2D space.
  • Color: Different colors represent different Leiden Clustering results; each cluster may correspond to one or more cell types.
  • Points: Each point represents a cell; position reflects similarity to other cells. Similar cells cluster together in UMAP space.
  • Purpose: Used to evaluate clustering effectiveness, identify cell subpopulations, and discover potential cell types.
python
warnings.filterwarnings('ignore')
sc.set_figure_params(scanpy=True, fontsize=12, facecolor='white', figsize=(6, 6))
sc.pl.umap(adata_rna, color='leiden', s=35, palette=my_palette, frameon=True)

Figure Legend (RNA UMAP - Marker Gene Expression):

The figure below shows the expression distribution of marker genes of RNA data in UMAP Dimensionality Reduction space.

  • X-axis and Y-axis: Two main dimensions after UMAP Dimensionality Reduction (UMAP1 and UMAP2).
  • Color: Color intensity represents expression level of each marker gene; darker means higher expression. Marker genes include: CD3D (T cells), NKG7 (NK cells), CD8A (CD8+ T cells), CD4 (CD4+ T cells), CD79A and MS4A1 (B cells), JCHAIN (Plasma cells), LYZ (Monocytes), LILRA4 (pDC cells).
  • Points: Each point represents a cell.
  • Purpose: Marker gene expression patterns help preliminarily identify and verify different cell types, assisting cell type annotation.
python
colors = ["#F5E6CC", "#FDBB84", "#E34A33", "#7F0000"]
my_warm_cmap = mcolors.LinearSegmentedColormap.from_list("white_orange_red", colors, N=256)

marker_names = ["CD3D","NKG7","CD8A","CD4","CCR7","CD79A","MS4A1","LYZ","FCN1","FCGR3A","CST3"]
sc.pl.umap(
    adata_rna, 
    color=marker_names, 
    ncols=4,            
    color_map=my_warm_cmap, 
    vmax='p99',         
    s=10,               
    frameon=False,      
    vmin=0,
    show=False 
)

plt.gcf().set_size_inches(16, 8) 
plt.show()

Figure Legend (RNA UMAP - Cell Type and Sample):

The figure below shows the distribution of RNA data in UMAP Dimensionality Reduction space, colored by cell type (celltype) and Sample.

  • X-axis and Y-axis: Two main dimensions after UMAP Dimensionality Reduction (UMAP1 and UMAP2).
  • Color (celltype plot): Different colors represent different cell types, showing the overall distribution after cell type annotation.
  • Color (Sample plot): Different colors represent different samples, used to evaluate batch effects and differences in cell type composition.
  • Points: Each point represents a cell.
  • Purpose: Evaluate cell type annotation, check for batch effects between samples, and compare cell type composition across samples.
python
celltype = {
    '0': 'CD8+ T',
    '1': 'Navie CD4+ T',
    '2': 'CD14+ Mono',
    '3': 'Memory CD4+ T',
    '4': 'B',
    '5': 'NK',
    '6': 'NK',
    '7': 'FCGR3A+ Mono',
    '8': 'DC',
    '9': 'CD8+ T'
}

adata_rna.obs['celltype'] = adata_rna.obs['leiden'].map(celltype)
python
sc.set_figure_params(scanpy=True, fontsize=12, facecolor='white', figsize=(6, 6))
sc.pl.umap(
    adata_rna, 
    color=['celltype', 'Sample'], 
    ncols=2,
    s=35,
    wspace=0.3,
    palette=my_palette
)
python
adata_rna.write("adata_rna.h5ad")

scMethyl Sequencing Data Multi-sample Integration Analysis Workflow

This section performs preprocessing, dimensionality reduction, batch correction, clustering analysis, and cell type annotation transfer on the QC-filtered scMethyl data.

Data Preprocessing and Dimensionality Reduction Analysis

This section performs preprocessing, dimensionality reduction, clustering, and batch correction on the QC-filtered scMethyl data.

python
# --- scMethyl Data Preprocessing and Dimensionality Reduction Analysis ---

# Matrix Binarization: Set 95% threshold for data binarization to highlight significant epigenetic differences
binarize_matrix(adata_met, cutoff=0.95)

# Region Filtering: Intelligently filter genomic regions
filter_regions(adata_met)

# LSI Dimensionality Reduction: Use ARPACK algorithm for linear dimensionality reduction to extract principal components of genomic scMethyl patterns
lsi(adata_met, algorithm='arpack', obsm='X_lsi')

# Significant PC Test: Retain only significant PCs with p < 0.1
significant_pc_test(adata_met, p_cutoff=0.1, obsm='X_lsi', update=True)

# Harmony Batch Effect Correction
ho = run_harmony(
    adata_met.obsm['X_lsi'],
    meta_data=adata_met.obs,
    vars_use='Sample',
    random_state=0,
    max_iter_harmony=20
)
adata_met.obsm['X_lsi_harmony'] = ho.Z_corr.T

# Use Harmony corrected results for downstream analysis
reduc_use = 'X_lsi_harmony'

# Build KNN Graph
sc.pp.neighbors(adata_met, n_neighbors=30, use_rep=reduc_use)

# UMAP Dimensionality Reduction Visualization
sc.tl.umap(adata_met)

# t-SNE Dimensionality Reduction Visualization
sc.tl.tsne(adata_met, use_rep=reduc_use)

# Leiden Clustering
sc.tl.leiden(adata_met, resolution=0.5)

Cell Type Annotation Transfer

This section transfers cell type annotations from RNA data to scMethyl data, achieving unified cell type annotation for multi-omics data.

Annotation Transfer Strategy:

Adopt different transfer methods based on kit type (reagent_type):

  • DD-MET3 Kit:

    • RNA and scMethyl data use different Barcodes, requiring mapping through the previously established relationship (rna_meta) for transfer.
    • Get celltype from adata_rna.obs via rna_bc_with_suffix (RNA barcode with suffix).
    • Then map celltype to scMethyl data's m_cb (scMethyl barcode).
    • Only transfer annotations for cells present in both RNA and scMethyl datasets.
  • DD-MET5 Kit:

    • RNA and scMethyl data use the same Barcode, can be transferred directly by index alignment.
    • Calculate barcode intersection, transfer annotation only for common cells.

⚠️ Important Note
Before executing this step, ensure:

  1. RNA data cell type annotation is complete (celltype column exists in adata_rna.obs).
  2. For DD-MET3 Kit, barcode association and intersection steps have been executed.
python
# --- Transfer RNA Data Cell Type Annotations to scMethyl Data ---

print("=== Cell Type Annotation Transfer ===")

# Check if RNA data has 'celltype' column
if 'celltype' not in adata_rna.obs.columns:
    raise ValueError("RNA data does not have 'celltype' column, please annotate cell types for RNA data first.")

print(f"Number of cell types in RNA data: {adata_rna.obs['celltype'].nunique()}")
print(f"Cell types in RNA data: {sorted(adata_rna.obs['celltype'].unique())}")

# Process differently based on kit type
if rna_meta is not None and len(rna_meta) > 0:
#if reagent_type == "DD-MET3":
    # DD-MET3: Transfer cell type annotation via previously established mapping
    # rna_meta index is m_cb (scMethyl barcode), including rna_bc_with_suffix (RNA barcode with suffix)
    
    if 'rna_meta' not in locals() or rna_meta is None:
        raise ValueError("rna_meta does not exist, please execute barcode association (Cell 15) and intersection (Cell 16) steps first.")
    
    if len(rna_meta) == 0:
        raise ValueError("rna_meta is empty, cannot transfer annotation. Please check mapping relationship.")
    
    # Get celltype from current adata_rna.obs (rna_meta might have been created before celltype annotation)
    # Get celltype from adata_rna.obs via rna_bc_with_suffix
    print(f"\nTransferring annotation via mapping...")
    print(f"Number of mapping relationships in rna_meta: {len(rna_meta)}")
    print(f"Cell count in scMethyl data: {len(adata_met.obs)}")
    
    # Get corresponding celltype via rna_bc_with_suffix in rna_meta
    # rna_meta index is m_cb (scMethyl barcode), including rna_bc_with_suffix (RNA barcode with suffix)
    # Steps: 
    # 1. Extract rna_bc_with_suffix and corresponding m_cb (index) from rna_meta
    # 2. Get celltype from adata_rna.obs (via rna_bc_with_suffix)
    # 3. Map celltype to scMethyl data's m_cb
    
    # Create mapping from RNA barcode to celltype
    rna_celltype_map = adata_rna.obs['celltype'].to_dict()
    
    # Get rna_bc_with_suffix from rna_meta and map to celltype
    # Retain only barcodes present in adata_met.obs.index
    met_barcodes_in_meta = rna_meta.index.intersection(adata_met.obs.index)
    rna_meta_filtered = rna_meta.loc[list(met_barcodes_in_meta), :]
    
    # Get celltype from adata_rna.obs via rna_bc_with_suffix
    celltype_series = rna_meta_filtered['rna_bc_with_suffix'].map(rna_celltype_map)
    
    # Check for missing values
    missing_in_map = celltype_series.isna().sum()
    if missing_in_map > 0:
        missing_barcodes = celltype_series[celltype_series.isna()].index
        print(f"⚠️  Warning: {missing_in_map} cells' RNA barcodes not found in adata_rna.obs for celltype")
        print(f"  Example barcode: {list(missing_barcodes[:5])}")
    
    # Assign celltype to scMethyl data (aligned by index)
    adata_met.obs["celltype"] = celltype_series
    
    # Check for missing values
    missing_count = adata_met.obs["celltype"].isna().sum()
    if missing_count > 0:
        print(f"⚠️  Warning: {missing_count} cells could not find corresponding celltype annotation")
        # Option to remove these cells or fill with "Unknown"
        # adata_met.obs["celltype"] = adata_met.obs["celltype"].fillna("Unknown")
    else:
        print(f"✅ Successfully transferred celltype annotation for {len(adata_met.obs)} cells")
    
    adata_met.obs["celltype"] = adata_met.obs["celltype"].astype('category')
    
    print(f"\nStatistics after cell type transfer: ")
    print(adata_met.obs['celltype'].value_counts())

else:
#elif reagent_type == "DD-MET5":
    # DD-MET5: RNA and scMethyl data use the same Barcode, transfer cell type annotation directly
    # Since barcodes are same, transfer cell type annotation directly by index alignment
    # Transfer only for cells present in both datasets
    
    print(f"\nTransferring annotation directly by index alignment...")
    print(f"RNA Data Cell Count: {len(adata_rna.obs)}")
    print(f"scMethyl data cell count: {len(adata_met.obs)}")
    
    # Calculate intersection
    common_barcodes_for_annotation = set(adata_rna.obs.index) & set(adata_met.obs.index)
    print(f"Number of common barcodes: {len(common_barcodes_for_annotation)}")
    
    if len(common_barcodes_for_annotation) == 0:
        raise ValueError("RNA and scMethyl data share no common barcode, cannot transfer annotation.")
    
    # Transfer directly by index alignment
    adata_met.obs["celltype"] = adata_rna.obs.loc[list(common_barcodes_for_annotation), "celltype"]
    
    # Check for missing values
    missing_count = adata_met.obs["celltype"].isna().sum()
    if missing_count > 0:
        print(f"⚠️  Warning: {missing_count} cells could not find corresponding celltype annotation")
    else:
        print(f"✅ Successfully transferred celltype annotation for {len(common_barcodes_for_annotation)} cells")
    
    adata_met.obs["celltype"] = adata_met.obs["celltype"].astype('category')
    
    print(f"\nStatistics after cell type transfer: ")
    print(adata_met.obs['celltype'].value_counts())

Figure Legend (scMethyl UMAP - Leiden Clustering and Sample):

The figure below shows the cell distribution of scMethyl data in UMAP Dimensionality Reduction space, colored by Leiden Clustering results and Sample.

  • X-axis and Y-axis: Two main dimensions after UMAP Dimensionality Reduction (UMAP1 and UMAP2), calculated based on LSI Dimensionality Reduction results.
  • Color(leiden): Different Colors represent different Leiden Clustering results, showing cell clustering based on scMethyl patterns.
  • Color(Sample): Different Colors represent different samples, used to evaluate batch effects between samples.
  • Points: Each point represents a cell.
  • Purpose: Evaluate scMethyl data clustering effect, identify cell subpopulations based on scMethyl patterns, and check batch effects between samples.
python
sc.set_figure_params(scanpy=True, fontsize=12, facecolor='white', figsize=(6, 6))
sc.pl.umap(
    adata_met, 
    color=['leiden', 'Sample'], 
    ncols=2,
    s=35,
    palette=my_palette
)

Figure Legend (scMethyl UMAP - Cell Type):

The figure below shows the cell distribution of scMethyl data in UMAP Dimensionality Reduction space, colored by cell types transferred from RNA data.

  • X-axis and Y-axis: Two main dimensions after UMAP Dimensionality Reduction(UMAP_1 and UMAP_2)。
  • Color: Different Colors represent different cell types, these cell type annotations are transferred from RNA data.
  • Points: Each point represents a cell.
  • Purpose: Evaluate Cell Type Annotation Transfer effect, check distribution patterns of different cell types in scMethyl space, and verify annotation accuracy.
python
warnings.filterwarnings('ignore')
sc.set_figure_params(scanpy=True, fontsize=12, facecolor='white', figsize=(6, 6))
sc.pl.umap(adata_met, color = ['celltype'], s=35, palette=my_palette)

Cell Composition Ratio

Figure Legend (Cell Type Composition Stacked Bar Plot):

The figure below shows the proportional distribution of various cell types in different samples.

  • X-axis: Different Samples, facilitating comparison of cell type composition across samples.
  • Y-axis: Cell Type Proportion, ranging from 0 to 1, representing the percentage of each cell type in the sample.
  • Stacked Bar Chart: Each bar represents a sample, different Colors represent different cell types, bar height represents the total proportion of all cell types in that sample (100%).
  • Color: Different Colors represent different cell types, legend shows names of each cell type.
  • Purpose: Used to compare cell type composition differences between samples, evaluate heterogeneity between samples, and identify sample-specific cell types.
python
def plot_bar_fraction_data_optimized(adata, x_key, y_key, custom_palette=None):
    # 1. Data Preparation
    df = adata.obs[[x_key, y_key]].copy()
    df[y_key] = df[y_key].astype('category')
    df[x_key] = df[x_key].astype('category')
    
    counts = df.groupby([x_key, y_key]).size().reset_index(name='count')
    totals = counts.groupby(x_key)['count'].transform('sum')
    counts['prop'] = counts['count'] / totals
    prop_pivot = counts.pivot(index=x_key, columns=y_key, values='prop').fillna(0)
    
    clusters = prop_pivot.columns.tolist()    
    bar_colors = None
    
    if f'{y_key}_colors' in adata.uns:
        categories = adata.obs[y_key].cat.categories
        uns_colors = adata.uns[f'{y_key}_colors']
        
        # Prevent color count mismatch
        if len(categories) == len(uns_colors):
            color_dict = dict(zip(categories, uns_colors))
            bar_colors = [color_dict.get(c, '#333333') for c in clusters]
    
    # If color not found above, or logic fails, use custom palette
    if bar_colors is None:
        if custom_palette is None:
            custom_palette = sc.pl.palettes.default_20
        
        # Loop to assign Color
        bar_colors = [custom_palette[i % len(custom_palette)] for i in range(len(clusters))]

    return prop_pivot, bar_colors

# --- Plotting ---

# Call function, passing your defined my_palette
prop_pivot, bar_colors = plot_bar_fraction_data_optimized(
    adata_met, 
    x_key="Sample", 
    y_key="celltype", 
    custom_palette=my_palette
)

# Set canvas size
plt.rcParams['figure.dpi'] = 100
fig, ax = plt.subplots(figsize=(5, 5)) 
ax.grid(False)
ax.tick_params(axis='both', which='major', labelsize=9)

# Plot
prop_pivot.plot(
    kind='bar', 
    stacked=True, 
    color=bar_colors, 
    ax=ax, 
    width=0.5,       # Bar width
    edgecolor='none',
    grid=False,
    zorder=3
)

# Beautify axes
ax.set_ylabel('Proportion', fontsize=10)
ax.set_xlabel("")
ax.tick_params(axis='x', rotation=45) # Rotate x-axis labels
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

# Beautify legend
ax.legend(
    title='Cell Type', 
    bbox_to_anchor=(1.05, 1), 
    loc='upper left', 
    frameon=False, 
    fontsize=9, 
    title_fontsize=9,
    alignment="left"
)

plt.tight_layout()
plt.show()

QC Metrics Visualization

This section visualizes the distribution of QC metrics in UMAP space to evaluate overall data quality and potential anomalies.

Figure Legend (QC Metrics UMAP Distribution):

The figure below shows the distribution of QC metrics in UMAP Dimensionality Reduction space for evaluating data quality.

  • X-axis and Y-axis: Two main dimensions after UMAP Dimensionality Reduction(UMAP_1 and UMAP_2)。
  • Color: Color intensity represents value of QC metrics, lighter Color (yellow/white) represents higher values, darker Color (black/purple) represents lower values.
  • Visualization Metrics:
    • total_cpg_number (Total CpG sites): Reflects total scMethyl sites captured per cell, higher values represent better sequencing depth.
    • reads_counts (Sequencing Depth): Reflects the sequencing depth per cell; higher values indicate more sequencing data.
    • genome_cov (Genome Coverage): Quantifies the coverage ratio of sequencing data in genomic regions; higher values indicate more comprehensive coverage.
    • CpG% (CpG scMethyl rate): Shows overall CpG site scMethyl level, usually within 0-100% range.
    • CH% (CH non-CpG scMethyl rate): Reveals modification level of non-canonical scMethyl sites, usually < 5% in mammalian cells.
  • Points: Each point represents a cell.
  • Purpose: Used to identify spatial distribution patterns of QC metrics, detect abnormal cells (e.g., cells with abnormally high or low QC metrics), and assess data quality uniformity.
python
# --- QC Metrics UMAP Visualization ---
plt.rcParams['font.size'] = 10         
plt.rcParams['axes.titlesize'] = 10  
# Visualize total_cpg_number, reads_counts, genome_cov, CpG%, CH%
features = ['total_cpg_number', 'reads_counts', 'genome_cov','CpG%', 'CH%']
sc.pl.umap(
    adata_met, 
    color=features, 
    ncols=3,            
    color_map=my_warm_cmap, 
    vmax='p99',         
    s=8,               
    frameon=False,      
    vmin=0,
    show=False 
)

plt.gcf().set_size_inches(10, 5) 
plt.show()

QC Metrics Violin Plot Visualization

This section uses Violin Plots to display the distribution characteristics of QC metrics across different cell groups (Leiden Clustering and cell types) to evaluate data quality consistency across cell populations.

Figure Legend (QC Metrics Violin Plot):

The figure below shows the distribution characteristics of QC metrics across different cell groups (Leiden Clustering and cell types).

  • X-axis: Different cell groups, including Leiden Clustering(leiden) and cell types (celltype),facilitating comparison of QC metrics differences across different cell groups.
  • Y-axis: Values of each QC metric, including total_cpg_number (Total CpG sites), CpG% (CpG scMethyl rate) and CH% (CH scMethyl rate).
  • Violin Plot: Each Violin Plot represents a cell group, width represents cell density in that value range, wider means more cells. Violin Plot also includes boxplot, showing median, quartiles and anomaly values.
  • Grouping Method:
    • Group by Leiden Clustering: Shows QC metric differences between clusters, used to assess cluster quality.
    • Group by Cell Type: Shows QC metric differences between cell types, used to assess the rationality of cell type annotation.
  • Purpose: for evaluating data quality uniformity across cell groups, identifying cell groups with QC metric anomalies, and verifying reliability of clustering and annotation results.
python
warnings.filterwarnings('ignore')
plt.rcParams['figure.figsize'] = (6, 4)
plt.rcParams['axes.grid'] = False

violin_params = {
    'palette': my_palette,   
    'stripplot': False,      
    'inner': 'box',          
    'linewidth': 1,        
    'size': 1,               
    'cut': 0, 
    'multi_panel': True      
}

print("\n=== QC Metrics Distribution (By Cluster) ===")
# Group by Leiden Cluster
sc.pl.violin(
    adata_met, 
    keys=['total_cpg_number', 'CpG%', 'CH%'], 
    groupby='leiden', 
    **violin_params  
)
plt.show()

print("\n=== QC Metrics Distribution (By Cell Type) ===")
# Group by Cell Type (with rotation)
sc.pl.violin(
    adata_met, 
    keys=['total_cpg_number', 'CpG%', 'CH%'], 
    groupby='celltype', 
    rotation=45,     
    **violin_params
)
plt.show()
python
adata_met.write_h5ad('adata_met.h5ad')
python
0 comments·0 replies