Skip to content

scMethyl + RNA Multi-omics: Doublet Detection Workflow

Author: SeekGene
Time: 25 min
Words: 4.8k words
Updated: 2026-03-18
Reads: 0 times
scMethyl + RNA-seq Notebooks

Module Introduction

This module is based on ALLCools and MethylScrublet algorithms, designed to identify and filter doublets from scMethyl sequencing data.

Doublets refer to technical artifacts where two or more cells accidentally stick together and are sequenced as a single "cell". Doublets introduce mixed expression/methylation profiles, seriously interfering with subsequent cell clustering and differential analysis (e.g., doublets might be wrongly identified as new intermediate cell types).

This module provides three complementary doublet identification methods to improve accuracy:

  • Method 1: Transcriptome Cell Type Identification: Based on RNA sequencing data, identifies doublets by detecting abnormal co-expression of marker genes from different cell types.
  • Method 2: scMethyl Reads Count Identification: Identifies potential doublets with abnormally high coverage by analyzing the methylation sequencing depth (Reads count) distribution of each cell.
  • Method 3: scMethyl Rate Identification (MethylScrublet): Uses the MethylScrublet algorithm to calculate a "doublet score" for each cell by simulating artificial doublets and comparing them with observed data.

💡 Note
This analysis applies to single-cell multi-omics data (containing both RNA and methylation information) and also supports doublet identification using only methylation data.

Input File Preparation

This module requires the following input files:

Required Files

  • Single-cell RNA sequencing data: Gene expression matrix (filtered_feature_bc_matrix directory).
  • scMethyl data: Methylation dataset in MCDS format (.mcds file).
  • Barcode mapping file (Optional): Used to associate cell barcodes between RNA and methylation data if the reagent type is DD-MET3.

File Structure Example

text
data/
├── AY1768874914782
│   └── methylation
│       └── demoWTJW969-task-1
│           └── WTJW969
│               ├── allcools_generate_datasets
│               │   └── WTJW969.mcds # scMethyl data
│               ├── filtered_feature_bc_matrix # Transcriptome expression matrix
│               └── split_bams
│                   ├── WTJW969_cells.csv
│                   └── filtered_barcode_reads_counts.csv
└── AY1768876253533
    └── methylation
        └── demo4WTJW880-task-1
            └── WTJW880
                ├── allcools_generate_datasets
                │   └── WTJW880.mcds # scMethyl data
                ├── filtered_feature_bc_matrix # Transcriptome expression matrix
                └── split_bams
                    ├── WTJW880_cells.csv
                    └── filtered_barcode_reads_counts.csv

DD-M_bUCB3_whitelist.csv  # Barcode mapping file (Optional)
python
# --- Import necessary Python packages ---

# System operations and file handling
import os
import re
import glob
import warnings

# ALLCools related packages: for scMethyl data processing and doublet identification
from ALLCools.mcds import MCDS
from ALLCools.clustering import (
    tsne, significant_pc_test, log_scale, lsi, 
    binarize_matrix, filter_regions, cluster_enriched_features, 
    ConsensusClustering, Dendrogram, get_pc_centers
)
from ALLCools.clustering.doublets import MethylScrublet
from ALLCools.plot import *

# Single-cell analysis tools
import scanpy as sc

# Data processing and visualization
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
from matplotlib.lines import Line2D
import seaborn as sns
from matplotlib.patches import Rectangle

# Other tools
from scipy import sparse
python
# --- File Path Configuration ---
# Configure paths based on the new file hierarchy
# Format: {sample_name: {'top_dir': top_directory, 'demo_dir': demo_directory}}
sample_path_config = {
    "WTJW969": {"top_dir": "AY1768874914782", "demo_dir": "demoWTJW969-task-1"},
    "WTJW880": {"top_dir": "AY1768876253533", "demo_dir": "demo4WTJW880-task-1"}
}

# Data root directory (default is "data")
data_root = "data"

# Current sample name (modify as needed)
current_sample = "WTJW969"

## reagent_type: Reagent kit type ('DD-MET3' or 'DD-MET5')
## DD-MET3: RNA and methylation data use different Barcodes, require mapping file for association
## DD-MET5: RNA and methylation data use the same Barcode, no mapping file needed
reagent_type = "DD-MET5"

Doublet Identification Methods

This module provides three complementary doublet identification methods. It is recommended to use them comprehensively to improve identification accuracy.

Method 1: Doublet Identification Based on Transcriptome Cell Type Identification

Principle:

In single-cell RNA sequencing data analysis, a common strategy for doublet identification is to detect abnormal co-expression of marker genes from different cell types.

Logic:

When marker genes that should theoretically be expressed in different cell types are highly expressed simultaneously in the same cell group, it suggests that the group may not be derived from a single cell type, but rather doublets composed of cells from two different cell types (i.e., two cells were mistakenly identified as one during library preparation or sequencing).

Example:

For example, when T cell marker genes (such as CD3D) and B cell marker genes (such as CD79A) are highly expressed simultaneously in the same cell group, the group is likely to represent doublets composed of T cells and B cells.

RNA Transcriptome Data Preprocessing and Dimensionality Reduction Clustering

python
warnings.filterwarnings('ignore')
# --- RNA Transcriptome Data Preprocessing and Dimensionality Reduction Clustering ---
# Build path based on new file hierarchy: data/{top_dir}/methylation/{demo_dir}/{sample}/filtered_feature_bc_matrix
top_dir = sample_path_config[current_sample]['top_dir']
demo_dir = sample_path_config[current_sample]['demo_dir']
rna_path = os.path.join('../../',data_root, top_dir, 'methylation', demo_dir, current_sample, 'filtered_feature_bc_matrix')
adata_rna = sc.read_10x_mtx(rna_path)

# Calculate mitochondrial gene metrics (for QC)
adata_rna.var["mt"] = adata_rna.var_names.str.startswith("MT-")
sc.pp.calculate_qc_metrics(adata_rna, qc_vars=["mt"], percent_top=None, log1p=False, inplace=True)

# Data Normalization: Normalize total counts of each cell to 10,000
sc.pp.normalize_total(adata_rna, target_sum=1e4)

# Log Transformation: log(x + 1) transformation to make data closer to normal distribution
sc.pp.log1p(adata_rna)

# Identify Highly Variable Genes: Select top 2000 genes with largest expression variation for subsequent analysis
sc.pp.highly_variable_genes(adata_rna, n_bins=100, n_top_genes=2000)

# Save raw data to adata.raw, then keep only highly variable genes
adata_rna.raw = adata_rna
adata_rna = adata_rna[:, adata_rna.var.highly_variable]

# Scaling: Z-score scaling, clip max value to 10
sc.pp.scale(adata_rna, max_value=10)

# PCA Dimensionality Reduction
sc.tl.pca(adata_rna)

# Build KNN Graph (for UMAP and clustering)
sc.pp.neighbors(adata_rna)

# UMAP Visualization
sc.tl.umap(adata_rna)

# Leiden Clustering (resolution 0.8)
sc.tl.leiden(adata_rna, resolution=0.8)

# Visualize UMAP results, colored by cluster
sc.set_figure_params(scanpy=True, fontsize=12, facecolor='white', figsize=(6, 6))
sc.pl.umap(adata_rna, color='leiden', s=35, palette='Set2', frameon=True)

Methylome Data Preprocessing and Dimensionality Reduction Clustering

python
# --- Methylome Data Preprocessing and Dimensionality Reduction Clustering ---
# Build path based on new file hierarchy: data/{top_dir}/methylation/{demo_dir}/{sample}/allcools_generate_datasets/{sample}.mcds
top_dir = sample_path_config[current_sample]['top_dir']
demo_dir = sample_path_config[current_sample]['demo_dir']
mcds_path = os.path.join('../../',data_root, top_dir, 'methylation', demo_dir, current_sample, 'allcools_generate_datasets', f'{current_sample}.mcds')
mcds = MCDS.open(
    mcds_path,
    obs_dim="cell",
    var_dim="chrom20k"
)

# Extract methylation score matrix (using hypo-score, i.e., hypomethylation score)
adata_methy = mcds.get_score_adata(mc_type="CGN", quant_type="hypo-score")

# Binarize Matrix: Mark regions with methylation rate > 0.95 as hypermethylated
binarize_matrix(adata_methy, cutoff=0.95)

# Filter Low Quality Regions: Remove regions with low variability or insufficient coverage
filter_regions(adata_methy)

# LSI (Latent Semantic Indexing) Dimensionality Reduction: Similar to PCA, suitable for sparse matrices
lsi(adata_methy, algorithm="arpack", obsm="X_lsi")

# Significant PC Test: Keep only significant PCs with p < 0.1
significant_pc_test(adata_methy, p_cutoff=0.1, obsm="X_lsi", update=True)

# Build KNN Graph (using LSI results, 30 neighbors)
sc.pp.neighbors(adata_methy, use_rep='X_lsi', n_neighbors=30)

# UMAP Visualization
sc.tl.umap(adata_methy)

# Leiden Clustering (resolution 0.5)
sc.tl.leiden(adata_methy, resolution=0.5)

Cell Type Marker Gene Expression Analysis

Identify cell groups (potential doublets) with abnormal co-expression by visualizing the expression patterns of marker genes for different cell types.

python
# --- Visualize expression patterns of marker genes for different cell types ---
# Marker gene list:
# - T cells: CD3D, CD8A, CD4
# - NK cells: NKG7
# - B cells: CD79A, MS4A1, JCHAIN
# - Monocytes/Macrophages: LYZ
# - Dendritic cells: LILRA4

dp = sc.pl.dotplot(
    adata_rna,
    ["CD3D", "CD8A", "CD4", "NKG7", "CD79A", "MS4A1", "JCHAIN", "LYZ", "LILRA4"],
    groupby='leiden',
    standard_scale='var',
    cmap='GnBu',
    show=False
)

dp['mainplot_ax'].figure.set_size_inches(10, 6)

for label in dp['mainplot_ax'].get_xticklabels():
    label.set_rotation(45)
    label.set_ha('right') 
    label.set_rotation_mode('anchor')
    
dp['mainplot_ax'].set_title('Marker Genes Expression', pad=10, fontsize=12)

plt.show()
  • According to the dot plot above, cluster 9 may be doublets. This cluster simultaneously expresses marker genes of multiple different cell types (such as T cell and B cell marker genes), presenting an abnormal co-expression pattern, suggesting that cells in this cluster may be doublets of two different cell types.

Cell UMI Count Distribution Visualization

Figure Legend: This figure shows the UMI count distribution of all cells, used to identify potential doublets with abnormally high expression.

  • X-axis: Order of cells sorted by total UMI counts (nCount_RNA) from high to low.
  • Y-axis: Log2 transformed value of the total transcripts (nCount_RNA, total UMI count) for the corresponding cell. Higher values indicate higher RNA abundance detected in that cell.
  • Scatter Plot (Middle): Each point represents a cell, with different colors indicating different cell clusters (Leiden clustering). Cells located at the front of the rank with significantly high UMI counts may represent abnormally high-expressing cells, suggesting potential doublet or multiplet events.
  • Density Curve (Top): Shows the overall nCount_RNA distribution characteristics of different cell groups. Different colors correspond to different Leiden clusters, and their peaks reflect the typical RNA abundance levels of each cell group.
  • Density Curve (Right): Shows the overall distribution of log2(nCount_RNA) by cell group, used to compare differences in RNA abundance across cell groups.
python
warnings.filterwarnings('ignore')
MYCOLOR = None
plot_data = adata_rna.obs[["total_counts", "leiden"]]
plot_data = plot_data.sort_values('total_counts', ascending=False).reset_index(drop=True)
plot_data['order'] = plot_data.index + 1
plot_data['log2_nCount_RNA'] = np.log2(plot_data['total_counts'] + 1)

# Create figure - reduce size for better viewing in Jupyter
fig = plt.figure(figsize=(10, 6), dpi=80) 

# Use GridSpec to place legend on the right, avoiding obstruction of x-axis labels
gs = fig.add_gridspec(2, 3, 
                      width_ratios=[7, 2, 2.5],  # Main plot, right density plot, legend (larger legend area)
                      height_ratios=[2, 6],  # Top density plot, main plot
                      hspace=0.1, wspace=0.1)  # Increase spacing

ax_main = fig.add_subplot(gs[1, 0])
ax_top = fig.add_subplot(gs[0, 0], sharex=ax_main)
ax_right = fig.add_subplot(gs[1, 1], sharey=ax_main)
ax_legend = fig.add_subplot(gs[:, 2])  # Legend in the right column

# Hide axis labels
ax_top.tick_params(axis='x', labelbottom=False)
ax_right.tick_params(axis='y', labelleft=False)
ax_top.grid(False)
ax_right.grid(False)

# Get color palette
palette = sns.color_palette('tab20', len(plot_data['leiden'].unique()))
color_map = dict(zip(plot_data['leiden'].unique(), palette))

# 1. Main Scatter Plot
scatter = sns.scatterplot(
    data=plot_data,
    x='order',
    y='log2_nCount_RNA',
    hue='leiden',
    palette=MYCOLOR if MYCOLOR else color_map,
    s=15,
    alpha=0.5,
    edgecolor=None,
    ax=ax_main,
    linewidth=0.5,
    legend=False  # Do not show legend on main plot
)

# Set Grid
ax_main.grid(True, linestyle='--', alpha=0.3)
# Set X-axis Label
ax_main.set_xlabel('Order(Sorted_by_Total_Counts)', fontsize=12)
ax_main.set_ylabel('log2(Total_Counts)', fontsize=12)

# 2. Top Density Plot (truncate values < 0)
for celltype, color in color_map.items():
    celltype_data = plot_data[plot_data['leiden'] == celltype]
    
    # Calculate Density
    if len(celltype_data) > 1:
        from scipy.stats import gaussian_kde
        # Use KDE
        kde = gaussian_kde(celltype_data['order'])
        x_density = np.linspace(celltype_data['order'].min(), celltype_data['order'].max(), 200)
        y_density = kde(x_density)
        
        # Normalize and truncate negative values
        y_density = np.maximum(y_density, 0)
        
        # Plot fill (start from 0)
        ax_top.fill_between(x_density, 0, y_density, 
                           color=color, alpha=0.5)
        
        # Plot density line
        ax_top.plot(x_density, y_density, 
                   color=color, alpha=0.7, linewidth=1)

# Set top density plot y-axis lower limit to 0
ax_top.set_ylim(bottom=0)
ax_top.set_ylabel('Density', fontsize=12)

# 3. Right Density Plot (truncate values < 0)
for celltype, color in color_map.items():
    celltype_data = plot_data[plot_data['leiden'] == celltype]
    
    # Calculate Density
    if len(celltype_data) > 1:
        from scipy.stats import gaussian_kde
        # Use KDE
        kde = gaussian_kde(celltype_data['log2_nCount_RNA'])
        y_density = np.linspace(celltype_data['log2_nCount_RNA'].min(), 
                               celltype_data['log2_nCount_RNA'].max(), 200)
        x_density = kde(y_density)
        
        # Normalize and truncate negative values
        x_density = np.maximum(x_density, 0)
        
        # Plot fill (start from 0)
        ax_right.fill_betweenx(y_density, 0, x_density, 
                              color=color, alpha=0.5)
        
        # Plot density line
        ax_right.plot(x_density, y_density, 
                     color=color, alpha=0.7, linewidth=1)

# Set right density plot x-axis lower limit to 0
ax_right.set_xlim(left=0)
ax_right.set_xlabel('Density', fontsize=12)

# 4. Create Legend
# Get unique leiden values and corresponding colors
unique_labels = plot_data['leiden'].unique()
leiden_labels = sorted(unique_labels, key=lambda x: int(x))
colors = [color_map[label] for label in leiden_labels]

# Create legend handles
handles = [plt.Line2D([0], [0], marker='o', color='w', 
                      markerfacecolor=color, markersize=4, 
                      label=label, alpha=0.7) 
           for label, color in zip(leiden_labels, colors)]

# Create legend on right axis
ax_legend.axis('off')  # Turn off axis
# Place legend
legend = ax_legend.legend(handles=handles, 
                         loc='center left',  # Center left
                         bbox_to_anchor=(0.1, 0.5),  # Position
                         ncol=1,  # Single column
                         fontsize=12,
                         title='Clusters',
                         title_fontsize=12,
                         frameon=False,
                         fancybox=True,
                         framealpha=0.8,
                         alignment='left')

# Adjust layout
plt.tight_layout(rect=[0, 0.05, 1, 0.95])
plt.show()
# Limit figure display size
plt.rcParams['figure.max_open_warning'] = 0

Method 2: Doublet Identification Based on Methylation Reads Count

Principle:

Doublets usually contain genetic material from two or more cells, so their library size (total Reads count) is often significantly higher than that of normal single cells.

Analysis Logic:

  1. Distribution Visualization: Visualize the distribution of methylation Reads counts for all cells.
  2. Relative Change Analysis: Calculate the relative change rate of Reads counts between adjacent ranked cells to identify "elbow points" (sudden changes) in the distribution.
  3. Threshold Determination: Determine the threshold for high Reads counts based on the elbow points to filter out potential doublets.
python
# --- Data Preparation: Load Methylation Reads Count Data ---
# Build path based on new file hierarchy: data/{top_dir}/methylation/{demo_dir}/{sample}/split_bams/filtered_barcode_reads_counts.csv
top_dir = sample_path_config[current_sample]['top_dir']
demo_dir = sample_path_config[current_sample]['demo_dir']
read_count_path = os.path.join('../../',data_root, top_dir, 'methylation', demo_dir, current_sample, 'split_bams', 'filtered_barcode_reads_counts.csv')
read_count = pd.read_csv(read_count_path)

# Add Leiden cluster information from RNA analysis
rna_meta = adata_rna.obs[['leiden']].reset_index()
rna_meta = rna_meta.rename(columns = {'index': 'barcode'})

# Handle Barcode matching based on reagent type
if reagent_type == "DD-MET3":
    # If DD-MET3, use mapping file to convert RNA Barcode to Methylation Barcode
    whitelist_path = os.path.join('../../',data_root, 'DD-M_bUCB3_whitelist.csv')
    whitelist = pd.read_csv(whitelist_path, header=None)
    whitelist.columns = ['rna_bc', 'met_bc']
    
    # Create mapping dictionary
    rna_to_met = dict(zip(whitelist['rna_bc'], whitelist['met_bc']))
    
    # Convert RNA Barcodes to Methylation Barcodes
    rna_meta['gex_cb'] = rna_meta['barcode'].map(lambda x: x.split('-')[0])
    rna_meta['barcode'] = rna_meta['gex_cb'].map(rna_to_met) + "-1"
    rna_meta = rna_meta.dropna(subset=['barcode'])
python
if reagent_type == "DD-MET3":
    read_count = read_count.merge(rna_meta)
elif reagent_type == "DD-MET5":
    rna_meta = rna_meta.rename(columns = {'gex_cb': 'barcode'})
    read_count = read_count.merge(rna_meta)

Visualization of Methylation Reads Count Distribution per Cell

Figure Legend: This figure shows the distribution of methylation Reads counts for all cells, used to identify potential doublets with abnormally high sequencing coverage.

  • X-axis: Order of cells sorted by total Reads counts from high to low.
  • Y-axis: Log2 transformed value of the total Reads counts for the corresponding cell. Higher values indicate higher methylation sequencing coverage.
  • Scatter Plot (Middle): Each point represents a cell, with different colors indicating different cell types. Cells located at the front of the rank with significantly high Reads counts suggest abnormally high coverage, potentially corresponding to doublet or multiplet events.
  • Density Curve (Top): Shows the overall Reads count distribution characteristics of different cell types. Different colors correspond to different cell types, and their peaks reflect the typical sequencing coverage levels of each cell type.
  • Density Curve (Right): Shows the overall distribution of log2(Reads Count) by cell type, used to compare differences in sequencing coverage across cell types.
python
warnings.filterwarnings('ignore')
MYCOLOR = None
read_count = read_count.sort_values('reads_counts', ascending=False).reset_index(drop=True)
read_count['order'] = read_count.index + 1
read_count['log2_reads_counts'] = np.log2(read_count['reads_counts'] + 1)

# Create figure - reduce size for better viewing in Jupyter
fig = plt.figure(figsize=(10, 6), dpi=80)

# Use GridSpec to place legend on the right
gs = fig.add_gridspec(2, 3, 
                      width_ratios=[7, 2, 2.5],  # Main plot, right density plot, legend
                      height_ratios=[2, 6],  # Top density plot, main plot
                      hspace=0.1, wspace=0.1)  # Spacing

ax_main = fig.add_subplot(gs[1, 0])
ax_top = fig.add_subplot(gs[0, 0], sharex=ax_main)
ax_right = fig.add_subplot(gs[1, 1], sharey=ax_main)
ax_legend = fig.add_subplot(gs[:, 2])  # Legend
ax_top.grid(False)
ax_right.grid(False)

# Hide axis labels
ax_top.tick_params(axis='x', labelbottom=False)
ax_right.tick_params(axis='y', labelleft=False)

# Get color palette
palette = sns.color_palette('tab20', len(read_count['leiden'].unique()))
color_map = dict(zip(read_count['leiden'].unique(), palette))

# 1. Main Scatter Plot
scatter = sns.scatterplot(
    data=read_count,
    x='order',
    y='log2_reads_counts',
    hue='leiden',
    palette=MYCOLOR if MYCOLOR else color_map,
    s=8,
    alpha=0.7,
    ax=ax_main,
    edgecolor=None,
    linewidth=0.5,
    legend=False
)

# Set Grid
ax_main.grid(True, linestyle='--', alpha=0.3)
# Set X-axis Label
ax_main.set_xlabel('Order(Sorted_by_Reads_Counts)', fontsize=12)
ax_main.set_ylabel('log2(Reads_Counts)', fontsize=12)

# 2. Top Density Plot (truncate values < 0)
for celltype, color in color_map.items():
    celltype_data = read_count[read_count['leiden'] == celltype]
    
    # Calculate Density
    if len(celltype_data) > 1:
        from scipy.stats import gaussian_kde
        # Use KDE
        kde = gaussian_kde(celltype_data['order'])
        x_density = np.linspace(celltype_data['order'].min(), celltype_data['order'].max(), 200)
        y_density = kde(x_density)
        
        # Normalize and truncate negative values
        y_density = np.maximum(y_density, 0)
        
        # Plot fill
        ax_top.fill_between(x_density, 0, y_density, 
                           color=color, alpha=0.5)
        
        # Plot density line
        ax_top.plot(x_density, y_density, 
                   color=color, alpha=0.7, linewidth=1)

# Set top density plot y-axis lower limit to 0
ax_top.set_ylim(bottom=0)
ax_top.set_ylabel('Density', fontsize=12)

# 3. Right Density Plot (truncate values < 0)
for celltype, color in color_map.items():
    celltype_data = read_count[read_count['leiden'] == celltype]
    
    # Calculate Density
    if len(celltype_data) > 1:
        from scipy.stats import gaussian_kde
        # Use KDE
        kde = gaussian_kde(celltype_data['log2_reads_counts'])
        y_density = np.linspace(celltype_data['log2_reads_counts'].min(), 
                               celltype_data['log2_reads_counts'].max(), 200)
        x_density = kde(y_density)
        
        # Normalize and truncate negative values
        x_density = np.maximum(x_density, 0)
        
        # Plot fill
        ax_right.fill_betweenx(y_density, 0, x_density, 
                              color=color, alpha=0.5)
        
        # Plot density line
        ax_right.plot(x_density, y_density, 
                     color=color, alpha=0.7, linewidth=1)

# Set right density plot x-axis lower limit to 0
ax_right.set_xlim(left=0)
ax_right.set_xlabel('Density', fontsize=12)

# 4. Create Legend
# Get unique leiden values and corresponding colors
unique_labels = plot_data['leiden'].unique()
leiden_labels = sorted(unique_labels, key=lambda x: int(x))
colors = [color_map[label] for label in leiden_labels]

# Create legend handles
handles = [plt.Line2D([0], [0], marker='o', color='w', 
                      markerfacecolor=color, markersize=4, 
                      label=label, alpha=0.7) 
           for label, color in zip(leiden_labels, colors)]

# Create legend on right axis
ax_legend.axis('off')
# Place legend
legend = ax_legend.legend(handles=handles, 
                         loc='center left', 
                         bbox_to_anchor=(0.1, 0.5), 
                         ncol=1, 
                         fontsize=12,
                         title='Clusters',
                         title_fontsize=12,
                         frameon=False,
                         fancybox=True,
                         framealpha=0.8,
                         alignment='left')

# Adjust layout
plt.tight_layout(rect=[0, 0.05, 1, 0.95])
plt.show()
# Limit figure display size
plt.rcParams['figure.max_open_warning'] = 0

Cell Reads Count Relative Change Analysis

Figure Legend: This figure shows the relative change analysis of cell Reads counts, used to identify "elbow points" in the Reads count distribution, thereby determining thresholds for screening doublets and low-quality cells.

  • X-axis: Order of cells sorted by total Reads counts from high to low. The left side represents cells with abnormally high Reads counts, the right side represents cells with low Reads counts, and the middle area corresponds to cells with moderate sequencing coverage.
  • Y-axis: Relative change rate between adjacent cells in the Reads count ranking, defined as (pre - post) / pre. Values close to 0 indicate small changes in Reads counts between adjacent cells, while larger values indicate significant changes.
  • Red Dashed Lines: Indicate the selected Reads count threshold range. The left threshold is used to exclude cells with abnormally high Reads counts (potential doublets), and the right threshold is used to exclude low-quality cells with excessively low Reads counts. The Reads count changes for cells between the two dashed lines are relatively stable, suitable for subsequent analysis.
python
# --- Prepare Data: Sort by Reads Count Descending ---
p4_data = read_count.copy()

# Sort by reads_counts descending
p4_data = p4_data.sort_values('reads_counts', ascending=False).reset_index(drop=True)
p4_data['order'] = p4_data.index + 1

# --- Calculate relative change between adjacent cells: value = (pre - post) / pre ---
plot_df = pd.DataFrame({
    'order': range(1, len(p4_data)),
    'value': (p4_data['reads_counts'].iloc[:-1].values - p4_data['reads_counts'].iloc[1:].values) / 
             p4_data['reads_counts'].iloc[:-1].values
})
python
# --- Visualize Reads Count Relative Change Distribution (with LOESS Smoothing) ---
plt.figure(figsize=(8, 6), dpi=70) 
# Raw data line plot
plt.plot(plot_df['order'], plot_df['value'], alpha=0.7, linewidth=1, label='Raw data')

# LOESS Smoothing (using moving average simulation)
span = 0.1
window_size = max(1, int(len(plot_df) * span))
smoothed = plot_df['value'].rolling(window=window_size, center=True, min_periods=1).mean()

# Plot smoothed curve
plt.plot(plot_df['order'], smoothed, color='red', linewidth=1.5, label='LOESS Smoothed')

# Add threshold lines (adjust according to actual situation)
plt.axvline(x=200, linestyle='--', color='red', alpha=0.7, label='High Reads Threshold')
plt.axvline(x=2000, linestyle='--', color='red', alpha=0.7, label='Low Reads Threshold')

plt.xlabel('Order(Sorted_by_Reads_Count)', fontsize=12)
plt.ylabel('Relative_Difference\n(pre-post)/pre', fontsize=12)
plt.title('Reads Distribution with LOESS Smoothing', fontsize=12)
plt.legend(fontsize=12)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

UMAP Highlighting of High/Low Reads Cells

Figure Legend: This figure highlights potential doublets and empty cells in the UMAP dimensionality reduction space, used to evaluate the spatial distribution characteristics of these abnormal cells in the cell atlas.

  • Left Plot (Potential Multi-Cells): Highlights potential doublets in the UMAP space. Red points indicate the top 200 cells with the highest Reads counts (candidate doublets), while light gray points indicate other normal cells. By observing the distribution of red points in the UMAP space, one can assess whether potential doublets are concentrated in specific regions or scattered throughout the cell atlas.
  • Right Plot (Potential Empty Cells): Highlights potential empty cells in the UMAP space. Red points indicate the bottom 200 cells with the lowest Reads counts (candidate empty cells), while light gray points indicate other normal cells. This plot is used to evaluate the spatial distribution pattern of potential empty cells and determine if they correspond to specific low-quality cell populations.
python
# --- Select Potential Doublets and Empty Cells ---
# Select top 200 cells with highest Reads counts as potential doublets
multi_cells = p4_data.head(200)['barcode'].tolist()
# Select bottom 200 cells with lowest Reads counts as potential empty cells
empty_cells = p4_data.tail(200)['barcode'].tolist()

# --- Mark in AnnData object ---
# Initialize highlight column, default to 'Normal'
adata_methy.obs['cell_multi_highlight'] = 'Normal'
# Mark potential doublets
adata_methy.obs.loc[adata_methy.obs.index.isin(multi_cells), 'cell_multi_highlight'] = 'Potential Multi-Cells'
# Mark potential empty cells
adata_methy.obs.loc[adata_methy.obs.index.isin(empty_cells), 'cell_multi_highlight'] = 'Potential Empty Cells'

# --- Visualization ---
sc.set_figure_params(figsize=(5, 5))
fig, axs = plt.subplots(1, 2, figsize=(12, 5))

# 1. Highlight Potential Multi-Cells
sc.pl.umap(adata_methy, color='cell_multi_highlight', groups=['Potential Multi-Cells'], 
           palette={'Potential Multi-Cells': 'red', 'Normal': 'lightgrey', 'Potential Empty Cells': 'lightgrey'}, 
           ax=axs[0], show=False, title='Potential Multi-Cells', s=20)

# 2. Highlight Potential Empty Cells
sc.pl.umap(adata_methy, color='cell_multi_highlight', groups=['Potential Empty Cells'], 
           palette={'Potential Empty Cells': 'red', 'Normal': 'lightgrey', 'Potential Multi-Cells': 'lightgrey'}, 
           ax=axs[1], show=False, title='Potential Empty Cells', s=20)

plt.tight_layout()
plt.show()

Method 3: Doublet Identification Based on Methylation Rate (MethylScrublet)

Principle:

MethylScrublet is a doublet identification algorithm specifically designed for single-cell methylation data. It simulates artificial doublets by randomly combining existing single-cell data and calculating a "doublet score" for each observed cell.

Workflow:

  1. Simulate Doublets: Randomly sample pairs of cells from the observed data and average their methylation profiles to generate simulated doublets.
  2. Feature Extraction: Perform dimensionality reduction (e.g., PCA) on the mixed data (observed cells + simulated doublets).
  3. Calculate Score: In the low-dimensional feature space, calculate the K-nearest neighbor density of each cell relative to the simulated doublets. Cells closer to simulated doublets receive higher doublet scores.
  4. Automatic Thresholding: Automatically determine the score threshold based on the bimodal distribution of scores to distinguish doublets from singlets.

Run MethylScrublet Analysis

python
# --- Open Methylation Data in MCDS format (1Mb resolution, for MethylScrublet) ---
# Build path based on new file hierarchy: data/{top_dir}/methylation/{demo_dir}/{sample}/allcools_generate_datasets/{sample}.mcds
top_dir = sample_path_config[current_sample]['top_dir']
demo_dir = sample_path_config[current_sample]['demo_dir']
mcds_path = os.path.join('../../',data_root, top_dir, 'methylation', demo_dir, current_sample, 'allcools_generate_datasets', f'{current_sample}.mcds')
mcds = MCDS.open(
    mcds_path,
    obs_dim="cell",
    var_dim="chrom1M"
)

# --- Extract Methylation Count (mc) and Coverage (cov) Matrices ---
mc = mcds[f'chrom1M_da'].sel({'count_type': 'mc'})
cov = mcds[f'chrom1M_da'].sel({'count_type': 'cov'})

# --- If data size is small, load into memory to improve calculation speed ---
load = True
if load and (mcds.get_index('cell').size <= 20000):
    mc.load()
    cov.load()
python
# --- Initialize MethylScrublet Object ---
scrublet = MethylScrublet(
    sim_doublet_ratio=2.0,  # Number of simulated doublets = real cells * 2.0
    n_neighbors=10,  # KNN neighbors
    expected_doublet_rate=0.06,  # Expected doublet rate
    stdev_doublet_rate=0.02,  # Standard deviation of doublet rate
    metric='euclidean',  # Distance metric
    random_state=0,  # Random seed
    n_jobs=-1  # Use all available CPU cores
)

# --- Run Doublet Identification Algorithm ---
# clusters=None means no prior cell type information is used
score, judge = scrublet.fit(mc, cov, clusters=None)

# --- Save Results to AnnData Object ---
adata_methy.obs['met_doublet_score'] = score  # Doublet score (0-1, higher means more likely to be a doublet)
adata_methy.obs['met_is_doublet'] = judge  # Doublet judgment result (True/False)

# --- Convert judgment result to categorical type ---
adata_methy.obs['met_is_doublet'] = adata_methy.obs['met_is_doublet'].astype('category')

Visualization of Doublet Scores and Judgment Results

Figure Legend: This figure shows the doublet identification results of the MethylScrublet algorithm, including the doublet score distribution and the final judgment results.

  • Left Plot (met_doublet_score): Doublet score distribution. Brighter colors (yellow) indicate that the cell is more similar to simulated doublet features and has a higher probability of being a doublet.
  • Right Plot (met_is_doublet): Doublet judgment results.
    • Orange Points (True): Judged as Doublets.
    • Blue Points (False): Judged as normal Singlets.
python
warnings.filterwarnings('ignore')
# --- Set Figure Parameters ---
sc.set_figure_params(scanpy=True, fontsize=12, facecolor='white', figsize=(6, 6))
# --- Visualize Doublet Score and Judgment Result in UMAP Space ---
sc.pl.umap(
    adata_methy, 
    color=['met_doublet_score', 'met_is_doublet'], 
    ncols=2,
    s=30
)
python
adata_methy.obs["m_cb"] = adata_methy.obs.index
# Output doublet judgment result file: {sample}_doublet.txt
output_file = f"{current_sample}_doublet.txt"
# Save columns: m_cb, met_is_doublet, cell_multi_highlight, rna_doublet
# Note: 'rna_doublet' column might not exist if not calculated in previous steps, ensure it exists or remove from list if not needed.
# Assuming 'rna_doublet' logic is similar to 'cell_multi_highlight' or derived from RNA analysis.
# If 'rna_doublet' is not defined, we can create a placeholder or remove it.
if 'rna_doublet' not in adata_methy.obs.columns:
    adata_methy.obs['rna_doublet'] = 'Unknown' # Placeholder

adata_methy.obs[["m_cb","met_is_doublet","cell_multi_highlight","rna_doublet"]].to_csv(output_file, sep="\t", index=None)
0 comments·0 replies