scMethyl + RNA Multi-omics: MethSCAn Differential Methylation Analysis

Author: SeekGene

Time: 23 min

Words: 4.7k words

Updated: 2026-02-28

Reads: 0 times

scMethyl + RNA-seq Analysis Guide Notebooks

Module Introduction

This module is based on MethSCAn, used for systematic analysis of single-cell methylation data.

MethSCAn is a command-line toolkit supporting preprocessing, quality control, variation region identification, and downstream analysis of single-cell methylation data. By detecting genome-wide methylation variations, it achieves the following core analysis goals:

VMR Detection: Automatically identifies Variably Methylated Regions (VMRs) between cells, which are often associated with cell type, developmental stage, or disease state.
Cell Clustering: Clusters cells based on the methylation feature matrix to effectively distinguish different cell types or states, with a methodological approach similar to single-cell RNA-seq clustering.
Differential Methylation Analysis (DMR Analysis): Identifies Differentially Methylated Regions (DMRs) between different cell populations to characterize cell heterogeneity at the epigenetic level.

Input File Preparation

This module supports starting analysis from the compact_data directory. The compact_data directory is typically generated by the following data preprocessing workflows:

Data Preprocessing Pipeline

Methylation-aware Alignment and Extraction: Use allcools analysis pipeline for alignment and methylation site extraction, generating allc files containing methylation and unmethylation site information.
Format Conversion: Use allc_to_bismarkCov.py to convert allc files to bismark coverage format (.cov files).
Data Preparation: Use methscan prepare to convert .cov files to efficient compact_data format for subsequent analysis.

Note:

Although MethSCAn supports allc format input, its processing efficiency is low. It is generally not recommended for direct use in formal analysis; converting to bismark coverage format first is suggested.
In general, the SeekGene Cloud Platform has completed allc_to_bismarkCov.py and methscan prepare steps by default, and you can directly use the existing compact_data directory for analysis.

File Structure Example

compact_data directory structure is as follows:

text

compact_data/
├── chr1.npz
├── chr2.npz
├── chr3.npz
├── ...
├── chrX.npz
├── column_header.txt
└── cell_stats.csv

MethSCAn Analysis Pipeline

Script Description: The following code is an initialization example for the MethSCAn analysis workflow, including environment configuration, path settings, and helper function definitions for subsequent steps.

options(warn = -1)
# Load required R packages
suppressPackageStartupMessages({
  library(ggplot2)      # Used for data visualization
  library(dplyr)        # Used for data processing
  library(readr)        # Used for reading CSV files
  library(tidyverse)    # Data Organization and Visualization Toolkit
  library(data.table)   # Efficient data reading and processing
  library(irlba)        # Used for iterative PCA (imputing missing values)
  library(Seurat)       # Single-cell Analysis Tools (Version v5)
  library(Matrix)       # Sparse matrix processing
})

# --- Input parameter configuration ---

## methscan_path: MethSCAn command path
# If methscan is in system PATH, set to NULL
# If methscan is in a different environment, specify full path
methscan_path <- "/jp_envs/envs/methscan/bin/methscan"

## outdir: Result output directory
outdir <- "./DMRs"
# Create output directory (if not exists)
dir.create(outdir, recursive = TRUE, showWarnings = FALSE)

## compact_data_dir: compact_data directory (generated by methscan prepare)
compact_data_dir <- file.path("../../data/AY1768874914782/methylation/demoWTJW969-task-1/WTJW969/methscan/compact_data")

## n_threads: Parallel calculation thread count
# Note: MethSCAn automatically detects the system's maximum thread count; setting n_threads higher than the system maximum will cause an error
# Recommended to set less than or equal to the number of available CPU cores
n_threads <- 8

# Define function: Call MethSCAn command in R
run_methscan <- function(command, args = character(), methscan_path = NULL) {
  # Determine command path: Prioritize parameter, then global variable, finally system PATH
  if (is.null(methscan_path) && exists("methscan_path", envir = .GlobalEnv)) {
    methscan_path <- get("methscan_path", envir = .GlobalEnv)
  }
  cmd <- if (is.null(methscan_path)) "methscan" else methscan_path
  
  cat("执行：", cmd, command, paste(args, collapse = " "), "\n")
  
  # Create temporary files to capture stdout and stderr
  stdout_file <- tempfile()
  stderr_file <- tempfile()
  
  # Execute command, capturing output
  result <- system2(
    cmd, 
    args = c(command, args), 
    wait = TRUE,
    stdout = stdout_file,
    stderr = stderr_file
  )
  
  return(invisible(result))
}

Key Parameter Settings

This section summarizes key parameters involved in the MethSCAn analysis workflow. These parameters will be called in subsequent methscan filter, methscan scan, methscan matrix, and methscan diff steps to control key behaviors like cell quality filtering, variable region detection, and differential analysis.

`methscan filter` Parameters

For specific meanings and recommended values of key parameters for low-quality cell filtering (e.g., min_sites, min_meth, max_meth), please refer to the parameter description table before the filtering execution code in Section 3.2.2.

`methscan diff` Parameters

Key parameters involved in Differentially Methylated Region (DMR) analysis for comparing differences between clusters are as follows:

Parameter Name	Chinese Meaning	Default Value	Detailed Description
`min_cells`	Min Cells Requirement	6	QC Parameter. Requires region to have coverage in at least this many cells per group for differential analysis. E.g., 6 means only regions covered in >=6 cells per group are tested. Note: Filters low-coverage regions, improving DMR stability. Can lower for small groups but may increase false positives.

`threads` Parameter (General Parameter)

Used to control parallel calculation threads for compute-intensive steps (e.g., scan, matrix, diff).

Parameter Name	Chinese Meaning	Default/Suggested Value	Detailed Description
`threads`	Parallel Calculation Threads	8	Performance Optimization Parameter. Accelerates compute-intensive steps (e.g., `scan`, `matrix`, `diff`). Recommended to set based on available CPU cores. Important: MethSCAn automatically detects system max threads. Program errors if `threads` > system max. Recommended to set <= available CPU cores.

Best Practice Recommendations

Choice of methscan filter parameters:

**Before parameter setting, it is strongly recommended to run the quality assessment step (see Section 3.2.1)** to identify potential outlier cells by visualizing the distribution of methylation sites and global methylation percentage, and determine the range of `min_sites`, `min_meth`, and `max_meth` accordingly.

`min_sites` parameter strategy: When the number of cells retained after filtering is small, this threshold can be appropriately lowered (e.g., to 20000), but may introduce more noise; in cases of high data quality, the threshold can be appropriately raised to enhance result reliability.

Global methylation percentage thresholds (`min_meth` and `max_meth`): Global methylation levels usually differ between species, for example:
*   Mouse: Usually 70-80%.
*   Human: Usually 60-70%.

About Computational Resources:

`methscan scan`, `matrix`, `diff`: Default use of 8c64g resources. **Under common data scales, the entire run usually takes 1-2 hours**. If running in the SeekGene Cloud Platform environment, it is recommended to reserve a longer task time window appropriately.

About min_cells parameter for DMR analysis:

Default value: 6 cells per group

Parameter function: Requires a sufficient number of cells in each comparison group to have sequencing coverage in a certain region, thereby improving the statistical stability of differential methylation analysis.

Experience reference: In practice, when the number of cells in both comparison clusters is greater than about 200, it is usually easier to obtain DMR results meeting significance thresholds; in cases of fewer cells, it may be difficult to detect significant differences, and one can try appropriately lowering the `min_cells` parameter for exploratory analysis and make a comprehensive judgment based on biological background.

# --- Platform description ---
# In the SeekGene Cloud Platform environment, the methscan prepare step is usually completed in the upstream workflow,
# Therefore, this workflow defaults to directly using the existing compact_data directory.
# The commented code below serves as a reference implementation for local environments or when re-execution of the prepare step is needed;
# If you need to re-run methscan prepare, please uncomment accordingly.


# # Check if input directory exists
# if (!dir.exists(input_cov_dir)) {
#   stop("Input directory does not exist: ", input_cov_dir,
#        "\nPlease run allc_to_bismarkCov.py to convert data first")
# }
# 
# # Get all coverage files
# cov_files <- list.files(input_cov_dir, full.names = TRUE, pattern = "\\.cov$")
# if (length(cov_files) == 0) {
#   stop("No .cov files found in ", input_cov_dir)
# }
# 
# cat("Found ", length(cov_files), " coverage files\n")
# 
# # Prepare output directory
# compact_data_dir <- file.path(outdir, "compact_data")
# if (!dir.exists(compact_data_dir)) {
#   dir.create(compact_data_dir, recursive = TRUE)
# }
# 
# # Execute methscan prepare
# run_methscan(
#   command = "prepare",
#   args = c(cov_files, compact_data_dir)
# )

# Directly use existing compact_data directory
if (!dir.exists(compact_data_dir)) {
  stop("compact_data 目录不存在：", compact_data_dir, 
       "\n请确保已在云平台完成 methscan prepare 步骤")
}
cat("✓ 使用已有的 compact_data 目录：", compact_data_dir, "\n")

output

✓ 使用已有的 compact_data 目录： ../../data/AY1768874914782/methylation/demoWTJW969-task-1/WTJW969/methscan/compact_data

Quality Assessment and Low-quality Cell Filtering

Before performing cell filtering, it is recommended to systematically assess cell quality and determine appropriate filtering parameters combined with visualization results.

Cell Statistics Reading and Quality Metric Visualization

# Read cell_stats.csv file
cell_stats_path <- file.path(compact_data_dir,  "cell_stats.csv")

cell_stats <- read_csv(cell_stats_path)

#if (file.exists(cell_stats_path)) {
#  cell_stats <- read_csv(cell_stats_path)
#  cat("Successfully read cell statistics, total", nrow(cell_stats), "cells\n")
#  head(cell_stats)
#} else {
#  stop("cell_stats.csv file not found, please check if the path is correct")
#}

# Visualize number of methylation sites and global methylation percentage
# Determine filtering parameters based on visualization results

options(repr.plot.width = 8, repr.plot.height = 7, repr.plot.res = 150)
p <- cell_stats %>% 
  ggplot(aes(x = global_meth_frac * 100, y = n_obs)) +
  geom_point(alpha = 0.6) +
  # Add red vertical lines at x-coordinates 65 and 86 (example thresholds)
  geom_vline(xintercept = c(65, 86), color = "red", linetype = "solid", linewidth = 1) +
  # Add red horizontal line at y-coordinate 60000 (example threshold)
  geom_hline(yintercept = 60000, color = "red", linetype = "solid", linewidth = 1) +
  labs(
    x = "Global DNA methylation %", 
    y = "# of observed CpG sites",
    title = "Cell Quality Assessment"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, size = 14, face = "bold")
  )

print(p)

output

Rows: 2196 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): cell_name
dbl (3): n_obs, n_meth, global_meth_frac

ℹ Use \`spec()\` to retrieve the full column specification for this data.
ℹ Specify the column types or set \`show_col_types = FALSE\` to quiet this message.

Perform Filtering Based on Visualization Results

Based on the above visualization results, determine filtering parameters and execute cell filtering:

methscan filter Key Parameters:

Parameter Name	Chinese Meaning	Default/Suggested Value	Detailed Description
`min_sites`	Min Observed Methylation Sites	25000-60000	QC Parameter. Requires cell to have > this number of observed methylation sites. Removes low-quality cells with insufficient depth for accurate methylation status inference. Note: Threshold depends on sequencing depth; higher depth allows higher threshold.
`min_meth`	Min Global Methylation %	60-70	QC Parameter. Requires cell global methylation % > this value. Removes abnormal cells due to incomplete bisulfite conversion or technical issues. Note: Normal levels vary (e.g., Mouse ~70-80%, Human ~60-70%).
`max_meth`	Max Global Methylation %	70-85	QC Parameter. Requires cell global methylation % < this value. Removes abnormally hypermethylated cells due to technical issues. Note: Normal levels vary by species/cell type; set reasonable thresholds based on data.
`threads`	Parallel Calculation Threads	8	Accelerates compute-intensive steps (e.g., `scan`, `matrix`). Recommended to set based on system resources, usually available CPU cores.

# Filtering parameters (please adjust based on QC results)
min_sites <- 60000    # Minimum observed methylation sites
min_meth <- 65       # Minimum global methylation percentage
max_meth <- 86        # Maximum global methylation percentage

# Prepare output directory
filtered_data_dir <- file.path(outdir, "filtered_data")
if (!dir.exists(filtered_data_dir)) {
  dir.create(filtered_data_dir, recursive = TRUE)
}

# Execute methscan filter
run_methscan(
  command = "filter",
  args = c(
    "--min-sites", min_sites,
    "--min-meth", min_meth,
    "--max-meth", max_meth,
    compact_data_dir,
    filtered_data_dir
  )
)

output

执行： /jp_envs/envs/methscan/bin/methscan filter --min-sites 60000 --min-meth 65 --max-meth 86 ../../data/AY1768874914782/methylation/demoWTJW969-task-1/WTJW969/methscan/compact_data ./DMRs/filtered_data

Data Smoothing: methscan smooth

Before VMR/DMR detection, methylation data smoothing is required. methscan smooth treats all single cells as pseudo-bulk samples to calculate genome-wide smoothed average methylation levels. This result is a necessary input for subsequent VMR detection and methylation matrix construction.

Output Results: Smoothed result files are output to the ${outdir}/filtered_data/smoothed directory.

# Execute methscan smooth
run_methscan(
  command = "smooth",
  args = filtered_data_dir
)

output

执行： /jp_envs/envs/methscan/bin/methscan smooth ./DMRs/filtered_data

VMR Detection: methscan scan

This step detects Variable Methylation Regions (VMR). methscan scan automatically identifies regions with significant intercellular methylation variation genome-wide.

Output Results: Generates result file in BED format, containing genomic coordinates and methylation variance of VMRs.

Alternative: If the analysis target is limited to specific genomic regions (such as promoters or gene body regions), you can directly provide the corresponding BED file, skip the scan step, and directly use methscan matrix to analyze these regions.

# VMR Output File
vmr_bed_file <- file.path(outdir, "VMRs.bed")

# Execute methscan scan
result <- run_methscan(
  command = "scan",
  args = c(
    "--threads", n_threads,
    filtered_data_dir,
    vmr_bed_file
  )
)

output

执行： /jp_envs/envs/methscan/bin/methscan scan --threads 8 ./DMRs/filtered_data ./DMRs/VMRs.bed

VMR Matrix Generation: methscan matrix

This step constructs a methylation matrix based on identified VMR regions for subsequent cell clustering and related downstream analysis.

Output File Description:

The output directory VMR_matrix contains the following files:

Filename	Description
`mean_shrunken_residuals.csv.gz`	Shrunken Residuals Matrix (Recommended for downstream). Less affected by random coverage changes, better for reduction/clustering.
`methylation_fractions.csv.gz`	Methylation Fraction Matrix. Methylation fraction (0-1) for each region, representing average methylation level.
`methylated_sites.csv.gz`	Methylated Sites Matrix (Numerator). Number of methylated CpG sites in each region.
`total_sites.csv.gz`	Total Sites Matrix (Denominator). Total number of CpG sites with sequencing coverage in each region.

Matrix Features:

Many missing values: Due to low coverage in single-cell sequencing, some genomic regions may lack valid sequencing reads in some cells, resulting in many missing values.
Methylation values are in small-denominator fraction form: Methylation levels are usually represented as small-denominator fractions (e.g., 1/1, 2/2, 1/3), a typical feature of single-cell methylation data.
Additional site statistics: The output directory also contains the number of methylated CpG sites and CpG sites with sequencing coverage for each region, corresponding to the numerator and denominator in methylation ratio calculation respectively.

# VMR Matrix Output Directory
vmr_matrix_dir <- file.path(outdir, "VMR_matrix")

# Execute methscan matrix
result <- run_methscan(
  command = "matrix",
  args = c(
    "--threads", n_threads,
    vmr_bed_file,
    filtered_data_dir,
    vmr_matrix_dir
  )
)

# Check if output files are generated
if (result == 0 && dir.exists(vmr_matrix_dir)) {
  expected_files <- c(
    "mean_shrunken_residuals.csv.gz",
    "methylation_fractions.csv.gz",
    "methylated_sites.csv.gz",
    "total_sites.csv.gz"
  )
  existing_files <- list.files(vmr_matrix_dir)
  found_files <- expected_files[expected_files %in% existing_files]
  
  if (length(found_files) > 0) {
    cat("✓ VMR 矩阵文件已生成：\n")
    for (f in found_files) {
      file_path <- file.path(vmr_matrix_dir, f)
      file_size <- file.info(file_path)$size / 1024 / 1024  # MB
      cat("  -", f, sprintf("(%.2f MB)\n", file_size))
    }
  } else {
    warning("警告：未找到预期的输出文件，请检查输出目录：", vmr_matrix_dir)
  }
} else if (result != 0) {
  stop("methscan matrix 执行失败，请检查上面的错误信息")
}

output

执行： /jp_envs/envs/methscan/bin/methscan matrix --threads 8 ./DMRs/VMRs.bed ./DMRs/filtered_data ./DMRs/VMR_matrix
✓ VMR 矩阵文件已生成：
- mean_shrunken_residuals.csv.gz (291.34 MB)
- methylation_fractions.csv.gz (60.59 MB)
- methylated_sites.csv.gz (43.48 MB)
- total_sites.csv.gz (56.35 MB)

VMR-based Clustering Analysis

This section uses Seurat for dimensionality reduction and clustering analysis based on the VMR methylation matrix to identify cell populations with similar methylation characteristics.

Read VMR Matrix

# Use fread to read compressed files
# Use configured outdir variable
vmr_matrix_path <- file.path(outdir, "VMR_matrix", "mean_shrunken_residuals.csv.gz")

if (file.exists(vmr_matrix_path)) {
  cat("正在读取 VMR 矩阵...\n")
  meth_mtx <- fread(vmr_matrix_path)
  cat("矩阵维度：", nrow(meth_mtx), "行（细胞）x", ncol(meth_mtx) - 1, "列（VMR 区域）\n")
  
  # Set first column as row names and convert to matrix
  row_names <- meth_mtx[[1]]
  meth_mtx <- as.matrix(meth_mtx[, -1])
  rownames(meth_mtx) <- row_names
  
  cat("矩阵转换完成！\n")
  cat("缺失值比例：", round(mean(is.na(meth_mtx)) * 100, 2), "%\n")
} else {
  stop("找不到 VMR 矩阵文件，请检查路径是否正确")
}

output

正在读取 VMR 矩阵...n 矩阵维度： 2170 行（细胞）x 106967 列（VMR 区域）
矩阵转换完成！
缺失值比例： 75.31 %

Iterative PCA Imputation

The VMR methylation matrix is structurally similar to the single-cell RNA-seq count matrix, but its distinguishing feature is containing a large number of missing values. This is mainly due to low coverage in single-cell methylation sequencing, resulting in many genomic regions lacking effective sequencing reads in some cells.

To address the missing value issue, this analysis uses an iterative PCA method for imputation. Specifically: missing values are first initialized with 0; then PCA is performed on the filled matrix, and predicted values reconstructed from PCA replace the original 0s; this process repeats iteratively until imputed values stabilize between adjacent iterations.

This method provides more reasonable estimates for missing methylation values by iteratively optimizing low-dimensional structure, providing a more complete and stable data foundation for subsequent dimensionality reduction and clustering analysis.

# Define function for iterative PCA imputation of missing values
prcomp_iterative <- function(x, n = 15, n_iter = 50, min_gain = 0.001, ...) {
  mse <- rep(NA, n_iter)
  na_loc <- is.na(x)
  x[na_loc] <- 0  # zero is our first guess

  for (i in 1:n_iter) {
    prev_imp <- x[na_loc]  # what we imputed in the previous round
    # PCA on the imputed matrix
    pr <- prcomp_irlba(x, center = F, scale. = F, n = n, ...)
    # impute missing values with PCA
    new_imp <- (pr$x %*% t(pr$rotation))[na_loc]
    x[na_loc] <- new_imp
    # compare our new imputed values to the ones from the previous round
    mse[i] <- mean((prev_imp - new_imp) ^ 2)
    # if the values didn't change a lot, terminate the iteration
    gain <- mse[i] / max(mse, na.rm = T)
    if (gain < min_gain) {
      message(paste(c("\n\nTerminated after ", i, " iterations.")))
      break
    }
  }
  message(paste(c("\n\nTerminated after ", i, " iterations. Gain: ", round(gain, 6))))
  pr$mse_iter <- mse[1:i]
  list(pr = pr, imputed_matrix = x)
}

cat("开始执行迭代 PCA 填补缺失值...\n")
result_residual <- meth_mtx %>% 
  scale(center = T, scale = F) %>% 
  prcomp_iterative(n = 15)

# Extract imputed residual matrix
residual_mtx_imputed <- result_residual$imputed_matrix
cat("缺失值填补完成！\n")

output

开始执行迭代 PCA 填补缺失值...n

Terminated after 32 iterations.

Terminated after 32 iterations. Gain: 0.000974

缺失值填补完成！

Create Seurat Object

Use the imputed residual matrix (residual_mtx_imputed) as input for the Seurat object. When constructing the Seurat object, counts, data, and scale.data matrices all use the same imputed residual matrix.

Explanation: Subsequent PCA, dimensionality reduction, and clustering analysis are actually based on the scale.data matrix.

# Create Seurat object using imputed residual matrix
# counts, data, and scale.data matrices use the same imputed residual matrix
seurat_obj <- CreateSeuratObject(
    counts = as(t(residual_mtx_imputed), "sparseMatrix"),
    project = "MethSCAn",
    assay = "VMR"
)
# Set data and scale.data to the same matrix as counts
# Note: Subsequent PCA, dimensionality reduction, and clustering analysis actually use the scale.data matrix
seurat_obj@assays$VMR@data <- seurat_obj@assays$VMR@counts
seurat_obj@assays$VMR@scale.data <- as.matrix(seurat_obj@assays$VMR@counts)

PCA, UMAP Dimensionality Reduction and Clustering

# 1. PCA Dimensionality Reduction
# Use Seurat's RunPCA function
# This automatically reads data from scale.data layer and calculates PCA
n_pcs = 30
seurat_obj <- RunPCA(
    seurat_obj,
    features = rownames(seurat_obj),  # Use all VMRs
    npcs = n_pcs,  # Calculate top 30 principal components
    assay = "VMR",
    verbose = FALSE
)

options(repr.plot.width = 8, repr.plot.height = 7, repr.plot.res = 150)

# Visualize Elbow Plot to determine number of PCs
ElbowPlot(seurat_obj, ndims = 30)

# 2. Determine number of PCs based on Elbow Plot
# You can adjust this parameter based on the Elbow Plot results
n_pcs_use <- 10

# 3. UMAP Dimensionality Reduction
seurat_obj <- RunUMAP(seurat_obj, dims = 1:n_pcs_use, reduction = "pca", verbose = FALSE)
cat("UMAP 降维完成！\n")

# 4. Build KNN Graph
seurat_obj <- FindNeighbors(seurat_obj, dims = 1:n_pcs_use, reduction = "pca", verbose = FALSE)
cat("KNN 图构建完成！\n")

# 5. Clustering (resolution parameter can be adjusted; larger value means more clusters)
seurat_obj <- FindClusters(seurat_obj, resolution = 0.5, verbose = FALSE)
cat("聚类完成！\n")

# 6. Visualize Clustering Results
DimPlot(seurat_obj, group.by = 'VMR_snn_res.0.5', label = TRUE) + 
  ggtitle("UMAP (resolution = 0.5)")

output

UMAP 降维完成！
KNN 图构建完成！
聚类完成！

Save Results

# Save Seurat object
output_file <- "MethSCAn.rds"
saveRDS(seurat_obj, file = output_file)

clust_tbl <- seurat_obj@meta.data
clust_tbl$cell <- row.names(clust_tbl)
clust_tbl %>%
  mutate(cell_group = case_when(
    seurat_clusters == "0" ~ "-",
    seurat_clusters == "1" ~ "1",
    seurat_clusters == "2" ~ "2",
    seurat_clusters == "3" ~ "-",
    seurat_clusters == "4" ~ "-",
    seurat_clusters == "5" ~ "-",
    seurat_clusters == "6" ~ "-",
    seurat_clusters == "7" ~ "-",
    seurat_clusters == "8" ~ "-"    
  )) %>% 
  dplyr::select(cell, cell_group) %>%
  write_csv("cell_groups.csv", col_names=F)

Differential Methylated Region Analysis: methscan diff

Differentially Methylated Region (DMR) analysis compares methylation differences between cells under different conditions, such as different clusters, cell types, or custom groups. It is generally recommended to perform DMR analysis based on clustering results or predefined groups after completing clustering analysis.

Input File Preparation

Input Requirements:

Data Directory: filtered_data directory (must have completed methscan smooth step).
Cell Group File (cell_groups_file): CSV format file, where the first column is cell name (cell barcode) and the second is group label. Label rules: group_A and group_B represent the two cell groups involved in comparison; - represents other cells not participating in this comparison.

File Format Example:

csv

cell_01,-
cell_02,-
cell_03,-
cell_04,group_A
cell_05,group_A
cell_06,group_A
cell_07,group_B
cell_08,group_B
cell_09,group_B

# Cell grouping file (for DMR analysis)
cell_groups_file <- "cell_groups.csv"  # Please modify according to actual situation

# DMR Output File
dmr_bed_file <- file.path(outdir, "DMRs.bed")

# Check if cell grouping file exists
if (file.exists(cell_groups_file)) {
  # Execute methscan diff
  run_methscan(
    command = "diff",
    args = c(
      "--threads", n_threads,
      filtered_data_dir,
      cell_groups_file,
      dmr_bed_file
    )
  )
} else {
  warning("细胞分组文件不存在：", cell_groups_file, 
          "\n跳过 DMR 分析。如需进行 DMR 分析，请先准备细胞分组文件。")
}

output

执行： /jp_envs/envs/methscan/bin/methscan diff --threads 8 ./DMRs/filtered_data cell_groups.csv ./DMRs/DMRs.bed

Output File Format Description

Output File: DMRs.bed

Results are output in BED format, containing the following fields:

Column No.	Column Name	Description	Example Value
1	`chromosome`	Chromosome name	17
2	`DMR_start`	DMR start position (genomic coordinate)	58270529
3	`DMR_end`	DMR end position (genomic coordinate)	58285529
4	`t_statistic`	Welch's t-test statistic	-12.85232790631932
5	`n_sites`	Total number of methylation sites in this region (CpG count)	561
6	`n_cells_group1`	Number of cells with coverage in Group 1	150
7	`n_cells_group2`	Number of cells with coverage in Group 2	66
8	`meth_frac_group1`	Average methylation level of Group 1 (0–1)	0.3396857458943857
9	`meth_frac_group2`	Average methylation level of Group 2 (0–1)	0.8429240842342447
10	`low_group_label`	Label of the group with lower methylation level	group_1
11	`p`	Raw p-value (unadjusted)	0.0
12	`adjusted_p`	Adjusted p-value (FDR, Benjamini-Hochberg method)	0.0

Result Interpretation: The low_group_label column indicates the group label with lower methylation levels; the adjusted_p column is the significance level after multiple testing correction, usually recommended to use adjusted_p < 0.05 as the screening threshold; the sign of t_statistic indicates the direction of difference (negative value indicates lower methylation in group_1, positive value indicates higher methylation in group_1).

Generate Significant Differential Methylated Region File

Based on the DMRs.bed file, screen for significantly differentially methylated regions (adjusted_p < 0.05), and group results based on low_group_label to generate two BED files.

Input: DMRs.bed: DMR detection result file

Output: Generate the following two significant DMR BED files based on the value of low_group_label:

DMRs_significant_group_A.bed: DMR regions with significantly lower methylation levels in group_A.
DMRs_significant_group_B.bed: DMR regions with significantly lower methylation levels in group_B.

The above output files are all in BED format, containing three columns: chromosome, start, end. If chromosome names are numbers or sex chromosomes (X/Y), "chr" prefix is automatically added; if "chr" prefix already exists, it is kept unchanged.

Explanation: Only keep significant DMRs with adjusted_p < 0.05, automatically add "chr" prefix to numeric and sex chromosomes (keep unchanged if already present), and skip rows where low_group_label is empty or "-".

# Generate significant DMR files (grouped by low_group_label, adjusted_p < 0.05)
# Add "chr" prefix to autosomes (numbers) and sex chromosomes (XY)
# Check if DMRs.bed file exists
if (file.exists(dmr_bed_file)) {
  cat("开始生成显著的 DMR 文件...\n")
  
  # Build awk command
  awk_script <- sprintf('
BEGIN {
    outdir = "%s"
}
# Skip header or empty lines
NR == 1 && $1 ~ /^chromosome|^#/ { next }
NF < 12 { next }
# Filter rows with adjusted_p < 0.05
$12 < 0.05 {
    # Get low_group_label (10th column) as filename
    group_label = $10
    if (group_label == "" || group_label == "-") next
    
    # Process chromosome names: If numeric or X/Y, add "chr" prefix
    chrom = $1
    if (chrom ~ /^[0-9]+$/ || chrom ~ /^[XY]$/) {
        chrom = "chr" chrom
    }
    
    # Build output filename
    output_file = outdir "/DMRs_significant_" group_label ".bed"
    
    # Output first 3 columns (chromosome, start, end) to corresponding file
    print chrom "\\t" $2 "\\t" $3 > output_file
}
END {
    print "处理完成！已生成显著的 DMR 文件。"
}
', outdir)
  
  # Execute awk command in R
  awk_cmd <- paste("awk -F'\\t'", shQuote(awk_script), shQuote(dmr_bed_file))
  result <- system(awk_cmd, intern = TRUE)
  
  # Display output
  cat(paste(result, collapse = "\n"), "\n")
  
  # Check generated files
  significant_files <- list.files(
    path = outdir,
    pattern = "^DMRs_significant_.*\\.bed$",
    full.names = TRUE
  )
  
  if (length(significant_files) > 0) {
    cat("\n生成的显著 DMR 文件：\n")
    for (f in significant_files) {
      n_lines <- length(readLines(f, warn = FALSE))
      cat("  -", basename(f), "（", n_lines, " 个区域）\n", sep = "")
    }
  } else {
    warning("未找到生成的显著 DMR 文件")
  }
} else {
  warning("DMRs.bed 文件不存在：", dmr_bed_file, 
          "\n请先执行 methscan diff 命令")
}

output

开始生成显著的 DMR 文件...n 处理完成！已生成显著的 DMR 文件。

生成的显著 DMR 文件：
-DMRs_significant_1.bed（23951 个区域）
-DMRs_significant_2.bed（30189 个区域）

scMethyl + RNA Multi-omics: MethSCAn Differential Methylation Analysis ​

Module Introduction ​

Input File Preparation ​

Data Preprocessing Pipeline ​

File Structure Example ​

MethSCAn Analysis Pipeline ​

Key Parameter Settings ​

methscan filter Parameters ​

methscan diff Parameters ​

threads Parameter (General Parameter) ​

Best Practice Recommendations ​

Quality Assessment and Low-quality Cell Filtering ​

Cell Statistics Reading and Quality Metric Visualization ​

Perform Filtering Based on Visualization Results ​

Data Smoothing: methscan smooth ​

VMR Detection: methscan scan ​

VMR Matrix Generation: methscan matrix ​

VMR-based Clustering Analysis ​

Read VMR Matrix ​

Iterative PCA Imputation ​

Create Seurat Object ​

PCA, UMAP Dimensionality Reduction and Clustering ​

Save Results ​

Differential Methylated Region Analysis: methscan diff ​

Input File Preparation ​

Output File Format Description ​

Generate Significant Differential Methylated Region File ​

scMethyl + RNA Multi-omics: MethSCAn Differential Methylation Analysis

Module Introduction

Input File Preparation

Data Preprocessing Pipeline

File Structure Example

MethSCAn Analysis Pipeline

Key Parameter Settings

`methscan filter` Parameters

`methscan diff` Parameters

`threads` Parameter (General Parameter)

Best Practice Recommendations

Quality Assessment and Low-quality Cell Filtering

Cell Statistics Reading and Quality Metric Visualization

Perform Filtering Based on Visualization Results

Data Smoothing: methscan smooth

VMR Detection: methscan scan

VMR Matrix Generation: methscan matrix

VMR-based Clustering Analysis

Read VMR Matrix

Iterative PCA Imputation

Create Seurat Object

PCA, UMAP Dimensionality Reduction and Clustering

Save Results

Differential Methylated Region Analysis: methscan diff

Input File Preparation

Output File Format Description

Generate Significant Differential Methylated Region File