scATAC-seq Differential Accessibility and Functional Enrichment Analysis Tutorial
Background Introduction
ATAC_DiffEnrich is a workflow for differential accessibility and enrichment analysis of scATAC-seq in single-cell multi-omics data. It aims to systematically identify Differentially Accessible Regions (DARs) between different cell populations or conditions, and resolve their potential transcription factors (TFs) and affected biological processes/signaling pathways. This workflow is based on the Signac/Seurat ATAC workflow, combining presto's fast differential testing, JASPAR/TFBSTools motif resources, chromVAR motif activity assessment, and clusterProfiler GO/KEGG enrichment analysis, providing an integrated interpretation from "differential peaks → regulatory factors → functional pathways".
Core Principle
Differential Accessibility Detection (DARs)
Evaluate accessibility differences between cell groups (or groupings) using Wilcoxon rank-sum test (presto::wilcoxauc), perform multiple testing correction, and obtain significant DAR sets.
Peak-to-Gene Linking
Associate significant differential peaks with nearby genes (e.g., near TSS) using ClosestFeature for subsequent functional enrichment and pathway interpretation.
Transcription Factor Inference (Motif/ChromVAR)
Perform enrichment analysis on differential peaks based on JASPAR motif set (FindMotifs), and quantify global motif preference and cell-level motif activity changes using chromVAR.
Functional and Pathway Enrichment (GO/KEGG)
Map peak-proximal genes to human/mouse annotation libraries, use clusterProfiler for GO/KEGG enrichment, characterizing affected biological processes and signaling pathways.
Purpose and Significance
- Cell type/state-specific regulatory landscape: Identify open regulatory elements and core TFs in specific cell populations.
- Mechanistic resolution: Support regulatory network hypotheses with multi-layer evidence from differential peaks—motif—chromVAR activity—GO/KEGG.
- Condition comparison and intervention clues: Locate perturbed regulatory pathways and candidate targets in case-control or treated-untreated groups.
- Enhanced interpretability: Translate unannotated peak signals into readable gene and pathway-level conclusions.
Tutorial Contents
- Differential Peaks Analysis: Comparison by cell group or condition, outputting significant DAR tables and visualization.
- Top peaks display: Draw CoveragePlot for representative peaks with high LogFC.
- Motif Enrichment and Activity: Motif enrichment of differential peaks, motif sequence plots, and chromVAR UMAP display.
- Peak-proximal genes: Output table of genes near significant differential peaks.
- GO/KEGG Enrichment: Enrich peak-adjacent genes, generating bar/bubble plots and summary tables.
- Interactive result browsing: Significant DARs, motifs, enrichment results support filtering and export.
Expected Analysis Results
- List of Significant Differential Peaks: Significant DARs and their statistical metrics summarized by cell group.
- Representative Signal Plot: Coverage plot of Top differential peaks for intuitive assessment of peak-level differences.
- Motif and TF Clues: Enriched TF motifs, corresponding sequence plots, and cell-level motif activity changes.
- Functional pathway overview: GO/KEGG enrichment results and visualization, pointing to affected key biological processes.
- One-stop output directory: Contains differential tables, peak-gene tables, motif/enrichment plots, and comprehensive summary files for downstream reporting and reuse.
Parameter Settings
Please mount the cloud platform workflow data to be analyzed first. Refer to the Jupyter usage tutorial for mounting methods.
Specific parameter entry:
- rds: Mounted workflow-related rds data, located in /home/{UserName}/workspace/project/{UserName}/, e.g., project data /home/shumeng/workspace/data/AY1752167131383/
- meta: metadata file in the same directory as rds
- outdir: Analysis result save path
- clusters_col: Cell type column name, select the label corresponding to the cell type or cluster to be analyzed. If you want to analyze annotated cell types, select their corresponding label, e.g., CellAnnotation, used with cell types.
- celltypes: Cell types, multiple selection, select cell types or clustering results to analyze, such as Monocyte and Macrophage.
- species: Species information
- group_comparison: Group comparison, select group names for comparative analysis. For example, to compare gene activity between Ctrl and STIM groups, select the label names corresponding to Ctrl and STIM groups, used with case_group: and control_group: below
- case_group: Case group, such as the STIM group mentioned above.
- control_group: Control group, such as the Ctrl group mentioned above.
# Parameters
outdir = "/home/shumeng/workspace/project/shumeng/"
rds = "/home/shumeng/workspace/data/AY1752167131383/input.rds"
meta = "/home/shumeng/workspace/data/AY1752167131383/meta.tsv"
species = "mouse"
clusters_col = "wknn_res.0.5_d30_l2_50"
celltypes = "0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18"
group_comparison = ""
case_group = ""
control_group = ""Environment Preparation
Load R Packages
Please select the 'common_r' environment for this integration tutorial
suppressPackageStartupMessages(suppressWarnings({
library(Signac)
library(Seurat)
library(JASPAR2020)
library(TFBSTools)
library(BSgenome.Hsapiens.UCSC.hg38)
library(BSgenome.Mmusculus.UCSC.mm10)
library(patchwork)
library(ggplot2)
library(presto)
library(clusterProfiler)
library(stringr)
library(tibble)
library(dplyr)
library(DOSE)
library(org.Hs.eg.db)
library(org.Mm.eg.db)
library(enrichplot)
library(foreach)
library(doParallel)
library(base64enc)
library(DT)
library(KEGG.db)
}))# Create necessary output directories
output <- paste0(outdir,'/output')
celltypes <- strsplit(celltypes,",")[[1]]
# pct <- as.numeric(pct)
# logfc <- as.numeric(logfc)
# pval_adj <- as.numeric(pval_adj)
# top_n <- as.numeric(top_n)
cl <- makeCluster(min(detectCores(), 25))
registerDoParallel(cl)
options(future.globals.maxSize = 10000 * 1024^2)
dir.create(paste0(output,'/DIFF/01_diffPeak'), recursive = TRUE)
dir.create(paste0(output,'/DIFF/02_topPeak'), recursive = TRUE)
dir.create(paste0(output,'/DIFF/03_motif'), recursive = TRUE)
dir.create(paste0(output,'/DIFF/04_closestFeature'), recursive = TRUE)
dir.create(paste0(output,'/clusterProfiler/GO'), recursive = TRUE)
dir.create(paste0(output,'/clusterProfiler/KEGG'), recursive = TRUE)# Data loading
obj <- readRDS(rds)
if (meta != ""){
meta <- read.table(meta,header=T,sep='\t',check.names=F)
rownames(meta) <- meta$barcodes
obj <- AddMetaData(obj, meta)
}
obj <- subset(obj, subset = !!sym(clusters_col) == celltypes)
n_fragments <- length(obj@assays$ATAC@fragments)
for (i in 1:n_fragments) {
original_path <- obj@assays$ATAC@fragments[[i]]@path
filename <- basename(original_path)
obj@assays$ATAC@fragments[[i]]@path <- paste0(dirname(dirname(outdir)), '/', filename)
}DefaultAssay(obj) <- "ATAC"
Idents(obj) <- clusters_colMotif Resources and chromVAR Initialization
Motif analysis infers potential transcription factors (TFs) and their activity changes from Differentially Accessible Regions (DARs). This workflow performs motif enrichment (FindMotifs) on significant differential peaks and combines it with chromVAR computed motif deviation scores for cell-level activity visualization, forming an evidence loop of "peak-level difference → TF clues → cell activity".
Load JASPAR2020 PFM matrices by species, map motifs to peak sets, and run chromVAR using the corresponding reference genome; set species and database identifiers required for subsequent enrichment and annotation.
Functional Overview
Get human/mouse JASPAR PFM set:
getMatrixSet(JASPAR2020, opts = list(species = 9606/10090))Write motifs to object:
AddMotifs(object = obj, genome = BSgenome.*, pfm = pfm)Calculate motif deviation scores:
RunChromVAR(object = obj, genome = BSgenome.*)Unify downstream species identifiers:
- human →
species = "hsa",db = "org.Hs.eg.db",spe = "human", genomehg38 - mouse →
species = "mmu",db = "org.Mm.eg.db",spe = "mouse", genomemm10
- human →
Usage Points
Reference genome must match species (human→
BSgenome.Hsapiens.UCSC.hg38, mouse→BSgenome.Mmusculus.UCSC.mm10)
if (species == "human"){
pfm <- getMatrixSet(
x = JASPAR2020,
opts = list(species = 9606, all_versions = FALSE)
)
obj <- AddMotifs(
object = obj,
genome = BSgenome.Hsapiens.UCSC.hg38,
# genome = genome,
pfm = pfm
)
obj <- RunChromVAR(
object = obj,
genome = BSgenome.Hsapiens.UCSC.hg38
# genome = genome,
)
species = "hsa"
db = 'org.Hs.eg.db'
spe = "human"
}else if (species == "mouse"){
pfm <- getMatrixSet(
x = JASPAR2020,
opts = list(species = 10090, all_versions = FALSE)
)
obj <- AddMotifs(
object = obj,
genome = BSgenome.Mmusculus.UCSC.mm10,
pfm = pfm
)
obj <- RunChromVAR(
object = obj,
genome = BSgenome.Mmusculus.UCSC.mm10
)
species = "mmu"
db = 'org.Mm.eg.db'
spe = "mouse"
}Differential Peaks Analysis Between Cell Groups
Differential Peaks Analysis
To identify Differentially Accessible Regions (DARs) between different cell populations (or conditions), this workflow uses the Wilcoxon rank-sum test for Differential Accessibility (DA) assessment and Benjamini–Hochberg (BH) for multiple testing correction. presto::wilcoxauc is an efficient implementation for Seurat objects, significantly speeding up large-scale scATAC data analysis while maintaining statistical robustness.
if (group_comparison == "") {
da_peaks <- wilcoxauc(obj, clusters_col, assay='data', seurat_assay='ATAC')
colnames(da_peaks)[2] <- 'cluster'
# write.table(da_peaks,
# file=paste0(output,'/DIFF/all_diffPeak.xls'),
# quote=F, row.names=F, col.names=T, sep='\t')
} else {
all_da_peaks <- list()
obj <- subset(obj, subset = !!sym(group_comparison) == c(case_group, control_group))
for (cell_type in unique(obj@meta.data[[clusters_col]])) {
tryCatch({
sub_obj <- subset(obj, subset = !!sym(clusters_col) == cell_type)
da_peaks <- wilcoxauc(sub_obj,
group_by = group_comparison,
assay = 'data',
seurat_assay = 'ATAC',
groups_use = c(case_group, control_group))
da_peaks$cluster <- as.character(cell_type)
all_da_peaks[[as.character(cell_type)]] <- da_peaks
# write.table(da_peaks,
# file = paste0(output, '/DIFF/by_celltype/', cell_type, '_diffPeak.xls'),
# quote = F, row.names = F, col.names = T, sep = '\t')
}, error = function(e) {
message(sprintf("Error in cluster %s: %s", cell_type, e$message))
})
}
da_peaks <- do.call(rbind, all_da_peaks)
# write.table(da_peaks,
# file = paste0(output, '/DIFF/all_diffPeak.xls'),
# quote = F, row.names = F, col.names = T, sep = '\t')
}Save and Visualize Differential Peaks Results
Based on differential accessibility results, perform significant peak filtering, peak-to-gene mapping, motif enrichment and activity visualization for each cell cluster, and plot coverage maps for representative Top peaks to visually present peak-level differences and potential transcriptional regulatory clues.
Core Workflow
Cluster sorting: Use generic sorting function
sort_clusters, prioritize numerical order (compatible with mixed character numbering).Significant peak selection: Select differential peaks (
featurecolumn) withpadj < 0.05for each cluster and export result table:output/DIFF/01_diffPeak/c_cluster_diffPeak_significant.xls
Peak-Gene Mapping:
ClosestFeature(obj, regions = peaks)maps significant peaks to nearby genes:output/DIFF/04_closestFeature/c_cluster_diffPeak_sig_genes.xls
Motif Enrichment and Visualization:
FindMotifs(object = obj, features = peaks)output enrichment table (including p.adjust and fold.enrichment):output/DIFF/03_motif/c_cluster_diffPeak_motifs.xls
Select Top 6 motifs by
fold.enrichment, draw sequence logo withMotifPlot: -..._diffPeak_motifs_top.pdf/pngTop Peaks Coverage Plot (CoveragePlot):
Select Top 6 differential peaks by descending
logFC, extending 5 kb upstream and downstream:No group comparison: Plot overall;
With group comparison (
group_comparisonnot empty): Facet plot bycase_groupvscontrol_group;Output:
output/DIFF/02_topPeak/c_cluster_diffPeak_top.pdf/png- chromVAR Activity UMAP:
Switch to
chromvarassay, useFeaturePlotto display Top motifs deviation scores (ncol=6/3): -output/DIFF/03_motif/c_cluster_diffPeak_motifs_top_umap.pdf/png
cluster <- unique(da_peaks$cluster)
# Common sorting function
sort_clusters <- function(x) {
if(all(!is.na(suppressWarnings(as.numeric(x))))) {
return(x[order(as.numeric(x))])
} else {
nums <- suppressWarnings(as.numeric(gsub("[^0-9]", "", x)))
if(all(!is.na(nums))) {
return(x[order(nums)])
} else {
return(sort(x))
}
}
}
cluster <- sort_clusters(cluster)
for (i in cluster) {
tryCatch({
DefaultAssay(obj) <- "ATAC"
data <- da_peaks[da_peaks$cluster == i, ]
# write.table(data,paste0(output,'/DIFF/c_',gsub(" ","",i),'_diffPeak.xls'),quote=F,row.names=F,col.names=T,sep='\t')
# data.sig <- data[data$p_val_adj < 0.05, ]
data.sig <- data[data$padj < 0.05, ]
write.table(data.sig,paste0(output,'/DIFF/01_diffPeak/c_',gsub(" ","",i),'_diffPeak_significant.xls'),quote=F,row.names=F,col.names=T,sep='\t')
# peaks <- rownames(data.sig)
peaks <- data.sig$feature
genes <- ClosestFeature(obj, regions = peaks)
genes$cluster <- as.character(i)
write.table(genes,paste0(output,'/DIFF/04_closestFeature/c_',gsub(" ","",i),'_diffPeak_sig_genes.xls'),quote=F,row.names=F,col.names=T,sep='\t')
enriched.motifs <- FindMotifs(object = obj, features = peaks)
enriched.motifs$cluster <- as.character(i)
write.table(enriched.motifs,paste0(output,'/DIFF/03_motif/c_',gsub(" ","",i),'_diffPeak_motifs.xls'),quote=F,row.names=F,col.names=T,sep='\t')
motif.top <- head(enriched.motifs[order(enriched.motifs$fold.enrichment, decreasing = TRUE), ])
p1 <- MotifPlot(object = obj, motifs = rownames(motif.top), nrow = 1)
p11 <- MotifPlot(object = obj, motifs = rownames(motif.top), nrow = 2)
ggsave(paste0(output,'/DIFF/03_motif/c_',gsub(" ","",i),'_diffPeak_motifs_top.pdf'), p11, width = 12, height = 6)
# ggsave(paste0(output,'/DIFF/c_',gsub(" ","",i),'_diffPeak_motifs_top.png'), p1, width = 8, height = 4, dpi = 100, bg = "white")
ggsave(paste0(output,'/DIFF/03_motif/c_',gsub(" ","",i),'_diffPeak_motifs_top.png'), p1, width = 24, height = 3, dpi = 100, bg = "white")
# peaks.top <- rownames(head(data.sig[order(data.sig$avg_log2FC, decreasing = TRUE), ]))
peaks.top <- head(data.sig[order(data.sig$logFC, decreasing = TRUE), ])[['feature']]
if (group_comparison == "") {
p2 <- CoveragePlot(object = obj, region = peaks.top, extend.upstream = 5000, extend.downstream = 5000, nrow = 1)
# p2 <- CoveragePlot(object = obj, region = peaks.top, extend.upstream = 5000, extend.downstream = 5000, nrow = 3)
p22 <- CoveragePlot(object = obj, region = peaks.top, extend.upstream = 5000, extend.downstream = 5000, nrow = 2)
} else {
p2 <- CoveragePlot(object = obj, region = peaks.top, extend.upstream = 5000, extend.downstream = 5000, nrow = 1, group.by = group_comparison, idents = c(case_group, control_group))
p22 <- CoveragePlot(object = obj, region = peaks.top, extend.upstream = 5000, extend.downstream = 5000, nrow = 2, group.by = group_comparison, idents = c(case_group, control_group))
}
ggsave(paste0(output,'/DIFF/02_topPeak/c_',gsub(" ","",i),'_diffPeak_top.pdf'), p22, width = 12, height = 6)
# ggsave(paste0(output,'/DIFF/c_',gsub(" ","",i),'_diffPeak_top.png'), p2, width = 8, height = 4, dpi = 100, bg = "white")
ggsave(paste0(output,'/DIFF/02_topPeak/c_',gsub(" ","",i),'_diffPeak_top.png'), p2, width = 24, height = 3, dpi = 100, bg = "white")
DefaultAssay(obj) <- 'chromvar'
p3 <- FeaturePlot(object = obj, features = rownames(motif.top), ncol = 6)
# p3 <- FeaturePlot(object = obj, features = rownames(motif.top), ncol = 2)
p33 <- FeaturePlot(object = obj, features = rownames(motif.top), ncol = 3)
ggsave(paste0(output,'/DIFF/03_motif/c_',gsub(" ","",i),'_diffPeak_motifs_top_umap.pdf'), p33, width = 12, height = 6)
# ggsave(paste0(output,'/DIFF/c_',gsub(" ","",i),'_diffPeak_motifs_top.png'), p1, width = 8, height = 4, dpi = 100, bg = "white")
ggsave(paste0(output,'/DIFF/03_motif/c_',gsub(" ","",i),'_diffPeak_motifs_top_umap.png'), p3, width = 24, height = 3, dpi = 100, bg = "white")
}, error = function(e) {
message(sprintf("Error in cluster %s: %s", i, e$message))
})
}
saveRDS(obj,paste0(output, '/output.rds'))
write.table(da_peaks[da_peaks$padj < 0.05, ],paste0(output,'/DIFF/01_diffPeak/all_diffPeak_significant.xls'),quote=F,row.names=F,col.names=T,sep='\t')List of Significant Differential Peaks Between Cell Groups
Table of differential peaks with adjusted p-value < 0.05 between cell groups.
Metric Interpretation (Differential Peaks Table)
- feature: Peak coordinates, format chr:start-end, uniquely identifying each ATAC peak.
- cluster: Target cell group (or label) name, indicating that the statistics for this row compare this group as 'in-group' vs. other cells as 'out-group'.
- avgExpr: Average accessibility signal of the peak in the target cell group (from normalized values of ATAC data matrix, e.g., TF-IDF/normalized counts), higher values indicate greater openness in the group.
- logFC: Log fold change of target group vs. other cells, >0 indicates more open in this cluster, larger values indicate more significant difference.
- statistic: Wilcoxon rank-sum test statistic (rank sum), used to calculate p-value.
- auc: AUC (Area Under Curve) measures separability, 0.5 indicates no difference; usually 0.6≈weak, 0.7≈moderate, ≥0.8≈strong separation.
- pval: Uncorrected p-value, obtained from Wilcoxon test.
- padj: Multiple testing corrected p-value (Benjamini–Hochberg), used for significance judgment (common threshold < 0.05).
- pct_in: Percentage of cells in the target group where the peak is non-zero/accessible (%), reflecting "prevalence" within the group.
- pct_out: Percentage of cells in other groups where the peak is non-zero/accessible (%), used to evaluate specificity compared to pct_in.
options(warn = -1)
options(message = FALSE)
# da_peaks <- da_peaks %>% mutate(across(where(is.numeric), ~round(., 2)))
datatable(da_peaks[da_peaks$padj < 0.05, ], rownames = FALSE, filter = 'top', options = list(pageLength = 5, dom = 'Blfrtip', buttons = c('csv', 'excel', 'pdf')), extensions = 'Buttons') %>%
formatSignif(
columns = names(da_peaks)[sapply(da_peaks, is.numeric)],
digit = 3, zero.print = NA
)Top Peaks Display
Visualize Top 6 significant differential peaks (padj < 0.05) sorted by logFC using CoveragePlot. Each panel displays the normalized coverage curve (Normalized signal) within ±5 kb of the peak, with gene structure and predicted peaks shown below for joint interpretation.
Image Interpretation
The upper colored tracks represent coverage of different clusters (or cell types); higher tracks indicate greater openness.
The x-axis is genomic coordinates; blue arrows below are gene coordinates (e.g., Ppp1r16a, Cpt, Mfsd3), gray bars are peak positions.
Height differences of the same peak across tracks reveal accessibility differences across clusters; higher curves in the target group/cluster suggest stronger chromatin openness and potential regulatory activity.
# 2. Peaks Charts
peaks_links <- data.frame(
Cluster = cluster,
Preview = sapply(cluster, function(i) {
file_path_ab <- paste0(output,'/DIFF/02_topPeak/c_',gsub(" ","",i),'_diffPeak_top.png')
file_path <- paste0('../output/DIFF/02_topPeak/c_', gsub(" ","", i), '_diffPeak_top.png')
if(file.exists(file_path_ab)) {
sprintf('<a href="%s"><img src="/placeholder.svg" height="200px"></a>', file_path, file_path)
} else {
"Image not found"
}
})
)options(warn = -1)
options(message = FALSE)
# Peaks Table
datatable(peaks_links, filter = 'top',
escape = FALSE,
options = list(
pageLength = 1,
dom = 'tp'
),
caption = "Peak Coverage Plots",
rownames = FALSE)Top Peaks Motif Display
Previously, we performed FindMotifs on significant differential peak sets (padj < 0.05) for each cluster to assess JASPAR Motif enrichment and identify TFs driving peak opening. Below we visualize Top Motifs and their activities.
Motif List
Metric Interpretation (Differential Peak Enriched Motif List)
- motif: Unique identifier of Motif (e.g., JASPAR ID).
- observed: Actual count of motif detected in differential peak regions.
- background: Count of motif detected in background regions (non-differential peaks).
- percent.observed: Occurrence proportion of motif in differential peak regions (%).
- percent.background: Occurrence proportion of motif in background regions (%).
- fold.enrichment: Fold enrichment, indicating motif enrichment in differential peak regions (percent.observed / percent.background), higher values mean more significant enrichment.
- pvalue: Original p-value of enrichment test, measuring statistical significance of motif enrichment.
- motif.name: Transcription factor name or annotation corresponding to Motif.
- p.adjust: P-value after multiple testing correction (e.g., FDR), common threshold < 0.05 for significant enrichment.
- cluster: Corresponding cell group/cluster, indicating which cluster's differential peaks the motif is enriched in.
data_dir <- paste0(output,"/DIFF/03_motif")
xls_files <- list.files(path = data_dir,
pattern = "motifs\\.xls$",
full.names = TRUE)
merged_data <- data.frame()
for (file in xls_files) {
df <- read.table(file, sep="\t", header=TRUE)
df$cluster <- as.character(df$cluster)
merged_data <- bind_rows(merged_data, df)
}
merged_data$cluster <- as.character(merged_data$cluster)options(warn = -1)
options(message = FALSE)
# merged_data_motif <- merged_data %>% mutate(across(where(is.numeric), ~round(., 2)))
merged_data_motif <- merged_data
datatable(merged_data_motif[merged_data_motif$p.adjust < 0.05, ], rownames = FALSE, filter = 'top', options = list(pageLength = 5, dom = 'Blfrtip', buttons = c('csv', 'excel', 'pdf')), extensions = 'Buttons') %>%
formatSignif(
columns = names(merged_data_motif)[sapply(merged_data_motif, is.numeric)],
digit = 3, zero.print = NA
)Motif Visualization (chromVAR UMAP)
- Display Top Motif sequence logo by
cluster, intuitively reflecting base composition and conservation. - Preview Top Motif cell-level activity distribution by
cluster. Each subplot is a projection of a Motif's chromVAR deviation z-score in UMAP space.
# 1. Motifs Charts
motifs_links1 <- data.frame(
Cluster = cluster,
Preview = sapply(cluster, function(i) {
file_path_ab <- paste0(output,'/DIFF/03_motif/c_',gsub(" ","",i),'_diffPeak_motifs_top.png')
file_path <- paste0('../output/DIFF/03_motif/c_', gsub(" ","", i), '_diffPeak_motifs_top.png')
if(file.exists(file_path_ab)) {
sprintf('<a href="%s"><img src="/placeholder.svg" height="200px"></a>', file_path, file_path)
} else {
"Image not found"
}
})
)
motifs_links2 <- data.frame(
Cluster = cluster,
Preview = sapply(cluster, function(i) {
file_path_ab <- paste0(output,'/DIFF/03_motif/c_',gsub(" ","",i),'_diffPeak_motifs_top_umap.png')
file_path <- paste0('../output/DIFF/03_motif/c_', gsub(" ","", i), '_diffPeak_motifs_top_umap.png')
if(file.exists(file_path_ab)) {
sprintf('<a href="%s"><img src="/placeholder.svg" height="200px"></a>', file_path, file_path)
} else {
"Image not found"
}
})
)options(warn = -1)
options(message = FALSE)
# Motifs Table
datatable(motifs_links1, filter = 'top',
escape = FALSE,
options = list(
pageLength = 1,
dom = 'tp'
),
caption = "Motifs Top Plots",
rownames = FALSE)
datatable(motifs_links2, filter = 'top',
escape = FALSE,
options = list(
pageLength = 1,
dom = 'tp'
),
rownames = FALSE)Genes Near Differential Peaks
Individual peak coordinates are hard to interpret directly. We use ClosestFeature to map significant differential peaks (padj < 0.05) to the nearest genes, inferring candidate genes potentially affected by these regulatory elements and providing input for subsequent GO/KEGG enrichment analysis.
Metric Interpretation (ClosestFeature Result Table)
- tx_id: Ensembl transcript ID of the nearest transcript.
- gene_name: Gene symbol of the nearest gene (SYMBOL).
- gene_id: Ensembl gene ID of the nearest gene.
- gene_biotype: Gene type (e.g., protein_coding, lincRNA).
- type: Nearest functional annotation type (e.g., exon, UTR, intron, promoter).
- closest_region: Genomic coordinates of the nearest functional annotation fragment.
- query_region: Input differential peak coordinates.
- distance: Distance between them (bp). 0 indicates overlap; smaller values suggest higher regulatory association probability; ≤2000 bp is often considered a proximal promoter candidate.
- cluster: Cell group label to which the differential peak belongs.
data_dir <- paste0(output,"/DIFF/04_closestFeature")
xls_files <- list.files(path = data_dir,
pattern = "diffPeak_sig_genes\\.xls$",
full.names = TRUE)
merged_data <- data.frame()
for (file in xls_files) {
df <- read.table(file, sep="\t", header=TRUE)
df$cluster <- as.character(df$cluster)
merged_data <- bind_rows(merged_data, df)
}
merged_data$cluster <- as.character(merged_data$cluster)options(warn = -1)
options(message = FALSE)
# merged_data_genes <- merged_data %>% mutate(across(where(is.numeric), ~round(., 2)))
merged_data_genes <- merged_data
datatable(merged_data_genes, rownames = FALSE, filter = 'top', options = list(pageLength = 5, dom = 'Blfrtip', buttons = c('csv', 'excel', 'pdf')), extensions = 'Buttons') %>%
formatSignif(
columns = names(merged_data_genes)[sapply(merged_data_genes, is.numeric)],
digit = 3, zero.print = NA
)Functional Enrichment of Genes Near Differential Peaks
Using nearby genes from ClosestFeature as input, use clusterProfiler to perform GO and KEGG enrichment for each cluster, systematically characterizing biological processes and signaling pathways potentially affected by differential regulatory regions.
Analysis Pipeline
Gene preparation: Extract
gene_namebycluster, map toENSEMBL/ENTREZIDusingbitr.GO Enrichment:
enrichGO(ont = "ALL", pAdjustMethod = "BH"), output significant terms and visualization.KEGG Enrichment:
enrichKEGG(organism = hsa/mmu), and visualization.
go_enrich <- function(eg,db,outdir,prefix) {
genelist <- eg$SYMBOL
go <- enrichGO(genelist, OrgDb=db, ont='ALL',pAdjustMethod = 'BH',qvalueCutoff = 1,pvalueCutoff = 1,keyType = 'SYMBOL')
go1 <- data.frame(cluster=prefix, go)
write.table(go1,file=paste(outdir,'/',prefix,'_GOenrich.xls',sep=''),sep='\t',quote=F,row.names=F)
pdf(paste0(outdir,'/',prefix,'_GO_bar.pdf',sep=''),width = 10, height = 9)
print(barplot(go,showCategory=20,drop=T)+ scale_y_discrete(labels = function(x) str_wrap(x, width = 50)))
dev.off()
pdf(paste0(outdir,'/',prefix,'_GO_dot.pdf',sep=''),width = 10, height = 8)
print(dotplot(go,showCategory=20)+ scale_y_discrete(labels = function(x) str_wrap(x, width = 50)))
dev.off()
png(paste0(outdir,'/',prefix,'_GO_bar.png',sep=''),width = 780, height = 580)
print(barplot(go,showCategory=20,drop=T)+ scale_y_discrete(labels = function(x) str_wrap(x, width = 50)))
dev.off()
png(paste0(outdir,'/',prefix,'_GO_dot.png',sep=''),width = 780, height = 580)
print(dotplot(go,showCategory=20)+ scale_y_discrete(labels = function(x) str_wrap(x, width = 50)))
dev.off()
}
kegg_enrich <- function(eg,species,outdir,prefix) {
genenames<-eg$SYMBOL
names(genenames)<-eg$ENTREZID
genelist <- eg$ENTREZID
download.KEGG.Path <- function (species) {
keggpathid2extid.df <- clusterProfiler:::kegg_link(species, "pathway")
if (is.null(keggpathid2extid.df)) {
message <- paste("Failed to download KEGG data.", "Wrong 'species' or the network is unreachable.",
"The 'species' should be one of organisms listed in",
"'https://www.genome.jp/kegg/catalog/org_list.html'")
stop(message)
}
keggpathid2extid.df[, 1] %<>% gsub("[^:]+:", "", .)
keggpathid2extid.df[, 2] %<>% gsub("[^:]+:", "", .)
keggpathid2name.df <- clusterProfiler:::kegg_list("pathway")
keggpathid2name.df[, 1] %<>% gsub("map", species, .)
keggpathid2extid.df <- keggpathid2extid.df[keggpathid2extid.df[, 1] %in% keggpathid2name.df[, 1], ]
return(list(KEGGPATHID2EXTID = keggpathid2extid.df, KEGGPATHID2NAME = keggpathid2name.df))
}
assignInNamespace('download.KEGG.Path',download.KEGG.Path,ns = 'clusterProfiler')
head(genelist)
kegg <- enrichKEGG(genelist, organism = species, keyType = 'kegg', pAdjustMethod = 'BH',pvalueCutoff = 1,qvalueCutoff = 1,use_internal_data = TRUE)
gene_list <- strsplit(kegg$geneID,split='/')
geneName<-unlist(lapply(gene_list,FUN=function(x){paste(genenames[x],collapse='/')}))
KEGGenrich <- data.frame(cluster=prefix, kegg[,1:8],geneName,kegg[,9,drop=F])
write.table(KEGGenrich,file=paste(outdir,'/',prefix,'_KEGGenrich.xls',sep=''),sep='\t',quote=F,row.names=F)
pdf(paste0(outdir,'/',prefix,'_KEGG_dot.pdf',sep=''),width = 8, height = 8)
print(dotplot(kegg, showCategory=20)+ scale_y_discrete(labels = function(x) str_wrap(x, width = 50)))
dev.off()
pdf(paste0(outdir,'/',prefix,'_KEGG_bar.pdf',sep=''),width = 8, height = 8)
print(barplot(kegg, showCategory=20)+ scale_y_discrete(labels = function(x) str_wrap(x, width = 50)))
dev.off()
png(paste0(outdir,'/',prefix,'_KEGG_dot.png',sep=''),width = 780, height = 580)
print(dotplot(kegg, showCategory=20)+ scale_y_discrete(labels = function(x) str_wrap(x, width = 50)))
dev.off()
png(paste0(outdir,'/',prefix,'_KEGG_bar.png',sep=''),width = 780, height = 580)
print(barplot(kegg, showCategory=20)+ scale_y_discrete(labels = function(x) str_wrap(x, width = 50)))
dev.off()
}files <- list.files(paste0(output,"/DIFF/04_closestFeature"),pattern="diffPeak_sig_genes.xls")
# cores <- detectCores() - 1
# cl <- makeCluster(cores)
# registerDoParallel(cl)
# foreach(f=files,.packages = c("clusterProfiler", "reshape2", "ggplot2", "stringr", "tibble", "dplyr", "DOSE", "enrichplot")) %dopar% {
for (f in files){
prefix <- gsub('_diffPeak_sig_genes.xls','',f)
f <- paste0(outdir,"/output/DIFF/04_closestFeature/",f)
print(f)
print(prefix)
clus_1<-read.table(f,header=T,sep='\t')
clus<-clus_1$gene_name
tryCatch({
eg <- bitr(clus, fromType="SYMBOL", toType=c("ENSEMBL","ENTREZID"), OrgDb=db)
print(head(eg))
}, error = function(e) {
message(paste("Error processing file", f, ":", e$message))
})
tryCatch({
go_enrich(eg,db,paste0(outdir,"/output/clusterProfiler/GO"),prefix)
}, error = function(e) {
message(paste0("Error in GO enrichment analysis: ", prefix, " - ", e$message))
})
tryCatch({
kegg_enrich(eg, species, paste0(outdir,"/output/clusterProfiler/KEGG"), prefix)
}, error = function(e) {
message(paste0("Error in KEGG enrichment analysis: ", prefix, " - ", e$message))
})
}
# stopCluster(cl)# Merge GO enrichment analysis results
go_dir <- paste0(output,"/clusterProfiler/GO")
if(dir.exists(go_dir)) { # Check if directory exists
xls_files <- list.files(path = go_dir,
pattern = "\\.xls$",
full.names = TRUE)
merged_data <- data.frame()
for (file in xls_files) {
df <- read.table(file, sep="\t", header=TRUE, fill=TRUE, quote="", check.names=FALSE)
merged_data <- bind_rows(merged_data, df)
}
# Write to file using full path
write.table(merged_data,
file = file.path(go_dir, "all_GOenrich.xls"),
row.names = FALSE,
sep="\t",
quote=FALSE)
}GO Enrichment Results
Table summarizing GO enrichment terms and significance statistics for each cluster, used to characterize functional directions of genes near differential peaks.
- Metric Interpretation
- cluster: Cell group label to which the result belongs.
- ONTOLOGY: GO domain, BP (Biological Process) / CC (Cellular Component) / MF (Molecular Function).
- ID: GO term ID (e.g.,
GO:0007264). - Description: GO term name.
- GeneRatio: Number of input genes hitting the term / Total number of input genes (higher means higher proportion in this input).
- BgRatio: Hits in background gene set / Total background (used to evaluate enrichment strength vs GeneRatio).
- pvalue: Original p-value of enrichment test.
- p.adjust: Multiple testing corrected p-value (BH/FDR), common threshold < 0.05 for significance.
- qvalue: q-value estimate (another representation of FDR), threshold < 0.2 can be used for screening.
- geneID: List of genes hitting the term, usually separated by "/".
options(warn = -1)
options(message = FALSE)
# merged_data_go <- merged_data %>% mutate(across(where(is.numeric), ~round(., 2)))
merged_data_go <- merged_data
if(nrow(merged_data_go[merged_data_go$p.adjust < 0.05, ]) > 0) {
datatable(merged_data_go[merged_data_go$p.adjust < 0.05, ], rownames = FALSE, filter = 'top', caption = "GO Enrichment",
options = list(pageLength = 5, dom = 'Blfrtip', buttons = c('csv', 'excel', 'pdf')),
extensions = 'Buttons') %>%
formatSignif(
columns = names(merged_data_go)[sapply(merged_data_go, is.numeric)],
digit = 3, zero.print = NA
)
}GO Enrichment Visualization
Dot (Left Plot) X-axis GeneRatio: Number of input genes hitting the term / Total input genes; larger means relatively more concentrated. y-axis GO terms (BP/CC/MF). Dot size Count: Absolute number of hitting genes, larger means more. Color p.adjust: Redder means higher corrected significance (smaller p.adjust). Interpretation: Prioritize "red, large, right" points (high significance, many hits, large proportion). Bar (Right Plot) x-axis Count: Number of hitting genes; longer bar means more hits. y-axis GO terms. Color p.adjust: Redder color indicates higher significance. Interpretation: Compare bar length and color from top to bottom, locking onto "long and red" items.
# 3. Enrichment Charts
go_links <- data.frame(
Cluster = cluster,
GO_dot = sapply(cluster, function(i) {
file_path_ab <- paste0(output,'/clusterProfiler/GO/c_', gsub(" ","", i), '_GO_dot.png')
file_path <- paste0('../output/clusterProfiler/GO/c_', gsub(" ","", i), '_GO_dot.png')
if(file.exists(file_path_ab)) {
sprintf('<a href="%s"><img src="/placeholder.svg" height="200px"></a>', file_path, file_path)
} else {
"Image not found"
}
}),
GO_bar = sapply(cluster, function(i) {
file_path_ab <- paste0(output,'/clusterProfiler/GO/c_', gsub(" ","", i), '_GO_bar.png')
file_path <- paste0('../output/clusterProfiler/GO/c_', gsub(" ","", i), '_GO_bar.png')
if(file.exists(file_path_ab)) {
sprintf('<a href="%s"><img src="/placeholder.svg" height="200px"></a>', file_path, file_path)
} else {
"Image not found"
}
})
)options(warn = -1)
options(message = FALSE)
# Enrich Table
datatable(go_links, filter = 'top',
escape = FALSE,
options = list(
pageLength = 1,
dom = 'tp'
),
caption = "Enrich Plots",
rownames = FALSE)- Bar plot: x-axis is gene count annotated to GO ID, y-axis is GO term description, color represents adjusted p-value (redder = smaller p-value, bluer = larger p-value).
- Dot plot: x-axis is GeneRatio, y-axis is GO term description, color represents adjusted p-value, dot size represents number of differential genes enriched in the pathway.# Merge KEGG enrichment analysis results
kegg_dir <- paste0(output,"/clusterProfiler/KEGG")
if(dir.exists(kegg_dir)) { # Check if directory exists
xls_files <- list.files(path = kegg_dir,
pattern = "\\.xls$",
full.names = TRUE)
merged_data <- data.frame()
for (file in xls_files) {
df <- read.table(file, sep="\t", header=TRUE, fill=TRUE, quote="", check.names=FALSE)
merged_data <- bind_rows(merged_data, df)
}
# Write to file using full path
write.table(merged_data,
file = file.path(kegg_dir, "all_KEGGenrich.xls"),
row.names = FALSE,
sep="\t",
quote=FALSE)
}KEGG Enrichment Results
Summary of KEGG pathway enrichment terms for each cluster, characterizing signaling and metabolic pathways potentially involving genes near differential peaks (based on enrichKEGG, species code human→hsa, mouse→mmu).
- Metric Interpretation
- cluster: Cluster label.
- category: KEGG Level 1 Category (e.g., Metabolism, Cellular Processes).
- subcategory: KEGG Level 2 Category (e.g., Carbohydrate metabolism, Signal transduction).
- ID: Pathway ID (e.g.,
mmu04015). - Description: Pathway name.
- GeneRatio: Proportion of input genes hitting the pathway (Hits/Total input), higher means more concentrated in current input.
- BgRatio: Proportion of background genes hitting the pathway (Hits/Total background), used to evaluate enrichment strength compared to GeneRatio.
- pvalue: Original p-value of enrichment test.
- p.adjust: P-value after multiple testing correction (FDR/BH), common threshold < 0.05.
- geneID/geneName: List of genes hitting the pathway, separated by "/".
options(warn = -1)
options(message = FALSE)
# merged_data_kegg <- merged_data %>% mutate(across(where(is.numeric), ~round(., 2)))
merged_data_kegg <- merged_data
if(nrow(merged_data_kegg[merged_data_kegg$p.adjust < 0.05, ]) > 0) {
datatable(merged_data_kegg[merged_data_kegg$p.adjust < 0.05, ], rownames = FALSE, filter = 'top', caption = "KEGG Enrichment",
options = list(pageLength = 5, dom = 'Blfrtip', buttons = c('csv', 'excel', 'pdf')),
extensions = 'Buttons') %>%
formatSignif(
columns = names(merged_data_kegg)[sapply(merged_data_kegg, is.numeric)],
digit = 3, zero.print = NA
)
}KEGG Enrichment Visualization
Dot (Left)
X-axis GeneRatio: Input genes hitting pathway / Total input genes; larger means more concentrated.
- y-axis Pathway name (including species code, e.g., mmu04015).
Dot size Count: Number of hit genes; larger means more hits.
Color p.adjust: Redder indicates higher significance (smaller adjusted p-value).
Interpretation: Prioritize "red, large, and right" points (significant, high hits, high proportion).
Bar (Right)
X-axis Count: Number of hit genes; longer bar means more hits.
- y-axis Pathway name.
Color p.adjust: Redder indicates higher significance.
Interpretation: Compare bar length and color from top to bottom, locking onto "long and red" pathways.
# 3. Enrichment Charts
kegg_links <- data.frame(
Cluster = cluster,
KEGG_dot = sapply(cluster, function(i) {
file_path_ab <- paste0(output,'/clusterProfiler/KEGG/c_', gsub(" ","", i), '_KEGG_dot.png')
file_path <- paste0('../output/clusterProfiler/KEGG/c_', gsub(" ","", i), '_KEGG_dot.png')
if(file.exists(file_path_ab)) {
sprintf('<a href="%s"><img src="/placeholder.svg" height="200px"></a>', file_path, file_path)
} else {
"Image not found"
}
}),
KEGG_bar = sapply(cluster, function(i) {
file_path_ab <- paste0(output,'/clusterProfiler/KEGG/c_', gsub(" ","", i), '_KEGG_bar.png')
file_path <- paste0('../output/clusterProfiler/KEGG/c_', gsub(" ","", i), '_KEGG_bar.png')
if(file.exists(file_path_ab)) {
sprintf('<a href="%s"><img src="/placeholder.svg" height="200px"></a>', file_path, file_path)
} else {
"Image not found"
}
})
)options(warn = -1)
options(message = FALSE)
# Enrich Table
datatable(kegg_links, filter = 'top',
escape = FALSE,
options = list(
pageLength = 1,
dom = 'tp'
),
caption = "Enrich Plots",
rownames = FALSE)- Bar plot: x-axis is gene count annotated to KEGG pathway, y-axis is KEGG pathway description, color represents adjusted p-value.
- Dot plot: x-axis is GeneRatio, y-axis is KEGG pathway description, color represents adjusted p-value, dot size represents number of differential genes enriched in the pathway.Result Files
DIFF/*diffPeak.xls Master table of differential peaks
DIFF/*diffPeak_significant.xls Table of significant differential peaks
DIFF/*diffPeak_top.pdf/png Chromatin accessibility signal plots for top 6 significant differential peaks
DIFF/*diffPeak_motifs.xls Table of motifs in significant differential peaks
DIFF/*diffPeak_motifs_top.pdf/png Motif plots for top 6 significant differential peaks
DIFF/*diffPeak_sig_genes.xls Genes near significant differential peaks
clusterProfiler/*/all_*enrich.xls Master table of enrichment results
clusterProfiler/*/*enrich.xls Enrichment results for each cell group
clusterProfiler/*/*bar.pdf/png Enrichment bar plots for each cell group
clusterProfiler/*/*dot.pdf/png Enrichment dot plots for each cell group
Result Directory: Includes all images and table result files involved in this analysis.
Literature Case Analysis
- Reference 1:
The paper "Single-cell multiomics analysis reveals regulatory programs in clear cell renal cell carcinoma" performed differential accessibility region (DAR) analysis and found that DARs in tumor cells were significantly enriched in metabolism-related biological processes.
References
[1] Stuart, Tim, Avi Srivastava, Caleb Lareau, and Rahul Satija. "Multimodal single-cell chromatin analysis with Signac." BioRxiv (2020): 2020-11.
[2] Korsunsky I, A. Nathan N Millard, and S. Raychaudhuri. "presto: Fast Functions for Differential Expression using Wilcox and AUC." R package. https://rdrr.io/github/immunogenomics/presto (2019).
Appendix
.csv/.xls/.txt: Result data table files, separated by commas or tabs. Linux/Mac users view with less or more commands; Windows users view with advanced text editors like Notepad++, or open with Microsoft Excel.
.png: Result image files, bitmap, lossless compression.
.pdf: Result image files, vector graphics, can be zoomed without distortion, convenient for viewing and editing, can be edited with Adobe Illustrator for publication.
