Skip to content

scATAC-seq Gene Activity Analysis: Chromatin Accessibility and Transcriptional Potential Assessment

Author: SeekGene
Time: 27 min
Words: 5.4k words
Updated: 2026-03-02
Reads: 0 times
scATAC + RNA-seq Analysis Guide Notebooks Activity

Background

Gene Activity Analysis is a core technique in single-cell ATAC-seq data analysis. It indirectly assesses the transcriptional potential of genes by quantifying chromatin accessibility. This method is based on a key biological hypothesis: there is a close relationship between the open state of chromatin and gene transcriptional activity.

Core Principle

This method statistics the ATAC-seq signal intensity within each gene region (including the gene body and its upstream 2kb regulatory region). These signals directly reflect the degree of DNA openness, thereby predicting the potential transcriptional activity of the gene.

Purpose and Significance

  • Data Integration Bridge: Construct 'transcriptional approximation' phenotype for scATAC-seq, used as anchors with scRNA-seq, supporting cross-omics integration (CCA/Anchors) and label transfer.
  • Annotation and Interpretation: Visualize classic marker gene activity on ATAC clusters to assist in identifying cell types and states.
  • Mechanism and Discovery: Perform correlation/differential analysis combining expression and activity to mine specifically open genes and potential regulatory logic.

Content Covered in This Tutorial

  • Gene Activity Analysis: Assess the relationship between chromatin accessibility and gene transcriptional potential
  • Gene Activity Differential Analysis: Identify gene activity differences between different cell types.
  • Differential Analysis Visualization: Combine gene expression data to compare differential gene activity with corresponding gene expression patterns
  • Correlation Analysis:
    • Gene activity correlation analysis between different clusters
    • Gene expression correlation analysis between different clusters

Expected Analysis Results

Through the analysis in this section, you will obtain:

  • Cluster-specific Gene Sets: Identify genes specifically open in specific cell types
  • Expression-Activity Correlation: Understand the performance of cluster-specific genes in both gene activity and gene expression levels
  • Heterogeneity Assessment: Assess heterogeneity characteristics of different clusters in gene expression and gene activity levels

Parameter Settings

Please mount the relevant cloud platform workflow data to be analyzed first. For mounting methods, please refer to the Jupyter usage tutorial.

Specific Parameter Entry:

  • rds: Mounted workflow-related RDS data, in the /home/{UserName}/workspace/project/{UserName}/ directory, such as the following project data /home/shumeng/workspace/data/AY1752167131383/
  • meta: Metadata file in the same directory as rds
  • outdir: Path to save analysis results
  • celltype_colname: Cell type column name. Select the label corresponding to the cell type or cluster to be analyzed. If you want to analyze annotated cell types, select their corresponding labels, such as CellAnnotation, used in conjunction with Cell Types.
  • celltypes: Cell types, multiple selection allowed. Select the cell types or clustering results to analyze, such as Monocyte and Macrophage.
  • logFC_cut: Log fold change threshold, default 0.25
  • padj_cut: padj threshold, default < 0.05
  • pval_cut: pval threshold, default < 0.05
  • reduction: Dimensionality reduction method for FeaturePlot visualization of gene activity. Options: umap/tsne/lsi/atacumap/atactsne/wnnumap/wnntsne. Default is wnnumap.
R
# Parameters: Fill in the data path under your own path as needed
rds = "/home/shumeng/workspace/data/AY1752167131383/input.rds"
meta = "/home/shumeng/workspace/data/AY1752167131383/meta.tsv"
outdir = "/home/shumeng/workspace/project/shumeng/"
celltype_colname = "wknn_res.0.5_d30_l2_50"
celltypes = "0,1,2,3,4,5,6,7,8,9"
logFC_cut = 0.25
padj_cut = 0.05
pval_cut = 0.05
reduction = "wnnumap"

Environment Preparation

Load R Packages

Please select the common_r environment for this tutorial

R
# Save current library path
original_lib_paths <- .libPaths()
# 1. First load basic dependency packages
suppressMessages({
    library(Rcpp, quietly = TRUE)
    library(Matrix, quietly = TRUE)
})
# 2. Set temporary library path and load presto
temp_lib_path <- "/PROJ/development/liuxin/Software/miniconda3/envs/Celltype/lib/R/library"
if(dir.exists(temp_lib_path)) {
    .libPaths(c(temp_lib_path, original_lib_paths))
}
suppressMessages({
    library(presto, quietly = TRUE)
    source("/PROJ2/FLOAT/shumeng/script/wilcoxauc.R")
})
# 3. Restore original library path
.libPaths(original_lib_paths)
# 4. Load other analysis packages, adjusting load order
suppressMessages({
    # First load basic dependencies
    #library(parallelly, quietly = TRUE)
    #library(future, quietly = TRUE)
    # Then load main analysis packages
    library(ggrepel,quietly = TRUE)
    library(Signac,quietly = TRUE)
    library(Seurat,quietly = TRUE)
    # Finally load visualization and tool packages
    library(ggplot2, quietly = TRUE)
    library(dplyr, quietly = TRUE)
    #library(ggrepel, quietly = TRUE)
    library(DT, quietly = TRUE)
    library(hexbin, quietly = TRUE)
    library(viridis, quietly = TRUE)
    library(gridExtra, quietly = TRUE)
    library(pheatmap, quietly = TRUE)
    library(base64enc, quietly = TRUE)
    library(viridis, quietly = TRUE)
    library(RColorBrewer,quietly = TRUE)
    library(scales,quietly = TRUE)


})

Gene Activity Calculation

Calculation Process

  1. Target Region Determination: Define the gene body and upstream ~2 kb promoter region for each gene (adjustable as needed).
  2. Fragment Counting: Count the number of ATAC fragments falling within the above regions for each cell (can be counted by fragment or insertion site).
  3. Matrix Construction: Integrate gene × cell counts into a gene activity matrix (FeatureMatrix; encapsulated by GeneActivity() in the actual workflow).
  4. Normalization: Log-normalize the activity matrix so that its magnitude is comparable to scRNA-seq expression, facilitating joint analysis and visualization.

Implementation in This Project (Signac)

  • Read single sample or integrated Seurat object (optionally merge meta and AddMetaData).
  • Set the default assay to ATAC and run GeneActivity(sc) to calculate gene activity.
  • Write the result to a new assay: ATAC_activity (CreateAssayObject(counts = gene.activity)).
  • Use NormalizeData to LogNormalize ATAC_activity, with scale.factor taken as median(sc$nCount_RNA), so that it is in a similar magnitude to RNA expression levels, supporting subsequent integration (such as CCA/Anchors) and comparative display.

Practical Points and Notes

  • Indirectness and Noise: Activity scores come from sparse accessibility data, noisier than expression; not one-to-one with real expression.
  • Region Definition Sensitivity: Promoter window (e.g., 2 kb) and inclusion of gene body affect results; can be fine-tuned by species and annotation version.
  • Distal Regulation Impact: For genes dominated by distal enhancers, consistency between activity and expression may be weak; use Cicero/ArchR to build connections and verify.
R
suppressWarnings({
  suppressMessages({
      sc=readRDS(rds)
      if(!is.null(meta)){
          meta <- read.table(meta, header = T, sep = "\t")
          rownames(meta) <- meta$barcodes
          sc <- AddMetaData(sc, meta)
      }
    
      DefaultAssay(sc)="ATAC"
      gene.activity <- GeneActivity(sc)
      sc[['ATAC_activity']] <- CreateAssayObject(counts = gene.activity) 
      sc <- NormalizeData(
          object = sc, 
          assay = 'ATAC_activity', 
          normalization.method = 'LogNormalize',
          scale.factor=median(as.numeric(sc$nCount_RNA))
      )
     })
})
R
sc@assays$ATAC_activity$counts[1:6,1:6]
The gene activity matrix is similar to the gene expression matrix, where each row represents a gene and each column represents a cell. The value represents the gene activity value of the corresponding gene in each cell. A larger value indicates that the regulatory region of the gene in the corresponding cell is more open, and the potential transcriptional activity of the corresponding gene is higher. This conclusion is based on the assumption that there is a positive correlation between gene activity and gene expression. In fact, due to the complexity of gene expression regulation, there may be cases where gene activity is high but gene expression is low. The specific regulatory relationship can be further explored according to actual needs.

Gene Activity Differential Analysis

Gene Activity Differential Analysis Between Cell Clusters

Analysis Purpose

Identify genes with significantly increased or decreased gene activity in different cell clusters, identify genes specifically open in specific cell types, and characterize the epigenetic transcriptional features of each cluster.

Comparison Strategy and Methods

  • Comparison method: Use "one-vs-rest" strategy to independently compare each cluster with all other clusters.
  • Statistical test: Use Wilcoxon rank-sum test to evaluate the difference in gene activity between the target cluster and other clusters.

Output Metrics Description

  • feature: Gene name (activity feature at the gene level)
  • group: Target cell cluster
  • avgExpr: Average activity level of the gene within the target cluster
  • logFC: log2 fold change (difference of target cluster relative to other clusters; positive value indicates higher in target cluster)
  • statistic: Wilcoxon test statistic
  • auc: Binary classification AUC, indicating the ability of the gene activity to distinguish the target cluster
  • pval: Original P-value
  • padj: P-value adjusted for multiple testing
  • pct_in / pct_out: Proportion of cells in the target cluster/other clusters where the gene is detected (activity > 0)

Screening and Thresholds

Thresholds can be set by combining effect size and significance (e.g., constraining both logFC and padj), and referencing detection rate (pct_in/pct_out) to improve result robustness. After screening, candidate high-activity (or low-activity) gene sets for each cluster can be obtained.

R
dir.create("../result", recursive = TRUE, showWarnings = FALSE)
dir.create("../result/cluster_Diff", recursive = TRUE, showWarnings = FALSE)
dir.create("../result/cluster_Diff/FeaturePlot", recursive = TRUE, showWarnings = FALSE)
#dir.create("../result/group_Diff", recursive = TRUE, showWarnings = FALSE)
R
# Gene activity differential analysis between cell types or clusters
DefaultAssay(sc) <- "ATAC_activity"

suppressWarnings({
    suppressMessages({
            Idents(sc)=sc@meta.data[[celltype_colname]]
            Diff_genes_activity=wilcoxauc(sc,celltype_colname,assay = 'data',seurat_assay = 'ATAC_activity')
            if(exists("padj_cut")){
                Diff_genes_activity=Diff_genes_activity[which(abs(Diff_genes_activity$logFC) > logFC_cut & Diff_genes_activity$padj < padj_cut),]
            }else if(exists("pval_cut")){
                Diff_genes_activity=Diff_genes_activity[which(abs(Diff_genes_activity$logFC) > logFC_cut & Diff_genes_activity$padj < pval_cut),]
            }
            
            write.table(Diff_genes_activity,file = "../result/cluster_Diff/cluster_Diff_genes_activity.txt",quote = F,sep = "\t",row.names = F)    
    })
})

Result Presentation and Usage

Results are displayed in an interactive table, supporting filtering, sorting, and exporting, facilitating the extraction of cluster-specific high-activity gene sets.

R
# Remove outer suppression function to ensure output visibility
if(nrow(Diff_genes_activity) < 1) {
  print("No genes with significant difference in gene activity, please lower significance threshold")
} else {
  # Return DT object directly (no print needed)
  datatable(
    Diff_genes_activity,
    rownames = FALSE,
    filter = "top",
    extensions = 'Buttons',
    options = list(
      dom = 'Bfrtip',
      buttons = c('csv', 'excel'),
      columnDefs = list(
        list(
          targets = 2:9,  # More concise index notation
          render = DT::JS("function(data) {
            return typeof data === 'number' 
                   ? data.toExponential(2) 
                   : data;
          }")
        )
      )
    )
  )
}
Usually, when both logFC > 0.25 and padj < 0.05 are met, the gene can be considered to have significantly higher regulatory activity in this specific cell cluster compared to other cell clusters.

Dimensionality Reduction Plot Visualization of Differential Results

Feature Gene Visualization (FeaturePlot)

To visually display the distribution of marker genes for different cell types, we select the top 5 most representative marker genes for each cell type and visualize them in parallel in the reduced dimensional space.

  • Display Dimensions
    • ATAC_activity: Reflects chromatin openness in gene promoter/body regions (regulatory potential)
    • RNA: Reflects transcriptional expression level of corresponding gene (actual transcript)
  • Figure Description
    • Each dot represents a single cell; color intensity indicates activity or expression intensity of the gene in that cell
    • Display the same gene side-by-side/faceted in two modalities to facilitate comparison of specific patterns in different cell types.
  • Interpretation Points
    • Cell Type Specificity: Marker genes should show spatial enrichment of high activity/expression in expected cell types
    • Cross-modal Consistency: Consistency between ATAC activity and RNA expression supports the active transcriptional state of the gene in that type
    • Inconsistency Clues: High activity but low expression may suggest regulatory pre-activation, temporal difference, or distal enhancer impact
  • Notes
    • Gene activity is an indirect reading, noisy; not necessarily one-to-one with RNA expression
    • Activity window definition (e.g., whether upstream 2 kb is included, whether gene body is counted) affects visualization results
R
# Get top 10 differential genes and heatmap genes in each cluster, preparing for subsequent visualization
 if(nrow(Diff_genes_activity) < 1){
            print("No genes with significant difference in gene activity, please lower significance threshold")
        }else{
            cluster_num <- length(table(sc@meta.data[[celltype_colname]]))
            if (cluster_num <= 8) {
                n <- 10
                height <- 3
            } else if (cluster_num > 8 & cluster_num <= 10) {
                n <- 8
                height <- 4
            } else if (cluster_num > 10 & cluster_num <= 15) {
                n <- 5
                height <- 5
            } else {
                n <- 4
                height <- 7
            }
            Diff_genes_activity %>%
                group_by(group) %>%
                arrange(desc(logFC)) %>%
                slice_head(n = n) %>%      # Change to 10 genes
                ungroup() -> top10

            if(nrow(top10) <= 10){
                width=5
            }else if(nrow(top10) > 10 & nrow(top10) <= 15){
                width=7
            }else if(nrow(top10) > 15 & nrow(top10) <= 20){
                width=9
            }else if(nrow(top10) > 20 & nrow(top10) <= 25){
                width=11
            }else if(nrow(top10) > 25 & nrow(top10) <= 30){
                width=13
            }else if(nrow(top10) > 30 & nrow(top10) <= 35){
                width=15
            }else if(nrow(top10) > 35 & nrow(top10) <= 40){
                width=17
            }else if(nrow(top10) > 40 & nrow(top10) <= 45){
                width=19
            }else if(nrow(top10) > 45 & nrow(top10) <= 55){
                width=20
            }else if(nrow(top10) > 55 & nrow(top10) <= 65){
                width=21
            }else {
                width=22
            }
                # Get top 1 differential gene in each cluster
            Diff_genes_activity %>%
                group_by(group) %>%
                arrange(desc(logFC)) %>%    # Add this line, sort by logFC descending
                slice_head(n = 5) %>%      # Change to 1 gene
                ungroup() -> top5
                # Get gene set for heatmap
            Diff_genes_activity %>%
                group_by(group) %>%
                arrange(desc(logFC)) -> heatmap_genes
        }
R
# Add necessary packages
# Check and create save directory
# dir.create("_build/html/result/cluster_Diff/FeaturePlot/", recursive = TRUE, showWarnings = FALSE)
options(warn = -1)
save_and_encode_plot <- function(plot, filename, width, height) {
    # Save image file
    full_path_pdf <- paste0(filename, ".pdf")
    temp_png <- tempfile(fileext = ".png")
    
    # Save PDF
    ggsave(full_path_pdf, plot, width = width, height = height)
    # Temporarily save PNG for base64 encoding
    ggsave(temp_png, plot, width = width, height = height, dpi = 300)
    
    # Convert image to base64
    png_data <- readBin(temp_png, "raw", file.info(temp_png)$size)
    base64_img <- base64encode(png_data)
    
    # Delete temporary PNG file
    #unlink(temp_png)
    
    return(list(
        base64 = base64_img,
        pdf_path = full_path_pdf
    ))
}

suppressWarnings({
    suppressMessages({
        if(exists("top5")){
            cell_types <- unique(top5$group)    
            # Create dataframe for display
            image_df <- data.frame(
                CellType = cell_types,
                stringsAsFactors = FALSE
            )
            # Generate a column of images for each assay
            for(assay in c("ATAC_activity", "RNA")) {
                DefaultAssay(sc) <- assay
                col_name <- paste0(assay, "_Plot")
            
                image_df[[col_name]] <- sapply(cell_types, function(i) {
                    # 1. Generate image
                    p1 <- FeaturePlot(
                        object = sc,
                        features = top5$feature[top5$group == i],
                        reduction = reduction,
                        pt.size = 0.1,
                        max.cutoff = 'q95',
                        ncol = 5,
                        cols=c("#E5E3E3","#EF6067"),
                        order = TRUE
                    ) +
                    theme_minimal() +
                    theme(
                        panel.grid = element_blank(),
                        axis.text = element_text(size = 8),
                        axis.title = element_text(size = 10)
                    )
                
                    # 2. Save image and get base64 encoding
                    result <- save_and_encode_plot(
                        plot = p1,
                        filename = paste0("../result/cluster_Diff/FeaturePlot/feature_plot_cluster_", 
                                       assay, "_", i),
                        width = 20,
                        height = 3
                    )
                
                    # 3. Create HTML, including base64 image
                    sprintf(
                        '<div class="plot-container">
                            <img src="/placeholder.svg" height="400px" onclick="window.open(\'%s\')"/>
                        </div>',
                        result$base64,
                        result$pdf_path
                    )
                })
            }
        
            # 4. Create interactive table to display results
            output_table <- suppressWarnings(DT::datatable(
                image_df,
                escape = FALSE,
                options = list(
                    pageLength = 1,
                    scrollX = TRUE,
                    dom = 'Bfrtip',
                    server = TRUE,       # Force server-side mode
                    pageLength = 1,
                    deferRender = TRUE,  # Defer render
                    scroller = TRUE,     # Scroll loading
                    columnDefs = list(
                        list(
                            targets = 1:2,
                            width = "45%"
                        )
                    )
                ),
                rownames = FALSE,
                filter = 'top',
                caption = htmltools::tags$caption(
                    style = 'caption-side: top; text-align: center; font-size: 1.2em; font-weight: bold;',
                    'Feature plots for top 5 genes in each celltype'
                )
            )
                                            )
            output_table
        }else{
            print("No genes with significant difference in gene activity, please lower significance threshold")
        }

    })
})
options(warn = 0)
The first column of the table is cell cluster information, the second column is the chromatin activity distribution map of the top five representative marker genes in the cell cluster, and the third column is the gene expression distribution map of the top five marker genes.

Heatmap Visualization of Differential Results

Feature Gene Heatmap Comparison (ATAC_activity vs RNA)

  • Visualization Purpose

    • Display distribution and consistency of marker genes across two modalities (chromatin activity and transcriptional expression) to assist cell type annotation and mechanism interpretation.
  • Image Composition

    • Left: ATAC activity heatmap (ATAC_activity, reflecting promoter/gene body accessibility)
    • Right: RNA expression heatmap (reflecting actual transcription levels)
    • Rows: Feature genes (select top 5 or custom list per cell type)
    • Columns: Single cells; grouped by cell type/cluster and labeled with color bar
    • Color: Light to dark indicates low to high activity/expression
    • Standardization suggestion: To highlight relative patterns, it is recommended to perform row-wise standardization (e.g., z-score) for each gene across cells, and maintain the same gene and column order in both heatmaps.
  • Interpretation Points

    • Cell Type Specificity: Marker genes should show aggregation bands of high activity/expression in corresponding cell types
    • Cross-modal Consistency: Coordinated upregulation of ATAC activity and RNA expression suggests consistency between active transcriptional state and open chromatin
    • Inconsistency Clues: High activity but low expression may reflect transcriptional pre-activation, time lag, or distal enhancer drive; high expression but low activity may be due to technical noise or region definition
    • Heterogeneity: Gradients/multimodality within the same cluster suggest subpopulations or continuous spectrum states.
R
# Check and process color_gradient
# Split comma-separated string into color value vector
if(!exists("color_gradient")){
    color_gradient <- c("#F9F9F9","#F0484A")
} else {
    # If color_gradient is a string, split it into a vector
    if(length(color_gradient) == 1) {
        color_gradient. <- unlist(strsplit(color_gradient, ","))
    }
}
colors_cancer <- colorRampPalette(color_gradient)(100)



base_colors=c("#EDD496","#65D1BF","#7ADAEA","#F2AFAF","#EA8F8F",
                 "#C1A9D3","#F2DA9E","#D9EF78","#EF9ED9","#C9BCC6","#8BD198")
library(RColorBrewer)
library(scales)  
# Get actual cluster count
cell_types <- if(exists("celltypes")) unlist(strsplit(celltypes, ",")) else unique(sc@meta.data[[celltype_colname]])
    unique_clusters <- length(cell_types)
cell_types_all=cell_types
generate_cols <- function(n) {
    if(n <= 0) return(character(0))  # Handle 0 length case
    # Loop to extend base colors
    base_colors[((1:n)) %% length(base_colors) + 1] 
}
my_cols <- generate_cols(unique_clusters)
identity_colors <- setNames(my_cols, cell_types)
R
suppressWarnings({
  suppressMessages({      
    # Load necessary packages
    
    # Set graphics parameters
    options(repr.plot.width = 16, repr.plot.height = 10)
    if(exists("top10")){
        # ATAC Activity Heatmap
        DefaultAssay(sc) <- 'ATAC_activity'
        p1 = DoHeatmap(sc,
            features = top10$feature,
            slot = "data",
            draw.lines = TRUE,
            raster = TRUE,
            group.colors = identity_colors,
            label = FALSE
        ) +
            scale_fill_gradientn(colors = colors_cancer) +
            guides(fill = guide_colorbar(title = "Activity")) +
            ggtitle("ATAC Activity")  # Add title
    
        # RNA Expression Heatmap
        DefaultAssay(sc) <- 'RNA'
        p2 = DoHeatmap(sc,
            features = top10$feature,
            slot = "data",
            draw.lines = TRUE,
            raster = TRUE,
            group.colors = identity_colors,
            label = FALSE
        ) +
            scale_fill_gradientn(colors = colors_cancer) +
            guides(fill = guide_colorbar(title = "Expression")) +
            ggtitle("RNA Expression")  # Add title
        # Combine and display graphics
        combined_plot <- p1 + p2  # Use / for vertical arrangement
        print(combined_plot)
        #dir.create("_build/html/result/cluster_Diff", recursive = TRUE, showWarnings = FALSE)
        # Save image
        invisible(pdf("../result/cluster_Diff/Top10.GeneActivityandExpression.heatmap.pdf", width = 8, height = 10))
        invisible(print(combined_plot))
        invisible(dev.off())
    }else{
       print("No genes with significant difference in gene activity, please lower significance threshold") 
    }
  })
})

Violin Plot Visualization of Differential Results

Feature Gene Dual-modal Violin Plot (ATAC_activity vs RNA)

  • Visualization Purpose

    • Compare distribution and consistency of high activity feature genes in chromatin accessibility (ATAC_activity) and transcriptional expression (RNA) across cell subgroups to assist annotation and interpretation.
  • Image Composition

    • Upper part: ATAC activity violin plot, representing the accessibility distribution (activity score) of gene promoters/gene bodies.
    • Lower part: RNA expression violin plot, representing the transcription level distribution of the same gene.
    • Selection and Sorting: Select top 10 genes with strongest activity per subgroup; use same gene set and order for both modalities; cells grouped by subgroup and colored.
    • Key points for reading: The shape of the violin reflects distribution density and heterogeneity, color represents subpopulations, and different modalities allow direct comparison of the activity/expression strength and distribution differences of the same gene in different subpopulations.
  • Interpretation Points

    • Subgroup Specificity: Target subgroup should show activity/expression peaks and density concentration bands for corresponding marker genes.
    • Cross-modal Consistency: Synchronous rise of high ATAC activity and high RNA expression supports active transcription driven by open chromatin.
    • Inconsistency Clues: High activity but low expression suggests pre-activation, time lag, or distal enhancer dominance; high expression but low activity suggests noise or region definition issues.
    • Heterogeneity Identification: Multi-peaks/long tails within same subgroup suggest subgroup structure or continuum state.
  • Notes

    • Indirectness and Noise: Gene activity is indirect reading of accessibility, sparser and noisier than RNA; not always one-to-one.
R
# Set plot size first
suppressWarnings({
  suppressMessages({
      if(exists("top10")){
          assays=c("ATAC_activity","RNA")
          p=list()
          for(assay in assays){
              DefaultAssay(sc)=assay
              if(length(top10$feature) > 1){
                  options(repr.plot.width = width, repr.plot.height = height)
                  stack=TRUE
                  p[[as.character(assay)]] = VlnPlot(sc,
                                                     features = top10$feature,
                                                     stack = stack,
                                                     pt.size = 0,
                                                     fill.by = "ident",
                                                     cols = identity_colors    # Add color parameter
                                                    ) + 
                  xlab(if(assay == "ATAC_activity"){
                  "Gene Activity"
                  }else{
                      "Gene Expression"
                  }) +
                  theme_minimal() +
                  theme(
                      # Remove top text
                      axis.text.x.top = element_blank(),
                      strip.text.x = element_text(angle = 45, hjust = 0),  # Facet label angle # Set gene names below
                      axis.text.x.bottom = element_text(
                          angle = 45,
                          hjust = 1,          # Adjust to 1 for right alignment
                          vjust = 1,          # Adjust to 1 to align with axis
                          face = "plain",
                          size = 8,           # Add font size setting
                          margin = margin(t = 0, r = 0, b = 0, l = 0)),
                      panel.grid.major = element_blank(),
                      panel.grid.minor = element_blank(),
                      legend.position = "none",
                      axis.ticks.x = element_blank(),    # Remove tick marks
                      axis.line.x = element_blank(),     # Remove axis lines
                      plot.margin = margin(t = 5, r = 5, b = 5, l = 5)  # Adjust margins
                  ) +NoLegend()+scale_x_discrete(labels = top10$feature)
              }else if(length(top10$feature) ==1){
                  options(repr.plot.width = width, repr.plot.height = height)
                  stack=FALSE
                  p[[as.character(assay)]] = VlnPlot(sc,
                                                     features = top10$feature,
                                                     stack = stack,
                                                     pt.size = 0,
                                                     fill.by = "ident",
                                                     cols = identity_colors    # Add color parameter
                                                    ) + 
                  xlab(if(assay == "ATAC_activity"){"Gene Activity"
                                                   }else{"Gene Expression"}) +
                  theme_minimal()                  
              }

              }
          print(p[[1]])
          print(p[[2]])
          invisible(pdf("../result/cluster_Diff/Top10.GeneActivityandExpression.vlnplot.pdf",width=24,height=6))
          invisible(print(p[[1]]))
          invisible(print(p[[2]]))
          invisible(dev.off())
      }else{
          print("No genes with significant difference in gene activity, please lower significance threshold") 
      }
  })
})

Dot Plot Visualization of Differential Results

Dual-Modality Dot Plot (ATAC_activity vs RNA)

  • Visualization Purpose

    • Parallel display of marker genes' overall level and detection in chromatin accessibility and transcriptional expression across cell types, facilitating quick comparison of consistency and specificity.
  • Image Composition and Encoding

    • Dimensions: Top is ATAC activity dot plot (ATAC_activity), bottom is RNA expression dot plot (RNA).
    • Rows/Columns: Rows represent genes, columns represent cell types/clusters (both plots use the same row and column order for alignment comparison).
    • Dot size: Proportion of cells in the corresponding cell type where the gene is detected (detection rate; e.g., proportion with activity/expression > 0).
    • Color intensity: Average activity/average expression level of the gene in the corresponding cell type (light to dark indicates low to high).
    • Normalization Suggestions: Normalize/scale within genes for both modalities separately, and unify color scale and tick labels to avoid misinterpretation.
  • Interpretation Points

    • Specificity: Target cell types should show 'large and dark' bubbles (high detection rate and high mean), reflecting specific activity/expression of marker genes.
    • Cross-modality consistency: "Large and dark" dots appearing in both ATAC and RNA for the same cell type suggest consistency between open chromatin and active transcription.
    • Inconsistency clues: Large/dark ATAC but small/light RNA may indicate pre-activation/time lag or distal enhancer dominance; conversely, it may be due to technical noise or region definition.
    • Sparsity and Heterogeneity: Small but dark bubbles may come from few high-value cells; large but light bubbles suggest widespread low-level activity/expression.
R
# 4. Dot Plot
suppressWarnings({
  suppressMessages({
      if(exists("top10")){
          options(repr.plot.width = width, repr.plot.height = height)
          assays=c("ATAC_activity","RNA")
          p=list()
          for(assay in assays){
              DefaultAssay(sc)=assay
              p[[as.character(assay)]]=DotPlot(sc, 
                                           features = unique(top10$feature),
                                           cols=c("#E5E3E3","#F0484A")  # Gradient from light gray to dark red
                                          ) + 
              #scale_size(range = c(0.5, 6)) +   # Adjust point size range
              # Modify color legend title
              guides(color = guide_colorbar(if(assay == "ATAC_activity"){title = "Average Activity"}else{title = "Average Expression"})) +
              # Or use this method
              # labs(color = "Average Activity") +
              theme(
                  axis.text.x.bottom = element_text(
                      angle = 45, 
                      hjust = 1, 
                      vjust = 1),
                  panel.grid.major = element_blank(),  # Remove grid lines
                  panel.grid.minor = element_blank(),
                  axis.line = element_line(colour = "black"),  # Add axis lines
                  legend.position = "right"  # Keep legend to view effect
              )
          }
          print(p[[1]])
          print(p[[2]])
          invisible(pdf("../result/cluster_Diff/Top10.GeneActivityandExpression.dotplot.pdf", width=24, height=6))
          invisible(print(p[[1]]))
          invisible(print(p[[2]]))
          invisible(dev.off()) 
          }else{
              print("No genes with significant difference in gene activity, please lower significance threshold") 
          }
  })
})
By comparing the gene activity calculated from scATAC-seq data with the gene expression levels measured by scRNA-seq data, we can observe the consistency between the two, which helps us understand the potential regulatory relationship between chromatin accessibility and gene expression. At the same time, we can also verify the rationality of cell clustering by observing the activity of classic marker genes.
However, the stability of these gene activity values is not as good as the expression results directly measured by scRNA-seq, mainly for two reasons:
Chromatin data itself has high sparsity;
The degree of openness of gene regulatory regions is not simply linearly related to the actual expression level of the gene.
Nevertheless, these gene activity data can still help us better identify major cell types. However, if we want to further divide cells into finer subpopulations, it is difficult to rely solely on this supervised analysis method.

Correlation Analysis Between Cell Clusters

Cross-Cell Type Correlation Analysis (RNA vs ATAC Activity)

  • Analysis Purpose

    • Quantify the similarity of different cell types in transcriptional expression and chromatin accessibility, verify the rationality of clustering/annotation, and compare cross-modality consistency and differences.
  • Implementation (Consistent with this code)

    • Input: sc (Seurat object), cell_types_all (cell type list), celltype_colname (type column name).
    • Data Source: sc@assays$RNA@data and sc@assays$ATAC_activity@data (both log-normalized data).
    • Type Summary (pseudobulk):
      • For each cell type, take the row mean at the cell level: Matrix::rowMeans(...), generating a gene × type matrix.
      • Obtain rna_avg (size approx. 25055 × n_types) and atac_avg (approx. 19620 × n_types).
    • Correlation Calculation:
      • Calculate Spearman correlation for rna_avg and atac_avg along the type dimension respectively: cor(..., method = "spearman").
      • Output two type × type correlation matrices: RNA and ATAC, with row/column names as cell_types_all.
    • Return Result: list(RNA = rna_mat, ATAC = atac_mat), for heatmap visualization (with hierarchical clustering and type color bar).
  • Interpretation Points

    • Consistency: Types with the same lineage or similar functions should show high correlation in both correlation heatmaps.
    • Cross-modality alignment: Similar adjacent type relationships in both RNA and ATAC matrices support reliable annotation.
    • Difference clues: High RNA correlation but low ATAC correlation suggests more obvious differences at the epigenetic level or sparse ATAC; conversely, it may reflect pre-activation/timing differences.
    • Heterogeneity: Correlation gradients or branch structures may suggest subgroup or continuum states.
R
# Calculate RNA and ATAC correlation between different cell types
calculate_modality_correlations <- function(sc,cell_types,celltype_colname) {
  
  # Initialize storage matrix
  n_types <- length(cell_types)
  rna_mat <- matrix(0, n_types, n_types)
  atac_mat <- matrix(0, n_types, n_types)
  
  # Get correct gene count
  n_rna_genes <- nrow(sc@assays$RNA@data)    # Should be 25055
  n_atac_genes <- nrow(sc@assays$ATAC_activity@data)  # Should be 19620
  
  # Initialize mean matrix with correct dimensions
  rna_avg <- matrix(0, n_rna_genes, n_types)
  atac_avg <- matrix(0, n_atac_genes, n_types)
  
  # Calculate average expression/accessibility for each cell type
  for (idx in seq_along(cell_types)) {
    current_type <- cell_types[idx]
    
    # Get cell index for current type
    cell_idx <- which(sc@meta.data[[celltype_colname]] == current_type) 
    # Calculate mean
    rna_avg[,idx] <- Matrix::rowMeans(sc@assays$RNA@data[, cell_idx])
    atac_avg[,idx] <- Matrix::rowMeans(sc@assays$ATAC_activity@data[, cell_idx])
  }
  
  # Calculate correlation
  rna_mat <- cor(rna_avg, method = "spearman")
  atac_mat <- cor(atac_avg, method = "spearman")
  
  rownames(rna_mat) <- colnames(rna_mat) <- cell_types
  rownames(atac_mat) <- colnames(atac_mat) <- cell_types
  # Set row and column names
  rownames(rna_mat) <- colnames(rna_mat) <- cell_types
  rownames(atac_mat) <- colnames(atac_mat) <- cell_types
  
  # Return correlation matrix
  return(list(RNA = rna_mat, ATAC = atac_mat))
}

# Use function
results <- calculate_modality_correlations(sc,cell_types_all,celltype_colname)
R
options(repr.plot.width = 18, repr.plot.height = 4)
# Define common visualization parameters
get_breaks <- function(mat) {
    min_break <- floor(min(mat) * 10) / 10  # Floor min value to 0.1
    seq(min_break, 1, length.out = 101)     # Fix max value 1
}

colors <- colorRampPalette(c("#4B2991", "#56B1F7", "#FCFFA4", "#F97B57", "#AB3282"))(100)
breaks <- get_breaks(c(results$RNA, results$ATAC))  # Unified color scale calculation

# Common heatmap parameters
heatmap_config <- list(
    color = colors,
    breaks = breaks,
    cluster_rows = FALSE,
    cluster_cols = FALSE,
    fontsize = 8,
    border_color = NA,
    silent = TRUE
)

# Generate heatmap
suppressWarnings({
    rna_plot <- do.call(pheatmap, c(list(mat = results$RNA, main = "RNA Expression Correlation"), heatmap_config))
    atac_plot <- do.call(pheatmap, c(list(mat = results$ATAC, main = "ATAC Accessibility Correlation"), heatmap_config))
    
    # Display and save
    grid.arrange(rna_plot$gtable, atac_plot$gtable, ncol = 2)
    pdf("../result/modality_correlations.pdf", width = 12, height = 6)
    grid.arrange(rna_plot$gtable, atac_plot$gtable, ncol = 2)
    dev.off()
})
* **Left Figure:** RNA expression correlation heatmap, reflecting the similarity and heterogeneity of different clusters at the transcriptome level.
* **Right Figure:** ATAC accessibility correlation heatmap, showing the similarity distribution and heterogeneity of chromatin open states.
R
saveRDS(sc, file = "../result/result.rds")

Result Files

├── cluster_Diff Inter-cluster gene activity differential analysis result folder
│ ├── cluster_Diff_genes_activity.txt Summary table of inter-cluster gene activity differential analysis results
│ ├── FeaturePlot Dimensionality reduction plot visualizing gene activity of top 5 fold change (logFC) in each cell cluster
│ │ ├── feature_plot_cluster_ATAC_activity_.pdf
│ │ ├── feature_plot_cluster_ATAC_activity_
.png
│ │ ├── feature_plot_cluster_RNA_.pdf
│ │ └── feature_plot_cluster_RNA_
.png
│ ├── Top10.GeneActivityandExpression.dotplot.pdf Dot plot visualizing gene activity and expression of top 10 fold change (logFC) in each cell cluster
│ ├── Top10.GeneActivityandExpression.heatmap.pdf Heatmap visualizing gene activity and expression of top 10 fold change (logFC) in each cell cluster
│ └── Top10.GeneActivityandExpression.vlnplot.pdf Violin plot visualizing gene activity and expression of top 10 fold change (logFC) in each cell cluster
├── group_Diff Inter-group gene activity differential analysis result folder
│ ├── *.vs.heatmap.Activity.pdf Heatmap showing inter-group differential gene activity
│ ├── *.vs.heatmap.Activity.png Heatmap showing inter-group differential gene activity
│ ├── *.vs.heatmap.Expression.pdf Heatmap of gene expression corresponding to the above genes
│ ├── *.vs.heatmap.Expression.png Heatmap of gene expression corresponding to the above genes
│ ├── *.vs.Volcano.pdf Volcano plot visualizing inter-group gene activity differences
│ ├── *.vs.Volcano.png Volcano plot visualizing inter-group gene activity differences
│ └── group_Diff_genes_activity.txt Summary of inter-group differential analysis results in all cell clusters
└── modality_correlations.pdf

Result Directory: The directory includes all images and tables involved in this analysis.

Literature Case Analysis

  • 《Single-cell systems pharmacology identifies development-driven drug response and combination therapy in B cell acute lymphoblastic leukemia》

  • "Single-cell analyses reveal key immune cell subsets associated with response to PD-L1 blockade in triple-negative breast cancer" Cell annotation of scATAC-seq data based on gene activity matrix

  • 《A cis-regulatory atlas in maize at single-cell resolution》

    In this study, researchers investigated the relationship between gene accessibility and gene expression in the maize genome by integrating ATAC-seq and RNA-seq data at the single-cell level. The study found that among the 36,322 analyzed genes, most showed a positive correlation between gene accessibility and transcriptional activity (Figures E, F, G, H). However, the study also found that about 6,063 genes had high chromatin accessibility but no significant transcriptional activity (Figure H), suggesting that gene accessibility does not necessarily lead to gene expression. To explain this phenomenon, researchers performed epigenetic analysis on the genes. The results showed that these accessible but non-expressed genes were enriched with H3K27me3 modification (Figure J); in contrast, those genes that were neither accessible nor expressed (about 4,315) maintained their silenced state mainly through DNA methylation. This finding reveals two main mechanisms of maize gene silencing regulation: (1) silencing genes through H3K27me3 modification; (2) relying on DNA methylation to maintain the silenced state.

References

[1] Stuart T, Srivastava A, Madad S, et al.Single-cell chromatin state analysis with Signac (vol 18, pg 1333, 2021)[J].Nature methods, 2022(2):19. [2] Hao Y, Stuart T, Kowalski M, et al.Dictionary learning for integrative, multimodal, and scalable single-cell analysis[J].bioRxiv, 2022.DOI: 10.1101/2022.02.24.481684.

Appendix

.txt: Result data table file, tab-separated. Unix/Linux/Mac users use less or more commands to view; Windows users use advanced text editors like Notepad++ to view, or open with Microsoft Excel.

.pdf: Result image file, vector graphic, can be zoomed in and out without distortion, convenient for users to view and edit. Use Adobe Illustrator for image editing, suitable for publication, etc.

.rds: Seurat object containing gene activity assay, needs to be opened in R environment for viewing and further analysis.

0 comments·0 replies