Skip to content

scMethyl + RNA Multi-omics: DMR Functional Enrichment Analysis (Based on rGREAT)

Author: SeekGene
Time: 16 min
Words: 3.0k words
Updated: 2026-02-28
Reads: 0 times
scMethyl + RNA-seq Analysis Guide Notebooks

Module Introduction

This module is based on rGREAT, aiming to associate biological functional annotations with genomic regions. In this analysis, the module is used to perform functional enrichment analysis on differentially methylated regions (DMRs).

rGREAT is the local R implementation of the GREAT (Genomic Regions Enrichment of Annotations Tool) algorithm. This method achieves the following core analysis goals by associating genomic regions with nearby genes and evaluating the enrichment of these genes in different functional annotation categories:

  • Functional Enrichment Analysis: Identifies significant enrichment of DMR-associated genes in Gene Ontology (GO) functional categories, revealing potential biological functions and processes involved with DMRs.
  • Region-Gene Association: Associates DMRs with their neighboring genes to infer potential regulatory target genes for each methylated region.
  • Multiple Gene Sets Support: In addition to GO gene sets, supports various gene set databases like MSigDB to provide a more comprehensive functional annotation perspective.

This analysis is mainly applicable to Human and Mouse samples, but also supports other species with Bioconductor annotation packages.

BioMart Support (Non-model Organisms): For non-model organisms, BioMart can be used to retrieve corresponding gene set information (covering over 600 species), extending the applicability of DMR functional enrichment analysis:

r
# Use BioMart to Retrieve Gene Sets
res <- great(gr, "GO:BP", biomart_dataset = "amelanoleuca_gene_ensembl")

Input File Preparation

This analysis requires a BED format input file describing the genomic coordinate information of Differentially Methylated Regions (DMRs).

Input File Format

The BED file should be a tab-separated text file containing at least the following three columns:

Column No.Column NameDescriptionExample
1Chromosome NameChromosome name (e.g., chr1, chr2, chr11, etc.)chr2
2Start PositionStart position (0-based, counting starts from 0)196161361
3End PositionEnd position (1-based, exclusive)196164361

File Description:

  • Coordinate system uses 0-based indexing (start position begins at 0).
  • Must explicitly specify the genome version used (e.g., hg38, hg19, mm10, mm9, etc.).
  • Each region represents a Differentially Methylated Region (DMR).
  • Note: If the BED file contains more than 500 regions, the analysis pipeline will default to using only the first 500 regions for calculation; this threshold can be adjusted based on analysis needs.

File Structure Example

bed
chr2	196161361	196164361
chr2	203704361	203712361
chr11	128486808	128522808
...

rGREAT Analysis Pipeline

Key Parameter Settings

In rGREAT analysis, the following parameters have a significant impact on result stability and biological interpretation:

Parameter NameChinese MeaningDefault/Suggested ValueDetailed Description
annotation_databaseTSS Source"GREAT:hg38"Annotation DB Source.
Specifies TSS data source. Supports:
"GREAT:hg38": Official GREAT hg38 TSS (Recommended)
"hg38", "hg19", "mm10": Auto-use TxDb packages
"RefSeq:hg38": RefSeqSelect genes
"Gencode_v19": Gencode annotation (Human)
Note: "GREAT:hg38" only supports hg38, hg19, mm10, mm9.
gene_setsGene Set Type"GO:BP", "GO:CC", "GO:MF"Functional Annotation Sets.
Specifies gene sets for enrichment:
"GO:BP": Biological Process
"GO:CC": Cellular Component
"GO:MF": Molecular Function
"msigdb:H": MSigDB Hallmark (Human only)
"msigdb:C2": MSigDB Curated (Human only)
Note: Prefixes GO: and msigdb: can be omitted.
top_nVisualization Count20Visualization Parameter.
Specifies top N most enriched GO terms to show in bubble plot. Recommended 20-30 for quick identification of key biological functions.
backgroundBackground RegionsNULL (Whole Genome)Background Parameter (Optional).
Specifies background regions for enrichment. If NULL, uses whole genome (excluding gaps).
Usage: To exclude specific regions (repeats, unsequenced) or limit analysis scope.
excludeExcluded RegionsNULLExclusion Parameter (Optional).
Specifies regions to exclude (e.g. gaps, repeats).
Note: rGREAT excludes gap regions by default.

Supported TSS Sources:

  1. Genome Version: E.g., "hg38", "hg19", "mm10", etc., will automatically call the corresponding TxDb annotation package.
  2. TxDb Package Name: E.g., "TxDb.Hsapiens.UCSC.hg38.knownGene".
  3. RefSeq: E.g., "RefSeq:hg38", based on RefSeq gene annotation.
  4. RefSeqCurated: e.g., "RefSeqCurated:hg38", RefSeqCurated subset.
  5. RefSeqSelect: e.g., "RefSeqSelect:hg38", RefSeqCurated subset.
  6. Gencode: e.g., "Gencode_v19" (Human) or "Gencode_M21" (Mouse).
  7. GREAT: e.g., "GREAT:hg38" (Only supports hg38, hg19, mm10, mm9). .

Region-Gene Association Analysis

The first step of GREAT analysis is to associate input genomic regions with extended TSS regulatory domains of genes.

Analysis Principle

The core idea of the GREAT algorithm is to associate non-coding genomic regions with their neighboring genes and infer potential biological functions of the regions based on known functional annotations of these genes.

  1. Association Rules: The GREAT algorithm first defines a regulatory domain for each gene. When an input region overlaps with a gene's regulatory domain, that region is associated with that gene (Note: a single region can be associated with multiple genes simultaneously).

  2. Construction of Regulatory Domains (taking basalPlusExt mode as example): First define Basal Domain (covering 5000 bp upstream to 1000 bp downstream of TSS); then perform Extension to both sides until encountering adjacent gene's Basal Domain boundary, or reaching the set maximum extension distance (default 1,000,000 bp).

Analysis Steps

  1. Build Extended TSS Regulatory Domains: Adopt specified extension mode (default basalPlusExt) to build extended regulatory domains for each gene.

  2. Region-Gene Matching: Determine gene association by overlapping the midpoint of input regions with extended TSS regulatory domains.

  3. Result Output: Generate region-gene association diagram (GREAT_region_gene_associations.png) to visually display correspondence between input regions and their associated genes.

R
# Load required R packages
library(rGREAT)
library(GenomicRanges)
library(rtracklayer)
library(ggplot2)
library(dplyr)

## bed_file: Input BED file path
bed_file <- "./DMRs/DMRs_significant_2.bed"

## annotation_database: TSS source
annotation_database <- "GREAT:hg38"  # Modify based on species and reference genome version used, e.g., GREAT:mm9

## go_types: GO types to analyze
go_types <- c("GO:BP" = "Biological Process",
              "GO:CC" = "Cellular Component", 
              "GO:MF" = "Molecular Function")

## outdir: Result output directory, all result files will be saved here
outdir <- "DMRs_rGREAT"
# Create output directory (if not exists)
dir.create(outdir, recursive = TRUE, showWarnings = FALSE)

# Define function: Call rGREAT analysis in R
run_rgreat_analysis <- function(gr, go_type, annotation_database) {
  res <- great(gr, go_type, annotation_database)
  return(res)
}

# Read BED file
gr <- import(bed_file, format = "BED")

# If number of regions exceeds 500, only use the first 500 regions for analysis
if (length(gr) > 500) {
  gr <- gr[1:500]
}



# Region-Gene Association Analysis
res_temp <- run_rgreat_analysis(gr, names(go_types)[1], annotation_database)
assoc_file <- file.path(outdir, "GREAT_region_gene_associations.png")
png(assoc_file, width = 12, height = 8, units = "in", res = 300)
p2 <- plotRegionGeneAssociations(res_temp)
if (inherits(p2, "ggplot")) {
  print(p2)
}
dev.off()
output
Loading required package: GenomicRanges

Loading required package: stats4

Loading required package: BiocGenerics


Attaching package: ‘BiocGenerics’


The following objects are masked from ‘package:stats’:

IQR, mad, sd, var, xtabs


The following objects are masked from ‘package:base’:

anyDuplicated, aperm, append, as.data.frame, basename, cbind,
colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort,
table, tapply, union, unique, unsplit, which.max, which.min


Loading required package: S4Vectors


Attaching package: ‘S4Vectors’


The following object is masked from ‘package:utils’:

findMatches


The following objects are masked from ‘package:base’:

expand.grid, I, unname


Loading required package: IRanges

Loading required package: GenomeInfoDb






------------------
Note: On Aug 19 2019 GREAT released version 4 where it supports \`hg38\`
genome and removes some ontologies such pathways. \`submitGreatJob()\`
still takes \`hg19\` as default. \`hg38\` can be specified by the \`species
= 'hg38'\` argument. To use the older versions such as 3.0.0, specify as
\`submitGreatJob(..., version = '3.0.0')\`.

From rGREAT version 1.99.0, it also implements the GREAT algorithm and
it allows to integrate GREAT analysis with the Bioconductor annotation
ecosystem. By default it supports more than 500 organisms. Check the
new function \`great()\` and the new vignette.
------------------


Attaching package: ‘dplyr’


The following objects are masked from ‘package:GenomicRanges’:

intersect, setdiff, union


The following object is masked from ‘package:GenomeInfoDb’:

intersect


The following objects are masked from ‘package:IRanges’:

collapse, desc, intersect, setdiff, slice, union


The following objects are masked from ‘package:S4Vectors’:

first, intersect, rename, setdiff, setequal, union


The following objects are masked from ‘package:BiocGenerics’:

combine, intersect, setdiff, union


The following objects are masked from ‘package:stats’:

filter, lag


The following objects are masked from ‘package:base’:

intersect, setdiff, setequal, union


* TSS source: GREAT.

* use genome 'hg38'.

* TSS extension mode is 'basalPlusExt'.

* construct the basal domains by extending 5000bp to upstream and 1000bp to downsteram of TSS.

* calculate distances to neighbour regions.

* extend to both sides until reaching the neighbour genes or to the maximal extension.

* use GO:BP ontology with 15709 gene sets (source: org.Hs.eg.db).

* check gene ID type in \`gene_sets\` and in \`extended_tss\`.

* use whole genome as background.

* remove excluded regions from background.

* overlap \`gr\` to background regions (based on midpoint).

* in total 499 \`gr\`.

* overlap extended TSS to background regions.

* check which genes are in the gene sets.

* only take gene sets with size >= 5.

* in total 9168 gene sets.

* overlap \`gr\` to every extended TSS.

* perform binomial test for each biological term.

pdf: 2

GO Enrichment Analysis

The second step of GREAT analysis is to perform functional enrichment analysis on region-gene association results to evaluate the significance of associated genes in biological functional categories.

Analysis Principle

  1. Binomial Test: Evaluates functional enrichment at the region level. This method assumes input regions are uniformly distributed across the genome and assesses whether the proportion of genomic regions associated with a specific GO term is significantly higher than random expectation.

  2. Hypergeometric Test: Used to assess functional enrichment at the gene level. This method assumes associated genes are randomly drawn from the background gene set and evaluates enrichment significance at the gene level by calculating if the number of associated genes in a GO term is significantly higher than random expectation.

  3. Result Interpretation: fold_enrichment_hyper is the enrichment fold from hypergeometric test (> 1 means enrichment), p_value_hyper is the raw P-value from hypergeometric test, p_adjust_hyper is the adjusted P-value after multiple testing correction (FDR, usually < 0.05 as significance threshold), mean_tss_dist is the average distance from input regions to TSS (larger value implies relatively lower reliability of region-gene association).

Analysis Steps

  1. Statistical Test: Simultaneously perform Binomial Test (assessing region-level enrichment) and Hypergeometric Test (calculating gene-level enrichment) to comprehensively assess GO term significance from different levels.

  2. Result Output: Obtain complete functional enrichment result table via getEnrichmentTable(res); simultaneously generate Volcano Plot (X-axis is enrichment fold, Y-axis is -log10 P-value) and Bubble Plot (showing top N most enriched GO terms) for result visualization.

R
## top_n: Visualize top N enriched GO terms
top_n <- 20

## desc_max_length: Maximum length of GO description in image
desc_max_length <- 50

# GO Enrichment Analysis
## Loop through each GO type
for (go_type in names(go_types)) {
  go_name <- go_types[go_type]
 
  # Run GREAT analysis
  res <- run_rgreat_analysis(gr, go_type, annotation_database)
  
  # Get enrichment analysis result table
  tb <- getEnrichmentTable(res)
  
  # Sort by fold_enrichment_hyper (descending) to align result table with chart
  tb <- tb %>% arrange(desc(fold_enrichment_hyper))
  
  # Save full result table (sorted)
  result_file <- file.path(outdir, paste0("GREAT_enrichment_", go_type, "_results.txt"))
  write.table(tb, file = result_file, sep = "\t", quote = FALSE, row.names = FALSE)
  
  # Visualization 1: Volcano Plot
  volcano_file <- file.path(outdir, paste0("GREAT_volcano_plot_", go_type, ".png"))
  png(volcano_file, width = 10, height = 8, units = "in", res = 300)
  p1 <- plotVolcano(res)
  # If it is a ggplot object, need to print
  if (inherits(p1, "ggplot")) {
    print(p1)
  }
  dev.off()
  
  # Visualization 2: Bubble Plot
  if (nrow(tb) > 0) {
    # Sort by fold_enrichment_hyper and take top_n
    tb_top <- tb %>%
      head(top_n) %>%
      mutate(
        neg_log10_padj = -log10(p_adjust_hyper),
        neg_log10_padj = ifelse(is.infinite(neg_log10_padj), -log10(1e-300), neg_log10_padj),
        description_short = ifelse(nchar(description) > desc_max_length,
                                 paste0(substr(description, 1, desc_max_length-3), "..."),
                                 description)
      ) %>%
      mutate(description_short = factor(description_short, levels = rev(description_short)))
    
    if (nrow(tb_top) > 0) {
      p3 <- ggplot(tb_top, aes(x = fold_enrichment_hyper, y = description_short,
                                size = observed_gene_hits, color = neg_log10_padj)) +
        geom_point(alpha = 0.7) +
        scale_size_continuous(name = "Number of\nGenes", range = c(3, 12)) +
        scale_color_gradient(low = "#4A90E2", high = "#E74C3C", name = "-log10(FDR)") +
        labs(title = paste("Top", min(top_n, nrow(tb_top)), "Enriched", go_name),
             x = "Fold Enrichment", y = "GO Term") +
        theme_minimal() +
        theme(plot.title = element_text(size = 14, face = "bold", hjust = 0.5),
              axis.text.y = element_text(size = 9),
              legend.position = "right") +
        geom_vline(xintercept = 1, linetype = "dashed", color = "gray60", alpha = 0.5)
      
      bubble_file <- file.path(outdir, paste0("GO_bubbleplot_", go_type, ".png"))
      ggsave(bubble_file, plot = p3, width = 12, height = 10, dpi = 300)
      
      # Save top_n results (corresponding to the chart, sorted by fold_enrichment_hyper)
      top20_file <- file.path(outdir, paste0("GREAT_enrichment_", go_type, "_top20.txt"))
      tb_top20 <- tb %>% head(top_n)
      write.table(tb_top20, file = top20_file, sep = "\t", quote = FALSE, row.names = FALSE)
    } else {
      cat("警告: 没有足够的数据生成气泡图\n")
    }
  } else {
    cat("警告: 没有富集结果\n")
  }
}
output
* TSS source: GREAT.

* use genome 'hg38'.

* TSS extension mode is 'basalPlusExt'.

* construct the basal domains by extending 5000bp to upstream and 1000bp to downsteram of TSS.

* calculate distances to neighbour regions.

* extend to both sides until reaching the neighbour genes or to the maximal extension.

* use GO:BP ontology with 15709 gene sets (source: org.Hs.eg.db).

* check gene ID type in \`gene_sets\` and in \`extended_tss\`.

* use whole genome as background.

* remove excluded regions from background.

* overlap \`gr\` to background regions (based on midpoint).

* in total 499 \`gr\`.

* overlap extended TSS to background regions.

* check which genes are in the gene sets.

* only take gene sets with size >= 5.

* in total 9168 gene sets.

* overlap \`gr\` to every extended TSS.

* perform binomial test for each biological term.

* TSS source: GREAT.

* use genome 'hg38'.

* TSS extension mode is 'basalPlusExt'.

* construct the basal domains by extending 5000bp to upstream and 1000bp to downsteram of TSS.

* calculate distances to neighbour regions.

* extend to both sides until reaching the neighbour genes or to the maximal extension.

* use GO:CC ontology with 1977 gene sets (source: org.Hs.eg.db).

* check gene ID type in \`gene_sets\` and in \`extended_tss\`.

* use whole genome as background.

* remove excluded regions from background.

* overlap \`gr\` to background regions (based on midpoint).

* in total 499 \`gr\`.

* overlap extended TSS to background regions.

* check which genes are in the gene sets.

* only take gene sets with size >= 5.

* in total 1184 gene sets.

* overlap \`gr\` to every extended TSS.

* perform binomial test for each biological term.

* TSS source: GREAT.

* use genome 'hg38'.

* TSS extension mode is 'basalPlusExt'.

* construct the basal domains by extending 5000bp to upstream and 1000bp to downsteram of TSS.

* calculate distances to neighbour regions.

* extend to both sides until reaching the neighbour genes or to the maximal extension.

* use GO:MF ontology with 5055 gene sets (source: org.Hs.eg.db).

* check gene ID type in \`gene_sets\` and in \`extended_tss\`.

* use whole genome as background.

* remove excluded regions from background.

* overlap \`gr\` to background regions (based on midpoint).

* in total 499 \`gr\`.

* overlap extended TSS to background regions.

* check which genes are in the gene sets.

* only take gene sets with size >= 5.

* in total 2001 gene sets.

* overlap \`gr\` to every extended TSS.

* perform binomial test for each biological term.

Result Interpretation

The enrichment result table returned by getEnrichmentTable() contains the following field descriptions:

Field NameDescription
idGO term ID (e.g., GO:0007155)
descriptionGO term description
genome_fractionProportion of genome covered by regions associated with this GO term
observed_region_hitsObserved Number of Associated Regions (Binomial Test)
fold_enrichmentFold enrichment from Binomial test
p_valueRaw P-value for Binomial Test
p_adjustAdjusted P-value for Binomial Test (FDR)
mean_tss_distMean distance from input regions to TSS (bp)
observed_gene_hitsObserved Number of Associated Genes (Hypergeometric Test)
gene_set_sizeTotal genes in GO term
fold_enrichment_hyperFold enrichment from Hypergeometric test (Recommended)
p_value_hyperRaw P-value for Hypergeometric Test
p_adjust_hyperAdjusted P-value for Hypergeometric Test (FDR, Recommended)
0 comments·0 replies