scMethyl + RNA Multi-omics: DMR Functional Enrichment Analysis (Based on rGREAT)

Author: SeekGene

Time: 15 min

Words: 3.0k words

Updated: 2026-03-18

Reads: 0 times

scMethyl + RNA-seq Notebooks

Module Introduction

This module is based on rGREAT and aims to associate biological functions with genomic regions. Here, we use it to perform functional enrichment analysis of Differentially Methylated Regions (DMRs).

rGREAT is an R package that implements a local version of the GREAT (Genomic Regions Enrichment of Annotations Tool) algorithm. By associating genomic regions with nearby genes and evaluating the enrichment of these genes in functional categories, it achieves the following core analytical goals:

Functional Enrichment: Identifies significant enrichment of DMR-associated genes in Gene Ontology (GO) functional categories, revealing potential biological functions and processes involved in DMRs.
Region-Gene Association: Associates DMRs with nearby genes to determine the potential regulatory targets of each region.
Multiple Gene Sets Support: Supports not only GO gene sets but also various gene set databases like MSigDB, providing comprehensive functional annotations.

This analysis is mainly applicable to Human and Mouse samples, but also supports other species with Bioconductor annotation packages.
BioMart Support (Non-model organisms): For non-model organisms, BioMart can be used to obtain gene sets (supporting over 600 species):
r
# Use BioMart to get gene sets
res <- great(gr, "GO:BP", biomart_dataset = "amelanoleuca_gene_ensembl")

Input File Preparation

This analysis requires a BED format file containing the genomic coordinates of DMRs (Differentially Methylated Regions).

Input File Format

The BED file should be a tab-separated text file containing at least 3 columns:

Col No.	Col Name	Meaning	Example
1	Chromosome Name	Chromosome name (e.g., chr1, chr2, chr11)	chr2
2	Start Position	Start position (0-based, inclusive)	196161361
3	End Position	End position (1-based, exclusive)	196164361

File Description:

Coordinate system is 0-based (start position starts from 0).
Must explicitly specify the genome version used (e.g., hg38, hg19, mm10, mm9, etc.).
Each region represents a Differentially Methylated Region (DMR).
Note: If the BED file contains more than 500 regions, the analysis will automatically use only the first 500 regions. You can adjust this number according to your needs.

File Structure Example

bed

chr2	196161361	196164361
chr2	203704361	203712361
chr11	128486808	128522808
...

rGREAT Analysis Workflow

Key Parameter Settings

In the analysis process, the following parameters are crucial for result quality and biological interpretation:

Parameter	Meaning	Default/Suggested	Description
`annotation_database`	TSS Source	`"GREAT:hg38"`	Annotation Database Source. Specifies the source of Transcription Start Sites (TSS). Supports multiple formats: • `"GREAT:hg38"`: Official GREAT hg38 TSS data (recommended) • `"hg38"`, `"hg19"`, `"mm10"`, etc.: Automatically uses corresponding TxDb packages • `"RefSeq:hg38"`: RefSeqSelect genes • `"Gencode_v19"`: Gencode annotation (Human) Note: `"GREAT:hg38"` only supports hg38, hg19, mm10, mm9 genome versions.
`gene_sets`	Gene Set Type	`"GO:BP"`, `"GO:CC"`, `"GO:MF"`	Functional Annotation Gene Sets. Specifies the gene set types for enrichment analysis: • `"GO:BP"`: Biological Process • `"GO:CC"`: Cellular Component • `"GO:MF"`: Molecular Function • `"msigdb:H"`: MSigDB Hallmark gene sets (Human only) • `"msigdb:C2"`: MSigDB Curated gene sets (Human only) Note: Prefixes `GO:` and `msigdb:` can be omitted.
`top_n`	Visualization Count	20	Visualization Parameter. Specifies the top N most enriched GO terms to display in the bubble plot. Suggested setting is 20-30 for quick identification of key biological functions.
`background`	Background Regions	`NULL` (Whole Genome)	Background Setting Parameter (Optional). Specifies the background regions for enrichment analysis. If set to `NULL`, the whole genome (excluding gap regions) is used as background. Application: Used when excluding specific regions (e.g., repetitive sequences, unsequenced regions) or restricting analysis scope.
`exclude`	Excluded Regions	`NULL`	Excluded Regions Parameter (Optional). Specifies genomic regions to exclude from analysis (e.g., gap regions, repetitive sequences). Note: rGREAT excludes gap regions by default.

Supported TSS Sources:

Genome Version: e.g., "hg38", "hg19", "mm10", automatically uses corresponding TxDb packages.
TxDb Package Name: e.g., "TxDb.Hsapiens.UCSC.hg38.knownGene".
RefSeq: "RefSeq:hg38" (Based on RefSeq gene annotation).
RefSeqCurated: "RefSeqCurated:hg38" (RefSeqCurated subset).
RefSeqSelect: "RefSeqSelect:hg38" (RefSeqSelect subset).
Gencode: "Gencode_v19" (Human) or "Gencode_M21" (Mouse).
GREAT: "GREAT:hg38" (Only supports hg38, hg19, mm10, mm9).

Region-Gene Association Analysis

The first step of GREAT analysis is to associate input genomic regions with the extended TSS regulatory domains of genes.

Analysis Principle

The core idea of the GREAT algorithm is to associate non-coding genomic regions with their nearby genes and infer the potential biological functions of these regions based on the known functional annotations of these genes.

Association Rule: The GREAT algorithm first defines a regulatory domain for each gene. When an input region overlaps with the regulatory domain of a gene, the region is associated with that gene (Note: A single region can be associated with multiple genes simultaneously).
Construction of Regulatory Domains (taking basalPlusExt mode as an example): First, define a Basal Domain (covering 5000 bp upstream to 1000 bp downstream of the TSS); then perform Extension to both sides until reaching the basal domain boundary of adjacent genes, or reaching the specified maximum extension distance (default 1,000,000 bp).

Analysis Steps

Construct Extended TSS Regulatory Domains: Use the specified extension mode (default basalPlusExt) to construct the extended regulatory domain for each gene.
Region-Gene Matching: Determine the genes associated with each region by checking for overlap between the midpoint of the input region and the extended TSS regulatory domains.
Result Output: Generate a region-gene association plot (GREAT_region_gene_associations.png) to visualize the correspondence between input regions and their associated genes.

# Load required R packages
library(rGREAT)
library(GenomicRanges)
library(rtracklayer)
library(ggplot2)
library(dplyr)

## bed_file: Input BED file path
bed_file <- "./DMRs/DMRs_significant_2.bed"

## annotation_database: TSS source
annotation_database <- "GREAT:hg38"  # Modify according to species and reference genome version, e.g., GREAT:mm9

## go_types: GO types to analyze
go_types <- c("GO:BP" = "Biological Process",
              "GO:CC" = "Cellular Component", 
              "GO:MF" = "Molecular Function")

## outdir: Output directory, all result files will be saved here
outdir <- "DMRs_rGREAT"
# Create output directory (if it doesn't exist)
dir.create(outdir, recursive = TRUE, showWarnings = FALSE)

# Define function: Call rGREAT analysis in R
run_rgreat_analysis <- function(gr, go_type, annotation_database) {
  res <- great(gr, go_type, annotation_database)
  return(res)
}

# Read BED file
gr <- import(bed_file, format = "BED")

# If number of regions exceeds 500, only use the first 500 regions
if (length(gr) > 500) {
  gr <- gr[1:500]
}



# Region-gene association analysis
res_temp <- run_rgreat_analysis(gr, names(go_types)[1], annotation_database)
assoc_file <- file.path(outdir, "GREAT_region_gene_associations.png")
png(assoc_file, width = 12, height = 8, units = "in", res = 300)
p2 <- plotRegionGeneAssociations(res_temp)
if (inherits(p2, "ggplot")) {
  print(p2)
}
dev.off()

output

Loading required package: GenomicRanges

Loading required package: stats4

Loading required package: BiocGenerics

Attaching package: ‘BiocGenerics’

The following objects are masked from ‘package:stats’:

IQR, mad, sd, var, xtabs

The following objects are masked from ‘package:base’:

anyDuplicated, aperm, append, as.data.frame, basename, cbind,
colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort,
table, tapply, union, unique, unsplit, which.max, which.min

Loading required package: S4Vectors

Attaching package: ‘S4Vectors’

The following object is masked from ‘package:utils’:

findMatches

The following objects are masked from ‘package:base’:

expand.grid, I, unname

Loading required package: IRanges

Loading required package: GenomeInfoDb

------------------
Note: On Aug 19 2019 GREAT released version 4 where it supports \`hg38\`
genome and removes some ontologies such pathways. \`submitGreatJob()\`
still takes \`hg19\` as default. \`hg38\` can be specified by the \`species
= 'hg38'\` argument. To use the older versions such as 3.0.0, specify as
\`submitGreatJob(..., version = '3.0.0')\`.

From rGREAT version 1.99.0, it also implements the GREAT algorithm and
it allows to integrate GREAT analysis with the Bioconductor annotation
ecosystem. By default it supports more than 500 organisms. Check the
new function \`great()\` and the new vignette.
------------------

Attaching package: ‘dplyr’

The following objects are masked from ‘package:GenomicRanges’:

intersect, setdiff, union

The following object is masked from ‘package:GenomeInfoDb’:

intersect

The following objects are masked from ‘package:IRanges’:

collapse, desc, intersect, setdiff, slice, union

The following objects are masked from ‘package:S4Vectors’:

first, intersect, rename, setdiff, setequal, union

The following objects are masked from ‘package:BiocGenerics’:

combine, intersect, setdiff, union

The following objects are masked from ‘package:stats’:

filter, lag

The following objects are masked from ‘package:base’:

intersect, setdiff, setequal, union

* TSS source: GREAT.

* use genome 'hg38'.

* TSS extension mode is 'basalPlusExt'.

* construct the basal domains by extending 5000bp to upstream and 1000bp to downsteram of TSS.

* calculate distances to neighbour regions.

* extend to both sides until reaching the neighbour genes or to the maximal extension.

* use GO:BP ontology with 15709 gene sets (source: org.Hs.eg.db).

* check gene ID type in \`gene_sets\` and in \`extended_tss\`.

* use whole genome as background.

* remove excluded regions from background.

* overlap \`gr\` to background regions (based on midpoint).

* in total 499 \`gr\`.

* overlap extended TSS to background regions.

* check which genes are in the gene sets.

* only take gene sets with size >= 5.

* in total 9168 gene sets.

* overlap \`gr\` to every extended TSS.

* perform binomial test for each biological term.

pdf: 2

GO Enrichment Analysis

The second step of GREAT analysis is to perform functional enrichment analysis on the associated genes.

Analysis Principle

Binomial Test: Used to evaluate enrichment at the region level. Assuming input regions are uniformly distributed across the genome, it calculates whether the proportion of genomic regions associated with each GO term is significantly higher than random expectation.
Hypergeometric Test: Used to evaluate enrichment at the gene level. Assuming associated genes are randomly drawn from all genes, it calculates whether the number of associated genes in each GO term is significantly higher than random expectation.
Result Interpretation: fold_enrichment_hyper is the enrichment fold (> 1 indicates enrichment), p_value_hyper is the raw P-value, p_adjust_hyper is the adjusted P-value (FDR, < 0.05 is usually considered significant), mean_tss_dist is the average distance from input regions to TSS (larger distance implies lower association reliability).

Analysis Steps

Statistical Testing: Perform both Binomial Test (evaluate region-level enrichment) and Hypergeometric Test (calculate gene-level enrichment) to comprehensively assess the significance of GO terms.
Result Output: Retrieve detailed enrichment result table via getEnrichmentTable(res); generate Volcano Plot (X-axis: enrichment fold, Y-axis: -log10 P-value) and Bubble Plot (displaying top N most enriched GO terms) for visualization.

# top_n: Visualize top N enriched GO terms
top_n <- 20

# desc_max_length: Maximum length of GO description in plots
desc_max_length <- 50

# GO Enrichment Analysis
## Loop through each GO type
for (go_type in names(go_types)) {
  go_name <- go_types[go_type]
 
  # Run GREAT analysis
  res <- run_rgreat_analysis(gr, go_type, annotation_database)
  
  # Get enrichment result table
  tb <- getEnrichmentTable(res)
  
  # Sort by fold_enrichment_hyper (descending) to match table with plots
  tb <- tb %>% arrange(desc(fold_enrichment_hyper))
  
  # Save complete result table (sorted)
  result_file <- file.path(outdir, paste0("GREAT_enrichment_", go_type, "_results.txt"))
  write.table(tb, file = result_file, sep = "\t", quote = FALSE, row.names = FALSE)
  
  # Visualization 1: Volcano Plot
  volcano_file <- file.path(outdir, paste0("GREAT_volcano_plot_", go_type, ".png"))
  png(volcano_file, width = 10, height = 8, units = "in", res = 300)
  p1 <- plotVolcano(res)
  # Print if it's a ggplot object
  if (inherits(p1, "ggplot")) {
    print(p1)
  }
  dev.off()
  
  # Visualization 2: Bubble Plot
  if (nrow(tb) > 0) {
    # Take top_n after sorting by fold_enrichment_hyper
    tb_top <- tb %>%
      head(top_n) %>%
      mutate(
        neg_log10_padj = -log10(p_adjust_hyper),
        neg_log10_padj = ifelse(is.infinite(neg_log10_padj), -log10(1e-300), neg_log10_padj),
        description_short = ifelse(nchar(description) > desc_max_length,
                                 paste0(substr(description, 1, desc_max_length-3), "..."),
                                 description)
      ) %>%
      mutate(description_short = factor(description_short, levels = rev(description_short)))
    
    if (nrow(tb_top) > 0) {
      p3 <- ggplot(tb_top, aes(x = fold_enrichment_hyper, y = description_short,
                                size = observed_gene_hits, color = neg_log10_padj)) +
        geom_point(alpha = 0.7) +
        scale_size_continuous(name = "Number of\nGenes", range = c(3, 12)) +
        scale_color_gradient(low = "#4A90E2", high = "#E74C3C", name = "-log10(FDR)") +
        labs(title = paste("Top", min(top_n, nrow(tb_top)), "Enriched", go_name),
             x = "Fold Enrichment", y = "GO Term") +
        theme_minimal() +
        theme(plot.title = element_text(size = 14, face = "bold", hjust = 0.5),
              axis.text.y = element_text(size = 9),
              legend.position = "right") +
        geom_vline(xintercept = 1, linetype = "dashed", color = "gray60", alpha = 0.5)
      
      bubble_file <- file.path(outdir, paste0("GO_bubbleplot_", go_type, ".png"))
      ggsave(bubble_file, plot = p3, width = 12, height = 10, dpi = 300)
      
      # Save top_n results (corresponding to plots, sorted by fold_enrichment_hyper)
      top20_file <- file.path(outdir, paste0("GREAT_enrichment_", go_type, "_top20.txt"))
      tb_top20 <- tb %>% head(top_n)
      write.table(tb_top20, file = top20_file, sep = "\t", quote = FALSE, row.names = FALSE)
    } else {
      cat("Warning: Not enough data to generate bubble plot\n")
    }
  } else {
    cat("Warning: No enrichment results\n")
  }
}

output

* TSS source: GREAT.

* use genome 'hg38'.

* TSS extension mode is 'basalPlusExt'.

* construct the basal domains by extending 5000bp to upstream and 1000bp to downsteram of TSS.

* calculate distances to neighbour regions.

* extend to both sides until reaching the neighbour genes or to the maximal extension.

* use GO:BP ontology with 15709 gene sets (source: org.Hs.eg.db).

* check gene ID type in \`gene_sets\` and in \`extended_tss\`.

* use whole genome as background.

* remove excluded regions from background.

* overlap \`gr\` to background regions (based on midpoint).

* in total 499 \`gr\`.

* overlap extended TSS to background regions.

* check which genes are in the gene sets.

* only take gene sets with size >= 5.

* in total 9168 gene sets.

* overlap \`gr\` to every extended TSS.

* perform binomial test for each biological term.

* TSS source: GREAT.

* use genome 'hg38'.

* TSS extension mode is 'basalPlusExt'.

* construct the basal domains by extending 5000bp to upstream and 1000bp to downsteram of TSS.

* calculate distances to neighbour regions.

* extend to both sides until reaching the neighbour genes or to the maximal extension.

* use GO:CC ontology with 1977 gene sets (source: org.Hs.eg.db).

* check gene ID type in \`gene_sets\` and in \`extended_tss\`.

* use whole genome as background.

* remove excluded regions from background.

* overlap \`gr\` to background regions (based on midpoint).

* in total 499 \`gr\`.

* overlap extended TSS to background regions.

* check which genes are in the gene sets.

* only take gene sets with size >= 5.

* in total 1184 gene sets.

* overlap \`gr\` to every extended TSS.

* perform binomial test for each biological term.

* TSS source: GREAT.

* use genome 'hg38'.

* TSS extension mode is 'basalPlusExt'.

* construct the basal domains by extending 5000bp to upstream and 1000bp to downsteram of TSS.

* calculate distances to neighbour regions.

* extend to both sides until reaching the neighbour genes or to the maximal extension.

* use GO:MF ontology with 5055 gene sets (source: org.Hs.eg.db).

* check gene ID type in \`gene_sets\` and in \`extended_tss\`.

* use whole genome as background.

* remove excluded regions from background.

* overlap \`gr\` to background regions (based on midpoint).

* in total 499 \`gr\`.

* overlap extended TSS to background regions.

* check which genes are in the gene sets.

* only take gene sets with size >= 5.

* in total 2001 gene sets.

* overlap \`gr\` to every extended TSS.

* perform binomial test for each biological term.

Result Interpretation

The enrichment result table returned by getEnrichmentTable() contains the following fields:

Field Name	Description
`id`	GO term ID (e.g., GO:0007155)
`description`	GO term description
`genome_fraction`	Fraction of genome associated with this GO term
`observed_region_hits`	Observed number of associated regions (Binomial test)
`fold_enrichment`	Fold enrichment from Binomial test
`p_value`	Raw P-value from Binomial test
`p_adjust`	Adjusted P-value from Binomial test (FDR)
`mean_tss_dist`	Mean distance from input regions to TSS (bp)
`observed_gene_hits`	Observed number of associated genes (Hypergeometric test)
`gene_set_size`	Total number of genes in the GO term
`fold_enrichment_hyper`	Fold enrichment from Hypergeometric test (Recommended)
`p_value_hyper`	Raw P-value from Hypergeometric test
`p_adjust_hyper`	Adjusted P-value from Hypergeometric test (FDR, Recommended)

scMethyl + RNA Multi-omics: DMR Functional Enrichment Analysis (Based on rGREAT) ​

Module Introduction ​

Input File Preparation ​

Input File Format ​

File Structure Example ​

rGREAT Analysis Workflow ​

Key Parameter Settings ​

Region-Gene Association Analysis ​

Analysis Principle ​

Analysis Steps ​

GO Enrichment Analysis ​

Analysis Principle ​

Analysis Steps ​

Result Interpretation ​

scMethyl + RNA Multi-omics: DMR Functional Enrichment Analysis (Based on rGREAT)

Module Introduction

Input File Preparation

Input File Format

File Structure Example

rGREAT Analysis Workflow

Key Parameter Settings

Region-Gene Association Analysis

Analysis Principle

Analysis Steps

GO Enrichment Analysis

Analysis Principle

Analysis Steps

Result Interpretation