Skip to content

scMethyl + RNA Multi-omics: DMR Functional Enrichment Analysis (Based on rGREAT)

Author: SeekGene
Time: 15 min
Words: 3.0k words
Updated: 2026-03-18
Reads: 0 times
scMethyl + RNA-seq Notebooks

Module Introduction

This module is based on rGREAT and aims to associate biological functions with genomic regions. Here, we use it to perform functional enrichment analysis of Differentially Methylated Regions (DMRs).

rGREAT is an R package that implements a local version of the GREAT (Genomic Regions Enrichment of Annotations Tool) algorithm. By associating genomic regions with nearby genes and evaluating the enrichment of these genes in functional categories, it achieves the following core analytical goals:

  • Functional Enrichment: Identifies significant enrichment of DMR-associated genes in Gene Ontology (GO) functional categories, revealing potential biological functions and processes involved in DMRs.
  • Region-Gene Association: Associates DMRs with nearby genes to determine the potential regulatory targets of each region.
  • Multiple Gene Sets Support: Supports not only GO gene sets but also various gene set databases like MSigDB, providing comprehensive functional annotations.

This analysis is mainly applicable to Human and Mouse samples, but also supports other species with Bioconductor annotation packages.

BioMart Support (Non-model organisms): For non-model organisms, BioMart can be used to obtain gene sets (supporting over 600 species):

r
# Use BioMart to get gene sets
res <- great(gr, "GO:BP", biomart_dataset = "amelanoleuca_gene_ensembl")

Input File Preparation

This analysis requires a BED format file containing the genomic coordinates of DMRs (Differentially Methylated Regions).

Input File Format

The BED file should be a tab-separated text file containing at least 3 columns:

Col No.Col NameMeaningExample
1Chromosome NameChromosome name (e.g., chr1, chr2, chr11)chr2
2Start PositionStart position (0-based, inclusive)196161361
3End PositionEnd position (1-based, exclusive)196164361

File Description:

  • Coordinate system is 0-based (start position starts from 0).
  • Must explicitly specify the genome version used (e.g., hg38, hg19, mm10, mm9, etc.).
  • Each region represents a Differentially Methylated Region (DMR).
  • Note: If the BED file contains more than 500 regions, the analysis will automatically use only the first 500 regions. You can adjust this number according to your needs.

File Structure Example

bed
chr2	196161361	196164361
chr2	203704361	203712361
chr11	128486808	128522808
...

rGREAT Analysis Workflow

Key Parameter Settings

In the analysis process, the following parameters are crucial for result quality and biological interpretation:

ParameterMeaningDefault/SuggestedDescription
annotation_databaseTSS Source"GREAT:hg38"Annotation Database Source.
Specifies the source of Transcription Start Sites (TSS). Supports multiple formats:
"GREAT:hg38": Official GREAT hg38 TSS data (recommended)
"hg38", "hg19", "mm10", etc.: Automatically uses corresponding TxDb packages
"RefSeq:hg38": RefSeqSelect genes
"Gencode_v19": Gencode annotation (Human)
Note: "GREAT:hg38" only supports hg38, hg19, mm10, mm9 genome versions.
gene_setsGene Set Type"GO:BP", "GO:CC", "GO:MF"Functional Annotation Gene Sets.
Specifies the gene set types for enrichment analysis:
"GO:BP": Biological Process
"GO:CC": Cellular Component
"GO:MF": Molecular Function
"msigdb:H": MSigDB Hallmark gene sets (Human only)
"msigdb:C2": MSigDB Curated gene sets (Human only)
Note: Prefixes GO: and msigdb: can be omitted.
top_nVisualization Count20Visualization Parameter.
Specifies the top N most enriched GO terms to display in the bubble plot. Suggested setting is 20-30 for quick identification of key biological functions.
backgroundBackground RegionsNULL (Whole Genome)Background Setting Parameter (Optional).
Specifies the background regions for enrichment analysis. If set to NULL, the whole genome (excluding gap regions) is used as background.
Application: Used when excluding specific regions (e.g., repetitive sequences, unsequenced regions) or restricting analysis scope.
excludeExcluded RegionsNULLExcluded Regions Parameter (Optional).
Specifies genomic regions to exclude from analysis (e.g., gap regions, repetitive sequences).
Note: rGREAT excludes gap regions by default.

Supported TSS Sources:

  1. Genome Version: e.g., "hg38", "hg19", "mm10", automatically uses corresponding TxDb packages.
  2. TxDb Package Name: e.g., "TxDb.Hsapiens.UCSC.hg38.knownGene".
  3. RefSeq: "RefSeq:hg38" (Based on RefSeq gene annotation).
  4. RefSeqCurated: "RefSeqCurated:hg38" (RefSeqCurated subset).
  5. RefSeqSelect: "RefSeqSelect:hg38" (RefSeqSelect subset).
  6. Gencode: "Gencode_v19" (Human) or "Gencode_M21" (Mouse).
  7. GREAT: "GREAT:hg38" (Only supports hg38, hg19, mm10, mm9).

Region-Gene Association Analysis

The first step of GREAT analysis is to associate input genomic regions with the extended TSS regulatory domains of genes.

Analysis Principle

The core idea of the GREAT algorithm is to associate non-coding genomic regions with their nearby genes and infer the potential biological functions of these regions based on the known functional annotations of these genes.

  1. Association Rule: The GREAT algorithm first defines a regulatory domain for each gene. When an input region overlaps with the regulatory domain of a gene, the region is associated with that gene (Note: A single region can be associated with multiple genes simultaneously).

  2. Construction of Regulatory Domains (taking basalPlusExt mode as an example): First, define a Basal Domain (covering 5000 bp upstream to 1000 bp downstream of the TSS); then perform Extension to both sides until reaching the basal domain boundary of adjacent genes, or reaching the specified maximum extension distance (default 1,000,000 bp).

Analysis Steps

  1. Construct Extended TSS Regulatory Domains: Use the specified extension mode (default basalPlusExt) to construct the extended regulatory domain for each gene.

  2. Region-Gene Matching: Determine the genes associated with each region by checking for overlap between the midpoint of the input region and the extended TSS regulatory domains.

  3. Result Output: Generate a region-gene association plot (GREAT_region_gene_associations.png) to visualize the correspondence between input regions and their associated genes.

R
# Load required R packages
library(rGREAT)
library(GenomicRanges)
library(rtracklayer)
library(ggplot2)
library(dplyr)

## bed_file: Input BED file path
bed_file <- "./DMRs/DMRs_significant_2.bed"

## annotation_database: TSS source
annotation_database <- "GREAT:hg38"  # Modify according to species and reference genome version, e.g., GREAT:mm9

## go_types: GO types to analyze
go_types <- c("GO:BP" = "Biological Process",
              "GO:CC" = "Cellular Component", 
              "GO:MF" = "Molecular Function")

## outdir: Output directory, all result files will be saved here
outdir <- "DMRs_rGREAT"
# Create output directory (if it doesn't exist)
dir.create(outdir, recursive = TRUE, showWarnings = FALSE)

# Define function: Call rGREAT analysis in R
run_rgreat_analysis <- function(gr, go_type, annotation_database) {
  res <- great(gr, go_type, annotation_database)
  return(res)
}

# Read BED file
gr <- import(bed_file, format = "BED")

# If number of regions exceeds 500, only use the first 500 regions
if (length(gr) > 500) {
  gr <- gr[1:500]
}



# Region-gene association analysis
res_temp <- run_rgreat_analysis(gr, names(go_types)[1], annotation_database)
assoc_file <- file.path(outdir, "GREAT_region_gene_associations.png")
png(assoc_file, width = 12, height = 8, units = "in", res = 300)
p2 <- plotRegionGeneAssociations(res_temp)
if (inherits(p2, "ggplot")) {
  print(p2)
}
dev.off()
output
Loading required package: GenomicRanges

Loading required package: stats4

Loading required package: BiocGenerics


Attaching package: ‘BiocGenerics’


The following objects are masked from ‘package:stats’:

IQR, mad, sd, var, xtabs


The following objects are masked from ‘package:base’:

anyDuplicated, aperm, append, as.data.frame, basename, cbind,
colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort,
table, tapply, union, unique, unsplit, which.max, which.min


Loading required package: S4Vectors


Attaching package: ‘S4Vectors’


The following object is masked from ‘package:utils’:

findMatches


The following objects are masked from ‘package:base’:

expand.grid, I, unname


Loading required package: IRanges

Loading required package: GenomeInfoDb






------------------
Note: On Aug 19 2019 GREAT released version 4 where it supports \`hg38\`
genome and removes some ontologies such pathways. \`submitGreatJob()\`
still takes \`hg19\` as default. \`hg38\` can be specified by the \`species
= 'hg38'\` argument. To use the older versions such as 3.0.0, specify as
\`submitGreatJob(..., version = '3.0.0')\`.

From rGREAT version 1.99.0, it also implements the GREAT algorithm and
it allows to integrate GREAT analysis with the Bioconductor annotation
ecosystem. By default it supports more than 500 organisms. Check the
new function \`great()\` and the new vignette.
------------------


Attaching package: ‘dplyr’


The following objects are masked from ‘package:GenomicRanges’:

intersect, setdiff, union


The following object is masked from ‘package:GenomeInfoDb’:

intersect


The following objects are masked from ‘package:IRanges’:

collapse, desc, intersect, setdiff, slice, union


The following objects are masked from ‘package:S4Vectors’:

first, intersect, rename, setdiff, setequal, union


The following objects are masked from ‘package:BiocGenerics’:

combine, intersect, setdiff, union


The following objects are masked from ‘package:stats’:

filter, lag


The following objects are masked from ‘package:base’:

intersect, setdiff, setequal, union


* TSS source: GREAT.

* use genome 'hg38'.

* TSS extension mode is 'basalPlusExt'.

* construct the basal domains by extending 5000bp to upstream and 1000bp to downsteram of TSS.

* calculate distances to neighbour regions.

* extend to both sides until reaching the neighbour genes or to the maximal extension.

* use GO:BP ontology with 15709 gene sets (source: org.Hs.eg.db).

* check gene ID type in \`gene_sets\` and in \`extended_tss\`.

* use whole genome as background.

* remove excluded regions from background.

* overlap \`gr\` to background regions (based on midpoint).

* in total 499 \`gr\`.

* overlap extended TSS to background regions.

* check which genes are in the gene sets.

* only take gene sets with size >= 5.

* in total 9168 gene sets.

* overlap \`gr\` to every extended TSS.

* perform binomial test for each biological term.

pdf: 2

GO Enrichment Analysis

The second step of GREAT analysis is to perform functional enrichment analysis on the associated genes.

Analysis Principle

  1. Binomial Test: Used to evaluate enrichment at the region level. Assuming input regions are uniformly distributed across the genome, it calculates whether the proportion of genomic regions associated with each GO term is significantly higher than random expectation.

  2. Hypergeometric Test: Used to evaluate enrichment at the gene level. Assuming associated genes are randomly drawn from all genes, it calculates whether the number of associated genes in each GO term is significantly higher than random expectation.

  3. Result Interpretation: fold_enrichment_hyper is the enrichment fold (> 1 indicates enrichment), p_value_hyper is the raw P-value, p_adjust_hyper is the adjusted P-value (FDR, < 0.05 is usually considered significant), mean_tss_dist is the average distance from input regions to TSS (larger distance implies lower association reliability).

Analysis Steps

  1. Statistical Testing: Perform both Binomial Test (evaluate region-level enrichment) and Hypergeometric Test (calculate gene-level enrichment) to comprehensively assess the significance of GO terms.

  2. Result Output: Retrieve detailed enrichment result table via getEnrichmentTable(res); generate Volcano Plot (X-axis: enrichment fold, Y-axis: -log10 P-value) and Bubble Plot (displaying top N most enriched GO terms) for visualization.

R
# top_n: Visualize top N enriched GO terms
top_n <- 20

# desc_max_length: Maximum length of GO description in plots
desc_max_length <- 50

# GO Enrichment Analysis
## Loop through each GO type
for (go_type in names(go_types)) {
  go_name <- go_types[go_type]
 
  # Run GREAT analysis
  res <- run_rgreat_analysis(gr, go_type, annotation_database)
  
  # Get enrichment result table
  tb <- getEnrichmentTable(res)
  
  # Sort by fold_enrichment_hyper (descending) to match table with plots
  tb <- tb %>% arrange(desc(fold_enrichment_hyper))
  
  # Save complete result table (sorted)
  result_file <- file.path(outdir, paste0("GREAT_enrichment_", go_type, "_results.txt"))
  write.table(tb, file = result_file, sep = "\t", quote = FALSE, row.names = FALSE)
  
  # Visualization 1: Volcano Plot
  volcano_file <- file.path(outdir, paste0("GREAT_volcano_plot_", go_type, ".png"))
  png(volcano_file, width = 10, height = 8, units = "in", res = 300)
  p1 <- plotVolcano(res)
  # Print if it's a ggplot object
  if (inherits(p1, "ggplot")) {
    print(p1)
  }
  dev.off()
  
  # Visualization 2: Bubble Plot
  if (nrow(tb) > 0) {
    # Take top_n after sorting by fold_enrichment_hyper
    tb_top <- tb %>%
      head(top_n) %>%
      mutate(
        neg_log10_padj = -log10(p_adjust_hyper),
        neg_log10_padj = ifelse(is.infinite(neg_log10_padj), -log10(1e-300), neg_log10_padj),
        description_short = ifelse(nchar(description) > desc_max_length,
                                 paste0(substr(description, 1, desc_max_length-3), "..."),
                                 description)
      ) %>%
      mutate(description_short = factor(description_short, levels = rev(description_short)))
    
    if (nrow(tb_top) > 0) {
      p3 <- ggplot(tb_top, aes(x = fold_enrichment_hyper, y = description_short,
                                size = observed_gene_hits, color = neg_log10_padj)) +
        geom_point(alpha = 0.7) +
        scale_size_continuous(name = "Number of\nGenes", range = c(3, 12)) +
        scale_color_gradient(low = "#4A90E2", high = "#E74C3C", name = "-log10(FDR)") +
        labs(title = paste("Top", min(top_n, nrow(tb_top)), "Enriched", go_name),
             x = "Fold Enrichment", y = "GO Term") +
        theme_minimal() +
        theme(plot.title = element_text(size = 14, face = "bold", hjust = 0.5),
              axis.text.y = element_text(size = 9),
              legend.position = "right") +
        geom_vline(xintercept = 1, linetype = "dashed", color = "gray60", alpha = 0.5)
      
      bubble_file <- file.path(outdir, paste0("GO_bubbleplot_", go_type, ".png"))
      ggsave(bubble_file, plot = p3, width = 12, height = 10, dpi = 300)
      
      # Save top_n results (corresponding to plots, sorted by fold_enrichment_hyper)
      top20_file <- file.path(outdir, paste0("GREAT_enrichment_", go_type, "_top20.txt"))
      tb_top20 <- tb %>% head(top_n)
      write.table(tb_top20, file = top20_file, sep = "\t", quote = FALSE, row.names = FALSE)
    } else {
      cat("Warning: Not enough data to generate bubble plot\n")
    }
  } else {
    cat("Warning: No enrichment results\n")
  }
}
output
* TSS source: GREAT.

* use genome 'hg38'.

* TSS extension mode is 'basalPlusExt'.

* construct the basal domains by extending 5000bp to upstream and 1000bp to downsteram of TSS.

* calculate distances to neighbour regions.

* extend to both sides until reaching the neighbour genes or to the maximal extension.

* use GO:BP ontology with 15709 gene sets (source: org.Hs.eg.db).

* check gene ID type in \`gene_sets\` and in \`extended_tss\`.

* use whole genome as background.

* remove excluded regions from background.

* overlap \`gr\` to background regions (based on midpoint).

* in total 499 \`gr\`.

* overlap extended TSS to background regions.

* check which genes are in the gene sets.

* only take gene sets with size >= 5.

* in total 9168 gene sets.

* overlap \`gr\` to every extended TSS.

* perform binomial test for each biological term.

* TSS source: GREAT.

* use genome 'hg38'.

* TSS extension mode is 'basalPlusExt'.

* construct the basal domains by extending 5000bp to upstream and 1000bp to downsteram of TSS.

* calculate distances to neighbour regions.

* extend to both sides until reaching the neighbour genes or to the maximal extension.

* use GO:CC ontology with 1977 gene sets (source: org.Hs.eg.db).

* check gene ID type in \`gene_sets\` and in \`extended_tss\`.

* use whole genome as background.

* remove excluded regions from background.

* overlap \`gr\` to background regions (based on midpoint).

* in total 499 \`gr\`.

* overlap extended TSS to background regions.

* check which genes are in the gene sets.

* only take gene sets with size >= 5.

* in total 1184 gene sets.

* overlap \`gr\` to every extended TSS.

* perform binomial test for each biological term.

* TSS source: GREAT.

* use genome 'hg38'.

* TSS extension mode is 'basalPlusExt'.

* construct the basal domains by extending 5000bp to upstream and 1000bp to downsteram of TSS.

* calculate distances to neighbour regions.

* extend to both sides until reaching the neighbour genes or to the maximal extension.

* use GO:MF ontology with 5055 gene sets (source: org.Hs.eg.db).

* check gene ID type in \`gene_sets\` and in \`extended_tss\`.

* use whole genome as background.

* remove excluded regions from background.

* overlap \`gr\` to background regions (based on midpoint).

* in total 499 \`gr\`.

* overlap extended TSS to background regions.

* check which genes are in the gene sets.

* only take gene sets with size >= 5.

* in total 2001 gene sets.

* overlap \`gr\` to every extended TSS.

* perform binomial test for each biological term.

Result Interpretation

The enrichment result table returned by getEnrichmentTable() contains the following fields:

Field NameDescription
idGO term ID (e.g., GO:0007155)
descriptionGO term description
genome_fractionFraction of genome associated with this GO term
observed_region_hitsObserved number of associated regions (Binomial test)
fold_enrichmentFold enrichment from Binomial test
p_valueRaw P-value from Binomial test
p_adjustAdjusted P-value from Binomial test (FDR)
mean_tss_distMean distance from input regions to TSS (bp)
observed_gene_hitsObserved number of associated genes (Hypergeometric test)
gene_set_sizeTotal number of genes in the GO term
fold_enrichment_hyperFold enrichment from Hypergeometric test (Recommended)
p_value_hyperRaw P-value from Hypergeometric test
p_adjust_hyperAdjusted P-value from Hypergeometric test (FDR, Recommended)
0 comments·0 replies