scMethyl + RNA Multi-omics: DMR Functional Enrichment Analysis (Based on rGREAT)
Module Introduction
This module is based on rGREAT and aims to associate biological functions with genomic regions. Here, we use it to perform functional enrichment analysis of Differentially Methylated Regions (DMRs).
rGREAT is an R package that implements a local version of the GREAT (Genomic Regions Enrichment of Annotations Tool) algorithm. By associating genomic regions with nearby genes and evaluating the enrichment of these genes in functional categories, it achieves the following core analytical goals:
- Functional Enrichment: Identifies significant enrichment of DMR-associated genes in Gene Ontology (GO) functional categories, revealing potential biological functions and processes involved in DMRs.
- Region-Gene Association: Associates DMRs with nearby genes to determine the potential regulatory targets of each region.
- Multiple Gene Sets Support: Supports not only GO gene sets but also various gene set databases like MSigDB, providing comprehensive functional annotations.
This analysis is mainly applicable to Human and Mouse samples, but also supports other species with Bioconductor annotation packages.
BioMart Support (Non-model organisms): For non-model organisms, BioMart can be used to obtain gene sets (supporting over 600 species):
r# Use BioMart to get gene sets res <- great(gr, "GO:BP", biomart_dataset = "amelanoleuca_gene_ensembl")
Input File Preparation
This analysis requires a BED format file containing the genomic coordinates of DMRs (Differentially Methylated Regions).
Input File Format
The BED file should be a tab-separated text file containing at least 3 columns:
| Col No. | Col Name | Meaning | Example |
|---|---|---|---|
| 1 | Chromosome Name | Chromosome name (e.g., chr1, chr2, chr11) | chr2 |
| 2 | Start Position | Start position (0-based, inclusive) | 196161361 |
| 3 | End Position | End position (1-based, exclusive) | 196164361 |
File Description:
- Coordinate system is 0-based (start position starts from 0).
- Must explicitly specify the genome version used (e.g., hg38, hg19, mm10, mm9, etc.).
- Each region represents a Differentially Methylated Region (DMR).
- Note: If the
BEDfile contains more than 500 regions, the analysis will automatically use only the first 500 regions. You can adjust this number according to your needs.
File Structure Example
chr2 196161361 196164361
chr2 203704361 203712361
chr11 128486808 128522808
...rGREAT Analysis Workflow
Key Parameter Settings
In the analysis process, the following parameters are crucial for result quality and biological interpretation:
| Parameter | Meaning | Default/Suggested | Description |
|---|---|---|---|
annotation_database | TSS Source | "GREAT:hg38" | Annotation Database Source. Specifies the source of Transcription Start Sites (TSS). Supports multiple formats: • "GREAT:hg38": Official GREAT hg38 TSS data (recommended)• "hg38", "hg19", "mm10", etc.: Automatically uses corresponding TxDb packages• "RefSeq:hg38": RefSeqSelect genes• "Gencode_v19": Gencode annotation (Human)Note: "GREAT:hg38" only supports hg38, hg19, mm10, mm9 genome versions. |
gene_sets | Gene Set Type | "GO:BP", "GO:CC", "GO:MF" | Functional Annotation Gene Sets. Specifies the gene set types for enrichment analysis: • "GO:BP": Biological Process• "GO:CC": Cellular Component• "GO:MF": Molecular Function• "msigdb:H": MSigDB Hallmark gene sets (Human only)• "msigdb:C2": MSigDB Curated gene sets (Human only)Note: Prefixes GO: and msigdb: can be omitted. |
top_n | Visualization Count | 20 | Visualization Parameter. Specifies the top N most enriched GO terms to display in the bubble plot. Suggested setting is 20-30 for quick identification of key biological functions. |
background | Background Regions | NULL (Whole Genome) | Background Setting Parameter (Optional). Specifies the background regions for enrichment analysis. If set to NULL, the whole genome (excluding gap regions) is used as background.Application: Used when excluding specific regions (e.g., repetitive sequences, unsequenced regions) or restricting analysis scope. |
exclude | Excluded Regions | NULL | Excluded Regions Parameter (Optional). Specifies genomic regions to exclude from analysis (e.g., gap regions, repetitive sequences). Note: rGREAT excludes gap regions by default. |
Supported TSS Sources:
- Genome Version: e.g.,
"hg38","hg19","mm10", automatically uses corresponding TxDb packages. - TxDb Package Name: e.g.,
"TxDb.Hsapiens.UCSC.hg38.knownGene". - RefSeq:
"RefSeq:hg38"(Based on RefSeq gene annotation). - RefSeqCurated:
"RefSeqCurated:hg38"(RefSeqCurated subset). - RefSeqSelect:
"RefSeqSelect:hg38"(RefSeqSelect subset). - Gencode:
"Gencode_v19"(Human) or"Gencode_M21"(Mouse). - GREAT:
"GREAT:hg38"(Only supports hg38, hg19, mm10, mm9).
Region-Gene Association Analysis
The first step of GREAT analysis is to associate input genomic regions with the extended TSS regulatory domains of genes.
Analysis Principle
The core idea of the GREAT algorithm is to associate non-coding genomic regions with their nearby genes and infer the potential biological functions of these regions based on the known functional annotations of these genes.
Association Rule: The GREAT algorithm first defines a regulatory domain for each gene. When an input region overlaps with the regulatory domain of a gene, the region is associated with that gene (Note: A single region can be associated with multiple genes simultaneously).
Construction of Regulatory Domains (taking
basalPlusExtmode as an example): First, define a Basal Domain (covering 5000 bp upstream to 1000 bp downstream of the TSS); then perform Extension to both sides until reaching the basal domain boundary of adjacent genes, or reaching the specified maximum extension distance (default 1,000,000 bp).
Analysis Steps
Construct Extended TSS Regulatory Domains: Use the specified extension mode (default
basalPlusExt) to construct the extended regulatory domain for each gene.Region-Gene Matching: Determine the genes associated with each region by checking for overlap between the midpoint of the input region and the extended TSS regulatory domains.
Result Output: Generate a region-gene association plot (
GREAT_region_gene_associations.png) to visualize the correspondence between input regions and their associated genes.
# Load required R packages
library(rGREAT)
library(GenomicRanges)
library(rtracklayer)
library(ggplot2)
library(dplyr)
## bed_file: Input BED file path
bed_file <- "./DMRs/DMRs_significant_2.bed"
## annotation_database: TSS source
annotation_database <- "GREAT:hg38" # Modify according to species and reference genome version, e.g., GREAT:mm9
## go_types: GO types to analyze
go_types <- c("GO:BP" = "Biological Process",
"GO:CC" = "Cellular Component",
"GO:MF" = "Molecular Function")
## outdir: Output directory, all result files will be saved here
outdir <- "DMRs_rGREAT"
# Create output directory (if it doesn't exist)
dir.create(outdir, recursive = TRUE, showWarnings = FALSE)
# Define function: Call rGREAT analysis in R
run_rgreat_analysis <- function(gr, go_type, annotation_database) {
res <- great(gr, go_type, annotation_database)
return(res)
}
# Read BED file
gr <- import(bed_file, format = "BED")
# If number of regions exceeds 500, only use the first 500 regions
if (length(gr) > 500) {
gr <- gr[1:500]
}
# Region-gene association analysis
res_temp <- run_rgreat_analysis(gr, names(go_types)[1], annotation_database)
assoc_file <- file.path(outdir, "GREAT_region_gene_associations.png")
png(assoc_file, width = 12, height = 8, units = "in", res = 300)
p2 <- plotRegionGeneAssociations(res_temp)
if (inherits(p2, "ggplot")) {
print(p2)
}
dev.off()Loading required package: stats4
Loading required package: BiocGenerics
Attaching package: ‘BiocGenerics’
The following objects are masked from ‘package:stats’:
IQR, mad, sd, var, xtabs
The following objects are masked from ‘package:base’:
anyDuplicated, aperm, append, as.data.frame, basename, cbind,
colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort,
table, tapply, union, unique, unsplit, which.max, which.min
Loading required package: S4Vectors
Attaching package: ‘S4Vectors’
The following object is masked from ‘package:utils’:
findMatches
The following objects are masked from ‘package:base’:
expand.grid, I, unname
Loading required package: IRanges
Loading required package: GenomeInfoDb
------------------
Note: On Aug 19 2019 GREAT released version 4 where it supports \`hg38\`
genome and removes some ontologies such pathways. \`submitGreatJob()\`
still takes \`hg19\` as default. \`hg38\` can be specified by the \`species
= 'hg38'\` argument. To use the older versions such as 3.0.0, specify as
\`submitGreatJob(..., version = '3.0.0')\`.
From rGREAT version 1.99.0, it also implements the GREAT algorithm and
it allows to integrate GREAT analysis with the Bioconductor annotation
ecosystem. By default it supports more than 500 organisms. Check the
new function \`great()\` and the new vignette.
------------------
Attaching package: ‘dplyr’
The following objects are masked from ‘package:GenomicRanges’:
intersect, setdiff, union
The following object is masked from ‘package:GenomeInfoDb’:
intersect
The following objects are masked from ‘package:IRanges’:
collapse, desc, intersect, setdiff, slice, union
The following objects are masked from ‘package:S4Vectors’:
first, intersect, rename, setdiff, setequal, union
The following objects are masked from ‘package:BiocGenerics’:
combine, intersect, setdiff, union
The following objects are masked from ‘package:stats’:
filter, lag
The following objects are masked from ‘package:base’:
intersect, setdiff, setequal, union
* TSS source: GREAT.
* use genome 'hg38'.
* TSS extension mode is 'basalPlusExt'.
* construct the basal domains by extending 5000bp to upstream and 1000bp to downsteram of TSS.
* calculate distances to neighbour regions.
* extend to both sides until reaching the neighbour genes or to the maximal extension.
* use GO:BP ontology with 15709 gene sets (source: org.Hs.eg.db).
* check gene ID type in \`gene_sets\` and in \`extended_tss\`.
* use whole genome as background.
* remove excluded regions from background.
* overlap \`gr\` to background regions (based on midpoint).
* in total 499 \`gr\`.
* overlap extended TSS to background regions.
* check which genes are in the gene sets.
* only take gene sets with size >= 5.
* in total 9168 gene sets.
* overlap \`gr\` to every extended TSS.
* perform binomial test for each biological term.
pdf: 2
GO Enrichment Analysis
The second step of GREAT analysis is to perform functional enrichment analysis on the associated genes.
Analysis Principle
Binomial Test: Used to evaluate enrichment at the region level. Assuming input regions are uniformly distributed across the genome, it calculates whether the proportion of genomic regions associated with each GO term is significantly higher than random expectation.
Hypergeometric Test: Used to evaluate enrichment at the gene level. Assuming associated genes are randomly drawn from all genes, it calculates whether the number of associated genes in each GO term is significantly higher than random expectation.
Result Interpretation:
fold_enrichment_hyperis the enrichment fold (> 1 indicates enrichment),p_value_hyperis the raw P-value,p_adjust_hyperis the adjusted P-value (FDR, < 0.05 is usually considered significant),mean_tss_distis the average distance from input regions to TSS (larger distance implies lower association reliability).
Analysis Steps
Statistical Testing: Perform both Binomial Test (evaluate region-level enrichment) and Hypergeometric Test (calculate gene-level enrichment) to comprehensively assess the significance of GO terms.
Result Output: Retrieve detailed enrichment result table via
getEnrichmentTable(res); generate Volcano Plot (X-axis: enrichment fold, Y-axis: -log10 P-value) and Bubble Plot (displaying top N most enriched GO terms) for visualization.
# top_n: Visualize top N enriched GO terms
top_n <- 20
# desc_max_length: Maximum length of GO description in plots
desc_max_length <- 50
# GO Enrichment Analysis
## Loop through each GO type
for (go_type in names(go_types)) {
go_name <- go_types[go_type]
# Run GREAT analysis
res <- run_rgreat_analysis(gr, go_type, annotation_database)
# Get enrichment result table
tb <- getEnrichmentTable(res)
# Sort by fold_enrichment_hyper (descending) to match table with plots
tb <- tb %>% arrange(desc(fold_enrichment_hyper))
# Save complete result table (sorted)
result_file <- file.path(outdir, paste0("GREAT_enrichment_", go_type, "_results.txt"))
write.table(tb, file = result_file, sep = "\t", quote = FALSE, row.names = FALSE)
# Visualization 1: Volcano Plot
volcano_file <- file.path(outdir, paste0("GREAT_volcano_plot_", go_type, ".png"))
png(volcano_file, width = 10, height = 8, units = "in", res = 300)
p1 <- plotVolcano(res)
# Print if it's a ggplot object
if (inherits(p1, "ggplot")) {
print(p1)
}
dev.off()
# Visualization 2: Bubble Plot
if (nrow(tb) > 0) {
# Take top_n after sorting by fold_enrichment_hyper
tb_top <- tb %>%
head(top_n) %>%
mutate(
neg_log10_padj = -log10(p_adjust_hyper),
neg_log10_padj = ifelse(is.infinite(neg_log10_padj), -log10(1e-300), neg_log10_padj),
description_short = ifelse(nchar(description) > desc_max_length,
paste0(substr(description, 1, desc_max_length-3), "..."),
description)
) %>%
mutate(description_short = factor(description_short, levels = rev(description_short)))
if (nrow(tb_top) > 0) {
p3 <- ggplot(tb_top, aes(x = fold_enrichment_hyper, y = description_short,
size = observed_gene_hits, color = neg_log10_padj)) +
geom_point(alpha = 0.7) +
scale_size_continuous(name = "Number of\nGenes", range = c(3, 12)) +
scale_color_gradient(low = "#4A90E2", high = "#E74C3C", name = "-log10(FDR)") +
labs(title = paste("Top", min(top_n, nrow(tb_top)), "Enriched", go_name),
x = "Fold Enrichment", y = "GO Term") +
theme_minimal() +
theme(plot.title = element_text(size = 14, face = "bold", hjust = 0.5),
axis.text.y = element_text(size = 9),
legend.position = "right") +
geom_vline(xintercept = 1, linetype = "dashed", color = "gray60", alpha = 0.5)
bubble_file <- file.path(outdir, paste0("GO_bubbleplot_", go_type, ".png"))
ggsave(bubble_file, plot = p3, width = 12, height = 10, dpi = 300)
# Save top_n results (corresponding to plots, sorted by fold_enrichment_hyper)
top20_file <- file.path(outdir, paste0("GREAT_enrichment_", go_type, "_top20.txt"))
tb_top20 <- tb %>% head(top_n)
write.table(tb_top20, file = top20_file, sep = "\t", quote = FALSE, row.names = FALSE)
} else {
cat("Warning: Not enough data to generate bubble plot\n")
}
} else {
cat("Warning: No enrichment results\n")
}
}* use genome 'hg38'.
* TSS extension mode is 'basalPlusExt'.
* construct the basal domains by extending 5000bp to upstream and 1000bp to downsteram of TSS.
* calculate distances to neighbour regions.
* extend to both sides until reaching the neighbour genes or to the maximal extension.
* use GO:BP ontology with 15709 gene sets (source: org.Hs.eg.db).
* check gene ID type in \`gene_sets\` and in \`extended_tss\`.
* use whole genome as background.
* remove excluded regions from background.
* overlap \`gr\` to background regions (based on midpoint).
* in total 499 \`gr\`.
* overlap extended TSS to background regions.
* check which genes are in the gene sets.
* only take gene sets with size >= 5.
* in total 9168 gene sets.
* overlap \`gr\` to every extended TSS.
* perform binomial test for each biological term.
* TSS source: GREAT.
* use genome 'hg38'.
* TSS extension mode is 'basalPlusExt'.
* construct the basal domains by extending 5000bp to upstream and 1000bp to downsteram of TSS.
* calculate distances to neighbour regions.
* extend to both sides until reaching the neighbour genes or to the maximal extension.
* use GO:CC ontology with 1977 gene sets (source: org.Hs.eg.db).
* check gene ID type in \`gene_sets\` and in \`extended_tss\`.
* use whole genome as background.
* remove excluded regions from background.
* overlap \`gr\` to background regions (based on midpoint).
* in total 499 \`gr\`.
* overlap extended TSS to background regions.
* check which genes are in the gene sets.
* only take gene sets with size >= 5.
* in total 1184 gene sets.
* overlap \`gr\` to every extended TSS.
* perform binomial test for each biological term.
* TSS source: GREAT.
* use genome 'hg38'.
* TSS extension mode is 'basalPlusExt'.
* construct the basal domains by extending 5000bp to upstream and 1000bp to downsteram of TSS.
* calculate distances to neighbour regions.
* extend to both sides until reaching the neighbour genes or to the maximal extension.
* use GO:MF ontology with 5055 gene sets (source: org.Hs.eg.db).
* check gene ID type in \`gene_sets\` and in \`extended_tss\`.
* use whole genome as background.
* remove excluded regions from background.
* overlap \`gr\` to background regions (based on midpoint).
* in total 499 \`gr\`.
* overlap extended TSS to background regions.
* check which genes are in the gene sets.
* only take gene sets with size >= 5.
* in total 2001 gene sets.
* overlap \`gr\` to every extended TSS.
* perform binomial test for each biological term.
Result Interpretation
The enrichment result table returned by getEnrichmentTable() contains the following fields:
| Field Name | Description |
|---|---|
id | GO term ID (e.g., GO:0007155) |
description | GO term description |
genome_fraction | Fraction of genome associated with this GO term |
observed_region_hits | Observed number of associated regions (Binomial test) |
fold_enrichment | Fold enrichment from Binomial test |
p_value | Raw P-value from Binomial test |
p_adjust | Adjusted P-value from Binomial test (FDR) |
mean_tss_dist | Mean distance from input regions to TSS (bp) |
observed_gene_hits | Observed number of associated genes (Hypergeometric test) |
gene_set_size | Total number of genes in the GO term |
fold_enrichment_hyper | Fold enrichment from Hypergeometric test (Recommended) |
p_value_hyper | Raw P-value from Hypergeometric test |
p_adjust_hyper | Adjusted P-value from Hypergeometric test (FDR, Recommended) |
