scMethyl + RNA Multi-omics: DMR Functional Enrichment Analysis (Based on rGREAT)
Module Introduction
This module is based on rGREAT, aiming to associate biological functional annotations with genomic regions. In this analysis, the module is used to perform functional enrichment analysis on differentially methylated regions (DMRs).
rGREAT is the local R implementation of the GREAT (Genomic Regions Enrichment of Annotations Tool) algorithm. This method achieves the following core analysis goals by associating genomic regions with nearby genes and evaluating the enrichment of these genes in different functional annotation categories:
- Functional Enrichment Analysis: Identifies significant enrichment of DMR-associated genes in Gene Ontology (GO) functional categories, revealing potential biological functions and processes involved with DMRs.
- Region-Gene Association: Associates DMRs with their neighboring genes to infer potential regulatory target genes for each methylated region.
- Multiple Gene Sets Support: In addition to GO gene sets, supports various gene set databases like MSigDB to provide a more comprehensive functional annotation perspective.
This analysis is mainly applicable to Human and Mouse samples, but also supports other species with Bioconductor annotation packages.
BioMart Support (Non-model Organisms): For non-model organisms, BioMart can be used to retrieve corresponding gene set information (covering over 600 species), extending the applicability of DMR functional enrichment analysis:
r# Use BioMart to Retrieve Gene Sets res <- great(gr, "GO:BP", biomart_dataset = "amelanoleuca_gene_ensembl")
Input File Preparation
This analysis requires a BED format input file describing the genomic coordinate information of Differentially Methylated Regions (DMRs).
Input File Format
The BED file should be a tab-separated text file containing at least the following three columns:
| Column No. | Column Name | Description | Example |
|---|---|---|---|
| 1 | Chromosome Name | Chromosome name (e.g., chr1, chr2, chr11, etc.) | chr2 |
| 2 | Start Position | Start position (0-based, counting starts from 0) | 196161361 |
| 3 | End Position | End position (1-based, exclusive) | 196164361 |
File Description:
- Coordinate system uses 0-based indexing (start position begins at 0).
- Must explicitly specify the genome version used (e.g., hg38, hg19, mm10, mm9, etc.).
- Each region represents a Differentially Methylated Region (DMR).
- Note: If the
BEDfile contains more than 500 regions, the analysis pipeline will default to using only the first 500 regions for calculation; this threshold can be adjusted based on analysis needs.
File Structure Example
chr2 196161361 196164361
chr2 203704361 203712361
chr11 128486808 128522808
...rGREAT Analysis Pipeline
Key Parameter Settings
In rGREAT analysis, the following parameters have a significant impact on result stability and biological interpretation:
| Parameter Name | Chinese Meaning | Default/Suggested Value | Detailed Description |
|---|---|---|---|
annotation_database | TSS Source | "GREAT:hg38" | Annotation DB Source. Specifies TSS data source. Supports: • "GREAT:hg38": Official GREAT hg38 TSS (Recommended)• "hg38", "hg19", "mm10": Auto-use TxDb packages• "RefSeq:hg38": RefSeqSelect genes• "Gencode_v19": Gencode annotation (Human)Note: "GREAT:hg38" only supports hg38, hg19, mm10, mm9. |
gene_sets | Gene Set Type | "GO:BP", "GO:CC", "GO:MF" | Functional Annotation Sets. Specifies gene sets for enrichment: • "GO:BP": Biological Process• "GO:CC": Cellular Component• "GO:MF": Molecular Function• "msigdb:H": MSigDB Hallmark (Human only)• "msigdb:C2": MSigDB Curated (Human only)Note: Prefixes GO: and msigdb: can be omitted. |
top_n | Visualization Count | 20 | Visualization Parameter. Specifies top N most enriched GO terms to show in bubble plot. Recommended 20-30 for quick identification of key biological functions. |
background | Background Regions | NULL (Whole Genome) | Background Parameter (Optional). Specifies background regions for enrichment. If NULL, uses whole genome (excluding gaps).Usage: To exclude specific regions (repeats, unsequenced) or limit analysis scope. |
exclude | Excluded Regions | NULL | Exclusion Parameter (Optional). Specifies regions to exclude (e.g. gaps, repeats). Note: rGREAT excludes gap regions by default. |
Supported TSS Sources:
- Genome Version: E.g.,
"hg38","hg19","mm10", etc., will automatically call the corresponding TxDb annotation package. - TxDb Package Name: E.g.,
"TxDb.Hsapiens.UCSC.hg38.knownGene". - RefSeq: E.g.,
"RefSeq:hg38", based on RefSeq gene annotation. - RefSeqCurated: e.g.,
"RefSeqCurated:hg38", RefSeqCurated subset. - RefSeqSelect: e.g.,
"RefSeqSelect:hg38", RefSeqCurated subset. - Gencode: e.g.,
"Gencode_v19"(Human) or"Gencode_M21"(Mouse). - GREAT: e.g.,
"GREAT:hg38"(Only supports hg38, hg19, mm10, mm9). .
Region-Gene Association Analysis
The first step of GREAT analysis is to associate input genomic regions with extended TSS regulatory domains of genes.
Analysis Principle
The core idea of the GREAT algorithm is to associate non-coding genomic regions with their neighboring genes and infer potential biological functions of the regions based on known functional annotations of these genes.
Association Rules: The GREAT algorithm first defines a regulatory domain for each gene. When an input region overlaps with a gene's regulatory domain, that region is associated with that gene (Note: a single region can be associated with multiple genes simultaneously).
Construction of Regulatory Domains (taking
basalPlusExtmode as example): First define Basal Domain (covering 5000 bp upstream to 1000 bp downstream of TSS); then perform Extension to both sides until encountering adjacent gene's Basal Domain boundary, or reaching the set maximum extension distance (default 1,000,000 bp).
Analysis Steps
Build Extended TSS Regulatory Domains: Adopt specified extension mode (default
basalPlusExt) to build extended regulatory domains for each gene.Region-Gene Matching: Determine gene association by overlapping the midpoint of input regions with extended TSS regulatory domains.
Result Output: Generate region-gene association diagram (
GREAT_region_gene_associations.png) to visually display correspondence between input regions and their associated genes.
# Load required R packages
library(rGREAT)
library(GenomicRanges)
library(rtracklayer)
library(ggplot2)
library(dplyr)
## bed_file: Input BED file path
bed_file <- "./DMRs/DMRs_significant_2.bed"
## annotation_database: TSS source
annotation_database <- "GREAT:hg38" # Modify based on species and reference genome version used, e.g., GREAT:mm9
## go_types: GO types to analyze
go_types <- c("GO:BP" = "Biological Process",
"GO:CC" = "Cellular Component",
"GO:MF" = "Molecular Function")
## outdir: Result output directory, all result files will be saved here
outdir <- "DMRs_rGREAT"
# Create output directory (if not exists)
dir.create(outdir, recursive = TRUE, showWarnings = FALSE)
# Define function: Call rGREAT analysis in R
run_rgreat_analysis <- function(gr, go_type, annotation_database) {
res <- great(gr, go_type, annotation_database)
return(res)
}
# Read BED file
gr <- import(bed_file, format = "BED")
# If number of regions exceeds 500, only use the first 500 regions for analysis
if (length(gr) > 500) {
gr <- gr[1:500]
}
# Region-Gene Association Analysis
res_temp <- run_rgreat_analysis(gr, names(go_types)[1], annotation_database)
assoc_file <- file.path(outdir, "GREAT_region_gene_associations.png")
png(assoc_file, width = 12, height = 8, units = "in", res = 300)
p2 <- plotRegionGeneAssociations(res_temp)
if (inherits(p2, "ggplot")) {
print(p2)
}
dev.off()Loading required package: stats4
Loading required package: BiocGenerics
Attaching package: ‘BiocGenerics’
The following objects are masked from ‘package:stats’:
IQR, mad, sd, var, xtabs
The following objects are masked from ‘package:base’:
anyDuplicated, aperm, append, as.data.frame, basename, cbind,
colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort,
table, tapply, union, unique, unsplit, which.max, which.min
Loading required package: S4Vectors
Attaching package: ‘S4Vectors’
The following object is masked from ‘package:utils’:
findMatches
The following objects are masked from ‘package:base’:
expand.grid, I, unname
Loading required package: IRanges
Loading required package: GenomeInfoDb
------------------
Note: On Aug 19 2019 GREAT released version 4 where it supports \`hg38\`
genome and removes some ontologies such pathways. \`submitGreatJob()\`
still takes \`hg19\` as default. \`hg38\` can be specified by the \`species
= 'hg38'\` argument. To use the older versions such as 3.0.0, specify as
\`submitGreatJob(..., version = '3.0.0')\`.
From rGREAT version 1.99.0, it also implements the GREAT algorithm and
it allows to integrate GREAT analysis with the Bioconductor annotation
ecosystem. By default it supports more than 500 organisms. Check the
new function \`great()\` and the new vignette.
------------------
Attaching package: ‘dplyr’
The following objects are masked from ‘package:GenomicRanges’:
intersect, setdiff, union
The following object is masked from ‘package:GenomeInfoDb’:
intersect
The following objects are masked from ‘package:IRanges’:
collapse, desc, intersect, setdiff, slice, union
The following objects are masked from ‘package:S4Vectors’:
first, intersect, rename, setdiff, setequal, union
The following objects are masked from ‘package:BiocGenerics’:
combine, intersect, setdiff, union
The following objects are masked from ‘package:stats’:
filter, lag
The following objects are masked from ‘package:base’:
intersect, setdiff, setequal, union
* TSS source: GREAT.
* use genome 'hg38'.
* TSS extension mode is 'basalPlusExt'.
* construct the basal domains by extending 5000bp to upstream and 1000bp to downsteram of TSS.
* calculate distances to neighbour regions.
* extend to both sides until reaching the neighbour genes or to the maximal extension.
* use GO:BP ontology with 15709 gene sets (source: org.Hs.eg.db).
* check gene ID type in \`gene_sets\` and in \`extended_tss\`.
* use whole genome as background.
* remove excluded regions from background.
* overlap \`gr\` to background regions (based on midpoint).
* in total 499 \`gr\`.
* overlap extended TSS to background regions.
* check which genes are in the gene sets.
* only take gene sets with size >= 5.
* in total 9168 gene sets.
* overlap \`gr\` to every extended TSS.
* perform binomial test for each biological term.
pdf: 2
GO Enrichment Analysis
The second step of GREAT analysis is to perform functional enrichment analysis on region-gene association results to evaluate the significance of associated genes in biological functional categories.
Analysis Principle
Binomial Test: Evaluates functional enrichment at the region level. This method assumes input regions are uniformly distributed across the genome and assesses whether the proportion of genomic regions associated with a specific GO term is significantly higher than random expectation.
Hypergeometric Test: Used to assess functional enrichment at the gene level. This method assumes associated genes are randomly drawn from the background gene set and evaluates enrichment significance at the gene level by calculating if the number of associated genes in a GO term is significantly higher than random expectation.
Result Interpretation:
fold_enrichment_hyperis the enrichment fold from hypergeometric test (> 1 means enrichment),p_value_hyperis the raw P-value from hypergeometric test,p_adjust_hyperis the adjusted P-value after multiple testing correction (FDR, usually < 0.05 as significance threshold),mean_tss_distis the average distance from input regions to TSS (larger value implies relatively lower reliability of region-gene association).
Analysis Steps
Statistical Test: Simultaneously perform Binomial Test (assessing region-level enrichment) and Hypergeometric Test (calculating gene-level enrichment) to comprehensively assess GO term significance from different levels.
Result Output: Obtain complete functional enrichment result table via
getEnrichmentTable(res); simultaneously generate Volcano Plot (X-axis is enrichment fold, Y-axis is -log10 P-value) and Bubble Plot (showing top N most enriched GO terms) for result visualization.
## top_n: Visualize top N enriched GO terms
top_n <- 20
## desc_max_length: Maximum length of GO description in image
desc_max_length <- 50
# GO Enrichment Analysis
## Loop through each GO type
for (go_type in names(go_types)) {
go_name <- go_types[go_type]
# Run GREAT analysis
res <- run_rgreat_analysis(gr, go_type, annotation_database)
# Get enrichment analysis result table
tb <- getEnrichmentTable(res)
# Sort by fold_enrichment_hyper (descending) to align result table with chart
tb <- tb %>% arrange(desc(fold_enrichment_hyper))
# Save full result table (sorted)
result_file <- file.path(outdir, paste0("GREAT_enrichment_", go_type, "_results.txt"))
write.table(tb, file = result_file, sep = "\t", quote = FALSE, row.names = FALSE)
# Visualization 1: Volcano Plot
volcano_file <- file.path(outdir, paste0("GREAT_volcano_plot_", go_type, ".png"))
png(volcano_file, width = 10, height = 8, units = "in", res = 300)
p1 <- plotVolcano(res)
# If it is a ggplot object, need to print
if (inherits(p1, "ggplot")) {
print(p1)
}
dev.off()
# Visualization 2: Bubble Plot
if (nrow(tb) > 0) {
# Sort by fold_enrichment_hyper and take top_n
tb_top <- tb %>%
head(top_n) %>%
mutate(
neg_log10_padj = -log10(p_adjust_hyper),
neg_log10_padj = ifelse(is.infinite(neg_log10_padj), -log10(1e-300), neg_log10_padj),
description_short = ifelse(nchar(description) > desc_max_length,
paste0(substr(description, 1, desc_max_length-3), "..."),
description)
) %>%
mutate(description_short = factor(description_short, levels = rev(description_short)))
if (nrow(tb_top) > 0) {
p3 <- ggplot(tb_top, aes(x = fold_enrichment_hyper, y = description_short,
size = observed_gene_hits, color = neg_log10_padj)) +
geom_point(alpha = 0.7) +
scale_size_continuous(name = "Number of\nGenes", range = c(3, 12)) +
scale_color_gradient(low = "#4A90E2", high = "#E74C3C", name = "-log10(FDR)") +
labs(title = paste("Top", min(top_n, nrow(tb_top)), "Enriched", go_name),
x = "Fold Enrichment", y = "GO Term") +
theme_minimal() +
theme(plot.title = element_text(size = 14, face = "bold", hjust = 0.5),
axis.text.y = element_text(size = 9),
legend.position = "right") +
geom_vline(xintercept = 1, linetype = "dashed", color = "gray60", alpha = 0.5)
bubble_file <- file.path(outdir, paste0("GO_bubbleplot_", go_type, ".png"))
ggsave(bubble_file, plot = p3, width = 12, height = 10, dpi = 300)
# Save top_n results (corresponding to the chart, sorted by fold_enrichment_hyper)
top20_file <- file.path(outdir, paste0("GREAT_enrichment_", go_type, "_top20.txt"))
tb_top20 <- tb %>% head(top_n)
write.table(tb_top20, file = top20_file, sep = "\t", quote = FALSE, row.names = FALSE)
} else {
cat("警告: 没有足够的数据生成气泡图\n")
}
} else {
cat("警告: 没有富集结果\n")
}
}* use genome 'hg38'.
* TSS extension mode is 'basalPlusExt'.
* construct the basal domains by extending 5000bp to upstream and 1000bp to downsteram of TSS.
* calculate distances to neighbour regions.
* extend to both sides until reaching the neighbour genes or to the maximal extension.
* use GO:BP ontology with 15709 gene sets (source: org.Hs.eg.db).
* check gene ID type in \`gene_sets\` and in \`extended_tss\`.
* use whole genome as background.
* remove excluded regions from background.
* overlap \`gr\` to background regions (based on midpoint).
* in total 499 \`gr\`.
* overlap extended TSS to background regions.
* check which genes are in the gene sets.
* only take gene sets with size >= 5.
* in total 9168 gene sets.
* overlap \`gr\` to every extended TSS.
* perform binomial test for each biological term.
* TSS source: GREAT.
* use genome 'hg38'.
* TSS extension mode is 'basalPlusExt'.
* construct the basal domains by extending 5000bp to upstream and 1000bp to downsteram of TSS.
* calculate distances to neighbour regions.
* extend to both sides until reaching the neighbour genes or to the maximal extension.
* use GO:CC ontology with 1977 gene sets (source: org.Hs.eg.db).
* check gene ID type in \`gene_sets\` and in \`extended_tss\`.
* use whole genome as background.
* remove excluded regions from background.
* overlap \`gr\` to background regions (based on midpoint).
* in total 499 \`gr\`.
* overlap extended TSS to background regions.
* check which genes are in the gene sets.
* only take gene sets with size >= 5.
* in total 1184 gene sets.
* overlap \`gr\` to every extended TSS.
* perform binomial test for each biological term.
* TSS source: GREAT.
* use genome 'hg38'.
* TSS extension mode is 'basalPlusExt'.
* construct the basal domains by extending 5000bp to upstream and 1000bp to downsteram of TSS.
* calculate distances to neighbour regions.
* extend to both sides until reaching the neighbour genes or to the maximal extension.
* use GO:MF ontology with 5055 gene sets (source: org.Hs.eg.db).
* check gene ID type in \`gene_sets\` and in \`extended_tss\`.
* use whole genome as background.
* remove excluded regions from background.
* overlap \`gr\` to background regions (based on midpoint).
* in total 499 \`gr\`.
* overlap extended TSS to background regions.
* check which genes are in the gene sets.
* only take gene sets with size >= 5.
* in total 2001 gene sets.
* overlap \`gr\` to every extended TSS.
* perform binomial test for each biological term.
Result Interpretation
The enrichment result table returned by getEnrichmentTable() contains the following field descriptions:
| Field Name | Description |
|---|---|
id | GO term ID (e.g., GO:0007155) |
description | GO term description |
genome_fraction | Proportion of genome covered by regions associated with this GO term |
observed_region_hits | Observed Number of Associated Regions (Binomial Test) |
fold_enrichment | Fold enrichment from Binomial test |
p_value | Raw P-value for Binomial Test |
p_adjust | Adjusted P-value for Binomial Test (FDR) |
mean_tss_dist | Mean distance from input regions to TSS (bp) |
observed_gene_hits | Observed Number of Associated Genes (Hypergeometric Test) |
gene_set_size | Total genes in GO term |
fold_enrichment_hyper | Fold enrichment from Hypergeometric test (Recommended) |
p_value_hyper | Raw P-value for Hypergeometric Test |
p_adjust_hyper | Adjusted P-value for Hypergeometric Test (FDR, Recommended) |
