Skip to content

甲基化 + RNA 双组学: 差异甲基化区域功能富集分析(基于 rGREAT)

作者: SeekGene
时长: 17 分钟
字数: 4.1k 字
更新: 2026-02-28
阅读: 0 次
甲基化 + RNA 双组学 Notebooks

模块简介

本模块基于 rGREAT 开发,旨在将生物学功能注释与基因组区域进行关联。本分析中利用该模块对差异甲基化区域(differentially methylated regions,DMRs)开展功能富集分析。

rGREAT是 GREAT(Genomic Regions Enrichment of Annotations Tool) 算法的 R 语言本地实现。该方法通过将基因组区域关联至邻近基因,并评估这些基因在不同功能注释类别中的富集程度,实现以下核心分析目标:

  • 功能富集分析 (Functional Enrichment):识别 DMRs 关联基因在 Gene Ontology(GO) 功能类别中的显著富集情况,从而揭示 DMRs 潜在的生物学功能和参与的生物学过程。
  • 区域-基因关联 (Region-Gene Association):将 DMRs 与其邻近基因进行关联,用于推断每个甲基化区域可能的调控靶基因。
  • 多基因集支持 (Multiple Gene Sets):除 GO 基因集外,还支持 MSigDB 等多种基因集数据库,以提供更加全面的功能注释视角。

本分析主要适用于人源和小鼠样本,也支持其他有 Bioconductor 注释包的物种。

BioMart 支持(非模式生物):对于非模式生物,可通过 BioMart 获取相应的基因集信息(覆盖 600 余种物种),从而扩展 DMR 功能富集分析的适用范围:

r
# 使用 BioMart 获取基因集
res <- great(gr, "GO:BP", biomart_dataset = "amelanoleuca_gene_ensembl")

输入文件准备

本分析需提供一个 BED 格式的输入文件,用于描述差异甲基化区域 (DMRs) 的基因组坐标信息。

输入文件格式

BED 文件应为制表符分隔的文本文件,且至少包含以下三列信息:

列号列名说明示例
1染色体名称染色体名称(如 chr1, chr2, chr11 等)chr2
2起始位置起始位置(0-based,即从 0 开始计数)196161361
3结束位置结束位置(1-based,即不包含该位置)196164361

文件说明

  • 坐标系统采用 0-based 编码方式(起始位置从 0 开始)。
  • 需明确指定所使用的基因组版本(如 hg38、hg19、mm10、mm9 等)。
  • 每个区域代表一个差异甲基化区域 (DMR)。
  • 注意:如果 BED 文件包含的区域数量超过 500 个,分析流程将默认仅使用前 500 个区域进行计算;该数量阈值可根据分析需求进行调整。

文件结构示例

bed
chr2	196161361	196164361
chr2	203704361	203712361
chr11	128486808	128522808
...

rGREAT 分析流程

关键参数设置

在 rGREAT 分析过程中,以下参数对分析结果的稳定性及其生物学解释具有重要影响:

参数名称中文释义默认值/建议值详细说明
annotation_databaseTSS 来源"GREAT:hg38"注释数据库来源
指定转录起始位点 (TSS) 的数据来源。支持多种格式:
"GREAT:hg38":GREAT 官方提供的 hg38 TSS 数据(推荐)
"hg38", "hg19", "mm10" 等:自动使用对应的 TxDb 包
"RefSeq:hg38":RefSeqSelect 基因
"Gencode_v19":Gencode 注释(人类)
注意"GREAT:hg38" 仅支持 hg38, hg19, mm10, mm9 四种基因组版本。
gene_sets基因集类型"GO:BP", "GO:CC", "GO:MF"功能注释基因集
指定用于富集分析的基因集类型:
"GO:BP":生物过程 (Biological Process)
"GO:CC":细胞组分 (Cellular Component)
"GO:MF":分子功能 (Molecular Function)
"msigdb:H":MSigDB Hallmark 基因集(仅人类)
"msigdb:C2":MSigDB 精选基因集(仅人类)
注意:前缀 GO:msigdb: 可以省略。
top_n可视化数量20可视化参数
指定在气泡图中展示的前 N 个最富集的 GO terms。建议设置为 20-30,便于快速识别最重要的生物学功能。
background背景区域NULL(全基因组)背景设置参数(可选)。
指定用于富集分析的背景区域。如果设置为 NULL,则使用全基因组(排除 gap 区域)作为背景。
应用场景:当需要排除特定区域(如重复序列、未测序区域)或限制分析范围时使用。
exclude排除区域NULL排除区域参数(可选)。
指定要从分析中排除的基因组区域(如 gap 区域、重复序列等)。
注意:rGREAT 默认会排除 gap 区域。

其中 TSS 支持的来源

  1. 基因组版本:如 "hg38", "hg19", "mm10" 等,将自动调用对应的 TxDb 注释包。
  2. TxDb 包名:如 "TxDb.Hsapiens.UCSC.hg38.knownGene"
  3. RefSeq:如"RefSeq:hg38",基于 RefSeq 基因注释。
  4. RefSeqCurated:如"RefSeqCurated:hg38",RefSeqCurated 子集。
  5. RefSeqSelect:如"RefSeqSelect:hg38",RefSeqCurated 子集。
  6. Gencode:如"Gencode_v19"(人类)或 "Gencode_M21"(小鼠)。
  7. GREAT:如"GREAT:hg38"(仅支持 hg38, hg19, mm10, mm9)。 。

区域-基因关联分析

GREAT 分析的第一步是将输入的基因组区域与基因的扩展 TSS 调控域进行关联。

分析原理

GREAT 算法的核心思想是将非编码基因组区域与其邻近基因进行关联,并基于这些基因的已知功能注释推断区域的潜在生物学功能。

  1. 关联规则:GREAT 算法首先为每个基因定义一个调控域 (regulatory domain)。当输入区域与某一基因的调控域发生重叠时,该区域即被关联至该基因(注:单个区域可同时关联至多个基因)。

  2. 调控域的构建(以 basalPlusExt 模式为例):首先定义 Basal 域(覆盖 TSS 上游 5000 bp 至下游 1000 bp);随后向两侧进行 Extension(延伸),直至遇到相邻基因的 Basal 域边界,或达到设定的最大扩展距离(默认 1,000,000 bp)。

分析步骤

  1. 构建扩展 TSS 调控域:采用指定的扩展模式(默认 basalPlusExt),为每个基因构建其扩展的调控域。

  2. 区域-基因匹配:将输入区域的中点与扩展 TSS 调控域进行重叠判断,从而确定每个区域所关联的基因。

  3. 结果输出:生成区域-基因关联示意图(GREAT_region_gene_associations.png),用于直观展示输入区域与其关联基因之间的对应关系。

R
# 加载所需的 R 包
library(rGREAT)
library(GenomicRanges)
library(rtracklayer)
library(ggplot2)
library(dplyr)

## bed_file:输入 BED 文件路径
bed_file <- "./DMRs/DMRs_significant_2.bed"

## annotation_database:TSS 来源
annotation_database <- "GREAT:hg38"  # 根据物种和使用的参考基因组版本修改,例如:GREAT:mm9

## go_types:要分析的 GO 类型
go_types <- c("GO:BP" = "Biological Process",
              "GO:CC" = "Cellular Component", 
              "GO:MF" = "Molecular Function")

## outdir:结果输出目录,所有结果文件将保存到此目录下
outdir <- "DMRs_rGREAT"
# 创建输出目录(如不存在)
dir.create(outdir, recursive = TRUE, showWarnings = FALSE)

# 定义函数:在 R 中调用 rGREAT 分析
run_rgreat_analysis <- function(gr, go_type, annotation_database) {
  res <- great(gr, go_type, annotation_database)
  return(res)
}

# 读取 BED 文件
gr <- import(bed_file, format = "BED")

# 如果区域数量超过 500 个,则只使用前 500 个区域进行分析
if (length(gr) > 500) {
  gr <- gr[1:500]
}



# 区域-基因关联分析
res_temp <- run_rgreat_analysis(gr, names(go_types)[1], annotation_database)
assoc_file <- file.path(outdir, "GREAT_region_gene_associations.png")
png(assoc_file, width = 12, height = 8, units = "in", res = 300)
p2 <- plotRegionGeneAssociations(res_temp)
if (inherits(p2, "ggplot")) {
  print(p2)
}
dev.off()
output
Loading required package: GenomicRanges

Loading required package: stats4

Loading required package: BiocGenerics


Attaching package: ‘BiocGenerics’


The following objects are masked from ‘package:stats’:

IQR, mad, sd, var, xtabs


The following objects are masked from ‘package:base’:

anyDuplicated, aperm, append, as.data.frame, basename, cbind,
colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort,
table, tapply, union, unique, unsplit, which.max, which.min


Loading required package: S4Vectors


Attaching package: ‘S4Vectors’


The following object is masked from ‘package:utils’:

findMatches


The following objects are masked from ‘package:base’:

expand.grid, I, unname


Loading required package: IRanges

Loading required package: GenomeInfoDb






------------------
Note: On Aug 19 2019 GREAT released version 4 where it supports \`hg38\`
genome and removes some ontologies such pathways. \`submitGreatJob()\`
still takes \`hg19\` as default. \`hg38\` can be specified by the \`species
= 'hg38'\` argument. To use the older versions such as 3.0.0, specify as
\`submitGreatJob(..., version = '3.0.0')\`.

From rGREAT version 1.99.0, it also implements the GREAT algorithm and
it allows to integrate GREAT analysis with the Bioconductor annotation
ecosystem. By default it supports more than 500 organisms. Check the
new function \`great()\` and the new vignette.
------------------


Attaching package: ‘dplyr’


The following objects are masked from ‘package:GenomicRanges’:

intersect, setdiff, union


The following object is masked from ‘package:GenomeInfoDb’:

intersect


The following objects are masked from ‘package:IRanges’:

collapse, desc, intersect, setdiff, slice, union


The following objects are masked from ‘package:S4Vectors’:

first, intersect, rename, setdiff, setequal, union


The following objects are masked from ‘package:BiocGenerics’:

combine, intersect, setdiff, union


The following objects are masked from ‘package:stats’:

filter, lag


The following objects are masked from ‘package:base’:

intersect, setdiff, setequal, union


* TSS source: GREAT.

* use genome 'hg38'.

* TSS extension mode is 'basalPlusExt'.

* construct the basal domains by extending 5000bp to upstream and 1000bp to downsteram of TSS.

* calculate distances to neighbour regions.

* extend to both sides until reaching the neighbour genes or to the maximal extension.

* use GO:BP ontology with 15709 gene sets (source: org.Hs.eg.db).

* check gene ID type in \`gene_sets\` and in \`extended_tss\`.

* use whole genome as background.

* remove excluded regions from background.

* overlap \`gr\` to background regions (based on midpoint).

* in total 499 \`gr\`.

* overlap extended TSS to background regions.

* check which genes are in the gene sets.

* only take gene sets with size >= 5.

* in total 9168 gene sets.

* overlap \`gr\` to every extended TSS.

* perform binomial test for each biological term.

pdf: 2

GO 富集分析

GREAT 分析的第二步是对区域–基因关联结果进行功能富集分析,以评估相关基因在生物学功能类别中的显著性。

分析原理

  1. 二项分布检验(Binomial Test):用于评估区域水平的功能富集。该方法假设输入区域在全基因组范围内均匀分布,通过计算与某一 GO term 相关联的基因组区域比例是否显著高于随机期望,从而评估该 GO term 在区域层面的富集程度。

  2. 超几何检验(Hypergeometric Test):用于评估基因水平的功能富集。该方法假设关联基因是从背景基因集中随机抽取的,通过计算某一 GO term 中被关联的基因数量是否显著高于随机期望,评估其在基因层面的富集显著性。

  3. 结果指标解读fold_enrichment_hyper 为超几何检验得到的富集倍数(> 1 表示富集),p_value_hyper 为超几何检验的原始 P 值,p_adjust_hyper 为多重检验校正后的 P 值(FDR,通常以 < 0.05 作为显著性阈值),mean_tss_dist 为输入区域到 TSS 的平均距离(该值越大,区域与基因的关联可靠性相对越低)。

分析步骤

  1. 统计检验:同时执行二项分布检验(评估区域水平富集)和超几何检验(计算基因水平富集),以从不同层面综合评估 GO term 的显著性。

  2. 结果输出:通过 getEnrichmentTable(res) 获取完整的功能富集结果表;同时生成火山图(X 轴为富集倍数,Y 轴为 -log10 P 值)和气泡图(展示前 N 个最富集的 GO terms),用于结果的可视化展示。

R
## top_n:可视化富集到的前 N 个 GO
top_n <- 20

## desc_max_length:图片中 GO description 的最大长度
desc_max_length <- 50

# GO 富集分析
## 循环处理每种 GO 类型
for (go_type in names(go_types)) {
  go_name <- go_types[go_type]
 
  # 运行 GREAT 分析
  res <- run_rgreat_analysis(gr, go_type, annotation_database)
  
  # 获取富集分析结果表
  tb <- getEnrichmentTable(res)
  
  # 按 fold_enrichment_hyper 排序(降序),使结果表与图表一致
  tb <- tb %>% arrange(desc(fold_enrichment_hyper))
  
  # 保存完整结果表(已排序)
  result_file <- file.path(outdir, paste0("GREAT_enrichment_", go_type, "_results.txt"))
  write.table(tb, file = result_file, sep = "\t", quote = FALSE, row.names = FALSE)
  
  # 可视化1: 火山图
  volcano_file <- file.path(outdir, paste0("GREAT_volcano_plot_", go_type, ".png"))
  png(volcano_file, width = 10, height = 8, units = "in", res = 300)
  p1 <- plotVolcano(res)
  # 如果是 ggplot 对象,需要打印
  if (inherits(p1, "ggplot")) {
    print(p1)
  }
  dev.off()
  
  # 可视化2: 气泡图
  if (nrow(tb) > 0) {
    # 按 fold_enrichment_hyper 排序后取前 top_n 个
    tb_top <- tb %>%
      head(top_n) %>%
      mutate(
        neg_log10_padj = -log10(p_adjust_hyper),
        neg_log10_padj = ifelse(is.infinite(neg_log10_padj), -log10(1e-300), neg_log10_padj),
        description_short = ifelse(nchar(description) > desc_max_length,
                                 paste0(substr(description, 1, desc_max_length-3), "..."),
                                 description)
      ) %>%
      mutate(description_short = factor(description_short, levels = rev(description_short)))
    
    if (nrow(tb_top) > 0) {
      p3 <- ggplot(tb_top, aes(x = fold_enrichment_hyper, y = description_short,
                                size = observed_gene_hits, color = neg_log10_padj)) +
        geom_point(alpha = 0.7) +
        scale_size_continuous(name = "Number of\nGenes", range = c(3, 12)) +
        scale_color_gradient(low = "#4A90E2", high = "#E74C3C", name = "-log10(FDR)") +
        labs(title = paste("Top", min(top_n, nrow(tb_top)), "Enriched", go_name),
             x = "Fold Enrichment", y = "GO Term") +
        theme_minimal() +
        theme(plot.title = element_text(size = 14, face = "bold", hjust = 0.5),
              axis.text.y = element_text(size = 9),
              legend.position = "right") +
        geom_vline(xintercept = 1, linetype = "dashed", color = "gray60", alpha = 0.5)
      
      bubble_file <- file.path(outdir, paste0("GO_bubbleplot_", go_type, ".png"))
      ggsave(bubble_file, plot = p3, width = 12, height = 10, dpi = 300)
      
      # 保存前 top_n 个结果(与图表对应,按fold_enrichment_hyper排序)
      top20_file <- file.path(outdir, paste0("GREAT_enrichment_", go_type, "_top20.txt"))
      tb_top20 <- tb %>% head(top_n)
      write.table(tb_top20, file = top20_file, sep = "\t", quote = FALSE, row.names = FALSE)
    } else {
      cat("警告: 没有足够的数据生成气泡图\n")
    }
  } else {
    cat("警告: 没有富集结果\n")
  }
}
output
* TSS source: GREAT.

* use genome 'hg38'.

* TSS extension mode is 'basalPlusExt'.

* construct the basal domains by extending 5000bp to upstream and 1000bp to downsteram of TSS.

* calculate distances to neighbour regions.

* extend to both sides until reaching the neighbour genes or to the maximal extension.

* use GO:BP ontology with 15709 gene sets (source: org.Hs.eg.db).

* check gene ID type in \`gene_sets\` and in \`extended_tss\`.

* use whole genome as background.

* remove excluded regions from background.

* overlap \`gr\` to background regions (based on midpoint).

* in total 499 \`gr\`.

* overlap extended TSS to background regions.

* check which genes are in the gene sets.

* only take gene sets with size >= 5.

* in total 9168 gene sets.

* overlap \`gr\` to every extended TSS.

* perform binomial test for each biological term.

* TSS source: GREAT.

* use genome 'hg38'.

* TSS extension mode is 'basalPlusExt'.

* construct the basal domains by extending 5000bp to upstream and 1000bp to downsteram of TSS.

* calculate distances to neighbour regions.

* extend to both sides until reaching the neighbour genes or to the maximal extension.

* use GO:CC ontology with 1977 gene sets (source: org.Hs.eg.db).

* check gene ID type in \`gene_sets\` and in \`extended_tss\`.

* use whole genome as background.

* remove excluded regions from background.

* overlap \`gr\` to background regions (based on midpoint).

* in total 499 \`gr\`.

* overlap extended TSS to background regions.

* check which genes are in the gene sets.

* only take gene sets with size >= 5.

* in total 1184 gene sets.

* overlap \`gr\` to every extended TSS.

* perform binomial test for each biological term.

* TSS source: GREAT.

* use genome 'hg38'.

* TSS extension mode is 'basalPlusExt'.

* construct the basal domains by extending 5000bp to upstream and 1000bp to downsteram of TSS.

* calculate distances to neighbour regions.

* extend to both sides until reaching the neighbour genes or to the maximal extension.

* use GO:MF ontology with 5055 gene sets (source: org.Hs.eg.db).

* check gene ID type in \`gene_sets\` and in \`extended_tss\`.

* use whole genome as background.

* remove excluded regions from background.

* overlap \`gr\` to background regions (based on midpoint).

* in total 499 \`gr\`.

* overlap extended TSS to background regions.

* check which genes are in the gene sets.

* only take gene sets with size >= 5.

* in total 2001 gene sets.

* overlap \`gr\` to every extended TSS.

* perform binomial test for each biological term.

结果解读

getEnrichmentTable() 返回的富集结果表包含以下字段说明:

字段名称说明
idGO term ID(如 GO:0007155)
descriptionGO term 描述
genome_fraction该 GO term 关联区域占全基因组的比例
observed_region_hits观察到的关联区域数量(二项分布检验)
fold_enrichment二项分布检验的富集倍数
p_value二项分布检验的原始 P 值
p_adjust二项分布检验的调整后 P 值 (FDR)
mean_tss_dist输入区域到 TSS 的平均距离 (bp)
observed_gene_hits观察到的关联基因数量(超几何检验)
gene_set_sizeGO term 中的总基因数
fold_enrichment_hyper超几何检验的富集倍数(推荐使用)
p_value_hyper超几何检验的原始 P 值
p_adjust_hyper超几何检验的调整后 P 值(FDR,推荐使用)
0 条评论·0 条回复