scMethyl + RNA Multi-omics: Motif Enrichment Analysis (Based on HOMER)
Module Introduction
This module is based on HOMER (Hypergeometric Optimization of Motif EnRichment), used for DNA sequence motif analysis in genomic regions (e.g., Differentially Methylated Regions DMRs, ChIP-seq peaks).
HOMER is a mature and widely used motif discovery tool capable of systematic sequence analysis on given genomic regions, primarily achieving the following analysis goals:
- De novo motif finding: Searches for new, unknown DNA sequence patterns in target sequences, independent of known motif databases; suitable for discovering new transcription factor binding sites or regulatory sequences.
- Known motif enrichment: Evaluates enrichment of known transcription factor motifs in target sequences and compares with HOMER's built-in motif database to identify potential regulatory transcription factors.
This analysis is mainly applicable to Human and Mouse samples, but also supports other species with reference genome FASTA files.
Input File Preparation
This analysis requires a BED format input file describing the coordinate information of genomic regions to be analyzed (e.g., Differentially Methylated Regions DMRs, ChIP-seq peaks).
Input File Format
The BED file should be a tab-separated text file containing at least the following three columns:
| Column No. | Column Name | Description | Example |
|---|---|---|---|
| 1 | Chromosome Name | Chromosome name (e.g., chr1, chr2, chr11, etc.) | chr2 |
| 2 | Start Position | Start position (0-based, counting starts from 0) | 196161361 |
| 3 | End Position | End position (1-based, exclusive) | 196164361 |
File Description:
- Coordinate system uses 0-based indexing (start position begins at 0).
- Must explicitly specify the genome version used (e.g., hg38, hg19, mm10, mm9, etc.).
- Each region represents a Differentially Methylated Region (DMR).
File Structure Example
chr1 116753470 116770470
chr2 196161361 196164361
chr2 203704361 203712361
...HOMER Analysis Pipeline
Key Parameter Settings
During analysis, the following parameters have a significant impact on result reliability and computational efficiency:
| Parameter Name | Chinese Meaning | Default/Suggested Value | Detailed Description |
|---|---|---|---|
-len | Motif Length | 6,7,8,9,10,11,12 | Motif Discovery Parameter. Specifies motif lengths to analyze, can specify multiple (comma-separated). • Small (6-8bp): For short TF binding sites • Medium (9-12bp): For medium regulatory sequences • Recommended: 6,7,8,9,10,11,12 (Ref Nature 2024)Note: Default is 8,10,12. |
-size | Extraction Size | 200 | Sequence Extraction Parameter. Length of sequence extracted from DMR center (bp). • Small (100-150bp): Focused on center, for tight TF binding sites • Large (200-300bp): More context, for dispersed elements Note: Default is 200bp. |
-mask | Mask Repeats | Enabled | QC Parameter. Masks repetitive sequence regions to avoid interference. Suggestion: Always use unless analyzing repeat regions. |
-p | Parallel Threads | 8 | Performance Parameter. Number of threads for parallel computing. • Suggested: 4-16 depending on server config • Note: Too many threads may cause high memory usage |
Best Practice Recommendations:
Choice of
-lenparameter: It is recommended to use a combination of6,7,8,9,10,11,12to cover motif patterns of different lengths; if analysis time is limited, only the default recommended values8,10,12can be used.Choice of
-sizeparameter: Default value of 200bp is sufficient in most cases. If DMR region length is small (< 500bp), this parameter can be appropriately reduced; if capturing a larger range of potential regulatory sequences is desired, this value can be increased to 300bp.About Environment Configuration: Ensure HOMER's
bindirectory is added toPATHenvironment variable (or use full path in command), and ensure reference genomeFASTAfile is consistent with the genome version used by inputBEDfile.
HOMER Analysis Execution
The basic flow of HOMER analysis includes: parameter configuration, environment setup, command construction, and execution.
## homer_bin: Full path to HOMER findMotifsGenome.pl script
homer_bin <- "/jp_envs/envs/common/bin/findMotifsGenome.pl"
## dmr_bed_file: Input DMR BED file path
dmr_bed_file <- "./DMRs/DMRs_significant_2.bed"
## genome_fa: Reference genome fasta file path
# Note: Must match the reference genome version used in the input BED file (e.g., hg38)
genome_fa <- "./genome.fa"
## output_dir: Result output directory
# HOMER will automatically create this directory
output_dir <- "DMRs_HOMER/"
# Check the number of regions in the BED file
# If more than 500 regions, only use the first 500
bed_lines <- readLines(dmr_bed_file, warn = FALSE)
# Filter empty lines and comment lines (lines starting with #)
bed_lines <- bed_lines[bed_lines != "" & !grepl("^#", bed_lines)]
num_regions <- length(bed_lines)
# Used to store the actual BED file path used (may be original file or temporary file)
actual_bed_file <- dmr_bed_file
temp_bed_file <- NULL
if (num_regions > 500) {
cat("Detected BED file contains ", num_regions, " regions, exceeding 500.\n")
cat("Only the first 500 regions will be used for analysis.\n")
# Create temporary file containing only the first 500 lines
temp_bed_file <- tempfile(fileext = ".bed")
writeLines(bed_lines[1:500], temp_bed_file)
actual_bed_file <- temp_bed_file
cat("Temporary BED file created:", temp_bed_file, "\n")
} else {
cat("BED file contains ", num_regions, " regions, all will be used.\n")
}
# Set PATH environment variable to ensure HOMER dependency scripts can be found
# HOMER bin directory contains all perl scripts (bed2pos.pl, homerTools, etc.)
homer_bin_dir <- dirname(homer_bin)
current_path <- Sys.getenv("PATH")
Sys.setenv(PATH = paste(homer_bin_dir, current_path, sep = ":"))
# Set LD_LIBRARY_PATH environment variable to ensure dynamic libraries (e.g., libcrypt.so.2) can be found
# This is key to solving "error while loading shared libraries" issues
homer_lib_dir <- file.path(dirname(homer_bin_dir), "lib")
current_ld_path <- Sys.getenv("LD_LIBRARY_PATH")
if (current_ld_path == "") {
Sys.setenv(LD_LIBRARY_PATH = homer_lib_dir)
} else {
# Ensure homer_lib_dir is at the beginning of LD_LIBRARY_PATH for priority lookup
Sys.setenv(LD_LIBRARY_PATH = paste(homer_lib_dir, current_ld_path, sep = ":"))
}
cat("LD_LIBRARY_PATH:", Sys.getenv("LD_LIBRARY_PATH"), "\n")
# Basic command format for findMotifsGenome.pl:
# findMotifsGenome.pl <DMR bed file> <Reference Genome> <Output Directory> [Options]
args <- c(
actual_bed_file, # DMR BED file (may be original or temporary file, containing only first 500 regions)
genome_fa, # Reference Genome FASTA File
output_dir, # Output directory
"-len", "6,7,8,9,10,11,12", # Motif Length: Analyze 6-12bp motifs
"-mask", # Mask repetitive sequences
"-p", "8" # Use 8 threads for parallel computation
)
# Use system2 in Jupyter to capture output for debugging
cat("Starting HOMER analysis...\n")
cat("HOMER bin directory:", homer_bin_dir, "\n")
cat("Command:", homer_bin, paste(args, collapse = " "), "\n\n")
flush.console() # Force flush output buffer
# Use stdout=TRUE and stderr=TRUE to capture output
result <- system2(
command = homer_bin,
args = args,
wait = TRUE,
stdout = TRUE,
stderr = TRUE
)
# If temporary BED file was used, delete it
if (!is.null(temp_bed_file) && file.exists(temp_bed_file)) {
unlink(temp_bed_file)
cat("Temporary BED file cleaned up.\n")
}Only the first 500 regions will be used for analysis
Temporary BED file created: /tmp/RtmpKd4Kmx/file4274c462b79.bed
LD_LIBRARY_PATH: /jp_envs/envs/common/lib
Starting HOMER analysis...n HOMER bin directory: /jp_envs/envs/common/bin
Command: /jp_envs/envs/common/bin/findMotifsGenome.pl /tmp/RtmpKd4Kmx/file4274c462b79.bed ./genome.fa DMRs_HOMER/ -len 6,7,8,9,10,11,12 -mask -p 8
Result Interpretation
HOMER generates multiple types of result files in the output directory, mainly divided into De novo motif finding results and Known motif enrichment results.
De novo Motif Finding Results
homerResults.htmlandhomerResults/directory: Formatted web report for De novo motif discovery, which is the main file to view. It contains visualization logo plots, statistical information (p-value, enrichment fold, etc.), sequence patterns for each motif, and detailedmotif<length>.motiffiles, facilitating quick browsing and screening of candidate motifs in the browser.homerMotifs.motifs<length>: Result files categorized by motif length (e.g.,homerMotifs.motifs6corresponds to 6bp length). Contains sequence patterns and statistical information for all discovered motifs of that length, used for viewing results of specific lengths.homerMotifs.all.motifs: Merged file of motifs of all lengths. It summarizes contents of allhomerMotifs.motifs<length>, used to view the complete set of motifs discovered in this analysis.
Known Motif Enrichment Results
knownResults.txt: Text result of known motif enrichment analysis (TSV format, openable with Excel), which is the main file to view. It lists in detailMotif Name,Consensus, enrichment statistics (P-value,Log P-value,q-value), and Motif distribution in target and background sequences (Target/Background Sequences with Motifcount#and percentage%), used to identify enriched transcription factors.knownResults.htmlandknownResults/directory: Formatted web report (HTML) and accompanying directory for known motif enrichment analysis. Contents include visualization logos of enriched motifs, statistical information tables, enrichment folds, and detailedknown<length>.motiffiles, providing an intuitive and interactive way to browse results.
