scMethyl + RNA Multi-omics: Motif Enrichment Analysis (Based on HOMER)
Module Introduction
This module is developed based on HOMER (Hypergeometric Optimization of Motif EnRichment) and is designed for DNA sequence motif analysis in genomic regions (e.g., Differentially Methylated Regions DMRs, ChIP-seq peaks, etc.).
HOMER is a mature and widely used motif discovery tool capable of performing systematic sequence analysis on given genomic regions, achieving the following core analysis goals:
- De novo motif finding: Identifies new, unknown DNA sequence patterns in target sequences without relying on known motif databases, suitable for discovering new transcription factor binding sites or regulatory sequences.
- Known motif enrichment: Evaluates the enrichment of known transcription factor motifs in target sequences and compares them with HOMER's built-in motif database to identify transcription factors that may be involved in regulation.
This analysis is mainly applicable to Human and Mouse samples, but also supports other species with a reference genome FASTA file.
Input File Preparation
This analysis requires a BED format input file to describe the coordinate information of the genomic regions to be analyzed (e.g., Differentially Methylated Regions DMRs, ChIP-seq peaks, etc.).
Input File Format
The BED file should be a tab-separated text file containing at least the following three columns:
| Column | Name | Description | Example |
|---|---|---|---|
| 1 | Chromosome | Chromosome name (e.g., chr1, chr2, chr11, etc.) | chr2 |
| 2 | Start Position | Start position (0-based, counting starts from 0) | 196161361 |
| 3 | End Position | End position (1-based, exclusive of this position) | 196164361 |
File Notes:
- Coordinate system uses 0-based encoding (start position begins at 0).
- Must explicitly specify the genome version used (e.g., hg38, hg19, mm10, mm9, etc.).
- Each region represents a Differentially Methylated Region (DMR).
File Structure Example
chr1 116753470 116770470
chr2 196161361 196164361
chr2 203704361 203712361
...HOMER Analysis Workflow
Key Parameter Settings
During the analysis, the following parameters have a significant impact on the reliability of results and computational efficiency:
| Parameter | Meaning | Default/Suggested | Detailed Description |
|---|---|---|---|
-len | Motif Length | 6,7,8,9,10,11,12 | Motif Discovery Parameter. Specifies the motif lengths to analyze, multiple lengths can be specified (comma-separated). • Small length (6-8bp): Suitable for finding shorter transcription factor binding sites • Medium length (9-12bp): Suitable for finding medium-length regulatory sequences • Recommended: 6,7,8,9,10,11,12 (Reference Nature 2024)Note: Default is 8,10,12. |
-size | Sequence Extraction Size | 200 | Sequence Extraction Parameter. Length of sequence extracted from the center of each DMR (bp). • Smaller value (100-150bp): Focuses more on the central region, suitable for tight transcription factor binding sites • Larger value (200-300bp): Includes more context sequence, suitable for more dispersed regulatory elements Note: Default is 200bp. |
-mask | Mask Repeats | Enabled | Quality Control Parameter. Masks repeat sequences to prevent them from interfering with motif discovery. Suggestion: Always use this option unless analyzing repeat regions specifically. |
-p | Parallel Threads | 8 | Performance Parameter. Number of threads used for parallel computing. • Suggested: Set according to server configuration, typically 4-16 • Note: Too many threads may lead to excessive memory usage |
Best Practice Suggestions:
Selection of
-lenparameter: It is recommended to use the combination6,7,8,9,10,11,12to cover motif patterns of different lengths; if analysis time is limited, the default recommendation8,10,12can be used.Selection of
-sizeparameter: In most cases, the default value of 200bp is sufficient. If the DMR region length is small (< 500bp), this parameter can be appropriately reduced; if it is desired to capture a larger range of potential regulatory sequences, this value can be increased to 300bp.Environment Configuration: Ensure that HOMER's
bindirectory is added to thePATHenvironment variable (or use the full path in the command), and ensure that the reference genomeFASTAfile is consistent with the genome version used in the inputBEDfile.
HOMER Analysis Execution
The basic workflow of HOMER analysis includes: parameter configuration, environment setup, command construction, and execution.
## homer_bin: Full path to HOMER findMotifsGenome.pl script
homer_bin <- "/jp_envs/envs/common/bin/findMotifsGenome.pl"
## dmr_bed_file: Input DMR BED file path
dmr_bed_file <- "./DMRs/DMRs_significant_2.bed"
## genome_fa: Reference genome fasta file path
# Note: Must be consistent with the reference genome version used in input BED file (e.g., hg38)
genome_fa <- "./genome.fa"
## output_dir: Result output directory
# HOMER will automatically create this directory
output_dir <- "DMRs_HOMER/"
# Check number of regions in BED file
# If > 500 regions, use only the first 500
bed_lines <- readLines(dmr_bed_file, warn = FALSE)
# Filter empty lines and comment lines (starting with #)
bed_lines <- bed_lines[bed_lines != "" & !grepl("^#", bed_lines)]
num_regions <- length(bed_lines)
# Used to store the actual BED file path (original or temporary)
actual_bed_file <- dmr_bed_file
temp_bed_file <- NULL
if (num_regions > 500) {
cat("Detected", num_regions, "regions in BED file, exceeding 500\n")
cat("Only the first 500 regions will be used for analysis\n")
# Create temporary file containing only the first 500 lines
temp_bed_file <- tempfile(fileext = ".bed")
writeLines(bed_lines[1:500], temp_bed_file)
actual_bed_file <- temp_bed_file
cat("Created temporary BED file:", temp_bed_file, "\n")
} else {
cat("BED file contains", num_regions, "regions, all will be used\n")
}
# Set PATH environment variable to ensure HOMER dependent scripts can be found
# HOMER bin directory contains all perl scripts (bed2pos.pl, homerTools, etc.)
homer_bin_dir <- dirname(homer_bin)
current_path <- Sys.getenv("PATH")
Sys.setenv(PATH = paste(homer_bin_dir, current_path, sep = ":"))
# Set LD_LIBRARY_PATH environment variable to ensure dynamic libraries (e.g., libcrypt.so.2) can be found
# This is key to solving "error while loading shared libraries"
homer_lib_dir <- file.path(dirname(homer_bin_dir), "lib")
current_ld_path <- Sys.getenv("LD_LIBRARY_PATH")
if (current_ld_path == "") {
Sys.setenv(LD_LIBRARY_PATH = homer_lib_dir)
} else {
# Ensure homer_lib_dir is at the beginning of LD_LIBRARY_PATH for priority lookup
Sys.setenv(LD_LIBRARY_PATH = paste(homer_lib_dir, current_ld_path, sep = ":"))
}
cat("LD_LIBRARY_PATH:", Sys.getenv("LD_LIBRARY_PATH"), "\n")
# Basic command format for findMotifsGenome.pl:
# findMotifsGenome.pl <DMR bed file> <genome> <output dir> [options]
args <- c(
actual_bed_file, # DMR BED file (original or temporary, only first 500 regions)
genome_fa, # Reference genome fasta file
output_dir, # Output directory
"-len", "6,7,8,9,10,11,12", # Motif Length: Analyze 6-12bp motifs
"-mask", # Mask repeat sequences
"-p", "8" # Use 8 threads for parallel calculation
)
# Use system2 in Jupyter to capture output for debugging
cat("Starting HOMER analysis...\n")
cat("HOMER bin directory:", homer_bin_dir, "\n")
cat("Command:", homer_bin, paste(args, collapse = " "), "\n\n")
flush.console() # Force flush output buffer
# Use stdout=TRUE and stderr=TRUE to capture output
result <- system2(
command = homer_bin,
args = args,
wait = TRUE,
stdout = TRUE,
stderr = TRUE
)
# If temporary BED file was used, delete it
if (!is.null(temp_bed_file) && file.exists(temp_bed_file)) {
unlink(temp_bed_file)
cat("Cleaned up temporary BED file\n")
}Only the first 500 regions will be used for analysis
Created temporary BED file: /tmp/Rtmp8aN0VY/file42f59fbc202.bed
LD_LIBRARY_PATH: /jp_envs/envs/common/lib
Starting HOMER analysis...n HOMER bin directory: /jp_envs/envs/common/bin
Command: /jp_envs/envs/common/bin/findMotifsGenome.pl /tmp/Rtmp8aN0VY/file42f59fbc202.bed ./genome.fa DMRs_HOMER/ -len 6,7,8,9,10,11,12 -mask -p 8
Cleaned up temporary BED file
Result Interpretation
HOMER generates multiple files in the output directory, mainly divided into De novo motif finding results and Known motif enrichment results.
De novo Motif Finding Results
homerResults.htmlandhomerResults/directory: Formatted web report for De novo motif discovery, which is the main file to view. It contains visualization logos, statistics (p-value, enrichment fold, etc.), sequence patterns for each motif, and detailedmotif<length>.motiffiles, facilitating quick browsing and filtering in a browser.homerMotifs.motifs<length>: Result files classified by motif length (e.g.,homerMotifs.motifs6for 6bp). Contains sequence patterns and statistics for all motifs found at that length, used to view analysis results for specific lengths.homerMotifs.all.motifs: Merged file for all motif lengths. Summarizes content of allhomerMotifs.motifs<length>files, used to view the collection of all motifs found in this analysis.
Known Motif Enrichment Results
knownResults.txt: Text result for known motif enrichment analysis (TSV format, open with Excel), which is the main file to view. It listsMotif Name,Consensus, enrichment statistics (P-value,Log P-value,q-value), and distribution in target/background sequences (Target/Background Sequences with Motifcount#and percentage%), used to identify enriched transcription factors.knownResults.htmlandknownResults/directory: Formatted web report (HTML) and accompanying directory for known motif enrichment. Contains visualization logos, statistics tables, enrichment folds, and detailedknown<length>.motiffiles, providing an intuitive interactive result viewing method.
