Skip to content

scMethyl + RNA Multi-omics: Motif Enrichment Analysis (Based on HOMER)

Author: SeekGene
Time: 8 min
Words: 1.5k words
Updated: 2026-02-28
Reads: 0 times
scMethyl + RNA-seq Analysis Guide Notebooks

Module Introduction

This module is based on HOMER (Hypergeometric Optimization of Motif EnRichment), used for DNA sequence motif analysis in genomic regions (e.g., Differentially Methylated Regions DMRs, ChIP-seq peaks).

HOMER is a mature and widely used motif discovery tool capable of systematic sequence analysis on given genomic regions, primarily achieving the following analysis goals:

  • De novo motif finding: Searches for new, unknown DNA sequence patterns in target sequences, independent of known motif databases; suitable for discovering new transcription factor binding sites or regulatory sequences.
  • Known motif enrichment: Evaluates enrichment of known transcription factor motifs in target sequences and compares with HOMER's built-in motif database to identify potential regulatory transcription factors.

This analysis is mainly applicable to Human and Mouse samples, but also supports other species with reference genome FASTA files.

Input File Preparation

This analysis requires a BED format input file describing the coordinate information of genomic regions to be analyzed (e.g., Differentially Methylated Regions DMRs, ChIP-seq peaks).

Input File Format

The BED file should be a tab-separated text file containing at least the following three columns:

Column No.Column NameDescriptionExample
1Chromosome NameChromosome name (e.g., chr1, chr2, chr11, etc.)chr2
2Start PositionStart position (0-based, counting starts from 0)196161361
3End PositionEnd position (1-based, exclusive)196164361

File Description:

  • Coordinate system uses 0-based indexing (start position begins at 0).
  • Must explicitly specify the genome version used (e.g., hg38, hg19, mm10, mm9, etc.).
  • Each region represents a Differentially Methylated Region (DMR).

File Structure Example

bed
chr1	116753470	116770470
chr2	196161361	196164361
chr2	203704361	203712361
...

HOMER Analysis Pipeline

Key Parameter Settings

During analysis, the following parameters have a significant impact on result reliability and computational efficiency:

Parameter NameChinese MeaningDefault/Suggested ValueDetailed Description
-lenMotif Length6,7,8,9,10,11,12Motif Discovery Parameter.
Specifies motif lengths to analyze, can specify multiple (comma-separated).
Small (6-8bp): For short TF binding sites
Medium (9-12bp): For medium regulatory sequences
Recommended: 6,7,8,9,10,11,12 (Ref Nature 2024)
Note: Default is 8,10,12.
-sizeExtraction Size200Sequence Extraction Parameter.
Length of sequence extracted from DMR center (bp).
Small (100-150bp): Focused on center, for tight TF binding sites
Large (200-300bp): More context, for dispersed elements
Note: Default is 200bp.
-maskMask RepeatsEnabledQC Parameter.
Masks repetitive sequence regions to avoid interference.
Suggestion: Always use unless analyzing repeat regions.
-pParallel Threads8Performance Parameter.
Number of threads for parallel computing.
Suggested: 4-16 depending on server config
Note: Too many threads may cause high memory usage

Best Practice Recommendations:

  1. Choice of -len parameter: It is recommended to use a combination of 6,7,8,9,10,11,12 to cover motif patterns of different lengths; if analysis time is limited, only the default recommended values 8,10,12 can be used.

  2. Choice of -size parameter: Default value of 200bp is sufficient in most cases. If DMR region length is small (< 500bp), this parameter can be appropriately reduced; if capturing a larger range of potential regulatory sequences is desired, this value can be increased to 300bp.

  3. About Environment Configuration: Ensure HOMER's bin directory is added to PATH environment variable (or use full path in command), and ensure reference genome FASTA file is consistent with the genome version used by input BED file.

HOMER Analysis Execution

The basic flow of HOMER analysis includes: parameter configuration, environment setup, command construction, and execution.

R
## homer_bin: Full path to HOMER findMotifsGenome.pl script
homer_bin <- "/jp_envs/envs/common/bin/findMotifsGenome.pl"

## dmr_bed_file: Input DMR BED file path
dmr_bed_file <- "./DMRs/DMRs_significant_2.bed"

## genome_fa: Reference genome fasta file path
# Note: Must match the reference genome version used in the input BED file (e.g., hg38)
genome_fa <- "./genome.fa"

## output_dir: Result output directory
# HOMER will automatically create this directory
output_dir <- "DMRs_HOMER/"

# Check the number of regions in the BED file
# If more than 500 regions, only use the first 500
bed_lines <- readLines(dmr_bed_file, warn = FALSE)
# Filter empty lines and comment lines (lines starting with #)
bed_lines <- bed_lines[bed_lines != "" & !grepl("^#", bed_lines)]
num_regions <- length(bed_lines)

# Used to store the actual BED file path used (may be original file or temporary file)
actual_bed_file <- dmr_bed_file
temp_bed_file <- NULL

if (num_regions > 500) {
  cat("Detected BED file contains ", num_regions, " regions, exceeding 500.\n")
  cat("Only the first 500 regions will be used for analysis.\n")
  
  # Create temporary file containing only the first 500 lines
  temp_bed_file <- tempfile(fileext = ".bed")
  writeLines(bed_lines[1:500], temp_bed_file)
  actual_bed_file <- temp_bed_file
  cat("Temporary BED file created:", temp_bed_file, "\n")
} else {
  cat("BED file contains ", num_regions, " regions, all will be used.\n")
}

# Set PATH environment variable to ensure HOMER dependency scripts can be found
# HOMER bin directory contains all perl scripts (bed2pos.pl, homerTools, etc.)
homer_bin_dir <- dirname(homer_bin)
current_path <- Sys.getenv("PATH")
Sys.setenv(PATH = paste(homer_bin_dir, current_path, sep = ":"))

# Set LD_LIBRARY_PATH environment variable to ensure dynamic libraries (e.g., libcrypt.so.2) can be found
# This is key to solving "error while loading shared libraries" issues
homer_lib_dir <- file.path(dirname(homer_bin_dir), "lib")
current_ld_path <- Sys.getenv("LD_LIBRARY_PATH")
if (current_ld_path == "") {
  Sys.setenv(LD_LIBRARY_PATH = homer_lib_dir)
} else {
  # Ensure homer_lib_dir is at the beginning of LD_LIBRARY_PATH for priority lookup
  Sys.setenv(LD_LIBRARY_PATH = paste(homer_lib_dir, current_ld_path, sep = ":"))
}
cat("LD_LIBRARY_PATH:", Sys.getenv("LD_LIBRARY_PATH"), "\n")


# Basic command format for findMotifsGenome.pl:
# findMotifsGenome.pl <DMR bed file> <Reference Genome> <Output Directory> [Options]
args <- c(
  actual_bed_file,     # DMR BED file (may be original or temporary file, containing only first 500 regions)
  genome_fa,           # Reference Genome FASTA File
  output_dir,          # Output directory
  "-len", "6,7,8,9,10,11,12",  # Motif Length: Analyze 6-12bp motifs
  "-mask",             # Mask repetitive sequences
  "-p", "8"            # Use 8 threads for parallel computation
)

# Use system2 in Jupyter to capture output for debugging
cat("Starting HOMER analysis...\n")
cat("HOMER bin directory:", homer_bin_dir, "\n")
cat("Command:", homer_bin, paste(args, collapse = " "), "\n\n")
flush.console()  # Force flush output buffer

# Use stdout=TRUE and stderr=TRUE to capture output
result <- system2(
  command = homer_bin,
  args = args,
  wait = TRUE,
  stdout = TRUE,
  stderr = TRUE
)

# If temporary BED file was used, delete it
if (!is.null(temp_bed_file) && file.exists(temp_bed_file)) {
  unlink(temp_bed_file)
  cat("Temporary BED file cleaned up.\n")
}
output
Detected BED file contains 30201 regions, exceeding 500
Only the first 500 regions will be used for analysis
Temporary BED file created: /tmp/RtmpKd4Kmx/file4274c462b79.bed
LD_LIBRARY_PATH: /jp_envs/envs/common/lib
Starting HOMER analysis...n HOMER bin directory: /jp_envs/envs/common/bin
Command: /jp_envs/envs/common/bin/findMotifsGenome.pl /tmp/RtmpKd4Kmx/file4274c462b79.bed ./genome.fa DMRs_HOMER/ -len 6,7,8,9,10,11,12 -mask -p 8

Result Interpretation

HOMER generates multiple types of result files in the output directory, mainly divided into De novo motif finding results and Known motif enrichment results.

De novo Motif Finding Results

  1. homerResults.html and homerResults/ directory: Formatted web report for De novo motif discovery, which is the main file to view. It contains visualization logo plots, statistical information (p-value, enrichment fold, etc.), sequence patterns for each motif, and detailed motif<length>.motif files, facilitating quick browsing and screening of candidate motifs in the browser.

  2. homerMotifs.motifs<length>: Result files categorized by motif length (e.g., homerMotifs.motifs6 corresponds to 6bp length). Contains sequence patterns and statistical information for all discovered motifs of that length, used for viewing results of specific lengths.

  3. homerMotifs.all.motifs: Merged file of motifs of all lengths. It summarizes contents of all homerMotifs.motifs<length>, used to view the complete set of motifs discovered in this analysis.

Known Motif Enrichment Results

  1. knownResults.txt: Text result of known motif enrichment analysis (TSV format, openable with Excel), which is the main file to view. It lists in detail Motif Name, Consensus, enrichment statistics (P-value, Log P-value, q-value), and Motif distribution in target and background sequences (Target/Background Sequences with Motif count # and percentage %), used to identify enriched transcription factors.

  2. knownResults.html and knownResults/ directory: Formatted web report (HTML) and accompanying directory for known motif enrichment analysis. Contents include visualization logos of enriched motifs, statistical information tables, enrichment folds, and detailed known<length>.motif files, providing an intuitive and interactive way to browse results.

References

0 comments·0 replies