scMethyl + RNA Multi-omics: Motif Enrichment Analysis (Based on HOMER)

Author: SeekGene

Time: 8 min

Words: 1.5k words

Updated: 2026-03-18

Reads: 0 times

scMethyl + RNA-seq Notebooks

Module Introduction

This module is developed based on HOMER (Hypergeometric Optimization of Motif EnRichment) and is designed for DNA sequence motif analysis in genomic regions (e.g., Differentially Methylated Regions DMRs, ChIP-seq peaks, etc.).

HOMER is a mature and widely used motif discovery tool capable of performing systematic sequence analysis on given genomic regions, achieving the following core analysis goals:

De novo motif finding: Identifies new, unknown DNA sequence patterns in target sequences without relying on known motif databases, suitable for discovering new transcription factor binding sites or regulatory sequences.
Known motif enrichment: Evaluates the enrichment of known transcription factor motifs in target sequences and compares them with HOMER's built-in motif database to identify transcription factors that may be involved in regulation.

This analysis is mainly applicable to Human and Mouse samples, but also supports other species with a reference genome FASTA file.

Input File Preparation

This analysis requires a BED format input file to describe the coordinate information of the genomic regions to be analyzed (e.g., Differentially Methylated Regions DMRs, ChIP-seq peaks, etc.).

Input File Format

The BED file should be a tab-separated text file containing at least the following three columns:

Column	Name	Description	Example
1	Chromosome	Chromosome name (e.g., chr1, chr2, chr11, etc.)	chr2
2	Start Position	Start position (0-based, counting starts from 0)	196161361
3	End Position	End position (1-based, exclusive of this position)	196164361

File Notes:

Coordinate system uses 0-based encoding (start position begins at 0).
Must explicitly specify the genome version used (e.g., hg38, hg19, mm10, mm9, etc.).
Each region represents a Differentially Methylated Region (DMR).

File Structure Example

bed

chr1	116753470	116770470
chr2	196161361	196164361
chr2	203704361	203712361
...

HOMER Analysis Workflow

Key Parameter Settings

During the analysis, the following parameters have a significant impact on the reliability of results and computational efficiency:

Parameter	Meaning	Default/Suggested	Detailed Description
`-len`	Motif Length	`6,7,8,9,10,11,12`	Motif Discovery Parameter. Specifies the motif lengths to analyze, multiple lengths can be specified (comma-separated). • Small length (6-8bp): Suitable for finding shorter transcription factor binding sites • Medium length (9-12bp): Suitable for finding medium-length regulatory sequences • Recommended: `6,7,8,9,10,11,12` (Reference Nature 2024) Note: Default is 8,10,12.
`-size`	Sequence Extraction Size	200	Sequence Extraction Parameter. Length of sequence extracted from the center of each DMR (bp). • Smaller value (100-150bp): Focuses more on the central region, suitable for tight transcription factor binding sites • Larger value (200-300bp): Includes more context sequence, suitable for more dispersed regulatory elements Note: Default is 200bp.
`-mask`	Mask Repeats	Enabled	Quality Control Parameter. Masks repeat sequences to prevent them from interfering with motif discovery. Suggestion: Always use this option unless analyzing repeat regions specifically.
`-p`	Parallel Threads	8	Performance Parameter. Number of threads used for parallel computing. • Suggested: Set according to server configuration, typically 4-16 • Note: Too many threads may lead to excessive memory usage

Best Practice Suggestions:

Selection of -len parameter: It is recommended to use the combination 6,7,8,9,10,11,12 to cover motif patterns of different lengths; if analysis time is limited, the default recommendation 8,10,12 can be used.
Selection of -size parameter: In most cases, the default value of 200bp is sufficient. If the DMR region length is small (< 500bp), this parameter can be appropriately reduced; if it is desired to capture a larger range of potential regulatory sequences, this value can be increased to 300bp.
Environment Configuration: Ensure that HOMER's bin directory is added to the PATH environment variable (or use the full path in the command), and ensure that the reference genome FASTA file is consistent with the genome version used in the input BED file.

HOMER Analysis Execution

The basic workflow of HOMER analysis includes: parameter configuration, environment setup, command construction, and execution.

## homer_bin: Full path to HOMER findMotifsGenome.pl script
homer_bin <- "/jp_envs/envs/common/bin/findMotifsGenome.pl"

## dmr_bed_file: Input DMR BED file path
dmr_bed_file <- "./DMRs/DMRs_significant_2.bed"

## genome_fa: Reference genome fasta file path
# Note: Must be consistent with the reference genome version used in input BED file (e.g., hg38)
genome_fa <- "./genome.fa"

## output_dir: Result output directory
# HOMER will automatically create this directory
output_dir <- "DMRs_HOMER/"

# Check number of regions in BED file
# If > 500 regions, use only the first 500
bed_lines <- readLines(dmr_bed_file, warn = FALSE)
# Filter empty lines and comment lines (starting with #)
bed_lines <- bed_lines[bed_lines != "" & !grepl("^#", bed_lines)]
num_regions <- length(bed_lines)

# Used to store the actual BED file path (original or temporary)
actual_bed_file <- dmr_bed_file
temp_bed_file <- NULL

if (num_regions > 500) {
  cat("Detected", num_regions, "regions in BED file, exceeding 500\n")
  cat("Only the first 500 regions will be used for analysis\n")
  
  # Create temporary file containing only the first 500 lines
  temp_bed_file <- tempfile(fileext = ".bed")
  writeLines(bed_lines[1:500], temp_bed_file)
  actual_bed_file <- temp_bed_file
  cat("Created temporary BED file:", temp_bed_file, "\n")
} else {
  cat("BED file contains", num_regions, "regions, all will be used\n")
}

# Set PATH environment variable to ensure HOMER dependent scripts can be found
# HOMER bin directory contains all perl scripts (bed2pos.pl, homerTools, etc.)
homer_bin_dir <- dirname(homer_bin)
current_path <- Sys.getenv("PATH")
Sys.setenv(PATH = paste(homer_bin_dir, current_path, sep = ":"))

# Set LD_LIBRARY_PATH environment variable to ensure dynamic libraries (e.g., libcrypt.so.2) can be found
# This is key to solving "error while loading shared libraries"
homer_lib_dir <- file.path(dirname(homer_bin_dir), "lib")
current_ld_path <- Sys.getenv("LD_LIBRARY_PATH")
if (current_ld_path == "") {
  Sys.setenv(LD_LIBRARY_PATH = homer_lib_dir)
} else {
  # Ensure homer_lib_dir is at the beginning of LD_LIBRARY_PATH for priority lookup
  Sys.setenv(LD_LIBRARY_PATH = paste(homer_lib_dir, current_ld_path, sep = ":"))
}
cat("LD_LIBRARY_PATH:", Sys.getenv("LD_LIBRARY_PATH"), "\n")


# Basic command format for findMotifsGenome.pl:
# findMotifsGenome.pl <DMR bed file> <genome> <output dir> [options]
args <- c(
  actual_bed_file,     # DMR BED file (original or temporary, only first 500 regions)
  genome_fa,           # Reference genome fasta file
  output_dir,          # Output directory
  "-len", "6,7,8,9,10,11,12",  # Motif Length: Analyze 6-12bp motifs
  "-mask",             # Mask repeat sequences
  "-p", "8"            # Use 8 threads for parallel calculation
)

# Use system2 in Jupyter to capture output for debugging
cat("Starting HOMER analysis...\n")
cat("HOMER bin directory:", homer_bin_dir, "\n")
cat("Command:", homer_bin, paste(args, collapse = " "), "\n\n")
flush.console()  # Force flush output buffer

# Use stdout=TRUE and stderr=TRUE to capture output
result <- system2(
  command = homer_bin,
  args = args,
  wait = TRUE,
  stdout = TRUE,
  stderr = TRUE
)

# If temporary BED file was used, delete it
if (!is.null(temp_bed_file) && file.exists(temp_bed_file)) {
  unlink(temp_bed_file)
  cat("Cleaned up temporary BED file\n")
}

output

Detected 30152 regions in BED file, exceeding 500
Only the first 500 regions will be used for analysis
Created temporary BED file: /tmp/Rtmp8aN0VY/file42f59fbc202.bed
LD_LIBRARY_PATH: /jp_envs/envs/common/lib
Starting HOMER analysis...n HOMER bin directory: /jp_envs/envs/common/bin
Command: /jp_envs/envs/common/bin/findMotifsGenome.pl /tmp/Rtmp8aN0VY/file42f59fbc202.bed ./genome.fa DMRs_HOMER/ -len 6,7,8,9,10,11,12 -mask -p 8

Cleaned up temporary BED file

Result Interpretation

HOMER generates multiple files in the output directory, mainly divided into De novo motif finding results and Known motif enrichment results.

De novo Motif Finding Results

homerResults.html and homerResults/ directory: Formatted web report for De novo motif discovery, which is the main file to view. It contains visualization logos, statistics (p-value, enrichment fold, etc.), sequence patterns for each motif, and detailed motif<length>.motif files, facilitating quick browsing and filtering in a browser.
homerMotifs.motifs<length>: Result files classified by motif length (e.g., homerMotifs.motifs6 for 6bp). Contains sequence patterns and statistics for all motifs found at that length, used to view analysis results for specific lengths.
homerMotifs.all.motifs: Merged file for all motif lengths. Summarizes content of all homerMotifs.motifs<length> files, used to view the collection of all motifs found in this analysis.

Known Motif Enrichment Results

knownResults.txt: Text result for known motif enrichment analysis (TSV format, open with Excel), which is the main file to view. It lists Motif Name, Consensus, enrichment statistics (P-value, Log P-value, q-value), and distribution in target/background sequences (Target/Background Sequences with Motif count # and percentage %), used to identify enriched transcription factors.
knownResults.html and knownResults/ directory: Formatted web report (HTML) and accompanying directory for known motif enrichment. Contains visualization logos, statistics tables, enrichment folds, and detailed known<length>.motif files, providing an intuitive interactive result viewing method.

scMethyl + RNA Multi-omics: Motif Enrichment Analysis (Based on HOMER) ​

Module Introduction ​

Input File Preparation ​

Input File Format ​

File Structure Example ​

HOMER Analysis Workflow ​

Key Parameter Settings ​

HOMER Analysis Execution ​

Result Interpretation ​

De novo Motif Finding Results ​

Known Motif Enrichment Results ​

References ​