Usage Instructions for Large Chromosome Splitting Script split_ref.py
Function Overview
split_ref.py
is a tool for processing genome data, with the main function of splitting overly long chromosome sequences into smaller fragments and adjusting coordinate information in GTF annotation files accordingly. The tool can automatically detect overly long chromosomes in FASTA files, find suitable splitting points based on gene positions in GTF files (avoiding splitting genes), and then generate new FASTA and GTF files.
NOTE
This tool is mainly used for processing large chromosome genomes of non-human/mouse species to avoid analysis problems caused by overly long chromosomes.
Dependencies
- Python 3.x
- Required Python libraries:
- click
- intervaltree
- loguru
Dependencies can be installed with the following command:
pip install click intervaltree loguru
Usage
Basic Command Format
python split_ref.py -f <input_fasta> -g <input_gtf> -o <output_directory>
Parameter Description
Parameter | Full Name | Type | Required | Description |
---|---|---|---|---|
-f | --fa | File Path | Yes | Input FASTA format file path |
-g | --gtf | File Path | Yes | Input GTF format annotation file path |
-o | --outdir | Directory Path | Yes | Output directory path for saving processed files |
Workflow
- Detect Overly Long Chromosomes: Scan FASTA files to identify chromosomes longer than the threshold (default 2^29 bp)
- Find Splitting Points: Determine suitable splitting points based on gene position information in GTF files, avoiding splitting genes
- Generate New FASTA: Split overly long chromosomes into multiple sub-fragments and generate new FASTA files
- Generate New GTF: Adjust chromosome names and coordinates in annotation information to generate GTF files matching the new FASTA
Usage Example
python split_ref.py -f genome.fa -g annotation.gtf -o split_output
The above command will:
- Process genome.fa and annotation.gtf files
- Create split_output directory in the current directory
- Generate split FASTA and GTF files in the split_output directory
Output File Description
After processing, the output directory will contain:
- New FASTA file with the same name as the input FASTA: containing split chromosome sequences
- New GTF file with the same name as the input GTF: containing adjusted annotation information
Notes
- The output directory cannot be the same as the directory containing the input files
- Ensure the input FASTA and GTF files are in correct format
- Split chromosomes will be named with the original chromosome name plus "_sN" suffix (N is the split sequence number)
- The program will automatically skip comment lines in GTF files (starting with #)
- If suitable splitting points cannot be found (failing after 10000 consecutive attempts), the program will throw an exception
WARNING
Please ensure to backup original data before running the script, as splitting operations are irreversible.
Log Information
The program uses the loguru library to record the processing process, including:
- Information about detected overly long chromosomes
- Determined splitting point positions
- Output file save paths
TIP
It is recommended to ensure sufficient disk space and memory resources when processing large genome files.