Usage Instructions for Large Chromosome Splitting Script split_ref.py

Author: SeekGene

Time: 3 min

Words: 486 words

Updated: 2025-08-08

Reads: 0 times

Function Overview

split_ref.py is a tool for processing genome data, with the main function of splitting overly long chromosome sequences into smaller fragments and adjusting coordinate information in GTF annotation files accordingly. The tool can automatically detect overly long chromosomes in FASTA files, find suitable splitting points based on gene positions in GTF files (avoiding splitting genes), and then generate new FASTA and GTF files.

NOTE

This tool is mainly used for processing large chromosome genomes of non-human/mouse species to avoid analysis problems caused by overly long chromosomes.

Dependencies

Python 3.x
Required Python libraries:
- click
- intervaltree
- loguru

Dependencies can be installed with the following command:

bash

pip install click intervaltree loguru

Usage

Basic Command Format

bash

python split_ref.py -f <input_fasta> -g <input_gtf> -o <output_directory>

Parameter Description

Parameter	Full Name	Type	Required	Description
-f	--fa	File Path	Yes	Input FASTA format file path
-g	--gtf	File Path	Yes	Input GTF format annotation file path
-o	--outdir	Directory Path	Yes	Output directory path for saving processed files

Workflow

Detect Overly Long Chromosomes: Scan FASTA files to identify chromosomes longer than the threshold (default 2^29 bp)
Find Splitting Points: Determine suitable splitting points based on gene position information in GTF files, avoiding splitting genes
Generate New FASTA: Split overly long chromosomes into multiple sub-fragments and generate new FASTA files
Generate New GTF: Adjust chromosome names and coordinates in annotation information to generate GTF files matching the new FASTA

Usage Example

bash

python split_ref.py -f genome.fa -g annotation.gtf -o split_output

The above command will:

Process genome.fa and annotation.gtf files
Create split_output directory in the current directory
Generate split FASTA and GTF files in the split_output directory

Output File Description

After processing, the output directory will contain:

New FASTA file with the same name as the input FASTA: containing split chromosome sequences
New GTF file with the same name as the input GTF: containing adjusted annotation information

Notes

The output directory cannot be the same as the directory containing the input files
Ensure the input FASTA and GTF files are in correct format
Split chromosomes will be named with the original chromosome name plus "_sN" suffix (N is the split sequence number)
The program will automatically skip comment lines in GTF files (starting with #)
If suitable splitting points cannot be found (failing after 10000 consecutive attempts), the program will throw an exception

WARNING

Please ensure to backup original data before running the script, as splitting operations are irreversible.

Log Information

The program uses the loguru library to record the processing process, including:

Information about detected overly long chromosomes
Determined splitting point positions
Output file save paths

TIP

It is recommended to ensure sufficient disk space and memory resources when processing large genome files.

Usage Instructions for Large Chromosome Splitting Script split_ref.py ​

Function Overview ​

Dependencies ​

Usage ​

Basic Command Format ​

Parameter Description ​

Workflow ​

Usage Example ​

Output File Description ​

Notes ​

Log Information ​

Usage Instructions for Large Chromosome Splitting Script split_ref.py

Function Overview

Dependencies

Usage

Basic Command Format

Parameter Description

Workflow

Usage Example

Output File Description

Notes

Log Information