VDJ Reference Preparation (SeekSoul Tools v1.3.0)
IMPORTANT
SeekSoul Tools v1.3.0 uses TRUST4 for analysis and requires preparing the corresponding reference files.
Download from IMGT: The database aggregates published literature and other databases, without explicit species genome version information, so the exact genome build for a species cannot be directly determined.
Step 1: Download the pan-species immune database and extract the target species FASTA
wget -c https://www.imgt.org/download/GENE-DB/IMGTGENEDB-ReferenceSequences.fasta-nt-WithGaps-F+ORF+inframeP
Extract the reference sequences for the species based on the Latin species name in FASTA IDs.
NOTE
File format description:
- IMGT/LIGM-DB accession number(s)
- IMGT gene and allele name
- species (may be followed by an "_" and the name of the strain, breed or isolate, if defined)
- IMGT gene and allele functionality
- exon(s), region name(s), or extracted label(s)
- start and end positions in the IMGT/LIGM-DB accession number(s)
- number of nucleotides in the IMGT/LIGM-DB accession number(s)
- codon start, or 'NR' (not relevant) for non coding labels
- +n: number of nucleotides (nt) added in 5' compared to the corresponding label extracted from IMGT/LIGM-DB
- +n or -n: number of nucleotides (nt) added or removed in 3' compared to the corresponding label extracted from IMGT/LIGM-DB
- +n, -n, and/or nS: number of added, deleted, and/or substituted nucleotides to correct sequencing errors, or 'not corrected' if non corrected sequencing errors
- number of amino acids (AA): this field indicates that the sequence is in amino acids
- number of characters in the sequence: nt (or AA)+IMGT gaps=total
- partial (if it is)
- reverse complementary (if it is)
Reference script example:
Extract FASTA sequences for rabbit:
Reference script - Click to view the script
python prepare.py IMGTGENEDB-ReferenceSequences.fasta-nt-WithGaps-F+ORF+inframeP "Oryctolagus cuniculus" -o Oryctolagus_cuniculus.raw.fasta
Normalize the extracted FASTA file:
Reference script - Click to view the script
python imgt_other_mkref.py --rawfa Oryctolagus_cuniculus.raw.fasta --outfa Oryctolagus_cuniculus.fasta
Step 2: Format reference sequences according to TRUST4 requirements
https://github.com/liulab-dfci/TRUST4
IMPORTANT
Basic requirements for the reference:
- The ref file (e.g., human_IMGT+C.fa) must contain sequences for V, D, J, and C genes.
- For FASTQ inputs, the ref file only needs IMGT-format V, D, J, and C gene sequences; genomic coordinates are not required.
>V_gene_name|other_info
Ensure each sequence ID is unique. It is recommended to retain the original IMGT ID information.
Common practice: use a script to standardize IDs to >TRAV1-1*01, >IGHV1-1*01, etc. (TR/IG + gene name + allele). Original annotations can be kept after the ID for traceability.
Further process the FASTA to meet TRUST4 requirements:
Reference script - Click to view the script
python process_regions.py --input_file Oryctolagus_cuniculus.fasta --output_file Oryctolagus_cuniculus_input.fasta
Step 3: Prepare leader sequences
Go to IMGT: https://www.imgt.org/vquest/refseqh.html#LEADER-sets to download "L-PART1+L-PART2" for the target species.
WARNING
Notes:
- Download IG and TR separately; TR and IG are provided as two files
- Download IGH, IGK, and IGL separately and then merge into one file
- Download TRA and TRB separately and then merge into one file
Convert bases to uppercase:
cat IG_L-PART1+L-PART2.fa |awk '/^>/ {print; next} {printf "%s", toupper($0)}' | awk 'BEGIN{RS=">"; FS="\n"} NR>1 {print ">" $1 "\n" $2}' >IG_L-PART1_L-PART2.fa
Step 4: Configure the cfg file for special species
fa:/1.3/ref/Oryctolagus_cuniculus_input.fasta
ref:/1.3/ref/Oryctolagus_cuniculus_input.fasta
leader:/IG_L-PART1_L-PART2.fa
Step 5: Run VDJ analysis
TIP
Before running VDJ analysis, ensure all reference file paths are correct and file formats meet TRUST4 requirements.
#export PATH=seeksoultools1.3.0/seeksoultools.130/external/conda:$PATH
seeksoultools1.3.0/seeksoultools.130/seeksoultools vdj run \
--samplename ZYB_Y2_BCR \
--fq1 demo_BCR_S14_L004_R1_001.fastq.gz \
--fq2 demo_BCR_S14_L004_R2_001.fastq.gz \
--chain IG \
--keep_tmp \
--outdir Output/ \
--core 16 \
--cfg cfg.txt \