Skip to content

VDJ Reference Preparation (SeekSoul Tools v1.3.0)

Author: Xueling Liu, Ruifeng Gao
Time: 4 min
Words: 746 words
Updated: 2025-08-13
Reads: 0 times
SeekSoul Tools VDJ

IMPORTANT

SeekSoul Tools v1.3.0 uses TRUST4 for analysis and requires preparing the corresponding reference files.

Download from IMGT: The database aggregates published literature and other databases, without explicit species genome version information, so the exact genome build for a species cannot be directly determined.

Step 1: Download the pan-species immune database and extract the target species FASTA

text
wget -c https://www.imgt.org/download/GENE-DB/IMGTGENEDB-ReferenceSequences.fasta-nt-WithGaps-F+ORF+inframeP

Extract the reference sequences for the species based on the Latin species name in FASTA IDs.

NOTE

File format description:

  1. IMGT/LIGM-DB accession number(s)
  2. IMGT gene and allele name
  3. species (may be followed by an "_" and the name of the strain, breed or isolate, if defined)
  4. IMGT gene and allele functionality
  5. exon(s), region name(s), or extracted label(s)
  6. start and end positions in the IMGT/LIGM-DB accession number(s)
  7. number of nucleotides in the IMGT/LIGM-DB accession number(s)
  8. codon start, or 'NR' (not relevant) for non coding labels
  9. +n: number of nucleotides (nt) added in 5' compared to the corresponding label extracted from IMGT/LIGM-DB
  10. +n or -n: number of nucleotides (nt) added or removed in 3' compared to the corresponding label extracted from IMGT/LIGM-DB
  11. +n, -n, and/or nS: number of added, deleted, and/or substituted nucleotides to correct sequencing errors, or 'not corrected' if non corrected sequencing errors
  12. number of amino acids (AA): this field indicates that the sequence is in amino acids
  13. number of characters in the sequence: nt (or AA)+IMGT gaps=total
  14. partial (if it is)
  15. reverse complementary (if it is)

Reference script example:

Extract FASTA sequences for rabbit:

Reference script - Click to view the script

text
python prepare.py  IMGTGENEDB-ReferenceSequences.fasta-nt-WithGaps-F+ORF+inframeP  "Oryctolagus cuniculus" -o Oryctolagus_cuniculus.raw.fasta

Normalize the extracted FASTA file:

Reference script - Click to view the script

text
python imgt_other_mkref.py --rawfa Oryctolagus_cuniculus.raw.fasta  --outfa Oryctolagus_cuniculus.fasta

Step 2: Format reference sequences according to TRUST4 requirements

https://github.com/liulab-dfci/TRUST4

IMPORTANT

Basic requirements for the reference:

  • The ref file (e.g., human_IMGT+C.fa) must contain sequences for V, D, J, and C genes.
  • For FASTQ inputs, the ref file only needs IMGT-format V, D, J, and C gene sequences; genomic coordinates are not required.
text
 >V_gene_name|other_info
Ensure each sequence ID is unique. It is recommended to retain the original IMGT ID information.
Common practice: use a script to standardize IDs to >TRAV1-1*01, >IGHV1-1*01, etc. (TR/IG + gene name + allele). Original annotations can be kept after the ID for traceability.

Further process the FASTA to meet TRUST4 requirements:

Reference script - Click to view the script

text
python  process_regions.py  --input_file Oryctolagus_cuniculus.fasta --output_file Oryctolagus_cuniculus_input.fasta

Step 3: Prepare leader sequences

Go to IMGT: https://www.imgt.org/vquest/refseqh.html#LEADER-sets to download "L-PART1+L-PART2" for the target species.

WARNING

Notes:

  • Download IG and TR separately; TR and IG are provided as two files
  • Download IGH, IGK, and IGL separately and then merge into one file
  • Download TRA and TRB separately and then merge into one file

Convert bases to uppercase:

text
cat IG_L-PART1+L-PART2.fa |awk '/^>/ {print; next} {printf "%s", toupper($0)}' | awk 'BEGIN{RS=">"; FS="\n"} NR>1 {print ">" $1 "\n" $2}' >IG_L-PART1_L-PART2.fa

Step 4: Configure the cfg file for special species

text
fa:/1.3/ref/Oryctolagus_cuniculus_input.fasta
ref:/1.3/ref/Oryctolagus_cuniculus_input.fasta
leader:/IG_L-PART1_L-PART2.fa

Step 5: Run VDJ analysis

TIP

Before running VDJ analysis, ensure all reference file paths are correct and file formats meet TRUST4 requirements.

text
#export PATH=seeksoultools1.3.0/seeksoultools.130/external/conda:$PATH
seeksoultools1.3.0/seeksoultools.130/seeksoultools vdj run \
--samplename ZYB_Y2_BCR \
--fq1 demo_BCR_S14_L004_R1_001.fastq.gz \
--fq2 demo_BCR_S14_L004_R2_001.fastq.gz \
--chain IG \
--keep_tmp \
--outdir  Output/ \
--core 16  \
--cfg cfg.txt \
0 comments·0 replies