SeekSoul Tools v1.3.0：特殊物种 VDJ ref 准备流程 (IMGT/TRUST4)

作者: 刘雪岭，高瑞峰

时长: 4 分钟

字数: 929 字

更新: 2026-01-21

阅读: 0 次

分析指南 FAQ SeekSoul Tools VDJ

IMPORTANT

SeekSoul Tools v1.3.0 需要 TRUST4 进行分析，需要准备相应的参考基因组文件。

从 IMGT 下载：数据库是根据已发表文章、数据库等汇总而成，没有具体物种基因组版本的信息，不能直接得到对应物种的版本信息。

步骤 1：下载总的物种免疫数据库，提取出对应物种的 FASTA

text

wget -c https://www.imgt.org/download/GENE-DB/IMGTGENEDB-ReferenceSequences.fasta-nt-WithGaps-F+ORF+inframeP

根据 fasta ID 中物种拉丁名提取物种对应的 ref 序列。

NOTE

文件格式说明：

IMGT/LIGM-DB accession number(s)
IMGT gene and allele name
species (may be followed by an "_" and the name of the strain, breed or isolate, if defined)
IMGT gene and allele functionality
exon(s), region name(s), or extracted label(s)
start and end positions in the IMGT/LIGM-DB accession number(s)
number of nucleotides in the IMGT/LIGM-DB accession number(s)
codon start, or 'NR' (not relevant) for non coding labels
+n: number of nucleotides (nt) added in 5' compared to the corresponding label extracted from IMGT/LIGM-DB
+n or -n: number of nucleotides (nt) added or removed in 3' compared to the corresponding label extracted from IMGT/LIGM-DB
+n, -n, and/or nS: number of added, deleted, and/or substituted nucleotides to correct sequencing errors, or 'not corrected' if non corrected sequencing errors
number of amino acids (AA): this field indicates that the sequence is in amino acids
number of characters in the sequence: nt (or AA)+IMGT gaps=total
partial (if it is)
reverse complementary (if it is)

参考脚本示例：

提取兔子物种的 FASTA 序列：

参考脚本 - 点击查看脚本

text

python prepare.py  IMGTGENEDB-ReferenceSequences.fasta-nt-WithGaps-F+ORF+inframeP  "Oryctolagus cuniculus" -o Oryctolagus_cuniculus.raw.fasta

规整提取出来的 fasta 文件：

参考脚本 - 点击查看脚本

text

python imgt_other_mkref.py --rawfa Oryctolagus_cuniculus.raw.fasta  --outfa Oryctolagus_cuniculus.fasta

步骤 2：根据 TRUST4 要求格式化 ref 序列

https://github.com/liulab-dfci/TRUST4

IMPORTANT

参考序列的基本要求：

Ref 文件（如 human_IMGT+C.fa）需要包含 V、D、J、C 基因的序列。
对于 FASTQ 输入，Ref 文件只需包含 IMGT 格式的 V、D、J、C 基因序列，不需要坐标信息。

text

 >V_gene_name|other_info
只要保证每条序列的 ID 唯一即可，且建议保留 IMGT 原始 ID 信息。
常见做法：用脚本将 ID 统一为 >TRAV1-1*01、>IGHV1-1*01 等（即 TR/IG 开头，后面是基因名和等位基因号）。可以在 ID 后保留原始注释信息，方便追溯。

进一步处理 FASTA 文件，使其符合 TRUST4 要求：

参考脚本 - 点击查看脚本

text

python  process_regions.py  --input_file Oryctolagus_cuniculus.fasta --output_file Oryctolagus_cuniculus_input.fasta

步骤 3: 准备 leader 文件

到 IMGT: https://www.imgt.org/vquest/refseqh.html#LEADER-sets 下载 "L-PART1+L-PART2"。下载对应物种的序列。

WARNING

注意事项：

IG 和 TR 的分别下载，TR、IG 分为 2 个文件
IGH、IGK、IGL 分别下载，合并成一个文件
TRA、TRB 分别下载，合并成一个文件

将碱基转化成大写：

text

cat IG_L-PART1+L-PART2.fa |awk '/^>/ {print; next} {printf "%s", toupper($0)}' | awk 'BEGIN{RS=">"; FS="\n"} NR>1 {print ">" $1 "\n" $2}' >IG_L-PART1_L-PART2.fa

步骤 4：配置特殊物种的 cfg 文件

text

fa:/1.3/ref/Oryctolagus_cuniculus_input.fasta
ref:/1.3/ref/Oryctolagus_cuniculus_input.fasta
leader:/IG_L-PART1_L-PART2.fa

步骤 5：运行 VDJ 分析

TIP

在运行 VDJ 分析之前，请确保所有参考文件路径正确，并且文件格式符合 TRUST4 要求。

text

#export PATH=seeksoultools1.3.0/seeksoultools.130/external/conda:$PATH
seeksoultools1.3.0/seeksoultools.130/seeksoultools vdj run \
--samplename ZYB_Y2_BCR \
--fq1 demo_BCR_S14_L004_R1_001.fastq.gz \
--fq2 demo_BCR_S14_L004_R2_001.fastq.gz \
--chain IG \
--keep_tmp \
--outdir  Output/ \
--core 16  \
--cfg cfg.txt \

SeekSoul Tools v1.3.0：特殊物种 VDJ ref 准备流程 (IMGT/TRUST4) ​

步骤 1：下载总的物种免疫数据库，提取出对应物种的 FASTA ​

步骤 2：根据 TRUST4 要求格式化 ref 序列 ​

步骤 3: 准备 leader 文件 ​

步骤 4：配置特殊物种的 cfg 文件 ​

步骤 5：运行 VDJ 分析 ​

SeekSoul Tools v1.3.0：特殊物种 VDJ ref 准备流程 (IMGT/TRUST4)

步骤 1：下载总的物种免疫数据库，提取出对应物种的 FASTA

步骤 2：根据 TRUST4 要求格式化 ref 序列

步骤 3: 准备 leader 文件

步骤 4：配置特殊物种的 cfg 文件

步骤 5：运行 VDJ 分析