SeekSoulTools-1.3.0_VDJ特殊物种ref准备

作者: 刘雪岭，高瑞峰

时长: 4 分钟

字数: 912 字

更新: 2025-09-02

阅读: 0 次

SeekSoulTools VDJ

IMPORTANT

SeekSoulTools v1.3.0 需要TRUST4进行分析，需要准备相应的参考基因组文件。

从IMGT下载：数据库是根据已发表文章、数据库等汇总而成，没有具体物种基因组版本的信息，不能直接得到对应物种的版本信息。

Step1：下载总的物种免疫数据库，提取出对应物种的fasta

text

wget -c https://www.imgt.org/download/GENE-DB/IMGTGENEDB-ReferenceSequences.fasta-nt-WithGaps-F+ORF+inframeP

根据fasta id中物种拉丁名提取物种对应的ref序列。

NOTE

文件格式说明：

IMGT/LIGM-DB accession number(s)
IMGT gene and allele name
species (may be followed by an "_" and the name of the strain, breed or isolate, if defined)
IMGT gene and allele functionality
exon(s), region name(s), or extracted label(s)
start and end positions in the IMGT/LIGM-DB accession number(s)
number of nucleotides in the IMGT/LIGM-DB accession number(s)
codon start, or 'NR' (not relevant) for non coding labels
+n: number of nucleotides (nt) added in 5' compared to the corresponding label extracted from IMGT/LIGM-DB
+n or -n: number of nucleotides (nt) added or removed in 3' compared to the corresponding label extracted from IMGT/LIGM-DB
+n, -n, and/or nS: number of added, deleted, and/or substituted nucleotides to correct sequencing errors, or 'not corrected' if non corrected sequencing errors
number of amino acids (AA): this field indicates that the sequence is in amino acids
number of characters in the sequence: nt (or AA)+IMGT gaps=total
partial (if it is)
reverse complementary (if it is)

参考脚本示例：

提取兔子物种的fasta序列：

参考脚本 - 点击查看脚本

text

python prepare.py  IMGTGENEDB-ReferenceSequences.fasta-nt-WithGaps-F+ORF+inframeP  "Oryctolagus cuniculus" -o Oryctolagus_cuniculus.raw.fasta

规整提取出来的fasta文件：

参考脚本 - 点击查看脚本

text

python imgt_other_mkref.py --rawfa Oryctolagus_cuniculus.raw.fasta  --outfa Oryctolagus_cuniculus.fasta

Step2：根据TRUST4要求格式化ref序列

https://github.com/liulab-dfci/TRUST4

IMPORTANT

参考序列的基本要求：

ref文件（如human_IMGT+C.fa）需要包含V、D、J、C基因的序列。
对于FASTQ输入，ref文件只需包含IMGT格式的V、D、J、C基因序列，不需要坐标信息。

text

 >V_gene_name|other_info
只要保证每条序列的ID唯一即可，且建议保留IMGT原始ID信息。
常见做法：用脚本将ID统一为>TRAV1-1*01、>IGHV1-1*01等（即TR/IG开头，后面是基因名和等位基因号）。可以在ID后保留原始注释信息，方便追溯。

进一步处理fasta文件，使其符合TRUST4要求：

参考脚本 - 点击查看脚本

text

python  process_regions.py  --input_file Oryctolagus_cuniculus.fasta --output_file Oryctolagus_cuniculus_input.fasta

Step3：准备leader文件

到IMGT: https://www.imgt.org/vquest/refseqh.html#LEADER-sets 下载 "L-PART1+L-PART2"。下载对应物种的序列。

WARNING

注意事项：

IG和TR的分别下载，TR、IG分为2个文件
IGH、IGK、IGL分别下载，合并成一个文件
TRA、TRB分别下载，合并成一个文件

将碱基转化成大写：

text

cat IG_L-PART1+L-PART2.fa |awk '/^>/ {print; next} {printf "%s", toupper($0)}' | awk 'BEGIN{RS=">"; FS="\n"} NR>1 {print ">" $1 "\n" $2}' >IG_L-PART1_L-PART2.fa

Step4：配置特殊物种的cfg文件

text

fa:/1.3/ref/Oryctolagus_cuniculus_input.fasta
ref:/1.3/ref/Oryctolagus_cuniculus_input.fasta
leader:/IG_L-PART1_L-PART2.fa

Step5：运行VDJ分析

TIP

在运行VDJ分析之前，请确保所有参考文件路径正确，并且文件格式符合TRUST4要求。

text

#export PATH=seeksoultools1.3.0/seeksoultools.130/external/conda:$PATH
seeksoultools1.3.0/seeksoultools.130/seeksoultools vdj run \
--samplename ZYB_Y2_BCR \
--fq1 demo_BCR_S14_L004_R1_001.fastq.gz \
--fq2 demo_BCR_S14_L004_R2_001.fastq.gz \
--chain IG \
--keep_tmp \
--outdir  Output/ \
--core 16  \
--cfg cfg.txt \

SeekSoulTools-1.3.0_VDJ特殊物种ref准备 ​

Step1：下载总的物种免疫数据库，提取出对应物种的fasta ​

Step2：根据TRUST4要求格式化ref序列 ​

Step3：准备leader文件 ​

Step4：配置特殊物种的cfg文件 ​

Step5：运行VDJ分析 ​

SeekSoulTools-1.3.0_VDJ特殊物种ref准备

Step1：下载总的物种免疫数据库，提取出对应物种的fasta

Step2：根据TRUST4要求格式化ref序列

Step3：准备leader文件

Step4：配置特殊物种的cfg文件

Step5：运行VDJ分析