如何构建参考基因组

作者: SeekGene

时长: 5 分钟

字数: 1.3k 字

更新: 2026-01-21

阅读: 0 次

SeekArc Tools

准备构建基因组所需的文件

使用 SeekArc Tools v1.0.0 软件进行多组学分析时，需要准备物种的参考基因组序列和对应的 GTF 注释文件。请尽量使用 Ensembl 和 UCSC 来源的参考基因组，如使用 NCBI 的参考基因组，将无法计算基因与 Peak 的连接关系。相关文件格式说明如下：

基因组序列

基因组序列要求 FASTA 格式文件，染色体 ID 要与 GTF 文件第一列 seqname 一致，且 GTF 的 seqname 需要是 FASTA 文件的染色体 ID 的子集。注意文件不可以有空行。

GTF 文件

GTF 文件格式说明如下：

seqname: 序列名称，通常为染色体或 Contig ID。
source: 注释来源，可以是数据库的名称，比如来自 RefSeq 数据库，也可以是软件的名称，比如用 GeneScan 软件预测得到，当然，也可以为空，用 . 点号填充。
feature: 代表区间对应的特征类型，在 GTF 中，常见类型：gene、transcript、CDS、exon、start_codon、stop_codon 等。
start: feature 的起始位置。
end: feature 的终止位置。
score: 表示 feature 存在和坐标的置信度，可以是一个浮点数或整数，"." 表示为空，就是不需要。
strand: 该 feature 位于参考基因组的正链 (+) 或者负链 (-)。
frame: 0 表示阅读框的第一个完整密码子位于最 5' 端，1 表示第一个完整密码子之前有一个额外的碱基，2 表示第一个完整密码子之前有两个额外的碱基。注意 frame 不是 CDS 长度除以 3 的余数，如果链是 '-'，则该区域的第一个碱基值为 'end'，因为对应的编码区将在反义链从 end 到 start。
attribute: 应具有的格式是 attributes_name "attributes_values"; 每个属性必须以分号结尾并且与下一个属性之间以空格分隔，并且属性的值用双引号包围。包含以下三种属性：

attribute	含义
gene_id "value";	表示转录本在基因组上的基因座的唯一 ID。gene_id 与 value 值用空格分开，如果值为空，则表示没有对应的基因。
transcript_id "value";	预测转录本的唯一 ID。transcipt_id 与 value 值用空格分开，空表示没有转录本
gene_type "value";	gene 的生物学类型，protein coding；lncRNA......

GTF 文件准备的注意事项：

GTF 文件中每个基因的 feature 列须包含 gene、transcript、exon 三个信息。
当 feature 列为 'gene' 时，attributes 列需要包含 gene_id，gene_type，如果没有 gene_name，则将 gene_id 的值当作 gene_name；当 feature 列为 'transcript' 时，attributes 列需要包含 transcript_id；当 feature 列为 'exon' 时，attributes 列需要包含 exon_id，否则会影响 Reads 注释到多个基因时的处理。
GTF 文件同样不要有空行。
GTF 文件中线粒体基因的 gene_name 需要以 "Mt-" 或者 "mt-" 开头，否则在报告中 mito 部分均为 0。

场景一：构建兼容不同平台单细胞数据的参考基因组

如果您既有 10X Genomics 单细胞数据，又有 SeekArc 产品的单细胞数据，推荐使用 10X Cell Ranger ARC 来构建参考基因组，SeekArc Tools v1.0.0 可以兼容 Cell Ranger ARC 构建的参考基因组。

请按以下格式配置 config.json 文件：

{
    "organism": "GRCh38",
    "genome": ["GRCh38"],
    "input_fasta": ["/path/to/GRCh38.fa"],
    "input_gtf": ["/path/to/Homo_sapiens.GRCh38.ensembl.filtered.gtf"]
}

处理基因注释文件（GTF 文件）的代码如下：

shell

/path/to/cellranger mkgtf Homo_sapiens.GRCh38.ensembl.gtf Homo_sapiens.GRCh38.ensembl.filtered.gtf \
    --attribute=gene_biotype:protein_coding \
    --attribute=gene_biotype:lncRNA \
    --attribute=gene_biotype:antisense \
    --attribute=gene_biotype:IG_LV_gene \
    --attribute=gene_biotype:IG_V_gene \
    --attribute=gene_biotype:IG_V_pseudogene \
    --attribute=gene_biotype:IG_D_gene \
    --attribute=gene_biotype:IG_J_gene \
    --attribute=gene_biotype:IG_J_pseudogene \
    --attribute=gene_biotype:IG_C_gene \
    --attribute=gene_biotype:IG_C_pseudogene \
    --attribute=gene_biotype:TR_V_gene \
    --attribute=gene_biotype:TR_V_pseudogene \
    --attribute=gene_biotype:TR_D_gene \
    --attribute=gene_biotype:TR_J_gene \
    --attribute=gene_biotype:TR_J_pseudogene \
    --attribute=gene_biotype:TR_C_gene
cellranger-arc mkref --config=config.json
cd GRCh38/genes
gunzip -dc genes.gtf.gz > genes.gtf

IMPORTANT

当 Cellranger-arc 构建的参考基因组与 SeekArc Tools v1.0.0 的 STAR 版本不兼容时，可以将 Cellranger-arc 的 STAR 路径指定给 SeekArc Tools v1.0.0，例如：--star_path /path/to/cellranger-arc-2.0.2/lib/bin/STAR。
FASTA 文件中染色体名称必须与 GTF 文件中的染色体名称相匹配。例如，FASTA 的 1 号染色体名称为 chr1，那么 GTF 文件中 1 号染色体名称也必须为 chr1。

场景二：仅有 SeekArc 产品，不需要考虑平台兼容性

构建 STAR 索引的代码如下：

shell

/demo/seekarctools/bin/STAR \
  --runMode genomeGenerate \
  --runThreadN 16 \                        
  --genomeDir /path/to/star \             
  --genomeFastaFiles /path/to/genome.fa \  
  --sjdbGTFfile /path/to/genome.gtf \      
  --sjdbOverhang 149 \                     
  --limitGenomeGenerateRAM 17179869184   
cd /path/to/fasta
bwa index genome.fa

IMPORTANT

FASTA 文件中染色体名称必须与 GTF 文件中的染色体名称相匹配。例如，FASTA 的 1 号染色体名称为 chr1，那么 GTF 文件中 1 号染色体名称也必须为 chr1。

v2.0.0

使用教程

rna

fast

vdj

multivdj

probe

utils

v1.3.0

使用教程

rna

fast

vdj

multivdj

utils

v1.2.2

使用教程

rna

fast

vdj

utils

v1.2.1

使用教程

rna

fast

utils

v1.2.0

使用教程

rna

fast

v1.0.0

使用教程

rna

如何构建参考基因组

准备构建基因组所需的文件

基因组序列

GTF 文件

场景一：构建兼容不同平台单细胞数据的参考基因组

场景二：仅有 SeekArc 产品，不需要考虑平台兼容性

使用教程

rna

fast

vdj

multivdj

probe

utils

使用教程

rna

fast

vdj

multivdj

utils

使用教程

rna

fast

vdj

utils

使用教程

rna

fast

utils

使用教程

rna

fast

使用教程

rna

如何构建参考基因组 ​

准备构建基因组所需的文件 ​

基因组序列 ​

GTF 文件 ​

场景一：构建兼容不同平台单细胞数据的参考基因组 ​

场景二：仅有 SeekArc 产品，不需要考虑平台兼容性 ​

如何构建参考基因组

准备构建基因组所需的文件

基因组序列

GTF 文件

场景一：构建兼容不同平台单细胞数据的参考基因组

场景二：仅有 SeekArc 产品，不需要考虑平台兼容性