How to Build a Reference Genome

Author: SeekGene

Time: 8 min

Words: 1.6k words

Updated: 2026-05-27

Reads: 0 times

SeekArc™ Tools Analysis Guide

Prepare Files Required for Genome Construction

When using SeekArc™ Tools for multi-omics analysis, you need to prepare the reference genome sequence and the corresponding GTF annotation file for the species. Please try to use reference genomes from Ensembl or UCSC sources. The file format specifications are as follows:

Genome Sequence

The genome sequence requires a FASTA format file. The chromosome ID must match the seqname in the first column of the GTF file, and the seqname in the GTF must be a subset of the Chrom IDs in the FASTA file. Note that the file must not contain empty lines.

GTF File

The GTF file format specifications are as follows:

seqname: Sequence name, usually chromosome or Contig ID.
source: Annotation source, which can be a database name (e.g., from RefSeq database), a software name (e.g., predicted by GeneScan), or empty (filled with a dot .).
feature: Feature type corresponding to the interval. Common types in GTF include: gene, transcript, CDS, exon, start_codon, stop_codon, etc.
start: Start position of the feature.
end: End position of the feature.
score: Confidence score for the existence and coordinates of the feature. It can be a floating-point number or an integer. . indicates empty.
strand: The feature is located on the forward strand (+) or reverse strand (-) of the reference genome.
frame: 0 indicates the first complete codon of the reading frame is at the 5'-most end; 1 indicates there is one extra base before the first complete codon; 2 indicates there are two extra bases before the first complete codon. Note that frame is not the remainder of CDS length divided by 3. If the strand is -, the first base value of the region is end because the corresponding coding region will be from end to start on the antisense strand.
attribute: Should have the format attributes_name "attributes_values";. Each attribute must end with a semicolon and be separated from the next attribute by a space, and attribute values must be enclosed in double quotes. It contains the following three attributes:

Attribute	Description
`gene_id "value";`	Unique ID of the gene locus of the transcript on the genome. `gene_id` and value are separated by a space. If the value is empty, it indicates no corresponding gene.
`transcript_id "value";`	Unique ID of the predicted transcript. `transcript_id` and value are separated by a space. Empty indicates no transcript.
`gene_type "value";`	Biological type of the gene, e.g., `protein_coding`, `lncRNA`...

TIP

When preparing reference genomes yourself, to ensure the software can correctly parse and perform downstream analysis, the genome files (FASTA sequence file and GTF annotation file) must strictly adhere to the following specifications internally:

1. Common Specifications for GTF and FASTA

No empty lines: Neither GTF nor FASTA files may contain any empty lines.
Consistent chromosome names: The chromosome name in the FASTA file (extracted from the first word after >) must be completely identical to the first column (seqname) of the GTF file; partial matches or subsets are not supported. Chromosome names in the GTF file must exist in the FASTA file. If there are extra chromosome annotations in the GTF that are not in the FASTA, they must be removed.
Chromosome length limit: The length of a single chromosome cannot exceed 536,870,912 bp. If it exceeds this length limit, the chromosome needs to be split in advance.
Coordinate boundary limit: The start (start) and end (end) coordinates of all features annotated in the GTF file must absolutely not exceed the actual total length of that chromosome in the corresponding FASTA file.

2. Specific Requirements for GTF File Columns The GTF file requires strict logical sorting (sequentially by chromosome, strand direction, gene start/end coordinates, transcript start/end coordinates, and following the hierarchical order of gene -> transcript -> exon). In addition, each column must meet:

Column 1: seqname (Chromosome Name)

Prefix and character limits: Chromosome names cannot contain hyphens - or underscores _ (it is recommended to uniformly replace such symbols with dots .).

Column 3: feature (Feature Hierarchy)

Hierarchy completeness: Each gene in the GTF file must have a complete hierarchical relationship, meaning it must simultaneously contain gene, transcript, and exon feature rows. If the original file only has CDS but no exon, you must manually complete the exon rows.

Column 7: strand (Strand Direction)

Strand direction character limit: The strand direction can only be the forward strand + or reverse strand -. No other symbols (such as . or ?) may appear. If abnormal symbols exist, it is recommended to uniformly replace them with +.
Strand direction consistency: All annotation rows within the same gene (including all its transcripts and exons) must maintain absolute consistency in their positive/negative strand direction.

Column 9: attributes

Hierarchy inheritance and essential attributes:
- All Feature rows (whether gene, transcript, or exon) must contain gene_id to establish the correct gene attribution relationship.
- When the feature is gene: It must contain gene_id and gene_type (or gene_biotype; if this attribute is not provided, you need to add gene_biotype "protein_coding" by default). If there is no gene_name, the value of gene_id must be assigned to gene_name.
- When the feature is transcript: It must contain gene_id, transcript_id, and gene_type (or gene_biotype; if not provided, add gene_biotype "protein_coding" by default).
- When the feature is exon: It must contain gene_id, transcript_id, exon_id, and gene_type (or gene_biotype; if not provided, add gene_biotype "protein_coding" by default). If the original file is missing exon_id, it is recommended to extract the corresponding transcript_id as a base and append a sequence suffix (e.g., exon_id "transcriptA.1", exon_id "transcriptA.2", etc.) according to the order of appearance of the exon in the transcript. Missing exon_id will affect the correct processing when a read is annotated to multiple genes.
- Regarding exon_number: If the exon_number attribute is provided in the GTF, you must ensure that all exon rows contain this attribute. If some exon rows have exon_number while others do not, it will cause failures downstream when generating the loom file. If unsure, it is recommended to uniformly remove all exon_number attributes.
ID character limit: The values of gene_id, transcript_id, and gene_name can only contain numbers, letters, and dots (.). They cannot contain spaces, hyphens, or any other characters (it is recommended to uniformly replace other illegal characters with dots .). The use of "NA" or empty strings "" as ID values is strictly prohibited to avoid downstream errors.
Mitochondrial gene identification: The gene_name of mitochondrial genes must start with "Mt-" or "mt-", otherwise the proportion of the mitochondrial (mito) part in the analysis report will always be calculated as 0.

Scenario 1: Building a Reference Genome Compatible with Different Platform Single-cell Data

If you have both 10X Genomics single-cell data and SeekArc product single-cell data, it is recommended to use 10X Cell Ranger ARC to build the reference genome. SeekArc™ Tools is compatible with reference genomes built by Cell Ranger ARC.

Please configure the config.json file in the following format:

json

{
    "organism": "GRCh38",
    "genome": ["GRCh38"],
    "input_fasta": ["/path/to/GRCh38.fa"],
    "input_gtf": ["/path/to/Homo_sapiens.GRCh38.ensembl.filtered.gtf"]
}

The code for processing the gene annotation file (GTF file) is as follows:

shell

/path/to/cellranger mkgtf Homo_sapiens.GRCh38.ensembl.gtf Homo_sapiens.GRCh38.ensembl.filtered.gtf \
    --attribute=gene_biotype:protein_coding \
    --attribute=gene_biotype:lncRNA \
    --attribute=gene_biotype:antisense \
    --attribute=gene_biotype:IG_LV_gene \
    --attribute=gene_biotype:IG_V_gene \
    --attribute=gene_biotype:IG_V_pseudogene \
    --attribute=gene_biotype:IG_D_gene \
    --attribute=gene_biotype:IG_J_gene \
    --attribute=gene_biotype:IG_J_pseudogene \
    --attribute=gene_biotype:IG_C_gene \
    --attribute=gene_biotype:IG_C_pseudogene \
    --attribute=gene_biotype:TR_V_gene \
    --attribute=gene_biotype:TR_V_pseudogene \
    --attribute=gene_biotype:TR_D_gene \
    --attribute=gene_biotype:TR_J_gene \
    --attribute=gene_biotype:TR_J_pseudogene \
    --attribute=gene_biotype:TR_C_gene
cellranger-arc mkref --config=config.json
cd GRCh38/genes
gunzip -dc genes.gtf.gz > genes.gtf

TIP

When the reference genome built by Cell Ranger ARC is incompatible with the STAR version of SeekArc™ Tools, you can specify the STAR path of Cell Ranger ARC to SeekArc™ Tools, for example: --star_path /path/to/cellranger-arc-2.0.2/lib/bin/STAR.
The chromosome names in the FASTA file must match the chromosome names in the GTF file. For example, if the name of chromosome 1 in FASTA is chr1, then the name of chromosome 1 in the GTF file must also be chr1.

Scenario 2: Only SeekArc Products, No Need to Consider Platform Compatibility

The code for building the STAR index is as follows:

shell

/demo/seekarctools/bin/STAR \
  --runMode genomeGenerate \
  --runThreadN 16 \
  --genomeDir /path/to/star \
  --genomeFastaFiles /path/to/genome.fa \
  --sjdbGTFfile /path/to/genome.gtf \
  --sjdbOverhang 149 \
  --limitGenomeGenerateRAM 17179869184
cd /path/to/fasta
bwa index genome.fa

TIP

The chromosome names in the FASTA file must match the chromosome names in the GTF file. For example, if the name of chromosome 1 in FASTA is chr1, then the name of chromosome 1 in the GTF file must also be chr1.

How to Build a Reference Genome ​

Prepare Files Required for Genome Construction ​

Genome Sequence ​

GTF File ​

Scenario 1: Building a Reference Genome Compatible with Different Platform Single-cell Data ​