Skip to content

How to Build a Reference Genome

Author: SeekGene
Time: 5 min
Words: 994 words
Updated: 2026-02-26
Reads: 0 times
SeekArc™ Tools Analysis Guide

Prepare Files Required for Genome Construction

When using SeekArc™ Tools for multi-omics analysis, you need to prepare the reference genome sequence and the corresponding GTF annotation file for the species. Please try to use reference genomes from Ensembl or UCSC sources. If you use NCBI reference genomes, the linkage between genes and peaks cannot be calculated. The file format specifications are as follows:

Genome Sequence

The genome sequence requires a FASTA format file. The chromosome ID must match the seqname in the first column of the GTF file, and the seqname in the GTF must be a subset of the Chrom IDs in the FASTA file. Note that the file must not contain empty lines.

GTF File

The GTF file format specifications are as follows:

  1. seqname: Sequence name, usually chromosome or Contig ID.
  2. source: Annotation source, which can be a database name (e.g., from RefSeq database), a software name (e.g., predicted by GeneScan), or empty (filled with a dot .).
  3. feature: Feature type corresponding to the interval. Common types in GTF include: gene, transcript, CDS, exon, start_codon, stop_codon, etc.
  4. start: Start position of the feature.
  5. end: End position of the feature.
  6. score: Confidence score for the existence and coordinates of the feature. It can be a floating-point number or an integer. . indicates empty.
  7. strand: The feature is located on the forward strand (+) or reverse strand (-) of the reference genome.
  8. frame: 0 indicates the first complete codon of the reading frame is at the 5'-most end; 1 indicates there is one extra base before the first complete codon; 2 indicates there are two extra bases before the first complete codon. Note that frame is not the remainder of CDS length divided by 3. If the strand is -, the first base value of the region is end because the corresponding coding region will be from end to start on the antisense strand.
  9. attribute: Should have the format attributes_name "attributes_values";. Each attribute must end with a semicolon and be separated from the next attribute by a space, and attribute values must be enclosed in double quotes. It contains the following three attributes:
AttributeDescription
gene_id "value";Unique ID of the gene locus of the transcript on the genome. gene_id and value are separated by a space. If the value is empty, it indicates no corresponding gene.
transcript_id "value";Unique ID of the predicted transcript. transcript_id and value are separated by a space. Empty indicates no transcript.
gene_type "value";Biological type of the gene, e.g., protein_coding, lncRNA...

TIP

Notes for GTF file preparation:

  • The feature column for each gene in the GTF file must contain gene, transcript, and exon information.
  • When the feature column is gene, the attributes column needs to contain gene_id and gene_type. If there is no gene_name, the value of gene_id is treated as gene_name. When the feature column is transcript, the attributes column needs to contain transcript_id. When the feature column is exon, the attributes column needs to contain exon_id. Otherwise, it will affect the handling when a Read is annotated to multiple genes.
  • The GTF file must also not contain empty lines.
  • The gene_name of mitochondrial genes in the GTF file needs to start with Mt- or mt-, otherwise the Mito part in the report will be 0.

Scenario 1: Building a Reference Genome Compatible with Different Platform Single-cell Data

If you have both 10X Genomics single-cell data and SeekArc product single-cell data, it is recommended to use 10X Cell Ranger ARC to build the reference genome. SeekArc™ Tools is compatible with reference genomes built by Cell Ranger ARC.

Please configure the config.json file in the following format:

json
{
    "organism": "GRCh38",
    "genome": ["GRCh38"],
    "input_fasta": ["/path/to/GRCh38.fa"],
    "input_gtf": ["/path/to/Homo_sapiens.GRCh38.ensembl.filtered.gtf"]
}

The code for processing the gene annotation file (GTF file) is as follows:

shell
/path/to/cellranger mkgtf Homo_sapiens.GRCh38.ensembl.gtf Homo_sapiens.GRCh38.ensembl.filtered.gtf \
    --attribute=gene_biotype:protein_coding \
    --attribute=gene_biotype:lncRNA \
    --attribute=gene_biotype:antisense \
    --attribute=gene_biotype:IG_LV_gene \
    --attribute=gene_biotype:IG_V_gene \
    --attribute=gene_biotype:IG_V_pseudogene \
    --attribute=gene_biotype:IG_D_gene \
    --attribute=gene_biotype:IG_J_gene \
    --attribute=gene_biotype:IG_J_pseudogene \
    --attribute=gene_biotype:IG_C_gene \
    --attribute=gene_biotype:IG_C_pseudogene \
    --attribute=gene_biotype:TR_V_gene \
    --attribute=gene_biotype:TR_V_pseudogene \
    --attribute=gene_biotype:TR_D_gene \
    --attribute=gene_biotype:TR_J_gene \
    --attribute=gene_biotype:TR_J_pseudogene \
    --attribute=gene_biotype:TR_C_gene
cellranger-arc mkref --config=config.json
cd GRCh38/genes
gunzip -dc genes.gtf.gz > genes.gtf

TIP

  • When the reference genome built by Cell Ranger ARC is incompatible with the STAR version of SeekArc™ Tools, you can specify the STAR path of Cell Ranger ARC to SeekArc™ Tools, for example: --star_path /path/to/cellranger-arc-2.0.2/lib/bin/STAR.
  • The chromosome names in the FASTA file must match the chromosome names in the GTF file. For example, if the name of chromosome 1 in FASTA is chr1, then the name of chromosome 1 in the GTF file must also be chr1.

Scenario 2: Only SeekArc Products, No Need to Consider Platform Compatibility

The code for building the STAR index is as follows:

shell
/demo/seekarctools/bin/STAR \
  --runMode genomeGenerate \
  --runThreadN 16 \
  --genomeDir /path/to/star \
  --genomeFastaFiles /path/to/genome.fa \
  --sjdbGTFfile /path/to/genome.gtf \
  --sjdbOverhang 149 \
  --limitGenomeGenerateRAM 17179869184
cd /path/to/fasta
bwa index genome.fa

TIP

  • The chromosome names in the FASTA file must match the chromosome names in the GTF file. For example, if the name of chromosome 1 in FASTA is chr1, then the name of chromosome 1 in the GTF file must also be chr1.
0 comments·0 replies