How to Build a Reference Genome
Prepare Files Required for Genome Construction
When using SeekSpace™ Tools software for spatial transcriptomics analysis, you need to prepare the reference genome sequence and the corresponding GTF annotation file for the species. The related file format specifications are as follows:
Genome Sequence
The genome sequence requires a FASTA format file. The chromosome ID must match the seqname in the first column of the GTF file, and the seqname in the GTF must be a subset of the Chrom IDs in the FASTA file. Note that the file must not contain empty lines.
GTF File
The GTF file format specifications are as follows:
- seqname: Sequence name, usually chromosome or Contig ID.
- source: Annotation source, which can be a database name (e.g., from RefSeq database), a software name (e.g., predicted by GeneScan), or empty (filled with a dot
.). - feature: Feature type corresponding to the interval. Common types in GTF include:
gene,transcript,CDS,exon,start_codon,stop_codon, etc. - start: Start position of the feature.
- end: End position of the feature.
- score: Confidence score for the existence and coordinates of the feature. It can be a floating-point number or an integer.
.indicates empty (i.e., not required). - strand: The feature is located on the forward strand (
+) or reverse strand (-) of the reference genome. - frame:
0indicates the first complete codon of the reading frame is at the 5'-most end;1indicates there is one extra base before the first complete codon;2indicates there are two extra bases before the first complete codon. Note that frame is not the remainder of CDS length divided by 3. If the strand is-, the first base value of the region isendbecause the corresponding coding region will be fromendtostarton the antisense strand. - attribute: Should have the format
attributes_name "attributes_values";. Each attribute must end with a semicolon and be separated from the next attribute by a space, and attribute values must be enclosed in double quotes. It contains the following three attributes:
| Attribute | Description |
|---|---|
gene_id "value"; | Unique ID of the gene locus of the transcript on the genome. gene_id and value are separated by a space. If the value is empty, it indicates no corresponding gene. |
transcript_id "value"; | Unique ID of the predicted transcript. transcript_id and value are separated by a space. Empty indicates no transcript. |
gene_type "value"; | Biological type of the gene, e.g., protein_coding, lncRNA... |
TIP
When preparing reference genomes yourself, to ensure the software can correctly parse and perform downstream analysis, the genome files (FASTA sequence file and GTF annotation file) must strictly adhere to the following specifications internally:
1. Common Specifications for GTF and FASTA
- No empty lines: Neither GTF nor FASTA files may contain any empty lines.
- Consistent chromosome names: The chromosome name in the FASTA file (extracted from the first word after
>) must be completely identical to the first column (seqname) of the GTF file; partial matches or subsets are not supported. Chromosome names in the GTF file must exist in the FASTA file. If there are extra chromosome annotations in the GTF that are not in the FASTA, they must be removed. - Chromosome length limit: The length of a single chromosome cannot exceed 536,870,912 bp. If it exceeds this length limit, the chromosome needs to be split in advance.
- Coordinate boundary limit: The start (
start) and end (end) coordinates of all features annotated in the GTF file must absolutely not exceed the actual total length of that chromosome in the corresponding FASTA file.
2. Specific Requirements for GTF File Columns The GTF file requires strict logical sorting (sequentially by chromosome, strand direction, gene start/end coordinates, transcript start/end coordinates, and following the hierarchical order of gene -> transcript -> exon). In addition, each column must meet:
Column 1: seqname (Chromosome Name)
- Prefix and character limits: Chromosome names must start with
chr. The name cannot contain hyphens-or underscores_(it is recommended to uniformly replace such symbols with dots.).
Column 3: feature (Feature Hierarchy)
- Hierarchy completeness: Each gene in the GTF file must have a complete hierarchical relationship, meaning it must simultaneously contain
gene,transcript, andexonfeature rows. If the original file only hasCDSbut noexon, you must manually complete theexonrows.
Column 7: strand (Strand Direction)
- Strand direction character limit: The strand direction can only be the forward strand
+or reverse strand-. No other symbols (such as.or?) may appear. If abnormal symbols exist, it is recommended to uniformly replace them with+. - Strand direction consistency: All annotation rows within the same gene (including all its transcripts and exons) must maintain absolute consistency in their positive/negative strand direction.
Column 9: attributes
- Attribute order requirement: In all attributes,
gene_namemust be placed in the first position (i.e., at the very front of the attribute string), immediately followed bygene_id,transcript_id, and other attributes. - Hierarchy inheritance and essential attributes:
- All Feature rows (whether
gene,transcript, orexon) must containgene_idto establish the correct gene attribution relationship. - When the feature is
gene: It must containgene_idandgene_type(orgene_biotype; if this attribute is not provided, you need to addgene_biotype "protein_coding"by default). If there is nogene_name, the value ofgene_idmust be assigned togene_name. - When the feature is
transcript: It must containgene_id,transcript_id, andgene_type(orgene_biotype; if not provided, addgene_biotype "protein_coding"by default). - When the feature is
exon: It must containgene_id,transcript_id,exon_id, andgene_type(orgene_biotype; if not provided, addgene_biotype "protein_coding"by default). If the original file is missingexon_id, it is recommended to extract the correspondingtranscript_idas a base and append a sequence suffix (e.g.,exon_id "transcriptA.1",exon_id "transcriptA.2", etc.) according to the order of appearance of the exon in the transcript. Missingexon_idwill affect the correct processing when a read is annotated to multiple genes. - Regarding
exon_number: If theexon_numberattribute is provided in the GTF, you must ensure that allexonrows contain this attribute. If someexonrows haveexon_numberwhile others do not, it will cause failures downstream when generating the loom file. If unsure, it is recommended to uniformly remove allexon_numberattributes.
- All Feature rows (whether
- ID character limit: The values of
gene_id,transcript_id, andgene_namecan only contain numbers, letters, and dots (.). They cannot contain spaces, hyphens, or any other characters (it is recommended to uniformly replace other illegal characters with dots.). The use of"NA"or empty strings""as ID values is strictly prohibited to avoid downstream errors. - Mitochondrial gene identification: The
gene_nameof mitochondrial genes must start with"Mt-"or"mt-", otherwise the proportion of the mitochondrial (mito) part in the analysis report will always be calculated as 0.
Scenario 1: Building a Reference Genome Compatible with Single-cell Data from Different Platforms
NOTE
If you have both 10X Genomics single-cell data and SeekSpace™ product single-cell data, it is recommended to use 10X CellRanger to build the reference genome. SeekSpace™ Tools is compatible with reference genomes built by CellRanger.
The code for processing the gene annotation file (GTF file) is as follows:
/path/to/cellranger mkgtf Homo_sapiens.GRCh38.ensembl.gtf Homo_sapiens.GRCh38.ensembl.filtered.gtf \
--attribute=gene_biotype:protein_coding \
--attribute=gene_biotype:lncRNA \
--attribute=gene_biotype:antisense \
--attribute=gene_biotype:IG_LV_gene \
--attribute=gene_biotype:IG_V_gene \
--attribute=gene_biotype:IG_V_pseudogene \
--attribute=gene_biotype:IG_D_gene \
--attribute=gene_biotype:IG_J_gene \
--attribute=gene_biotype:IG_J_pseudogene \
--attribute=gene_biotype:IG_C_gene \
--attribute=gene_biotype:IG_C_pseudogene \
--attribute=gene_biotype:TR_V_gene \
--attribute=gene_biotype:TR_V_pseudogene \
--attribute=gene_biotype:TR_D_gene \
--attribute=gene_biotype:TR_J_gene \
--attribute=gene_biotype:TR_J_pseudogene \
--attribute=gene_biotype:TR_C_gene
cellranger mkref --genome=GRCh38 --fasta=GRCh38.fa --genes=GRCh38-filtered-ensembl.gtf
cd GRCh38/genes
gunzip -dc genes.gtf.gz > genes.gtfTIP
- When the reference genome built by Cell Ranger is incompatible with the STAR version of SeekSpace® Tools, you can specify the Cell Ranger STAR path to SeekSpace® Tools, for example:
--star_path /path/to/cellranger-5.0.1/lib/bin/STAR. - Chromosome names in the FASTA file must match those in the GTF file. For example, if chromosome 1 in FASTA is named
chr1, then chromosome 1 in the GTF file must also bechr1.
Scenario 2: Only SeekSpace™ Products, No Need to Consider Platform Compatibility
The code for building the STAR index is as follows:
/demo/seekspacetools_v1.0.2/bin/STAR \
--runMode genomeGenerate \
--runThreadN 16 \
--genomeDir /path/to/star \
--genomeFastaFiles /path/to/genome.fa \
--sjdbGTFfile /path/to/genome.gtf \
--sjdbOverhang 149 \
--limitGenomeGenerateRAM 17179869184TIP
- Chromosome names in the FASTA file must match those in the GTF file. For example, if chromosome 1 in FASTA is named
chr1, then chromosome 1 in the GTF file must also bechr1.
