How to Build a Reference Genome
Prepare Files Required for Genome Construction
When using SeekArc™ Tools for multi-omics analysis, you need to prepare the reference genome sequence and the corresponding GTF annotation file for the species. Please try to use reference genomes from Ensembl or UCSC sources. If you use NCBI reference genomes, the linkage between genes and peaks cannot be calculated. The file format specifications are as follows:
Genome Sequence
The genome sequence requires a FASTA format file. The chromosome ID must match the seqname in the first column of the GTF file, and the seqname in the GTF must be a subset of the Chrom IDs in the FASTA file. Note that the file must not contain empty lines.
GTF File
The GTF file format specifications are as follows:
- seqname: Sequence name, usually chromosome or Contig ID.
- source: Annotation source, which can be a database name (e.g., from RefSeq database), a software name (e.g., predicted by GeneScan), or empty (filled with a dot
.). - feature: Feature type corresponding to the interval. Common types in GTF include:
gene,transcript,CDS,exon,start_codon,stop_codon, etc. - start: Start position of the feature.
- end: End position of the feature.
- score: Confidence score for the existence and coordinates of the feature. It can be a floating-point number or an integer.
.indicates empty. - strand: The feature is located on the forward strand (
+) or reverse strand (-) of the reference genome. - frame:
0indicates the first complete codon of the reading frame is at the 5'-most end;1indicates there is one extra base before the first complete codon;2indicates there are two extra bases before the first complete codon. Note that frame is not the remainder of CDS length divided by 3. If the strand is-, the first base value of the region isendbecause the corresponding coding region will be fromendtostarton the antisense strand. - attribute: Should have the format
attributes_name "attributes_values";. Each attribute must end with a semicolon and be separated from the next attribute by a space, and attribute values must be enclosed in double quotes. It contains the following three attributes:
| Attribute | Description |
|---|---|
gene_id "value"; | Unique ID of the gene locus of the transcript on the genome. gene_id and value are separated by a space. If the value is empty, it indicates no corresponding gene. |
transcript_id "value"; | Unique ID of the predicted transcript. transcript_id and value are separated by a space. Empty indicates no transcript. |
gene_type "value"; | Biological type of the gene, e.g., protein_coding, lncRNA... |
TIP
Notes for GTF file preparation:
- The feature column for each gene in the GTF file must contain
gene,transcript, andexoninformation. - When the feature column is
gene, the attributes column needs to containgene_idandgene_type. If there is nogene_name, the value ofgene_idis treated asgene_name. When the feature column istranscript, the attributes column needs to containtranscript_id. When the feature column isexon, the attributes column needs to containexon_id. Otherwise, it will affect the handling when a Read is annotated to multiple genes. - The GTF file must also not contain empty lines.
- The
gene_nameof mitochondrial genes in the GTF file needs to start withMt-ormt-, otherwise the Mito part in the report will be 0.
Scenario 1: Building a Reference Genome Compatible with Different Platform Single-cell Data
If you have both 10X Genomics single-cell data and SeekArc product single-cell data, it is recommended to use 10X Cell Ranger ARC to build the reference genome. SeekArc™ Tools is compatible with reference genomes built by Cell Ranger ARC.
Please configure the config.json file in the following format:
{
"organism": "GRCh38",
"genome": ["GRCh38"],
"input_fasta": ["/path/to/GRCh38.fa"],
"input_gtf": ["/path/to/Homo_sapiens.GRCh38.ensembl.filtered.gtf"]
}The code for processing the gene annotation file (GTF file) is as follows:
/path/to/cellranger mkgtf Homo_sapiens.GRCh38.ensembl.gtf Homo_sapiens.GRCh38.ensembl.filtered.gtf \
--attribute=gene_biotype:protein_coding \
--attribute=gene_biotype:lncRNA \
--attribute=gene_biotype:antisense \
--attribute=gene_biotype:IG_LV_gene \
--attribute=gene_biotype:IG_V_gene \
--attribute=gene_biotype:IG_V_pseudogene \
--attribute=gene_biotype:IG_D_gene \
--attribute=gene_biotype:IG_J_gene \
--attribute=gene_biotype:IG_J_pseudogene \
--attribute=gene_biotype:IG_C_gene \
--attribute=gene_biotype:IG_C_pseudogene \
--attribute=gene_biotype:TR_V_gene \
--attribute=gene_biotype:TR_V_pseudogene \
--attribute=gene_biotype:TR_D_gene \
--attribute=gene_biotype:TR_J_gene \
--attribute=gene_biotype:TR_J_pseudogene \
--attribute=gene_biotype:TR_C_gene
cellranger-arc mkref --config=config.json
cd GRCh38/genes
gunzip -dc genes.gtf.gz > genes.gtfTIP
- When the reference genome built by Cell Ranger ARC is incompatible with the STAR version of SeekArc™ Tools, you can specify the STAR path of Cell Ranger ARC to SeekArc™ Tools, for example:
--star_path /path/to/cellranger-arc-2.0.2/lib/bin/STAR. - The chromosome names in the FASTA file must match the chromosome names in the GTF file. For example, if the name of chromosome 1 in FASTA is
chr1, then the name of chromosome 1 in the GTF file must also bechr1.
Scenario 2: Only SeekArc Products, No Need to Consider Platform Compatibility
The code for building the STAR index is as follows:
/demo/seekarctools/bin/STAR \
--runMode genomeGenerate \
--runThreadN 16 \
--genomeDir /path/to/star \
--genomeFastaFiles /path/to/genome.fa \
--sjdbGTFfile /path/to/genome.gtf \
--sjdbOverhang 149 \
--limitGenomeGenerateRAM 17179869184
cd /path/to/fasta
bwa index genome.faTIP
- The chromosome names in the FASTA file must match the chromosome names in the GTF file. For example, if the name of chromosome 1 in FASTA is
chr1, then the name of chromosome 1 in the GTF file must also bechr1.
