Troubleshooting Guide for Undetected Genes in Single-Cell Sequencing (Endogenous/Exogenous Inserts)
This guide provides a systematic checklist to troubleshoot cases where a specific gene (e.g., eGFP, mCherry, or an endogenous gene) shows low expression or is not detected in single-cell data analysis.
Check the Gene FASTA and Annotation GTF Files
When building an exogenous gene reference or preparing a custom reference genome, the FASTA file and the GTF annotation must strictly follow formatting requirements. Otherwise, aligners and quantification tools may fail to recognize the target region or produce incorrect quantification.
Requirements for Insert Sequences and Annotations
Insert Sequence File (FASTA) Requirements
Make sure the exogenous gene sequence has been added to the reference genome FASTA. The chromosome/contig ID (FASTA header) must be concise and unique, and must not contain spaces.
>GFP
ATGGTGAGCAAGGGCGAGGAGCTGTTCACCGGGGTGGTGCCCATCCTGGTCGAGCTGGACGGCGACGTAAACGGCCACAAGTTCAGCGTGTCCGGCGAGGGCGAGGGCGATGCCACCTACGGCAAGCTGACCCTGAAGTTCATCTGCACCACCGGCAAGCTGCCCGTGCCCTGGCCCACCCTCGTGACCACCCTGACCTACGGCGTGCAGTGCTTCAGCCGCTACCCCGACCACATGAAGCAGCACGACTTCTTCAAGTCCGCCATGCCCGAAGGCTACGTCCAGGAGCGCACCATCTTCTTCAAGGACGACGGCAACTACAAGACCCGCGCCGAGGTGAAGTTCGAGGGCGACACCCTGGTGAACCGCATCGAGCTGAAGGGCATCGACTTCAAGGAGGACGGCAACATCCTGGGGCACAAGCTGGAGTACAACTACAACAGCCACAACGTCTATATCATGGCCGACAAGCAGAAGAACGGCATCAAGGTGAACTTCAAGATCCGCCACAACATCGAGGACGGCAGCGTGCAGCTCGCCGACCACTACCAGCAGAACACCCCCATCGGCGACGGCCCCGTGCTGCTGCCCGACAACCACTACCTGAGCACCCAGTCCGCCCTGAGCAAAGACCCCAACGAGAAGCGCGATCACATGGTCCTGCTGGAGTTCGTGACCGCCGCCGGGATCACTCTCGGCATGGACGAGCTGTACAAGTAAInsert Annotation File (GTF) Requirements
Single-cell quantification tools (e.g., SeekSoul™ Tools) commonly require each gene to contain three feature records: gene, transcript, and exon. Coordinates must match the insert length precisely.
GFP insert gene 1 720 . + . gene_id "GFP"; gene_name "GFP"; gene_biotype "protein_coding";
GFP insert transcript 1 720 . + . gene_id "GFP"; gene_name "GFP"; transcript_id "GFP"; gene_biotype "protein_coding";
GFP insert exon 1 720 . + . gene_id "GFP"; gene_name "GFP"; transcript_id "GFP"; gene_biotype "protein_coding";
GFP insert CDS 1 720 . + . gene_id "GFP"; gene_name "GFP"; transcript_id "GFP"; gene_biotype "protein_coding";Example script (GFP):
# Define variables
GENE="GFP"
CHR="GFP_Chr"
LENGTH=720
# Create the GTF file
echo -e "${CHR}\tUser\tgene\t1\t${LENGTH}\t.\t+\t.\tgene_id \"${GENE}\"; gene_name \"${GENE}\"; gene_biotype \"protein_coding\";" > ${GENE}.gtf
echo -e "${CHR}\tUser\ttranscript\t1\t${LENGTH}\t.\t+\t.\tgene_id \"${GENE}\"; transcript_id \"${GENE}_T1\"; gene_name \"${GENE}\"; gene_biotype \"protein_coding\";" >> ${GENE}.gtf
echo -e "${CHR}\tUser\texon\t1\t${LENGTH}\t.\t+\t.\tgene_id \"${GENE}\"; transcript_id \"${GENE}_T1\"; gene_name \"${GENE}\"; exon_number 1; exon_id \"${GENE}_E1\";" >> ${GENE}.gtfStandardized Preparation of Reference Genome and Annotation Files
Requirements for the Genome Sequence File (FASTA)
- ID consistency: Chromosome/contig IDs in the FASTA headers must exactly match the first column (
seqname) of the GTF. - Subset relationship: All
seqnamevalues referenced in the GTF must exist in the FASTA header ID set. - No spaces or blank lines: Chromosome/contig IDs must not contain spaces (content after a space may be truncated by tools), and the file should not contain blank lines.
Detailed Requirements for the Gene Annotation File (GTF)
A GTF file is a 9-column, tab-delimited format. The requirements for each column are listed below.
| Column | Field | Description and requirements |
|---|---|---|
| 1 | seqname | Chromosome/contig ID (must match FASTA). |
| 2 | source | Annotation source (e.g., RefSeq, GeneScan); use . if not available. |
| 3 | feature | Feature type; must include gene, transcript, and exon. |
| 4 | start | Feature start position (1-based coordinate). |
| 5 | end | Feature end position (inclusive); start must be less than or equal to end. |
| 6 | score | Confidence score; usually .. |
| 7 | strand | Strand: + or -. |
| 8 | frame | Coding frame (0, 1, 2); use . for non-CDS features; for genes on the negative strand, the biological transcription start corresponds to the end coordinate. |
| 9 | attribute | Attribute list in key "value"; format; each attribute must end with ; and attributes are separated by spaces. |
Core Attributes in Column 9 (Attributes)
The attribute column defines gene–transcript relationships and is critical for quantification. It must include the following.
- gene_id "value";: Unique identifier of the gene locus.
- transcript_id "value";: Unique identifier of the transcript.
- gene_name "value";: Display name of the gene; if missing, many tools fall back to
gene_id. - gene_biotype "value";: Biological type (e.g.,
protein_coding,lncRNA); some tools may usegene_type. - Three-level structure: Each gene record must include
gene,transcript, andexonfeature entries.
Visually Validate Read Alignments
If the FASTA and GTF are correct, the next step is to confirm read coverage in the target region.
How to check: Load the final BAM file and the corresponding genome file into IGV (Integrative Genomics Viewer).
TIP
See igv-reports.
- Case A: No read coverage: This suggests the library did not capture the gene’s mRNA. Possible reasons include:
- Experimental: the gene is not expressed, capture efficiency is low, or the mRNA is degraded.
- Sequence: the insert sequence differs from the true sequence in the sample, causing alignment failures.
- Case B: High read coverage but quantification is 0 or very low: Reads are aligned, but were filtered during assignment (Annotation/Quantification). Proceed to the next section.
Deep Dive into BAM Alignment Tags (BAM Tags)
Single-cell quantification tools determine whether a read contributes to UMI counting based on tags in the BAM file. Use samtools view to inspect reads in the target region.
Key Tag Diagnostics
- Check the XS tag:
- If
XS:Z:Unassigned_NoFeaturesis present and there is noXT:Z:tag, the read aligns to an intergenic region or has low alignment quality (MAPQ). Such reads are not counted.
- If
- Check multi-mapping (NH tag > 1):
- If the tag shows
XT:Z:gene1,gene2, the read overlaps multiple genes. The tool tries to assign it based on exon/intron priority. If it cannot resolve the assignment, the read is discarded.
- If the tag shows
- Check sequence homology:
- If
XT:Z:gene_idis a single gene butNH:i:N(N > 1), the read has equally scoring alignments to multiple loci (e.g., homologous genes or vector backbone). In strict modes, multi-mapping reads are typically not counted.
- If
Case Study: Fluorescent Proteins (eGFP/mCherry) Not Detected
Observed issue: eGFP and mCherry sequences were added in the project. IGV shows very high read coverage, but the final expression matrix reports low counts for both genes.
Root cause analysis: Inspection of BAM shows many reads with NH > 1 that simultaneously map to eGFP and mCherry. Sequence review indicates that the reference was built by treating full exogenous sequences (including identical promoters or shared vector backbone) as transcripts. Because these backbone regions are identical and annotated as exon in the GTF for both genes, the aligner cannot uniquely determine the gene of origin, so the reads are marked as multi-mapping and filtered.
Recommended fix: When building the exogenous gene index, annotate only gene-specific regions (e.g., CDS or 3' UTR) in the GTF. Alternatively, remove duplicated backbone sequences from the FASTA and keep only gene-specific fragments to improve unique mapping.
BAM Field Definitions
A BAM file is the standard format for storing alignment results. Each record contains 11 mandatory fields followed by optional custom tags.
Standard Mandatory Fields
| Column | Field | Meaning | Common values and interpretation |
|---|---|---|---|
| 1 | QNAME | Read ID | Sequence name from the original FASTQ file. |
| 2 | FLAG | Alignment flag | Numeric encoding of the read status; 0: aligned to + strand; 16: aligned to - strand; 4: unmapped. |
| 3 | RNAME | Reference name | Chromosome/contig name (e.g., chr1, chrM, eGFP). |
| 4 | POS | Alignment start | Leftmost 1-based alignment coordinate on the reference. |
| 5 | MAPQ | Mapping quality | Measures alignment reliability; in STAR, 255 indicates a unique alignment, while lower values indicate multi-mapping. |
| 6 | CIGAR | Alignment structure | Describes how the read aligns to the reference; M: match/mismatch; N: skipped region (often intron); S: soft-clipped bases; D/I: deletion/insertion. |
| 7 | RNEXT | Mate reference name | In paired-end data, the reference name of the mate; * for single-end or unmapped mate. |
| 8 | PNEXT | Mate start | Alignment start position of the mate in paired-end data. |
| 9 | TLEN | Template length | Observed insert size (template length). |
| 10 | SEQ | Read sequence | Base sequence of the read. |
| 11 | QUAL | Base qualities | ASCII-encoded base quality scores. |
Common Custom Tags
After the 11 mandatory fields, tools (e.g., STAR, SeekSoul™ Tools) may add tags in TAG:TYPE:VALUE format to carry additional alignment and quantification information.
Standard Alignment Tags
- NH (Number of Hits): Number of genomic loci the read aligns to;
NH:i:1indicates a unique alignment. - HI (Hit Index): Index of the alignment record among multiple hits.
- AS (Alignment Score): Alignment score; higher scores indicate better matches.
- nM (Number of Mismatches): Number of mismatched bases in the alignment.
Feature Assignment Tags (XS/XN/XT)
These tags describe how a read is assigned to genomic features.
- XS (Assignment Status):
XS:Z:Assigned: Successfully assigned; the read overlaps an annotated feature and matches strand requirements.XS:Z:Unassigned_NoFeatures: Unassigned; the read aligns to the genome but overlaps no annotated feature (e.g., intergenic region).
- XN (Number of Genes): Number of genes overlapping the alignment position;
XN:i:1indicates one unambiguous gene. - XT (Gene ID): The final gene name/ID used for quantification;
XT:Z:GeneNameis the key field that determines which gene receives the count.
Single-Cell-Specific Tags
- CB (Cell Barcode): Corrected cell barcode.
- UB (UMI Barcode): Corrected UMI barcode used for deduplication.
