Skip to content

Troubleshooting Guide for Undetected Genes in Single-Cell Sequencing (Endogenous/Exogenous Inserts)

Author: Shirley
Time: 8 min
Words: 1.5k words
Updated: 2026-03-20
Reads: 0 times
Analysis Guide FAQ

This guide provides a systematic checklist to troubleshoot cases where a specific gene (e.g., eGFP, mCherry, or an endogenous gene) shows low expression or is not detected in single-cell data analysis.

Check the Gene FASTA and Annotation GTF Files

When building an exogenous gene reference or preparing a custom reference genome, the FASTA file and the GTF annotation must strictly follow formatting requirements. Otherwise, aligners and quantification tools may fail to recognize the target region or produce incorrect quantification.

Requirements for Insert Sequences and Annotations

Insert Sequence File (FASTA) Requirements

Make sure the exogenous gene sequence has been added to the reference genome FASTA. The chromosome/contig ID (FASTA header) must be concise and unique, and must not contain spaces.

bash
>GFP
ATGGTGAGCAAGGGCGAGGAGCTGTTCACCGGGGTGGTGCCCATCCTGGTCGAGCTGGACGGCGACGTAAACGGCCACAAGTTCAGCGTGTCCGGCGAGGGCGAGGGCGATGCCACCTACGGCAAGCTGACCCTGAAGTTCATCTGCACCACCGGCAAGCTGCCCGTGCCCTGGCCCACCCTCGTGACCACCCTGACCTACGGCGTGCAGTGCTTCAGCCGCTACCCCGACCACATGAAGCAGCACGACTTCTTCAAGTCCGCCATGCCCGAAGGCTACGTCCAGGAGCGCACCATCTTCTTCAAGGACGACGGCAACTACAAGACCCGCGCCGAGGTGAAGTTCGAGGGCGACACCCTGGTGAACCGCATCGAGCTGAAGGGCATCGACTTCAAGGAGGACGGCAACATCCTGGGGCACAAGCTGGAGTACAACTACAACAGCCACAACGTCTATATCATGGCCGACAAGCAGAAGAACGGCATCAAGGTGAACTTCAAGATCCGCCACAACATCGAGGACGGCAGCGTGCAGCTCGCCGACCACTACCAGCAGAACACCCCCATCGGCGACGGCCCCGTGCTGCTGCCCGACAACCACTACCTGAGCACCCAGTCCGCCCTGAGCAAAGACCCCAACGAGAAGCGCGATCACATGGTCCTGCTGGAGTTCGTGACCGCCGCCGGGATCACTCTCGGCATGGACGAGCTGTACAAGTAA

Insert Annotation File (GTF) Requirements

Single-cell quantification tools (e.g., SeekSoul™ Tools) commonly require each gene to contain three feature records: gene, transcript, and exon. Coordinates must match the insert length precisely.

bash
GFP	insert	gene	1	720	.	+	.	gene_id "GFP"; gene_name "GFP"; gene_biotype "protein_coding";
GFP	insert	transcript	1	720	.	+	.	gene_id "GFP"; gene_name "GFP"; transcript_id "GFP"; gene_biotype "protein_coding";
GFP	insert	exon	1	720	.	+	.	gene_id "GFP"; gene_name "GFP"; transcript_id "GFP"; gene_biotype "protein_coding";
GFP	insert	CDS	1	720	.	+	.	gene_id "GFP"; gene_name "GFP"; transcript_id "GFP"; gene_biotype "protein_coding";

Example script (GFP):

shell
# Define variables
GENE="GFP"
CHR="GFP_Chr"
LENGTH=720

# Create the GTF file
echo -e "${CHR}\tUser\tgene\t1\t${LENGTH}\t.\t+\t.\tgene_id \"${GENE}\"; gene_name \"${GENE}\"; gene_biotype \"protein_coding\";" > ${GENE}.gtf
echo -e "${CHR}\tUser\ttranscript\t1\t${LENGTH}\t.\t+\t.\tgene_id \"${GENE}\"; transcript_id \"${GENE}_T1\"; gene_name \"${GENE}\"; gene_biotype \"protein_coding\";" >> ${GENE}.gtf
echo -e "${CHR}\tUser\texon\t1\t${LENGTH}\t.\t+\t.\tgene_id \"${GENE}\"; transcript_id \"${GENE}_T1\"; gene_name \"${GENE}\"; exon_number 1; exon_id \"${GENE}_E1\";" >> ${GENE}.gtf

Standardized Preparation of Reference Genome and Annotation Files

Requirements for the Genome Sequence File (FASTA)

  • ID consistency: Chromosome/contig IDs in the FASTA headers must exactly match the first column (seqname) of the GTF.
  • Subset relationship: All seqname values referenced in the GTF must exist in the FASTA header ID set.
  • No spaces or blank lines: Chromosome/contig IDs must not contain spaces (content after a space may be truncated by tools), and the file should not contain blank lines.

Detailed Requirements for the Gene Annotation File (GTF)

A GTF file is a 9-column, tab-delimited format. The requirements for each column are listed below.

ColumnFieldDescription and requirements
1seqnameChromosome/contig ID (must match FASTA).
2sourceAnnotation source (e.g., RefSeq, GeneScan); use . if not available.
3featureFeature type; must include gene, transcript, and exon.
4startFeature start position (1-based coordinate).
5endFeature end position (inclusive); start must be less than or equal to end.
6scoreConfidence score; usually ..
7strandStrand: + or -.
8frameCoding frame (0, 1, 2); use . for non-CDS features; for genes on the negative strand, the biological transcription start corresponds to the end coordinate.
9attributeAttribute list in key "value"; format; each attribute must end with ; and attributes are separated by spaces.

Core Attributes in Column 9 (Attributes)

The attribute column defines gene–transcript relationships and is critical for quantification. It must include the following.

  • gene_id "value";: Unique identifier of the gene locus.
  • transcript_id "value";: Unique identifier of the transcript.
  • gene_name "value";: Display name of the gene; if missing, many tools fall back to gene_id.
  • gene_biotype "value";: Biological type (e.g., protein_coding, lncRNA); some tools may use gene_type.
  • Three-level structure: Each gene record must include gene, transcript, and exon feature entries.

Visually Validate Read Alignments

If the FASTA and GTF are correct, the next step is to confirm read coverage in the target region.

How to check: Load the final BAM file and the corresponding genome file into IGV (Integrative Genomics Viewer).

TIP

See igv-reports.

  • Case A: No read coverage: This suggests the library did not capture the gene’s mRNA. Possible reasons include:
    1. Experimental: the gene is not expressed, capture efficiency is low, or the mRNA is degraded.
    2. Sequence: the insert sequence differs from the true sequence in the sample, causing alignment failures.
  • Case B: High read coverage but quantification is 0 or very low: Reads are aligned, but were filtered during assignment (Annotation/Quantification). Proceed to the next section.

Deep Dive into BAM Alignment Tags (BAM Tags)

Single-cell quantification tools determine whether a read contributes to UMI counting based on tags in the BAM file. Use samtools view to inspect reads in the target region.

Key Tag Diagnostics

  • Check the XS tag:
    • If XS:Z:Unassigned_NoFeatures is present and there is no XT:Z: tag, the read aligns to an intergenic region or has low alignment quality (MAPQ). Such reads are not counted.
  • Check multi-mapping (NH tag > 1):
    • If the tag shows XT:Z:gene1,gene2, the read overlaps multiple genes. The tool tries to assign it based on exon/intron priority. If it cannot resolve the assignment, the read is discarded.
  • Check sequence homology:
    • If XT:Z:gene_id is a single gene but NH:i:N (N > 1), the read has equally scoring alignments to multiple loci (e.g., homologous genes or vector backbone). In strict modes, multi-mapping reads are typically not counted.

Case Study: Fluorescent Proteins (eGFP/mCherry) Not Detected

Observed issue: eGFP and mCherry sequences were added in the project. IGV shows very high read coverage, but the final expression matrix reports low counts for both genes.

Root cause analysis: Inspection of BAM shows many reads with NH > 1 that simultaneously map to eGFP and mCherry. Sequence review indicates that the reference was built by treating full exogenous sequences (including identical promoters or shared vector backbone) as transcripts. Because these backbone regions are identical and annotated as exon in the GTF for both genes, the aligner cannot uniquely determine the gene of origin, so the reads are marked as multi-mapping and filtered.

Recommended fix: When building the exogenous gene index, annotate only gene-specific regions (e.g., CDS or 3' UTR) in the GTF. Alternatively, remove duplicated backbone sequences from the FASTA and keep only gene-specific fragments to improve unique mapping.


BAM Field Definitions

A BAM file is the standard format for storing alignment results. Each record contains 11 mandatory fields followed by optional custom tags.

Standard Mandatory Fields

ColumnFieldMeaningCommon values and interpretation
1QNAMERead IDSequence name from the original FASTQ file.
2FLAGAlignment flagNumeric encoding of the read status; 0: aligned to + strand; 16: aligned to - strand; 4: unmapped.
3RNAMEReference nameChromosome/contig name (e.g., chr1, chrM, eGFP).
4POSAlignment startLeftmost 1-based alignment coordinate on the reference.
5MAPQMapping qualityMeasures alignment reliability; in STAR, 255 indicates a unique alignment, while lower values indicate multi-mapping.
6CIGARAlignment structureDescribes how the read aligns to the reference; M: match/mismatch; N: skipped region (often intron); S: soft-clipped bases; D/I: deletion/insertion.
7RNEXTMate reference nameIn paired-end data, the reference name of the mate; * for single-end or unmapped mate.
8PNEXTMate startAlignment start position of the mate in paired-end data.
9TLENTemplate lengthObserved insert size (template length).
10SEQRead sequenceBase sequence of the read.
11QUALBase qualitiesASCII-encoded base quality scores.

Common Custom Tags

After the 11 mandatory fields, tools (e.g., STAR, SeekSoul™ Tools) may add tags in TAG:TYPE:VALUE format to carry additional alignment and quantification information.

Standard Alignment Tags
  • NH (Number of Hits): Number of genomic loci the read aligns to; NH:i:1 indicates a unique alignment.
  • HI (Hit Index): Index of the alignment record among multiple hits.
  • AS (Alignment Score): Alignment score; higher scores indicate better matches.
  • nM (Number of Mismatches): Number of mismatched bases in the alignment.
Feature Assignment Tags (XS/XN/XT)

These tags describe how a read is assigned to genomic features.

  • XS (Assignment Status):
    • XS:Z:Assigned: Successfully assigned; the read overlaps an annotated feature and matches strand requirements.
    • XS:Z:Unassigned_NoFeatures: Unassigned; the read aligns to the genome but overlaps no annotated feature (e.g., intergenic region).
  • XN (Number of Genes): Number of genes overlapping the alignment position; XN:i:1 indicates one unambiguous gene.
  • XT (Gene ID): The final gene name/ID used for quantification; XT:Z:GeneName is the key field that determines which gene receives the count.
Single-Cell-Specific Tags
  • CB (Cell Barcode): Corrected cell barcode.
  • UB (UMI Barcode): Corrected UMI barcode used for deduplication.
0 comments·0 replies