Troubleshooting Guide for Undetected Genes in Single-Cell Sequencing (Endogenous/Exogenous Inserts)

Author: Shirley

Time: 8 min

Words: 1.5k words

Updated: 2026-03-20

Reads: 0 times

Analysis Guide FAQ

This guide provides a systematic checklist to troubleshoot cases where a specific gene (e.g., eGFP, mCherry, or an endogenous gene) shows low expression or is not detected in single-cell data analysis.

Check the Gene FASTA and Annotation GTF Files

When building an exogenous gene reference or preparing a custom reference genome, the FASTA file and the GTF annotation must strictly follow formatting requirements. Otherwise, aligners and quantification tools may fail to recognize the target region or produce incorrect quantification.

Requirements for Insert Sequences and Annotations

Insert Sequence File (FASTA) Requirements

Make sure the exogenous gene sequence has been added to the reference genome FASTA. The chromosome/contig ID (FASTA header) must be concise and unique, and must not contain spaces.

bash

>GFP
ATGGTGAGCAAGGGCGAGGAGCTGTTCACCGGGGTGGTGCCCATCCTGGTCGAGCTGGACGGCGACGTAAACGGCCACAAGTTCAGCGTGTCCGGCGAGGGCGAGGGCGATGCCACCTACGGCAAGCTGACCCTGAAGTTCATCTGCACCACCGGCAAGCTGCCCGTGCCCTGGCCCACCCTCGTGACCACCCTGACCTACGGCGTGCAGTGCTTCAGCCGCTACCCCGACCACATGAAGCAGCACGACTTCTTCAAGTCCGCCATGCCCGAAGGCTACGTCCAGGAGCGCACCATCTTCTTCAAGGACGACGGCAACTACAAGACCCGCGCCGAGGTGAAGTTCGAGGGCGACACCCTGGTGAACCGCATCGAGCTGAAGGGCATCGACTTCAAGGAGGACGGCAACATCCTGGGGCACAAGCTGGAGTACAACTACAACAGCCACAACGTCTATATCATGGCCGACAAGCAGAAGAACGGCATCAAGGTGAACTTCAAGATCCGCCACAACATCGAGGACGGCAGCGTGCAGCTCGCCGACCACTACCAGCAGAACACCCCCATCGGCGACGGCCCCGTGCTGCTGCCCGACAACCACTACCTGAGCACCCAGTCCGCCCTGAGCAAAGACCCCAACGAGAAGCGCGATCACATGGTCCTGCTGGAGTTCGTGACCGCCGCCGGGATCACTCTCGGCATGGACGAGCTGTACAAGTAA

Insert Annotation File (GTF) Requirements

Single-cell quantification tools (e.g., SeekSoul™ Tools) commonly require each gene to contain three feature records: gene, transcript, and exon. Coordinates must match the insert length precisely.

bash

GFP	insert	gene	1	720	.	+	.	gene_id "GFP"; gene_name "GFP"; gene_biotype "protein_coding";
GFP	insert	transcript	1	720	.	+	.	gene_id "GFP"; gene_name "GFP"; transcript_id "GFP"; gene_biotype "protein_coding";
GFP	insert	exon	1	720	.	+	.	gene_id "GFP"; gene_name "GFP"; transcript_id "GFP"; gene_biotype "protein_coding";
GFP	insert	CDS	1	720	.	+	.	gene_id "GFP"; gene_name "GFP"; transcript_id "GFP"; gene_biotype "protein_coding";

Example script (GFP):

shell

# Define variables
GENE="GFP"
CHR="GFP_Chr"
LENGTH=720

# Create the GTF file
echo -e "${CHR}\tUser\tgene\t1\t${LENGTH}\t.\t+\t.\tgene_id \"${GENE}\"; gene_name \"${GENE}\"; gene_biotype \"protein_coding\";" > ${GENE}.gtf
echo -e "${CHR}\tUser\ttranscript\t1\t${LENGTH}\t.\t+\t.\tgene_id \"${GENE}\"; transcript_id \"${GENE}_T1\"; gene_name \"${GENE}\"; gene_biotype \"protein_coding\";" >> ${GENE}.gtf
echo -e "${CHR}\tUser\texon\t1\t${LENGTH}\t.\t+\t.\tgene_id \"${GENE}\"; transcript_id \"${GENE}_T1\"; gene_name \"${GENE}\"; exon_number 1; exon_id \"${GENE}_E1\";" >> ${GENE}.gtf

Standardized Preparation of Reference Genome and Annotation Files

Requirements for the Genome Sequence File (FASTA)

ID consistency: Chromosome/contig IDs in the FASTA headers must exactly match the first column (seqname) of the GTF.
Subset relationship: All seqname values referenced in the GTF must exist in the FASTA header ID set.
No spaces or blank lines: Chromosome/contig IDs must not contain spaces (content after a space may be truncated by tools), and the file should not contain blank lines.

Detailed Requirements for the Gene Annotation File (GTF)

A GTF file is a 9-column, tab-delimited format. The requirements for each column are listed below.

Column	Field	Description and requirements
1	seqname	Chromosome/contig ID (must match FASTA).
2	source	Annotation source (e.g., RefSeq, GeneScan); use `.` if not available.
3	feature	Feature type; must include `gene`, `transcript`, and `exon`.
4	start	Feature start position (1-based coordinate).
5	end	Feature end position (inclusive); start must be less than or equal to end.
6	score	Confidence score; usually `.`.
7	strand	Strand: `+` or `-`.
8	frame	Coding frame (0, 1, 2); use `.` for non-CDS features; for genes on the negative strand, the biological transcription start corresponds to the `end` coordinate.
9	attribute	Attribute list in `key "value";` format; each attribute must end with `;` and attributes are separated by spaces.

Core Attributes in Column 9 (Attributes)

The attribute column defines gene–transcript relationships and is critical for quantification. It must include the following.

gene_id "value";: Unique identifier of the gene locus.
transcript_id "value";: Unique identifier of the transcript.
gene_name "value";: Display name of the gene; if missing, many tools fall back to gene_id.
gene_biotype "value";: Biological type (e.g., protein_coding, lncRNA); some tools may use gene_type.
Three-level structure: Each gene record must include gene, transcript, and exon feature entries.

Visually Validate Read Alignments

If the FASTA and GTF are correct, the next step is to confirm read coverage in the target region.

How to check: Load the final BAM file and the corresponding genome file into IGV (Integrative Genomics Viewer).

TIP

See igv-reports.

Case A: No read coverage: This suggests the library did not capture the gene’s mRNA. Possible reasons include:
1. Experimental: the gene is not expressed, capture efficiency is low, or the mRNA is degraded.
2. Sequence: the insert sequence differs from the true sequence in the sample, causing alignment failures.
Case B: High read coverage but quantification is 0 or very low: Reads are aligned, but were filtered during assignment (Annotation/Quantification). Proceed to the next section.

Deep Dive into BAM Alignment Tags (BAM Tags)

Single-cell quantification tools determine whether a read contributes to UMI counting based on tags in the BAM file. Use samtools view to inspect reads in the target region.

Key Tag Diagnostics

Check the XS tag:
- If XS:Z:Unassigned_NoFeatures is present and there is no XT:Z: tag, the read aligns to an intergenic region or has low alignment quality (MAPQ). Such reads are not counted.
Check multi-mapping (NH tag > 1):
- If the tag shows XT:Z:gene1,gene2, the read overlaps multiple genes. The tool tries to assign it based on exon/intron priority. If it cannot resolve the assignment, the read is discarded.
Check sequence homology:
- If XT:Z:gene_id is a single gene but NH:i:N (N > 1), the read has equally scoring alignments to multiple loci (e.g., homologous genes or vector backbone). In strict modes, multi-mapping reads are typically not counted.

Case Study: Fluorescent Proteins (eGFP/mCherry) Not Detected

Observed issue: eGFP and mCherry sequences were added in the project. IGV shows very high read coverage, but the final expression matrix reports low counts for both genes.

Root cause analysis: Inspection of BAM shows many reads with NH > 1 that simultaneously map to eGFP and mCherry. Sequence review indicates that the reference was built by treating full exogenous sequences (including identical promoters or shared vector backbone) as transcripts. Because these backbone regions are identical and annotated as exon in the GTF for both genes, the aligner cannot uniquely determine the gene of origin, so the reads are marked as multi-mapping and filtered.

Recommended fix: When building the exogenous gene index, annotate only gene-specific regions (e.g., CDS or 3' UTR) in the GTF. Alternatively, remove duplicated backbone sequences from the FASTA and keep only gene-specific fragments to improve unique mapping.

BAM Field Definitions

A BAM file is the standard format for storing alignment results. Each record contains 11 mandatory fields followed by optional custom tags.

Standard Mandatory Fields

Column	Field	Meaning	Common values and interpretation
1	QNAME	Read ID	Sequence name from the original FASTQ file.
2	FLAG	Alignment flag	Numeric encoding of the read status; 0: aligned to `+` strand; 16: aligned to `-` strand; 4: unmapped.
3	RNAME	Reference name	Chromosome/contig name (e.g., `chr1`, `chrM`, `eGFP`).
4	POS	Alignment start	Leftmost 1-based alignment coordinate on the reference.
5	MAPQ	Mapping quality	Measures alignment reliability; in STAR, 255 indicates a unique alignment, while lower values indicate multi-mapping.
6	CIGAR	Alignment structure	Describes how the read aligns to the reference; M: match/mismatch; N: skipped region (often intron); S: soft-clipped bases; D/I: deletion/insertion.
7	RNEXT	Mate reference name	In paired-end data, the reference name of the mate; `*` for single-end or unmapped mate.
8	PNEXT	Mate start	Alignment start position of the mate in paired-end data.
9	TLEN	Template length	Observed insert size (template length).
10	SEQ	Read sequence	Base sequence of the read.
11	QUAL	Base qualities	ASCII-encoded base quality scores.

Common Custom Tags

After the 11 mandatory fields, tools (e.g., STAR, SeekSoul™ Tools) may add tags in TAG:TYPE:VALUE format to carry additional alignment and quantification information.

Standard Alignment Tags

NH (Number of Hits): Number of genomic loci the read aligns to; NH:i:1 indicates a unique alignment.
HI (Hit Index): Index of the alignment record among multiple hits.
AS (Alignment Score): Alignment score; higher scores indicate better matches.
nM (Number of Mismatches): Number of mismatched bases in the alignment.

Feature Assignment Tags (XS/XN/XT)

These tags describe how a read is assigned to genomic features.

XS (Assignment Status):
- XS:Z:Assigned: Successfully assigned; the read overlaps an annotated feature and matches strand requirements.
- XS:Z:Unassigned_NoFeatures: Unassigned; the read aligns to the genome but overlaps no annotated feature (e.g., intergenic region).
XN (Number of Genes): Number of genes overlapping the alignment position; XN:i:1 indicates one unambiguous gene.
XT (Gene ID): The final gene name/ID used for quantification; XT:Z:GeneName is the key field that determines which gene receives the count.

Single-Cell-Specific Tags

CB (Cell Barcode): Corrected cell barcode.
UB (UMI Barcode): Corrected UMI barcode used for deduplication.

Troubleshooting Guide for Undetected Genes in Single-Cell Sequencing (Endogenous/Exogenous Inserts) ​

Check the Gene FASTA and Annotation GTF Files ​

Requirements for Insert Sequences and Annotations ​

Insert Sequence File (FASTA) Requirements ​

Insert Annotation File (GTF) Requirements ​

Standardized Preparation of Reference Genome and Annotation Files ​

Requirements for the Genome Sequence File (FASTA) ​

Detailed Requirements for the Gene Annotation File (GTF) ​

Core Attributes in Column 9 (Attributes) ​

Visually Validate Read Alignments ​

Deep Dive into BAM Alignment Tags (BAM Tags) ​

Key Tag Diagnostics ​

Case Study: Fluorescent Proteins (eGFP/mCherry) Not Detected ​

BAM Field Definitions ​

Standard Mandatory Fields ​

Common Custom Tags ​

Standard Alignment Tags ​

Feature Assignment Tags (XS/XN/XT) ​

Single-Cell-Specific Tags ​