Skip to content

SeekSoul Tools v1.3.0

Author: SeekGene
Time: 27 min
Words: 5.3k words
Updated: 2025-07-25
Reads: 0 times

SeekSoulTools is a software developed by SEEKGENE for processing single-cell transcriptome data. Currently, the software contains five modules.

1.rna module;This module is used for identifying cell barcode, genome alignment and gene quantification to obtain a feature-barcode matrix that can be used for downstream analsis, followed by cell clustering and differential analysis. This module not only support data from SeekOne™ transcriptome kits, it also supports a variety of customized structure designs.

2.fast module:This module is specifically designed for data produced by the SeekOne™ DD Single Cell Full-length RNA Sequence Transcriptome-seq Kit, which is used for barcode extraction, paired-end read alignment, quantification, and unique metrics analysis for full-length transcriptome data.

3.vdj module: This module is specifically designed for data produced by the SeekOne™ DD Single Cell Immune Profiling Kit, which is used for assembling, filtering, and annotating immune receptors.

4.multivdj module:This module is used for joint analysis of 5' RNA and VDJ data.

5.utils module: The module contains other small tools.

Download

SeekSoul Tools v1.3.0

Download-SeekSoulTools - md5: 21b19d5f10022f0d8caf1a1a9c3f66c6

wget

shell
wget -c -O seeksoultools.1.3.0.tar.gz "https://seekgene-public.oss-cn-beijing.aliyuncs.com/software/seeksoultools/seeksoultools.1.3.0.tar.gz"

curl

shell
curl -C - -o seeksoultools.1.3.0.tar.gz "https://seekgene-public.oss-cn-beijing.aliyuncs.com/software/seeksoultools/seeksoultools.1.3.0.tar.gz"

Installation Guide

IMPORTANT

Before installation, please make sure that the conda environment is properly configured to avoid conflicts with other bioinformatics software environments.

Installation:

shell
 
tar zxf seeksoultools.1.3.0.tar.gz
cd seeksoultools.1.3.0
 
export PATH=`pwd`:$PATH
echo "export PATH=$(pwd):\$PATH" >> ~/.bashrc

# Initialization and installation verification
`pwd`/seeksoultools --version

Installation Guide

Installation:

shell
 
tar zxf seeksoultools.1.3.0.tar.gz
cd seeksoultools.1.3.0
 
export PATH=`pwd`:$PATH
echo "export PATH=$(pwd):\$PATH" >> ~/.bashrc

# Initialization and installation verification
`pwd`/seeksoultools --version

Tutorials

rna module

Data preparation

Download sample datasets

sample datasets - md5: 3d15fcfdefc0722735d726f40ec4e324(Species: Homo sapiens.)

wget

shell
wget -c -O demo_dd.tar "https://seekgene-public.oss-cn-beijing.aliyuncs.com/software/data/demodata/demo_dd.tar"
# decompress
tar xf demo_dd.tar

curl

shell
curl -C - -o demo_dd.tar "https://seekgene-public.oss-cn-beijing.aliyuncs.com/software/data/demodata/demo_dd.tar"
# decompress
tar xf demo_dd.tar

Download and build reference genome

Download-human-reference-GRCh38 - md5: 5473213ae62ebf35326a85c8fba6cc42

Download-mouse-reference-mm10 - md5: 5c7c63701ffd7bb5e6b2b9c2b650e3c2

wget

shell
wget -c -O GRCh38.tar.gz "https://seekgene-public.oss-cn-beijing.aliyuncs.com/software/data/reference/GRCh38.tar.gz"
# decompress
tar -zxvf GRCh38.tar.gz
wget -c -O mm10.tar.gz "https://seekgene-public.oss-cn-beijing.aliyuncs.com/software/data/reference/mm10.tar.gz"

tar -zxvf mm10.tar.gz

curl

shell
curl -C - -o GRCh38.tar.gz "https://seekgene-public.oss-cn-beijing.aliyuncs.com/software/data/reference/GRCh38.tar.gz"
# decompress
tar -zxvf GRCh38.tar.gz

curl -C - -o mm10.tar.gz "https://seekgene-public.oss-cn-beijing.aliyuncs.com/software/data/reference/mm10.tar.gz"

tar -zxvf mm10.tar.gz

The assembly of the reference genome refers to How to build reference genome?

CAUTION

When downloading and extracting the reference genome, please ensure sufficient disk space to avoid file corruption and subsequent analysis failure.

Run SeekSoulTools

Run tests

Example 1: Basic usage

Set up the necessary configuration files for the analysis, including the paths to the sample data, the chemistry versions, the genome index, the gene annotation file, etc. Run the SeekSoulTools using the following command:

shell
seeksoultools rna run \
--fq1 /path/to/demo_dd/demo_dd_S39_L001_R1_001.fastq.gz \
--fq2 /path/to/demo_dd/demo_dd_S39_L001_R2_001.fastq.gz \
--samplename demo_dd \
--genomeDir /path/to/GRCh38/star \
--gtf /path/to/GRCh38/genes/genes.gtf \
--chemistry DDV2 \
--core 4 \
--include-introns

Example 2: Specify a different version of STAR for analysis.

To use a specific version of STAR for analysis while ensuring compatibility with the –genomeDir generated by that version, you can run the SeekSoulTools using the following command, specifying the path to the desired version of STAR:

shell
seeksoultools rna run \
--fq1 /path/to/demo_dd/demo_dd_S39_L001_R1_001.fastq.gz \
--fq2 /path/to/demo_dd/demo_dd_S39_L001_R2_001.fastq.gz \
--samplename demo_dd \
--genomeDir /path/to/GRCh38/star \
--gtf /path/to/GRCh38/genes/genes.gtf \
--chemistry DDV2 \
--core 4 \
--include-introns \
--star_path /path/to/cellranger-5.0.0/lib/bin/STAR

NOTE

The --star_path parameter must strictly match the version of the reference genome index, otherwise the alignment will fail.

Example 3: A sample has multiple sets of fastq files

If a sample has multiple sets of FASTQ data, you can provide the paths to all the FASTQ files associated with that sample when running the SeekSoulTools. Here's an example command:

shell
seeksoultools rna run \
--fq1 /path/to/demo_dd_S39_L001_R1_001.fastq.gz \
--fq1 /path/to/demo_dd_S39_L002_R1_001.fastq.gz \
--fq2 /path/to/demo_dd_S39_L001_R2_001.fastq.gz \
--fq2 /path/to/demo_dd_S39_L002_R2_001.fastq.gz \
--samplename demo \
--genomeDir /path/to/GRCh38/star \
--gtf /path/to/GRCh38/genes/genes.gtf \
--chemistry DDV2 \
--core 4 \
--include-introns

Example 4: Customize the structure of R1

To customize the structure of the Read 1 (R1) FASTQ files, here's an example command:

shell
seeksoultools rna run \
--fq1 /path/to/demo_dd_S39_L001_R1_001.fastq.gz \
--fq2 /path/to/demo_dd_S39_L001_R2_001.fastq.gz \
--samplename demo \
--genomeDir /path/to/GRCh38/star \
--gtf /path/to/GRCh38/genes/genes.gtf \
--barcode /path/to/utils/CLS1.txt \
--barcode /path/to/utils/CLS2.txt \
--barcode /path/to/utils/CLS3.txt \
--linker /path/to/utils/Linker1.txt \
--linker /path/to/utils/Linker2.txt \
--structure B9L12B9L13B9U8 \
--core 4 \
--include-introns
  • The structure of read1 is represented by B9L12B9L13B9U8, which means it consists of three sections of cell barcode, each with 9 bases, and a UMI section with 8 bases. The linker section between the cell barcode and UMI consists of two parts, with the first part being 12 bases and the second part being 13 bases
  • Use --barcode to specify the three sections of barcodes sequentially, and use --linker to specify the two sections of linkers sequentially.

TIP

When customizing the structure, it is recommended to test the structure parameters with a small sample first to ensure correct extraction of barcode/linker/UMI.

Parameter descriptions
ParametersDescriptions
--fq1Paths to R1 fastq files.
--fq2Paths to R2 fastq files.
--samplenameSample name. A directory will be created named after the sample name in the outdir directory. Only digits, letters, and underscores are supported.
--outdirOutput directory. Default: ./
--genomeDirThe path of the reference genome generated by STAR. The version needs to be consistent with the STAR used by SeekSoulTools.
--gtfPath to the GTF file for the corresponding species.
--coreNumber of threads used for the analysis.
--chemistryReagent type, with each type corresponding to a combination of --shift,--pattern, --structure, --barcode, and--sc5p. Available options: DDV2, DD5V1, MM, MM-D.
DDV2 corresponds to the SeekOne™ DD Single Cell 3' Transcriptome-seq Kit.
DD5V1 corresponds to the SeekOne™ DD Single Cell 5' Transcriptome-seq Kit.
MM corresponds to the SeekOne™ MM Single Cell Transcriptome Kit.
MM-D corresponds to the SeekOne™ MM Large-well Single Cell Transcriptome-seq Kit.
--skip_misBIf enabled, no base mismatch is allowed for barcode. Default is 1.
--skip_misLIf enabled, no base mismatch is allowed for linker. Default is 1.
--skip_multiIf enabled, discard reads that can be corrected to multiple white-listed barcodes. Barcodes are corrected to the barcode with the highest frequency by default.
--expectNumEstimated number of captured cells.
--forceCellWhen number of cells obtained from analysis is abnormal, add this parameter with expected value N. SeekSoulTools will select the top N cells based on UMI from high to low.
--include-intronsWhen disabled, only exon reads are used for quantification. When enabled, intron reads are also used for quantification.
--star_pathPath to another version of STAR for alignment. The version must be compatible with the --genomeDir version. The default --star_path is the STAR in the environment.

WARNING

Improper parameter settings (such as chemistry, genomeDir, gtf, etc.) will directly lead to analysis failure or abnormal results. Please double-check the parameter meanings and input paths.

Output descriptions

Here's the output directory structure: each line represents a file or folder, indicated by "├──", and the numbers indicate three important output files.

shell
./
├── demo_report.html                        1 
├── demo_summary.csv                        2
├── demo_summary.json                    
├── step1                                
   └──demo_2.fq.gz                      
├── step2                                
   ├── featureCounts                    
   └── demo_SortedByName.bam        
   └── STAR                             
       ├── demo_Log.final.out           
       └── demo_SortedByCoordinate.bam  
├── step3
   ├── filtered_feature_bc_matrix          3
   └── raw_feature_bc_matrix            
└──step4                                 
    ├── FeatureScatter.png               
    ├── FindAllMarkers.xls               
    ├── mito_quantile.xls                
    ├── nCount_quantile.xls              
    ├── nFeature_quantile.xls            
    ├── resolution.xls                   
    ├── top10_heatmap.png                
    ├── tsne.png                         
    ├── tsne_umi.png                     
    ├── tsne_umi.xls                     
    ├── umap.png                         
    └── VlnPlot.png
  1. Final report in html
  2. Quality control information in csv
  3. Filtered feature-barcode matrix

NOTE

If the output directory is missing key files, please first check whether there are errors in the upstream steps or incorrect parameter settings.

Algorithms Overview

step1: barcode/UMI extraction

SeekSoulTools is able to extract the barcode and UMI sequences based on different Read1 structures. The pipeline processes the barcodes and filters Read1 and its corresponding Read2, and an updated fastq file is generated at the end of step 1.

Structure design and description
  • Barcode and UMI descriptions: Describing the basic structure of Read1 using letters and numbers, where the letters represent the meaning of the nucleotides and the numbers represent the length of the nucleotides.

    B: barcode bases L: linker bases U: UMI bases X: other arbitrary bases used as placeholders

    Taking the following two Read1 structures as examples: B8L8B8L10B8U8: MM0

    B17U12: DD

  • Anchor-based misalignment design: In MM design, to increase the base balance of the linker portion during sequencing,1-4 bp shifted bases, called anchors, were added. The anchor determines the starting position of the barcode. MM

Workflow

Anchor determination:

For data with misaligned Read1 (produced by MM reagents), SeekSoulTools attempts to find the anchor sequence within the first 7 bases of the Read1 sequence to determine the start position of subsequent barcodes. If the anchor sequence is not found, the corresponding Read1 and R2 are considered invalid reads.

Barcode and linker correction:

After determining the starting position of the barcode, the corresponding sequence is extracted based on the structural design. When the extracted barcode sequence is found in the whitelist, it is considered a valid barcode and the read count with valid barcodes is recorded. When the barcode is not found in the whitelist, it is considered an invalid barcode.

During sequencing process, there is a certain probability that sequencing error occurs. With the presense of a whitelist, SeekSoulTools are able to do barcode correction. When correction is enabled, if sequences that are one base differ (one humming distance apart) from the invalid barcode appear in the whitelist, we will consider the following circumstances:

  • If there is one and only sequence exists in the whitelist, we will correct the invalid barcode to the barcode in the whitelist.
  • If multiple sequences exist in the whitelist, we will correct the invalid barcode to the sequence with the highest read support.

The logic of linker correction is the same as that of barcode correction.

Adaptors and polyA sequence trimming

In transcriptome, polyA tail and Adapter sequences introduced during library preparation, may appear at the end of Read2.We remove these contaminating sequences and make sure the trimmed Read2 length is greater than the minimum length for accurate alignment afterward. If the trimmed Read2 length is less than the minimum length, we consider the read to be invalid.

After the processing procedure described above, the data composition is shown in the following figure:

step1

  • total: Total reads
  • valid: Number of reads without correction and number of successfully corrected reads
  • B_corrected: Number of successfully corrected reads
  • B_no_correction: Number of reads with incorrect barcodes
  • L_no_correction: Number of reads with incorrect linkers
  • no_anchor:Number of reads without anchors
  • trimmed: Number of reads that have been trimmed
  • too_short: Number of reads that are shorter than 60bps after trimming

The relationships between the metrics are as follows:

total = valid + no_anchor + B_no_correction + L_no_correction

Number of reads in the updated FASTQ file: total_output = valid - too_short

step2: Alignment
Sequence Alignment
  • SeekSoulTools uses an aligner software called STAR to perform splicing-aware alignment of reads to the genome.
  • After the alignment procedure has mapped the reads to the genome, SeekSoulTools uses another software called QualiMap along with a gene annotation transcript file GTF to bucket the reads into exonic, intronic, and intergenic regions.
  • Using featureCounts, the reads aligned to the genome were annotated to genes, with different rules such as strand specificity and features used for annotation. When using exon for annotation, a read is annotated to the corresponding gene if over 50% of its bases are aligned to the exon. When using transcripts for annotation, a read is annotated to the corresponding gene if over 50% of its bases are aligned to the transcript.

After the processing procedure described above, the metrics is shown below:

  • Reads Mapped to Genome: Fraction of reads that aligned to the genome (including both uniquely mapped reads and multi-mapped reads)
  • Reads Mapped Confidently to Genome: Fraction of reads that uniquely mapped to the genome
  • Reads Mapped to Intergenic Regions:Fraction of reads that mapped to intergenic regions
  • Reads Mapped to Intronic Regions:Fraction of reads that mapped to intronic regions
  • Reads Mapped to Exonic Regions:Fraction of reads that mapped to exonic regions
step3: Counting
UMI counting

SeekSoulTools extracts group of reads that shared the same barcode from output BAM file and counts the number of UMIs annotated to genes and the number of reads corresponding to each UMI.

  • Filter out reads whose UMIs consist of repetitive bases. For example, TTTTTTTT.
  • If a read has multiple annotations and only one gene annotation is from exon, it is considered a valid read, and all others are filtered.
UMI correction

UMIs may also have sequencing errors during sequencing process. By default, SeekSoulTools uses the adjacency from UMI-tools to correct UMIs.

Schematic from UMI-toolsImage source: https://umi-tools.readthedocs.io/en/latest/the_methods.html

Cell calling

In a cell population, we assume that the mRNA content of cells does not differ significantly. If the mRNA content of a barcode is very low, we consider that barcode might be 'empty', and the mRNA comes from the background. SeekSoulTools uses the following steps to determine whether a barcode has a cell:

  • Sort all barcodes in descending order based on their corresponding UMI counts.
  • The threshold is calculated by dividing the number of UMIs for the barcode at the 1% position of the estimated cells by 10.
  • Barcodes with UMIs greater than the threshold are considered to be cells.
  • If the UMI of a barcode is less than the threshold but greater than 300, DropletUtils is used for further analysis. The DropletUtils method first assumes that barcodes with UMI counts below 100 are empty droplets. It then uses the total UMI count for each gene in each droplet as the expected UMI count for that gene in the background RNA expression profile. Next, it performs a statistical test on the UMI counts of each barcode, identifying those with significant differences as cells.
  • Those do not meet the above criteria are considered as background.

After the processing procedure described above, the metrics is shown below:

  • Estimated Number of Cells: Total number of cells by cell calling
  • Fraction Reads in Cells: Fraction of reads after cell calling among all reads used for counting
  • Mean Reads per Cell: Average number of reads per cell, Total number of reads/Number of cells after cell calling
  • Median Genes per Cell: Median gene count in barcodes after cell calling
  • Median UMI Counts per Cell: Median UMI count in barcodes after cell calling
  • Total Genes Detected: Number of genes detected in all cells
  • Sequencing Saturation: 1 - Total UMI count/Total read count
step4: Downstream analysis

SeekSoulTools perform downstream analysis when we have gene expression matrix from step3.

Seurat analysis

SeekSoulTools use Seurat to calculate the mitochondrial content, number of genes, and UMIs of each cell. After that, the gene expression matrix is normalized, and a subset of features that exhibit high cell-to-cell variation in the dataset is identified. Linear dimensional reduction using PCA is then performed, and the result is passed to t-SNE and UMAP for visualization. A graph-based clustering procedure is then followed, and cells are partitioned into different clusters. Finally, SeekSoulTools finds markers that define clusters via differential expression.

fast module reference fast module

vdj module

Data preparation

Download sample datasets

sample datasets - md5:afbd2deac59b581a4d44a0f73655d71d(Species: Homo sapiens.)

shell
wget -c -O PBMC_xin.tar "https://seekgene-public.oss-cn-beijing.aliyuncs.com/software/data/demodata/PBMC_xin.tar"
# decompress
tar xf PBMC.tar
shell
curl -C - -o PBMC_xin.tar "https://seekgene-public.oss-cn-beijing.aliyuncs.com/software/data/demodata/PBMC_xin.tar"
# decompress
tar xf PBMC.tar

Run SeekSoulTools

Run tests

Example 1: T cell receptor analysis example

shell
seeksoultools vdj run \
--fq1 /path/to/demo_tcr/demo_tcr_R1.fq.gz \
--fq2 /path/to/demo_tcr/demo_tcr_R2.fq.gz \
--samplename demo_tcr \
--chain TR \
--core 16  \
--outdir /path/to/ouput/demo_tcr \
--organism human

Example 2: B cell receptor analysis example

shell
seeksoultools vdj run \
--fq1 /path/to/demo_bcr/demo_bcr_R1.fq.gz \
--fq2 /path/to/demo_bcr/demo_bcr_R2.fq.gz \
--samplename demo_bcr \
--chain IG \
--core 16  \
--outdir /path/to/ouput/demo_bcr \
--organism human
Parameter descriptions
ParametersDescriptions
--fq1Paths to R1 fastq files.
--fq2Paths to R2 fastq files.
--samplenameSample name. Only digits, letters, and underscores are supported
--organismorganism, Available options: human, mouse, monkey, rabbit or rat.
--chainchain type, Available options: IG and TR. IG for B clel receptor and TR for T cell receptor.
--coreNumber of threads used for the analysis
--outdirOutput directory. Absolute path, and ensure that the outdir for each task is unique.
--cfgWhen the species is not human or mouse, the ref file needs to be configured by itself, and the file that records three REFs is the value of the parameter.
--read_pairWhether or not to use paired reads to do the assembly, defaults to False.
--keep_tmpWhether to keep the intermediate files of trust4, the default is not to keep them, when this parameter is specified, these intermediate files are compressed.
--matrixrna matrix data. If this parameter is specified, the intersection of the barcode of vdj and matrix is taken before find clone, and the path to the filter matrix of the rna data is specified if required
--rna_wdWhen analyzing RNA data, if this parameter is specified along with the –matrix parameter, a combined websummary for RNA and VDJ will be generated. When specifying this parameter, please ensure that the sample names for RNA and VDJ are consistent.

Example of building a ref file

Approach One: Utilize the trust4 software to retrieve data from the IMGT database and construct the required reference sequences.
shell
BuildImgtAnnot.pl Homo_sapien > IMGT+C.fa
grep ">" IMGT+C.fa | cut -f2 -d'>' | cut -f1 -d'*' | sort | uniq > bcr_tcr_gene_name.txt
BuildDatabaseFa.pl refdata-GRCh38/fasta/genome.fa refdata-GRCh38/genes/genes.gtf bcr_tcr_gene_name.txt > bcrtcr.fa
Approach Two: If the gene names in the genes.gtf file do not encompass the required BCR/TCR gene names, configure them independently.
shell
### Retrieve all species sequences from the IMGT database
wget -c https://www.imgt.org/download/GENE-DB/IMGTGENEDB-ReferenceSequences.fasta-nt-WithGaps-F+ORF+inframeP

### Extract the required sequences based on the species information in the ID and format them according to the requirements of TRUST4.
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord
from Bio.Seq import Seq
with open("IMGTGENEDB-ReferenceSequences.fasta-nt-WithGaps-F+ORF+inframeP", "r") as file, open('IMGT+C.fa', 'w') as out:
    for record in SeqIO.parse(file, "fasta"):
        #if "IG" not in record.description and "TR" not in record.description:
        if 'Bos taurus_Holstein' in record.description:
            new_id = record.description.strip().split('|')[1]
            new_seq = record.seq.upper()
            new_record = SeqRecord(Seq(new_seq), id=new_id, description="")
            SeqIO.write(new_record, out, "fasta")
Prepare the leader file.

Download "L-PART1+L-PART2" from IMGT -Separate downloads for IG and TR are required, with IG and TR sequences being saved as two different files -Download IGH, IGK, and IGL separately, and then merge them into a single file. -Download TRA and TRB separately, and then merge them into a single file.

shell
cat IG_L-PART1+L-PART2.fa |awk '/^>/ {print; next} {printf "%s", toupper($0)}' | awk 'BEGIN{RS=">"; FS="\n"} NR>1 {print ">" $1 "\n" $2}'
Write into the cfg file.

The format of the cfg file is as follows:
fa:"/your/path/bcrtcr.fa"
ref:"/your/path/IMGT+C.fa"
leader:"/your/path/IG_L-PART1+L-PART2.fa"

  • If the reference (ref) information is configured through Approach Two, both the fa and ref can be specified as IMGT+C.fa.

Output descriptions

Here's the output directory structure: each line represents a file or folder, indicated by "├──".

shell
.
├── Analysis
   ├── dandelion
   ├── qc
   ├── summary.json
   └── trust4
└── Outs
    ├── demo_TCR_airr_rearrangement.tsv               AIRR format result file
    ├── demo_TCR_filtered_contig_annotations.tsv      Annotation results for filtered contigs
    ├── demo_TCR_cell_barcodes.json                   All barcode for filtered contigs
    ├── demo_TCR_clonotypes.tsv                       Clonotype clustering information
    ├── demo_TCR_consensus.fasta                      Consensus sequences in FASTA format per clonotype chain
    ├── demo_TCR_filtered_contig.fasta                FASTA sequences of filtered assembled contigs
    ├── demo_TCR_report.html                          Analysis report
    └── metrics_summary.csv                           Quality control metrics summary file

Algorithms Overview

barcode/UMI extraction

SeekSoulTools is able to extract the barcode and UMI sequences based on different Read1 structures. The pipeline processes the barcodes and filters Read1 and its corresponding Read2, and an updated fastq file is generated at the end of step 1.

Structure design and description
  • Barcode and UMI descriptions: Describing the basic structure of Read1 using letters and numbers, where the letters represent the meaning of the nucleotides and the numbers represent the length of the nucleotides.

    B: barcode bases
    L: linker bases
    U: UMI bases
    X: other arbitrary bases used as placeholders

The Read1 structure of the SeekOne™ DD Single Cell Immune Profiling Kit is B17U12:
DD

Workflow

Barcode and linker correction:

After determining the starting position of the barcode, the corresponding sequence is extracted based on the structural design. When the extracted barcode sequence is found in the whitelist, it is considered a valid barcodeand the read count with valid barcodes is recorded. When the barcode is not found in the whitelist, it is considered an invalid barcode.

During sequencing process, there is a certain probability that sequencing error occurs. With the presense of a whitelist, SeekSoulTools are able to do barcode correction. When correction is enabled, if sequences that are one base differ (one humming distance apart) from the invalid barcode appear in the whitelist, we will consider the following circumstances:

  • If there is one and only sequence exists in the whitelist, we will correct the invalid barcode to the barcode in the whitelist.
  • If multiple sequences exist in the whitelist, we will correct the invalid barcode to the sequence with the highest read support.

Adaptors and polyA sequence trimming:

In transcriptome, polyA tail and Adapter sequences introduced during library preparation, may appear at the end of Read2.We remove these contaminating sequences and make sure the trimmed Read2 length is greater than the minimum length for accurate alignment afterward. If the trimmed Read2 length is less than the minimum length, we consider the read to be invalid.

After the processing procedure described above, the metrics is shown below:

  • total: Total reads
  • valid: Number of reads without correction and number of successfully corrected reads
  • B_corrected: Number of successfully corrected reads
  • B_no_correction: Number of reads with incorrect barcodes
  • trimmed: Number of reads that have been trimmed
  • too_short: Number of reads that are shorter than 60bps after trimming
Statistics on the proportion of VDJ gene

Construct STAR index based on ref and compare reads, count the number of TRA, TRB, TRD or IGK, IGL, IGH and the corresponding ratio based on bam files

Sequence Assembly

1、Use the fastq-extractor module of trust4 to extract the reads that may contain TCR/BCR sequences, and then do downsample for each barcode, and only take the first 80,000 reads for cells with more than 80,000 reads.
2、According to the barcode, filter out the umi that support reads less than n50, and then use umitools to correct the umi.
3、Based on the above reads were assembled with trust4.
Please refer to trust4 for detailed description.

VDJ annotation

The annotator of trust4 was utilized for the annotation of VDJC genes and CDR3 positions.For the case of multiple annotations for per chain in per cell, the contig with the highest umi value was selected as the primary_chain.The UMI count is calculated based on the reads information of each contig, with a filtering step performed before counting: when the same UMI is used for multiple contig assemblies, these contigs are sorted by reads in descending order. If the ratio of reads between the second highest contig and the highest contig is less than 0.15, that UMI is only counted for the highest contig, and no UMI counting is performed for the others.

Calculation Clonotype

1、Select contigs that meet chain requirements; filter out contigs with 1-UMI/reads ratio less than 0.94; remove cells where the sum of UMIs supporting the contigs is less than 3.
2、If 5' matrix information is provided, take the intersection of cell barcodes.
3、Utilize dandelion to determine clones based on CDR3 amino acid sequence similarity, with similarity thresholds set at 85% for BCR and 100% for TCR. For detailed criteria, refer to:sc-dandelion
4、For single-chain clonotypes, identify all other contigs sharing the same VJ genes and CDR3 amino acid sequence. Calculate a threshold as 1% of the n50 UMI count of these contigs. Discard the single-chain if its UMI count falls below this threshold.
5、For the retained single-chain clonotypes, identify all dual-chain clonotypes sharing the same VJ genes and CDR3 amino acid sequence. Sort by cell count. If the ratio of cell counts between the second largest and the largest dual-chain clonotype is less than 0.15, merge the single-chain clonotype into the largest dual-chain clonotype; otherwise, maintain them separately.
6、Sort by new clonotypes based on cell counts, generate new clone IDs, and provide exact IDs.

multivdj module

This module is used for joint analysis of 5' RNA and VDJ data.

Run SeekSoulTools

shell
seeksoultools multivdj run \
        --rnafq1 /path/to/demo/rna.demo.R1.fq.gz \
        --rnafq2 /path/to/demo/rna.demo.R2.fq.gz \
        --tcrfq1 /path/to/demo/tcr.demo.R1.fq.gz \
        --tcrfq2 /path/to/demo/tcr.demo.R2.fq.gz \
        --samplename demo_multi \
        --organism human \
        --core 8  \
	--genomeDir /path/to/GRCh38/star  \
        --gtf /path/to/GRCh38/genes/genes.gtf \
        --include-introns \
        --outdir ./

Parameter descriptions

ParametersDescriptions
--rnafq1The input FASTQ R1 file for RNA sequencing data. Required.
--rnafq2The input FASTQ R2 file for RNA sequencing data. Required.
--tcrfq1The input FASTQ R1 file for TCR sequencing data.
--tcrfq2The input FASTQ R2 file for TCR sequencing data.
--bcrfq1The input FASTQ R1 file for BCR sequencing data.
--bcrfq2The input FASTQ R2 file for BCR sequencing data. Optional. Note: At least one set of parameters (–tcrfq or –bcrfq) must be specified. This can be either tcrfq1 and tcrfq2, or bcrfq1 and bcrfq2, or all four parameters.
--samplenameSample name. A directory will be created named after the sample name in the outdir directory. Only digits, letters, and underscores are supported.
--organismorganism, Available options: human,mouse
--cfgWhen the species is not human or mouse, the ref file needs to be configured by itself, and the file that records three REFs is the value of the parameter.
--coreNumber of threads used for the analysis
--genomeDirPath to another version of STAR for alignment. The version must be compatible with the –genomeDir version. The default –star_path is the STAR in the environment.
--gtfPath to the GTF file for the corresponding species.
--include-intronsWhen disabled, only exon reads are used for quantification. When enabled, intron reads are also used for quantification.
--start_pathPath to another version of STAR for alignment. The version must be compatible with the –genomeDir version. The default –star_path is the STAR in the environment.

utils module

addtag

Modify BAM files to include barcode and UMI tags.

Run tests

Set up necessary configuration files for analysis, including the sample's BAM file and the umi.xls file in the step3 directory. Run the SeekSoulTools using the following

shell
seeksoultools utils addtag \
    --inbam step2/featureCounts/Samplename_SortedByName.bam \
    --umifile step3/umi.xls \
    --outbam Samplename_addtag.bam
Parameter descriptions
ParametersDescriptions
--inbamsample's BAM file in the step2/featureCount directory.
--outbamThe path of the BAM file with added tags.
--umifilehe path of the umi.xls file in the step3 directory.

FAQ

How to build reference genome?

The assembly of the reference genome refers to How to build reference genome?

Release Notes

v1.3.0

  • Update the algorithm for the V(D)J module.
  • The output path names have been standardized.
  • In the FAST and FFPE modules, the 17 base with low-quality bases are removed following the extraction of Reads1 UMI.

v1.2.2

  • Add vdj analysis module
  • Support for FFPE samples in the fast module
  • Fix the statistical error in gene median calculation caused by missing lncRNA type in the GTF file.
  • Resolve the issue of missing Ensembl IDs for differentially expressed genes due to gene name duplication.
  • Other improvements to enhance usability.

v1.2.1

  • Update the style of the report
  • Add trimming for SP1, SP2, TSO, and adapters
  • Add tool for adding tags to BAM files
  • Enhance support for non-standard GTF files
  • Enhance the FAST module to include support for species other than human and mouse

v1.2.0

  • Add output of read1 fastq file after removing barcode and UMI sequences
  • Add fast analysis module
  • Apply UMI-tools correction method
  • Update rules for handling annotations of multiple genes in reads: When there is a unique exon annotation, consider the read as valid

v1.0.0

  • Initial release
0 comments·0 replies