SeekArc Tools v1.0.0

Author: SeekGene

Time: 12 min

Words: 2.4k words

Updated: 2025-07-25

Reads: 0 times

SeekArc Tools v1.0.0 is a software developed by SEEKGENE for processing single-cell ATAC multi-omics data. The software independently processes transcriptome libraries and ATAC libraries for tasks, such as: Barcode extraction and correction, alignment. Quantification of each barcode, joint cell identification, and generation of cell expression matrices for downstream analysis. Enables subsequent clustering and differential analysis, as well as calculation of associations between genes and peaks for downstream interpretation.

Download

SeekArc Tools v1.0.0

Download-SeekArc Tools - md5: af6e054480ca082a554b499fe79c3a48

wget

shell

mkdir seekarctools_v1.0.0
cd seekarctools_v1.0.0
wget -c -O seekarctools_v1.0.0.tar.gz "https://seekgene-public.oss-cn-beijing.aliyuncs.com/software/seekarctools/seekarctools_v1.0.0.tar.gz"

curl

shell

mkdir seekarctools_v1.0.0
cd seekarctools_v1.0.0
curl -C - -o seekarctools_v1.0.0.tar.gz "https://seekgene-public.oss-cn-beijing.aliyuncs.com/software/seekarctools/seekarctools_v1.0.0.tar.gz"

Installation Guide

Installation:

shell

# decompress
tar zxf seekarctools_v1.0.0.tar.gz

# install
source ./bin/activate
./bin/conda-unpack

# export path in bashrc
export PATH=`pwd`/bin:$PATH
echo "export PATH=$(pwd)/bin:\$PATH" >> ~/.bashrc

Confirm installation:

shell

seekarctools_py --version

Tutorials

Data preparation

Download sample datasets

sample datasets - md5: 45a48572f5ffd00dd6e6d79e3ad495d4（Species：mouse）

wget

shell

wget -c -O demo_data.tar.gz "https://seekgene-public.oss-cn-beijing.aliyuncs.com/sgarc_demo/rawdata/demo_data.tar.gz"
# decompress
tar -zxf demo_data.tar.gz

curl

shell

curl -C - -o demo_data.tar.gz "https://seekgene-public.oss-cn-beijing.aliyuncs.com/sgarc_demo/rawdata/demo_data.tar.gz"
# decompress
tar -zxf demo_data.tar.gz

Download reference genome

Download-human-reference-GRCh38 - md5: a9ce266ead3d89bc0aa77cfa646ca309
Download-mouse-reference-GRCm38 - md5: c4f2244f1d0560d5d3b4c71da06fd08f

wget

shell

# human
wget -c -O refdata-arc-GRCh38.tar.gz "https://seekgene-public.oss-cn-beijing.aliyuncs.com/sgarc_demo/reference/refdata-arc-GRCh38.tar.gz"
# mouse
wget -c -O refdata-arc-GRCm38.tar.gz "https://seekgene-public.oss-cn-beijing.aliyuncs.com/sgarc_demo/reference/refdata-arc-GRCm38.tar.gz"
# decompress
tar -zxvf refdata-arc-GRCh38.tar.gz

curl

shell

# human
curl -C - -o refdata-arc-GRCh38.tar.gz "https://seekgene-public.oss-cn-beijing.aliyuncs.com/sgarc_demo/reference/refdata-arc-GRCh38.tar.gz"
# mouse
curl -C - -o refdata-arc-GRCm38.tar.gz "https://seekgene-public.oss-cn-beijing.aliyuncs.com/sgarc_demo/reference/refdata-arc-GRCm38.tar.gz"
# decompress
tar -zxvf refdata-arc-GRCh38.tar.gz

The assembly of the reference genome refers to How to build reference genome?

Run SeekArc Tools v1.0.0

Run tests

Example 1: Basic usage

Set up the necessary configuration files for the analysis, including sample data paths, reference genome paths, etc. Run the SeekArc Tools v1.0.0 using the following command:

shell

seekarctools_py arc run \
    --rnafq1 /path/to/demo/demo_GE_S1_L001_R1_001.fastq.gz \
    --rnafq2 /path/to/demo/demo_GE_S1_L001_R2_001.fastq.gz \
    --atacfq1 /path/to/demo/demo_arc_S2_L002_R1_001.fastq.gz \
    --atacfq2 /path/to/demo/demo_arc_S2_L002_R2_001.fastq.gz \
    --samplename demo \
    --outdir /path/to/outdir \
    --refpath /path/to/reference/GRCh38 \
    --include-introns \
    --core 16

Example 2: Adjusting thresholds for re-running peak calling or cell calling

If the identified peaks or cells under default parameters do not meet requirements, users can adjust parameters to skip preceding steps (e.g., alignment) and re-run the workflow to optimize results while saving time.

shell

seekarctools_py arc retry \
    --samplename demo \
    --outdir /path/to/outdir \
    --refpath /path/to/reference/GRCh38 \
    --core 16 \
    --qvalue 0.01 \
    --snapshift 0 \
    --extsize 200 \
    --min_len 200 \
    --min_atac_count 1000 \
    --min_gex_count 500

NOTE

Ensure all files from previous runs remain intact and are not deleted or relocated. The --outdir specifies the directory path from the initial SeekArc Tools v1.0.0 analysis. The --min_atac_count must be used together with --min_gex_count; it will not take effect if used alone.

Parameter descriptions

Parameters	Parameter descriptions
--rnafq1	Paths to R1 fastq files of RNA library
--rnafq2	Paths to R2 fastq files of RNA library
--atacfq1	Paths to R1 fastq files of ATAC library
--atacfq2	Paths to R2 fastq files of ATAC library
--samplename	Sample name
--outdir	Output directory. Default: ./
--skip_misB	If enabled, no base mismatch is allowed for barcode. Default is 1.
--skip_misL	If enabled, no base mismatch is allowed for linker. Default is 1.
--skip_multi	If enabled, discard reads that can be corrected to multiple white-listed barcodes. Barcodes are corrected to the barcode with the highest frequency by default.
--skip_len	Skip filtering short reads after adapter filter, short reads will be used.
--core	Number of threads used for the analysis.
--include-introns	When disabled, only exon reads are used for quantification. When enabled, intron reads are also used for quantification.
--refpath	The path of reference genome.
--star_path	External STAR software path. If the index in the reference genome is built by another STAR, please specify its path.
--qvalue	Minimum FDR (q-value) cutoff for peak detection. Default: 0.05.
--nolambda	If True, MACS3 will use the background lambda as local lambda. This means MACS3 will not consider the local bias at peak candidate regions.
--snapshift	MACS3 peak detection shift size. Default: 0.
--extsize	MACS3 peak detection extension size. Default: 400.
--min_len	Minimum length of peaks. If not set, it will be set to "extsize". Default: 400.
--broad	If enabled, perform broad peak calling and generate results in UCSC gappedPeak format, which encapsulates the nested structure of peaks.
--broad_cutoff	Threshold for broad peak calling. Default: 0.1.
--min_atac_count	Cell caller override: define the minimum number of ATAC transposition events in peaks (ATAC counts) for a cell barcode.
--min_gex_count	Cell caller override: define the minimum number of GEX UMI counts for a cell barcode.
-h, --help	Show this parameter descriptions.

Output descriptions

"Here's the output directory structure: each line represents a file or folder, indicated by "├──", and the numbers indicate three important output files.

shell

./
├── outs
│   ├── filtered_feature_bc_matrix              1
│   │   ├── barcodes.tsv.gz
│   │   ├── features.tsv.gz
│   │   └── matrix.mtx.gz
│   ├── filtered_peaks_bc_matrix                2
│   │   ├── barcodes.tsv.gz
│   │   ├── features.tsv.gz
│   │   └── matrix.mtx.gz
│   ├── raw_feature_bc_matrix                   3
│   │   ├── barcodes.tsv.gz
│   │   ├── features.tsv.gz
│   │   └── matrix.mtx.gz
│   ├── raw_peaks_bc_matrix                     4
│   │   ├── barcodes.tsv.gz
│   │   ├── features.tsv.gz
│   │   └── matrix.mtx.gz
│   ├── demo_A_fragments.tsv.gz                 5
│   ├── demo_A_fragments.tsv.gz.tbi             6
│   ├── demo_A_mem_pe_Sort.bam                  7
│   ├── demo_A_peaks.bed                        8
│   ├── demo_E_SortedByName.bam                 9
│   ├── demo.rds                                10
│   ├── demo_report.html                        11
│   └── demo_summary.csv                        12
└── analysis

Transcriptome cell sparse matrix
ATAC cell sparse matrix
Transcriptome raw sparse matrix
ATAC raw sparse matrix
ATAC per fragment information file
Fragments file index
The bam file of ATAC libiary
ATAC peaks
The bam file of RNA libiary
Matrix in rds format
Final report in html
Quality control information in csv

Algorithms Overview

RNA library analysis

estep1: barcode/UMI extraction

RNA library structure: DDAG_gex

Barcode processing:

The corresponding sequence is extracted based on the structural design. When the extracted barcode sequence is found in the whitelist, it is considered a valid barcode and the read count with valid barcodes is recorded. When the barcode is not found in the whitelist, it is considered an invalid barcode.

TIP

During sequencing process, there is a certain probability that sequencing error occurs. With the presence of a whitelist, SeekArc Tools v1.0.0 are able to do barcode correction. When correction is enabled, if sequences that are one base differ (one humming distance apart) from the invalid barcode appear in the whitelist, we will consider the following circumstances:

If there is one and only sequence exists in the whitelist, we will correct the invalid barcode to the barcode in the whitelist.
If multiple sequences exist in the whitelist, we will correct the invalid barcode to the sequence with the highest read support.

Adapter processing:

In the RNA library, the end of Read2 may contain the reverse complement sequence of 7F, which could be an adapter sequence introduced during library preparation. We remove these contaminating sequences, and the length of the trimmed Read2 needs to be longer than the set minimum length to ensure sufficient information for accurate alignment to the genome. If the length of Read2 after trimming is shorter than the minimum length, we consider this read as an invalid read.

estep2: Alignment

Sequence Alignment

SeekArc Tools v1.0.0 uses an aligner software called STAR to perform splicing-aware alignment of reads to the genome.
After the alignment procedure has mapped the reads to the genome, SeekArc Tools v1.0.0 uses another software called QualiMap along with a gene annotation transcript file GTF to bucket the reads into exonic, intronic, and intergenic regions.
Using featureCounts, the reads aligned to the genome were annotated to genes, with different rules such as strand specificity and features used for annotation. When using exon for annotation, a read is annotated to the corresponding gene if over 50% of its bases are aligned to the exon. When using transcripts for annotation, a read is annotated to the corresponding gene if over 50% of its bases are aligned to the transcript.

After the processing procedure described above, the metrics is shown below:

Reads Mapped to Genome: Fraction of reads that aligned to the genome (including both uniquely mapped reads and multi-mapped reads)
Reads Mapped Confidently to Genome: Fraction of reads that uniquely mapped to the genome
Reads Mapped to Intergenic Regions：Fraction of reads that mapped to intergenic regions
Reads Mapped to Intronic Regions：Fraction of reads that mapped to intronic regions
Reads Mapped to Exonic Regions：Fraction of reads that mapped to exonic regions

estep3: Counting

UMI counting

SeekArc Tools v1.0.0 extracts group of reads that shared the same barcode from output BAM file and counts the number of UMIs annotated to genes and the number of reads corresponding to each UMI.

Filter out reads whose UMIs consist of repetitive bases. For example, TTTTTTTT.
If a read has multiple annotations and only one gene annotation is from exon, it is considered a valid read, and all others are filtered.

UMI correction：

NOTE

UMIs may also have sequencing errors during sequencing process. By default, SeekArc Tools v1.0.0 uses the adjacency from UMI-tools to correct UMIs.

umi-correction

Image source: https://umi-tools.readthedocs.io/en/latest/the_methods.html

ATAC library analysis

astep1: Barcode extraction

Atac library structure: DDAG_atac

Barcode processing:

Same processing method as estep1.

Adapter and reads processing:

In the Atac library, Read1 may contain a fixed sequence at the end, which could include adapter sequences; the end of Read2 may contain the reverse complement of the fixed sequence and adapter sequences. We remove these contaminating sequences, and the length of the trimmed Read1 or Read2 needs to be longer than the set minimum length to ensure sufficient information for accurate alignment to the genome. If the length of Read1 or Read2 after trimming is shorter than the minimum length, we consider this read as an invalid read; if the length of Read1 or Read2 after trimming is greater than 50, we truncate the first 50bp of the Read1 or Read2 sequence; if the length of Read1 or Read2 after trimming is less than 50 but greater than the minimum length, we retain the entire sequence of Read1 or Read2.

estep2: Alignment

Sequence Alignment：

Use bwa mem to align the processed Read1 and Read2 to the reference genome.

astep3: Counting

Use SnapATAC2 v2.8.0 to process the BAM file output from astep2.

Read the BAM file to obtain ATAC sequencing and mapping metrics.
Generate the raw fragments file from the BAM file. Filter out mitochondria fragments to obtain the filtered fragments file.
Generate an Anndata object from the filtered fragments file.
Plot the distribution of TSS score.
Use MACS3 to call peaks, filter duplicate peaks, and output to peaks.bed.
Generate raw_peaks_bc_matrix dir based on peaks.
Calculate metrics for each barcode and output to per_barcode_metrics.csv.
Plot the distribution of insert sizes.
Cell calling
Generate filtered_feature_bc_matrix and filtered_peaks_bc_matrix.
Calculate ATAC library metrics for cells and targeting.

Cell calling

IMPORTANT

If --min_atac_count and --min_gex_count are specified:

Then determine the barcode as a cell if the UMI count is greater than the min_gex_count value and the events count is greater than the min_atac_count value.

If not specified:

First, retain barcodes where the UMI count is greater than 1 or the events in peaks count is greater than 1.
Then, filter barcodes where "the fraction of fragments overlapping called peaks" is less than "the fraction of genome in peaks".
Deduplicates. Filter barcodes with identical UMI values and events in peaks values.
Perform log10 transformation on the UMI values and events in peaks values of the filtered barcodes.
Based on the transformed UMI values and events in peaks values, use the KMeans algorithm to divide the barcodes into two clusters. The cluster with the higher mean is identified as cells, while the other is not.
Map the barcodes filtered out during deduplication to cells or non-cells based on the determination of the remaining barcodes.

astep4: Subsequent analysis

After the above steps, with the obtained cell expression matrix and ATAC matrix, we can proceed to the next stage of analysis.

Use Signac v1.14.0 to merge the cell expression matrix and ATAC matrix into a single Seurat object, Filter out features not in the database, then normalize the expression matrix and ATAC matrix separately, identify highly variable genes, perform dimensionality reduction and clustering, find differential genes, conduct WNN clustering, and finally calculate feature links.

Release notes

v1.0.0

Initial release

Release of a stable version supporting standardized analysis workflow for ATAC multi-omics data.

SeekArcTools

Dataset

SeekArc Tools v1.0.0

Download

Installation Guide

Installation:

Confirm installation:

Tutorials

Data preparation

Download sample datasets

Download reference genome

Run SeekArc Tools v1.0.0

Run tests

Parameter descriptions

Output descriptions

Algorithms Overview

RNA library analysis

estep1: barcode/UMI extraction

estep2: Alignment

estep3: Counting

ATAC library analysis

astep1: Barcode extraction

estep2: Alignment

astep3: Counting

Cell calling

astep4: Subsequent analysis

Release notes

v1.0.0

Initial release

SeekArc Tools v1.0.0 ​

Download ​

Installation Guide ​

Installation: ​

Confirm installation: ​

Tutorials ​

Data preparation ​

Download sample datasets ​

Download reference genome ​

Run SeekArc Tools v1.0.0 ​

Run tests ​

Parameter descriptions ​

Output descriptions ​

Algorithms Overview ​

RNA library analysis ​

estep1: barcode/UMI extraction ​

estep2: Alignment ​

estep3: Counting ​

ATAC library analysis ​

astep1: Barcode extraction ​

estep2: Alignment ​

astep3: Counting ​

Cell calling ​

astep4: Subsequent analysis ​

Release notes ​

v1.0.0 ​

Initial release ​

SeekArc Tools v1.0.0

Download

Installation Guide

Installation:

Confirm installation:

Tutorials

Data preparation

Download sample datasets

Download reference genome

Run SeekArc Tools v1.0.0

Run tests

Parameter descriptions

Output descriptions

Algorithms Overview

RNA library analysis

estep1: barcode/UMI extraction

estep2: Alignment

estep3: Counting

ATAC library analysis

astep1: Barcode extraction

estep2: Alignment

astep3: Counting

Cell calling

astep4: Subsequent analysis

Release notes

v1.0.0

Initial release