Barcode-Converter
This tool converts CellBarcodes in single-cell sequencing data.
Conversion workflow
First, detect CellBarcodes in R1 of the input FASTQ files. One mismatch is allowed for CellBarcode matching.
Then, complete the conversion using the whitelist mapping.
Download
NOTE
Download and extract the package.
Quick start
TIP
Before converting, make sure conv.0.1.2
is properly installed and the corresponding whitelist files are prepared.
Example 1: Convert DD CellBarcodes to 10x
--wl1
specifies the input whitelist file, --wl2
specifies the output whitelist file, --rs
specifies the start position pattern of the converted CellBarcode, -t
specifies the number of threads, and -o
specifies the output directory.
/path/to/conv.0.1.2 --fq1 ./demo_dd_S39_L001_R1_001.fastq.gz --fq2 ./demo_dd_S39_L001_R2_001.fastq.gz --wl1 ./P3CB.barcode.txt.gz --wl2 3M-february-2018.txt.gz --rs 17C+T -t 12 -o output/
Example 2: Convert multiple files for the same sample
--fq1
and --fq2
can accept multiple files separated by spaces; ensure the order of R1 and R2 files matches.
/path/to/conv.0.1.2 --fq1 ./demo_dd_S39_L001_R1_001.fastq.gz ./demo_dd_S39_L001_R1_002.fastq.gz --fq2 ./demo_dd_S39_L001_R2_001.fastq.gz ./demo_dd_S39_L001_R2_002.fastq.gz --wl1 ./P3CB.barcode.txt.gz --wl2 3M-february-2018.txt.gz --rs 17C+T -t 12 -o output/
Example 3: Use an existing CellBarcode mapping
Use this when multiple omics of the same sample need a consistent mapping.
/path/to/conv.0.1.2 --fq1 ./demo_dd_S39_L001_R1_001.fastq.gz --fq2 ./demo_dd_S39_L001_R2_001.fastq.gz --map ../map.tsv --rs 17C+T -t 12 -o output/
Options
Option | Description |
---|---|
--fq1 ... | Input R1 FASTQ files; you can specify multiple runs for one sample, space-separated |
--fq2 ... | Input R2 FASTQ files; you can specify multiple runs for one sample, space-separated |
--rs | R1 structure pattern using digits/+ and letters: digits = base count, + = remaining bases, C = CellBarcode bases, T = other bases [default: 17C+T]. DD series use 17C+T |
--wl1 | Whitelist file for the input FASTQ chemistry; DD series use barcode/P3CB.barcode.txt |
--wl2 | Whitelist file for the target (output) chemistry; e.g., 3' libraries use 3M-february-2018.txt.gz, 5' libraries use 737K-august-2016.txt.gz |
--map | Barcode mapping file with two tab-separated columns: first = input whitelist, second = output whitelist. --map takes precedence over --wl1 and --wl2 . At least one of --wl1 /--wl2 or --map must be provided |
--no-multi | Redistribute reads with multiple matching barcodes; enabled by default |
-t, --threads | Number of threads [default: 10] |
-o, --out | Output directory [default: ./] |
-h, --help | Print help |
-V, --version | Print version |
Output files
The output directory contains:
<OUT>/*fastq.gz
: Converted FASTQ files<OUT>/multi_*fastq.gz
: Intermediate FASTQ files for reads with multiple matching barcodes; candidate barcodes joined by "_"<OUT>/map.txt
: Barcode mapping (two columns, tab-separated): first = input whitelist, second = output whitelist
Notes
IMPORTANT
Vendors may provide multiple whitelists and different products may use different ones. Set --wl1
and --wl2
correctly. 10x Genomics barcodes are defined in cellranger-*/lib/python/cellranger/chemistry_defs.json
or cellranger-5.0.1/lib/python/cellranger/chemistry.py
; barcodes are under cellranger-*/lib/python/cellranger/barcodes/
.
NOTE
When comparing the counts of CellBarcodes in --wl1
vs --wl2
: if --wl1
has more barcodes than --wl2
, 10M reads are sampled to count CellBarcodes. If the input FASTQ contains more unique CellBarcodes than the --wl2
whitelist size, map the most frequent input CellBarcodes to the --wl2
barcodes. If not, map all observed input CellBarcodes to --wl2
and randomly assign the remaining --wl2
barcodes.
WARNING
When --no-multi
is set, after counting reads per CellBarcode, reads with multiple matching barcodes are redistributed. Candidate barcodes are sorted by read counts; assign to the barcode with the highest reads. If the top two barcodes have equal read counts, skip assignment.
TIP
For multiple omics of the same sample (e.g., 5'/TCR/BCR), use a consistent mapping. Convert one data type first (RNA library is recommended), then reuse its output map.txt
to convert other data types.
NOTE
conv.0.1.2
is an upgraded version of conv
: it fixes high memory usage when the number of worker threads is small by setting read_ahead
's chunk_size
and chunk_queue_size
to the square of the thread count instead of the default 100.