Barcode-Converter
This tool converts CellBarcodes in single-cell sequencing data.
Conversion workflow
First, detect CellBarcodes in R1 of the input FASTQ files. One mismatch is allowed for CellBarcode matching.
Then, complete the conversion using the whitelist mapping.
Download
NOTE
Download and extract the package.
Quick start
TIP
Before converting, make sure conv.0.1.2 is properly installed and the corresponding whitelist files are prepared.
Example 1: Convert DD CellBarcodes to 10x
--wl1 specifies the input whitelist file, --wl2 specifies the output whitelist file, --rs specifies the start position pattern of the converted CellBarcode, -t specifies the number of threads, and -o specifies the output directory.
/path/to/conv.0.1.2 --fq1 ./demo_dd_S39_L001_R1_001.fastq.gz --fq2 ./demo_dd_S39_L001_R2_001.fastq.gz --wl1 ./P3CB.barcode.txt.gz --wl2 3M-february-2018.txt.gz --rs 17C+T -t 12 -o output/Example 2: Convert multiple files for the same sample
--fq1 and --fq2 can accept multiple files separated by spaces; ensure the order of R1 and R2 files matches.
/path/to/conv.0.1.2 --fq1 ./demo_dd_S39_L001_R1_001.fastq.gz ./demo_dd_S39_L001_R1_002.fastq.gz --fq2 ./demo_dd_S39_L001_R2_001.fastq.gz ./demo_dd_S39_L001_R2_002.fastq.gz --wl1 ./P3CB.barcode.txt.gz --wl2 3M-february-2018.txt.gz --rs 17C+T -t 12 -o output/Example 3: Use an existing CellBarcode mapping
Use this when multiple omics of the same sample need a consistent mapping.
/path/to/conv.0.1.2 --fq1 ./demo_dd_S39_L001_R1_001.fastq.gz --fq2 ./demo_dd_S39_L001_R2_001.fastq.gz --map ../map.tsv --rs 17C+T -t 12 -o output/Options
| Option | Description |
|---|---|
| --fq1 ... | Input R1 FASTQ files; you can specify multiple runs for one sample, space-separated |
| --fq2 ... | Input R2 FASTQ files; you can specify multiple runs for one sample, space-separated |
| --rs | R1 structure pattern using digits/+ and letters: digits = base count, + = remaining bases, C = CellBarcode bases, T = other bases [default: 17C+T]. DD series use 17C+T |
| --wl1 | Whitelist file for the input FASTQ chemistry; DD series use barcode/P3CB.barcode.txt |
| --wl2 | Whitelist file for the target (output) chemistry; e.g., 3' libraries use 3M-february-2018.txt.gz, 5' libraries use 737K-august-2016.txt.gz |
| --map | Barcode mapping file with two tab-separated columns: first = input whitelist, second = output whitelist. --map takes precedence over --wl1 and --wl2. At least one of --wl1/--wl2 or --map must be provided |
| --no-multi | Redistribute reads with multiple matching barcodes; enabled by default |
| -t, --threads | Number of threads [default: 10] |
| -o, --out | Output directory [default: ./] |
| -h, --help | Print help |
| -V, --version | Print version |
Output files
The output directory contains:
<OUT>/*fastq.gz: Converted FASTQ files<OUT>/multi_*fastq.gz: Intermediate FASTQ files for reads with multiple matching barcodes; candidate barcodes joined by "_"<OUT>/map.txt: Barcode mapping (two columns, tab-separated): first = input whitelist, second = output whitelist
Notes
IMPORTANT
Vendors may provide multiple whitelists and different products may use different ones. Set --wl1 and --wl2 correctly. 10x Genomics barcodes are defined in cellranger-*/lib/python/cellranger/chemistry_defs.json or cellranger-5.0.1/lib/python/cellranger/chemistry.py; barcodes are under cellranger-*/lib/python/cellranger/barcodes/.
NOTE
When comparing the counts of CellBarcodes in --wl1 vs --wl2: if --wl1 has more barcodes than --wl2, 10M reads are sampled to count CellBarcodes. If the input FASTQ contains more unique CellBarcodes than the --wl2 whitelist size, map the most frequent input CellBarcodes to the --wl2 barcodes. If not, map all observed input CellBarcodes to --wl2 and randomly assign the remaining --wl2 barcodes.
WARNING
When --no-multi is set, after counting reads per CellBarcode, reads with multiple matching barcodes are redistributed. Candidate barcodes are sorted by read counts; assign to the barcode with the highest reads. If the top two barcodes have equal read counts, skip assignment.
TIP
For multiple omics of the same sample (e.g., 5'/TCR/BCR), use a consistent mapping. Convert one data type first (RNA library is recommended), then reuse its output map.txt to convert other data types.
NOTE
conv.0.1.2 is an upgraded version of conv: it fixes high memory usage when the number of worker threads is small by setting read_ahead's chunk_size and chunk_queue_size to the square of the thread count instead of the default 100.
