scMethyl + RNA Multi-omics: CNV Copy Number Variation Analysis (Based on CopyKit)
Module Introduction
This module is developed based on CopyKit (v0.1.3), designed for inferring genome-wide copy number variations (CNV) from single-cell sequencing data.
CopyKit is a powerful R toolkit capable of systematic preprocessing, normalization, segmentation, and clustering analysis of single-cell data. By detecting genome-wide coverage changes, it achieves the following core analysis goals:
- Variation Detection: Accurately locate chromosomal amplification and deletion events in tumor cells and map genome-wide CNV profiles.
- Cell Classification: Effectively distinguish normal cells (diploid) from tumor cells (non-diploid/aneuploid) based on CNV features.
- Clonal Evolution: Explore intra-tumor heterogeneity and clonal evolution trajectories through subclone structure reconstruction.
Note: This analysis is mainly applicable to human tumor samples, relying on human reference genomes (hg19/hg38) for alignment and annotation.
Input File Preparation
This module supports two types of input files. Please choose according to your data processing stage:
Option A: Directly Import Analyzed Results (Recommended)
If you have already run the runVarbin pipeline of CopyKit, you can directly use the generated .rds file. This is the fastest way.
File Structure Example:
workspace/data/AY1768876207519/methylation/demo3MDA-task-2/MDA_cs/cnvPreflight/
├── copykit_MDA_cs_1Mb.rds # 1Mb resolution result
├── copykit_MDA_cs_500kb.rds # 500kb resolution result
└── copykit_MDA_cs_220kb.rds # 220kb resolution resultOption B: Start from Single-Cell BAM Files
If you have split single-cell BAM files, you can use them directly as input.
Note: Input BAM files must be Flag corrected and Deduplicated to ensure quantification accuracy.
File Structure Example:
./
├── AAAGAAGAAGAGTTTAG.dedup.bam
├── AAAGAAGAAGGAGTGGT.dedup.bam
├── ...
└── TTTGGATATTTGTGAGG.dedup.bamDetailed Tutorial: For how to generate the above standard single-cell BAM files (including splitting, Flag correction, deduplication), please refer to the complete guide: Auide
CopyKit Analysis Pipeline
Key Parameter Settings
During the analysis process, the following three parameters are crucial for data quality control and result resolution:
| Parameter Name | Description | Default Value | Detailed Explanation |
|---|---|---|---|
bin_sizes | Genome window size | 220kb, 500kb, 1Mb | Determines the resolution of CNV analysis. • Smaller (e.g., 220kb): Higher resolution, capable of detecting smaller variations, but requires higher sequencing depth and may increase noise. • Larger (e.g., 1Mb): Smoother data, lower noise, but only capable of detecting large fragment variations. Note: Modifying this parameter requires re-running the runVarbin step, which takes BAM files as input and is time-consuming. |
min_bincount | Minimum Reads abundance | 10 | Quality Control (QC) parameter. Requires the average Reads count of each cell in each Bin to be greater than this value. Used to exclude low-quality cells with insufficient sequencing depth that cannot accurately infer CNV. |
pearson_corr_cutoff | Pearson correlation filtering threshold | 0.8 (Software default 0.9) | Quality Control (QC) parameter (do not confuse with bin_size).CopyKit calculates the Pearson correlation between each cell and its nearest neighbors (k=5). If the correlation is lower than this threshold, the cell is considered an "outlier" or low-quality cell and is filtered out. • Increasing this value yields purer clones but may lose some heterogeneous cells. |
Best Practice Suggestions
About
bin_sizesSelection:- This module defaults to calculating 220kb, 500kb, and 1Mb resolutions simultaneously.
- Resolutions supported by CopyKit: 55kb, 110kb, 195kb, 220kb, 280kb, 500kb, 1Mb, 2.8Mb.
- Strategy: If the sample sequencing depth is low, it is recommended to prioritize 1Mb or 500kb results; if the depth is high, 220kb can be used for finer results.
- CopyKit has preset and optimized these Bin regions, filtering out repetitive sequences and difficult-to-align regions.
About Filtering Parameters (
min_bincount&resolution):- If too few cells are retained, you can appropriately lower
min_bincount(e.g., to 5), but this will increase noise. - If the clustering results contain a large number of messy single cells, you can appropriately increase the
pearson_corr_cutoffthreshold (e.g., to 0.9).
- If too few cells are retained, you can appropriately lower
library(copykit)
library(BiocParallel)
library(ggplot2)
# --- Input Parameter Configuration ---
## genome_version: Reference genome version for data alignment
genome_version <- 'hg38'
## outdir: Output directory for results
outdir <- '/PROJ2/FLOAT/weiqiuxia/project/20241227_methy/script/standard_reports/modules/copykit/test/results/MDA_cs'
## bam_dir: Directory containing single-cell BAM files (Option B)
bam_dir <- NULL
## rds_path: Directory containing generated RDS files (Option A)
rds_path <- '/PROJ2/FLOAT/weiqiuxia/project/20241227_methy/script/standard_reports/modules/copykit/test/input/MDA_cs/'
## samplename: Sample name identifier
samplename <- 'MDA_cs'
## min_bincount: Minimum average Reads count filtering threshold
# Used to filter out low-quality cells with low Read Counts, suggested to keep default
min_bincount <- 10
## bin_sizes: Analysis resolution list
bin_sizes <- c('220kb', '500kb', '1Mb')
## pearson_corr_cutoff: Correlation filtering threshold (for findOutliers)
pearson_corr_cutoff <- 0.8
## is_paired_end: Whether it is paired-end sequencing (TRUE/FALSE)
is_paired_end <- TRUE
## remove_Y: Whether to remove Y chromosome (TRUE/FALSE)
remove_Y <- TRUE
## threads: Number of parallel computing threads
threads <- 8
## skipKNNSmooth: Whether to skip KNNSmooth (TRUE/FALSE)
skipKNNSmooth <- FALSE
dir.create(outdir, recursive = TRUE)plotConsensusLine_static <- function(scCNA, sample) {
# bindings for NSE
start <- xstart <- xend <- abspos <- NULL
####################
## checks
####################
if (nrow(consensus(scCNA)) == 0) {
stop("Slot consensus is empty. Run calcConsensus()")
}
####################
## aesthetic setup
####################
# obtaining chromosome lengths
chr_ranges <-
as.data.frame(SummarizedExperiment::rowRanges(scCNA))
chr_lengths <- rle(as.numeric(chr_ranges$seqnames))$lengths
# obtaining first and last row of each chr
chr_ranges_start <- chr_ranges %>%
dplyr::group_by(seqnames) %>%
dplyr::arrange(seqnames, start) %>%
dplyr::filter(dplyr::row_number() == 1) %>%
dplyr::ungroup()
chr_ranges_end <- chr_ranges %>%
dplyr::group_by(seqnames) %>%
dplyr::arrange(seqnames, start) %>%
dplyr::filter(dplyr::row_number() == dplyr::n()) %>%
dplyr::ungroup()
# Creating data frame object for chromosome rectangles shadows
chrom_rects <- data.frame(
chr = chr_ranges_start$seqnames,
xstart = as.numeric(chr_ranges_start$abspos),
xend = as.numeric(chr_ranges_end$abspos)
)
xbreaks <- rowMeans(chrom_rects %>%
dplyr::select(
xstart,
xend
))
if (nrow(chrom_rects) == 24) {
chrom_rects$colors <- rep(
c("white", "gray"),
length(chr_lengths) / 2
)
} else {
chrom_rects$colors <- c(rep(
c("white", "gray"),
(length(chr_lengths) / 2)
), "white")
}
# Creating the geom_rect object
ggchr_back <-
list(geom_rect(
data = chrom_rects,
aes(
xmin = xstart,
xmax = xend,
ymin = -Inf,
ymax = Inf,
fill = colors
),
alpha = .2
), scale_fill_identity())
sec_breaks <- c(0, 0.5e9, 1e9, 1.5e9, 2e9, 2.5e9, 3e9)
sec_labels <- c(0, 0.5, 1, 1.5, 2, 2.5, 3)
# theme
ggaes <- list(
scale_x_continuous(
breaks = xbreaks,
labels = gsub("chr", "", chrom_rects$chr),
position = "top",
expand = c(0, 0),
sec.axis = sec_axis(
~.,
breaks = sec_breaks,
labels = sec_labels,
name = "genome position (Gb)"
)
),
theme_classic(),
theme(
axis.text.x = element_text(
angle = 0,
vjust = .5,
size = 15
),
axis.text.y = element_text(size = 15),
panel.border = element_rect(
colour = "black",
fill = NA,
size = 1.3
),
legend.position = "top",
axis.ticks.x = element_blank(),
axis.title = element_text(size = 15),
plot.title = element_text(size = 15)
)
)
####################
# obtaining and wrangling data
####################
con <- consensus(scCNA)
con_l <- con %>%
dplyr::mutate(abspos = chr_ranges$abspos) %>%
tidyr::gather(
key = "group",
value = "segment_ratio",
-abspos
)
choice <- gtools::mixedsort(unique(con_l$group))
df_plot <- con_l %>% dplyr::filter(group %in% sample)
p <- ggplot(df_plot) +
ggchr_back +
ggaes +
geom_line(aes(abspos, segment_ratio, color = group),size = 1.2, alpha = .7) +
labs(
x = "",
y = "consensus segment ratio"
)
return(p)
}# Set parallel parameters
register(MulticoreParam(progressbar = T, workers = threads), default = T)
# --- Input Path Logic Judgment ---
# Prioritize reading data under rds_path, if not exists then try reading bam_dir
bam_flag <- FALSE
if(!is.null(bam_dir) & !is.null(rds_path)){
if(!dir.exists(rds_path)){
if(!dir.exists(bam_dir)){
stop("Error: Both bam_dir and rds_path do not exist!")
}else{
input_dir <- bam_dir
bam_flag <- TRUE
}
}else{
input_dir <- rds_path
}
} else if(is.null(bam_dir) & !is.null(rds_path)){
input_dir <- rds_path
} else if(!is.null(bam_dir) & is.null(rds_path)){
input_dir <- bam_dir
bam_flag <- TRUE
} else{
stop("Error: Please set either bam_dir or rds_path!")
}
# --- Genome Version Identification ---
if(length(grep('(GRCh38)|(hg38)', genome_version, ignore.case = TRUE )) > 0){
genome <- 'hg38'
}else if(length(grep('(GRCh37)|(hg19)', genome_version, ignore.case = TRUE )) > 0){
genome <- 'hg19'
}else{
stop('Error: Cannot recognize genome version. Please use hg38, GRCh38, hg19, or GRCh37.')
}
# --- Run runVarbin (If input is BAM) ---
if(bam_flag){
for(i in bin_sizes){
message(paste("Processing bin size:", i))
tumor <- runVarbin(input_dir,
min_bincount = min_bincount,
remove_Y = remove_Y,
is_paired_end = is_paired_end,
resolution = i,
genome = genome)
saveRDS(tumor, paste0(outdir, '/copykit_', samplename, '_', i, '.rds'))
}
input_dir <- outdir
}Data Preprocessing Principles
CopyKit uses the core function runVarbin to preprocess the input BAM files. The process includes the following key modules:
runCountReads(Bin Counting):- Bin Counting: Statistics of Reads falling into each Bin based on preset genome windows (Bins) using featureCounts.
- Deduplication: Reads marked as PCR duplicates (PCR Duplicates) are automatically ignored during counting.
- Paired-end Processing: If
is_paired_end = TRUE, paired-end Reads are counted as one Fragment. - Filtering: By default removes Y chromosome regions (optional), and filters out low-quality cells with too low average Read Counts based on counting results.
- GC Correction: Theoretically, sequencing depth should be evenly distributed, but in reality, high GC content regions often exhibit PCR amplification bias, causing a positive correlation between Fragment counts and GC content. Before CNV analysis, the algorithm performs GC bias correction for each cell, pulling counts affected by GC back to the baseline to eliminate systematic errors.
runVst(Variance Stabilization Transformation):- Principle: Bin Counts in single-cell DNA sequencing usually follow a Poisson Distribution, characterized by variance equal to the mean. This causes noise in high-abundance regions to be much larger than in low-abundance regions.
- Function: Uses Freeman-Tukey (FT) transformation to convert data to an approximately normal distribution. The transformed data variance is approximately constant at 1 and no longer depends on the mean. This step is crucial for accurate breakpoint detection (Segmentation).
runSegmentation(CBS Segmentation):- Smoothing: First uses a smoothing algorithm to trim extremely few "spike" values that are far higher or lower than neighboring Bins, reducing noise interference.
- CBS Algorithm: Runs the Circular Binary Segmentation (CBS) algorithm for each cell to find breakpoints where copy number changes occur on the genome.
- Merging: Merges consecutive Bins with similar counts on the genome into Segments. All Bins within the same Segment have the same Segment Ratio.
- Segment Ratio Meaning (Relative to average ploidy):
1.0: Diploid state.1.5: 3 Copies (Gain).0.5: Single copy loss (Loss).2.0: 4 Copies amplification (Amplification).
logNorm(Log Normalization):- Log transformation of Segment Ratio to generate
logrmatrix, facilitating subsequent UMAP dimensionality reduction and clustering analysis.
- Log transformation of Segment Ratio to generate
Matrices generated after running runVarbin (Bin x Cells):
bincounts: Raw Read Counts matrix after GC correction.ft: Numerical matrix after Variance Stabilization Transformation (VST), used as input for the segmentation algorithm.smoothed_bincounts: Smoothed counts after removing outliers, used for absolute copy number inference.segment_ratios: Ratio matrix after segmentation, which is the core data for cell filtering, plotting, clustering, and CNV analysis.ratios: Ratio matrix normalized based on smoothed_bincounts.logr: Log transformed Segment Ratio.
The above matrices can be accessed via
assay(tumor_list[['220kb']], 'bincounts').
varbin_rds <- list.files(rds_path,pattern = '.rds', full.names = T)Quality Control
runMetrics function calculates the Overdispersion for each cell based on bincounts. At the same time, we visualize the total Reads count, Reads assigned to Bin, and Breakpoints count for each cell.
Key QC Metrics
- Overdispersion:
- Reflects the uniformity of sequencing coverage. Theoretically, Read Counts of adjacent Bins should be similar.
- If the Overdispersion value is too high, it indicates large differences between adjacent Bins and significant technical noise, which may affect the accuracy of CNV results.
- Breakpoint Count:
- Normal diploid cells, or cells without variations, usually have a Breakpoint Count close to 0.
- Tumor cells usually have a higher number of breakpoints due to genomic instability.
- Total Reads:
- The number of Reads per cell, reflecting whether the data amount for each cell is sufficient.
- Reads Assigned bins:
- Effective Reads count. If this value is too low, it indicates that a large number of Reads are aligned to repetitive sequences or low-quality regions (CopyKit's preset Bins have already filtered these regions), suggesting poor cell sequencing quality.
CopyKit Suggestion: Usually, a sequencing depth of 1M reads/cell and a Duplication ratio of about 10% can obtain high-quality CNV results at 220kb resolution.
The following shows the distribution of QC metrics under different resolutions:
Figure Legend: The figure below shows the distribution of four QC metrics.
- X-axis: Sample.
- Y-axis: Specific values of each metric.
- Color: Represents the magnitude of
overdispersion. Brighter colors indicate greater data noise.- Interpretation: High-quality data usually shows higher
reads_assigned_binsand loweroverdispersion.
tumor_list <- list()
for(i in varbin_rds){
bin_size_name <- gsub('.*_(\\d+[Mk]b).rds','\\1',i)
tumor_list[[bin_size_name]] <- readRDS(i)
tumor_list[[bin_size_name]] <- runMetrics(tumor_list[[bin_size_name]])
}options(repr.plot.width = 8, repr.plot.height = 7, repr.plot.res = 150)
pic <- plotMetrics(
tumor_list[['220kb']],
metric = c(
"reads_total",
"reads_assigned_bins",
"overdispersion",
"breakpoint_count"),
label = 'overdispersion') +
ggtitle('220kb Metrics') +
theme(plot.title = element_text(hjust = .5, size = 20))
pic
ggsave(filename = file.path(outdir, paste0('plotMetrics_',samplename, '_220kb.pdf')), pic, width = 8, height = 7)options(repr.plot.width = 8, repr.plot.height = 7, repr.plot.res = 150)
pic <- plotMetrics(
tumor_list[['500kb']],
metric = c(
"reads_total",
"reads_assigned_bins",
"overdispersion",
"breakpoint_count"),
label = 'overdispersion') +
ggtitle('500kb Metrics') +
theme(plot.title = element_text(hjust = .5, size = 20))
pic
ggsave(filename = file.path(outdir, paste0('plotMetrics_',samplename, '_500kb.pdf')), pic, width = 8, height = 7)options(repr.plot.width = 8, repr.plot.height = 7, repr.plot.res = 150)
pic <- plotMetrics(
tumor_list[['1Mb']],
metric = c(
"reads_total",
"reads_assigned_bins",
"overdispersion",
"breakpoint_count"),
label = 'overdispersion') +
ggtitle('1Mb Metrics') +
theme(plot.title = ggplot2::element_text(hjust = .5, size = 20))
pic
ggsave(filename = file.path(outdir, paste0('plotMetrics_',samplename, '_1Mb.pdf')), pic, width = 8, height = 7)Normal Diploid Cell Filtering
Before performing clonal evolution analysis, clustering, and lineage tree construction, normal diploid cells must be removed. If these cells are retained, they will cluster into a huge "fake clone", seriously interfering with the resolution of tumor heterogeneity.
findAneuploidCells function is used to distinguish normal cells from aneuploid (tumor) cells:
- Criteria: Segment Ratios of normal diploid cells should be a smooth straight line, with a coefficient of variation (CV = SD / Mean) close to 0.
- Tumor Cells: Due to a large number of amplifications and deletions, their Segment Ratios have a larger standard deviation and higher CV values.
- Bimodal Distribution: The CV distribution of normal cells and tumor cells usually shows a bimodal pattern. If the CV fluctuation of a cell exceeds 5 times the diploid background noise, it is judged as Aneuploid.
Diploid and Aneuploid CNV Results for 220kb Bin
Figure Legend: The heatmap below shows the following. Each column represents a Bin on the genome (sorted by position), and each row represents a cell.
- Left Color Bar (
is_aneuploid):TRUEindicates it is judged as an aneuploid (tumor) cell;FALSEindicates a normal diploid cell.- Heatmap Color: Represents Log2 transformed Segment Ratio. Red indicates copy number amplification (Gain/Amp), blue indicates copy number loss (Loss/Del), and white indicates diploid state (Neutral).
options(repr.plot.width = 12, repr.plot.height = 6, repr.plot.res = 150)
suppressMessages({
tumor_list[['220kb']] <- findAneuploidCells(tumor_list[['220kb']])
ht <- plotHeatmap(
tumor_list[['220kb']],
label = 'is_aneuploid',
row_split = 'is_aneuploid',
n_threads = threads)
ht
})
pdf(file.path(outdir, paste0('plotHeatmap_', samplename, '_220k_aneuploid.pdf')), width = 12, height = 6)
ht
dev.off()Diploid and Aneuploid CNV Results for 500kb Bin
options(repr.plot.width = 12, repr.plot.height = 6, repr.plot.res = 150)
suppressMessages({
tumor_list[['500kb']] <- findAneuploidCells(tumor_list[['500kb']])
ht <- plotHeatmap(tumor_list[['500kb']], label = 'is_aneuploid', row_split = 'is_aneuploid', n_threads = threads)
})
pdf(file.path(outdir, paste0('plotHeatmap_', samplename, '_500k_aneuploid.pdf')), width = 12, height = 6)
ht
dev.off()Diploid and Aneuploid CNV Results for 1Mb Bin
options(repr.plot.width = 12, repr.plot.height = 6, repr.plot.res = 150)
suppressMessages({
tumor_list[['1Mb']] <- findAneuploidCells(tumor_list[['1Mb']])
ht <- plotHeatmap(tumor_list[['1Mb']], label = 'is_aneuploid', row_split = 'is_aneuploid', n_threads = threads)
})
pdf(file.path(outdir, paste0('plotHeatmap_', samplename, '_1Mb_aneuploid.pdf')), width = 12, height = 6)
ht
dev.off()# Filter out normal diploid cells
for(i in names(tumor_list)){
tumor_list[[i]] <- tumor_list[[i]][,colData(tumor_list[[i]])$is_aneuploid == TRUE]
}Outlier Cell Filtering
To construct accurate tumor subclones and evolutionary relationships, we need to further use findOutliers to remove cells that have extremely low correlation with other cells, which may be technical noise or of poor quality.
This step uses Pearson correlation as the metric. If the average correlation of a cell with its K nearest neighbors is lower than pearson_corr_cutoff (default 0.9, set to 0.8 in this tutorial), it is marked as an outlier.
for(i in names(tumor_list)){
tumor_list[[i]] <- findOutliers(tumor_list[[i]], resolution = pearson_corr_cutoff)
}CNV Heatmap Comparison of Outlier vs Retained Cells at 220kb Resolution
Figure Legend: Outlier identification heatmap.
- Left Color Bar (
outlier):TRUE(Red) represents low-quality cells judged as outliers;FALSE(Blue) represents retained high-quality tumor cells.- Features: Outlier cells usually show messy signals or copy number patterns distinct from other tumor cells.
options(repr.plot.width = 12, repr.plot.height = 6, repr.plot.res = 150)
suppressMessages({
ht <- plotHeatmap(tumor_list[['220kb']], label = 'outlier', row_split = 'outlier', n_threads = threads)
ht
})
pdf(file.path(outdir, paste0('plotHeatmap_', samplename, '_220kb_outlier.pdf')), width = 12, height = 6)
ht
dev.off()
tumor_list[['220kb']] <- tumor_list[['220kb']][,colData(tumor_list[['220kb']])$outlier == FALSE]CNV Heatmap Comparison of Outlier vs Retained Cells at 500kb Resolution
options(repr.plot.width = 12, repr.plot.height = 6, repr.plot.res = 150)
suppressMessages({
ht <- plotHeatmap(tumor_list[['500kb']], label = 'outlier', row_split = 'outlier', n_threads = threads)
ht
})
pdf(file.path(outdir, paste0('plotHeatmap_', samplename, '_500kb_outlier.pdf')), width = 12, height = 6)
ht
dev.off()
tumor_list[['500kb']] <- tumor_list[['500kb']][,colData(tumor_list[['500kb']])$outlier == FALSE]CNV Heatmap Comparison of Outlier vs Retained Cells at 1Mb Resolution
options(repr.plot.width = 12, repr.plot.height = 6, repr.plot.res = 150)
suppressMessages({
ht <- plotHeatmap(tumor_list[['1Mb']], label = 'outlier', row_split = 'outlier', n_threads = threads)
ht
})
pdf(file.path(outdir, paste0('plotHeatmap_', samplename, '_1Mbkb_outlier.pdf')), width = 12, height = 6)
ht
dev.off()
tumor_list[['1Mb']] <- tumor_list[['1Mb']][,colData(tumor_list[['1Mb']])$outlier == FALSE]KNN Smoothing
Due to extremely low sequencing depth (high sparsity) in single-cell sequencing, raw Bin Counts data is often full of noise. knnSmooth function uses K-Nearest Neighbors (KNN) algorithm to enhance signals by integrating information from each cell and its k nearest neighbors (default k=4), significantly reducing random noise.
- Parameter Adjustment: If data coverage is extremely low, you can appropriately increase the
kvalue, but too large akvalue may blur small differences between subclones. If you find too many clusters in subsequent steps, you can try skipping theknnSmoothstep.
if(skipKNNSmooth){
for(i in names(tumor_list)){
suppressMessages({tumor_list[[i]] <- knnSmooth(tumor_list[[i]], k = 4)})
}
saveRDS(tumor_list, file.path(outdir, paste0(samplename, '_copykit_analysis.rds')))
}Clustering and Clonal Evolution Analysis
Dimensionality Reduction
To visualize the clonal structure of tumor cells in a 2D plane, we use UMAP (Uniform Manifold Approximation and Projection) algorithm to reduce the dimensionality of high-dimensional copy number data (logr matrix).
- Meaning: In the UMAP space, each point represents a cell. The distance between points reflects the similarity of their genome-wide copy number profiles. Cells clustered together usually represent Subclones with the same CNV features.
for(i in names(tumor_list)){
tumor_list[[i]] <- runUmap(tumor_list[[i]])
}Absolute Ploidy Inference
calcInteger function uses scquantum algorithm to convert Relative Copy Number Ratios to Absolute Integer Copy Numbers. This is because biological copy numbers must be integers (e.g., 2, 3, 4).
- Ploidy Score: This metric measures the quality of integer fitting.
- Score close to 0: Good fitting, clear CNV state of the cell, easy to judge whether it is 2 copies or 3 copies.
- Higher Score: Indicates noisy data, difficult to accurately judge its absolute copy number.
- Suggestion: In subsequent analysis, ploidy score can be referred to for further filtering out cells with extremely poor fitting quality.
for(i in names(tumor_list)){
tumor_list[[i]] <- calcInteger(tumor_list[[i]], method = 'scquantum', assay = 'smoothed_bincounts')
}Figure Legend: Ploidy Score distribution plot.
- X-axis: Cell ordering.
- Y-axis: Ploidy Score.
- Color: Darker color (smaller value) indicates more reliable ploidy inference. If most cells have large ploidy scores, it may indicate poor sample quality or overly complex heterogeneity.
options(repr.plot.width = 5, repr.plot.height = 5)
for(i in names(tumor_list)){
suppressMessages({
pic <- plotMetrics(
tumor_list[[i]],
metric = 'ploidy',
label = 'ploidy_score') +
ggtitle(i) +
theme(plot.title = element_text(hjust = .5, size = 20))
})
print(pic)
ggsave(file.path(outdir, paste0('ploidy_score_',samplename,'_',i,'.pdf')), pic, width = 5, height = 5)
}Tumor Subclone Clustering and Clonal Evolution Analysis
- Find Best K (
findSuggestedK): Automatically recommend the most suitable number of nearest neighbors for clustering by calculating Jaccard similarity under different K values. - Perform Clustering (
findClusters): Perform unsupervised clustering based on UMAP coordinates to identify different tumor Subclones. Default uses HDBSCAN algorithm. Compared to traditional Louvain/Leiden, HDBSCAN is more sensitive in identifying noise points (labeled asc0). - Construct Consensus Profiles (
calcConsensus): Single-cell data inevitably contains random noise. To accurately infer the evolutionary history of tumors, we no longer focus on minor fluctuations of individual cells, but calculate the Consensus Copy Number Profile of each subclone throughcalcConsensus. This function takes the median of all cells in each Bin within the subclone to build a "virtual cell" representing the clone's characteristics. This greatly improves the signal-to-noise ratio. - Construct Subclone Phylogenetic Tree:
runConsensusPhylofunction calculates the Manhattan Distance between consensus profiles of various subclones and constructs a phylogenetic tree.
About Outlier Cluster 'c0'
HDBSCAN algorithm may classify a portion of cells into the
c0cluster. These cells are usually unable to be assigned to any major clone due to the following reasons:
- Data Noise: Low sequencing quality leads to blurred copy number features.
- Rare Clones: Too few cells to meet the minimum threshold for independent clustering.
- Doublets/Degradation: Abnormal signals caused by technical errors.
Processing Suggestion: Before analyzing the main evolutionary structure of the tumor, it is usually recommended to remove the
c0cluster to avoid interfering with the construction of the phylogenetic tree.
Figure Legend: Jaccard similarity index curve with varying K values.
- X-axis: K value (number of nearest neighbors).
- Y-axis: Jaccard Index (clustering stability metric).
CopyKit automatically selects the K value with the maximum Jaccard similarity as the parameter for subsequent
findClusters.
options(repr.plot.width = 8, repr.plot.height = 5)
for(i in names(tumor_list)){
suppressMessages({tumor_list[[i]] <- findSuggestedK(tumor_list[[i]])})
pic <- plotSuggestedK(tumor_list[[i]]) +
ggtitle(i) +
theme(plot.title = element_text(hjust = .5, size = 20))
print(pic)
ggsave(file.path(outdir, paste0('SuggestedK_',samplename,'_',i,'.pdf')), pic, width = 8, height = 5)
}for(i in names(tumor_list)){
tumor_list[[i]] <- findClusters(tumor_list[[i]])
}Tumor Subclone Analysis Results at 220kb Resolution
The following shows UMAP clustering plots before and after removing the c0 group.
Figure Legend: UMAP clustering plot after removing
c0. Different colors clearly show the major subclone structure within the tumor.
options(repr.plot.width = 7, repr.plot.height = 7)
pic <- plotUmap(tumor_list[['220kb']], label = 'subclones') +
ggtitle('220kb keep c0') +
theme(plot.title = element_text(hjust = .5, size = 20))
pic
ggsave(file.path(outdir, paste0('plotUmap',samplename,'_',i,'_keep_c0.pdf')), pic, width = 7, height = 7)tumor_list[['220kb']] <- tumor_list[['220kb']][,colData(tumor_list[['220kb']])$subclones != 'c0']
pic <- plotUmap(tumor_list[['220kb']], label = 'subclones') +
ggtitle('220kb') +
theme(plot.title = element_text(hjust = .5, size = 20))
pic
ggsave(file.path(outdir, paste0('plotUmap',samplename,'_',i,'_remove_c0.pdf')), pic, width = 7, height = 5)Tumor Subclone CNV Heatmap and Phylogenetic Tree at 220kb Resolution
Figure Legend: Subclone Consensus Heatmap
- Heatmap: Each row represents a cell, ordered by the phylogenetic tree. Color represents Log2 Segment Ratio.
- Left Annotation: Labels the subclone ID the cell belongs to.
options(repr.plot.width = 12, repr.plot.height = 6, repr.plot.res = 150)
tumor_list[['220kb']] <- calcConsensus(tumor_list[['220kb']])
tumor_list[['220kb']] <- runConsensusPhylo(tumor_list[['220kb']])
suppressMessages({
ht <- plotHeatmap(
tumor_list[['220kb']],
label = 'subclones',
order_cells = 'consensus_tree'
)
})
ht
pdf(
file.path(outdir, paste0('plotHeatmap_consensus_tree_',samplename,'_',i,'.pdf')),
width = 16,
height = 6)
ht
dev.off()Figure Legend: Subclone Phylogenetic Tree.
- Branch Length: Represents Manhattan distance, i.e., the magnitude of difference in genome CNV profiles between subclones.
- Structure: Shows how the tumor evolved from the root into different branches (subclones).
suppressWarnings({
pic <- plotPhylo(tumor_list[['220kb']], label = 'subclones', consensus = TRUE)
})
pic
ggsave(file.path(outdir, paste0('plotPhylo_consensus_',samplename,'_220kb','.pdf')), pic, width = 7, height = 5)The following heatmap and line plot observe CNV differences between different subclones at the "Bulk" level (i.e., subclone consensus level).
Figure Legend: Subclone Consensus Heatmap.
Unlike the single-cell heatmap, here each row represents only one subclone (instead of a single cell). This eliminates single-cell noise and clearly shows the core CNV events of each subclone (such as overall amplification or deletion of a chromosome arm).
tumor_list[['220kb']] <- calcConsensus(tumor_list[['220kb']] )
suppressMessages({
ht <- plotHeatmap(tumor_list[['220kb']],consensus = TRUE,label = 'subclones')
})
ht
pdf(
file.path(outdir, paste0('plotHeatmap_consensus_tree_',samplename,'_220kb','.pdf')),
width = 16,
height = 6)
ht
dev.off()Figure Legend: Subclone copy number distribution line plot.
- X-axis: Genome chromosome position (chr1 - chr22).
- Y-axis: Consensus Segment Ratio (Consensus segment ratio, taking the median value of each segment within the subclone).
- Meaning: Different colored lines represent different subclones. Elevations above the baseline (1.0) represent amplification, and depressions represent deletion. This plot allows precise comparison of differences between subclones at specific loci.
pic <- plotConsensusLine_static(tumor_list[['220kb']],levels(colData(tumor_list[['220kb']])$subclones) ) +
scale_color_manual(values = subclones_pal())
pic
ggsave(file.path(outdir, paste0('plot_ConsensusLine_',samplename,'_220kb','.pdf')), pic, width = 7, height = 5)Tumor Subclone Analysis Results at 500kb Resolution
The following shows UMAP clustering plots before and after removing the c0 group.
options(repr.plot.width = 7, repr.plot.height = 7)
pic <- plotUmap(tumor_list[['500kb']], label = 'subclones') +
ggtitle('500kb keep c0') +
theme(plot.title = ggplot2::element_text(hjust = .5, size = 20))
pic
ggsave(file.path(outdir, paste0('plotUmap',samplename,'_1Mb','_keep_c0.pdf')), pic, width = 7, height = 5)tumor_list[['500kb']] <- tumor_list[['500kb']][,colData(tumor_list[['500kb']])$subclones != 'c0']
pic <- plotUmap(tumor_list[['500kb']], label = 'subclones') +
ggtitle('500kb') +
theme(plot.title = element_text(hjust = .5, size = 20))
pic
ggsave(file.path(outdir, paste0('plotUmap',samplename,'_500kb','_remove_c0.pdf')), pic, width = 7, height = 5)Tumor Subclone CNV Heatmap and Phylogenetic Tree at 500kb Resolution
options(repr.plot.width = 12, repr.plot.height = 7, repr.plot.res = 150)
tumor_list[['500kb']] <- calcConsensus(tumor_list[['500kb']])
tumor_list[['500kb']] <- runConsensusPhylo(tumor_list[['500kb']])
suppressMessages({
ht <- plotHeatmap(
tumor_list[['500kb']],
label = 'subclones',
order_cells = 'consensus_tree'
)})
ht
pdf(
file.path(outdir, paste0('plotHeatmap_consensus_tree_',samplename,'_550kb','.pdf')),
width = 16,
height = 6)
ht
dev.off()suppressWarnings({
pic <- plotPhylo(tumor_list[['500kb']], label = 'subclones', consensus = TRUE)
})
pic
ggsave(file.path(outdir, paste0('plotPhylo_consensus_',samplename,'_500kb','.pdf')), pic, width = 7, height = 7)The following heatmap and line plot observe CNV differences between different subclones at the "Bulk" level.
tumor_list[['500kb']] <- calcConsensus(tumor_list[['500kb']] )
suppressMessages({
ht <- plotHeatmap(tumor_list[['500kb']],consensus = TRUE,label = 'subclones')
})
ht
pdf(
file.path(outdir, paste0('plotHeatmap_consensus_',samplename,'_550kb','.pdf')),
width = 16,
height = 6)
ht
dev.off()pic <- plotConsensusLine_static(tumor_list[['500kb']],levels(colData(tumor_list[['500kb']])$subclones) ) +
scale_color_manual(values = subclones_pal())
pic
ggsave(file.path(outdir, paste0('plot_ConsensusLine_',samplename,'_500kb','.pdf')), pic, width = 7, height = 5)Tumor Subclone Analysis Results at 1Mb Resolution
The following shows UMAP clustering plots before and after removing the c0 group.
options(repr.plot.width = 7, repr.plot.height = 7)
pic <- plotUmap(tumor_list[['1Mb']], label = 'subclones') +
ggtitle('1Mb keep c0') +
theme(plot.title = element_text(hjust = .5, size = 20))
pic
ggsave(file.path(outdir, paste0('plotUmap_',samplename,'1Mb','_keep_c0.pdf')), pic, width = 7, height = 5)options(repr.plot.width = 7, repr.plot.height = 7)
tumor_list[['1Mb']] <- tumor_list[['1Mb']][,colData(tumor_list[['1Mb']])$subclones != 'c0']
pic <- plotUmap(tumor_list[['1Mb']], label = 'subclones') +
ggtitle('1Mb') +
theme(plot.title = element_text(hjust = .5, size = 20))
pic
ggsave(file.path(outdir, paste0('plotUmap_',samplename,'_1Mb','_remove_c0.pdf')), pic, width = 7, height = 5)Tumor Subclone CNV Heatmap and Phylogenetic Tree at 1Mb Resolution
options(repr.plot.width = 12, repr.plot.height = 6, repr.plot.res = 150)
tumor_list[['1Mb']] <- calcConsensus(tumor_list[['1Mb']])
tumor_list[['1Mb']] <- runConsensusPhylo(tumor_list[['1Mb']])
suppressMessages({
ht <- plotHeatmap(
tumor_list[['1Mb']],
label = 'subclones',
order_cells = 'consensus_tree'
)
})
ht
pdf(
file.path(outdir, paste0('plotHeatmap_consensus_tree_',samplename,'_1Mb','.pdf')),
width = 16,
height = 6)
ht
dev.off()suppressWarnings({
pic <- plotPhylo(tumor_list[['1Mb']], label = 'subclones', consensus = TRUE)
})
pic
ggsave(file.path(outdir, paste0('plotPhylo_consensus_',samplename,'_1Mb','.pdf')), pic, width = 7, height = 7)The following heatmap and line plot observe CNV differences between different subclones at the "Bulk" level.
suppressMessages({
ht <- plotHeatmap(tumor_list[['1Mb']],consensus = TRUE,label = 'subclones')
})
ht
pdf(
file.path(outdir, paste0('plotHeatmap_consensus_',samplename,'_550kb','.pdf')),
width = 16,
height = 6)
ht
dev.off()pic <- plotConsensusLine_static(tumor_list[['1Mb']],levels(colData(tumor_list[['1Mb']])$subclones) ) +
scale_color_manual(values = subclones_pal())
pic
ggsave(file.path(outdir, paste0('plot_ConsensusLine_',samplename,'_1Mb','.pdf')), pic, width = 7, height = 5)Single-Cell Level Phylogenetic Tree
In addition to subclone-level phylogenetic trees, CopyKit can also construct Single-Cell Resolution phylogenetic trees using runPhylo. This helps to observe micro-evolution processes within clones.
- Principle: Calculates distances between cells based on the similarity of copy number profiles. Closer copy number profiles indicate closer kinship.
- Note: Due to the large volume of single-cell data and significant noise, the computational load is high.
Single-Cell Phylogenetic Tree at 220kb Resolution
Figure Legend: Single-cell resolution phylogenetic tree.
- Terminal Nodes: Each point in the figure represents an independent cell.
- Color: Identifies the subclone to which the cell belongs.
- Significance: Compared to the subclone tree, this figure shows the internal structure of subclones and can be used to observe subtle heterogeneity between cells.
suppressMessages({
tumor_list[['220kb']] <- runPhylo(tumor_list[['220kb']], metric = 'manhattan')
pic <- plotPhylo(tumor_list[['220kb']], label = 'subclones')
})pic
ggsave(file.path(outdir, paste0('plotPhylo_cells_',samplename,'_220kb.pdf')), width = 8, height = 6)Single-Cell Phylogenetic Tree at 500kb Resolution
suppressMessages({
tumor_list[['500kb']] <- runPhylo(tumor_list[['500kb']], metric = 'manhattan')
pic <- plotPhylo(tumor_list[['500kb']], label = 'subclones')
})pic
ggsave(file.path(outdir, paste0('plotPhylo_cells_',samplename,'_500kb.pdf')), width = 8, height = 6)Single-Cell Phylogenetic Tree at 1Mb Resolution
suppressMessages({
tumor_list[['1Mb']] <- runPhylo(tumor_list[['1Mb']], metric = 'manhattan')
pic <- plotPhylo(tumor_list[['1Mb']], label = 'subclones')
})pic
ggsave(file.path(outdir, paste0('plotPhylo_cells_',samplename,'_1Mb.pdf')), width = 8, height = 6)Gene-Level CNV Visualization
In addition to genome-wide profiles, we usually focus on the copy number status of specific driver genes (such as MYC, TP53, EGFR, etc.). plotGeneCopy function can display the copy number distribution of specific genes in different subclones. The figure below only shows the dot plot of driver gene copy numbers for subclones identified at 220kb resolution.
Tip: You can replace the genes in the list with your target genes of interest. Note: If certain genes are located in excluded regions of the VarBin algorithm (such as centromeres, telomeres, or high-repeat regions), they cannot be detected (a warning message will be displayed).
Figure Legend: Driver gene copy number dot plot (Jitter Plot).
- X-axis: Different tumor subclones.
- Y-axis: Inferred Absolute Integer Copy Number.
- Meaning: Each dot represents a cell. This plot visually shows the amplification or deletion status of key oncogenes (such as MYC, EGFR) or tumor suppressor genes (such as TP53, PTEN) in different subclones, helping to identify key variation events driving tumor evolution.
options(repr.plot.height = 6, repr.plot.width = 15)
suppressMessages({
plotGeneCopy(
tumor_list[['220kb']],
genes = c("TP53","PTEN","MYC","CCND1","KRAS","BRAF"),
label = 'subclones',
dodge.width = 0.8)
})saveRDS(tumor_list, file.path(outdir, paste0(samplename, '_copykit_analysis.rds')))More Features: For more advanced visualization features of CopyKit, please refer to the CopyKit Official Documentation.
