SeekArc 单细胞多组学 (RNA+ATAC) 多样本整合分析教程
环境准备
R 包加载
请选择 common_r 这个环境进行该整合教程的学习
#加载必要的R包
suppressPackageStartupMessages({
library(Seurat)
library(Signac)
library(EnsDb.Hsapiens.v86, lib.loc = "/PROJ2/FLOAT/shumeng/apps/miniconda3/envs/python3.10/lib/R/library")
library(BSgenome.Hsapiens.UCSC.hg38,lib = "/PROJ2/FLOAT/shumeng/apps/miniconda3/envs/python3.10/lib/R/library")
library(biovizBase, lib = "/PROJ2/FLOAT/shumeng/apps/miniconda3/envs/python3.10/lib/R/library")
#library(BSgenome.Mmusculus.UCSC.mm10)
#library(EnsDb.Mmusculus.v79)
library(dplyr)
library(ggplot2)
library(patchwork)
library(harmony)
})
# 设置随机种子
set.seed(1234)
# 设置Seurat选项(注意:8000 * 1024^2 实际上是8GB)
options(future.globals.maxSize = 8000 * 1024^2) # 8GB# 定义颜色方案
my36colors <-c( '#E5D2DD', '#53A85F', '#F1BB72', '#F3B1A0', '#D6E7A3', '#57C3F3', '#476D87',
'#E95C59', '#E59CC4', '#AB3282', '#23452F', '#BD956A', '#8C549C', '#585658',
'#9FA3A8', '#E0D4CA', '#5F3D69', '#C5DEBA', '#58A4C3', '#E4C755', '#F7F398',
'#AA9A59', '#E63863', '#E39A35', '#C1E6F3', '#6778AE', '#91D0BE', '#B53E2B',
'#712820', '#DCC1DD', '#CCE0F5', '#CCC9E6', '#625D9E', '#68A180', '#3A6963',
'#968175', "#6495ED", "#FFC1C1",'#f1ac9d','#f06966','#dee2d1','#6abe83','#39BAE8','#B9EDF8','#221a12',
'#b8d00a','#74828F','#96C0CE','#E95D22','#017890')获取基因注释信息
我们将从 EnsDb 数据库获取相应物种的基因组注释信息(基因位置、转录本、外显子、TSS 等),具体物种需根据数据做更换。这些信息用于:
- 计算 ATAC 的 TSS 富集(判断开放染色质是否在转录起始位点附近更集中)
- 构建基因活性矩阵(把 peaks 信号映射到基因上)
- Peak 注释和功能分析
注意事项:
- 物种与参考基因组版本匹配(如
EnsDb.Hsapiens.v86对应 hg38,EnsDb.Mmusculus.v75对应 mm10) - 染色体命名风格一致(例如都用
chr1、chr2这样的前缀)
# 获取基因注释信息(静默处理警告和消息)
suppressWarnings({
suppressMessages({
annotation <- GetGRangesFromEnsDb(ensdb = EnsDb.Hsapiens.v86)
seqlevels(annotation) <- paste0('chr', seqlevels(annotation))
genome(annotation) <- 'hg38'
})
})数据读取
本教程提供两种不同形式的数据输入方式,以满足不同用户的数据获取需求,选择适合自己的一种方式即可:
云平台 RDS 文件读取
数据特点:
- RDS 文件是标准的 Seurat 对象文件
- 数据已经过预处理和多个样本合并
- 可直接用于后续的下游分析,也可以提取表达矩阵,重新整合
适用场景:
- 当您无法获得标准的
filtered_feature_bc_matrix表达矩阵时 - 希望整合云平台现有数据进行学习时
- 需要快速重新进行 scRNA-seq 数据整合分析时
注意事项:
- 具体挂载数据和 rds 文件的读取,请参照 jupyter 使用教程
例如下列项目数据/home/demo-SeekGene-com/workspace/data/AY1752565399550/
# 读取数据
input <- readRDS("/home/demo-seekgene-com/workspace/data/AY1752565399550/input.rds")
meta <- read.table("/home/demo-seekgene-com/workspace/data/AY1752565399550/meta.tsv",
header = TRUE,
sep = "\t",
row.names = 1)
# 提取基因组注释(Signac 没有 Annotation() 函数,直接从 ChromatinAssay 获取)
annotations <- input@assays$ATAC@annotation # 确保 ATAC 是 ChromatinAssay
# 创建 Seurat 对象(RNA 数据)
data <- CreateSeuratObject(
counts = input@assays$RNA@counts,
meta.data = meta
)
# 创建 ChromatinAssay(ATAC 数据)
data[["ATAC"]] <- CreateChromatinAssay(
counts = input@assays$ATAC@counts,
fragments = input@assays$ATAC@fragments,
annotation = annotations # 添加基因组注释
)
# 按样本拆分数据
seurat_list <- SplitObject(data, split.by = "Sample")
# 清理内存
rm(data, input)
gc()标准 filtered_XXXX_bc_matrix 文件读取
适用场景:
- 当您拥有标准的基因表达矩阵和 epaks 开放矩阵文件时
- 希望自主完成单细胞多组学(SeekArc)数据的多样本整合和批次矫正时
- 需要进行完整的从原始数据到整合分析的工作流程时
注意:
- 请保证样本与样本之间的文件结构如下:
数据目录结构需满足如下要求:
- 每个样本的文件夹名称为样本 ID,如 S127、S44R 等。
- 每个样本文件夹下包含以下文件:
filtered_feature_bc_matrix:scRNA-seq 表达矩阵文件夹,包含barcodes.tsv.gz、features.tsv.gz和matrix.mtx.gz文件。filtered_peaks_bc_matrix:ATAC 的 peak 开放矩阵文件夹,包含barcodes.tsv.gz、features.tsv.gz和matrix.mtx.gz文件。{样本 ID}_A_fragments.tsv.gz:ATAC 片段文件,如 S127_A_fragments.tsv.gz。{样本 ID}_A_fragments.tsv.gz.tbi:ATAC 片段索引文件,如 S127_A_fragments.tsv.gz.tbi。
具体文件夹结构如下:
├── S127/
│ ├── filtered_feature_bc_matrix/ (scRNA-seq 表达矩阵)
│ │ ├── barcodes.tsv.gz
│ │ ├── features.tsv.gz
│ │ └── matrix.mtx.gz
│ ├── filtered_peaks_bc_matrix/ (ATAC 的 peak 开放矩阵)
│ │ ├── barcodes.tsv.gz
│ │ ├── features.tsv.gz
│ │ └── matrix.mtx.gz
│ ├── S127_A_fragments.tsv.gz (ATAC 片段文件)
│ └── S127_A_fragments.tsv.gz.tbi
├── S44R/
│ ├── filtered_feature_bc_matrix/
│ │ ├── barcodes.tsv.gz
│ │ ├── features.tsv.gz
│ │ └── matrix.mtx.gz
│ ├── filtered_peaks_bc_matrix/
│ │ ├── barcodes.tsv.gz
│ │ ├── features.tsv.gz
│ │ └── matrix.mtx.gz
│ ├── S44R_A_fragments.tsv.gz
│ └── S44R_A_fragments.tsv.gz.tbi
sample_names <- c('S127', 'S44R')
# 创建空列表存储Seurat对象
seurat_list <- list()
# 循环读取每个样本的数据
for(sample in sample_names){
cat('正在处理样本:', sample, '\n')
# 构建文件路径
atac_path <- file.path(sample, 'filtered_peaks_bc_matrix')
rna_path <- file.path(sample, 'filtered_feature_bc_matrix')
frag_path <- file.path(sample, paste0(sample, '_A_fragments.tsv.gz'))
# 读取RNA数据
rna_counts <- Read10X(data.dir = rna_path)
# 创建Seurat对象
obj <- CreateSeuratObject(
counts = rna_counts,
assay = "RNA")
obj$Sample <- sample
rm(rna_counts)
gc()
# 读取ATAC数据
atac_counts <- Read10X(data.dir = atac_path)
atac_counts <- atac_counts[Matrix::rowSums(atac_counts > 0) >= 3, ]
ChromatinAssay <- CreateChromatinAssay(
counts = atac_counts,
sep = c(":", "-"),
fragments = frag_path,
annotation = annotation
)
obj[["ATAC"]] <- ChromatinAssay
rm(atac_counts, ChromatinAssay)
gc()
# 将对象添加到列表
seurat_list[[sample]] <- obj
cat('样本', sample, '处理完成\n')
}Computing hash
样本 S127 处理完成
正在处理样本: S44R
Computing hash
样本 S44R 处理完成
数据质量控制
质控指标计算
RNA 质控指标:
percent.mt: 线粒体基因比例(通常<20%)
ATAC 质控指标:
TSS.enrichment: TSS 富集分数(通常>2)nucleosome_signal: 核小体信号(越低越好)nCount_ATAC: 总 ATAC 计数(通常在 1000-10000 之间)nfeature_RNA: 总 ATAC 特征数(通常在 200-10000 之间)
# 对每个样本进行质量控制
suppressWarnings({
suppressMessages({
for(i in names(seurat_list)){
# RNA质控指标
seurat_list[[i]][["percent.mt"]] <- PercentageFeatureSet(seurat_list[[i]], pattern = "^MT-")
# ATAC质控指标
DefaultAssay(seurat_list[[i]]) <- "ATAC"
# 计算TSS富集分数
seurat_list[[i]] <- TSSEnrichment(object = seurat_list[[i]], fast = FALSE)
# 计算核小体信号
seurat_list[[i]] <- NucleosomeSignal(object = seurat_list[[i]])
}
})
})质控指标可视化
用小提琴图查看各 QC 指标的分布/异常点,确定合适的阈值:
建议:
- 观察是否存在明显的长尾或双峰分布
- 尝试多组阈值并比较下游聚类/UMAP 是否更清晰
# 可视化各质控指标,该cell内容选择性执行,非必要
options(repr.plot.width = 17, repr.plot.height = 10)
suppressWarnings({
for (sample_name in names(seurat_list)) {
seurat_obj <- seurat_list[[sample_name]]
p1=DensityScatter(seurat_obj, x = 'nCount_ATAC', y = 'TSS.enrichment', log_x = TRUE, quantiles = TRUE)
p2=VlnPlot(
object = seurat_obj,
features = c('nCount_ATAC', 'TSS.enrichment', 'nucleosome_signal',"nFeature_RNA", "nCount_RNA", "percent.mt"),
pt.size = 0.1,
ncol = 6
)
print(p1 / p2 + plot_annotation(title = sample_name))
}
})

低质量细胞过滤
根据质控指标过滤低质量细胞。具体阈值应根据数据特征进行调整。具体过滤阈值参考上面的小提琴图分布情况。
# 质控过滤
for(i in names(seurat_list)){
cat('正在对样本', i, '进行质控过滤...\n')
cells_before <- ncol(seurat_list[[i]])
seurat_list[[i]] <- subset(
seurat_list[[i]],
subset = nFeature_RNA > 200 &
nFeature_RNA < 6000 &
nCount_RNA > 500 &
nCount_RNA < 30000 &
percent.mt < 20 &
nCount_ATAC > 500 &
nCount_ATAC < 100000 &
TSS.enrichment > 1 &
nucleosome_signal < 2
)
cells_after <- ncol(seurat_list[[i]])
cat('样本', i, '过滤完成: 过滤前', cells_before, '个细胞,过滤后', cells_after, '个细胞\n')
}正在对样本 S44R 进行质控过滤...n 样本 S44R 过滤完成: 过滤前 12072 个细胞,过滤后 10900 个细胞
计算样本间的共有 peaks
由于各样本独立进行 peak calling,不同样本中的 peaks 不一样。为了整合多样本,需要先统一 peaks 信息,创建所有样本的共有 peak 集合。
处理流程:
- 合并所有样本的 peaks
- 过滤 peak 长度(20bp-10kb)
- 为每个样本重新量化共有 peaks
注意事项: 如果前面读取数据用的第一种读取流程 rds 的方式,此部分不需要进行分析
注意事项: 如果前面读取数据用的第一种读取流程 rds 的方式,此部分不需要进行分析
注意事项: 如果前面读取数据用的第一种读取流程 rds 的方式,此部分不需要进行分析
# 提取所有样本的peaks
all_peaks <- lapply(seurat_list, function(x) {
DefaultAssay(x) <- "ATAC"
granges(x@assays$ATAC)
})
# 合并所有peaks
gr_list <- GRangesList(all_peaks)
all_granges <- unlist(gr_list, use.names = FALSE)
combined.peaks <- reduce(x = all_granges)
# 过滤peak长度
peakwidths <- width(combined.peaks)
common_peaks <- combined.peaks[peakwidths < 10000 & peakwidths > 20]
cat('合并后的peak数量:', length(common_peaks), '\n')
# 为每个样本创建统一的peak矩阵
seurat_list <- lapply(seurat_list, function(x) {
combined_counts <- FeatureMatrix(
fragments = Fragments(x@assays$ATAC),
features = common_peaks,
cells = colnames(x)
)
combined_peaks_assay <- CreateChromatinAssay(
counts = combined_counts,
fragments = Fragments(x@assays$ATAC),
annotation = Annotation(x@assays$ATAC)
)
# 添加到Seurat对象中
x[["combinedpeaks"]] <- combined_peaks_assay
DefaultAssay(x) <- "combinedpeaks"
x[["ATAC"]] <- NULL
return(x)
})Extracting reads overlapping genomic regions
Extracting reads overlapping genomic regions
多样本合并
在多组学数据整合分析流程中,多样本合并是整合分析的重要前置步骤。
合并操作的目的:
- 将多个样本数据整合到同一个 Seurat 对象中
- 为后续的批次矫正和整合分析做准备
- 便于与批次矫正后的结果进行对比分析
重要说明:
- 此步骤仅进行简单的数据合并,尚未进行批次效应矫正
- 合并后的对象包含所有样本的原始数据
- 将合并的数据进行标准化、特征选择、PCA 降维和 UMAP 可视化等预处理步骤
# 合并所有样本
suppressWarnings({
suppressMessages({
obj_merge <- merge(seurat_list[[1]], seurat_list[-1], merge.data = FALSE)
DefaultAssay(obj_merge) <- "RNA"
obj_merge <- NormalizeData(obj_merge)
obj_merge <- FindVariableFeatures(obj_merge, nfeatures = 2000)
obj_merge <- ScaleData(obj_merge)
obj_merge <- RunPCA(obj_merge)
DefaultAssay(obj_merge) <- "combinedpeaks"
obj_merge <- FindTopFeatures(obj_merge, min.cutoff = 10)
obj_merge <- RunTFIDF(obj_merge)
obj_merge <- RunSVD(obj_merge)
obj_merge <- RunUMAP(obj_merge, reduction = "lsi", dims = 2:30,reduction.name="atacumap")
})
})数据整合
现在我们将多个样本的单细胞多组学(SeekArc 数据)进行整合,并对整合的数据进行降维聚类。多组学数据整合主要有两种常用方法:
整合方法选择
CCA(Canonical Correlation Analysis)整合:
- 基于典型相关分析的整合方法
- 通过寻找样本间的共同变异模式进行整合
- 适用于样本间差异较大的情况
- Seurat 包的经典整合方法
Harmony 整合:
- 基于迭代聚类的快速整合方法
- 直接在降维空间中校正批次效应
- 计算效率高,适用于大规模数据
- 保留更多生物学变异信息
方法选择建议
- Harmony:同平台不同样本 scATAC-seq 数据;样本数量较多或数据量大,Harmony 方法更快。
- CCA:运行时间较久,通常批次效应严重的情况下推荐选 CCA,如跨平台的 scATAC-seq 数据
RNA 数据整合
使用 CCA 方法整合多样本的 RNA 数据
# 切换到RNA assay
suppressWarnings({
suppressMessages({
objs <- lapply(seurat_list, function(x) {
DefaultAssay(x) <- "RNA"
return(x)
})
# 定义CCA整合函数
integrate_cca <- function(objs) {
# 使用lapply替代for循环
objs <- lapply(objs, function(x) {
x <- NormalizeData(x, verbose = FALSE)
x <- FindVariableFeatures(x,nfeatures = 2000,selection.method = "vst")
return(x)
})
features <- Seurat::SelectIntegrationFeatures(object.list = objs)
anchors <- Seurat::FindIntegrationAnchors(
object.list = objs,
anchor.features = features
)
obj <- Seurat::IntegrateData(anchorset = anchors)
DefaultAssay(obj) <- "integrated"
obj <- ScaleData(obj, verbose = FALSE)
obj <- RunPCA(obj, verbose = FALSE)
obj <- RunUMAP(obj, reduction = "pca",reduction.name="rnaintegratedumap", dims = 1:30)
#obj <- RunTSNE(obj, reduction = "pca", reduction.name="rnaintegratedtsne",dims = 1:30, check_duplicates = FALSE)
return(obj)
}
# 执行CCA整合
obj_integrated <- integrate_cca(seurat_list)
})
})
##可视化RNA数据去批次效果
options(repr.plot.width=10, repr.plot.height=8)
DimPlot(obj_integrated, reduction = "rnaintegratedumap", group.by = "Sample")使用 harmony 整合多样本的 RNA 数据
# RNA数据整合函数
integrate_harmony <- function(obj) {
DefaultAssay(obj) <- "RNA"
obj <- RunHarmony(obj, "Sample")
obj <- RunUMAP(obj, reduction = "harmony",reduction.name="rnaharmonyumap",dims = 1:30)
#obj <- RunTSNE(obj, reduction = "harmony",reduction.name="rnaharmonytsne",dims = 1:30,check_duplicates = FALSE)
return(obj)
}
suppressWarnings({
suppressMessages({
obj_integrated = integrate_harmony(obj_merge)
})
})
##可视化RNA数据去批次效果
options(repr.plot.width=10, repr.plot.height=8)
DimPlot(obj_integrated, reduction = "rnaharmonyumap", group.by = "Sample")ATAC 数据整合
使用 CCA 方法整合多样本的 ATAC 数据
# 定义ATAC整合函数
options(future.globals.maxSize = 20 * 1024^3)
integrate_atac_cca <- function(object.list){
object.list <- lapply(object.list, function(x) {
DefaultAssay(x)="combinedpeaks"
x=RunTFIDF(x)
x=FindTopFeatures(x, min.cutoff = 5)
x=RunSVD(x)
return(x)
})
anchors <- FindIntegrationAnchors(
object.list = object.list,
anchor.features = rownames(obj_merge),
reduction = "rlsi",
dims = 2:30
)
integrated <- IntegrateEmbeddings(
anchorset = anchors,
reductions = obj_merge[["lsi"]],
new.reduction.name = "integrated_lsi",
dims.to.integrate = 1:30)
integrated<- RunUMAP(integrated, reduction = "integrated_lsi", dims = 2:30,reduction.name="atacintegratedumap")
#integrated<- RunTSNE(integrated, reduction = "integrated_lsi", dims = 2:30,reduction.name="atacintegratedtsne")
return(integrated)
}
# 执行ATAC整合
suppressWarnings({
suppressMessages({
integrated_atac <- integrate_atac_cca(seurat_list)
})
})
# 将整合后的降维结果添加到主对象
obj_integrated[['integrated_lsi']] <- integrated_atac[['integrated_lsi']]
obj_integrated[['atacintegratedumap']] <- integrated_atac[['atacintegratedumap']]
#obj_integrated[['atacintegratedtsne']] <- integrated_atac[['atacintegratedtsne']]
DefaultAssay(obj_integrated) <- "integrated"
#可视化ATAC去批次效果
options(repr.plot.width=10, repr.plot.height=8)
DimPlot(obj_integrated, reduction = "atacintegratedumap", group.by = "Sample")使用 harmony 方法整合多样本的 ATAC 数据
atac_integrate_harmony <- function(obj) {
library(harmony)
DefaultAssay(obj) <- "combinedpeaks"
obj<- RunHarmony(
object = obj,
group.by.vars = 'Sample',
reduction = 'lsi',
assay.use = 'combinepeaks',
reduction.save = "harmonylsi",
project.dim = FALSE
)
obj <- RunUMAP(obj,reduction = 'harmonylsi',reduction.name="atacharmonyumap", dims = 2:30)
#obj <- RunTSNE(obj, reduction = "harmonylsi",reduction.name="atacharmonytsne",dims = 2:50,check_duplicates = FALSE)
return(obj)
}
suppressWarnings({
suppressMessages({
obj_integrated=atac_integrate_harmony(obj_integrated)
})
})
#可视化ATAC去批次效果
options(repr.plot.width=10, repr.plot.height=8)
DimPlot(obj_integrated, reduction = "atacharmonyumap", group.by = "Sample")加权最近邻(WNN)分析
WNN 原理
WNN 方法通过学习每种数据类型(RNA 和 ATAC)的相对重要性,自动为每个细胞分配模态权重,从而:
- 同时利用 RNA 和 ATAC 信息
- 避免单一模态的偏差
- 提供更准确的细胞聚类和降维结果
参数说明
reduction.list:指定用于 WNN 的降维结果- CCA 整合:
list("pca", "integrated_lsi") - Harmony 整合:
list("harmony", "harmonylsi")
- CCA 整合:
dims.list:各模态使用的维度范围
# 寻找多模态邻居
suppressWarnings({
suppressMessages({
obj_integrated <- FindMultiModalNeighbors(
object = obj_integrated,
reduction.list = list("pca", "integrated_lsi"),##CCA方法选择此
#reduction.list = list("harmony", "harmonylsi"),##harmony方法选此
dims.list = list(1:30, 2:50),
modality.weight.name = "RNA.weight",
verbose = TRUE)
# 基于WNN进行UMAP降维
obj_integrated <- RunUMAP(
object = obj_integrated,
nn.name = "weighted.nn",
assay = "RNA",
verbose = TRUE,
reduction.name = "wnnintergratedumap"
)
})
})聚类分析
三种聚类策略
- RNA 聚类:基于基因表达相似性
- ATAC 聚类:基于染色质可及性相似性
- WNN 聚类:整合两种模态信息(推荐)
结果解读
- 不同方法可能产生不同的聚类结果
- WNN 聚类通常能发现更细致的细胞亚群
- 建议优先使用 WNN 结果进行下游分析
下面参数说明
下面在进行聚类分析时,请根据前期使用的批次校正方法选择对应的降维参数:
CCA 去批次:
• RNA 聚类:FindNeighbors 中使用 reduction="pca"
• ATAC 聚类:FindNeighbors 中使用 reduction="integrated_lsi"
Harmony 去批次:
• RNA 聚类:FindNeighbors 中使用 reduction="harmony"
• ATAC 聚类:FindNeighbors 中使用 reduction="harmonylsi"
suppressWarnings({
suppressMessages({
# RNA聚类
obj_integrated <- FindNeighbors(object = obj_integrated, reduction = 'pca',graph.name = "rnaneigobr", dims = 1:30)
#obj_integrated <- FindNeighbors(object = obj_integrated, reduction = 'harmony', dims = 1:30)
obj_integrated <- FindClusters(object = obj_integrated, verbose = FALSE,graph.name = "rnaneigobr", algorithm = 3, resolution = 0.5)
# ATAC聚类
obj_integrated <- FindNeighbors(object = obj_integrated, reduction = 'integrated_lsi', graph.name = "atacneigobr", dims = 2:30)
#obj_integrated <- FindNeighbors(object = obj_integrated, reduction = 'harmonylsi', graph.name = "atacneigobr", dims = 2:30)
obj_integrated <- FindClusters(object = obj_integrated, verbose = FALSE, graph.name = "atacneigobr",resolution = 0.5, algorithm = 3)
# WNN聚类
obj_integrated <- FindClusters(obj_integrated, graph.name = "wknn", algorithm = 3, resolution = 0.5, verbose = TRUE)
})
})Computing SNN
Computing nearest neighbor graph
Computing SNN
Only one graph name supplied, storing nearest-neighbor graph only
"The following arguments are not used: cluster.name"
"The following arguments are not used: cluster.name"
可视化参数说明
降维可视化一致性要求 在进行聚类结果可视化时,请确保使用的 UMAP 降维方法与前面聚类分析保持一致:
CCA 去批次:
• RNA 聚类结果:使用"rnaintegratedumap"降维空间
• ATAC 聚类结果:使用"atacintegratedumap"降维空间
• WNN 聚类结果:使用"wnnintergratedumap"降维空间
Harmony 去批次:
• RNA 聚类结果:使用"rnaharmonyumap"降维空间
• ATAC 聚类结果:使用"atacharmonyumap"降维空间
• WNN 聚类结果:使用"wnnintergratedumap"降维空间
# 比较不同方法的聚类结果
p1 <- DimPlot(obj_integrated, reduction = "rnaintegratedumap", group.by = "rnaneigobr_res.0.5",label=T, cols = my36colors) + ggtitle("RNA")
p2 <- DimPlot(obj_integrated, reduction = "atacintegratedumap", group.by = "atacneigobr_res.0.5",label=T, cols = my36colors) + ggtitle("ATAC")
p3 <- DimPlot(obj_integrated, reduction = "wnnintergratedumap", group.by = "wknn_res.0.5",label=T, cols = my36colors) + ggtitle("WNN")
#p1 <- DimPlot(obj_integrated, reduction = "rnaharmonyumap", group.by = "rnaneigobr_res.0.5",label=T, cols = my36colors) + ggtitle("RNA")
#p2 <- DimPlot(obj_integrated, reduction = "atacharmonyumap", group.by = "atacneigobr_res.0.5",label=T, cols = my36colors) + ggtitle("ATAC")
#p3 <- DimPlot(obj_integrated, reduction = "wnnintergratedumap", group.by = "wknn_res.0.5",label=T, cols = my36colors) + ggtitle("WNN")
options(repr.plot.width=23, repr.plot.height=6)
print(p1 + p2 + p3)
细胞类型注释
marker 基因
根据组织类型,收集不同细胞类型 marker 基因集,本示例数据是眼睛,下面是眼睛中细胞类型及其对应的 marker 基因,用气泡图可视化不同 cluster 高表达哪些细胞类型的 marker 基因
eye_marker_integrated <- list(
# ========== 光感受器细胞 ==========
"Rod_Photoreceptors" = c("RHO", "PDE6B", "CNGB1", "PDE6A", "NR2E3", "REEP6",
"RCVRN", "SAG", "NEB", "SLC24A3", "TRPM1"),
"Cone_Photoreceptors" = c("ARR3", "GNAT2", "OPN1SW", "TRPM3"),
# ========== 视网膜神经元 ==========
"Bipolar_Cells" = c("VSX1", "OTX2", "GRM6", "PRKCA",
"DLG2", "NRXN3", "RBFOX3", "FSTL4"),
"Amacrine_Cells" = c("GAD1", "GAD2", "C1QL2", "TFAP2B", "SLC32A1"),
"Horizontal_Cells" = c("ONECUT1", "LHX1", "CALB1", "NFIA"),
"Retinal_Ganglion_Cells" = c("RBPMS", "THY1", "NEFL", "POU4F2"),
# ========== 胶质细胞 ==========
"Muller_Glia" = c("SLC1A3", "RLBP1", "SOX9", "CRYAB"),
"Astrocytes" = c("GFAP", "AQP4", "S100B"),
"Microglia" = c("TMEM119", "C1QA", "AIF1", "CX3CR1", "CD74",
"PTPRC", "CSF1R", "CCL3L1", "SPP1", "P2RY12"),
# ========== 血管与支持细胞 ==========
"Endothelial" = c("FLT1", "PODXL", "PLVAP", "KDR", "EGFL7",
"CLDN5", "PECAM1", "CDH5"),
"Pericytes" = c("PDGFRB", "RGS5", "CSPG4", "STAB1"),
# ========== 上皮细胞 ==========
"RPE" = c("RPE65", "BEST1", "PMEL", "TYRP1", "DCT", "SLC38A11"),
#"Corneal_Epithelial" = c("KRT12", "KRT3"),
"Corneal_Endothelial" = c("SLC4A11", "COL8A2", "ATP1A1"),
"Conjunctival_Epithelial" = c("KRT13", "KRT19", "MUC5AC"),
"Retinal_Progenitors" = c("VSX2", "PAX6","SOX2", "HES1", "NOTCH1"), # 神经视网膜增殖期
# ========== 晶状体细胞 ==========
"Lens_Epithelial" = c("CRYAA", "BFSP1", "MIP"),
"Lens_Fiber" = c("CRYGS", "LIM2", "FN1"),
# ========== 其他细胞 ==========
"Melanocytes" = c("TYR", "MLANA", "MITF"),
#"Erythrocytes" = c("HBB", "HBA1", "HBA2"),
"ECM" = c("COL1A1", "COL3A1", "COL12A1", "COL1A2", "COL6A3")#,
#"Others" = c("TTN", "CLCN5", "DCC", "MIAT")
)# 设置图形大小
options(repr.plot.width=25, repr.plot.height=7)
# 绘制DotPlot
DefaultAssay(obj_integrated)="RNA"
DotPlot(obj_integrated,
group.by = "wknn_res.0.5",
features = eye_marker_integrated,
cols = c("#f8f8f8","#ff3472"),
#dot.min = 0.05,
dot.scale = 8)+ # 应用自定义配色
RotatedAxis() +
scale_x_discrete("") +
scale_y_discrete("") +
theme(
axis.text.x = element_text(size = 12, face = "bold",
angle = 45, hjust = 1, vjust = 1),
axis.text.y = element_text(size = 12, face = "bold"),
plot.title = element_text(size = 14, face = "bold", hjust = 0.5),
legend.title = element_text(size = 10, face = "bold")
) +
ggtitle("Marker Genes Expression") +
labs(color = "Expression\nLevel") # 修改图例标题
** 注意事项 ** SeekArc 单细胞双组学数据细胞注释一般基于 wnn 多模态降维,依据 marker 基因在 wknn 聚类结果中表达情况进行注释
options(repr.plot.width=10, repr.plot.height=8)
DimPlot(obj_integrated, reduction = "wnnintergratedumap", group.by = "wknn_res.0.5", cols = my36colors,label = T)
细胞类型标注
cat('开始细胞类型注释...', Sys.time(), '\n')
# 基于聚类结果进行细胞类型注释(需要根据实际的marker基因表达情况调整)
# 这里提供一个示例,实际使用时需要根据DotPlot结果进行调整
celltype_mapping <- c(
"0" = "ECM",
"1" = "Retinal_Progenitors",
"2" = "ECM",
"3" = "PRE",
"4" = "ECM",
"5" = "Bipolar_Cells",
"6" = "Bipolar_Cells",
"7" = "ECM",
"8" = "Retinal_Progenitors",
"9" = "ECM",
"10" = "Retinal_Progenitors",
"11" = "Conjunctival_Epithelial",
"12" = "ECM",
"13" = "Bipolar_Cells",
"14" = "Retinal_Progenitors",
"15" = "Doublets",
"16" = "Endothelial",
"17" = "Rod_Photoreceptors",
"18" = "Astrocytes",
"19" = "Muller_Glia",
"20" = "Microglia",
"21" = "Lens"
)
# 应用细胞类型注释
obj_integrated$celltype <- recode(
obj_integrated$wknn_res.0.5,
!!!celltype_mapping
)注释结果可视化
cat('细胞类型注释可视化...', '\n')
# 细胞类型UMAP可视化
p1 <- DimPlot(
obj_integrated,
reduction = "wnnintergratedumap",
group.by = "celltype",
label = TRUE,
label.size = 3,
cols = my36colors
) +
ggtitle("celltype") +
theme(legend.position = "bottom")
# 按样本分组展示细胞类型分布
p2 <- DimPlot(
obj_integrated,
reduction = "wnnintergratedumap",
group.by = "celltype",
split.by = "Sample",
cols = my36colors,
ncol = 2,label=T
)
# 保存细胞类型注释图
pdf("celltype_annotation.pdf", width = 16, height = 12)
print(p1)
print(p2)
dev.off()
options(repr.plot.width=10, repr.plot.height=8)
print(p1)
options(repr.plot.width=20, repr.plot.height=8)
print(p2)
cat('细胞类型注释结果分样本展开!', '\n')pdf: 2

细胞类型注释结果分样本展开!

保存结果
# 保存整合后的Seurat对象
saveRDS(obj_integrated, file = "scMultiomics_integrated.rds")总结
本教程演示了单细胞多组学数据的完整整合分析流程。
关键要点
- 数据质量:多组学数据需要同时考虑 RNA 和 ATAC 质控指标
- 特征统一:Peak 合并确保样本间特征一致性
- 方法选择:根据数据特点选择 CCA 或 Harmony 方法
- 模态整合:WNN 方法能充分利用多模态信息
常见问题
- 内存不足:可以减少使用的维度数或降低分辨率
- 整合效果差:尝试调整整合参数或更换整合方法
- 聚类结果不理想:调整分辨率参数或使用不同的聚类算法
后续分析方向
- 细胞类型注释和标记基因识别
- 差异表达基因和差异可及性分析
- 基因调控网络推断
- 发育轨迹和伪时间分析
sessionInfo()Platform: x86_64-conda-linux-gnu (64-bit)
Running under: Debian GNU/Linux 12 (bookworm)
Matrix products: default
BLAS/LAPACK: /jp_envs/envs/common/lib/libopenblasp-r0.3.29.so; LAPACK version 3.12.0
locale:
[1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
[4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
[7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
time zone: Asia/Shanghai
tzcode source: system (glibc)
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] harmony_1.2.3 Rcpp_1.0.14
[3] patchwork_1.3.0 ggplot2_3.5.2
[5] dplyr_1.1.4 biovizBase_1.50.0
[7] BSgenome.Hsapiens.UCSC.hg38_1.4.5 BSgenome_1.70.1
[9] rtracklayer_1.62.0 BiocIO_1.12.0
[11] Biostrings_2.70.1 XVector_0.42.0
[13] EnsDb.Hsapiens.v86_2.99.0 ensembldb_2.26.0
[15] AnnotationFilter_1.26.0 GenomicFeatures_1.54.1
[17] AnnotationDbi_1.64.1 Biobase_2.62.0
[19] GenomicRanges_1.54.1 GenomeInfoDb_1.38.1
[21] IRanges_2.36.0 S4Vectors_0.40.2
[23] BiocGenerics_0.48.1 Signac_1.10.0
[25] SeuratObject_4.1.4 Seurat_4.4.0
[27] repr_1.1.7
loaded via a namespace (and not attached):
[1] RcppAnnoy_0.0.22 splines_4.3.3
[3] later_1.4.2 pbdZMQ_0.3-13
[5] bitops_1.0-9 filelock_1.0.3
[7] tibble_3.2.1 polyclip_1.10-7
[9] rpart_4.1.23 XML_3.99-0.17
[11] lifecycle_1.0.4 globals_0.16.3
[13] lattice_0.22-7 MASS_7.3-60.0.1
[15] backports_1.5.0 magrittr_2.0.3
[17] rmarkdown_2.29 Hmisc_5.2-1
[19] plotly_4.10.4 yaml_2.3.10
[21] httpuv_1.6.15 sctransform_0.4.1
[23] sp_2.2-0 spatstat.sparse_3.1-0
[25] reticulate_1.42.0 cowplot_1.1.3
[27] pbapply_1.7-2 DBI_1.2.3
[29] RColorBrewer_1.1-3 abind_1.4-5
[31] zlibbioc_1.48.0 Rtsne_0.17
[33] purrr_1.0.4 RCurl_1.98-1.16
[35] nnet_7.3-19 VariantAnnotation_1.48.1
[37] rappdirs_0.3.3 GenomeInfoDbData_1.2.11
[39] ggrepel_0.9.6 irlba_2.3.5.1
[41] listenv_0.9.1 spatstat.utils_3.1-3
[43] goftest_1.2-3 spatstat.random_3.3-3
[45] fitdistrplus_1.2-2 parallelly_1.43.0
[47] DelayedArray_0.28.0 leiden_0.4.3.1
[49] codetools_0.2-20 RcppRoll_0.3.1
[51] xml2_1.3.6 tidyselect_1.2.1
[53] farver_2.1.2 matrixStats_1.5.0
[55] BiocFileCache_2.10.1 base64enc_0.1-3
[57] spatstat.explore_3.4-2 GenomicAlignments_1.38.0
[59] jsonlite_2.0.0 Formula_1.2-5
[61] progressr_0.15.1 ggridges_0.5.6
[63] survival_3.8-3 tools_4.3.3
[65] progress_1.2.3 ica_1.0-3
[67] glue_1.8.0 SparseArray_1.2.2
[69] gridExtra_2.3 xfun_0.50
[71] MatrixGenerics_1.14.0 IRdisplay_1.1
[73] withr_3.0.2 fastmap_1.2.0
[75] digest_0.6.37 R6_2.6.1
[77] mime_0.13 colorspace_2.1-1
[79] scattermore_1.2 tensor_1.5
[81] dichromat_2.0-0.1 spatstat.data_3.1-6
[83] biomaRt_2.58.0 RSQLite_2.3.9
[85] tidyr_1.3.1 generics_0.1.3
[87] data.table_1.17.0 S4Arrays_1.2.0
[89] prettyunits_1.2.0 httr_1.4.7
[91] htmlwidgets_1.6.4 uwot_0.2.3
[93] pkgconfig_2.0.3 gtable_0.3.6
[95] blob_1.2.4 lmtest_0.9-40
[97] htmltools_0.5.8.1 ProtGenerics_1.34.0
[99] scales_1.3.0 png_0.1-8
[101] spatstat.univar_3.1-2 rstudioapi_0.15.0
[103] knitr_1.49 reshape2_1.4.4
[105] rjson_0.2.23 uuid_1.2-1
[107] checkmate_2.3.2 nlme_3.1-168
[109] curl_6.0.1 zoo_1.8-14
[111] cachem_1.1.0 stringr_1.5.1
[113] KernSmooth_2.23-26 parallel_4.3.3
[115] miniUI_0.1.1.1 foreign_0.8-87
[117] restfulr_0.0.15 pillar_1.10.2
[119] grid_4.3.3 vctrs_0.6.5
[121] RANN_2.6.2 promises_1.3.2
[123] dbplyr_2.5.0 xtable_1.8-4
[125] cluster_2.1.8.1 htmlTable_2.4.3
[127] evaluate_1.0.3 cli_3.6.4
[129] compiler_4.3.3 Rsamtools_2.18.0
[131] rlang_1.1.5 crayon_1.5.3
[133] future.apply_1.11.3 plyr_1.8.9
[135] stringi_1.8.7 viridisLite_0.4.2
[137] deldir_2.0-4 BiocParallel_1.36.0
[139] munsell_0.5.1 lazyeval_0.2.2
[141] spatstat.geom_3.3-6 Matrix_1.6-5
[143] IRkernel_1.3.2 hms_1.1.3
[145] bit64_4.5.2 future_1.40.0
[147] KEGGREST_1.42.0 shiny_1.10.0
[149] SummarizedExperiment_1.32.0 ROCR_1.0-11
[151] igraph_2.0.3 memoise_2.0.1
[153] fastmatch_1.1-6 bit_4.5.0.1
