Skip to content

SeekArc 单细胞多组学 (RNA+ATAC) 多样本整合分析教程

作者: SeekGene
时长: 27 分钟
字数: 6.3k 字
更新: 2026-02-28
阅读: 0 次
ATAC + RNA 双组学 分析指南 Notebooks 基础分析

环境准备

R 包加载

请选择 common_r 这个环境进行该整合教程的学习

R
#加载必要的R包
suppressPackageStartupMessages({
  library(Seurat)
  library(Signac)
  library(EnsDb.Hsapiens.v86, lib.loc = "/PROJ2/FLOAT/shumeng/apps/miniconda3/envs/python3.10/lib/R/library")
  library(BSgenome.Hsapiens.UCSC.hg38,lib = "/PROJ2/FLOAT/shumeng/apps/miniconda3/envs/python3.10/lib/R/library")
  library(biovizBase, lib = "/PROJ2/FLOAT/shumeng/apps/miniconda3/envs/python3.10/lib/R/library")
  #library(BSgenome.Mmusculus.UCSC.mm10)
  #library(EnsDb.Mmusculus.v79)
  library(dplyr)
  library(ggplot2)
  library(patchwork)
  library(harmony)
})

# 设置随机种子
set.seed(1234)

# 设置Seurat选项(注意:8000 * 1024^2 实际上是8GB)
options(future.globals.maxSize = 8000 * 1024^2)  # 8GB
R
# 定义颜色方案
my36colors <-c( '#E5D2DD', '#53A85F', '#F1BB72', '#F3B1A0', '#D6E7A3', '#57C3F3', '#476D87',
                '#E95C59', '#E59CC4', '#AB3282', '#23452F', '#BD956A', '#8C549C', '#585658',
                '#9FA3A8', '#E0D4CA', '#5F3D69', '#C5DEBA', '#58A4C3', '#E4C755', '#F7F398',
                '#AA9A59', '#E63863', '#E39A35', '#C1E6F3', '#6778AE', '#91D0BE', '#B53E2B',
                '#712820', '#DCC1DD', '#CCE0F5', '#CCC9E6', '#625D9E', '#68A180', '#3A6963',
                '#968175', "#6495ED", "#FFC1C1",'#f1ac9d','#f06966','#dee2d1','#6abe83','#39BAE8','#B9EDF8','#221a12',
                '#b8d00a','#74828F','#96C0CE','#E95D22','#017890')

获取基因注释信息

我们将从 EnsDb 数据库获取相应物种的基因组注释信息(基因位置、转录本、外显子、TSS 等),具体物种需根据数据做更换。这些信息用于:

  • 计算 ATAC 的 TSS 富集(判断开放染色质是否在转录起始位点附近更集中)
  • 构建基因活性矩阵(把 peaks 信号映射到基因上)
  • Peak 注释和功能分析

注意事项

  • 物种与参考基因组版本匹配(如EnsDb.Hsapiens.v86对应 hg38,EnsDb.Mmusculus.v75对应 mm10)
  • 染色体命名风格一致(例如都用chr1chr2这样的前缀)
R
# 获取基因注释信息(静默处理警告和消息)
suppressWarnings({
  suppressMessages({
    annotation <- GetGRangesFromEnsDb(ensdb = EnsDb.Hsapiens.v86)
    seqlevels(annotation) <- paste0('chr', seqlevels(annotation))
    genome(annotation) <- 'hg38'
  })
})

数据读取

本教程提供两种不同形式的数据输入方式,以满足不同用户的数据获取需求,选择适合自己的一种方式即可:

云平台 RDS 文件读取

数据特点

  • RDS 文件是标准的 Seurat 对象文件
  • 数据已经过预处理和多个样本合并
  • 可直接用于后续的下游分析,也可以提取表达矩阵,重新整合

适用场景

  • 当您无法获得标准的filtered_feature_bc_matrix表达矩阵时
  • 希望整合云平台现有数据进行学习时
  • 需要快速重新进行 scRNA-seq 数据整合分析时

注意事项

  • 具体挂载数据和 rds 文件的读取,请参照 jupyter 使用教程

例如下列项目数据/home/demo-SeekGene-com/workspace/data/AY1752565399550/

R
# 读取数据
input <- readRDS("/home/demo-seekgene-com/workspace/data/AY1752565399550/input.rds")
meta <- read.table("/home/demo-seekgene-com/workspace/data/AY1752565399550/meta.tsv", 
                  header = TRUE, 
                  sep = "\t", 
                  row.names = 1)

# 提取基因组注释(Signac 没有 Annotation() 函数,直接从 ChromatinAssay 获取)
annotations <- input@assays$ATAC@annotation  # 确保 ATAC 是 ChromatinAssay

# 创建 Seurat 对象(RNA 数据)
data <- CreateSeuratObject(
    counts = input@assays$RNA@counts,
    meta.data = meta
)

# 创建 ChromatinAssay(ATAC 数据)
data[["ATAC"]] <- CreateChromatinAssay(
    counts = input@assays$ATAC@counts,
    fragments = input@assays$ATAC@fragments,
    annotation = annotations  # 添加基因组注释
)

# 按样本拆分数据
seurat_list <- SplitObject(data, split.by = "Sample")

# 清理内存
rm(data, input)
gc()

标准 filtered_XXXX_bc_matrix 文件读取

适用场景

  • 当您拥有标准的基因表达矩阵和 epaks 开放矩阵文件时
  • 希望自主完成单细胞多组学(SeekArc)数据的多样本整合和批次矫正时
  • 需要进行完整的从原始数据到整合分析的工作流程时

注意

  • 请保证样本与样本之间的文件结构如下:

数据目录结构需满足如下要求:

  1. 每个样本的文件夹名称为样本 ID,如 S127、S44R 等。
  2. 每个样本文件夹下包含以下文件:
    • filtered_feature_bc_matrix:scRNA-seq 表达矩阵文件夹,包含barcodes.tsv.gzfeatures.tsv.gzmatrix.mtx.gz文件。
    • filtered_peaks_bc_matrix:ATAC 的 peak 开放矩阵文件夹,包含barcodes.tsv.gzfeatures.tsv.gzmatrix.mtx.gz文件。
    • {样本 ID}_A_fragments.tsv.gz:ATAC 片段文件,如 S127_A_fragments.tsv.gz。
    • {样本 ID}_A_fragments.tsv.gz.tbi:ATAC 片段索引文件,如 S127_A_fragments.tsv.gz.tbi。

具体文件夹结构如下:

├── S127/
│ ├── filtered_feature_bc_matrix/ (scRNA-seq 表达矩阵)
│ │ ├── barcodes.tsv.gz
│ │ ├── features.tsv.gz
│ │ └── matrix.mtx.gz
│ ├── filtered_peaks_bc_matrix/ (ATAC 的 peak 开放矩阵)
│ │ ├── barcodes.tsv.gz
│ │ ├── features.tsv.gz
│ │ └── matrix.mtx.gz
│ ├── S127_A_fragments.tsv.gz (ATAC 片段文件)
│ └── S127_A_fragments.tsv.gz.tbi
├── S44R/
│ ├── filtered_feature_bc_matrix/
│ │ ├── barcodes.tsv.gz
│ │ ├── features.tsv.gz
│ │ └── matrix.mtx.gz
│ ├── filtered_peaks_bc_matrix/
│ │ ├── barcodes.tsv.gz
│ │ ├── features.tsv.gz
│ │ └── matrix.mtx.gz
│ ├── S44R_A_fragments.tsv.gz
│ └── S44R_A_fragments.tsv.gz.tbi

R
sample_names <- c('S127', 'S44R')
# 创建空列表存储Seurat对象
seurat_list <- list()

# 循环读取每个样本的数据
for(sample in sample_names){
  cat('正在处理样本:', sample, '\n')
  
  # 构建文件路径
  atac_path <- file.path(sample, 'filtered_peaks_bc_matrix')
  rna_path <- file.path(sample, 'filtered_feature_bc_matrix')
  frag_path <- file.path(sample, paste0(sample, '_A_fragments.tsv.gz'))

  # 读取RNA数据
  rna_counts <- Read10X(data.dir = rna_path)

  # 创建Seurat对象
  obj <- CreateSeuratObject(
    counts = rna_counts,
    assay = "RNA")
  
  obj$Sample <- sample
  rm(rna_counts)
  gc()
  
  # 读取ATAC数据
  atac_counts <- Read10X(data.dir = atac_path)
  atac_counts <- atac_counts[Matrix::rowSums(atac_counts > 0) >= 3, ]   
  
  ChromatinAssay <- CreateChromatinAssay(
    counts = atac_counts,
    sep = c(":", "-"),
    fragments = frag_path,
    annotation = annotation
  )
  
  obj[["ATAC"]] <- ChromatinAssay
  rm(atac_counts, ChromatinAssay)
  gc()
  
  # 将对象添加到列表
  seurat_list[[sample]] <- obj
  cat('样本', sample, '处理完成\n')
}
output
正在处理样本: S127


Computing hash



样本 S127 处理完成
正在处理样本: S44R


Computing hash



样本 S44R 处理完成

数据质量控制

质控指标计算

RNA 质控指标

  • percent.mt: 线粒体基因比例(通常<20%)

ATAC 质控指标

  • TSS.enrichment: TSS 富集分数(通常>2)
  • nucleosome_signal: 核小体信号(越低越好)
  • nCount_ATAC: 总 ATAC 计数(通常在 1000-10000 之间)
  • nfeature_RNA: 总 ATAC 特征数(通常在 200-10000 之间)
R
# 对每个样本进行质量控制
suppressWarnings({
  suppressMessages({
      for(i in names(seurat_list)){
          # RNA质控指标
          seurat_list[[i]][["percent.mt"]] <- PercentageFeatureSet(seurat_list[[i]], pattern = "^MT-")
          # ATAC质控指标
          DefaultAssay(seurat_list[[i]]) <- "ATAC"
          # 计算TSS富集分数
          seurat_list[[i]] <- TSSEnrichment(object = seurat_list[[i]], fast = FALSE) 
          # 计算核小体信号
          seurat_list[[i]] <- NucleosomeSignal(object = seurat_list[[i]])
      }
  })
})

质控指标可视化

用小提琴图查看各 QC 指标的分布/异常点,确定合适的阈值:

建议:

  • 观察是否存在明显的长尾或双峰分布
  • 尝试多组阈值并比较下游聚类/UMAP 是否更清晰
R
# 可视化各质控指标,该cell内容选择性执行,非必要
options(repr.plot.width = 17, repr.plot.height = 10)
suppressWarnings({
for (sample_name in names(seurat_list)) {
  seurat_obj <- seurat_list[[sample_name]]
  p1=DensityScatter(seurat_obj, x = 'nCount_ATAC', y = 'TSS.enrichment', log_x = TRUE, quantiles = TRUE)
  p2=VlnPlot(
      object = seurat_obj,
      features = c('nCount_ATAC', 'TSS.enrichment',  'nucleosome_signal',"nFeature_RNA", "nCount_RNA", "percent.mt"),
      pt.size = 0.1,
      ncol = 6
  )
 print(p1 / p2 + plot_annotation(title = sample_name))
}
})

低质量细胞过滤

根据质控指标过滤低质量细胞。具体阈值应根据数据特征进行调整。具体过滤阈值参考上面的小提琴图分布情况。

R
# 质控过滤
for(i in names(seurat_list)){
  cat('正在对样本', i, '进行质控过滤...\n')
  
  cells_before <- ncol(seurat_list[[i]])
  
  seurat_list[[i]] <- subset(
    seurat_list[[i]],
    subset = nFeature_RNA > 200 &
             nFeature_RNA < 6000 &
             nCount_RNA > 500 &
             nCount_RNA < 30000 &
             percent.mt < 20 &
             nCount_ATAC > 500 &
             nCount_ATAC < 100000 &
             TSS.enrichment > 1 &
             nucleosome_signal < 2
  )
  
  cells_after <- ncol(seurat_list[[i]])
  cat('样本', i, '过滤完成: 过滤前', cells_before, '个细胞,过滤后', cells_after, '个细胞\n')
}
output
正在对样本 S127 进行质控过滤...n 样本 S127 过滤完成: 过滤前 13957 个细胞,过滤后 12483 个细胞
正在对样本 S44R 进行质控过滤...n 样本 S44R 过滤完成: 过滤前 12072 个细胞,过滤后 10900 个细胞

计算样本间的共有 peaks

由于各样本独立进行 peak calling,不同样本中的 peaks 不一样。为了整合多样本,需要先统一 peaks 信息,创建所有样本的共有 peak 集合。

处理流程

  1. 合并所有样本的 peaks
  2. 过滤 peak 长度(20bp-10kb)
  3. 为每个样本重新量化共有 peaks

注意事项: 如果前面读取数据用的第一种读取流程 rds 的方式,此部分不需要进行分析

注意事项: 如果前面读取数据用的第一种读取流程 rds 的方式,此部分不需要进行分析

注意事项: 如果前面读取数据用的第一种读取流程 rds 的方式,此部分不需要进行分析

R
# 提取所有样本的peaks
all_peaks <- lapply(seurat_list, function(x) {
  DefaultAssay(x) <- "ATAC"
  granges(x@assays$ATAC)
})

# 合并所有peaks
gr_list <- GRangesList(all_peaks)
all_granges <- unlist(gr_list, use.names = FALSE)
combined.peaks <- reduce(x = all_granges)

# 过滤peak长度
peakwidths <- width(combined.peaks)
common_peaks <- combined.peaks[peakwidths < 10000 & peakwidths > 20]

cat('合并后的peak数量:', length(common_peaks), '\n')

# 为每个样本创建统一的peak矩阵
seurat_list <- lapply(seurat_list, function(x) {
  combined_counts <- FeatureMatrix(
    fragments = Fragments(x@assays$ATAC),
    features = common_peaks,
    cells = colnames(x)
  )
  
  combined_peaks_assay <- CreateChromatinAssay(
    counts = combined_counts,
    fragments = Fragments(x@assays$ATAC),
    annotation = Annotation(x@assays$ATAC)
  )
  
  # 添加到Seurat对象中
  x[["combinedpeaks"]] <- combined_peaks_assay
  DefaultAssay(x) <- "combinedpeaks"
  x[["ATAC"]] <- NULL
  return(x)
})
output
合并后的peak数量: 211924


Extracting reads overlapping genomic regions

Extracting reads overlapping genomic regions

多样本合并

在多组学数据整合分析流程中,多样本合并是整合分析的重要前置步骤。

合并操作的目的

  • 将多个样本数据整合到同一个 Seurat 对象中
  • 为后续的批次矫正和整合分析做准备
  • 便于与批次矫正后的结果进行对比分析

重要说明

  • 此步骤仅进行简单的数据合并,尚未进行批次效应矫正
  • 合并后的对象包含所有样本的原始数据
  • 将合并的数据进行标准化、特征选择、PCA 降维和 UMAP 可视化等预处理步骤
R
# 合并所有样本
suppressWarnings({
  suppressMessages({
      obj_merge <- merge(seurat_list[[1]], seurat_list[-1], merge.data = FALSE)
      DefaultAssay(obj_merge) <- "RNA" 
      obj_merge <- NormalizeData(obj_merge)
      obj_merge <- FindVariableFeatures(obj_merge, nfeatures = 2000)
      obj_merge <- ScaleData(obj_merge)
      obj_merge <- RunPCA(obj_merge)
      DefaultAssay(obj_merge) <- "combinedpeaks"
      obj_merge <- FindTopFeatures(obj_merge, min.cutoff = 10)
      obj_merge <- RunTFIDF(obj_merge)
      obj_merge <- RunSVD(obj_merge)
      obj_merge <- RunUMAP(obj_merge, reduction = "lsi", dims = 2:30,reduction.name="atacumap")
        })
})

数据整合

现在我们将多个样本的单细胞多组学(SeekArc 数据)进行整合,并对整合的数据进行降维聚类。多组学数据整合主要有两种常用方法:

整合方法选择

CCA(Canonical Correlation Analysis)整合

  • 基于典型相关分析的整合方法
  • 通过寻找样本间的共同变异模式进行整合
  • 适用于样本间差异较大的情况
  • Seurat 包的经典整合方法

Harmony 整合

  • 基于迭代聚类的快速整合方法
  • 直接在降维空间中校正批次效应
  • 计算效率高,适用于大规模数据
  • 保留更多生物学变异信息

方法选择建议

  • Harmony:同平台不同样本 scATAC-seq 数据;样本数量较多或数据量大,Harmony 方法更快。
  • CCA:运行时间较久,通常批次效应严重的情况下推荐选 CCA,如跨平台的 scATAC-seq 数据

RNA 数据整合

使用 CCA 方法整合多样本的 RNA 数据

R
# 切换到RNA assay
suppressWarnings({
  suppressMessages({
      objs <- lapply(seurat_list, function(x) {
          DefaultAssay(x) <- "RNA"
          return(x)
      })
      # 定义CCA整合函数
      integrate_cca <- function(objs) {
          # 使用lapply替代for循环
          objs <- lapply(objs, function(x) {
              x <- NormalizeData(x, verbose = FALSE)
              x <- FindVariableFeatures(x,nfeatures = 2000,selection.method = "vst")
              return(x)
          })
          features <- Seurat::SelectIntegrationFeatures(object.list = objs)
          anchors <- Seurat::FindIntegrationAnchors(
              object.list = objs,
              anchor.features = features
          ) 
          obj <- Seurat::IntegrateData(anchorset = anchors)
          DefaultAssay(obj) <- "integrated"
          obj <- ScaleData(obj, verbose = FALSE)
          obj <- RunPCA(obj, verbose = FALSE)
          obj <- RunUMAP(obj, reduction = "pca",reduction.name="rnaintegratedumap", dims = 1:30)
          #obj <- RunTSNE(obj, reduction = "pca", reduction.name="rnaintegratedtsne",dims = 1:30, check_duplicates = FALSE)
          return(obj)
      }
      # 执行CCA整合
      obj_integrated <- integrate_cca(seurat_list)
  })
})    
##可视化RNA数据去批次效果
options(repr.plot.width=10, repr.plot.height=8)
DimPlot(obj_integrated, reduction = "rnaintegratedumap", group.by = "Sample")

使用 harmony 整合多样本的 RNA 数据

R
# RNA数据整合函数
integrate_harmony <- function(obj) {
  DefaultAssay(obj) <- "RNA"
  obj <- RunHarmony(obj, "Sample")
  obj <- RunUMAP(obj, reduction = "harmony",reduction.name="rnaharmonyumap",dims = 1:30)
  #obj <- RunTSNE(obj, reduction = "harmony",reduction.name="rnaharmonytsne",dims = 1:30,check_duplicates = FALSE)
  return(obj)
}
suppressWarnings({
  suppressMessages({
      obj_integrated = integrate_harmony(obj_merge)
              })
}) 
##可视化RNA数据去批次效果
options(repr.plot.width=10, repr.plot.height=8)
DimPlot(obj_integrated, reduction = "rnaharmonyumap", group.by = "Sample")

ATAC 数据整合

使用 CCA 方法整合多样本的 ATAC 数据

R
# 定义ATAC整合函数
options(future.globals.maxSize = 20 * 1024^3)
integrate_atac_cca <- function(object.list){
    object.list <- lapply(object.list, function(x) {
        DefaultAssay(x)="combinedpeaks"
        x=RunTFIDF(x)
        x=FindTopFeatures(x, min.cutoff = 5)
        x=RunSVD(x)
        return(x)
      })
    
    anchors <- FindIntegrationAnchors(
        object.list = object.list,
        anchor.features = rownames(obj_merge),
        reduction = "rlsi",
        dims = 2:30
    )
    
    integrated <- IntegrateEmbeddings(
            anchorset = anchors,
            reductions = obj_merge[["lsi"]],
            new.reduction.name = "integrated_lsi",
            dims.to.integrate = 1:30)
   integrated<- RunUMAP(integrated, reduction = "integrated_lsi", dims = 2:30,reduction.name="atacintegratedumap")
   #integrated<- RunTSNE(integrated, reduction = "integrated_lsi", dims = 2:30,reduction.name="atacintegratedtsne")
   return(integrated)
}

# 执行ATAC整合
suppressWarnings({
  suppressMessages({
      integrated_atac <- integrate_atac_cca(seurat_list)
  })
})  
# 将整合后的降维结果添加到主对象
obj_integrated[['integrated_lsi']] <- integrated_atac[['integrated_lsi']]
obj_integrated[['atacintegratedumap']] <- integrated_atac[['atacintegratedumap']]
#obj_integrated[['atacintegratedtsne']] <- integrated_atac[['atacintegratedtsne']]
DefaultAssay(obj_integrated) <- "integrated"
#可视化ATAC去批次效果
options(repr.plot.width=10, repr.plot.height=8)
DimPlot(obj_integrated, reduction = "atacintegratedumap", group.by = "Sample")

使用 harmony 方法整合多样本的 ATAC 数据

R
atac_integrate_harmony <- function(obj) {
  library(harmony)
  DefaultAssay(obj) <- "combinedpeaks"
  obj<- RunHarmony(
        object = obj,
        group.by.vars = 'Sample',
        reduction = 'lsi',
        assay.use = 'combinepeaks',
        reduction.save = "harmonylsi",
        project.dim = FALSE
        )
  obj <- RunUMAP(obj,reduction = 'harmonylsi',reduction.name="atacharmonyumap", dims = 2:30)
  #obj <- RunTSNE(obj, reduction = "harmonylsi",reduction.name="atacharmonytsne",dims = 2:50,check_duplicates = FALSE)
  return(obj)
}
suppressWarnings({
  suppressMessages({
      obj_integrated=atac_integrate_harmony(obj_integrated)
  })
})
#可视化ATAC去批次效果
options(repr.plot.width=10, repr.plot.height=8)
DimPlot(obj_integrated, reduction = "atacharmonyumap", group.by = "Sample")

加权最近邻(WNN)分析

WNN 原理

WNN 方法通过学习每种数据类型(RNA 和 ATAC)的相对重要性,自动为每个细胞分配模态权重,从而:

  • 同时利用 RNA 和 ATAC 信息
  • 避免单一模态的偏差
  • 提供更准确的细胞聚类和降维结果

参数说明

  • reduction.list:指定用于 WNN 的降维结果
    • CCA 整合:list("pca", "integrated_lsi")
    • Harmony 整合:list("harmony", "harmonylsi")
  • dims.list:各模态使用的维度范围
R
# 寻找多模态邻居
suppressWarnings({
  suppressMessages({
      obj_integrated <- FindMultiModalNeighbors(
          object = obj_integrated,
          reduction.list = list("pca", "integrated_lsi"),##CCA方法选择此
          #reduction.list = list("harmony", "harmonylsi"),##harmony方法选此
          dims.list = list(1:30, 2:50),
          modality.weight.name = "RNA.weight",
          verbose = TRUE)
      # 基于WNN进行UMAP降维
      obj_integrated <- RunUMAP(
          object = obj_integrated,
          nn.name = "weighted.nn",
          assay = "RNA",
          verbose = TRUE,
          reduction.name = "wnnintergratedumap"
      )
  })
})

聚类分析

三种聚类策略

  1. RNA 聚类:基于基因表达相似性
  2. ATAC 聚类:基于染色质可及性相似性
  3. WNN 聚类:整合两种模态信息(推荐)

结果解读

  • 不同方法可能产生不同的聚类结果
  • WNN 聚类通常能发现更细致的细胞亚群
  • 建议优先使用 WNN 结果进行下游分析

下面参数说明

下面在进行聚类分析时,请根据前期使用的批次校正方法选择对应的降维参数:

​​CCA 去批次​​:

• RNA 聚类:FindNeighbors 中使用 reduction="pca"

• ATAC 聚类:FindNeighbors 中使用 reduction="integrated_lsi"

​​Harmony 去批次​​:

• RNA 聚类:FindNeighbors 中使用 reduction="harmony"

• ATAC 聚类:FindNeighbors 中使用 reduction="harmonylsi"

R
suppressWarnings({
  suppressMessages({
      # RNA聚类
      obj_integrated <- FindNeighbors(object = obj_integrated, reduction = 'pca',graph.name = "rnaneigobr", dims = 1:30)
      #obj_integrated <- FindNeighbors(object = obj_integrated, reduction = 'harmony', dims = 1:30)
      obj_integrated <- FindClusters(object = obj_integrated, verbose = FALSE,graph.name = "rnaneigobr", algorithm = 3, resolution = 0.5)
      # ATAC聚类
      obj_integrated <- FindNeighbors(object = obj_integrated, reduction = 'integrated_lsi', graph.name = "atacneigobr", dims = 2:30)
      #obj_integrated <- FindNeighbors(object = obj_integrated, reduction = 'harmonylsi', graph.name = "atacneigobr", dims = 2:30)
      obj_integrated <- FindClusters(object = obj_integrated, verbose = FALSE, graph.name = "atacneigobr",resolution = 0.5, algorithm = 3)
      # WNN聚类
      obj_integrated <- FindClusters(obj_integrated, graph.name = "wknn", algorithm = 3, resolution = 0.5, verbose = TRUE)
        })
})
output
Computing nearest neighbor graph

Computing SNN

Computing nearest neighbor graph

Computing SNN

Only one graph name supplied, storing nearest-neighbor graph only

Warning message:
"The following arguments are not used: cluster.name"

Warning message:
"The following arguments are not used: cluster.name"

可视化参数说明

降维可视化一致性要求 在进行聚类结果可视化时,请确保使用的 UMAP 降维方法与前面聚类分析保持一致:

​​CCA 去批次​​:

• RNA 聚类结果:使用"rnaintegratedumap"降维空间

• ATAC 聚类结果:使用"atacintegratedumap"降维空间

• WNN 聚类结果:使用"wnnintergratedumap"降维空间

​​Harmony 去批次​​:

• RNA 聚类结果:使用"rnaharmonyumap"降维空间

• ATAC 聚类结果:使用"atacharmonyumap"降维空间

• WNN 聚类结果:使用"wnnintergratedumap"降维空间

R
# 比较不同方法的聚类结果
p1 <- DimPlot(obj_integrated, reduction = "rnaintegratedumap", group.by = "rnaneigobr_res.0.5",label=T, cols = my36colors) + ggtitle("RNA")
p2 <- DimPlot(obj_integrated, reduction = "atacintegratedumap", group.by = "atacneigobr_res.0.5",label=T, cols = my36colors) + ggtitle("ATAC")
p3 <- DimPlot(obj_integrated, reduction = "wnnintergratedumap", group.by = "wknn_res.0.5",label=T, cols = my36colors) + ggtitle("WNN")


#p1 <- DimPlot(obj_integrated, reduction = "rnaharmonyumap", group.by = "rnaneigobr_res.0.5",label=T, cols = my36colors) + ggtitle("RNA")
#p2 <- DimPlot(obj_integrated, reduction = "atacharmonyumap", group.by = "atacneigobr_res.0.5",label=T, cols = my36colors) + ggtitle("ATAC")
#p3 <- DimPlot(obj_integrated, reduction = "wnnintergratedumap", group.by = "wknn_res.0.5",label=T, cols = my36colors) + ggtitle("WNN")
options(repr.plot.width=23, repr.plot.height=6)
print(p1 + p2 + p3)

细胞类型注释

marker 基因

根据组织类型,收集不同细胞类型 marker 基因集,本示例数据是眼睛,下面是眼睛中细胞类型及其对应的 marker 基因,用气泡图可视化不同 cluster 高表达哪些细胞类型的 marker 基因

R
eye_marker_integrated <- list(
  # ========== 光感受器细胞 ==========
  "Rod_Photoreceptors" = c("RHO", "PDE6B", "CNGB1", "PDE6A", "NR2E3", "REEP6",
                          "RCVRN", "SAG", "NEB", "SLC24A3", "TRPM1"),
  "Cone_Photoreceptors" = c("ARR3", "GNAT2", "OPN1SW", "TRPM3"),

  # ========== 视网膜神经元 ==========
  "Bipolar_Cells" = c("VSX1", "OTX2", "GRM6", "PRKCA",
                     "DLG2", "NRXN3", "RBFOX3", "FSTL4"),
  "Amacrine_Cells" = c("GAD1", "GAD2", "C1QL2", "TFAP2B", "SLC32A1"),
  "Horizontal_Cells" = c("ONECUT1", "LHX1", "CALB1", "NFIA"),
  "Retinal_Ganglion_Cells" = c("RBPMS", "THY1", "NEFL", "POU4F2"),

  # ========== 胶质细胞 ==========
  "Muller_Glia" = c("SLC1A3", "RLBP1", "SOX9", "CRYAB"),
  "Astrocytes" = c("GFAP", "AQP4", "S100B"),
  "Microglia" = c("TMEM119", "C1QA", "AIF1", "CX3CR1", "CD74",
                 "PTPRC", "CSF1R", "CCL3L1", "SPP1", "P2RY12"),

  # ========== 血管与支持细胞 ==========
  "Endothelial" = c("FLT1", "PODXL", "PLVAP", "KDR", "EGFL7",
                   "CLDN5", "PECAM1", "CDH5"),
  "Pericytes" = c("PDGFRB", "RGS5", "CSPG4", "STAB1"),

  # ========== 上皮细胞 ==========
  "RPE" = c("RPE65", "BEST1", "PMEL", "TYRP1", "DCT", "SLC38A11"),
  #"Corneal_Epithelial" = c("KRT12", "KRT3"),
  "Corneal_Endothelial" = c("SLC4A11", "COL8A2", "ATP1A1"),
  "Conjunctival_Epithelial" = c("KRT13", "KRT19", "MUC5AC"),
  "Retinal_Progenitors" = c("VSX2", "PAX6","SOX2", "HES1", "NOTCH1"),  # 神经视网膜增殖期

  # ========== 晶状体细胞 ==========
  "Lens_Epithelial" = c("CRYAA", "BFSP1", "MIP"),
  "Lens_Fiber" = c("CRYGS", "LIM2", "FN1"),

  # ========== 其他细胞 ==========
  "Melanocytes" = c("TYR", "MLANA", "MITF"),
  #"Erythrocytes" = c("HBB", "HBA1", "HBA2"),
  "ECM" = c("COL1A1", "COL3A1", "COL12A1", "COL1A2", "COL6A3")#,
  #"Others" = c("TTN", "CLCN5", "DCC", "MIAT")
)
R
# 设置图形大小
options(repr.plot.width=25, repr.plot.height=7)

# 绘制DotPlot
DefaultAssay(obj_integrated)="RNA"
DotPlot(obj_integrated, 
        group.by = "wknn_res.0.5", 
        features = eye_marker_integrated,
        cols = c("#f8f8f8","#ff3472"),
       #dot.min = 0.05,
       dot.scale = 8)+  # 应用自定义配色
  RotatedAxis() + 
  scale_x_discrete("") + 
  scale_y_discrete("") +
  theme(
    axis.text.x = element_text(size = 12, face = "bold", 
                              angle = 45, hjust = 1, vjust = 1),
    axis.text.y = element_text(size = 12, face = "bold"),
    plot.title = element_text(size = 14, face = "bold", hjust = 0.5),
    legend.title = element_text(size = 10, face = "bold")
  ) +
  ggtitle("Marker Genes Expression") +
  labs(color = "Expression\nLevel")  # 修改图例标题

** 注意事项 ** SeekArc 单细胞双组学数据细胞注释一般基于 wnn 多模态降维,依据 marker 基因在 wknn 聚类结果中表达情况进行注释

R
options(repr.plot.width=10, repr.plot.height=8)
DimPlot(obj_integrated, reduction = "wnnintergratedumap", group.by = "wknn_res.0.5", cols = my36colors,label = T)

细胞类型标注

R
cat('开始细胞类型注释...', Sys.time(), '\n')

# 基于聚类结果进行细胞类型注释(需要根据实际的marker基因表达情况调整)
# 这里提供一个示例,实际使用时需要根据DotPlot结果进行调整
celltype_mapping <- c(
  "0" = "ECM",
  "1" = "Retinal_Progenitors",
  "2" = "ECM",
  "3" = "PRE",
  "4" = "ECM",
  "5" = "Bipolar_Cells",
  "6" = "Bipolar_Cells",
  "7" = "ECM",
  "8" = "Retinal_Progenitors",
  "9" = "ECM",
  "10" = "Retinal_Progenitors",
  "11" = "Conjunctival_Epithelial",
  "12" = "ECM",
  "13" = "Bipolar_Cells",
  "14" = "Retinal_Progenitors",
  "15" = "Doublets",
  "16" = "Endothelial",
  "17" = "Rod_Photoreceptors",
  "18" = "Astrocytes",
  "19" = "Muller_Glia",
  "20" = "Microglia",
  "21" = "Lens"
)

# 应用细胞类型注释
obj_integrated$celltype <- recode(
  obj_integrated$wknn_res.0.5,
  !!!celltype_mapping
)
output
开始细胞类型注释... 1754902885

注释结果可视化

R
cat('细胞类型注释可视化...', '\n')

# 细胞类型UMAP可视化
p1 <- DimPlot(
  obj_integrated,
  reduction = "wnnintergratedumap",
  group.by = "celltype",
  label = TRUE,
  label.size = 3,
  cols = my36colors
) +
  ggtitle("celltype") +
  theme(legend.position = "bottom")

# 按样本分组展示细胞类型分布
p2 <- DimPlot(
  obj_integrated,
  reduction = "wnnintergratedumap",
  group.by = "celltype",
  split.by = "Sample",
  cols = my36colors,
  ncol = 2,label=T
) 

# 保存细胞类型注释图
pdf("celltype_annotation.pdf", width = 16, height = 12)
print(p1)
print(p2)
dev.off()
options(repr.plot.width=10, repr.plot.height=8)
print(p1)
options(repr.plot.width=20, repr.plot.height=8)
print(p2)

cat('细胞类型注释结果分样本展开!', '\n')
output
细胞类型注释可视化...

pdf: 2

细胞类型注释结果分样本展开! 

保存结果

R
# 保存整合后的Seurat对象
saveRDS(obj_integrated, file = "scMultiomics_integrated.rds")

总结

本教程演示了单细胞多组学数据的完整整合分析流程。

关键要点

  • 数据质量:多组学数据需要同时考虑 RNA 和 ATAC 质控指标
  • 特征统一:Peak 合并确保样本间特征一致性
  • 方法选择:根据数据特点选择 CCA 或 Harmony 方法
  • 模态整合:WNN 方法能充分利用多模态信息

常见问题

  • 内存不足:可以减少使用的维度数或降低分辨率
  • 整合效果差:尝试调整整合参数或更换整合方法
  • 聚类结果不理想:调整分辨率参数或使用不同的聚类算法

后续分析方向

  • 细胞类型注释和标记基因识别
  • 差异表达基因和差异可及性分析
  • 基因调控网络推断
  • 发育轨迹和伪时间分析
R
sessionInfo()
output
R version 4.3.3 (2024-02-29)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: Debian GNU/Linux 12 (bookworm)

Matrix products: default
BLAS/LAPACK: /jp_envs/envs/common/lib/libopenblasp-r0.3.29.so; LAPACK version 3.12.0

locale:
[1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
[4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
[7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C

time zone: Asia/Shanghai
tzcode source: system (glibc)

attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base

other attached packages:
[1] harmony_1.2.3 Rcpp_1.0.14
[3] patchwork_1.3.0 ggplot2_3.5.2
[5] dplyr_1.1.4 biovizBase_1.50.0
[7] BSgenome.Hsapiens.UCSC.hg38_1.4.5 BSgenome_1.70.1
[9] rtracklayer_1.62.0 BiocIO_1.12.0
[11] Biostrings_2.70.1 XVector_0.42.0
[13] EnsDb.Hsapiens.v86_2.99.0 ensembldb_2.26.0
[15] AnnotationFilter_1.26.0 GenomicFeatures_1.54.1
[17] AnnotationDbi_1.64.1 Biobase_2.62.0
[19] GenomicRanges_1.54.1 GenomeInfoDb_1.38.1
[21] IRanges_2.36.0 S4Vectors_0.40.2
[23] BiocGenerics_0.48.1 Signac_1.10.0
[25] SeuratObject_4.1.4 Seurat_4.4.0
[27] repr_1.1.7

loaded via a namespace (and not attached):
[1] RcppAnnoy_0.0.22 splines_4.3.3
[3] later_1.4.2 pbdZMQ_0.3-13
[5] bitops_1.0-9 filelock_1.0.3
[7] tibble_3.2.1 polyclip_1.10-7
[9] rpart_4.1.23 XML_3.99-0.17
[11] lifecycle_1.0.4 globals_0.16.3
[13] lattice_0.22-7 MASS_7.3-60.0.1
[15] backports_1.5.0 magrittr_2.0.3
[17] rmarkdown_2.29 Hmisc_5.2-1
[19] plotly_4.10.4 yaml_2.3.10
[21] httpuv_1.6.15 sctransform_0.4.1
[23] sp_2.2-0 spatstat.sparse_3.1-0
[25] reticulate_1.42.0 cowplot_1.1.3
[27] pbapply_1.7-2 DBI_1.2.3
[29] RColorBrewer_1.1-3 abind_1.4-5
[31] zlibbioc_1.48.0 Rtsne_0.17
[33] purrr_1.0.4 RCurl_1.98-1.16
[35] nnet_7.3-19 VariantAnnotation_1.48.1
[37] rappdirs_0.3.3 GenomeInfoDbData_1.2.11
[39] ggrepel_0.9.6 irlba_2.3.5.1
[41] listenv_0.9.1 spatstat.utils_3.1-3
[43] goftest_1.2-3 spatstat.random_3.3-3
[45] fitdistrplus_1.2-2 parallelly_1.43.0
[47] DelayedArray_0.28.0 leiden_0.4.3.1
[49] codetools_0.2-20 RcppRoll_0.3.1
[51] xml2_1.3.6 tidyselect_1.2.1
[53] farver_2.1.2 matrixStats_1.5.0
[55] BiocFileCache_2.10.1 base64enc_0.1-3
[57] spatstat.explore_3.4-2 GenomicAlignments_1.38.0
[59] jsonlite_2.0.0 Formula_1.2-5
[61] progressr_0.15.1 ggridges_0.5.6
[63] survival_3.8-3 tools_4.3.3
[65] progress_1.2.3 ica_1.0-3
[67] glue_1.8.0 SparseArray_1.2.2
[69] gridExtra_2.3 xfun_0.50
[71] MatrixGenerics_1.14.0 IRdisplay_1.1
[73] withr_3.0.2 fastmap_1.2.0
[75] digest_0.6.37 R6_2.6.1
[77] mime_0.13 colorspace_2.1-1
[79] scattermore_1.2 tensor_1.5
[81] dichromat_2.0-0.1 spatstat.data_3.1-6
[83] biomaRt_2.58.0 RSQLite_2.3.9
[85] tidyr_1.3.1 generics_0.1.3
[87] data.table_1.17.0 S4Arrays_1.2.0
[89] prettyunits_1.2.0 httr_1.4.7
[91] htmlwidgets_1.6.4 uwot_0.2.3
[93] pkgconfig_2.0.3 gtable_0.3.6
[95] blob_1.2.4 lmtest_0.9-40
[97] htmltools_0.5.8.1 ProtGenerics_1.34.0
[99] scales_1.3.0 png_0.1-8
[101] spatstat.univar_3.1-2 rstudioapi_0.15.0
[103] knitr_1.49 reshape2_1.4.4
[105] rjson_0.2.23 uuid_1.2-1
[107] checkmate_2.3.2 nlme_3.1-168
[109] curl_6.0.1 zoo_1.8-14
[111] cachem_1.1.0 stringr_1.5.1
[113] KernSmooth_2.23-26 parallel_4.3.3
[115] miniUI_0.1.1.1 foreign_0.8-87
[117] restfulr_0.0.15 pillar_1.10.2
[119] grid_4.3.3 vctrs_0.6.5
[121] RANN_2.6.2 promises_1.3.2
[123] dbplyr_2.5.0 xtable_1.8-4
[125] cluster_2.1.8.1 htmlTable_2.4.3
[127] evaluate_1.0.3 cli_3.6.4
[129] compiler_4.3.3 Rsamtools_2.18.0
[131] rlang_1.1.5 crayon_1.5.3
[133] future.apply_1.11.3 plyr_1.8.9
[135] stringi_1.8.7 viridisLite_0.4.2
[137] deldir_2.0-4 BiocParallel_1.36.0
[139] munsell_0.5.1 lazyeval_0.2.2
[141] spatstat.geom_3.3-6 Matrix_1.6-5
[143] IRkernel_1.3.2 hms_1.1.3
[145] bit64_4.5.2 future_1.40.0
[147] KEGGREST_1.42.0 shiny_1.10.0
[149] SummarizedExperiment_1.32.0 ROCR_1.0-11
[151] igraph_2.0.3 memoise_2.0.1
[153] fastmatch_1.1-6 bit_4.5.0.1
R
0 条评论·0 条回复