Skip to content

单细胞精细解析:基于 LncRNA 表达特征的细胞分群策略

作者: SeekGene
时长: 16 分钟
字数: 3.2k 字
更新: 2026-02-28
阅读: 0 次
全序列转录组 Notebooks 基础分析

本文探讨在单细胞 RNA 测序分析中,除传统高变基因外,利用长链非编码 RNA 进行细胞分群的补充策略。LncRNA 表达具有更高的细胞状态特异性,有助于识别传统方法难以分辨的精细亚群;同时,其表达能直接关联细胞的表观遗传调控活性,为分群提供功能解释。当传统分群分辨率不足时,此方法有助于发现新型细胞状态或生物标志物。

例如,有研究 [1] 就用这个方法找到了新的细胞亚群和关键的 LncRNA 标志物;另一项研究 [2] 也通过 LncRNA 区分出了不同特性的细胞亚群。

这里推荐两步分群法:先以经典高变基因确定主要细胞类型,再在目标亚群内使用 LncRNA 进行二次分群。该策略可排除主要类型间的背景干扰,聚焦于同一谱系内部的异质性,从而更精准地解析细胞功能状态、发育轨迹或病理转变。

这里使用寻因公共数据来作为示例,LncRNA 分群适用于 DD3 转录组 / DD 全序列转录组 / DD5 转录组等多种数据类型,很多 LncRNA 基因都是有 poly-A 尾的,常规单细胞 RNA 捕获手段得到的数据可以满足分群分析。

R
options(repr.plot.width = 6, repr.plot.height = 6)

代码示例

加载所需 R 包

R
library(Seurat)
library(harmony)
library(plotly)
library(ggplot2)
library(dplyr)
library(tidyverse)
library(RColorBrewer)
library(ggalluvial)

读取 RDS 数据

R
obj=readRDS("../../../data/流程编号/input.rds")
meta=read.delim("../../../data/流程编号/meta.tsv")

rownames(meta)=meta$barcode
obj=AddMetaData(obj,metadata = meta)

对所有基因计算高变特征 (Variable Features)

R
obj <- FindVariableFeatures(obj, nfeatures = dim(obj)[1], verbose = FALSE)
hvg_all_features <- VariableFeatures(obj)
head(hvg_all_features)
output
Warning message:
The \`slot\` argument of \`GetAssayData()\` is deprecated as of SeuratObject 5.0.0.
ℹ Please use the \`layer\` argument instead.
ℹ The deprecated feature was likely used in the Seurat package.
Please report the issue at .”
  1. 'IGHV4-59'
  2. 'IGKV1D-39'
  3. 'IGHV4-61'
  4. 'IGHV3-30'
  5. 'PTGDS'
  6. 'IGKV4-1'

读取基因类型

R
gene_biotype = read.csv("../GRCh38_gene_biotype.csv")
##人类GRCh38基因组可以直接读取下列数据
#gene_biotype = read.csv("https://seekgene-public.oss-cn-beijing.aliyuncs.com/software/FAST/GRCh38_gene_biotype.csv")
##小鼠mm10基因组可以直接读取下列数据
#gene_biotype = read.csv("https://seekgene-public.oss-cn-beijing.aliyuncs.com/software/FAST/mm10_gene_biotype.csv")
head(gene_biotype)
dim(gene_biotype)
A data.frame: 6 × 3
gene_idgene_typegene_name
<chr><chr><chr>
1ENSG00000243485lncRNA MIR1302-2HG
2ENSG00000237613lncRNA FAM138A
3ENSG00000186092protein_codingOR4F5
4ENSG00000238009lncRNA AL627309.1
5ENSG00000239945lncRNA AL627309.3
6ENSG00000239906lncRNA AL627309.2
  1. 36601
  2. 3
R
##筛选仅在rds中存在的基因,并按照hvg_all_features排序
all_genes=rownames(obj)
gene_biotype <- gene_biotype %>% filter(gene_name %in% hvg_all_features)
gene_biotype <- gene_biotype[match(hvg_all_features, gene_biotype$gene_name, nomatch = 0), ]
head(gene_biotype)
A data.frame: 6 × 3
gene_idgene_typegene_name
<chr><chr><chr>
13345ENSG00000224373IG_V_gene IGHV4-59
2547ENSG00000251546IG_V_gene IGKV1D-39
13346ENSG00000211970IG_V_gene IGHV4-61
13331ENSG00000270550IG_V_gene IGHV3-30
9429ENSG00000107317protein_codingPTGDS
2522ENSG00000211598IG_V_gene IGKV4-1

分别选择 Top 2000 的 LncRNA 基因、Protein-coding 基因、所有基因

R
top_2000_coding_RNA <- gene_biotype %>%
    filter(gene_type == "protein_coding")  %>%
    select(gene_name) %>%
    top_n(2000) %>%
    pull()
top_2000_lnc_RNA <- gene_biotype %>%
    filter(gene_type == "lncRNA")  %>%
    select(gene_name) %>%
    top_n(2000) %>%
    pull()
top_2000_all <- gene_biotype %>%
    select(gene_name) %>%
    top_n(2000) %>%
    pull()
output
Selecting by gene_name
Selecting by gene_name
Selecting by gene_name
R
redu="harmony" #如果rds是没有去除批次的,这里指定pca
resolution_=0.5 #统一设置0.5的分辨率

基于 Top 2000 Lnc 基因做降维聚类

R
VariableFeatures(obj) <- top_2000_lnc_RNA

obj <- RunUMAP(object = obj,
               assay = "RNA",
               reduction = redu,
               features  = top_2000_lnc_RNA,
               verbose = FALSE)
obj <- RunTSNE(object = obj,
               assay = "RNA",
               reduction = redu,
               features  = top_2000_lnc_RNA,
               verbose = FALSE)
obj <- FindNeighbors(object = obj,
                     assay = "RNA",
                     reduction = redu,
                     dims = NULL,
                     features  = top_2000_lnc_RNA,
                     verbose = FALSE)
obj <- FindClusters(object = obj,
                    resolution = 0.5,
                    cluster.name = "resolution_0.5_lnc",
                    verbose = FALSE) 


embed <- Embeddings(obj, reduction = "umap")
obj[["UMAP_lnc"]] <- CreateDimReducObject(embeddings = embed, key = "UMAPlnc_", assay = DefaultAssay(obj))

embed <- Embeddings(obj, reduction = "tsne")
obj[["TSNE_lnc"]] <- CreateDimReducObject(embeddings = embed, key = "TSNElnc_", assay = DefaultAssay(obj))

DimPlot(obj, group.by = "resolution_0.5_lnc", label = T, reduction = "UMAP_lnc")
output
Warning message:
“The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
This message will be shown once per session”

基于 top 2000 coding 基因做降维聚类

R
VariableFeatures(obj) <- top_2000_coding_RNA

obj <- RunUMAP(object = obj,
               assay = "RNA",
               reduction = redu,
               features  = top_2000_coding_RNA,
               verbose = FALSE)
obj <- RunTSNE(object = obj,
               assay = "RNA",
               reduction = redu,
               features  = top_2000_coding_RNA,
               verbose = FALSE)
obj <- FindNeighbors(object = obj,
                     assay = "RNA",
                     reduction = redu,
                     dims = NULL,
                     features  = top_2000_coding_RNA,
                     verbose = FALSE)
obj <- FindClusters(object = obj,
                    resolution = 0.5,
                    cluster.name = "resolution_0.5_coding",
                    verbose = FALSE)


embed <- Embeddings(obj, reduction = "umap")
obj[["UMAP_coding"]] <- CreateDimReducObject(embeddings = embed, key = "UMAPcoding_", assay = DefaultAssay(obj))

embed <- Embeddings(obj, reduction = "tsne")
obj[["TSNE_coding"]] <- CreateDimReducObject(embeddings = embed, key = "TSNEcoding_", assay = DefaultAssay(obj))

DimPlot(obj, group.by = "resolution_0.5_coding", label = T, reduction = "UMAP_coding")

基于 top 2000 所有基因做降维聚类

R
VariableFeatures(obj) <- top_2000_all

obj <- RunUMAP(object = obj,
               assay = "RNA",
               reduction = redu,
               features  = top_2000_all,
               verbose = FALSE)
obj <- RunTSNE(object = obj,
               assay = "RNA",
               reduction = redu,
               features  = top_2000_all,
               verbose = FALSE)
obj <- FindNeighbors(object = obj,
                     assay = "RNA",
                     reduction = redu,
                     dims = NULL,
                     features  = top_2000_all,
                     verbose = FALSE)
obj <- FindClusters(object = obj,
                    resolution = 0.5,
                    cluster.name = "resolution_0.5_all",
                    verbose = FALSE) 

embed <- Embeddings(obj, reduction = "umap")
obj[["UMAP_all"]] <- CreateDimReducObject(embeddings = embed, key = "UMAPall_", assay = DefaultAssay(obj))

embed <- Embeddings(obj, reduction = "tsne")
obj[["TSNE_all"]] <- CreateDimReducObject(embeddings = embed, key = "TSNEall_", assay = DefaultAssay(obj))

DimPlot(obj, group.by = "resolution_0.5_all", label = T, reduction = "UMAP_all")

通过桑基图展示不同分群方式的区别

R
options(repr.plot.width = 10, repr.plot.height = 8)

plot_data <- data.frame(resolution_0.5_coding=obj$resolution_0.5_coding,
                        resolution_0.5_all=obj$resolution_0.5_all,
                        resolution_0.5_lnc=obj$resolution_0.5_lnc)

p <- ggplot(plot_data,aes(axis1 = resolution_0.5_coding, axis2 =resolution_0.5_all, axis3 = resolution_0.5_lnc)) +
  geom_alluvium(aes(fill = resolution_0.5_all))+
  geom_stratum(width = 1/6) +
  geom_label(stat = "stratum", aes(label = after_stat(stratum)))+
  scale_x_discrete(limits = c("resolution_0.5_coding","resolution_0.5_all","resolution_0.5_lnc"),expand = c(0, 0))+
theme_bw()+
theme(axis.text.x = element_text(size=15))
p
output
Warning message in to_lodes_form(data = data, axes = axis_ind, discern = params$discern):
“Some strata appear at multiple axes.”
Warning message in to_lodes_form(data = data, axes = axis_ind, discern = params$discern):
“Some strata appear at multiple axes.”
Warning message in to_lodes_form(data = data, axes = axis_ind, discern = params$discern):
“Some strata appear at multiple axes.”

通过 umap/tsne 图对比不同降维分群结果

umap 图

R
options(repr.plot.width = 20, repr.plot.height = 6)
p=DimPlot(obj, group.by = "resolution_0.5_lnc", label = T, reduction = "UMAP_lnc",ncol=3)+
DimPlot(obj, group.by = "resolution_0.5_coding", label = T, reduction = "UMAP_coding")+
DimPlot(obj, group.by = "resolution_0.5_all", label = T, reduction = "UMAP_all")
p

tsne 图

R
options(repr.plot.width = 20, repr.plot.height = 6)
p=DimPlot(obj, group.by = "resolution_0.5_lnc", label = T, reduction = "TSNE_lnc",ncol=3)+
DimPlot(obj, group.by = "resolution_0.5_coding", label = T, reduction = "TSNE_coding")+
DimPlot(obj, group.by = "resolution_0.5_all", label = T, reduction = "TSNE_all")
p

保存分群结果

新的分群结果可以上传到云平台,添加作为新的 meta 信息。 之后通过差异富集分析了解不同 cluster 的基因表达、功能特征,通过其他高级分析获取更深层次的特征信息。

R
new_meta=obj@meta.data %>%
    as.data.frame() %>%
    select(barcode, resolution_0.5_lnc,resolution_0.5_coding,resolution_0.5_all)
head(new_meta)
write.csv(new_meta,file="clustering.csv",row.names=F,quote=F)
A data.frame: 6 × 4
barcoderesolution_0.5_lncresolution_0.5_codingresolution_0.5_all
<chr><fct><fct><fct>
AAACGAAGTTTCGCTC-1_1AAACGAAGTTTCGCTC-1_1011
AAAGAACAGAGCAGCT-1_1AAAGAACAGAGCAGCT-1_1011
AAAGGATGTTAATCGC-1_1AAAGGATGTTAATCGC-1_1011
AAAGGGCAGCCTTGAT-1_1AAAGGGCAGCCTTGAT-1_1011
AAAGGTACAGCACACC-1_1AAAGGTACAGCACACC-1_1011
AAATGGACACAGTGAG-1_1AAATGGACACAGTGAG-1_1011
R
sessionInfo()
output
R version 4.3.3 (2024-02-29)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: Debian GNU/Linux 12 (bookworm)

Matrix products: default
BLAS/LAPACK: /PROJ2/FLOAT/guopingfan/micromamba/env_files/envs/R_4.4.1/lib/libopenblasp-r0.3.30.so; LAPACK version 3.12.0

locale:
[1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
[4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
[7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C

time zone: Asia/Shanghai
tzcode source: system (glibc)

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] future_1.67.0 ggalluvial_0.12.5 RColorBrewer_1.1-3 lubridate_1.9.4
[5] forcats_1.0.0 stringr_1.6.0 purrr_1.1.0 readr_2.1.5
[9] tidyr_1.3.1 tibble_3.3.0 tidyverse_2.0.0 dplyr_1.1.4
[13] plotly_4.11.0 ggplot2_3.5.2 harmony_1.2.1 Rcpp_1.1.0
[17] Seurat_5.3.0 SeuratObject_5.1.0 sp_2.2-0 repr_1.1.7

loaded via a namespace (and not attached):
[1] jsonlite_2.0.0 magrittr_2.0.3 spatstat.utils_3.1-5
[4] farver_2.1.2 ragg_1.3.3 vctrs_0.6.5
[7] ROCR_1.0-11 Cairo_1.6-2 spatstat.explore_3.5-2
[10] base64enc_0.1-3 htmltools_0.5.8.1 sctransform_0.4.2
[13] parallelly_1.45.1 KernSmooth_2.23-26 htmlwidgets_1.6.4
[16] ica_1.0-3 plyr_1.8.9 zoo_1.8-14
[19] uuid_1.2-1 igraph_2.0.3 mime_0.13
[22] lifecycle_1.0.4 pkgconfig_2.0.3 Matrix_1.6-5
[25] R6_2.6.1 fastmap_1.2.0 fitdistrplus_1.2-4
[28] shiny_1.11.1 digest_0.6.37 colorspace_2.1-1
[31] patchwork_1.3.2 tensor_1.5.1 RSpectra_0.16-2
[34] irlba_2.3.5.1 textshaping_0.4.0 labeling_0.4.3
[37] progressr_0.18.0 spatstat.sparse_3.1-0 timechange_0.3.0
[40] httr_1.4.7 polyclip_1.10-7 abind_1.4-8
[43] compiler_4.3.3 withr_3.0.2 fastDummies_1.7.5
[46] MASS_7.3-60.0.1 tools_4.3.3 lmtest_0.9-40
[49] otel_0.2.0 httpuv_1.6.15 future.apply_1.20.0
[52] goftest_1.2-3 glue_1.8.0 nlme_3.1-168
[55] promises_1.5.0 grid_4.3.3 pbdZMQ_0.3-13
[58] Rtsne_0.17 cluster_2.1.8.1 reshape2_1.4.4
[61] generics_0.1.4 gtable_0.3.6 spatstat.data_3.1-9
[64] tzdb_0.5.0 data.table_1.17.8 hms_1.1.3
[67] spatstat.geom_3.5-0 RcppAnnoy_0.0.22 ggrepel_0.9.6
[70] RANN_2.6.2 pillar_1.11.1 spam_2.11-1
[73] IRdisplay_1.1 RcppHNSW_0.6.0 later_1.4.2
[76] splines_4.3.3 lattice_0.22-7 survival_3.8-3
[79] deldir_2.0-4 tidyselect_1.2.1 miniUI_0.1.2
[82] pbapply_1.7-4 gridExtra_2.3 scattermore_1.2
[85] matrixStats_1.5.0 stringi_1.8.7 lazyeval_0.2.2
[88] evaluate_1.0.5 codetools_0.2-20 cli_3.6.5
[91] uwot_0.2.3 IRkernel_1.3.2 xtable_1.8-4
[94] reticulate_1.43.0 systemfonts_1.3.1 dichromat_2.0-0.1
[97] globals_0.18.0 spatstat.random_3.4-1 png_0.1-8
[100] spatstat.univar_3.1-4 parallel_4.3.3 dotCall64_1.2
[103] listenv_0.10.0 viridisLite_0.4.2 scales_1.4.0
[106] ggridges_0.5.7 crayon_1.5.3 rlang_1.1.6
[109] cowplot_1.2.0

参考资料

[1] https://mp.weixin.qq.com/s/-a9kNKdsnKp6RH_rVpVsdA

[2] Bitar M, Rivera IS, Almeida I, Shi W, Ferguson K, Beesley J, Lakhani SR, Edwards SL, French JD. Redefining normal breast cell populations using long noncoding RNAs. Nucleic Acids Res. 2023 Jul 7;51(12):6389-6410. doi: 10.1093/nar/gkad339. PMID: 37144467; PMCID: PMC10325898.

[3] Aghagolzadeh P, Plaisance I, Bernasconi R, Treibel TA, Pulido Quetglas C, Wyss T, Wigger L, Nemir M, Sarre A, Chouvardas P, Johnson R, González A, Pedrazzini T. Assessment of the Cardiac Noncoding Transcriptome by Single-Cell RNA Sequencing Identifies FIXER, a Conserved Profibrogenic Long Noncoding RNA. Circulation. 2023 Aug 29;148(9):778-797. doi: 10.1161/CIRCULATIONAHA.122.062601. Epub 2023 Jul 10. PMID: 37427428.

0 条评论·0 条回复