{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "0bdae9cc",
   "metadata": {},
   "source": [
    "---\n",
    "title: scATAC-seq 高级分析：Peak-Gene 调控关联分析\n",
    "author: SeekGene\n",
    "date: 2026-01-29\n",
    "tags:\n",
    "  - ATAC + RNA 双组学\n",
    "  - 分析指南\n",
    "  - Notebooks\n",
    "  - Peak-Gene Links分析\n",
    "---\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "aca702d2-45f2-420e-83d1-ac0361d75f6a",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "source": [
    "# scATAC-seq 高级分析：Peak-Gene 调控关联分析"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "45e7ddae",
   "metadata": {},
   "source": [
    "## 背景介绍\n",
    "\n",
    "ATAC_Peak2Gene 是面向单细胞多组学数据中 scATAC-seq 与 scRNA-seq 整合分析的 Peak-to-Gene 关联分析流程，旨在系统性地识别染色质开放区域（peaks）与基因表达之间的调控关系，揭示顺式调控元件（CREs）对基因转录的潜在影响。该流程基于 Signac/Seurat 的多组学整合框架，结合 LinkPeaks 算法、相关性分析、聚类可视化以及统计检验，提供从\"染色质可及性 → 基因表达 → 调控关联\"的全面解读。\n",
    "\n",
    "### 核心原理\n",
    "\n",
    "- Peak-to-Gene 关联检测\n",
    "    - 采用 Signac 的 LinkPeaks() 函数，基于距离约束和统计相关性，将染色质开放区域与邻近基因建立关联关系。\n",
    "    - 计算每个 peak 与基因表达之间的相关系数，并通过背景模型校正获得显著性评估。\n",
    "\n",
    "- 组学数据整合\n",
    "    - 同时分析 ATAC-seq（染色质可及性）和 RNA-seq（基因表达）数据，确保关联分析的生物学合理性。\n",
    "    - 通过方差过滤、相关性阈值和显著性检验，筛选高质量的 peak-gene 关联对。\n",
    "\n",
    "- 聚类与可视化\n",
    "    - 基于 ATAC 和 RNA 数据的 Z-score 标准化值进行 K-means 聚类，识别具有相似调控模式的 peak-gene 组合。\n",
    "    - 通过热图展示不同细胞类型中 peak-gene 关联的表达模式，揭示细胞类型特异的调控关系。"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "992898d9",
   "metadata": {},
   "source": [
    "## 参数设置\n",
    "\n",
    "请先挂载将要分析的云平台相关流程数据，挂载方式请参照 jupyter 使用教程\n",
    "\n",
    "具体参数填写：\n",
    "\n",
    "- rds：挂载的流程相关的 rds 数据，在/home/{UserName}/workspace/project/{UserName}/目录下，如下列项目数据/home/shumeng/workspace/data/AY1752167131383/\n",
    "- meta：与 rds 同目录下的 metadata 文件  \n",
    "- outdir: 分析结果保存路径\n",
    "- clusters_col：细胞类型列名，选择要分析的细胞类型或者聚类对应的标签。假如想对注释好的细胞类型进行分析，则选择其对应的标签，例如 CellAnnotation，与<细胞类型>配合使用。\n",
    "- celltypes：细胞类型，多选，选择要分析的细胞类型或者聚类结果，如 Monocyte 和 Macrophage 等。\n",
    "- SAMple_col: 样本列，meta.data 的列名。\n",
    "- SAMples：样本，和 SAMple_col 搭配使用，是 meta.data 的 SAMple_col 列具体的内容。\n",
    "- species：物种信息\n",
    "- filter：是否过滤 Links,选项为 TRUE or FALSE，若为 TRUE，则填写以下四个具体过滤参数，corCutOff（相关性阈值）、pCutOff（显著性阈值）、varCutOffATAC（ATAC 方差阈值）、varCutOffRNA（RNA 方差阈值）。\n",
    "- k: K-means 聚类的簇数，用于 link 分组，默认为 10。\n",
    "- nPlot：默认 1000，限制分析的链接数量。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cf3d8acc",
   "metadata": {
    "vscode": {
     "languageId": "r"
    }
   },
   "outputs": [],
   "source": [
    "# Parameters\n",
    "outdir = \"/PROJ2/FLOAT/advanced-analysis/prod/18676-49338-86595-580897984602987130/_build/html/\"\n",
    "rds = \"/PROJ2/FLOAT/advanced-analysis/prod/18676-49338-86595-580897984602987130/input.rds\"\n",
    "meta = \"/PROJ2/FLOAT/advanced-analysis/prod/18676-49338-86595-580897984602987130/meta.txt\"\n",
    "species = \"/jp_envs/reference/human/refdata-arc-GRCh38-2020-A/fasta/genome.fa\"\n",
    "clusters_col = \"wknn_res.0.5_d30_l2_50\"\n",
    "celltypes = \"0,1,10,11,12,13,14,15,16,17,2,3,4,5,6,7,8,9\"\n",
    "sample_col = \"Sample\"\n",
    "samples = \"LXHBCC_p_arc,LXHBCC_z_arc,SGKMM_p_arc,SGKMM_z_arc,SQKBCC_p_arc,SQKBCC_z_arc,XALSCC_p_arc,XALSCC_z_arc\"\n",
    "filter = \"FALSE\"\n",
    "corCutOff = \"\"\n",
    "pCutOff = \"\"\n",
    "varCutOffATAC = \"\"\n",
    "varCutOffRNA = \"\"\n",
    "k = \"10\"\n",
    "nPlot = \"1000\"\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "88af30d0",
   "metadata": {},
   "source": [
    "## 环境准备\n",
    "\n",
    "**R 包加载**\n",
    "\n",
    "请选择 common_r 这个环境进行该整合教程的学习"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2dd37f51-7539-4df9-8968-7051a2961983",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": [
     "remove-cell"
    ],
    "vscode": {
     "languageId": "r"
    }
   },
   "outputs": [],
   "source": [
    "suppressPackageStartupMessages({\n",
    "  library(Seurat)\n",
    "  library(Signac)\n",
    "  library(ArchR)\n",
    "  library(BSgenome.Hsapiens.UCSC.hg38)\n",
    "  library(BSgenome.Mmusculus.UCSC.mm10)\n",
    "  library(dplyr)\n",
    "  library(MatrixGenerics)\n",
    "  library(future)\n",
    "  library(parallel)\n",
    "  library(ComplexHeatmap)\n",
    "  library(magick)\n",
    "})\n",
    "plan(multisession, workers = min(detectCores()-1, 25))\n",
    "options(future.globals.maxSize = 30 * 1024^3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bda7628a-008b-47ee-90b0-6a77f0fa46fe",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": [
     "remove-cell"
    ],
    "vscode": {
     "languageId": "r"
    }
   },
   "outputs": [],
   "source": [
    "outdir <- paste0(outdir,'/output')\n",
    "dir.create(outdir, recursive = TRUE)\n",
    "\n",
    "celltypes <- strsplit(celltypes,\",\")[[1]]\n",
    "samples <- strsplit(samples,\",\")[[1]]\n",
    "\n",
    "filter <- as.logical(filter)\n",
    "corCutOff <- as.numeric(corCutOff)\n",
    "pCutOff <- as.numeric(pCutOff)\n",
    "varCutOffATAC <- as.numeric(varCutOffATAC)\n",
    "varCutOffRNA <- as.numeric(varCutOffRNA)\n",
    "\n",
    "k <- as.numeric(k)\n",
    "nPlot <- as.numeric(nPlot)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "029a8766",
   "metadata": {},
   "source": [
    "**数据读取**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "52cede64-3a7d-4c2c-be04-b87b43ae7898",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": [
     "remove-cell"
    ],
    "vscode": {
     "languageId": "r"
    }
   },
   "outputs": [],
   "source": [
    "obj <- readRDS(rds)\n",
    "if (meta != \"\"){\n",
    "  meta <- read.table(meta,header=T,sep='\\t',check.names=F)\n",
    "  rownames(meta) <- meta$barcodes\n",
    "  obj <- AddMetaData(obj, meta)\n",
    "}\n",
    "obj <- subset(obj, subset = !!sym(clusters_col) == celltypes)\n",
    "obj <- subset(obj, subset = !!sym(sample_col) == samples)\n",
    "\n",
    "n_fragments <- length(obj@assays$ATAC@fragments)\n",
    "for (i in 1:n_fragments) {\n",
    "  original_path <- obj@assays$ATAC@fragments[[i]]@path\n",
    "  filename <- basename(original_path)\n",
    "  obj@assays$ATAC@fragments[[i]]@path <- paste0(dirname(dirname(outdir)), '/', filename)\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "70c049b3",
   "metadata": {},
   "source": [
    "**定义热图函数**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bed23eaa-c80e-4f31-aaf5-98c92b5359af",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": [
     "remove-cell"
    ],
    "vscode": {
     "languageId": "r"
    }
   },
   "outputs": [],
   "source": [
    "peak2GeneHeatmap <- function(\n",
    "  seurat_obj,\n",
    "  filter = FALSE,\n",
    "  corCutOff = 0.45,\n",
    "  pCutOff = 0.0001,\n",
    "  varCutOffATAC = 0.25,\n",
    "  varCutOffRNA = 0.25,\n",
    "  k = 10,\n",
    "  nPlot = 5000,\n",
    "  groupBy = \"seurat_clusters\",\n",
    "  palGroup = NULL,\n",
    "  palATAC = paletteContinuous(\"solarExtra\"),\n",
    "  palRNA = paletteContinuous(\"blueYellow\"),\n",
    "  seed = 1\n",
    ") {\n",
    "    \n",
    "  # 设置随机种子\n",
    "  set.seed(seed)\n",
    "  \n",
    "  # 提取ATAC数据\n",
    "  peak_matrix <- GetAssayData(seurat_obj, assay = \"ATAC\", slot = \"data\")\n",
    "  peak_vars <- rowVars(peak_matrix)\n",
    "  peak_var_quantiles <- rank(peak_vars) / length(peak_vars)\n",
    "  names(peak_var_quantiles) <- rownames(peak_matrix)\n",
    "  \n",
    "  # 提取RNA数据\n",
    "  rna_matrix <- GetAssayData(seurat_obj, assay = \"RNA\", slot = \"data\")\n",
    "  rna_vars <- rowVars(rna_matrix)\n",
    "  rna_var_quantiles <- rank(rna_vars) / length(rna_vars)\n",
    "  names(rna_var_quantiles) <- rownames(rna_matrix)\n",
    "\n",
    "  links_data <- data.frame(\n",
    "    peak = links$peak,\n",
    "    gene = links$gene,\n",
    "    score = links$score,\n",
    "    pvalue = links$pvalue\n",
    "  )\n",
    "  \n",
    "  if (filter){\n",
    "    links_filtered <- links_data[\n",
    "      abs(links_data$score) >= corCutOff & \n",
    "      links_data$pvalue <= pCutOff,\n",
    "    ]\n",
    "\n",
    "    links_filtered$peak_id <- match(links_filtered$peak, rownames(peak_matrix))\n",
    "    links_filtered$gene_id <- match(links_filtered$gene, rownames(rna_matrix))\n",
    "    # 添加方差quantiles\n",
    "    links_filtered$VarQATAC <- peak_var_quantiles[rownames(peak_matrix)[links_filtered$peak_id]]\n",
    "    links_filtered$VarQRNA <- rna_var_quantiles[rownames(rna_matrix)[links_filtered$gene_id]]\n",
    "    # 进一步过滤链接（基于方差quantiles）\n",
    "    links_filtered <- links_filtered[links_filtered$VarQATAC > varCutOffATAC, ]\n",
    "    links_filtered <- links_filtered[links_filtered$VarQRNA > varCutOffRNA, ]\n",
    "\n",
    "  }else{\n",
    "    links_filtered <- links_data\n",
    "  }\n",
    "\n",
    "  if(nrow(links_filtered) < 20) { \n",
    "    stop(\"link过少，建议放宽过滤阈值\")\n",
    "  }\n",
    "\n",
    "  # 限制分析的链接数量\n",
    "  if (nrow(links_filtered) > nPlot) {\n",
    "    links_filtered <- links_filtered[sample(1:nrow(links_filtered), nPlot), ]\n",
    "  }\n",
    "\n",
    "  # 提取需要的峰区域和基因数据\n",
    "  peak_ids <- match(links_filtered$peak, rownames(peak_matrix))\n",
    "  gene_ids <- match(links_filtered$gene, rownames(rna_matrix))\n",
    "  \n",
    "  atac_data <- peak_matrix[peak_ids, ]\n",
    "  rna_data <- rna_matrix[gene_ids, ]\n",
    "  \n",
    "  # 计算Z-scores\n",
    "  rowZscores <- function(matrix1, matrix2, min = NULL, max = NULL, limit = TRUE) {\n",
    "    # 对行进行z-score标准化\n",
    "    z1 <- t(scale(t(matrix1)))\n",
    "    z2 <- t(scale(t(matrix2)))\n",
    "    # 如果需要限制范围\n",
    "    if(limit) {\n",
    "        # 如果没有指定min和max，计算分位数来确定合适的范围\n",
    "        if(is.null(min) || is.null(max)) {\n",
    "          middle_range1 <- quantile(abs(z1), probs = 0.875, na.rm = TRUE)\n",
    "          # 确定第一个矩阵的范围\n",
    "          if (middle_range1 < 1) {\n",
    "            min1 <- -1\n",
    "            max1 <- 1\n",
    "          } else if (middle_range1 < 1.5) {\n",
    "            min1 <- -1.5\n",
    "            max1 <- 1.5\n",
    "          } else {\n",
    "            min1 <- -2\n",
    "            max1 <- 2\n",
    "          }\n",
    "          middle_range2 <- quantile(abs(z2), probs = 0.875, na.rm = TRUE)\n",
    "        \n",
    "          if (middle_range2 < 1) {\n",
    "            min2 <- -1\n",
    "            max2 <- 1\n",
    "          } else if (middle_range2 < 1.5) {\n",
    "            min2 <- -1.5\n",
    "            max2 <- 1.5\n",
    "          } else {\n",
    "            min2 <- -2\n",
    "            max2 <- 2\n",
    "          }\n",
    "        min <- max(min1, min2)  # 取较大的最小值\n",
    "        max <- min(max1, max2)  # 取较小的最大值\n",
    "        }\n",
    "\n",
    "        z1[z1 > max] <- max\n",
    "        z1[z1 < min] <- min\n",
    "        z2[z2 > max] <- max\n",
    "        z2[z2 < min] <- min\n",
    "    }\n",
    "    return(list(matrix1 = z1, matrix2 = z2))\n",
    "  }\n",
    "  # atac_zscores <- rowZscores(as.matrix(atac_data))\n",
    "  # rna_zscores <- rowZscores(as.matrix(rna_data))\n",
    "  zs <- rowZscores(as.matrix(atac_data), as.matrix(rna_data))\n",
    "  atac_zscores <- zs$matrix1\n",
    "  rna_zscores <- zs$matrix2\n",
    "  # 获取细胞分组信息\n",
    "  cell_groups <- seurat_obj@meta.data[, groupBy]\n",
    "  names(cell_groups) <- colnames(seurat_obj)\n",
    "  \n",
    "  # K-means聚类\n",
    "  m <- nrow(atac_zscores)\n",
    "  if(k > m) {\n",
    "    message(sprintf(\"要求的聚类数(k=%d)大于数据点数量(m=%d)\", k, m))\n",
    "    # k <- min(k, m)\n",
    "    k <- 1\n",
    "    message(sprintf(\"自动调整聚类数为: k=%d\", k))\n",
    "  }\n",
    "\n",
    "  k_clusters <- kmeans(atac_zscores, centers = k)\n",
    "  cluster_ids <- k_clusters$cluster\n",
    "\n",
    "  while(1 %in% table(cluster_ids)){\n",
    "    k <- ceiling(k/2)\n",
    "    message(sprintf(\"聚类数过多，自动调整聚类数为: k=%d\", k))\n",
    "    k_clusters <- kmeans(atac_zscores, centers = k)\n",
    "    cluster_ids <- k_clusters$cluster\n",
    "  }\n",
    "\n",
    "  # 计算每个聚类中每个组的平均值\n",
    "  group_means_atac <- matrix(0, nrow = k, ncol = length(levels(as.factor(cell_groups))))\n",
    "  group_means_rna <- matrix(0, nrow = k, ncol = length(levels(as.factor(cell_groups))))\n",
    "  colnames(group_means_atac) <- levels(as.factor(cell_groups))\n",
    "  colnames(group_means_rna) <- levels(as.factor(cell_groups))\n",
    "  \n",
    "  for (i in 1:k) {\n",
    "    cluster_rows <- which(cluster_ids == i)\n",
    "    if (length(cluster_rows) > 0) {\n",
    "      cluster_atac <- atac_zscores[cluster_rows, , drop = FALSE]\n",
    "      cluster_rna <- rna_zscores[cluster_rows, , drop = FALSE]\n",
    "      \n",
    "      # 如果只有一行，需要保持为矩阵形式\n",
    "      # if (length(cluster_rows) == 1) {\n",
    "      #   cluster_atac <- matrix(cluster_atac, nrow = 1)\n",
    "      #   cluster_rna <- matrix(cluster_rna, nrow = 1)\n",
    "      #   colnames(cluster_atac) <- colnames(atac_zscores)\n",
    "      #   colnames(cluster_rna) <- colnames(rna_zscores)\n",
    "      # }\n",
    "      \n",
    "      for (g in levels(as.factor(cell_groups))) {\n",
    "        group_cells <- names(cell_groups)[cell_groups == g]\n",
    "        if (length(group_cells) > 0) {\n",
    "          group_means_atac[i, g] <- mean(rowMeans(cluster_atac[, group_cells, drop = FALSE]), na.rm = TRUE)\n",
    "          group_means_rna[i, g] <- mean(rowMeans(cluster_rna[, group_cells, drop = FALSE]), na.rm = TRUE)\n",
    "        }\n",
    "      }\n",
    "    }\n",
    "  }\n",
    "\n",
    "  # 对聚类重新排序\n",
    "  hc <- hclust(dist(group_means_atac))\n",
    "  cluster_order <- hc$order\n",
    "  \n",
    "  # 排序后的数据\n",
    "  ordered_links <- list()\n",
    "  ordered_atac <- list()\n",
    "  ordered_rna <- list()\n",
    "  \n",
    "  for (i in cluster_order) {\n",
    "    cluster_rows <- which(cluster_ids == i)\n",
    "    if (length(cluster_rows) > 0) {\n",
    "      ordered_links <- c(ordered_links, list(links_filtered[cluster_rows, ]))\n",
    "      ordered_atac <- c(ordered_atac, list(atac_zscores[cluster_rows, ]))\n",
    "      ordered_rna <- c(ordered_rna, list(rna_zscores[cluster_rows, ]))\n",
    "    }\n",
    "  }\n",
    "  \n",
    "  # 合并数据\n",
    "  ordered_links_df <- do.call(rbind, ordered_links)\n",
    "  ordered_atac_mat <- do.call(rbind, ordered_atac)\n",
    "  ordered_rna_mat <- do.call(rbind, ordered_rna)\n",
    "\n",
    "  # 准备分组颜色\n",
    "  if (is.null(palGroup)) {\n",
    "    group_levels <- levels(as.factor(cell_groups))\n",
    "    palGroup <- setNames(\n",
    "      scales::hue_pal()(length(group_levels)),\n",
    "      group_levels\n",
    "    )\n",
    "  }\n",
    "  \n",
    "  # 创建分组注释\n",
    "  column_atac <- ComplexHeatmap::HeatmapAnnotation(\n",
    "    Group = cell_groups,\n",
    "    col = list(Group = palGroup),\n",
    "    show_legend = FALSE,\n",
    "    show_annotation_name = FALSE\n",
    "  )\n",
    "\n",
    "  column_rna <- ComplexHeatmap::HeatmapAnnotation(\n",
    "    Cell = cell_groups,\n",
    "    col = list(Cell = palGroup),\n",
    "    show_annotation_name = FALSE\n",
    "  )\n",
    "\n",
    "  # 创建热图\n",
    "  atac_heatmap <- ComplexHeatmap::Heatmap(\n",
    "    ordered_atac_mat,\n",
    "    name = \"ATAC Z-score\",\n",
    "    col = palATAC,\n",
    "    cluster_rows = FALSE,\n",
    "    cluster_columns = FALSE,\n",
    "    column_order = order(colMeans(ordered_rna_mat), decreasing = TRUE),\n",
    "    show_row_names = FALSE,\n",
    "    show_column_names = FALSE,\n",
    "    top_annotation = column_atac,\n",
    "    column_split = cell_groups,\n",
    "    column_gap = unit(0, \"mm\"),\n",
    "    column_title = NULL,\n",
    "    row_split = rep(1:k, sapply(ordered_atac, nrow)),\n",
    "    use_raster = TRUE,\n",
    "    raster_quality = 1,\n",
    "    heatmap_legend_param = list(\n",
    "      direction = \"horizontal\",\n",
    "      title_position = \"topcenter\"\n",
    "    )\n",
    "  )\n",
    "  \n",
    "  rna_heatmap <- ComplexHeatmap::Heatmap(\n",
    "    ordered_rna_mat,\n",
    "    name = \"RNA Z-score\",\n",
    "    col = palRNA,\n",
    "    cluster_rows = FALSE,\n",
    "    cluster_columns = FALSE,\n",
    "    column_order = order(colMeans(ordered_rna_mat), decreasing = TRUE),\n",
    "    show_row_names = FALSE,\n",
    "    show_column_names = FALSE,\n",
    "    top_annotation = column_rna,\n",
    "    column_split = cell_groups,\n",
    "    column_gap = unit(0, \"mm\"),\n",
    "    column_title = NULL,\n",
    "    row_split = rep(1:k, sapply(ordered_rna, nrow)),\n",
    "    use_raster = TRUE,\n",
    "    raster_quality = 1,\n",
    "    heatmap_legend_param = list(\n",
    "      direction = \"horizontal\",\n",
    "      title_position = \"topcenter\"\n",
    "    )\n",
    "  )\n",
    "\n",
    "  draw((atac_heatmap + rna_heatmap), heatmap_legend_side = \"bottom\")\n",
    "  return(links_filtered)\n",
    "}\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "57c1cc45",
   "metadata": {},
   "source": [
    "**定义直方图函数**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a9bb14a4-4be7-4aff-b8e1-9c91add10ac3",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": [
     "remove-cell"
    ],
    "vscode": {
     "languageId": "r"
    }
   },
   "outputs": [],
   "source": [
    "plotPeaksPerGene <- function(links) {\n",
    "  # 统计每个基因连接的peak数量\n",
    "  gene_counts <- table(links$gene)\n",
    "  gene_counts_df <- data.frame(\n",
    "    gene = names(gene_counts),\n",
    "    count = as.numeric(gene_counts)\n",
    "  )\n",
    "    \n",
    "  # 计算中位数\n",
    "  median_linked_Peaks <- median(gene_counts_df$count)\n",
    "  \n",
    "  # 绘制直方图\n",
    "  p <- ggplot(gene_counts_df, aes(x = count)) +\n",
    "    geom_histogram(binwidth = 1, fill = \"black\", color = \"white\") +\n",
    "    geom_vline(xintercept = median_linked_Peaks, linetype = \"dashed\", color = \"gray\") +\n",
    "    annotate(\"text\", x = 10, y = 1000, \n",
    "             label = paste0(\"Median Linked Peaks = \", median_linked_Peaks)) +\n",
    "    labs(x = \"Number of linked peaks\", y = \"Number of genes\") +\n",
    "    theme_classic() +\n",
    "    xlim(0, 25) #+ \n",
    "    # theme(text = element_text(family = \"sans\"))\n",
    "    \n",
    "  return(p)\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bb6efe42",
   "metadata": {},
   "source": [
    "**定义火山图函数**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cc2e4872-4138-4263-96e9-49c78bc884a9",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": [
     "remove-cell"
    ],
    "vscode": {
     "languageId": "r"
    }
   },
   "outputs": [],
   "source": [
    "# 添加火山图：展示峰区域与基因连接的相关性和显著性\n",
    "plotPeak2GeneVolcano <- function(links, topN = 10, p = 0.05, score = 0.05) {\n",
    "  # 准备数据\n",
    "  volcano_data <- data.frame(\n",
    "    peak = links$peak,\n",
    "    gene = links$gene,\n",
    "    correlation = links$score,\n",
    "    pvalue = links$pvalue,\n",
    "    logPvalue = -log10(links$pvalue + 1e-300) # 避免log(0)\n",
    "  )\n",
    "  \n",
    "  # 直接找出最显著的topN个基因\n",
    "  top_genes_indices <- order(volcano_data$logPvalue, decreasing = TRUE)[1:min(topN, nrow(volcano_data))]\n",
    "  top_genes <- volcano_data[top_genes_indices, ]\n",
    "  \n",
    "  # 标记哪些点需要高亮显示（被标注的点）\n",
    "  volcano_data$highlighted <- \"No\"\n",
    "  volcano_data$highlighted[top_genes_indices] <- \"Yes\"\n",
    "  top_genes$highlighted <- \"Yes\"\n",
    "  \n",
    "  # 绘制火山图\n",
    "  p <- ggplot(volcano_data, aes(x = correlation, y = logPvalue, color = highlighted)) +\n",
    "    geom_point(alpha = 0.6, size = 1) +\n",
    "    scale_color_manual(values = c(\"No\" = \"gray\", \"Yes\" = \"red\")) +\n",
    "    labs(\n",
    "      # title = \"Peak-to-Gene Linkage Volcano Plot\",\n",
    "      x = \"Correlation Score\",\n",
    "      y = \"-log10(p-value)\",\n",
    "      color = \"Top Genes\"\n",
    "    ) +\n",
    "    theme_classic() +\n",
    "    theme(\n",
    "      legend.position = \"none\",\n",
    "      # text = element_text(family = \"sans\")\n",
    "    ) +\n",
    "    # 添加虚线\n",
    "    geom_vline(xintercept = -score, linetype = \"dashed\", color = \"black\", alpha = 0.7) +\n",
    "    geom_vline(xintercept = score, linetype = \"dashed\", color = \"black\", alpha = 0.7) +\n",
    "    geom_hline(yintercept = -log10(p), linetype = \"dashed\", color = \"black\", alpha = 0.7) \n",
    "  \n",
    "  # 标记顶部基因\n",
    "  if(nrow(top_genes) > 0) {\n",
    "    p <- p + ggrepel::geom_text_repel(\n",
    "      data = top_genes,\n",
    "      # family = \"sans\",\n",
    "      aes(label = gene),\n",
    "      size = 3,\n",
    "      box.padding = 0.5,\n",
    "      point.padding = 0.2,\n",
    "      force = 2,\n",
    "      max.overlaps = 20\n",
    "    )\n",
    "  }\n",
    "  \n",
    "  return(p)\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "537f7eea",
   "metadata": {},
   "source": [
    "**定义相关性排序图**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "be5b72fe-001d-4a06-9a31-07ed3cf96c4a",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": [
     "remove-cell"
    ],
    "vscode": {
     "languageId": "r"
    }
   },
   "outputs": [],
   "source": [
    "# 使用links中的所有score值绘制相关性排序图\n",
    "plotRankCorrelation <- function(links) {\n",
    "  # 提取所有相关性值\n",
    "  all_correlations <- links$score\n",
    "  \n",
    "  # 按相关性从高到低排序\n",
    "  sorted_correlations <- sort(all_correlations, decreasing = TRUE)\n",
    "  correlation_df <- data.frame(\n",
    "    rank = 1:length(sorted_correlations),\n",
    "    correlation = sorted_correlations\n",
    "  )\n",
    "  \n",
    "  # 绘制排序图\n",
    "  p <- ggplot(correlation_df, aes(x = rank, y = correlation)) +\n",
    "    geom_line(size = 1) +\n",
    "    geom_hline(yintercept = 0, linetype = \"solid\", color = \"black\") +\n",
    "    labs(\n",
    "      # title = \"Peak-Gene correlation\",\n",
    "      x = \"Peak-Gene Linkage\",\n",
    "      y = \"Correlation Score\"\n",
    "    ) +\n",
    "    theme_classic() +\n",
    "    theme(\n",
    "      # text = element_text(family = \"sans\"),\n",
    "      axis.title = element_text(size = 14),\n",
    "      axis.text = element_text(size = 12),\n",
    "      plot.title = element_text(hjust = 0.5, size = 14),\n",
    "      axis.text.x = element_blank(),  # 隐藏x轴刻度标签\n",
    "      axis.ticks.x = element_blank()  # 隐藏x轴刻度线\n",
    "    )\n",
    "  \n",
    "  return(p)\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "35f271f8-2f87-4532-ab89-97ed5261a7d1",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "source": [
    "## 开始细胞群间 Peak-to-Gene 关联分析\n",
    "\n",
    "我们开始实施 Peak-to-Gene 关联分析。该分析的核心在于建立染色质开放区域（peaks）与基因表达之间的统计关联，从而识别潜在的顺式调控关系。\n",
    "\n",
    "### 分析流程概述\n",
    "\n",
    "**我们的分析采用以下步骤：**\n",
    "- 检查现有关联：首先检查 Seurat 对象中是否已存在 peak-to-gene 链接\n",
    "- 建立关联关系：如果不存在链接，则使用 LinkPeaks()函数建立新的关联\n",
    "- 数据预处理：对基因组数据进行染色体筛选和区域统计计算\n",
    "- 结果输出：将关联结果保存为表格文件，便于后续分析\n",
    "\n",
    "**关键技术参数**\n",
    "- LinkPeaks 函数：整合 ATAC-seq 和 RNA-seq 数据，计算 peak 与基因表达的相关性\n",
    "- 距离约束：默认在基因 TSS 附近 500kb 范围内寻找关联的 peak 区域\n",
    "- 统计评估：通过背景模型校正，计算 Z 得分和 P 值，评估关联的显著性\n",
    "- 基因组兼容性：支持人类和小鼠基因组，自动适配染色体信息\n",
    "\n",
    "**输出结果说明**\n",
    "- 分析完成后，我们将获得：\n",
    "    - Gene2PeakLinks.xls：包含 peak-gene 关联对的详细表格，包括 peak 位置、基因名称、相关性得分和显著性指标\n",
    "    - 关联统计信息：peak 数量、基因数量、显著关联对数量等汇总信息\n",
    "    - 质量控制指标：关联强度分布、距离分布等统计特征\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "554224f2-e6e2-4973-a7d6-d0c4d5bbd934",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": [
     "remove-cell"
    ],
    "vscode": {
     "languageId": "r"
    }
   },
   "outputs": [],
   "source": [
    "# 检查peak-to-gene链接\n",
    "links <- Signac::Links(obj[[\"ATAC\"]])\n",
    "if (length(links) == 0) {\n",
    "  # if (species == \"human\"){\n",
    "  #     genome = BSgenome.Hsapiens.UCSC.hg38\n",
    "  # }else if (species == \"mouse\"){\n",
    "  #     genome = BSgenome.Mmusculus.UCSC.mm10\n",
    "  # }\n",
    "  library(Biostrings)\n",
    "  genome <- readDNAStringSet(species)\n",
    "  names(genome) <- sub(\" .*\", \"\", names(genome)) \n",
    "  keep_chromosomes <- unique(as.character(seqnames(granges(obj[[\"ATAC\"]]))))\n",
    "  genome <- genome[which(names(genome) %in% keep_chromosomes)]\n",
    "\n",
    "  obj <- RegionStats(\n",
    "    object = obj,\n",
    "    genome = genome,\n",
    "    assay = \"ATAC\"\n",
    "  )\n",
    "  obj <- LinkPeaks(\n",
    "      object = obj,\n",
    "      peak.assay = \"ATAC\",\n",
    "      expression.assay = \"RNA\"\n",
    "  )\n",
    "  links <- Signac::Links(obj[[\"ATAC\"]])\n",
    "  \n",
    "  if (length(links) == 0) {\n",
    "    stop(\"未找到peak-to-gene链接\")\n",
    "  }\n",
    "}\n",
    "write.table(as.data.frame(links), file = file.path(outdir, \"Gene2PeakLinks.xls\"), sep = \"\\t\", row.names = FALSE, col.names = TRUE, quote = FALSE)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "424b5a38",
   "metadata": {},
   "source": [
    "## 结果可视化\n",
    "\n",
    "**热图**\n",
    "- 热图展示了染色质开放区域(peaks)与基因表达之间的相关性模式。热图分为两部分：\n",
    "- 热图中的每一行代表一个 peak-gene 对，每一列代表一个细胞。\n",
    "- ATAC 热图（左）：显示 ATAC-seq 信号强度（染色质可及性），蓝色表示低可及性，红色表示高可及性。\n",
    "- RNA 热图（右）：显示相应基因的表达水平，蓝色表示低表达，黄色表示高表达。\n",
    "- 顶部颜色条标识了不同的细胞类型或群体，帮助观察特定细胞类型中 peak-gene 关联的模式。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "239ce1f6-9fdb-445c-a5a7-4de212379604",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": [
     "remove-cell"
    ],
    "vscode": {
     "languageId": "r"
    }
   },
   "outputs": [],
   "source": [
    "pdf(file.path(outdir, \"Peak2GeneHeatmap.pdf\"), width = 12, height = 10)\n",
    "links <- peak2GeneHeatmap(\n",
    "    obj,\n",
    "    filter = filter,\n",
    "    corCutOff = corCutOff,\n",
    "    pCutOff = pCutOff,\n",
    "    varCutOffATAC = varCutOffATAC,\n",
    "    varCutOffRNA = varCutOffRNA,\n",
    "    k = k,\n",
    "    nPlot = nPlot,\n",
    "    groupBy = clusters_col\n",
    "  )\n",
    "dev.off()\n",
    "\n",
    "png(file.path(outdir, \"Peak2GeneHeatmap.png\"), width = 12, height = 10, units = \"in\", res = 300)\n",
    "links <- peak2GeneHeatmap(\n",
    "    obj,\n",
    "    filter = filter,\n",
    "    corCutOff = corCutOff,\n",
    "    pCutOff = pCutOff,\n",
    "    varCutOffATAC = varCutOffATAC,\n",
    "    varCutOffRNA = varCutOffRNA,\n",
    "    k = k,\n",
    "    nPlot = nPlot,\n",
    "    groupBy = clusters_col\n",
    "  )\n",
    "dev.off()\n",
    "# img <- image_read_pdf(file.path(outdir, \"Peak2GeneHeatmap.pdf\"), density = 300)\n",
    "# image_write(img, path = file.path(outdir, \"Peak2GeneHeatmap.png\"), format = \"png\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c22c58fc",
   "metadata": {},
   "source": [
    "**直方图**\n",
    "\n",
    "- 直方图显示 1 ~ 25 个关联 peak 的基因数量。\n",
    "- 横轴代表关联 peak 数，纵轴代表关联到横轴所示数量的 peak 的基因数\n",
    "- 虚线为中位数所在位置，文本标注中位数具体数值。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "57d12645-6ed4-42c9-91fc-cbb66857674d",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": [
     "remove-cell"
    ],
    "vscode": {
     "languageId": "r"
    }
   },
   "outputs": [],
   "source": [
    "pdf(file.path(outdir, \"Peak2GeneHistogram.pdf\"), width = 8, height = 8)\n",
    "print(plotPeaksPerGene(links))\n",
    "dev.off()\n",
    "\n",
    "png(file.path(outdir, \"Peak2GeneHistogram.png\"), width = 8, height = 8, units = \"in\", res = 300)\n",
    "print(plotPeaksPerGene(links))\n",
    "dev.off()\n",
    "# img <- image_read_pdf(file.path(outdir, \"Peak2GeneHistogram.pdf\"), density = 300)\n",
    "# image_write(img, path = file.path(outdir, \"Peak2GeneHistogram.png\"), format = \"png\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "eecde403",
   "metadata": {},
   "source": [
    "**火山图**\n",
    "\n",
    "- 火山图标注了 Top 10 最显著的 peak-gene 关联的基因。\n",
    "- 横轴为相关性 score，纵轴为-log10(p)。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2488cf9a-02cc-4217-9d80-4159dcd50452",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": [
     "remove-cell"
    ],
    "vscode": {
     "languageId": "r"
    }
   },
   "outputs": [],
   "source": [
    "pdf(file.path(outdir, \"Peak2GeneVolcano.pdf\"), width = 8, height = 8)\n",
    "if(filter == TRUE) {\n",
    "  print(plotPeak2GeneVolcano(links, p = pCutOff, score = corCutOff))\n",
    "} else {\n",
    "  print(plotPeak2GeneVolcano(links))\n",
    "}\n",
    "dev.off()\n",
    "\n",
    "png(file.path(outdir, \"Peak2GeneVolcano.png\"), width = 8, height = 8, units = \"in\", res = 300)\n",
    "if(filter == TRUE) {\n",
    "  print(plotPeak2GeneVolcano(links, p = pCutOff, score = corCutOff))\n",
    "} else {\n",
    "  print(plotPeak2GeneVolcano(links))\n",
    "}\n",
    "dev.off()\n",
    "# img <- image_read_pdf(file.path(outdir, \"Peak2GeneVolcano.pdf\"), density = 300)\n",
    "# image_write(img, path = file.path(outdir, \"Peak2GeneVolcano.png\"), format = \"png\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c4e9c499",
   "metadata": {},
   "source": [
    "**相关性排序图**\n",
    "\n",
    "- 曲线图展示了所有 peak-gene 对的相关系数，按照相关性从高到低排序。\n",
    "- 横轴代表 peak-gene 对，纵轴代表其相关系数。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "070526d3-bb08-4238-80b4-cea2dff7f295",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": [
     "remove-cell"
    ],
    "vscode": {
     "languageId": "r"
    }
   },
   "outputs": [],
   "source": [
    "pdf(file.path(outdir, \"Peak2GeneRank.pdf\"), width = 8, height = 6)\n",
    "print(plotRankCorrelation(links))\n",
    "dev.off()\n",
    "\n",
    "png(file.path(outdir, \"Peak2GeneRank.png\"), width = 8, height = 6, units = \"in\", res = 300)\n",
    "print(plotRankCorrelation(links))\n",
    "dev.off()\n",
    "# img <- image_read_pdf(file.path(outdir, \"Peak2GeneRank.pdf\"), density = 300)\n",
    "# image_write(img, path = file.path(outdir, \"Peak2GeneRank.png\"), format = \"png\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "60694d7d-d5d1-49b5-946a-e0571778f28a",
   "metadata": {
    "editable": true,
    "jp-MarkdownHeadingCollapsed": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "source": [
    "## 结果文件\n",
    "\n",
    "- Peak2GeneHeatmap.pdf/png - Peak-to-Gene 关联热图\n",
    "- Peak2GeneHistogram.pdf/png - Peak-to-Gene 直方图\n",
    "- Peak2GeneVolcano.pdf/png - Peak-to-Gene 小提琴图\n",
    "- Peak2GeneRank.pdf/png - Peak-to-Gene 排序曲线图\n",
    "\n",
    "结果目录：目录下包括此项分析所涉及的所有图片以及表格等结果文件\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5a536d97-1ca8-4d46-bf92-75526e415226",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "source": [
    "## 文献案例解析\n",
    "\n",
    "* **文献一：**  \n",
    "文献《Spatial multiomic landscape of the human placenta at molecular resolution》为了揭示基因和远端 CREs 之间的关系，作者以顺式方式将远端峰与基因相关联，并确定了 43,622 个显著的峰-基因对（代表潜在的增强子-基因关系）。\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0e622a5c-6196-4f95-bb82-f298661eb656",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "source": [
    "## 参考资料\n",
    "[1] Stuart, Tim, Avi Srivastava, Caleb Lareau, and Rahul Satija. \"Multimodal single-cell chromatin analysis with Signac.\" BioRxiv (2020): 2020-11.\n",
    "\n",
    "[2] Ma S, Zhang B, LaFave LM, Earl AS, Chiang Z, Hu Y, Ding J, Brack A, Kartha VK, Tay T, Law T, Lareau C, Hsu YC, Regev A, Buenrostro JD. Chromatin Potential Identified by Shared Single-Cell Profiling of RNA and Chromatin. Cell. 2020 Nov 12;183(4):1103-1116.e20. doi: 10.1016/j.cell.2020.09.056. Epub 2020 Oct 23. PMID: 33098772; PMCID: PMC7669735.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "059455e6-b826-4582-b83d-aa42d6a4594c",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "source": [
    "## 附录\n",
    "\n",
    "- .csv/.xls/.txt ：结果数据表格文件，文件以逗号或制表符（Tab）分隔。Linux/Mac 用户使用 less 或 more 命令查看；Windows 用户使用高级文本编辑器 Notepad++ 等查看，也可以用 Microsoft Excel 打开。\n",
    "\n",
    "- .png：结果图像文件，位图，无损压缩。\n",
    "\n",
    "- .pdf：结果图像文件，矢量图，可以放大和缩小而不失真，方便用户查看和编辑处理，可使用 Adobe Illustrator 进行图片编辑，用于文章发表等。"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "ATAC",
   "language": "R",
   "name": "atac"
  },
  "language_info": {
   "codemirror_mode": "r",
   "file_extension": ".r",
   "mimetype": "text/x-r-source",
   "name": "R",
   "pygments_lexer": "r",
   "version": "4.3.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}