Skip to content

scATAC-seq 变异分析:SNP 在染色质开放区域的富集

作者: SeekGene
时长: 49 分钟
字数: 10.3k 字
更新: 2026-02-28
阅读: 0 次
ATAC + RNA 双组学 分析指南 Notebooks SNP富集分析

背景介绍

全基因组关联研究(GWAS)是解析复杂性状遗传基础的关键工具,其鉴定出的大量显著单核苷酸多态性(SNP)往往位于基因组的非编码调控区域。这提示这些 SNP 可能通过影响基因表达调控而非直接改变蛋白质编码序列来影响性状,其效应通常具有细胞类型或状态特异性。理解这些非编码变异如何在特定细胞背景下发挥作用,是揭示复杂性状分子机制的核心挑战。scATAC-seq 能够以单细胞分辨率在全基因组范围内刻画染色质的开放状态,从而精准地定义出不同细胞类型和活化状态下特异性的调控元件(如增强子、启动子)及其活性图谱。本分析旨在建立特定性状与关键细胞类型/状态之间的功能关联。核心策略是计算 GWAS 鉴定的性状相关 SNP(或其连锁区域内的标签 SNP)在各类细胞状态特异性开放染色质区域中的富集程度。
首先,基于 scATAC-seq 定义细胞状态特异的开放区域;其次,对于每个关注的 GWAS 性状(如疾病风险、血液生化指标、神经精神表型等),我们计算其 GWAS 显著 SNP 集合以及其紧密连锁的区域(如 LD 区块)出现在每个特定细胞状态开放区域内的频率。 通过统计检验(如超几何检验、置换检验等)评估该频率是否显著高于随机期望值。 从而揭示哪些细胞类型或活化状态可能是该性状遗传变异影响的主要“靶细胞”。

分析前期数据准备

在进行 SNP 富集分析之前,需要提前准备如下两种文件

(1)traits 相关 SNP 信息

以 azoospermia_with_SNP.csv 文件为例,

azoospermia 这个 trait 相关 SNP 的信息必须包含以下内容:
trait_name index_snps index_chrom index_pos SNP_id tagged_r2 POS
azoospermia rs7192 6 32443869 rs9268686 0.929192 32447745
azoospermia rs7192 6 32443869 rs9268670 0.937897 32446510
azoospermia rs7192 6 32443869 rs7195 1.000000 32444762
azoospermia rs7192 6 32443869 rs2213585 1.000000 32445373
azoospermia rs7192 6 32443869 rs567082101 0.937897 32453552
azoospermia rs7192 6 32443869 rs7775108 0.937897 32452461
azoospermia rs7192 6 32443869 rs9268783 0.937897 32452938
azoospermia rs7192 6 32443869 rs4935354 1.000000 32444621
azoospermia rs7192 6 32443869 rs78456540 0.914852 32586949

其中 trait_name 列是待分析的 trait 名称;index_snps 列表示这个 trait 相关的所有 SNP 信息,可能是几个,也可能是几百几千个,该信息来源于 GWAS 分析结果,也可以从 EBI 官网下载获取;index_chrom 列表示该 SNP 所在染色体编号;index_pos 列表示该 SNP 所在染色体具体位置;SNP_id 表示与前面 index_snps 具有连锁不平衡的 SNP;tagged_r2 则表示连锁不平衡的 r2 系数;POS 表示连锁不平衡的 SNP 所在的位置。

├── index_snps_with_LD_with_pos 该路径下存放着待分析的所有 traits 的 SNP 信息,每个 traits 单独成一个独立的 csv 文件
│ ├── azoospermia_with_SNP.csv
│ ├── cardiomyopathy_with_SNP.csv
│ ├── chronic.lymphocytic.leukemia_with_SNP.csv
│ ├── chronic.myelogenous.leukemia_with_SNP.csv
│ ├── chronic.pancreatitis_with_SNP.csv
│ ├── diffuse.gastric.adenocarcinoma_with_SNP.csv
│ ├── endocarditis_with_SNP.csv
│ ├── ewing.sarcoma_with_SNP.csv
│ ├── gastric.adenocarcinoma_with_SNP.csv
│ ├── gastritis_with_SNP.csv
│ ├── head.and.neck.squamous.cell.carcinoma_with_SNP.csv
│ ├── hyperplasia_with_SNP.csv
│ ├── infectious.meningitis_with_SNP.csv
│ ├── invasive.lobular.carcinoma_with_SNP.csv
│ ├── leukemia_with_SNP.csv
│ ├── myelodysplastic.syndrome_with_SNP.csv
│ ├── neoplasm.of.mature.bcells_with_SNP.csv
│ ├── neoplasm_with_SNP.csv
│ ├── nervous.system.disease_with_SNP.csv
│ ├── pancreatitis_with_SNP.csv
│ ├── papillary.renal.cell.carcinoma_with_SNP.csv
│ ├── polyp_with_SNP.csv
│ └── portal.hypertension_with_SNP.csv

(2)celltype-by-peak 矩阵

celltype-by-peak 矩阵行名是 celltype,列名是 peaks 信息,值表示在某一 celltype 中,该 peak 位置处于开放状态的细胞数占总细胞数的比例

└── peaks 该路径下存放着 celltype-by-peaks 的矩阵文件
│ └── celltype_by_peaks.csv

SNPenrichment 分析

python
import numpy as np
import pandas as pd
import scanpy as sc
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.pyplot import rc_context
import bbknn
import re
import json
import os
import rpy2
import anndata
from datetime import date
import scipy.stats
from datetime import datetime
import requests
import logging

# YYYY-MM-DD
today = date.today()
today = today.strftime("%Y-%m-%d")

import warnings
warnings.filterwarnings('ignore')
output
/PROJ2/FLOAT/shumeng/apps/miniconda3/envs/ATAC_SNPEnrich/lib/python3.11/site-packages/requests/__init__.py:86: RequestsDependencyWarning: Unable to find acceptable character detection dependency (chardet or charset_normalizer).
warnings.warn(

参数说明

peaks_path:存放 celltype_by_peaks.csv 文件的路径;
SNP_path:存放 traits 相关 SNP 信息的文件的路径;
n_permutations:置换检验的次数;
output_path:结果文件输出路径;
threshold:判断 celltype 在某一 peak 上是否开放的阈值;
pval_sig_threshold:富集显著性阈值。

python
peaks_path='/PROJ2/FLOAT/shumeng/project/scATAC_learn/Signac/example_data/joint.14/joint14.22/'
SNP_path='/PROJ2/FLOAT/advanced-analysis/Seek_module_dev/ATAC_SNPenrichment/demo/index_snps_with_LD_with_pos/'
n_permutations=1000
output_path="../result/"
threshold=0.05
pval_sig_threshold=0.05

读取 celltype-by-peaks 文件

python
os.makedirs("../result", exist_ok=True)
peakfile=os.path.join(peaks_path, "celltype_by_peaks.csv")
#print(f'{current_time}:...reading peak file...takes 5 mins...')
peaks=pd.read_csv(peakfile)
    
peaks.columns = peaks.columns.str.replace('Unnamed: 0', 'fine_grain')
peaks=peaks.set_index(peaks['fine_grain'])
peaks=peaks.drop(columns=['fine_grain'])

# replace characters which can break the code
peaks.index=peaks.index.str.replace('+', 'pos')
peaks.index=peaks.index.str.replace('/', '_or_')

now = datetime.now()
current_time = now.strftime("%H:%M:%S")
#print(f'{current_time}:...done.')
peaks.head()
chr1-9815-10616chr1-180437-181865chr1-190528-191930chr1-267655-268361chr1-629577-630313chr1-633636-634418chr1-778209-779399chr1-817017-817549chr1-819870-822573chr1-822744-823396...chrY-56861377-56861817chrY-56864364-56864795chrY-56868776-56869176chrY-56869677-56870379chrY-56870544-56871524chrY-56873556-56874208chrY-56879530-56880478chrM-11-4079chrM-6124-6729chrM-9885-16556
fine_grain
Neurons0.0113230.0223750.0399220.0041500.0139390.0709130.0919790.0048270.0251710.009473...0.0034730.0025710.0033380.0063610.0080300.0047820.0061350.9963910.8509560.999865
Macrophage0.0093810.0231390.0131330.0018760.0100060.0587870.0769230.0125080.0256410.008755...0.0025020.0000000.0043780.0068790.0075050.0062540.0056290.9974980.8461541.000000
Endothelial_cells0.0167600.0502790.0530730.0055870.0502790.2122910.1145250.0083800.0446930.002793...0.0111730.0055870.0055870.0083800.0167600.0111730.0055871.0000000.9776541.000000
T_cells0.0090190.0225480.0124010.0000000.0135290.0721530.0653890.0045100.0112740.005637...0.0033820.0000000.0011270.0056370.0112740.0067640.0067640.9966180.9064261.000000
Smooth_muscle_cells0.0100810.0292340.0252020.0040320.0151210.1028230.0725810.0090730.0272180.012097...0.0050400.0010080.0030240.0090730.0080650.0050400.0080650.9989920.8739921.000000

5 rows × 243391 columns

celltype-by-peaks 矩阵每行代表一种细胞类型,每列代表一个 peak,对应的值代表 celltype 在 peak 中处于开放状态的细胞数在 celltype 总细胞数的占比。这个表格需要自己提前制备。具体方法:首先将 peak-cell 矩阵二值化,其中 0 代表不开放,1 代表开放,然后统计开放和总数的占比所得。

整理数据中所涉及所有 peaks 的 peak_window、chrom、start、end 信息,为后续统计 traits 相关 SNP 受否覆盖在 peaks 信息中做准备

python
%%time
# 生成初始DataFrame
all_peaks = pd.DataFrame(peaks.columns, columns=['peak'])

split_result = all_peaks['peak'].str.split('-', expand=True)
all_peaks['peak_window'] = split_result[1] + "_" + split_result[2]  # 直接字符串拼接
all_peaks['chrom'] = split_result[0]
all_peaks['start'] = split_result[1].astype(int)  # 转为整数
all_peaks['end'] = split_result[2].astype(int)

# 设置索引
all_peaks = all_peaks.set_index('peak')
all_peaks.head(4)
output
CPU times: user 878 ms, sys: 47.5 ms, total: 925 ms
Wall time: 948 ms
peak_windowchromstartend
peak
chr1-9815-106169815_10616chr1981510616
chr1-180437-181865180437_181865chr1180437181865
chr1-190528-191930190528_191930chr1190528191930
chr1-267655-268361267655_268361chr1267655268361
python
# select a window size (bp) of interest (by default it is 500bp)
window_size=500
window_adjustment=(window_size-500)/2
#print('adjustment '+str(window_adjustment))
window_size_for_filename='peak_width_'+str(window_size)
#print(window_size_for_filename)

all_peaks['start']=all_peaks['start'].apply(lambda x: x-window_adjustment)
all_peaks['end']=all_peaks['end'].apply(lambda x: x+window_adjustment)

all_peaks
peak_windowchromstartend
peak
chr1-9815-106169815_10616chr19815.010616.0
chr1-180437-181865180437_181865chr1180437.0181865.0
chr1-190528-191930190528_191930chr1190528.0191930.0
chr1-267655-268361267655_268361chr1267655.0268361.0
chr1-629577-630313629577_630313chr1629577.0630313.0
...............
chrY-56873556-5687420856873556_56874208chrY56873556.056874208.0
chrY-56879530-5688047856879530_56880478chrY56879530.056880478.0
chrM-11-407911_4079chrM11.04079.0
chrM-6124-67296124_6729chrM6124.06729.0
chrM-9885-165569885_16556chrM9885.016556.0

243391 rows × 4 columns

python
# get list of celltypes for subsequent analyses
cell_types=peaks.index.unique().tolist()
cell_types
output
['Neurons',
'Macrophage',
'Endothelial_cells',
'T_cells',
'Smooth_muscle_cells',
'Keratinocytes',
'DC',
'B_cell']
python
# Set threshold for binarisation
threshold_for_filename=str(threshold).replace('.','p')
print(threshold_for_filename)
output
0p05

根据 SNP_path 路径下包含的所有 traits 信息,统计此次富集分析中,每个 traits 包含的 SNP 总数,包含 SNP 和与之连锁不平衡的 SNP

python
%%time

# make a table of metadata about the SNP files
md_efo_terms=[]
md_n_SNPs=[]


files=os.listdir(SNP_path)

for file in files:
    
    efo_term=str(file.split('_')[0])
    md_efo_terms.append(efo_term)
                      
    snps_df=pd.read_csv(f'{SNP_path}{file}')
    snps_df=snps_df.set_index('SNP_id')
    
    snps_df["Chrom"] = snps_df['Chrom'].apply(lambda x: 'chr'+str(x)).str.split('.',expand=True)[0] # creates a new 'chrom' column and should work even for X or M chromosomes

    md_n_SNPs.append(len(snps_df))

md_dict={
    'efo_term':md_efo_terms,
    'n_SNPs':md_n_SNPs
}


SNP_md=pd.DataFrame(md_dict)
SNP_md=SNP_md.set_index('efo_term')

SNP_md.sort_values('n_SNPs',ascending=False)
output
CPU times: user 76.1 ms, sys: 2.6 ms, total: 78.7 ms
Wall time: 802 ms
n_SNPs
efo_term
chronic.lymphocytic.leukemia483
neoplasm.of.mature.bcells285
cardiomyopathy192
ewing.sarcoma157
chronic.myelogenous.leukemia139
gastritis121
azoospermia118
pancreatitis108
myelodysplastic.syndrome102
leukemia90
papillary.renal.cell.carcinoma83
head.and.neck.squamous.cell.carcinoma63
hyperplasia61
gastric.adenocarcinoma60
chronic.pancreatitis57
diffuse.gastric.adenocarcinoma56
neoplasm51
polyp27
infectious.meningitis24
portal.hypertension18
nervous.system.disease11
invasive.lobular.carcinoma6
endocarditis4
python
efo_terms=SNP_md.index.unique().tolist()

统计 traits 相关 SNP 落在已知 peaks 区域内的个数,形成 peak-traits 矩阵

python
%%time
# Add on a column for each trait, indicating whether a SNP from that trait falls within a peak. This file varies with the which traits are assessed

all_peaks_path=f'{output_path}SNP_mapped_to_peaks/'
os.makedirs(all_peaks_path,
           exist_ok=True) # makes the directory
files=os.listdir(SNP_path)
for efo_term in efo_terms:
    file = [f for f in files if f"{efo_term}" in f]
    file = file[0]
    file_path = os.path.join(SNP_path, file)
    snps_df=pd.read_csv(file_path)
    snps_df=snps_df.set_index('SNP_id')
    snps_df["chrom"]=snps_df["Chrom"].apply(lambda x: 'chr' + str(x)).str.split('.', n=1).str[0]
    all_peaks[efo_term]=0
    for snp in range(len(snps_df["chrom"])): # This loop incrementally adds 1 if there is a SNP within a peak (open or closed)
        all_peaks[efo_term][(all_peaks['chrom'] == snps_df["chrom"][snp])&(all_peaks['start'] <= snps_df['POS'][snp])&(all_peaks['end'] >= snps_df['POS'][snp])]+=1 # Adds one for each SNP which falls inside a peak
all_peaks.to_csv(f'{all_peaks_path}all_peaks_with_SNPs_for_traits_incremental.csv')
output
CPU times: user 1min 32s, sys: 99.4 ms, total: 1min 32s
Wall time: 1min 33s
python
# Show number of SNPs falling in peaks, for each trait
efo_terms_to_drop=[]

for efo_term in efo_terms:
#    print(efo_id+': '+str(sum(all_peaks[efo_id])))
    if sum(all_peaks[efo_term]) == 0:
        efo_terms_to_drop.append(efo_term)
python
efo_terms=[efo_term for efo_term in efo_terms if efo_term not in efo_terms_to_drop]

对于每个细胞类型,通过打乱该细胞类型峰的开放/关闭标签来创建随机背景(1/0 的二值化矩阵),使得随机的一组峰被标注为开放,其数量等于该细胞状态实际开放峰的数量。重复此操作以创建 1000 个随机排列

python
%%time
import numpy as np
import pandas as pd
import os
from datetime import datetime

permutations = range(n_permutations)
bin_mat_path = f'{output_path}binarised_permutation_matrices/'
os.makedirs(bin_mat_path, exist_ok=True)

for cell_type in cell_types:
    output_file = f'{bin_mat_path}{cell_type}_{n_permutations}_permutations_bin_threshold_{threshold_for_filename}_matrix.csv'
    
    if not os.path.isfile(output_file):
        print(f'{datetime.now().strftime("%H:%M:%S")}...generating binarised matrix for {cell_type}...')
        
        # 初始化 DataFrame
        cell_df = pd.DataFrame(peaks.loc[cell_type])
        cell_df['peak'] = cell_df.index
        split_result = cell_df['peak'].str.split('-', expand=True)
        cell_df['chrom'] = split_result[0]
        cell_df['peak_window'] = split_result[1] + "_" + split_result[2]
        cell_df['start'] = split_result[1].astype(int)
        cell_df['end'] = split_result[2].astype(int)
        cell_df[f'{cell_type}_binarised_real'] = cell_df[cell_type].ge(threshold).astype(int)
        
        # 生成所有 permutation 列
        for perm in permutations:
            cell_df[f'{cell_type}_binarised_permutation_{perm}'] = np.random.permutation(cell_df[f'{cell_type}_binarised_real'])
        
        # 一次性保存
        cell_df.to_csv(output_file, index=True)
    else:
        print(f'NOT generating binarised matrix for {cell_type} since it already exists')

print('finished')
output
NOT generating binarised matrix for Neurons since it already exists
NOT generating binarised matrix for Macrophage since it already exists
NOT generating binarised matrix for Endothelial_cells since it already exists
NOT generating binarised matrix for T_cells since it already exists
NOT generating binarised matrix for Smooth_muscle_cells since it already exists
NOT generating binarised matrix for Keratinocytes since it already exists
NOT generating binarised matrix for DC since it already exists
NOT generating binarised matrix for B_cell since it already exists
finished
CPU times: user 1.58 ms, sys: 976 μs, total: 2.56 ms
Wall time: 26.5 ms

将每个细胞类型的随机排列的二值化矩阵与实际 peak-traits 矩阵进行合并

python
%%time

joined_bin_mat_path = f'{output_path}binarised_permutation_matrices_joined_incremental/'
os.makedirs(joined_bin_mat_path, exist_ok=True)

for cell_type in cell_types:
    output_file = f'{joined_bin_mat_path}{cell_type}_{n_permutations}_permutations_for_traits_bin_threshold_{threshold_for_filename}_matrix_joined.csv'
    
    if not os.path.isfile(output_file):
        # 读取已生成的二值化矩阵
        binarised_matrix = pd.read_csv(
            f'{bin_mat_path}{cell_type}_{n_permutations}_permutations_bin_threshold_{threshold_for_filename}_matrix.csv'
        )
        
        # 清理并设置索引
        binarised_matrix.set_index('peak', inplace=True)  # 确保索引是唯一的 peak 列
        
        # 添加 SNP 是否在 peaks 中的列(假设 all_peaks 是另一个 DataFrame)
        # 方法一:指定后缀
        binarised_matrix = binarised_matrix.join(
            all_peaks, 
            lsuffix="_right"
        )
        
        # 保存到新目录
        binarised_matrix.to_csv(output_file)
    else:
        print(f'NOT generating joined binarised matrix for {cell_type} since it already exists')

print('finished')
output
NOT generating joined binarised matrix for Neurons since it already exists
NOT generating joined binarised matrix for Macrophage since it already exists
NOT generating joined binarised matrix for Endothelial_cells since it already exists
NOT generating joined binarised matrix for T_cells since it already exists
NOT generating joined binarised matrix for Smooth_muscle_cells since it already exists
NOT generating joined binarised matrix for Keratinocytes since it already exists
NOT generating joined binarised matrix for DC since it already exists
NOT generating joined binarised matrix for B_cell since it already exists
finished
CPU times: user 2.2 ms, sys: 21 μs, total: 2.22 ms
Wall time: 28 ms

对于每个性状和细胞状态,计算该性状相关单核苷酸多态性(SNP)落在该细胞状态开放峰内的比例(SNP 比例)。对于每个随机排列也计算此 SNP 比例。然后可以计算 P 值,即随机 SNP 比例超过或等于实际 SNP 比例的次数所占的比例。最后,使用 Benjamini-Hochberg 方法对这些 P 值进行多重检验校正

python
%%time
import logging

# Set up logging
logging.basicConfig(level=logging.ERROR)



# Find the proportion of all peaks which are open in this celltype
enrichment_output_path=f'{output_path}enrichment_output/'
os.makedirs(enrichment_output_path,
           exist_ok=True) # makes the directory

list_of_output_dfs=[]

#now = datetime.now()
#current_time = now.strftime("%H:%M:%S")
#print(f'...================================================================...') 
#print(f'...{current_time}:STARTING')
#print(f'...================================================================...') 

# make some empty lists
proportion_of_SNPs_found_in_celltype_specific_open_peaks=[]
proportion_of_all_open_peaks_found_in_this_celltype=[]

n_times_proportions_of_SNPs_in_permuted_open_peaks_greater_than_observed_proportion=[]
mean_proportions_of_SNPs_in_open_peaks=[]
p_values=[]
efo_term_list=[]
cell_type_list=[]
n_SNPs_list=[]

cell_types_done=[]
n_cell_types_total=len(cell_types)

permutations=range(n_permutations)

print('...evaluating '+str(len(efo_terms))+' traits, across '+str(len(cell_types))+' cell types...')

for cell_type in cell_types:
    
    try:
    
        cell_types_done.append(cell_type)
        n_cell_types_done=len(cell_types_done)
        n_cell_types_remaining=n_cell_types_total-n_cell_types_done

        now = datetime.now()
        current_time = now.strftime("%H:%M:%S")
        print(f'...================================================================...')    
        print(f'...{current_time}: {n_cell_types_remaining+1} of {n_cell_types_total} cell types remaining. Reading binarised matrix for {cell_type}...')
        print(f'...================================================================...') 

        cell_bin_mat=pd.read_csv(f'{joined_bin_mat_path}{cell_type}_{n_permutations}_permutations_for_traits_bin_threshold_{threshold_for_filename}_matrix_joined.csv',index_col='peak')

        prop_bins_in_this_cell_type=(len(cell_bin_mat[cell_bin_mat[f'{cell_type}_binarised_real']==1]))/len(cell_bin_mat)


        for efo_term in efo_terms:

            # grab some metadata
            n_SNPs=SNP_md.loc[efo_term]['n_SNPs']

            # add columns which won't change until we run a new cell_type
            proportion_of_all_open_peaks_found_in_this_celltype.append(prop_bins_in_this_cell_type)
            cell_type_list.append(cell_type)

            # add columns which won't change until we run a new efo_id
            n_SNPs_list.append(n_SNPs)
            efo_term_list.append(efo_term)

            # subset to just open regions for this cell type
            # find the proportion of SNPs for this trait that lie within this cell types open peaks
            observed_proportion=(cell_bin_mat[efo_term][cell_bin_mat[f'{cell_type}_binarised_real']==1].sum())/(cell_bin_mat[efo_term].sum())

            proportion_of_SNPs_found_in_celltype_specific_open_peaks.append(observed_proportion)

            proportion_of_SNPs_found_in_permutations_of_celltype_specific_open_peaks=[]

            for permutation in permutations:
                proportion_of_SNPs_found_in_permutations_of_celltype_specific_open_peaks.append(cell_bin_mat[efo_term][cell_bin_mat[f'{cell_type}_binarised_permutation_{permutation}']==1].sum()/cell_bin_mat[efo_term].sum())

            proportions_of_SNPs_in_permuted_open_peaks_greater_than_observed_proportion = [i for i in proportion_of_SNPs_found_in_permutations_of_celltype_specific_open_peaks if i >= observed_proportion]

            n_times_proportions_of_SNPs_in_permuted_open_peaks_greater_than_observed_proportion.append(len(proportions_of_SNPs_in_permuted_open_peaks_greater_than_observed_proportion))

            p_values.append(len(proportions_of_SNPs_in_permuted_open_peaks_greater_than_observed_proportion)/len(permutations)) # p val is simply the proportion of null hypotheses 'observations' greater than the actual observed proportion

            mean_proportions_of_SNPs_in_open_peaks.append(sum(proportion_of_SNPs_found_in_permutations_of_celltype_specific_open_peaks)/len(proportion_of_SNPs_found_in_permutations_of_celltype_specific_open_peaks))

            # Plot histograms for each cell type
    #        plt.rcParams["figure.figsize"] = (20,10)
    #        plt.rcParams["figure.dpi"] = 300

    #        plt.hist(proportion_of_SNPs_found_in_permutations_of_celltype_specific_open_peaks,
    #                 bins=100,color='red',
    #                 range=(0,1),
    #                 histtype='stepfilled',edgecolor='none')
    #        plt.axvline(x=observed_proportion, color='blue', linestyle='--')
    #        plt.legend(['null: proportion of SNPs falling in randomly shuffled OC regions','observed: proportion of SNPs falling cell-type specific OC regions'])
    #        plt.title('cell type: '+cell_type+', trait: '+efo_id+', term: '+efo_term+', threshold for binarisation: '+threshold_for_filename)
    #        plt.savefig(f'{output_path}{efo_id}_{efo_term}_{cell_type}_{threshold_for_filename}_SNP_enrichment.png')
    #        plt.clf() #clears the current plot

    
    except KeyError as e:
        # Log the error
        logging.error(e)
        print(cell_type)

    # edited so that the file is written incrementally
        
    output_dict={
        'cell_type':cell_type_list,
        'proportion_of_all_open_peaks_found_in_this_celltype':proportion_of_all_open_peaks_found_in_this_celltype,
        'proportion_of_SNPs_found_in_celltype_specific_open_peaks':proportion_of_SNPs_found_in_celltype_specific_open_peaks,
        'mean_proportions_of_SNPs_in_open_peaks':mean_proportions_of_SNPs_in_open_peaks,
        'n_times_proportions_of_SNPs_in_permuted_open_peaks_greater_than_observed_proportion':n_times_proportions_of_SNPs_in_permuted_open_peaks_greater_than_observed_proportion,
        'p_value':p_values,
        'n_SNPs':n_SNPs_list,
        'efo_term':efo_term_list}

    output_df=pd.DataFrame(output_dict)

    list_of_output_dfs.append(output_df)
    combined_output_df=pd.concat(list_of_output_dfs)
    combined_output_df=combined_output_df.sort_values(by=['efo_term'])
    combined_output_df=combined_output_df.set_index('cell_type')
    combined_output_df.to_csv(f'{enrichment_output_path}{threshold_for_filename}_{window_size_for_filename}_SNPs_in_LD_all_traits_summary.csv')

now = datetime.now()
current_time = now.strftime("%H:%M:%S")
print(f'...================================================================...')    
print(f'...{current_time}:FINSIHED')
print(f'...================================================================...')
output
...evaluating 19 traits, across 8 cell types...n ...================================================================...n ...09:46:59: 8 of 8 cell types remaining. Reading binarised matrix for Neurons...n ...================================================================...n ...================================================================...n ...09:49:31: 7 of 8 cell types remaining. Reading binarised matrix for Macrophage...n ...================================================================...n ...================================================================...n ...09:51:32: 6 of 8 cell types remaining. Reading binarised matrix for Endothelial_cells...n ...================================================================...n ...================================================================...n ...09:54:00: 5 of 8 cell types remaining. Reading binarised matrix for T_cells...n ...================================================================...n ...================================================================...n ...09:56:07: 4 of 8 cell types remaining. Reading binarised matrix for Smooth_muscle_cells...n ...================================================================...n ...================================================================...n ...09:58:15: 3 of 8 cell types remaining. Reading binarised matrix for Keratinocytes...n ...================================================================...n ...================================================================...n ...10:00:35: 2 of 8 cell types remaining. Reading binarised matrix for DC...n ...================================================================...n ...================================================================...n ...10:02:40: 1 of 8 cell types remaining. Reading binarised matrix for B_cell...n ...================================================================...n ...================================================================...n ...10:04:45:FINSIHED
...================================================================...n CPU times: user 9min 9s, sys: 26.8 s, total: 9min 36s
Wall time: 17min 46s
python
neglog10_pval_sig_threshold=np.negative(np.log10(pval_sig_threshold))
neglog10_pval_sig_threshold
output
1.3010299956639813

富集结果可视化

python
#combined_output_df=pd.read_csv(f'{output_path}{threshold_for_filename}_{window_size_for_filename}_SNPs_in_LD{LD_threshold_for_filename}_all_traits_summary.csv',index_col='cell_type')
combined_output_file=os.path.join(peaks_path, "celltype_by_peaks.csv")
combined_output_df=pd.read_csv("/PROJ2/FLOAT/advanced-analysis/Seek_module_dev/ATAC_SNPenrichment/demo/new_output/enrichment_output/0p05_peak_width_500_SNPs_in_LD_all_traits_summary.csv")
combined_output_df.head()
cell_typeproportion_of_all_open_peaks_found_in_this_celltypeproportion_of_SNPs_found_in_celltype_specific_open_peaksmean_proportions_of_SNPs_in_open_peaksn_times_proportions_of_SNPs_in_permuted_open_peaks_greater_than_observed_proportionp_valuen_SNPsefo_term
0Neurons0.0756890.00.06842910001.0118azoospermia
1Neurons0.0756890.00.06842910001.0118azoospermia
2Keratinocytes0.0952750.00.09200010001.0118azoospermia
3Smooth_muscle_cells0.0616170.00.06042910001.0118azoospermia
4T_cells0.0381980.00.03664310001.0118azoospermia
python
combined_output_df['efo_term']=combined_output_df['efo_term'].astype(str)#+"_"+combined_output_df['n_SNPs'].astype(str) # Add the n_SNPs - useful for later inspection
combined_output_df.head()
cell_typeproportion_of_all_open_peaks_found_in_this_celltypeproportion_of_SNPs_found_in_celltype_specific_open_peaksmean_proportions_of_SNPs_in_open_peaksn_times_proportions_of_SNPs_in_permuted_open_peaks_greater_than_observed_proportionp_valuen_SNPsefo_term
0Neurons0.0756890.00.06842910001.0118azoospermia
1Neurons0.0756890.00.06842910001.0118azoospermia
2Keratinocytes0.0952750.00.09200010001.0118azoospermia
3Smooth_muscle_cells0.0616170.00.06042910001.0118azoospermia
4T_cells0.0381980.00.03664310001.0118azoospermia
python
combined_output_df.proportion_of_SNPs_found_in_celltype_specific_open_peaks.hist(bins=100)
output
Axes:
python
#pivot the table to prepare for heatmap
df=combined_output_df[['p_value',"cell_type",'efo_term']]
df.head()
p_valuecell_typeefo_term
01.0Neuronsazoospermia
11.0Neuronsazoospermia
21.0Keratinocytesazoospermia
31.0Smooth_muscle_cellsazoospermia
41.0T_cellsazoospermia
python
df=df.pivot_table(values='p_value',index='cell_type',columns='efo_term')
df
efo_termazoospermiacardiomyopathychronic.lymphocytic.leukemiachronic.myelogenous.leukemiachronic.pancreatitisdiffuse.gastric.adenocarcinomaewing.sarcomagastric.adenocarcinomagastritishead.and.neck.squamous.cell.carcinomahyperplasialeukemiamyelodysplastic.syndromeneoplasmneoplasm.of.mature.bcellspancreatitispapillary.renal.cell.carcinomapolypportal.hypertension
cell_type
B_cell1.00.1560.0001.001.00.2150.0580.2150.3020.2791.00.0001.0000.0990.0991.00.0011.0000.249
DC1.00.0110.0031.001.00.2051.0000.2050.2660.0131.00.0030.1230.0720.0721.00.0021.0000.247
Endothelial_cells1.00.0080.0001.001.00.3920.0660.3920.4750.2031.00.0000.2540.3230.3231.00.0041.0000.112
Keratinocytes1.00.0250.0021.001.01.0000.0281.0000.5080.1721.00.0020.2600.3010.3011.00.0000.1020.103
Macrophage1.00.1900.0001.001.00.2520.0720.2520.3110.0031.00.0001.0000.1060.1061.00.0001.0000.277
Neurons1.00.1440.0040.191.00.2910.1700.2910.4380.1391.00.0040.2100.3210.3211.00.0000.0870.068
Smooth_muscle_cells1.00.1050.0021.001.01.0000.2541.0000.3760.3631.00.0020.1790.2320.2321.00.0001.0000.054
T_cells1.00.1020.0051.001.01.0001.0001.0000.2290.2371.00.0051.0000.1330.1331.00.0001.0000.184
python
# drop columns where p value is all 1 (i.e. nothing at all significant)
print(df.shape)
df=df.loc[:, (df != 1).any(axis=0)]
print(df.shape)
output
(8, 19)
(8, 15)
python
# drop columns where p value is all 0 (i.e. no open peaks)
print(df.shape)
df=df.loc[:, (df != 0).any(axis=0)]
print(df.shape)
output
(8, 15)
(8, 15)
python
plt.rcParams['figure.dpi'] = 100

tmp=df.loc[:, (df > 0.05).any(axis=0)]
tmp=tmp.loc[(tmp > 0.05).any(axis=1),:]

g=sns.clustermap(tmp,
               xticklabels=True,
               yticklabels=True,
               cmap='OrRd',
               figsize=(tmp.shape[1]*0.3,tmp.shape[0]*0.3),
#               annot=anno_df,
               fmt = '',
               dendrogram_ratio=0.05,
               cbar_pos=(0.98,0.9,0.02,0.1)
              )

for a in g.ax_row_dendrogram.collections:
    a.set_linewidth(1)

for a in g.ax_col_dendrogram.collections:
    a.set_linewidth(1)
    
ax = g.ax_heatmap
ax.set_xlabel('GWAS traits')
ax.set_ylabel('Cell types')
plt.show()
python
print("原始数据描述:")
print(df.describe())
plt.figure()
sns.histplot(df.values.flatten())
plt.title('Value Distribution')
plt.show()
output
原始数据描述:
efo_term cardiomyopathy chronic.lymphocytic.leukemia \\
count 8.000000 8.000000
mean 0.092625 0.002000
std 0.070484 0.001927
min 0.008000 0.000000
25% 0.021500 0.000000
50% 0.103500 0.002000
75% 0.147000 0.003250
max 0.190000 0.005000

efo_term chronic.myelogenous.leukemia diffuse.gastric.adenocarcinoma \\
count 8.000000 8.000000
mean 0.898750 0.544375
std 0.286378 0.381603
min 0.190000 0.205000
25% 1.000000 0.242750
50% 1.000000 0.341500
75% 1.000000 1.000000
max 1.000000 1.000000

efo_term ewing.sarcoma gastric.adenocarcinoma gastritis \\
count 8.000000 8.000000 8.000000
mean 0.331000 0.544375 0.363125
std 0.419288 0.381603 0.102227
min 0.028000 0.205000 0.229000
25% 0.064000 0.242750 0.293000
50% 0.121000 0.341500 0.343500
75% 0.440500 1.000000 0.447250
max 1.000000 1.000000 0.508000

efo_term head.and.neck.squamous.cell.carcinoma leukemia \\
count 8.000000 8.000000
mean 0.176125 0.002000
std 0.124161 0.001927
min 0.003000 0.000000
25% 0.107500 0.000000
50% 0.187500 0.002000
75% 0.247500 0.003250
max 0.363000 0.005000

efo_term myelodysplastic.syndrome neoplasm neoplasm.of.mature.bcells \\
count 8.000000 8.000000 8.000000
mean 0.503250 0.198375 0.198375
std 0.413574 0.107493 0.107493
min 0.123000 0.072000 0.072000
25% 0.202250 0.104250 0.104250
50% 0.257000 0.182500 0.182500
75% 1.000000 0.306000 0.306000
max 1.000000 0.323000 0.323000

efo_term papillary.renal.cell.carcinoma polyp portal.hypertension
count 8.000000 8.000000 8.000000
mean 0.000875 0.773625 0.161750
std 0.001458 0.419184 0.088627
min 0.000000 0.087000 0.054000
25% 0.000000 0.775500 0.094250
50% 0.000000 1.000000 0.148000
75% 0.001250 1.000000 0.247500
max 0.004000 1.000000 0.277000
python
# multiple testing correction
import statsmodels.stats.multitest
from statsmodels.stats.multitest import multipletests

method='fdr_bh'
columns=df.columns.tolist()

# correct for multiple testing (each test being one cell type). Replacing the column with a corrected value
for column in columns:
    df[column]=statsmodels.stats.multitest.multipletests(df[column],method=method)[1]
df
efo_termcardiomyopathychronic.lymphocytic.leukemiachronic.myelogenous.leukemiadiffuse.gastric.adenocarcinomaewing.sarcomagastric.adenocarcinomagastritishead.and.neck.squamous.cell.carcinomaleukemiamyelodysplastic.syndromeneoplasmneoplasm.of.mature.bcellspapillary.renal.cell.carcinomapolypportal.hypertension
cell_type
B_cell0.1782860.0000001.00.58200.1440000.58200.5080.3188570.0000001.0000.2660.2660.0013331.0000.277
DC0.0440000.0040001.00.58201.0000000.58200.5080.0520000.0040000.4160.2660.2660.0022861.0000.277
Endothelial_cells0.0440000.0000001.00.62720.1440000.62720.5080.3160000.0000000.4160.3230.3230.0040001.0000.224
Keratinocytes0.0666670.0032001.01.00000.1440001.00000.5080.3160000.0032000.4160.3230.3230.0000000.4080.224
Macrophage0.1900000.0000001.00.58200.1440000.58200.5080.0240000.0000001.0000.2660.2660.0000001.0000.277
Neurons0.1782860.0045711.00.58200.2720000.58200.5080.3160000.0045710.4160.3230.3230.0000000.4080.224
Smooth_muscle_cells0.1680000.0032001.01.00000.3386671.00000.5080.3630000.0032000.4160.3230.3230.0000001.0000.224
T_cells0.1680000.0050001.01.00001.0000001.00000.5080.3160000.0050001.0000.2660.2660.0000001.0000.277
python
# 应用变换
df = -np.log10(df)

# 处理所有特殊值
df = df.replace([-0.0, 0.0], 0.0)  # 统一零值
df = df.replace([np.inf, -np.inf], np.nan)  # 无穷大转NaN

# 用有限值最大值填充NaN(跳过NaN计算最大值)
if not df.isnull().all().all():
    max_val = df.max().max()  # 自动跳过NaN
    df = df.fillna(max_val)
else:
    df = df.fillna(300)  # 全为无穷大的情况

df
efo_termcardiomyopathychronic.lymphocytic.leukemiachronic.myelogenous.leukemiadiffuse.gastric.adenocarcinomaewing.sarcomagastric.adenocarcinomagastritishead.and.neck.squamous.cell.carcinomaleukemiamyelodysplastic.syndromeneoplasmneoplasm.of.mature.bcellspapillary.renal.cell.carcinomapolypportal.hypertension
cell_type
B_cell0.7488832.8750610.00.2350770.8416380.2350770.2941360.4964042.8750610.0000000.5751180.5751182.8750610.000000.557520
DC1.3565472.3979400.00.2350770.0000000.2350770.2941361.2839972.3979400.3809070.5751180.5751182.6409780.000000.557520
Endothelial_cells1.3565472.8750610.00.2025940.8416380.2025940.2941360.5003132.8750610.3809070.4907970.4907972.3979400.000000.649752
Keratinocytes1.1760912.4948500.00.0000000.8416380.0000000.2941360.5003132.4948500.3809070.4907970.4907972.8750610.389340.649752
Macrophage0.7212462.8750610.00.2350770.8416380.2350770.2941361.6197892.8750610.0000000.5751180.5751182.8750610.000000.557520
Neurons0.7488832.3399480.00.2350770.5654310.2350770.2941360.5003132.3399480.3809070.4907970.4907972.8750610.389340.649752
Smooth_muscle_cells0.7746912.4948500.00.0000000.4702280.0000000.2941360.4400932.4948500.3809070.4907970.4907972.8750610.000000.649752
T_cells0.7746912.3010300.00.0000000.0000000.0000000.2941360.5003132.3010300.0000000.5751180.5751182.8750610.000000.557520
python
plt.rcParams['figure.dpi'] = 150

tmp=df.loc[:, (df > 1).any(axis=0)]
tmp=tmp.loc[(tmp > 1).any(axis=1),:]



g=sns.clustermap(tmp,
               xticklabels=True,
               yticklabels=True,
               cmap='OrRd',
               figsize=(5,5),
#               annot=anno_df,
               fmt = '',
               dendrogram_ratio=0.05,
               cbar_pos=(0.98,0.9,0.02,0.1),
                 vmax=4
              )


for a in g.ax_row_dendrogram.collections:
    a.set_linewidth(1)

for a in g.ax_col_dendrogram.collections:
    a.set_linewidth(1)
    
ax = g.ax_heatmap
ax.set_xlabel('GWAS traits')
ax.set_ylabel('Cell types')
plt.savefig(f'{output_path}overview_heatmap_SNP_enrichment.png')
plt.show()

删掉在所有细胞类型中都不显著富集的 traits

python
df=df.loc[:, (df > neglog10_pval_sig_threshold).any(axis=0)]
df=df.loc[(df > neglog10_pval_sig_threshold).any(axis=1),:]
python
threshold=0.05
python
plt.rcParams['figure.dpi'] = 100

# Make annotation df
anno_df=df.where(df<1, other="*")
anno_df=anno_df.where(anno_df=='*', other=" ")
anno_df=anno_df.astype(str)


g=sns.clustermap(df,
               xticklabels=True,
               yticklabels=True,
               cmap='OrRd',
               figsize=(6,6),
               annot=anno_df,
               fmt = '',
               dendrogram_ratio=0.05,
               cbar_pos=(0.98,0.9,0.02,0.1),
                 vmax=2
              )

g.ax_col_dendrogram.set_title(f'Enrichment of cell type-specific open chromatin for GWAS trait SNPs.\n\nScale = -log(bh-corrected p-value)\n"*" = p<{pval_sig_threshold} (-log(p)>={neglog10_pval_sig_threshold}), "open threshold" = {threshold}')

for a in g.ax_row_dendrogram.collections:
    a.set_linewidth(1)

for a in g.ax_col_dendrogram.collections:
    a.set_linewidth(1)
    
ax = g.ax_heatmap
ax.set_xlabel('GWAS traits')
ax.set_ylabel('Cell types')
plt.savefig(f'{output_path}overview_heatmap_SNP_enrichment.png')
plt.show()

删掉不感兴趣的 traits

python

df.columns
output
Index(['cardiomyopathy', 'chronic.lymphocytic.leukemia',
'head.and.neck.squamous.cell.carcinoma', 'leukemia',
'papillary.renal.cell.carcinoma'],
dtype='object', name='efo_term')
python
df.columns = map(str.upper, df.columns)
df = df.reindex(sorted(df.columns), axis=1)
df.columns
output
Index(['CARDIOMYOPATHY', 'CHRONIC.LYMPHOCYTIC.LEUKEMIA',
'HEAD.AND.NECK.SQUAMOUS.CELL.CARCINOMA', 'LEUKEMIA',
'PAPILLARY.RENAL.CELL.CARCINOMA'],
dtype='object')
python
plt.rcParams['figure.dpi'] = 60
plt.rcParams['font.size'] = 15     # 设置全局基础字体大小
from matplotlib.colors import LinearSegmentedColormap
# Make annotation df
anno_df=df.where(df<neglog10_pval_sig_threshold, other="*")
anno_df=anno_df.where(anno_df=='*', other=" ")
anno_df=anno_df.astype(str)

#cmap_celadon = LinearSegmentedColormap.from_list('celadon_gradient', 
#                colors = ["#DFEDEA","#B9EAE1", "#79D8C5", "#49C4AC","#17BCA0","#0A9E81"],
#                N=256)
#g=sns.clustermap(df,
#               xticklabels=True,
#               yticklabels=True,
               #cmap = cmap_celadon,
#               camp = "OrRd",
#               figsize=(8,12),
#               annot=anno_df,
#               fmt = '',
#               dendrogram_ratio=0.05,
#               cbar_pos=(0.98,0.9,0.02,0.1),
#                 vmax=5,
#                 row_cluster=False, col_cluster=False
#              )
g=sns.clustermap(df,
               xticklabels=True,
               yticklabels=True,
               cmap='OrRd',
               figsize=(10,10),
               annot=anno_df,
               fmt = '',
               dendrogram_ratio=0.05,
               cbar_pos=(0.98,0.9,0.02,0.1),
                 vmax=2,
                 row_cluster=False, col_cluster=False
              )

g.ax_col_dendrogram.set_title(f'Enrichment of cell-type-specific open chromatin for GWAS trait SNPs.\n\nScale = -log(BH-corrected p-value)\n\n"*" = p<{pval_sig_threshold} (-log(p)>={round(neglog10_pval_sig_threshold,2)})')

for a in g.ax_row_dendrogram.collections:
    a.set_linewidth(1)

for a in g.ax_col_dendrogram.collections:
    a.set_linewidth(1)

ax = g.ax_heatmap
ax.set_xlabel('GWAS traits')
ax.set_ylabel('Cell types')
plt.savefig(f'{output_path}overview_heatmap_SNP_enrichment_physiological.pdf',bbox_inches='tight')
plt.savefig(f'{output_path}overview_heatmap_SNP_enrichment_physiological.png')
热图展示 GWAS traits 相关的 SNP 在细胞特异性开放染色质区域的富集情况,颜色表示经过 BH 方法矫正后的-log(p)值,颜色越绿,表示越显著。‘*’表示矫正后 p 值<0.05,即 GAWS traits 在对应细胞类型中具有显著富集的现象。

结果文件

├── enrichment_output
│ ├── 0p05_peak_width_500_SNPs_in_LD_all_traits_summary.csv
│ └── overview_heatmap_SNP_enrichment_physiological.pdf

文献案例解析

  • 《Spatially resolved multiomics of human cardiac niches》

    在这项研究中,作者利用 scATAC-seq 数据分析与心脏生理和病理相关的 GAWS traits SNP 在各个细胞类型中的富集情况。

参考资料

[1] Kanemaru K, Cranley J, Muraro D, et al.Spatially resolved multiomics of human cardiac niches[J].Nature, 2023.DOI: 10.1038/s41586-023-06311-1.

附录

.txt :结果数据表格文件,文件以制表符(Tab)分隔。unix/Linux/Mac 用户使用 less 或 more 命令查看;windows 用户使用高级文本编辑器 Notepad++ 等查看,也可以用 Microsoft Excel 打开。

.pdf :结果图像文件,矢量图,可以放大和缩小而不失真,方便用户查看和编辑处理,可使用 Adobe Illustrator 进行图片编辑,用于文章发表等。

.rds : 包含基因活力 assay 的 Seurat 对象,需要在 R 环境中打开查看和用于进一步分析。

0 条评论·0 条回复