Skip to content

scATAC-seq Variation Analysis: SNP Enrichment in Chromatin Open Regions

Author: SeekGene
Time: 49 min
Words: 9.6k words
Updated: 2026-02-28
Reads: 0 times
scATAC + RNA-seq Analysis Guide Notebooks SNP Enrichment Analysis

Background

Genome-Wide Association Studies (GWAS) are key tools for resolving the genetic basis of complex traits. Many significant Single Nucleotide Polymorphisms (SNPs) identified often reside in non-coding regulatory regions of the genome. This suggests these SNPs likely influence traits by affecting gene expression regulation rather than directly altering protein coding sequences, often with cell type or state specificity. Understanding how these non-coding variants function in specific cellular contexts is a core challenge in revealing molecular mechanisms of complex traits. scATAC-seq can characterize chromatin accessibility genome-wide at single-cell resolution, precisely defining specific regulatory elements (e.g., enhancers, promoters) and their activity profiles across different cell types and activation states. This analysis aims to establish functional links between specific traits and key cell types/states. The core strategy is to calculate the enrichment of GWAS-identified trait-associated SNPs (or tag SNPs in their linkage regions) within cell state-specific open chromatin regions.
First, define cell state-specific open regions based on scATAC-seq; second, for each GWAS trait of interest (e.g., disease risk, blood biochemical indices, neuropsychiatric phenotypes), calculate the frequency of its GWAS significant SNP set and closely linked regions (e.g., LD blocks) appearing within each specific cell state's open regions. Assess whether this frequency is significantly higher than random expectation via statistical tests (e.g., hypergeometric test, permutation test). This reveals which cell types or activation states might be the primary "target cells" for the genetic variants influencing that trait.

Pre-analysis Data Preparation

Before SNP enrichment analysis, prepare the following two types of files

(1) Trait-associated SNP Information

Taking azoospermia_with_SNP.csv as an example,

Information for SNPs associated with the trait 'azoospermia' must include the following:
trait_name index_snps index_chrom index_pos SNP_id tagged_r2 POS
azoospermia rs7192 6 32443869 rs9268686 0.929192 32447745
azoospermia rs7192 6 32443869 rs9268670 0.937897 32446510
azoospermia rs7192 6 32443869 rs7195 1.000000 32444762
azoospermia rs7192 6 32443869 rs2213585 1.000000 32445373
azoospermia rs7192 6 32443869 rs567082101 0.937897 32453552
azoospermia rs7192 6 32443869 rs7775108 0.937897 32452461
azoospermia rs7192 6 32443869 rs9268783 0.937897 32452938
azoospermia rs7192 6 32443869 rs4935354 1.000000 32444621
azoospermia rs7192 6 32443869 rs78456540 0.914852 32586949

Where the trait_name column is the name of the trait to be analyzed; the index_snps column represents all SNP information associated with this trait, which could be a few or hundreds/thousands, derived from GWAS analysis results or downloaded from the EBI website; the index_chrom column represents the chromosome number of the SNP; the index_pos column represents the specific position of the SNP on the chromosome; SNP_id represents SNPs in linkage disequilibrium with the index_snps; tagged_r2 represents the r2 coefficient of linkage disequilibrium; POS represents the position of the SNP in linkage disequilibrium.

├── index_snps_with_LD_with_pos This path stores SNP information for all traits to be analyzed, with each trait in a separate csv file
│ ├── azoospermia_with_SNP.csv
│ ├── cardiomyopathy_with_SNP.csv
│ ├── chronic.lymphocytic.leukemia_with_SNP.csv
│ ├── chronic.myelogenous.leukemia_with_SNP.csv
│ ├── chronic.pancreatitis_with_SNP.csv
│ ├── diffuse.gastric.adenocarcinoma_with_SNP.csv
│ ├── endocarditis_with_SNP.csv
│ ├── ewing.sarcoma_with_SNP.csv
│ ├── gastric.adenocarcinoma_with_SNP.csv
│ ├── gastritis_with_SNP.csv
│ ├── head.and.neck.squamous.cell.carcinoma_with_SNP.csv
│ ├── hyperplasia_with_SNP.csv
│ ├── infectious.meningitis_with_SNP.csv
│ ├── invasive.lobular.carcinoma_with_SNP.csv
│ ├── leukemia_with_SNP.csv
│ ├── myelodysplastic.syndrome_with_SNP.csv
│ ├── neoplasm.of.mature.bcells_with_SNP.csv
│ ├── neoplasm_with_SNP.csv
│ ├── nervous.system.disease_with_SNP.csv
│ ├── pancreatitis_with_SNP.csv
│ ├── papillary.renal.cell.carcinoma_with_SNP.csv
│ ├── polyp_with_SNP.csv
│ └── portal.hypertension_with_SNP.csv

(2) celltype-by-peak Matrix

The row names of the celltype-by-peak matrix are celltypes, column names are peaks information, and values represent the proportion of cells in a given celltype where that peak position is open relative to the total number of cells.

└── peaks This path stores the celltype-by-peaks matrix file
│ └── celltype_by_peaks.csv

SNPenrichment Analysis

python
import numpy as np
import pandas as pd
import scanpy as sc
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.pyplot import rc_context
import bbknn
import re
import json
import os
import rpy2
import anndata
from datetime import date
import scipy.stats
from datetime import datetime
import requests
import logging

# YYYY-MM-DD
today = date.today()
today = today.strftime("%Y-%m-%d")

import warnings
warnings.filterwarnings('ignore')
output
/PROJ2/FLOAT/shumeng/apps/miniconda3/envs/ATAC_SNPEnrich/lib/python3.11/site-packages/requests/__init__.py:86: RequestsDependencyWarning: Unable to find acceptable character detection dependency (chardet or charset_normalizer).
warnings.warn(

Parameter Description

peaks_path: Path storing the celltype_by_peaks.csv file;
SNP_path: Path storing trait-associated SNP information files;
n_permutations: Number of permutation tests;
output_path: Result file output path;
threshold: Threshold to determine if a peak is open in a celltype;
pval_sig_threshold: Enrichment significance threshold.

python
peaks_path='/PROJ2/FLOAT/shumeng/project/scATAC_learn/Signac/example_data/joint.14/joint14.22/'
SNP_path='/PROJ2/FLOAT/advanced-analysis/Seek_module_dev/ATAC_SNPenrichment/demo/index_snps_with_LD_with_pos/'
n_permutations=1000
output_path="../result/"
threshold=0.05
pval_sig_threshold=0.05

Read celltype-by-peaks File

python
os.makedirs("../result", exist_ok=True)
peakfile=os.path.join(peaks_path, "celltype_by_peaks.csv")
#print(f'{current_time}:...reading peak file...takes 5 mins...')
peaks=pd.read_csv(peakfile)
    
peaks.columns = peaks.columns.str.replace('Unnamed: 0', 'fine_grain')
peaks=peaks.set_index(peaks['fine_grain'])
peaks=peaks.drop(columns=['fine_grain'])

# replace characters which can break the code
peaks.index=peaks.index.str.replace('+', 'pos')
peaks.index=peaks.index.str.replace('/', '_or_')

now = datetime.now()
current_time = now.strftime("%H:%M:%S")
#print(f'{current_time}:...done.')
peaks.head()
chr1-9815-10616chr1-180437-181865chr1-190528-191930chr1-267655-268361chr1-629577-630313chr1-633636-634418chr1-778209-779399chr1-817017-817549chr1-819870-822573chr1-822744-823396...chrY-56861377-56861817chrY-56864364-56864795chrY-56868776-56869176chrY-56869677-56870379chrY-56870544-56871524chrY-56873556-56874208chrY-56879530-56880478chrM-11-4079chrM-6124-6729chrM-9885-16556
fine_grain
Neurons0.0113230.0223750.0399220.0041500.0139390.0709130.0919790.0048270.0251710.009473...0.0034730.0025710.0033380.0063610.0080300.0047820.0061350.9963910.8509560.999865
Macrophage0.0093810.0231390.0131330.0018760.0100060.0587870.0769230.0125080.0256410.008755...0.0025020.0000000.0043780.0068790.0075050.0062540.0056290.9974980.8461541.000000
Endothelial_cells0.0167600.0502790.0530730.0055870.0502790.2122910.1145250.0083800.0446930.002793...0.0111730.0055870.0055870.0083800.0167600.0111730.0055871.0000000.9776541.000000
T_cells0.0090190.0225480.0124010.0000000.0135290.0721530.0653890.0045100.0112740.005637...0.0033820.0000000.0011270.0056370.0112740.0067640.0067640.9966180.9064261.000000
Smooth_muscle_cells0.0100810.0292340.0252020.0040320.0151210.1028230.0725810.0090730.0272180.012097...0.0050400.0010080.0030240.0090730.0080650.0050400.0080650.9989920.8739921.000000

5 rows × 243391 columns

Each row of the celltype-by-peaks matrix represents a cell type, each column represents a peak, and the corresponding value represents the proportion of cells in that celltype where the peak is open. This table needs to be prepared in advance. Specific method: First binarize the peak-cell matrix, where 0 represents closed and 1 represents open, then calculate the proportion of open to total.

Organize peak_window, chrom, start, end information for all peaks involved in the data, preparing for subsequent statistics on whether trait-associated SNPs cover peak information

python
%%time
# Generate initial DataFrame
all_peaks = pd.DataFrame(peaks.columns, columns=['peak'])

split_result = all_peaks['peak'].str.split('-', expand=True)
all_peaks['peak_window'] = split_result[1] + "_" + split_result[2]  # Direct string concatenation
all_peaks['chrom'] = split_result[0]
all_peaks['start'] = split_result[1].astype(int)  # Convert to integer
all_peaks['end'] = split_result[2].astype(int)

# Set index
all_peaks = all_peaks.set_index('peak')
all_peaks.head(4)
output
CPU times: user 878 ms, sys: 47.5 ms, total: 925 ms
Wall time: 948 ms
peak_windowchromstartend
peak
chr1-9815-106169815_10616chr1981510616
chr1-180437-181865180437_181865chr1180437181865
chr1-190528-191930190528_191930chr1190528191930
chr1-267655-268361267655_268361chr1267655268361
python
# select a window size (bp) of interest (by default it is 500bp)
window_size=500
window_adjustment=(window_size-500)/2
#print('adjustment '+str(window_adjustment))
window_size_for_filename='peak_width_'+str(window_size)
#print(window_size_for_filename)

all_peaks['start']=all_peaks['start'].apply(lambda x: x-window_adjustment)
all_peaks['end']=all_peaks['end'].apply(lambda x: x+window_adjustment)

all_peaks
peak_windowchromstartend
peak
chr1-9815-106169815_10616chr19815.010616.0
chr1-180437-181865180437_181865chr1180437.0181865.0
chr1-190528-191930190528_191930chr1190528.0191930.0
chr1-267655-268361267655_268361chr1267655.0268361.0
chr1-629577-630313629577_630313chr1629577.0630313.0
...............
chrY-56873556-5687420856873556_56874208chrY56873556.056874208.0
chrY-56879530-5688047856879530_56880478chrY56879530.056880478.0
chrM-11-407911_4079chrM11.04079.0
chrM-6124-67296124_6729chrM6124.06729.0
chrM-9885-165569885_16556chrM9885.016556.0

243391 rows × 4 columns

python
# get list of celltypes for subsequent analyses
cell_types=peaks.index.unique().tolist()
cell_types
output
['Neurons',
'Macrophage',
'Endothelial_cells',
'T_cells',
'Smooth_muscle_cells',
'Keratinocytes',
'DC',
'B_cell']
python
# Set threshold for binarisation
threshold_for_filename=str(threshold).replace('.','p')
print(threshold_for_filename)
output
0p05

Based on all trait information in the SNP_path, count the total number of SNPs for each trait in this enrichment analysis, including SNPs and those in linkage disequilibrium

python
%%time

# make a table of metadata about the SNP files
md_efo_terms=[]
md_n_SNPs=[]


files=os.listdir(SNP_path)

for file in files:
    
    efo_term=str(file.split('_')[0])
    md_efo_terms.append(efo_term)
                      
    snps_df=pd.read_csv(f'{SNP_path}{file}')
    snps_df=snps_df.set_index('SNP_id')
    
    snps_df["Chrom"] = snps_df['Chrom'].apply(lambda x: 'chr'+str(x)).str.split('.',expand=True)[0] # creates a new 'chrom' column and should work even for X or M chromosomes

    md_n_SNPs.append(len(snps_df))

md_dict={
    'efo_term':md_efo_terms,
    'n_SNPs':md_n_SNPs
}


SNP_md=pd.DataFrame(md_dict)
SNP_md=SNP_md.set_index('efo_term')

SNP_md.sort_values('n_SNPs',ascending=False)
output
CPU times: user 76.1 ms, sys: 2.6 ms, total: 78.7 ms
Wall time: 802 ms
n_SNPs
efo_term
chronic.lymphocytic.leukemia483
neoplasm.of.mature.bcells285
cardiomyopathy192
ewing.sarcoma157
chronic.myelogenous.leukemia139
gastritis121
azoospermia118
pancreatitis108
myelodysplastic.syndrome102
leukemia90
papillary.renal.cell.carcinoma83
head.and.neck.squamous.cell.carcinoma63
hyperplasia61
gastric.adenocarcinoma60
chronic.pancreatitis57
diffuse.gastric.adenocarcinoma56
neoplasm51
polyp27
infectious.meningitis24
portal.hypertension18
nervous.system.disease11
invasive.lobular.carcinoma6
endocarditis4
python
efo_terms=SNP_md.index.unique().tolist()

Count the number of trait-associated SNPs falling within known peak regions to form a peak-traits matrix

python
%%time
# Add on a column for each trait, indicating whether a SNP from that trait falls within a peak. This file varies with the which traits are assessed

all_peaks_path=f'{output_path}SNP_mapped_to_peaks/'
os.makedirs(all_peaks_path,
           exist_ok=True) # makes the directory
files=os.listdir(SNP_path)
for efo_term in efo_terms:
    file = [f for f in files if f"{efo_term}" in f]
    file = file[0]
    file_path = os.path.join(SNP_path, file)
    snps_df=pd.read_csv(file_path)
    snps_df=snps_df.set_index('SNP_id')
    snps_df["chrom"]=snps_df["Chrom"].apply(lambda x: 'chr' + str(x)).str.split('.', n=1).str[0]
    all_peaks[efo_term]=0
    for snp in range(len(snps_df["chrom"])): # This loop incrementally adds 1 if there is a SNP within a peak (open or closed)
        all_peaks[efo_term][(all_peaks['chrom'] == snps_df["chrom"][snp])&(all_peaks['start'] <= snps_df['POS'][snp])&(all_peaks['end'] >= snps_df['POS'][snp])]+=1 # Adds one for each SNP which falls inside a peak
all_peaks.to_csv(f'{all_peaks_path}all_peaks_with_SNPs_for_traits_incremental.csv')
output
CPU times: user 1min 32s, sys: 99.4 ms, total: 1min 32s
Wall time: 1min 33s
python
# Show number of SNPs falling in peaks, for each trait
efo_terms_to_drop=[]

for efo_term in efo_terms:
#    print(efo_id+': '+str(sum(all_peaks[efo_id])))
    if sum(all_peaks[efo_term]) == 0:
        efo_terms_to_drop.append(efo_term)
python
efo_terms=[efo_term for efo_term in efo_terms if efo_term not in efo_terms_to_drop]

For each cell type, create a random background (1/0 binary matrix) by shuffling the open/closed labels of peaks for that cell type, such that a random set of peaks is labeled open, equal in number to the actual open peaks for that cell state. Repeat this operation to create 1000 random permutations

python
%%time
import numpy as np
import pandas as pd
import os
from datetime import datetime

permutations = range(n_permutations)
bin_mat_path = f'{output_path}binarised_permutation_matrices/'
os.makedirs(bin_mat_path, exist_ok=True)

for cell_type in cell_types:
    output_file = f'{bin_mat_path}{cell_type}_{n_permutations}_permutations_bin_threshold_{threshold_for_filename}_matrix.csv'
    
    if not os.path.isfile(output_file):
        print(f'{datetime.now().strftime("%H:%M:%S")}...generating binarised matrix for {cell_type}...')
        
        # Initialize DataFrame
        cell_df = pd.DataFrame(peaks.loc[cell_type])
        cell_df['peak'] = cell_df.index
        split_result = cell_df['peak'].str.split('-', expand=True)
        cell_df['chrom'] = split_result[0]
        cell_df['peak_window'] = split_result[1] + "_" + split_result[2]
        cell_df['start'] = split_result[1].astype(int)
        cell_df['end'] = split_result[2].astype(int)
        cell_df[f'{cell_type}_binarised_real'] = cell_df[cell_type].ge(threshold).astype(int)
        
        # Generate all permutation columns
        for perm in permutations:
            cell_df[f'{cell_type}_binarised_permutation_{perm}'] = np.random.permutation(cell_df[f'{cell_type}_binarised_real'])
        
        # Save at once
        cell_df.to_csv(output_file, index=True)
    else:
        print(f'NOT generating binarised matrix for {cell_type} since it already exists')

print('finished')
output
NOT generating binarised matrix for Neurons since it already exists
NOT generating binarised matrix for Macrophage since it already exists
NOT generating binarised matrix for Endothelial_cells since it already exists
NOT generating binarised matrix for T_cells since it already exists
NOT generating binarised matrix for Smooth_muscle_cells since it already exists
NOT generating binarised matrix for Keratinocytes since it already exists
NOT generating binarised matrix for DC since it already exists
NOT generating binarised matrix for B_cell since it already exists
finished
CPU times: user 1.58 ms, sys: 976 μs, total: 2.56 ms
Wall time: 26.5 ms

Merge the randomized binary matrix for each cell type with the actual peak-traits matrix

python
%%time

joined_bin_mat_path = f'{output_path}binarised_permutation_matrices_joined_incremental/'
os.makedirs(joined_bin_mat_path, exist_ok=True)

for cell_type in cell_types:
    output_file = f'{joined_bin_mat_path}{cell_type}_{n_permutations}_permutations_for_traits_bin_threshold_{threshold_for_filename}_matrix_joined.csv'
    
    if not os.path.isfile(output_file):
        # Read generated binarized matrix
        binarised_matrix = pd.read_csv(
            f'{bin_mat_path}{cell_type}_{n_permutations}_permutations_bin_threshold_{threshold_for_filename}_matrix.csv'
        )
        
        # Clean and set index
        binarised_matrix.set_index('peak', inplace=True)  # Ensure index is unique peak column
        
        # Add column for SNP presence in peaks (assuming all_peaks is another DataFrame)
        # Method 1: Specify suffix
        binarised_matrix = binarised_matrix.join(
            all_peaks, 
            lsuffix="_right"
        )
        
        # Save to new directory
        binarised_matrix.to_csv(output_file)
    else:
        print(f'NOT generating joined binarised matrix for {cell_type} since it already exists')

print('finished')
output
NOT generating joined binarised matrix for Neurons since it already exists
NOT generating joined binarised matrix for Macrophage since it already exists
NOT generating joined binarised matrix for Endothelial_cells since it already exists
NOT generating joined binarised matrix for T_cells since it already exists
NOT generating joined binarised matrix for Smooth_muscle_cells since it already exists
NOT generating joined binarised matrix for Keratinocytes since it already exists
NOT generating joined binarised matrix for DC since it already exists
NOT generating joined binarised matrix for B_cell since it already exists
finished
CPU times: user 2.2 ms, sys: 21 μs, total: 2.22 ms
Wall time: 28 ms

For each trait and cell state, calculate the proportion of trait-associated SNPs falling within the open peaks of that cell state (SNP proportion). Calculate this SNP proportion for each random permutation as well. Then calculate the P-value, which is the proportion of times the random SNP proportion exceeds or equals the actual SNP proportion. Finally, use the Benjamini-Hochberg method to correct these P-values for multiple testing

python
%%time
import logging

# Set up logging
logging.basicConfig(level=logging.ERROR)



# Find the proportion of all peaks which are open in this celltype
enrichment_output_path=f'{output_path}enrichment_output/'
os.makedirs(enrichment_output_path,
           exist_ok=True) # makes the directory

list_of_output_dfs=[]

#now = datetime.now()
#current_time = now.strftime("%H:%M:%S")
#print(f'...================================================================...') 
#print(f'...{current_time}:STARTING')
#print(f'...================================================================...') 

# make some empty lists
proportion_of_SNPs_found_in_celltype_specific_open_peaks=[]
proportion_of_all_open_peaks_found_in_this_celltype=[]

n_times_proportions_of_SNPs_in_permuted_open_peaks_greater_than_observed_proportion=[]
mean_proportions_of_SNPs_in_open_peaks=[]
p_values=[]
efo_term_list=[]
cell_type_list=[]
n_SNPs_list=[]

cell_types_done=[]
n_cell_types_total=len(cell_types)

permutations=range(n_permutations)

print('...evaluating '+str(len(efo_terms))+' traits, across '+str(len(cell_types))+' cell types...')

for cell_type in cell_types:
    
    try:
    
        cell_types_done.append(cell_type)
        n_cell_types_done=len(cell_types_done)
        n_cell_types_remaining=n_cell_types_total-n_cell_types_done

        now = datetime.now()
        current_time = now.strftime("%H:%M:%S")
        print(f'...================================================================...')    
        print(f'...{current_time}: {n_cell_types_remaining+1} of {n_cell_types_total} cell types remaining. Reading binarised matrix for {cell_type}...')
        print(f'...================================================================...') 

        cell_bin_mat=pd.read_csv(f'{joined_bin_mat_path}{cell_type}_{n_permutations}_permutations_for_traits_bin_threshold_{threshold_for_filename}_matrix_joined.csv',index_col='peak')

        prop_bins_in_this_cell_type=(len(cell_bin_mat[cell_bin_mat[f'{cell_type}_binarised_real']==1]))/len(cell_bin_mat)


        for efo_term in efo_terms:

            # grab some metadata
            n_SNPs=SNP_md.loc[efo_term]['n_SNPs']

            # add columns which won't change until we run a new cell_type
            proportion_of_all_open_peaks_found_in_this_celltype.append(prop_bins_in_this_cell_type)
            cell_type_list.append(cell_type)

            # add columns which won't change until we run a new efo_id
            n_SNPs_list.append(n_SNPs)
            efo_term_list.append(efo_term)

            # subset to just open regions for this cell type
            # find the proportion of SNPs for this trait that lie within this cell types open peaks
            observed_proportion=(cell_bin_mat[efo_term][cell_bin_mat[f'{cell_type}_binarised_real']==1].sum())/(cell_bin_mat[efo_term].sum())

            proportion_of_SNPs_found_in_celltype_specific_open_peaks.append(observed_proportion)

            proportion_of_SNPs_found_in_permutations_of_celltype_specific_open_peaks=[]

            for permutation in permutations:
                proportion_of_SNPs_found_in_permutations_of_celltype_specific_open_peaks.append(cell_bin_mat[efo_term][cell_bin_mat[f'{cell_type}_binarised_permutation_{permutation}']==1].sum()/cell_bin_mat[efo_term].sum())

            proportions_of_SNPs_in_permuted_open_peaks_greater_than_observed_proportion = [i for i in proportion_of_SNPs_found_in_permutations_of_celltype_specific_open_peaks if i >= observed_proportion]

            n_times_proportions_of_SNPs_in_permuted_open_peaks_greater_than_observed_proportion.append(len(proportions_of_SNPs_in_permuted_open_peaks_greater_than_observed_proportion))

            p_values.append(len(proportions_of_SNPs_in_permuted_open_peaks_greater_than_observed_proportion)/len(permutations)) # p val is simply the proportion of null hypotheses 'observations' greater than the actual observed proportion

            mean_proportions_of_SNPs_in_open_peaks.append(sum(proportion_of_SNPs_found_in_permutations_of_celltype_specific_open_peaks)/len(proportion_of_SNPs_found_in_permutations_of_celltype_specific_open_peaks))

            # Plot histograms for each cell type
    #        plt.rcParams["figure.figsize"] = (20,10)
    #        plt.rcParams["figure.dpi"] = 300

    #        plt.hist(proportion_of_SNPs_found_in_permutations_of_celltype_specific_open_peaks,
    #                 bins=100,color='red',
    #                 range=(0,1),
    #                 histtype='stepfilled',edgecolor='none')
    #        plt.axvline(x=observed_proportion, color='blue', linestyle='--')
    #        plt.legend(['null: proportion of SNPs falling in randomly shuffled OC regions','observed: proportion of SNPs falling cell-type specific OC regions'])
    #        plt.title('cell type: '+cell_type+', trait: '+efo_id+', term: '+efo_term+', threshold for binarisation: '+threshold_for_filename)
    #        plt.savefig(f'{output_path}{efo_id}_{efo_term}_{cell_type}_{threshold_for_filename}_SNP_enrichment.png')
    #        plt.clf() #clears the current plot

    
    except KeyError as e:
        # Log the error
        logging.error(e)
        print(cell_type)

    # edited so that the file is written incrementally
        
    output_dict={
        'cell_type':cell_type_list,
        'proportion_of_all_open_peaks_found_in_this_celltype':proportion_of_all_open_peaks_found_in_this_celltype,
        'proportion_of_SNPs_found_in_celltype_specific_open_peaks':proportion_of_SNPs_found_in_celltype_specific_open_peaks,
        'mean_proportions_of_SNPs_in_open_peaks':mean_proportions_of_SNPs_in_open_peaks,
        'n_times_proportions_of_SNPs_in_permuted_open_peaks_greater_than_observed_proportion':n_times_proportions_of_SNPs_in_permuted_open_peaks_greater_than_observed_proportion,
        'p_value':p_values,
        'n_SNPs':n_SNPs_list,
        'efo_term':efo_term_list}

    output_df=pd.DataFrame(output_dict)

    list_of_output_dfs.append(output_df)
    combined_output_df=pd.concat(list_of_output_dfs)
    combined_output_df=combined_output_df.sort_values(by=['efo_term'])
    combined_output_df=combined_output_df.set_index('cell_type')
    combined_output_df.to_csv(f'{enrichment_output_path}{threshold_for_filename}_{window_size_for_filename}_SNPs_in_LD_all_traits_summary.csv')

now = datetime.now()
current_time = now.strftime("%H:%M:%S")
print(f'...================================================================...')    
print(f'...{current_time}:FINSIHED')
print(f'...================================================================...')
output
...evaluating 19 traits, across 8 cell types...n ...================================================================...n ...09:46:59: 8 of 8 cell types remaining. Reading binarised matrix for Neurons...n ...================================================================...n ...================================================================...n ...09:49:31: 7 of 8 cell types remaining. Reading binarised matrix for Macrophage...n ...================================================================...n ...================================================================...n ...09:51:32: 6 of 8 cell types remaining. Reading binarised matrix for Endothelial_cells...n ...================================================================...n ...================================================================...n ...09:54:00: 5 of 8 cell types remaining. Reading binarised matrix for T_cells...n ...================================================================...n ...================================================================...n ...09:56:07: 4 of 8 cell types remaining. Reading binarised matrix for Smooth_muscle_cells...n ...================================================================...n ...================================================================...n ...09:58:15: 3 of 8 cell types remaining. Reading binarised matrix for Keratinocytes...n ...================================================================...n ...================================================================...n ...10:00:35: 2 of 8 cell types remaining. Reading binarised matrix for DC...n ...================================================================...n ...================================================================...n ...10:02:40: 1 of 8 cell types remaining. Reading binarised matrix for B_cell...n ...================================================================...n ...================================================================...n ...10:04:45:FINSIHED
...================================================================...n CPU times: user 9min 9s, sys: 26.8 s, total: 9min 36s
Wall time: 17min 46s
python
neglog10_pval_sig_threshold=np.negative(np.log10(pval_sig_threshold))
neglog10_pval_sig_threshold
output
1.3010299956639813

Enrichment Result Visualization

python
#combined_output_df=pd.read_csv(f'{output_path}{threshold_for_filename}_{window_size_for_filename}_SNPs_in_LD{LD_threshold_for_filename}_all_traits_summary.csv',index_col='cell_type')
combined_output_file=os.path.join(peaks_path, "celltype_by_peaks.csv")
combined_output_df=pd.read_csv("/PROJ2/FLOAT/advanced-analysis/Seek_module_dev/ATAC_SNPenrichment/demo/new_output/enrichment_output/0p05_peak_width_500_SNPs_in_LD_all_traits_summary.csv")
combined_output_df.head()
cell_typeproportion_of_all_open_peaks_found_in_this_celltypeproportion_of_SNPs_found_in_celltype_specific_open_peaksmean_proportions_of_SNPs_in_open_peaksn_times_proportions_of_SNPs_in_permuted_open_peaks_greater_than_observed_proportionp_valuen_SNPsefo_term
0Neurons0.0756890.00.06842910001.0118azoospermia
1Neurons0.0756890.00.06842910001.0118azoospermia
2Keratinocytes0.0952750.00.09200010001.0118azoospermia
3Smooth_muscle_cells0.0616170.00.06042910001.0118azoospermia
4T_cells0.0381980.00.03664310001.0118azoospermia
python
combined_output_df['efo_term']=combined_output_df['efo_term'].astype(str)#+"_"+combined_output_df['n_SNPs'].astype(str) # Add the n_SNPs - useful for later inspection
combined_output_df.head()
cell_typeproportion_of_all_open_peaks_found_in_this_celltypeproportion_of_SNPs_found_in_celltype_specific_open_peaksmean_proportions_of_SNPs_in_open_peaksn_times_proportions_of_SNPs_in_permuted_open_peaks_greater_than_observed_proportionp_valuen_SNPsefo_term
0Neurons0.0756890.00.06842910001.0118azoospermia
1Neurons0.0756890.00.06842910001.0118azoospermia
2Keratinocytes0.0952750.00.09200010001.0118azoospermia
3Smooth_muscle_cells0.0616170.00.06042910001.0118azoospermia
4T_cells0.0381980.00.03664310001.0118azoospermia
python
combined_output_df.proportion_of_SNPs_found_in_celltype_specific_open_peaks.hist(bins=100)
output
Axes:
python
#pivot the table to prepare for heatmap
df=combined_output_df[['p_value',"cell_type",'efo_term']]
df.head()
p_valuecell_typeefo_term
01.0Neuronsazoospermia
11.0Neuronsazoospermia
21.0Keratinocytesazoospermia
31.0Smooth_muscle_cellsazoospermia
41.0T_cellsazoospermia
python
df=df.pivot_table(values='p_value',index='cell_type',columns='efo_term')
df
efo_termazoospermiacardiomyopathychronic.lymphocytic.leukemiachronic.myelogenous.leukemiachronic.pancreatitisdiffuse.gastric.adenocarcinomaewing.sarcomagastric.adenocarcinomagastritishead.and.neck.squamous.cell.carcinomahyperplasialeukemiamyelodysplastic.syndromeneoplasmneoplasm.of.mature.bcellspancreatitispapillary.renal.cell.carcinomapolypportal.hypertension
cell_type
B_cell1.00.1560.0001.001.00.2150.0580.2150.3020.2791.00.0001.0000.0990.0991.00.0011.0000.249
DC1.00.0110.0031.001.00.2051.0000.2050.2660.0131.00.0030.1230.0720.0721.00.0021.0000.247
Endothelial_cells1.00.0080.0001.001.00.3920.0660.3920.4750.2031.00.0000.2540.3230.3231.00.0041.0000.112
Keratinocytes1.00.0250.0021.001.01.0000.0281.0000.5080.1721.00.0020.2600.3010.3011.00.0000.1020.103
Macrophage1.00.1900.0001.001.00.2520.0720.2520.3110.0031.00.0001.0000.1060.1061.00.0001.0000.277
Neurons1.00.1440.0040.191.00.2910.1700.2910.4380.1391.00.0040.2100.3210.3211.00.0000.0870.068
Smooth_muscle_cells1.00.1050.0021.001.01.0000.2541.0000.3760.3631.00.0020.1790.2320.2321.00.0001.0000.054
T_cells1.00.1020.0051.001.01.0001.0001.0000.2290.2371.00.0051.0000.1330.1331.00.0001.0000.184
python
# drop columns where p value is all 1 (i.e. nothing at all significant)
print(df.shape)
df=df.loc[:, (df != 1).any(axis=0)]
print(df.shape)
output
(8, 19)
(8, 15)
python
# drop columns where p value is all 0 (i.e. no open peaks)
print(df.shape)
df=df.loc[:, (df != 0).any(axis=0)]
print(df.shape)
output
(8, 15)
(8, 15)
python
plt.rcParams['figure.dpi'] = 100

tmp=df.loc[:, (df > 0.05).any(axis=0)]
tmp=tmp.loc[(tmp > 0.05).any(axis=1),:]

g=sns.clustermap(tmp,
               xticklabels=True,
               yticklabels=True,
               cmap='OrRd',
               figsize=(tmp.shape[1]*0.3,tmp.shape[0]*0.3),
#               annot=anno_df,
               fmt = '',
               dendrogram_ratio=0.05,
               cbar_pos=(0.98,0.9,0.02,0.1)
              )

for a in g.ax_row_dendrogram.collections:
    a.set_linewidth(1)

for a in g.ax_col_dendrogram.collections:
    a.set_linewidth(1)
    
ax = g.ax_heatmap
ax.set_xlabel('GWAS traits')
ax.set_ylabel('Cell types')
plt.show()
python
print("Original data description:")
print(df.describe())
plt.figure()
sns.histplot(df.values.flatten())
plt.title('Value Distribution')
plt.show()
output
原始数据描述:
efo_term cardiomyopathy chronic.lymphocytic.leukemia \\
count 8.000000 8.000000
mean 0.092625 0.002000
std 0.070484 0.001927
min 0.008000 0.000000
25% 0.021500 0.000000
50% 0.103500 0.002000
75% 0.147000 0.003250
max 0.190000 0.005000

efo_term chronic.myelogenous.leukemia diffuse.gastric.adenocarcinoma \\
count 8.000000 8.000000
mean 0.898750 0.544375
std 0.286378 0.381603
min 0.190000 0.205000
25% 1.000000 0.242750
50% 1.000000 0.341500
75% 1.000000 1.000000
max 1.000000 1.000000

efo_term ewing.sarcoma gastric.adenocarcinoma gastritis \\
count 8.000000 8.000000 8.000000
mean 0.331000 0.544375 0.363125
std 0.419288 0.381603 0.102227
min 0.028000 0.205000 0.229000
25% 0.064000 0.242750 0.293000
50% 0.121000 0.341500 0.343500
75% 0.440500 1.000000 0.447250
max 1.000000 1.000000 0.508000

efo_term head.and.neck.squamous.cell.carcinoma leukemia \\
count 8.000000 8.000000
mean 0.176125 0.002000
std 0.124161 0.001927
min 0.003000 0.000000
25% 0.107500 0.000000
50% 0.187500 0.002000
75% 0.247500 0.003250
max 0.363000 0.005000

efo_term myelodysplastic.syndrome neoplasm neoplasm.of.mature.bcells \\
count 8.000000 8.000000 8.000000
mean 0.503250 0.198375 0.198375
std 0.413574 0.107493 0.107493
min 0.123000 0.072000 0.072000
25% 0.202250 0.104250 0.104250
50% 0.257000 0.182500 0.182500
75% 1.000000 0.306000 0.306000
max 1.000000 0.323000 0.323000

efo_term papillary.renal.cell.carcinoma polyp portal.hypertension
count 8.000000 8.000000 8.000000
mean 0.000875 0.773625 0.161750
std 0.001458 0.419184 0.088627
min 0.000000 0.087000 0.054000
25% 0.000000 0.775500 0.094250
50% 0.000000 1.000000 0.148000
75% 0.001250 1.000000 0.247500
max 0.004000 1.000000 0.277000
python
# multiple testing correction
import statsmodels.stats.multitest
from statsmodels.stats.multitest import multipletests

method='fdr_bh'
columns=df.columns.tolist()

# correct for multiple testing (each test being one cell type). Replacing the column with a corrected value
for column in columns:
    df[column]=statsmodels.stats.multitest.multipletests(df[column],method=method)[1]
df
efo_termcardiomyopathychronic.lymphocytic.leukemiachronic.myelogenous.leukemiadiffuse.gastric.adenocarcinomaewing.sarcomagastric.adenocarcinomagastritishead.and.neck.squamous.cell.carcinomaleukemiamyelodysplastic.syndromeneoplasmneoplasm.of.mature.bcellspapillary.renal.cell.carcinomapolypportal.hypertension
cell_type
B_cell0.1782860.0000001.00.58200.1440000.58200.5080.3188570.0000001.0000.2660.2660.0013331.0000.277
DC0.0440000.0040001.00.58201.0000000.58200.5080.0520000.0040000.4160.2660.2660.0022861.0000.277
Endothelial_cells0.0440000.0000001.00.62720.1440000.62720.5080.3160000.0000000.4160.3230.3230.0040001.0000.224
Keratinocytes0.0666670.0032001.01.00000.1440001.00000.5080.3160000.0032000.4160.3230.3230.0000000.4080.224
Macrophage0.1900000.0000001.00.58200.1440000.58200.5080.0240000.0000001.0000.2660.2660.0000001.0000.277
Neurons0.1782860.0045711.00.58200.2720000.58200.5080.3160000.0045710.4160.3230.3230.0000000.4080.224
Smooth_muscle_cells0.1680000.0032001.01.00000.3386671.00000.5080.3630000.0032000.4160.3230.3230.0000001.0000.224
T_cells0.1680000.0050001.01.00001.0000001.00000.5080.3160000.0050001.0000.2660.2660.0000001.0000.277
python
# Apply transformation
df = -np.log10(df)

# Handle all special values
df = df.replace([-0.0, 0.0], 0.0)  # Unify zero values
df = df.replace([np.inf, -np.inf], np.nan)  # Infinity to NaN

# Fill NaN with finite max value (skip NaN when calculating max)
if not df.isnull().all().all():
    max_val = df.max().max()  # Auto skip NaN
    df = df.fillna(max_val)
else:
    df = df.fillna(300)  # Case where all are infinity

df
efo_termcardiomyopathychronic.lymphocytic.leukemiachronic.myelogenous.leukemiadiffuse.gastric.adenocarcinomaewing.sarcomagastric.adenocarcinomagastritishead.and.neck.squamous.cell.carcinomaleukemiamyelodysplastic.syndromeneoplasmneoplasm.of.mature.bcellspapillary.renal.cell.carcinomapolypportal.hypertension
cell_type
B_cell0.7488832.8750610.00.2350770.8416380.2350770.2941360.4964042.8750610.0000000.5751180.5751182.8750610.000000.557520
DC1.3565472.3979400.00.2350770.0000000.2350770.2941361.2839972.3979400.3809070.5751180.5751182.6409780.000000.557520
Endothelial_cells1.3565472.8750610.00.2025940.8416380.2025940.2941360.5003132.8750610.3809070.4907970.4907972.3979400.000000.649752
Keratinocytes1.1760912.4948500.00.0000000.8416380.0000000.2941360.5003132.4948500.3809070.4907970.4907972.8750610.389340.649752
Macrophage0.7212462.8750610.00.2350770.8416380.2350770.2941361.6197892.8750610.0000000.5751180.5751182.8750610.000000.557520
Neurons0.7488832.3399480.00.2350770.5654310.2350770.2941360.5003132.3399480.3809070.4907970.4907972.8750610.389340.649752
Smooth_muscle_cells0.7746912.4948500.00.0000000.4702280.0000000.2941360.4400932.4948500.3809070.4907970.4907972.8750610.000000.649752
T_cells0.7746912.3010300.00.0000000.0000000.0000000.2941360.5003132.3010300.0000000.5751180.5751182.8750610.000000.557520
python
plt.rcParams['figure.dpi'] = 150

tmp=df.loc[:, (df > 1).any(axis=0)]
tmp=tmp.loc[(tmp > 1).any(axis=1),:]



g=sns.clustermap(tmp,
               xticklabels=True,
               yticklabels=True,
               cmap='OrRd',
               figsize=(5,5),
#               annot=anno_df,
               fmt = '',
               dendrogram_ratio=0.05,
               cbar_pos=(0.98,0.9,0.02,0.1),
                 vmax=4
              )


for a in g.ax_row_dendrogram.collections:
    a.set_linewidth(1)

for a in g.ax_col_dendrogram.collections:
    a.set_linewidth(1)
    
ax = g.ax_heatmap
ax.set_xlabel('GWAS traits')
ax.set_ylabel('Cell types')
plt.savefig(f'{output_path}overview_heatmap_SNP_enrichment.png')
plt.show()

Remove traits not significantly enriched in any cell type

python
df=df.loc[:, (df > neglog10_pval_sig_threshold).any(axis=0)]
df=df.loc[(df > neglog10_pval_sig_threshold).any(axis=1),:]
python
threshold=0.05
python
plt.rcParams['figure.dpi'] = 100

# Make annotation df
anno_df=df.where(df<1, other="*")
anno_df=anno_df.where(anno_df=='*', other=" ")
anno_df=anno_df.astype(str)


g=sns.clustermap(df,
               xticklabels=True,
               yticklabels=True,
               cmap='OrRd',
               figsize=(6,6),
               annot=anno_df,
               fmt = '',
               dendrogram_ratio=0.05,
               cbar_pos=(0.98,0.9,0.02,0.1),
                 vmax=2
              )

g.ax_col_dendrogram.set_title(f'Enrichment of cell type-specific open chromatin for GWAS trait SNPs.\n\nScale = -log(bh-corrected p-value)\n"*" = p<{pval_sig_threshold} (-log(p)>={neglog10_pval_sig_threshold}), "open threshold" = {threshold}')

for a in g.ax_row_dendrogram.collections:
    a.set_linewidth(1)

for a in g.ax_col_dendrogram.collections:
    a.set_linewidth(1)
    
ax = g.ax_heatmap
ax.set_xlabel('GWAS traits')
ax.set_ylabel('Cell types')
plt.savefig(f'{output_path}overview_heatmap_SNP_enrichment.png')
plt.show()

Remove uninteresting traits

python

df.columns
output
Index(['cardiomyopathy', 'chronic.lymphocytic.leukemia',
'head.and.neck.squamous.cell.carcinoma', 'leukemia',
'papillary.renal.cell.carcinoma'],
dtype='object', name='efo_term')
python
df.columns = map(str.upper, df.columns)
df = df.reindex(sorted(df.columns), axis=1)
df.columns
output
Index(['CARDIOMYOPATHY', 'CHRONIC.LYMPHOCYTIC.LEUKEMIA',
'HEAD.AND.NECK.SQUAMOUS.CELL.CARCINOMA', 'LEUKEMIA',
'PAPILLARY.RENAL.CELL.CARCINOMA'],
dtype='object')
python
plt.rcParams['figure.dpi'] = 60
plt.rcParams['font.size'] = 15     # Set global base font size
from matplotlib.colors import LinearSegmentedColormap
# Make annotation df
anno_df=df.where(df<neglog10_pval_sig_threshold, other="*")
anno_df=anno_df.where(anno_df=='*', other=" ")
anno_df=anno_df.astype(str)

#cmap_celadon = LinearSegmentedColormap.from_list('celadon_gradient', 
#                colors = ["#DFEDEA","#B9EAE1", "#79D8C5", "#49C4AC","#17BCA0","#0A9E81"],
#                N=256)
#g=sns.clustermap(df,
#               xticklabels=True,
#               yticklabels=True,
               #cmap = cmap_celadon,
#               camp = "OrRd",
#               figsize=(8,12),
#               annot=anno_df,
#               fmt = '',
#               dendrogram_ratio=0.05,
#               cbar_pos=(0.98,0.9,0.02,0.1),
#                 vmax=5,
#                 row_cluster=False, col_cluster=False
#              )
g=sns.clustermap(df,
               xticklabels=True,
               yticklabels=True,
               cmap='OrRd',
               figsize=(10,10),
               annot=anno_df,
               fmt = '',
               dendrogram_ratio=0.05,
               cbar_pos=(0.98,0.9,0.02,0.1),
                 vmax=2,
                 row_cluster=False, col_cluster=False
              )

g.ax_col_dendrogram.set_title(f'Enrichment of cell-type-specific open chromatin for GWAS trait SNPs.\n\nScale = -log(BH-corrected p-value)\n\n"*" = p<{pval_sig_threshold} (-log(p)>={round(neglog10_pval_sig_threshold,2)})')

for a in g.ax_row_dendrogram.collections:
    a.set_linewidth(1)

for a in g.ax_col_dendrogram.collections:
    a.set_linewidth(1)

ax = g.ax_heatmap
ax.set_xlabel('GWAS traits')
ax.set_ylabel('Cell types')
plt.savefig(f'{output_path}overview_heatmap_SNP_enrichment_physiological.pdf',bbox_inches='tight')
plt.savefig(f'{output_path}overview_heatmap_SNP_enrichment_physiological.png')
The heatmap shows the enrichment of GWAS trait-associated SNPs in cell-specific open chromatin regions. Colors represent -log(p) values corrected by the BH method; greener colors indicate greater significance. '*' indicates corrected p-value < 0.05, meaning GWAS traits are significantly enriched in the corresponding cell type.

Result Files

├── enrichment_output
│ ├── 0p05_peak_width_500_SNPs_in_LD_all_traits_summary.csv
│ └── overview_heatmap_SNP_enrichment_physiological.pdf

Literature Case Analysis

  • 《Spatially resolved multiomics of human cardiac niches》

    In this study, the authors used scATAC-seq data to analyze the enrichment of GWAS trait SNPs associated with cardiac physiology and pathology across various cell types.

References

[1] Kanemaru K, Cranley J, Muraro D, et al.Spatially resolved multiomics of human cardiac niches[J].Nature, 2023.DOI: 10.1038/s41586-023-06311-1.

Appendix

.txt: Result data table files, separated by tabs. Unix/Linux/Mac users use less or more commands to view; Windows users use advanced text editors like Notepad++, or open with Microsoft Excel.

.pdf: Result image files, vector graphics, can be zoomed in/out without distortion, convenient for viewing and editing, can be edited using Adobe Illustrator for publication.

.rds: Seurat object containing gene activity assay, needs to be opened in R environment for viewing and further analysis.

0 comments·0 replies