Identification of Novel Genetic Markers for Complex Diseases by Integrating Blood Samples from Both Patient and Healthy Donors (Tenk10k cohort-specific eQTL)
This repository contains the bioinformatics pipeline used to investigate the regulatory architecture of the immune system under the physiological stress of Coronary Artery Disease (CAD). The analysis utilizes SAIGE-QTL for genetic association testing across two cohorts (BioHeart and TOB) and then Cochran's Q-test and SMR (Summary-data-based Mendelian Randomization) for integrative analysis.
We ran eQTL mapping using SAIGE-QTL in TOB and BioHeart cohorts separately across 28 blood cell types, and performed heterogeneity and SMR tests.
The pipeline is divided into four main stages: Preprocessing, Processing (SAIGE-QTL), Postprocessing, and Downstream Analysis (SMR & Heterogeneity).
Scripts to prepare gene coordinates, filter genes based on expression thresholds, and format phenotypes/genotypes.
-
generate_coords.py: Generates hg38 coordinates for the analysis.
-
generate_filtered_gene_list.py: Filters genes within input_{cell_type} that meet the minimum expression criteria (> 0.05% of cells).
-
prepare_phenotypes.py: Merges gene expression from .h5ad files with covariates (PCs) into SAIGE-ready formats.
-
prepare_genotypes.sh: Shell script to generate chromosome-wise PLINK binary files.
-
submit_phenotypes_array.sh: HPC array job script to parallelize the creation of phenotype files across 22 chromosomes.
The core association testing phase, optimized for running Steps 1 & 2 of SAIGE-QTL.
-
run_SAIGE.sh: Primary implementation of SAIGE-QTL (Steps 1 & 2 combined) for each specific cohort.
-
rerun_SAIGE.sh: Troubleshooting script for failed genes; uses the filtered gene list and provides higher RAM allocation.
-
missing_genes.sh: Diagnostic script to identify genes that failed initial runs due to memory constraints.
Consolidating raw outputs from the cluster into unified datasets.
- merge_results.py: Merges raw Step 2 results into Full-Genome files for both cohorts.
- het_test.R: Performs Cochran’s Q test on cell-type-specific results to identify heterogeneous effects. Outputs are stored in output_heterogeneity/.
- run_smr_gene_wise.py & run_SMR.sh: Integration of CAD GWAS data, gene lists, and hg38 coordinates to perform Summary-data-based Mendelian Randomization.
Input data (Genotypes, .h5ad objects, and raw GWAS summaries) are not included in this repository due to size and privacy constraints. These scripts assume a directory structure containing /inputs_genotype and /input_{cell_type}
Angli Xue (a.xue@garvan.org.au)
Jonathan Johnson (jonathanvergis@hotmail.com)