Skip to content

powellgenomicslab/tenk10k_cohort_specific_eQTL

Repository files navigation

Identification of Novel Genetic Markers for Complex Diseases by Integrating Blood Samples from Both Patient and Healthy Donors (Tenk10k cohort-specific eQTL)

This repository contains the bioinformatics pipeline used to investigate the regulatory architecture of the immune system under the physiological stress of Coronary Artery Disease (CAD). The analysis utilizes SAIGE-QTL for genetic association testing across two cohorts (BioHeart and TOB) and then Cochran's Q-test and SMR (Summary-data-based Mendelian Randomization) for integrative analysis.

Main Analyses

We ran eQTL mapping using SAIGE-QTL in TOB and BioHeart cohorts separately across 28 blood cell types, and performed heterogeneity and SMR tests.

Project Structure

The pipeline is divided into four main stages: Preprocessing, Processing (SAIGE-QTL), Postprocessing, and Downstream Analysis (SMR & Heterogeneity).

1. Preprocessing

Scripts to prepare gene coordinates, filter genes based on expression thresholds, and format phenotypes/genotypes.

  • generate_coords.py: Generates hg38 coordinates for the analysis.

  • generate_filtered_gene_list.py: Filters genes within input_{cell_type} that meet the minimum expression criteria (> 0.05% of cells).

  • prepare_phenotypes.py: Merges gene expression from .h5ad files with covariates (PCs) into SAIGE-ready formats.

  • prepare_genotypes.sh: Shell script to generate chromosome-wise PLINK binary files.

  • submit_phenotypes_array.sh: HPC array job script to parallelize the creation of phenotype files across 22 chromosomes.

2. Processing (SAIGE-QTL)

The core association testing phase, optimized for running Steps 1 & 2 of SAIGE-QTL.

  • run_SAIGE.sh: Primary implementation of SAIGE-QTL (Steps 1 & 2 combined) for each specific cohort.

  • rerun_SAIGE.sh: Troubleshooting script for failed genes; uses the filtered gene list and provides higher RAM allocation.

  • missing_genes.sh: Diagnostic script to identify genes that failed initial runs due to memory constraints.

3. Postprocessing

Consolidating raw outputs from the cluster into unified datasets.

  • merge_results.py: Merges raw Step 2 results into Full-Genome files for both cohorts.

4. Downstream Analysis

Heterogeneity Testing:

  • het_test.R: Performs Cochran’s Q test on cell-type-specific results to identify heterogeneous effects. Outputs are stored in output_heterogeneity/.

SMR Analysis:

  • run_smr_gene_wise.py & run_SMR.sh: Integration of CAD GWAS data, gene lists, and hg38 coordinates to perform Summary-data-based Mendelian Randomization.

Data Note

Input data (Genotypes, .h5ad objects, and raw GWAS summaries) are not included in this repository due to size and privacy constraints. These scripts assume a directory structure containing /inputs_genotype and /input_{cell_type}

Contact

Angli Xue (a.xue@garvan.org.au)

Jonathan Johnson (jonathanvergis@hotmail.com)

About

This repo contains the scripts for a summer project led by Jonathan Johnson, which aims to compare the eQTL difference between the TOB and BioHeart cohorts.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors