This repository contains a pipeline for performing genome-wide association studies using genetic data from the European Prospective Investigation into Cancer and Nutrition (EPIC) cohort.
A Genome-Wide Association Study (GWAS) is a systematic approach to identify genetic variants associated with a particular trait or disease. GWAS scan the entire genome of many individuals to find single nucleotide polymorphisms (SNPs) that occur more frequently in people with a particular trait compared to those without it. The typical GWAS workflow involves:
- Quality control of genetic data and phenotypes
- Association testing between genetic variants and the trait of interest
- Visualization of results through Manhattan and QQ plots
- Meta-analysis of results across multiple studies (when applicable)
GWAS results help researchers understand the genetic architecture of complex traits and diseases, identify biological pathways involved in disease aetiology, and potentially discover new therapeutic targets.
EPIC is a large prospective cohort study. Since its inception, follow-up data collection has been extensive with many nested case-control and case-cohort studies being established, which has included molecular data collection. This includes multiple nested case-control studies in which blood samples were used for genome-wide genotyping. This data represents ~10% of the total EPIC population and ~16,000 incident cancers as of 2025.
Imputation was performed using the Michigan Imputation Server with the 1000 Genomes Project Phase 3 (version 5) reference (build hg37) panel. Post-imputation quality control of genotype data was performed sequentially: (1) HapMap3 SNPs were extracted, filtered for missingness (--geno 0.02) and Hardy-Weinberg equilibrium (--hwe 1e-7), and pruned to remove high linkage disequilibrium (LD) regions; (2) identity-by-state (IBS) metrics were computed to identify related individuals and samples with IBS > 0.25 were removed to reduce cryptic relatedness; (3) the study dataset was aligned with the 1000 Genomes reference by extracting overlapping SNPs and merging datasets and resolving merge errors where necessary; (4) principal component analysis (PCA) was run on the combined dataset (study + reference) to capture population structure and assign super-populations based on the reference super-populations (EUR, AFR, EAS, SAS, AMR) and identify and exclude samples that fell outside expected clusters in PCA space. At this stage, 10,791 samples were present in ≥2 studies; in order to perform meta-analyses or pooling, overlapping samples present in ≥2 studies were retained in the smallest study and excluded from the largest study. Here in we use body mass index (BMI) as an example trait to perform a GWAS.
- Format: PLINK (
.bed,.bim,.fam) - Location:
../genetics/combined.bed(restricted access) - Build: GRCh37 (hg19)
We use HAIL in this pipeline as it allows for weighting while also maintaining flexibility and efficiency for large datasets. Hail is an open-source Python library designed specifically for analyzing large-scale genomic data. Developed at the Broad Institute, Hail provides:
- Scalable computing: Built on Apache Spark, Hail can process datasets with millions of samples and billions of genetic variants
- Efficient storage: Uses optimized data structures (MatrixTables and Tables) that are much faster than traditional text files
- Built-in GWAS tools: Includes pre-implemented statistical methods for association testing, quality control, and visualization
- Memory efficiency: Processes data in chunks rather than loading everything into memory at once
In this pipeline, Hail handles all genotype data operations, performs linear regression for association testing, and generates quality control metrics and plots. This pipeline uses linear regression to test for associations between genetic variants and quantitative traits. For each variant, the model is:
Y = β₀ + β₁X + β₂C₁ + β₃C₂ + ... + ε
Where Y = phenotype (e.g., BMI), X = genotype dosage (0, 1, or 2 copies of the effect allele), C₁, C₂, ... = covariates (age, sex, principal components), β₁ = effect size (the main parameter of interest), ε = error term.
The pipeline uses Hail's linear_regression_rows() function, which:
- Tests each variant independently
- Adjusts for specified covariates and principal components
- Estimates the effect size (beta), standard error, and p-value for each variant
- Handles missing genotypes appropriately
The pipeline automatically includes 10 principal components (PCs) from ancestry analysis as covariates. PCs control for population stratification—systematic differences in allele frequencies between subgroups that could cause spurious associations. When combining results from multiple studies (Scenario B), the pipeline uses METAL (Metal-based Analysis of Genome-wide Association Studies), which implements inverse-variance weighted meta-analysis:
β_meta = Σ(w_i × β_i) / Σ(w_i)
Where β_i = effect estimate from study i and w_i = 1/SE_i² is the inverse of squared standard error from study i. This method, weights each study by the precision of its effect estimate, assumes a fixed-effect model (same true effect across studies), provides heterogeneity statistics (I² and p-value) to assess consistency across studies
You'll need to set up a Hail environment. Run the following commands:
# Create a new environment
conda create -n hail_env python=3.10 -y
# Activate it
conda activate hail_env
# Install SYSTEM BLAS/LAPACK
conda install -c conda-forge openblas numpy scipy matplotlib phantomjs selenium -y
export LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LD_LIBRARY_PATH
export JAVA_OPTS="-Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.NativeRefBLAS"
# Install Hail
pip install hail
# Verify installation
python -c "import hail; print(hail.__version__)"Before running a GWAS, you need:
- Phenotype file: A tab-separated file containing:
- Sample IDs (column name: Idepic)
- Your trait of interest (e.g., Bmi_C)
- Covariates (e.g., Age_Recr, Sex)
- Study IDs: A list of study names to include in your analysis
- Genotype data and PCs are already available and built into the pipeline
EPIC GWAS analyses fall into three scenarios:
- Scenario A: Binary traits used to construct case-control studies (e.g., colorectal cancer, breast cancer)
- Studies will be pooled into a single GWAS
- Must adjust for matching factors from the trait-specific study
- Scenario B: Quantitative traits used in sampling or matching (e.g., BMI if used in case-control selection)
- Study-specific GWAS will be performed with inverse-probability weighting (IPW)
- An unweighted sensitivity analysis will also be run
- Results will be meta-analyzed across studies
- Scenario C: Quantitative traits not used in sampling (e.g., height, serum biomarkers)
- Studies will be pooled into a single GWAS
- No study-specific covariates or weighting required
Use the R function process_covars() from functions.R to create covariate files:
# Load required packages
library(data.table)
library(dplyr)
# Source the functions
source("functions.R")
# Define studies to include
keep_studies <- data.frame(id = c("Clrt_01_Gecco", "Brea_02_Onco"))
# Load phenotype file
phenofile <- fread("../BMI.txt")
# Process covariates
process_covars(
study_ids = keep_studies$id,
phenofile = phenofile,
out_dir = "../",
id_col = "Idepic",
trait = "Bmi_C",
covariates = c("Age_Recr", "Sex")
)This will create covariate files for each study in /scratch/leem/GWAS_Bmi_C/covars/ that include:
- Sample IDs matched to genotype data
- Specified covariates
- 10 principal components from PCA
- A
.nosex.covarversion without the Sex column (if applicable)
For a single study or pooled analysis:
import sys
sys.path.append("../GWAS-EPIC/src")
from functions import run_gwas_pipeline
gwas_ht = run_gwas_pipeline(
study="Clrt_01_Gecco", # Study name (matches PLINK file prefix)
trait="Bmi_C", # Trait column name in phenotype file
phenotype_file="../BMI.txt",
dir_out="../gwas_results/", # Output directory; results saved to {dir_out}/{trait}/
covariates=['Sex', 'Age_Recr'], # Covariates (PCs added automatically)
qc_plots=True, # Generate QC plots
chr="all" # Chromosomes to analyze ("all" or list/int)
)The pipeline creates the following in {dir_out}/{trait}/:
GWAS.ht: Hail table with full resultsGWAS.tsv: Results in GWAS Catalog formatplots/manhattan.html: Manhattan plotplots/qq.html: QQ plot
For study-specific analyses (Scenario B) or to run multiple studies efficiently:
import sys
sys.path.append("../GWAS-EPIC/src")
from functions import run_gwas_pipeline
import glob
import os
# Get study names from FAM files
fam_files = glob.glob("../genetics/*extracted.fam")
studies = [os.path.basename(f).replace("_extracted.fam", "") for f in fam_files]
# Loop over studies and run GWAS
for study in studies:
print(f"Running GWAS for {study}...")
gwas_ht = run_gwas_pipeline(
study=study,
trait="Bmi_C",
phenotype_file="../BMI.txt",
dir_out=f"../gwas_results/",
covariates=['Sex', 'Age_Recr'],
qc_plots=True,
chr="all"
)This script automatically detects all available studies and runs GWAS for each one. For parallel execution, submit this as an array job with each job processing one study.
After running study-specific GWAS (Scenario B), perform meta-analysis using METAL:
import sys
sys.path.append("../GWAS-EPIC/src")
from functions import run_metal_meta_analysis
import os
# Define output directory and studies
dir_out = "../gwas_results/"
studies = [d for d in os.listdir(dir_out)
if os.path.isdir(os.path.join(dir_out, d)) and d != "meta_analysis"]
# Run meta-analysis
meta_results = run_metal_meta_analysis(
studies=studies,
trait="Bmi_C",
dir_out=dir_out,
manhattan=True, # Create Manhattan plot
metal_path="../tools/METAL/build/metal/metal"
)What the meta-analysis does:
- Converts GWAS Catalog format files to METAL input format
- Runs METAL using inverse-variance weighted meta-analysis
- Annotates results with dbSNP rsIDs
- Exports results in GWAS Catalog format
- Creates Manhattan and QQ plots
Meta-analysis results are saved to {dir_out}/meta_analysis/{trait}/:
META.tsv: Meta-analysis results in GWAS Catalog formatMETAL.ht: Hail table with annotated resultsmanhattan.html: Manhattan plot of meta-analysisMETAL-script.txt: METAL configuration scriptinput/*.tsv: Formatted input files for each study
All GWAS results are exported in GWAS Catalog format with the following columns:
| Column | Description |
|---|---|
chromosome |
Chromosome identifier |
base_pair_location |
Position on chromosome (GRCh37) |
effect_allele |
Effect allele (alternative allele) |
other_allele |
Other allele (reference allele) |
beta |
Effect size estimate |
standard_error |
Standard error of beta |
effect_allele_frequency |
Frequency of effect allele |
p_value |
Association p-value |
n |
Sample size |
variant_id |
Variant identifier (chr:pos:ref:alt) |
rsid |
dbSNP rsID (if available) |
ref_allele |
Reference allele |
trait |
Trait name |
Meta-analysis results include additional columns:
| Column | Description |
|---|---|
direction |
Direction of effect in each study |
het_i_squared |
I² heterogeneity statistic |
het_p_value |
P-value for heterogeneity test |
- Memory management: The pipeline is optimized for memory efficiency.
- For very large datasets, consider running chromosomes sequentially rather than all at once.
- Quality control: Always review QC plots before interpreting GWAS results. Check for:
- Appropriate allele frequency distributions
- High variant and sample call rates
- Hardy-Weinberg equilibrium
- No outliers in sample heterozygosity
- Significance threshold: The standard genome-wide significance threshold is p < 5×10⁻⁸.
- Storage: Hail tables (
.htfiles) are the native format and load faster than TSV files for subsequent analyses. - Matching factors: For Scenario A, ensure you adjust for all matching factors used in the study design.
- Study specific matching factors - IN PROGRESS
- Chromosome specification: For testing or quick runs, start with chr=[21, 22] before running all chromosomes.