GWAS Pipeline for EPIC

This repository contains a pipeline for performing genome-wide association studies using genetic data from the European Prospective Investigation into Cancer and Nutrition (EPIC) cohort.

What is a GWAS?

A Genome-Wide Association Study (GWAS) is a systematic approach to identify genetic variants associated with a particular trait or disease. GWAS scan the entire genome of many individuals to find single nucleotide polymorphisms (SNPs) that occur more frequently in people with a particular trait compared to those without it. The typical GWAS workflow involves:

Quality control of genetic data and phenotypes
Association testing between genetic variants and the trait of interest
Visualization of results through Manhattan and QQ plots
Meta-analysis of results across multiple studies (when applicable)

GWAS results help researchers understand the genetic architecture of complex traits and diseases, identify biological pathways involved in disease aetiology, and potentially discover new therapeutic targets.

Genetic data

EPIC is a large prospective cohort study. Since its inception, follow-up data collection has been extensive with many nested case-control and case-cohort studies being established, which has included molecular data collection. This includes multiple nested case-control studies in which blood samples were used for genome-wide genotyping. This data represents ~10% of the total EPIC population and ~16,000 incident cancers as of 2025.

Imputation was performed using the Michigan Imputation Server with the 1000 Genomes Project Phase 3 (version 5) reference (build hg37) panel. Post-imputation quality control of genotype data was performed sequentially: (1) HapMap3 SNPs were extracted, filtered for missingness (--geno 0.02) and Hardy-Weinberg equilibrium (--hwe 1e-7), and pruned to remove high linkage disequilibrium (LD) regions; (2) identity-by-state (IBS) metrics were computed to identify related individuals and samples with IBS > 0.25 were removed to reduce cryptic relatedness; (3) the study dataset was aligned with the 1000 Genomes reference by extracting overlapping SNPs and merging datasets and resolving merge errors where necessary; (4) principal component analysis (PCA) was run on the combined dataset (study + reference) to capture population structure and assign super-populations based on the reference super-populations (EUR, AFR, EAS, SAS, AMR) and identify and exclude samples that fell outside expected clusters in PCA space. At this stage, 10,791 samples were present in ≥2 studies; in order to perform meta-analyses or pooling, overlapping samples present in ≥2 studies were retained in the smallest study and excluded from the largest study. Here in we use body mass index (BMI) as an example trait to perform a GWAS.

Format: PLINK (.bed, .bim, .fam)
Location: ../genetics/combined.bed (restricted access)
Build: GRCh37 (hg19)

How to run a GWAS in EPIC

We use HAIL in this pipeline as it allows for weighting while also maintaining flexibility and efficiency for large datasets. Hail is an open-source Python library designed specifically for analyzing large-scale genomic data. Developed at the Broad Institute, Hail provides:

Scalable computing: Built on Apache Spark, Hail can process datasets with millions of samples and billions of genetic variants
Efficient storage: Uses optimized data structures (MatrixTables and Tables) that are much faster than traditional text files
Built-in GWAS tools: Includes pre-implemented statistical methods for association testing, quality control, and visualization
Memory efficiency: Processes data in chunks rather than loading everything into memory at once

In this pipeline, Hail handles all genotype data operations, performs linear regression for association testing, and generates quality control metrics and plots. This pipeline uses linear regression to test for associations between genetic variants and quantitative traits. For each variant, the model is:

Y = β₀ + β₁X + β₂C₁ + β₃C₂ + ... + ε

Where Y = phenotype (e.g., BMI), X = genotype dosage (0, 1, or 2 copies of the effect allele), C₁, C₂, ... = covariates (age, sex, principal components), β₁ = effect size (the main parameter of interest), ε = error term.

The pipeline uses Hail's linear_regression_rows() function, which:

Tests each variant independently
Adjusts for specified covariates and principal components
Estimates the effect size (beta), standard error, and p-value for each variant
Handles missing genotypes appropriately

The pipeline automatically includes 10 principal components (PCs) from ancestry analysis as covariates. PCs control for population stratification—systematic differences in allele frequencies between subgroups that could cause spurious associations. When combining results from multiple studies (Scenario B), the pipeline uses METAL (Metal-based Analysis of Genome-wide Association Studies), which implements inverse-variance weighted meta-analysis:

β_meta = Σ(w_i × β_i) / Σ(w_i)

Where β_i = effect estimate from study i and w_i = 1/SE_i² is the inverse of squared standard error from study i. This method, weights each study by the precision of its effect estimate, assumes a fixed-effect model (same true effect across studies), provides heterogeneity statistics (I² and p-value) to assess consistency across studies

Requirements

You'll need to set up a Hail environment. Run the following commands:

# Create a new environment
conda create -n hail_env python=3.10 -y

# Activate it
conda activate hail_env

# Install SYSTEM BLAS/LAPACK
conda install -c conda-forge openblas numpy scipy matplotlib phantomjs selenium -y
export LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LD_LIBRARY_PATH
export JAVA_OPTS="-Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.NativeRefBLAS"

# Install Hail
pip install hail

# Verify installation
python -c "import hail; print(hail.__version__)"

Before running a GWAS, you need:

Phenotype file: A tab-separated file containing:
- Sample IDs (column name: Idepic)
- Your trait of interest (e.g., Bmi_C)
- Covariates (e.g., Age_Recr, Sex)
Study IDs: A list of study names to include in your analysis
Genotype data and PCs are already available and built into the pipeline

3.2 Determining Your Analysis Scenario

EPIC GWAS analyses fall into three scenarios:

Scenario A: Binary traits used to construct case-control studies (e.g., colorectal cancer, breast cancer)
- Studies will be pooled into a single GWAS
- Must adjust for matching factors from the trait-specific study
Scenario B: Quantitative traits used in sampling or matching (e.g., BMI if used in case-control selection)
- Study-specific GWAS will be performed with inverse-probability weighting (IPW)
- An unweighted sensitivity analysis will also be run
- Results will be meta-analyzed across studies
Scenario C: Quantitative traits not used in sampling (e.g., height, serum biomarkers)
- Studies will be pooled into a single GWAS
- No study-specific covariates or weighting required

Preparing Covariate Files

Use the R function process_covars() from functions.R to create covariate files:

# Load required packages
library(data.table)
library(dplyr)

# Source the functions
source("functions.R")

# Define studies to include
keep_studies <- data.frame(id = c("Clrt_01_Gecco", "Brea_02_Onco"))

# Load phenotype file
phenofile <- fread("../BMI.txt")

# Process covariates
process_covars(
  study_ids = keep_studies$id,
  phenofile = phenofile,
  out_dir = "../",
  id_col = "Idepic",
  trait = "Bmi_C",
  covariates = c("Age_Recr", "Sex")
)

This will create covariate files for each study in /scratch/leem/GWAS_Bmi_C/covars/ that include:

Sample IDs matched to genotype data
Specified covariates
10 principal components from PCA
A .nosex.covar version without the Sex column (if applicable)

Running a Single GWAS

For a single study or pooled analysis:

import sys
sys.path.append("../GWAS-EPIC/src")
from functions import run_gwas_pipeline

gwas_ht = run_gwas_pipeline(
    study="Clrt_01_Gecco", # Study name (matches PLINK file prefix)
    trait="Bmi_C", # Trait column name in phenotype file
    phenotype_file="../BMI.txt",
    dir_out="../gwas_results/", # Output directory; results saved to {dir_out}/{trait}/
    covariates=['Sex', 'Age_Recr'], # Covariates (PCs added automatically)
    qc_plots=True, # Generate QC plots
    chr="all" # Chromosomes to analyze ("all" or list/int)
)

The pipeline creates the following in {dir_out}/{trait}/:

GWAS.ht: Hail table with full results
GWAS.tsv: Results in GWAS Catalog format
plots/manhattan.html: Manhattan plot
plots/qq.html: QQ plot

Running Multiple GWAS in Parallel

For study-specific analyses (Scenario B) or to run multiple studies efficiently:

import sys
sys.path.append("../GWAS-EPIC/src")
from functions import run_gwas_pipeline
import glob
import os

# Get study names from FAM files
fam_files = glob.glob("../genetics/*extracted.fam")
studies = [os.path.basename(f).replace("_extracted.fam", "") for f in fam_files]

# Loop over studies and run GWAS
for study in studies:
    print(f"Running GWAS for {study}...")
    gwas_ht = run_gwas_pipeline(
        study=study,
        trait="Bmi_C",
        phenotype_file="../BMI.txt",
        dir_out=f"../gwas_results/",
        covariates=['Sex', 'Age_Recr'],
        qc_plots=True,
        chr="all"
    )

This script automatically detects all available studies and runs GWAS for each one. For parallel execution, submit this as an array job with each job processing one study.

Meta-Analysis

After running study-specific GWAS (Scenario B), perform meta-analysis using METAL:

import sys
sys.path.append("../GWAS-EPIC/src")
from functions import run_metal_meta_analysis
import os

# Define output directory and studies
dir_out = "../gwas_results/"
studies = [d for d in os.listdir(dir_out) 
           if os.path.isdir(os.path.join(dir_out, d)) and d != "meta_analysis"]

# Run meta-analysis
meta_results = run_metal_meta_analysis(
    studies=studies,
    trait="Bmi_C",
    dir_out=dir_out,
    manhattan=True,  # Create Manhattan plot
    metal_path="../tools/METAL/build/metal/metal"
)

What the meta-analysis does:

Converts GWAS Catalog format files to METAL input format
Runs METAL using inverse-variance weighted meta-analysis
Annotates results with dbSNP rsIDs
Exports results in GWAS Catalog format
Creates Manhattan and QQ plots

Meta-analysis results are saved to {dir_out}/meta_analysis/{trait}/:

META.tsv: Meta-analysis results in GWAS Catalog format
METAL.ht: Hail table with annotated results
manhattan.html: Manhattan plot of meta-analysis
METAL-script.txt: METAL configuration script
input/*.tsv: Formatted input files for each study

Output

All GWAS results are exported in GWAS Catalog format with the following columns:

Column	Description
`chromosome`	Chromosome identifier
`base_pair_location`	Position on chromosome (GRCh37)
`effect_allele`	Effect allele (alternative allele)
`other_allele`	Other allele (reference allele)
`beta`	Effect size estimate
`standard_error`	Standard error of beta
`effect_allele_frequency`	Frequency of effect allele
`p_value`	Association p-value
`n`	Sample size
`variant_id`	Variant identifier (chr:pos:ref:alt)
`rsid`	dbSNP rsID (if available)
`ref_allele`	Reference allele
`trait`	Trait name

Meta-analysis results include additional columns:

Column	Description
`direction`	Direction of effect in each study
`het_i_squared`	I² heterogeneity statistic
`het_p_value`	P-value for heterogeneity test

Tips and Best Practices

Memory management: The pipeline is optimized for memory efficiency.
- For very large datasets, consider running chromosomes sequentially rather than all at once.
Quality control: Always review QC plots before interpreting GWAS results. Check for:
- Appropriate allele frequency distributions
- High variant and sample call rates
- Hardy-Weinberg equilibrium
- No outliers in sample heterozygosity
Significance threshold: The standard genome-wide significance threshold is p < 5×10⁻⁸.
Storage: Hail tables (.ht files) are the native format and load faster than TSV files for subsequent analyses.
Matching factors: For Scenario A, ensure you adjust for all matching factors used in the study design.
- Study specific matching factors - IN PROGRESS
Chromosome specification: For testing or quick runs, start with chr=[21, 22] before running all chromosomes.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GWAS Pipeline for EPIC

What is a GWAS?

Genetic data

How to run a GWAS in EPIC

Requirements

3.2 Determining Your Analysis Scenario

Preparing Covariate Files

Running a Single GWAS

Running Multiple GWAS in Parallel

Meta-Analysis

Output

Tips and Best Practices

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

GWAS Pipeline for EPIC

What is a GWAS?

Genetic data

How to run a GWAS in EPIC

Requirements

3.2 Determining Your Analysis Scenario

Preparing Covariate Files

Running a Single GWAS

Running Multiple GWAS in Parallel

Meta-Analysis

Output

Tips and Best Practices

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages