Skip to content

IARCBiostat/GWAS-EPIC

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 

Repository files navigation

GWAS Pipeline for EPIC

This repository contains a pipeline for performing genome-wide association studies using genetic data from the European Prospective Investigation into Cancer and Nutrition (EPIC) cohort.

What is a GWAS?

A Genome-Wide Association Study (GWAS) is a systematic approach to identify genetic variants associated with a particular trait or disease. GWAS scan the entire genome of many individuals to find single nucleotide polymorphisms (SNPs) that occur more frequently in people with a particular trait compared to those without it. The typical GWAS workflow involves:

  1. Quality control of genetic data and phenotypes
  2. Association testing between genetic variants and the trait of interest
  3. Visualization of results through Manhattan and QQ plots
  4. Meta-analysis of results across multiple studies (when applicable)

GWAS results help researchers understand the genetic architecture of complex traits and diseases, identify biological pathways involved in disease aetiology, and potentially discover new therapeutic targets.

Genetic data

EPIC is a large prospective cohort study. Since its inception, follow-up data collection has been extensive with many nested case-control and case-cohort studies being established, which has included molecular data collection. This includes multiple nested case-control studies in which blood samples were used for genome-wide genotyping. This data represents ~10% of the total EPIC population and ~16,000 incident cancers as of 2025.

Imputation was performed using the Michigan Imputation Server with the 1000 Genomes Project Phase 3 (version 5) reference (build hg37) panel. Post-imputation quality control of genotype data was performed sequentially: (1) HapMap3 SNPs were extracted, filtered for missingness (--geno 0.02) and Hardy-Weinberg equilibrium (--hwe 1e-7), and pruned to remove high linkage disequilibrium (LD) regions; (2) identity-by-state (IBS) metrics were computed to identify related individuals and samples with IBS > 0.25 were removed to reduce cryptic relatedness; (3) the study dataset was aligned with the 1000 Genomes reference by extracting overlapping SNPs and merging datasets and resolving merge errors where necessary; (4) principal component analysis (PCA) was run on the combined dataset (study + reference) to capture population structure and assign super-populations based on the reference super-populations (EUR, AFR, EAS, SAS, AMR) and identify and exclude samples that fell outside expected clusters in PCA space. At this stage, 10,791 samples were present in ≥2 studies; in order to perform meta-analyses or pooling, overlapping samples present in ≥2 studies were retained in the smallest study and excluded from the largest study. Here in we use body mass index (BMI) as an example trait to perform a GWAS.

  • Format: PLINK (.bed, .bim, .fam)
  • Location: ../genetics/combined.bed (restricted access)
  • Build: GRCh37 (hg19)

How to run a GWAS in EPIC

We use HAIL in this pipeline as it allows for weighting while also maintaining flexibility and efficiency for large datasets. Hail is an open-source Python library designed specifically for analyzing large-scale genomic data. Developed at the Broad Institute, Hail provides:

  • Scalable computing: Built on Apache Spark, Hail can process datasets with millions of samples and billions of genetic variants
  • Efficient storage: Uses optimized data structures (MatrixTables and Tables) that are much faster than traditional text files
  • Built-in GWAS tools: Includes pre-implemented statistical methods for association testing, quality control, and visualization
  • Memory efficiency: Processes data in chunks rather than loading everything into memory at once

In this pipeline, Hail handles all genotype data operations, performs linear regression for association testing, and generates quality control metrics and plots. This pipeline uses linear regression to test for associations between genetic variants and quantitative traits. For each variant, the model is:

Y = β₀ + β₁X + β₂C₁ + β₃C₂ + ... + ε

Where Y = phenotype (e.g., BMI), X = genotype dosage (0, 1, or 2 copies of the effect allele), C₁, C₂, ... = covariates (age, sex, principal components), β₁ = effect size (the main parameter of interest), ε = error term.

The pipeline uses Hail's linear_regression_rows() function, which:

  1. Tests each variant independently
  2. Adjusts for specified covariates and principal components
  3. Estimates the effect size (beta), standard error, and p-value for each variant
  4. Handles missing genotypes appropriately

The pipeline automatically includes 10 principal components (PCs) from ancestry analysis as covariates. PCs control for population stratification—systematic differences in allele frequencies between subgroups that could cause spurious associations. When combining results from multiple studies (Scenario B), the pipeline uses METAL (Metal-based Analysis of Genome-wide Association Studies), which implements inverse-variance weighted meta-analysis:

β_meta = Σ(w_i × β_i) / Σ(w_i)

Where β_i = effect estimate from study i and w_i = 1/SE_i² is the inverse of squared standard error from study i. This method, weights each study by the precision of its effect estimate, assumes a fixed-effect model (same true effect across studies), provides heterogeneity statistics (I² and p-value) to assess consistency across studies

Requirements

You'll need to set up a Hail environment. Run the following commands:

# Create a new environment
conda create -n hail_env python=3.10 -y

# Activate it
conda activate hail_env

# Install SYSTEM BLAS/LAPACK
conda install -c conda-forge openblas numpy scipy matplotlib phantomjs selenium -y
export LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LD_LIBRARY_PATH
export JAVA_OPTS="-Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.NativeRefBLAS"

# Install Hail
pip install hail

# Verify installation
python -c "import hail; print(hail.__version__)"

Before running a GWAS, you need:

  • Phenotype file: A tab-separated file containing:
    • Sample IDs (column name: Idepic)
    • Your trait of interest (e.g., Bmi_C)
    • Covariates (e.g., Age_Recr, Sex)
  • Study IDs: A list of study names to include in your analysis
  • Genotype data and PCs are already available and built into the pipeline

3.2 Determining Your Analysis Scenario

EPIC GWAS analyses fall into three scenarios:

  • Scenario A: Binary traits used to construct case-control studies (e.g., colorectal cancer, breast cancer)
    • Studies will be pooled into a single GWAS
    • Must adjust for matching factors from the trait-specific study
  • Scenario B: Quantitative traits used in sampling or matching (e.g., BMI if used in case-control selection)
    • Study-specific GWAS will be performed with inverse-probability weighting (IPW)
    • An unweighted sensitivity analysis will also be run
    • Results will be meta-analyzed across studies
  • Scenario C: Quantitative traits not used in sampling (e.g., height, serum biomarkers)
    • Studies will be pooled into a single GWAS
    • No study-specific covariates or weighting required

Preparing Covariate Files

Use the R function process_covars() from functions.R to create covariate files:

# Load required packages
library(data.table)
library(dplyr)

# Source the functions
source("functions.R")

# Define studies to include
keep_studies <- data.frame(id = c("Clrt_01_Gecco", "Brea_02_Onco"))

# Load phenotype file
phenofile <- fread("../BMI.txt")

# Process covariates
process_covars(
  study_ids = keep_studies$id,
  phenofile = phenofile,
  out_dir = "../",
  id_col = "Idepic",
  trait = "Bmi_C",
  covariates = c("Age_Recr", "Sex")
)

This will create covariate files for each study in /scratch/leem/GWAS_Bmi_C/covars/ that include:

  • Sample IDs matched to genotype data
  • Specified covariates
  • 10 principal components from PCA
  • A .nosex.covar version without the Sex column (if applicable)

Running a Single GWAS

For a single study or pooled analysis:

import sys
sys.path.append("../GWAS-EPIC/src")
from functions import run_gwas_pipeline

gwas_ht = run_gwas_pipeline(
    study="Clrt_01_Gecco", # Study name (matches PLINK file prefix)
    trait="Bmi_C", # Trait column name in phenotype file
    phenotype_file="../BMI.txt",
    dir_out="../gwas_results/", # Output directory; results saved to {dir_out}/{trait}/
    covariates=['Sex', 'Age_Recr'], # Covariates (PCs added automatically)
    qc_plots=True, # Generate QC plots
    chr="all" # Chromosomes to analyze ("all" or list/int)
)

The pipeline creates the following in {dir_out}/{trait}/:

  • GWAS.ht: Hail table with full results
  • GWAS.tsv: Results in GWAS Catalog format
  • plots/manhattan.html: Manhattan plot
  • plots/qq.html: QQ plot

Running Multiple GWAS in Parallel

For study-specific analyses (Scenario B) or to run multiple studies efficiently:

import sys
sys.path.append("../GWAS-EPIC/src")
from functions import run_gwas_pipeline
import glob
import os

# Get study names from FAM files
fam_files = glob.glob("../genetics/*extracted.fam")
studies = [os.path.basename(f).replace("_extracted.fam", "") for f in fam_files]

# Loop over studies and run GWAS
for study in studies:
    print(f"Running GWAS for {study}...")
    gwas_ht = run_gwas_pipeline(
        study=study,
        trait="Bmi_C",
        phenotype_file="../BMI.txt",
        dir_out=f"../gwas_results/",
        covariates=['Sex', 'Age_Recr'],
        qc_plots=True,
        chr="all"
    )

This script automatically detects all available studies and runs GWAS for each one. For parallel execution, submit this as an array job with each job processing one study.

Meta-Analysis

After running study-specific GWAS (Scenario B), perform meta-analysis using METAL:

import sys
sys.path.append("../GWAS-EPIC/src")
from functions import run_metal_meta_analysis
import os

# Define output directory and studies
dir_out = "../gwas_results/"
studies = [d for d in os.listdir(dir_out) 
           if os.path.isdir(os.path.join(dir_out, d)) and d != "meta_analysis"]

# Run meta-analysis
meta_results = run_metal_meta_analysis(
    studies=studies,
    trait="Bmi_C",
    dir_out=dir_out,
    manhattan=True,  # Create Manhattan plot
    metal_path="../tools/METAL/build/metal/metal"
)

What the meta-analysis does:

  • Converts GWAS Catalog format files to METAL input format
  • Runs METAL using inverse-variance weighted meta-analysis
  • Annotates results with dbSNP rsIDs
  • Exports results in GWAS Catalog format
  • Creates Manhattan and QQ plots

Meta-analysis results are saved to {dir_out}/meta_analysis/{trait}/:

  • META.tsv: Meta-analysis results in GWAS Catalog format
  • METAL.ht: Hail table with annotated results
  • manhattan.html: Manhattan plot of meta-analysis
  • METAL-script.txt: METAL configuration script
  • input/*.tsv: Formatted input files for each study

Output

All GWAS results are exported in GWAS Catalog format with the following columns:

Column Description
chromosome Chromosome identifier
base_pair_location Position on chromosome (GRCh37)
effect_allele Effect allele (alternative allele)
other_allele Other allele (reference allele)
beta Effect size estimate
standard_error Standard error of beta
effect_allele_frequency Frequency of effect allele
p_value Association p-value
n Sample size
variant_id Variant identifier (chr:pos:ref:alt)
rsid dbSNP rsID (if available)
ref_allele Reference allele
trait Trait name

Meta-analysis results include additional columns:

Column Description
direction Direction of effect in each study
het_i_squared I² heterogeneity statistic
het_p_value P-value for heterogeneity test

Tips and Best Practices

  1. Memory management: The pipeline is optimized for memory efficiency.
    • For very large datasets, consider running chromosomes sequentially rather than all at once.
  2. Quality control: Always review QC plots before interpreting GWAS results. Check for:
    • Appropriate allele frequency distributions
    • High variant and sample call rates
    • Hardy-Weinberg equilibrium
    • No outliers in sample heterozygosity
  3. Significance threshold: The standard genome-wide significance threshold is p < 5×10⁻⁸.
  4. Storage: Hail tables (.ht files) are the native format and load faster than TSV files for subsequent analyses.
  5. Matching factors: For Scenario A, ensure you adjust for all matching factors used in the study design.
    • Study specific matching factors - IN PROGRESS
  6. Chromosome specification: For testing or quick runs, start with chr=[21, 22] before running all chromosomes.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors