Skip to content

KatherineWasmer/SpectrumPRS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

236 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Polygenic Risk Scoring for ASD

My independent study examined the predictive power of polygenic risk scoring for autism, using 119 cases and 201 controls.

⚠️ Project Status: This repository is under active development to ensure proper documentation and reproducibility of the entire machine learning pipeline. The analyses and results for the project can be found in the Final Report.

Repository Structure

SpectrumPRS/
│
├── data/         
|   ├── hdgp_ref 
|   ├── study_samples.md            
├── notebooks/    
|   ├── exploratory_PRS.ipynb    
├── scripts/
│   ├── getHG19info.sh   
│   ├── installDependencies.sh 
│   ├── liftover.sh
|   ├── sortByBP.sh
│   └── getAutosomes.sh
├── src/                 
│   ├── preprocessing   
│   ├── features
│   ├── models
│   └── evaluation          
├── results/     # model outputs       
└── README.md

data/hgdp_ref: Contains a comparative reference panel from the Human Genome Diversity Project, which was used to help estimate the admixture for the American samples.

data/study_samples.md: Contains information on the samples in the study.

notebook/exploratory_PRS.ipynb: Before exploring the different methods for polygenic risk scoring in autism, I ran a sample experiment on the Michigan Imputation Server (v2) with a random set of samples from the HGDP project. This shows the expected distribution of scores given a random sample, but the samples did not contain phenotypes for evaluating predictive power.

scripts/: Contains bash scripts for quality control. Detailed descriptions of each QC step can be found in the directory README.

scripts/getHG19Info.sh: Transfers your dowloaded HG19.fa file and Hg38ToHg19.over.chain file to the script directory to properly run liftover scripts

scripts/installDependencies.sh: Installs CrossMap, bcftools, and htslib on your local machine. If you are using the Great Lakes HPC, bcftools and htslib are already downloaded, so you do not need to run this script.

scripts/liftover.sh: Lifts a VCF file in HG38 build over to HG19 (more commonly used in genomic studies).

scripts/sortByBP.sh: Sorts a VCF file by ascending base pairs for each chromosome.

scripts/getAutosomes.sh: Reduces the VCF file to chromosomes 1-22, excluding the sex chromosomes. This may be useful if you only want to study autosomal DNA for your project.

Installation/Usage

1. Clone the github repository in your terminal and navigate to the SpectrumPRS directory

git clone https://github.com/KatherineWasmer/SpectrumPRS
cd SpectrumPRS 

2. To download one or more genotype file, navigate to data -> study_samples.md and click the hyperlink of your choice. Unless you are working on an HPC cluster, downloading an entire TAR file (typically > 10 GB) is highly discouraged.

Example for downloading a single file (A102902, from LaSalle):

i. Go to data -> study_samples.md -> click the hyperlinked GEO accession ID for the American data set

ii. On the NCBI webpage, click on the hyperlinked GSM ID next to your sample. This will direct you to a page for downloading SNP data.

image

iii. Underneath the supplementary files section, navigate to the row with the file "GSM5381820_A102092_SNP.vcf.gz". Click on the http hyperlink to download the VCF file.

3. Add downloaded files to your cloned data folder. Since even a single genotype file exceeds the file upload limits on GitHub, you will need to install them manually. Run this script to transfer a downloaded file to the data directory.

# using the example from step 2 
cd data
mv ~/Downloads/GSM5381820_A102092_SNP.vcf.gz .

About

For my MDS capstone, I conducted an independent study on the polygenic risk scoring for autism.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors