My independent study examined the predictive power of polygenic risk scoring for autism, using 119 cases and 201 controls.
SpectrumPRS/
│
├── data/
| ├── hdgp_ref
| ├── study_samples.md
├── notebooks/
| ├── exploratory_PRS.ipynb
├── scripts/
│ ├── getHG19info.sh
│ ├── installDependencies.sh
│ ├── liftover.sh
| ├── sortByBP.sh
│ └── getAutosomes.sh
├── src/
│ ├── preprocessing
│ ├── features
│ ├── models
│ └── evaluation
├── results/ # model outputs
└── README.md
data/hgdp_ref: Contains a comparative reference panel from the Human Genome Diversity Project, which was used to help estimate the admixture for the American samples.
data/study_samples.md: Contains information on the samples in the study.
notebook/exploratory_PRS.ipynb: Before exploring the different methods for polygenic risk scoring in autism, I ran a sample experiment on the Michigan Imputation Server (v2) with a random set of samples from the HGDP project. This shows the expected distribution of scores given a random sample, but the samples did not contain phenotypes for evaluating predictive power.
scripts/: Contains bash scripts for quality control. Detailed descriptions of each QC step can be found in the directory README.
scripts/getHG19Info.sh: Transfers your dowloaded HG19.fa file and Hg38ToHg19.over.chain file to the script directory to properly run liftover scripts
scripts/installDependencies.sh: Installs CrossMap, bcftools, and htslib on your local machine. If you are using the Great Lakes HPC, bcftools and htslib are already downloaded, so you do not need to run this script.
scripts/liftover.sh: Lifts a VCF file in HG38 build over to HG19 (more commonly used in genomic studies).
scripts/sortByBP.sh: Sorts a VCF file by ascending base pairs for each chromosome.
scripts/getAutosomes.sh: Reduces the VCF file to chromosomes 1-22, excluding the sex chromosomes. This may be useful if you only want to study autosomal DNA for your project.
git clone https://github.com/KatherineWasmer/SpectrumPRS
cd SpectrumPRS
2. To download one or more genotype file, navigate to data -> study_samples.md and click the hyperlink of your choice. Unless you are working on an HPC cluster, downloading an entire TAR file (typically > 10 GB) is highly discouraged.
Example for downloading a single file (A102902, from LaSalle):
i. Go to data -> study_samples.md -> click the hyperlinked GEO accession ID for the American data set
ii. On the NCBI webpage, click on the hyperlinked GSM ID next to your sample. This will direct you to a page for downloading SNP data.
iii. Underneath the supplementary files section, navigate to the row with the file "GSM5381820_A102092_SNP.vcf.gz". Click on the http hyperlink to download the VCF file.
3. Add downloaded files to your cloned data folder. Since even a single genotype file exceeds the file upload limits on GitHub, you will need to install them manually. Run this script to transfer a downloaded file to the data directory.
# using the example from step 2
cd data
mv ~/Downloads/GSM5381820_A102092_SNP.vcf.gz .