This R package provides functions to interact with the GWAS
Catalog's FTP
server that are (as
far as I know) not available in the GWAS Catalog
API and by extension the R package
{gwasrapidd}. It
provides wrapper functions that call to lftp, a
command-line file transfer program.
The main goal of {gwascatftp} is to query the GWAS Catalog FTP server with a
user-provided study accession (e.g. GCST009541) and:
- Find all associated files (documentation, meta-data, summary statistics files)
- Identify whether harmonised summary statistics are available
- Download & parse the
yamlmeta-data file - Download all available files from the FTP server
- Work from behind an institution's HTTP proxy server (e.g. from a university HPC cluster)
This package wants to cover this basic use case: You read a GWAS paper about
heart failure that that makes summary statistics available on the GWAS Catalog
under the study accession GCST90162626. You want to use the full summary
statistics for your own research. In R, {gwasrapidd} can be used to query the
Catalog's API to download association summary statistics with genome-wide
significance, but not the full summary statistics. These are available from the
FTP server, but the links to those files are not available in the data the API
provides. Instead, you have to have to go to the study website at
https://www.ebi.ac.uk/gwas/studies/GCST90162626,
click on the FTP Download
link and then manually copy the links to each file and download them using
wget, curl, or a similar program. Wouldn't it be much easier to do this from
within R, without having to manually search the FTP server in your web browser
for the links to the file? Especially when trying to download data for several
studies, this can be tedious. That is where {gwascatftp} comes in. It provides
a simple interface to access the GWAS Catalog FTP server using only the study
accession from within R to find and download all files associated with a
specific study accession.
You can install the development version of gwascatftp from GitHub with:
# install.packages("devtools")
remotes::install_github("cfbeuchel/gwascatftp")This package requires an installation of lftp and the
path of the program's executable file. To get basic installation information
from within the package, call the install_lftp() function. The function will also
check whether lftp can be found in $PATH and will return the binary if found.
library(gwascatftp)
install_lftp()An easy way to install lftp without root access is to use
conda.
See below for the commands to install lftp in a fresh conda environment and
identify the lftp binary's path with whereis. If you do not have lftp
installed in your system's $PATH (which it is not unless you activate the
environment from within your R session), you need to provide the full path when
using {gwascatftp}.
conda create -n lftp -c conda-forge lftp
conda activate lftp
whereis lftp
#> e.g. `~/miniconda3/envs/lftp/bin/lftp`Before using the package, make sure you have lftp installed and know where the
executable is located. When using conda, this should hopefully work on most
systems.
While the package tries to detect and set necessary settings at start-up, manual intervention might be necessary. For this, several functions are provided. See below for the recommended workflow for preparing your session.
Check whether automatic setting recognition was successful.
check_lftp_settings(verbose = TRUE)Settings that were created automatically at startup are returned when calling
create_lftp_settings() without any arguments.
create_lftp_settings()When check_lftp_settings() returned an error, you will have to manually create a
settings object and then parse it. To do this, supply the necessary arguments to create_lftp_settings(),
save the list in a variable and then use parse_lftp_settings() to complete the setup.
# Create the settings object. Note that the `ftp_root` is already set to the argument
# you see by default, so supplying it will most likely not be necessary
my_settings <- create_lftp_settings(
lftp_bin = </FULL/PATH/TO/lftp>,
use_proxy = FALSE,
ftp_proxy = NULL,
ftp_root = "ftp://ftp.ebi.ac.uk/pub/databases/gwas/summary_statistics/"
)
# Transform the settings list to the options
parse_lftp_settings(my_settings, check_settings = TRUE)You can also automatically parse the settings when creating them.
create_lftp_settings(
lftp_bin = </FULL/PATH/TO/lftp>,
use_proxy = FALSE,
ftp_proxy = NULL,
ftp_root = "ftp://ftp.ebi.ac.uk/pub/databases/gwas/summary_statistics/",
parse_settings = TRUE
)When querying the GWAS Catalog, two large lists are used often. To avoid repeatedly downloading these files, we will download them once, save them in variables and supply those to the functions using them.
my_directory_list <- get_directory_list()
my_harmonised_list <- get_harmonised_list()Now we can query the GWAS Catalog FTP server using accession names. The most
high-level function is download_all_accession_data(), that, given a GWAS
Catalog study accession, will try to download all available files for that
accession and return the parsed meta data when available.
# Supply a single accession you want to download the data from
my_study_accession <- "GCST009541"
# Call the function with all the necessary data, including the settings and pre-downloaded lists
download_all_accession_data(
study_accession = my_study_accession,
harmonised_list = my_harmonised_list,
directory_list = my_directory_list
download_directory = tempdir(), # Set your target directory here!
create_accession_directory = TRUE,
overwrite_existing_files = FALSE,
return_meta_data = TRUE
)The second high-level function, download_multiple_accession_meta_data() will
accept a vector of study accessions and return a data.table containing all
meta data available for each accession's [...]-meta.yaml files if available.
# Prepare a list of GWAS IDs
my_mulltiple_study_accessions <- c("GCST009541", "GCST90204201")
# Call the functions including the necessary lists to download the meta data
my_meta_data <- download_multiple_accession_meta_data(
study_accessions = my_mulltiple_study_accessions,
harmonised_list = my_harmonised_list,
directory_list = my_directory_list
)- The far more extensive package to query the GWAS Catalog API is {gwasrapidd}: https://github.com/ramiromagno/gwasrapidd/
- The package {MungeSumstats} provides an interface to the MRC IEU openGWAS Project: https://github.com/neurogenomics/MungeSumstats
- The GWAS Catalog has extensive documentation at https://www.ebi.ac.uk/gwas/docs
- Browse the GWAS Catalog FTP Server at https://ftp.ebi.ac.uk/pub/databases/gwas/summary_statistics/
- Tests!
- Actions for
check() - cache
harmonised_listanddirectory_list, maybe usingR.cache?