`gwascatftp` - Access the GWAS Catalog FTP server from R

This R package provides functions to interact with the GWAS Catalog's FTP server that are (as far as I know) not available in the GWAS Catalog API and by extension the R package {gwasrapidd}. It provides wrapper functions that call to lftp, a command-line file transfer program.

The main goal of {gwascatftp} is to query the GWAS Catalog FTP server with a user-provided study accession (e.g. GCST009541) and:

Find all associated files (documentation, meta-data, summary statistics files)
Identify whether harmonised summary statistics are available
Download & parse the yaml meta-data file
Download all available files from the FTP server
Work from behind an institution's HTTP proxy server (e.g. from a university HPC cluster)

Use Case

This package wants to cover this basic use case: You read a GWAS paper about heart failure that that makes summary statistics available on the GWAS Catalog under the study accession GCST90162626. You want to use the full summary statistics for your own research. In R, {gwasrapidd} can be used to query the Catalog's API to download association summary statistics with genome-wide significance, but not the full summary statistics. These are available from the FTP server, but the links to those files are not available in the data the API provides. Instead, you have to have to go to the study website at https://www.ebi.ac.uk/gwas/studies/GCST90162626, click on the FTP Download link and then manually copy the links to each file and download them using wget, curl, or a similar program. Wouldn't it be much easier to do this from within R, without having to manually search the FTP server in your web browser for the links to the file? Especially when trying to download data for several studies, this can be tedious. That is where {gwascatftp} comes in. It provides a simple interface to access the GWAS Catalog FTP server using only the study accession from within R to find and download all files associated with a specific study accession.

Installation

You can install the development version of gwascatftp from GitHub with:

# install.packages("devtools")
remotes::install_github("cfbeuchel/gwascatftp")

Prerequisites

This package requires an installation of lftp and the path of the program's executable file. To get basic installation information from within the package, call the install_lftp() function. The function will also check whether lftp can be found in $PATH and will return the binary if found.

library(gwascatftp)
install_lftp()

An easy way to install lftp without root access is to use conda. See below for the commands to install lftp in a fresh conda environment and identify the lftp binary's path with whereis. If you do not have lftp installed in your system's $PATH (which it is not unless you activate the environment from within your R session), you need to provide the full path when using {gwascatftp}.

conda create -n lftp -c conda-forge lftp
conda activate lftp
whereis lftp
#> e.g. `~/miniconda3/envs/lftp/bin/lftp`

Getting Started

Before using the package, make sure you have lftp installed and know where the executable is located. When using conda, this should hopefully work on most systems.

While the package tries to detect and set necessary settings at start-up, manual intervention might be necessary. For this, several functions are provided. See below for the recommended workflow for preparing your session.

Check whether automatic setting recognition was successful.

check_lftp_settings(verbose = TRUE)

Settings that were created automatically at startup are returned when calling create_lftp_settings() without any arguments.

create_lftp_settings()

When check_lftp_settings() returned an error, you will have to manually create a settings object and then parse it. To do this, supply the necessary arguments to create_lftp_settings(), save the list in a variable and then use parse_lftp_settings() to complete the setup.

# Create the settings object. Note that the `ftp_root` is already set to the argument
# you see by default, so supplying it will most likely not be necessary
my_settings <- create_lftp_settings(
  lftp_bin = </FULL/PATH/TO/lftp>, 
  use_proxy = FALSE, 
  ftp_proxy = NULL, 
  ftp_root = "ftp://ftp.ebi.ac.uk/pub/databases/gwas/summary_statistics/" 
  )

# Transform the settings list to the options
parse_lftp_settings(my_settings, check_settings = TRUE)

You can also automatically parse the settings when creating them.

create_lftp_settings(
  lftp_bin = </FULL/PATH/TO/lftp>, 
  use_proxy = FALSE, 
  ftp_proxy = NULL, 
  ftp_root = "ftp://ftp.ebi.ac.uk/pub/databases/gwas/summary_statistics/",
  parse_settings = TRUE
)

When querying the GWAS Catalog, two large lists are used often. To avoid repeatedly downloading these files, we will download them once, save them in variables and supply those to the functions using them.

my_directory_list <- get_directory_list()
my_harmonised_list <- get_harmonised_list()

Now we can query the GWAS Catalog FTP server using accession names. The most high-level function is download_all_accession_data(), that, given a GWAS Catalog study accession, will try to download all available files for that accession and return the parsed meta data when available.

# Supply a single accession you want to download the data from
my_study_accession <- "GCST009541"

# Call the function with all the necessary data, including the settings and pre-downloaded lists
download_all_accession_data(
    study_accession = my_study_accession,
    harmonised_list = my_harmonised_list,
    directory_list = my_directory_list
    download_directory = tempdir(), # Set your target directory here!
    create_accession_directory = TRUE,
    overwrite_existing_files = FALSE,
    return_meta_data = TRUE
)

The second high-level function, download_multiple_accession_meta_data() will accept a vector of study accessions and return a data.table containing all meta data available for each accession's [...]-meta.yaml files if available.

# Prepare a list of GWAS IDs
my_mulltiple_study_accessions <- c("GCST009541", "GCST90204201")

# Call the functions including the necessary lists to download the meta data
my_meta_data <- download_multiple_accession_meta_data(
  study_accessions = my_mulltiple_study_accessions, 
  harmonised_list = my_harmonised_list, 
  directory_list = my_directory_list
)

TODO

Tests!
Actions for check()
cache harmonised_list and directory_list, maybe using R.cache?

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
R		R
man		man
renv		renv
tests		tests
.Rbuildignore		.Rbuildignore
.Rprofile		.Rprofile
.gitignore		.gitignore
DESCRIPTION		DESCRIPTION
LICENSE.md		LICENSE.md
NAMESPACE		NAMESPACE
README.md		README.md
renv.lock		renv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

`gwascatftp` - Access the GWAS Catalog FTP server from R

Use Case

Installation

Prerequisites

Getting Started

See Also

TODO

About

Uh oh!

Releases

Packages

Languages

License

comp-med/gwascatftp

Folders and files

Latest commit

History

Repository files navigation

gwascatftp - Access the GWAS Catalog FTP server from R

Use Case

Installation

Prerequisites

Getting Started

See Also

TODO

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

`gwascatftp` - Access the GWAS Catalog FTP server from R

Packages