Skip to content

Commit 4561bfb

Browse files
authored
Merge pull request StatFunGen#435 from xueweic/main
fixed format issue for colocboost pipeline
2 parents 28fbdb4 + 32eb25f commit 4561bfb

1 file changed

Lines changed: 105 additions & 107 deletions

File tree

vignettes/colocboost-pipeline.Rmd

Lines changed: 105 additions & 107 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,7 @@
11
---
22
title: "Bioinformatics Pipeline for ColocBoost"
3+
author: "Xuewei Cao"
4+
date: "`r Sys.Date()`"
35
output: rmarkdown::html_vignette
46
vignette: >
57
%\VignetteIndexEntry{Bioinformatics Pipeline for ColocBoost}
@@ -21,6 +23,7 @@ This vignette demonstrates how to use the bioinformatics pipeline for ColocBoost
2123
`colocboost_pipeline` with [link](https://github.com/StatFunGen/pecotmr/blob/main/R/colocboost_pipeline.R).
2224
- See more details about input data preparation in `xqtl_protocol` with [link](https://statfungen.github.io/xqtl-protocol/code/mnm_analysis/mnm_methods/colocboost.html).
2325

26+
Acknowledgment: Thanks to Kate (Kathryn) Lawrence (GitHub:@kal26) for her contributions to this vignette.
2427

2528
# 1. Loading Data using `colocboost_analysis_pipeline` function
2629

@@ -38,15 +41,16 @@ Below are the input parameters for this function for loading individual-level da
3841

3942
## 1.1. Loading individual-level data from multiple cohorts
4043

41-
inputs:
44+
Inputs:
45+
4246
- **`region`**: String ; Genomic region of interest in the format of `chr:start-end` for the phenotype region you want to analyze.
4347
- **`genotype_list`**: Character vector; Paths for PLINK bed files containing genotype data (do NOT include .bed suffix).
4448
- **`phenotype_list`**: Character vector; Paths for phenotype file names.
4549
- **`covariate_list`**: Character vector; Paths for covariate file names for each phenotype. Must have the same length as the phenotype file vector.
4650
- **`conditions_list_individual`**: Character vector; Strings representing different conditions or groups used for naming. Must have the same length as the phenotype file vector.
4751
- **`match_geno_pheno`**: Integer vector; Indices of phenotypes matched to genotype if multiple genotype PLINK files are used. For each phenotype file in `phenotype_list`, the index of the genotype file in `genotype_list` it matches with.
4852
- **`association_window`**: String; Genomic region of interest in the format of `chr:start-end` for the association analysis window of variants to test (cis or trans). If not provided, all genotype data will be loaded.
49-
- **`extract_region_name`**: List of character vectors; Phenotype names (e.g., gene ID `ENSG00000269699`) to subset the phenotype data when there are multiple phenotypes availible in the region. Must have the same length as the phenotype file vector. Default is `NULL`, which will use all phenotypes in the region.
53+
- **`extract_region_name`**: List of character vectors; Phenotype names (e.g., gene ID `ENSG00000269699`) to subset the phenotype data when there are multiple phenotypes available in the region. Must have the same length as the phenotype file vector. Default is `NULL`, which will use all phenotypes in the region.
5054
- **`region_name_col`**: Integer; 1-based index of the column containing the region name (i.e. 4 for gene ID in a bed file). Required if `extract_region_name` is not `NULL`, or if multiple phenotypes fall into the same region in one phenotype file
5155
- **`keep_indel`**: Logical; indicating whether to keep insertions/deletions (INDELs). Default is `TRUE`.
5256
- **`keep_samples`**: Character vector; Sample names to keep. Default is `NULL`. Currently only supports keeping the same samples from all genotype and phenotype files.
@@ -55,15 +59,16 @@ inputs:
5559
- **`xvar_cutoff`**: Numeric; Minimum genotype variance cutoff. Default is 0.
5660
- **`imiss_cutoff`**: Numeric; Maximum individual missingness cutoff. Default is 0.
5761

58-
outputs:
62+
Outputs:
63+
5964
- **`region_data`**: List (with `individual_data`, `sumstat_data`); Output of the `load_multitask_regional_data` function. If only individual-level data is loaded, `sumstat_data` will be `NULL`.
6065

6166

62-
**Indivudual-level data loading example**
67+
**Individual-level data loading example**
6368

6469
The following example demonstrates how to set up input data with 3 phenotypes and 2 cohorts. The first cohort has 2 phenotypes and the second cohort has 1 phenotype. The first phenotype has 2 genes and the second phenotype has 1 gene.
6570

66-
```{r, data-loader-individual}
71+
```{r, data-loader-individual, eval = FALSE}
6772
# Example of loading individual-level data
6873
region = "chr1:1000000-2000000"
6974
genotype_list = c("plink_cohort1.1", "plink_cohort1.2")
@@ -84,34 +89,32 @@ xvar_cutoff = 0
8489
imiss_cutoff = 0.9
8590
8691
# More advanced parameters see pecotmr::load_multitask_regional_data()
87-
88-
#### Comment out to avoid running this code here, as we do not have real data files in this example ####
89-
# region_data_individual <- load_multitask_regional_data(
90-
# region = region,
91-
# genotype_list = genotype_list,
92-
# phenotype_list = phenotype_list,
93-
# covariate_list = covariate_list,
94-
# conditions_list_individual = conditions_list_individual,
95-
# match_geno_pheno = match_geno_pheno,
96-
# association_window = association_window,
97-
# region_name_col = region_name_col,
98-
# extract_region_name = extract_region_name,
99-
# keep_indel = keep_indel,
100-
# keep_samples = keep_samples,
101-
# maf_cutoff = maf_cutoff,
102-
# mac_cutoff = mac_cutoff,
103-
# xvar_cutoff = xvar_cutoff,
104-
# imiss_cutoff = imiss_cutoff
105-
# )
106-
#### End of comment out
92+
region_data_individual <- load_multitask_regional_data(
93+
region = region,
94+
genotype_list = genotype_list,
95+
phenotype_list = phenotype_list,
96+
covariate_list = covariate_list,
97+
conditions_list_individual = conditions_list_individual,
98+
match_geno_pheno = match_geno_pheno,
99+
association_window = association_window,
100+
region_name_col = region_name_col,
101+
extract_region_name = extract_region_name,
102+
keep_indel = keep_indel,
103+
keep_samples = keep_samples,
104+
maf_cutoff = maf_cutoff,
105+
mac_cutoff = mac_cutoff,
106+
xvar_cutoff = xvar_cutoff,
107+
imiss_cutoff = imiss_cutoff
108+
)
107109
108110
```
109111

110112

111113

112114
## 1.2. Loading summary statistics from multiple cohorts or datasets
113115

114-
inputs:
116+
Inputs:
117+
115118
- **`sumstat_path_list`**: Character vector; Paths to the summary statistics.
116119
- **`column_file_path_list`**: Character vector; Paths to the column mapping files. See below for expected format.
117120
- **`LD_meta_file_path_list`**: Character vector; Paths to LD metadata files. See below for expected format.
@@ -122,14 +125,15 @@ inputs:
122125
- **`n_cases`**: Integer vector; Number of cases. Set a 0 if `n_samples` is passed explicitly. If unknown, set as 0 and include `n_cases` column in the column mapping file to retrieve from the sumstat file.
123126
- **`n_controls`**: Integer vector; Number of controls. Set a 0 if `n_samples` is passed explicitly. If unknown, set as 0 and include `n_controls` column in the column mapping file to retrieve from the sumstat file.
124127

125-
outputs:
128+
Outputs:
129+
126130
- **`region_data`**: List (with `individual_data`, `sumstat_data`); Output of the `load_multitask_regional_data` function. If only summary statistics data is loaded, `individual_data` will be `NULL`.
127131

128132
**Summary statistics loading example**
129133

130134
The following example demonstrates how to set up input data with 2 summary statistics and one LD reference.
131135

132-
```{r, data-loader-sumstat}
136+
```{r, data-loader-sumstat, eval = FALSE}
133137
# Example of loading summary statistics
134138
sumstat_path_list = c("sumstat1.tsv.gz", "sumstat2.tsv.gz")
135139
column_file_path_list = c("column_mapping_sumstat1.yml", "column_mapping_sumstat2.yml")
@@ -145,28 +149,26 @@ n_controls = c(0, 40000)
145149
146150
147151
# More advanced parameters see pecotmr::load_multitask_regional_data()
148-
149-
#### Comment out to avoid running this code here, as we do not have real data files in this example ####
150-
# region_data_sumstat <- load_multitask_regional_data(
151-
# sumstat_path_list = sumstat_path_list,
152-
# column_file_path_list = column_file_path_list,
153-
# LD_meta_file_path_list = LD_meta_file_path_list,
154-
# conditions_list_sumstat = conditions_list_sumstat,
155-
# match_LD_sumstat = match_LD_sumstat,
156-
# association_window = association_window,
157-
# n_samples = n_samples,
158-
# n_cases = n_cases,
159-
# n_controls = n_controls
160-
# )
161-
#### End of comment out
152+
region_data_sumstat <- load_multitask_regional_data(
153+
sumstat_path_list = sumstat_path_list,
154+
column_file_path_list = column_file_path_list,
155+
LD_meta_file_path_list = LD_meta_file_path_list,
156+
conditions_list_sumstat = conditions_list_sumstat,
157+
match_LD_sumstat = match_LD_sumstat,
158+
association_window = association_window,
159+
n_samples = n_samples,
160+
n_cases = n_cases,
161+
n_controls = n_controls
162+
)
162163
```
163164

164165

165166

166167
**Expected format for column mapping file**
168+
167169
The column mapping file is YAML (`.yml`) with key: value pairs mapping your input column names to the standardized names expected by the loader.
168170
Required columns are `chrom`, `pos`, `A1`, and `A2`, and either `z` or `beta` and `sebeta`.
169-
Either 'n_case' and 'n_control' or 'n_samples' can be passed as part of the column mapping, but will be overwritten by the n_cases and n_controls or n_samples parameterspassed explicitly.
171+
Either 'n_case' and 'n_control' or 'n_samples' can be passed as part of the column mapping, but will be overwritten by the n_cases and n_controls or n_samples parameters passed explicitly.
170172
```yaml
171173
# required
172174
chrom: chromosome
@@ -188,7 +190,7 @@ n_sample: N
188190
189191
190192
**Expected format for LD metadata file**
191-
LD files sould be in the format generated by for instance `plink --r squared`, then xz compressed.
193+
LD files should be in the format generated by for instance `plink --r squared`, then xz compressed.
192194
The LD metadata file is a tab-separated file with the following columns:
193195
- `chrom`: chromosome
194196
- `start`: start position
@@ -208,84 +210,80 @@ The colocalization analysis can be run in any one of three modes, or in a combin
208210
- **`joint GWAS mode`**: Perform colocalization analysis in disease-agnostic mode on the individual-level and summary statistics data together.
209211
- **`separate GWAS mode`**: Perform colocalization analysis in disease-prioritized mode on the the individual-level data and each summary statistics dataset separately, treating each summary statistics dataset as the focal trait.
210212

211-
inputs:
213+
Inputs:
214+
212215
- **`region_data`**: List (with `individual_data`, `sumstat_data`); Output of the `load_multitask_regional_data` function.
213216
- **`focal_trait`**: String; For xQTL-only mode, the name of the trait to perform disease-prioritized ColocBoost, from `conditions_list_individual`. If not provided, xQTL-only mode will be run without disease-prioritized mode.
214217
- **`event_filters`**: List of character vectors; Patterns for filtering events based on context names.
215218
Example: for sQTL, `list(type_pattern = ".*clu_(\\d+_[+-?]).*", valid_pattern = "clu_(\\d+_[+-?]):PR:", exclude_pattern = "clu_(\\d+_[+-?]):IN:")`.
216219
- **`maf_cutoff`**: Numeric; Minor allele frequency cutoff. Default is 0.005.
217220
- **`pip_cutoff_to_skip_ind`**: Integer vector; Cutoff values for skipping analysis based on pre-screening with single-effect SuSiE (L=1). Context is skipped if none of the variants in the context have PIP values greater than the cutoff. Default is 0 (does not run single-effect SuSiE). Passing a negative value sets the cutoff to 3/number of variants.
218221
- **`pip_cutoff_to_skip_sumstat`**: Integer vector; Cutoff values for skipping analysis based on pre-screening with single-effect SuSiE (L=1). Sumstat is skipped if none of the variants in the sumstat have PIP values greater than the cutoff. Default is 0 (does not run single-effect SuSiE). Passing a negative value sets the cutoff to 3/number of variants.
219-
- **`qc_method`**: String; Quality control method to use. Options are "dentist" or "slalom". Default is `dentist`.
222+
- **`qc_method`**: String; Quality control method to use. Options are "rss_qc", "dentist", or "slalom". Default is `rss_qc`.
220223
- **`impute`**: Logical; if TRUE, performs imputation for outliers identified in the analysis. Default is `TRUE`.
221224
- **`impute_opts`**: List of lists; Imputation options including rcond, R2_threshold, and minimum_ld. Default is `list(rcond = 0.01, R2_threshold = 0.6, minimum_ld = 5)`.
222225
- **`xqtl_coloc`**: Logical; if TRUE, performs xQTL-only mode. Default is `TRUE`.
223226
- **`joint_gwas`**: Logical; if TRUE, performs joint GWAS mode, mapping all individual-level and sumstat data together.Default is `FALSE`.
224227
- **`separate_gwas`**: Logical; if TRUE, runs separate GWAS mode, where each sumstat dataset is analyzed separately with all individual-level data, treating each sumstat as the focal trait in disease-prioritized mode. Default is `FALSE`.
225228

226-
outputs:
227-
- **`colocboost_results`**: List of colocboost objects (with `xqtl_coloc`, `joint_gwas`, `separate_gwas`); Output of the `colocboost_analysis_pipeline` function. If the mode is not run, the corresponding element will be `NULL`.
229+
Outputs:
228230

229-
```{r, colocboost-analysis}
231+
- **`colocboost_results`**: List of colocboost objects (with `xqtl_coloc`, `joint_gwas`, `separate_gwas`); Output of the `colocboost_analysis_pipeline` function. If the mode is not run, the corresponding element will be `NULL`.
230232

231-
#### Comment out to avoid running this code here, as we do not have real data files in this example ####
233+
```{r, colocboost-analysis, eval = FALSE}
232234
#### Please check the example code below ####
233-
234235
# # load in individual-level and sumstat data
235-
# region_data_combined <- load_multitask_regional_data(
236-
# region = region,
237-
# genotype_list = genotype_list,
238-
# phenotype_list = phenotype_list,
239-
# covariate_list = covariate_list,
240-
# conditions_list_individual = conditions_list_individual,
241-
# match_geno_pheno = match_geno_pheno,
242-
# association_window = association_window,
243-
# region_name_col = region_name_col,
244-
# extract_region_name = extract_region_name,
245-
# keep_indel = keep_indel,
246-
# keep_samples = keep_samples,
247-
# maf_cutoff = maf_cutoff,
248-
# mac_cutoff = mac_cutoff,
249-
# xvar_cutoff = xvar_cutoff,
250-
# imiss_cutoff = imiss_cutoff,
251-
# sumstat_path_list = sumstat_path_list,
252-
# column_file_path_list = column_file_path_list,
253-
# LD_meta_file_path_list = LD_meta_file_path_list,
254-
# conditions_list_sumstat = conditions_list_sumstat,
255-
# match_LD_sumstat = match_LD_sumstat,
256-
# n_samples = n_samples,
257-
# n_cases = n_cases,
258-
# n_controls = n_controls
259-
# )
260-
261-
# maf_cutoff = 0.01
262-
# pip_cutoff_to_skip_ind = rep(0, length(phenotype_list))
263-
# pip_cutoff_to_skip_sumstat = rep(0, length(sumstat_path_list))
264-
# qc_method = "dentist"
265-
266-
# # run colocboost analysis
267-
# colocboost_results <- colocboost_analysis_pipeline(
268-
# region_data_combined,
269-
# maf_cutoff = maf_cutoff,
270-
# pip_cutoff_to_skip_ind = pip_cutoff_to_skip_ind,
271-
# pip_cutoff_to_skip_sumstat = pip_cutoff_to_skip_sumstat,
272-
# qc_method = qc_method,
273-
# xqtl_coloc = TRUE,
274-
# joint_gwas = TRUE,
275-
# separate_gwas = TRUE
276-
# )
277-
278-
# # visualize results for xQTL-only mode
279-
# colocboost_plot(colocboost_results$xqtl_coloc)
280-
281-
# # visualize results for joint GWAS mode
282-
# colocboost_plot(colocboost_results$joint_gwas)
283-
284-
# # visualize results for separate GWAS mode
285-
# for (i in 1:length(colocboost_results$separate_gwas)) {
286-
# colocboost_plot(colocboost_results$separate_gwas[[i]])
287-
# }
288-
289-
290-
#### End of comment out
236+
region_data_combined <- load_multitask_regional_data(
237+
region = region,
238+
genotype_list = genotype_list,
239+
phenotype_list = phenotype_list,
240+
covariate_list = covariate_list,
241+
conditions_list_individual = conditions_list_individual,
242+
match_geno_pheno = match_geno_pheno,
243+
association_window = association_window,
244+
region_name_col = region_name_col,
245+
extract_region_name = extract_region_name,
246+
keep_indel = keep_indel,
247+
keep_samples = keep_samples,
248+
maf_cutoff = maf_cutoff,
249+
mac_cutoff = mac_cutoff,
250+
xvar_cutoff = xvar_cutoff,
251+
imiss_cutoff = imiss_cutoff,
252+
sumstat_path_list = sumstat_path_list,
253+
column_file_path_list = column_file_path_list,
254+
LD_meta_file_path_list = LD_meta_file_path_list,
255+
conditions_list_sumstat = conditions_list_sumstat,
256+
match_LD_sumstat = match_LD_sumstat,
257+
n_samples = n_samples,
258+
n_cases = n_cases,
259+
n_controls = n_controls
260+
)
261+
262+
maf_cutoff = 0.01
263+
pip_cutoff_to_skip_ind = rep(0, length(phenotype_list))
264+
pip_cutoff_to_skip_sumstat = rep(0, length(sumstat_path_list))
265+
qc_method = "rss_qc"
266+
267+
# run colocboost analysis
268+
colocboost_results <- colocboost_analysis_pipeline(
269+
region_data_combined,
270+
maf_cutoff = maf_cutoff,
271+
pip_cutoff_to_skip_ind = pip_cutoff_to_skip_ind,
272+
pip_cutoff_to_skip_sumstat = pip_cutoff_to_skip_sumstat,
273+
qc_method = qc_method,
274+
xqtl_coloc = TRUE,
275+
joint_gwas = TRUE,
276+
separate_gwas = TRUE
277+
)
278+
279+
# visualize results for xQTL-only mode
280+
colocboost_plot(colocboost_results$xqtl_coloc)
281+
282+
# visualize results for joint GWAS mode
283+
colocboost_plot(colocboost_results$joint_gwas)
284+
285+
# visualize results for separate GWAS mode
286+
for (i in 1:length(colocboost_results$separate_gwas)) {
287+
colocboost_plot(colocboost_results$separate_gwas[[i]])
288+
}
291289
```

0 commit comments

Comments
 (0)