You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
%\VignetteIndexEntry{Bioinformatics Pipeline for ColocBoost}
@@ -21,6 +23,7 @@ This vignette demonstrates how to use the bioinformatics pipeline for ColocBoost
21
23
`colocboost_pipeline` with [link](https://github.com/StatFunGen/pecotmr/blob/main/R/colocboost_pipeline.R).
22
24
- See more details about input data preparation in `xqtl_protocol` with [link](https://statfungen.github.io/xqtl-protocol/code/mnm_analysis/mnm_methods/colocboost.html).
23
25
26
+
Acknowledgment: Thanks to Kate (Kathryn) Lawrence (GitHub:@kal26) for her contributions to this vignette.
24
27
25
28
# 1. Loading Data using `colocboost_analysis_pipeline` function
26
29
@@ -38,15 +41,16 @@ Below are the input parameters for this function for loading individual-level da
38
41
39
42
## 1.1. Loading individual-level data from multiple cohorts
40
43
41
-
inputs:
44
+
Inputs:
45
+
42
46
-**`region`**: String ; Genomic region of interest in the format of `chr:start-end` for the phenotype region you want to analyze.
43
47
-**`genotype_list`**: Character vector; Paths for PLINK bed files containing genotype data (do NOT include .bed suffix).
44
48
-**`phenotype_list`**: Character vector; Paths for phenotype file names.
45
49
-**`covariate_list`**: Character vector; Paths for covariate file names for each phenotype. Must have the same length as the phenotype file vector.
46
50
-**`conditions_list_individual`**: Character vector; Strings representing different conditions or groups used for naming. Must have the same length as the phenotype file vector.
47
51
-**`match_geno_pheno`**: Integer vector; Indices of phenotypes matched to genotype if multiple genotype PLINK files are used. For each phenotype file in `phenotype_list`, the index of the genotype file in `genotype_list` it matches with.
48
52
-**`association_window`**: String; Genomic region of interest in the format of `chr:start-end` for the association analysis window of variants to test (cis or trans). If not provided, all genotype data will be loaded.
49
-
-**`extract_region_name`**: List of character vectors; Phenotype names (e.g., gene ID `ENSG00000269699`) to subset the phenotype data when there are multiple phenotypes availible in the region. Must have the same length as the phenotype file vector. Default is `NULL`, which will use all phenotypes in the region.
53
+
-**`extract_region_name`**: List of character vectors; Phenotype names (e.g., gene ID `ENSG00000269699`) to subset the phenotype data when there are multiple phenotypes available in the region. Must have the same length as the phenotype file vector. Default is `NULL`, which will use all phenotypes in the region.
50
54
-**`region_name_col`**: Integer; 1-based index of the column containing the region name (i.e. 4 for gene ID in a bed file). Required if `extract_region_name` is not `NULL`, or if multiple phenotypes fall into the same region in one phenotype file
51
55
-**`keep_indel`**: Logical; indicating whether to keep insertions/deletions (INDELs). Default is `TRUE`.
52
56
-**`keep_samples`**: Character vector; Sample names to keep. Default is `NULL`. Currently only supports keeping the same samples from all genotype and phenotype files.
@@ -55,15 +59,16 @@ inputs:
55
59
-**`xvar_cutoff`**: Numeric; Minimum genotype variance cutoff. Default is 0.
56
60
-**`imiss_cutoff`**: Numeric; Maximum individual missingness cutoff. Default is 0.
57
61
58
-
outputs:
62
+
Outputs:
63
+
59
64
-**`region_data`**: List (with `individual_data`, `sumstat_data`); Output of the `load_multitask_regional_data` function. If only individual-level data is loaded, `sumstat_data` will be `NULL`.
60
65
61
66
62
-
**Indivudual-level data loading example**
67
+
**Individual-level data loading example**
63
68
64
69
The following example demonstrates how to set up input data with 3 phenotypes and 2 cohorts. The first cohort has 2 phenotypes and the second cohort has 1 phenotype. The first phenotype has 2 genes and the second phenotype has 1 gene.
## 1.2. Loading summary statistics from multiple cohorts or datasets
113
115
114
-
inputs:
116
+
Inputs:
117
+
115
118
-**`sumstat_path_list`**: Character vector; Paths to the summary statistics.
116
119
-**`column_file_path_list`**: Character vector; Paths to the column mapping files. See below for expected format.
117
120
-**`LD_meta_file_path_list`**: Character vector; Paths to LD metadata files. See below for expected format.
@@ -122,14 +125,15 @@ inputs:
122
125
-**`n_cases`**: Integer vector; Number of cases. Set a 0 if `n_samples` is passed explicitly. If unknown, set as 0 and include `n_cases` column in the column mapping file to retrieve from the sumstat file.
123
126
-**`n_controls`**: Integer vector; Number of controls. Set a 0 if `n_samples` is passed explicitly. If unknown, set as 0 and include `n_controls` column in the column mapping file to retrieve from the sumstat file.
124
127
125
-
outputs:
128
+
Outputs:
129
+
126
130
-**`region_data`**: List (with `individual_data`, `sumstat_data`); Output of the `load_multitask_regional_data` function. If only summary statistics data is loaded, `individual_data` will be `NULL`.
127
131
128
132
**Summary statistics loading example**
129
133
130
134
The following example demonstrates how to set up input data with 2 summary statistics and one LD reference.
The column mapping file is YAML (`.yml`) with key: value pairs mapping your input column names to the standardized names expected by the loader.
168
170
Required columns are `chrom`, `pos`, `A1`, and `A2`, and either `z` or `beta` and `sebeta`.
169
-
Either 'n_case' and 'n_control' or 'n_samples' can be passed as part of the column mapping, but will be overwritten by the n_cases and n_controls or n_samples parameterspassed explicitly.
171
+
Either 'n_case' and 'n_control' or 'n_samples' can be passed as part of the column mapping, but will be overwritten by the n_cases and n_controls or n_samples parameters passed explicitly.
170
172
```yaml
171
173
# required
172
174
chrom: chromosome
@@ -188,7 +190,7 @@ n_sample: N
188
190
189
191
190
192
**Expected format for LD metadata file**
191
-
LD files sould be in the format generated by for instance `plink --r squared`, then xz compressed.
193
+
LD files should be in the format generated by for instance `plink --r squared`, then xz compressed.
192
194
The LD metadata file is a tab-separated file with the following columns:
193
195
- `chrom`: chromosome
194
196
- `start`: start position
@@ -208,84 +210,80 @@ The colocalization analysis can be run in any one of three modes, or in a combin
208
210
- **`joint GWAS mode`**: Perform colocalization analysis in disease-agnostic mode on the individual-level and summary statistics data together.
209
211
- **`separate GWAS mode`**: Perform colocalization analysis in disease-prioritized mode on the the individual-level data and each summary statistics dataset separately, treating each summary statistics dataset as the focal trait.
210
212
211
-
inputs:
213
+
Inputs:
214
+
212
215
- **`region_data`**: List (with `individual_data`, `sumstat_data`); Output of the `load_multitask_regional_data` function.
213
216
- **`focal_trait`**: String; For xQTL-only mode, the name of the trait to perform disease-prioritized ColocBoost, from `conditions_list_individual`. If not provided, xQTL-only mode will be run without disease-prioritized mode.
214
217
- **`event_filters`**: List of character vectors; Patterns for filtering events based on context names.
- **`maf_cutoff`**: Numeric; Minor allele frequency cutoff. Default is 0.005.
217
220
- **`pip_cutoff_to_skip_ind`**: Integer vector; Cutoff values for skipping analysis based on pre-screening with single-effect SuSiE (L=1). Context is skipped if none of the variants in the context have PIP values greater than the cutoff. Default is 0 (does not run single-effect SuSiE). Passing a negative value sets the cutoff to 3/number of variants.
218
221
- **`pip_cutoff_to_skip_sumstat`**: Integer vector; Cutoff values for skipping analysis based on pre-screening with single-effect SuSiE (L=1). Sumstat is skipped if none of the variants in the sumstat have PIP values greater than the cutoff. Default is 0 (does not run single-effect SuSiE). Passing a negative value sets the cutoff to 3/number of variants.
219
-
- **`qc_method`**: String; Quality control method to use. Options are "dentist" or "slalom". Default is `dentist`.
222
+
- **`qc_method`**: String; Quality control method to use. Options are "rss_qc", "dentist", or "slalom". Default is `rss_qc`.
220
223
- **`impute`**: Logical; if TRUE, performs imputation for outliers identified in the analysis. Default is `TRUE`.
221
224
- **`impute_opts`**: List of lists; Imputation options including rcond, R2_threshold, and minimum_ld. Default is `list(rcond = 0.01, R2_threshold = 0.6, minimum_ld = 5)`.
222
225
- **`xqtl_coloc`**: Logical; if TRUE, performs xQTL-only mode. Default is `TRUE`.
223
226
- **`joint_gwas`**: Logical; if TRUE, performs joint GWAS mode, mapping all individual-level and sumstat data together.Default is `FALSE`.
224
227
- **`separate_gwas`**: Logical; if TRUE, runs separate GWAS mode, where each sumstat dataset is analyzed separately with all individual-level data, treating each sumstat as the focal trait in disease-prioritized mode. Default is `FALSE`.
225
228
226
-
outputs:
227
-
- **`colocboost_results`**: List of colocboost objects (with `xqtl_coloc`, `joint_gwas`, `separate_gwas`); Output of the `colocboost_analysis_pipeline` function. If the mode is not run, the corresponding element will be `NULL`.
229
+
Outputs:
228
230
229
-
```{r, colocboost-analysis}
231
+
- **`colocboost_results`**: List of colocboost objects (with `xqtl_coloc`, `joint_gwas`, `separate_gwas`); Output of the `colocboost_analysis_pipeline` function. If the mode is not run, the corresponding element will be `NULL`.
230
232
231
-
#### Comment out to avoid running this code here, as we do not have real data files in this example ####
0 commit comments