From b69eb489821be3fd0efd633f0c5a973793af1d58 Mon Sep 17 00:00:00 2001
From: katehoffshutta <43797774+katehoffshutta@users.noreply.github.com>
Date: Thu, 24 Apr 2025 15:21:20 -0400
Subject: [PATCH 1/3] add expression functions
---
quickstart.Rmd | 131 ++++++++++++++++++++++++++++++++++++++-----------
1 file changed, 102 insertions(+), 29 deletions(-)
diff --git a/quickstart.Rmd b/quickstart.Rmd
index 7791dfc..c0cc48f 100644
--- a/quickstart.Rmd
+++ b/quickstart.Rmd
@@ -106,46 +106,119 @@ meth_datapath = paste0(datapath, "data_download/methylation/tcga_coad_cms1_methy
## Functions for working with expression
-
-
-
-
+The functions for working with gene expression in NetworkDataCompanion are generally wrappers around the functionality of the `RangedSummarizedExperiment` objects produced by `recount3`. For documentation on this type of object, please see ***.
+
+First, we read in the expression data, which is downloaded as both an .rds file and as a .csv file in the Download phase of the `tcga-data-nf` pipeline. The .rds file contains the `RangedSummarizedExperiment` object, so this is the downloaded file that we work with here. The .csv file is provided for convenience if you have other applications that do not require the `RangedSummarizedExperiment`.
+
+```{R}
+expr_data = readRDS(exp_datapath)
+expr_data
+```
+
+To extract metadata from the `RangedSummarizedExperiment`, we use the `extractSampleAndGeneInfo` method. This produces two dataframes: `rds_gene_info`, which stores gene identifiers and other gene-level metadata, and `rds_sample_info`, which stores sample identifiers such as TCGA barcodes, patient characteristics, and sample characteristics.
-
-
-
-
+```{R}
+expr_metadata = my_friend$extractSampleAndGeneInfo(expr_data)
+ls(expr_metadata)
+```
-
-
-
-
+For example, if we wish to get the gene names and Ensembl IDs for the first five genes in the dataset, we can use:
+
+```{R}
+expr_metadata$rds_gene_info %>%
+ dplyr::select(gene_id,gene_name) %>%
+ slice_head(n=5)
+```
-
+Similarly, let's get the TCGA barcode and vital status of the patients who provided the first 5 samples:
-
-
-
-
-
+```{R}
+expr_metadata$rds_sample_info %>%
+ dplyr::select(tcga.tcga_barcode,tcga.gdc_cases.diagnoses.vital_status) %>%
+ slice_head(n=5)
+```
-
+To get information about a set of genes, `NetworkDataCompanion` provides a wrapper function `getGeneInfo`.
-
-
-
-
-
+```{R}
+my_friend$getGeneInfo(c("BRCA1","BRCA2"))
+```
-
+These results can also be searched by Ensemble or Entrez gene IDs. The Ensemble IDs for BRCA1 are ENSG00000012048 and ENSG00000139618, respectively:
+
+```{R}
+my_friend$getGeneInfo(c("ENSG00000012048","ENSG00000139618"))
+```
+
+And the Entrez IDs for BRCA1 and BRCA2 are 672 and 675.
+
+```{R}
+my_friend$getGeneInfo(c(672, 675))
+```
+
+### Converting between gene identifiers
+
+There are many, many tools for converting between gene identifiers, and ours has neither bells nor whistles. Our tool simply uses the hardcode Gencode v26 mapping found in the `inst/extdata/gen_v26_mapping_w_entrez.csv` file. A more sophisticated tool may better fit your use case; however, we provide this functionality as we used it in the `tcga-data-nf` pipeline.
+
+```{R}
+my_friend$geneNameToENSG(c("BRCA1","BRCA2"))
+my_friend$geneNameToEntrez(c("BRCA1","BRCA2"))
+```
+
+Sometimes, the gene conversion will yield a one-to-many mapping. For this reason, it is important to use proper merging operations (`merge` in base R; the `left_join`,`right_join`,`inner_join`,`full_join` functions from `dplyr`) when annotating these conversions back to a dataset.
+
+```{R}
+my_friend$geneNameToENSG("NOTCH4")
+```
+
+
+### Normalizing RNAseq data
+
+`NetworkDataCompanion` provides a wrapper for logTPM RNAseq normalization functions from `recount3`. A pseudocount is added in the calculation. The input to this function is the `RangedSummarizedExperiment` object:
+
+
+```{R}
+logTPM = my_friend$logTPMNormalization(expr_data)
+ls(logTPM)
+```
+
+The three items returned include the transformed counts, which have been processed by `recount3::transform_counts`, the logTPM (corresponding to log(TPM+1) with the pseudocount), the TPM (which is provided for use cases related to filtering by TPM, for example, if one would like to do something like remove genes with TPM < 1 in more than 20% of the samples). Rows are indexed by Ensembl gene ID. Let's look at the values for BRCA1 (ENSG00000012048.20):
+
+```{R}
+brca1_counts=logTPM$counts[which(rownames(logTPM$counts)=="ENSG00000012048.20"),1:3]
+brca1_TPM=logTPM$TPM[which(rownames(logTPM$TPM)=="ENSG00000012048.20"),1:3]
+brca1_logTPM=logTPM$logTPM[which(rownames(logTPM$logTPM)=="ENSG00000012048.20"),1:3]
+
+brca1_counts
+brca1_TPM
+brca1_logTPM
+```
+
+There is also a wrapper for logCPM normalization. This is based on the `edgeR` recipe for TMM normalization; see the package documentation for more details. Again, a pseudocount is added in the calculation. Unlike `logTPMNormalization`, which uses the `RangedSummarizedExperiment` as input, `logCPMNormalization` uses a matrix or data frame of raw expression counts as input. This can be accessed using the `assay` function of the `RangedSummarizedExperiment`.
+
+```{R}
+logCPM = my_friend$logCPMNormalization(assay(expr_data))
+ls(logCPM)
+```
+
+Here, counts refer to the raw counts in the `RangedSummarizedExperiment` object. Note that, because these counts have not been transformed with `recount3::transform_counts`, they will not be coherent with the `counts` attribute of `logTPMNormalization`. This is expected behavior.
+
+We again investigate the values for BRCA1:
+
+
+```{R}
+brca1_counts=logCPM$counts[which(rownames(logCPM$counts)=="ENSG00000012048.20"),1:3]
+brca1_CPM=logCPM$CPM[which(rownames(logCPM$CPM)=="ENSG00000012048.20"),1:3]
+brca1_logCPM=logCPM$logCPM[which(rownames(logCPM$logCPM)=="ENSG00000012048.20"),1:3]
+
+brca1_counts
+brca1_CPM
+brca1_logCPM
+```
-
-
-
-
From 1e5c325ebc39fee35aaca8e19b81b5585d111b49 Mon Sep 17 00:00:00 2001
From: katehoffshutta <43797774+katehoffshutta@users.noreply.github.com>
Date: Fri, 25 Apr 2025 14:38:27 -0400
Subject: [PATCH 2/3] add methylation to quickstart
---
quickstart.Rmd | 125 ++++++++++++++++++++++++++++++++++++++++---------
quickstart.bib | 20 ++++++++
2 files changed, 122 insertions(+), 23 deletions(-)
diff --git a/quickstart.Rmd b/quickstart.Rmd
index c0cc48f..6f1a861 100644
--- a/quickstart.Rmd
+++ b/quickstart.Rmd
@@ -9,6 +9,11 @@ bibliography: quickstart.bib
link-citations: true
---
+```{R echo=F}
+library(ggplot2)
+library(knitr)
+```
+
# About NetworkDataCompanion
The `NetworkDataCompanion` R package is a collection of utilities for wrangling TCGA data. While it was developed to serve as the engine behind `tcga-data-nf`, the Nextflow workflow presented in @fanfani2024reproducible, `NetworkDataCompanion` is a stand-alone package that can be used independently. The primary goal of the package is to provide version-controlled and unit-tested functions for common tasks involved in network analyses of TCGA data, such as calculating gene-level summary measures of methylation data, normalizing RNA-seq data, and integrating multiple omics by mapping various file identifiers back to TCGA barcodes.
@@ -23,7 +28,7 @@ While none of these tasks are particularly complex, `NetworkDataCompanion` is mo
The first step is to install `NetworkDataCompanion` from GitHub using the `devtools` library and the `install_github` function. Here, we have suppressed messages as there are quite a few.
-```{R, message=F}
+```{R, message=F, eval=F}
library(devtools)
devtools::install_github("QuackenbushLab/NetworkDataCompanion")
```
@@ -48,16 +53,16 @@ Data acknowledgement: These data were generated by the TCGA Research Network: ht
# Working with the NetworkDataCompanion Object
-The most simple way to get started with `NetworkDataCompanion` is to create an oject using the `CreateNetworkDataCompanionObject` function with an empty argument. Because it is a Companion, we like to call it `my_friend`:
+The most simple way to get started with `NetworkDataCompanion` is to create an oject using the `CreateNetworkDataCompanionObject` function with an empty argument. Because it is a Companion, we like to call it `my_coad_friend`:
```{R}
-my_friend = NetworkDataCompanion::CreateNetworkDataCompanionObject()
+my_coad_friend = NetworkDataCompanion::CreateNetworkDataCompanionObject()
```
There are a few built-in attributes that will be accessible to the functions of the Companion:
```{R}
-ls(my_friend)
+ls(my_coad_friend)
```
Of these, `gene_mapping` and `sample_type_mapping` are loaded with the empty constructor. For the other attributes, we must supply a patient data file and TCGA cancer type. For the COAD example, we can use the following:
@@ -120,7 +125,7 @@ To extract metadata from the `RangedSummarizedExperiment`, we use the `extractSa
```{R}
-expr_metadata = my_friend$extractSampleAndGeneInfo(expr_data)
+expr_metadata = my_coad_friend$extractSampleAndGeneInfo(expr_data)
ls(expr_metadata)
```
@@ -144,19 +149,19 @@ expr_metadata$rds_sample_info %>%
To get information about a set of genes, `NetworkDataCompanion` provides a wrapper function `getGeneInfo`.
```{R}
-my_friend$getGeneInfo(c("BRCA1","BRCA2"))
+my_coad_friend$getGeneInfo(c("BRCA1","BRCA2"))
```
These results can also be searched by Ensemble or Entrez gene IDs. The Ensemble IDs for BRCA1 are ENSG00000012048 and ENSG00000139618, respectively:
```{R}
-my_friend$getGeneInfo(c("ENSG00000012048","ENSG00000139618"))
+my_coad_friend$getGeneInfo(c("ENSG00000012048","ENSG00000139618"))
```
And the Entrez IDs for BRCA1 and BRCA2 are 672 and 675.
```{R}
-my_friend$getGeneInfo(c(672, 675))
+my_coad_friend$getGeneInfo(c(672, 675))
```
### Converting between gene identifiers
@@ -164,14 +169,14 @@ my_friend$getGeneInfo(c(672, 675))
There are many, many tools for converting between gene identifiers, and ours has neither bells nor whistles. Our tool simply uses the hardcode Gencode v26 mapping found in the `inst/extdata/gen_v26_mapping_w_entrez.csv` file. A more sophisticated tool may better fit your use case; however, we provide this functionality as we used it in the `tcga-data-nf` pipeline.
```{R}
-my_friend$geneNameToENSG(c("BRCA1","BRCA2"))
-my_friend$geneNameToEntrez(c("BRCA1","BRCA2"))
+my_coad_friend$geneNameToENSG(c("BRCA1","BRCA2"))
+my_coad_friend$geneNameToEntrez(c("BRCA1","BRCA2"))
```
Sometimes, the gene conversion will yield a one-to-many mapping. For this reason, it is important to use proper merging operations (`merge` in base R; the `left_join`,`right_join`,`inner_join`,`full_join` functions from `dplyr`) when annotating these conversions back to a dataset.
```{R}
-my_friend$geneNameToENSG("NOTCH4")
+my_coad_friend$geneNameToENSG("NOTCH4")
```
@@ -181,7 +186,7 @@ my_friend$geneNameToENSG("NOTCH4")
```{R}
-logTPM = my_friend$logTPMNormalization(expr_data)
+logTPM = my_coad_friend$logTPMNormalization(expr_data)
ls(logTPM)
```
@@ -200,7 +205,7 @@ brca1_logTPM
There is also a wrapper for logCPM normalization. This is based on the `edgeR` recipe for TMM normalization; see the package documentation for more details. Again, a pseudocount is added in the calculation. Unlike `logTPMNormalization`, which uses the `RangedSummarizedExperiment` as input, `logCPMNormalization` uses a matrix or data frame of raw expression counts as input. This can be accessed using the `assay` function of the `RangedSummarizedExperiment`.
```{R}
-logCPM = my_friend$logCPMNormalization(assay(expr_data))
+logCPM = my_coad_friend$logCPMNormalization(assay(expr_data))
ls(logCPM)
```
@@ -220,18 +225,92 @@ brca1_logCPM
```
-
+## Functions for working with methylation
-
-
-
-
-
-
-
+The methylation data in the TCGA COAD analysis are stored in a which rows correspond to CpG probes and columns correspond to samples. We load this matrix into a data frame for easy manipulation, using the `fread` function from `data.table` to speed the file input.
-## Functions for working with mutations
+```{R}
+meth_data = fread(meth_datapath, data.table = F)
+meth_data[1:5,1:5]
+```
-## Functions for working with methylation
+The data are provided as methylation beta values, ranging from 0 to 1. It is also common to work with "m values", which are the logit base 2 transformed beta values @du2010comparison. `NetworkDataCompanion` has the function `convertBetaToM` for converting such values.
+
+```{R}
+m_vals = my_coad_friend$convertBetaToM(meth_data[1:5,2])
+```
+
+When working with methylation data, there is often a need to identify probes within a certain range of a gene's transcription start site (TSS), in order to assess promoter methylation. The `mapProbesToGenes` function uses the public Illumina array manifests available at \url{https://zwdzwd.github.io/InfiniumAnnotation} to do this. Here, we perform this mapping for only the first 10,000 probes in the manifest for brevity, and define the promoter region as 1500 basepairs upstream to 200 basepairs downstream of the TSS.
+
+```{R}
+probe_map = my_coad_friend$mapProbesToGenes(probelist = meth_data$probeID[1:10000],
+ rangeUp = 1500,
+ rangeDown = 200)
+dim(probe_map)
+names(probe_map)
+```
+
+Any probe that is not within the specified `rangeUp` and `rangeDown` bounds of the TSS for an annotated gene in the manifest will have `NA` values for the columns `geneNames`,`ensemblID`, and `distToTSS`. We can look at just the probes mapping to genes by filtering out `NA` values. Here, only the first 10 genes are shown for brevity. Use the scroll bar at the bottom of the table to see all four columns.
+
+```{R}
+probe_map %>%
+ dplyr::filter(!is.na(geneNames)) %>%
+ slice_head(n=10) %>%
+ knitr::kable()
+```
+
+
+In many applications, it is necessary to combine a set of probe methylation values to create a summary measure (e.g., to assess overall promoter methylation). `NetworkDataCompanion` currently provides functionality for calculating the mean methylation within a promoter region defined based on the output of the `mapProbesToGenes` function. For example, our probe map above has two probes within 1500 bp upstream and 200bp downstream of the TSS for CTBP1:
+
+```{R}
+probe_map %>% dplyr::filter(grepl("CTBP1",geneNames))
+```
+
+We can look at the individual methylation at these sites for a few people:
+
+```{R}
+meth_data %>% dplyr::filter(probeID %in% c("cg00148286","cg00070460")) %>%
+ dplyr::select(1:5) %>%
+ kable()
+```
+
+Next, we average these values using the `probeToMeanPromoterMethylation` function:
+
+```{R}
+ctbp1_meth=my_coad_friend$probeToMeanPromoterMethylation(methylation_betas = meth_data,
+ probe_gene_map = probe_map,
+ genesOfInterest = c("CTBP1")) %>%
+ data.frame() %>%
+ dplyr::filter(X1 != "CTBP1") %>%
+ dplyr::rename("CTBP1"=X1)
+head(ctbp1_meth,n=5) %>%
+ kable()
+```
+We see, for example, that the first value is 0.8531883 = (0.7906715 + 0.9157052)/2, the average of sites cg00148286 and cg00070460 for sample 7318cda6-0b88-416a-afdc-7c36628e7c82.
+
+`estimateCellCountsEpiSCORE`
+
+In the bulk methylation data we study in TCGA, cell type composition may be confounded with observed methylation values. We provide a wrapper around the EpiSCORE method (@teschendorff2020episcore), which performs an *in silico* deconvolution of bulk methylation data for certain tissue types.
+
+```{R}
+estimated_cell_counts = my_coad_friend$estimateCellCountsEpiSCORE(methylation_betas = meth_data, tissue = "Colon", array = "450k")
+head(estimated_cell_counts) %>% kable()
+```
+
+From a boxplot, we can assess the overall distributions of the estimated cell type proportions:
+
+```{R boxplot_cell_counts}
+estimated_cell_counts %>% pivot_longer(cols=c("EC", "Epi","Lym", "Mye", "Stromal"),
+ names_to="CellType",
+ values_to="Proportion") %>%
+ ggplot(aes(x = CellType, y = Proportion)) +
+ geom_boxplot()
+```
+
+# Conclusions
+
+Thank you for using `NetworkDataCompanion`!
+
+If you have any issues, please raise them as GitHub issues and/or contact the maintainer directly. This package is currently maintained by Kate Shutta (kshutta@hsph.harvard.edu).
# References
diff --git a/quickstart.bib b/quickstart.bib
index 35fb924..a864888 100644
--- a/quickstart.bib
+++ b/quickstart.bib
@@ -14,4 +14,24 @@ @data{fanfaniDataverse
version = {V1},
doi = {10.7910/DVN/MCSSYJ},
url = {https://doi.org/10.7910/DVN/MCSSYJ}
+}
+
+@article{du2010comparison,
+ title={Comparison of Beta-value and M-value methods for quantifying methylation levels by microarray analysis},
+ author={Du, Pan and Zhang, Xiao and Huang, Chiang-Ching and Jafari, Nadereh and Kibbe, Warren A and Hou, Lifang and Lin, Simon M},
+ journal={BMC bioinformatics},
+ volume={11},
+ pages={1--9},
+ year={2010},
+ publisher={Springer}
+}
+
+@article{teschendorff2020episcore,
+ title={EPISCORE: cell type deconvolution of bulk tissue DNA methylomes from single-cell RNA-Seq data},
+ author={Teschendorff, Andrew E and Zhu, Tianyu and Breeze, Charles E and Beck, Stephan},
+ journal={Genome biology},
+ volume={21},
+ pages={1--33},
+ year={2020},
+ publisher={Springer}
}
\ No newline at end of file
From 519ce3df3fa75a1105d7d5804bbb08c543be2017 Mon Sep 17 00:00:00 2001
From: katehoffshutta <43797774+katehoffshutta@users.noreply.github.com>
Date: Tue, 29 Apr 2025 05:55:13 -0400
Subject: [PATCH 3/3] update quickstart html
---
quickstart.Rmd | 46 +-
quickstart.html | 2384 +++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 2408 insertions(+), 22 deletions(-)
create mode 100644 quickstart.html
diff --git a/quickstart.Rmd b/quickstart.Rmd
index 6f1a861..f149931 100644
--- a/quickstart.Rmd
+++ b/quickstart.Rmd
@@ -53,16 +53,16 @@ Data acknowledgement: These data were generated by the TCGA Research Network: ht
# Working with the NetworkDataCompanion Object
-The most simple way to get started with `NetworkDataCompanion` is to create an oject using the `CreateNetworkDataCompanionObject` function with an empty argument. Because it is a Companion, we like to call it `my_coad_friend`:
+The most simple way to get started with `NetworkDataCompanion` is to create an object using the `CreateNetworkDataCompanionObject` function with an empty argument. Here, we call the object `ndc`:
```{R}
-my_coad_friend = NetworkDataCompanion::CreateNetworkDataCompanionObject()
+ndc = NetworkDataCompanion::CreateNetworkDataCompanionObject()
```
There are a few built-in attributes that will be accessible to the functions of the Companion:
```{R}
-ls(my_coad_friend)
+ls(ndc)
```
Of these, `gene_mapping` and `sample_type_mapping` are loaded with the empty constructor. For the other attributes, we must supply a patient data file and TCGA cancer type. For the COAD example, we can use the following:
@@ -75,24 +75,25 @@ patient_data = paste0(datapath,"data_download/clinical/clinical_patient_coad.csv
We input these to the constructor to load the relevant data into the Companion object:
```{R}
-my_coad_friend = NetworkDataCompanion::CreateNetworkDataCompanionObject(clinical_patient_file = patient_data,project_name = "COAD")
+ndc = NetworkDataCompanion::CreateNetworkDataCompanionObject(clinical_patient_file = patient_data,project_name = "COAD")
```
Now, we can see the other attributes. The project name helps us keep track of which cancer we are working with:
```{R}
-my_coad_friend$project_name
+ndc$project_name
```
The clinical patient data stores TCGA barcodes and a range of other patient-level information:
```{R}
-my_coad_friend$clinical_patient_data[1:5,1:5]
+ndc$clinical_patient_data[1:5,1:5]
```
+
The `TCGA_purities` attribute contains the information about tumor purity for each sample. Because these TCGA data are from bulk tissue, there may be some non-tumor cells in the each sample as well. Higher purity samples will give more accurate depiction of the tumor microenvironment and the user may wish to filter based on purity.
```{R}
-head(my_coad_friend$TCGA_purities)
+head(ndc$TCGA_purities)
```
# NetworkDataCompanion Functions
@@ -125,7 +126,7 @@ To extract metadata from the `RangedSummarizedExperiment`, we use the `extractSa
```{R}
-expr_metadata = my_coad_friend$extractSampleAndGeneInfo(expr_data)
+expr_metadata = ndc$extractSampleAndGeneInfo(expr_data)
ls(expr_metadata)
```
@@ -149,19 +150,19 @@ expr_metadata$rds_sample_info %>%
To get information about a set of genes, `NetworkDataCompanion` provides a wrapper function `getGeneInfo`.
```{R}
-my_coad_friend$getGeneInfo(c("BRCA1","BRCA2"))
+ndc$getGeneInfo(c("BRCA1","BRCA2"))
```
-These results can also be searched by Ensemble or Entrez gene IDs. The Ensemble IDs for BRCA1 are ENSG00000012048 and ENSG00000139618, respectively:
+These results can also be searched by Ensembl or Entrez gene IDs. The Ensembl IDs for BRCA1 are ENSG00000012048 and ENSG00000139618, respectively:
```{R}
-my_coad_friend$getGeneInfo(c("ENSG00000012048","ENSG00000139618"))
+ndc$getGeneInfo(c("ENSG00000012048","ENSG00000139618"))
```
And the Entrez IDs for BRCA1 and BRCA2 are 672 and 675.
```{R}
-my_coad_friend$getGeneInfo(c(672, 675))
+ndc$getGeneInfo(c(672, 675))
```
### Converting between gene identifiers
@@ -169,14 +170,14 @@ my_coad_friend$getGeneInfo(c(672, 675))
There are many, many tools for converting between gene identifiers, and ours has neither bells nor whistles. Our tool simply uses the hardcode Gencode v26 mapping found in the `inst/extdata/gen_v26_mapping_w_entrez.csv` file. A more sophisticated tool may better fit your use case; however, we provide this functionality as we used it in the `tcga-data-nf` pipeline.
```{R}
-my_coad_friend$geneNameToENSG(c("BRCA1","BRCA2"))
-my_coad_friend$geneNameToEntrez(c("BRCA1","BRCA2"))
+ndc$geneNameToENSG(c("BRCA1","BRCA2"))
+ndc$geneNameToEntrez(c("BRCA1","BRCA2"))
```
Sometimes, the gene conversion will yield a one-to-many mapping. For this reason, it is important to use proper merging operations (`merge` in base R; the `left_join`,`right_join`,`inner_join`,`full_join` functions from `dplyr`) when annotating these conversions back to a dataset.
```{R}
-my_coad_friend$geneNameToENSG("NOTCH4")
+ndc$geneNameToENSG("NOTCH4")
```
@@ -186,7 +187,7 @@ my_coad_friend$geneNameToENSG("NOTCH4")
```{R}
-logTPM = my_coad_friend$logTPMNormalization(expr_data)
+logTPM = ndc$logTPMNormalization(expr_data)
ls(logTPM)
```
@@ -205,7 +206,7 @@ brca1_logTPM
There is also a wrapper for logCPM normalization. This is based on the `edgeR` recipe for TMM normalization; see the package documentation for more details. Again, a pseudocount is added in the calculation. Unlike `logTPMNormalization`, which uses the `RangedSummarizedExperiment` as input, `logCPMNormalization` uses a matrix or data frame of raw expression counts as input. This can be accessed using the `assay` function of the `RangedSummarizedExperiment`.
```{R}
-logCPM = my_coad_friend$logCPMNormalization(assay(expr_data))
+logCPM = ndc$logCPMNormalization(assay(expr_data))
ls(logCPM)
```
@@ -237,13 +238,13 @@ meth_data[1:5,1:5]
The data are provided as methylation beta values, ranging from 0 to 1. It is also common to work with "m values", which are the logit base 2 transformed beta values @du2010comparison. `NetworkDataCompanion` has the function `convertBetaToM` for converting such values.
```{R}
-m_vals = my_coad_friend$convertBetaToM(meth_data[1:5,2])
+m_vals = ndc$convertBetaToM(meth_data[1:5,2])
```
When working with methylation data, there is often a need to identify probes within a certain range of a gene's transcription start site (TSS), in order to assess promoter methylation. The `mapProbesToGenes` function uses the public Illumina array manifests available at \url{https://zwdzwd.github.io/InfiniumAnnotation} to do this. Here, we perform this mapping for only the first 10,000 probes in the manifest for brevity, and define the promoter region as 1500 basepairs upstream to 200 basepairs downstream of the TSS.
```{R}
-probe_map = my_coad_friend$mapProbesToGenes(probelist = meth_data$probeID[1:10000],
+probe_map = ndc$mapProbesToGenes(probelist = meth_data$probeID[1:10000],
rangeUp = 1500,
rangeDown = 200)
dim(probe_map)
@@ -277,7 +278,7 @@ meth_data %>% dplyr::filter(probeID %in% c("cg00148286","cg00070460")) %>%
Next, we average these values using the `probeToMeanPromoterMethylation` function:
```{R}
-ctbp1_meth=my_coad_friend$probeToMeanPromoterMethylation(methylation_betas = meth_data,
+ctbp1_meth=ndc$probeToMeanPromoterMethylation(methylation_betas = meth_data,
probe_gene_map = probe_map,
genesOfInterest = c("CTBP1")) %>%
data.frame() %>%
@@ -286,6 +287,7 @@ ctbp1_meth=my_coad_friend$probeToMeanPromoterMethylation(methylation_betas = met
head(ctbp1_meth,n=5) %>%
kable()
```
+
We see, for example, that the first value is 0.8531883 = (0.7906715 + 0.9157052)/2, the average of sites cg00148286 and cg00070460 for sample 7318cda6-0b88-416a-afdc-7c36628e7c82.
`estimateCellCountsEpiSCORE`
@@ -293,7 +295,7 @@ We see, for example, that the first value is 0.8531883 = (0.7906715 + 0.9157052)
In the bulk methylation data we study in TCGA, cell type composition may be confounded with observed methylation values. We provide a wrapper around the EpiSCORE method (@teschendorff2020episcore), which performs an *in silico* deconvolution of bulk methylation data for certain tissue types.
```{R}
-estimated_cell_counts = my_coad_friend$estimateCellCountsEpiSCORE(methylation_betas = meth_data, tissue = "Colon", array = "450k")
+estimated_cell_counts = ndc$estimateCellCountsEpiSCORE(methylation_betas = meth_data, tissue = "Colon", array = "450k")
head(estimated_cell_counts) %>% kable()
```
@@ -311,6 +313,6 @@ estimated_cell_counts %>% pivot_longer(cols=c("EC", "Epi","Lym", "Mye", "Stromal
Thank you for using `NetworkDataCompanion`!
-If you have any issues, please raise them as GitHub issues and/or contact the maintainer directly. This package is currently maintained by Kate Shutta (kshutta@hsph.harvard.edu).
+If you have any issues, please raise them as GitHub issues and/or contact the maintainer directly. This package is currently maintained by Kate Hoff Shutta (kshutta@hsph.harvard.edu).
# References
diff --git a/quickstart.html b/quickstart.html
new file mode 100644
index 0000000..f18eb4f
--- /dev/null
+++ b/quickstart.html
@@ -0,0 +1,2384 @@
+
+
+
The NetworkDataCompanion R package is a collection of
+utilities for wrangling TCGA data. While it was developed to serve as
+the engine behind tcga-data-nf, the Nextflow workflow
+presented in Fanfani et al. (2024),
+NetworkDataCompanion is a stand-alone package that can be
+used independently. The primary goal of the package is to provide
+version-controlled and unit-tested functions for common tasks involved
+in network analyses of TCGA data, such as calculating gene-level summary
+measures of methylation data, normalizing RNA-seq data, and integrating
+multiple omics by mapping various file identifiers back to TCGA
+barcodes.
+
While none of these tasks are particularly complex,
+NetworkDataCompanion is motivated by their ubiquity. Beyond
+its role in tcga-data-nf, this package aims to:
+
+
Reduce “re-work” by eliminating the need for everyone to write a
+few lines of code to, e.g., map TCGA identifiers in their own
+scripts
+
Reduce the error-prone nature of the type of coding described in
+(1) by offering a centralized, maintained, version-controlled, and
+unit-tested resource for general use.
+
+
+
+
Preparing the workspace
+
The first step is to install NetworkDataCompanion from
+GitHub using the devtools library and the
+install_github function. Here, we have suppressed messages
+as there are quite a few.
Once you have installed the package, load the library into your
+workspace.
+
library(NetworkDataCompanion)
+
+
+
Download demonstration data
+
Once you have installed the package, you next need to download the
+data used in this quickstart.
+
These data are from the TCGA COAD analysis presented in Fanfani et al. (2024) and can be
+downloaded from the Harvard Dataverse here: https://doi.org/10.7910/DVN/MCSSYJ
+
For the script below to work, please place the
+tcga-data-supplement directory in the same working
+directory as this RMarkdown. Alternatively, you can modify the
+datapath variable below.
The most simple way to get started with
+NetworkDataCompanion is to create an object using the
+CreateNetworkDataCompanionObject function with an empty
+argument. Here, we call the object ndc:
Of these, gene_mapping and
+sample_type_mapping are loaded with the empty constructor.
+For the other attributes, we must supply a patient data file and TCGA
+cancer type. For the COAD example, we can use the following:
The TCGA_purities attribute contains the information
+about tumor purity for each sample. Because these TCGA data are from
+bulk tissue, there may be some non-tumor cells in the each sample as
+well. Higher purity samples will give more accurate depiction of the
+tumor microenvironment and the user may wish to filter based on
+purity.
+
head(ndc$TCGA_purities)
+
## Cancer_type ESTIMATE ABSOLUTE LUMP IHC CPE
+## TCGA-3L-AA1B-01A COAD 0.9025 NA 0.6238 0.70 0.7913
+## TCGA-4N-A93T-01A COAD 0.9752 NA 0.7915 0.75 0.7898
+## TCGA-4T-AA8H-01A COAD 0.9865 NA NA 0.60 0.8917
+## TCGA-5M-AAT4-01A COAD 0.9720 NA NA 0.70 0.8372
+## TCGA-5M-AAT5-01A COAD 0.9752 NA NA 0.75 0.8170
+## TCGA-5M-AAT6-01A COAD 0.7275 NA NA 0.80 0.6712
+
+
+
NetworkDataCompanion Functions
+
The tcga-data-nf workflow retrieves five different data
+types: clinical data, expression, mutations, copy number variation (CNV)
+data, and methylation. The clinical data is used in the Companion object
+constructor. The other data types can be wrangled with the Companion
+member functions.
The functions for working with gene expression in
+NetworkDataCompanion are generally wrappers around the functionality of
+the RangedSummarizedExperiment objects produced by
+recount3. For documentation on this type of object, please
+see ***.
+
First, we read in the expression data, which is downloaded as both an
+.rds file and as a .csv file in the Download phase of the
+tcga-data-nf pipeline. The .rds file contains the
+RangedSummarizedExperiment object, so this is the
+downloaded file that we work with here. The .csv file is provided for
+convenience if you have other applications that do not require the
+RangedSummarizedExperiment.
To extract metadata from the RangedSummarizedExperiment,
+we use the extractSampleAndGeneInfo method. This produces
+two dataframes: rds_gene_info, which stores gene
+identifiers and other gene-level metadata, and
+rds_sample_info, which stores sample identifiers such as
+TCGA barcodes, patient characteristics, and sample characteristics.
## gene_id_no_ver seqid source type start end score strand phase
+## 1 ENSG00000012048 chr17 HAVANA gene 43044295 43170245 NA - NA
+## 2 ENSG00000139618 chr13 HAVANA gene 32315474 32400266 NA + NA
+## gene_id gene_name gene_entrez gene_type
+## 1 ENSG00000012048.20 BRCA1 672 protein_coding
+## 2 ENSG00000139618.14 BRCA2 675 protein_coding
+
And the Entrez IDs for BRCA1 and BRCA2 are 672 and 675.
+
ndc$getGeneInfo(c(672, 675))
+
## gene_entrez seqid source type start end score strand phase
+## 1 672 chr17 HAVANA gene 43044295 43170245 NA - NA
+## 2 675 chr13 HAVANA gene 32315474 32400266 NA + NA
+## gene_id gene_name gene_type gene_id_no_ver
+## 1 ENSG00000012048.20 BRCA1 protein_coding ENSG00000012048
+## 2 ENSG00000139618.14 BRCA2 protein_coding ENSG00000139618
+
+
Converting between gene identifiers
+
There are many, many tools for converting between gene identifiers,
+and ours has neither bells nor whistles. Our tool simply uses the
+hardcode Gencode v26 mapping found in the
+inst/extdata/gen_v26_mapping_w_entrez.csv file. A more
+sophisticated tool may better fit your use case; however, we provide
+this functionality as we used it in the tcga-data-nf
+pipeline.
Sometimes, the gene conversion will yield a one-to-many mapping. For
+this reason, it is important to use proper merging operations
+(merge in base R; the
+left_join,right_join,inner_join,full_join
+functions from dplyr) when annotating these conversions
+back to a dataset.
+
ndc$geneNameToENSG("NOTCH4")
+
## [1] "[NetworkDataCompanion::getGeneInfo] In gene conversion, there was at least one one-to-many mapping (most probably from multiple ensembl IDs for the input)"
NetworkDataCompanion provides a wrapper for logTPM
+RNAseq normalization functions from recount3. A pseudocount
+is added in the calculation. The input to this function is the
+RangedSummarizedExperiment object:
The three items returned include the transformed counts, which have
+been processed by recount3::transform_counts, the logTPM
+(corresponding to log(TPM+1) with the pseudocount), the TPM (which is
+provided for use cases related to filtering by TPM, for example, if one
+would like to do something like remove genes with TPM < 1 in more
+than 20% of the samples). Rows are indexed by Ensembl gene ID. Let’s
+look at the values for BRCA1 (ENSG00000012048.20):
There is also a wrapper for logCPM normalization. This is based on
+the edgeR recipe for TMM normalization; see the package
+documentation for more details. Again, a pseudocount is added in the
+calculation. Unlike logTPMNormalization, which uses the
+RangedSummarizedExperiment as input,
+logCPMNormalization uses a matrix or data frame of raw
+expression counts as input. This can be accessed using the
+assay function of the
+RangedSummarizedExperiment.
Here, counts refer to the raw counts in the
+RangedSummarizedExperiment object. Note that, because these
+counts have not been transformed with
+recount3::transform_counts, they will not be coherent with
+the counts attribute of logTPMNormalization.
+This is expected behavior.
The methylation data in the TCGA COAD analysis are stored in a which
+rows correspond to CpG probes and columns correspond to samples. We load
+this matrix into a data frame for easy manipulation, using the
+fread function from data.table to speed the
+file input.
The data are provided as methylation beta values, ranging from 0 to
+1. It is also common to work with “m values”, which are the logit base 2
+transformed beta values Du et al. (2010).
+NetworkDataCompanion has the function
+convertBetaToM for converting such values.
+
m_vals = ndc$convertBetaToM(meth_data[1:5,2])
+
When working with methylation data, there is often a need to identify
+probes within a certain range of a gene’s transcription start site
+(TSS), in order to assess promoter methylation. The
+mapProbesToGenes function uses the public Illumina array
+manifests available at to do this. Here, we perform this mapping for
+only the first 10,000 probes in the manifest for brevity, and define the
+promoter region as 1500 basepairs upstream to 200 basepairs downstream
+of the TSS.
## [1] "[NetworkDataCompanion::mapProbesToGenes] Sourcing 450k probe annotation from https://zwdzwd.github.io/InfiniumAnnotation ..."
+## [1] "[NetworkDataCompanion::mapProbesToGenes] https://zhouserver.research.chop.edu/InfiniumAnnotation/20210615/HM450/HM450.hg38.manifest.gencode.v36.tsv.gz"
+## [1] "[NetworkDataCompanion::mapProbesToGenes] HG38, Gencode v36"
+## probeID geneNames ensemblID distToTSS
+## 1 NA NA NA NA
+## 2 NA NA NA NA
+## 3 NA NA NA NA
+## 4 NA NA NA NA
+## 5 NA NA NA NA
+## 6 NA NA NA NA
+## [1] "[NetworkDataCompanion::mapProbesToGenes] Processing probe number: 10000"
Any probe that is not within the specified rangeUp and
+rangeDown bounds of the TSS for an annotated gene in the
+manifest will have NA values for the columns
+geneNames,ensemblID, and
+distToTSS. We can look at just the probes mapping to genes
+by filtering out NA values. Here, only the first 10 genes
+are shown for brevity. Use the scroll bar at the bottom of the table to
+see all four columns.
In many applications, it is necessary to combine a set of probe
+methylation values to create a summary measure (e.g., to assess overall
+promoter methylation). NetworkDataCompanion currently
+provides functionality for calculating the mean methylation within a
+promoter region defined based on the output of the
+mapProbesToGenes function. For example, our probe map above
+has two probes within 1500 bp upstream and 200bp downstream of the TSS
+for CTBP1:
We see, for example, that the first value is 0.8531883 = (0.7906715 +
+0.9157052)/2, the average of sites cg00148286 and cg00070460 for sample
+7318cda6-0b88-416a-afdc-7c36628e7c82.
+
estimateCellCountsEpiSCORE
+
In the bulk methylation data we study in TCGA, cell type composition
+may be confounded with observed methylation values. We provide a wrapper
+around the EpiSCORE method (Teschendorff et al.
+(2020)), which
+performs an in silico deconvolution of bulk methylation data
+for certain tissue types.
## [1] "Done for regional gene group 1"
+## [1] "Done for regional gene group 2"
+## [1] "Done for regional gene group 3"
+## [1] "Done for regional gene group 4"
+## [1] "Done for regional gene group 5"
+## [1] "Done for regional gene group 6"
+
head(estimated_cell_counts) %>%kable()
+
+
+
+
+
+
+
+
+
+
+
+
+
UUID
+
TCGAbarcode
+
EC
+
Epi
+
Lym
+
Mye
+
Stromal
+
+
+
+
+
7318cda6-0b88-416a-afdc-7c36628e7c82
+
TCGA-AD-6895-01A
+
0.1233243
+
0.3151952
+
0.2559282
+
0.1675718
+
0.1379805
+
+
+
fa6b8806-1b2f-428c-906e-e910ec0af610
+
TCGA-AZ-4616-01A
+
0.1560313
+
0.3311448
+
0.2608134
+
0.1275436
+
0.1244670
+
+
+
cf3443f5-f79d-4a01-9963-f3e3ced15786
+
TCGA-A6-6653-01A
+
0.0480103
+
0.6213130
+
0.1820945
+
0.1485822
+
0.0000000
+
+
+
9a645f01-1ca4-49c0-bd29-fc3b0c6570c9
+
TCGA-F4-6570-01A
+
0.1323797
+
0.3881631
+
0.1994900
+
0.1946285
+
0.0853388
+
+
+
d39669b0-14df-480d-8134-1f262cad68de
+
TCGA-AZ-6601-01A
+
0.1532917
+
0.1826823
+
0.3001661
+
0.2878829
+
0.0759769
+
+
+
8fba567b-80f4-42f0-a012-ca6b2daa14d3
+
TCGA-G4-6628-01A
+
0.1397655
+
0.2135989
+
0.4203463
+
0.2014077
+
0.0248815
+
+
+
+
From a boxplot, we can assess the overall distributions of the
+estimated cell type proportions:
If you have any issues, please raise them as GitHub issues and/or
+contact the maintainer directly. This package is currently maintained by
+Kate Hoff Shutta (kshutta@hsph.harvard.edu).
+
+
+
References
+
+
+Du, Pan, Xiao Zhang, Chiang-Ching Huang, Nadereh Jafari, Warren A Kibbe,
+Lifang Hou, and Simon M Lin. 2010. “Comparison of Beta-Value and
+m-Value Methods for Quantifying Methylation Levels by Microarray
+Analysis.”BMC Bioinformatics 11: 1–9.
+
+
+Fanfani, Viola, Katherine H Shutta, Panagiotis Mandros, Jonas Fischer,
+Enakshi Saha, Soel Micheletti, Chen Chen, Marouen Ben Guebila, Camila M
+Lopes-Ramos, and John Quackenbush. 2024. “Reproducible Processing
+of TCGA Regulatory Networks.”bioRxiv.
+
+
+Teschendorff, Andrew E, Tianyu Zhu, Charles E Breeze, and Stephan Beck.
+2020. “EPISCORE: Cell Type Deconvolution of Bulk Tissue DNA
+Methylomes from Single-Cell RNA-Seq Data.”Genome
+Biology 21: 1–33.
+