QuackenbushLab · katehoffshutta · Apr 29, 2025 · Apr 24, 2025 · Apr 25, 2025 · Apr 29, 2025
diff --git a/quickstart.Rmd b/quickstart.Rmd
@@ -9,6 +9,11 @@ bibliography: quickstart.bib
 link-citations: true
 ---
 
+```{R echo=F}
+library(ggplot2)
+library(knitr)
+```
+
 # About NetworkDataCompanion
 
 The `NetworkDataCompanion` R package is a collection of utilities for wrangling TCGA data. While it was developed to serve as the engine behind `tcga-data-nf`, the Nextflow workflow presented in @fanfani2024reproducible, `NetworkDataCompanion` is a stand-alone package that can be used independently. The primary goal of the package is to provide version-controlled and unit-tested functions for common tasks involved in network analyses of TCGA data, such as calculating gene-level summary measures of methylation data, normalizing RNA-seq data, and integrating multiple omics by mapping various file identifiers back to TCGA barcodes. 
@@ -23,7 +28,7 @@ While none of these tasks are particularly complex, `NetworkDataCompanion` is mo
 
 The first step is to install `NetworkDataCompanion` from GitHub using the `devtools` library and the `install_github` function. Here, we have suppressed messages as there are quite a few. 
 
-```{R, message=F}
+```{R, message=F, eval=F}
 library(devtools)
 devtools::install_github("QuackenbushLab/NetworkDataCompanion")
 ```
@@ -48,16 +53,16 @@ Data acknowledgement: These data were generated by the TCGA Research Network: ht
 
 # Working with the NetworkDataCompanion Object
 
-The most simple way to get started with `NetworkDataCompanion` is to create an oject using the `CreateNetworkDataCompanionObject` function with an empty argument. Because it is a Companion, we like to call it `my_friend`: 
+The most simple way to get started with `NetworkDataCompanion` is to create an object using the `CreateNetworkDataCompanionObject` function with an empty argument. Here, we call the object `ndc`: 
 
 ```{R}
-my_friend = NetworkDataCompanion::CreateNetworkDataCompanionObject()
+ndc = NetworkDataCompanion::CreateNetworkDataCompanionObject()
 ```
 
 There are a few built-in attributes that will be accessible to the functions of the Companion:
 
 ```{R}
-ls(my_friend)
+ls(ndc)
 ```
 
 Of these, `gene_mapping` and `sample_type_mapping` are loaded with the empty constructor. For the other attributes, we must supply a patient data file and TCGA cancer type. For the COAD example, we can use the following:
@@ -70,24 +75,25 @@ patient_data = paste0(datapath,"data_download/clinical/clinical_patient_coad.csv
 We input these to the constructor to load the relevant data into the Companion object: 
 
 ```{R}
-my_coad_friend = NetworkDataCompanion::CreateNetworkDataCompanionObject(clinical_patient_file = patient_data,project_name = "COAD")
+ndc = NetworkDataCompanion::CreateNetworkDataCompanionObject(clinical_patient_file = patient_data,project_name = "COAD")
 ```
 
 Now, we can see the other attributes. The project name helps us keep track of which cancer we are working with:
 
 ```{R}
-my_coad_friend$project_name
+ndc$project_name
 ```
 
 The clinical patient data stores TCGA barcodes and a range of other patient-level information:
 
 ```{R}
-my_coad_friend$clinical_patient_data[1:5,1:5]
+ndc$clinical_patient_data[1:5,1:5]
 ```
+
 The `TCGA_purities` attribute contains the information about tumor purity for each sample. Because these TCGA data are from bulk tissue, there may be some non-tumor cells in the each sample as well. Higher purity samples will give more accurate depiction of the tumor microenvironment and the user may wish to filter based on purity.
 
 ```{R}
-head(my_coad_friend$TCGA_purities)
+head(ndc$TCGA_purities)
 ```
 
 # NetworkDataCompanion Functions
@@ -106,59 +112,207 @@ meth_datapath = paste0(datapath, "data_download/methylation/tcga_coad_cms1_methy
 
 ## Functions for working with expression
 
-<!-- ```{R} -->
-<!-- # Read RDS expression data -->
-<!-- test_exp_rds = readRDS(exp_data) -->
-<!-- ``` -->
+The functions for working with gene expression in NetworkDataCompanion are generally wrappers around the functionality of the `RangedSummarizedExperiment` objects produced by `recount3`. For documentation on this type of object, please see ***.
+
+First, we read in the expression data, which is downloaded as both an .rds file and as a .csv file in the Download phase of the `tcga-data-nf` pipeline. The .rds file contains the `RangedSummarizedExperiment` object, so this is the downloaded file that we work with here. The .csv file is provided for convenience if you have other applications that do not require the `RangedSummarizedExperiment`.
+
+```{R}
+expr_data = readRDS(exp_datapath)
+expr_data
+```
+
+To extract metadata from the `RangedSummarizedExperiment`, we use the `extractSampleAndGeneInfo` method. This produces two dataframes: `rds_gene_info`, which stores gene identifiers and other gene-level metadata, and `rds_sample_info`, which stores sample identifiers such as TCGA barcodes, patient characteristics, and sample characteristics.
 
 <!-- The function `extractSampleAndGeneInfo` provides a way to get metadata from the RangedSummarizedExperiment object that stores the expression data download. -->
 
-<!-- ```{R} -->
-<!-- rds_info = my_friend$extractSampleAndGeneInfo(test_exp_rds) -->
-<!-- rds_sample_info = rds_info$rds_sample_info -->
-<!-- rds_sample_info[1:3,1:3] -->
+```{R}
+expr_metadata = ndc$extractSampleAndGeneInfo(expr_data)
+ls(expr_metadata)
+```
+
+
+For example, if we wish to get the gene names and Ensembl IDs for the first five genes in the dataset, we can use:
+
+```{R}
+expr_metadata$rds_gene_info %>%
+  dplyr::select(gene_id,gene_name) %>%
+  slice_head(n=5)
+```
+
+Similarly, let's get the TCGA barcode and vital status of the patients who provided the first 5 samples:
+
+```{R}
+expr_metadata$rds_sample_info %>%
+  dplyr::select(tcga.tcga_barcode,tcga.gdc_cases.diagnoses.vital_status) %>%
+  slice_head(n=5)
+```
+
+To get information about a set of genes, `NetworkDataCompanion` provides a wrapper function `getGeneInfo`. 
+
+```{R}
+ndc$getGeneInfo(c("BRCA1","BRCA2"))
+```
+
+These results can also be searched by Ensembl or Entrez gene IDs. The Ensembl IDs for BRCA1 are ENSG00000012048 and ENSG00000139618, respectively:
+
+```{R}
+ndc$getGeneInfo(c("ENSG00000012048","ENSG00000139618"))
+```
+
+And the Entrez IDs for BRCA1 and BRCA2 are 672 and 675.
+
+```{R}
+ndc$getGeneInfo(c(672, 675))
+```
+
+### Converting between gene identifiers
+
+There are many, many tools for converting between gene identifiers, and ours has neither bells nor whistles. Our tool simply uses the hardcode Gencode v26 mapping found in the `inst/extdata/gen_v26_mapping_w_entrez.csv` file. A more sophisticated tool may better fit your use case; however, we provide this functionality as we used it in the `tcga-data-nf` pipeline.
+
+```{R}
+ndc$geneNameToENSG(c("BRCA1","BRCA2"))
+ndc$geneNameToEntrez(c("BRCA1","BRCA2"))
+```
+
+Sometimes, the gene conversion will yield a one-to-many mapping. For this reason, it is important to use proper merging operations (`merge` in base R; the `left_join`,`right_join`,`inner_join`,`full_join` functions from `dplyr`) when annotating these conversions back to a dataset.
+
+```{R}
+ndc$geneNameToENSG("NOTCH4")
+```
+
 
-<!-- rds_gene_info = rds_info$rds_gene_info -->
-<!-- rds_gene_info[1:3,1:3] -->
-<!-- ``` -->
+### Normalizing RNAseq data
 
-<!-- We can get information about a particular set of genes using either the names or the Ensembl IDs.  -->
+`NetworkDataCompanion` provides a wrapper for logTPM RNAseq normalization functions from `recount3`. A pseudocount is added in the calculation. The input to this function is the `RangedSummarizedExperiment` object:
 
-<!-- From gene names: -->
 
-<!-- ```{R} -->
-<!-- rds_gene_info$gene_name[1:3] -->
-<!-- gene_info_from_names = my_friend$getGeneInfo(rds_gene_info$gene_name) -->
-<!-- gene_info_from_names[1:3,] -->
-<!-- ``` -->
+```{R}
+logTPM = ndc$logTPMNormalization(expr_data)
+ls(logTPM)
+```
 
-<!-- From Ensembl IDs:  -->
+The three items returned include the transformed counts, which have been processed by `recount3::transform_counts`, the logTPM (corresponding to log(TPM+1) with the pseudocount), the TPM (which is provided for use cases related to filtering by TPM, for example, if one would like to do something like remove genes with TPM < 1 in more than 20% of the samples). Rows are indexed by Ensembl gene ID. Let's look at the values for BRCA1 (ENSG00000012048.20):
 
-<!-- ```{R} -->
-<!-- rds_gene_info$gene_id[1:3] -->
-<!-- genes_info_from_ids = my_friend$getGeneInfo(rds_gene_info$gene_id) -->
-<!-- genes_info_from_ids[1:3,] -->
-<!-- ``` -->
+```{R}
+brca1_counts=logTPM$counts[which(rownames(logTPM$counts)=="ENSG00000012048.20"),1:3]
+brca1_TPM=logTPM$TPM[which(rownames(logTPM$TPM)=="ENSG00000012048.20"),1:3]
+brca1_logTPM=logTPM$logTPM[which(rownames(logTPM$logTPM)=="ENSG00000012048.20"),1:3]
 
-<!-- We can also convert between gene name and Ensembl ID and vice versa: -->
+brca1_counts
+brca1_TPM
+brca1_logTPM
+```
 
-<!-- ```{R} -->
-<!-- genes_from_names_to_ids = data.frame(my_friend$geneNameToENSG(rds_gene_info$gene_name)) -->
-<!-- genes_from_ids_to_names = data.frame(my_friend$geneENSGToName(rds_gene_info$gene_id)) -->
-<!-- ``` -->
+There is also a wrapper for logCPM normalization. This is based on the `edgeR` recipe for TMM normalization; see the package documentation for more details. Again, a pseudocount is added in the calculation. Unlike `logTPMNormalization`, which uses the `RangedSummarizedExperiment` as input, `logCPMNormalization` uses a matrix or data frame of raw expression counts as input. This can be accessed using the `assay` function of the `RangedSummarizedExperiment`.
 
-<!-- A function for normalization takes a RangedSummarizedExperiment as input and returns three matrices: raw counts, TPM. and log(TPM + 1). -->
+```{R}
+logCPM = ndc$logCPMNormalization(assay(expr_data))
+ls(logCPM)
+```
 
-<!-- ```{R} -->
-<!-- # Normalize data -->
-<!-- test_exp_all = my_friend$logTPMNormalization(test_exp_rds) -->
-<!-- test_exp_count = test_exp_all$counts -->
-<!-- test_exp_tpm = test_exp_all$TPM -->
-<!-- test_exp_logtpm = test_exp_all$logTPM -->
-<!-- ``` -->
+Here, counts refer to the raw counts in the `RangedSummarizedExperiment` object. Note that, because these counts have not been transformed with `recount3::transform_counts`, they will not be coherent with the `counts` attribute of `logTPMNormalization`. This is expected behavior.
+
+We again investigate the values for BRCA1:
+
+
+```{R}
+brca1_counts=logCPM$counts[which(rownames(logCPM$counts)=="ENSG00000012048.20"),1:3]
+brca1_CPM=logCPM$CPM[which(rownames(logCPM$CPM)=="ENSG00000012048.20"),1:3]
+brca1_logCPM=logCPM$logCPM[which(rownames(logCPM$logCPM)=="ENSG00000012048.20"),1:3]
+
+brca1_counts
+brca1_CPM
+brca1_logCPM
+```
 
-## Functions for working with mutations
 
 ## Functions for working with methylation
 
+The methylation data in the TCGA COAD analysis are stored in a  which rows correspond to CpG probes and columns correspond to samples. We load this matrix into a data frame for easy manipulation, using the `fread` function from `data.table` to speed the file input.
+
+```{R}
+meth_data = fread(meth_datapath, data.table = F)
+meth_data[1:5,1:5]
+```
+
+The data are provided as methylation beta values, ranging from 0 to 1. It is also common to work with "m values", which are the logit base 2 transformed beta values @du2010comparison. `NetworkDataCompanion` has the function `convertBetaToM` for converting such values.
+
+```{R}
+m_vals = ndc$convertBetaToM(meth_data[1:5,2])
+```
+
+When working with methylation data, there is often a need to identify probes within a certain range of a gene's transcription start site (TSS), in order to assess promoter methylation. The `mapProbesToGenes` function uses the public Illumina array manifests available at \url{https://zwdzwd.github.io/InfiniumAnnotation} to do this. Here, we perform this mapping for only the first 10,000 probes in the manifest for brevity, and define the promoter region as 1500 basepairs upstream to 200 basepairs downstream of the TSS.
+
+```{R}
+probe_map = ndc$mapProbesToGenes(probelist = meth_data$probeID[1:10000],
+                                rangeUp = 1500,
+                                rangeDown = 200)
+dim(probe_map)
+names(probe_map)
+```
+
+Any probe that is not within the specified `rangeUp` and `rangeDown` bounds of the TSS for an annotated gene in the manifest will have `NA` values for the columns `geneNames`,`ensemblID`, and `distToTSS`. We can look at just the probes mapping to genes by filtering out `NA` values. Here, only the first 10 genes are shown for brevity. Use the scroll bar at the bottom of the table to see all four columns. 
+
+```{R}
+probe_map %>% 
+  dplyr::filter(!is.na(geneNames)) %>%
+  slice_head(n=10) %>%
+  knitr::kable() 
+```
+
+
+In many applications, it is necessary to combine a set of probe methylation values to create a summary measure (e.g., to assess overall promoter methylation). `NetworkDataCompanion` currently provides functionality for calculating the mean methylation within a promoter region defined based on the output of the `mapProbesToGenes` function. For example, our probe map above has two probes within 1500 bp upstream and 200bp downstream of the TSS for CTBP1:
+
+```{R}
+probe_map %>% dplyr::filter(grepl("CTBP1",geneNames))
+```
+
+We can look at the individual methylation at these sites for a few people:
+
+```{R}
+meth_data %>% dplyr::filter(probeID %in% c("cg00148286","cg00070460")) %>%
+  dplyr::select(1:5) %>%
+  kable()
+```
+
+Next, we average these values using the `probeToMeanPromoterMethylation` function:
+
+```{R}
+ctbp1_meth=ndc$probeToMeanPromoterMethylation(methylation_betas = meth_data,
+                                                         probe_gene_map = probe_map,
+                                                    genesOfInterest = c("CTBP1")) %>%
+  data.frame() %>%
+  dplyr::filter(X1 != "CTBP1") %>%
+  dplyr::rename("CTBP1"=X1)
+head(ctbp1_meth,n=5) %>%
+  kable()
+```
+
+We see, for example, that the first value is 0.8531883 = (0.7906715	+ 0.9157052)/2, the average of sites cg00148286 and cg00070460 for sample 7318cda6-0b88-416a-afdc-7c36628e7c82.
+
+`estimateCellCountsEpiSCORE`
+
+In the bulk methylation data we study in TCGA, cell type composition may be confounded with observed methylation values. We provide a wrapper around the EpiSCORE method (@teschendorff2020episcore), which performs an *in silico* deconvolution of bulk methylation data for certain tissue types.
+
+```{R}
+estimated_cell_counts = ndc$estimateCellCountsEpiSCORE(methylation_betas = meth_data, tissue = "Colon", array = "450k")
+head(estimated_cell_counts) %>% kable()
+```
+
+From a boxplot, we can assess the overall distributions of the estimated cell type proportions:
+
+```{R boxplot_cell_counts}
+estimated_cell_counts %>% pivot_longer(cols=c("EC", "Epi","Lym", "Mye", "Stromal"),
+                                       names_to="CellType",
+                                       values_to="Proportion") %>% 
+  ggplot(aes(x = CellType, y = Proportion)) + 
+  geom_boxplot()
+```
+
+# Conclusions
+
+Thank you for using `NetworkDataCompanion`!
+
+If you have any issues, please raise them as GitHub issues and/or contact the maintainer directly. This package is currently maintained by Kate Hoff Shutta (kshutta@hsph.harvard.edu).
+
 # References
diff --git a/quickstart.bib b/quickstart.bib
@@ -14,4 +14,24 @@ @data{fanfaniDataverse
 version = {V1},
 doi = {10.7910/DVN/MCSSYJ},
 url = {https://doi.org/10.7910/DVN/MCSSYJ}
+}
+
+@article{du2010comparison,
+  title={Comparison of Beta-value and M-value methods for quantifying methylation levels by microarray analysis},
+  author={Du, Pan and Zhang, Xiao and Huang, Chiang-Ching and Jafari, Nadereh and Kibbe, Warren A and Hou, Lifang and Lin, Simon M},
+  journal={BMC bioinformatics},
+  volume={11},
+  pages={1--9},
+  year={2010},
+  publisher={Springer}
+}
+
+@article{teschendorff2020episcore,
+  title={EPISCORE: cell type deconvolution of bulk tissue DNA methylomes from single-cell RNA-Seq data},
+  author={Teschendorff, Andrew E and Zhu, Tianyu and Breeze, Charles E and Beck, Stephan},
+  journal={Genome biology},
+  volume={21},
+  pages={1--33},
+  year={2020},
+  publisher={Springer}
 }
diff --git a/quickstart.html b/quickstart.html