diff --git a/quickstart.Rmd b/quickstart.Rmd index 7791dfc..f149931 100644 --- a/quickstart.Rmd +++ b/quickstart.Rmd @@ -9,6 +9,11 @@ bibliography: quickstart.bib link-citations: true --- +```{R echo=F} +library(ggplot2) +library(knitr) +``` + # About NetworkDataCompanion The `NetworkDataCompanion` R package is a collection of utilities for wrangling TCGA data. While it was developed to serve as the engine behind `tcga-data-nf`, the Nextflow workflow presented in @fanfani2024reproducible, `NetworkDataCompanion` is a stand-alone package that can be used independently. The primary goal of the package is to provide version-controlled and unit-tested functions for common tasks involved in network analyses of TCGA data, such as calculating gene-level summary measures of methylation data, normalizing RNA-seq data, and integrating multiple omics by mapping various file identifiers back to TCGA barcodes. @@ -23,7 +28,7 @@ While none of these tasks are particularly complex, `NetworkDataCompanion` is mo The first step is to install `NetworkDataCompanion` from GitHub using the `devtools` library and the `install_github` function. Here, we have suppressed messages as there are quite a few. -```{R, message=F} +```{R, message=F, eval=F} library(devtools) devtools::install_github("QuackenbushLab/NetworkDataCompanion") ``` @@ -48,16 +53,16 @@ Data acknowledgement: These data were generated by the TCGA Research Network: ht # Working with the NetworkDataCompanion Object -The most simple way to get started with `NetworkDataCompanion` is to create an oject using the `CreateNetworkDataCompanionObject` function with an empty argument. Because it is a Companion, we like to call it `my_friend`: +The most simple way to get started with `NetworkDataCompanion` is to create an object using the `CreateNetworkDataCompanionObject` function with an empty argument. Here, we call the object `ndc`: ```{R} -my_friend = NetworkDataCompanion::CreateNetworkDataCompanionObject() +ndc = NetworkDataCompanion::CreateNetworkDataCompanionObject() ``` There are a few built-in attributes that will be accessible to the functions of the Companion: ```{R} -ls(my_friend) +ls(ndc) ``` Of these, `gene_mapping` and `sample_type_mapping` are loaded with the empty constructor. For the other attributes, we must supply a patient data file and TCGA cancer type. For the COAD example, we can use the following: @@ -70,24 +75,25 @@ patient_data = paste0(datapath,"data_download/clinical/clinical_patient_coad.csv We input these to the constructor to load the relevant data into the Companion object: ```{R} -my_coad_friend = NetworkDataCompanion::CreateNetworkDataCompanionObject(clinical_patient_file = patient_data,project_name = "COAD") +ndc = NetworkDataCompanion::CreateNetworkDataCompanionObject(clinical_patient_file = patient_data,project_name = "COAD") ``` Now, we can see the other attributes. The project name helps us keep track of which cancer we are working with: ```{R} -my_coad_friend$project_name +ndc$project_name ``` The clinical patient data stores TCGA barcodes and a range of other patient-level information: ```{R} -my_coad_friend$clinical_patient_data[1:5,1:5] +ndc$clinical_patient_data[1:5,1:5] ``` + The `TCGA_purities` attribute contains the information about tumor purity for each sample. Because these TCGA data are from bulk tissue, there may be some non-tumor cells in the each sample as well. Higher purity samples will give more accurate depiction of the tumor microenvironment and the user may wish to filter based on purity. ```{R} -head(my_coad_friend$TCGA_purities) +head(ndc$TCGA_purities) ``` # NetworkDataCompanion Functions @@ -106,59 +112,207 @@ meth_datapath = paste0(datapath, "data_download/methylation/tcga_coad_cms1_methy ## Functions for working with expression - - - - +The functions for working with gene expression in NetworkDataCompanion are generally wrappers around the functionality of the `RangedSummarizedExperiment` objects produced by `recount3`. For documentation on this type of object, please see ***. + +First, we read in the expression data, which is downloaded as both an .rds file and as a .csv file in the Download phase of the `tcga-data-nf` pipeline. The .rds file contains the `RangedSummarizedExperiment` object, so this is the downloaded file that we work with here. The .csv file is provided for convenience if you have other applications that do not require the `RangedSummarizedExperiment`. + +```{R} +expr_data = readRDS(exp_datapath) +expr_data +``` + +To extract metadata from the `RangedSummarizedExperiment`, we use the `extractSampleAndGeneInfo` method. This produces two dataframes: `rds_gene_info`, which stores gene identifiers and other gene-level metadata, and `rds_sample_info`, which stores sample identifiers such as TCGA barcodes, patient characteristics, and sample characteristics. - - - - +```{R} +expr_metadata = ndc$extractSampleAndGeneInfo(expr_data) +ls(expr_metadata) +``` + + +For example, if we wish to get the gene names and Ensembl IDs for the first five genes in the dataset, we can use: + +```{R} +expr_metadata$rds_gene_info %>% + dplyr::select(gene_id,gene_name) %>% + slice_head(n=5) +``` + +Similarly, let's get the TCGA barcode and vital status of the patients who provided the first 5 samples: + +```{R} +expr_metadata$rds_sample_info %>% + dplyr::select(tcga.tcga_barcode,tcga.gdc_cases.diagnoses.vital_status) %>% + slice_head(n=5) +``` + +To get information about a set of genes, `NetworkDataCompanion` provides a wrapper function `getGeneInfo`. + +```{R} +ndc$getGeneInfo(c("BRCA1","BRCA2")) +``` + +These results can also be searched by Ensembl or Entrez gene IDs. The Ensembl IDs for BRCA1 are ENSG00000012048 and ENSG00000139618, respectively: + +```{R} +ndc$getGeneInfo(c("ENSG00000012048","ENSG00000139618")) +``` + +And the Entrez IDs for BRCA1 and BRCA2 are 672 and 675. + +```{R} +ndc$getGeneInfo(c(672, 675)) +``` + +### Converting between gene identifiers + +There are many, many tools for converting between gene identifiers, and ours has neither bells nor whistles. Our tool simply uses the hardcode Gencode v26 mapping found in the `inst/extdata/gen_v26_mapping_w_entrez.csv` file. A more sophisticated tool may better fit your use case; however, we provide this functionality as we used it in the `tcga-data-nf` pipeline. + +```{R} +ndc$geneNameToENSG(c("BRCA1","BRCA2")) +ndc$geneNameToEntrez(c("BRCA1","BRCA2")) +``` + +Sometimes, the gene conversion will yield a one-to-many mapping. For this reason, it is important to use proper merging operations (`merge` in base R; the `left_join`,`right_join`,`inner_join`,`full_join` functions from `dplyr`) when annotating these conversions back to a dataset. + +```{R} +ndc$geneNameToENSG("NOTCH4") +``` + - - - +### Normalizing RNAseq data - +`NetworkDataCompanion` provides a wrapper for logTPM RNAseq normalization functions from `recount3`. A pseudocount is added in the calculation. The input to this function is the `RangedSummarizedExperiment` object: - - - - - - +```{R} +logTPM = ndc$logTPMNormalization(expr_data) +ls(logTPM) +``` - +The three items returned include the transformed counts, which have been processed by `recount3::transform_counts`, the logTPM (corresponding to log(TPM+1) with the pseudocount), the TPM (which is provided for use cases related to filtering by TPM, for example, if one would like to do something like remove genes with TPM < 1 in more than 20% of the samples). Rows are indexed by Ensembl gene ID. Let's look at the values for BRCA1 (ENSG00000012048.20): - - - - - +```{R} +brca1_counts=logTPM$counts[which(rownames(logTPM$counts)=="ENSG00000012048.20"),1:3] +brca1_TPM=logTPM$TPM[which(rownames(logTPM$TPM)=="ENSG00000012048.20"),1:3] +brca1_logTPM=logTPM$logTPM[which(rownames(logTPM$logTPM)=="ENSG00000012048.20"),1:3] - +brca1_counts +brca1_TPM +brca1_logTPM +``` - - - - +There is also a wrapper for logCPM normalization. This is based on the `edgeR` recipe for TMM normalization; see the package documentation for more details. Again, a pseudocount is added in the calculation. Unlike `logTPMNormalization`, which uses the `RangedSummarizedExperiment` as input, `logCPMNormalization` uses a matrix or data frame of raw expression counts as input. This can be accessed using the `assay` function of the `RangedSummarizedExperiment`. - +```{R} +logCPM = ndc$logCPMNormalization(assay(expr_data)) +ls(logCPM) +``` - - - - - - - +Here, counts refer to the raw counts in the `RangedSummarizedExperiment` object. Note that, because these counts have not been transformed with `recount3::transform_counts`, they will not be coherent with the `counts` attribute of `logTPMNormalization`. This is expected behavior. + +We again investigate the values for BRCA1: + + +```{R} +brca1_counts=logCPM$counts[which(rownames(logCPM$counts)=="ENSG00000012048.20"),1:3] +brca1_CPM=logCPM$CPM[which(rownames(logCPM$CPM)=="ENSG00000012048.20"),1:3] +brca1_logCPM=logCPM$logCPM[which(rownames(logCPM$logCPM)=="ENSG00000012048.20"),1:3] + +brca1_counts +brca1_CPM +brca1_logCPM +``` -## Functions for working with mutations ## Functions for working with methylation +The methylation data in the TCGA COAD analysis are stored in a which rows correspond to CpG probes and columns correspond to samples. We load this matrix into a data frame for easy manipulation, using the `fread` function from `data.table` to speed the file input. + +```{R} +meth_data = fread(meth_datapath, data.table = F) +meth_data[1:5,1:5] +``` + +The data are provided as methylation beta values, ranging from 0 to 1. It is also common to work with "m values", which are the logit base 2 transformed beta values @du2010comparison. `NetworkDataCompanion` has the function `convertBetaToM` for converting such values. + +```{R} +m_vals = ndc$convertBetaToM(meth_data[1:5,2]) +``` + +When working with methylation data, there is often a need to identify probes within a certain range of a gene's transcription start site (TSS), in order to assess promoter methylation. The `mapProbesToGenes` function uses the public Illumina array manifests available at \url{https://zwdzwd.github.io/InfiniumAnnotation} to do this. Here, we perform this mapping for only the first 10,000 probes in the manifest for brevity, and define the promoter region as 1500 basepairs upstream to 200 basepairs downstream of the TSS. + +```{R} +probe_map = ndc$mapProbesToGenes(probelist = meth_data$probeID[1:10000], + rangeUp = 1500, + rangeDown = 200) +dim(probe_map) +names(probe_map) +``` + +Any probe that is not within the specified `rangeUp` and `rangeDown` bounds of the TSS for an annotated gene in the manifest will have `NA` values for the columns `geneNames`,`ensemblID`, and `distToTSS`. We can look at just the probes mapping to genes by filtering out `NA` values. Here, only the first 10 genes are shown for brevity. Use the scroll bar at the bottom of the table to see all four columns. + +```{R} +probe_map %>% + dplyr::filter(!is.na(geneNames)) %>% + slice_head(n=10) %>% + knitr::kable() +``` + + +In many applications, it is necessary to combine a set of probe methylation values to create a summary measure (e.g., to assess overall promoter methylation). `NetworkDataCompanion` currently provides functionality for calculating the mean methylation within a promoter region defined based on the output of the `mapProbesToGenes` function. For example, our probe map above has two probes within 1500 bp upstream and 200bp downstream of the TSS for CTBP1: + +```{R} +probe_map %>% dplyr::filter(grepl("CTBP1",geneNames)) +``` + +We can look at the individual methylation at these sites for a few people: + +```{R} +meth_data %>% dplyr::filter(probeID %in% c("cg00148286","cg00070460")) %>% + dplyr::select(1:5) %>% + kable() +``` + +Next, we average these values using the `probeToMeanPromoterMethylation` function: + +```{R} +ctbp1_meth=ndc$probeToMeanPromoterMethylation(methylation_betas = meth_data, + probe_gene_map = probe_map, + genesOfInterest = c("CTBP1")) %>% + data.frame() %>% + dplyr::filter(X1 != "CTBP1") %>% + dplyr::rename("CTBP1"=X1) +head(ctbp1_meth,n=5) %>% + kable() +``` + +We see, for example, that the first value is 0.8531883 = (0.7906715 + 0.9157052)/2, the average of sites cg00148286 and cg00070460 for sample 7318cda6-0b88-416a-afdc-7c36628e7c82. + +`estimateCellCountsEpiSCORE` + +In the bulk methylation data we study in TCGA, cell type composition may be confounded with observed methylation values. We provide a wrapper around the EpiSCORE method (@teschendorff2020episcore), which performs an *in silico* deconvolution of bulk methylation data for certain tissue types. + +```{R} +estimated_cell_counts = ndc$estimateCellCountsEpiSCORE(methylation_betas = meth_data, tissue = "Colon", array = "450k") +head(estimated_cell_counts) %>% kable() +``` + +From a boxplot, we can assess the overall distributions of the estimated cell type proportions: + +```{R boxplot_cell_counts} +estimated_cell_counts %>% pivot_longer(cols=c("EC", "Epi","Lym", "Mye", "Stromal"), + names_to="CellType", + values_to="Proportion") %>% + ggplot(aes(x = CellType, y = Proportion)) + + geom_boxplot() +``` + +# Conclusions + +Thank you for using `NetworkDataCompanion`! + +If you have any issues, please raise them as GitHub issues and/or contact the maintainer directly. This package is currently maintained by Kate Hoff Shutta (kshutta@hsph.harvard.edu). + # References diff --git a/quickstart.bib b/quickstart.bib index 35fb924..a864888 100644 --- a/quickstart.bib +++ b/quickstart.bib @@ -14,4 +14,24 @@ @data{fanfaniDataverse version = {V1}, doi = {10.7910/DVN/MCSSYJ}, url = {https://doi.org/10.7910/DVN/MCSSYJ} +} + +@article{du2010comparison, + title={Comparison of Beta-value and M-value methods for quantifying methylation levels by microarray analysis}, + author={Du, Pan and Zhang, Xiao and Huang, Chiang-Ching and Jafari, Nadereh and Kibbe, Warren A and Hou, Lifang and Lin, Simon M}, + journal={BMC bioinformatics}, + volume={11}, + pages={1--9}, + year={2010}, + publisher={Springer} +} + +@article{teschendorff2020episcore, + title={EPISCORE: cell type deconvolution of bulk tissue DNA methylomes from single-cell RNA-Seq data}, + author={Teschendorff, Andrew E and Zhu, Tianyu and Breeze, Charles E and Beck, Stephan}, + journal={Genome biology}, + volume={21}, + pages={1--33}, + year={2020}, + publisher={Springer} } \ No newline at end of file diff --git a/quickstart.html b/quickstart.html new file mode 100644 index 0000000..f18eb4f --- /dev/null +++ b/quickstart.html @@ -0,0 +1,2384 @@ + + + + + + + + + + + NetworkDataCompanion Quickstart Guide + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+
+ + + +
+
+
+ + + +
+
+ +
+ + +
+ + +

NetworkDataCompanion Quickstart Guide

+ + + + + + + +
+

About NetworkDataCompanion

+

The NetworkDataCompanion R package is a collection of +utilities for wrangling TCGA data. While it was developed to serve as +the engine behind tcga-data-nf, the Nextflow workflow +presented in Fanfani et al. (2024), +NetworkDataCompanion is a stand-alone package that can be +used independently. The primary goal of the package is to provide +version-controlled and unit-tested functions for common tasks involved +in network analyses of TCGA data, such as calculating gene-level summary +measures of methylation data, normalizing RNA-seq data, and integrating +multiple omics by mapping various file identifiers back to TCGA +barcodes.

+

While none of these tasks are particularly complex, +NetworkDataCompanion is motivated by their ubiquity. Beyond +its role in tcga-data-nf, this package aims to:

+
    +
  1. Reduce “re-work” by eliminating the need for everyone to write a +few lines of code to, e.g., map TCGA identifiers in their own +scripts

  2. +
  3. Reduce the error-prone nature of the type of coding described in +(1) by offering a centralized, maintained, version-controlled, and +unit-tested resource for general use.

  4. +
+
+
+

Preparing the workspace

+

The first step is to install NetworkDataCompanion from +GitHub using the devtools library and the +install_github function. Here, we have suppressed messages +as there are quite a few.

+
library(devtools)
+devtools::install_github("QuackenbushLab/NetworkDataCompanion")
+

Once you have installed the package, load the library into your +workspace.

+
library(NetworkDataCompanion)
+
+
+

Download demonstration data

+

Once you have installed the package, you next need to download the +data used in this quickstart.

+

These data are from the TCGA COAD analysis presented in Fanfani et al. (2024) and can be +downloaded from the Harvard Dataverse here: https://doi.org/10.7910/DVN/MCSSYJ

+

For the script below to work, please place the +tcga-data-supplement directory in the same working +directory as this RMarkdown. Alternatively, you can modify the +datapath variable below.

+

For reference on the source of the data, the full workflow of the +COAD analysis is available at https://github.com/QuackenbushLab/tcga-data-supplement/.

+

Data acknowledgement: These data were generated by the TCGA Research +Network: https://www.cancer.gov/tcga.

+
+
+

Working with the NetworkDataCompanion Object

+

The most simple way to get started with +NetworkDataCompanion is to create an object using the +CreateNetworkDataCompanionObject function with an empty +argument. Here, we call the object ndc:

+
ndc = NetworkDataCompanion::CreateNetworkDataCompanionObject()
+

There are a few built-in attributes that will be accessible to the +functions of the Companion:

+
ls(ndc)
+
## [1] "clinical_patient_data" "gene_mapping"          "project_name"         
+## [4] "sample_type_mapping"   "TCGA_purities"
+

Of these, gene_mapping and +sample_type_mapping are loaded with the empty constructor. +For the other attributes, we must supply a patient data file and TCGA +cancer type. For the COAD example, we can use the following:

+
datapath = "tcga-data-supplement/data/processed/batch-coad-subtype-20240510/tcga_coad_cms1/"
+patient_data = paste0(datapath,"data_download/clinical/clinical_patient_coad.csv")
+

We input these to the constructor to load the relevant data into the +Companion object:

+
ndc = NetworkDataCompanion::CreateNetworkDataCompanionObject(clinical_patient_file = patient_data,project_name = "COAD")
+

Now, we can see the other attributes. The project name helps us keep +track of which cancer we are working with:

+
ndc$project_name
+
## [1] "COAD"
+

The clinical patient data stores TCGA barcodes and a range of other +patient-level information:

+
ndc$clinical_patient_data[1:5,1:5]
+
##                       bcr_patient_uuid bcr_patient_barcode form_completion_date
+## 3 A94E1279-A975-480A-93E9-7B1FF05CBCBF        TCGA-3L-AA1B            2014-4-22
+## 4 92554413-9EBC-4354-8E1B-9682F3A031D9        TCGA-4N-A93T            2014-10-1
+## 5 A5E14ADD-1552-4606-9FFE-3A03BCF76640        TCGA-4T-AA8H             2014-6-5
+## 6 1136DD50-242A-4659-AAD4-C53F9E759BB3        TCGA-5M-AAT4            2015-1-27
+## 7 CE00896A-F7D2-4123-BB95-24CB6E53FC32        TCGA-5M-AAT6            2015-1-27
+##            histologic_diagnosis prospective_collection
+## 3          Colon Adenocarcinoma                    YES
+## 4          Colon Adenocarcinoma                    YES
+## 5 Colon Mucinous Adenocarcinoma                     NO
+## 6          Colon Adenocarcinoma                     NO
+## 7          Colon Adenocarcinoma                     NO
+

The TCGA_purities attribute contains the information +about tumor purity for each sample. Because these TCGA data are from +bulk tissue, there may be some non-tumor cells in the each sample as +well. Higher purity samples will give more accurate depiction of the +tumor microenvironment and the user may wish to filter based on +purity.

+
head(ndc$TCGA_purities)
+
##                  Cancer_type ESTIMATE ABSOLUTE   LUMP  IHC    CPE
+## TCGA-3L-AA1B-01A        COAD   0.9025       NA 0.6238 0.70 0.7913
+## TCGA-4N-A93T-01A        COAD   0.9752       NA 0.7915 0.75 0.7898
+## TCGA-4T-AA8H-01A        COAD   0.9865       NA     NA 0.60 0.8917
+## TCGA-5M-AAT4-01A        COAD   0.9720       NA     NA 0.70 0.8372
+## TCGA-5M-AAT5-01A        COAD   0.9752       NA     NA 0.75 0.8170
+## TCGA-5M-AAT6-01A        COAD   0.7275       NA     NA 0.80 0.6712
+
+
+

NetworkDataCompanion Functions

+

The tcga-data-nf workflow retrieves five different data +types: clinical data, expression, mutations, copy number variation (CNV) +data, and methylation. The clinical data is used in the Companion object +constructor. The other data types can be wrangled with the Companion +member functions.

+

First, we define paths to the data:

+
exp_datapath = paste0(datapath,"data_download/recount3/tcga_coad_cms1.rds") 
+mut_datapath = paste0(datapath,"data_download/mutations/tcga_coad_cms1_mutations.txt") 
+mut_pivot_datapath = paste0(datapath, "data_download/mutations/tcga_coad_cms1_mutations_pivot.csv")
+cnv_datapath = paste0(datapath, "data_download/cnv/tcga_coad_cms1.csv")
+meth_datapath = paste0(datapath, "data_download/methylation/tcga_coad_cms1_methylations.txt")
+
+

Functions for working with expression

+

The functions for working with gene expression in +NetworkDataCompanion are generally wrappers around the functionality of +the RangedSummarizedExperiment objects produced by +recount3. For documentation on this type of object, please +see ***.

+

First, we read in the expression data, which is downloaded as both an +.rds file and as a .csv file in the Download phase of the +tcga-data-nf pipeline. The .rds file contains the +RangedSummarizedExperiment object, so this is the +downloaded file that we work with here. The .csv file is provided for +convenience if you have other applications that do not require the +RangedSummarizedExperiment.

+
expr_data = readRDS(exp_datapath)
+expr_data
+
## class: RangedSummarizedExperiment 
+## dim: 63856 80 
+## metadata(8): time_created recount3_version ... annotation recount3_url
+## assays(1): raw_counts
+## rownames(63856): ENSG00000278704.1 ENSG00000277400.1 ...
+##   ENSG00000182484.15_PAR_Y ENSG00000227159.8_PAR_Y
+## rowData names(10): source type ... havana_gene tag
+## colnames(80): 196281bd-2c05-46d3-a53c-ead8639ef02d
+##   45cedf59-bc7a-4d86-8c2d-9a91ac141d30 ...
+##   bcd07f56-fa83-473f-b5d1-9c637299c7e2
+##   b444dc9f-6f2c-4cbc-9e1c-7b9e0fc790f2
+## colData names(937): rail_id external_id ... recount_seq_qc.errq
+##   BigWigURL
+

To extract metadata from the RangedSummarizedExperiment, +we use the extractSampleAndGeneInfo method. This produces +two dataframes: rds_gene_info, which stores gene +identifiers and other gene-level metadata, and +rds_sample_info, which stores sample identifiers such as +TCGA barcodes, patient characteristics, and sample characteristics.

+ +
expr_metadata = ndc$extractSampleAndGeneInfo(expr_data)
+ls(expr_metadata)
+
## [1] "rds_gene_info"   "rds_sample_info"
+

For example, if we wish to get the gene names and Ensembl IDs for the +first five genes in the dataset, we can use:

+
expr_metadata$rds_gene_info %>%
+  dplyr::select(gene_id,gene_name) %>%
+  slice_head(n=5)
+
##             gene_id  gene_name
+## 1 ENSG00000278704.1 BX004987.1
+## 2 ENSG00000277400.1 AC145212.2
+## 3 ENSG00000274847.1 AC145212.1
+## 4 ENSG00000277428.1      Y_RNA
+## 5 ENSG00000276256.1 AC011043.1
+

Similarly, let’s get the TCGA barcode and vital status of the +patients who provided the first 5 samples:

+
expr_metadata$rds_sample_info %>%
+  dplyr::select(tcga.tcga_barcode,tcga.gdc_cases.diagnoses.vital_status) %>%
+  slice_head(n=5)
+
##                                                 tcga.tcga_barcode
+## 196281bd-2c05-46d3-a53c-ead8639ef02d TCGA-AA-3845-01A-01R-1022-07
+## 45cedf59-bc7a-4d86-8c2d-9a91ac141d30 TCGA-F4-6570-01A-11R-1774-07
+## 771833ef-f9f3-43e9-ac9e-bf249333b257 TCGA-G4-6588-01A-11R-1774-07
+## 343f031b-0095-429d-8212-ad0581dbb05b TCGA-A6-3809-01A-01R-1022-07
+## aa49ac5d-9dc7-48f9-8498-deb9a3f0e0d0 TCGA-CM-5861-01A-01R-1653-07
+##                                      tcga.gdc_cases.diagnoses.vital_status
+## 196281bd-2c05-46d3-a53c-ead8639ef02d                                  dead
+## 45cedf59-bc7a-4d86-8c2d-9a91ac141d30                                  dead
+## 771833ef-f9f3-43e9-ac9e-bf249333b257                                 alive
+## 343f031b-0095-429d-8212-ad0581dbb05b                                 alive
+## aa49ac5d-9dc7-48f9-8498-deb9a3f0e0d0                                 alive
+

To get information about a set of genes, +NetworkDataCompanion provides a wrapper function +getGeneInfo.

+
ndc$getGeneInfo(c("BRCA1","BRCA2"))
+
##   gene_name seqid source type    start      end score strand phase
+## 1     BRCA1 chr17 HAVANA gene 43044295 43170245    NA      -    NA
+## 2     BRCA2 chr13 HAVANA gene 32315474 32400266    NA      +    NA
+##              gene_id gene_entrez      gene_type  gene_id_no_ver
+## 1 ENSG00000012048.20         672 protein_coding ENSG00000012048
+## 2 ENSG00000139618.14         675 protein_coding ENSG00000139618
+

These results can also be searched by Ensembl or Entrez gene IDs. The +Ensembl IDs for BRCA1 are ENSG00000012048 and ENSG00000139618, +respectively:

+
ndc$getGeneInfo(c("ENSG00000012048","ENSG00000139618"))
+
##    gene_id_no_ver seqid source type    start      end score strand phase
+## 1 ENSG00000012048 chr17 HAVANA gene 43044295 43170245    NA      -    NA
+## 2 ENSG00000139618 chr13 HAVANA gene 32315474 32400266    NA      +    NA
+##              gene_id gene_name gene_entrez      gene_type
+## 1 ENSG00000012048.20     BRCA1         672 protein_coding
+## 2 ENSG00000139618.14     BRCA2         675 protein_coding
+

And the Entrez IDs for BRCA1 and BRCA2 are 672 and 675.

+
ndc$getGeneInfo(c(672, 675))
+
##   gene_entrez seqid source type    start      end score strand phase
+## 1         672 chr17 HAVANA gene 43044295 43170245    NA      -    NA
+## 2         675 chr13 HAVANA gene 32315474 32400266    NA      +    NA
+##              gene_id gene_name      gene_type  gene_id_no_ver
+## 1 ENSG00000012048.20     BRCA1 protein_coding ENSG00000012048
+## 2 ENSG00000139618.14     BRCA2 protein_coding ENSG00000139618
+
+

Converting between gene identifiers

+

There are many, many tools for converting between gene identifiers, +and ours has neither bells nor whistles. Our tool simply uses the +hardcode Gencode v26 mapping found in the +inst/extdata/gen_v26_mapping_w_entrez.csv file. A more +sophisticated tool may better fit your use case; however, we provide +this functionality as we used it in the tcga-data-nf +pipeline.

+
ndc$geneNameToENSG(c("BRCA1","BRCA2"))
+
##   gene_name            gene_id  gene_id_no_ver
+## 1     BRCA1 ENSG00000012048.20 ENSG00000012048
+## 2     BRCA2 ENSG00000139618.14 ENSG00000139618
+
ndc$geneNameToEntrez(c("BRCA1","BRCA2"))
+
##   gene_name gene_entrez
+## 1     BRCA1         672
+## 2     BRCA2         675
+

Sometimes, the gene conversion will yield a one-to-many mapping. For +this reason, it is important to use proper merging operations +(merge in base R; the +left_join,right_join,inner_join,full_join +functions from dplyr) when annotating these conversions +back to a dataset.

+
ndc$geneNameToENSG("NOTCH4")
+
## [1] "[NetworkDataCompanion::getGeneInfo] In gene conversion, there was at least one one-to-many mapping (most probably from multiple ensembl IDs for the input)"
+
##   gene_name           gene_id  gene_id_no_ver
+## 1    NOTCH4 ENSG00000206312.7 ENSG00000206312
+## 2    NOTCH4 ENSG00000234876.5 ENSG00000234876
+## 3    NOTCH4 ENSG00000235396.5 ENSG00000235396
+## 4    NOTCH4 ENSG00000223355.5 ENSG00000223355
+## 5    NOTCH4 ENSG00000238196.5 ENSG00000238196
+## 6    NOTCH4 ENSG00000232339.5 ENSG00000232339
+## 7    NOTCH4 ENSG00000204301.6 ENSG00000204301
+
+
+

Normalizing RNAseq data

+

NetworkDataCompanion provides a wrapper for logTPM +RNAseq normalization functions from recount3. A pseudocount +is added in the calculation. The input to this function is the +RangedSummarizedExperiment object:

+
logTPM = ndc$logTPMNormalization(expr_data)
+ls(logTPM)
+
## [1] "counts" "logTPM" "TPM"
+

The three items returned include the transformed counts, which have +been processed by recount3::transform_counts, the logTPM +(corresponding to log(TPM+1) with the pseudocount), the TPM (which is +provided for use cases related to filtering by TPM, for example, if one +would like to do something like remove genes with TPM < 1 in more +than 20% of the samples). Rows are indexed by Ensembl gene ID. Let’s +look at the values for BRCA1 (ENSG00000012048.20):

+
brca1_counts=logTPM$counts[which(rownames(logTPM$counts)=="ENSG00000012048.20"),1:3]
+brca1_TPM=logTPM$TPM[which(rownames(logTPM$TPM)=="ENSG00000012048.20"),1:3]
+brca1_logTPM=logTPM$logTPM[which(rownames(logTPM$logTPM)=="ENSG00000012048.20"),1:3]
+
+brca1_counts
+
## 196281bd-2c05-46d3-a53c-ead8639ef02d 45cedf59-bc7a-4d86-8c2d-9a91ac141d30 
+##                                  561                                  570 
+## 771833ef-f9f3-43e9-ac9e-bf249333b257 
+##                                 1348
+
brca1_TPM
+
## 196281bd-2c05-46d3-a53c-ead8639ef02d 45cedf59-bc7a-4d86-8c2d-9a91ac141d30 
+##                             4.036036                             3.923258 
+## 771833ef-f9f3-43e9-ac9e-bf249333b257 
+##                             9.182649
+
brca1_logTPM
+
## 196281bd-2c05-46d3-a53c-ead8639ef02d 45cedf59-bc7a-4d86-8c2d-9a91ac141d30 
+##                             1.616619                             1.593970 
+## 771833ef-f9f3-43e9-ac9e-bf249333b257 
+##                             2.320685
+

There is also a wrapper for logCPM normalization. This is based on +the edgeR recipe for TMM normalization; see the package +documentation for more details. Again, a pseudocount is added in the +calculation. Unlike logTPMNormalization, which uses the +RangedSummarizedExperiment as input, +logCPMNormalization uses a matrix or data frame of raw +expression counts as input. This can be accessed using the +assay function of the +RangedSummarizedExperiment.

+
logCPM = ndc$logCPMNormalization(assay(expr_data))
+ls(logCPM)
+
## [1] "counts" "CPM"    "logCPM"
+

Here, counts refer to the raw counts in the +RangedSummarizedExperiment object. Note that, because these +counts have not been transformed with +recount3::transform_counts, they will not be coherent with +the counts attribute of logTPMNormalization. +This is expected behavior.

+

We again investigate the values for BRCA1:

+
brca1_counts=logCPM$counts[which(rownames(logCPM$counts)=="ENSG00000012048.20"),1:3]
+brca1_CPM=logCPM$CPM[which(rownames(logCPM$CPM)=="ENSG00000012048.20"),1:3]
+brca1_logCPM=logCPM$logCPM[which(rownames(logCPM$logCPM)=="ENSG00000012048.20"),1:3]
+
+brca1_counts
+
## 196281bd-2c05-46d3-a53c-ead8639ef02d 45cedf59-bc7a-4d86-8c2d-9a91ac141d30 
+##                                24368                                75884 
+## 771833ef-f9f3-43e9-ac9e-bf249333b257 
+##                               154683
+
brca1_CPM
+
## 196281bd-2c05-46d3-a53c-ead8639ef02d 45cedf59-bc7a-4d86-8c2d-9a91ac141d30 
+##                             13.09724                             14.56629 
+## 771833ef-f9f3-43e9-ac9e-bf249333b257 
+##                             34.11405
+
brca1_logCPM
+
## 196281bd-2c05-46d3-a53c-ead8639ef02d 45cedf59-bc7a-4d86-8c2d-9a91ac141d30 
+##                             2.645979                             2.745107 
+## 771833ef-f9f3-43e9-ac9e-bf249333b257 
+##                             3.558601
+
+
+
+

Functions for working with methylation

+

The methylation data in the TCGA COAD analysis are stored in a which +rows correspond to CpG probes and columns correspond to samples. We load +this matrix into a data frame for easy manipulation, using the +fread function from data.table to speed the +file input.

+
meth_data = fread(meth_datapath, data.table = F)
+meth_data[1:5,1:5]
+
##      probeID 7318cda6-0b88-416a-afdc-7c36628e7c82
+## 1 cg00000029                            0.2602616
+## 2 cg00000108                            0.9603123
+## 3 cg00000109                            0.9148485
+## 4 cg00000165                            0.5946983
+## 5 cg00000236                            0.9103589
+##   fa6b8806-1b2f-428c-906e-e910ec0af610 cf3443f5-f79d-4a01-9963-f3e3ced15786
+## 1                            0.2421992                            0.1009500
+## 2                            0.9643367                            0.9278799
+## 3                            0.9097119                            0.8865743
+## 4                            0.5318436                            0.8743420
+## 5                            0.9085887                            0.9469717
+##   9a645f01-1ca4-49c0-bd29-fc3b0c6570c9
+## 1                            0.2936026
+## 2                            0.9738917
+## 3                            0.9294175
+## 4                            0.5188614
+## 5                            0.9405403
+

The data are provided as methylation beta values, ranging from 0 to +1. It is also common to work with “m values”, which are the logit base 2 +transformed beta values Du et al. (2010). +NetworkDataCompanion has the function +convertBetaToM for converting such values.

+
m_vals = ndc$convertBetaToM(meth_data[1:5,2])
+

When working with methylation data, there is often a need to identify +probes within a certain range of a gene’s transcription start site +(TSS), in order to assess promoter methylation. The +mapProbesToGenes function uses the public Illumina array +manifests available at to do this. Here, we perform this mapping for +only the first 10,000 probes in the manifest for brevity, and define the +promoter region as 1500 basepairs upstream to 200 basepairs downstream +of the TSS.

+
probe_map = ndc$mapProbesToGenes(probelist = meth_data$probeID[1:10000],
+                                rangeUp = 1500,
+                                rangeDown = 200)
+
## [1] "[NetworkDataCompanion::mapProbesToGenes] Sourcing 450k probe annotation from https://zwdzwd.github.io/InfiniumAnnotation ..."
+## [1] "[NetworkDataCompanion::mapProbesToGenes] https://zhouserver.research.chop.edu/InfiniumAnnotation/20210615/HM450/HM450.hg38.manifest.gencode.v36.tsv.gz"
+## [1] "[NetworkDataCompanion::mapProbesToGenes] HG38, Gencode v36"
+##   probeID geneNames ensemblID distToTSS
+## 1      NA        NA        NA        NA
+## 2      NA        NA        NA        NA
+## 3      NA        NA        NA        NA
+## 4      NA        NA        NA        NA
+## 5      NA        NA        NA        NA
+## 6      NA        NA        NA        NA
+## [1] "[NetworkDataCompanion::mapProbesToGenes] Processing probe number: 10000"
+
dim(probe_map)
+
## [1] 10000     4
+
names(probe_map)
+
## [1] "probeID"   "geneNames" "ensemblID" "distToTSS"
+

Any probe that is not within the specified rangeUp and +rangeDown bounds of the TSS for an annotated gene in the +manifest will have NA values for the columns +geneNames,ensemblID, and +distToTSS. We can look at just the probes mapping to genes +by filtering out NA values. Here, only the first 10 genes +are shown for brevity. Use the scroll bar at the bottom of the table to +see all four columns.

+
probe_map %>% 
+  dplyr::filter(!is.na(geneNames)) %>%
+  slice_head(n=10) %>%
+  knitr::kable() 
+ ++++++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
probeIDgeneNamesensemblIDdistToTSS
cg00381604MIR1302-2;MIR1302-2HG;MIR1302-2HG;WASH7PENST00000607096.1;ENST00000469289.1;ENST00000473358.1;ENST00000488147.1-931;-832;-119;136
cg00111697AL669831.5;FAM41C;FAM41CENST00000635557.1;ENST00000427857.1;ENST00000446136.1-145;-577;-476
cg00328058PLEKHN1;PLEKHN1;PLEKHN1ENST00000379407.7;ENST00000379409.6;ENST00000379410.8-820;-820;-800
cg00447632HES4;HES4;HES4;HES4;ISG15;ISG15ENST00000304952.11;ENST00000428771.6;ENST00000481869.1;ENST00000484667.2;ENST00000624652.1;ENST00000624697.4-383;-308;-384;-499;-664;-657
cg00152708AL390719.2;MIR429ENST00000606993.1;ENST00000362106.1-421;-69
cg00086243TNFRSF18;TNFRSF18;TNFRSF18;TNFRSF18ENST00000328596.10;ENST00000379265.5;ENST00000379268.7;ENST00000486728.5-453;-453;-432;-1344
cg00062072SDF4ENST00000494748.1-952
cg00262415ACAP3;PUSL1;PUSL1;PUSL1;PUSL1ENST00000492936.5;ENST00000379031.10;ENST00000463758.1;ENST00000467712.1;ENST00000470520.5-1422;-1245;-1436;-1469;-1259
cg00086266PUSL1;PUSL1;PUSL1;PUSL1ENST00000379031.10;ENST00000463758.1;ENST00000467712.1;ENST00000470520.5-1082;-1273;-1306;-1096
cg00040588MXRA8;MXRA8;MXRA8;MXRA8;MXRA8;MXRA8ENST00000460473.1;ENST00000464351.1;ENST00000473097.5;ENST00000474033.5;ENST00000476718.1;ENST00000478517.5181;181;-857;-1067;-315;-1084
+

In many applications, it is necessary to combine a set of probe +methylation values to create a summary measure (e.g., to assess overall +promoter methylation). NetworkDataCompanion currently +provides functionality for calculating the mean methylation within a +promoter region defined based on the output of the +mapProbesToGenes function. For example, our probe map above +has two probes within 1500 bp upstream and 200bp downstream of the TSS +for CTBP1:

+
probe_map %>% dplyr::filter(grepl("CTBP1",geneNames))
+
##      probeID geneNames         ensemblID distToTSS
+## 1 cg00148286     CTBP1 ENST00000511907.1       130
+## 2 cg00070460     CTBP1 ENST00000514669.5      -663
+

We can look at the individual methylation at these sites for a few +people:

+
meth_data %>% dplyr::filter(probeID %in% c("cg00148286","cg00070460")) %>%
+  dplyr::select(1:5) %>%
+  kable()
+ +++++++ + + + + + + + + + + + + + + + + + + + + + + + + + +
probeID7318cda6-0b88-416a-afdc-7c36628e7c82fa6b8806-1b2f-428c-906e-e910ec0af610cf3443f5-f79d-4a01-9963-f3e3ced157869a645f01-1ca4-49c0-bd29-fc3b0c6570c9
cg000704600.79067150.75063740.84850950.7844562
cg001482860.91570520.90635790.96335610.8701251
+

Next, we average these values using the +probeToMeanPromoterMethylation function:

+
ctbp1_meth=ndc$probeToMeanPromoterMethylation(methylation_betas = meth_data,
+                                                         probe_gene_map = probe_map,
+                                                    genesOfInterest = c("CTBP1")) %>%
+  data.frame() %>%
+  dplyr::filter(X1 != "CTBP1") %>%
+  dplyr::rename("CTBP1"=X1)
+head(ctbp1_meth,n=5) %>%
+  kable()
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
CTBP1
7318cda6-0b88-416a-afdc-7c36628e7c820.8531883
fa6b8806-1b2f-428c-906e-e910ec0af6100.8284977
cf3443f5-f79d-4a01-9963-f3e3ced157860.9059328
9a645f01-1ca4-49c0-bd29-fc3b0c6570c90.8272906
d39669b0-14df-480d-8134-1f262cad68de0.8814039
+

We see, for example, that the first value is 0.8531883 = (0.7906715 + +0.9157052)/2, the average of sites cg00148286 and cg00070460 for sample +7318cda6-0b88-416a-afdc-7c36628e7c82.

+

estimateCellCountsEpiSCORE

+

In the bulk methylation data we study in TCGA, cell type composition +may be confounded with observed methylation values. We provide a wrapper +around the EpiSCORE method (Teschendorff et al. +(2020)), which +performs an in silico deconvolution of bulk methylation data +for certain tissue types.

+
estimated_cell_counts = ndc$estimateCellCountsEpiSCORE(methylation_betas = meth_data, tissue = "Colon", array = "450k")
+
## [1] "Done for regional gene group 1"
+## [1] "Done for regional gene group 2"
+## [1] "Done for regional gene group 3"
+## [1] "Done for regional gene group 4"
+## [1] "Done for regional gene group 5"
+## [1] "Done for regional gene group 6"
+
head(estimated_cell_counts) %>% kable()
+ +++++++++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
UUIDTCGAbarcodeECEpiLymMyeStromal
7318cda6-0b88-416a-afdc-7c36628e7c82TCGA-AD-6895-01A0.12332430.31519520.25592820.16757180.1379805
fa6b8806-1b2f-428c-906e-e910ec0af610TCGA-AZ-4616-01A0.15603130.33114480.26081340.12754360.1244670
cf3443f5-f79d-4a01-9963-f3e3ced15786TCGA-A6-6653-01A0.04801030.62131300.18209450.14858220.0000000
9a645f01-1ca4-49c0-bd29-fc3b0c6570c9TCGA-F4-6570-01A0.13237970.38816310.19949000.19462850.0853388
d39669b0-14df-480d-8134-1f262cad68deTCGA-AZ-6601-01A0.15329170.18268230.30016610.28788290.0759769
8fba567b-80f4-42f0-a012-ca6b2daa14d3TCGA-G4-6628-01A0.13976550.21359890.42034630.20140770.0248815
+

From a boxplot, we can assess the overall distributions of the +estimated cell type proportions:

+
estimated_cell_counts %>% pivot_longer(cols=c("EC", "Epi","Lym", "Mye", "Stromal"),
+                                       names_to="CellType",
+                                       values_to="Proportion") %>% 
+  ggplot(aes(x = CellType, y = Proportion)) + 
+  geom_boxplot()
+

+
+
+
+

Conclusions

+

Thank you for using NetworkDataCompanion!

+

If you have any issues, please raise them as GitHub issues and/or +contact the maintainer directly. This package is currently maintained by +Kate Hoff Shutta ().

+
+
+

References

+
+
+Du, Pan, Xiao Zhang, Chiang-Ching Huang, Nadereh Jafari, Warren A Kibbe, +Lifang Hou, and Simon M Lin. 2010. “Comparison of Beta-Value and +m-Value Methods for Quantifying Methylation Levels by Microarray +Analysis.” BMC Bioinformatics 11: 1–9. +
+
+Fanfani, Viola, Katherine H Shutta, Panagiotis Mandros, Jonas Fischer, +Enakshi Saha, Soel Micheletti, Chen Chen, Marouen Ben Guebila, Camila M +Lopes-Ramos, and John Quackenbush. 2024. “Reproducible Processing +of TCGA Regulatory Networks.” bioRxiv. +
+
+Teschendorff, Andrew E, Tianyu Zhu, Charles E Breeze, and Stephan Beck. +2020. “EPISCORE: Cell Type Deconvolution of Bulk Tissue DNA +Methylomes from Single-Cell RNA-Seq Data.” Genome +Biology 21: 1–33. +
+
+
+ + + +
+
+
+
+
+ + + + + + + + + + + +