Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
246 changes: 200 additions & 46 deletions quickstart.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,11 @@ bibliography: quickstart.bib
link-citations: true
---

```{R echo=F}
library(ggplot2)
library(knitr)
```

# About NetworkDataCompanion

The `NetworkDataCompanion` R package is a collection of utilities for wrangling TCGA data. While it was developed to serve as the engine behind `tcga-data-nf`, the Nextflow workflow presented in @fanfani2024reproducible, `NetworkDataCompanion` is a stand-alone package that can be used independently. The primary goal of the package is to provide version-controlled and unit-tested functions for common tasks involved in network analyses of TCGA data, such as calculating gene-level summary measures of methylation data, normalizing RNA-seq data, and integrating multiple omics by mapping various file identifiers back to TCGA barcodes.
Expand All @@ -23,7 +28,7 @@ While none of these tasks are particularly complex, `NetworkDataCompanion` is mo

The first step is to install `NetworkDataCompanion` from GitHub using the `devtools` library and the `install_github` function. Here, we have suppressed messages as there are quite a few.

```{R, message=F}
```{R, message=F, eval=F}
library(devtools)
devtools::install_github("QuackenbushLab/NetworkDataCompanion")
```
Expand All @@ -48,16 +53,16 @@ Data acknowledgement: These data were generated by the TCGA Research Network: ht

# Working with the NetworkDataCompanion Object

The most simple way to get started with `NetworkDataCompanion` is to create an oject using the `CreateNetworkDataCompanionObject` function with an empty argument. Because it is a Companion, we like to call it `my_friend`:
The most simple way to get started with `NetworkDataCompanion` is to create an object using the `CreateNetworkDataCompanionObject` function with an empty argument. Here, we call the object `ndc`:

```{R}
my_friend = NetworkDataCompanion::CreateNetworkDataCompanionObject()
ndc = NetworkDataCompanion::CreateNetworkDataCompanionObject()
```

There are a few built-in attributes that will be accessible to the functions of the Companion:

```{R}
ls(my_friend)
ls(ndc)
```

Of these, `gene_mapping` and `sample_type_mapping` are loaded with the empty constructor. For the other attributes, we must supply a patient data file and TCGA cancer type. For the COAD example, we can use the following:
Expand All @@ -70,24 +75,25 @@ patient_data = paste0(datapath,"data_download/clinical/clinical_patient_coad.csv
We input these to the constructor to load the relevant data into the Companion object:

```{R}
my_coad_friend = NetworkDataCompanion::CreateNetworkDataCompanionObject(clinical_patient_file = patient_data,project_name = "COAD")
ndc = NetworkDataCompanion::CreateNetworkDataCompanionObject(clinical_patient_file = patient_data,project_name = "COAD")
```

Now, we can see the other attributes. The project name helps us keep track of which cancer we are working with:

```{R}
my_coad_friend$project_name
ndc$project_name
```

The clinical patient data stores TCGA barcodes and a range of other patient-level information:

```{R}
my_coad_friend$clinical_patient_data[1:5,1:5]
ndc$clinical_patient_data[1:5,1:5]
```

The `TCGA_purities` attribute contains the information about tumor purity for each sample. Because these TCGA data are from bulk tissue, there may be some non-tumor cells in the each sample as well. Higher purity samples will give more accurate depiction of the tumor microenvironment and the user may wish to filter based on purity.

```{R}
head(my_coad_friend$TCGA_purities)
head(ndc$TCGA_purities)
```

# NetworkDataCompanion Functions
Expand All @@ -106,59 +112,207 @@ meth_datapath = paste0(datapath, "data_download/methylation/tcga_coad_cms1_methy

## Functions for working with expression

<!-- ```{R} -->
<!-- # Read RDS expression data -->
<!-- test_exp_rds = readRDS(exp_data) -->
<!-- ``` -->
The functions for working with gene expression in NetworkDataCompanion are generally wrappers around the functionality of the `RangedSummarizedExperiment` objects produced by `recount3`. For documentation on this type of object, please see ***.

First, we read in the expression data, which is downloaded as both an .rds file and as a .csv file in the Download phase of the `tcga-data-nf` pipeline. The .rds file contains the `RangedSummarizedExperiment` object, so this is the downloaded file that we work with here. The .csv file is provided for convenience if you have other applications that do not require the `RangedSummarizedExperiment`.

```{R}
expr_data = readRDS(exp_datapath)
expr_data
```

To extract metadata from the `RangedSummarizedExperiment`, we use the `extractSampleAndGeneInfo` method. This produces two dataframes: `rds_gene_info`, which stores gene identifiers and other gene-level metadata, and `rds_sample_info`, which stores sample identifiers such as TCGA barcodes, patient characteristics, and sample characteristics.

<!-- The function `extractSampleAndGeneInfo` provides a way to get metadata from the RangedSummarizedExperiment object that stores the expression data download. -->

<!-- ```{R} -->
<!-- rds_info = my_friend$extractSampleAndGeneInfo(test_exp_rds) -->
<!-- rds_sample_info = rds_info$rds_sample_info -->
<!-- rds_sample_info[1:3,1:3] -->
```{R}
expr_metadata = ndc$extractSampleAndGeneInfo(expr_data)
ls(expr_metadata)
```


For example, if we wish to get the gene names and Ensembl IDs for the first five genes in the dataset, we can use:

```{R}
expr_metadata$rds_gene_info %>%
dplyr::select(gene_id,gene_name) %>%
slice_head(n=5)
```

Similarly, let's get the TCGA barcode and vital status of the patients who provided the first 5 samples:

```{R}
expr_metadata$rds_sample_info %>%
dplyr::select(tcga.tcga_barcode,tcga.gdc_cases.diagnoses.vital_status) %>%
slice_head(n=5)
```

To get information about a set of genes, `NetworkDataCompanion` provides a wrapper function `getGeneInfo`.

```{R}
ndc$getGeneInfo(c("BRCA1","BRCA2"))
```

These results can also be searched by Ensembl or Entrez gene IDs. The Ensembl IDs for BRCA1 are ENSG00000012048 and ENSG00000139618, respectively:

```{R}
ndc$getGeneInfo(c("ENSG00000012048","ENSG00000139618"))
```

And the Entrez IDs for BRCA1 and BRCA2 are 672 and 675.

```{R}
ndc$getGeneInfo(c(672, 675))
```

### Converting between gene identifiers

There are many, many tools for converting between gene identifiers, and ours has neither bells nor whistles. Our tool simply uses the hardcode Gencode v26 mapping found in the `inst/extdata/gen_v26_mapping_w_entrez.csv` file. A more sophisticated tool may better fit your use case; however, we provide this functionality as we used it in the `tcga-data-nf` pipeline.

```{R}
ndc$geneNameToENSG(c("BRCA1","BRCA2"))
ndc$geneNameToEntrez(c("BRCA1","BRCA2"))
```

Sometimes, the gene conversion will yield a one-to-many mapping. For this reason, it is important to use proper merging operations (`merge` in base R; the `left_join`,`right_join`,`inner_join`,`full_join` functions from `dplyr`) when annotating these conversions back to a dataset.

```{R}
ndc$geneNameToENSG("NOTCH4")
```


<!-- rds_gene_info = rds_info$rds_gene_info -->
<!-- rds_gene_info[1:3,1:3] -->
<!-- ``` -->
### Normalizing RNAseq data

<!-- We can get information about a particular set of genes using either the names or the Ensembl IDs. -->
`NetworkDataCompanion` provides a wrapper for logTPM RNAseq normalization functions from `recount3`. A pseudocount is added in the calculation. The input to this function is the `RangedSummarizedExperiment` object:

<!-- From gene names: -->

<!-- ```{R} -->
<!-- rds_gene_info$gene_name[1:3] -->
<!-- gene_info_from_names = my_friend$getGeneInfo(rds_gene_info$gene_name) -->
<!-- gene_info_from_names[1:3,] -->
<!-- ``` -->
```{R}
logTPM = ndc$logTPMNormalization(expr_data)
ls(logTPM)
```

<!-- From Ensembl IDs: -->
The three items returned include the transformed counts, which have been processed by `recount3::transform_counts`, the logTPM (corresponding to log(TPM+1) with the pseudocount), the TPM (which is provided for use cases related to filtering by TPM, for example, if one would like to do something like remove genes with TPM < 1 in more than 20% of the samples). Rows are indexed by Ensembl gene ID. Let's look at the values for BRCA1 (ENSG00000012048.20):

<!-- ```{R} -->
<!-- rds_gene_info$gene_id[1:3] -->
<!-- genes_info_from_ids = my_friend$getGeneInfo(rds_gene_info$gene_id) -->
<!-- genes_info_from_ids[1:3,] -->
<!-- ``` -->
```{R}
brca1_counts=logTPM$counts[which(rownames(logTPM$counts)=="ENSG00000012048.20"),1:3]
brca1_TPM=logTPM$TPM[which(rownames(logTPM$TPM)=="ENSG00000012048.20"),1:3]
brca1_logTPM=logTPM$logTPM[which(rownames(logTPM$logTPM)=="ENSG00000012048.20"),1:3]

<!-- We can also convert between gene name and Ensembl ID and vice versa: -->
brca1_counts
brca1_TPM
brca1_logTPM
```

<!-- ```{R} -->
<!-- genes_from_names_to_ids = data.frame(my_friend$geneNameToENSG(rds_gene_info$gene_name)) -->
<!-- genes_from_ids_to_names = data.frame(my_friend$geneENSGToName(rds_gene_info$gene_id)) -->
<!-- ``` -->
There is also a wrapper for logCPM normalization. This is based on the `edgeR` recipe for TMM normalization; see the package documentation for more details. Again, a pseudocount is added in the calculation. Unlike `logTPMNormalization`, which uses the `RangedSummarizedExperiment` as input, `logCPMNormalization` uses a matrix or data frame of raw expression counts as input. This can be accessed using the `assay` function of the `RangedSummarizedExperiment`.

<!-- A function for normalization takes a RangedSummarizedExperiment as input and returns three matrices: raw counts, TPM. and log(TPM + 1). -->
```{R}
logCPM = ndc$logCPMNormalization(assay(expr_data))
ls(logCPM)
```

<!-- ```{R} -->
<!-- # Normalize data -->
<!-- test_exp_all = my_friend$logTPMNormalization(test_exp_rds) -->
<!-- test_exp_count = test_exp_all$counts -->
<!-- test_exp_tpm = test_exp_all$TPM -->
<!-- test_exp_logtpm = test_exp_all$logTPM -->
<!-- ``` -->
Here, counts refer to the raw counts in the `RangedSummarizedExperiment` object. Note that, because these counts have not been transformed with `recount3::transform_counts`, they will not be coherent with the `counts` attribute of `logTPMNormalization`. This is expected behavior.

We again investigate the values for BRCA1:


```{R}
brca1_counts=logCPM$counts[which(rownames(logCPM$counts)=="ENSG00000012048.20"),1:3]
brca1_CPM=logCPM$CPM[which(rownames(logCPM$CPM)=="ENSG00000012048.20"),1:3]
brca1_logCPM=logCPM$logCPM[which(rownames(logCPM$logCPM)=="ENSG00000012048.20"),1:3]

brca1_counts
brca1_CPM
brca1_logCPM
```

## Functions for working with mutations

## Functions for working with methylation

The methylation data in the TCGA COAD analysis are stored in a which rows correspond to CpG probes and columns correspond to samples. We load this matrix into a data frame for easy manipulation, using the `fread` function from `data.table` to speed the file input.

```{R}
meth_data = fread(meth_datapath, data.table = F)
meth_data[1:5,1:5]
```

The data are provided as methylation beta values, ranging from 0 to 1. It is also common to work with "m values", which are the logit base 2 transformed beta values @du2010comparison. `NetworkDataCompanion` has the function `convertBetaToM` for converting such values.

```{R}
m_vals = ndc$convertBetaToM(meth_data[1:5,2])
```

When working with methylation data, there is often a need to identify probes within a certain range of a gene's transcription start site (TSS), in order to assess promoter methylation. The `mapProbesToGenes` function uses the public Illumina array manifests available at \url{https://zwdzwd.github.io/InfiniumAnnotation} to do this. Here, we perform this mapping for only the first 10,000 probes in the manifest for brevity, and define the promoter region as 1500 basepairs upstream to 200 basepairs downstream of the TSS.

```{R}
probe_map = ndc$mapProbesToGenes(probelist = meth_data$probeID[1:10000],
rangeUp = 1500,
rangeDown = 200)
dim(probe_map)
names(probe_map)
```

Any probe that is not within the specified `rangeUp` and `rangeDown` bounds of the TSS for an annotated gene in the manifest will have `NA` values for the columns `geneNames`,`ensemblID`, and `distToTSS`. We can look at just the probes mapping to genes by filtering out `NA` values. Here, only the first 10 genes are shown for brevity. Use the scroll bar at the bottom of the table to see all four columns.

```{R}
probe_map %>%
dplyr::filter(!is.na(geneNames)) %>%
slice_head(n=10) %>%
knitr::kable()
```


In many applications, it is necessary to combine a set of probe methylation values to create a summary measure (e.g., to assess overall promoter methylation). `NetworkDataCompanion` currently provides functionality for calculating the mean methylation within a promoter region defined based on the output of the `mapProbesToGenes` function. For example, our probe map above has two probes within 1500 bp upstream and 200bp downstream of the TSS for CTBP1:

```{R}
probe_map %>% dplyr::filter(grepl("CTBP1",geneNames))
```

We can look at the individual methylation at these sites for a few people:

```{R}
meth_data %>% dplyr::filter(probeID %in% c("cg00148286","cg00070460")) %>%
dplyr::select(1:5) %>%
kable()
```

Next, we average these values using the `probeToMeanPromoterMethylation` function:

```{R}
ctbp1_meth=ndc$probeToMeanPromoterMethylation(methylation_betas = meth_data,
probe_gene_map = probe_map,
genesOfInterest = c("CTBP1")) %>%
data.frame() %>%
dplyr::filter(X1 != "CTBP1") %>%
dplyr::rename("CTBP1"=X1)
head(ctbp1_meth,n=5) %>%
kable()
```

We see, for example, that the first value is 0.8531883 = (0.7906715 + 0.9157052)/2, the average of sites cg00148286 and cg00070460 for sample 7318cda6-0b88-416a-afdc-7c36628e7c82.

`estimateCellCountsEpiSCORE`

In the bulk methylation data we study in TCGA, cell type composition may be confounded with observed methylation values. We provide a wrapper around the EpiSCORE method (@teschendorff2020episcore), which performs an *in silico* deconvolution of bulk methylation data for certain tissue types.

```{R}
estimated_cell_counts = ndc$estimateCellCountsEpiSCORE(methylation_betas = meth_data, tissue = "Colon", array = "450k")
head(estimated_cell_counts) %>% kable()
```

From a boxplot, we can assess the overall distributions of the estimated cell type proportions:

```{R boxplot_cell_counts}
estimated_cell_counts %>% pivot_longer(cols=c("EC", "Epi","Lym", "Mye", "Stromal"),
names_to="CellType",
values_to="Proportion") %>%
ggplot(aes(x = CellType, y = Proportion)) +
geom_boxplot()
```

# Conclusions

Thank you for using `NetworkDataCompanion`!

If you have any issues, please raise them as GitHub issues and/or contact the maintainer directly. This package is currently maintained by Kate Hoff Shutta (kshutta@hsph.harvard.edu).

# References
20 changes: 20 additions & 0 deletions quickstart.bib
Original file line number Diff line number Diff line change
Expand Up @@ -14,4 +14,24 @@ @data{fanfaniDataverse
version = {V1},
doi = {10.7910/DVN/MCSSYJ},
url = {https://doi.org/10.7910/DVN/MCSSYJ}
}

@article{du2010comparison,
title={Comparison of Beta-value and M-value methods for quantifying methylation levels by microarray analysis},
author={Du, Pan and Zhang, Xiao and Huang, Chiang-Ching and Jafari, Nadereh and Kibbe, Warren A and Hou, Lifang and Lin, Simon M},
journal={BMC bioinformatics},
volume={11},
pages={1--9},
year={2010},
publisher={Springer}
}

@article{teschendorff2020episcore,
title={EPISCORE: cell type deconvolution of bulk tissue DNA methylomes from single-cell RNA-Seq data},
author={Teschendorff, Andrew E and Zhu, Tianyu and Breeze, Charles E and Beck, Stephan},
journal={Genome biology},
volume={21},
pages={1--33},
year={2020},
publisher={Springer}
}
2,384 changes: 2,384 additions & 0 deletions quickstart.html

Large diffs are not rendered by default.