NetworkDataCompanion Quickstart Guide
+ + + + + + + +About NetworkDataCompanion
+The NetworkDataCompanion R package is a collection of
+utilities for wrangling TCGA data. While it was developed to serve as
+the engine behind tcga-data-nf, the Nextflow workflow
+presented in Fanfani et al. (2024),
+NetworkDataCompanion is a stand-alone package that can be
+used independently. The primary goal of the package is to provide
+version-controlled and unit-tested functions for common tasks involved
+in network analyses of TCGA data, such as calculating gene-level summary
+measures of methylation data, normalizing RNA-seq data, and integrating
+multiple omics by mapping various file identifiers back to TCGA
+barcodes.
While none of these tasks are particularly complex,
+NetworkDataCompanion is motivated by their ubiquity. Beyond
+its role in tcga-data-nf, this package aims to:
-
+
Reduce “re-work” by eliminating the need for everyone to write a +few lines of code to, e.g., map TCGA identifiers in their own +scripts
+Reduce the error-prone nature of the type of coding described in +(1) by offering a centralized, maintained, version-controlled, and +unit-tested resource for general use.
+
Preparing the workspace
+The first step is to install NetworkDataCompanion from
+GitHub using the devtools library and the
+install_github function. Here, we have suppressed messages
+as there are quite a few.
Once you have installed the package, load the library into your +workspace.
+ +Download demonstration data
+Once you have installed the package, you next need to download the +data used in this quickstart.
+These data are from the TCGA COAD analysis presented in Fanfani et al. (2024) and can be +downloaded from the Harvard Dataverse here: https://doi.org/10.7910/DVN/MCSSYJ
+For the script below to work, please place the
+tcga-data-supplement directory in the same working
+directory as this RMarkdown. Alternatively, you can modify the
+datapath variable below.
For reference on the source of the data, the full workflow of the +COAD analysis is available at https://github.com/QuackenbushLab/tcga-data-supplement/.
+Data acknowledgement: These data were generated by the TCGA Research +Network: https://www.cancer.gov/tcga.
+Working with the NetworkDataCompanion Object
+The most simple way to get started with
+NetworkDataCompanion is to create an object using the
+CreateNetworkDataCompanionObject function with an empty
+argument. Here, we call the object ndc:
There are a few built-in attributes that will be accessible to the +functions of the Companion:
+ +## [1] "clinical_patient_data" "gene_mapping" "project_name"
+## [4] "sample_type_mapping" "TCGA_purities"
+Of these, gene_mapping and
+sample_type_mapping are loaded with the empty constructor.
+For the other attributes, we must supply a patient data file and TCGA
+cancer type. For the COAD example, we can use the following:
datapath = "tcga-data-supplement/data/processed/batch-coad-subtype-20240510/tcga_coad_cms1/"
+patient_data = paste0(datapath,"data_download/clinical/clinical_patient_coad.csv")We input these to the constructor to load the relevant data into the +Companion object:
+ndc = NetworkDataCompanion::CreateNetworkDataCompanionObject(clinical_patient_file = patient_data,project_name = "COAD")Now, we can see the other attributes. The project name helps us keep +track of which cancer we are working with:
+ +## [1] "COAD"
+The clinical patient data stores TCGA barcodes and a range of other +patient-level information:
+ +## bcr_patient_uuid bcr_patient_barcode form_completion_date
+## 3 A94E1279-A975-480A-93E9-7B1FF05CBCBF TCGA-3L-AA1B 2014-4-22
+## 4 92554413-9EBC-4354-8E1B-9682F3A031D9 TCGA-4N-A93T 2014-10-1
+## 5 A5E14ADD-1552-4606-9FFE-3A03BCF76640 TCGA-4T-AA8H 2014-6-5
+## 6 1136DD50-242A-4659-AAD4-C53F9E759BB3 TCGA-5M-AAT4 2015-1-27
+## 7 CE00896A-F7D2-4123-BB95-24CB6E53FC32 TCGA-5M-AAT6 2015-1-27
+## histologic_diagnosis prospective_collection
+## 3 Colon Adenocarcinoma YES
+## 4 Colon Adenocarcinoma YES
+## 5 Colon Mucinous Adenocarcinoma NO
+## 6 Colon Adenocarcinoma NO
+## 7 Colon Adenocarcinoma NO
+The TCGA_purities attribute contains the information
+about tumor purity for each sample. Because these TCGA data are from
+bulk tissue, there may be some non-tumor cells in the each sample as
+well. Higher purity samples will give more accurate depiction of the
+tumor microenvironment and the user may wish to filter based on
+purity.
## Cancer_type ESTIMATE ABSOLUTE LUMP IHC CPE
+## TCGA-3L-AA1B-01A COAD 0.9025 NA 0.6238 0.70 0.7913
+## TCGA-4N-A93T-01A COAD 0.9752 NA 0.7915 0.75 0.7898
+## TCGA-4T-AA8H-01A COAD 0.9865 NA NA 0.60 0.8917
+## TCGA-5M-AAT4-01A COAD 0.9720 NA NA 0.70 0.8372
+## TCGA-5M-AAT5-01A COAD 0.9752 NA NA 0.75 0.8170
+## TCGA-5M-AAT6-01A COAD 0.7275 NA NA 0.80 0.6712
+NetworkDataCompanion Functions
+The tcga-data-nf workflow retrieves five different data
+types: clinical data, expression, mutations, copy number variation (CNV)
+data, and methylation. The clinical data is used in the Companion object
+constructor. The other data types can be wrangled with the Companion
+member functions.
First, we define paths to the data:
+exp_datapath = paste0(datapath,"data_download/recount3/tcga_coad_cms1.rds")
+mut_datapath = paste0(datapath,"data_download/mutations/tcga_coad_cms1_mutations.txt")
+mut_pivot_datapath = paste0(datapath, "data_download/mutations/tcga_coad_cms1_mutations_pivot.csv")
+cnv_datapath = paste0(datapath, "data_download/cnv/tcga_coad_cms1.csv")
+meth_datapath = paste0(datapath, "data_download/methylation/tcga_coad_cms1_methylations.txt")Functions for working with expression
+The functions for working with gene expression in
+NetworkDataCompanion are generally wrappers around the functionality of
+the RangedSummarizedExperiment objects produced by
+recount3. For documentation on this type of object, please
+see ***.
First, we read in the expression data, which is downloaded as both an
+.rds file and as a .csv file in the Download phase of the
+tcga-data-nf pipeline. The .rds file contains the
+RangedSummarizedExperiment object, so this is the
+downloaded file that we work with here. The .csv file is provided for
+convenience if you have other applications that do not require the
+RangedSummarizedExperiment.
## class: RangedSummarizedExperiment
+## dim: 63856 80
+## metadata(8): time_created recount3_version ... annotation recount3_url
+## assays(1): raw_counts
+## rownames(63856): ENSG00000278704.1 ENSG00000277400.1 ...
+## ENSG00000182484.15_PAR_Y ENSG00000227159.8_PAR_Y
+## rowData names(10): source type ... havana_gene tag
+## colnames(80): 196281bd-2c05-46d3-a53c-ead8639ef02d
+## 45cedf59-bc7a-4d86-8c2d-9a91ac141d30 ...
+## bcd07f56-fa83-473f-b5d1-9c637299c7e2
+## b444dc9f-6f2c-4cbc-9e1c-7b9e0fc790f2
+## colData names(937): rail_id external_id ... recount_seq_qc.errq
+## BigWigURL
+To extract metadata from the RangedSummarizedExperiment,
+we use the extractSampleAndGeneInfo method. This produces
+two dataframes: rds_gene_info, which stores gene
+identifiers and other gene-level metadata, and
+rds_sample_info, which stores sample identifiers such as
+TCGA barcodes, patient characteristics, and sample characteristics.
## [1] "rds_gene_info" "rds_sample_info"
+For example, if we wish to get the gene names and Ensembl IDs for the +first five genes in the dataset, we can use:
+ +## gene_id gene_name
+## 1 ENSG00000278704.1 BX004987.1
+## 2 ENSG00000277400.1 AC145212.2
+## 3 ENSG00000274847.1 AC145212.1
+## 4 ENSG00000277428.1 Y_RNA
+## 5 ENSG00000276256.1 AC011043.1
+Similarly, let’s get the TCGA barcode and vital status of the +patients who provided the first 5 samples:
+expr_metadata$rds_sample_info %>%
+ dplyr::select(tcga.tcga_barcode,tcga.gdc_cases.diagnoses.vital_status) %>%
+ slice_head(n=5)## tcga.tcga_barcode
+## 196281bd-2c05-46d3-a53c-ead8639ef02d TCGA-AA-3845-01A-01R-1022-07
+## 45cedf59-bc7a-4d86-8c2d-9a91ac141d30 TCGA-F4-6570-01A-11R-1774-07
+## 771833ef-f9f3-43e9-ac9e-bf249333b257 TCGA-G4-6588-01A-11R-1774-07
+## 343f031b-0095-429d-8212-ad0581dbb05b TCGA-A6-3809-01A-01R-1022-07
+## aa49ac5d-9dc7-48f9-8498-deb9a3f0e0d0 TCGA-CM-5861-01A-01R-1653-07
+## tcga.gdc_cases.diagnoses.vital_status
+## 196281bd-2c05-46d3-a53c-ead8639ef02d dead
+## 45cedf59-bc7a-4d86-8c2d-9a91ac141d30 dead
+## 771833ef-f9f3-43e9-ac9e-bf249333b257 alive
+## 343f031b-0095-429d-8212-ad0581dbb05b alive
+## aa49ac5d-9dc7-48f9-8498-deb9a3f0e0d0 alive
+To get information about a set of genes,
+NetworkDataCompanion provides a wrapper function
+getGeneInfo.
## gene_name seqid source type start end score strand phase
+## 1 BRCA1 chr17 HAVANA gene 43044295 43170245 NA - NA
+## 2 BRCA2 chr13 HAVANA gene 32315474 32400266 NA + NA
+## gene_id gene_entrez gene_type gene_id_no_ver
+## 1 ENSG00000012048.20 672 protein_coding ENSG00000012048
+## 2 ENSG00000139618.14 675 protein_coding ENSG00000139618
+These results can also be searched by Ensembl or Entrez gene IDs. The +Ensembl IDs for BRCA1 are ENSG00000012048 and ENSG00000139618, +respectively:
+ +## gene_id_no_ver seqid source type start end score strand phase
+## 1 ENSG00000012048 chr17 HAVANA gene 43044295 43170245 NA - NA
+## 2 ENSG00000139618 chr13 HAVANA gene 32315474 32400266 NA + NA
+## gene_id gene_name gene_entrez gene_type
+## 1 ENSG00000012048.20 BRCA1 672 protein_coding
+## 2 ENSG00000139618.14 BRCA2 675 protein_coding
+And the Entrez IDs for BRCA1 and BRCA2 are 672 and 675.
+ +## gene_entrez seqid source type start end score strand phase
+## 1 672 chr17 HAVANA gene 43044295 43170245 NA - NA
+## 2 675 chr13 HAVANA gene 32315474 32400266 NA + NA
+## gene_id gene_name gene_type gene_id_no_ver
+## 1 ENSG00000012048.20 BRCA1 protein_coding ENSG00000012048
+## 2 ENSG00000139618.14 BRCA2 protein_coding ENSG00000139618
+Converting between gene identifiers
+There are many, many tools for converting between gene identifiers,
+and ours has neither bells nor whistles. Our tool simply uses the
+hardcode Gencode v26 mapping found in the
+inst/extdata/gen_v26_mapping_w_entrez.csv file. A more
+sophisticated tool may better fit your use case; however, we provide
+this functionality as we used it in the tcga-data-nf
+pipeline.
## gene_name gene_id gene_id_no_ver
+## 1 BRCA1 ENSG00000012048.20 ENSG00000012048
+## 2 BRCA2 ENSG00000139618.14 ENSG00000139618
+
+## gene_name gene_entrez
+## 1 BRCA1 672
+## 2 BRCA2 675
+Sometimes, the gene conversion will yield a one-to-many mapping. For
+this reason, it is important to use proper merging operations
+(merge in base R; the
+left_join,right_join,inner_join,full_join
+functions from dplyr) when annotating these conversions
+back to a dataset.
## [1] "[NetworkDataCompanion::getGeneInfo] In gene conversion, there was at least one one-to-many mapping (most probably from multiple ensembl IDs for the input)"
+## gene_name gene_id gene_id_no_ver
+## 1 NOTCH4 ENSG00000206312.7 ENSG00000206312
+## 2 NOTCH4 ENSG00000234876.5 ENSG00000234876
+## 3 NOTCH4 ENSG00000235396.5 ENSG00000235396
+## 4 NOTCH4 ENSG00000223355.5 ENSG00000223355
+## 5 NOTCH4 ENSG00000238196.5 ENSG00000238196
+## 6 NOTCH4 ENSG00000232339.5 ENSG00000232339
+## 7 NOTCH4 ENSG00000204301.6 ENSG00000204301
+Normalizing RNAseq data
+NetworkDataCompanion provides a wrapper for logTPM
+RNAseq normalization functions from recount3. A pseudocount
+is added in the calculation. The input to this function is the
+RangedSummarizedExperiment object:
## [1] "counts" "logTPM" "TPM"
+The three items returned include the transformed counts, which have
+been processed by recount3::transform_counts, the logTPM
+(corresponding to log(TPM+1) with the pseudocount), the TPM (which is
+provided for use cases related to filtering by TPM, for example, if one
+would like to do something like remove genes with TPM < 1 in more
+than 20% of the samples). Rows are indexed by Ensembl gene ID. Let’s
+look at the values for BRCA1 (ENSG00000012048.20):
brca1_counts=logTPM$counts[which(rownames(logTPM$counts)=="ENSG00000012048.20"),1:3]
+brca1_TPM=logTPM$TPM[which(rownames(logTPM$TPM)=="ENSG00000012048.20"),1:3]
+brca1_logTPM=logTPM$logTPM[which(rownames(logTPM$logTPM)=="ENSG00000012048.20"),1:3]
+
+brca1_counts## 196281bd-2c05-46d3-a53c-ead8639ef02d 45cedf59-bc7a-4d86-8c2d-9a91ac141d30
+## 561 570
+## 771833ef-f9f3-43e9-ac9e-bf249333b257
+## 1348
+
+## 196281bd-2c05-46d3-a53c-ead8639ef02d 45cedf59-bc7a-4d86-8c2d-9a91ac141d30
+## 4.036036 3.923258
+## 771833ef-f9f3-43e9-ac9e-bf249333b257
+## 9.182649
+
+## 196281bd-2c05-46d3-a53c-ead8639ef02d 45cedf59-bc7a-4d86-8c2d-9a91ac141d30
+## 1.616619 1.593970
+## 771833ef-f9f3-43e9-ac9e-bf249333b257
+## 2.320685
+There is also a wrapper for logCPM normalization. This is based on
+the edgeR recipe for TMM normalization; see the package
+documentation for more details. Again, a pseudocount is added in the
+calculation. Unlike logTPMNormalization, which uses the
+RangedSummarizedExperiment as input,
+logCPMNormalization uses a matrix or data frame of raw
+expression counts as input. This can be accessed using the
+assay function of the
+RangedSummarizedExperiment.
## [1] "counts" "CPM" "logCPM"
+Here, counts refer to the raw counts in the
+RangedSummarizedExperiment object. Note that, because these
+counts have not been transformed with
+recount3::transform_counts, they will not be coherent with
+the counts attribute of logTPMNormalization.
+This is expected behavior.
We again investigate the values for BRCA1:
+brca1_counts=logCPM$counts[which(rownames(logCPM$counts)=="ENSG00000012048.20"),1:3]
+brca1_CPM=logCPM$CPM[which(rownames(logCPM$CPM)=="ENSG00000012048.20"),1:3]
+brca1_logCPM=logCPM$logCPM[which(rownames(logCPM$logCPM)=="ENSG00000012048.20"),1:3]
+
+brca1_counts## 196281bd-2c05-46d3-a53c-ead8639ef02d 45cedf59-bc7a-4d86-8c2d-9a91ac141d30
+## 24368 75884
+## 771833ef-f9f3-43e9-ac9e-bf249333b257
+## 154683
+
+## 196281bd-2c05-46d3-a53c-ead8639ef02d 45cedf59-bc7a-4d86-8c2d-9a91ac141d30
+## 13.09724 14.56629
+## 771833ef-f9f3-43e9-ac9e-bf249333b257
+## 34.11405
+
+## 196281bd-2c05-46d3-a53c-ead8639ef02d 45cedf59-bc7a-4d86-8c2d-9a91ac141d30
+## 2.645979 2.745107
+## 771833ef-f9f3-43e9-ac9e-bf249333b257
+## 3.558601
+Functions for working with methylation
+The methylation data in the TCGA COAD analysis are stored in a which
+rows correspond to CpG probes and columns correspond to samples. We load
+this matrix into a data frame for easy manipulation, using the
+fread function from data.table to speed the
+file input.
## probeID 7318cda6-0b88-416a-afdc-7c36628e7c82
+## 1 cg00000029 0.2602616
+## 2 cg00000108 0.9603123
+## 3 cg00000109 0.9148485
+## 4 cg00000165 0.5946983
+## 5 cg00000236 0.9103589
+## fa6b8806-1b2f-428c-906e-e910ec0af610 cf3443f5-f79d-4a01-9963-f3e3ced15786
+## 1 0.2421992 0.1009500
+## 2 0.9643367 0.9278799
+## 3 0.9097119 0.8865743
+## 4 0.5318436 0.8743420
+## 5 0.9085887 0.9469717
+## 9a645f01-1ca4-49c0-bd29-fc3b0c6570c9
+## 1 0.2936026
+## 2 0.9738917
+## 3 0.9294175
+## 4 0.5188614
+## 5 0.9405403
+The data are provided as methylation beta values, ranging from 0 to
+1. It is also common to work with “m values”, which are the logit base 2
+transformed beta values Du et al. (2010).
+NetworkDataCompanion has the function
+convertBetaToM for converting such values.
When working with methylation data, there is often a need to identify
+probes within a certain range of a gene’s transcription start site
+(TSS), in order to assess promoter methylation. The
+mapProbesToGenes function uses the public Illumina array
+manifests available at to do this. Here, we perform this mapping for
+only the first 10,000 probes in the manifest for brevity, and define the
+promoter region as 1500 basepairs upstream to 200 basepairs downstream
+of the TSS.
probe_map = ndc$mapProbesToGenes(probelist = meth_data$probeID[1:10000],
+ rangeUp = 1500,
+ rangeDown = 200)## [1] "[NetworkDataCompanion::mapProbesToGenes] Sourcing 450k probe annotation from https://zwdzwd.github.io/InfiniumAnnotation ..."
+## [1] "[NetworkDataCompanion::mapProbesToGenes] https://zhouserver.research.chop.edu/InfiniumAnnotation/20210615/HM450/HM450.hg38.manifest.gencode.v36.tsv.gz"
+## [1] "[NetworkDataCompanion::mapProbesToGenes] HG38, Gencode v36"
+## probeID geneNames ensemblID distToTSS
+## 1 NA NA NA NA
+## 2 NA NA NA NA
+## 3 NA NA NA NA
+## 4 NA NA NA NA
+## 5 NA NA NA NA
+## 6 NA NA NA NA
+## [1] "[NetworkDataCompanion::mapProbesToGenes] Processing probe number: 10000"
+
+## [1] 10000 4
+
+## [1] "probeID" "geneNames" "ensemblID" "distToTSS"
+Any probe that is not within the specified rangeUp and
+rangeDown bounds of the TSS for an annotated gene in the
+manifest will have NA values for the columns
+geneNames,ensemblID, and
+distToTSS. We can look at just the probes mapping to genes
+by filtering out NA values. Here, only the first 10 genes
+are shown for brevity. Use the scroll bar at the bottom of the table to
+see all four columns.
| probeID | +geneNames | +ensemblID | +distToTSS | +
|---|---|---|---|
| cg00381604 | +MIR1302-2;MIR1302-2HG;MIR1302-2HG;WASH7P | +ENST00000607096.1;ENST00000469289.1;ENST00000473358.1;ENST00000488147.1 | +-931;-832;-119;136 | +
| cg00111697 | +AL669831.5;FAM41C;FAM41C | +ENST00000635557.1;ENST00000427857.1;ENST00000446136.1 | +-145;-577;-476 | +
| cg00328058 | +PLEKHN1;PLEKHN1;PLEKHN1 | +ENST00000379407.7;ENST00000379409.6;ENST00000379410.8 | +-820;-820;-800 | +
| cg00447632 | +HES4;HES4;HES4;HES4;ISG15;ISG15 | +ENST00000304952.11;ENST00000428771.6;ENST00000481869.1;ENST00000484667.2;ENST00000624652.1;ENST00000624697.4 | +-383;-308;-384;-499;-664;-657 | +
| cg00152708 | +AL390719.2;MIR429 | +ENST00000606993.1;ENST00000362106.1 | +-421;-69 | +
| cg00086243 | +TNFRSF18;TNFRSF18;TNFRSF18;TNFRSF18 | +ENST00000328596.10;ENST00000379265.5;ENST00000379268.7;ENST00000486728.5 | +-453;-453;-432;-1344 | +
| cg00062072 | +SDF4 | +ENST00000494748.1 | +-952 | +
| cg00262415 | +ACAP3;PUSL1;PUSL1;PUSL1;PUSL1 | +ENST00000492936.5;ENST00000379031.10;ENST00000463758.1;ENST00000467712.1;ENST00000470520.5 | +-1422;-1245;-1436;-1469;-1259 | +
| cg00086266 | +PUSL1;PUSL1;PUSL1;PUSL1 | +ENST00000379031.10;ENST00000463758.1;ENST00000467712.1;ENST00000470520.5 | +-1082;-1273;-1306;-1096 | +
| cg00040588 | +MXRA8;MXRA8;MXRA8;MXRA8;MXRA8;MXRA8 | +ENST00000460473.1;ENST00000464351.1;ENST00000473097.5;ENST00000474033.5;ENST00000476718.1;ENST00000478517.5 | +181;181;-857;-1067;-315;-1084 | +
In many applications, it is necessary to combine a set of probe
+methylation values to create a summary measure (e.g., to assess overall
+promoter methylation). NetworkDataCompanion currently
+provides functionality for calculating the mean methylation within a
+promoter region defined based on the output of the
+mapProbesToGenes function. For example, our probe map above
+has two probes within 1500 bp upstream and 200bp downstream of the TSS
+for CTBP1:
## probeID geneNames ensemblID distToTSS
+## 1 cg00148286 CTBP1 ENST00000511907.1 130
+## 2 cg00070460 CTBP1 ENST00000514669.5 -663
+We can look at the individual methylation at these sites for a few +people:
+meth_data %>% dplyr::filter(probeID %in% c("cg00148286","cg00070460")) %>%
+ dplyr::select(1:5) %>%
+ kable()| probeID | +7318cda6-0b88-416a-afdc-7c36628e7c82 | +fa6b8806-1b2f-428c-906e-e910ec0af610 | +cf3443f5-f79d-4a01-9963-f3e3ced15786 | +9a645f01-1ca4-49c0-bd29-fc3b0c6570c9 | +
|---|---|---|---|---|
| cg00070460 | +0.7906715 | +0.7506374 | +0.8485095 | +0.7844562 | +
| cg00148286 | +0.9157052 | +0.9063579 | +0.9633561 | +0.8701251 | +
Next, we average these values using the
+probeToMeanPromoterMethylation function:
ctbp1_meth=ndc$probeToMeanPromoterMethylation(methylation_betas = meth_data,
+ probe_gene_map = probe_map,
+ genesOfInterest = c("CTBP1")) %>%
+ data.frame() %>%
+ dplyr::filter(X1 != "CTBP1") %>%
+ dplyr::rename("CTBP1"=X1)
+head(ctbp1_meth,n=5) %>%
+ kable()| + | CTBP1 | +
|---|---|
| 7318cda6-0b88-416a-afdc-7c36628e7c82 | +0.8531883 | +
| fa6b8806-1b2f-428c-906e-e910ec0af610 | +0.8284977 | +
| cf3443f5-f79d-4a01-9963-f3e3ced15786 | +0.9059328 | +
| 9a645f01-1ca4-49c0-bd29-fc3b0c6570c9 | +0.8272906 | +
| d39669b0-14df-480d-8134-1f262cad68de | +0.8814039 | +
We see, for example, that the first value is 0.8531883 = (0.7906715 + +0.9157052)/2, the average of sites cg00148286 and cg00070460 for sample +7318cda6-0b88-416a-afdc-7c36628e7c82.
+estimateCellCountsEpiSCORE
In the bulk methylation data we study in TCGA, cell type composition +may be confounded with observed methylation values. We provide a wrapper +around the EpiSCORE method (Teschendorff et al. +(2020)), which +performs an in silico deconvolution of bulk methylation data +for certain tissue types.
+estimated_cell_counts = ndc$estimateCellCountsEpiSCORE(methylation_betas = meth_data, tissue = "Colon", array = "450k")## [1] "Done for regional gene group 1"
+## [1] "Done for regional gene group 2"
+## [1] "Done for regional gene group 3"
+## [1] "Done for regional gene group 4"
+## [1] "Done for regional gene group 5"
+## [1] "Done for regional gene group 6"
+
+| UUID | +TCGAbarcode | +EC | +Epi | +Lym | +Mye | +Stromal | +
|---|---|---|---|---|---|---|
| 7318cda6-0b88-416a-afdc-7c36628e7c82 | +TCGA-AD-6895-01A | +0.1233243 | +0.3151952 | +0.2559282 | +0.1675718 | +0.1379805 | +
| fa6b8806-1b2f-428c-906e-e910ec0af610 | +TCGA-AZ-4616-01A | +0.1560313 | +0.3311448 | +0.2608134 | +0.1275436 | +0.1244670 | +
| cf3443f5-f79d-4a01-9963-f3e3ced15786 | +TCGA-A6-6653-01A | +0.0480103 | +0.6213130 | +0.1820945 | +0.1485822 | +0.0000000 | +
| 9a645f01-1ca4-49c0-bd29-fc3b0c6570c9 | +TCGA-F4-6570-01A | +0.1323797 | +0.3881631 | +0.1994900 | +0.1946285 | +0.0853388 | +
| d39669b0-14df-480d-8134-1f262cad68de | +TCGA-AZ-6601-01A | +0.1532917 | +0.1826823 | +0.3001661 | +0.2878829 | +0.0759769 | +
| 8fba567b-80f4-42f0-a012-ca6b2daa14d3 | +TCGA-G4-6628-01A | +0.1397655 | +0.2135989 | +0.4203463 | +0.2014077 | +0.0248815 | +
From a boxplot, we can assess the overall distributions of the +estimated cell type proportions:
+estimated_cell_counts %>% pivot_longer(cols=c("EC", "Epi","Lym", "Mye", "Stromal"),
+ names_to="CellType",
+ values_to="Proportion") %>%
+ ggplot(aes(x = CellType, y = Proportion)) +
+ geom_boxplot()Conclusions
+Thank you for using NetworkDataCompanion!
If you have any issues, please raise them as GitHub issues and/or +contact the maintainer directly. This package is currently maintained by +Kate Hoff Shutta (kshutta@hsph.harvard.edu).
+