SGCCA/prepro.Rmd at main · CSB-IG/SGCCA · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
---
title: "Data pre-procesing"
output:
  html_document:
    df_print: paged
---

## 1. Get the barcode of the samples **with co-occurring**  measurements

[getData.R](https://github.com/CSB-IG/SGCCA/blob/main/getData.R) → subtype.tsv

DB has been updated since the download, i.e.:

-   workflow.type changed from HTSeq - Counts to STAR - Counts

-   3 extra luminal A samples

```{r}
suppressPackageStartupMessages({
library(SummarizedExperiment)#1.22.0
library(TCGAbiolinks)#2.20.1
#library(VennDiagram)#1.6.20
})
##########SAMPLE IDs PER DATA TYPE#####################
mthyltn <-  GDCquery(project = "TCGA-BRCA",
	data.category = "DNA Methylation",
	platform="Illumina Human Methylation 450")
mthyltn=getResults(mthyltn)
head(mthyltn,3)
```

```{r}
i=substr(mthyltn$cases,1,19)
xprssn <- GDCquery(project = "TCGA-BRCA",
  data.category = "Transcriptome Profiling",
  data.type = "Gene Expression Quantification",
  workflow.type="STAR - Counts")#11/2022 option
#  workflow.type = "HTSeq - Counts")
xprssn=getResults(xprssn)
j=substr(xprssn$cases,1,19)
mirnas <- GDCquery(project = "TCGA-BRCA",
	data.category = "Transcriptome Profiling",
	data.type = "miRNA Expression Quantification")
mirnas=getResults(mirnas)
k=substr(mirnas$cases,1,19)

##############CONCOURRENT MEASURES########################
samples=intersect(intersect(i,j),k)
samples=data.frame(cbind(sample=samples,patient=substr(samples,1,12)))
head(samples)
```

would have been easier with tidy

```{r}
suppressPackageStartupMessages({library(tidyverse)})
subtypes=TCGAquery_subtype(tumor="brca")#subtype per patient
samples=merge(samples,subtypes,by="patient",all.x=T)
samples%>%distinct(patient,BRCA_Subtype_PAM50)%>%count(BRCA_Subtype_PAM50)
```

```{r}
colnames(mthyltn)[30]="patient"
temp=mthyltn%>%select(cases,data_type,sample_type,patient)
temp$sample=substr(temp$cases,1,19)
samples=merge(samples,temp,by="sample",all.x=T)
samples%>%distinct(patient.x,BRCA_Subtype_PAM50,sample_type)%>%count(sample_type)
```

## 2. RNAseq pre-processing

[prepro-mRNA.R](https://github.com/CSB-IG/SGCCA/blob/main/prepro-mRNA.R) → RNAseqnormalized.tsv

## 3. miRNAseq pre-processing

[prepro-miRNA.R](https://github.com/CSB-IG/SGCCA/blob/main/prepro-miRNA.R) → miRNAseqNormi.tsv

```{r}
suppressPackageStartupMessages({library(data.table)})
exampl=fread("miRNAseqNormi.tsv")
dim(exampl)
exampl[1:5,1:5]
```

## 4. HM450 pre-processing

[prepro-methy.R](https://github.com/CSB-IG/SGCCA/blob/main/prepro-methy.R) → methyNormi.tsv

## 5. Paste together the 3 matrixes of every subtype

Rscript [concatena.R](https://github.com/CSB-IG/SGCCA/blob/main/concatena.R) → Basal.mtrx, Her2.mtrx, ...

## 6. Eigenvalue normalization

Rscript [mfa_normi.R](https://github.com/CSB-IG/SGCCA/blob/main/mfa_normi.R) → Basal.eigeNormi, Her2.eigeNormi...

```{r}
exampl=fread("Her2.eigeNormi")
exampl[1:5,1:5]
```