MultiCCA clustering and tumor samples

Hello,

I'm trying to run some parts of your benchmark and I have some questions about your code and some of the choices you made.

First, I have a question about the MultiCCA run : 

```
cca.ret = PMA::MultiCCA(omics.transposed, ncomponents = MAX.NUM.CLUSTERS)
sample.rep = omics.transposed[[1]] %*% cca.ret$ws[[1]]
```

It seems here that only the fisrt omic dataset is used to generate sample.rep, reducing it using the canonical variates found for this dataset. sample.rep is then used for the clustering. Why did you choose the first omic ? Can we consider using another dataset ? Let's say : 

``
sample.rep = omics.transposed[[2]] %*% cca.ret$ws[[2]]
``

What are the consequences on the results ? 


Second, in the same MultiCCA run, the silhouette values of clusters are computed to chose coherent clusters : 

```
 sils = c()
  clustering.per.num.clusters = list()
  for (num.clusters in 2:MAX.NUM.CLUSTERS) {
    cur.clustering = kmeans(sample.rep, num.clusters, iter.max=100, nstart=30)$cluster  
    sil = get.clustering.silhouette(list(t(sample.rep)), cur.clustering)
    sils = c(sils, sil)
    clustering.per.num.clusters[[num.clusters - 1]] = cur.clustering
}
 cca.clustering = clustering.per.num.clusters[[which.min(sils)]]
```

I don't understand the last line of this code : why did you choose the min average silhouette width ? I thought the higher the silhouette value, the better was the clustering. Shouldn't it be `which.max(sils)` instead ? 

Finally, my last question is about the choice of removing some tissues from the datasets : 


```
filter.non.tumor.samples <- function(raw.datum, only.primary=only.primary) {
  # 01 is primary, 06 is metastatic, 03 is blood derived cancer
  if (!only.primary)
    return(raw.datum[,substring(colnames(raw.datum), 14, 15) %in% c('01', '03', '06')])
  else
    return(raw.datum[,substring(colnames(raw.datum), 14, 15) %in% c('01')])
}
```
Why did you chose to select only primary tumors for some cancers and discard  other sample types like metastatic or recurrent tumor ? Is it coherent to discard only "normal" samples and keep the information on the samples types (not running the fix.patient.names function) so that the clusters also take this information ?

I hope my questions are clear,
Thank you in advance ! 
Galadriel

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MultiCCA clustering and tumor samples #1

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

MultiCCA clustering and tumor samples #1

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions