You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Building a classification model of NGS RNA-seq data with MLseq.
Author :Bo Cheng
Overview
1. In this project, I try to use a machine learning package, MLSeq from Bioconductor to find out the best model that predicts breast cancer subtype.
2. For my project, the machine learning process will use 5-fold cross-validation, and 10 repeats. I use 28 data sets from TCGA to train and test this model using 12 datasets.
3. The main goal of this project is to have a taste of machine learning in R. So I may do many copy and paste, but I will give my understanding and opinions in the R notebook.
1.The training data and testing data are from TCGA, 40 datasets in total. Training datasets : testing datasets = 7:3
2.When we download datasets from TCGA, we can add the htseq file into cart, and download them as a comprised file.
Milestone1
Progress
According to the Vignette, I've input the data and converted the data to be right data frames which are ready to do MLSeq. And the next step is to choose a model, do the Normalization and transformation, and use the normalized data to train model.
Milestone2
1. What I've done
1.I use 28 datasets to training all the model offered by MLSeq, and use 12 dataset to test the model.
2.The best model I have now is "voomNSC", which has about 75% accuracy.
The comparison between prediction and actual class:
*This is selected possible biomarkers are from inner_join of voomNSC,plda, and plda2 model, there is no possible biomarkers selected from NSC model. However, if NSC model has slected possible biomarkers, we should also do the inner_join.
2. What issues I found
1.In the vignette, I found the "voomDLDA" method, which has errors for now I can't solve, so I just skip this method.
2.Another thing is the model accuracy is low, maybe I need more dataset.
3.For the HTML, I still can't make the HTML, the issue is from my input data function, I input my data once a time and print the table, so the HTML will contain so much table, so HTML does not work well. The HTML file is more than 100Mb, so I did not put it up.
Repeatability
1.I use set.seed function, so you can generate the same results as mine.
2.I upload all the data so that you can reproduce my project.
Deliverable
1.I use R MarkDown to display an understandable codes.
2.I tried to render and upload the HTML file, but there is some issue as mentioned in issue part.