This goal of this project was to increasing familiarity with data clustering packages available in R.
It uses the following clustering techniques:
- K-means
- Hierarchical
- DBSCAN
- Shared Nearest Neighbor
Instructions to run file:
-
Set working directory to Raheen_Mazgaonkar_47144316 (This is required as the script references to the excel sheet present in this folder. If different excel has to be loaded please change path in readData())
-
Install the following packages (using install.packages(), code for this is present but commented in script to avoid reinstallation) i) stats (for k-means and hclust) ii) fpc (for dbscan) iii) dbscan (for sNNclust) iv) ClusterR (for accuracy measure) v) mclust (for adjusted rand index) vi) rgl (for scatterplot) vii) car (for scatterplot) viii) factoextra ( for dendogram and scatterplot) ix) zeallot (for getting multiple output from function)
-
For dataset1, run proj2p1_final.R from source. Note: i) Each plot gets over-written on the previous one. So in case it is required to view dendogram or kNNdist plots, run each clustering separately. ii) RGL is used for scatterplots, it doesn't display title but displays a number indicating order of display. Order in which scatterplots are displayed is Original labels, K-means, Hierachical, Density-based and Graph-based. iii) Plotting dendogram will take considerable time.
-
For dataset2, run proj2p2_final.R from source. Note: i) Plotting final clusters will take considerable amount of time even after processing has stopped.