- The goal of this project is to find the most optimal model that performs the best in diagnosing breast cancer, determining whether the cancer tumors are benign (non-cancerous) or malignant (cancerous)
- Although, this is a .csv multivariate dataset, the features were computed from a digitized image of a fine needle aspirate (FNA) of breast mass, using varying techniques in Linear Programming
- The number of patients/instances is 569 with 30 features (predictors)
- State origin of the data is Wisconsin, USA
- Source: UCI Machine Learning Repository
How were 30 features extracted from one image?
-
For each nucleus, ten(10) real-valued features were computed:
** radius (mean of distances from the center to points on the perimeter) ** texture (std of grayscale values) ** perimeter ** area ** smoothness (local variation in radius lengths) ** compactness (perimeter^2 / area - 1.0) ** concavity ( severity of concave portions of the contour) ** concave points (number of concave portions of the contour) ** symmetry ** fractal dimension ("coastline approximation" - 1) -
(a) Recorded to 4 decimal places, of each ten features above, the mean, standard error, and 'worst' or "largest" (-mean of the 3 largest values), for each image.
-
(b) For example, field 3 = mean radius, field 13 = radius SE, field 23 = worst radius
-
For the response variable's class distribution: 357 benign (B), 212 malignant (M)
- K Nearest Neighbor (KNN)
- Random Forest
- Logistic Regression
- Support Vector Machines