Skip to content

khushi4124/Breast-Cancer-Detection

Repository files navigation

Problem Statement

In clinical diagnostics, the manual classification of tumors as Malignant (cancerous) or Benign (non-cancerous) is a time-consuming process prone to human error. The goal of this project is to develop a robust machine learning pipeline that can automate this classification using the Breast Cancer Wisconsin (Diagnostic) Dataset.

Dataset Used: https://www.kaggle.com/datasets/uciml/breast-cancer-wisconsin-data

Solution

Technical Approach Used: Linear Support Vector Machine (SVM) with automated preprocessing.

To solve the classification problem, I implemented a Support Vector Machine (SVM) using a Linear Kernel.

SVM was chosen because of its effectiveness in finding the "Optimal Hyperplane" that maximizes the margin (the safety buffer) between classes.

Key Implementation Details:

Pipeline Architecture: Utilized a Scikit-Learn Pipeline to bundle StandardScaler and the SVC model. This ensures that the data scaling is performed correctly within each fold of cross-validation, preventing data leakage.

Hyperparameter Tuning: Optimized the C hyperparameter 1. This "Soft Margin" approach improved the model's ability to generalize to new, unseen patient data by prioritizing a wider decision boundary over perfect memorization of training noise.

Evaluation Strategy: Beyond simple accuracy, the model was evaluated using:

5-Fold Cross-Validation: To ensure stability across different data distributions.

Confusion Matrix Analysis: Specifically monitoring False Negatives to evaluate clinical safety.

Results and Impact:

The optimized SVM model achieved a Mean Cross-Validation Accuracy of 97.19%, with a specific test set performance of 99.12%.

Final Performance Metrics:

True Negatives: 71 (Benign correctly identified)

True Positives: 42 (Malignant correctly identified)

False Positives: 0 (Zero healthy patients were misdiagnosed)

False Negatives: 1 (Only one case of cancer was missed)

image

About

Built a high-precision Breast Cancer prediction system using a Linear Support Vector Machine (SVM) on the Wisconsin Diagnostic dataset. I implemented a professional Scikit-Learn Pipeline. The model achieved a 99% accuracy rate.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors