🚀 Overview
This project explores 5-year survival prediction using the Haberman dataset through a systematic machine learning workflow, focusing on:
⚙️ Logistic regression implemented from scratch
📊 Validation against scikit-learn
🧪 Feature engineering
📈 ROC/AUC evaluation
🎯 Threshold optimisation
🌲 Benchmarking with a tree-based model
👉 The emphasis is on understanding model behaviour, not just applying models
Haberman’s Survival Dataset
age— age at operationyear— year of operationnodes— number of positive lymph nodes
1→ Survived ≥ 5 years2→ Died < 5 years
Converted to:
1→ survived0→ non-survivor
What actually improves performance in a small clinical dataset?
More features? Class imbalance handling (SMOTE / weighting)? Threshold tuning? Feature transformation? More complex models?
1️⃣ Feature representation dominates performance
- A single feature — lymph node count — carried most predictive signal.
2️⃣ The relationship is non-linear
-
Applying a transformation:
- log(nodes) significantly improved model performance.
3️⃣ Threshold tuning matters
- Adjusting the decision threshold improved the balance between classes more effectively than resampling or weighting.
4️⃣ Class imbalance was NOT the main problem
- SMOTE and class weighting did not improve AUC
- Improvements came from better feature representation
5️⃣ Simple models were sufficient
- A more complex model (Gradient Boosting) did not improve performance
Feature: log(nodes) Model: Logistic Regression Threshold: 0.65 Class weighting: None
Accuracy: 0.79 AUC: 0.735
precision recall f1-score support
0 0.60 0.56 0.58 16
1 0.85 0.87 0.86 46
accuracy 0.79 62
🔹 Threshold Optimisation (Final Model)
Accuracy: 0.79
Recall (non-survivor): 0.56
Recall (survivor): 0.87
🔹 Class Weighting (balanced, threshold = 0.5)
Accuracy: 0.76
Recall (non-survivor): 0.75
Recall (survivor): 0.76
AUC: 0.735
🧠 Interpretation
📈 AUC remained unchanged (0.735) → underlying model discrimination did not improve
⚖️ Class weighting:
increases detection of non-survivors
reduces overall performance
🎯 Threshold tuning:
provides a better overall trade-off
improves practical decision performance
To test whether a more complex model could improve performance, a Gradient Boosting classifier was applied.
📊 Results
Accuracy: 0.79
AUC: 0.715
Confusion matrix:
[[ 9 7]
[ 6 40]]
🧠 Interpretation
No improvement in accuracy
Slightly worse AUC than logistic regression
Same effective decision boundary
A more complex model did not improve performance, indicating that the dataset is limited by signal rather than model choice.
The simpler logistic regression model is therefore preferred due to interpretability.
Non-survivors represent the high-risk group The final model detects ~56% of these cases
👉 This highlights an important limitation:
Even with optimal modelling, detection of high-risk patients remains limited, reflecting weak signal in the dataset rather than model choice.
1️⃣ Baseline (All Features) Misleading accuracy Poor minority detection
2️⃣ Threshold Tuning Improved decision boundary No change in AUC
3️⃣ Nodes Only Improved discrimination Identified dominant feature
4️⃣ SMOTE & Class Weighting Improved class balance No improvement in AUC
5️⃣ Log Transformation 🔥 Best performance Revealed non-linear relationship
6️⃣ Gradient Boosting Benchmark No improvement over logistic regression Confirmed dataset limitation
🧩 End-to-end ML workflow 🔍 Debugging and validation ⚖️ Accuracy vs AUC understanding 🎯 Threshold optimisation 🧪 Feature engineering impact 🔁 Scratch vs library validation 🌲 Model selection judgement
haberman-logistic-regression/
│
├── README.md
├── requirements.txt
├── .gitignore
├── data/
├── notebooks/
├── src/
└── results/
git clone cd haberman-logistic-regression pip install -r requirements.txt
notebooks/haberman_analysis.ipynb
MIT License
David Power
Simulation Specialist | MSc Artificial Intelligence
- 💼 LinkedIn: https://www.linkedin.com/in/dave-power-47280a44/
- 💻 GitHub: https://github.com/DavePower-cloud

_dist.png)
