🧠 Haberman Survival Prediction — Logistic Regression from Scratch

🚀 Overview

This project explores 5-year survival prediction using the Haberman dataset through a systematic machine learning workflow, focusing on:

⚙️ Logistic regression implemented from scratch
📊 Validation against scikit-learn
🧪 Feature engineering
📈 ROC/AUC evaluation
🎯 Threshold optimisation
🌲 Benchmarking with a tree-based model

👉 The emphasis is on understanding model behaviour, not just applying models

📊 Dataset

Haberman’s Survival Dataset

🔹 Features

age — age at operation
year — year of operation
nodes — number of positive lymph nodes

🔹 Target

1 → Survived ≥ 5 years
2 → Died < 5 years

Converted to:

1 → survived
0 → non-survivor

❓ Project Question

What actually improves performance in a small clinical dataset?

More features? Class imbalance handling (SMOTE / weighting)? Threshold tuning? Feature transformation? More complex models?

🔥 Key Finding

1️⃣ Feature representation dominates performance

A single feature — lymph node count — carried most predictive signal.

2️⃣ The relationship is non-linear

Applying a transformation:
- log(nodes) significantly improved model performance.

3️⃣ Threshold tuning matters

Adjusting the decision threshold improved the balance between classes more effectively than resampling or weighting.

4️⃣ Class imbalance was NOT the main problem

SMOTE and class weighting did not improve AUC
Improvements came from better feature representation

5️⃣ Simple models were sufficient

A more complex model (Gradient Boosting) did not improve performance

🏁 Final Model

Feature: log(nodes) Model: Logistic Regression Threshold: 0.65 Class weighting: None

📈 Final Performance

Accuracy: 0.79 AUC: 0.735

🔢 Confusion Matrix

📋 Classification Report

          precision    recall  f1-score   support

       0       0.60      0.56      0.58        16
       1       0.85      0.87      0.86        46

accuracy                           0.79        62

⚖️ Model Comparison: Threshold vs Class Weighting

🔹 Threshold Optimisation (Final Model)
Accuracy: 0.79
Recall (non-survivor): 0.56
Recall (survivor): 0.87

🔹 Class Weighting (balanced, threshold = 0.5) Accuracy: 0.76
Recall (non-survivor): 0.75
Recall (survivor): 0.76
AUC: 0.735

🧠 Interpretation

📈 AUC remained unchanged (0.735) → underlying model discrimination did not improve

⚖️ Class weighting:
increases detection of non-survivors
reduces overall performance

🎯 Threshold tuning:
provides a better overall trade-off
improves practical decision performance

🌲 Tree-Based Benchmark (Gradient Boosting)

To test whether a more complex model could improve performance, a Gradient Boosting classifier was applied.

📊 Results
Accuracy: 0.79
AUC: 0.715

Confusion matrix:

[[ 9 7]
[ 6 40]]

🧠 Interpretation
No improvement in accuracy
Slightly worse AUC than logistic regression
Same effective decision boundary

🔥 Key Insight

A more complex model did not improve performance, indicating that the dataset is limited by signal rather than model choice.

The simpler logistic regression model is therefore preferred due to interpretability.

🏥 Clinical Interpretation

Non-survivors represent the high-risk group The final model detects ~56% of these cases

👉 This highlights an important limitation:

Even with optimal modelling, detection of high-risk patients remains limited, reflecting weak signal in the dataset rather than model choice.

🧪 Experiment Journey

1️⃣ Baseline (All Features) Misleading accuracy Poor minority detection

2️⃣ Threshold Tuning Improved decision boundary No change in AUC

3️⃣ Nodes Only Improved discrimination Identified dominant feature

4️⃣ SMOTE & Class Weighting Improved class balance No improvement in AUC

5️⃣ Log Transformation 🔥 Best performance Revealed non-linear relationship

6️⃣ Gradient Boosting Benchmark No improvement over logistic regression Confirmed dataset limitation

🧠 What This Project Demonstrates

🧩 End-to-end ML workflow 🔍 Debugging and validation ⚖️ Accuracy vs AUC understanding 🎯 Threshold optimisation 🧪 Feature engineering impact 🔁 Scratch vs library validation 🌲 Model selection judgement

📁 Repository Structure

haberman-logistic-regression/
│ ├── README.md
├── requirements.txt
├── .gitignore
├── data/
├── notebooks/
├── src/
└── results/

⚙️ Setup

git clone cd haberman-logistic-regression pip install -r requirements.txt

▶️ Run

notebooks/haberman_analysis.ipynb

📜 License

MIT License

👤 Author

David Power
Simulation Specialist | MSc Artificial Intelligence

💼 LinkedIn: https://www.linkedin.com/in/dave-power-47280a44/
💻 GitHub: https://github.com/DavePower-cloud

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
notebooks		notebooks
src		src
README.md		README.md
gitignore		gitignore
log(nodes)_dist.png		log(nodes)_dist.png
requirements.txt		requirements.txt
roc_lr_v_gb.png		roc_lr_v_gb.png
scracth_thres_0.65.png		scracth_thres_0.65.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧠 Haberman Survival Prediction — Logistic Regression from Scratch

🚀 Overview

📊 Dataset

🔹 Features

🔹 Target

❓ Project Question

🔥 Key Finding

🏁 Final Model

📈 Final Performance

🔢 Confusion Matrix

📋 Classification Report

⚖️ Model Comparison: Threshold vs Class Weighting

🌲 Tree-Based Benchmark (Gradient Boosting)

🔥 Key Insight

🏥 Clinical Interpretation

🧪 Experiment Journey

🧠 What This Project Demonstrates

📁 Repository Structure

⚙️ Setup

▶️ Run

📜 License

👤 Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🧠 Haberman Survival Prediction — Logistic Regression from Scratch

🚀 Overview

📊 Dataset

🔹 Features

🔹 Target

❓ Project Question

🔥 Key Finding

🏁 Final Model

📈 Final Performance

🔢 Confusion Matrix

📋 Classification Report

⚖️ Model Comparison: Threshold vs Class Weighting

🌲 Tree-Based Benchmark (Gradient Boosting)

🔥 Key Insight

🏥 Clinical Interpretation

🧪 Experiment Journey

🧠 What This Project Demonstrates

📁 Repository Structure

⚙️ Setup

▶️ Run

📜 License

👤 Author

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages