Skip to content

DavePower-cloud/logistic_regression_from_scratch

Repository files navigation

🧠 Haberman Survival Prediction — Logistic Regression from Scratch

Python scikit-learn Status License


🚀 Overview

🚀 Overview

This project explores 5-year survival prediction using the Haberman dataset through a systematic machine learning workflow, focusing on:

⚙️ Logistic regression implemented from scratch
📊 Validation against scikit-learn
🧪 Feature engineering
📈 ROC/AUC evaluation
🎯 Threshold optimisation
🌲 Benchmarking with a tree-based model

👉 The emphasis is on understanding model behaviour, not just applying models

ROC Curve


📊 Dataset

Haberman’s Survival Dataset

🔹 Features

  • age — age at operation
  • year — year of operation
  • nodes — number of positive lymph nodes

🔹 Target

  • 1 → Survived ≥ 5 years
  • 2 → Died < 5 years

Converted to:

  • 1 → survived
  • 0 → non-survivor

❓ Project Question

What actually improves performance in a small clinical dataset?

More features? Class imbalance handling (SMOTE / weighting)? Threshold tuning? Feature transformation? More complex models?


🔥 Key Finding

1️⃣ Feature representation dominates performance

  • A single feature — lymph node count — carried most predictive signal.

2️⃣ The relationship is non-linear

  • Applying a transformation:

    • log(nodes) significantly improved model performance.

3️⃣ Threshold tuning matters

  • Adjusting the decision threshold improved the balance between classes more effectively than resampling or weighting.

4️⃣ Class imbalance was NOT the main problem

  • SMOTE and class weighting did not improve AUC
  • Improvements came from better feature representation

5️⃣ Simple models were sufficient

  • A more complex model (Gradient Boosting) did not improve performance

Feature Engineering


🏁 Final Model

Feature: log(nodes) Model: Logistic Regression Threshold: 0.65 Class weighting: None


📈 Final Performance

Accuracy: 0.79 AUC: 0.735


🔢 Confusion Matrix

Confusion Matrix


📋 Classification Report

          precision    recall  f1-score   support

       0       0.60      0.56      0.58        16
       1       0.85      0.87      0.86        46

accuracy                           0.79        62

⚖️ Model Comparison: Threshold vs Class Weighting

🔹 Threshold Optimisation (Final Model)
Accuracy: 0.79
Recall (non-survivor): 0.56
Recall (survivor): 0.87


🔹 Class Weighting (balanced, threshold = 0.5) Accuracy: 0.76
Recall (non-survivor): 0.75
Recall (survivor): 0.76
AUC: 0.735


🧠 Interpretation

📈 AUC remained unchanged (0.735) → underlying model discrimination did not improve

⚖️ Class weighting:
increases detection of non-survivors
reduces overall performance

🎯 Threshold tuning:
provides a better overall trade-off
improves practical decision performance


🌲 Tree-Based Benchmark (Gradient Boosting)

To test whether a more complex model could improve performance, a Gradient Boosting classifier was applied.

📊 Results
Accuracy: 0.79
AUC: 0.715

Confusion matrix:

[[ 9 7]
[ 6 40]]


🧠 Interpretation
No improvement in accuracy
Slightly worse AUC than logistic regression
Same effective decision boundary


🔥 Key Insight

A more complex model did not improve performance, indicating that the dataset is limited by signal rather than model choice.

The simpler logistic regression model is therefore preferred due to interpretability.


🏥 Clinical Interpretation

Non-survivors represent the high-risk group The final model detects ~56% of these cases

👉 This highlights an important limitation:

Even with optimal modelling, detection of high-risk patients remains limited, reflecting weak signal in the dataset rather than model choice.


🧪 Experiment Journey

1️⃣ Baseline (All Features) Misleading accuracy Poor minority detection

2️⃣ Threshold Tuning Improved decision boundary No change in AUC

3️⃣ Nodes Only Improved discrimination Identified dominant feature

4️⃣ SMOTE & Class Weighting Improved class balance No improvement in AUC

5️⃣ Log Transformation 🔥 Best performance Revealed non-linear relationship

6️⃣ Gradient Boosting Benchmark No improvement over logistic regression Confirmed dataset limitation


🧠 What This Project Demonstrates

🧩 End-to-end ML workflow 🔍 Debugging and validation ⚖️ Accuracy vs AUC understanding 🎯 Threshold optimisation 🧪 Feature engineering impact 🔁 Scratch vs library validation 🌲 Model selection judgement


📁 Repository Structure

haberman-logistic-regression/
│ ├── README.md
├── requirements.txt
├── .gitignore
├── data/
├── notebooks/
├── src/
└── results/


⚙️ Setup

git clone cd haberman-logistic-regression pip install -r requirements.txt


▶️ Run

notebooks/haberman_analysis.ipynb


📜 License

MIT License


👤 Author

David Power
Simulation Specialist | MSc Artificial Intelligence

About

Implementation of logistic regression from scratch validated against Scikit-Learn, systematic ML workflow on feature engineering, evaluation and threshold optimisation, and furthermore benchmarked off Gradient Boosting

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors