This project focuses on predicting a patient’s health outcome using medical diagnostic data. The goal is to explore how machine learning can assist in identifying patterns related to disease risk, while remaining transparent, interpretable, and aligned with academic best practices.
Rather than building a complex or black-box system, this project intentionally uses a simple and explainable classification model to understand how medical features relate to health outcomes.
- Apply machine learning to a real healthcare-related dataset
- Practice structured data cleaning and preprocessing
- Use logistic regression for binary classification
- Evaluate model performance using clear and interpretable metrics
- Understand the limitations of predictive models in healthcare contexts
The dataset contains medical and demographic information collected from patients. Each row represents one patient, and the target variable indicates whether the patient tested positive or negative for a specific medical condition.
Typical features include:
- Glucose levels
- Blood pressure
- Body mass index (BMI)
- Age
- Other physiological measurements
The target variable is binary:
- Positive diagnosis
- Negative diagnosis
This dataset is well-suited for introductory healthcare classification tasks due to its structured format and clear outcome variable.
Before modeling, the dataset was carefully prepared:
- Invalid zero values in medical measurements were identified and treated as missing data
- Missing values were replaced using column mean imputation
- Features were verified for consistency and correctness
- The dataset was split into training and testing sets
These steps ensure that the model is trained on realistic and meaningful medical data.
Logistic Regression was selected because:
- It is widely used in healthcare analytics
- It produces interpretable probability-based outputs
- It aligns with coursework concepts
- It avoids unnecessary model complexity
The focus of this project is understanding how the model works, not maximizing performance at all costs.
The model was evaluated using:
- Accuracy – to measure overall prediction correctness
- Confusion Matrix – to examine true positives, false positives, true negatives, and false negatives
These metrics are especially important in healthcare, where different types of errors can have different real-world implications.
- Certain medical features show strong relationships with patient outcomes
- Logistic regression provides clear insight into classification behavior
- The confusion matrix highlights where the model succeeds and fails
- Even simple models can provide meaningful signals when data is cleaned properly
- This model is trained on historical data and does not replace medical diagnosis
- The dataset lacks contextual factors such as lifestyle or genetics
- Accuracy alone is insufficient for real healthcare deployment
- Ethical and clinical validation would be required in real-world use
This project is strictly educational and exploratory.
Possible extensions include:
- Adding precision, recall, and F1-score analysis
- Comparing logistic regression with other classifiers
- Exploring feature importance more deeply
- Using cross-validation for more robust evaluation
Isaac Wanlemvo
Software Engineering & AI Student