This project focuses on predicting whether a customer will default on a loan using machine learning classification models. The dataset is simple and structured, allowing focus on model evaluation, class imbalance, and metric interpretation.
-
Source: Kaggle
-
Rows: 10,000
-
Features:
- Employed (0/1)
- Bank Balance
- Annual Salary
- Defaulted? (Target)
- Minimal preprocessing required due to clean dataset
- Train-test split applied
- Focus on handling class imbalance
- No missing values
- Logistic Regression (Standard)
- Logistic Regression (Class Balanced)
- Random Forest
- XGBoost
- Confusion Matrix
- Precision
- Recall
- F1-Score
Special focus was given to recall, as detecting defaulters is more critical than minimizing false alarms.
| Model | Precision | Recall | F1 Score |
|---|---|---|---|
| Logistic (Standard) | 0.63 | 0.28 | 0.38 |
| Logistic (Balanced) | 0.19 | 0.87 | 0.31 |
| Random Forest | 0.22 | 0.75 | 0.34 |
| XGBoost | 0.23 | 0.81 | 0.36 |
- The dataset is highly imbalanced, making accuracy misleading
- Logistic Regression (standard) misses most defaulters
- Balanced Logistic Regression improves recall significantly but increases false positives
- Random Forest improves recall while maintaining reasonable precision
- XGBoost provides the best overall balance between precision and recall
- Higher recall → better detection of defaulters
- Higher precision → fewer false alarms
The choice of model depends on business priorities.
- Confusion Matrix comparison
- Precision-Recall trade-off
Tree-based models (Random Forest, XGBoost) outperform linear models by capturing non-linear relationships. XGBoost achieved the best balance between detecting defaulters and controlling false positives.
- Hyperparameter tuning
- Threshold optimization
- ROC / PR curve analysis
- Deployment using Streamlit
This project is licensed under the GNU General Public License.