A Machine Learning project to predict credit default risk based on applicant financial profiles.
Financial institutions face significant risks of financial loss due to customers failing to repay loans (default). This project aims to build an intelligent system capable of classifying whether a customer is eligible for a loan (Non-Default/Safe) or poses a risk (Default) based on their demographic and financial history.
The dataset used in this project is sourced from Kaggle: Dataset Klasifikasi Status Pinjaman.
- Total Records: ~50,000 rows.
- Target Variable:
status_pinjaman(0: Non-Default, 1: Default). - Key Features: Age, Annual Income, Credit Score, Total Debt, Credit History Length, etc.
This project implements an end-to-end data science pipeline:
-
Data Cleaning:
- Removed the
id_pelanggancolumn (irrelevant unique identifier). - Handling Missing Values: Dropped rows containing nulls (percentage was < 5%).
- Label Encoding: Applied to categorical features such as
status_pekerjaan(Job Status),tipe_produk(Product Type), andtujuan_pinjaman(Loan Purpose).
- Removed the
-
Handling Skewed Data (Crucial Step):
- Identified highly skewed distributions in numerical features (e.g.,
pendapatan_tahunan,aset_tabungan). - Applied Log Transformation (
np.log1p) to normalize the data distribution, optimizing it for linear and distance-based algorithms.
- Identified highly skewed distributions in numerical features (e.g.,
-
Data Splitting:
- Ratio: 80% Train : 10% Validation : 10% Test.
- Stratification: Used
stratify=yto maintain the ratio of the target classes across all splits.
Three Machine Learning algorithms were trained and evaluated:
- Naive Bayes (GaussianNB) - Used as the baseline model.
- K-Nearest Neighbors (KNN) - Optimized using
GridSearchCVto find the best k and distance metric. - Decision Tree - Optimized using
GridSearchCV(tuningmax_depth,criterion, etc.).
The models were evaluated using the Test Set (unseen data). The performance comparison is as follows:
| Model | Accuracy | Precision | Recall | Observation |
|---|---|---|---|---|
| Naive Bayes | 72.07% | 66.65% | 98.61% | Very high Recall, but suffers from high False Positives. |
| KNN (Tuned) | 78.69% | 76.61% | 88.23% | Moderate and balanced performance. |
| Decision Tree (Tuned) | 88.29% | 87.76% | 91.48% | 🏆 Best Model |
The Decision Tree was selected as the final model because it offered the optimal balance between Accuracy and Recall.
- True Positive: Successfully detected the majority of default cases.
- False Negative (Bank Risk): Very minimal (~233 cases out of the total test set), significantly reducing the risk of granting loans to defaulters.
-
Clone the repository:
git clone [https://github.com/faisalsuryasaputra/klasifikasi-status-pinjaman.git](https://github.com/faisalsuryasaputra/klasifikasi-status-pinjaman.git) cd klasifikasi-status-pinjaman -
Install dependencies:
pip install pandas numpy scikit-learn matplotlib seaborn kagglehub
-
Run the Notebook: Open
notebook.ipynb(or your specific filename) using Jupyter Notebook, VS Code, or Google Colab.
- Faisal Surya Saputra - End-to-End Analysis (Preprocessing, Modeling, & Evaluation)
Created as a Final Project for the Artificial Intelligence / Machine Learning Course.