Capstone Project for Berkley HAAS (ML & AI)
- Employ machine learning to predict which individuals are at the highest risk of defaulting on their loans?
- Binary Classification
-
The dataset contains 255,347 rows and 18 columns in total.
Features
| Column Name | Data Type | Description | |
|---|---|---|---|
| 1 | LoanID | string | A unique identifier for each loan. |
| 2 | Age | integer | The age of the borrower. |
| 3 | Income | integer | The annual income of the borrower. |
| 4 | LoanAmount | integer | The amount of money being borrowed. |
| 5 | CreditScore | integer | The credit score of the borrower indicating their creditworthiness. |
| 6 | MonthsEmployed | integer | The number of months the borrower has been employed. |
| 7 | NumCreditLines | integer | The number of credit lines the borrower has open. |
| 8 | InterestRate | float | The interest rate for the loan. |
| 9 | LoanTerm | integer | The term length of the loan in months. |
| 10 | DTIRatio | float | The Debt-to-Income ratio indicating the borrower's debt compared to their income. |
| 11 | Education | string | The highest level of education attained by the borrower (PhD Master's Bachelor's High School). |
| 12 | EmploymentType | string | The type of employment status of the borrower (Full-time Part-time Self-employed Unemployed). |
| 13 | MaritalStatus | string | The marital status of the borrower (Single Married Divorced). |
| 14 | HasMortgage | string | Whether the borrower has a mortgage (Yes or No). |
| 15 | HasDependents | string | Whether the borrower has dependents (Yes or No). |
| 16 | LoanPurpose | string | The purpose of the loan (Home Auto Education Business Other). |
| 17 | HasCoSigner | string | Whether the loan has a co-signer (Yes or No). |
| 18 | Default | integer | The binary target variable indicating whether the loan defaulted -1 or not (0). |
File: 01_EDA.ipynb
LoanIDhas all distinct values. This will thus not be useful in our model. Thus droppingLoanID
-
There are 6 Catagorial Data
-
Education, 2. EmploymentType 3. MaritalStatus, 4. HasMortgage, 5.HasDependents, 6. LoanPurpose,7. HasCoSigner, with possible values as
{ "Education": [ "Bachelor's", "Master's", "High School","PhD"], "EmploymentType":[ "Full-time", "Unemployed", "Self-employed", "Part-time" ], "MaritalStatus": [ "Divorced", "Married", "Single" ], "HasMortgage": [ "Yes", "No" ], "HasDependents": [ "Yes", "No" ], "LoanPurpose": [ "Other", "Auto", "Business", "Home", "Education" ], "HasCoSigner": [ "Yes", "No" ] }
-
-
Null Check
- There were no null values in either Numerical or Categorial Data.
-
Imbalance Check
- DataSet is Imbalanced with 11.6% target as
1and rest 88.4 as0
- DataSet is Imbalanced with 11.6% target as
-
Plotting the univariate features
For Bivariate Analysis, a random sample was picked form "Loan Defaulters" to see if we can see any trend. For This 2 continuous numerical feature were plotted against a category,
- Scatter Plots also shows the vaiation of CreditScore
- ScatterPlot and KdePlot filters only Loan defaulters
1. Impact of Income on LoanAmount, with Purpose of Loan

1. Lower Income Group are the once who have taken Highest Loan.
2. Highest Loan Defaulters have defaulted in Education, Auto and Business Loan.
3. Number of loan Taken in each Income group for Each Loan purpose is almost.
2. Impact of InterestRate on LoanAmount, with Purpose of Loan

1. As expected, the High Interest, Higher Amount Loan defaulters are higher.
2. Highest Loan Defaulters have defaulted in Education, Auto and Business Loan.
3. Impact of InterestRate on LoanAmount, with AgeGroup For The Purpose of bucketing age, groups age groups between multiples of 5 were grouped togerher as 5-10, 11-15 and so on.
1. Variation of above, it was seen Younger people have defaulted in Loan most, in all category
For Multivariate Analysis, a random sample was picked to see if we can see any trend. For This 1 continuous numerical feature were plotted against 2 categororial features,
- ViolinPlot plotted, the distribution against Loan defaulters.
- CountPlot Plotted count in each category
- HistPlot on right only considered the defaulters and plotted them for each category
1. Analysis of LoanAmount for each Education category

2. Analysis of LoanAmount for each Employment category

3. Analysis of LoanAmount for each LoanTerm category

4. Analysis of LoanAmount for each LoanPurpose category

5. Analysis of LoanAmount for each MaritalStatus category

6. Analysis of LoanAmount for each HasMortgage category

- Conclusion
- It was generally seen Loan Defaulters are generally The once who had no previous Mortgage and this trent is seen almost all LoanAmount range.
- Higher Amount of of Loan Defaulters are Singles. Where as
Divorcedare more likely to default on High Amount Loan. - Highest Loan was taken for Business, Defaulters were even spread across all LoanPurpose.
- Highest Loan was taken and defaulted on 24 months term. In addition to that. Highest Defaulters wuere is higher tange across all term.
- Outliers check was done using Z-Score on 2 fields (1. Income and 2, LoanAmount), both had no Outliers.
- Income Range is from 15000.00 to 149999.00
- Loan Amount if in range 5000.00 to 249999.00
- Creditscore, MonthsEmployed and LoanTerm all Seem to be in valid Range
| Age | Income | LoanAmount | CreditScore | MonthsEmployed | NumCreditLines | InterestRate | LoanTerm | DTIRatio | Default | |
|---|---|---|---|---|---|---|---|---|---|---|
| count | 255347.000 | 255347.000 | 255347.000 | 255347.000 | 255347.000 | 255347.000 | 255347.000 | 255347.000 | 255347.000 | 255347.000 |
| mean | 43.498 | 82499.305 | 127578.866 | 574.264 | 59.542 | 2.501 | 13.493 | 36.026 | 0.500 | 0.116 |
| std | 14.990 | 38963.014 | 70840.706 | 158.904 | 34.643 | 1.117 | 6.636 | 16.969 | 0.231 | 0.320 |
| min | 18.000 | 15000.000 | 5000.000 | 300.000 | 0.000 | 1.000 | 2.000 | 12.000 | 0.100 | 0.000 |
| 25% | 31.000 | 48825.500 | 66156.000 | 437.000 | 30.000 | 2.000 | 7.770 | 24.000 | 0.300 | 0.000 |
| 50% | 43.000 | 82466.000 | 127556.000 | 574.000 | 60.000 | 2.000 | 13.460 | 36.000 | 0.500 | 0.000 |
| 75% | 56.000 | 116219.000 | 188985.000 | 712.000 | 90.000 | 3.000 | 19.250 | 48.000 | 0.700 | 0.000 |
| max | 69.000 | 149999.000 | 249999.000 | 849.000 | 119.000 | 4.000 | 25.000 | 60.000 | 0.900 | 1.000 |
Notebook: 02_FeatureSelection.ipynb
These are bottom 30% in both filters
- LoanPurpose (Auto, Education, Others)
- Education (High School, Master's)
- EmploymentType (Full-time)
- HasDependents (Yes, No)
- HasCoSigner (Yes, No)
- HasMortgage (Yes, No)
- MaritalStatus_(Married, Divorced)
Using Mutual Information with mutual_info_classif()
- Age, Income, NumCreditLines, InterestRate, LoanTerm are top 5 picks
- DTIRatio and CreditScore are least contributing features
This criterion is linked to a Learning OutcomeModeling:
The choice of metrics depends on what exactly we are trying to answer. As per the problem statement,
One of the primary objectives of companies with financial loan services is to decrease payment defaults and ensure that individuals are paying back their loans as expected.
The question we, want to answer is
How do we predict which individuals are at the highest risk of defaulting on their loans, so that proper interventions can be effectively deployed to the right audience.?
In technical terms we would like to identity majority of our True Positives and reduce False Negative .
-
Recall (Sensitivity) is a metric, that measures proportion of correctly predicted positive observations. It answers the question: “Out of all actual positives, how many did the model capture?”.
$\large Recall; (Sensitivity)= \Large \frac{TPs}{(TPs + FNs)}$ Thus to achieve high
Precision Scorewe would to increase True Positives (TP) and recduce False Positive (FP)
Additionally, we also would like to reduce False Positive, this will make the model more pessimistic and loss of opportunity of more applications are rejected or more resources are wasted if more application are scrutinised.
-
Precision score's focus is out of the predictions made by the model, what percent is correct>?
$\large Precision; (Sensitivity)= \Large \frac{TPs}{(TPs + FPs)}$ Thus, Model should be able to capture majority of
True Positivesand also reduceFalse Positives
Unbalanced dataset particularly are need additional Consideration.
-
F1 score is essential because it balances precision and recall, providing a single metric that considers both FPs and FNs.
$\large F1; = 2* \Large \frac{Recall; *; Precision }{(Recall; +; Precision)}$
Thus to conclude, the 3 Metrics for evaluation will be
- Recall (Sensitivity) Score
- Precision Score
- F1 Score
| Column | Transformation | Notes |
|---|---|---|
| Education | OneHotEncoding | |
| EmploymentType | OneHotEncoding | |
| MaritalStatus | OneHotEncoding | |
| LoanPurpose | OneHotEncoding | |
| HasMortgage | OneHotEncoding | Option: drop=if_binary |
| HasDependents | OneHotEncoding | Option: drop=if_binary |
| HasCoSigner | OneHotEncoding | Option: drop=if_binary |
| LoanTerm | OrdinalEncoder |
Two Algorithms which will be suitable for to evaluate the model for is Based on the observation, we had seen, the data is non-linear. Thus first we would like to
- Non-Linear Algorithm
- K-Nearest Neighbours
- Decision Tree (with/without class weight)
We would also like to give Linear Algirithm a shot, thus
- Linear Algorithm
- LogistisRegression with Polynomial Features (with/without class weight)
If the models are determined to be weak, we will use following Ensemble algorithm
-
Ensemble Algorithm
- Boosting (CatBoostClassifier, XGBClassifier)
- Bagging (e.g RandomForestClassifier, BalancedBaggingClassifier)
- StackingClassifier and VotingClassifier
-
Prior to Modeling use Data Sampling Algorithms to balance Dataset
- Random under sampling
- Random over sampling
- Smote,Tomek, SMOTETomek
- PolynomialFeatures + PCA
- Column Transformation
- Undersampling the majority class using
RandomUnderSampler - GridSearchCV + K-Nearest Neighbor
| Evaluator | Score |
|---|---|
| Training Accuracy | 100.00% |
| Test Accuracy | 59.89% |
| Recall Score | 15.25% |
| Precision Score | 25.00% |
| Accuracy Score | 84.80% |
| F1 Score | 18.95% |
| ROC AUC Score | 60.92% |
| threshold | recall | precision | f1-score | accuracy |
|---|---|---|---|---|
| 0.300 | 0.950 | 0.127 | 0.224 | 0.233 |
| 0.400 | 0.825 | 0.142 | 0.242 | 0.399 |
| 0.500 | 0.623 | 0.169 | 0.266 | 0.599 |
| 0.600 | 0.365 | 0.204 | 0.262 | 0.760 |
| 0.700 | 0.153 | 0.25 | 0.189 | 0.848 |
- Column Transformation
- Using
TomekLinksto Unbalanced dataset - GridSearchCV + K-Nearest Neighbor
| Evaluator | Score |
|---|---|
| Training Accuracy | 100.00% |
| Test Accuracy | 87.92% |
| Recall Score | 0.29% |
| Precision Score | 46.43% |
| Accuracy Score | 88.35% |
| F1 Score | 0.58% |
| ROC AUC Score | 51.14% |
| threshold | recall | precision | f1-score | accuracy |
|---|---|---|---|---|
| 0.300 | 0.194 | 0.241 | 0.215 | 0.835 |
| 0.400 | 0.089 | 0.292 | 0.137 | 0.869 |
| 0.500 | 0.032 | 0.316 | 0.058 | 0.879 |
| 0.600 | 0.011 | 0.388 | 0.021 | 0.883 |
| 0.700 | 0.003 | 0.464 | 0.006 | 0.883 |
- Column Transformation
- Using
SMOTEto Unbalanced dataset - GridSearchCV + K-Nearest Neighbor
| Evaluator | Score |
|---|---|
| Training Accuracy | 100.00% |
| Test Accuracy | 87.92% |
| Recall Score | 0.29% |
| Precision Score | 46.43% |
| Accuracy Score | 88.35% |
| F1 Score | 0.58% |
| ROC AUC Score | 51.14% |
| threshold | recall | precision | f1-score | accuracy |
|---|---|---|---|---|
| 0.300 | 0.604 | 0.157 | 0.249 | 0.575 |
| 0.400 | 0.495 | 0.165 | 0.248 | 0.649 |
| 0.500 | 0.472 | 0.167 | 0.247 | 0.664 |
| 0.600 | 0.364 | 0.176 | 0.238 | 0.728 |
| 0.700 | 0.330 | 0.182 | 0.234 | 0.749 |
- Column Transformation
- Determine approx range if all parameters for Decision Tree
- GridSearchCV + Decision Tree
| Evaluators | Score |
|---|---|
| Training Accuracy | 54.25% |
| Test Accuracy | 54.14% |
| Recall Score | 41.54% |
| Precision Score | 25.32% |
| Accuracy Score | 79.09% |
| F1 Score | 31.47% |
| ROC AUC Score | 64.40% |
| threshold | recall | precision | f1-score | accuracy |
|---|---|---|---|---|
| 0.300 | 0.962 | 0.129 | 0.227 | 0.243 |
| 0.400 | 0.838 | 0.155 | 0.261 | 0.452 |
| 0.500 | 0.777 | 0.172 | 0.281 | 0.541 |
| 0.600 | 0.511 | 0.236 | 0.323 | 0.753 |
| 0.700 | 0.415 | 0.253 | 0.315 | 0.791 |
- Column Transformation
- Determine approx range if all parameters for Decision Tree
PruneandBayesSearchCVwith different values ofccp_alphas
| Evaluators | Score |
|---|---|
| Training Accuracy | 67.69% |
| Test Accuracy | 67.01% |
| Recall Score | 29.17% |
| Precision Score | 31.49% |
| Accuracy Score | 84.49% |
| F1 Score | 30.29% |
| ROC AUC Score | 66.16% |
| threshold | recall | precision | f1-score | accuracy |
|---|---|---|---|---|
| 0.300 | 0.894 | 0.148 | 0.255 | 0.395 |
| 0.400 | 0.794 | 0.174 | 0.286 | 0.541 |
| 0.500 | 0.651 | 0.206 | 0.313 | 0.670 |
| 0.600 | 0.516 | 0.251 | 0.337 | 0.766 |
| 0.700 | 0.292 | 0.315 | 0.303 | 0.845 |
- Column Transformation
- Determine approx range if all parameters for Decision Tree
- GridSearchCV + BalancedRandomForestClassifier
| Evaluators | Score |
|---|---|
| Training Accuracy | 64.17% |
| Test Accuracy | 55.48% |
| Recall Score | 44.95% |
| Precision Score | 23.99% |
| Accuracy Score | 77.19% |
| F1 Score | 31.28% |
| ROC AUC Score | 64.71% |
| threshold | recall | precision | f1-score | accuracy |
|---|---|---|---|---|
| 0.300 | 0.939 | 0.137 | 0.239 | 0.311 |
| 0.400 | 0.851 | 0.155 | 0.262 | 0.446 |
| 0.500 | 0.767 | 0.175 | 0.285 | 0.555 |
| 0.600 | 0.605 | 0.206 | 0.308 | 0.685 |
| 0.700 | 0.449 | 0.240 | 0.313 | 0.772 |
- Column Transformation
- CatBoostClassifier
| Evaluators | Score | |
|---|---|---|
| 0 | Training Accuracy | 65.98% |
| 1 | Test Accuracy | 65.24% |
| 2 | Recall Score | 41.36% |
| 3 | Precision Score | 32.73% |
| 4 | Accuracy Score | 83.40% |
| 5 | F1 Score | 36.54% |
| 6 | ROC AUC Score | 68.51% |
| threshold | recall | precision | f1-score | accuracy |
|---|---|---|---|---|
| 0.300 | 0.927 | 0.149 | 0.257 | 0.380 |
| 0.400 | 0.842 | 0.175 | 0.290 | 0.524 |
| 0.500 | 0.728 | 0.210 | 0.326 | 0.652 |
| 0.600 | 0.585 | 0.257 | 0.357 | 0.757 |
| 0.700 | 0.414 | 0.327 | 0.365 | 0.834 |
- Column Transformation
- SMOTETomek to balance the dataset
- XGBClassifier
| Evaluators | Score |
|---|---|
| Training Accuracy | 74.10% |
| Test Accuracy | 53.73% |
| Recall Score | 56.73% |
| Precision Score | 25.73% |
| Accuracy Score | 76.09% |
| F1 Score | 35.41% |
| ROC AUC Score | 66.23% |
| threshold | recall | precision | f1-score | accuracy |
|---|---|---|---|---|
| 0.300 | 0.956 | 0.138 | 0.241 | 0.305 |
| 0.400 | 0.905 | 0.155 | 0.265 | 0.419 |
| 0.500 | 0.825 | 0.177 | 0.292 | 0.537 |
| 0.600 | 0.719 | 0.210 | 0.325 | 0.654 |
| 0.700 | 0.567 | 0.257 | 0.354 | 0.761 |
- Column Transformation
- XGBClassifier
| Evaluators | Score |
|---|---|
| Training Accuracy | 73.94% |
| Test Accuracy | 70.96% |
| Recall Score | 33.07% |
| Precision Score | 33.92% |
| Accuracy Score | 84.83% |
| F1 Score | 33.49% |
| ROC AUC Score | 67.52% |
| threshold | recall | precision | f1-score | accuracy |
|---|---|---|---|---|
| 0.300 | 0.855 | 0.164 | 0.275 | 0.479 |
| 0.400 | 0.748 | 0.191 | 0.304 | 0.605 |
| 0.500 | 0.631 | 0.227 | 0.334 | 0.710 |
| 0.600 | 0.489 | 0.276 | 0.353 | 0.793 |
| 0.700 | 0.331 | 0.339 | 0.335 | 0.848 |
- Column Transformation
- RandomForestClassifier
| Evaluator | Score |
|---|---|
| Training Accuracy | 66.20% |
| Test Accuracy | 66.21% |
| Recall Score | 2.10% |
| Precision Score | 63.92% |
| Accuracy Score | 88.55% |
| F1 Score | 4.07% |
| ROC AUC Score | 66.89% |
| threshold | recall | precision | f1-score | accuracy |
|---|---|---|---|---|
| 0.300 | 0.993 | 0.120 | 0.215 | 0.162 |
| 0.400 | 0.898 | 0.151 | 0.258 | 0.403 |
| 0.500 | 0.678 | 0.207 | 0.317 | 0.662 |
| 0.600 | 0.345 | 0.316 | 0.330 | 0.838 |
| 0.700 | 0.021 | 0.639 | 0.041 | 0.886 |
- Column Transformation
- Estimators: BalancedRandomForest, CatBoostClassifier, RandomForestClassifier
- Final Estimator: DecisionTree
| Evaluator | Score |
|---|---|
| Training Accuracy | 73.90% |
| Test Accuracy | 71.98% |
| Recall Score | 34.14% |
| Precision Score | 18.85% |
| Accuracy Score | 75.42% |
| F1 Score | 24.29% |
| ROC AUC Score | 58.22% |
| threshold | recall | precision | f1-score | accuracy |
|---|---|---|---|---|
| 0.300 | 0.438 | 0.175 | 0.250 | 0.697 |
| 0.400 | 0.421 | 0.178 | 0.250 | 0.708 |
| 0.500 | 0.403 | 0.181 | 0.250 | 0.720 |
| 0.600 | 0.376 | 0.182 | 0.246 | 0.733 |
| 0.700 | 0.341 | 0.189 | 0.243 | 0.754 |
- Column Transformation
- Estimators: BalancedRandomForest, CatBoostClassifier, RandomForestClassifier
| Evaluator | Score |
|---|---|
| Training Accuracy | 74.83% |
| Test Accuracy | 68.13% |
| Recall Score | 67.69% |
| Precision Score | 21.75% |
| Accuracy Score | 68.13% |
| F1 Score | 32.92% |
| ROC AUC Score | 67.94% |
| threshold | recall | precision | f1-score | accuracy |
|---|---|---|---|---|
| 0.300 | 0.926 | 0.146 | 0.252 | 0.364 |
| 0.400 | 0.816 | 0.176 | 0.289 | 0.536 |
| 0.500 | 0.677 | 0.217 | 0.329 | 0.681 |
| 0.600 | 0.496 | 0.273 | 0.352 | 0.789 |
| 0.700 | 0.298 | 0.361 | 0.327 | 0.858 |
- Column Transformation
- PCA
- LogisticRegression
| Evaluator | Score |
|---|---|
| Training Accuracy | 88.44% |
| Test Accuracy | 88.53% |
| Recall Score | 0.03% |
| Precision Score | 100.00% |
| Accuracy Score | 88.45% |
| F1 Score | 0.07% |
| ROC AUC Score | 51.14% |
| threshold | recall | precision | f1-score | accuracy |
|---|---|---|---|---|
| 0.300 | 0.209 | 0.404 | 0.275 | 0.873 |
| 0.400 | 0.087 | 0.507 | 0.148 | 0.885 |
| 0.500 | 0.025 | 0.583 | 0.048 | 0.885 |
| 0.600 | 0.004 | 0.667 | 0.009 | 0.885 |
| 0.700 | 0.000 | 1.000 | 0.001 | 0.885 |
- Column Transformation
- SMOTE
- FeatureSelection
- LogisticRegression
| Evaluator | Score |
|---|---|
| Training Accuracy | 70.31% |
| Test Accuracy | 68.95% |
| Recall Score | 37.03% |
| Precision Score | 33.80% |
| Accuracy Score | 84.35% |
| F1 Score | 35.34% |
| ROC AUC Score | 68.74% |
| threshold | recall | precision | f1-score | accuracy |
|---|---|---|---|---|
| 0.300 | 0.887 | 0.160 | 0.271 | 0.449 |
| 0.400 | 0.797 | 0.188 | 0.304 | 0.578 |
| 0.500 | 0.685 | 0.224 | 0.338 | 0.690 |
| 0.600 | 0.539 | 0.269 | 0.359 | 0.778 |
| 0.700 | 0.370 | 0.338 | 0.353 | 0.843 |
Using Mutual Information with mutual_info_classif()
- Age, Income, NumCreditLines, InterestRate, LoanTerm are top 5 picks
- DTIRatio and CreditScore are least contributing features
These are bottom 30% in both filters
- LoanPurpose (Auto, Education, Others)
- Education (High School, Master's)
- EmploymentType (Full-time)
- HasDependents (Yes, No)
- HasCoSigner (Yes, No)
- HasMortgage (Yes, No)
- MaritalStatus_(Married, Divorced)
-
(Algorithm 12) VotingClassifier,
- Estimators: BalancedRandomForest, CatBoostClassifier, RandomForestClassifier
Score ------ Recall Score 67.69% Precision Score 21.75% Accuracy Score 68.13% F1 Score 32.92% ROC AUC Score 67.94% This algorithms ensures we are able to catch Defaulters about 68% of times.













