Synapse³ is a collaborative learning collective focused on building production-oriented data and machine learning systems.
This project is developed as Synapse³’s first end-to-end flagship project, emphasizing:
- Real-world data challenges
- Strong data foundations
- Reproducible machine learning pipelines
- Clear business alignment
End-to-end data reconciliation and machine learning pipeline for predicting loan default risk.
Financial institutions face challenges in accurately assessing credit risk due to fragmented data sources and inconsistent customer records.
This project focuses on building a loan-level default prediction system with strong emphasis on data reconciliation, feature integrity, and production-aligned modeling practices.
Predict whether a loan will default using customer demographics, loan attributes, and historical repayment behavior.
- Predict whether a loan will default
- Preserve loan-level granularity
- Prevent data leakage
- Align with real-world credit scoring workflows
- Loan application data containing 4,368 loans with target labels
- Customer banking profiles with demographic and account attributes
- Historical repayment records used to derive behavioral features
- One row represents one loan
- All features are aligned to the loan level
- Historical data is strictly backward-looking
- Loan application table is used as the master dataset
- Customer profiles are left-joined using customerid
- Repayment history is aggregated per customer before merging
- No labeled loans are dropped during reconciliation
- Total loans: 4,368
- Loans with customer profiles: 3,269
- Loans with repayment history: 4,359
- Loans missing both profile and history: 4
- Dataset shape: 4,368 rows by 20 features
- Target variable: good_bad_flag
- Class distribution reflects real-world imbalance
- Rows are preserved to avoid selection bias
- Missing categorical values are retained as explicit categories
- Missing numeric values are imputed
- Missingness is treated as potentially informative
Feature engineering is performed after all data is aligned to the loan level.
- Demographic features
- Financial features
- Behavioral aggregates
- Temporal features
- Missing-ness indicators
- Data cleaning completed
- Data reconciliation completed
- Modeling dataset finalized
- Feature engineering in progress
loan-default-risk/
│
├── data/
│ ├── raw/
│ │ ├── loan_application.csv
│ │ ├── customer_banking_profile.csv
│ │ └── repayment_history.csv
│ │
│ ├── interim/
│ │ ├── loan_application_cleaned.csv
│ │ ├── customer_banking_profile_cleaned.csv
│ │ └── repayment_history_cleaned.csv
│ │
│ └── processed/
│ └── loan_level_model_dataset.csv
│
├── notebooks/
│ ├── 01_data_cleaning/
│ ├── 02_data_reconciliation/
│ ├── 03_feature_engineering/
│ └── 04_modeling/
│
├── src/
│ ├── data/
│ ├── features/
│ ├── models/
│ └── utils/
│
├── models/
│ ├── trained/
│ └── metrics/
│
├── reports/
│ ├── eda/
│ └── model_performance/
│
├── docker/
│ ├── Dockerfile
│ └── docker-compose.yml
│
├── tests/
│
├── requirements.txt
├── Makefile
└── README.md
This project prioritizes data integrity, reproducibility, and real-world modeling practices over shortcut performance gains.
The goal is to build a production-aligned loan default risk system
- The loan application table is the single source of truth for labels
- All feature engineering is performed after data reconciliation
- No labeled loan is dropped during preprocessing
- Missing data is treated as informative where applicable
- Modeling decisions prioritize interpretability alongside performance