A responsible machine learning pipeline for binary income classification using the UCI Adult Census dataset. The project predicts whether an individual's annual income exceeds USD 50,000, with an explicit focus on predictive accuracy, fairness auditing, and model explainability.
- Multi-model comparison — Logistic Regression, Random Forest, and XGBoost trained and benchmarked with 5-fold stratified cross-validation
- Threshold optimisation — F1-maximising decision threshold sweep to improve minority-class recall beyond the default 0.5 cutoff
- Fairness auditing — Demographic Parity Difference and Equalized Odds Difference measured across sex and race using
fairlearn - Bias mitigation — Post-processing via
ThresholdOptimizer(equalized odds constraint), achieving a 43% reduction in DPD and 71% in EOD for sex - SHAP explainability — Global beeswarm/bar plots and local waterfall/dependence plots for per-prediction audit trails
- Interactive dashboard — 8-page Streamlit application that loads data directly from disk; no file upload required
Prerequisites: Python 3.9+
-
Clone the repository
git clone https://github.com/akritisshetty/census-income-analysis.git cd census-income-analysis -
Install dependencies
pip install -r requirements.txt
-
Download the dataset Place
adult.csv(UCI Adult Census dataset) in the project root directory. -
Run the Jupyter notebook (training & evaluation)
jupyter notebook census_income_analysis.ipynb
-
Launch the Streamlit dashboard
streamlit run app.py
Requirements:
pandas==3.0.2,numpy==2.4.4,scikit-learn==1.8.0,xgboost,fairlearn==0.13.0,shap==0.51.0,matplotlib==3.10.8,seaborn==0.13.2,streamlit
census-income-analysis/
│
├── adult.csv # UCI Adult Census dataset (32,561 records)
├── census_income_analysis.ipynb # End-to-end ML pipeline notebook (34 cells)
├── app.py # Streamlit 8-page dashboard
├── requirements.txt
└── README.md
The pipeline runs end-to-end through the following stages:
- Load & Sanitise — Replace
"?"entries withNaN; drop thefnlwgtsampling weight column - EDA — Explore feature distributions, missing values, and fairness baselines
- Preprocessing — Engineer
capital.netandwork_intensityfeatures; impute missing values; apply One-Hot Encoding and StandardScaler (for Logistic Regression only); perform an 80/20 stratified train-test split - Model Training — Train Logistic Regression, Random Forest, and XGBoost with 5-fold cross-validation; XGBoost achieves the best CV ROC-AUC of 0.9275 and test ROC-AUC of 0.9231
- Threshold Optimisation — Sweep thresholds from 0.1 to 0.9; optimal threshold of 0.43 improves minority-class F1 from 0.7124 → 0.7282
- Fairness Audit — Compute Demographic Parity Difference and Equalized Odds Difference via
fairlearn.MetricFrameacross sex and race groups - Bias Mitigation — Apply
ThresholdOptimizerwith equalized odds constraint; reduces sex-based DPD by 43% and EOD by 71% - SHAP Analysis — Identify top predictive features globally (age, marital status, capital gain, education, hours/week) and generate local per-prediction explanations
- Dashboard — Serve all results through an interactive Streamlit application
| Purpose | Library |
|---|---|
| Data manipulation | pandas 3.0.2, numpy 2.4.4 |
| Visualisation | matplotlib 3.10.8, seaborn 0.13.2 |
| Machine learning | scikit-learn 1.8.0 |
| Gradient boosting | XGBoost |
| Fairness auditing & mitigation | fairlearn 0.13.0 |
| Explainability | SHAP 0.51.0 |
| Interactive dashboard | Streamlit |