A reproducible data science project that analyzes the UCI "Adult" (Census Income) dataset and builds models to predict whether an individual's income exceeds $50K/yr. The project is notebook-centric (Jupyter Notebooks) and focuses on end-to-end steps: data ingestion, EDA, preprocessing, feature engineering, modeling, tuning, evaluation, and interpretability.
- About the dataset
- Repository structure
- Quick start
- Environment & dependencies
- Notebooks & workflow
- Modeling & evaluation
- Reproducibility & results
- Tips for extension
- Contributing
- Acknowledgements
- Source: UCI Machine Learning Repository — "Adult" / "Census Income" dataset.
- Goal: Predict whether an individual's annual income is >50K.
- Typical features: age, workclass, education, marital-status, occupation, relationship, race, sex, capital-gain, capital-loss, hours-per-week, native-country, etc.
- Target: income (<=50K, >50K).
- Notes: The dataset contains missing values encoded as '?', categorical attributes with varying cardinalities, and class imbalance that should be considered during modeling.
- Notebooks/ or root .ipynb files — main analysis and modeling notebooks
- data/ (optional) — raw and processed datasets (if included)
- notebooks/ — exploratory and modeling notebooks (EDA, preprocessing, modeling, explainability)
- reports/ (optional) — model summaries, charts, or exported results
- requirements.txt — Python dependencies (if present)
- README.md — this file
(Exact file names may vary; open the repo to see the notebook filenames and adjust references if needed.)
- Open the notebook in GitHub and click "Open in Colab" (or use the Colab link if provided).
- If necessary, mount Google Drive to persist outputs:
from google.colab import drive
drive.mount('/content/drive')- Install required packages (if
requirements.txtis present):
!pip install -r requirements.txt- Clone the repository:
git clone https://github.com/Hemanth7723/Adult-Census-Income-Prediction.git
cd Adult-Census-Income-Prediction- Create a Python environment (recommended):
# using venv
python -m venv venv
source venv/bin/activate # macOS / Linux
venv\Scripts\activate # Windows
# or using conda
conda create -n acip python=3.9
conda activate acip- Install dependencies:
pip install -r requirements.txt- Start Jupyter Lab / Notebook:
jupyter lab
# or
jupyter notebook- Open the notebooks and run cells in order (typically start with EDA notebook, then preprocessing, then modeling).
- Recommended Python: 3.8+ (3.9 or 3.10 are common choices)
- Core libraries commonly used: pandas, numpy, scikit-learn, matplotlib, seaborn, xgboost (optional), shap/eli5 (for explainability), joblib (for saving models)
- If a requirements.txt exists, install with
pip install -r requirements.txt. - For GPU-enabled training (if using large models), ensure appropriate CUDA and XGBoost/LightGBM builds; however, the Adult dataset is small and CPU is typically sufficient.
- After changes you can create your own requirements.txt file with
pip freeze > requirements.txt.
Typical notebook order (may vary by repo contents):
- 00-data-exploration.ipynb — Data loading, missing values, distribution of features, class balance, initial visualizations.
- 01-preprocessing-feature-engineering.ipynb — Encoding categorical features, scaling numeric features, handling missing values, feature creation.
- 02-modeling.ipynb — Train test split, baseline models (Logistic Regression, Decision Trees), cross-validation, hyperparameter tuning.
- 03-evaluation-and-interpretation.ipynb — Final evaluation metrics (confusion matrix, ROC AUC), feature importance and SHAP/interpretability.
- 04-deployment-or-notes.ipynb — (Optional) model export, inference examples, or notes on deployment.
Run notebooks sequentially to reproduce results. Use kernel restart and run-all to ensure clean reproducibility.
- Suggested models: Logistic Regression (baseline), Random Forest, Gradient Boosted Trees (XGBoost/LightGBM), and simple ensemble approaches.
- Important steps:
- Use stratified train/test split to preserve class balance.
- Apply cross-validation for robust performance estimates.
- Address class imbalance (if present) via class weights, resampling (SMOTE), or threshold tuning.
- Common evaluation metrics:
- Accuracy, Precision, Recall, F1-score
- ROC AUC and precision-recall curves (especially useful when classes are imbalanced)
- Confusion matrix to inspect error types
- Save best models with joblib or pickle for later inference.
- Set random seeds in notebooks (numpy, scikit-learn, any model-specific RNG) to help reproduce experiments.
- If results/figures are important, export them to the
reports/oroutputs/folder in a notebook cell so they are preserved. - Document chosen hyperparameters, model versions, and experiment IDs in a short log or table inside the notebooks.
- Try advanced feature engineering: interaction terms, binning continuous features, or target encoding for high-cardinality categories.
- Experiment with pipeline automation: use scikit-learn Pipelines to chain preprocessing and modeling steps.
- Add CI for notebooks: use nbval or papermill to validate notebooks in continuous integration.
- Build a lightweight REST API for inference with FastAPI or Flask and containerize with Docker for deployment.
Contributions are welcome. Suggested workflow:
- Fork the repo
- Create a new branch:
git checkout -b feature/your-change - Make changes and add or update notebooks/documentation
- Commit and push, then open a pull request with a clear description and any reproducible steps
Please keep notebooks runnable (consider clearing outputs before committing large outputs) and add a short description of changes to the notebook's top cell.
- The UCI Machine Learning Repository for the Adult Census Income dataset.
- Data science and ML community resources for modeling and interpretability best practices.