🧠 Census Income Classification

Naive Bayes & Decision Tree — Built from Scratch (No ML Libraries)

📌 Project Overview

A desktop application that predicts whether a person's income exceeds $50K/year based on census data.
Two classification algorithms were implemented from scratch without using any ML libraries like scikit-learn:

Naive Bayes — using Gaussian distribution for age + Laplace smoothing for categorical features
Decision Tree — using Gini Index with binary splits and recursive tree building

Model	Accuracy
Naive Bayes	~82.47%
Decision Tree	~83%

📁 Dataset

Census Income Dataset (Adult Dataset) — UCI Machine Learning Repository

Features Used:

Feature	Type	Preprocessing
Age	Continuous	Gaussian distribution
Workclass	Categorical	Label Encoding
Marital Status	Categorical	Binary Encoding (married=1, else=0)
Education	Categorical	Label Encoding
Occupation	Categorical	Label Encoding

Target: income → <=50K or >50K

⚙️ How It Works

1. Data Preprocessing

Raw CSV
   ↓ Remove missing values ('?')
   ↓ Strip whitespace
   ↓ Marital Status → binary (married=1, not married=0)
   ↓ Label encode remaining categorical features
   ↓ Shuffle + Train/Test split

2. Naive Bayes Algorithm

Prior Probability:

P(class) = count(class) / total_samples

Age — Gaussian Distribution:

P(age | class) = (1 / √(2π × var)) × exp(-(age - mean)² / (2 × var))

Categorical Features — Laplace Smoothing:

P(feature=val | class) = (count + 1) / (class_size + V)

where V = number of unique values for that feature

Final Prediction:

P(class | data) ∝ P(class) × P(age|class) × P(workclass|class) × ...
→ pick class with highest probability

3. Decision Tree Algorithm

Gini Index:

Gini(D) = 1 - Σ(prob²)   for each class

Gini(split) = (|D1|/|D|) × Gini(D1) + (|D2|/|D|) × Gini(D2)

Building the Tree:

Try every feature and every unique value as a threshold
Pick the split with minimum Gini (purest split)
Recursively build left and right subtrees
Stop when max_depth=8 is reached or all samples are one class

Prediction:

Start at root → answer yes/no questions → reach a leaf → return class

🖥️ Application Features

Section 1 — Data Configuration

Browse and load any CSV dataset
Set % of data to use (e.g. 50% for faster training)
Set Train/Test split ratio (default: 75/25)
Train both models and see accuracy results

Section 2 — Model Accuracy Results

Displays Naive Bayes accuracy
Displays Decision Tree accuracy

Section 3 — Predict Custom Record

Select values from dropdowns (Workclass, Education, etc.)
Enter age manually
Get prediction from both models instantly

📸 Application Screenshot

🚀 How to Run

Requirements

pip install pandas

tkinter and math are built-in Python libraries — no extra install needed

Run

python main.py

📂 Project Structure

📦 census-income-classification
 ┣ 📜 main.py
 ┣ 📜 README.md
 ┗ 🖼️ GUI_Screenshot.png

🔑 Key Design Decisions

Why binary encoding for Marital Status?
The 7 marital categories reduce to one meaningful signal — married or not. This simplified encoding improved Decision Tree accuracy by giving it a clean single-question split.

Why Gaussian for Age?
Age is a continuous variable. Using Gaussian distribution captures the full range of age values and their relationship to income far better than treating age as a category.

Why no age binning?
Tested both approaches. Without binning, the Decision Tree finds more precise split thresholds (e.g. age <= 37) rather than being forced into broad groups. This improved accuracy from ~75% to ~83%.

Why Laplace Smoothing?
Without it, any unseen value in test data would produce a probability of 0, making the entire Naive Bayes calculation collapse to 0.

Why max_depth=8?
Prevents overfitting — a very deep tree memorizes training data and performs poorly on new data.

👩‍💻 Implementation Notes

No scikit-learn, no ML libraries — pure Python logic
Gini Index matches sklearn's DecisionTreeClassifier(criterion='gini') exactly
Binary splits implemented via <= threshold comparison
All encoding stored in label_encoders dictionary for consistent prediction

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧠 Census Income Classification

Naive Bayes & Decision Tree — Built from Scratch (No ML Libraries)

📌 Project Overview

📁 Dataset

Features Used:

⚙️ How It Works

1. Data Preprocessing

2. Naive Bayes Algorithm

3. Decision Tree Algorithm

🖥️ Application Features

Section 1 — Data Configuration

Section 2 — Model Accuracy Results

Section 3 — Predict Custom Record

📸 Application Screenshot

🚀 How to Run

Requirements

Run

📂 Project Structure

🔑 Key Design Decisions

👩‍💻 Implementation Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
GUI_Screenshot.png		GUI_Screenshot.png
README.md		README.md
main.py		main.py

Folders and files

Latest commit

History

Repository files navigation

🧠 Census Income Classification

Naive Bayes & Decision Tree — Built from Scratch (No ML Libraries)

📌 Project Overview

📁 Dataset

Features Used:

⚙️ How It Works

1. Data Preprocessing

2. Naive Bayes Algorithm

3. Decision Tree Algorithm

🖥️ Application Features

Section 1 — Data Configuration

Section 2 — Model Accuracy Results

Section 3 — Predict Custom Record

📸 Application Screenshot

🚀 How to Run

Requirements

Run

📂 Project Structure

🔑 Key Design Decisions

👩‍💻 Implementation Notes

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages