Skip to content

IsraaXx/Income-Classification-NB-DT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 

Repository files navigation

🧠 Census Income Classification

Naive Bayes & Decision Tree — Built from Scratch (No ML Libraries)


📌 Project Overview

A desktop application that predicts whether a person's income exceeds $50K/year based on census data.
Two classification algorithms were implemented from scratch without using any ML libraries like scikit-learn:

  • Naive Bayes — using Gaussian distribution for age + Laplace smoothing for categorical features
  • Decision Tree — using Gini Index with binary splits and recursive tree building
Model Accuracy
Naive Bayes ~82.47%
Decision Tree ~83%

📁 Dataset

Census Income Dataset (Adult Dataset) — UCI Machine Learning Repository

Features Used:

Feature Type Preprocessing
Age Continuous Gaussian distribution
Workclass Categorical Label Encoding
Marital Status Categorical Binary Encoding (married=1, else=0)
Education Categorical Label Encoding
Occupation Categorical Label Encoding

Target: income<=50K or >50K


⚙️ How It Works

1. Data Preprocessing

Raw CSV
   ↓ Remove missing values ('?')
   ↓ Strip whitespace
   ↓ Marital Status → binary (married=1, not married=0)
   ↓ Label encode remaining categorical features
   ↓ Shuffle + Train/Test split

2. Naive Bayes Algorithm

Prior Probability:

P(class) = count(class) / total_samples

Age — Gaussian Distribution:

P(age | class) = (1 / √(2π × var)) × exp(-(age - mean)² / (2 × var))

Categorical Features — Laplace Smoothing:

P(feature=val | class) = (count + 1) / (class_size + V)

where V = number of unique values for that feature

Final Prediction:

P(class | data) ∝ P(class) × P(age|class) × P(workclass|class) × ...
→ pick class with highest probability

3. Decision Tree Algorithm

Gini Index:

Gini(D) = 1 - Σ(prob²)   for each class

Gini(split) = (|D1|/|D|) × Gini(D1) + (|D2|/|D|) × Gini(D2)

Building the Tree:

  1. Try every feature and every unique value as a threshold
  2. Pick the split with minimum Gini (purest split)
  3. Recursively build left and right subtrees
  4. Stop when max_depth=8 is reached or all samples are one class

Prediction:

Start at root → answer yes/no questions → reach a leaf → return class

🖥️ Application Features

Section 1 — Data Configuration

  • Browse and load any CSV dataset
  • Set % of data to use (e.g. 50% for faster training)
  • Set Train/Test split ratio (default: 75/25)
  • Train both models and see accuracy results

Section 2 — Model Accuracy Results

  • Displays Naive Bayes accuracy
  • Displays Decision Tree accuracy

Section 3 — Predict Custom Record

  • Select values from dropdowns (Workclass, Education, etc.)
  • Enter age manually
  • Get prediction from both models instantly

📸 Application Screenshot

K-Means GUI Interface

🚀 How to Run

Requirements

pip install pandas

tkinter and math are built-in Python libraries — no extra install needed

Run

python main.py

📂 Project Structure

📦 census-income-classification
 ┣ 📜 main.py
 ┣ 📜 README.md
 ┗ 🖼️ GUI_Screenshot.png 

🔑 Key Design Decisions

Why binary encoding for Marital Status?
The 7 marital categories reduce to one meaningful signal — married or not. This simplified encoding improved Decision Tree accuracy by giving it a clean single-question split.

Why Gaussian for Age?
Age is a continuous variable. Using Gaussian distribution captures the full range of age values and their relationship to income far better than treating age as a category.

Why no age binning?
Tested both approaches. Without binning, the Decision Tree finds more precise split thresholds (e.g. age <= 37) rather than being forced into broad groups. This improved accuracy from ~75% to ~83%.

Why Laplace Smoothing?
Without it, any unseen value in test data would produce a probability of 0, making the entire Naive Bayes calculation collapse to 0.

Why max_depth=8?
Prevents overfitting — a very deep tree memorizes training data and performs poorly on new data.


👩‍💻 Implementation Notes

  • No scikit-learn, no ML libraries — pure Python logic
  • Gini Index matches sklearn's DecisionTreeClassifier(criterion='gini') exactly
  • Binary splits implemented via <= threshold comparison
  • All encoding stored in label_encoders dictionary for consistent prediction

About

A Machine Learning desktop application predicting Census Income. Features Naive Bayes and Decision Tree classifiers built entirely from scratch in Python, integrated with a Tkinter GUI.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages