A desktop application that predicts whether a person's income exceeds $50K/year based on census data.
Two classification algorithms were implemented from scratch without using any ML libraries like scikit-learn:
- Naive Bayes — using Gaussian distribution for age + Laplace smoothing for categorical features
- Decision Tree — using Gini Index with binary splits and recursive tree building
| Model | Accuracy |
|---|---|
| Naive Bayes | ~82.47% |
| Decision Tree | ~83% |
Census Income Dataset (Adult Dataset) — UCI Machine Learning Repository
| Feature | Type | Preprocessing |
|---|---|---|
| Age | Continuous | Gaussian distribution |
| Workclass | Categorical | Label Encoding |
| Marital Status | Categorical | Binary Encoding (married=1, else=0) |
| Education | Categorical | Label Encoding |
| Occupation | Categorical | Label Encoding |
Target: income → <=50K or >50K
Raw CSV
↓ Remove missing values ('?')
↓ Strip whitespace
↓ Marital Status → binary (married=1, not married=0)
↓ Label encode remaining categorical features
↓ Shuffle + Train/Test split
Prior Probability:
P(class) = count(class) / total_samples
Age — Gaussian Distribution:
P(age | class) = (1 / √(2π × var)) × exp(-(age - mean)² / (2 × var))
Categorical Features — Laplace Smoothing:
P(feature=val | class) = (count + 1) / (class_size + V)
where V = number of unique values for that feature
Final Prediction:
P(class | data) ∝ P(class) × P(age|class) × P(workclass|class) × ...
→ pick class with highest probability
Gini Index:
Gini(D) = 1 - Σ(prob²) for each class
Gini(split) = (|D1|/|D|) × Gini(D1) + (|D2|/|D|) × Gini(D2)
Building the Tree:
- Try every feature and every unique value as a threshold
- Pick the split with minimum Gini (purest split)
- Recursively build left and right subtrees
- Stop when
max_depth=8is reached or all samples are one class
Prediction:
Start at root → answer yes/no questions → reach a leaf → return class
- Browse and load any CSV dataset
- Set % of data to use (e.g. 50% for faster training)
- Set Train/Test split ratio (default: 75/25)
- Train both models and see accuracy results
- Displays Naive Bayes accuracy
- Displays Decision Tree accuracy
- Select values from dropdowns (Workclass, Education, etc.)
- Enter age manually
- Get prediction from both models instantly
pip install pandastkinter and math are built-in Python libraries — no extra install needed
python main.py📦 census-income-classification
┣ 📜 main.py
┣ 📜 README.md
┗ 🖼️ GUI_Screenshot.png
Why binary encoding for Marital Status?
The 7 marital categories reduce to one meaningful signal — married or not. This simplified encoding improved Decision Tree accuracy by giving it a clean single-question split.
Why Gaussian for Age?
Age is a continuous variable. Using Gaussian distribution captures the full range of age values and their relationship to income far better than treating age as a category.
Why no age binning?
Tested both approaches. Without binning, the Decision Tree finds more precise split thresholds (e.g. age <= 37) rather than being forced into broad groups. This improved accuracy from ~75% to ~83%.
Why Laplace Smoothing?
Without it, any unseen value in test data would produce a probability of 0, making the entire Naive Bayes calculation collapse to 0.
Why max_depth=8?
Prevents overfitting — a very deep tree memorizes training data and performs poorly on new data.
- No scikit-learn, no ML libraries — pure Python logic
- Gini Index matches sklearn's
DecisionTreeClassifier(criterion='gini')exactly - Binary splits implemented via
<=threshold comparison - All encoding stored in
label_encodersdictionary for consistent prediction
