This project aims to predict whether an individual’s annual income exceeds $50,000 using demographic and employment-related features from the Adult (Census Income) dataset. The task is formulated as a binary classification problem and serves as a benchmark for evaluating classical machine learning algorithms on structured tabular data.
This project was created as part of a machine learning course and completed collaboratively as a team project.
To address this problem, three supervised learning models were implemented and compared: Support Vector Machine (SVM), Decision Tree, and K-Nearest Neighbors (KNN). The goal is to assess how different modeling approaches perform in terms of accuracy and class-level prediction quality.
- Supports socioeconomic and workforce analysis
- Helps identify factors associated with income levels
- Provides a standard benchmark for evaluating ML classifiers
The Adult dataset was introduced by Becker and Kohavi (1996) and is hosted by the UCI Machine Learning Repository. It contains demographic and employment-related attributes such as age, education, occupation, and hours worked per week. The target variable is Income, formulated as a binary classification task with the following two classes:
1- <=50K: Individuals earning less than or equal to $50,000 per year.
2- >50K: Individuals earning more than $50,000 per year.
Dataset Source: https://doi.org/10.24432/C5XW20.
To ensure consistency and avoid data leakage, training and test datasets were temporarily combined during preprocessing and later separated. The main preprocessing steps included:
- Handling missing values using the most frequent category
- Encoding categorical variables into numeric form
- Applying Min–Max normalization to scale features These steps were particularly important for distance-based models such as SVM and KNN.
Three machine learning models were trained and evaluated:
- Support Vector Machine (RBF kernel) for margin-based classification
- Decision Tree Classifier for interpretable rule-based learning
- K-Nearest Neighbors (KNN) for distance-based classification
Model performance was evaluated using accuracy, precision, recall, F1-score, and confusion matrices.
The performance of the three machine learning models was evaluated using accuracy on the test dataset.
| Model | Accuracy |
|---|---|
| Support Vector Machine (SVM) | 0.8483 |
| Decision Tree | 0.8448 |
| K-Nearest Neighbors (KNN) | 0.8284 |
The SVM model achieved the highest accuracy, followed closely by the Decision Tree, while KNN showed slightly lower performance.
- Becker, B., & Kohavi, R. (1996). Adult Dataset. UCI Machine Learning Repository.
https://doi.org/10.24432/C5XW20