GitHub - mkmkkkkk/MATH5472-final-project

MATH5472 Final Project — What If Without Random Forest

This repository contains the implementation and experiments for the MATH5472 course project.
The objective is to evaluate the contribution of Random Forest by comparing it against baseline models on the Wine Quality dataset.

Project Overview

Course: MATH5472 – Statistical Machine Learning
Topic: What If Without the Random Forest Method
Dataset: Wine Quality (red and white variants)
Task: Binary classification — predict whether a wine is “good” (quality ≥ 7)

NABC Summary:

Need: Linear models underfit nonlinear relationships; single decision trees have high variance.

Approach: Random Forest using bootstrap sampling and random feature subsets, compared with Logistic Regression, Decision Tree, and Gradient Boosting.

Benefit: Random Forest provides the strongest accuracy, F1-score, and ROC-AUC.

Competitors: Logistic Regression, Decision Tree, Gradient Boosting.

Repository Structure

MATH5472-final-project/ │ ├── data/ │ ├── winequality-red.csv │ └── winequality-white.csv │ ├── src/ │ ├── init.py │ ├── config.py │ ├── data_utils.py │ ├── train_and_evaluate.py │ └── plot_results.py │ ├── results/ │ ├── metrics.csv │ ├── confusion_matrix_logistic_regression.png │ ├── confusion_matrix_decision_tree.png │ ├── confusion_matrix_random_forest.png │ ├── confusion_matrix_gradient_boosting.png │ ├── feature_importance_random_forest.png │ ├── feature_importance_gradient_boosting.png │ └── model_comparison_*.png │ ├── .gitignore └── paper.tex

Environment Setup

cd MATH5472-final-project
python3 -m venv .venv
source .venv/bin/activate

pip install pandas scikit-learn matplotlib

Data Placement

Place these files into the data/ directory:

winequality-red.csv
winequality-white.csv

Running the Experiments

Train and evaluate all models:

python -m src.train_and_evaluate

The script loads data, trains Logistic Regression, Decision Tree, Random Forest, Gradient Boosting, evaluates all models, and saves results.

Generate comparison plots:

python -m src.plot_results

Preliminary Results

Logistic Regression: accuracy ≈ 0.83, ROC-AUC ≈ 0.81
Decision Tree: accuracy ≈ 0.83, ROC-AUC ≈ 0.75
Random Forest: accuracy ≈ 0.89, ROC-AUC ≈ 0.92
Gradient Boosting: accuracy ≈ 0.85, ROC-AUC ≈ 0.87

Reproducibility

git clone https://github.com/MichaelYangzk/MATH5472-final-project.git
cd MATH5472-final-project
python3 -m venv .venv
source .venv/bin/activate
pip install pandas scikit-learn matplotlib
python -m src.train_and_evaluate
python -m src.plot_results

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
data		data
results		results
src		src
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages