🎓 Student Performance Prediction System

A complete, industry-grade Machine Learning system that predicts student academic performance using the UCI Student Performance Dataset. Includes regression (grade prediction), classification (pass/fail), hyperparameter tuning, sklearn Pipelines, and a live Streamlit web app.

🔍 Project Overview

This project builds a complete ML pipeline to predict:

Task	Target	Type
Grade Prediction	Final grade G3 (0–20)	Regression
Pass/Fail Prediction	Pass if G3 ≥ 10	Classification

Why this matters in the real world:

🏫 Schools identify at-risk students early
📚 EdTech companies personalize learning paths
🎯 Academic counselors intervene before dropout

🛠️ Tech Stack

Component	Technology
Language	Python 3.10+
ML Framework	scikit-learn 1.4, XGBoost 2.0
Data Processing	pandas, numpy
Visualization	matplotlib, seaborn, plotly
Web App	Streamlit
Pipelines	sklearn ColumnTransformer + GridSearchCV
CI/CD	GitHub Actions
Notebooks	Jupyter

⚠️ GPU Note: This project is 100% CPU-based. No PyTorch, TensorFlow, or CUDA packages are used. Your GPU setup will not be affected.

🏗️ Project Architecture

Student Data (UCI)
        │
        ▼
  data_loader.py ──── Auto-download UCI Dataset
        │
        ▼
 preprocessing.py ─── Cleaning + Feature Engineering
        │              (avg_grade, grade_trend, absence_bucket, ...)
        ▼
   ColumnTransformer ── Numeric scaling + Categorical encoding
        │
        ├──────────────────────────────────┐
        ▼                                  ▼
  REGRESSION                         CLASSIFICATION
  (predict G3)                       (predict pass/fail)
        │                                  │
  ┌─────────────┐                   ┌─────────────┐
  │ Lin.Reg     │                   │ Log.Reg     │
  │ RF Regr     │  GridSearchCV     │ RF Classif  │  GridSearchCV
  │ XGB Regr    │  (5-fold CV)      │ XGB Classif │  (5-fold CV)
  │ SVR         │                   │ SVC         │
  │ KNN Regr    │                   │ KNN Classif │
  └─────────────┘                   └─────────────┘
        │                                  │
        ▼                                  ▼
  RMSE / MAE / R²              Acc / F1 / AUC / CM
        │                                  │
        └──────────┬───────────────────────┘
                   ▼
           Streamlit Web App
           (live predictions)

📁 Folder Structure

Student-Performance-Prediction/
│
├── data/
│   ├── raw/                    ← Auto-downloaded UCI dataset
│   └── processed/              ← Cleaned + feature-engineered CSV
│
├── notebooks/
│   ├── 01_EDA.ipynb            ← Exploratory Data Analysis
│   └── 02_Model_Training.ipynb ← Model training + evaluation
│
├── src/
│   ├── __init__.py
│   ├── data_loader.py          ← Download + load UCI dataset
│   ├── preprocessing.py        ← Cleaning + feature engineering
│   ├── train.py                ← All models + GridSearchCV pipelines
│   ├── evaluate.py             ← Metrics + plots
│   ├── predict.py              ← Single student prediction
│   └── visualize.py            ← EDA visualizations
│
├── models/                     ← Saved .pkl model files
├── outputs/
│   ├── plots/                  ← All generated charts/graphs
│   └── reports/                ← Metrics CSV files
│
├── app/
│   └── streamlit_app.py        ← Streamlit web application
│
├── .github/
│   └── workflows/ci.yml        ← GitHub Actions CI pipeline
│
├── main.py                     ← Single entry point
├── requirements.txt
├── .gitignore
└── README.md

⚙️ Installation

Prerequisites

Python 3.10 or 3.12
pip
Git

Step 1: Clone the repository

git clone https://github.com/YOUR_USERNAME/Student-Performance-Prediction.git
cd Student-Performance-Prediction

Step 2: Create a virtual environment (recommended)

# Windows
python -m venv venv
venv\Scripts\activate

# macOS/Linux
python3 -m venv venv
source venv/bin/activate

Step 3: Install dependencies

pip install -r requirements.txt

✅ This installs ONLY: pandas, numpy, scikit-learn, xgboost, matplotlib, seaborn, plotly, streamlit, joblib, jupyter. No GPU-conflicting packages.

🚀 How to Run

Option A: Full Pipeline (recommended first run)

python main.py

This will:

Download the UCI dataset automatically
Preprocess + feature engineer
Generate all EDA plots
Train all 10 models (5 regression + 5 classification)
Evaluate and compare models
Save models to models/
Save plots to outputs/plots/
Run a sample prediction

Expected time: 5–15 minutes (depends on system)

Option B: EDA only

python main.py --eda-only

Option C: Predict only (after training)

python main.py --predict

Option D: Portuguese dataset

python main.py --subject por

Option E: Streamlit web app

streamlit run app/streamlit_app.py

Then open: http://localhost:8501

Option F: Jupyter Notebooks

jupyter notebook notebooks/01_EDA.ipynb
jupyter notebook notebooks/02_Model_Training.ipynb

🤖 Models

Regression (predict final grade G3)

Model	Description
Linear Regression	Baseline
Random Forest Regressor	Ensemble, handles non-linearity
XGBoost Regressor	Gradient boosting, high performance
SVR	Kernel-based, robust
KNN Regressor	Instance-based

Classification (predict Pass/Fail)

Model	Description
Logistic Regression	Probabilistic, interpretable
Random Forest Classifier	Ensemble
XGBoost Classifier	State-of-the-art
SVC	Kernel SVM with probability
KNN Classifier	Instance-based

All models use:

✅ sklearn Pipeline (no data leakage)
✅ GridSearchCV (5-fold cross validation)
✅ ColumnTransformer (numeric scaling + categorical encoding)

📊 Results

Results will appear in outputs/reports/ after running python main.py.

Example benchmark (Math dataset, 80/20 split):

Model	RMSE ↓	R² ↑
XGBoost Regressor	~1.8	~0.88
Random Forest	~2.0	~0.85

Model	Accuracy ↑	F1 ↑	AUC ↑
XGBoost Classifier	~0.91	~0.91	~0.96
Random Forest	~0.89	~0.89	~0.95

Actual numbers depend on random state and hyperparameter tuning results.

🌐 Streamlit App

The Streamlit app has 4 pages:

Page	Description
🎯 Predict Performance	Input student data, get grade + pass/fail + gauge
📊 EDA Visualizations	View all generated charts
🏆 Model Comparison	Compare all model metrics interactively
ℹ️ About	Project info

streamlit run app/streamlit_app.py

📂 Dataset

UCI Student Performance Dataset

Source: UCI ML Repository
Students: 395 (Math), 649 (Portuguese)
Features: 30+ (demographics, academic, social, family background)
Target: G3 (final grade, 0–20)
License: Public domain for research/educational use

The dataset is auto-downloaded when you run python main.py.

📸 Screenshots & Insights

🌐 Web Interface

The Streamlit dashboard provides an interactive way to predict student success and explore model metrics in real-time.

📊 Exploratory Data Analysis (EDA)

Key insights derived from the UCI dataset regarding student demographics and academic trends.

Grade Distribution	Correlation Heatmap

Study Time vs Grade	Absences vs Grade

🔍 Additional Academic Factors

Demographic Analysis:
Failures vs Grade:
Parental Education Impact:
Pass/Fail Ratio:

🤖 Model Performance & Evaluation

Comparing 10 different models (5 Regression, 5 Classification) using advanced metrics and cross-validation.

📈 Regression Metrics (G3 Prediction)

Model Comparison	Actual vs Predicted

Residual Analysis:	Feature Importance:

🎯 Classification Metrics (Pass/Fail)

Accuracy & F1 Comparison	Confusion Matrices

ROC Curves:	Feature Importance:

🎤 Interview Prep

Q: Why did you use sklearn Pipelines?
A: Pipelines prevent data leakage — the preprocessor is fit only on training data inside GridSearchCV, so test data is never seen during training.

Q: How did you handle categorical variables?
A: Used OneHotEncoder inside a ColumnTransformer pipeline, with handle_unknown='ignore' to handle unseen categories during inference.

Q: What is your target variable and why G3 ≥ 10 for pass?
A: G3 is the final period grade (0–20). The UCI dataset defines 10 as the passing threshold in the Portuguese grading system.

Q: Why both regression and classification?
A: Regression gives the exact predicted grade (useful for counselors), classification gives a binary risk flag (useful for early intervention systems).

Q: How did you tune hyperparameters?
A: GridSearchCV with 5-fold cross-validation. Scoring: RMSE for regression, weighted F1 for classification.

📄 License

MIT License — free to use for learning, portfolio, and research.

Built for Data Science / ML portfolio — UCI Student Performance Dataset

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github/workflows		.github/workflows
app		app
data		data
images		images
models		models
notebooks		notebooks
outputs		outputs
src		src
.gitignore		.gitignore
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

🎓 Student Performance Prediction System

📋 Table of Contents

🔍 Project Overview

🛠️ Tech Stack

🏗️ Project Architecture

📁 Folder Structure

⚙️ Installation

Prerequisites

Step 1: Clone the repository

Step 2: Create a virtual environment (recommended)

Step 3: Install dependencies

🚀 How to Run

Option A: Full Pipeline (recommended first run)

Option B: EDA only

Option C: Predict only (after training)

Option D: Portuguese dataset

Option E: Streamlit web app

Option F: Jupyter Notebooks

🤖 Models

Regression (predict final grade G3)

Classification (predict Pass/Fail)

📊 Results

🌐 Streamlit App

📂 Dataset

📸 Screenshots & Insights

🌐 Web Interface

📊 Exploratory Data Analysis (EDA)

🔍 Additional Academic Factors

🤖 Model Performance & Evaluation

📈 Regression Metrics (G3 Prediction)

🎯 Classification Metrics (Pass/Fail)

🎤 Interview Prep

📄 License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages