A complete, industry-grade Machine Learning system that predicts student academic performance using the UCI Student Performance Dataset. Includes regression (grade prediction), classification (pass/fail), hyperparameter tuning, sklearn Pipelines, and a live Streamlit web app.
- Project Overview
- Tech Stack
- Project Architecture
- Folder Structure
- Installation
- How to Run
- Models
- Results
- Streamlit App
- Dataset
- Screenshots
- Interview Prep
This project builds a complete ML pipeline to predict:
| Task | Target | Type |
|---|---|---|
| Grade Prediction | Final grade G3 (0β20) | Regression |
| Pass/Fail Prediction | Pass if G3 β₯ 10 | Classification |
Why this matters in the real world:
- π« Schools identify at-risk students early
- π EdTech companies personalize learning paths
- π― Academic counselors intervene before dropout
| Component | Technology |
|---|---|
| Language | Python 3.10+ |
| ML Framework | scikit-learn 1.4, XGBoost 2.0 |
| Data Processing | pandas, numpy |
| Visualization | matplotlib, seaborn, plotly |
| Web App | Streamlit |
| Pipelines | sklearn ColumnTransformer + GridSearchCV |
| CI/CD | GitHub Actions |
| Notebooks | Jupyter |
β οΈ GPU Note: This project is 100% CPU-based. No PyTorch, TensorFlow, or CUDA packages are used. Your GPU setup will not be affected.
Student Data (UCI)
β
βΌ
data_loader.py ββββ Auto-download UCI Dataset
β
βΌ
preprocessing.py βββ Cleaning + Feature Engineering
β (avg_grade, grade_trend, absence_bucket, ...)
βΌ
ColumnTransformer ββ Numeric scaling + Categorical encoding
β
ββββββββββββββββββββββββββββββββββββ
βΌ βΌ
REGRESSION CLASSIFICATION
(predict G3) (predict pass/fail)
β β
βββββββββββββββ βββββββββββββββ
β Lin.Reg β β Log.Reg β
β RF Regr β GridSearchCV β RF Classif β GridSearchCV
β XGB Regr β (5-fold CV) β XGB Classif β (5-fold CV)
β SVR β β SVC β
β KNN Regr β β KNN Classif β
βββββββββββββββ βββββββββββββββ
β β
βΌ βΌ
RMSE / MAE / RΒ² Acc / F1 / AUC / CM
β β
ββββββββββββ¬ββββββββββββββββββββββββ
βΌ
Streamlit Web App
(live predictions)
Student-Performance-Prediction/
β
βββ data/
β βββ raw/ β Auto-downloaded UCI dataset
β βββ processed/ β Cleaned + feature-engineered CSV
β
βββ notebooks/
β βββ 01_EDA.ipynb β Exploratory Data Analysis
β βββ 02_Model_Training.ipynb β Model training + evaluation
β
βββ src/
β βββ __init__.py
β βββ data_loader.py β Download + load UCI dataset
β βββ preprocessing.py β Cleaning + feature engineering
β βββ train.py β All models + GridSearchCV pipelines
β βββ evaluate.py β Metrics + plots
β βββ predict.py β Single student prediction
β βββ visualize.py β EDA visualizations
β
βββ models/ β Saved .pkl model files
βββ outputs/
β βββ plots/ β All generated charts/graphs
β βββ reports/ β Metrics CSV files
β
βββ app/
β βββ streamlit_app.py β Streamlit web application
β
βββ .github/
β βββ workflows/ci.yml β GitHub Actions CI pipeline
β
βββ main.py β Single entry point
βββ requirements.txt
βββ .gitignore
βββ README.md
- Python 3.10 or 3.12
- pip
- Git
git clone https://github.com/YOUR_USERNAME/Student-Performance-Prediction.git
cd Student-Performance-Prediction# Windows
python -m venv venv
venv\Scripts\activate
# macOS/Linux
python3 -m venv venv
source venv/bin/activatepip install -r requirements.txtβ This installs ONLY: pandas, numpy, scikit-learn, xgboost, matplotlib, seaborn, plotly, streamlit, joblib, jupyter. No GPU-conflicting packages.
python main.pyThis will:
- Download the UCI dataset automatically
- Preprocess + feature engineer
- Generate all EDA plots
- Train all 10 models (5 regression + 5 classification)
- Evaluate and compare models
- Save models to
models/ - Save plots to
outputs/plots/ - Run a sample prediction
Expected time: 5β15 minutes (depends on system)
python main.py --eda-onlypython main.py --predictpython main.py --subject porstreamlit run app/streamlit_app.pyThen open: http://localhost:8501
jupyter notebook notebooks/01_EDA.ipynb
jupyter notebook notebooks/02_Model_Training.ipynb| Model | Description |
|---|---|
| Linear Regression | Baseline |
| Random Forest Regressor | Ensemble, handles non-linearity |
| XGBoost Regressor | Gradient boosting, high performance |
| SVR | Kernel-based, robust |
| KNN Regressor | Instance-based |
| Model | Description |
|---|---|
| Logistic Regression | Probabilistic, interpretable |
| Random Forest Classifier | Ensemble |
| XGBoost Classifier | State-of-the-art |
| SVC | Kernel SVM with probability |
| KNN Classifier | Instance-based |
All models use:
- β
sklearn
Pipeline(no data leakage) - β
GridSearchCV(5-fold cross validation) - β
ColumnTransformer(numeric scaling + categorical encoding)
Results will appear in outputs/reports/ after running python main.py.
Example benchmark (Math dataset, 80/20 split):
| Model | RMSE β | RΒ² β |
|---|---|---|
| XGBoost Regressor | ~1.8 | ~0.88 |
| Random Forest | ~2.0 | ~0.85 |
| Model | Accuracy β | F1 β | AUC β |
|---|---|---|---|
| XGBoost Classifier | ~0.91 | ~0.91 | ~0.96 |
| Random Forest | ~0.89 | ~0.89 | ~0.95 |
Actual numbers depend on random state and hyperparameter tuning results.
The Streamlit app has 4 pages:
| Page | Description |
|---|---|
| π― Predict Performance | Input student data, get grade + pass/fail + gauge |
| π EDA Visualizations | View all generated charts |
| π Model Comparison | Compare all model metrics interactively |
| βΉοΈ About | Project info |
streamlit run app/streamlit_app.pyUCI Student Performance Dataset
- Source: UCI ML Repository
- Students: 395 (Math), 649 (Portuguese)
- Features: 30+ (demographics, academic, social, family background)
- Target: G3 (final grade, 0β20)
- License: Public domain for research/educational use
The dataset is auto-downloaded when you run python main.py.
The Streamlit dashboard provides an interactive way to predict student success and explore model metrics in real-time.
Key insights derived from the UCI dataset regarding student demographics and academic trends.
| Grade Distribution | Correlation Heatmap |
|---|---|
![]() |
![]() |
| Study Time vs Grade | Absences vs Grade |
|---|---|
![]() |
![]() |
Comparing 10 different models (5 Regression, 5 Classification) using advanced metrics and cross-validation.
| Model Comparison | Actual vs Predicted |
|---|---|
![]() |
![]() |
Residual Analysis: ![]() |
Feature Importance: ![]() |
| Accuracy & F1 Comparison | Confusion Matrices |
|---|---|
![]() |
![]() |
ROC Curves: ![]() |
Feature Importance: ![]() |
Q: Why did you use sklearn Pipelines?
A: Pipelines prevent data leakage β the preprocessor is fit only on training data inside GridSearchCV, so test data is never seen during training.
Q: How did you handle categorical variables?
A: Used OneHotEncoder inside a ColumnTransformer pipeline, with handle_unknown='ignore' to handle unseen categories during inference.
Q: What is your target variable and why G3 β₯ 10 for pass?
A: G3 is the final period grade (0β20). The UCI dataset defines 10 as the passing threshold in the Portuguese grading system.
Q: Why both regression and classification?
A: Regression gives the exact predicted grade (useful for counselors), classification gives a binary risk flag (useful for early intervention systems).
Q: How did you tune hyperparameters?
A: GridSearchCV with 5-fold cross-validation. Scoring: RMSE for regression, weighted F1 for classification.
MIT License β free to use for learning, portfolio, and research.
Built for Data Science / ML portfolio β UCI Student Performance Dataset
















