Skip to content

Anupam-Santra/Student-Performance-Prediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1 Commit
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸŽ“ Student Performance Prediction System

Python scikit-learn XGBoost Streamlit License CI

A complete, industry-grade Machine Learning system that predicts student academic performance using the UCI Student Performance Dataset. Includes regression (grade prediction), classification (pass/fail), hyperparameter tuning, sklearn Pipelines, and a live Streamlit web app.


πŸ“‹ Table of Contents


πŸ” Project Overview

This project builds a complete ML pipeline to predict:

Task Target Type
Grade Prediction Final grade G3 (0–20) Regression
Pass/Fail Prediction Pass if G3 β‰₯ 10 Classification

Why this matters in the real world:

  • 🏫 Schools identify at-risk students early
  • πŸ“š EdTech companies personalize learning paths
  • 🎯 Academic counselors intervene before dropout

πŸ› οΈ Tech Stack

Component Technology
Language Python 3.10+
ML Framework scikit-learn 1.4, XGBoost 2.0
Data Processing pandas, numpy
Visualization matplotlib, seaborn, plotly
Web App Streamlit
Pipelines sklearn ColumnTransformer + GridSearchCV
CI/CD GitHub Actions
Notebooks Jupyter

⚠️ GPU Note: This project is 100% CPU-based. No PyTorch, TensorFlow, or CUDA packages are used. Your GPU setup will not be affected.


πŸ—οΈ Project Architecture

Student Data (UCI)
        β”‚
        β–Ό
  data_loader.py ──── Auto-download UCI Dataset
        β”‚
        β–Ό
 preprocessing.py ─── Cleaning + Feature Engineering
        β”‚              (avg_grade, grade_trend, absence_bucket, ...)
        β–Ό
   ColumnTransformer ── Numeric scaling + Categorical encoding
        β”‚
        β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
        β–Ό                                  β–Ό
  REGRESSION                         CLASSIFICATION
  (predict G3)                       (predict pass/fail)
        β”‚                                  β”‚
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚ Lin.Reg     β”‚                   β”‚ Log.Reg     β”‚
  β”‚ RF Regr     β”‚  GridSearchCV     β”‚ RF Classif  β”‚  GridSearchCV
  β”‚ XGB Regr    β”‚  (5-fold CV)      β”‚ XGB Classif β”‚  (5-fold CV)
  β”‚ SVR         β”‚                   β”‚ SVC         β”‚
  β”‚ KNN Regr    β”‚                   β”‚ KNN Classif β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
        β”‚                                  β”‚
        β–Ό                                  β–Ό
  RMSE / MAE / RΒ²              Acc / F1 / AUC / CM
        β”‚                                  β”‚
        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                   β–Ό
           Streamlit Web App
           (live predictions)

πŸ“ Folder Structure

Student-Performance-Prediction/
β”‚
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ raw/                    ← Auto-downloaded UCI dataset
β”‚   └── processed/              ← Cleaned + feature-engineered CSV
β”‚
β”œβ”€β”€ notebooks/
β”‚   β”œβ”€β”€ 01_EDA.ipynb            ← Exploratory Data Analysis
β”‚   └── 02_Model_Training.ipynb ← Model training + evaluation
β”‚
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ data_loader.py          ← Download + load UCI dataset
β”‚   β”œβ”€β”€ preprocessing.py        ← Cleaning + feature engineering
β”‚   β”œβ”€β”€ train.py                ← All models + GridSearchCV pipelines
β”‚   β”œβ”€β”€ evaluate.py             ← Metrics + plots
β”‚   β”œβ”€β”€ predict.py              ← Single student prediction
β”‚   └── visualize.py            ← EDA visualizations
β”‚
β”œβ”€β”€ models/                     ← Saved .pkl model files
β”œβ”€β”€ outputs/
β”‚   β”œβ”€β”€ plots/                  ← All generated charts/graphs
β”‚   └── reports/                ← Metrics CSV files
β”‚
β”œβ”€β”€ app/
β”‚   └── streamlit_app.py        ← Streamlit web application
β”‚
β”œβ”€β”€ .github/
β”‚   └── workflows/ci.yml        ← GitHub Actions CI pipeline
β”‚
β”œβ”€β”€ main.py                     ← Single entry point
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ .gitignore
└── README.md

βš™οΈ Installation

Prerequisites

  • Python 3.10 or 3.12
  • pip
  • Git

Step 1: Clone the repository

git clone https://github.com/YOUR_USERNAME/Student-Performance-Prediction.git
cd Student-Performance-Prediction

Step 2: Create a virtual environment (recommended)

# Windows
python -m venv venv
venv\Scripts\activate

# macOS/Linux
python3 -m venv venv
source venv/bin/activate

Step 3: Install dependencies

pip install -r requirements.txt

βœ… This installs ONLY: pandas, numpy, scikit-learn, xgboost, matplotlib, seaborn, plotly, streamlit, joblib, jupyter. No GPU-conflicting packages.


πŸš€ How to Run

Option A: Full Pipeline (recommended first run)

python main.py

This will:

  1. Download the UCI dataset automatically
  2. Preprocess + feature engineer
  3. Generate all EDA plots
  4. Train all 10 models (5 regression + 5 classification)
  5. Evaluate and compare models
  6. Save models to models/
  7. Save plots to outputs/plots/
  8. Run a sample prediction

Expected time: 5–15 minutes (depends on system)


Option B: EDA only

python main.py --eda-only

Option C: Predict only (after training)

python main.py --predict

Option D: Portuguese dataset

python main.py --subject por

Option E: Streamlit web app

streamlit run app/streamlit_app.py

Then open: http://localhost:8501

Option F: Jupyter Notebooks

jupyter notebook notebooks/01_EDA.ipynb
jupyter notebook notebooks/02_Model_Training.ipynb

πŸ€– Models

Regression (predict final grade G3)

Model Description
Linear Regression Baseline
Random Forest Regressor Ensemble, handles non-linearity
XGBoost Regressor Gradient boosting, high performance
SVR Kernel-based, robust
KNN Regressor Instance-based

Classification (predict Pass/Fail)

Model Description
Logistic Regression Probabilistic, interpretable
Random Forest Classifier Ensemble
XGBoost Classifier State-of-the-art
SVC Kernel SVM with probability
KNN Classifier Instance-based

All models use:

  • βœ… sklearn Pipeline (no data leakage)
  • βœ… GridSearchCV (5-fold cross validation)
  • βœ… ColumnTransformer (numeric scaling + categorical encoding)

πŸ“Š Results

Results will appear in outputs/reports/ after running python main.py.

Example benchmark (Math dataset, 80/20 split):

Model RMSE ↓ RΒ² ↑
XGBoost Regressor ~1.8 ~0.88
Random Forest ~2.0 ~0.85
Model Accuracy ↑ F1 ↑ AUC ↑
XGBoost Classifier ~0.91 ~0.91 ~0.96
Random Forest ~0.89 ~0.89 ~0.95

Actual numbers depend on random state and hyperparameter tuning results.


🌐 Streamlit App

The Streamlit app has 4 pages:

Page Description
🎯 Predict Performance Input student data, get grade + pass/fail + gauge
πŸ“Š EDA Visualizations View all generated charts
πŸ† Model Comparison Compare all model metrics interactively
ℹ️ About Project info
streamlit run app/streamlit_app.py

πŸ“‚ Dataset

UCI Student Performance Dataset

  • Source: UCI ML Repository
  • Students: 395 (Math), 649 (Portuguese)
  • Features: 30+ (demographics, academic, social, family background)
  • Target: G3 (final grade, 0–20)
  • License: Public domain for research/educational use

The dataset is auto-downloaded when you run python main.py.


πŸ“Έ Screenshots & Insights

🌐 Web Interface

The Streamlit dashboard provides an interactive way to predict student success and explore model metrics in real-time.

Web Dashboard


πŸ“Š Exploratory Data Analysis (EDA)

Key insights derived from the UCI dataset regarding student demographics and academic trends.

Grade Distribution Correlation Heatmap
Grade Distribution Correlation Heatmap
Study Time vs Grade Absences vs Grade
Study Time vs Grade Absences vs Grade

πŸ” Additional Academic Factors

  • Demographic Analysis: Demographic Analysis
  • Failures vs Grade: Failures vs Grade
  • Parental Education Impact: Parental Education
  • Pass/Fail Ratio: Pass/Fail Analysis

πŸ€– Model Performance & Evaluation

Comparing 10 different models (5 Regression, 5 Classification) using advanced metrics and cross-validation.

πŸ“ˆ Regression Metrics (G3 Prediction)

Model Comparison Actual vs Predicted
Regression Comparison Scatter Plot
Residual Analysis: Residual Plots Feature Importance: XGB Regressor Importance

🎯 Classification Metrics (Pass/Fail)

Accuracy & F1 Comparison Confusion Matrices
Classification Comparison Confusion Matrices
ROC Curves: ROC Curves Feature Importance: XGB Classifier Importance

🎀 Interview Prep

Q: Why did you use sklearn Pipelines?
A: Pipelines prevent data leakage β€” the preprocessor is fit only on training data inside GridSearchCV, so test data is never seen during training.

Q: How did you handle categorical variables?
A: Used OneHotEncoder inside a ColumnTransformer pipeline, with handle_unknown='ignore' to handle unseen categories during inference.

Q: What is your target variable and why G3 β‰₯ 10 for pass?
A: G3 is the final period grade (0–20). The UCI dataset defines 10 as the passing threshold in the Portuguese grading system.

Q: Why both regression and classification?
A: Regression gives the exact predicted grade (useful for counselors), classification gives a binary risk flag (useful for early intervention systems).

Q: How did you tune hyperparameters?
A: GridSearchCV with 5-fold cross-validation. Scoring: RMSE for regression, weighted F1 for classification.


πŸ“„ License

MIT License β€” free to use for learning, portfolio, and research.


Built for Data Science / ML portfolio β€” UCI Student Performance Dataset

About

A complete ML system to predict student academic performance using UCI dataset. Includes regression (grade prediction) + classification (pass/fail), sklearn Pipelines, GridSearchCV, XGBoost, SVM, KNN, and a live Streamlit web app.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors