HiParis 2025 - PISA Math Score Prediction

Problem Statement

Predict student Math Scores from PISA assessment data. The dataset contains behavioral, demographic, and academic features for 1.17M students.

Core Analytical Insight

The Zero-Score Problem

38% of students have MathScore = 0. The next lowest scores are around ~2.2, with very few observations in between. This bimodal distribution suggests zeros represent a distinct category — likely students with missing math assessments rather than students who scored poorly.

Solution: Two-stage model:

Classifier predicts P(MathScore = 0)
Regressor predicts score for non-zero cases

The combination: Final = Regressor × (1 - P(zero))

Data Leakage Discovery

We identified 40 columns that are 100% NaN in test data:

math_q*_average_score — per-question math averages
math_q*_total_timing — per-question timing

These features exist in training but are completely unavailable at inference. Removing them dropped R² from 0.99 to 0.77 — revealing the true predictive signal.

Feature Correlation Analysis

Feature Type	Correlation with MathScore
Science scores	0.47
Reading scores	0.37
Demographics (OECD, Year)	0.15-0.25
Questionnaire (ST*)	0.05-0.15

Key insight: Science/Reading are the strongest predictors, but correlation is moderate (~0.4). This limits regression performance on non-zero samples to R² ≈ 0.65.

Why Tree Ensembles?

Handle missing values natively
Capture non-linear interactions automatically
GPU acceleration for 1M+ samples
Manual feature engineering adds marginal value — trees learn interactions implicitly

Model Architecture

Why This Works

Classifier achieves AUC 0.997: Separating zero from non-zero is easy — behavioral patterns (timing, question attempts) strongly signal whether a student has a math score.
Regressor captures the conditional distribution: For students with scores, science/reading performance drives predictions.
Target normalization: Scaling targets to mean=0, std=1 improves gradient-based optimization.
Ensemble diversity: CatBoost (ordered boosting), LightGBM (leaf-wise), XGBoost (regularized) — blending reduces variance.

What Doesn't Work

Approach	Why It Failed
Linear models	R² = 0.55, can't capture interactions
Polynomial features	Overfits, hurts generalization
Neural networks	Same R² as trees, slower to train
Feature engineering	Trees already learn aggregations
Deeper trees	Diminishing returns past depth=8

Results

CV R²: 0.7937
Hackathon LB: 0.78

Files

final_submission.ipynb — Complete pipeline
requirements.txt — Dependencies

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
README.md		README.md
diagram.png		diagram.png
final_submission.ipynb		final_submission.ipynb
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HiParis 2025 - PISA Math Score Prediction

Problem Statement

Core Analytical Insight

The Zero-Score Problem

Data Leakage Discovery

Feature Correlation Analysis

Why Tree Ensembles?

Model Architecture

Why This Works

What Doesn't Work

Results

Files

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

HiParis 2025 - PISA Math Score Prediction

Problem Statement

Core Analytical Insight

The Zero-Score Problem

Data Leakage Discovery

Feature Correlation Analysis

Why Tree Ensembles?

Model Architecture

Why This Works

What Doesn't Work

Results

Files

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages