Predict student Math Scores from PISA assessment data. The dataset contains behavioral, demographic, and academic features for 1.17M students.
38% of students have MathScore = 0. The next lowest scores are around ~2.2, with very few observations in between. This bimodal distribution suggests zeros represent a distinct category — likely students with missing math assessments rather than students who scored poorly.
Solution: Two-stage model:
- Classifier predicts P(MathScore = 0)
- Regressor predicts score for non-zero cases
The combination: Final = Regressor × (1 - P(zero))
We identified 40 columns that are 100% NaN in test data:
math_q*_average_score— per-question math averagesmath_q*_total_timing— per-question timing
These features exist in training but are completely unavailable at inference. Removing them dropped R² from 0.99 to 0.77 — revealing the true predictive signal.
| Feature Type | Correlation with MathScore |
|---|---|
| Science scores | 0.47 |
| Reading scores | 0.37 |
| Demographics (OECD, Year) | 0.15-0.25 |
| Questionnaire (ST*) | 0.05-0.15 |
Key insight: Science/Reading are the strongest predictors, but correlation is moderate (~0.4). This limits regression performance on non-zero samples to R² ≈ 0.65.
- Handle missing values natively
- Capture non-linear interactions automatically
- GPU acceleration for 1M+ samples
- Manual feature engineering adds marginal value — trees learn interactions implicitly
-
Classifier achieves AUC 0.997: Separating zero from non-zero is easy — behavioral patterns (timing, question attempts) strongly signal whether a student has a math score.
-
Regressor captures the conditional distribution: For students with scores, science/reading performance drives predictions.
-
Target normalization: Scaling targets to mean=0, std=1 improves gradient-based optimization.
-
Ensemble diversity: CatBoost (ordered boosting), LightGBM (leaf-wise), XGBoost (regularized) — blending reduces variance.
| Approach | Why It Failed |
|---|---|
| Linear models | R² = 0.55, can't capture interactions |
| Polynomial features | Overfits, hurts generalization |
| Neural networks | Same R² as trees, slower to train |
| Feature engineering | Trees already learn aggregations |
| Deeper trees | Diminishing returns past depth=8 |
- CV R²: 0.7937
- Hackathon LB: 0.78
final_submission.ipynb— Complete pipelinerequirements.txt— Dependencies
