Skip to content

Lycoriolis/HiParis_2025

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HiParis 2025 - PISA Math Score Prediction

Problem Statement

Predict student Math Scores from PISA assessment data. The dataset contains behavioral, demographic, and academic features for 1.17M students.

Core Analytical Insight

The Zero-Score Problem

38% of students have MathScore = 0. The next lowest scores are around ~2.2, with very few observations in between. This bimodal distribution suggests zeros represent a distinct category — likely students with missing math assessments rather than students who scored poorly.

Solution: Two-stage model:

  1. Classifier predicts P(MathScore = 0)
  2. Regressor predicts score for non-zero cases

The combination: Final = Regressor × (1 - P(zero))

Data Leakage Discovery

We identified 40 columns that are 100% NaN in test data:

  • math_q*_average_score — per-question math averages
  • math_q*_total_timing — per-question timing

These features exist in training but are completely unavailable at inference. Removing them dropped R² from 0.99 to 0.77 — revealing the true predictive signal.

Feature Correlation Analysis

Feature Type Correlation with MathScore
Science scores 0.47
Reading scores 0.37
Demographics (OECD, Year) 0.15-0.25
Questionnaire (ST*) 0.05-0.15

Key insight: Science/Reading are the strongest predictors, but correlation is moderate (~0.4). This limits regression performance on non-zero samples to R² ≈ 0.65.

Why Tree Ensembles?

  • Handle missing values natively
  • Capture non-linear interactions automatically
  • GPU acceleration for 1M+ samples
  • Manual feature engineering adds marginal value — trees learn interactions implicitly

Model Architecture

Model Architecture

Why This Works

  1. Classifier achieves AUC 0.997: Separating zero from non-zero is easy — behavioral patterns (timing, question attempts) strongly signal whether a student has a math score.

  2. Regressor captures the conditional distribution: For students with scores, science/reading performance drives predictions.

  3. Target normalization: Scaling targets to mean=0, std=1 improves gradient-based optimization.

  4. Ensemble diversity: CatBoost (ordered boosting), LightGBM (leaf-wise), XGBoost (regularized) — blending reduces variance.

What Doesn't Work

Approach Why It Failed
Linear models R² = 0.55, can't capture interactions
Polynomial features Overfits, hurts generalization
Neural networks Same R² as trees, slower to train
Feature engineering Trees already learn aggregations
Deeper trees Diminishing returns past depth=8

Results

  • CV R²: 0.7937
  • Hackathon LB: 0.78

Files

  • final_submission.ipynb — Complete pipeline
  • requirements.txt — Dependencies

About

HiParis Hackathon 2025 Edition Project

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors