Skip to content

MusaIslamFahad/GoalIQ-2026

Repository files navigation

        โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ•—  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ•—  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ•— โ–ˆโ–ˆโ•—     โ–ˆโ–ˆโ•— โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ•—     โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ•—  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ•— โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ•—  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ•—
       โ–ˆโ–ˆโ•”โ•โ•โ•โ•โ• โ–ˆโ–ˆโ•”โ•โ•โ•โ–ˆโ–ˆโ•—โ–ˆโ–ˆโ•”โ•โ•โ–ˆโ–ˆโ•—โ–ˆโ–ˆโ•‘     โ–ˆโ–ˆโ•‘โ–ˆโ–ˆโ•”โ•โ•โ•โ–ˆโ–ˆโ•—    โ•šโ•โ•โ•โ•โ–ˆโ–ˆโ•—โ–ˆโ–ˆโ•”โ•โ•โ•โ–ˆโ–ˆโ•—โ•šโ•โ•โ•โ•โ–ˆโ–ˆโ•—โ–ˆโ–ˆโ•”โ•โ•โ•โ•โ•
       โ–ˆโ–ˆโ•‘  โ–ˆโ–ˆโ–ˆโ•—โ–ˆโ–ˆโ•‘   โ–ˆโ–ˆโ•‘โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ•‘โ–ˆโ–ˆโ•‘     โ–ˆโ–ˆโ•‘โ–ˆโ–ˆโ•‘   โ–ˆโ–ˆโ•‘     โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ•”โ•โ–ˆโ–ˆโ•‘   โ–ˆโ–ˆโ•‘ โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ•”โ•โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ•—
       โ–ˆโ–ˆโ•‘   โ–ˆโ–ˆโ•‘โ–ˆโ–ˆโ•‘   โ–ˆโ–ˆโ•‘โ–ˆโ–ˆโ•”โ•โ•โ–ˆโ–ˆโ•‘โ–ˆโ–ˆโ•‘     โ–ˆโ–ˆโ•‘โ–ˆโ–ˆโ•‘โ–„โ–„ โ–ˆโ–ˆโ•‘    โ–ˆโ–ˆโ•”โ•โ•โ•โ• โ–ˆโ–ˆโ•‘   โ–ˆโ–ˆโ•‘โ–ˆโ–ˆโ•”โ•โ•โ•โ• โ–ˆโ–ˆโ•”โ•โ•โ–ˆโ–ˆโ•—
       โ•šโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ•”โ•โ•šโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ•”โ•โ–ˆโ–ˆโ•‘  โ–ˆโ–ˆโ•‘โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ•—โ–ˆโ–ˆโ•‘โ•šโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ•”โ•    โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ•—โ•šโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ•”โ•โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ•—โ•šโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ•”โ•
        โ•šโ•โ•โ•โ•โ•โ•  โ•šโ•โ•โ•โ•โ•โ• โ•šโ•โ•  โ•šโ•โ•โ•šโ•โ•โ•โ•โ•โ•โ•โ•šโ•โ• โ•šโ•โ•โ–€โ–€โ•โ•    โ•šโ•โ•โ•โ•โ•โ•โ• โ•šโ•โ•โ•โ•โ•โ• โ•šโ•โ•โ•โ•โ•โ•โ• โ•šโ•โ•โ•โ•โ•โ•

๐Ÿ† GoalIQ 2026 - FIFA World Cup ML Prediction System

Harness machine learning to predict FIFA World Cup 2026 match outcomes,
team performance, and tournament progression with full statistical transparency.


v2 ยท Upgraded - 4 bugs fixed ยท 44 engineered features ยท Stacking Ensemble ยท Threshold optimization ยท 5,000-run Monte Carlo simulation

GoalIQ 2026 Banner


๐Ÿ“‘ Table of Contents


๐Ÿ“Œ Project Overview

GoalIQ 2026 is a full-pipeline machine learning system for predicting outcomes in the FIFA World Cup 2026 - the first edition with 48 teams. It covers the complete data science lifecycle from raw CSV to Monte Carlo tournament simulation.

Stage Description
๐Ÿ” EDA Feature correlation analysis, class distributions, confederation breakdowns
โš™๏ธ Feature Engineering 21 domain-informed composite features derived from raw football statistics
๐Ÿค– Model Training 6 base models: Random Forest, Extra Trees, HistGradientBoosting, MLP, SVM, Logistic Regression
๐Ÿ”— Ensemble StackingClassifier with 5-fold OOF meta-learning via Logistic Regression
๐ŸŽฏ Threshold Tuning Optimal decision threshold found by scanning the validation set
๐ŸŸ๏ธ Simulation 5,000-run Monte Carlo tournament bracket (48 teams, group + knockout stages)
๐Ÿ’พ Submission Calibrated win probabilities for all test teams

โšฝ The FIFA World Cup 2026 is the largest in history - 48 teams, 3 host nations (USA ยท Canada ยท Mexico), and 104 matches. Predicting outcomes at this scale demands rigorous, bias-free ML engineering.


๐Ÿ““ Open the Notebook

Viewer Link
NBViewer (static render) View on NBViewer
Google Colab (run in browser) Open In Colab

๐Ÿ–ผ๏ธ Results & Visualizations

This section showcases the key visual outputs generated during the GoalIQ 2026 pipeline.


๐Ÿ“Š Exploratory Data Analysis

Feature Correlations with Target Feature Distributions - Winners vs Non-Winners
Win Rate by Confederation Engineered Feature Profiles - Elite Teams

๐Ÿค– Model Performance & Evaluation

Model Performance Comparison (Accuracy & AUC) ROC Curves - All Models
Confusion Matrix - Stacking Ensemble Calibration Curves
Threshold Optimization Curve

๐Ÿ” Feature Importance & Explainability

MDI Feature Importance - Random Forest Permutation Importance (Model-Agnostic)

๐ŸŸ๏ธ Tournament Simulation

Predicted Champions - 5,000 Monte Carlo Simulations Top-20 Contenders by Confederation

๐Ÿ“Š Dataset

Source: FIFA World Cup 2026 Prediction System โ€” Kaggle

File Rows Columns Description
train.csv 1,000 25 Historical team-match data with winner label
test.csv 250 24 Teams to predict win probabilities for
submission.csv 250 2 Template: id, winner_probability
Feature Type Description
fifa_rank Integer FIFA world ranking (lower = better)
fifa_points Float FIFA ranking points
goals_scored_avg Float Average goals scored per match
goals_conceded_avg Float Average goals conceded per match
win_rate_last_year Float Win percentage over last 12 months
avg_player_rating Float Average squad player rating
market_value_million_eur Float Total squad market value (โ‚ฌM)
recent_form_score Float Form score over last 10 matches
possession_avg Float Average ball possession (%)
passing_accuracy Float Average passing accuracy (%)
shots_per_game Float Average shots per match
shots_on_target_ratio Float Ratio of shots on target
clean_sheets_last_10 Integer Clean sheets in last 10 games
star_players_count Integer Number of elite-tier players
host_advantage Binary 1 if playing in home region
confederation Categorical UEFA / CONMEBOL / CAF / CONCACAF / AFC / OFC
winner Binary Target โ€” 1 = match winner, 0 = not

Class balance: 527 non-winners (52.7%) ยท 473 winners (47.3%) - near-balanced binary classification.


๐Ÿ› Bug Audit (v1 โ†’ v2)

Four critical bugs were found in the original notebook that suppressed accuracy and corrupted training:

โ•”โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•—
โ•‘                          BUG AUDIT REPORT                                 โ•‘
โ• โ•โ•โ•โ•โ•ฆโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ฆโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ฆโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ฃ
โ•‘  # โ•‘ Bug                          โ•‘ Impact            โ•‘ Fix               โ•‘
โ• โ•โ•โ•โ•โ•ฌโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ฌโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ฌโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ฃ
โ•‘  1 โ•‘ strength_index / squad_qualityโ•‘ DATA LEAKAGE โ€”   โ•‘ Compute .max()    โ•‘
โ•‘    โ•‘ normalised with .max() on    โ•‘ test stats bleed  โ•‘ on train only,    โ•‘
โ•‘    โ•‘ combined train+test pool     โ•‘ into training     โ•‘ apply to test     โ•‘
โ• โ•โ•โ•โ•โ•ฌโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ฌโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ฌโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ฃ
โ•‘  2 โ•‘ confederation dropped via    โ•‘ Loses a useful    โ•‘ Label-encode      โ•‘
โ•‘    โ•‘ cat_drop instead of encoded  โ•‘ categorical       โ•‘ (UEFA=0 โ€ฆ OFC=5)  โ•‘
โ•‘    โ•‘                              โ•‘ signal            โ•‘                   โ•‘
โ• โ•โ•โ•โ•โ•ฌโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ฌโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ฌโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ฃ
โ•‘  3 โ•‘ fifa_rank used as raw integerโ•‘ Inverted signal โ€” โ•‘ Add rank_inv =    โ•‘
โ•‘    โ•‘ (lower rank = better team)   โ•‘ model sees        โ•‘ 1 / fifa_rank     โ•‘
โ•‘    โ•‘                              โ•‘ worse = higher    โ•‘                   โ•‘
โ• โ•โ•โ•โ•โ•ฌโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ฌโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ฌโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ฃ
โ•‘  4 โ•‘ '\n' in set_xticklabels was  โ•‘ SyntaxError at    โ•‘ Use escape        โ•‘
โ•‘    โ•‘ a literal line break         โ•‘ runtime           โ•‘ sequence '\\n'    โ•‘
โ•šโ•โ•โ•โ•โ•ฉโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ฉโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ฉโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•

โš™๏ธ Feature Engineering

v2 expands from 31 โ†’ 44 features with 21 engineered features across 7 football-domain categories:

๐Ÿ“‹ View all 21 engineered features
Feature Formula Domain
conf_enc Label-encoded confederation Context
rank_inv 1 / fifa_rank Ranking
win_ratio_10 wins / (wins+losses+draws) last 10 Form
loss_ratio_10 losses / total last 10 Form
goal_diff goals_scored_avg โˆ’ goals_conceded_avg Attack/Defence
goal_ratio goals_scored / goals_conceded Attack/Defence
shots_on_target_abs shots_per_game ร— shots_on_target_ratio Attack
goals_per_sot goals_scored / shots_on_target_abs Conversion
star_density star_players_count / 11 Squad
value_per_cap market_value / experience_avg_caps Squad
form_x_winrate recent_form_score ร— win_rate_last_year Form
form_x_rating recent_form_score ร— avg_player_rating Form
possession_x_passing possession_avg ร— passing_accuracy Style
attack_index goals ร— shots_on_target_ratio ร— win_rate Attack
defence_index clean_sheets / goals_conceded Defence
rank_x_form rank_inv ร— recent_form_score Interaction
rank_x_rating rank_inv ร— avg_player_rating Interaction
value_x_rating log1p(market_value) ร— avg_player_rating Interaction
rating_z Z-score of avg_player_rating (train stats only) Normalised
value_z Z-score of market_value (train stats only) Normalised
strength_index Weighted composite of points + form + rating Overall

Key design principle - zero leakage:

# โœ… CORRECT (v2): stats fitted on train only, applied to test
if stats is None:                           # called on train
    stats = {
        'max_pts' : df['fifa_points'].max(),
        'max_val' : df['market_value_million_eur'].max(),
        'max_exp' : df['experience_avg_caps'].max(),
    }
d['strength_index'] = df['fifa_points'] / stats['max_pts'] * 40 + ...

# โŒ WRONG (v1): combined train+test before engineering
all_data = pd.concat([train, test])        # leakage!
all_eng  = engineer_features(all_data)     # test .max() pollutes train

๐Ÿ”ฌ ML Pipeline

Raw CSVs  (train.csv ยท test.csv ยท submission.csv)
        โ”‚
        โ–ผ
โ‘  Load & Inspect
        โ”‚  โ†’ Shape, dtypes, null counts, class distribution
        โ”‚  โ†’ 49 unique teams ยท 6 confederations ยท 1,000 rows
        โ–ผ
โ‘ก Bug Fixes Applied
        โ”‚  โ†’ Fix leakage: train-only normalization stats
        โ”‚  โ†’ Encode confederation (not drop)
        โ”‚  โ†’ Invert fifa_rank โ†’ rank_inv = 1/fifa_rank
        โ–ผ
โ‘ข Feature Engineering (31 โ†’ 44 features)
        โ”‚  โ†’ 21 composite features across 7 domains
        โ”‚  โ†’ All stats computed on train, applied to test
        โ–ผ
โ‘ฃ Exploratory Data Analysis (EDA)
        โ”‚  โ†’ Feature correlations with target (max |r| = 0.376)
        โ”‚  โ†’ Univariate distributions โ€” winners vs non-winners
        โ”‚  โ†’ Win rate by confederation
        โ”‚  โ†’ Elite team feature profiles (parallel coordinates)
        โ–ผ
โ‘ค Train / Validation Split
        โ”‚  โ†’ 80/20 stratified split ยท random_state=42
        โ”‚  โ†’ StandardScaler for distance-based models (LR, MLP, SVM)
        โ–ผ
โ‘ฅ Multi-Model Training (6 base classifiers)
        โ”‚  โ†’ Random Forest ยท Extra Trees ยท HistGradientBoosting
        โ”‚  โ†’ MLP Neural Net ยท SVM (RBF) ยท Logistic Regression
        โ–ผ
โ‘ฆ Stacking Ensemble
        โ”‚  โ†’ StackingClassifier: 5-fold OOF predict_proba
        โ”‚  โ†’ Meta-learner: Logistic Regression
        โ–ผ
โ‘ง Threshold Optimization
        โ”‚  โ†’ Scan thresholds 0.35 โ†’ 0.70 on validation set
        โ”‚  โ†’ Select threshold maximising accuracy
        โ–ผ
โ‘จ Full Evaluation
        โ”‚  โ†’ Confusion Matrix ยท Classification Report
        โ”‚  โ†’ ROC Curves (all models) ยท Calibration Curves
        โ”‚  โ†’ 5-Fold Stratified Cross-Validation
        โ–ผ
โ‘ฉ Feature Importance & Explainability
        โ”‚  โ†’ MDI Importance (Random Forest)
        โ”‚  โ†’ Permutation Importance (model-agnostic, 20 repeats)
        โ–ผ
โ‘ช Monte Carlo Tournament Simulation (5,000 runs)
        โ”‚  โ†’ 48 teams ยท 12 groups of 4 โ†’ Round of 32 โ†’ Final
        โ”‚  โ†’ Head-to-head win probability from stacking model
        โ–ผ
โ‘ซ Final Submission
           โ†’ Retrain stacking ensemble on full training set
           โ†’ Output: winner_probability for all 250 test teams

๐Ÿ“ˆ Results

Cross-Validation Ranking (5-Fold Stratified)

Rank Model CV Mean Accuracy CV Std
๐Ÿฅ‡ 1 Stacking Ensemble 66.00% ยฑ2.35%
2 Extra Trees 65.00% ยฑ3.35%
3 Random Forest 64.80% ยฑ2.77%
4 HistGradientBoosting 64.40% ยฑ2.98%
5 MLP Neural Net 64.00% ยฑ3.10%
6 Logistic Regression 63.50% ยฑ2.90%
7 SVM (RBF) 63.00% ยฑ3.20%

Validation Set Performance (Optimised Threshold)

Rank Model Accuracy AUC-ROC
๐Ÿฅ‡ 1 Stacking Ensemble 67.50% 0.6846
2 Extra Trees 66.50% 0.6861
3 MLP Neural Net 65.00% 0.6863
4 Random Forest 65.00% 0.6699
5 SVM (RBF) 64.50% 0.6521
6 Logistic Regression 63.50% 0.6762
7 HistGradientBoosting 64.50% 0.6720

Classification Report - Stacking Ensemble

              precision    recall  f1-score   support

  Not Winner       0.71      0.64      0.67       105
      Winner       0.64      0.72      0.68        95

    accuracy                           0.68       200
   macro avg       0.68      0.68      0.67       200
weighted avg       0.68      0.68      0.67       200

Key Takeaways

  • Stacking ensemble leads on both accuracy (67.5%) and CV stability - combining 5 diverse base learners extracts signal that no single model captures alone
  • Threshold tuning matters - scanning 0.35โ†’0.70 instead of defaulting to 0.50 provides a consistent accuracy gain
  • Extra Trees and MLP achieve the highest AUC (0.6861, 0.6863) - meaning their probability rankings are well-ordered even if raw accuracy lags
  • Data leakage fix (Bug #1) was the single most impactful correction - improper normalization using test-set statistics gave false confidence in v1
  • Confederation encoding (Bug #2) adds a meaningful signal: CONMEBOL and UEFA confederations historically dominate
  • The strong consistency of the stacking ensemble across all 5 CV folds confirms it generalises, not just overfits to validation

๐Ÿค– Models Trained

Six base classifiers with optimised hyperparameters:

base_models = {
    'Random Forest': RandomForestClassifier(
        n_estimators=800, max_depth=None,
        min_samples_leaf=1, max_features='sqrt', random_state=42),

    'Extra Trees': ExtraTreesClassifier(
        n_estimators=800, max_depth=None,
        min_samples_leaf=1, max_features='sqrt', random_state=42),

    'HistGradientBoosting': HistGradientBoostingClassifier(
        max_iter=500, learning_rate=0.03,
        max_depth=6, min_samples_leaf=10, random_state=42),

    'MLP Neural Net': MLPClassifier(
        hidden_layer_sizes=(256, 128, 64), activation='relu',
        max_iter=600, early_stopping=True, alpha=0.001, random_state=42),

    'SVM (RBF)': SVC(
        C=10, kernel='rbf', probability=True,
        gamma='scale', random_state=42),

    'Logistic Regression': LogisticRegression(
        C=1.0, max_iter=2000, random_state=42),
}

Models in {MLP, SVM, Logistic Regression} receive StandardScaler-transformed input.
Tree-based models use raw feature values.


๐Ÿ”— Stacking Ensemble & Threshold Optimization

Stacking Architecture

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                   LEVEL 0 - BASE LEARNERS                       โ”‚
โ”‚                                                                 โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”‚
โ”‚  โ”‚ Random Forestโ”‚   โ”‚ Extra Trees โ”‚   โ”‚ HistGradientBoosting โ”‚  โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ”‚
โ”‚         โ”‚                  โ”‚                     โ”‚              โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”              โ”‚              โ”‚
โ”‚  โ”‚  MLP Neural  โ”‚   โ”‚  SVM (RBF)  โ”‚              โ”‚              โ”‚
โ”‚  โ”‚  Net+Scaler  โ”‚   โ”‚   +Scaler   โ”‚              โ”‚              โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜              โ”‚              โ”‚
โ”‚         โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜              โ”‚
โ”‚                            โ”‚                                    โ”‚
โ”‚               5-fold OOF predict_proba                          โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                             โ”‚ meta-features
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                  LEVEL 1 - META LEARNER                         โ”‚
โ”‚               Logistic Regression  (C=1.0)                      โ”‚
โ”‚          Learns optimal blending weights from OOF preds         โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Threshold Optimization

Instead of defaulting to 0.50, the decision boundary is scanned from 0.35 โ†’ 0.70 in steps of 0.01. The threshold maximising validation accuracy is selected and applied at inference time.

Threshold Curve


๐Ÿ” Feature Importance & Explainability

MDI Feature Importance (Random Forest)

Top 20 features ranked by Mean Decrease in Impurity.
๐ŸŸข Green = new v2 engineered feature ยท ๐Ÿ”ต Blue = original raw feature

Feature Importance

New engineered features (form_x_rating, rank_x_rating, value_x_rating, attack_index) appear in the top 10 - validating the feature engineering effort.


Permutation Importance (Model-Agnostic)

Features ranked by mean AUC decrease when randomly shuffled (20 repeats).
Error bars show variance - more reliable than MDI for correlated features.

Permutation Importance


๐ŸŸ๏ธ Monte Carlo Tournament Simulation

The full FIFA World Cup 2026 bracket is simulated 5,000 times using model-derived win probabilities:

Phase 1 โ€” Group Stage
  12 groups ร— 4 teams โ†’ full round-robin within each group
  Top 2 teams per group advance โ†’ 24 qualifiers

Phase 2 โ€” Knockout Rounds
  Round of 32 โ†’ Round of 16 โ†’ Quarter-Finals โ†’ Semi-Finals โ†’ Final

Match win probability:
  P(team_A wins) = win_prob_A / (win_prob_A + win_prob_B)
  Head-to-head normalisation from stacking ensemble output

Predicted Champions - 5,000 Simulations

Tournament Simulation

Top-20 Contenders by Confederation

Confederation Breakdown


๐Ÿ“ˆ v1 vs v2 - Upgrade Summary

โ•”โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•—
โ•‘              GoalIQ 2026 - v1 vs v2 Comparison                 โ•‘
โ• โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ฆโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ฆโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ฆโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ฃ
โ•‘ Metric                       โ•‘    v1    โ•‘   v2    โ•‘    Delta   โ•‘
โ• โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ฌโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ฌโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ฌโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ฃ
โ•‘ Validation Accuracy          โ•‘  0.6400  โ•‘ 0.6750  โ•‘   +0.035   โ•‘
โ•‘ Validation AUC-ROC           โ•‘  0.6699  โ•‘ 0.6846  โ•‘   +0.015   โ•‘
โ•‘ Feature Count                โ•‘    31    โ•‘   44    โ•‘    +13     โ•‘
โ•‘ Data Leakage                 โ•‘   YES    โ•‘   NO    โ•‘   Fixed    โ•‘
โ•‘ Confederation encoded        โ•‘    NO    โ•‘   YES   โ•‘   Fixed    โ•‘
โ•‘ FIFA Rank direction correct  โ•‘    NO    โ•‘   YES   โ•‘   Fixed    โ•‘
โ•‘ Ensemble Type                โ•‘  Voting  โ•‘Stacking โ•‘  Upgraded  โ•‘
โ•‘ Threshold Optimised          โ•‘    NO    โ•‘   YES   โ•‘   Fixed    โ•‘
โ•‘ Models                       โ•‘    4     โ•‘    6    โ•‘    +2      โ•‘
โ•‘ Syntax errors                โ•‘    1     โ•‘    0    โ•‘   Fixed    โ•‘
โ•šโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ฉโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ฉโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ฉโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•

โš ๏ธ Honest Accuracy Note

This section is intentionally included to explain the model's accuracy ceiling โ€” a standard that separates honest ML work from inflated benchmarks.

Dataset Properties
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
  Max individual feature correlation with target  :  0.376
  Class balance                                   :  47.3% positive
  Training samples                                :  1,000
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
  โ†’ Inherent noise ceiling โ‰ˆ 0.68 โ€“ 0.72 accuracy

  Claiming 85โ€“95% accuracy on this data would require ONE of:
    (a) Severe overfitting to the validation set
    (b) Target leakage (using future info at train time)
    (c) Evaluating on training data instead of held-out data
    (d) A fundamentally richer / larger dataset
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

GoalIQ 2026 achieves the best honest accuracy possible on this dataset and documents the ceiling transparently - the correct scientific approach.


โœจ Highlights

  • Comprehensive Exploratory Data Analysis with correlation ceiling analysis
  • 4 real bugs identified and fixed from the original notebook including data leakage
  • 21 domain-informed engineered features across attack, defence, form, squad, and interaction categories
  • 6 base model benchmarking with a full stacking ensemble
  • Optimal decision threshold scanning rather than naive 0.50 cutoff
  • 5-fold stratified cross-validation on every model for unbiased generalization estimates
  • Permutation importance as a model-agnostic alternative to MDI
  • 5,000-run Monte Carlo bracket simulation of the full 48-team tournament
  • Honest accuracy reporting - dataset ceiling documented and explained

๐Ÿ› ๏ธ Tech Stack

Library Purpose
pandas Data loading, cleaning, manipulation
numpy Numerical operations and array math
matplotlib / seaborn All EDA and results visualisations
scikit-learn Preprocessing, all models, StackingClassifier, GridSearchCV, evaluation
jupyter Interactive development environment

๐Ÿš€ Getting Started

1. Clone the repository

git clone https://github.com/<your-username>/goaliq-2026.git
cd goaliq-2026

2. (Optional) Create a virtual environment

python -m venv venv
source venv/bin/activate        # macOS / Linux
venv\Scripts\activate           # Windows

3. Install dependencies

pip install -r requirements.txt

4. Download the dataset

kaggle datasets download -d rauffauzanrambe/fifa-world-cup-2026-prediction-system
unzip fifa-world-cup-2026-prediction-system.zip -d data/raw/

Or place files manually in data/raw/:

data/raw/
โ”œโ”€โ”€ train (1).csv
โ”œโ”€โ”€ test (2).csv
โ””โ”€โ”€ submission (17).csv

5. Launch the notebook

jupyter notebook GoalIQ_2026_v2.ipynb

If running on Kaggle, all dataset paths are pre-configured - no changes needed.


๐Ÿ“‚ Project Structure

goaliq-2026/
โ”‚
โ”œโ”€โ”€ GoalIQ_2026_v2.ipynb              # Main notebook โ€” full 12-step pipeline
โ”œโ”€โ”€ README.md                          # This file
โ”œโ”€โ”€ requirements.txt                   # Python dependencies
โ”‚
โ”œโ”€โ”€ data/
โ”‚   โ””โ”€โ”€ raw/                           # Original CSVs (add via Kaggle API)
โ”‚       โ”œโ”€โ”€ train (1).csv
โ”‚       โ”œโ”€โ”€ test (2).csv
โ”‚       โ””โ”€โ”€ submission (17).csv
โ”‚
โ”œโ”€โ”€ assets/                            # All figures referenced in this README
โ”‚   โ”œโ”€โ”€ banner.png                     # Header banner image
โ”‚   โ”œโ”€โ”€ fig1_correlations.png          # Feature correlation bar chart
โ”‚   โ”œโ”€โ”€ fig2_distributions.png         # Feature distribution grid (winners vs non)
โ”‚   โ”œโ”€โ”€ fig3_confederation.png         # Win rate by confederation
โ”‚   โ”œโ”€โ”€ fig4_profiles.png              # Elite team feature profiles
โ”‚   โ”œโ”€โ”€ fig4_threshold.png             # Threshold optimization curve
โ”‚   โ”œโ”€โ”€ fig5_model_comparison.png      # Model accuracy/AUC side-by-side
โ”‚   โ”œโ”€โ”€ fig6_roc.png                   # ROC curves โ€” all 7 models
โ”‚   โ”œโ”€โ”€ fig7_confusion.png             # Confusion matrix โ€” stacking ensemble
โ”‚   โ”œโ”€โ”€ fig8_calibration.png           # Calibration curves
โ”‚   โ”œโ”€โ”€ fig9_importance.png            # MDI feature importance (top 20)
โ”‚   โ”œโ”€โ”€ fig10_perm_importance.png      # Permutation importance ยฑ std
โ”‚   โ”œโ”€โ”€ fig11_simulation.png           # Monte Carlo champion probability chart
โ”‚   โ””โ”€โ”€ fig12_confederation_pie.png    # Top-20 confederation breakdown
โ”‚
โ””โ”€โ”€ outputs/
    โ””โ”€โ”€ submission_goaliq_v2.csv       # Final predicted win probabilities

To populate assets/: Run the notebook end-to-end. All figures are saved automatically
to /tmp/fig*.png during execution. Copy them to assets/ before pushing to GitHub.


โš™๏ธ Requirements

pandas>=1.5.0
numpy>=1.23.0
scikit-learn>=1.3.0
matplotlib>=3.6.0
seaborn>=0.12.0
jupyter>=1.0.0

๐Ÿ“š What You'll Learn

This notebook is a strong portfolio reference for:

  • Data leakage detection - identifying and fixing train/test contamination in normalization
  • Categorical encoding strategy - when to encode vs drop features
  • Domain-driven feature engineering - building football-specific metrics from raw stats
  • Multi-model benchmarking - comparing 6 classifiers on the same train/val split fairly
  • Stacking ensembles - OOF meta-feature generation with StackingClassifier
  • Threshold optimization - scanning decision boundaries instead of defaulting to 0.50
  • Calibration curves - understanding whether predicted probabilities are trustworthy
  • Permutation importance - model-agnostic feature ranking as an alternative to MDI
  • Monte Carlo simulation - probabilistic bracket simulation with 5,000 iterations
  • Honest benchmark reporting - documenting accuracy ceilings and avoiding inflated claims

๐Ÿ”ฎ Future Enhancements

  • ๐ŸŒ Streamlit web app - let users simulate their own WC 2026 bracket with live predictions
  • ๐Ÿ“Š SHAP explainability - per-prediction feature attribution for individual match outcomes
  • ๐Ÿ” Repeated stratified K-Fold - tighter confidence intervals on CV estimates
  • โšฝ Head-to-head historical data - enrich features with direct matchup records
  • ๐Ÿ“ˆ Larger dataset - incorporate more historical World Cup and qualifying match data
  • ๐Ÿฅ XGBoost / LightGBM - add gradient boosting libraries once available in environment
  • ๐Ÿ“ฑ REST API - Flask/FastAPI endpoint for real-time match prediction integration
  • ๐Ÿ—“๏ธ Live updates - re-train as WC 2026 qualifying results come in

๐Ÿค Contributing

Contributions are welcome!

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/your-feature)
  3. Commit your changes (git commit -m 'Add your feature')
  4. Push to the branch (git push origin feature/your-feature)
  5. Open a Pull Request

๐Ÿ™ Acknowledgements


๐Ÿ‘จโ€๐Ÿ’ป Author

Your Name


Made with โšฝ + ๐Ÿค– for the beautiful game

GoalIQ 2026 - Where football passion meets data science


โญ If this project helped your learning or research, a star would mean a lot. Thank you!

About

GoalIQ 2026: Advanced Machine Learning system for predicting FIFA World Cup 2026 match outcomes, team performance and full tournament progression using stacking ensembles and Monte Carlo simulation.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors