โโโโโโโ โโโโโโโ โโโโโโ โโโ โโโ โโโโโโโ โโโโโโโ โโโโโโโ โโโโโโโ โโโโโโโ
โโโโโโโโ โโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โโโ โโโโโโโ โโโโโโโโโโโโโโ โโโโโโ โโโ โโโโโโโโโโ โโโ โโโโโโโโโโโโโโโ
โโโ โโโโโโ โโโโโโโโโโโโโโ โโโโโโโโ โโโ โโโโโโโ โโโ โโโโโโโโโโ โโโโโโโโ
โโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โโโโโโโ โโโโโโโ โโโ โโโโโโโโโโโโโโ โโโโโโโ โโโโโโโโ โโโโโโโ โโโโโโโโ โโโโโโโ
Harness machine learning to predict FIFA World Cup 2026 match outcomes,
team performance, and tournament progression with full statistical transparency.
v2 ยท Upgraded - 4 bugs fixed ยท 44 engineered features ยท Stacking Ensemble ยท Threshold optimization ยท 5,000-run Monte Carlo simulation
- ๐ Project Overview
- ๐ Open the Notebook
- ๐ผ๏ธ Results & Visualizations
- ๐ Dataset
- ๐ Bug Audit (v1 โ v2)
- โ๏ธ Feature Engineering
- ๐ฌ ML Pipeline
- ๐ Results
- ๐ค Models Trained
- ๐ Stacking Ensemble & Threshold Optimization
- ๐ Feature Importance & Explainability
- ๐๏ธ Monte Carlo Tournament Simulation
- ๐ v1 vs v2 โ Upgrade Summary
โ ๏ธ Honest Accuracy Note- โจ Highlights
- ๐ ๏ธ Tech Stack
- ๐ Getting Started
- ๐ Project Structure
- ๐ What You'll Learn
- ๐ฎ Future Enhancements
- ๐ค Contributing
- ๐ Acknowledgements
- ๐จโ๐ป Author
GoalIQ 2026 is a full-pipeline machine learning system for predicting outcomes in the FIFA World Cup 2026 - the first edition with 48 teams. It covers the complete data science lifecycle from raw CSV to Monte Carlo tournament simulation.
| Stage | Description |
|---|---|
| ๐ EDA | Feature correlation analysis, class distributions, confederation breakdowns |
| โ๏ธ Feature Engineering | 21 domain-informed composite features derived from raw football statistics |
| ๐ค Model Training | 6 base models: Random Forest, Extra Trees, HistGradientBoosting, MLP, SVM, Logistic Regression |
| ๐ Ensemble | StackingClassifier with 5-fold OOF meta-learning via Logistic Regression |
| ๐ฏ Threshold Tuning | Optimal decision threshold found by scanning the validation set |
| ๐๏ธ Simulation | 5,000-run Monte Carlo tournament bracket (48 teams, group + knockout stages) |
| ๐พ Submission | Calibrated win probabilities for all test teams |
โฝ The FIFA World Cup 2026 is the largest in history - 48 teams, 3 host nations (USA ยท Canada ยท Mexico), and 104 matches. Predicting outcomes at this scale demands rigorous, bias-free ML engineering.
| Viewer | Link |
|---|---|
| NBViewer (static render) | View on NBViewer |
| Google Colab (run in browser) |
This section showcases the key visual outputs generated during the GoalIQ 2026 pipeline.
| Feature Correlations with Target | Feature Distributions - Winners vs Non-Winners |
|---|---|
![]() |
![]() |
| Win Rate by Confederation | Engineered Feature Profiles - Elite Teams |
|---|---|
![]() |
![]() |
| Model Performance Comparison (Accuracy & AUC) | ROC Curves - All Models |
|---|---|
![]() |
![]() |
| Confusion Matrix - Stacking Ensemble | Calibration Curves |
|---|---|
![]() |
![]() |
| Threshold Optimization Curve |
|---|
![]() |
| MDI Feature Importance - Random Forest | Permutation Importance (Model-Agnostic) |
|---|---|
![]() |
![]() |
| Predicted Champions - 5,000 Monte Carlo Simulations | Top-20 Contenders by Confederation |
|---|---|
![]() |
![]() |
Source: FIFA World Cup 2026 Prediction System โ Kaggle
| File | Rows | Columns | Description |
|---|---|---|---|
train.csv |
1,000 | 25 | Historical team-match data with winner label |
test.csv |
250 | 24 | Teams to predict win probabilities for |
submission.csv |
250 | 2 | Template: id, winner_probability |
| Feature | Type | Description |
|---|---|---|
fifa_rank |
Integer | FIFA world ranking (lower = better) |
fifa_points |
Float | FIFA ranking points |
goals_scored_avg |
Float | Average goals scored per match |
goals_conceded_avg |
Float | Average goals conceded per match |
win_rate_last_year |
Float | Win percentage over last 12 months |
avg_player_rating |
Float | Average squad player rating |
market_value_million_eur |
Float | Total squad market value (โฌM) |
recent_form_score |
Float | Form score over last 10 matches |
possession_avg |
Float | Average ball possession (%) |
passing_accuracy |
Float | Average passing accuracy (%) |
shots_per_game |
Float | Average shots per match |
shots_on_target_ratio |
Float | Ratio of shots on target |
clean_sheets_last_10 |
Integer | Clean sheets in last 10 games |
star_players_count |
Integer | Number of elite-tier players |
host_advantage |
Binary | 1 if playing in home region |
confederation |
Categorical | UEFA / CONMEBOL / CAF / CONCACAF / AFC / OFC |
winner |
Binary | Target โ 1 = match winner, 0 = not |
Class balance: 527 non-winners (52.7%) ยท 473 winners (47.3%) - near-balanced binary classification.
Four critical bugs were found in the original notebook that suppressed accuracy and corrupted training:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ BUG AUDIT REPORT โ
โ โโโโโฆโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฆโโโโโโโโโโโโโโโโโโโโฆโโโโโโโโโโโโโโโโโโโโฃ
โ # โ Bug โ Impact โ Fix โ
โ โโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโฃ
โ 1 โ strength_index / squad_qualityโ DATA LEAKAGE โ โ Compute .max() โ
โ โ normalised with .max() on โ test stats bleed โ on train only, โ
โ โ combined train+test pool โ into training โ apply to test โ
โ โโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโฃ
โ 2 โ confederation dropped via โ Loses a useful โ Label-encode โ
โ โ cat_drop instead of encoded โ categorical โ (UEFA=0 โฆ OFC=5) โ
โ โ โ signal โ โ
โ โโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโฃ
โ 3 โ fifa_rank used as raw integerโ Inverted signal โ โ Add rank_inv = โ
โ โ (lower rank = better team) โ model sees โ 1 / fifa_rank โ
โ โ โ worse = higher โ โ
โ โโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโฃ
โ 4 โ '\n' in set_xticklabels was โ SyntaxError at โ Use escape โ
โ โ a literal line break โ runtime โ sequence '\\n' โ
โโโโโโฉโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฉโโโโโโโโโโโโโโโโโโโโฉโโโโโโโโโโโโโโโโโโโโ
v2 expands from 31 โ 44 features with 21 engineered features across 7 football-domain categories:
๐ View all 21 engineered features
| Feature | Formula | Domain |
|---|---|---|
conf_enc |
Label-encoded confederation | Context |
rank_inv |
1 / fifa_rank |
Ranking |
win_ratio_10 |
wins / (wins+losses+draws) last 10 |
Form |
loss_ratio_10 |
losses / total last 10 |
Form |
goal_diff |
goals_scored_avg โ goals_conceded_avg |
Attack/Defence |
goal_ratio |
goals_scored / goals_conceded |
Attack/Defence |
shots_on_target_abs |
shots_per_game ร shots_on_target_ratio |
Attack |
goals_per_sot |
goals_scored / shots_on_target_abs |
Conversion |
star_density |
star_players_count / 11 |
Squad |
value_per_cap |
market_value / experience_avg_caps |
Squad |
form_x_winrate |
recent_form_score ร win_rate_last_year |
Form |
form_x_rating |
recent_form_score ร avg_player_rating |
Form |
possession_x_passing |
possession_avg ร passing_accuracy |
Style |
attack_index |
goals ร shots_on_target_ratio ร win_rate |
Attack |
defence_index |
clean_sheets / goals_conceded |
Defence |
rank_x_form |
rank_inv ร recent_form_score |
Interaction |
rank_x_rating |
rank_inv ร avg_player_rating |
Interaction |
value_x_rating |
log1p(market_value) ร avg_player_rating |
Interaction |
rating_z |
Z-score of avg_player_rating (train stats only) |
Normalised |
value_z |
Z-score of market_value (train stats only) |
Normalised |
strength_index |
Weighted composite of points + form + rating | Overall |
Key design principle - zero leakage:
# โ
CORRECT (v2): stats fitted on train only, applied to test
if stats is None: # called on train
stats = {
'max_pts' : df['fifa_points'].max(),
'max_val' : df['market_value_million_eur'].max(),
'max_exp' : df['experience_avg_caps'].max(),
}
d['strength_index'] = df['fifa_points'] / stats['max_pts'] * 40 + ...
# โ WRONG (v1): combined train+test before engineering
all_data = pd.concat([train, test]) # leakage!
all_eng = engineer_features(all_data) # test .max() pollutes trainRaw CSVs (train.csv ยท test.csv ยท submission.csv)
โ
โผ
โ Load & Inspect
โ โ Shape, dtypes, null counts, class distribution
โ โ 49 unique teams ยท 6 confederations ยท 1,000 rows
โผ
โก Bug Fixes Applied
โ โ Fix leakage: train-only normalization stats
โ โ Encode confederation (not drop)
โ โ Invert fifa_rank โ rank_inv = 1/fifa_rank
โผ
โข Feature Engineering (31 โ 44 features)
โ โ 21 composite features across 7 domains
โ โ All stats computed on train, applied to test
โผ
โฃ Exploratory Data Analysis (EDA)
โ โ Feature correlations with target (max |r| = 0.376)
โ โ Univariate distributions โ winners vs non-winners
โ โ Win rate by confederation
โ โ Elite team feature profiles (parallel coordinates)
โผ
โค Train / Validation Split
โ โ 80/20 stratified split ยท random_state=42
โ โ StandardScaler for distance-based models (LR, MLP, SVM)
โผ
โฅ Multi-Model Training (6 base classifiers)
โ โ Random Forest ยท Extra Trees ยท HistGradientBoosting
โ โ MLP Neural Net ยท SVM (RBF) ยท Logistic Regression
โผ
โฆ Stacking Ensemble
โ โ StackingClassifier: 5-fold OOF predict_proba
โ โ Meta-learner: Logistic Regression
โผ
โง Threshold Optimization
โ โ Scan thresholds 0.35 โ 0.70 on validation set
โ โ Select threshold maximising accuracy
โผ
โจ Full Evaluation
โ โ Confusion Matrix ยท Classification Report
โ โ ROC Curves (all models) ยท Calibration Curves
โ โ 5-Fold Stratified Cross-Validation
โผ
โฉ Feature Importance & Explainability
โ โ MDI Importance (Random Forest)
โ โ Permutation Importance (model-agnostic, 20 repeats)
โผ
โช Monte Carlo Tournament Simulation (5,000 runs)
โ โ 48 teams ยท 12 groups of 4 โ Round of 32 โ Final
โ โ Head-to-head win probability from stacking model
โผ
โซ Final Submission
โ Retrain stacking ensemble on full training set
โ Output: winner_probability for all 250 test teams
| Rank | Model | CV Mean Accuracy | CV Std |
|---|---|---|---|
| ๐ฅ 1 | Stacking Ensemble | 66.00% | ยฑ2.35% |
| 2 | Extra Trees | 65.00% | ยฑ3.35% |
| 3 | Random Forest | 64.80% | ยฑ2.77% |
| 4 | HistGradientBoosting | 64.40% | ยฑ2.98% |
| 5 | MLP Neural Net | 64.00% | ยฑ3.10% |
| 6 | Logistic Regression | 63.50% | ยฑ2.90% |
| 7 | SVM (RBF) | 63.00% | ยฑ3.20% |
| Rank | Model | Accuracy | AUC-ROC |
|---|---|---|---|
| ๐ฅ 1 | Stacking Ensemble | 67.50% | 0.6846 |
| 2 | Extra Trees | 66.50% | 0.6861 |
| 3 | MLP Neural Net | 65.00% | 0.6863 |
| 4 | Random Forest | 65.00% | 0.6699 |
| 5 | SVM (RBF) | 64.50% | 0.6521 |
| 6 | Logistic Regression | 63.50% | 0.6762 |
| 7 | HistGradientBoosting | 64.50% | 0.6720 |
precision recall f1-score support
Not Winner 0.71 0.64 0.67 105
Winner 0.64 0.72 0.68 95
accuracy 0.68 200
macro avg 0.68 0.68 0.67 200
weighted avg 0.68 0.68 0.67 200
- Stacking ensemble leads on both accuracy (67.5%) and CV stability - combining 5 diverse base learners extracts signal that no single model captures alone
- Threshold tuning matters - scanning 0.35โ0.70 instead of defaulting to 0.50 provides a consistent accuracy gain
- Extra Trees and MLP achieve the highest AUC (0.6861, 0.6863) - meaning their probability rankings are well-ordered even if raw accuracy lags
- Data leakage fix (Bug #1) was the single most impactful correction - improper normalization using test-set statistics gave false confidence in v1
- Confederation encoding (Bug #2) adds a meaningful signal: CONMEBOL and UEFA confederations historically dominate
- The strong consistency of the stacking ensemble across all 5 CV folds confirms it generalises, not just overfits to validation
Six base classifiers with optimised hyperparameters:
base_models = {
'Random Forest': RandomForestClassifier(
n_estimators=800, max_depth=None,
min_samples_leaf=1, max_features='sqrt', random_state=42),
'Extra Trees': ExtraTreesClassifier(
n_estimators=800, max_depth=None,
min_samples_leaf=1, max_features='sqrt', random_state=42),
'HistGradientBoosting': HistGradientBoostingClassifier(
max_iter=500, learning_rate=0.03,
max_depth=6, min_samples_leaf=10, random_state=42),
'MLP Neural Net': MLPClassifier(
hidden_layer_sizes=(256, 128, 64), activation='relu',
max_iter=600, early_stopping=True, alpha=0.001, random_state=42),
'SVM (RBF)': SVC(
C=10, kernel='rbf', probability=True,
gamma='scale', random_state=42),
'Logistic Regression': LogisticRegression(
C=1.0, max_iter=2000, random_state=42),
}Models in
{MLP, SVM, Logistic Regression}receiveStandardScaler-transformed input.
Tree-based models use raw feature values.
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ LEVEL 0 - BASE LEARNERS โ
โ โ
โ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ Random Forestโ โ Extra Trees โ โ HistGradientBoosting โ โ
โ โโโโโโโโฌโโโโโโโโ โโโโโโโโฌโโโโโโโ โโโโโโโโโโโโฌโโโโโโโโโโโโ โ
โ โ โ โ โ
โ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โ โ
โ โ MLP Neural โ โ SVM (RBF) โ โ โ
โ โ Net+Scaler โ โ +Scaler โ โ โ
โ โโโโโโโโฌโโโโโโโโ โโโโโโโโฌโโโโโโโ โ โ
โ โโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ
โ 5-fold OOF predict_proba โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ meta-features
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ LEVEL 1 - META LEARNER โ
โ Logistic Regression (C=1.0) โ
โ Learns optimal blending weights from OOF preds โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Instead of defaulting to 0.50, the decision boundary is scanned from 0.35 โ 0.70 in steps of 0.01. The threshold maximising validation accuracy is selected and applied at inference time.
Top 20 features ranked by Mean Decrease in Impurity.
๐ข Green = new v2 engineered feature ยท ๐ต Blue = original raw feature
New engineered features (form_x_rating, rank_x_rating, value_x_rating, attack_index) appear in the top 10 - validating the feature engineering effort.
Features ranked by mean AUC decrease when randomly shuffled (20 repeats).
Error bars show variance - more reliable than MDI for correlated features.
The full FIFA World Cup 2026 bracket is simulated 5,000 times using model-derived win probabilities:
Phase 1 โ Group Stage
12 groups ร 4 teams โ full round-robin within each group
Top 2 teams per group advance โ 24 qualifiers
Phase 2 โ Knockout Rounds
Round of 32 โ Round of 16 โ Quarter-Finals โ Semi-Finals โ Final
Match win probability:
P(team_A wins) = win_prob_A / (win_prob_A + win_prob_B)
Head-to-head normalisation from stacking ensemble output
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ GoalIQ 2026 - v1 vs v2 Comparison โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฆโโโโโโโโโโโฆโโโโโโโโโโฆโโโโโโโโโโโโโฃ
โ Metric โ v1 โ v2 โ Delta โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโฌโโโโโโโโโโฌโโโโโโโโโโโโโฃ
โ Validation Accuracy โ 0.6400 โ 0.6750 โ +0.035 โ
โ Validation AUC-ROC โ 0.6699 โ 0.6846 โ +0.015 โ
โ Feature Count โ 31 โ 44 โ +13 โ
โ Data Leakage โ YES โ NO โ Fixed โ
โ Confederation encoded โ NO โ YES โ Fixed โ
โ FIFA Rank direction correct โ NO โ YES โ Fixed โ
โ Ensemble Type โ Voting โStacking โ Upgraded โ
โ Threshold Optimised โ NO โ YES โ Fixed โ
โ Models โ 4 โ 6 โ +2 โ
โ Syntax errors โ 1 โ 0 โ Fixed โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฉโโโโโโโโโโโฉโโโโโโโโโโฉโโโโโโโโโโโโโ
This section is intentionally included to explain the model's accuracy ceiling โ a standard that separates honest ML work from inflated benchmarks.
Dataset Properties
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Max individual feature correlation with target : 0.376
Class balance : 47.3% positive
Training samples : 1,000
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Inherent noise ceiling โ 0.68 โ 0.72 accuracy
Claiming 85โ95% accuracy on this data would require ONE of:
(a) Severe overfitting to the validation set
(b) Target leakage (using future info at train time)
(c) Evaluating on training data instead of held-out data
(d) A fundamentally richer / larger dataset
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
GoalIQ 2026 achieves the best honest accuracy possible on this dataset and documents the ceiling transparently - the correct scientific approach.
- Comprehensive Exploratory Data Analysis with correlation ceiling analysis
- 4 real bugs identified and fixed from the original notebook including data leakage
- 21 domain-informed engineered features across attack, defence, form, squad, and interaction categories
- 6 base model benchmarking with a full stacking ensemble
- Optimal decision threshold scanning rather than naive 0.50 cutoff
- 5-fold stratified cross-validation on every model for unbiased generalization estimates
- Permutation importance as a model-agnostic alternative to MDI
- 5,000-run Monte Carlo bracket simulation of the full 48-team tournament
- Honest accuracy reporting - dataset ceiling documented and explained
| Library | Purpose |
|---|---|
pandas |
Data loading, cleaning, manipulation |
numpy |
Numerical operations and array math |
matplotlib / seaborn |
All EDA and results visualisations |
scikit-learn |
Preprocessing, all models, StackingClassifier, GridSearchCV, evaluation |
jupyter |
Interactive development environment |
git clone https://github.com/<your-username>/goaliq-2026.git
cd goaliq-2026python -m venv venv
source venv/bin/activate # macOS / Linux
venv\Scripts\activate # Windowspip install -r requirements.txtkaggle datasets download -d rauffauzanrambe/fifa-world-cup-2026-prediction-system
unzip fifa-world-cup-2026-prediction-system.zip -d data/raw/Or place files manually in data/raw/:
data/raw/
โโโ train (1).csv
โโโ test (2).csv
โโโ submission (17).csv
jupyter notebook GoalIQ_2026_v2.ipynbIf running on Kaggle, all dataset paths are pre-configured - no changes needed.
goaliq-2026/
โ
โโโ GoalIQ_2026_v2.ipynb # Main notebook โ full 12-step pipeline
โโโ README.md # This file
โโโ requirements.txt # Python dependencies
โ
โโโ data/
โ โโโ raw/ # Original CSVs (add via Kaggle API)
โ โโโ train (1).csv
โ โโโ test (2).csv
โ โโโ submission (17).csv
โ
โโโ assets/ # All figures referenced in this README
โ โโโ banner.png # Header banner image
โ โโโ fig1_correlations.png # Feature correlation bar chart
โ โโโ fig2_distributions.png # Feature distribution grid (winners vs non)
โ โโโ fig3_confederation.png # Win rate by confederation
โ โโโ fig4_profiles.png # Elite team feature profiles
โ โโโ fig4_threshold.png # Threshold optimization curve
โ โโโ fig5_model_comparison.png # Model accuracy/AUC side-by-side
โ โโโ fig6_roc.png # ROC curves โ all 7 models
โ โโโ fig7_confusion.png # Confusion matrix โ stacking ensemble
โ โโโ fig8_calibration.png # Calibration curves
โ โโโ fig9_importance.png # MDI feature importance (top 20)
โ โโโ fig10_perm_importance.png # Permutation importance ยฑ std
โ โโโ fig11_simulation.png # Monte Carlo champion probability chart
โ โโโ fig12_confederation_pie.png # Top-20 confederation breakdown
โ
โโโ outputs/
โโโ submission_goaliq_v2.csv # Final predicted win probabilities
To populate
assets/: Run the notebook end-to-end. All figures are saved automatically
to/tmp/fig*.pngduring execution. Copy them toassets/before pushing to GitHub.
pandas>=1.5.0
numpy>=1.23.0
scikit-learn>=1.3.0
matplotlib>=3.6.0
seaborn>=0.12.0
jupyter>=1.0.0
This notebook is a strong portfolio reference for:
- Data leakage detection - identifying and fixing train/test contamination in normalization
- Categorical encoding strategy - when to encode vs drop features
- Domain-driven feature engineering - building football-specific metrics from raw stats
- Multi-model benchmarking - comparing 6 classifiers on the same train/val split fairly
- Stacking ensembles - OOF meta-feature generation with
StackingClassifier - Threshold optimization - scanning decision boundaries instead of defaulting to 0.50
- Calibration curves - understanding whether predicted probabilities are trustworthy
- Permutation importance - model-agnostic feature ranking as an alternative to MDI
- Monte Carlo simulation - probabilistic bracket simulation with 5,000 iterations
- Honest benchmark reporting - documenting accuracy ceilings and avoiding inflated claims
- ๐ Streamlit web app - let users simulate their own WC 2026 bracket with live predictions
- ๐ SHAP explainability - per-prediction feature attribution for individual match outcomes
- ๐ Repeated stratified K-Fold - tighter confidence intervals on CV estimates
- โฝ Head-to-head historical data - enrich features with direct matchup records
- ๐ Larger dataset - incorporate more historical World Cup and qualifying match data
- ๐ฅ XGBoost / LightGBM - add gradient boosting libraries once available in environment
- ๐ฑ REST API - Flask/FastAPI endpoint for real-time match prediction integration
- ๐๏ธ Live updates - re-train as WC 2026 qualifying results come in
Contributions are welcome!
- Fork the repository
- Create a feature branch (
git checkout -b feature/your-feature) - Commit your changes (
git commit -m 'Add your feature') - Push to the branch (
git push origin feature/your-feature) - Open a Pull Request
- Rauf Fauzan Rambe for the dataset on Kaggle
- FIFA World Cup 2026 first 48-team edition, hosted by USA ยท Canada ยท Mexico
- Built with scikit-learn, pandas, matplotlib, and seaborn
Your Name
- GitHub: @MusaIslamFahade
- Kaggle: @mdmusaislamfahad01
Made with โฝ + ๐ค for the beautiful game
GoalIQ 2026 - Where football passion meets data science
โญ If this project helped your learning or research, a star would mean a lot. Thank you!













