Soccer ML Analytics - Player Valuation & Breakout Prediction

A comprehensive machine learning project for soccer player analytics, featuring transfer fee prediction, market value modeling, transfer outcome classification, breakout candidate identification, and aging curve analysis.

Project Overview

This project uses the Kaggle player-scores dataset by davidcariboo to build multiple ML models for soccer analytics:

M1: Transfer Fee Regression - Predicts transfer fees based on player stats
M2: Market Value Model - Explains/predicts player market values
M3: Transfer Outcome Classifier - Classifies transfer success
M4: Breakout Candidate Classifier - Identifies young players with high growth potential

Plus interactive analytics:

Player aging curves by position
Undervalued player identification
Interactive Streamlit dashboard

Requirements

Python 3.10+
Windows 10/11, macOS 13+ (Ventura/Sonoma), or a modern Linux distro with Python 3.10 available. Commands below include PowerShell/CMD and bash/zsh equivalents.
Kaggle dataset: https://www.kaggle.com/datasets/davidcariboo/player-scores/data

Setup Instructions

1. Clone/Download this Project

Clone or download this repository to your local machine.

2. Get the Data

Go to https://www.kaggle.com/datasets/davidcariboo/player-scores/data
Download the dataset (you will need a Kaggle account)
Extract all CSV files to the data/raw/ directory

Your data/raw/ folder should contain:

players.csv
player_valuations.csv
transfers.csv
appearances.csv
games.csv
clubs.csv
competitions.csv
game_events.csv (optional)
club_games.csv (optional)
game_lineups.csv (optional)

3. Create Virtual Environment

Open a terminal, navigate to the project directory, and create the virtual environment.

Windows (PowerShell/CMD):

cd "D:\Coding Projetcs\MLDS Hackathon\hackathon-2025-three-muscketeers"
python -m venv .venv

macOS/Linux (bash/zsh):

cd "/path/to/MLDS Hackathon/hackathon-2025-three-muscketeers"
python3 -m venv .venv

4. Activate Virtual Environment

Windows PowerShell:

.\.venv\Scripts\Activate.ps1

Windows CMD:

.venv\Scripts\activate.bat

macOS/Linux (bash/zsh):

source .venv/bin/activate

If you get an execution policy error in PowerShell, run:

Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser

5. Install Dependencies

pip install -r requirements.txt

macOS/Linux users may need to use pip3 depending on their Python installation.

Usage

Step 1: Run Feature Engineering

This creates all processed datasets needed for modeling:

python -m src.feature_engineering

This will create 4 processed datasets in data/processed/:

player_snapshot.csv - Latest player info with aggregated stats
transfer_model_dataset.csv - Transfer records with features
breakout_model_dataset.csv - Young player candidates with growth labels
aging_curves_by_pos.csv - Average value by position and age

Step 2: Run Optuna

This finds all optimized hyperparameter for models training

python -m src.hyperparameter_tuning.py

Step 3: Train Valuation Models (M1, M2, M3)

Train the transfer fee, market value, and outcome models:

python -m src.modeling_valuations

This trains and saves:

models/transfer_fee_regressor.joblib (M1)
models/value_growth_regressor.joblib (M2)
models/transfer_outcome_classifier.joblib (M3)

Step 4: Train Breakout Model (M4)

Train the breakout candidate classifier:

python -m src.modeling_breakouts

This saves:

models/breakout_classifier.joblib (M4)

Step 5: Launch Dashboard

Run the Streamlit dashboard:

streamlit run app.py

This will open your browser to http://localhost:8501 with 6 tabs:

Undervalued Players - Players whose predicted value > current market value
Breakout Candidates - Young players with high growth probability
Player Development - Aging curves showing value vs age by position
Player Lookup - Search for specific players
Transfer Insights - Historical transfer analysis
Guide & Explainability - Model documentation

Data Processing Procedure

Our feature engineering pipeline transforms raw Kaggle data into ML-ready datasets through a comprehensive multi-stage process:

Pipeline Overview

The data processing runs via python -m src.feature_engineering and creates 4 processed datasets:

player_snapshot.csv - Latest player profiles with aggregated career metrics
transfer_model_dataset.csv - Transfer records with contextual features
breakout_model_dataset.csv - Young player candidates with growth labels
aging_curves_by_pos.csv - Position-specific value trajectories by age

Processing Stages

Stage 1: Data Loading & Standardization

Load raw CSVs from data/raw/: players, valuations, transfers, appearances, games, clubs, competitions
Parse dates and handle missing values
Standardize position labels (GK, DEF, MID, FWD)
Assign league tiers (1=Big 5, 2=Mid-tier European, 3=Others)

Stage 2: Feature Engineering Components

Player Profile Features:

Age, age², position, height, foot preference, nationality
Contract expiration tracking (months remaining)
Current club and league context

Performance Metrics:

Season-by-season aggregation from appearances data
Career totals: goals, assists, minutes, appearances
Per-90 statistics: goals_per_90, assists_per_90
Last season form indicators
Minimum 300 minutes threshold to filter noise

Market Value Temporal Features:

1-year growth rate (percentage change)
Peak value ever attained
Volatility (coefficient of variation)
Trend classification (rising/flat/declining based on last 3 valuations)
6-month momentum for recent trajectory

Transfer History:

Total transfer count and frequency
Average transfer fee
Years since last transfer
Career span calculation

Advanced Context Features:

Discipline: Yellow/red cards per 90, composite discipline score
International Experience: Caps, minutes, goals at international level
Value Trajectory: Current vs. peak ratio, momentum indicators
Agent Quality: Top agent identification (top 20 by client value)
League Transitions: Step-up/down detection, league strength comparison

Aging Curve Baselines:

Position-age value distributions (mean, median, 25th/75th percentiles)
Player percentile ranks within position-age cohorts

Stage 3: Dataset-Specific Processing

Player Snapshot:

One row per player with latest valuation
Merges all feature categories above
Fills missing values with domain-appropriate defaults
Enriches with club/league metadata

Transfer Model Dataset:

One row per paid transfer (fee > 0)
Features captured at transfer date (historical point-in-time)
Recent season stats before transfer
League transition indicators (from/to league strength)
Post-transfer outcomes: future value (12-24 months), playing time
Transfer outcome label: Success if value growth ≥15% OR minutes ≥1200 OR (minutes ≥900 AND value growth ≥0)

Breakout Model Dataset:

Filters players aged 16-23 with ≥300 career minutes
Calculates 2-year future value (8 quarters ahead)
Breakout label: 1 if value growth >50% in 2 years, else 0
Includes percentile ranks and momentum features
Focus on growth potential indicators

Aging Curves Dataset:

Aggregates all valuations by position and age (16-40)
Calculates mean, median, and percentile bands
Filters groups with <10 observations
Used for visualizing typical career trajectories

Stage 4: Data Quality & Validation

Replace infinities with 0 or appropriate defaults
Enforce minimum playing time thresholds (300+ minutes)
Fill missing numeric values with 0
Clip outlier ratios to reasonable bounds
Sort temporal data for consistency

Key Design Decisions

Point-in-time integrity: Transfer features use only data available before the transfer date
Minimum thresholds: 300 minutes minimum to avoid noisy per-90 metrics
Position simplification: 4 main categories for statistical power
League tiering: Big 5 leagues get special treatment in features
Temporal windows: 1-year for growth, 2-year for breakout labels
Outcome definitions: Combine value growth + playing time for transfer success

Processing Performance

Player Snapshot: ~30,000+ rows (all players with valuations)
Transfer Dataset: ~25,000+ rows (paid transfers only)
Breakout Dataset: ~8,000+ rows (young players with future data)
Aging Curves: ~400+ rows (position-age combinations)
Runtime: 5-15 minutes depending on hardware

Project Structure

hackathon-2025-three-muscketeers/
├── app.py                    # Main Streamlit dashboard
├── requirements.txt          # Python dependencies
├── README.md                 # This file
├── data/
│   ├── raw/                  # Raw CSV files (download from Kaggle)
│   └── processed/            # Processed datasets (generated)
├── models/                   # Trained models (generated)
├── reports/                  # SHAP summaries and metrics (generated)
├── src/
│   ├── config.py             # Configuration and paths
│   ├── ui_config.py          # UI constants and styling
│   ├── app_utils.py          # Utility functions
│   ├── data_loading.py       # Data loading functions
│   ├── feature_engineering.py # Feature engineering pipeline
│   ├── modeling_valuations.py # M1, M2, M3 training
│   ├── modeling_breakouts.py  # M4 training
│   ├── analytics_aging.py    # Aging curve analysis
│   └── utils.py              # General utilities
└── tests/
    └── test_feature_engineering_utils.py  # Unit tests

Models Overview

M1: Transfer Fee Regression

Algorithm: XGBoost Regressor
Features: age, recent performance stats, market value, league strength, position
Target: Transfer fee in EUR (log-transformed)
Evaluation: RMSE, R²
Use Case: Benchmarking negotiations, identifying bargains/overpays

M2: Market Value Model

Algorithm: XGBoost Regressor
Features: age, career stats, goals/assists per 90, league tier, position
Target: Market value (log-transformed)
Use Case: Identify undervalued players, what-if simulations

M3: Transfer Outcome Classifier

Algorithm: XGBoost Classifier
Features: age, stats, market value at transfer, league transition, position
Target: Transfer success (binary/categorical)
Evaluation: Accuracy, Precision, Recall
Use Case: Risk screening for potential transfers

M4: Breakout Candidate Classifier

Algorithm: LightGBM Classifier
Features: age, current value, career stats, growth momentum, position
Target: Breakout label (50%+ value growth in 2 years)
Use Case: Scout young talent, guide youth recruitment

Dashboard Features

Tab 1: Undervalued Players

Filters by position, league, minimum value, and age
Shows top 50 players with highest undervaluation
Interactive scatter plot with confidence bands
SHAP feature importance visualization
What-if scenario simulator

Tab 2: Breakout Candidates

Filter by minimum breakout probability
Shows young players (age ≤ 22) ranked by probability
Performance metrics and predicted growth
Interactive visualizations

Tab 3: Player Development

Interactive aging curves for each position
Plots mean/median market value vs age
Percentile bands (25th-75th)
Player comparison to positional benchmarks

Tab 4: Player Lookup

Search players by partial name match
View detailed stats and current market value
Per-90 contribution visualization

Tab 5: Transfer Insights

Historical transfer analysis
Fee predictions vs actual fees
Success probability estimation
Bargain identification
What-if transfer scenario tool

Tab 6: Guide & Explainability

Data foundation documentation
Model architecture details
SHAP explanations
Validation methodology

Troubleshooting

Data Loading Errors

Ensure all CSV files are in data/raw/
Check that column names match (the code handles minor variations)
Verify files are not corrupted

Model Training Errors

Ensure feature engineering ran successfully first
Check that processed CSVs exist in data/processed/
Review error logs in soccer_ml.log

Dashboard Won't Start

Ensure all models are trained (check models/ directory)
Make sure virtual environment is activated
Check that Streamlit is installed: pip show streamlit

Memory Issues

The dataset is large; if you run out of memory, try:
- Closing other applications
- Using a machine with more RAM (8GB+ recommended)
- Processing data in smaller chunks

Performance Notes

Feature engineering: ~5-15 minutes depending on dataset size and CPU
Model training: ~2-10 minutes per model
Dashboard: Loads in ~5-10 seconds after models are cached
Recommended: 8GB+ RAM for smooth performance

Testing

Run the test suite:

python -m unittest discover -s tests

Data Sources

Dataset: player-scores by davidcariboo

URL: https://www.kaggle.com/datasets/davidcariboo/player-scores/data
License: As specified on Kaggle
Data dictionary: See Kaggle page for column descriptions

Contributing

This is a hackathon project. For improvements:

Fork the repository
Create a feature branch
Add tests for new functionality
Submit a pull request

License

This project is for educational/hackathon purposes.

Authors

Team: Three Musketeers MLDS Hackathon 2025

Acknowledgments

Kaggle and davidcariboo for the player-scores dataset
Streamlit for the dashboard framework
XGBoost and LightGBM teams for the ML libraries

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
.streamlit		.streamlit
data		data
models		models
src		src
tests		tests
.gitignore		.gitignore
.replit		.replit
README.md		README.md
app.py		app.py
replit.md		replit.md
requirements.txt		requirements.txt
run_dashboard.bat		run_dashboard.bat
run_dashboard.sh		run_dashboard.sh

Folders and files

Latest commit

History

Repository files navigation

Soccer ML Analytics - Player Valuation & Breakout Prediction

Project Overview

Requirements

Setup Instructions

1. Clone/Download this Project

2. Get the Data

3. Create Virtual Environment

4. Activate Virtual Environment

5. Install Dependencies

Usage

Step 1: Run Feature Engineering

Step 2: Run Optuna

Step 3: Train Valuation Models (M1, M2, M3)

Step 4: Train Breakout Model (M4)

Step 5: Launch Dashboard

Data Processing Procedure

Pipeline Overview

Processing Stages

Stage 1: Data Loading & Standardization

Stage 2: Feature Engineering Components

Stage 3: Dataset-Specific Processing

Stage 4: Data Quality & Validation

Key Design Decisions

Processing Performance

Project Structure

Models Overview

M1: Transfer Fee Regression

M2: Market Value Model

M3: Transfer Outcome Classifier

M4: Breakout Candidate Classifier

Dashboard Features

Tab 1: Undervalued Players

Tab 2: Breakout Candidates

Tab 3: Player Development

Tab 4: Player Lookup

Tab 5: Transfer Insights

Tab 6: Guide & Explainability

Troubleshooting

Data Loading Errors

Model Training Errors

Dashboard Won't Start

Memory Issues

Performance Notes

Testing

Data Sources

Contributing

License

Authors

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages