Skip to content

jixinyan/Scout4One

Repository files navigation

Soccer ML Analytics - Player Valuation & Breakout Prediction

A comprehensive machine learning project for soccer player analytics, featuring transfer fee prediction, market value modeling, transfer outcome classification, breakout candidate identification, and aging curve analysis.

Project Overview

This project uses the Kaggle player-scores dataset by davidcariboo to build multiple ML models for soccer analytics:

  • M1: Transfer Fee Regression - Predicts transfer fees based on player stats
  • M2: Market Value Model - Explains/predicts player market values
  • M3: Transfer Outcome Classifier - Classifies transfer success
  • M4: Breakout Candidate Classifier - Identifies young players with high growth potential

Plus interactive analytics:

  • Player aging curves by position
  • Undervalued player identification
  • Interactive Streamlit dashboard

Requirements

Setup Instructions

1. Clone/Download this Project

Clone or download this repository to your local machine.

2. Get the Data

  1. Go to https://www.kaggle.com/datasets/davidcariboo/player-scores/data
  2. Download the dataset (you will need a Kaggle account)
  3. Extract all CSV files to the data/raw/ directory

Your data/raw/ folder should contain:

  • players.csv
  • player_valuations.csv
  • transfers.csv
  • appearances.csv
  • games.csv
  • clubs.csv
  • competitions.csv
  • game_events.csv (optional)
  • club_games.csv (optional)
  • game_lineups.csv (optional)

3. Create Virtual Environment

Open a terminal, navigate to the project directory, and create the virtual environment.

Windows (PowerShell/CMD):

cd "D:\Coding Projetcs\MLDS Hackathon\hackathon-2025-three-muscketeers"
python -m venv .venv

macOS/Linux (bash/zsh):

cd "/path/to/MLDS Hackathon/hackathon-2025-three-muscketeers"
python3 -m venv .venv

4. Activate Virtual Environment

Windows PowerShell:

.\.venv\Scripts\Activate.ps1

Windows CMD:

.venv\Scripts\activate.bat

macOS/Linux (bash/zsh):

source .venv/bin/activate

If you get an execution policy error in PowerShell, run:

Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser

5. Install Dependencies

pip install -r requirements.txt

macOS/Linux users may need to use pip3 depending on their Python installation.

Usage

Step 1: Run Feature Engineering

This creates all processed datasets needed for modeling:

python -m src.feature_engineering

This will create 4 processed datasets in data/processed/:

  • player_snapshot.csv - Latest player info with aggregated stats
  • transfer_model_dataset.csv - Transfer records with features
  • breakout_model_dataset.csv - Young player candidates with growth labels
  • aging_curves_by_pos.csv - Average value by position and age

Step 2: Run Optuna

This finds all optimized hyperparameter for models training

python -m src.hyperparameter_tuning.py

Step 3: Train Valuation Models (M1, M2, M3)

Train the transfer fee, market value, and outcome models:

python -m src.modeling_valuations

This trains and saves:

  • models/transfer_fee_regressor.joblib (M1)
  • models/value_growth_regressor.joblib (M2)
  • models/transfer_outcome_classifier.joblib (M3)

Step 4: Train Breakout Model (M4)

Train the breakout candidate classifier:

python -m src.modeling_breakouts

This saves:

  • models/breakout_classifier.joblib (M4)

Step 5: Launch Dashboard

Run the Streamlit dashboard:

streamlit run app.py

This will open your browser to http://localhost:8501 with 6 tabs:

  1. Undervalued Players - Players whose predicted value > current market value
  2. Breakout Candidates - Young players with high growth probability
  3. Player Development - Aging curves showing value vs age by position
  4. Player Lookup - Search for specific players
  5. Transfer Insights - Historical transfer analysis
  6. Guide & Explainability - Model documentation

Data Processing Procedure

Our feature engineering pipeline transforms raw Kaggle data into ML-ready datasets through a comprehensive multi-stage process:

Pipeline Overview

The data processing runs via python -m src.feature_engineering and creates 4 processed datasets:

  1. player_snapshot.csv - Latest player profiles with aggregated career metrics
  2. transfer_model_dataset.csv - Transfer records with contextual features
  3. breakout_model_dataset.csv - Young player candidates with growth labels
  4. aging_curves_by_pos.csv - Position-specific value trajectories by age

Processing Stages

Stage 1: Data Loading & Standardization

  • Load raw CSVs from data/raw/: players, valuations, transfers, appearances, games, clubs, competitions
  • Parse dates and handle missing values
  • Standardize position labels (GK, DEF, MID, FWD)
  • Assign league tiers (1=Big 5, 2=Mid-tier European, 3=Others)

Stage 2: Feature Engineering Components

Player Profile Features:

  • Age, age², position, height, foot preference, nationality
  • Contract expiration tracking (months remaining)
  • Current club and league context

Performance Metrics:

  • Season-by-season aggregation from appearances data
  • Career totals: goals, assists, minutes, appearances
  • Per-90 statistics: goals_per_90, assists_per_90
  • Last season form indicators
  • Minimum 300 minutes threshold to filter noise

Market Value Temporal Features:

  • 1-year growth rate (percentage change)
  • Peak value ever attained
  • Volatility (coefficient of variation)
  • Trend classification (rising/flat/declining based on last 3 valuations)
  • 6-month momentum for recent trajectory

Transfer History:

  • Total transfer count and frequency
  • Average transfer fee
  • Years since last transfer
  • Career span calculation

Advanced Context Features:

  • Discipline: Yellow/red cards per 90, composite discipline score
  • International Experience: Caps, minutes, goals at international level
  • Value Trajectory: Current vs. peak ratio, momentum indicators
  • Agent Quality: Top agent identification (top 20 by client value)
  • League Transitions: Step-up/down detection, league strength comparison

Aging Curve Baselines:

  • Position-age value distributions (mean, median, 25th/75th percentiles)
  • Player percentile ranks within position-age cohorts

Stage 3: Dataset-Specific Processing

Player Snapshot:

  • One row per player with latest valuation
  • Merges all feature categories above
  • Fills missing values with domain-appropriate defaults
  • Enriches with club/league metadata

Transfer Model Dataset:

  • One row per paid transfer (fee > 0)
  • Features captured at transfer date (historical point-in-time)
  • Recent season stats before transfer
  • League transition indicators (from/to league strength)
  • Post-transfer outcomes: future value (12-24 months), playing time
  • Transfer outcome label: Success if value growth ≥15% OR minutes ≥1200 OR (minutes ≥900 AND value growth ≥0)

Breakout Model Dataset:

  • Filters players aged 16-23 with ≥300 career minutes
  • Calculates 2-year future value (8 quarters ahead)
  • Breakout label: 1 if value growth >50% in 2 years, else 0
  • Includes percentile ranks and momentum features
  • Focus on growth potential indicators

Aging Curves Dataset:

  • Aggregates all valuations by position and age (16-40)
  • Calculates mean, median, and percentile bands
  • Filters groups with <10 observations
  • Used for visualizing typical career trajectories

Stage 4: Data Quality & Validation

  • Replace infinities with 0 or appropriate defaults
  • Enforce minimum playing time thresholds (300+ minutes)
  • Fill missing numeric values with 0
  • Clip outlier ratios to reasonable bounds
  • Sort temporal data for consistency

Key Design Decisions

  1. Point-in-time integrity: Transfer features use only data available before the transfer date
  2. Minimum thresholds: 300 minutes minimum to avoid noisy per-90 metrics
  3. Position simplification: 4 main categories for statistical power
  4. League tiering: Big 5 leagues get special treatment in features
  5. Temporal windows: 1-year for growth, 2-year for breakout labels
  6. Outcome definitions: Combine value growth + playing time for transfer success

Processing Performance

  • Player Snapshot: ~30,000+ rows (all players with valuations)
  • Transfer Dataset: ~25,000+ rows (paid transfers only)
  • Breakout Dataset: ~8,000+ rows (young players with future data)
  • Aging Curves: ~400+ rows (position-age combinations)
  • Runtime: 5-15 minutes depending on hardware

Project Structure

hackathon-2025-three-muscketeers/
├── app.py                    # Main Streamlit dashboard
├── requirements.txt          # Python dependencies
├── README.md                 # This file
├── data/
│   ├── raw/                  # Raw CSV files (download from Kaggle)
│   └── processed/            # Processed datasets (generated)
├── models/                   # Trained models (generated)
├── reports/                  # SHAP summaries and metrics (generated)
├── src/
│   ├── config.py             # Configuration and paths
│   ├── ui_config.py          # UI constants and styling
│   ├── app_utils.py          # Utility functions
│   ├── data_loading.py       # Data loading functions
│   ├── feature_engineering.py # Feature engineering pipeline
│   ├── modeling_valuations.py # M1, M2, M3 training
│   ├── modeling_breakouts.py  # M4 training
│   ├── analytics_aging.py    # Aging curve analysis
│   └── utils.py              # General utilities
└── tests/
    └── test_feature_engineering_utils.py  # Unit tests

Models Overview

M1: Transfer Fee Regression

  • Algorithm: XGBoost Regressor
  • Features: age, recent performance stats, market value, league strength, position
  • Target: Transfer fee in EUR (log-transformed)
  • Evaluation: RMSE, R²
  • Use Case: Benchmarking negotiations, identifying bargains/overpays

M2: Market Value Model

  • Algorithm: XGBoost Regressor
  • Features: age, career stats, goals/assists per 90, league tier, position
  • Target: Market value (log-transformed)
  • Use Case: Identify undervalued players, what-if simulations

M3: Transfer Outcome Classifier

  • Algorithm: XGBoost Classifier
  • Features: age, stats, market value at transfer, league transition, position
  • Target: Transfer success (binary/categorical)
  • Evaluation: Accuracy, Precision, Recall
  • Use Case: Risk screening for potential transfers

M4: Breakout Candidate Classifier

  • Algorithm: LightGBM Classifier
  • Features: age, current value, career stats, growth momentum, position
  • Target: Breakout label (50%+ value growth in 2 years)
  • Use Case: Scout young talent, guide youth recruitment

Dashboard Features

Tab 1: Undervalued Players

  • Filters by position, league, minimum value, and age
  • Shows top 50 players with highest undervaluation
  • Interactive scatter plot with confidence bands
  • SHAP feature importance visualization
  • What-if scenario simulator

Tab 2: Breakout Candidates

  • Filter by minimum breakout probability
  • Shows young players (age ≤ 22) ranked by probability
  • Performance metrics and predicted growth
  • Interactive visualizations

Tab 3: Player Development

  • Interactive aging curves for each position
  • Plots mean/median market value vs age
  • Percentile bands (25th-75th)
  • Player comparison to positional benchmarks

Tab 4: Player Lookup

  • Search players by partial name match
  • View detailed stats and current market value
  • Per-90 contribution visualization

Tab 5: Transfer Insights

  • Historical transfer analysis
  • Fee predictions vs actual fees
  • Success probability estimation
  • Bargain identification
  • What-if transfer scenario tool

Tab 6: Guide & Explainability

  • Data foundation documentation
  • Model architecture details
  • SHAP explanations
  • Validation methodology

Troubleshooting

Data Loading Errors

  • Ensure all CSV files are in data/raw/
  • Check that column names match (the code handles minor variations)
  • Verify files are not corrupted

Model Training Errors

  • Ensure feature engineering ran successfully first
  • Check that processed CSVs exist in data/processed/
  • Review error logs in soccer_ml.log

Dashboard Won't Start

  • Ensure all models are trained (check models/ directory)
  • Make sure virtual environment is activated
  • Check that Streamlit is installed: pip show streamlit

Memory Issues

  • The dataset is large; if you run out of memory, try:
    • Closing other applications
    • Using a machine with more RAM (8GB+ recommended)
    • Processing data in smaller chunks

Performance Notes

  • Feature engineering: ~5-15 minutes depending on dataset size and CPU
  • Model training: ~2-10 minutes per model
  • Dashboard: Loads in ~5-10 seconds after models are cached
  • Recommended: 8GB+ RAM for smooth performance

Testing

Run the test suite:

python -m unittest discover -s tests

Data Sources

Dataset: player-scores by davidcariboo

Contributing

This is a hackathon project. For improvements:

  1. Fork the repository
  2. Create a feature branch
  3. Add tests for new functionality
  4. Submit a pull request

License

This project is for educational/hackathon purposes.

Authors

Team: Three Musketeers MLDS Hackathon 2025

Acknowledgments

  • Kaggle and davidcariboo for the player-scores dataset
  • Streamlit for the dashboard framework
  • XGBoost and LightGBM teams for the ML libraries

About

Intelligent ML Soccer Player Valuation System

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages