A comprehensive machine learning project for soccer player analytics, featuring transfer fee prediction, market value modeling, transfer outcome classification, breakout candidate identification, and aging curve analysis.
This project uses the Kaggle player-scores dataset by davidcariboo to build multiple ML models for soccer analytics:
- M1: Transfer Fee Regression - Predicts transfer fees based on player stats
- M2: Market Value Model - Explains/predicts player market values
- M3: Transfer Outcome Classifier - Classifies transfer success
- M4: Breakout Candidate Classifier - Identifies young players with high growth potential
Plus interactive analytics:
- Player aging curves by position
- Undervalued player identification
- Interactive Streamlit dashboard
- Python 3.10+
- Windows 10/11, macOS 13+ (Ventura/Sonoma), or a modern Linux distro with Python 3.10 available. Commands below include PowerShell/CMD and bash/zsh equivalents.
- Kaggle dataset: https://www.kaggle.com/datasets/davidcariboo/player-scores/data
Clone or download this repository to your local machine.
- Go to https://www.kaggle.com/datasets/davidcariboo/player-scores/data
- Download the dataset (you will need a Kaggle account)
- Extract all CSV files to the
data/raw/directory
Your data/raw/ folder should contain:
- players.csv
- player_valuations.csv
- transfers.csv
- appearances.csv
- games.csv
- clubs.csv
- competitions.csv
- game_events.csv (optional)
- club_games.csv (optional)
- game_lineups.csv (optional)
Open a terminal, navigate to the project directory, and create the virtual environment.
Windows (PowerShell/CMD):
cd "D:\Coding Projetcs\MLDS Hackathon\hackathon-2025-three-muscketeers"
python -m venv .venvmacOS/Linux (bash/zsh):
cd "/path/to/MLDS Hackathon/hackathon-2025-three-muscketeers"
python3 -m venv .venvWindows PowerShell:
.\.venv\Scripts\Activate.ps1Windows CMD:
.venv\Scripts\activate.batmacOS/Linux (bash/zsh):
source .venv/bin/activateIf you get an execution policy error in PowerShell, run:
Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUserpip install -r requirements.txtmacOS/Linux users may need to use
pip3depending on their Python installation.
This creates all processed datasets needed for modeling:
python -m src.feature_engineeringThis will create 4 processed datasets in data/processed/:
player_snapshot.csv- Latest player info with aggregated statstransfer_model_dataset.csv- Transfer records with featuresbreakout_model_dataset.csv- Young player candidates with growth labelsaging_curves_by_pos.csv- Average value by position and age
This finds all optimized hyperparameter for models training
python -m src.hyperparameter_tuning.pyTrain the transfer fee, market value, and outcome models:
python -m src.modeling_valuationsThis trains and saves:
models/transfer_fee_regressor.joblib(M1)models/value_growth_regressor.joblib(M2)models/transfer_outcome_classifier.joblib(M3)
Train the breakout candidate classifier:
python -m src.modeling_breakoutsThis saves:
models/breakout_classifier.joblib(M4)
Run the Streamlit dashboard:
streamlit run app.pyThis will open your browser to http://localhost:8501 with 6 tabs:
- Undervalued Players - Players whose predicted value > current market value
- Breakout Candidates - Young players with high growth probability
- Player Development - Aging curves showing value vs age by position
- Player Lookup - Search for specific players
- Transfer Insights - Historical transfer analysis
- Guide & Explainability - Model documentation
Our feature engineering pipeline transforms raw Kaggle data into ML-ready datasets through a comprehensive multi-stage process:
The data processing runs via python -m src.feature_engineering and creates 4 processed datasets:
- player_snapshot.csv - Latest player profiles with aggregated career metrics
- transfer_model_dataset.csv - Transfer records with contextual features
- breakout_model_dataset.csv - Young player candidates with growth labels
- aging_curves_by_pos.csv - Position-specific value trajectories by age
- Load raw CSVs from
data/raw/: players, valuations, transfers, appearances, games, clubs, competitions - Parse dates and handle missing values
- Standardize position labels (GK, DEF, MID, FWD)
- Assign league tiers (1=Big 5, 2=Mid-tier European, 3=Others)
Player Profile Features:
- Age, age², position, height, foot preference, nationality
- Contract expiration tracking (months remaining)
- Current club and league context
Performance Metrics:
- Season-by-season aggregation from appearances data
- Career totals: goals, assists, minutes, appearances
- Per-90 statistics: goals_per_90, assists_per_90
- Last season form indicators
- Minimum 300 minutes threshold to filter noise
Market Value Temporal Features:
- 1-year growth rate (percentage change)
- Peak value ever attained
- Volatility (coefficient of variation)
- Trend classification (rising/flat/declining based on last 3 valuations)
- 6-month momentum for recent trajectory
Transfer History:
- Total transfer count and frequency
- Average transfer fee
- Years since last transfer
- Career span calculation
Advanced Context Features:
- Discipline: Yellow/red cards per 90, composite discipline score
- International Experience: Caps, minutes, goals at international level
- Value Trajectory: Current vs. peak ratio, momentum indicators
- Agent Quality: Top agent identification (top 20 by client value)
- League Transitions: Step-up/down detection, league strength comparison
Aging Curve Baselines:
- Position-age value distributions (mean, median, 25th/75th percentiles)
- Player percentile ranks within position-age cohorts
Player Snapshot:
- One row per player with latest valuation
- Merges all feature categories above
- Fills missing values with domain-appropriate defaults
- Enriches with club/league metadata
Transfer Model Dataset:
- One row per paid transfer (fee > 0)
- Features captured at transfer date (historical point-in-time)
- Recent season stats before transfer
- League transition indicators (from/to league strength)
- Post-transfer outcomes: future value (12-24 months), playing time
- Transfer outcome label: Success if value growth ≥15% OR minutes ≥1200 OR (minutes ≥900 AND value growth ≥0)
Breakout Model Dataset:
- Filters players aged 16-23 with ≥300 career minutes
- Calculates 2-year future value (8 quarters ahead)
- Breakout label: 1 if value growth >50% in 2 years, else 0
- Includes percentile ranks and momentum features
- Focus on growth potential indicators
Aging Curves Dataset:
- Aggregates all valuations by position and age (16-40)
- Calculates mean, median, and percentile bands
- Filters groups with <10 observations
- Used for visualizing typical career trajectories
- Replace infinities with 0 or appropriate defaults
- Enforce minimum playing time thresholds (300+ minutes)
- Fill missing numeric values with 0
- Clip outlier ratios to reasonable bounds
- Sort temporal data for consistency
- Point-in-time integrity: Transfer features use only data available before the transfer date
- Minimum thresholds: 300 minutes minimum to avoid noisy per-90 metrics
- Position simplification: 4 main categories for statistical power
- League tiering: Big 5 leagues get special treatment in features
- Temporal windows: 1-year for growth, 2-year for breakout labels
- Outcome definitions: Combine value growth + playing time for transfer success
- Player Snapshot: ~30,000+ rows (all players with valuations)
- Transfer Dataset: ~25,000+ rows (paid transfers only)
- Breakout Dataset: ~8,000+ rows (young players with future data)
- Aging Curves: ~400+ rows (position-age combinations)
- Runtime: 5-15 minutes depending on hardware
hackathon-2025-three-muscketeers/
├── app.py # Main Streamlit dashboard
├── requirements.txt # Python dependencies
├── README.md # This file
├── data/
│ ├── raw/ # Raw CSV files (download from Kaggle)
│ └── processed/ # Processed datasets (generated)
├── models/ # Trained models (generated)
├── reports/ # SHAP summaries and metrics (generated)
├── src/
│ ├── config.py # Configuration and paths
│ ├── ui_config.py # UI constants and styling
│ ├── app_utils.py # Utility functions
│ ├── data_loading.py # Data loading functions
│ ├── feature_engineering.py # Feature engineering pipeline
│ ├── modeling_valuations.py # M1, M2, M3 training
│ ├── modeling_breakouts.py # M4 training
│ ├── analytics_aging.py # Aging curve analysis
│ └── utils.py # General utilities
└── tests/
└── test_feature_engineering_utils.py # Unit tests
- Algorithm: XGBoost Regressor
- Features: age, recent performance stats, market value, league strength, position
- Target: Transfer fee in EUR (log-transformed)
- Evaluation: RMSE, R²
- Use Case: Benchmarking negotiations, identifying bargains/overpays
- Algorithm: XGBoost Regressor
- Features: age, career stats, goals/assists per 90, league tier, position
- Target: Market value (log-transformed)
- Use Case: Identify undervalued players, what-if simulations
- Algorithm: XGBoost Classifier
- Features: age, stats, market value at transfer, league transition, position
- Target: Transfer success (binary/categorical)
- Evaluation: Accuracy, Precision, Recall
- Use Case: Risk screening for potential transfers
- Algorithm: LightGBM Classifier
- Features: age, current value, career stats, growth momentum, position
- Target: Breakout label (50%+ value growth in 2 years)
- Use Case: Scout young talent, guide youth recruitment
- Filters by position, league, minimum value, and age
- Shows top 50 players with highest undervaluation
- Interactive scatter plot with confidence bands
- SHAP feature importance visualization
- What-if scenario simulator
- Filter by minimum breakout probability
- Shows young players (age ≤ 22) ranked by probability
- Performance metrics and predicted growth
- Interactive visualizations
- Interactive aging curves for each position
- Plots mean/median market value vs age
- Percentile bands (25th-75th)
- Player comparison to positional benchmarks
- Search players by partial name match
- View detailed stats and current market value
- Per-90 contribution visualization
- Historical transfer analysis
- Fee predictions vs actual fees
- Success probability estimation
- Bargain identification
- What-if transfer scenario tool
- Data foundation documentation
- Model architecture details
- SHAP explanations
- Validation methodology
- Ensure all CSV files are in
data/raw/ - Check that column names match (the code handles minor variations)
- Verify files are not corrupted
- Ensure feature engineering ran successfully first
- Check that processed CSVs exist in
data/processed/ - Review error logs in
soccer_ml.log
- Ensure all models are trained (check
models/directory) - Make sure virtual environment is activated
- Check that Streamlit is installed:
pip show streamlit
- The dataset is large; if you run out of memory, try:
- Closing other applications
- Using a machine with more RAM (8GB+ recommended)
- Processing data in smaller chunks
- Feature engineering: ~5-15 minutes depending on dataset size and CPU
- Model training: ~2-10 minutes per model
- Dashboard: Loads in ~5-10 seconds after models are cached
- Recommended: 8GB+ RAM for smooth performance
Run the test suite:
python -m unittest discover -s testsDataset: player-scores by davidcariboo
- URL: https://www.kaggle.com/datasets/davidcariboo/player-scores/data
- License: As specified on Kaggle
- Data dictionary: See Kaggle page for column descriptions
This is a hackathon project. For improvements:
- Fork the repository
- Create a feature branch
- Add tests for new functionality
- Submit a pull request
This project is for educational/hackathon purposes.
Team: Three Musketeers MLDS Hackathon 2025
- Kaggle and davidcariboo for the player-scores dataset
- Streamlit for the dashboard framework
- XGBoost and LightGBM teams for the ML libraries