Skip to content

Latest commit

 

History

History
162 lines (129 loc) · 6.05 KB

File metadata and controls

162 lines (129 loc) · 6.05 KB

Soccer ML Analytics - Player Valuation & Breakout Prediction

Overview

A comprehensive machine learning project for soccer player analytics featuring transfer fee prediction, market value modeling, transfer outcome classification, breakout candidate identification, and aging curve analysis. Built with Python, Streamlit, XGBoost, and LightGBM.

Current State: Fully functional with all ML models trained and Kaggle dataset loaded. Interactive Streamlit dashboard running on port 5000.

Project Architecture

Technology Stack

  • Frontend: Streamlit (running on port 5000)
  • ML Models: XGBoost, LightGBM, scikit-learn
  • Data Processing: pandas, numpy
  • Visualization: plotly, matplotlib, seaborn
  • Interpretability: SHAP

Directory Structure

├── app.py                      # Main Streamlit dashboard application
├── src/
│   ├── config.py              # Configuration and paths
│   ├── ui_config.py           # UI constants and styling
│   ├── app_utils.py           # Dashboard utility functions
│   ├── data_loading.py        # Data loading functions
│   ├── feature_engineering.py # Feature engineering pipeline
│   ├── modeling_valuations.py # M1, M2, M3 training
│   ├── modeling_breakouts.py  # M4 training
│   ├── analytics_aging.py     # Aging curve analysis
│   └── utils.py               # General utilities
├── data/
│   ├── raw/                   # Raw CSV files from Kaggle (user must download)
│   └── processed/             # Processed datasets (generated by scripts)
├── models/                    # Trained ML models (generated)
├── reports/                   # SHAP summaries and metrics (generated)
└── tests/                     # Unit tests

Machine Learning Models

  1. M1: Transfer Fee Regression - Predicts transfer fees based on player stats
  2. M2: Market Value Model - Explains/predicts player market values
  3. M3: Transfer Outcome Classifier - Classifies transfer success
  4. M4: Breakout Candidate Classifier - Identifies young players with high growth potential

Dashboard Features

The Streamlit dashboard provides 6 interactive tabs:

  1. Undervalued Players - Find players whose predicted value exceeds market value
  2. Breakout Candidates - Young players with high growth probability
  3. Player Development - Aging curves showing value vs age by position
  4. Player Lookup - Search for specific players
  5. Transfer Insights - Historical transfer analysis
  6. Guide & Explainability - Model documentation

Recent Changes

2025-11-21: Complete setup with trained models

  • Installed Python 3.11 and all dependencies from requirements.txt
  • Created necessary directories (data/raw, data/processed, models, reports)
  • Configured Streamlit to run on 0.0.0.0:5000 with CORS disabled for Replit proxy
  • Set up workflow for Streamlit dashboard
  • Downloaded Kaggle player-scores dataset (165MB, 10 CSV files)
  • Ran feature engineering pipeline - created all processed datasets
  • Trained all ML models with excellent performance:
    • M1 Transfer Fee: R² = 0.96, RMSE = €2.27M
    • M2 Market Value: R² = 0.82, RMSE = €2.57M
    • M3 Transfer Outcome: Accuracy = 61%
    • M4 Breakout Classifier: Accuracy = 80%, F1 = 0.79
  • Installed gcc-unwrapped system dependency for LightGBM support
  • Application fully functional with all 6 dashboard tabs operational

User Preferences

None recorded yet.

Setup Instructions

Running the Dashboard

The Streamlit dashboard is configured to run automatically. Simply click the "Run" button or the workflow will start it on port 5000.

Adding Data (Required for Full Functionality)

The application requires the Kaggle player-scores dataset:

  1. Go to https://www.kaggle.com/datasets/davidcariboo/player-scores/data
  2. Download the dataset (requires Kaggle account)
  3. Extract all CSV files to the data/raw/ directory

Required CSV files:

  • players.csv
  • player_valuations.csv
  • transfers.csv
  • appearances.csv
  • games.csv
  • clubs.csv
  • competitions.csv

Processing Data & Training Models

Once data is added, run these commands in order:

  1. Feature Engineering (creates processed datasets):

    python -m src.feature_engineering
  2. Hyperparameter Tuning (optional, finds optimal parameters):

    python -m src.hyperparameter_tuning
  3. Train Valuation Models (M1, M2, M3):

    python -m src.modeling_valuations
  4. Train Breakout Model (M4):

    python -m src.modeling_breakouts

After these steps, the dashboard will have full functionality with all models loaded.

Configuration

Streamlit Configuration

  • Port: 5000 (required for Replit webview)
  • Address: 0.0.0.0 (allows all hosts for Replit proxy)
  • CORS: Disabled (for iframe access)
  • Config file: .streamlit/config.toml

Environment

  • Python Version: 3.11
  • Package Management: pip
  • Virtual Environment: Not used (Replit manages environment)

Deployment

The application is configured for deployment via Replit's deployment system:

  • Type: Autoscale (stateless web application)
  • Port: 5000
  • Run Command: streamlit run app.py

Known Issues

  • Application requires Kaggle dataset to be manually downloaded and placed in data/raw/
  • Models need to be trained before full dashboard functionality is available
  • Some LSP diagnostics present in app.py (non-critical, related to optional features)

Development Notes

  • No virtual environment needed - Replit manages the Python environment
  • Dependencies are installed globally in the Replit workspace
  • Models are saved as .joblib files (large, excluded from git)
  • Processed data is excluded from git (regenerate via feature engineering)

Testing

Run unit tests with:

python -m unittest discover -s tests

Resources