A comprehensive machine learning project for soccer player analytics featuring transfer fee prediction, market value modeling, transfer outcome classification, breakout candidate identification, and aging curve analysis. Built with Python, Streamlit, XGBoost, and LightGBM.
Current State: Fully functional with all ML models trained and Kaggle dataset loaded. Interactive Streamlit dashboard running on port 5000.
- Frontend: Streamlit (running on port 5000)
- ML Models: XGBoost, LightGBM, scikit-learn
- Data Processing: pandas, numpy
- Visualization: plotly, matplotlib, seaborn
- Interpretability: SHAP
├── app.py # Main Streamlit dashboard application
├── src/
│ ├── config.py # Configuration and paths
│ ├── ui_config.py # UI constants and styling
│ ├── app_utils.py # Dashboard utility functions
│ ├── data_loading.py # Data loading functions
│ ├── feature_engineering.py # Feature engineering pipeline
│ ├── modeling_valuations.py # M1, M2, M3 training
│ ├── modeling_breakouts.py # M4 training
│ ├── analytics_aging.py # Aging curve analysis
│ └── utils.py # General utilities
├── data/
│ ├── raw/ # Raw CSV files from Kaggle (user must download)
│ └── processed/ # Processed datasets (generated by scripts)
├── models/ # Trained ML models (generated)
├── reports/ # SHAP summaries and metrics (generated)
└── tests/ # Unit tests
- M1: Transfer Fee Regression - Predicts transfer fees based on player stats
- M2: Market Value Model - Explains/predicts player market values
- M3: Transfer Outcome Classifier - Classifies transfer success
- M4: Breakout Candidate Classifier - Identifies young players with high growth potential
The Streamlit dashboard provides 6 interactive tabs:
- Undervalued Players - Find players whose predicted value exceeds market value
- Breakout Candidates - Young players with high growth probability
- Player Development - Aging curves showing value vs age by position
- Player Lookup - Search for specific players
- Transfer Insights - Historical transfer analysis
- Guide & Explainability - Model documentation
2025-11-21: Complete setup with trained models
- Installed Python 3.11 and all dependencies from requirements.txt
- Created necessary directories (data/raw, data/processed, models, reports)
- Configured Streamlit to run on 0.0.0.0:5000 with CORS disabled for Replit proxy
- Set up workflow for Streamlit dashboard
- Downloaded Kaggle player-scores dataset (165MB, 10 CSV files)
- Ran feature engineering pipeline - created all processed datasets
- Trained all ML models with excellent performance:
- M1 Transfer Fee: R² = 0.96, RMSE = €2.27M
- M2 Market Value: R² = 0.82, RMSE = €2.57M
- M3 Transfer Outcome: Accuracy = 61%
- M4 Breakout Classifier: Accuracy = 80%, F1 = 0.79
- Installed gcc-unwrapped system dependency for LightGBM support
- Application fully functional with all 6 dashboard tabs operational
None recorded yet.
The Streamlit dashboard is configured to run automatically. Simply click the "Run" button or the workflow will start it on port 5000.
The application requires the Kaggle player-scores dataset:
- Go to https://www.kaggle.com/datasets/davidcariboo/player-scores/data
- Download the dataset (requires Kaggle account)
- Extract all CSV files to the
data/raw/directory
Required CSV files:
- players.csv
- player_valuations.csv
- transfers.csv
- appearances.csv
- games.csv
- clubs.csv
- competitions.csv
Once data is added, run these commands in order:
-
Feature Engineering (creates processed datasets):
python -m src.feature_engineering
-
Hyperparameter Tuning (optional, finds optimal parameters):
python -m src.hyperparameter_tuning
-
Train Valuation Models (M1, M2, M3):
python -m src.modeling_valuations
-
Train Breakout Model (M4):
python -m src.modeling_breakouts
After these steps, the dashboard will have full functionality with all models loaded.
- Port: 5000 (required for Replit webview)
- Address: 0.0.0.0 (allows all hosts for Replit proxy)
- CORS: Disabled (for iframe access)
- Config file:
.streamlit/config.toml
- Python Version: 3.11
- Package Management: pip
- Virtual Environment: Not used (Replit manages environment)
The application is configured for deployment via Replit's deployment system:
- Type: Autoscale (stateless web application)
- Port: 5000
- Run Command:
streamlit run app.py
- Application requires Kaggle dataset to be manually downloaded and placed in
data/raw/ - Models need to be trained before full dashboard functionality is available
- Some LSP diagnostics present in app.py (non-critical, related to optional features)
- No virtual environment needed - Replit manages the Python environment
- Dependencies are installed globally in the Replit workspace
- Models are saved as .joblib files (large, excluded from git)
- Processed data is excluded from git (regenerate via feature engineering)
Run unit tests with:
python -m unittest discover -s tests- Original GitHub repository: Three Musketeers - MLDS Hackathon 2025
- Kaggle Dataset: https://www.kaggle.com/datasets/davidcariboo/player-scores/data
- Streamlit Documentation: https://docs.streamlit.io