This project analyzes how player skill and course characteristics interact to shape tournament outcomes on the PGA Tour. Using strokes-gained statistics and unsupervised course clustering, we identify measurable patterns in player performance and course design to better understand the factors that influence success at the tournament level.
Golf is a sport defined by volatility, small differences in course setup or execution can drastically alter outcomes. This project investigates whether that apparent randomness can be explained by patterns in data.
-
Unsupervised Learning: Groups PGA Tour courses into clusters based on both physical and performance-based characteristics. This analysis reveals structural similarities among different courses on the PGA Tour.
-
Unsupervised Model Evaluation: Evaluated the sensitivity of our best performing K-Means clustering model against a range of UMAP hyperparameters.
-
Supervised Learning: Aims to determine whether measurable patterns in player performance can help explain what separates top finishers from the rest of the field. We use tournament-level strokes-gained statistics and course information to explore how skill areas such as Off the Tee, Approach, Around the Green, and Putting relate to the likelihood of a Top-10 finish among players who make the cut. This allows us to identify consistent, data-driven relationships between a player's skill profile and their performance, complementing the unsupervised analysis.
-
Supervised Model Evaluation: Our best-performing model, XGBoost, was evaluated through a full calibration, feature, sensitivity, learning curve, and error analysis.
A detailed explanation of our methods, data preparation, modeling process, and results can be found in the accompanying report: Understanding PGA Tour Performance Through Data-Driven Player and Course Modeling
This project relies on three primary data sources:
-
DataGolf Course Statistics – Publicly available course-level data containing scoring, strokes-gained, and performance metrics for all ShotLink-equipped PGA Tour venues.
Website: https://datagolf.com -
GCSAA Tournament Fact Sheets – PDF documents published by the Golf Course Superintendents Association of America providing physical course characteristics (e.g., grass types, green sizes, hazards).
Website: https://www.gcsaa.org -
PGA Tour Stats Scraper – Open-source project that extracts tournament-level strokes-gained statistics directly from the official PGA Tour website.
Repository: https://github.com/shaunyap01/PGA-Tour-Stats-Scraper- Due to the size of the original files obtained from this project, the raw data files are not included in this repository. However, the source data can be gathered directly from that project, and the code used to collect and compile the strokes-gained statistics is provided here in
strokes_gained_data_loader.py. This script aggregates all tournaments from 2012–2024 into a single consolidated dataframe that serves as the foundation for the supervised analysis. The resulting dataset has been saved underdata/pga_tournament_stats_2012-2024.csv.
- Due to the size of the original files obtained from this project, the raw data files are not included in this repository. However, the source data can be gathered directly from that project, and the code used to collect and compile the strokes-gained statistics is provided here in
data/
│
├── pdf_courses/
│ ├── 2016_pga/ ... 2024_pga/ # GCSAA tournament PDFs grouped by season
│
├── best_kmeans_model_clusters.csv # Saved K-Means course clustering results
├── datagolf.csv # Raw Datagolf course-level statistics
├── pga_tournament_stats_2012-2024.csv # Strokes-gained tournament data
├── unique_course_tournament_list.csv # Unique course names for fuzzy matching
Understanding PGA Tour Performance... # Full project report
pdf_cleaning.py # Script for parsing and cleaning GCSAA PDFs
scrape_courses.py # Scrapes course names from ESPN tournament pages
datagolf_scraper.ipynb # Scrapes and formats Datagolf course statistics
sg_feature_engineering_for_modeling.py # Aggregates player and field-level features for modeling
strokes_gained_data_loader.py # Loads and merges strokes-gained datasets
unsupervised_model_evaluation_notebook.ipynb # Cluster evaluation
supervised_model_evaluation_notebook.ipynb # Model training, evaluation, and analysis
README.md # Project overview and documentation
- Jonathan Kelly
- Michael Michelini
- Alex Thoreux