A reproducible pipeline for constructing population-scale triathlon datasets from public race result APIs
Companion repository for a methods paper submitted to Scientific Reports.
Aldo Seffrin¹, Pantelis Theodoros Nikolaidis², Marilia Santos Andrade³, Elias Villiger⁴, Daniel Ferreira¹, Beat Knechtle⁴ˌ⁵*
¹ Nova O2 Sports Science, São Paulo, Brazil ² School of Health and Caring Sciences, University of West Attica, Athens, Greece ³ Department of Physiology, Federal University of São Paulo (UNIFESP), São Paulo, Brazil ⁴ Institute of Primary Care, University of Zurich, Zurich, Switzerland ⁵ Medbase St. Gallen Am Vadianplatz, St. Gallen, Switzerland
* Corresponding author: beat.knechtle@hispeed.ch
| Metric | Value |
|---|---|
| Records | 2,706,922 |
| Races | >1,500 events |
| Years | 2002–2026 |
| Full-distance | 1,340,799 (49.5%) |
| Half-distance (70.3) | 1,366,123 (50.5%) |
| Source: official | 2,041,743 (75.4%) |
| Source: supplement | 665,179 (24.6%) |
| T1/T2 coverage | 84.2% |
| Cross-source agreement | 99.7–99.9% |
The individual athlete-level dataset is not redistributed due to the proprietary nature of the source data. Researchers can reproduce the full dataset by running the collection scripts below. The processed dataset is available from the corresponding author upon reasonable request.
All collection scripts are provided in data/collection/. See data/README.md for step-by-step instructions.
pip install -r requirements.txt
cd data/collection
python scrape_official.py # ~2–4h, official IRONMAN platform
python combine_official.py # JSON → CSV
python scrape_coachcox.py # ~30min, supplementary source
python combine_coachcox.py # JSON → CSV
python merge_sources.py # Merge into unified datasetdata/
├── collection/
│ ├── scrape_official.py # Official platform scraper
│ ├── scrape_coachcox.py # Supplementary source scraper
│ ├── combine_official.py # Official JSON → CSV consolidation
│ ├── combine_coachcox.py # Supplementary JSON → CSV consolidation
│ ├── merge_sources.py # Deterministic merge procedure
│ ├── event_uuids_full.csv # 124 event UUIDs (official platform)
│ ├── all_subevents.csv # 1,192 subevent index
│ └── race_metadata.csv # Supplementary race metadata
└── README.md # Reproduction instructions
notebooks/
├── 01_DESCRIPTIVES.ipynb # Dataset statistics and validation
└── 02_FIGURES.ipynb # Publication figures
figures/
├── Figure1.tiff # Dataset composition by source and type
├── Figure2.tiff # Split time distributions
└── Figure3.tiff # T1/T2 coverage analysis
MIT — Copyright (c) 2026 Nova O2 Sports Science