Skip to content

Nova-O2/ironman-data

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

A reproducible pipeline for constructing population-scale triathlon datasets from public race result APIs

Companion repository for a methods paper submitted to Scientific Reports.

Authors

Aldo Seffrin¹, Pantelis Theodoros Nikolaidis², Marilia Santos Andrade³, Elias Villiger⁴, Daniel Ferreira¹, Beat Knechtle⁴ˌ⁵*

¹ Nova O2 Sports Science, São Paulo, Brazil ² School of Health and Caring Sciences, University of West Attica, Athens, Greece ³ Department of Physiology, Federal University of São Paulo (UNIFESP), São Paulo, Brazil ⁴ Institute of Primary Care, University of Zurich, Zurich, Switzerland ⁵ Medbase St. Gallen Am Vadianplatz, St. Gallen, Switzerland

* Corresponding author: beat.knechtle@hispeed.ch

Dataset

Metric Value
Records 2,706,922
Races >1,500 events
Years 2002–2026
Full-distance 1,340,799 (49.5%)
Half-distance (70.3) 1,366,123 (50.5%)
Source: official 2,041,743 (75.4%)
Source: supplement 665,179 (24.6%)
T1/T2 coverage 84.2%
Cross-source agreement 99.7–99.9%

Data availability

The individual athlete-level dataset is not redistributed due to the proprietary nature of the source data. Researchers can reproduce the full dataset by running the collection scripts below. The processed dataset is available from the corresponding author upon reasonable request.

Reproduction

All collection scripts are provided in data/collection/. See data/README.md for step-by-step instructions.

pip install -r requirements.txt

cd data/collection
python scrape_official.py      # ~2–4h, official IRONMAN platform
python combine_official.py     # JSON → CSV

python scrape_coachcox.py      # ~30min, supplementary source
python combine_coachcox.py     # JSON → CSV

python merge_sources.py        # Merge into unified dataset

Structure

data/
├── collection/
│   ├── scrape_official.py       # Official platform scraper
│   ├── scrape_coachcox.py       # Supplementary source scraper
│   ├── combine_official.py      # Official JSON → CSV consolidation
│   ├── combine_coachcox.py      # Supplementary JSON → CSV consolidation
│   ├── merge_sources.py         # Deterministic merge procedure
│   ├── event_uuids_full.csv     # 124 event UUIDs (official platform)
│   ├── all_subevents.csv        # 1,192 subevent index
│   └── race_metadata.csv        # Supplementary race metadata
└── README.md                    # Reproduction instructions

notebooks/
├── 01_DESCRIPTIVES.ipynb        # Dataset statistics and validation
└── 02_FIGURES.ipynb             # Publication figures

figures/
├── Figure1.tiff                 # Dataset composition by source and type
├── Figure2.tiff                 # Split time distributions
└── Figure3.tiff                 # T1/T2 coverage analysis

License

MIT — Copyright (c) 2026 Nova O2 Sports Science

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors