CrossFit South Brooklyn Workout Scraper

Python helper that pulls the full history of CrossFit South Brooklyn "Workout of the Day" posts via the site's WordPress API and extracts the workout components (Strength, Assistance, Conditioning, etc.), the date, and any cycle markers like (WK4/8).

Plan

Use the WordPress REST API (wp-json/wp/v2/posts) scoped to the Workout of the Day category (ID 1) to avoid brittle HTML pagination.
Fetch posts in pages of up to 100 items, respecting the X-WP-TotalPages header and pausing slightly between requests.
Parse each post's rendered HTML with BeautifulSoup; find the Workout of the Day heading and capture subsequent component headings (h3–h6) and their text until the next heading.
Detect cycle information with regex patterns such as (WK4/8) or (Week 6/8) and include it on every record.
Emit newline-delimited JSON so the data can be transformed later into a richer model.

Usage (uv)

# Install deps into .venv
uv sync

# Run the scraper
uv run python scrape_cfsbk.py --output workouts.jsonl

Flags:

--per-page (default 100): API page size (max WordPress allows).
--max-pages: limit pages for quick tests.
--pause (default 0.2): seconds to sleep between requests.

Movement frequency visualization

After generating movement_counts.csv (from the analysis snippet), plot the top movements:

uv run python visualize_movements.py --input movement_counts.csv --top 20 --output movement_counts.png

Named workouts (Heroes & Girls)

The ETL emits data/derived/named_workouts.json capturing Hero WODs and Girl benchmarks (occurrences, counts, latest date/link, summaries).
Matching uses workout titles and component headings (ignores “tomorrow”/promo components) with word-boundary regexes to avoid false positives.
In the frontend, these appear as expandable cards with the workout text and clickable dates to the source blog posts.

Tests

# Backend/tests
uv run pytest

# Frontend tests
cd frontend
npm test -- --run

ETL pipeline

# Fetch latest posts and build canonical + aggregates
uv run python etl.py all

# (Optional) Fetch comment metadata and write comment analytics
uv run python etl.py build --with-comment-analysis

Artifacts land in data/derived/:

workouts.jsonl (canonical with movements/format/component tags/seq_no)
top_movements.json, top_pairs.json, yearly_counts.json, weekday_counts.json
movement_yearly.json, movement_weekday.json, movement_monthly.json, movement_calendar.json
search_index.json, data_version.json
comment_count is included on each workout when running with --with-comments or --with-comment-analysis (hits the WP comments API)
comments_analysis.json is written when running with --with-comment-analysis (monthly totals, most-commented posts, top commenters)

LLM tagging (audit mode)

The regex-based tagger is the source of truth for the site. For auditing, you can generate a second set of tags using an LLM and compare them in the frontend.

Provide an OpenAI key (either works):

export OPENAI_API_KEY="..."

or add it to .env (gitignored) as OPENAI_API_KEY=....

Generate LLM tags for a date range (start small to control cost):

uv run python scripts/llm_tag_workouts.py --start-date 2016-01-01 --end-date 2016-01-31 --max-posts 50 --workers 4

This writes gitignored artifacts:

data/derived/llm_tags.jsonl (append-only)
data/derived/llm_tags.json (JSON array for the frontend)
data/llm_cache/ (per-post cached responses)

Judge pass (second opinion)

You can optionally run a second-pass "judge" LLM that sees:

The full blog post text
The regex result (data/derived/search_index.json)
The first-pass LLM result

By default it only judges posts where regex and first-pass LLM disagree (to control cost):

uv run python etl.py build
uv run python scripts/llm_tag_workouts.py --start-date 2016-01-01 --end-date 2016-01-31 --judge

Outputs (gitignored):

data/derived/llm_judged_tags.jsonl
data/derived/llm_judged_tags.json

Deploying LLM results

If you don’t want to re-run LLM tagging in prod, commit these to the repo after generating them locally:

data/derived/llm_tags.json
data/derived/llm_judged_tags.json

(The per-post cache and .jsonl files remain gitignored.)

Sync derived data into the dev server and reload:

cd frontend
npm run sync-data
npm run dev

Then open the LLM Tagging Audit section to review differences between regex tags and LLM tags.

Frontend (React/Vite)

Scaffold lives in frontend/.

cd frontend
npm install
npm run sync-data   # copies data/derived into public/data for local dev
npm run dev

Name		Name	Last commit message	Last commit date
Latest commit History 108 Commits
.githooks		.githooks
.github/workflows		.github/workflows
cfa_etl		cfa_etl
config		config
data		data
frontend		frontend
scripts		scripts
tests		tests
.env.example		.env.example
.gitignore		.gitignore
IMPLEMENTATION_PLAN.md		IMPLEMENTATION_PLAN.md
MODELS.md		MODELS.md
README.md		README.md
etl.py		etl.py
exercise_counts.csv		exercise_counts.csv
movement_counts.csv		movement_counts.csv
movement_counts.png		movement_counts.png
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
sample_workouts.jsonl		sample_workouts.jsonl
scrape_cfsbk.py		scrape_cfsbk.py
uv.lock		uv.lock
visualize_movements.py		visualize_movements.py
workouts.jsonl		workouts.jsonl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CrossFit South Brooklyn Workout Scraper

Plan

Usage (uv)

Movement frequency visualization

Named workouts (Heroes & Girls)

Tests

ETL pipeline

LLM tagging (audit mode)

Judge pass (second opinion)

Deploying LLM results

Frontend (React/Vite)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CrossFit South Brooklyn Workout Scraper

Plan

Usage (uv)

Movement frequency visualization

Named workouts (Heroes & Girls)

Tests

ETL pipeline

LLM tagging (audit mode)

Judge pass (second opinion)

Deploying LLM results

Frontend (React/Vite)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages