Python helper that pulls the full history of CrossFit South Brooklyn "Workout of the Day" posts via the site's WordPress API and extracts the workout components (Strength, Assistance, Conditioning, etc.), the date, and any cycle markers like (WK4/8).
- Use the WordPress REST API (
wp-json/wp/v2/posts) scoped to theWorkout of the Daycategory (ID 1) to avoid brittle HTML pagination. - Fetch posts in pages of up to 100 items, respecting the
X-WP-TotalPagesheader and pausing slightly between requests. - Parse each post's rendered HTML with BeautifulSoup; find the
Workout of the Dayheading and capture subsequent component headings (h3–h6) and their text until the next heading. - Detect cycle information with regex patterns such as
(WK4/8)or(Week 6/8)and include it on every record. - Emit newline-delimited JSON so the data can be transformed later into a richer model.
# Install deps into .venv
uv sync
# Run the scraper
uv run python scrape_cfsbk.py --output workouts.jsonlFlags:
--per-page(default 100): API page size (max WordPress allows).--max-pages: limit pages for quick tests.--pause(default 0.2): seconds to sleep between requests.
After generating movement_counts.csv (from the analysis snippet), plot the top movements:
uv run python visualize_movements.py --input movement_counts.csv --top 20 --output movement_counts.png- The ETL emits
data/derived/named_workouts.jsoncapturing Hero WODs and Girl benchmarks (occurrences, counts, latest date/link, summaries). - Matching uses workout titles and component headings (ignores “tomorrow”/promo components) with word-boundary regexes to avoid false positives.
- In the frontend, these appear as expandable cards with the workout text and clickable dates to the source blog posts.
# Backend/tests
uv run pytest
# Frontend tests
cd frontend
npm test -- --run# Fetch latest posts and build canonical + aggregates
uv run python etl.py all
# (Optional) Fetch comment metadata and write comment analytics
uv run python etl.py build --with-comment-analysisArtifacts land in data/derived/:
workouts.jsonl(canonical with movements/format/component tags/seq_no)top_movements.json,top_pairs.json,yearly_counts.json,weekday_counts.jsonmovement_yearly.json,movement_weekday.json,movement_monthly.json,movement_calendar.jsonsearch_index.json,data_version.jsoncomment_countis included on each workout when running with--with-commentsor--with-comment-analysis(hits the WP comments API)comments_analysis.jsonis written when running with--with-comment-analysis(monthly totals, most-commented posts, top commenters)
The regex-based tagger is the source of truth for the site. For auditing, you can generate a second set of tags using an LLM and compare them in the frontend.
- Provide an OpenAI key (either works):
export OPENAI_API_KEY="..."or add it to .env (gitignored) as OPENAI_API_KEY=....
- Generate LLM tags for a date range (start small to control cost):
uv run python scripts/llm_tag_workouts.py --start-date 2016-01-01 --end-date 2016-01-31 --max-posts 50 --workers 4This writes gitignored artifacts:
data/derived/llm_tags.jsonl(append-only)data/derived/llm_tags.json(JSON array for the frontend)data/llm_cache/(per-post cached responses)
You can optionally run a second-pass "judge" LLM that sees:
- The full blog post text
- The regex result (
data/derived/search_index.json) - The first-pass LLM result
By default it only judges posts where regex and first-pass LLM disagree (to control cost):
uv run python etl.py build
uv run python scripts/llm_tag_workouts.py --start-date 2016-01-01 --end-date 2016-01-31 --judgeOutputs (gitignored):
data/derived/llm_judged_tags.jsonldata/derived/llm_judged_tags.json
If you don’t want to re-run LLM tagging in prod, commit these to the repo after generating them locally:
data/derived/llm_tags.jsondata/derived/llm_judged_tags.json
(The per-post cache and .jsonl files remain gitignored.)
- Sync derived data into the dev server and reload:
cd frontend
npm run sync-data
npm run devThen open the LLM Tagging Audit section to review differences between regex tags and LLM tags.
Scaffold lives in frontend/.
cd frontend
npm install
npm run sync-data # copies data/derived into public/data for local dev
npm run dev