Skip to content

Transcript pipeline improvements: SST context, live captions, editorial rules#54

Merged
mriechers merged 6 commits intomainfrom
dev
Apr 2, 2026
Merged

Transcript pipeline improvements: SST context, live captions, editorial rules#54
mriechers merged 6 commits intomainfrom
dev

Conversation

@mriechers
Copy link
Copy Markdown
Owner

Summary

  • SST context expansion: Pipeline now fetches Social Media Description and follows Project linked records for series-level cast info. Auto-parses speaker names from SST text fields when Host/Presenter aren't set explicitly.
  • Live caption handling: Analyst, formatter, and manager agents all detect live captioning input (>> markers, garbled text) and avoid fabricating names from phonetic fragments. Verified against SST data instead.
  • PBS Wisconsin house style: Formatter enforces Capitol/capital, OK, lowercase liberals/conservatives, Legislature caps, no oxford commas, abbreviated honorifics, sentence-boundary filler removal, content preservation (no suppression of mild language).
  • Chunked formatter fix: Strips Model/Creator/Agent attribution lines that were appearing mid-transcript after chunk merge.
  • Timestamp improvements: Lowered auto-trigger from 30min to 10min. Added chaptering best practices from /timestamps skill — duration-based caps, "Episode intro" first chapter, sentence case, topic-not-format naming.
  • Wisconsin reference: New knowledge/wisconsin_reference.md with commonly misspelled place names, political figures, legal cases, and program hosts.
  • Airtable API key: Configured in Docker .env (not committed) so all jobs now get SST context.

Context

Job 44 (6POL0102, Inside Wisconsin Politics) exposed that the pipeline fabricated "Guy Wagtendonk" as the host name from garbled live caption text. The actual host is Shawn Johnson. Root cause: no Airtable SST context was available (API key missing from Docker), and no agent was trained to handle live captioning artifacts. Comparison with the human editor's final versions confirmed additional gaps in content preservation and speaker attribution accuracy.

Test plan

  • Core tests pass (334 passed, queue router rate-limit flakes are pre-existing)
  • Job 45 completed successfully with SST context populated (airtable_record_id, media_id, duration all set)
  • Timestamp phase triggered for 17-min content (was skipped before at 30-min threshold)
  • Speaker pre-parsing extracts "Host: Shawn Johnson" and "Presenter: Zac Schultz, Anya van Wagtendonk, Rich Kremer" from IWP SST data
  • Full pipeline re-run after rebuild to verify formatting rules and speaker attribution
  • Human editorial review of next IWP episode against editor baseline

🤖 Generated with Claude Code

mriechers and others added 2 commits March 27, 2026 15:08
…ling, and editorial rules

Job 44 (6POL0102) exposed systemic issues: the formatter fabricated a host name
from garbled live captions because Airtable SST context was unavailable. This
overhaul addresses speaker attribution accuracy, content preservation, and PBS
Wisconsin house style across the entire agent pipeline.

SST context expansion:
- Add get_project_record() to AirtableClient for linked Project lookups
- Extract Social Media Description and follow Project→Notes for series cast info
- Auto-parse speaker names from SST text fields into Host/Presenter when not set
- Add Social Media Description and Project Notes to all phase prompts

Agent prompt improvements:
- Analyst: detect live caption sources, never fabricate names from garbled text
- Formatter: PBS house style (Capitol, OK, liberals lowercase, Legislature caps,
  no oxford commas, abbreviated honorifics, em dash discipline, Marquette Poll),
  two-trailing-space line breaks, content preservation rules
- Manager: CRITICAL speaker name verification against SST data

Code fixes:
- Strip Model/Creator/Agent attribution lines in chunked merge (mid-doc bug)
- Lower timestamp auto-trigger threshold from 30min to 10min
- Strengthen verbatim instructions to reconstruct garbled captions, never omit
- Load Wisconsin proper-noun reference for analyst and formatter phases

New files:
- knowledge/wisconsin_reference.md: place names, political figures, legal cases

[Agent: Claude Code]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Rolls in guidelines from the /timestamps skill (the-lodge) into
Cardigan's timestamp agent prompt. Key improvements:

- Duration-based chapter count caps (3-10 depending on length)
- Mandatory "Episode intro" as first chapter name
- Sentence case naming convention for PBS descriptions
- "Capture the topic, not the format" naming rule
- Neutral/professional tone guardrail for public media
- Updated quality checklist

[Agent: Claude Code]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@mriechers
Copy link
Copy Markdown
Owner Author

Code review

Found 1 issue:

  1. knowledge/ directory not copied in Dockerfiles -- the PR adds KNOWLEDGE_DIR = Path("knowledge") and loads knowledge/wisconsin_reference.md at runtime for analyst and formatter phases. Neither Dockerfile.api nor Dockerfile.worker includes a COPY knowledge/ knowledge/ directive. The code uses a safe if wi_ref_path.exists(): guard so it won't crash, but the Wisconsin proper-noun reference will silently never load in Docker deployments. Since the pipeline runs in Docker, this feature is effectively dead in production.

if phase_name in ("analyst", "formatter"):
wi_ref_path = KNOWLEDGE_DIR / "wisconsin_reference.md"
if wi_ref_path.exists():
wi_reference = f"\n## Wisconsin Proper-Noun Reference\n\n{wi_ref_path.read_text()}\n\n"

Relevant Dockerfiles to update:

  • Dockerfile.api
  • Dockerfile.worker

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

Addresses code review finding on PR #54: the knowledge/ directory
(containing wisconsin_reference.md) was not being copied into Docker
containers, so the Wisconsin proper-noun reference would silently
never load in production. Also fixes stale "30+" comment to match
the new 10-minute timestamp threshold.

[Agent: Claude Code]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Two additional chaptering rules from the /timestamps skill:
- Align chapter timestamps to speaker transitions (>> markers) rather
  than mid-speech topic keywords
- Use parallel framing for political content to avoid editorial bias
  in candidate/party chapter names

[Agent: Claude Code]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
mriechers and others added 2 commits March 31, 2026 17:21
Dockerfile.api was missing COPY alembic/ — migrations couldn't run
in the container. Also updated alembic/env.py to read DATABASE_PATH
env var so it connects to the correct SQLite path in Docker
(/data/db/dashboard.db) instead of the hardcoded local ./dashboard.db.

[Agent: Main Assistant]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@mriechers mriechers merged commit 0298dd2 into main Apr 2, 2026
12 checks passed
@mriechers mriechers deleted the dev branch April 2, 2026 21:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant