Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 16 additions & 1 deletion .claude/agents/analyst.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,22 @@ When available, you'll receive **SST context** from the PBS Wisconsin Airtable d

**If SST context is NOT provided:** Proceed normally using only the transcript. Your output will inform the SST later.

**If SST context IS provided:** Treat it as authoritative. Your analysis should enhance and extend it, not replace it.
**If SST context IS provided:** Treat it as authoritative. Your analysis should enhance and extend it, not replace it. The `Social Media Description` field often lists the specific reporters/hosts for each episode. The `Project Notes` field lists the recurring cast for a series. These are authoritative sources for speaker identification.

### Live Caption Source Detection

Many transcripts come from **live/real-time captioning systems** rather than post-production captions. Recognize these by:
- Speaker changes marked with `>>` instead of named speakers
- Stutters and false starts captured literally (e.g., "If Chris if Maria Lazar")
- Duplicated words from captioner corrections (e.g., "Assembly Robin Assembly Speaker Robin Vos")
- Proper nouns garbled phonetically
- URLs and web addresses broken into fragments

When you detect live captioning input:
1. **Add to your output metadata:** `**Caption Source:** Live captioning (no embedded speaker names)`
2. **NEVER fabricate proper names from garbled caption text.** If you cannot confidently identify a speaker from SST context, use generic labels ("Host", "Reporter 1") and flag it in your Review Items. Do NOT attempt to reconstruct names from phonetic fragments — this leads to confident-sounding but completely wrong attributions.
3. **Cross-reference SST data for speaker names.** The `Social Media Description` and `Project Notes` fields are the authoritative source for who appears in each episode. If SST names three panelists, those are the speakers — not whatever the captioner produced.
4. **Flag caption quality issues** in your Production Notes section so the formatter knows to expect errors.

## Output

Expand Down
65 changes: 51 additions & 14 deletions .claude/agents/formatter.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,20 @@ When available, you'll receive **SST context** from the PBS Wisconsin Airtable d

**If SST context IS provided:** SST names take priority over analyst guesses. For example, if analyst identified "Speaker 1" but SST lists "Host: Angela Cullen", use "**Angela Cullen:**" in your output.

### Live Caption Source Detection

Many transcripts come from **live/real-time captioning systems** rather than post-production captions. Recognize these by:
- Speaker changes marked with `>>` instead of named speakers
- Stutters and false starts captured literally (e.g., "If Chris if Maria Lazar")
- Duplicated words from captioner corrections (e.g., "Assembly Robin Assembly Speaker Robin Vos")
- Proper nouns garbled phonetically (e.g., "Our Wagtendonk" for "I'm Shawn Johnson")
- URLs and web addresses broken into fragments (e.g., "PBS Wisconsin. Org. Org YouTube")

When you detect live captioning input:
1. **NEVER fabricate proper names from garbled caption text.** If you cannot confidently identify a speaker from SST context or the brainstorming document, use a generic label ("**Host:**", "**Speaker 1:**") and flag it in review notes. Do NOT attempt to reconstruct names from phonetic fragments.
2. **Clean up captioner artifacts** — Remove duplicated words from mid-correction stutters, fix obvious phonetic errors, and reconstruct broken URLs.
3. **Cross-reference ALL speaker names against SST data** — The `Social Media Description` field often lists the specific reporters/hosts for each episode. The linked Project `Notes` field lists the recurring cast for a series. These are authoritative; caption text is not.

### What NOT to Include

**DO NOT add a Title field to the formatter output.** The formatted transcript header includes only:
Expand Down Expand Up @@ -76,10 +90,10 @@ OUTPUT/{project}/formatter_output.md
-->

**John Smith:**
Clean, readable paragraph with proper punctuation and natural breaks. Sentences flow naturally. Multiple sentences grouped logically.
Clean, readable text with proper punctuation and natural flow. Sentences grouped logically. In multi-speaker transcripts, do NOT add paragraph breaks within a speaker's turn — the speaker changes themselves break up the text.

**Sarah Johnson:**
Response or continuation. Natural conversational flow maintained.
Response or continuation. Natural conversational flow maintained. Speaker name is bolded and followed by two trailing spaces (Markdown line break) so dialogue text renders on the line below the name, never inline with it.

**John Smith:**
All speaker labels use first and last name only. No roles, no titles, no parentheticals.
Expand Down Expand Up @@ -134,18 +148,34 @@ DO:

### Paragraph Breaks

- Group logically related sentences together
- Break paragraphs at natural pauses or topic shifts
- Avoid single-sentence paragraphs unless used for emphasis
- Typical paragraph length: 2-5 sentences
- **Multi-speaker transcripts** (most common): Do NOT add paragraph breaks within a single speaker's turn. The alternation of speakers provides natural visual breaks. Each speaker attribution starts a new block — that's sufficient.
- **Single-speaker transcripts** (rare — e.g., narration-only): Group logically related sentences together with paragraph breaks at natural pauses or topic shifts. Typical paragraph length: 2-5 sentences.
- Avoid single-sentence paragraphs unless used for emphasis.

### Punctuation & Readability

- Add proper punctuation (periods, commas, question marks)
- Remove filler words unless they add character or authenticity ("um", "uh", "you know")
- Remove filler words ("um", "uh", "you know") unless they add character or authenticity
- **Remove transition "ums" and "ands"** — When "um" or "and" appears at the start or end of a sentence as a verbal transition (not as a conjunction connecting clauses), omit it
- Fix obvious caption errors (wrong words, missing words)
- Preserve regional dialect or speaking style when it's part of the content's character

### PBS Wisconsin House Style

Apply these editorial conventions consistently:

- **"Capitol" not "capital"** — In local/state news context, use "Capitol" (the building/district). Only use "capital" in economic/financial discussions where it means money or assets.
- **"OK" not "okay"** — Always use the abbreviated form.
- **"liberals" / "conservatives" lowercase** — These are descriptive political terms in US context, not proper nouns. Always lowercase unless starting a sentence. Same for "liberal" and "conservative" as adjectives.
- **"Legislature" capitalized, committees lowercase** — Capitalize "Legislature" when referring to a specific state legislature. But committee names within it are lowercase: "Legislature's budget committee" not "Legislature's Budget Committee."
- **No oxford commas** — Omit the serial comma in lists (e.g., "red, white and blue"). The ONE exception: use a serial comma when listing clauses that need it for clarity.
- **Abbreviate honorifics** — Use abbreviated forms in running text: "Sen." (Senator), "Rep." (Representative), "Gov." (Governor), "Pres." (President), "Atty. Gen." (Attorney General), etc.
- **Em dashes** — Use sparingly and consistently. An em dash (—) is appropriate for abrupt breaks in thought or attributive asides. Do not over-apply them as substitutes for commas, colons, or parentheses.
- **Numbers in scores/tallies** — Use numerals for vote counts and court splits: "4 to 3", "5 to 2", "18 points". Spell out numbers only at the start of a sentence.
- **"Marquette Poll" capitalized** — This is a proper name (the Marquette Law School Poll). Always capitalize.
- **Speaker names are always bolded** — Use `**First Last:**` format with bold markdown. Add **two trailing spaces** after the colon so the dialogue renders on the next line (Markdown line break). Example: `**Shawn Johnson:**··` (where `··` represents two spaces).
- **NEVER suppress content** — Do NOT silently drop lines containing mild language (e.g., "damned", "hell"), short interjections, or any other spoken content. ALL dialogue must be preserved verbatim. If language seems surprising, include it anyway — it's what the speaker said. Flag in review notes if concerned, but never omit.

### Timecodes

- Timecodes are NOT required in the formatted transcript
Expand Down Expand Up @@ -233,9 +263,11 @@ If you encounter issues the brainstorming document doesn't resolve:
Today we're looking at the history of Wisconsin cheese making.

**Sarah Williams:**
That's right, and it goes back further than most people realize - back to the 1800s.
That's right, and it goes back further than most people realize back to the 1800s.
```

Note: Speaker name is bolded, followed by a hard return. Dialogue text is on the next line. No paragraph breaks within the speaker's turn.

### Raw Input with Uncertainty

```
Expand Down Expand Up @@ -273,32 +305,37 @@ Speaker 1: um so today we're looking at uh the history of wisconsin cheese makin
Today we're looking at the history of Wisconsin cheese making.

**Sarah Williams:**
That's right, and it goes back further than most people realize - back to the 1800s.
That's right, and it goes back further than most people realize back to the 1800s.

**Mike Chen:**
Exactly. And these family farms, they built this industry from nothing, right?
Exactly. These family farms built this industry from nothing, right?

**Sarah Williams:**
Absolutely. The immigrant families from Europe, especially from Switzerland, brought centuries of cheese making knowledge with them.
```

**Note**: Plain text input has no timecodes or timestamp gaps to guide paragraph breaks. Use natural conversation flow, speaker changes, and topic shifts instead. Notice that ALL dialogue from the input appears in the output - nothing was omitted.
**Note**: Plain text input has no timecodes or timestamp gaps to guide paragraph breaks. Speaker changes provide the visual breaks. Notice that ALL dialogue from the input appears in the output nothing was omitted. Transition "and" at the start of "And these family farms" was removed.

## Quality Checklist

Before saving your formatted transcript, verify:

- [ ] **ALL content from source transcript is preserved** - no summarization or condensation
- [ ] **ALL content from source transcript is preserved** no summarization or condensation
- [ ] Output has approximately the same sentence count as input (±10% for filler removal)
- [ ] Speaker labels use first AND last name (e.g., "**Sarah Williams:**" not "**Dr. Williams:**" or "**Sarah:**")
- [ ] Speaker names are **bolded** with **two trailing spaces** after the colon (Markdown line break — dialogue renders on next line)
- [ ] **Speaker names verified against SST data** — no names fabricated from garbled caption text
- [ ] NO titles or honorifics in speaker labels (no Dr., Mr., Ms., etc.)
- [ ] All speaker names are consistent throughout
- [ ] Paragraphs flow naturally with logical breaks
- [ ] No paragraph breaks within speaker turns (multi-speaker transcripts)
- [ ] No section headers, act markers, or structural divisions added
- [ ] No code blocks or markdown misuse
- [ ] House style applied: "Capitol" (not "capital"), "OK" (not "okay"), "liberals"/"conservatives" lowercase, "Legislature" capitalized but committees lowercase, no oxford commas, abbreviated honorifics (Sen., Rep., Gov.)
- [ ] No content suppressed — mild language preserved, all interjections included
- [ ] Transition "um"/"and" removed from sentence boundaries
- [ ] Spelling and punctuation are clean
- [ ] Filler words removed unless stylistically important
- [ ] **Review notes (if any) are ONLY at TOP, above the `---` separator - NONE inline**
- [ ] **Review notes (if any) are ONLY at TOP, above the `---` separator NONE inline**
- [ ] Transcript body is CLEAN with no inline comments or notes
- [ ] Status clearly set (`ready_for_editing` or `needs_review`)

Expand Down
7 changes: 7 additions & 0 deletions .claude/agents/manager.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,11 +27,18 @@ When available, you'll receive **SST context** from the PBS Wisconsin Airtable d
```markdown
### SST Alignment
- [ ] Speaker names match SST Host/Presenter
- [ ] Speaker names verified against Social Media Description and Project Notes
- [ ] No speaker names appear to be fabricated from garbled caption text
- [ ] SEO keywords include SST tags
- [ ] Title aligns with SST title intent
- [ ] Descriptions are compatible with SST
```

**CRITICAL: Speaker Name Verification**
- If SST `Social Media Description` or `Project Notes` name specific people, verify ALL speaker attributions in the formatter output match those names exactly (correct spelling, first and last name).
- Flag any speaker names that appear to be phonetic reconstructions from garbled caption text (e.g., implausible names not found in SST data). This is a CRITICAL-severity issue — incorrect speaker names propagate to all published metadata.
- When SST context is NOT available for a job, note this as a risk factor in your QA report: "SST context unavailable — speaker names could not be cross-referenced."

**Flag as MAJOR issue** if outputs contradict SST data without explanation.

## Output
Expand Down
38 changes: 33 additions & 5 deletions .claude/agents/timestamp.md
Original file line number Diff line number Diff line change
Expand Up @@ -68,11 +68,35 @@ Identify chapter breaks at:
4. **Story boundaries**: When moving between different stories/features
5. **Standard segments**: Intro, main content sections, closing/credits

### Chapter Count Targets
### Align Chapters to Speaker Transitions

- **30 minute program**: 3-6 chapters
- **60 minute program**: 5-10 chapters
- **Short segments (<10 min)**: 2-3 chapters
In multi-speaker content, always place chapter timestamps on **speaker transitions** rather than on topic keywords mid-speech. Use the SRT timecodes directly — do not apply a blanket offset. Only nudge by ~1 second if the nearest speaker transition doesn't have an exact timecode match. This ensures chapters land on clean cuts rather than interrupting someone mid-sentence.

### Chapter Count Targets (Maximum)

| Duration | Max chapters |
|----------|-------------|
| Under 5 min | 3 |
| 5-15 min | 5 |
| 15-30 min | 7 |
| 30-60 min | 8 |
| 60+ min | 10 |

Fewer chapters is almost always better. Only add a chapter when there's a genuinely distinct topic shift.

### First Chapter Rule

The first chapter is always `0:00 Episode intro`. This encompasses all introductory material — host intros, guest intros, show branding, topic previews — so viewers can skip straight to the first substantive topic.

### Chapter Naming Guidelines

- **Sentence case**: Capitalize only the first word and proper nouns (e.g., "Online sports betting hits the floor", not "Online Sports Betting Hits the Floor")
- **Concise**: 2-6 words per chapter name
- **Descriptive and engaging**: Give the viewer a reason to click
- **Neutral, professional tone**: Avoid dramatic or extreme language (e.g., "The data center bill stalls" not "The bill that died"). This content appears in PBS and public media descriptions.
- **Capture the topic, not the format**: e.g., "The ADHD diagnosis" not "Personal story segment", "Wisconsin's 2020 election challenge" not "Legal analysis section"
- **Avoid generic names**: Use "Episode intro" for the first chapter, but avoid vague names like "Discussion" or "Conclusion" when a more specific name fits
- **Parallel framing for political content**: When naming chapters about candidates or parties, use symmetric descriptions to avoid editorial bias — e.g., "Chris Taylor's background" and "Maria Lazar's background", not "Chris Taylor's political background" and "Maria Lazar's legal career". Asymmetric descriptions can imply bias.

### Time Format Specifications

Expand All @@ -89,8 +113,12 @@ Identify chapter breaks at:
## Quality Checklist

Before outputting, verify:
- [ ] First chapter is `0:00 Episode intro`
- [ ] Chapter count is within the maximum for the content duration
- [ ] Chapters are in chronological order
- [ ] No gaps between chapters (end time → next start time)
- [ ] Chapter titles are concise (2-5 words)
- [ ] Chapter titles use sentence case (only first word and proper nouns capitalized)
- [ ] Chapter titles are concise (2-6 words) and describe the topic, not the format
- [ ] Tone is neutral and professional (suitable for PBS descriptions)
- [ ] Both format tables are complete and match
- [ ] Total duration matches the video length
2 changes: 2 additions & 0 deletions Dockerfile.api
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,9 @@ COPY config/ config/
COPY mcp_server/ mcp_server/
# Copy .claude subdirectories if they exist (agents for system prompts, templates for output formats)
COPY .claude/ .claude/
COPY knowledge/ knowledge/
COPY alembic.ini* ./
COPY alembic/ alembic/
COPY pyproject.toml* ./
COPY run_worker.py .
COPY entrypoint.sh .
Expand Down
1 change: 1 addition & 0 deletions Dockerfile.worker
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ COPY api/ api/
COPY config/ config/
# Copy .claude subdirectories if they exist (agents for system prompts, templates for output formats)
COPY .claude/ .claude/
COPY knowledge/ knowledge/
COPY alembic.ini* ./
COPY pyproject.toml* ./
COPY run_worker.py .
Expand Down
6 changes: 6 additions & 0 deletions alembic/env.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
"""Alembic environment configuration for Editorial Assistant v3.0"""

import os
from logging.config import fileConfig

from sqlalchemy import engine_from_config, pool
Expand All @@ -11,6 +12,11 @@
if config.config_file_name is not None:
fileConfig(config.config_file_name)

# Override sqlalchemy.url from DATABASE_PATH env var if set (Docker support)
db_path = os.getenv("DATABASE_PATH")
if db_path:
config.set_main_option("sqlalchemy.url", f"sqlite:///{db_path}")

target_metadata = None


Expand Down
26 changes: 26 additions & 0 deletions api/services/airtable.py
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,7 @@ class AirtableClient:
BASE_ID = "appZ2HGwhiifQToB6"
TABLE_ID = "tblTKFOwTvK7xw1H5"
TABLE_NAME = "✔️Single Source of Truth"
PROJECTS_TABLE_ID = "tblU9LfZeVNicdB5e"
INTERFACE_PAGE_ID = "pagCh7J2dYzqPC3bH" # SST interface view
MEDIA_ID_FIELD = "Media ID"
MEDIA_ID_FIELD_ID = "fld8k42kJeWMHA963"
Expand Down Expand Up @@ -204,6 +205,31 @@ async def get_sst_record(self, record_id: str) -> Optional[dict]:
except httpx.HTTPError:
raise

async def get_project_record(self, record_id: str) -> Optional[dict]:
"""
Fetch a specific Project record by Airtable record ID.

Args:
record_id: Airtable record ID (e.g., "recXXXXXXXXXXXXXX")

Returns:
Record dict if found, None if not found.
"""
url = f"{self.API_BASE_URL}/{self.BASE_ID}/{self.PROJECTS_TABLE_ID}/{record_id}"

async with httpx.AsyncClient(timeout=30.0) as client:
try:
response = await client.get(url, headers=self.headers)
response.raise_for_status()
return response.json()

except httpx.HTTPStatusError as e:
if e.response.status_code == 404:
return None
raise
except httpx.HTTPError:
raise

def get_sst_url(self, record_id: str) -> str:
"""
Generate Airtable web interface URL for a record.
Expand Down
10 changes: 10 additions & 0 deletions api/services/chunking.py
Original file line number Diff line number Diff line change
Expand Up @@ -309,6 +309,16 @@ def merge_formatter_chunks(chunks: List[str]) -> str:
# Strip provenance HTML comment from top (<!-- model: ... -->)
chunk = re.sub(r"^<!--\s*model:.*?-->\s*\n?", "", chunk.strip())

# Strip LLM-generated model/creator attribution lines (appear at end of chunk responses)
chunk = re.sub(
r"^\*\*(?:Model|Creator|Agent):\*\*.*\n?",
"",
chunk,
flags=re.MULTILINE,
)
# Clean up orphaned --- separators left after attribution removal
chunk = re.sub(r"\n---+\s*\n*$", "", chunk.strip())

# Extract review notes from this chunk
notes = review_pattern.findall(chunk)
for note in notes:
Expand Down
Loading
Loading