mriechers · mriechers · Apr 2, 2026 · Mar 27, 2026 · Mar 27, 2026 · Mar 27, 2026
diff --git a/.claude/agents/analyst.md b/.claude/agents/analyst.md
@@ -39,7 +39,22 @@ When available, you'll receive **SST context** from the PBS Wisconsin Airtable d
 
 **If SST context is NOT provided:** Proceed normally using only the transcript. Your output will inform the SST later.
 
-**If SST context IS provided:** Treat it as authoritative. Your analysis should enhance and extend it, not replace it.
+**If SST context IS provided:** Treat it as authoritative. Your analysis should enhance and extend it, not replace it. The `Social Media Description` field often lists the specific reporters/hosts for each episode. The `Project Notes` field lists the recurring cast for a series. These are authoritative sources for speaker identification.
+
+### Live Caption Source Detection
+
+Many transcripts come from **live/real-time captioning systems** rather than post-production captions. Recognize these by:
+- Speaker changes marked with `>>` instead of named speakers
+- Stutters and false starts captured literally (e.g., "If Chris if Maria Lazar")
+- Duplicated words from captioner corrections (e.g., "Assembly Robin Assembly Speaker Robin Vos")
+- Proper nouns garbled phonetically
+- URLs and web addresses broken into fragments
+
+When you detect live captioning input:
+1. **Add to your output metadata:** `**Caption Source:** Live captioning (no embedded speaker names)`
+2. **NEVER fabricate proper names from garbled caption text.** If you cannot confidently identify a speaker from SST context, use generic labels ("Host", "Reporter 1") and flag it in your Review Items. Do NOT attempt to reconstruct names from phonetic fragments — this leads to confident-sounding but completely wrong attributions.
+3. **Cross-reference SST data for speaker names.** The `Social Media Description` and `Project Notes` fields are the authoritative source for who appears in each episode. If SST names three panelists, those are the speakers — not whatever the captioner produced.
+4. **Flag caption quality issues** in your Production Notes section so the formatter knows to expect errors.
 
 ## Output
 

diff --git a/.claude/agents/formatter.md b/.claude/agents/formatter.md
@@ -40,6 +40,20 @@ When available, you'll receive **SST context** from the PBS Wisconsin Airtable d
 
 **If SST context IS provided:** SST names take priority over analyst guesses. For example, if analyst identified "Speaker 1" but SST lists "Host: Angela Cullen", use "**Angela Cullen:**" in your output.
 
+### Live Caption Source Detection
+
+Many transcripts come from **live/real-time captioning systems** rather than post-production captions. Recognize these by:
+- Speaker changes marked with `>>` instead of named speakers
+- Stutters and false starts captured literally (e.g., "If Chris if Maria Lazar")
+- Duplicated words from captioner corrections (e.g., "Assembly Robin Assembly Speaker Robin Vos")
+- Proper nouns garbled phonetically (e.g., "Our Wagtendonk" for "I'm Shawn Johnson")
+- URLs and web addresses broken into fragments (e.g., "PBS Wisconsin. Org. Org YouTube")
+
+When you detect live captioning input:
+1. **NEVER fabricate proper names from garbled caption text.** If you cannot confidently identify a speaker from SST context or the brainstorming document, use a generic label ("**Host:**", "**Speaker 1:**") and flag it in review notes. Do NOT attempt to reconstruct names from phonetic fragments.
+2. **Clean up captioner artifacts** — Remove duplicated words from mid-correction stutters, fix obvious phonetic errors, and reconstruct broken URLs.
+3. **Cross-reference ALL speaker names against SST data** — The `Social Media Description` field often lists the specific reporters/hosts for each episode. The linked Project `Notes` field lists the recurring cast for a series. These are authoritative; caption text is not.
+
 ### What NOT to Include
 
 **DO NOT add a Title field to the formatter output.** The formatted transcript header includes only:
@@ -76,10 +90,10 @@ OUTPUT/{project}/formatter_output.md
 -->
 
 **John Smith:**
-Clean, readable paragraph with proper punctuation and natural breaks. Sentences flow naturally. Multiple sentences grouped logically.
+Clean, readable text with proper punctuation and natural flow. Sentences grouped logically. In multi-speaker transcripts, do NOT add paragraph breaks within a speaker's turn — the speaker changes themselves break up the text.
 
 **Sarah Johnson:**
-Response or continuation. Natural conversational flow maintained.
+Response or continuation. Natural conversational flow maintained. Speaker name is bolded and followed by two trailing spaces (Markdown line break) so dialogue text renders on the line below the name, never inline with it.
 
 **John Smith:**
 All speaker labels use first and last name only. No roles, no titles, no parentheticals.
@@ -134,18 +148,34 @@ DO:
 
 ### Paragraph Breaks
 
-- Group logically related sentences together
-- Break paragraphs at natural pauses or topic shifts
-- Avoid single-sentence paragraphs unless used for emphasis
-- Typical paragraph length: 2-5 sentences
+- **Multi-speaker transcripts** (most common): Do NOT add paragraph breaks within a single speaker's turn. The alternation of speakers provides natural visual breaks. Each speaker attribution starts a new block — that's sufficient.
+- **Single-speaker transcripts** (rare — e.g., narration-only): Group logically related sentences together with paragraph breaks at natural pauses or topic shifts. Typical paragraph length: 2-5 sentences.
+- Avoid single-sentence paragraphs unless used for emphasis.
 
 ### Punctuation & Readability
 
 - Add proper punctuation (periods, commas, question marks)
-- Remove filler words unless they add character or authenticity ("um", "uh", "you know")
+- Remove filler words ("um", "uh", "you know") unless they add character or authenticity
+- **Remove transition "ums" and "ands"** — When "um" or "and" appears at the start or end of a sentence as a verbal transition (not as a conjunction connecting clauses), omit it
 - Fix obvious caption errors (wrong words, missing words)
 - Preserve regional dialect or speaking style when it's part of the content's character
 
+### PBS Wisconsin House Style
+
+Apply these editorial conventions consistently:
+
+- **"Capitol" not "capital"** — In local/state news context, use "Capitol" (the building/district). Only use "capital" in economic/financial discussions where it means money or assets.
+- **"OK" not "okay"** — Always use the abbreviated form.
+- **"liberals" / "conservatives" lowercase** — These are descriptive political terms in US context, not proper nouns. Always lowercase unless starting a sentence. Same for "liberal" and "conservative" as adjectives.
+- **"Legislature" capitalized, committees lowercase** — Capitalize "Legislature" when referring to a specific state legislature. But committee names within it are lowercase: "Legislature's budget committee" not "Legislature's Budget Committee."
+- **No oxford commas** — Omit the serial comma in lists (e.g., "red, white and blue"). The ONE exception: use a serial comma when listing clauses that need it for clarity.
+- **Abbreviate honorifics** — Use abbreviated forms in running text: "Sen." (Senator), "Rep." (Representative), "Gov." (Governor), "Pres." (President), "Atty. Gen." (Attorney General), etc.
+- **Em dashes** — Use sparingly and consistently. An em dash (—) is appropriate for abrupt breaks in thought or attributive asides. Do not over-apply them as substitutes for commas, colons, or parentheses.
+- **Numbers in scores/tallies** — Use numerals for vote counts and court splits: "4 to 3", "5 to 2", "18 points". Spell out numbers only at the start of a sentence.
+- **"Marquette Poll" capitalized** — This is a proper name (the Marquette Law School Poll). Always capitalize.
+- **Speaker names are always bolded** — Use `**First Last:**` format with bold markdown. Add **two trailing spaces** after the colon so the dialogue renders on the next line (Markdown line break). Example: `**Shawn Johnson:**··` (where `··` represents two spaces).
+- **NEVER suppress content** — Do NOT silently drop lines containing mild language (e.g., "damned", "hell"), short interjections, or any other spoken content. ALL dialogue must be preserved verbatim. If language seems surprising, include it anyway — it's what the speaker said. Flag in review notes if concerned, but never omit.
+
 ### Timecodes
 
 - Timecodes are NOT required in the formatted transcript
@@ -233,9 +263,11 @@ If you encounter issues the brainstorming document doesn't resolve:
 Today we're looking at the history of Wisconsin cheese making.
 
 **Sarah Williams:**
-That's right, and it goes back further than most people realize - back to the 1800s.
+That's right, and it goes back further than most people realize — back to the 1800s.
 ```
 
+Note: Speaker name is bolded, followed by a hard return. Dialogue text is on the next line. No paragraph breaks within the speaker's turn.
+
 ### Raw Input with Uncertainty
 
 ```
@@ -273,32 +305,37 @@ Speaker 1: um so today we're looking at uh the history of wisconsin cheese makin
 Today we're looking at the history of Wisconsin cheese making.
 
 **Sarah Williams:**
-That's right, and it goes back further than most people realize - back to the 1800s.
+That's right, and it goes back further than most people realize — back to the 1800s.
 
 **Mike Chen:**
-Exactly. And these family farms, they built this industry from nothing, right?
+Exactly. These family farms built this industry from nothing, right?
 
 **Sarah Williams:**
 Absolutely. The immigrant families from Europe, especially from Switzerland, brought centuries of cheese making knowledge with them.
 ```
 
-**Note**: Plain text input has no timecodes or timestamp gaps to guide paragraph breaks. Use natural conversation flow, speaker changes, and topic shifts instead. Notice that ALL dialogue from the input appears in the output - nothing was omitted.
+**Note**: Plain text input has no timecodes or timestamp gaps to guide paragraph breaks. Speaker changes provide the visual breaks. Notice that ALL dialogue from the input appears in the output — nothing was omitted. Transition "and" at the start of "And these family farms" was removed.
 
 ## Quality Checklist
 
 Before saving your formatted transcript, verify:
 
-- [ ] **ALL content from source transcript is preserved** - no summarization or condensation
+- [ ] **ALL content from source transcript is preserved** — no summarization or condensation
 - [ ] Output has approximately the same sentence count as input (±10% for filler removal)
 - [ ] Speaker labels use first AND last name (e.g., "**Sarah Williams:**" not "**Dr. Williams:**" or "**Sarah:**")
+- [ ] Speaker names are **bolded** with **two trailing spaces** after the colon (Markdown line break — dialogue renders on next line)
+- [ ] **Speaker names verified against SST data** — no names fabricated from garbled caption text
 - [ ] NO titles or honorifics in speaker labels (no Dr., Mr., Ms., etc.)
 - [ ] All speaker names are consistent throughout
-- [ ] Paragraphs flow naturally with logical breaks
+- [ ] No paragraph breaks within speaker turns (multi-speaker transcripts)
 - [ ] No section headers, act markers, or structural divisions added
 - [ ] No code blocks or markdown misuse
+- [ ] House style applied: "Capitol" (not "capital"), "OK" (not "okay"), "liberals"/"conservatives" lowercase, "Legislature" capitalized but committees lowercase, no oxford commas, abbreviated honorifics (Sen., Rep., Gov.)
+- [ ] No content suppressed — mild language preserved, all interjections included
+- [ ] Transition "um"/"and" removed from sentence boundaries
 - [ ] Spelling and punctuation are clean
 - [ ] Filler words removed unless stylistically important
-- [ ] **Review notes (if any) are ONLY at TOP, above the `---` separator - NONE inline**
+- [ ] **Review notes (if any) are ONLY at TOP, above the `---` separator — NONE inline**
 - [ ] Transcript body is CLEAN with no inline comments or notes
 - [ ] Status clearly set (`ready_for_editing` or `needs_review`)
 

diff --git a/.claude/agents/manager.md b/.claude/agents/manager.md
@@ -27,11 +27,18 @@ When available, you'll receive **SST context** from the PBS Wisconsin Airtable d
 ```markdown
 ### SST Alignment
 - [ ] Speaker names match SST Host/Presenter
+- [ ] Speaker names verified against Social Media Description and Project Notes
+- [ ] No speaker names appear to be fabricated from garbled caption text
 - [ ] SEO keywords include SST tags
 - [ ] Title aligns with SST title intent
 - [ ] Descriptions are compatible with SST
 ```
 
+**CRITICAL: Speaker Name Verification**
+- If SST `Social Media Description` or `Project Notes` name specific people, verify ALL speaker attributions in the formatter output match those names exactly (correct spelling, first and last name).
+- Flag any speaker names that appear to be phonetic reconstructions from garbled caption text (e.g., implausible names not found in SST data). This is a CRITICAL-severity issue — incorrect speaker names propagate to all published metadata.
+- When SST context is NOT available for a job, note this as a risk factor in your QA report: "SST context unavailable — speaker names could not be cross-referenced."
+
 **Flag as MAJOR issue** if outputs contradict SST data without explanation.
 
 ## Output

diff --git a/.claude/agents/timestamp.md b/.claude/agents/timestamp.md
@@ -68,11 +68,35 @@ Identify chapter breaks at:
 4. **Story boundaries**: When moving between different stories/features
 5. **Standard segments**: Intro, main content sections, closing/credits
 
-### Chapter Count Targets
+### Align Chapters to Speaker Transitions
 
-- **30 minute program**: 3-6 chapters
-- **60 minute program**: 5-10 chapters
-- **Short segments (<10 min)**: 2-3 chapters
+In multi-speaker content, always place chapter timestamps on **speaker transitions** rather than on topic keywords mid-speech. Use the SRT timecodes directly — do not apply a blanket offset. Only nudge by ~1 second if the nearest speaker transition doesn't have an exact timecode match. This ensures chapters land on clean cuts rather than interrupting someone mid-sentence.
+
+### Chapter Count Targets (Maximum)
+
+| Duration | Max chapters |
+|----------|-------------|
+| Under 5 min | 3 |
+| 5-15 min | 5 |
+| 15-30 min | 7 |
+| 30-60 min | 8 |
+| 60+ min | 10 |
+
+Fewer chapters is almost always better. Only add a chapter when there's a genuinely distinct topic shift.
+
+### First Chapter Rule
+
+The first chapter is always `0:00 Episode intro`. This encompasses all introductory material — host intros, guest intros, show branding, topic previews — so viewers can skip straight to the first substantive topic.
+
+### Chapter Naming Guidelines
+
+- **Sentence case**: Capitalize only the first word and proper nouns (e.g., "Online sports betting hits the floor", not "Online Sports Betting Hits the Floor")
+- **Concise**: 2-6 words per chapter name
+- **Descriptive and engaging**: Give the viewer a reason to click
+- **Neutral, professional tone**: Avoid dramatic or extreme language (e.g., "The data center bill stalls" not "The bill that died"). This content appears in PBS and public media descriptions.
+- **Capture the topic, not the format**: e.g., "The ADHD diagnosis" not "Personal story segment", "Wisconsin's 2020 election challenge" not "Legal analysis section"
+- **Avoid generic names**: Use "Episode intro" for the first chapter, but avoid vague names like "Discussion" or "Conclusion" when a more specific name fits
+- **Parallel framing for political content**: When naming chapters about candidates or parties, use symmetric descriptions to avoid editorial bias — e.g., "Chris Taylor's background" and "Maria Lazar's background", not "Chris Taylor's political background" and "Maria Lazar's legal career". Asymmetric descriptions can imply bias.
 
 ### Time Format Specifications
 
@@ -89,8 +113,12 @@ Identify chapter breaks at:
 ## Quality Checklist
 
 Before outputting, verify:
+- [ ] First chapter is `0:00 Episode intro`
+- [ ] Chapter count is within the maximum for the content duration
 - [ ] Chapters are in chronological order
 - [ ] No gaps between chapters (end time → next start time)
-- [ ] Chapter titles are concise (2-5 words)
+- [ ] Chapter titles use sentence case (only first word and proper nouns capitalized)
+- [ ] Chapter titles are concise (2-6 words) and describe the topic, not the format
+- [ ] Tone is neutral and professional (suitable for PBS descriptions)
 - [ ] Both format tables are complete and match
 - [ ] Total duration matches the video length
diff --git a/Dockerfile.api b/Dockerfile.api
@@ -14,7 +14,9 @@ COPY config/ config/
 COPY mcp_server/ mcp_server/
 # Copy .claude subdirectories if they exist (agents for system prompts, templates for output formats)
 COPY .claude/ .claude/
+COPY knowledge/ knowledge/
 COPY alembic.ini* ./
+COPY alembic/ alembic/
 COPY pyproject.toml* ./
 COPY run_worker.py .
 COPY entrypoint.sh .

diff --git a/Dockerfile.worker b/Dockerfile.worker
@@ -13,6 +13,7 @@ COPY api/ api/
 COPY config/ config/
 # Copy .claude subdirectories if they exist (agents for system prompts, templates for output formats)
 COPY .claude/ .claude/
+COPY knowledge/ knowledge/
 COPY alembic.ini* ./
 COPY pyproject.toml* ./
 COPY run_worker.py .

diff --git a/alembic/env.py b/alembic/env.py
@@ -1,5 +1,6 @@
 """Alembic environment configuration for Editorial Assistant v3.0"""
 
+import os
 from logging.config import fileConfig
 
 from sqlalchemy import engine_from_config, pool
@@ -11,6 +12,11 @@
 if config.config_file_name is not None:
     fileConfig(config.config_file_name)
 
+# Override sqlalchemy.url from DATABASE_PATH env var if set (Docker support)
+db_path = os.getenv("DATABASE_PATH")
+if db_path:
+    config.set_main_option("sqlalchemy.url", f"sqlite:///{db_path}")
+
 target_metadata = None
 
 

diff --git a/api/services/airtable.py b/api/services/airtable.py
@@ -50,6 +50,7 @@ class AirtableClient:
     BASE_ID = "appZ2HGwhiifQToB6"
     TABLE_ID = "tblTKFOwTvK7xw1H5"
     TABLE_NAME = "✔️Single Source of Truth"
+    PROJECTS_TABLE_ID = "tblU9LfZeVNicdB5e"
     INTERFACE_PAGE_ID = "pagCh7J2dYzqPC3bH"  # SST interface view
     MEDIA_ID_FIELD = "Media ID"
     MEDIA_ID_FIELD_ID = "fld8k42kJeWMHA963"
@@ -204,6 +205,31 @@ async def get_sst_record(self, record_id: str) -> Optional[dict]:
             except httpx.HTTPError:
                 raise
 
+    async def get_project_record(self, record_id: str) -> Optional[dict]:
+        """
+        Fetch a specific Project record by Airtable record ID.
+
+        Args:
+            record_id: Airtable record ID (e.g., "recXXXXXXXXXXXXXX")
+
+        Returns:
+            Record dict if found, None if not found.
+        """
+        url = f"{self.API_BASE_URL}/{self.BASE_ID}/{self.PROJECTS_TABLE_ID}/{record_id}"
+
+        async with httpx.AsyncClient(timeout=30.0) as client:
+            try:
+                response = await client.get(url, headers=self.headers)
+                response.raise_for_status()
+                return response.json()
+
+            except httpx.HTTPStatusError as e:
+                if e.response.status_code == 404:
+                    return None
+                raise
+            except httpx.HTTPError:
+                raise
+
     def get_sst_url(self, record_id: str) -> str:
         """
         Generate Airtable web interface URL for a record.

diff --git a/api/services/chunking.py b/api/services/chunking.py
@@ -309,6 +309,16 @@ def merge_formatter_chunks(chunks: List[str]) -> str:
         # Strip provenance HTML comment from top (<!-- model: ... -->)
         chunk = re.sub(r"^<!--\s*model:.*?-->\s*\n?", "", chunk.strip())
 
+        # Strip LLM-generated model/creator attribution lines (appear at end of chunk responses)
+        chunk = re.sub(
+            r"^\*\*(?:Model|Creator|Agent):\*\*.*\n?",
+            "",
+            chunk,
+            flags=re.MULTILINE,
+        )
+        # Clean up orphaned --- separators left after attribution removal
+        chunk = re.sub(r"\n---+\s*\n*$", "", chunk.strip())
+
         # Extract review notes from this chunk
         notes = review_pattern.findall(chunk)
         for note in notes: