yoaquim · jedibrillo · May 3, 2026 · May 3, 2026 · May 3, 2026 · May 3, 2026
diff --git a/.gitignore b/.gitignore
@@ -36,4 +36,11 @@ npm-debug.log*
 public/manifest.json
 
 # Claude Code
-.claude/*.local.json
+.claude/*.local.json
+.claude/scheduled_tasks.lock
+.claude/worktrees/
+
+# GEPA / prompt optimization
+.venv-gepa/
+prompts/analyze.optimized*.md
+gepa-runs/
diff --git a/prompt_optim/COMMANDS.md b/prompt_optim/COMMANDS.md
@@ -0,0 +1,196 @@
+# prompt_optim — sample commands
+
+Ordered shortest-to-longest. All paths are relative — **run from the worktree root**.
+
+## Setup notes
+
+- Worktree's `.seam/` is a copy of the main repo's `.seam/` (gitignored).
+  State changes (telemetry, generic-speakers tweaks) live in this copy only.
+- `.venv-gepa/` holds the GEPA install. Created via:
+  ```bash
+  python3 -m venv .venv-gepa
+  .venv-gepa/bin/pip install -r prompt_optim/requirements.txt
+  ```
+- The metric and adapter pick up `.seam/` automatically. Override with
+  `SEAM_DATA_DIR=/path/to/.seam` if you want to point at a different copy.
+
+---
+
+## 1) Pure-Python tests (no LLM calls; instant)
+
+### Score one analysis
+
+```bash
+.venv-gepa/bin/python3 -m prompt_optim.metric \
+  .seam/recordings/2026-02-24_cancun-and-project-management \
+  .seam/analysis/2026-02-24_cancun-and-project-management/analysis.json
+```
+
+### Distribution of metric scores across the whole dataset
+
+Useful for spotting the worst-scoring analyses (where the prompt is failing
+hardest) before you invest LLM calls.
+
+```bash
+.venv-gepa/bin/python3 -c '
+import json
+from pathlib import Path
+from prompt_optim.metric import score_analysis, _load_people_names
+people = _load_people_names()
+scores = []
+for rd in sorted(Path(".seam/recordings").iterdir()):
+    aj = Path(".seam/analysis") / rd.name / "analysis.json"
+    if not (rd / "recording.json").exists() or not aj.exists(): continue
+    try:
+        rec = json.loads((rd / "recording.json").read_text())
+        an = json.loads(aj.read_text())
+    except: continue
+    scores.append((score_analysis(an, rec, people).total, rd.name))
+scores.sort()
+import statistics
+just = [s for s,_ in scores]
+print(f"n={len(scores)} min={min(just):.3f} p25={statistics.quantiles(just,n=4)[0]:.3f} median={statistics.median(just):.3f} p75={statistics.quantiles(just,n=4)[2]:.3f} max={max(just):.3f}")
+print("\nworst 5:");  [print(f"  {s:.3f}  {n}") for s,n in scores[:5]]
+print("\nbest 5:");   [print(f"  {s:.3f}  {n}") for s,n in scores[-5:]]
+'
+```
+
+---
+
+## 2) Wrapper smoke (1 LM call; ~20s, uses haiku quota)
+
+Confirms `claude` CLI subprocess invocation, JSON envelope parsing, and
+Max-subscription auth.
+
+```bash
+.venv-gepa/bin/python3 -m prompt_optim.claude_cli_lm
+```
+
+Expect: a response containing the literal word `OK`.
+
+---
+
+## 3) Adapter smoke (1 sonnet call; ~1-2 min)
+
+Single-example end-to-end run of the adapter without the GEPA loop.
+
+```bash
+.venv-gepa/bin/python3 -m prompt_optim._smoke_adapter
+```
+
+Expect: a score in the 0.7–0.9 range with subscores per component.
+
+---
+
+## 4) Tiny GEPA loop (haiku, ~3-5 min, ~10-15 calls)
+
+Confirms the whole pipeline including reflection + mutation + acceptance
+test. Won't produce a meaningfully better prompt at this budget — it's
+wiring confirmation.
+
+```bash
+.venv-gepa/bin/python3 -m prompt_optim.optimize \
+  --budget 6 --task-model haiku --reflection-model haiku
+```
+
+Output: `prompts/analyze.optimized.md` (likely identical to seed at this
+budget — meaning no mutation passed the acceptance test, which is expected).
+
+---
+
+## 5) Production-strength run (sonnet+opus, hours-to-days)
+
+Plan for 1–3 days wall clock against Max caps. The state file at
+`.seam/prompt-optim-state.json` persists telemetry across the run, and the
+rate-limit retry loop in `claude_cli_lm.py` makes long sleeps recoverable —
+but expect the run to spread across multiple sessions if you have a tight
+weekly cap.
+
+```bash
+.venv-gepa/bin/python3 -m prompt_optim.optimize \
+  --budget 150 \
+  --task-model sonnet \
+  --reflection-model opus \
+  --consistency \
+  --train 2026-02-23_team-sync-update \
+          2026-03-05_thyroid-medication-adjustment \
+          2026-03-17_vision-and-execution-strategy \
+          2026-02-24_cancun-and-project-management
+```
+
+Watch telemetry while the run progresses (separate terminal):
+
+```bash
+watch -n 30 'cat .seam/prompt-optim-state.json'
+```
+
+After the run finishes, **do not** auto-promote
+`prompts/analyze.optimized.md` over `prompts/analyze.md`. Re-test in
+production tool-writing mode first — see the "Reviewing & promoting" section
+in [README.md](README.md).
+
+---
+
+## Helpers
+
+### Inspect telemetry
+
+```bash
+cat .seam/prompt-optim-state.json | python3 -m json.tool
+```
+
+### Reset telemetry
+
+Zeroes the counters but leaves recordings/analyses untouched.
+
+```bash
+rm .seam/prompt-optim-state.json
+```
+
+### Diff seed vs. optimized prompt
+
+```bash
+diff prompts/analyze.md prompts/analyze.optimized.md
+```
+
+```bash
+code --diff prompts/analyze.md prompts/analyze.optimized.md
+```
+
+### Run against a different `.seam/`
+
+```bash
+SEAM_DATA_DIR=/Users/cwoodson/src/personal/seam/.seam \
+  .venv-gepa/bin/python3 -m prompt_optim.optimize --budget 6 ...
+```
+
+### Force a specific recording set as trainset
+
+The `--train` flag accepts directory names from `.seam/recordings/`:
+
+```bash
+.venv-gepa/bin/python3 -m prompt_optim.optimize \
+  --budget 20 \
+  --train 2026-02-23_team-sync-update 2026-03-17_vision-and-execution-strategy
+```
+
+---
+
+## Reading the output
+
+`optimize.py` prints at the end:
+
+```
+[optimize] seed score: 0.7302 -> best score: 0.8154 (12 candidates evaluated)
+[optimize] telemetry: 87 calls, rate-limit hits: 2, input tok: 124, output tok: 91,234, cost proxy: $4.73
+[optimize]   sonnet: 64 calls, in=82 out=78,109 cost_proxy=$2.91
+[optimize]   opus:   23 calls, in=42 out=13,125 cost_proxy=$1.82
+```
+
+- **Seed → best score** tells you whether GEPA actually improved the prompt.
+- **`rate_limit_hits`** is how many times the wrapper saw a usage-cap error
+  and slept; > 0 just means you hit the cap during the run, not that
+  anything failed.
+- **`cost_proxy`** is **not a real charge** under Max-subscription auth —
+  the CLI surfaces the API-equivalent cost number, but no billing event
+  occurred. Treat it as a relative indicator of subscription burn.
diff --git a/prompt_optim/README.md b/prompt_optim/README.md
@@ -0,0 +1,175 @@
+# prompt_optim — GEPA-based optimization for `prompts/analyze.md`
+
+Optimizes the recording-analysis prompt using [GEPA](https://github.com/gepa-ai/gepa)
+without paying Anthropic API tokens. All model calls go through the `claude`
+CLI, which uses your Max subscription's OAuth keychain auth.
+
+## Why this isn't billed against the API
+
+The `claude` CLI authenticates via the same OAuth/keychain login you use for
+interactive sessions. Calls made via `claude -p` consume your Max
+subscription's quota — not API credits. The wrapper at
+[`claude_cli_lm.py`](claude_cli_lm.py) shells out to `claude -p
+--output-format json` and unsets `ANTHROPIC_API_KEY` defensively to keep
+billing on the subscription.
+
+## Files
+
+| File                | Purpose                                                                                                                                                                                                                                                                                                                                                            |
+| ------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
+| `claude_cli_lm.py`  | Subprocess wrapper around `claude -p`. Serial semaphore (Max is one auth bucket), exponential backoff on usage-limit errors, persistent telemetry at `.seam/prompt-optim-state.json` (calls, tokens, cost-proxy, by-model).                                                                                                                                        |
+| `metric.py`         | Reference-free deterministic metric. See [Metric components](#metric-components) below.                                                                                                                                                                                                                                                                            |
+| `adapter.py`        | Custom `GEPAAdapter`. Wraps the candidate prompt in the same delimiters used by `scripts/pocket-run.sh`, asks Claude to emit raw JSON to stdout (vs. the production prompt which writes via the Write tool), parses the response, scores it, and assembles reflective trajectories. Optionally re-runs the first batch example a second time to score consistency. |
+| `optimize.py`       | CLI driver. `python -m prompt_optim.optimize --budget 20`.                                                                                                                                                                                                                                                                                                         |
+| `_smoke_adapter.py` | Single-example end-to-end smoke test of the adapter without the GEPA loop.                                                                                                                                                                                                                                                                                         |
+
+## Metric components
+
+All components are deterministic, derived from `recording.json`,
+`people.json`, and `.seam/generic-speakers.txt` (if present). No
+gold-standard analyses, no LLM-as-judge.
+
+### Hard gate
+
+| Component | Behavior                                                                                                                                                              |
+| --------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `schema`  | Output JSON must conform to the schema in `prompts/analyze.md`. If it fails, the metric short-circuits to 0 — none of the others matter when the structure is broken. |
+
+### Quality components (weighted sum if the gate passes)
+
+| Component                 | Weight | What it measures                                                                                                                                                                                                                                   |
+| ------------------------- | ------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `speaker_grounding`       | 0.15   | Every name in `speaker_map.values()` is in `people.json` and not generic (per `scripts/seed_people.py:GENERIC_LABELS` + `.seam/generic-speakers.txt`).                                                                                             |
+| `participant_consistency` | 0.05   | `participants[]` ⊆ `people.json` ∪ `speaker_map.values()`, no generics. Catches invented participants.                                                                                                                                             |
+| `attribution_grounding`   | 0.15   | Aggregate grounding across `speaker_map`, `decisions[].by`, `key_quotes[].speaker`, `action_items[].owner`.                                                                                                                                        |
+| `quote_grounding`         | 0.15   | Each `key_quotes[].text` must appear as a substring (whitespace-fuzzed) in the transcript. The single biggest failure mode of the current prompt — average across the dataset is ~0.29.                                                            |
+| `coverage_spread`         | 0.05   | Quote timestamps should hit all 3 thirds of the recording.                                                                                                                                                                                         |
+| `takeaway_quality`        | 0.20   | Average of two sub-checks: pairwise Jaccard < 0.6 between takeaway content-word sets (non-redundancy) + each takeaway shares ≥ 2 content words with the transcript (grounding). Mostly a regression guard today — the current prompt is good here. |
+| `mind_map_quality`        | 0.10   | Average of: edge-endpoint integrity, single connected component, root branching factor in 3..7, ≥ 2 distinct node types.                                                                                                                           |
+| `output_economy`          | 0.10   | Sigmoid penalty on `len(json.dumps(analysis))`. Calibrated against the existing 178-analysis distribution: 14 kB (current median) → 0.5, 11 kB → 0.73, 16 kB → 0.31.                                                                               |
+| `consistency`             | 0.05   | **Gated on `--consistency`.** Re-runs the first batch example a second time and scores Jaccard over takeaway content-words and `speaker_map` value sets. When disabled, its 0.05 is redistributed pro-rata across the others.                      |
+
+### Why this set
+
+- **No LLM-as-judge** in v1: doubles subscription burn; judge and task share blind spots.
+- **No reference-similarity to existing analyses**: those were produced by the prompt we're optimizing, so they're not gold standard. Optimizing toward them caps the result at "looks like the current output."
+- **Reuses `is_generic` from `scripts/seed_people.py`** so the metric and the production speaker-staging pipeline agree on what counts as a generic role label. Tweak `.seam/generic-speakers.txt` once and both pick it up.
+
+## Quick start
+
+One-time setup (creates a venv outside the project tree pollutes nothing):
+
+```bash
+python3 -m venv .venv-gepa
+.venv-gepa/bin/pip install -r prompt_optim/requirements.txt
+```
+
+Tiny smoke test of the CLI wrapper:
+
+```bash
+.venv-gepa/bin/python3 -m prompt_optim.claude_cli_lm
+# expect: response containing "OK"
+```
+
+Smoke test the adapter (one Sonnet call, ~1–2 min):
+
+```bash
+SEAM_DATA_DIR=/path/to/.seam .venv-gepa/bin/python3 -m prompt_optim._smoke_adapter
+# expect: score ~0.9 with subscores per component
+```
+
+Run a tiny optimization (budget=20, ~30–90 min on Max sub):
+
+```bash
+SEAM_DATA_DIR=/path/to/.seam .venv-gepa/bin/python3 -m prompt_optim.optimize --budget 20
+# writes prompts/analyze.optimized.md
+```
+
+## Production-quality run
+
+Once you've confirmed the wiring works:
+
+```bash
+SEAM_DATA_DIR=/path/to/.seam .venv-gepa/bin/python3 -m prompt_optim.optimize \
+    --budget 150 \
+    --task-model sonnet \
+    --reflection-model opus \
+    --consistency \
+    --train 2026-02-23_team-sync-update \
+            2026-03-05_thyroid-medication-adjustment \
+            2026-03-17_vision-and-execution-strategy \
+            2026-02-24_cancun-and-project-management \
+            ...
+```
+
+`--consistency` is **off by default** because it adds one extra LM call per
+`evaluate()`. Enable it on production-strength runs (`--budget >= 100`) where
+the extra cost is justified.
+
+### Telemetry
+
+Every successful CLI call appends to `.seam/prompt-optim-state.json`:
+
+```json
+{
+  "calls": 13,
+  "rate_limit_hits": 0,
+  "total_cost_usd_proxy": 0.27,
+  "total_input_tokens": 9,
+  "total_output_tokens": 7788,
+  "total_cache_read_input_tokens": 96221,
+  "by_model": { "sonnet": { ... }, "opus": { ... } }
+}
+```
+
+`total_cost_usd_proxy` is **not a real charge** under Max-subscription auth — there's
+no API billing event. Treat it as a relative indicator of subscription burn.
+`optimize.py` prints a summary at end of run.
+
+Plan for **1–3 days wall clock** against Max caps. The wrapper persists state
+at `.seam/prompt-optim-state.json` so a multi-day run can be resumed by
+re-invoking the same command — GEPA itself starts fresh, but the rate-limit
+backoff state and call counters are preserved.
+
+## Reviewing & promoting an optimized prompt
+
+`prompts/analyze.optimized.md` is **never auto-promoted**. After a run:
+
+1. Eyeball the diff: `diff prompts/analyze.md prompts/analyze.optimized.md`.
+2. **Re-test in production tool-writing mode**, not just optimization JSON
+   mode. The optimized prompt was scored against direct-JSON output; the
+   production path uses `--allowedTools "Write"`. Manually edit
+   [`scripts/pocket-run.sh`](../scripts/pocket-run.sh) to reference
+   `analyze.optimized.md` for one run, regenerate analyses for 3 recordings,
+   and diff the resulting `analysis.json` files against current ones. Look
+   for: more grounded `speaker_map` entries, no hallucinated names,
+   `key_quotes` that match transcript text.
+3. Only after that passes manual review, copy `analyze.optimized.md` over
+   `analyze.md` and open a PR.
+
+## Design notes
+
+- **Metric is reference-free**: the 178 existing analyses in `.seam/analysis/`
+  were produced by the prompt we're optimizing, so they're not a clean gold
+  standard. Instead the metric scores against the source recording
+  (transcript substrings, speaker names in `people.json`) and the schema.
+- **Serial execution**: Max is one auth bucket. Parallel calls don't increase
+  throughput, just burn the bucket faster. The wrapper enforces serial calls
+  via a global lock.
+- **Rate-limit retry**: when the CLI returns a usage-limit error, the wrapper
+  sleeps with exponential backoff (cap 1h) and retries up to `max_retries`
+  times. Counters persist to the state file.
+- **Optimization-time prompt drift**: at optimization time the wrapped prompt
+  emits raw JSON; at production time it uses the Write tool. Functionally
+  equivalent for what the metric scores, but the optimized prompt **must** be
+  re-tested in production mode before promotion (see step 2 above).
+- **No LLM-as-judge**: would double Max-quota spend and create a
+  feedback loop where judge and task share blind spots.
+- **`is_generic` is sourced from `scripts/seed_people.py`**: the metric and
+  the production speaker-staging pipeline share one definition. To extend
+  the rejection list, edit `.seam/generic-speakers.txt` (one label per line)
+  and both pick it up automatically.
+- **Output-economy sigmoid is calibrated to the existing dataset**: median
+  current output is ~14 kB, so the curve is centered at 14 kB. Optimization
+  has real gradient toward smaller outputs without immediately killing the
+  seed prompt's score.