Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 8 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -36,4 +36,11 @@ npm-debug.log*
public/manifest.json

# Claude Code
.claude/*.local.json
.claude/*.local.json
.claude/scheduled_tasks.lock
.claude/worktrees/

# GEPA / prompt optimization
.venv-gepa/
prompts/analyze.optimized*.md
gepa-runs/
196 changes: 196 additions & 0 deletions prompt_optim/COMMANDS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,196 @@
# prompt_optim — sample commands

Ordered shortest-to-longest. All paths are relative — **run from the worktree root**.

## Setup notes

- Worktree's `.seam/` is a copy of the main repo's `.seam/` (gitignored).
State changes (telemetry, generic-speakers tweaks) live in this copy only.
- `.venv-gepa/` holds the GEPA install. Created via:
```bash
python3 -m venv .venv-gepa
.venv-gepa/bin/pip install -r prompt_optim/requirements.txt
```
- The metric and adapter pick up `.seam/` automatically. Override with
`SEAM_DATA_DIR=/path/to/.seam` if you want to point at a different copy.

---

## 1) Pure-Python tests (no LLM calls; instant)

### Score one analysis

```bash
.venv-gepa/bin/python3 -m prompt_optim.metric \
.seam/recordings/2026-02-24_cancun-and-project-management \
.seam/analysis/2026-02-24_cancun-and-project-management/analysis.json
```

### Distribution of metric scores across the whole dataset

Useful for spotting the worst-scoring analyses (where the prompt is failing
hardest) before you invest LLM calls.

```bash
.venv-gepa/bin/python3 -c '
import json
from pathlib import Path
from prompt_optim.metric import score_analysis, _load_people_names
people = _load_people_names()
scores = []
for rd in sorted(Path(".seam/recordings").iterdir()):
aj = Path(".seam/analysis") / rd.name / "analysis.json"
if not (rd / "recording.json").exists() or not aj.exists(): continue
try:
rec = json.loads((rd / "recording.json").read_text())
an = json.loads(aj.read_text())
except: continue
scores.append((score_analysis(an, rec, people).total, rd.name))
scores.sort()
import statistics
just = [s for s,_ in scores]
print(f"n={len(scores)} min={min(just):.3f} p25={statistics.quantiles(just,n=4)[0]:.3f} median={statistics.median(just):.3f} p75={statistics.quantiles(just,n=4)[2]:.3f} max={max(just):.3f}")
print("\nworst 5:"); [print(f" {s:.3f} {n}") for s,n in scores[:5]]
print("\nbest 5:"); [print(f" {s:.3f} {n}") for s,n in scores[-5:]]
'
```

---

## 2) Wrapper smoke (1 LM call; ~20s, uses haiku quota)

Confirms `claude` CLI subprocess invocation, JSON envelope parsing, and
Max-subscription auth.

```bash
.venv-gepa/bin/python3 -m prompt_optim.claude_cli_lm
```

Expect: a response containing the literal word `OK`.

---

## 3) Adapter smoke (1 sonnet call; ~1-2 min)

Single-example end-to-end run of the adapter without the GEPA loop.

```bash
.venv-gepa/bin/python3 -m prompt_optim._smoke_adapter
```

Expect: a score in the 0.7–0.9 range with subscores per component.

---

## 4) Tiny GEPA loop (haiku, ~3-5 min, ~10-15 calls)

Confirms the whole pipeline including reflection + mutation + acceptance
test. Won't produce a meaningfully better prompt at this budget — it's
wiring confirmation.

```bash
.venv-gepa/bin/python3 -m prompt_optim.optimize \
--budget 6 --task-model haiku --reflection-model haiku
```

Output: `prompts/analyze.optimized.md` (likely identical to seed at this
budget — meaning no mutation passed the acceptance test, which is expected).

---

## 5) Production-strength run (sonnet+opus, hours-to-days)

Plan for 1–3 days wall clock against Max caps. The state file at
`.seam/prompt-optim-state.json` persists telemetry across the run, and the
rate-limit retry loop in `claude_cli_lm.py` makes long sleeps recoverable —
but expect the run to spread across multiple sessions if you have a tight
weekly cap.

```bash
.venv-gepa/bin/python3 -m prompt_optim.optimize \
--budget 150 \
--task-model sonnet \
--reflection-model opus \
--consistency \
--train 2026-02-23_team-sync-update \
2026-03-05_thyroid-medication-adjustment \
2026-03-17_vision-and-execution-strategy \
2026-02-24_cancun-and-project-management
```

Watch telemetry while the run progresses (separate terminal):

```bash
watch -n 30 'cat .seam/prompt-optim-state.json'
```

After the run finishes, **do not** auto-promote
`prompts/analyze.optimized.md` over `prompts/analyze.md`. Re-test in
production tool-writing mode first — see the "Reviewing & promoting" section
in [README.md](README.md).

---

## Helpers

### Inspect telemetry

```bash
cat .seam/prompt-optim-state.json | python3 -m json.tool
```

### Reset telemetry

Zeroes the counters but leaves recordings/analyses untouched.

```bash
rm .seam/prompt-optim-state.json
```

### Diff seed vs. optimized prompt

```bash
diff prompts/analyze.md prompts/analyze.optimized.md
```

```bash
code --diff prompts/analyze.md prompts/analyze.optimized.md
```

### Run against a different `.seam/`

```bash
SEAM_DATA_DIR=/Users/cwoodson/src/personal/seam/.seam \
.venv-gepa/bin/python3 -m prompt_optim.optimize --budget 6 ...
```

### Force a specific recording set as trainset

The `--train` flag accepts directory names from `.seam/recordings/`:

```bash
.venv-gepa/bin/python3 -m prompt_optim.optimize \
--budget 20 \
--train 2026-02-23_team-sync-update 2026-03-17_vision-and-execution-strategy
```

---

## Reading the output

`optimize.py` prints at the end:

```
[optimize] seed score: 0.7302 -> best score: 0.8154 (12 candidates evaluated)
[optimize] telemetry: 87 calls, rate-limit hits: 2, input tok: 124, output tok: 91,234, cost proxy: $4.73
[optimize] sonnet: 64 calls, in=82 out=78,109 cost_proxy=$2.91
[optimize] opus: 23 calls, in=42 out=13,125 cost_proxy=$1.82
```

- **Seed → best score** tells you whether GEPA actually improved the prompt.
- **`rate_limit_hits`** is how many times the wrapper saw a usage-cap error
and slept; > 0 just means you hit the cap during the run, not that
anything failed.
- **`cost_proxy`** is **not a real charge** under Max-subscription auth —
the CLI surfaces the API-equivalent cost number, but no billing event
occurred. Treat it as a relative indicator of subscription burn.
175 changes: 175 additions & 0 deletions prompt_optim/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,175 @@
# prompt_optim — GEPA-based optimization for `prompts/analyze.md`

Optimizes the recording-analysis prompt using [GEPA](https://github.com/gepa-ai/gepa)
without paying Anthropic API tokens. All model calls go through the `claude`
CLI, which uses your Max subscription's OAuth keychain auth.

## Why this isn't billed against the API

The `claude` CLI authenticates via the same OAuth/keychain login you use for
interactive sessions. Calls made via `claude -p` consume your Max
subscription's quota — not API credits. The wrapper at
[`claude_cli_lm.py`](claude_cli_lm.py) shells out to `claude -p
--output-format json` and unsets `ANTHROPIC_API_KEY` defensively to keep
billing on the subscription.

## Files

| File | Purpose |
| ------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `claude_cli_lm.py` | Subprocess wrapper around `claude -p`. Serial semaphore (Max is one auth bucket), exponential backoff on usage-limit errors, persistent telemetry at `.seam/prompt-optim-state.json` (calls, tokens, cost-proxy, by-model). |
| `metric.py` | Reference-free deterministic metric. See [Metric components](#metric-components) below. |
| `adapter.py` | Custom `GEPAAdapter`. Wraps the candidate prompt in the same delimiters used by `scripts/pocket-run.sh`, asks Claude to emit raw JSON to stdout (vs. the production prompt which writes via the Write tool), parses the response, scores it, and assembles reflective trajectories. Optionally re-runs the first batch example a second time to score consistency. |
| `optimize.py` | CLI driver. `python -m prompt_optim.optimize --budget 20`. |
| `_smoke_adapter.py` | Single-example end-to-end smoke test of the adapter without the GEPA loop. |

## Metric components

All components are deterministic, derived from `recording.json`,
`people.json`, and `.seam/generic-speakers.txt` (if present). No
gold-standard analyses, no LLM-as-judge.

### Hard gate

| Component | Behavior |
| --------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `schema` | Output JSON must conform to the schema in `prompts/analyze.md`. If it fails, the metric short-circuits to 0 — none of the others matter when the structure is broken. |

### Quality components (weighted sum if the gate passes)

| Component | Weight | What it measures |
| ------------------------- | ------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `speaker_grounding` | 0.15 | Every name in `speaker_map.values()` is in `people.json` and not generic (per `scripts/seed_people.py:GENERIC_LABELS` + `.seam/generic-speakers.txt`). |
| `participant_consistency` | 0.05 | `participants[]` ⊆ `people.json` ∪ `speaker_map.values()`, no generics. Catches invented participants. |
| `attribution_grounding` | 0.15 | Aggregate grounding across `speaker_map`, `decisions[].by`, `key_quotes[].speaker`, `action_items[].owner`. |
| `quote_grounding` | 0.15 | Each `key_quotes[].text` must appear as a substring (whitespace-fuzzed) in the transcript. The single biggest failure mode of the current prompt — average across the dataset is ~0.29. |
| `coverage_spread` | 0.05 | Quote timestamps should hit all 3 thirds of the recording. |
| `takeaway_quality` | 0.20 | Average of two sub-checks: pairwise Jaccard < 0.6 between takeaway content-word sets (non-redundancy) + each takeaway shares ≥ 2 content words with the transcript (grounding). Mostly a regression guard today — the current prompt is good here. |
| `mind_map_quality` | 0.10 | Average of: edge-endpoint integrity, single connected component, root branching factor in 3..7, ≥ 2 distinct node types. |
| `output_economy` | 0.10 | Sigmoid penalty on `len(json.dumps(analysis))`. Calibrated against the existing 178-analysis distribution: 14 kB (current median) → 0.5, 11 kB → 0.73, 16 kB → 0.31. |
| `consistency` | 0.05 | **Gated on `--consistency`.** Re-runs the first batch example a second time and scores Jaccard over takeaway content-words and `speaker_map` value sets. When disabled, its 0.05 is redistributed pro-rata across the others. |

### Why this set

- **No LLM-as-judge** in v1: doubles subscription burn; judge and task share blind spots.
- **No reference-similarity to existing analyses**: those were produced by the prompt we're optimizing, so they're not gold standard. Optimizing toward them caps the result at "looks like the current output."
- **Reuses `is_generic` from `scripts/seed_people.py`** so the metric and the production speaker-staging pipeline agree on what counts as a generic role label. Tweak `.seam/generic-speakers.txt` once and both pick it up.

## Quick start

One-time setup (creates a venv outside the project tree pollutes nothing):

```bash
python3 -m venv .venv-gepa
.venv-gepa/bin/pip install -r prompt_optim/requirements.txt
```

Tiny smoke test of the CLI wrapper:

```bash
.venv-gepa/bin/python3 -m prompt_optim.claude_cli_lm
# expect: response containing "OK"
```

Smoke test the adapter (one Sonnet call, ~1–2 min):

```bash
SEAM_DATA_DIR=/path/to/.seam .venv-gepa/bin/python3 -m prompt_optim._smoke_adapter
# expect: score ~0.9 with subscores per component
```

Run a tiny optimization (budget=20, ~30–90 min on Max sub):

```bash
SEAM_DATA_DIR=/path/to/.seam .venv-gepa/bin/python3 -m prompt_optim.optimize --budget 20
# writes prompts/analyze.optimized.md
```

## Production-quality run

Once you've confirmed the wiring works:

```bash
SEAM_DATA_DIR=/path/to/.seam .venv-gepa/bin/python3 -m prompt_optim.optimize \
--budget 150 \
--task-model sonnet \
--reflection-model opus \
--consistency \
--train 2026-02-23_team-sync-update \
2026-03-05_thyroid-medication-adjustment \
2026-03-17_vision-and-execution-strategy \
2026-02-24_cancun-and-project-management \
...
```

`--consistency` is **off by default** because it adds one extra LM call per
`evaluate()`. Enable it on production-strength runs (`--budget >= 100`) where
the extra cost is justified.

### Telemetry

Every successful CLI call appends to `.seam/prompt-optim-state.json`:

```json
{
"calls": 13,
"rate_limit_hits": 0,
"total_cost_usd_proxy": 0.27,
"total_input_tokens": 9,
"total_output_tokens": 7788,
"total_cache_read_input_tokens": 96221,
"by_model": { "sonnet": { ... }, "opus": { ... } }
}
```

`total_cost_usd_proxy` is **not a real charge** under Max-subscription auth — there's
no API billing event. Treat it as a relative indicator of subscription burn.
`optimize.py` prints a summary at end of run.

Plan for **1–3 days wall clock** against Max caps. The wrapper persists state
at `.seam/prompt-optim-state.json` so a multi-day run can be resumed by
re-invoking the same command — GEPA itself starts fresh, but the rate-limit
backoff state and call counters are preserved.

## Reviewing & promoting an optimized prompt

`prompts/analyze.optimized.md` is **never auto-promoted**. After a run:

1. Eyeball the diff: `diff prompts/analyze.md prompts/analyze.optimized.md`.
2. **Re-test in production tool-writing mode**, not just optimization JSON
mode. The optimized prompt was scored against direct-JSON output; the
production path uses `--allowedTools "Write"`. Manually edit
[`scripts/pocket-run.sh`](../scripts/pocket-run.sh) to reference
`analyze.optimized.md` for one run, regenerate analyses for 3 recordings,
and diff the resulting `analysis.json` files against current ones. Look
for: more grounded `speaker_map` entries, no hallucinated names,
`key_quotes` that match transcript text.
3. Only after that passes manual review, copy `analyze.optimized.md` over
`analyze.md` and open a PR.

## Design notes

- **Metric is reference-free**: the 178 existing analyses in `.seam/analysis/`
were produced by the prompt we're optimizing, so they're not a clean gold
standard. Instead the metric scores against the source recording
(transcript substrings, speaker names in `people.json`) and the schema.
- **Serial execution**: Max is one auth bucket. Parallel calls don't increase
throughput, just burn the bucket faster. The wrapper enforces serial calls
via a global lock.
- **Rate-limit retry**: when the CLI returns a usage-limit error, the wrapper
sleeps with exponential backoff (cap 1h) and retries up to `max_retries`
times. Counters persist to the state file.
- **Optimization-time prompt drift**: at optimization time the wrapped prompt
emits raw JSON; at production time it uses the Write tool. Functionally
equivalent for what the metric scores, but the optimized prompt **must** be
re-tested in production mode before promotion (see step 2 above).
- **No LLM-as-judge**: would double Max-quota spend and create a
feedback loop where judge and task share blind spots.
- **`is_generic` is sourced from `scripts/seed_people.py`**: the metric and
the production speaker-staging pipeline share one definition. To extend
the rejection list, edit `.seam/generic-speakers.txt` (one label per line)
and both pick it up automatically.
- **Output-economy sigmoid is calibrated to the existing dataset**: median
current output is ~14 kB, so the curve is centered at 14 kB. Optimization
has real gradient toward smaller outputs without immediately killing the
seed prompt's score.
Loading
Loading