Add prompt_optim: GEPA-based optimization for analyze.md (Max-subscription auth)#18
Add prompt_optim: GEPA-based optimization for analyze.md (Max-subscription auth)#18jedibrillo wants to merge 4 commits into
Conversation
Optimizes the recording-analysis prompt without paying API tokens by routing
every model call through the `claude` CLI (Max subscription OAuth auth) instead
of the Anthropic SDK. Custom GEPAAdapter, reference-free deterministic metric
(9 components covering grounding, coverage, takeaway quality, mind-map
integrity, and output economy), serial CLI wrapper with rate-limit backoff,
and per-call telemetry persisted to .seam/prompt-optim-state.json.
Includes:
- prompt_optim/{claude_cli_lm,metric,adapter,optimize}.py — the pipeline
- prompt_optim/_pick_trainset.py — picker for diverse trainset recordings
- prompt_optim/_smoke_adapter.py — single-example wiring smoke test
- prompt_optim/_seed_people_bridge.py — shares is_generic() with seed-people.py
- prompt_optim/{README,COMMANDS,TRAINSET}.md — usage and design notes
- .gitignore: ignore .venv-gepa/, prompts/analyze.optimized*.md, gepa-runs/,
.claude/worktrees/, .claude/scheduled_tasks.lock
The optimized prompt is written to prompts/analyze.optimized.md (gitignored)
and is never auto-promoted; production analyze.md is unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
A few items from review: 1. Generalize the docsThe examples in The 2. `seed-people.py` renameThe `_seed_people_bridge.py` import-by-path workaround is no longer needed — #19 renames `seed-people.py` to `seed_people.py`, making it a standard importable module. Once that merges, this PR can drop the bridge file and do a normal `from scripts.seed_people import is_generic, load_generic_labels` (or adjust the import path as needed). 3. TestsThe PR adds no automated tests for the `prompt_optim` module — only manual smoke scripts. The project's CLAUDE.md mandates TDD with 80%+ coverage on new code. At minimum, `metric.py` is pure functions and very testable:
These don't require any LLM calls — they're deterministic scoring functions. Should have pytest coverage. |
|
Two more items: 4. Separate valset from trainset
ap.add_argument(
"--val",
nargs="*",
default=None,
help="validation recording dir names (defaults to --train if not set)",
)Then in the body: val_names = args.val if args.val else args.train
valset = _build_trainset(seam_dir, val_names)Usage becomes: python -m prompt_optim.optimize \
--budget 150 \
--train recording-a recording-b recording-c recording-d \
--val recording-x recording-yWhen
5. Auto-calculate output economy centerThe sigmoid center is hardcoded at 14,000 bytes ( x = (size - 14000) / 3000This should be computed at runtime from the user's actual analysis distribution. Something like: def _compute_economy_center(seam_dir: Path) -> int:
"""Median analysis.json size across the dataset."""
sizes = []
analysis_dir = seam_dir / "analysis"
if analysis_dir.exists():
for d in analysis_dir.iterdir():
f = d / "analysis.json"
if f.exists():
sizes.append(f.stat().st_size)
if not sizes:
return 14000 # fallback default
return int(sorted(sizes)[len(sizes) // 2])Then This way any user's dataset self-calibrates — someone with longer recordings and naturally larger analyses won't get unfairly penalized. |
## Summary - Renames `scripts/seed-people.py` → `scripts/seed_people.py` so it's importable as a standard Python module - Eliminates the need for `importlib` path hacks (like `_seed_people_bridge.py` in #18) - Updates references in CLAUDE.md and README.md ## Test plan - [x] All tests pass (94 vitest + 51 pytest) - [x] No scripts reference `seed-people.py` by name (it's only invoked manually)
seed-people.py was renamed to seed_people.py on main (#19), so the importlib bridge hack is no longer needed. Import directly via sys.path like the rest of the project's cross-directory imports.
|
Update: Item #2 from my earlier review (the
Commit: b0cb9bd Remaining open items from the review:
|
Summary
Adds prompt_optim/ — a GEPA-based optimizer for prompts/analyze.md that runs entirely on Max-subscription auth (no API tokens billed). Custom
GEPAAdapterroutes every model call through theclaudeCLI; reference-free deterministic metric scores outputs against the source recording +.seam/people.json+ the schema, so no gold-standard analyses are needed.The production prompt at prompts/analyze.md is unchanged. Optimized variants are written to
prompts/analyze.optimized*.md(gitignored) and require manual review before promotion.What's in here
claude -p --output-format json. Serial semaphore (Max is one auth bucket — parallel calls don't add throughput, just burn it faster), exponential backoff on usage-limit errors, persistent telemetry at.seam/prompt-optim-state.json(calls, tokens, cost-proxy, by-model breakdown).schema(hard gate),speaker_grounding,participant_consistency,attribution_grounding,quote_grounding,coverage_spread,takeaway_quality,mind_map_quality,output_economy, optionalconsistency(gated on--consistencyflag, doubles eval cost on first batch example only).output_economysigmoid is calibrated against the existing 178-analysis dataset (median ~14 kB → score 0.5).is_genericrule for filtering role labels (Speaker 01,host, etc.) is shared with scripts/seed-people.py via _seed_people_bridge.py — single source of truth, picks up.seam/generic-speakers.txtautomatically.GEPAAdapter. Wraps the candidate prompt in the same delimiters as scripts/pocket-run.sh, asks Claude to emit raw JSON to stdout (vs. production's Write-tool path), parses, scores, builds reflective trajectories.python -m prompt_optim.optimize --budget N..venv-gepa/,prompts/analyze.optimized*.md,gepa-runs/,.claude/worktrees/,.claude/scheduled_tasks.lock.Why route through the CLI
The
claudeCLI authenticates via OAuth/keychain — calls consume Max-subscription quota, not API credits. The wrapper unsetsANTHROPIC_API_KEYdefensively to keep billing on the subscription regardless of environment.Why reference-free metric
The 178 existing analyses in
.seam/analysis/were produced by the very prompt we're optimizing, so they're not a clean gold standard. Optimizing toward similarity would cap us at "looks like the current output." Instead the metric scores against the source recording (transcript substrings, schema,people.jsongrounding) and structural properties.Validation
End-to-end smoke run completed successfully on haiku with
--budget 6:Distribution check across 30 sample analyses scored by the new metric: range 0.57–0.92, median 0.76.
quote_groundingmean 0.29 confirms it's the dominant failure mode (the verbatim-quote substring problem) — exactly what GEPA can target.Test plan
npm testpasses (94 vitest + 51 pytest, no regressions in existing suite)npm run lintpasses (0 errors; 3 pre-existing warnings unrelated to this PR)npm run format:checkpassesprompt_optimsmoke tests run cleanly (CLI wrapper, metric, adapter, full GEPA loop on haiku)_BASE_WEIGHTSagainst the design notes in README.md_seed_people_bridge.pyimport-by-path approach is acceptable (alternative: renameseed-people.pytoseed_people.py)Known follow-ups (not in this PR)
optimize.pycurrently passesvalset=trainset. A real production-strength run should hold out 2–3 recordings as a true valset.🤖 Generated with Claude Code