Add prompt_optim: GEPA-based optimization for analyze.md (Max-subscription auth) by jedibrillo · Pull Request #18 · yoaquim/seam

jedibrillo · 2026-05-03T03:55:31Z

Summary

Adds prompt_optim/ — a GEPA-based optimizer for prompts/analyze.md that runs entirely on Max-subscription auth (no API tokens billed). Custom GEPAAdapter routes every model call through the claude CLI; reference-free deterministic metric scores outputs against the source recording + .seam/people.json + the schema, so no gold-standard analyses are needed.

The production prompt at prompts/analyze.md is unchanged. Optimized variants are written to prompts/analyze.optimized*.md (gitignored) and require manual review before promotion.

What's in here

claude_cli_lm.py — subprocess wrapper around claude -p --output-format json. Serial semaphore (Max is one auth bucket — parallel calls don't add throughput, just burn it faster), exponential backoff on usage-limit errors, persistent telemetry at .seam/prompt-optim-state.json (calls, tokens, cost-proxy, by-model breakdown).
metric.py — 9-component reference-free metric:
- schema (hard gate), speaker_grounding, participant_consistency, attribution_grounding, quote_grounding, coverage_spread, takeaway_quality, mind_map_quality, output_economy, optional consistency (gated on --consistency flag, doubles eval cost on first batch example only).
- The output_economy sigmoid is calibrated against the existing 178-analysis dataset (median ~14 kB → score 0.5).
- The is_generic rule for filtering role labels (Speaker 01, host, etc.) is shared with scripts/seed-people.py via _seed_people_bridge.py — single source of truth, picks up .seam/generic-speakers.txt automatically.
adapter.py — custom GEPAAdapter. Wraps the candidate prompt in the same delimiters as scripts/pocket-run.sh, asks Claude to emit raw JSON to stdout (vs. production's Write-tool path), parses, scores, builds reflective trajectories.
optimize.py — CLI driver. python -m prompt_optim.optimize --budget N.
_pick_trainset.py — picker that ranks recordings by current metric score and prints a table annotated with type, duration, speaker setup, and dominant failure mode for hand-picking a diverse trainset.
_smoke_adapter.py — single-example end-to-end wiring smoke test.
Docs: README.md (design + usage), COMMANDS.md (sample commands), TRAINSET.md (trainset-picking criteria + script).
.gitignore: ignore .venv-gepa/, prompts/analyze.optimized*.md, gepa-runs/, .claude/worktrees/, .claude/scheduled_tasks.lock.

Why route through the CLI

The claude CLI authenticates via OAuth/keychain — calls consume Max-subscription quota, not API credits. The wrapper unsets ANTHROPIC_API_KEY defensively to keep billing on the subscription regardless of environment.

Why reference-free metric

The 178 existing analyses in .seam/analysis/ were produced by the very prompt we're optimizing, so they're not a clean gold standard. Optimizing toward similarity would cap us at "looks like the current output." Instead the metric scores against the source recording (transcript substrings, schema, people.json grounding) and structural properties.

Validation

End-to-end smoke run completed successfully on haiku with --budget 6:

Seed prompt scored 0.7302 averaged over 2 trainset examples.
GEPA proposed a "Validation Checklist" mutation addressing the metric's flagged failure modes (speaker grounding + verbatim quotes), evaluated it, and correctly rejected it on the minibatch acceptance test.
0 rate-limit hits, telemetry accumulated correctly.

Distribution check across 30 sample analyses scored by the new metric: range 0.57–0.92, median 0.76. quote_grounding mean 0.29 confirms it's the dominant failure mode (the verbatim-quote substring problem) — exactly what GEPA can target.

Test plan

npm test passes (94 vitest + 51 pytest, no regressions in existing suite)
npm run lint passes (0 errors; 3 pre-existing warnings unrelated to this PR)
npm run format:check passes
prompt_optim smoke tests run cleanly (CLI wrapper, metric, adapter, full GEPA loop on haiku)
Reviewer: sanity-check the metric weighting in metric.py _BASE_WEIGHTS against the design notes in README.md
Reviewer: confirm _seed_people_bridge.py import-by-path approach is acceptable (alternative: rename seed-people.py to seed_people.py)

Known follow-ups (not in this PR)

optimize.py currently passes valset=trainset. A real production-strength run should hold out 2–3 recordings as a true valset.
LLM-as-judge metric component (deliberately deferred — would double subscription burn, judge and task share blind spots).

🤖 Generated with Claude Code

Optimizes the recording-analysis prompt without paying API tokens by routing every model call through the `claude` CLI (Max subscription OAuth auth) instead of the Anthropic SDK. Custom GEPAAdapter, reference-free deterministic metric (9 components covering grounding, coverage, takeaway quality, mind-map integrity, and output economy), serial CLI wrapper with rate-limit backoff, and per-call telemetry persisted to .seam/prompt-optim-state.json. Includes: - prompt_optim/{claude_cli_lm,metric,adapter,optimize}.py — the pipeline - prompt_optim/_pick_trainset.py — picker for diverse trainset recordings - prompt_optim/_smoke_adapter.py — single-example wiring smoke test - prompt_optim/_seed_people_bridge.py — shares is_generic() with seed-people.py - prompt_optim/{README,COMMANDS,TRAINSET}.md — usage and design notes - .gitignore: ignore .venv-gepa/, prompts/analyze.optimized*.md, gepa-runs/, .claude/worktrees/, .claude/scheduled_tasks.lock The optimized prompt is written to prompts/analyze.optimized.md (gitignored) and is never auto-promoted; production analyze.md is unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

yoaquim · 2026-05-03T12:10:29Z

A few items from review:

1. Generalize the docs

The examples in README.md, COMMANDS.md, and TRAINSET.md reference specific recording names (2026-02-24_cancun-and-project-management, 2026-03-06_calebs-pediatrician-appointment, etc.). These should be replaced with generic placeholders like <recording-dir-name> or your-recording-name so the docs make sense for any user.

The _pick_trainset.py script already generates user-specific recommendations, so hardcoding specific recordings in the docs is redundant. The sample output table in TRAINSET.md is fine to keep as an illustration, but the --train examples in README.md and COMMANDS.md should use placeholders.

2. `seed-people.py` rename

The `_seed_people_bridge.py` import-by-path workaround is no longer needed — #19 renames `seed-people.py` to `seed_people.py`, making it a standard importable module. Once that merges, this PR can drop the bridge file and do a normal `from scripts.seed_people import is_generic, load_generic_labels` (or adjust the import path as needed).

3. Tests

The PR adds no automated tests for the `prompt_optim` module — only manual smoke scripts. The project's CLAUDE.md mandates TDD with 80%+ coverage on new code. At minimum, `metric.py` is pure functions and very testable:

`schema_score` with valid/invalid JSON structures
`speaker_grounding_score` with known/unknown/generic names
`quote_grounding_score` with matching/non-matching transcript text
`takeaway_quality_score` with redundant vs distinct takeaways
`mind_map_quality_score` with connected/disconnected graphs
`output_economy_score` at various sizes around the 14 kB center

These don't require any LLM calls — they're deterministic scoring functions. Should have pytest coverage.

yoaquim · 2026-05-03T12:16:12Z

Two more items:

4. Separate valset from trainset

optimize.py line 153 has valset=trainset with a comment noting it's an MVP shortcut. The fix is straightforward — add a --val flag mirroring --train:

ap.add_argument(
    "--val",
    nargs="*",
    default=None,
    help="validation recording dir names (defaults to --train if not set)",
)

Then in the body:

val_names = args.val if args.val else args.train
valset = _build_trainset(seam_dir, val_names)

Usage becomes:

python -m prompt_optim.optimize \
    --budget 150 \
    --train recording-a recording-b recording-c recording-d \
    --val recording-x recording-y

When --val is omitted, it falls back to --train (current behavior). When provided, GEPA trains on one set and validates on the other — catches overfitting.

_pick_trainset.py could also be updated to suggest a val split (e.g. "pick 4 for train, hold out these 2 for val").

5. Auto-calculate output economy center

The sigmoid center is hardcoded at 14,000 bytes (metric.py line 588):

x = (size - 14000) / 3000

This should be computed at runtime from the user's actual analysis distribution. Something like:

def _compute_economy_center(seam_dir: Path) -> int:
    """Median analysis.json size across the dataset."""
    sizes = []
    analysis_dir = seam_dir / "analysis"
    if analysis_dir.exists():
        for d in analysis_dir.iterdir():
            f = d / "analysis.json"
            if f.exists():
                sizes.append(f.stat().st_size)
    if not sizes:
        return 14000  # fallback default
    return int(sorted(sizes)[len(sizes) // 2])

Then _output_economy_score takes the center as a parameter instead of hardcoding it. The center gets computed once when the metric module is initialized or when score_analysis is first called.

This way any user's dataset self-calibrates — someone with longer recordings and naturally larger analyses won't get unfairly penalized.

## Summary - Renames `scripts/seed-people.py` → `scripts/seed_people.py` so it's importable as a standard Python module - Eliminates the need for `importlib` path hacks (like `_seed_people_bridge.py` in #18) - Updates references in CLAUDE.md and README.md ## Test plan - [x] All tests pass (94 vitest + 51 pytest) - [x] No scripts reference `seed-people.py` by name (it's only invoked manually)

seed-people.py was renamed to seed_people.py on main (#19), so the importlib bridge hack is no longer needed. Import directly via sys.path like the rest of the project's cross-directory imports.

yoaquim · 2026-05-03T12:22:15Z

Update: Item #2 from my earlier review (the _seed_people_bridge.py hack) is now addressed:

seed-people.py was renamed to seed_people.py on main via Rename seed-people.py to seed_people.py #19
Merged main into this branch
Deleted _seed_people_bridge.py
metric.py now imports directly from scripts/seed_people via sys.path (same pattern used by the project's test files)
Updated all seed-people.py references in prompt_optim/README.md

Commit: b0cb9bd

Remaining open items from the review:

Generalize docs — replace specific recording names with placeholders
Tests — add pytest coverage for the metric's pure functions
Separate --val flag — so valset can differ from trainset
Auto-calculate output economy center — compute median from user's dataset at runtime

yoaquim

in comments

jedibrillo requested a review from yoaquim May 3, 2026 11:17

yoaquim mentioned this pull request May 3, 2026

Rename seed-people.py to seed_people.py #19

Merged

2 tasks

yoaquim added 2 commits May 3, 2026 08:20

Merge remote-tracking branch 'origin/main' into claude/zen-payne-df3772

f30fd7b

Remove _seed_people_bridge.py, import seed_people directly

b0cb9bd

seed-people.py was renamed to seed_people.py on main (#19), so the importlib bridge hack is no longer needed. Import directly via sys.path like the rest of the project's cross-directory imports.

yoaquim requested changes May 3, 2026

View reviewed changes

Fix formatting in prompt_optim README

9b1c300

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add prompt_optim: GEPA-based optimization for analyze.md (Max-subscription auth)#18

Add prompt_optim: GEPA-based optimization for analyze.md (Max-subscription auth)#18
jedibrillo wants to merge 4 commits into
mainfrom
claude/zen-payne-df3772

jedibrillo commented May 3, 2026

Uh oh!

yoaquim commented May 3, 2026

Uh oh!

yoaquim commented May 3, 2026

Uh oh!

yoaquim commented May 3, 2026

Uh oh!

yoaquim left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jedibrillo commented May 3, 2026

Summary

What's in here

Why route through the CLI

Why reference-free metric

Validation

Test plan

Known follow-ups (not in this PR)

Uh oh!

yoaquim commented May 3, 2026

1. Generalize the docs

2. `seed-people.py` rename

3. Tests

Uh oh!

yoaquim commented May 3, 2026

4. Separate valset from trainset

5. Auto-calculate output economy center

Uh oh!

yoaquim commented May 3, 2026

Uh oh!

yoaquim left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants