Skip to content

Add prompt_optim: GEPA-based optimization for analyze.md (Max-subscription auth)#18

Open
jedibrillo wants to merge 4 commits into
mainfrom
claude/zen-payne-df3772
Open

Add prompt_optim: GEPA-based optimization for analyze.md (Max-subscription auth)#18
jedibrillo wants to merge 4 commits into
mainfrom
claude/zen-payne-df3772

Conversation

@jedibrillo
Copy link
Copy Markdown
Collaborator

Summary

Adds prompt_optim/ — a GEPA-based optimizer for prompts/analyze.md that runs entirely on Max-subscription auth (no API tokens billed). Custom GEPAAdapter routes every model call through the claude CLI; reference-free deterministic metric scores outputs against the source recording + .seam/people.json + the schema, so no gold-standard analyses are needed.

The production prompt at prompts/analyze.md is unchanged. Optimized variants are written to prompts/analyze.optimized*.md (gitignored) and require manual review before promotion.

What's in here

  • claude_cli_lm.py — subprocess wrapper around claude -p --output-format json. Serial semaphore (Max is one auth bucket — parallel calls don't add throughput, just burn it faster), exponential backoff on usage-limit errors, persistent telemetry at .seam/prompt-optim-state.json (calls, tokens, cost-proxy, by-model breakdown).
  • metric.py — 9-component reference-free metric:
    • schema (hard gate), speaker_grounding, participant_consistency, attribution_grounding, quote_grounding, coverage_spread, takeaway_quality, mind_map_quality, output_economy, optional consistency (gated on --consistency flag, doubles eval cost on first batch example only).
    • The output_economy sigmoid is calibrated against the existing 178-analysis dataset (median ~14 kB → score 0.5).
    • The is_generic rule for filtering role labels (Speaker 01, host, etc.) is shared with scripts/seed-people.py via _seed_people_bridge.py — single source of truth, picks up .seam/generic-speakers.txt automatically.
  • adapter.py — custom GEPAAdapter. Wraps the candidate prompt in the same delimiters as scripts/pocket-run.sh, asks Claude to emit raw JSON to stdout (vs. production's Write-tool path), parses, scores, builds reflective trajectories.
  • optimize.py — CLI driver. python -m prompt_optim.optimize --budget N.
  • _pick_trainset.py — picker that ranks recordings by current metric score and prints a table annotated with type, duration, speaker setup, and dominant failure mode for hand-picking a diverse trainset.
  • _smoke_adapter.py — single-example end-to-end wiring smoke test.
  • Docs: README.md (design + usage), COMMANDS.md (sample commands), TRAINSET.md (trainset-picking criteria + script).
  • .gitignore: ignore .venv-gepa/, prompts/analyze.optimized*.md, gepa-runs/, .claude/worktrees/, .claude/scheduled_tasks.lock.

Why route through the CLI

The claude CLI authenticates via OAuth/keychain — calls consume Max-subscription quota, not API credits. The wrapper unsets ANTHROPIC_API_KEY defensively to keep billing on the subscription regardless of environment.

Why reference-free metric

The 178 existing analyses in .seam/analysis/ were produced by the very prompt we're optimizing, so they're not a clean gold standard. Optimizing toward similarity would cap us at "looks like the current output." Instead the metric scores against the source recording (transcript substrings, schema, people.json grounding) and structural properties.

Validation

End-to-end smoke run completed successfully on haiku with --budget 6:

  • Seed prompt scored 0.7302 averaged over 2 trainset examples.
  • GEPA proposed a "Validation Checklist" mutation addressing the metric's flagged failure modes (speaker grounding + verbatim quotes), evaluated it, and correctly rejected it on the minibatch acceptance test.
  • 0 rate-limit hits, telemetry accumulated correctly.

Distribution check across 30 sample analyses scored by the new metric: range 0.57–0.92, median 0.76. quote_grounding mean 0.29 confirms it's the dominant failure mode (the verbatim-quote substring problem) — exactly what GEPA can target.

Test plan

  • npm test passes (94 vitest + 51 pytest, no regressions in existing suite)
  • npm run lint passes (0 errors; 3 pre-existing warnings unrelated to this PR)
  • npm run format:check passes
  • prompt_optim smoke tests run cleanly (CLI wrapper, metric, adapter, full GEPA loop on haiku)
  • Reviewer: sanity-check the metric weighting in metric.py _BASE_WEIGHTS against the design notes in README.md
  • Reviewer: confirm _seed_people_bridge.py import-by-path approach is acceptable (alternative: rename seed-people.py to seed_people.py)

Known follow-ups (not in this PR)

  • optimize.py currently passes valset=trainset. A real production-strength run should hold out 2–3 recordings as a true valset.
  • LLM-as-judge metric component (deliberately deferred — would double subscription burn, judge and task share blind spots).

🤖 Generated with Claude Code

Optimizes the recording-analysis prompt without paying API tokens by routing
every model call through the `claude` CLI (Max subscription OAuth auth) instead
of the Anthropic SDK. Custom GEPAAdapter, reference-free deterministic metric
(9 components covering grounding, coverage, takeaway quality, mind-map
integrity, and output economy), serial CLI wrapper with rate-limit backoff,
and per-call telemetry persisted to .seam/prompt-optim-state.json.

Includes:
- prompt_optim/{claude_cli_lm,metric,adapter,optimize}.py — the pipeline
- prompt_optim/_pick_trainset.py — picker for diverse trainset recordings
- prompt_optim/_smoke_adapter.py — single-example wiring smoke test
- prompt_optim/_seed_people_bridge.py — shares is_generic() with seed-people.py
- prompt_optim/{README,COMMANDS,TRAINSET}.md — usage and design notes
- .gitignore: ignore .venv-gepa/, prompts/analyze.optimized*.md, gepa-runs/,
  .claude/worktrees/, .claude/scheduled_tasks.lock

The optimized prompt is written to prompts/analyze.optimized.md (gitignored)
and is never auto-promoted; production analyze.md is unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jedibrillo jedibrillo requested a review from yoaquim May 3, 2026 11:17
@yoaquim
Copy link
Copy Markdown
Owner

yoaquim commented May 3, 2026

A few items from review:

1. Generalize the docs

The examples in README.md, COMMANDS.md, and TRAINSET.md reference specific recording names (2026-02-24_cancun-and-project-management, 2026-03-06_calebs-pediatrician-appointment, etc.). These should be replaced with generic placeholders like <recording-dir-name> or your-recording-name so the docs make sense for any user.

The _pick_trainset.py script already generates user-specific recommendations, so hardcoding specific recordings in the docs is redundant. The sample output table in TRAINSET.md is fine to keep as an illustration, but the --train examples in README.md and COMMANDS.md should use placeholders.

2. `seed-people.py` rename

The `_seed_people_bridge.py` import-by-path workaround is no longer needed — #19 renames `seed-people.py` to `seed_people.py`, making it a standard importable module. Once that merges, this PR can drop the bridge file and do a normal `from scripts.seed_people import is_generic, load_generic_labels` (or adjust the import path as needed).

3. Tests

The PR adds no automated tests for the `prompt_optim` module — only manual smoke scripts. The project's CLAUDE.md mandates TDD with 80%+ coverage on new code. At minimum, `metric.py` is pure functions and very testable:

  • `schema_score` with valid/invalid JSON structures
  • `speaker_grounding_score` with known/unknown/generic names
  • `quote_grounding_score` with matching/non-matching transcript text
  • `takeaway_quality_score` with redundant vs distinct takeaways
  • `mind_map_quality_score` with connected/disconnected graphs
  • `output_economy_score` at various sizes around the 14 kB center

These don't require any LLM calls — they're deterministic scoring functions. Should have pytest coverage.

@yoaquim
Copy link
Copy Markdown
Owner

yoaquim commented May 3, 2026

Two more items:

4. Separate valset from trainset

optimize.py line 153 has valset=trainset with a comment noting it's an MVP shortcut. The fix is straightforward — add a --val flag mirroring --train:

ap.add_argument(
    "--val",
    nargs="*",
    default=None,
    help="validation recording dir names (defaults to --train if not set)",
)

Then in the body:

val_names = args.val if args.val else args.train
valset = _build_trainset(seam_dir, val_names)

Usage becomes:

python -m prompt_optim.optimize \
    --budget 150 \
    --train recording-a recording-b recording-c recording-d \
    --val recording-x recording-y

When --val is omitted, it falls back to --train (current behavior). When provided, GEPA trains on one set and validates on the other — catches overfitting.

_pick_trainset.py could also be updated to suggest a val split (e.g. "pick 4 for train, hold out these 2 for val").

5. Auto-calculate output economy center

The sigmoid center is hardcoded at 14,000 bytes (metric.py line 588):

x = (size - 14000) / 3000

This should be computed at runtime from the user's actual analysis distribution. Something like:

def _compute_economy_center(seam_dir: Path) -> int:
    """Median analysis.json size across the dataset."""
    sizes = []
    analysis_dir = seam_dir / "analysis"
    if analysis_dir.exists():
        for d in analysis_dir.iterdir():
            f = d / "analysis.json"
            if f.exists():
                sizes.append(f.stat().st_size)
    if not sizes:
        return 14000  # fallback default
    return int(sorted(sizes)[len(sizes) // 2])

Then _output_economy_score takes the center as a parameter instead of hardcoding it. The center gets computed once when the metric module is initialized or when score_analysis is first called.

This way any user's dataset self-calibrates — someone with longer recordings and naturally larger analyses won't get unfairly penalized.

yoaquim added a commit that referenced this pull request May 3, 2026
## Summary

- Renames `scripts/seed-people.py` → `scripts/seed_people.py` so it's
importable as a standard Python module
- Eliminates the need for `importlib` path hacks (like
`_seed_people_bridge.py` in #18)
- Updates references in CLAUDE.md and README.md

## Test plan

- [x] All tests pass (94 vitest + 51 pytest)
- [x] No scripts reference `seed-people.py` by name (it's only invoked
manually)
yoaquim added 2 commits May 3, 2026 08:20
seed-people.py was renamed to seed_people.py on main (#19), so the
importlib bridge hack is no longer needed. Import directly via
sys.path like the rest of the project's cross-directory imports.
@yoaquim
Copy link
Copy Markdown
Owner

yoaquim commented May 3, 2026

Update: Item #2 from my earlier review (the _seed_people_bridge.py hack) is now addressed:

  • seed-people.py was renamed to seed_people.py on main via Rename seed-people.py to seed_people.py #19
  • Merged main into this branch
  • Deleted _seed_people_bridge.py
  • metric.py now imports directly from scripts/seed_people via sys.path (same pattern used by the project's test files)
  • Updated all seed-people.py references in prompt_optim/README.md

Commit: b0cb9bd

Remaining open items from the review:

  1. Generalize docs — replace specific recording names with placeholders
  2. Tests — add pytest coverage for the metric's pure functions
  3. Separate --val flag — so valset can differ from trainset
  4. Auto-calculate output economy center — compute median from user's dataset at runtime

Copy link
Copy Markdown
Owner

@yoaquim yoaquim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in comments

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants