Skip to content

GEPA optimization program: plugin v0.12.3 (over-explain + viz), viewer fixes, full harness + skill#218

Draft
ivanmkc wants to merge 3 commits into
masterfrom
worktree-gepa-flowchart
Draft

GEPA optimization program: plugin v0.12.3 (over-explain + viz), viewer fixes, full harness + skill#218
ivanmkc wants to merge 3 commits into
masterfrom
worktree-gepa-flowchart

Conversation

@ivanmkc

@ivanmkc ivanmkc commented Jul 3, 2026

Copy link
Copy Markdown
Owner

What this ships

This branch is the full GEPA prompt-optimization program for termchart diagram
skills, plus the validated wins ported into the shipped plugin and viewer. It is
~82 commits on top of master (clean fast-forward, no divergence).

Plugin (user-facing) — v0.11.4 → v0.12.3

  • Over-explain-for-non-experts default (v0.12.2/0.12.3): define jargon + WHY + a
    space-budget "don't crowd the visuals" guard. GEPA-validated to lift junior
    comprehension
    ~+0.19–0.21 on unseen (OOD) journeys, 8/10 improving.
  • Visualization-usage guidance (show-don't-tell: prose/diagram mix, products get
    images, links checked).
  • New UX/app-screen mockup recipe family (v0.12.1); diagram-recipes reconciled
    with the real API surface (dedup + self-consistency audit).

Viewer

  • Experimental features (chat console, board history) hidden behind an
    ?experimental=1 / localStorage flag.
  • Fixes: inbox nudge fires once per new message (not a heartbeat); don't snap-to-top
    when the viewed board re-renders.

GEPA harness (scripts/experiments/gepa-flowchart/)

  • Topology-skill hierarchy (journey → shared topology skills → schema atoms) + joint
    multi-journey GEPA
    over shared skills.
  • Unified scorer: comprehension (text+vision VQA) · geometry (heuristic + rendered
    DOM) · visual-quality · junior rubric · viz-usage, harmonic-mean weighted.
  • Anti-Goodhart machinery: PoLL multi-judge panel (median + disagreement + abstain),
    K-sample generation, optimize-judge ≠ validation-judge, and an OOD holdout kept in
    a separate file so optimization can't train on it.
  • Resilient LLM calls (429/transient backoff + graceful degrade); regression + judge-
    agreement + cross-eval harnesses; two-gate validated promotion.
  • Methodology codified as a skill (SKILL.md, symlinked into .claude/skills/ gepa-optimization/) — runbook + metric design + gotchas.
  • Curated run artifacts: best_* skills, report.md, SUMMARY*.md kept; regenerable
    scratch git-ignored.

Key finding (documented, not over-applied)

The junior-comprehension gain generalizes OOD; it mildly regresses viz on data-dense
boards. A viz-protective re-weight proved the tension is fundamental for one universal
board_layout
— so the shipped default keeps the space-budget guard, and the clean
follow-up (split board_layout per artifact class) is left staged, not forced.

Notes for review

  • Draft: the plugin bump (0.12.3) only reaches users once this is on master — today
    master is at 0.11.4 with no experiments dir, so none of this has shipped via git yet.
  • No scratch/binaries in the diff (gepa_state.bin, run logs, generated boards ignored).

Runbook (entry points, env vars, auth) + metric design + OOD-holdout
discipline + the hard-won gotchas, authored in the experiment dir and
symlinked into .claude/skills/ so any session in the repo finds it.
…e scratch

Curates ~130 overnight run dirs down to their durable outputs
(best_topology_skills/best_prompts JSON, report.md, SUMMARY*.md) and
adds a .gitignore for regenerable scratch (gepa_state.bin, run_log*,
candidate_tree.html, candidates.json, generated_best_outputs_valset/,
frozen_*.json, logs, generated corpus).
…ntal flag)

Reverts the gating from 29877fd — the agent console (chat) and board-history
toggles are wired up for all non-readonly views again, and the EXPERIMENTAL
flag is removed. Restores the pre-hiding behavior byte-for-byte.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants