Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
3b4c5b8
docs: spec for GEPA flowchart optimization (comprehension + geometry)
ivanmkc-google Jun 18, 2026
08a3a2a
docs: implementation plan for GEPA flowchart optimization
ivanmkc-google Jun 18, 2026
e6d8c16
feat(gepa): scaffold gepa-flowchart project + config
ivanmkc-google Jun 18, 2026
abb5f96
feat(gepa): geometry bridge reusing TS validateContent + geometryReport
ivanmkc-google Jun 18, 2026
8a508bb
feat(gepa): Anthropic SDK wrappers + reflection callable
ivanmkc-google Jun 18, 2026
1b16860
feat(gepa): board-to-text serializer + JSON extractor
ivanmkc-google Jun 18, 2026
30bb7e9
feat(gepa): topic dataset + reader-question generation/freezing
ivanmkc-google Jun 18, 2026
6233d5a
feat(gepa): seed brainstorm/generate prompts + pipeline
ivanmkc-google Jun 18, 2026
82d48ac
fix(gepa): load real gallery example in SKILL_CONTEXT (parents[4])
ivanmkc-google Jun 18, 2026
c69dc8b
feat(gepa): three-part metric (validity gate, geometry, comprehension)
ivanmkc-google Jun 18, 2026
34cf94c
fix(gepa): judge parse degrades safely on malformed JSON
ivanmkc-google Jun 18, 2026
28fdcc0
feat(gepa): GEPAAdapter (evaluate + reflective dataset)
ivanmkc-google Jun 18, 2026
cce5d9d
feat(gepa): run entrypoint, report, and smoke test
ivanmkc-google Jun 18, 2026
03fae2b
docs(gepa): README with setup, run, and cost notes
ivanmkc-google Jun 18, 2026
d39a67c
feat(gepa): run via Claude on Vertex AI (ADC auth)
ivanmkc-google Jun 18, 2026
a5ad73c
fix(gepa): brace-safe prompt fill so GEPA-mutated prompts don't KeyError
ivanmkc-google Jun 18, 2026
1bc010b
feat(gepa): de-saturate the metric (strict judge + harder questions +…
ivanmkc-google Jun 18, 2026
75aa915
feat(gepa): multimodal + Playwright validators on the rendered board
ivanmkc-google Jun 18, 2026
2e411ea
fix(gepa): freeze reader-questions only for used topics (not all 20)
ivanmkc-google Jun 18, 2026
8b1f738
fix(gepa): tolerant reader-question parsing (raw_decode + retry + fal…
ivanmkc-google Jun 18, 2026
830e68d
chore(gepa): log dataset module + parse_str_array at startup
ivanmkc-google Jun 18, 2026
619d630
feat(gepa): weighted-harmonic-mean metric + multi-objective Pareto
ivanmkc-google Jun 19, 2026
75fc6dc
feat(gepa): overnight autonomous sweep — huge corpus + experiments + …
ivanmkc-google Jun 19, 2026
36afc77
fix(gepa): overnight cleanup reaps chromium + runs before first exper…
ivanmkc-google Jun 19, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1,535 changes: 1,535 additions & 0 deletions docs/superpowers/plans/2026-06-18-gepa-flowchart-optimization.md

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -0,0 +1,226 @@
# GEPA optimization for readable, comprehensible flowcharts

**Date:** 2026-06-18
**Status:** Approved design — ready for implementation plan

## Problem

termchart can generate flowcharts, but the *quality* of a generated board is
uneven. Two failure modes recur:

1. **Unreadable geometry** — edges run over nodes, nodes overlap, the graph is
so large it renders tiny. (The push-time geometry lint already detects this.)
2. **Insufficient detail / context** — the more common problem. A board renders
cleanly but a first-time reader can't actually learn what they need from it:
labels are terse, the triggers/conditions/outcomes aren't shown, there's no
orienting context. The reader is left with unanswered questions.

We want to **systematically improve the prompts** that produce flowcharts so the
output is both readable *and* comprehensible to someone seeing it for the first
time. GEPA (reflective prompt evolution) is a good fit: it mutates text prompts
using an LLM that reflects on execution feedback and a metric.

## Goal

Stand up a GEPA optimization loop that evolves the prompts used to **brainstorm**
and **generate** a termchart flowchart, optimizing primarily for **first-user
comprehension** (a fresh reader can answer the questions they'd naturally have),
with **geometric readability** as a secondary guardrail. Produce the evolved
prompts plus a before/after report.

## Non-goals (YAGNI)

- **No** changes to the termchart viewer/CLI public surface (the geometry
validator is reused via a thin bridge, not re-exported).
- **No** human-in-the-loop labeling — the eval is fully automated (LLM reader +
LLM judge).
- **No** visual/image rendering of the board for the reader — comprehension is
scored on the board's **structured content** (see Decisions). True pixel
readability is the geometry lint's job, not the comprehension test's.
- **No** integration into the live push path — this is an offline experiment that
produces better prompts; wiring them into a recipe/skill is a separate follow-up.

## Decisions (from brainstorming)

| Question | Decision |
|---|---|
| LLM access | Claude direct via the Anthropic Python SDK (not the LiteLLM proxy) |
| GEPA implementation | Standalone `gepa` package (bring-your-own adapter) |
| What GEPA optimizes | **Two** text components: a `brainstorm` prompt and a `generate` prompt |
| Dataset | ~12 diverse flowchart topics, ~8 train / ~4 val, authored in-repo |
| Reader-questions (FUX backbone) | **Auto-generated per topic at run start**, then frozen for the run so every candidate is scored against the same questions |
| Reader input surface | **Structured flow content** (node/edge/group labels + annotations), not an image |
| Run scope | **Smoke-first**: a cheap `--smoke` default; a modest full run (~150 metric calls) via flag |

## Architecture

New Python harness in a new git worktree:

```
scripts/experiments/gepa-flowchart/
README.md # how to run, env vars, cost notes
pyproject.toml # deps: gepa, anthropic
gepa_flowchart/
__init__.py
config.py # models, weights, budget, paths (env-overridable)
llm.py # thin Anthropic SDK wrappers: generate / read / judge / reflect
dataset.py # ~12 topics; loads + freezes auto-generated reader-questions
pipeline.py # brainstorm -> generate (with termchart skill context)
geometry_bridge.py # shells out to the tsx Node validator, parses findings JSON
validate_flow.ts # Node bridge: imports geometryReport, reads stdin, prints JSON
metric.py # structural gate + geometry score + comprehension score + feedback
adapter.py # gepa.GEPAAdapter: evaluate() + make_reflective_dataset()
seed_prompts.py # the seed brainstorm/generate prompts (the starting candidate)
run.py # CLI entrypoint: gepa.optimize(...), writes results + report
tests/ # pytest: pure-logic units with a fake LLM + one smoke
topics/ # the authored topic dataset (json/yaml)
```

### The pipeline GEPA optimizes (per task)

```
topic ─┬▶ [brainstorm prompt]* ─▶ plan: what to show + what context a reader needs
└▶ [generate prompt]* + termchart skill context ─▶ flow JSON
┌─────────────────────────────────────────┘
validators → combined score + textual feedback ─▶ GEPA reflection ─▶ better prompts*
```

`*` marks the two text components GEPA mutates. The seed candidate is
`{ "brainstorm": <seed>, "generate": <seed> }`. The generate prompt's static
context includes the termchart `flow` JSON schema and 1–2 shipped gallery example
specs (`plugin/skills/diagram-recipes/examples/*.flow.json`) — this is the
"generate using termchart skills" step.

### The flow spec format (target output)

```jsonc
{
"direction": "TB" | "LR" | "BT" | "RL", // default TB
"nodes": [{ "id", "data": { "label", "status?" }, "group?" }],
"edges": [{ "source", "target", "data?": { "label?" } }],
"groups?": [{ "id", "label", "color?" }], // tiers/lanes/zones
"tiers?": bool, "lanes?": bool
}
```
Detail and context live in `data.label` (rich node text), edge `data.label`
(what the transition is/when it happens), and group labels — the levers the
generate prompt learns to use.

## The metric (the heart of this)

`evaluate(task, flow_json)` returns a score in `[0,1]` plus a textual feedback
string for GEPA's reflection.

1. **Structural validity (hard gate).** Run the shared `validateContent` logic.
Invalid/unparseable → score `0.0`, feedback = the precise path-pointed error.
Everything below only runs on a valid spec.

2. **Geometry / readability** → `geom_score ∈ [0,1]`. From the geometry bridge's
findings: `error`-severity (edge-over-node, node-overlap, missing-ref) drive a
large penalty; `warning`-severity (crossings, edge-near-node, low-readability,
and the over-stuffed-density codes) a small one. Feedback = the findings'
actionable messages.

3. **Comprehension / first-user-experience (PRIMARY)** → `comp_score ∈ [0,1]`.
- A **fresh reader-LLM with no prior context** (system: "you're seeing this
board for the first time; answer ONLY from what's shown; say 'not shown' if
the board doesn't tell you") is given the board's **structured content** and
the task's frozen reader-questions.
- A **judge-LLM** scores each answer 0–1 on two axes: *correct* AND
*actually supported by the board*. An answer the reader had to invent, or
marked "not shown", counts as a comprehension miss attributable to **missing
detail/context** — flagged explicitly.
- `comp_score = mean(per-question scores)`.
- Feedback = the specific questions that scored low, each with the judge's
reason (e.g. "board never indicates what happens when validation fails"),
plus a short "missing context" list. This is what pushes reflection toward
adding detail and orienting context, not just cleaner layout.

**Combined:** `total = w_comp * comp_score + w_geom * geom_score`, gated by
validity. Defaults `w_comp = 0.6`, `w_geom = 0.4` (env-configurable). Comprehension
leads because under-detailing is the main problem we're fixing; geometry remains a
real guardrail so the optimizer can't win by dumping unreadable walls of text.

## GEPA wiring

- `gepa.optimize(seed_candidate, trainset, valset, adapter, reflection_lm=<callable>, max_metric_calls=<budget>)`.
- `reflection_lm` is a **callable** wrapping the Anthropic SDK (Claude direct) —
not a LiteLLM model string — honoring the "Claude direct" decision.
- The `GEPAAdapter` implements `evaluate(batch, candidate, capture_traces)` (runs
the pipeline + metric per task) and `make_reflective_dataset(...)` (turns the
captured feedback into the per-component reflective examples GEPA mutates on).

## Models (Anthropic SDK direct)

| Role | Default model | Notes |
|---|---|---|
| Generation (high volume) | `claude-opus-4-8` | env knob to `claude-sonnet-4-6` to cut cost |
| Reader (FUX) | `claude-sonnet-4-6` | simulates an average reader; cheap; no thinking needed |
| Judge | `claude-opus-4-8` | scores reader answers; distinct role from the generator |
| GEPA reflection | `claude-opus-4-8` | strongest model proposes prompt mutations |

Adaptive thinking on the reasoning-heavy roles (reflection, judge). Auth via
`ANTHROPIC_API_KEY` or an `ant auth login` profile.

## Geometry bridge (TS ↔ Python)

The geometry validator (`packages/viewer/src/flow-geometry.ts` → `geometryReport`)
is TypeScript and intentionally not on the CLI/public surface. Rather than re-export
it or stand up a server, a small `validate_flow.ts` Node script imports
`geometryReport`, reads a flow-JSON spec on stdin, and prints
`{ findings: [...], warnings: [...] }` on stdout. It runs via `npx tsx` with the
viewer package as the resolution root (so `dagre` etc. resolve). `geometry_bridge.py`
shells out to it and parses the JSON. No changes to shipped packages.

## Dataset

`topics/` holds ~12 tasks, each:

```jsonc
{
"id": "ci-cd-pipeline",
"topic": "CI/CD pipeline for a web app",
"audience": "an engineer new to the team",
"purpose": "understand how code reaches production and what can go wrong"
}
```
Reader-questions are **not** stored per topic — at run start the harness
auto-generates a fixed set per topic (via an LLM, from topic+audience+purpose) and
**freezes** them to the run directory, so every candidate in that run is judged
against identical questions (fair comparison; stable signal within a run). Split
~8 train / ~4 val.

## Outputs

Written to a timestamped run directory:
- `best_prompts.json` — the evolved `brainstorm` + `generate` prompts.
- `report.md` — seed vs. best on the val set: comprehension score, geometry score,
combined, and per-question deltas (which previously-unanswerable questions the
improved board now answers).
- `frozen_questions.json` — the reader-questions used for the run.

## Testing

- **Unit (pytest, fake LLM):** findings→`geom_score` mapping; comprehension
scoring + feedback assembly; structural-gate behavior; dataset load + question
freeze; the adapter's `evaluate`/`make_reflective_dataset` shapes.
- **Geometry bridge:** a known-bad spec (edge over node) yields the expected
finding; a clean spec yields none.
- **Smoke (live, tiny):** `run.py --smoke` runs 1 topic at a tiny budget end to
end and asserts a report is produced. Gated behind `ANTHROPIC_API_KEY`.

## Cost & scope control

- `--smoke` is the default-safe entry: 1 topic, minimal budget, a handful of LLM
calls — validates the whole loop for cents.
- A full run defaults to ~150 metric calls (`--max-metric-calls`), train/val as
above. All models, weights, and budget are env/flag-overridable.
- Every run prints an up-front estimate (rollouts × calls/rollout × models) before
spending.

## Files touched / created

All new, under `scripts/experiments/gepa-flowchart/` (+ this spec). No existing
package code is modified.
5 changes: 5 additions & 0 deletions scripts/experiments/gepa-flowchart/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
__pycache__/
*.pyc
.venv/
node_modules/
runs/
76 changes: 76 additions & 0 deletions scripts/experiments/gepa-flowchart/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
# gepa-flowchart

GEPA prompt optimization for readable, comprehensible termchart flowcharts.
Optimizes the `brainstorm` + `generate` prompts for first-user comprehension
(primary) and geometric readability (secondary). See
`docs/superpowers/specs/2026-06-18-gepa-flowchart-optimization-design.md`.

## Setup

```bash
# Node deps for the geometry bridge (run once)
npm install # repo root — gives the viewer its deps + playwright
npm run build --workspace @ivanmkc/termchart-viewer # build the viewer (needed to RENDER boards)
cd scripts/experiments/gepa-flowchart
npm install # local tsx

# Python
python -m venv .venv && . .venv/bin/activate
pip install -e ".[dev]"
export ANTHROPIC_API_KEY=sk-ant-... # or: ant auth login
```

### Auth: direct API or Vertex AI

By default the harness uses the direct Anthropic API (`ANTHROPIC_API_KEY`).
To run via **Claude on Vertex AI** (ADC, no API key), install the Vertex extra
and set the Vertex env vars — `get_client()` auto-selects the Vertex client when
`CLAUDE_CODE_USE_VERTEX` (or `GEPA_USE_VERTEX`) is set:

```bash
pip install "anthropic[vertex]"
export CLAUDE_CODE_USE_VERTEX=1
export ANTHROPIC_VERTEX_PROJECT_ID=<gcp-project>
export CLOUD_ML_REGION=global # or a Claude-on-Vertex region
gcloud auth application-default login # ADC
```

## Run

```bash
python -m gepa_flowchart.run --smoke # cheap end-to-end check (1 topic)
python -m gepa_flowchart.run # full run (~150 metric calls)
python -m gepa_flowchart.run --max-metric-calls 80 --train 8
```

Outputs land in `runs/<timestamp>/`: `best_prompts.json`, `report.md`,
`frozen_questions.json`.

## Validators (the metric)

Each board is scored on five signals, gated by structural validity:

- **text comprehension** — a fresh reader-LLM answers the run-frozen reader-questions from the board's structured content; a strict, board-grounded judge scores them.
- **visual comprehension** — the board is **rendered in a real browser** (viewer + Chromium via a persistent render service) and a **multimodal LLM reads the screenshot**, answering the same questions from the pixels.
- **visual quality** — the same vision call rates legibility / crowding / overlaps / clipping.
- **heuristic geometry** — the fast TS `geometryReport` (edges-over-nodes, crossings, density).
- **rendered geometry (Playwright)** — real-DOM measurements: node-pair overlaps, off-canvas nodes, smallest on-screen font.

`comp = mean(text, visual)`, `geom = mean(heuristic, rendered)`,
`total = w_comp·comp + w_geom·geom + w_vq·visual_quality`.

## Config (env)

`GEPA_GEN_MODEL` (default `claude-opus-4-8`; set `claude-sonnet-4-6` to cut cost),
`GEPA_READER_MODEL`, `GEPA_JUDGE_MODEL`, `GEPA_VISION_MODEL`, `GEPA_REFLECTION_MODEL`,
`GEPA_W_COMP` (0.5), `GEPA_W_GEOM` (0.3), `GEPA_W_VQ` (0.2), `GEPA_N_QUESTIONS` (7),
`GEPA_MAX_METRIC_CALLS` (150). Rendered-geometry penalties: `GEPA_RG_OVERLAP_PENALTY`,
`GEPA_RG_OFFSCREEN_PENALTY`, `GEPA_RG_TINYFONT_PENALTY`, `GEPA_RG_MIN_FONT_PX`.

## Cost note

Each rollout now does generation + text reader/judge + **a browser render + a
multimodal vision call**, plus reflection — heavier than a text-only metric.
Start with `--smoke`. The render service (viewer + one Chromium) starts once per
run and is reused. Generation defaults to Opus; `GEPA_GEN_MODEL=claude-sonnet-4-6`
cuts the high-volume role.
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
"""GEPA optimization for readable, comprehensible termchart flowcharts."""
74 changes: 74 additions & 0 deletions scripts/experiments/gepa-flowchart/gepa_flowchart/adapter.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
from __future__ import annotations

from gepa.core.adapter import EvaluationBatch, GEPAAdapter

from .config import Config
from .geometry_bridge import validate_flow
from .llm import complete, complete_vision
from .metric import score_board
from .pipeline import run_pipeline


def _no_render(_content):
return {"ok": False, "error": "no render_fn supplied"}


class FlowchartAdapter(GEPAAdapter):
def __init__(self, cfg: Config, *, validate_fn=validate_flow, complete_fn=complete,
render_fn=_no_render, vision_fn=complete_vision):
self.cfg = cfg
self.validate_fn = validate_fn
self.complete_fn = complete_fn
self.render_fn = render_fn
self.vision_fn = vision_fn

def evaluate(self, batch, candidate, capture_traces=False):
outputs, scores, trajectories = [], [], [] if capture_traces else None
objective_scores = []
for task in batch:
content, trace = run_pipeline(
candidate, task, model=self.cfg.gen_model,
complete_fn=self.complete_fn, max_tokens=self.cfg.gen_max_tokens,
)
result = score_board(
content, task, self.cfg, validate_fn=self.validate_fn, complete_fn=self.complete_fn,
render_fn=self.render_fn, vision_fn=self.vision_fn,
)
outputs.append(content)
scores.append(result.score)
# Per-objective scores so GEPA can track/Pareto the distinct axes instead of only the
# collapsed scalar (keeps candidates that are best on comprehension vs. layout separate).
objective_scores.append({
"comprehension": result.comp,
"geometry": result.geom,
"visual_quality": result.visual_quality,
})
if capture_traces:
trajectories.append({"task": task, "trace": trace, "result": result, "output": content})
return EvaluationBatch(
outputs=outputs, scores=scores, trajectories=trajectories, objective_scores=objective_scores
)

def make_reflective_dataset(self, candidate, eval_batch, components_to_update):
out: dict[str, list[dict]] = {c: [] for c in components_to_update}
for traj in eval_batch.trajectories or []:
task = traj["task"]
result = traj["result"]
trace = traj["trace"]
shared_feedback = result.feedback
if "brainstorm" in out:
out["brainstorm"].append({
"Inputs": f"Topic: {task.topic}\nAudience: {task.audience}\nPurpose: {task.purpose}",
"Generated Outputs": trace.get("plan", ""),
"Feedback": (
"The plan feeds a downstream generator. The resulting board scored "
f"{result.score:.2f}. " + shared_feedback
),
})
if "generate" in out:
out["generate"].append({
"Inputs": f"Plan:\n{trace.get('plan','')}",
"Generated Outputs": traj.get("output") or trace.get("raw_generation", ""),
"Feedback": shared_feedback,
})
return out
Loading
Loading