ivanmkc · ivanmkc · Jun 18, 2026 · Jun 18, 2026 · Jun 18, 2026 · Jun 18, 2026
diff --git a/docs/superpowers/plans/2026-06-18-gepa-flowchart-optimization.md b/docs/superpowers/plans/2026-06-18-gepa-flowchart-optimization.md
diff --git a/docs/superpowers/specs/2026-06-18-gepa-flowchart-optimization-design.md b/docs/superpowers/specs/2026-06-18-gepa-flowchart-optimization-design.md
@@ -0,0 +1,226 @@
+# GEPA optimization for readable, comprehensible flowcharts
+
+**Date:** 2026-06-18
+**Status:** Approved design — ready for implementation plan
+
+## Problem
+
+termchart can generate flowcharts, but the *quality* of a generated board is
+uneven. Two failure modes recur:
+
+1. **Unreadable geometry** — edges run over nodes, nodes overlap, the graph is
+   so large it renders tiny. (The push-time geometry lint already detects this.)
+2. **Insufficient detail / context** — the more common problem. A board renders
+   cleanly but a first-time reader can't actually learn what they need from it:
+   labels are terse, the triggers/conditions/outcomes aren't shown, there's no
+   orienting context. The reader is left with unanswered questions.
+
+We want to **systematically improve the prompts** that produce flowcharts so the
+output is both readable *and* comprehensible to someone seeing it for the first
+time. GEPA (reflective prompt evolution) is a good fit: it mutates text prompts
+using an LLM that reflects on execution feedback and a metric.
+
+## Goal
+
+Stand up a GEPA optimization loop that evolves the prompts used to **brainstorm**
+and **generate** a termchart flowchart, optimizing primarily for **first-user
+comprehension** (a fresh reader can answer the questions they'd naturally have),
+with **geometric readability** as a secondary guardrail. Produce the evolved
+prompts plus a before/after report.
+
+## Non-goals (YAGNI)
+
+- **No** changes to the termchart viewer/CLI public surface (the geometry
+  validator is reused via a thin bridge, not re-exported).
+- **No** human-in-the-loop labeling — the eval is fully automated (LLM reader +
+  LLM judge).
+- **No** visual/image rendering of the board for the reader — comprehension is
+  scored on the board's **structured content** (see Decisions). True pixel
+  readability is the geometry lint's job, not the comprehension test's.
+- **No** integration into the live push path — this is an offline experiment that
+  produces better prompts; wiring them into a recipe/skill is a separate follow-up.
+
+## Decisions (from brainstorming)
+
+| Question | Decision |
+|---|---|
+| LLM access | Claude direct via the Anthropic Python SDK (not the LiteLLM proxy) |
+| GEPA implementation | Standalone `gepa` package (bring-your-own adapter) |
+| What GEPA optimizes | **Two** text components: a `brainstorm` prompt and a `generate` prompt |
+| Dataset | ~12 diverse flowchart topics, ~8 train / ~4 val, authored in-repo |
+| Reader-questions (FUX backbone) | **Auto-generated per topic at run start**, then frozen for the run so every candidate is scored against the same questions |
+| Reader input surface | **Structured flow content** (node/edge/group labels + annotations), not an image |
+| Run scope | **Smoke-first**: a cheap `--smoke` default; a modest full run (~150 metric calls) via flag |
+
+## Architecture
+
+New Python harness in a new git worktree:
+
+```
+scripts/experiments/gepa-flowchart/
+  README.md                # how to run, env vars, cost notes
+  pyproject.toml           # deps: gepa, anthropic
+  gepa_flowchart/
+    __init__.py
+    config.py              # models, weights, budget, paths (env-overridable)
+    llm.py                 # thin Anthropic SDK wrappers: generate / read / judge / reflect
+    dataset.py             # ~12 topics; loads + freezes auto-generated reader-questions
+    pipeline.py            # brainstorm -> generate (with termchart skill context)
+    geometry_bridge.py     # shells out to the tsx Node validator, parses findings JSON
+    validate_flow.ts       # Node bridge: imports geometryReport, reads stdin, prints JSON
+    metric.py              # structural gate + geometry score + comprehension score + feedback
+    adapter.py             # gepa.GEPAAdapter: evaluate() + make_reflective_dataset()
+    seed_prompts.py        # the seed brainstorm/generate prompts (the starting candidate)
+    run.py                 # CLI entrypoint: gepa.optimize(...), writes results + report
+  tests/                   # pytest: pure-logic units with a fake LLM + one smoke
+  topics/                  # the authored topic dataset (json/yaml)
+```
+
+### The pipeline GEPA optimizes (per task)
+
+```
+topic ─┬▶ [brainstorm prompt]*  ─▶ plan: what to show + what context a reader needs
+       └▶ [generate prompt]* + termchart skill context ─▶ flow JSON
+                                                            │
+                  ┌─────────────────────────────────────────┘
+                  ▼
+   validators → combined score + textual feedback ─▶ GEPA reflection ─▶ better prompts*
+```
+
+`*` marks the two text components GEPA mutates. The seed candidate is
+`{ "brainstorm": <seed>, "generate": <seed> }`. The generate prompt's static
+context includes the termchart `flow` JSON schema and 1–2 shipped gallery example
+specs (`plugin/skills/diagram-recipes/examples/*.flow.json`) — this is the
+"generate using termchart skills" step.
+
+### The flow spec format (target output)
+
+```jsonc
+{
+  "direction": "TB" | "LR" | "BT" | "RL",        // default TB
+  "nodes": [{ "id", "data": { "label", "status?" }, "group?" }],
+  "edges": [{ "source", "target", "data?": { "label?" } }],
+  "groups?": [{ "id", "label", "color?" }],        // tiers/lanes/zones
+  "tiers?": bool, "lanes?": bool
+}
+```
+Detail and context live in `data.label` (rich node text), edge `data.label`
+(what the transition is/when it happens), and group labels — the levers the
+generate prompt learns to use.
+
+## The metric (the heart of this)
+
+`evaluate(task, flow_json)` returns a score in `[0,1]` plus a textual feedback
+string for GEPA's reflection.
+
+1. **Structural validity (hard gate).** Run the shared `validateContent` logic.
+   Invalid/unparseable → score `0.0`, feedback = the precise path-pointed error.
+   Everything below only runs on a valid spec.
+
+2. **Geometry / readability** → `geom_score ∈ [0,1]`. From the geometry bridge's
+   findings: `error`-severity (edge-over-node, node-overlap, missing-ref) drive a
+   large penalty; `warning`-severity (crossings, edge-near-node, low-readability,
+   and the over-stuffed-density codes) a small one. Feedback = the findings'
+   actionable messages.
+
+3. **Comprehension / first-user-experience (PRIMARY)** → `comp_score ∈ [0,1]`.
+   - A **fresh reader-LLM with no prior context** (system: "you're seeing this
+     board for the first time; answer ONLY from what's shown; say 'not shown' if
+     the board doesn't tell you") is given the board's **structured content** and
+     the task's frozen reader-questions.
+   - A **judge-LLM** scores each answer 0–1 on two axes: *correct* AND
+     *actually supported by the board*. An answer the reader had to invent, or
+     marked "not shown", counts as a comprehension miss attributable to **missing
+     detail/context** — flagged explicitly.
+   - `comp_score = mean(per-question scores)`.
+   - Feedback = the specific questions that scored low, each with the judge's
+     reason (e.g. "board never indicates what happens when validation fails"),
+     plus a short "missing context" list. This is what pushes reflection toward
+     adding detail and orienting context, not just cleaner layout.
+
+**Combined:** `total = w_comp * comp_score + w_geom * geom_score`, gated by
+validity. Defaults `w_comp = 0.6`, `w_geom = 0.4` (env-configurable). Comprehension
+leads because under-detailing is the main problem we're fixing; geometry remains a
+real guardrail so the optimizer can't win by dumping unreadable walls of text.
+
+## GEPA wiring
+
+- `gepa.optimize(seed_candidate, trainset, valset, adapter, reflection_lm=<callable>, max_metric_calls=<budget>)`.
+- `reflection_lm` is a **callable** wrapping the Anthropic SDK (Claude direct) —
+  not a LiteLLM model string — honoring the "Claude direct" decision.
+- The `GEPAAdapter` implements `evaluate(batch, candidate, capture_traces)` (runs
+  the pipeline + metric per task) and `make_reflective_dataset(...)` (turns the
+  captured feedback into the per-component reflective examples GEPA mutates on).
+
+## Models (Anthropic SDK direct)
+
+| Role | Default model | Notes |
+|---|---|---|
+| Generation (high volume) | `claude-opus-4-8` | env knob to `claude-sonnet-4-6` to cut cost |
+| Reader (FUX) | `claude-sonnet-4-6` | simulates an average reader; cheap; no thinking needed |
+| Judge | `claude-opus-4-8` | scores reader answers; distinct role from the generator |
+| GEPA reflection | `claude-opus-4-8` | strongest model proposes prompt mutations |
+
+Adaptive thinking on the reasoning-heavy roles (reflection, judge). Auth via
+`ANTHROPIC_API_KEY` or an `ant auth login` profile.
+
+## Geometry bridge (TS ↔ Python)
+
+The geometry validator (`packages/viewer/src/flow-geometry.ts` → `geometryReport`)
+is TypeScript and intentionally not on the CLI/public surface. Rather than re-export
+it or stand up a server, a small `validate_flow.ts` Node script imports
+`geometryReport`, reads a flow-JSON spec on stdin, and prints
+`{ findings: [...], warnings: [...] }` on stdout. It runs via `npx tsx` with the
+viewer package as the resolution root (so `dagre` etc. resolve). `geometry_bridge.py`
+shells out to it and parses the JSON. No changes to shipped packages.
+
+## Dataset
+
+`topics/` holds ~12 tasks, each:
+
+```jsonc
+{
+  "id": "ci-cd-pipeline",
+  "topic": "CI/CD pipeline for a web app",
+  "audience": "an engineer new to the team",
+  "purpose": "understand how code reaches production and what can go wrong"
+}
+```
+Reader-questions are **not** stored per topic — at run start the harness
+auto-generates a fixed set per topic (via an LLM, from topic+audience+purpose) and
+**freezes** them to the run directory, so every candidate in that run is judged
+against identical questions (fair comparison; stable signal within a run). Split
+~8 train / ~4 val.
+
+## Outputs
+
+Written to a timestamped run directory:
+- `best_prompts.json` — the evolved `brainstorm` + `generate` prompts.
+- `report.md` — seed vs. best on the val set: comprehension score, geometry score,
+  combined, and per-question deltas (which previously-unanswerable questions the
+  improved board now answers).
+- `frozen_questions.json` — the reader-questions used for the run.
+
+## Testing
+
+- **Unit (pytest, fake LLM):** findings→`geom_score` mapping; comprehension
+  scoring + feedback assembly; structural-gate behavior; dataset load + question
+  freeze; the adapter's `evaluate`/`make_reflective_dataset` shapes.
+- **Geometry bridge:** a known-bad spec (edge over node) yields the expected
+  finding; a clean spec yields none.
+- **Smoke (live, tiny):** `run.py --smoke` runs 1 topic at a tiny budget end to
+  end and asserts a report is produced. Gated behind `ANTHROPIC_API_KEY`.
+
+## Cost & scope control
+
+- `--smoke` is the default-safe entry: 1 topic, minimal budget, a handful of LLM
+  calls — validates the whole loop for cents.
+- A full run defaults to ~150 metric calls (`--max-metric-calls`), train/val as
+  above. All models, weights, and budget are env/flag-overridable.
+- Every run prints an up-front estimate (rollouts × calls/rollout × models) before
+  spending.
+
+## Files touched / created
+
+All new, under `scripts/experiments/gepa-flowchart/` (+ this spec). No existing
+package code is modified.
diff --git a/scripts/experiments/gepa-flowchart/.gitignore b/scripts/experiments/gepa-flowchart/.gitignore
@@ -0,0 +1,5 @@
+__pycache__/
+*.pyc
+.venv/
+node_modules/
+runs/
diff --git a/scripts/experiments/gepa-flowchart/README.md b/scripts/experiments/gepa-flowchart/README.md
@@ -0,0 +1,76 @@
+# gepa-flowchart
+
+GEPA prompt optimization for readable, comprehensible termchart flowcharts.
+Optimizes the `brainstorm` + `generate` prompts for first-user comprehension
+(primary) and geometric readability (secondary). See
+`docs/superpowers/specs/2026-06-18-gepa-flowchart-optimization-design.md`.
+
+## Setup
+
+```bash
+# Node deps for the geometry bridge (run once)
+npm install                         # repo root — gives the viewer its deps + playwright
+npm run build --workspace @ivanmkc/termchart-viewer   # build the viewer (needed to RENDER boards)
+cd scripts/experiments/gepa-flowchart
+npm install                         # local tsx
+
+# Python
+python -m venv .venv && . .venv/bin/activate
+pip install -e ".[dev]"
+export ANTHROPIC_API_KEY=sk-ant-...   # or: ant auth login
+```
+
+### Auth: direct API or Vertex AI
+
+By default the harness uses the direct Anthropic API (`ANTHROPIC_API_KEY`).
+To run via **Claude on Vertex AI** (ADC, no API key), install the Vertex extra
+and set the Vertex env vars — `get_client()` auto-selects the Vertex client when
+`CLAUDE_CODE_USE_VERTEX` (or `GEPA_USE_VERTEX`) is set:
+
+```bash
+pip install "anthropic[vertex]"
+export CLAUDE_CODE_USE_VERTEX=1
+export ANTHROPIC_VERTEX_PROJECT_ID=<gcp-project>
+export CLOUD_ML_REGION=global              # or a Claude-on-Vertex region
+gcloud auth application-default login       # ADC
+```
+
+## Run
+
+```bash
+python -m gepa_flowchart.run --smoke            # cheap end-to-end check (1 topic)
+python -m gepa_flowchart.run                    # full run (~150 metric calls)
+python -m gepa_flowchart.run --max-metric-calls 80 --train 8
+```
+
+Outputs land in `runs/<timestamp>/`: `best_prompts.json`, `report.md`,
+`frozen_questions.json`.
+
+## Validators (the metric)
+
+Each board is scored on five signals, gated by structural validity:
+
+- **text comprehension** — a fresh reader-LLM answers the run-frozen reader-questions from the board's structured content; a strict, board-grounded judge scores them.
+- **visual comprehension** — the board is **rendered in a real browser** (viewer + Chromium via a persistent render service) and a **multimodal LLM reads the screenshot**, answering the same questions from the pixels.
+- **visual quality** — the same vision call rates legibility / crowding / overlaps / clipping.
+- **heuristic geometry** — the fast TS `geometryReport` (edges-over-nodes, crossings, density).
+- **rendered geometry (Playwright)** — real-DOM measurements: node-pair overlaps, off-canvas nodes, smallest on-screen font.
+
+`comp = mean(text, visual)`, `geom = mean(heuristic, rendered)`,
+`total = w_comp·comp + w_geom·geom + w_vq·visual_quality`.
+
+## Config (env)
+
+`GEPA_GEN_MODEL` (default `claude-opus-4-8`; set `claude-sonnet-4-6` to cut cost),
+`GEPA_READER_MODEL`, `GEPA_JUDGE_MODEL`, `GEPA_VISION_MODEL`, `GEPA_REFLECTION_MODEL`,
+`GEPA_W_COMP` (0.5), `GEPA_W_GEOM` (0.3), `GEPA_W_VQ` (0.2), `GEPA_N_QUESTIONS` (7),
+`GEPA_MAX_METRIC_CALLS` (150). Rendered-geometry penalties: `GEPA_RG_OVERLAP_PENALTY`,
+`GEPA_RG_OFFSCREEN_PENALTY`, `GEPA_RG_TINYFONT_PENALTY`, `GEPA_RG_MIN_FONT_PX`.
+
+## Cost note
+
+Each rollout now does generation + text reader/judge + **a browser render + a
+multimodal vision call**, plus reflection — heavier than a text-only metric.
+Start with `--smoke`. The render service (viewer + one Chromium) starts once per
+run and is reused. Generation defaults to Opus; `GEPA_GEN_MODEL=claude-sonnet-4-6`
+cuts the high-volume role.
diff --git a/scripts/experiments/gepa-flowchart/gepa_flowchart/__init__.py b/scripts/experiments/gepa-flowchart/gepa_flowchart/__init__.py
@@ -0,0 +1 @@
+"""GEPA optimization for readable, comprehensible termchart flowcharts."""
diff --git a/scripts/experiments/gepa-flowchart/gepa_flowchart/adapter.py b/scripts/experiments/gepa-flowchart/gepa_flowchart/adapter.py
@@ -0,0 +1,74 @@
+from __future__ import annotations
+
+from gepa.core.adapter import EvaluationBatch, GEPAAdapter
+
+from .config import Config
+from .geometry_bridge import validate_flow
+from .llm import complete, complete_vision
+from .metric import score_board
+from .pipeline import run_pipeline
+
+
+def _no_render(_content):
+    return {"ok": False, "error": "no render_fn supplied"}
+
+
+class FlowchartAdapter(GEPAAdapter):
+    def __init__(self, cfg: Config, *, validate_fn=validate_flow, complete_fn=complete,
+                 render_fn=_no_render, vision_fn=complete_vision):
+        self.cfg = cfg
+        self.validate_fn = validate_fn
+        self.complete_fn = complete_fn
+        self.render_fn = render_fn
+        self.vision_fn = vision_fn
+
+    def evaluate(self, batch, candidate, capture_traces=False):
+        outputs, scores, trajectories = [], [], [] if capture_traces else None
+        objective_scores = []
+        for task in batch:
+            content, trace = run_pipeline(
+                candidate, task, model=self.cfg.gen_model,
+                complete_fn=self.complete_fn, max_tokens=self.cfg.gen_max_tokens,
+            )
+            result = score_board(
+                content, task, self.cfg, validate_fn=self.validate_fn, complete_fn=self.complete_fn,
+                render_fn=self.render_fn, vision_fn=self.vision_fn,
+            )
+            outputs.append(content)
+            scores.append(result.score)
+            # Per-objective scores so GEPA can track/Pareto the distinct axes instead of only the
+            # collapsed scalar (keeps candidates that are best on comprehension vs. layout separate).
+            objective_scores.append({
+                "comprehension": result.comp,
+                "geometry": result.geom,
+                "visual_quality": result.visual_quality,
+            })
+            if capture_traces:
+                trajectories.append({"task": task, "trace": trace, "result": result, "output": content})
+        return EvaluationBatch(
+            outputs=outputs, scores=scores, trajectories=trajectories, objective_scores=objective_scores
+        )
+
+    def make_reflective_dataset(self, candidate, eval_batch, components_to_update):
+        out: dict[str, list[dict]] = {c: [] for c in components_to_update}
+        for traj in eval_batch.trajectories or []:
+            task = traj["task"]
+            result = traj["result"]
+            trace = traj["trace"]
+            shared_feedback = result.feedback
+            if "brainstorm" in out:
+                out["brainstorm"].append({
+                    "Inputs": f"Topic: {task.topic}\nAudience: {task.audience}\nPurpose: {task.purpose}",
+                    "Generated Outputs": trace.get("plan", ""),
+                    "Feedback": (
+                        "The plan feeds a downstream generator. The resulting board scored "
+                        f"{result.score:.2f}. " + shared_feedback
+                    ),
+                })
+            if "generate" in out:
+                out["generate"].append({
+                    "Inputs": f"Plan:\n{trace.get('plan','')}",
+                    "Generated Outputs": traj.get("output") or trace.get("raw_generation", ""),
+                    "Feedback": shared_feedback,
+                })
+        return out
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		"""GEPA optimization for readable, comprehensible termchart flowcharts."""