From 3b4c5b8aa8a2c6215490255039ab9b5624dd8b63 Mon Sep 17 00:00:00 2001
From: Ivan Cheung <ivanmkc@google.com>
Date: Thu, 18 Jun 2026 04:26:40 +0000
Subject: [PATCH 01/24] docs: spec for GEPA flowchart optimization
 (comprehension + geometry)

Offline GEPA loop that evolves the brainstorm + generate prompts for termchart
flowcharts, optimizing primarily for first-user comprehension (a fresh reader-LLM
answers auto-generated reader-questions, judged for board-supported correctness)
with geometry/readability as a secondary guardrail. Claude direct via the
Anthropic SDK; standalone gepa; structured-content reader surface; smoke-first.
---
 ...6-18-gepa-flowchart-optimization-design.md | 226 ++++++++++++++++++
 1 file changed, 226 insertions(+)
 create mode 100644 docs/superpowers/specs/2026-06-18-gepa-flowchart-optimization-design.md

diff --git a/docs/superpowers/specs/2026-06-18-gepa-flowchart-optimization-design.md b/docs/superpowers/specs/2026-06-18-gepa-flowchart-optimization-design.md
new file mode 100644
index 0000000..b53ee1f
--- /dev/null
+++ b/docs/superpowers/specs/2026-06-18-gepa-flowchart-optimization-design.md
@@ -0,0 +1,226 @@
+# GEPA optimization for readable, comprehensible flowcharts
+
+**Date:** 2026-06-18
+**Status:** Approved design — ready for implementation plan
+
+## Problem
+
+termchart can generate flowcharts, but the *quality* of a generated board is
+uneven. Two failure modes recur:
+
+1. **Unreadable geometry** — edges run over nodes, nodes overlap, the graph is
+   so large it renders tiny. (The push-time geometry lint already detects this.)
+2. **Insufficient detail / context** — the more common problem. A board renders
+   cleanly but a first-time reader can't actually learn what they need from it:
+   labels are terse, the triggers/conditions/outcomes aren't shown, there's no
+   orienting context. The reader is left with unanswered questions.
+
+We want to **systematically improve the prompts** that produce flowcharts so the
+output is both readable *and* comprehensible to someone seeing it for the first
+time. GEPA (reflective prompt evolution) is a good fit: it mutates text prompts
+using an LLM that reflects on execution feedback and a metric.
+
+## Goal
+
+Stand up a GEPA optimization loop that evolves the prompts used to **brainstorm**
+and **generate** a termchart flowchart, optimizing primarily for **first-user
+comprehension** (a fresh reader can answer the questions they'd naturally have),
+with **geometric readability** as a secondary guardrail. Produce the evolved
+prompts plus a before/after report.
+
+## Non-goals (YAGNI)
+
+- **No** changes to the termchart viewer/CLI public surface (the geometry
+  validator is reused via a thin bridge, not re-exported).
+- **No** human-in-the-loop labeling — the eval is fully automated (LLM reader +
+  LLM judge).
+- **No** visual/image rendering of the board for the reader — comprehension is
+  scored on the board's **structured content** (see Decisions). True pixel
+  readability is the geometry lint's job, not the comprehension test's.
+- **No** integration into the live push path — this is an offline experiment that
+  produces better prompts; wiring them into a recipe/skill is a separate follow-up.
+
+## Decisions (from brainstorming)
+
+| Question | Decision |
+|---|---|
+| LLM access | Claude direct via the Anthropic Python SDK (not the LiteLLM proxy) |
+| GEPA implementation | Standalone `gepa` package (bring-your-own adapter) |
+| What GEPA optimizes | **Two** text components: a `brainstorm` prompt and a `generate` prompt |
+| Dataset | ~12 diverse flowchart topics, ~8 train / ~4 val, authored in-repo |
+| Reader-questions (FUX backbone) | **Auto-generated per topic at run start**, then frozen for the run so every candidate is scored against the same questions |
+| Reader input surface | **Structured flow content** (node/edge/group labels + annotations), not an image |
+| Run scope | **Smoke-first**: a cheap `--smoke` default; a modest full run (~150 metric calls) via flag |
+
+## Architecture
+
+New Python harness in a new git worktree:
+
+```
+scripts/experiments/gepa-flowchart/
+  README.md                # how to run, env vars, cost notes
+  pyproject.toml           # deps: gepa, anthropic
+  gepa_flowchart/
+    __init__.py
+    config.py              # models, weights, budget, paths (env-overridable)
+    llm.py                 # thin Anthropic SDK wrappers: generate / read / judge / reflect
+    dataset.py             # ~12 topics; loads + freezes auto-generated reader-questions
+    pipeline.py            # brainstorm -> generate (with termchart skill context)
+    geometry_bridge.py     # shells out to the tsx Node validator, parses findings JSON
+    validate_flow.ts       # Node bridge: imports geometryReport, reads stdin, prints JSON
+    metric.py              # structural gate + geometry score + comprehension score + feedback
+    adapter.py             # gepa.GEPAAdapter: evaluate() + make_reflective_dataset()
+    seed_prompts.py        # the seed brainstorm/generate prompts (the starting candidate)
+    run.py                 # CLI entrypoint: gepa.optimize(...), writes results + report
+  tests/                   # pytest: pure-logic units with a fake LLM + one smoke
+  topics/                  # the authored topic dataset (json/yaml)
+```
+
+### The pipeline GEPA optimizes (per task)
+
+```
+topic ─┬▶ [brainstorm prompt]*  ─▶ plan: what to show + what context a reader needs
+       └▶ [generate prompt]* + termchart skill context ─▶ flow JSON
+                                                            │
+                  ┌─────────────────────────────────────────┘
+                  ▼
+   validators → combined score + textual feedback ─▶ GEPA reflection ─▶ better prompts*
+```
+
+`*` marks the two text components GEPA mutates. The seed candidate is
+`{ "brainstorm": <seed>, "generate": <seed> }`. The generate prompt's static
+context includes the termchart `flow` JSON schema and 1–2 shipped gallery example
+specs (`plugin/skills/diagram-recipes/examples/*.flow.json`) — this is the
+"generate using termchart skills" step.
+
+### The flow spec format (target output)
+
+```jsonc
+{
+  "direction": "TB" | "LR" | "BT" | "RL",        // default TB
+  "nodes": [{ "id", "data": { "label", "status?" }, "group?" }],
+  "edges": [{ "source", "target", "data?": { "label?" } }],
+  "groups?": [{ "id", "label", "color?" }],        // tiers/lanes/zones
+  "tiers?": bool, "lanes?": bool
+}
+```
+Detail and context live in `data.label` (rich node text), edge `data.label`
+(what the transition is/when it happens), and group labels — the levers the
+generate prompt learns to use.
+
+## The metric (the heart of this)
+
+`evaluate(task, flow_json)` returns a score in `[0,1]` plus a textual feedback
+string for GEPA's reflection.
+
+1. **Structural validity (hard gate).** Run the shared `validateContent` logic.
+   Invalid/unparseable → score `0.0`, feedback = the precise path-pointed error.
+   Everything below only runs on a valid spec.
+
+2. **Geometry / readability** → `geom_score ∈ [0,1]`. From the geometry bridge's
+   findings: `error`-severity (edge-over-node, node-overlap, missing-ref) drive a
+   large penalty; `warning`-severity (crossings, edge-near-node, low-readability,
+   and the over-stuffed-density codes) a small one. Feedback = the findings'
+   actionable messages.
+
+3. **Comprehension / first-user-experience (PRIMARY)** → `comp_score ∈ [0,1]`.
+   - A **fresh reader-LLM with no prior context** (system: "you're seeing this
+     board for the first time; answer ONLY from what's shown; say 'not shown' if
+     the board doesn't tell you") is given the board's **structured content** and
+     the task's frozen reader-questions.
+   - A **judge-LLM** scores each answer 0–1 on two axes: *correct* AND
+     *actually supported by the board*. An answer the reader had to invent, or
+     marked "not shown", counts as a comprehension miss attributable to **missing
+     detail/context** — flagged explicitly.
+   - `comp_score = mean(per-question scores)`.
+   - Feedback = the specific questions that scored low, each with the judge's
+     reason (e.g. "board never indicates what happens when validation fails"),
+     plus a short "missing context" list. This is what pushes reflection toward
+     adding detail and orienting context, not just cleaner layout.
+
+**Combined:** `total = w_comp * comp_score + w_geom * geom_score`, gated by
+validity. Defaults `w_comp = 0.6`, `w_geom = 0.4` (env-configurable). Comprehension
+leads because under-detailing is the main problem we're fixing; geometry remains a
+real guardrail so the optimizer can't win by dumping unreadable walls of text.
+
+## GEPA wiring
+
+- `gepa.optimize(seed_candidate, trainset, valset, adapter, reflection_lm=<callable>, max_metric_calls=<budget>)`.
+- `reflection_lm` is a **callable** wrapping the Anthropic SDK (Claude direct) —
+  not a LiteLLM model string — honoring the "Claude direct" decision.
+- The `GEPAAdapter` implements `evaluate(batch, candidate, capture_traces)` (runs
+  the pipeline + metric per task) and `make_reflective_dataset(...)` (turns the
+  captured feedback into the per-component reflective examples GEPA mutates on).
+
+## Models (Anthropic SDK direct)
+
+| Role | Default model | Notes |
+|---|---|---|
+| Generation (high volume) | `claude-opus-4-8` | env knob to `claude-sonnet-4-6` to cut cost |
+| Reader (FUX) | `claude-sonnet-4-6` | simulates an average reader; cheap; no thinking needed |
+| Judge | `claude-opus-4-8` | scores reader answers; distinct role from the generator |
+| GEPA reflection | `claude-opus-4-8` | strongest model proposes prompt mutations |
+
+Adaptive thinking on the reasoning-heavy roles (reflection, judge). Auth via
+`ANTHROPIC_API_KEY` or an `ant auth login` profile.
+
+## Geometry bridge (TS ↔ Python)
+
+The geometry validator (`packages/viewer/src/flow-geometry.ts` → `geometryReport`)
+is TypeScript and intentionally not on the CLI/public surface. Rather than re-export
+it or stand up a server, a small `validate_flow.ts` Node script imports
+`geometryReport`, reads a flow-JSON spec on stdin, and prints
+`{ findings: [...], warnings: [...] }` on stdout. It runs via `npx tsx` with the
+viewer package as the resolution root (so `dagre` etc. resolve). `geometry_bridge.py`
+shells out to it and parses the JSON. No changes to shipped packages.
+
+## Dataset
+
+`topics/` holds ~12 tasks, each:
+
+```jsonc
+{
+  "id": "ci-cd-pipeline",
+  "topic": "CI/CD pipeline for a web app",
+  "audience": "an engineer new to the team",
+  "purpose": "understand how code reaches production and what can go wrong"
+}
+```
+Reader-questions are **not** stored per topic — at run start the harness
+auto-generates a fixed set per topic (via an LLM, from topic+audience+purpose) and
+**freezes** them to the run directory, so every candidate in that run is judged
+against identical questions (fair comparison; stable signal within a run). Split
+~8 train / ~4 val.
+
+## Outputs
+
+Written to a timestamped run directory:
+- `best_prompts.json` — the evolved `brainstorm` + `generate` prompts.
+- `report.md` — seed vs. best on the val set: comprehension score, geometry score,
+  combined, and per-question deltas (which previously-unanswerable questions the
+  improved board now answers).
+- `frozen_questions.json` — the reader-questions used for the run.
+
+## Testing
+
+- **Unit (pytest, fake LLM):** findings→`geom_score` mapping; comprehension
+  scoring + feedback assembly; structural-gate behavior; dataset load + question
+  freeze; the adapter's `evaluate`/`make_reflective_dataset` shapes.
+- **Geometry bridge:** a known-bad spec (edge over node) yields the expected
+  finding; a clean spec yields none.
+- **Smoke (live, tiny):** `run.py --smoke` runs 1 topic at a tiny budget end to
+  end and asserts a report is produced. Gated behind `ANTHROPIC_API_KEY`.
+
+## Cost & scope control
+
+- `--smoke` is the default-safe entry: 1 topic, minimal budget, a handful of LLM
+  calls — validates the whole loop for cents.
+- A full run defaults to ~150 metric calls (`--max-metric-calls`), train/val as
+  above. All models, weights, and budget are env/flag-overridable.
+- Every run prints an up-front estimate (rollouts × calls/rollout × models) before
+  spending.
+
+## Files touched / created
+
+All new, under `scripts/experiments/gepa-flowchart/` (+ this spec). No existing
+package code is modified.

From 08a3a2a23adb81bd04901a209b1c81995ac53b75 Mon Sep 17 00:00:00 2001
From: Ivan Cheung <ivanmkc@google.com>
Date: Thu, 18 Jun 2026 04:34:17 +0000
Subject: [PATCH 02/24] docs: implementation plan for GEPA flowchart
 optimization

10 TDD tasks: scaffold+config, geometry bridge (tsx), Anthropic LLM wrappers,
board serializer, dataset+question freeze, seed prompts+pipeline, three-part
metric, GEPAAdapter, run+report, README. Verified gepa API (adapter protocol,
EvaluationBatch, optimize signature, reflection callable) and Anthropic SDK
patterns against source.
---
 .../2026-06-18-gepa-flowchart-optimization.md | 1535 +++++++++++++++++
 1 file changed, 1535 insertions(+)
 create mode 100644 docs/superpowers/plans/2026-06-18-gepa-flowchart-optimization.md

diff --git a/docs/superpowers/plans/2026-06-18-gepa-flowchart-optimization.md b/docs/superpowers/plans/2026-06-18-gepa-flowchart-optimization.md
new file mode 100644
index 0000000..302cddf
--- /dev/null
+++ b/docs/superpowers/plans/2026-06-18-gepa-flowchart-optimization.md
@@ -0,0 +1,1535 @@
+# GEPA Flowchart Optimization Implementation Plan
+
+> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
+
+**Goal:** Build an offline GEPA loop that evolves the `brainstorm` + `generate` prompts for termchart flowcharts, optimizing primarily for first-user comprehension and secondarily for geometric readability.
+
+**Architecture:** A Python harness (`scripts/experiments/gepa-flowchart/`) drives the standalone `gepa` package. Each rollout runs a two-prompt pipeline (brainstorm → generate flow JSON) through a three-part metric: structural-validity gate, geometry score (reused from the TS `geometryReport` via a `tsx` bridge), and comprehension score (a fresh reader-LLM answers auto-generated, run-frozen questions; a judge scores board-supported correctness). GEPA reflects on the combined feedback to mutate the two prompts. All LLM access is Claude direct via the Anthropic Python SDK.
+
+**Tech Stack:** Python 3.11+, `gepa`, `anthropic` (Python SDK); Node + `tsx` for the geometry bridge that imports the existing TypeScript validators.
+
+## Global Constraints
+
+- Python **3.11+** (uses `X | Y` unions, `list[...]` generics).
+- LLM access is **Claude direct via the `anthropic` Python SDK** — never LiteLLM, never a non-Anthropic shim.
+- **No existing package code is modified.** Everything new lives under `scripts/experiments/gepa-flowchart/`. The TS validators are *imported* by a bridge, not edited or re-exported.
+- Models come from config; defaults: generation `claude-opus-4-8`, reader `claude-sonnet-4-6`, judge `claude-opus-4-8`, reflection `claude-opus-4-8`. All env-overridable.
+- **Never pass `temperature`/`top_p`/`top_k`** to the Anthropic API (these 400 on Opus 4.8 / Sonnet 4.6). Steer with prompts; control depth with `output_config={"effort": ...}`.
+- Combined score `total = w_comp*comp_score + w_geom*geom_score`, gated by validity. Defaults `w_comp=0.6`, `w_geom=0.4`.
+- TDD: write the failing test, watch it fail, minimal code, watch it pass, commit. Frequent commits.
+- Run all `pytest` from the project dir: `cd scripts/experiments/gepa-flowchart`.
+
+---
+
+### Task 1: Project scaffold + config
+
+**Files:**
+- Create: `scripts/experiments/gepa-flowchart/pyproject.toml`
+- Create: `scripts/experiments/gepa-flowchart/package.json`
+- Create: `scripts/experiments/gepa-flowchart/.gitignore`
+- Create: `scripts/experiments/gepa-flowchart/gepa_flowchart/__init__.py`
+- Create: `scripts/experiments/gepa-flowchart/gepa_flowchart/config.py`
+- Test: `scripts/experiments/gepa-flowchart/tests/test_config.py`
+
+**Interfaces:**
+- Produces: `Config` dataclass with fields `gen_model: str`, `reader_model: str`, `judge_model: str`, `reflection_model: str`, `w_comp: float`, `w_geom: float`, `geom_error_penalty: float`, `geom_warning_penalty: float`, `n_questions: int`, `max_metric_calls: int`, `gen_max_tokens: int`. Classmethod `Config.from_env(overrides: dict | None = None) -> Config`.
+
+- [ ] **Step 1: Write the config files (non-test scaffolding)**
+
+`pyproject.toml`:
+```toml
+[project]
+name = "gepa-flowchart"
+version = "0.1.0"
+requires-python = ">=3.11"
+dependencies = ["gepa", "anthropic>=0.69"]
+
+[project.optional-dependencies]
+dev = ["pytest>=8"]
+
+[tool.pytest.ini_options]
+testpaths = ["tests"]
+```
+
+`package.json`:
+```json
+{
+  "name": "gepa-flowchart-bridge",
+  "private": true,
+  "devDependencies": { "tsx": "^4.19.0" }
+}
+```
+
+`.gitignore`:
+```
+__pycache__/
+*.pyc
+.venv/
+node_modules/
+runs/
+```
+
+`gepa_flowchart/__init__.py`:
+```python
+"""GEPA optimization for readable, comprehensible termchart flowcharts."""
+```
+
+- [ ] **Step 2: Write the failing test**
+
+`tests/test_config.py`:
+```python
+from gepa_flowchart.config import Config
+
+
+def test_defaults():
+    cfg = Config.from_env({})
+    assert cfg.gen_model == "claude-opus-4-8"
+    assert cfg.reader_model == "claude-sonnet-4-6"
+    assert cfg.judge_model == "claude-opus-4-8"
+    assert cfg.reflection_model == "claude-opus-4-8"
+    assert cfg.w_comp == 0.6 and cfg.w_geom == 0.4
+
+
+def test_env_override():
+    cfg = Config.from_env({"GEPA_GEN_MODEL": "claude-sonnet-4-6", "GEPA_W_COMP": "0.7"})
+    assert cfg.gen_model == "claude-sonnet-4-6"
+    assert cfg.w_comp == 0.7
+```
+
+- [ ] **Step 3: Run test to verify it fails**
+
+Run: `cd scripts/experiments/gepa-flowchart && python -m pytest tests/test_config.py -v`
+Expected: FAIL — `ModuleNotFoundError: No module named 'gepa_flowchart.config'`
+
+- [ ] **Step 4: Write minimal implementation**
+
+`gepa_flowchart/config.py`:
+```python
+from __future__ import annotations
+
+import os
+from dataclasses import dataclass
+
+
+@dataclass
+class Config:
+    gen_model: str = "claude-opus-4-8"
+    reader_model: str = "claude-sonnet-4-6"
+    judge_model: str = "claude-opus-4-8"
+    reflection_model: str = "claude-opus-4-8"
+    w_comp: float = 0.6
+    w_geom: float = 0.4
+    geom_error_penalty: float = 0.34
+    geom_warning_penalty: float = 0.08
+    n_questions: int = 5
+    max_metric_calls: int = 150
+    gen_max_tokens: int = 8000
+
+    @classmethod
+    def from_env(cls, overrides: dict | None = None) -> "Config":
+        env = {**os.environ, **(overrides or {})}
+
+        def s(key: str, default: str) -> str:
+            return env.get(key, default)
+
+        def f(key: str, default: float) -> float:
+            return float(env.get(key, default))
+
+        def i(key: str, default: int) -> int:
+            return int(env.get(key, default))
+
+        return cls(
+            gen_model=s("GEPA_GEN_MODEL", cls.gen_model),
+            reader_model=s("GEPA_READER_MODEL", cls.reader_model),
+            judge_model=s("GEPA_JUDGE_MODEL", cls.judge_model),
+            reflection_model=s("GEPA_REFLECTION_MODEL", cls.reflection_model),
+            w_comp=f("GEPA_W_COMP", cls.w_comp),
+            w_geom=f("GEPA_W_GEOM", cls.w_geom),
+            geom_error_penalty=f("GEPA_GEOM_ERR_PENALTY", cls.geom_error_penalty),
+            geom_warning_penalty=f("GEPA_GEOM_WARN_PENALTY", cls.geom_warning_penalty),
+            n_questions=i("GEPA_N_QUESTIONS", cls.n_questions),
+            max_metric_calls=i("GEPA_MAX_METRIC_CALLS", cls.max_metric_calls),
+            gen_max_tokens=i("GEPA_GEN_MAX_TOKENS", cls.gen_max_tokens),
+        )
+```
+
+- [ ] **Step 5: Run test to verify it passes**
+
+Run: `cd scripts/experiments/gepa-flowchart && python -m pytest tests/test_config.py -v`
+Expected: PASS (2 passed)
+
+- [ ] **Step 6: Commit**
+
+```bash
+git add scripts/experiments/gepa-flowchart/
+git commit -m "feat(gepa): scaffold gepa-flowchart project + config"
+```
+
+---
+
+### Task 2: Geometry bridge (TS validator + Python wrapper)
+
+**Files:**
+- Create: `scripts/experiments/gepa-flowchart/gepa_flowchart/validate_flow.ts`
+- Create: `scripts/experiments/gepa-flowchart/gepa_flowchart/geometry_bridge.py`
+- Test: `scripts/experiments/gepa-flowchart/tests/test_geometry_bridge.py`
+
+**Interfaces:**
+- Produces: `validate_flow(content: str, *, cwd: str | None = None) -> dict` returning `{"valid": bool, "error": str | None, "findings": list[dict], "warnings": list[str]}`. Each finding dict has `severity`, `code`, `message`, `count`.
+
+- [ ] **Step 1: Set up Node deps (one-time, not a code step)**
+
+Run (from repo root, then the project dir):
+```bash
+npm install                                   # populates packages/viewer/node_modules (dagre, etc.)
+cd scripts/experiments/gepa-flowchart && npm install   # installs tsx locally
+```
+
+- [ ] **Step 2: Write the TS bridge**
+
+`gepa_flowchart/validate_flow.ts` (relative imports reach the existing validators; tsx resolves `.js`→`.ts`):
+```typescript
+// Reads a flow-JSON spec on stdin, prints { valid, error, findings, warnings } on stdout.
+// Imports the existing validators directly — no changes to shipped packages.
+import { validateContent } from "../../../../packages/core/src/validate.js";
+import { geometryReport } from "../../../../packages/viewer/src/flow-geometry.js";
+
+function readStdin(): Promise<string> {
+  return new Promise((resolve) => {
+    let data = "";
+    process.stdin.setEncoding("utf8");
+    process.stdin.on("data", (c) => (data += c));
+    process.stdin.on("end", () => resolve(data));
+  });
+}
+
+const content = await readStdin();
+const error = validateContent("flow", content);
+if (error) {
+  process.stdout.write(JSON.stringify({ valid: false, error, findings: [], warnings: [] }));
+} else {
+  const { warnings, findings } = geometryReport("flow", content);
+  process.stdout.write(JSON.stringify({ valid: true, error: null, findings, warnings }));
+}
+```
+
+- [ ] **Step 3: Write the failing test**
+
+`tests/test_geometry_bridge.py`:
+```python
+import json
+
+from gepa_flowchart.geometry_bridge import validate_flow
+
+CLEAN = json.dumps({
+    "direction": "TB",
+    "nodes": [{"id": "a", "data": {"label": "A"}}, {"id": "b", "data": {"label": "B"}}],
+    "edges": [{"source": "a", "target": "b"}],
+})
+
+# a -> c with b sitting on the a-c line (explicit positions force the overlap)
+EDGE_OVER_NODE = json.dumps({
+    "layout": "manual", "direction": "LR",
+    "nodes": [
+        {"id": "a", "position": {"x": 0, "y": 0}},
+        {"id": "b", "position": {"x": 200, "y": 0}},
+        {"id": "c", "position": {"x": 440, "y": 0}},
+    ],
+    "edges": [{"source": "a", "target": "c"}],
+})
+
+
+def test_clean_spec_is_valid_no_errors():
+    r = validate_flow(CLEAN)
+    assert r["valid"] is True
+    assert [f for f in r["findings"] if f["severity"] == "error"] == []
+
+
+def test_edge_over_node_flagged():
+    r = validate_flow(EDGE_OVER_NODE)
+    assert r["valid"] is True
+    assert any(f["code"] == "edge-over-node" for f in r["findings"])
+
+
+def test_invalid_json_reported():
+    r = validate_flow("{ not json")
+    assert r["valid"] is False
+    assert r["error"]
+```
+
+- [ ] **Step 4: Run test to verify it fails**
+
+Run: `cd scripts/experiments/gepa-flowchart && python -m pytest tests/test_geometry_bridge.py -v`
+Expected: FAIL — `ModuleNotFoundError: No module named 'gepa_flowchart.geometry_bridge'`
+
+- [ ] **Step 5: Write the Python wrapper**
+
+`gepa_flowchart/geometry_bridge.py`:
+```python
+from __future__ import annotations
+
+import json
+import subprocess
+from pathlib import Path
+
+_HERE = Path(__file__).resolve().parent
+_PROJECT = _HERE.parent  # scripts/experiments/gepa-flowchart
+_SCRIPT = _HERE / "validate_flow.ts"
+
+
+def validate_flow(content: str, *, cwd: str | None = None) -> dict:
+    """Run the TS validator on a flow-JSON string. Returns
+    {valid, error, findings, warnings}. Raises RuntimeError if the bridge fails."""
+    proc = subprocess.run(
+        ["npx", "tsx", str(_SCRIPT)],
+        input=content,
+        capture_output=True,
+        text=True,
+        cwd=cwd or str(_PROJECT),
+    )
+    if proc.returncode != 0:
+        raise RuntimeError(f"geometry bridge failed: {proc.stderr.strip()}")
+    return json.loads(proc.stdout)
+```
+
+- [ ] **Step 6: Run test to verify it passes**
+
+Run: `cd scripts/experiments/gepa-flowchart && python -m pytest tests/test_geometry_bridge.py -v`
+Expected: PASS (3 passed). If `npx tsx` errors on resolution, confirm Step 1 ran (`packages/viewer/node_modules/dagre` exists and `scripts/experiments/gepa-flowchart/node_modules/.bin/tsx` exists).
+
+- [ ] **Step 7: Commit**
+
+```bash
+git add scripts/experiments/gepa-flowchart/
+git commit -m "feat(gepa): geometry bridge reusing TS validateContent + geometryReport"
+```
+
+---
+
+### Task 3: Anthropic LLM wrappers + reflection callable
+
+**Files:**
+- Create: `scripts/experiments/gepa-flowchart/gepa_flowchart/llm.py`
+- Test: `scripts/experiments/gepa-flowchart/tests/test_llm.py`
+
+**Interfaces:**
+- Produces:
+  - `complete(system: str, user: str, *, model: str, effort: str = "medium", max_tokens: int = 4096, client=None) -> str` — single-turn Claude call; returns concatenated text blocks.
+  - `make_reflection_callable(model: str, *, client=None) -> Callable[[str], str]` — wraps `complete` as the `(prompt) -> str` callable GEPA's `reflection_lm` expects.
+  - `get_client()` — lazily constructs and caches an `anthropic.Anthropic()`.
+
+- [ ] **Step 1: Write the failing test**
+
+`tests/test_llm.py`:
+```python
+from gepa_flowchart import llm
+
+
+class FakeMessages:
+    def __init__(self, recorder):
+        self.recorder = recorder
+
+    def create(self, **kwargs):
+        self.recorder.update(kwargs)
+        block = type("Block", (), {"type": "text", "text": "hello world"})()
+        return type("Resp", (), {"content": [block]})()
+
+
+class FakeClient:
+    def __init__(self):
+        self.calls = {}
+        self.messages = FakeMessages(self.calls)
+
+
+def test_complete_extracts_text_and_passes_params():
+    fake = FakeClient()
+    out = llm.complete("sys", "usr", model="claude-opus-4-8", effort="high", client=fake)
+    assert out == "hello world"
+    assert fake.calls["model"] == "claude-opus-4-8"
+    assert fake.calls["output_config"] == {"effort": "high"}
+    assert "temperature" not in fake.calls  # must never be sent
+    assert fake.calls["system"] == "sys"
+    assert fake.calls["messages"] == [{"role": "user", "content": "usr"}]
+
+
+def test_reflection_callable_returns_str():
+    fake = FakeClient()
+    fn = llm.make_reflection_callable("claude-opus-4-8", client=fake)
+    assert fn("reflect on this") == "hello world"
+```
+
+- [ ] **Step 2: Run test to verify it fails**
+
+Run: `cd scripts/experiments/gepa-flowchart && python -m pytest tests/test_llm.py -v`
+Expected: FAIL — `AttributeError: module 'gepa_flowchart.llm' has no attribute 'complete'` (or import error)
+
+- [ ] **Step 3: Write minimal implementation**
+
+`gepa_flowchart/llm.py`:
+```python
+from __future__ import annotations
+
+from typing import Callable
+
+_client = None
+
+
+def get_client():
+    global _client
+    if _client is None:
+        import anthropic
+        _client = anthropic.Anthropic()
+    return _client
+
+
+def complete(
+    system: str,
+    user: str,
+    *,
+    model: str,
+    effort: str = "medium",
+    max_tokens: int = 4096,
+    client=None,
+) -> str:
+    client = client or get_client()
+    resp = client.messages.create(
+        model=model,
+        max_tokens=max_tokens,
+        system=system,
+        output_config={"effort": effort},
+        messages=[{"role": "user", "content": user}],
+    )
+    return "".join(b.text for b in resp.content if getattr(b, "type", None) == "text")
+
+
+def make_reflection_callable(model: str, *, client=None) -> Callable[[str], str]:
+    def reflect(prompt: str) -> str:
+        return complete(
+            "You are an expert prompt engineer improving prompts based on feedback.",
+            prompt,
+            model=model,
+            effort="high",
+            max_tokens=8000,
+            client=client,
+        )
+
+    return reflect
+```
+
+- [ ] **Step 4: Run test to verify it passes**
+
+Run: `cd scripts/experiments/gepa-flowchart && python -m pytest tests/test_llm.py -v`
+Expected: PASS (2 passed)
+
+- [ ] **Step 5: Commit**
+
+```bash
+git add scripts/experiments/gepa-flowchart/
+git commit -m "feat(gepa): Anthropic SDK wrappers + reflection callable"
+```
+
+---
+
+### Task 4: Board serializer + JSON extractor
+
+**Files:**
+- Create: `scripts/experiments/gepa-flowchart/gepa_flowchart/render.py`
+- Test: `scripts/experiments/gepa-flowchart/tests/test_render.py`
+
+**Interfaces:**
+- Produces:
+  - `board_to_text(flow: dict) -> str` — renders a flow spec as structured, reader-facing text (direction, groups, nodes-by-group with labels/status, edges with labels).
+  - `extract_json(text: str) -> str | None` — pulls the first JSON object out of an LLM response (strips ```json fences; falls back to first balanced `{...}`).
+
+- [ ] **Step 1: Write the failing test**
+
+`tests/test_render.py`:
+```python
+import json
+
+from gepa_flowchart.render import board_to_text, extract_json
+
+
+def test_board_to_text_includes_labels_and_edges():
+    flow = {
+        "direction": "TB",
+        "groups": [{"id": "g1", "label": "Frontend"}],
+        "nodes": [
+            {"id": "a", "group": "g1", "data": {"label": "Login form"}},
+            {"id": "b", "data": {"label": "Auth service", "status": "active"}},
+        ],
+        "edges": [{"source": "a", "target": "b", "data": {"label": "submit"}}],
+    }
+    text = board_to_text(flow)
+    assert "Login form" in text
+    assert "Auth service" in text
+    assert "Frontend" in text
+    assert "submit" in text
+
+
+def test_extract_json_from_fenced():
+    raw = 'Here:\n```json\n{"nodes": [], "edges": []}\n```\nDone.'
+    out = extract_json(raw)
+    assert json.loads(out) == {"nodes": [], "edges": []}
+
+
+def test_extract_json_bare():
+    out = extract_json('prefix {"x": 1} suffix')
+    assert json.loads(out) == {"x": 1}
+
+
+def test_extract_json_none_when_absent():
+    assert extract_json("no json here") is None
+```
+
+- [ ] **Step 2: Run test to verify it fails**
+
+Run: `cd scripts/experiments/gepa-flowchart && python -m pytest tests/test_render.py -v`
+Expected: FAIL — `ModuleNotFoundError: No module named 'gepa_flowchart.render'`
+
+- [ ] **Step 3: Write minimal implementation**
+
+`gepa_flowchart/render.py`:
+```python
+from __future__ import annotations
+
+
+def board_to_text(flow: dict) -> str:
+    lines: list[str] = []
+    direction = flow.get("direction", "TB")
+    lines.append(f"Flowchart (direction: {direction})")
+    groups = {g["id"]: g for g in flow.get("groups", []) if isinstance(g, dict) and "id" in g}
+    nodes = [n for n in flow.get("nodes", []) if isinstance(n, dict)]
+
+    lines.append("\nNodes:")
+    for n in nodes:
+        nid = n.get("id", "?")
+        data = n.get("data", {}) if isinstance(n.get("data"), dict) else {}
+        label = data.get("label", "(no label)")
+        status = data.get("status")
+        grp = groups.get(n.get("group", ""), {}).get("label")
+        suffix = []
+        if grp:
+            suffix.append(f"group: {grp}")
+        if status:
+            suffix.append(f"status: {status}")
+        extra = f" [{', '.join(suffix)}]" if suffix else ""
+        lines.append(f"  - {nid}: {label}{extra}")
+
+    label_by_id = {
+        n.get("id"): (n.get("data", {}) or {}).get("label", n.get("id"))
+        for n in nodes
+    }
+    lines.append("\nConnections:")
+    for e in flow.get("edges", []):
+        if not isinstance(e, dict):
+            continue
+        src = label_by_id.get(e.get("source"), e.get("source"))
+        tgt = label_by_id.get(e.get("target"), e.get("target"))
+        elabel = (e.get("data", {}) or {}).get("label") if isinstance(e.get("data"), dict) else None
+        arrow = f' --[{elabel}]-->' if elabel else " -->"
+        lines.append(f"  - {src}{arrow} {tgt}")
+    return "\n".join(lines)
+
+
+def extract_json(text: str) -> str | None:
+    if "```" in text:
+        # take the content of the first fenced block
+        parts = text.split("```")
+        for chunk in parts[1:]:
+            body = chunk
+            if body.lstrip().lower().startswith("json"):
+                body = body.lstrip()[4:]
+            body = body.strip()
+            if body.startswith("{"):
+                bal = _balanced(body)
+                if bal:
+                    return bal
+    start = text.find("{")
+    if start == -1:
+        return None
+    return _balanced(text[start:])
+
+
+def _balanced(s: str) -> str | None:
+    depth = 0
+    in_str = False
+    esc = False
+    for i, ch in enumerate(s):
+        if in_str:
+            if esc:
+                esc = False
+            elif ch == "\\":
+                esc = True
+            elif ch == '"':
+                in_str = False
+            continue
+        if ch == '"':
+            in_str = True
+        elif ch == "{":
+            depth += 1
+        elif ch == "}":
+            depth -= 1
+            if depth == 0:
+                return s[: i + 1]
+    return None
+```
+
+- [ ] **Step 4: Run test to verify it passes**
+
+Run: `cd scripts/experiments/gepa-flowchart && python -m pytest tests/test_render.py -v`
+Expected: PASS (4 passed)
+
+- [ ] **Step 5: Commit**
+
+```bash
+git add scripts/experiments/gepa-flowchart/
+git commit -m "feat(gepa): board-to-text serializer + JSON extractor"
+```
+
+---
+
+### Task 5: Dataset + reader-question freezing
+
+**Files:**
+- Create: `scripts/experiments/gepa-flowchart/topics/topics.json`
+- Create: `scripts/experiments/gepa-flowchart/gepa_flowchart/dataset.py`
+- Test: `scripts/experiments/gepa-flowchart/tests/test_dataset.py`
+
+**Interfaces:**
+- Consumes: `llm.complete` signature (a callable matching `complete(system, user, *, model, ...) -> str` is injected as `complete_fn` for testability).
+- Produces:
+  - `Task` dataclass: `id: str`, `topic: str`, `audience: str`, `purpose: str`, `questions: list[str]` (empty until frozen).
+  - `load_topics(path: str) -> list[Task]`.
+  - `generate_questions(task: Task, *, model: str, n: int, complete_fn) -> list[str]`.
+  - `freeze_questions(tasks: list[Task], run_dir: str, *, model: str, n: int, complete_fn) -> list[Task]` — writes `<run_dir>/frozen_questions.json` and returns tasks with `questions` populated.
+  - `load_frozen(tasks: list[Task], run_dir: str) -> list[Task]`.
+
+- [ ] **Step 1: Write the topics dataset (data, not code)**
+
+`topics/topics.json` (12 entries; abbreviated shape — fill all 12):
+```json
+[
+  {"id": "ci-cd", "topic": "CI/CD pipeline for a web app", "audience": "an engineer new to the team", "purpose": "understand how code reaches production and what can go wrong"},
+  {"id": "user-auth", "topic": "User authentication and session flow", "audience": "a backend developer", "purpose": "understand login, token issuance, and refresh"},
+  {"id": "order-fulfillment", "topic": "E-commerce order fulfillment", "audience": "an operations analyst", "purpose": "trace an order from checkout to delivery"},
+  {"id": "incident-response", "topic": "On-call incident response process", "audience": "a new on-call engineer", "purpose": "know what to do when paged"},
+  {"id": "data-pipeline", "topic": "Batch ETL data pipeline", "audience": "a data engineer", "purpose": "understand ingestion, transform, and load stages"},
+  {"id": "pr-review", "topic": "Pull request review and merge process", "audience": "a contributor", "purpose": "know the path from PR open to merge"},
+  {"id": "support-triage", "topic": "Customer support ticket triage", "audience": "a support agent", "purpose": "route and escalate tickets correctly"},
+  {"id": "payment-processing", "topic": "Online payment processing and retries", "audience": "a fintech engineer", "purpose": "understand auth, capture, and failure handling"},
+  {"id": "onboarding", "topic": "New employee onboarding workflow", "audience": "an HR coordinator", "purpose": "track steps from offer to first day"},
+  {"id": "state-machine", "topic": "Document approval state machine", "audience": "a product manager", "purpose": "understand statuses and transitions"},
+  {"id": "k8s-deploy", "topic": "Kubernetes rolling deployment", "audience": "a platform engineer", "purpose": "understand how a new version rolls out safely"},
+  {"id": "signup-funnel", "topic": "SaaS signup and activation funnel", "audience": "a growth analyst", "purpose": "see where users drop off between signup and activation"}
+]
+```
+
+- [ ] **Step 2: Write the failing test**
+
+`tests/test_dataset.py`:
+```python
+import json
+from pathlib import Path
+
+from gepa_flowchart.dataset import (
+    Task,
+    load_topics,
+    generate_questions,
+    freeze_questions,
+    load_frozen,
+)
+
+TOPICS = str(Path(__file__).resolve().parent.parent / "topics" / "topics.json")
+
+
+def fake_complete(system, user, *, model, **kw):
+    # Return a JSON array of questions, regardless of input.
+    return '["Q1?", "Q2?", "Q3?"]'
+
+
+def test_load_topics():
+    tasks = load_topics(TOPICS)
+    assert len(tasks) == 12
+    assert all(isinstance(t, Task) and t.id and t.topic for t in tasks)
+
+
+def test_generate_questions_parses_list():
+    t = Task(id="x", topic="T", audience="A", purpose="P")
+    qs = generate_questions(t, model="m", n=3, complete_fn=fake_complete)
+    assert qs == ["Q1?", "Q2?", "Q3?"]
+
+
+def test_freeze_and_reload(tmp_path):
+    tasks = load_topics(TOPICS)[:2]
+    frozen = freeze_questions(tasks, str(tmp_path), model="m", n=3, complete_fn=fake_complete)
+    assert all(t.questions for t in frozen)
+    assert (tmp_path / "frozen_questions.json").exists()
+    reloaded = load_frozen(load_topics(TOPICS)[:2], str(tmp_path))
+    assert reloaded[0].questions == frozen[0].questions
+```
+
+- [ ] **Step 3: Run test to verify it fails**
+
+Run: `cd scripts/experiments/gepa-flowchart && python -m pytest tests/test_dataset.py -v`
+Expected: FAIL — `ModuleNotFoundError: No module named 'gepa_flowchart.dataset'`
+
+- [ ] **Step 4: Write minimal implementation**
+
+`gepa_flowchart/dataset.py`:
+```python
+from __future__ import annotations
+
+import json
+from dataclasses import dataclass, field
+from pathlib import Path
+
+from .render import extract_json
+
+
+@dataclass
+class Task:
+    id: str
+    topic: str
+    audience: str
+    purpose: str
+    questions: list[str] = field(default_factory=list)
+
+
+def load_topics(path: str) -> list[Task]:
+    raw = json.loads(Path(path).read_text())
+    return [Task(id=r["id"], topic=r["topic"], audience=r["audience"], purpose=r["purpose"]) for r in raw]
+
+
+_QGEN_SYSTEM = (
+    "You write the questions a first-time reader of a diagram would need answered "
+    "to actually understand the subject. Output ONLY a JSON array of concise question strings."
+)
+
+
+def generate_questions(task: Task, *, model: str, n: int, complete_fn) -> list[str]:
+    user = (
+        f"Topic: {task.topic}\nAudience: {task.audience}\nPurpose: {task.purpose}\n\n"
+        f"List exactly {n} distinct questions this reader should be able to answer "
+        f"from a good flowchart on this topic. JSON array of strings only."
+    )
+    out = complete_fn(_QGEN_SYSTEM, user, model=model, effort="medium", max_tokens=1000)
+    blob = extract_array(out)
+    qs = json.loads(blob) if blob else []
+    return [str(q) for q in qs][:n]
+
+
+def extract_array(text: str) -> str | None:
+    start = text.find("[")
+    end = text.rfind("]")
+    if start == -1 or end == -1 or end < start:
+        return None
+    return text[start : end + 1]
+
+
+def freeze_questions(tasks: list[Task], run_dir: str, *, model: str, n: int, complete_fn) -> list[Task]:
+    Path(run_dir).mkdir(parents=True, exist_ok=True)
+    frozen: dict[str, list[str]] = {}
+    for t in tasks:
+        t.questions = generate_questions(t, model=model, n=n, complete_fn=complete_fn)
+        frozen[t.id] = t.questions
+    (Path(run_dir) / "frozen_questions.json").write_text(json.dumps(frozen, indent=2))
+    return tasks
+
+
+def load_frozen(tasks: list[Task], run_dir: str) -> list[Task]:
+    frozen = json.loads((Path(run_dir) / "frozen_questions.json").read_text())
+    for t in tasks:
+        t.questions = frozen.get(t.id, [])
+    return tasks
+```
+
+- [ ] **Step 5: Run test to verify it passes**
+
+Run: `cd scripts/experiments/gepa-flowchart && python -m pytest tests/test_dataset.py -v`
+Expected: PASS (3 passed)
+
+- [ ] **Step 6: Commit**
+
+```bash
+git add scripts/experiments/gepa-flowchart/
+git commit -m "feat(gepa): topic dataset + reader-question generation/freezing"
+```
+
+---
+
+### Task 6: Seed prompts + pipeline
+
+**Files:**
+- Create: `scripts/experiments/gepa-flowchart/gepa_flowchart/seed_prompts.py`
+- Create: `scripts/experiments/gepa-flowchart/gepa_flowchart/pipeline.py`
+- Test: `scripts/experiments/gepa-flowchart/tests/test_pipeline.py`
+
+**Interfaces:**
+- Consumes: `Task` (Task 5), `extract_json` (Task 4), a `complete_fn` callable.
+- Produces:
+  - `SEED_CANDIDATE: dict[str, str]` with keys `"brainstorm"` and `"generate"`.
+  - `SKILL_CONTEXT: str` (flow schema + an example spec).
+  - `run_pipeline(candidate: dict, task: Task, *, model: str, complete_fn, max_tokens: int) -> tuple[str | None, dict]` returning `(flow_json_str_or_None, trace)`. `trace` has keys `brainstorm_input`, `plan`, `generate_input`, `raw_generation`.
+
+- [ ] **Step 1: Write the failing test**
+
+`tests/test_pipeline.py`:
+```python
+import json
+
+from gepa_flowchart.dataset import Task
+from gepa_flowchart.pipeline import run_pipeline, SEED_CANDIDATE
+
+
+def test_seed_candidate_has_two_components():
+    assert set(SEED_CANDIDATE.keys()) == {"brainstorm", "generate"}
+    assert "{topic}" in SEED_CANDIDATE["brainstorm"]
+    assert "{plan}" in SEED_CANDIDATE["generate"]
+
+
+def test_run_pipeline_threads_plan_into_generation():
+    calls = []
+    flow = {"direction": "TB", "nodes": [{"id": "a", "data": {"label": "A"}}], "edges": []}
+
+    def fake_complete(system, user, *, model, **kw):
+        calls.append(user)
+        if len(calls) == 1:
+            return "PLAN: show A then B"
+        return f"```json\n{json.dumps(flow)}\n```"
+
+    task = Task(id="x", topic="Topic T", audience="Aud", purpose="Pur")
+    out, trace = run_pipeline(SEED_CANDIDATE, task, model="m", complete_fn=fake_complete, max_tokens=4000)
+    assert json.loads(out) == flow
+    assert trace["plan"] == "PLAN: show A then B"
+    # the plan must be threaded into the generate step's prompt
+    assert "PLAN: show A then B" in calls[1]
+    assert "Topic T" in calls[0]
+```
+
+- [ ] **Step 2: Run test to verify it fails**
+
+Run: `cd scripts/experiments/gepa-flowchart && python -m pytest tests/test_pipeline.py -v`
+Expected: FAIL — `ModuleNotFoundError: No module named 'gepa_flowchart.pipeline'`
+
+- [ ] **Step 3: Write the seed prompts**
+
+`gepa_flowchart/seed_prompts.py`:
+```python
+from __future__ import annotations
+
+from pathlib import Path
+
+_FLOW_SCHEMA = """A termchart `flow` spec is JSON:
+{
+  "direction": "TB" | "LR" | "BT" | "RL",   // default TB; prefer TB for processes
+  "nodes": [{ "id": "x", "data": { "label": "...", "status": "active|info|warn|success|neutral" }, "group": "g1?" }],
+  "edges": [{ "source": "x", "target": "y", "data": { "label": "when/condition?" } }],
+  "groups": [{ "id": "g1", "label": "Lane/Zone name", "color": "#hex?" }]
+}
+Rules: every edge source/target must be an existing node id; keep it under ~24 nodes; labels carry the meaning."""
+
+
+def _example() -> str:
+    candidates = [
+        "call-hierarchy.flow.json",
+        "okr-tree.flow.json",
+        "binary-search.flow.json",
+    ]
+    base = Path(__file__).resolve().parents[3] / "plugin" / "skills" / "diagram-recipes" / "examples"
+    for name in candidates:
+        p = base / name
+        if p.exists():
+            return p.read_text()
+    return '{"direction":"TB","nodes":[{"id":"a","data":{"label":"Start"}}],"edges":[]}'
+
+
+SKILL_CONTEXT = f"{_FLOW_SCHEMA}\n\nExample of a well-formed flow spec:\n{_example()}"
+
+SEED_CANDIDATE = {
+    "brainstorm": (
+        "You are planning a flowchart.\n"
+        "Topic: {topic}\nAudience: {audience}\nPurpose: {purpose}\n\n"
+        "Decide what the flowchart must show so this reader can understand the subject. "
+        "List the key steps/states, the decisions and what triggers each branch, and the "
+        "context a newcomer needs. Output a concise plan in plain text."
+    ),
+    "generate": (
+        "You generate a termchart `flow` diagram as JSON.\n\n"
+        "{skill_context}\n\n"
+        "Topic: {topic}\nAudience: {audience}\nPurpose: {purpose}\n\n"
+        "Plan to follow:\n{plan}\n\n"
+        "Produce a single flow JSON object. Use clear, specific labels; label edges with the "
+        "condition/trigger; group related nodes. Output ONLY the JSON."
+    ),
+}
+```
+
+- [ ] **Step 4: Write the pipeline**
+
+`gepa_flowchart/pipeline.py`:
+```python
+from __future__ import annotations
+
+from .dataset import Task
+from .render import extract_json
+from .seed_prompts import SKILL_CONTEXT
+
+
+def run_pipeline(candidate: dict, task: Task, *, model: str, complete_fn, max_tokens: int):
+    trace: dict = {}
+    brainstorm_user = candidate["brainstorm"].format(
+        topic=task.topic, audience=task.audience, purpose=task.purpose
+    )
+    trace["brainstorm_input"] = brainstorm_user
+    plan = complete_fn(
+        "You are a diagram planner.", brainstorm_user, model=model, effort="medium", max_tokens=2000
+    )
+    trace["plan"] = plan
+
+    generate_user = candidate["generate"].format(
+        skill_context=SKILL_CONTEXT,
+        topic=task.topic,
+        audience=task.audience,
+        purpose=task.purpose,
+        plan=plan,
+    )
+    trace["generate_input"] = generate_user
+    raw = complete_fn(
+        "You output only valid JSON.", generate_user, model=model, effort="medium", max_tokens=max_tokens
+    )
+    trace["raw_generation"] = raw
+    return extract_json(raw), trace
+```
+
+- [ ] **Step 5: Run test to verify it passes**
+
+Run: `cd scripts/experiments/gepa-flowchart && python -m pytest tests/test_pipeline.py -v`
+Expected: PASS (2 passed)
+
+- [ ] **Step 6: Commit**
+
+```bash
+git add scripts/experiments/gepa-flowchart/
+git commit -m "feat(gepa): seed brainstorm/generate prompts + pipeline"
+```
+
+---
+
+### Task 7: Metric (gate + geometry + comprehension + feedback)
+
+**Files:**
+- Create: `scripts/experiments/gepa-flowchart/gepa_flowchart/metric.py`
+- Test: `scripts/experiments/gepa-flowchart/tests/test_metric.py`
+
+**Interfaces:**
+- Consumes: `Config` (Task 1), `Task` (Task 5), `board_to_text` (Task 4), `extract_array`/`extract_json`, a `validate_fn` (matches `validate_flow`), a `complete_fn`.
+- Produces:
+  - `geometry_score(findings: list[dict], cfg: Config) -> tuple[float, str]`.
+  - `comprehension_score(board_text: str, questions: list[str], *, reader_model: str, judge_model: str, complete_fn) -> tuple[float, str]`.
+  - `score_board(content: str | None, task: Task, cfg: Config, *, validate_fn, complete_fn) -> ScoreResult` where `ScoreResult` is a dataclass `score: float`, `feedback: str`, `comp: float`, `geom: float`, `valid: bool`.
+
+- [ ] **Step 1: Write the failing test**
+
+`tests/test_metric.py`:
+```python
+import json
+
+from gepa_flowchart.config import Config
+from gepa_flowchart.dataset import Task
+from gepa_flowchart.metric import geometry_score, score_board
+
+CFG = Config.from_env({})
+TASK = Task(id="x", topic="T", audience="A", purpose="P", questions=["Q1?", "Q2?"])
+GOOD_FLOW = json.dumps({"direction": "TB", "nodes": [{"id": "a", "data": {"label": "Start"}}], "edges": []})
+
+
+def test_geometry_score_penalizes_errors():
+    clean, _ = geometry_score([], CFG)
+    assert clean == 1.0
+    err, msg = geometry_score([{"severity": "error", "code": "edge-over-node", "message": "x", "count": 1}], CFG)
+    assert err < 1.0 and "edge-over-node" in msg
+
+
+def test_invalid_board_scores_zero():
+    def validate_fn(content):
+        return {"valid": False, "error": "bad json", "findings": [], "warnings": []}
+
+    res = score_board("{bad", TASK, CFG, validate_fn=validate_fn, complete_fn=lambda *a, **k: "")
+    assert res.score == 0.0 and res.valid is False
+    assert "bad json" in res.feedback
+
+
+def test_missing_context_lowers_comp_and_is_fed_back():
+    def validate_fn(content):
+        return {"valid": True, "error": None, "findings": [], "warnings": []}
+
+    # reader answers; judge marks Q2 unsupported
+    def complete_fn(system, user, *, model, **kw):
+        if "first time" in system.lower():  # reader
+            return json.dumps([{"q": "Q1?", "a": "Start the process"}, {"q": "Q2?", "a": "not shown"}])
+        # judge
+        return json.dumps([
+            {"q": "Q1?", "score": 1.0, "supported": True, "reason": "shown"},
+            {"q": "Q2?", "score": 0.0, "supported": False, "reason": "board never shows the failure path"},
+        ])
+
+    res = score_board(GOOD_FLOW, TASK, CFG, validate_fn=validate_fn, complete_fn=complete_fn)
+    assert 0.0 < res.comp < 1.0
+    assert res.geom == 1.0
+    assert "failure path" in res.feedback
+```
+
+- [ ] **Step 2: Run test to verify it fails**
+
+Run: `cd scripts/experiments/gepa-flowchart && python -m pytest tests/test_metric.py -v`
+Expected: FAIL — `ModuleNotFoundError: No module named 'gepa_flowchart.metric'`
+
+- [ ] **Step 3: Write minimal implementation**
+
+`gepa_flowchart/metric.py`:
+```python
+from __future__ import annotations
+
+import json
+from dataclasses import dataclass
+
+from .config import Config
+from .dataset import Task, extract_array
+from .render import board_to_text
+
+
+@dataclass
+class ScoreResult:
+    score: float
+    feedback: str
+    comp: float
+    geom: float
+    valid: bool
+
+
+def geometry_score(findings: list[dict], cfg: Config) -> tuple[float, str]:
+    errs = [f for f in findings if f.get("severity") == "error"]
+    warns = [f for f in findings if f.get("severity") == "warning"]
+    score = 1.0 - cfg.geom_error_penalty * len(errs) - cfg.geom_warning_penalty * len(warns)
+    score = max(0.0, min(1.0, score))
+    if not findings:
+        return score, "Geometry: clean (no findings)."
+    msgs = "; ".join(f"[{f.get('severity')}] {f.get('code')}: {f.get('message')}" for f in findings)
+    return score, f"Geometry findings: {msgs}"
+
+
+_READER_SYSTEM = (
+    "You are seeing this flowchart for the FIRST TIME and know nothing else about it. "
+    "Answer each question using ONLY what the board shows. If the board does not tell you, "
+    "answer exactly 'not shown'. Output ONLY a JSON array of {\"q\": question, \"a\": answer}."
+)
+
+_JUDGE_SYSTEM = (
+    "You grade how well a first-time reader's answers are supported by a flowchart. For each "
+    "question score 0..1 for correctness AND whether the board actually supports the answer. "
+    "An answer of 'not shown' or one not supported by the board scores low and is a context gap. "
+    "Output ONLY a JSON array of {\"q\", \"score\", \"supported\" (bool), \"reason\"}."
+)
+
+
+def comprehension_score(board_text: str, questions: list[str], *, reader_model: str, judge_model: str, complete_fn) -> tuple[float, str]:
+    qlist = "\n".join(f"- {q}" for q in questions)
+    reader_out = complete_fn(
+        _READER_SYSTEM,
+        f"BOARD:\n{board_text}\n\nQUESTIONS:\n{qlist}",
+        model=reader_model,
+        effort="low",
+        max_tokens=1500,
+    )
+    judge_out = complete_fn(
+        _JUDGE_SYSTEM,
+        f"BOARD:\n{board_text}\n\nQUESTIONS:\n{qlist}\n\nREADER ANSWERS:\n{reader_out}",
+        model=judge_model,
+        effort="high",
+        max_tokens=1500,
+    )
+    blob = extract_array(judge_out)
+    rows = json.loads(blob) if blob else []
+    if not rows:
+        return 0.0, "Comprehension: judge produced no parseable scores."
+    scores = [float(r.get("score", 0.0)) for r in rows]
+    comp = sum(scores) / len(scores)
+    gaps = [r for r in rows if not r.get("supported", False) or float(r.get("score", 0.0)) < 0.5]
+    if gaps:
+        gap_txt = "; ".join(f"'{r.get('q')}' — {r.get('reason')}" for r in gaps)
+        fb = f"Comprehension {comp:.2f}. Reader could not answer (add detail/context): {gap_txt}"
+    else:
+        fb = f"Comprehension {comp:.2f}. Reader answered all questions from the board."
+    return comp, fb
+
+
+def score_board(content, task: Task, cfg: Config, *, validate_fn, complete_fn) -> ScoreResult:
+    if not content:
+        return ScoreResult(0.0, "Generation produced no JSON object.", 0.0, 0.0, False)
+    report = validate_fn(content)
+    if not report.get("valid"):
+        return ScoreResult(0.0, f"Invalid flow spec: {report.get('error')}", 0.0, 0.0, False)
+
+    geom, geom_fb = geometry_score(report.get("findings", []), cfg)
+    board_text = board_to_text(json.loads(content))
+    comp, comp_fb = comprehension_score(
+        board_text, task.questions, reader_model=cfg.reader_model, judge_model=cfg.judge_model, complete_fn=complete_fn
+    )
+    total = cfg.w_comp * comp + cfg.w_geom * geom
+    feedback = f"{comp_fb}\n{geom_fb}"
+    return ScoreResult(total, feedback, comp, geom, True)
+```
+
+- [ ] **Step 4: Run test to verify it passes**
+
+Run: `cd scripts/experiments/gepa-flowchart && python -m pytest tests/test_metric.py -v`
+Expected: PASS (3 passed)
+
+- [ ] **Step 5: Commit**
+
+```bash
+git add scripts/experiments/gepa-flowchart/
+git commit -m "feat(gepa): three-part metric (validity gate, geometry, comprehension)"
+```
+
+---
+
+### Task 8: GEPA adapter
+
+**Files:**
+- Create: `scripts/experiments/gepa-flowchart/gepa_flowchart/adapter.py`
+- Test: `scripts/experiments/gepa-flowchart/tests/test_adapter.py`
+
+**Interfaces:**
+- Consumes: `Config`, `Task`, `run_pipeline` (Task 6), `score_board`/`ScoreResult` (Task 7), `validate_flow` (Task 2), `complete` (Task 3). `gepa.core.adapter.EvaluationBatch`.
+- Produces: `FlowchartAdapter` implementing `evaluate(batch, candidate, capture_traces=False) -> EvaluationBatch` and `make_reflective_dataset(candidate, eval_batch, components_to_update) -> dict[str, list[dict]]`. Constructor: `FlowchartAdapter(cfg: Config, *, validate_fn=validate_flow, complete_fn=complete)`.
+
+- [ ] **Step 1: Write the failing test**
+
+`tests/test_adapter.py`:
+```python
+import json
+
+from gepa_flowchart.config import Config
+from gepa_flowchart.dataset import Task
+from gepa_flowchart.adapter import FlowchartAdapter
+from gepa_flowchart.pipeline import SEED_CANDIDATE
+
+CFG = Config.from_env({})
+FLOW = {"direction": "TB", "nodes": [{"id": "a", "data": {"label": "A"}}], "edges": []}
+
+
+def validate_fn(content):
+    return {"valid": True, "error": None, "findings": [], "warnings": []}
+
+
+def complete_fn(system, user, *, model, **kw):
+    if "planner" in system.lower():
+        return "plan"
+    if "only valid json" in system.lower():
+        return json.dumps(FLOW)
+    if "first time" in system.lower():
+        return json.dumps([{"q": "Q1?", "a": "A"}])
+    # judge
+    return json.dumps([{"q": "Q1?", "score": 1.0, "supported": True, "reason": "ok"}])
+
+
+def test_evaluate_returns_scores_and_traces():
+    adapter = FlowchartAdapter(CFG, validate_fn=validate_fn, complete_fn=complete_fn)
+    batch = [Task(id="x", topic="T", audience="A", purpose="P", questions=["Q1?"])]
+    out = adapter.evaluate(batch, SEED_CANDIDATE, capture_traces=True)
+    assert len(out.scores) == 1
+    assert out.scores[0] > 0.0
+    assert out.trajectories is not None and len(out.trajectories) == 1
+
+
+def test_make_reflective_dataset_has_requested_components():
+    adapter = FlowchartAdapter(CFG, validate_fn=validate_fn, complete_fn=complete_fn)
+    batch = [Task(id="x", topic="T", audience="A", purpose="P", questions=["Q1?"])]
+    ev = adapter.evaluate(batch, SEED_CANDIDATE, capture_traces=True)
+    refl = adapter.make_reflective_dataset(SEED_CANDIDATE, ev, ["brainstorm", "generate"])
+    assert set(refl.keys()) == {"brainstorm", "generate"}
+    assert refl["generate"] and "Feedback" in refl["generate"][0]
+```
+
+- [ ] **Step 2: Run test to verify it fails**
+
+Run: `cd scripts/experiments/gepa-flowchart && python -m pytest tests/test_adapter.py -v`
+Expected: FAIL — `ModuleNotFoundError: No module named 'gepa_flowchart.adapter'`
+
+- [ ] **Step 3: Write minimal implementation**
+
+`gepa_flowchart/adapter.py`:
+```python
+from __future__ import annotations
+
+from gepa.core.adapter import EvaluationBatch, GEPAAdapter
+
+from .config import Config
+from .geometry_bridge import validate_flow
+from .llm import complete
+from .metric import score_board
+from .pipeline import run_pipeline
+
+
+class FlowchartAdapter(GEPAAdapter):
+    def __init__(self, cfg: Config, *, validate_fn=validate_flow, complete_fn=complete):
+        self.cfg = cfg
+        self.validate_fn = validate_fn
+        self.complete_fn = complete_fn
+
+    def evaluate(self, batch, candidate, capture_traces=False):
+        outputs, scores, trajectories = [], [], [] if capture_traces else None
+        for task in batch:
+            content, trace = run_pipeline(
+                candidate, task, model=self.cfg.gen_model,
+                complete_fn=self.complete_fn, max_tokens=self.cfg.gen_max_tokens,
+            )
+            result = score_board(
+                content, task, self.cfg, validate_fn=self.validate_fn, complete_fn=self.complete_fn
+            )
+            outputs.append(content)
+            scores.append(result.score)
+            if capture_traces:
+                trajectories.append({"task": task, "trace": trace, "result": result, "output": content})
+        return EvaluationBatch(outputs=outputs, scores=scores, trajectories=trajectories)
+
+    def make_reflective_dataset(self, candidate, eval_batch, components_to_update):
+        out: dict[str, list[dict]] = {c: [] for c in components_to_update}
+        for traj in eval_batch.trajectories or []:
+            task = traj["task"]
+            result = traj["result"]
+            trace = traj["trace"]
+            shared_feedback = result.feedback
+            if "brainstorm" in out:
+                out["brainstorm"].append({
+                    "Inputs": f"Topic: {task.topic}\nAudience: {task.audience}\nPurpose: {task.purpose}",
+                    "Generated Outputs": trace.get("plan", ""),
+                    "Feedback": (
+                        "The plan feeds a downstream generator. The resulting board scored "
+                        f"{result.score:.2f}. " + shared_feedback
+                    ),
+                })
+            if "generate" in out:
+                out["generate"].append({
+                    "Inputs": f"Plan:\n{trace.get('plan','')}",
+                    "Generated Outputs": traj.get("output") or trace.get("raw_generation", ""),
+                    "Feedback": shared_feedback,
+                })
+        return out
+```
+
+- [ ] **Step 4: Run test to verify it passes**
+
+Run: `cd scripts/experiments/gepa-flowchart && python -m pytest tests/test_adapter.py -v`
+Expected: PASS (2 passed)
+
+- [ ] **Step 5: Commit**
+
+```bash
+git add scripts/experiments/gepa-flowchart/
+git commit -m "feat(gepa): GEPAAdapter (evaluate + reflective dataset)"
+```
+
+---
+
+### Task 9: Run entrypoint + report
+
+**Files:**
+- Create: `scripts/experiments/gepa-flowchart/gepa_flowchart/report.py`
+- Create: `scripts/experiments/gepa-flowchart/gepa_flowchart/run.py`
+- Test: `scripts/experiments/gepa-flowchart/tests/test_report.py`
+- Test: `scripts/experiments/gepa-flowchart/tests/test_smoke.py`
+
+**Interfaces:**
+- Consumes: everything above; `gepa.optimize`, `GEPAResult`.
+- Produces:
+  - `render_report(seed_scores: list[float], best_scores: list[float], best_candidate: dict, val_ids: list[str]) -> str` (Markdown).
+  - `run.py` CLI: `--smoke`, `--max-metric-calls N`, `--run-dir PATH`, `--train N`. `main(argv=None) -> int`.
+
+- [ ] **Step 1: Write the failing test (report)**
+
+`tests/test_report.py`:
+```python
+from gepa_flowchart.report import render_report
+
+
+def test_report_shows_before_after():
+    md = render_report([0.4, 0.5], [0.7, 0.8], {"brainstorm": "b", "generate": "g"}, ["t1", "t2"])
+    assert "0.45" in md  # seed mean
+    assert "0.75" in md  # best mean
+    assert "brainstorm" in md and "generate" in md
+```
+
+- [ ] **Step 2: Run test to verify it fails**
+
+Run: `cd scripts/experiments/gepa-flowchart && python -m pytest tests/test_report.py -v`
+Expected: FAIL — `ModuleNotFoundError: No module named 'gepa_flowchart.report'`
+
+- [ ] **Step 3: Write report.py**
+
+`gepa_flowchart/report.py`:
+```python
+from __future__ import annotations
+
+import json
+
+
+def _mean(xs: list[float]) -> float:
+    return sum(xs) / len(xs) if xs else 0.0
+
+
+def render_report(seed_scores, best_scores, best_candidate, val_ids) -> str:
+    seed_m, best_m = _mean(seed_scores), _mean(best_scores)
+    lines = [
+        "# GEPA Flowchart Optimization — Report",
+        "",
+        f"- Val tasks: {len(val_ids)} ({', '.join(val_ids)})",
+        f"- Seed mean score: **{seed_m:.2f}**",
+        f"- Best mean score: **{best_m:.2f}**",
+        f"- Delta: **{best_m - seed_m:+.2f}**",
+        "",
+        "## Per-task (seed → best)",
+        "",
+        "| task | seed | best |",
+        "|---|---|---|",
+    ]
+    for tid, s, b in zip(val_ids, seed_scores, best_scores):
+        lines.append(f"| {tid} | {s:.2f} | {b:.2f} |")
+    lines += ["", "## Best prompts", "", "```json", json.dumps(best_candidate, indent=2), "```"]
+    return "\n".join(lines)
+```
+
+- [ ] **Step 4: Run report test to verify it passes**
+
+Run: `cd scripts/experiments/gepa-flowchart && python -m pytest tests/test_report.py -v`
+Expected: PASS (1 passed)
+
+- [ ] **Step 5: Write run.py**
+
+`gepa_flowchart/run.py`:
+```python
+from __future__ import annotations
+
+import argparse
+import json
+import sys
+from datetime import datetime, timezone
+from pathlib import Path
+
+import gepa
+
+from .adapter import FlowchartAdapter
+from .config import Config
+from .dataset import freeze_questions, load_topics
+from .llm import complete, make_reflection_callable
+from .metric import score_board
+from .geometry_bridge import validate_flow
+from .pipeline import SEED_CANDIDATE
+from .report import render_report
+
+_TOPICS = str(Path(__file__).resolve().parent.parent / "topics" / "topics.json")
+
+
+def _eval_scores(tasks, candidate, cfg) -> list[float]:
+    return [
+        score_board(
+            run_one(candidate, t, cfg), t, cfg, validate_fn=validate_flow, complete_fn=complete
+        ).score
+        for t in tasks
+    ]
+
+
+def run_one(candidate, task, cfg):
+    from .pipeline import run_pipeline
+    content, _ = run_pipeline(candidate, task, model=cfg.gen_model, complete_fn=complete, max_tokens=cfg.gen_max_tokens)
+    return content
+
+
+def main(argv=None) -> int:
+    p = argparse.ArgumentParser(description="GEPA flowchart prompt optimization")
+    p.add_argument("--smoke", action="store_true", help="1 topic, tiny budget")
+    p.add_argument("--max-metric-calls", type=int, default=None)
+    p.add_argument("--train", type=int, default=8)
+    p.add_argument("--run-dir", default=None)
+    args = p.parse_args(argv)
+
+    cfg = Config.from_env({})
+    budget = args.max_metric_calls or (6 if args.smoke else cfg.max_metric_calls)
+    run_dir = args.run_dir or str(Path("runs") / datetime.now(timezone.utc).strftime("%Y%m%dT%H%M%SZ"))
+    Path(run_dir).mkdir(parents=True, exist_ok=True)
+
+    topics = load_topics(_TOPICS)
+    if args.smoke:
+        train, val = topics[:1], topics[:1]
+    else:
+        train, val = topics[: args.train], topics[args.train :]
+
+    print(f"[gepa] run_dir={run_dir} budget={budget} train={len(train)} val={len(val)}", file=sys.stderr)
+    print(f"[gepa] models gen={cfg.gen_model} reader={cfg.reader_model} judge={cfg.judge_model} reflect={cfg.reflection_model}", file=sys.stderr)
+
+    # Freeze the reader-questions once for the whole run (fair comparison).
+    all_tasks = freeze_questions(topics, run_dir, model=cfg.judge_model, n=cfg.n_questions, complete_fn=complete)
+    by_id = {t.id: t for t in all_tasks}
+    train = [by_id[t.id] for t in train]
+    val = [by_id[t.id] for t in val]
+
+    seed_scores = _eval_scores(val, SEED_CANDIDATE, cfg)
+
+    adapter = FlowchartAdapter(cfg)
+    result = gepa.optimize(
+        seed_candidate=SEED_CANDIDATE,
+        trainset=train,
+        valset=val,
+        adapter=adapter,
+        reflection_lm=make_reflection_callable(cfg.reflection_model),
+        max_metric_calls=budget,
+        run_dir=run_dir,
+        display_progress_bar=True,
+    )
+    best = result.best_candidate
+    best_scores = _eval_scores(val, best, cfg)
+
+    (Path(run_dir) / "best_prompts.json").write_text(json.dumps(best, indent=2))
+    report = render_report(seed_scores, best_scores, best, [t.id for t in val])
+    (Path(run_dir) / "report.md").write_text(report)
+    print(report)
+    return 0
+
+
+if __name__ == "__main__":
+    raise SystemExit(main())
+```
+
+- [ ] **Step 6: Write the smoke test (gated on a real API key)**
+
+`tests/test_smoke.py`:
+```python
+import os
+
+import pytest
+
+from gepa_flowchart.run import main
+
+
+@pytest.mark.skipif(not os.environ.get("ANTHROPIC_API_KEY"), reason="needs ANTHROPIC_API_KEY")
+def test_smoke_end_to_end(tmp_path):
+    rc = main(["--smoke", "--run-dir", str(tmp_path)])
+    assert rc == 0
+    assert (tmp_path / "report.md").exists()
+    assert (tmp_path / "best_prompts.json").exists()
+```
+
+- [ ] **Step 7: Run the non-smoke tests; confirm smoke skips without a key**
+
+Run: `cd scripts/experiments/gepa-flowchart && python -m pytest tests/ -v`
+Expected: all unit tests PASS; `test_smoke_end_to_end` SKIPPED (no key) — or PASS if `ANTHROPIC_API_KEY` is set and you choose to run it (costs a few cents).
+
+- [ ] **Step 8: Commit**
+
+```bash
+git add scripts/experiments/gepa-flowchart/
+git commit -m "feat(gepa): run entrypoint, report, and smoke test"
+```
+
+---
+
+### Task 10: README + full verification
+
+**Files:**
+- Create: `scripts/experiments/gepa-flowchart/README.md`
+
+**Interfaces:** none (docs + verification).
+
+- [ ] **Step 1: Write the README**
+
+`README.md`:
+```markdown
+# gepa-flowchart
+
+GEPA prompt optimization for readable, comprehensible termchart flowcharts.
+Optimizes the `brainstorm` + `generate` prompts for first-user comprehension
+(primary) and geometric readability (secondary). See
+`docs/superpowers/specs/2026-06-18-gepa-flowchart-optimization-design.md`.
+
+## Setup
+
+```bash
+# Node deps for the geometry bridge (run once)
+npm install                         # repo root — gives the viewer its deps
+cd scripts/experiments/gepa-flowchart
+npm install                         # local tsx
+
+# Python
+python -m venv .venv && . .venv/bin/activate
+pip install -e ".[dev]"
+export ANTHROPIC_API_KEY=sk-ant-...   # or: ant auth login
+```
+
+## Run
+
+```bash
+python -m gepa_flowchart.run --smoke            # cheap end-to-end check (1 topic)
+python -m gepa_flowchart.run                    # full run (~150 metric calls)
+python -m gepa_flowchart.run --max-metric-calls 80 --train 8
+```
+
+Outputs land in `runs/<timestamp>/`: `best_prompts.json`, `report.md`,
+`frozen_questions.json`.
+
+## Config (env)
+
+`GEPA_GEN_MODEL` (default `claude-opus-4-8`; set `claude-sonnet-4-6` to cut cost),
+`GEPA_READER_MODEL`, `GEPA_JUDGE_MODEL`, `GEPA_REFLECTION_MODEL`,
+`GEPA_W_COMP` (0.6), `GEPA_W_GEOM` (0.4), `GEPA_N_QUESTIONS` (5),
+`GEPA_MAX_METRIC_CALLS` (150).
+
+## Cost note
+
+A full run does many generation + reader + judge calls per rollout plus
+reflection. Start with `--smoke`. Generation defaults to Opus; switch
+`GEPA_GEN_MODEL=claude-sonnet-4-6` for a cheaper high-volume role.
+```
+
+- [ ] **Step 2: Run the full unit suite**
+
+Run: `cd scripts/experiments/gepa-flowchart && python -m pytest tests/ -v`
+Expected: all unit tests PASS; smoke SKIPPED without a key.
+
+- [ ] **Step 3: (Optional, costs cents) Live smoke**
+
+Run: `cd scripts/experiments/gepa-flowchart && ANTHROPIC_API_KEY=... python -m gepa_flowchart.run --smoke`
+Expected: prints a report; `runs/<ts>/report.md` and `best_prompts.json` exist.
+
+- [ ] **Step 4: Commit**
+
+```bash
+git add scripts/experiments/gepa-flowchart/README.md
+git commit -m "docs(gepa): README with setup, run, and cost notes"
+```
+
+---
+
+## Self-Review
+
+**Spec coverage:**
+- Two optimizable prompts (brainstorm/generate) → Task 6 (`SEED_CANDIDATE`), Task 8 (adapter mutates both).
+- Claude direct via Anthropic SDK → Task 3 (`llm.py`); reflection callable → Task 3 + Task 9.
+- Standalone `gepa` → Task 8 (`GEPAAdapter`), Task 9 (`gepa.optimize`).
+- Structural validity gate → Task 7 (`score_board` early return) via Task 2 bridge (`validateContent`).
+- Geometry/readability score → Task 2 (bridge `geometryReport`), Task 7 (`geometry_score`).
+- Comprehension/FUX primary (fresh reader + judge, board-supported correctness, missing-context flagged) → Task 7 (`comprehension_score`).
+- Auto-generated, run-frozen reader-questions → Task 5 (`freeze_questions`), Task 9 (frozen once per run).
+- Structured-content reader surface → Task 4 (`board_to_text`).
+- ~12 topics, train/val split → Task 5 (`topics.json`), Task 9 (split).
+- Geometry bridge, no package changes → Task 2.
+- Models (gen opus + sonnet knob, reader sonnet, judge/reflection opus) → Task 1 config defaults.
+- Combined weights 0.6/0.4 → Task 1 + Task 7.
+- Smoke-first, cost estimate → Task 9 (`--smoke`, stderr prints), Task 10 README.
+- Outputs (best_prompts.json, report.md, frozen_questions.json) → Task 9.
+- Testing (units with fakes + gated smoke) → every task + Task 9 smoke.
+
+**Placeholder scan:** No TBD/TODO; every code step has real code; commands have expected output. `topics.json` shows all 12 entries.
+
+**Type consistency:** `complete(system, user, *, model, effort, max_tokens, client)` used consistently (Tasks 3, 5, 6, 7, adapter). `validate_flow(content) -> {valid,error,findings,warnings}` consistent (Tasks 2, 7, 8). `Task(id,topic,audience,purpose,questions)` consistent (Tasks 5–9). `ScoreResult(score,feedback,comp,geom,valid)` consistent (Tasks 7, 8). `SEED_CANDIDATE` keys `brainstorm`/`generate` consistent (Tasks 6, 8). `EvaluationBatch(outputs,scores,trajectories)` matches the gepa source.

From e6d8c168403470233ee6712d8c7fc3c815593bbd Mon Sep 17 00:00:00 2001
From: Ivan Cheung <ivanmkc@google.com>
Date: Thu, 18 Jun 2026 04:38:30 +0000
Subject: [PATCH 03/24] feat(gepa): scaffold gepa-flowchart project + config

---
 scripts/experiments/gepa-flowchart/.gitignore |  5 ++
 .../gepa-flowchart/gepa_flowchart/__init__.py |  1 +
 .../gepa-flowchart/gepa_flowchart/config.py   | 46 +++++++++++++++++++
 .../experiments/gepa-flowchart/package.json   |  5 ++
 .../experiments/gepa-flowchart/pyproject.toml | 11 +++++
 .../gepa-flowchart/tests/test_config.py       | 16 +++++++
 6 files changed, 84 insertions(+)
 create mode 100644 scripts/experiments/gepa-flowchart/.gitignore
 create mode 100644 scripts/experiments/gepa-flowchart/gepa_flowchart/__init__.py
 create mode 100644 scripts/experiments/gepa-flowchart/gepa_flowchart/config.py
 create mode 100644 scripts/experiments/gepa-flowchart/package.json
 create mode 100644 scripts/experiments/gepa-flowchart/pyproject.toml
 create mode 100644 scripts/experiments/gepa-flowchart/tests/test_config.py

diff --git a/scripts/experiments/gepa-flowchart/.gitignore b/scripts/experiments/gepa-flowchart/.gitignore
new file mode 100644
index 0000000..b875e21
--- /dev/null
+++ b/scripts/experiments/gepa-flowchart/.gitignore
@@ -0,0 +1,5 @@
+__pycache__/
+*.pyc
+.venv/
+node_modules/
+runs/
diff --git a/scripts/experiments/gepa-flowchart/gepa_flowchart/__init__.py b/scripts/experiments/gepa-flowchart/gepa_flowchart/__init__.py
new file mode 100644
index 0000000..2e891bd
--- /dev/null
+++ b/scripts/experiments/gepa-flowchart/gepa_flowchart/__init__.py
@@ -0,0 +1 @@
+"""GEPA optimization for readable, comprehensible termchart flowcharts."""
diff --git a/scripts/experiments/gepa-flowchart/gepa_flowchart/config.py b/scripts/experiments/gepa-flowchart/gepa_flowchart/config.py
new file mode 100644
index 0000000..1e5ae10
--- /dev/null
+++ b/scripts/experiments/gepa-flowchart/gepa_flowchart/config.py
@@ -0,0 +1,46 @@
+from __future__ import annotations
+
+import os
+from dataclasses import dataclass
+
+
+@dataclass
+class Config:
+    gen_model: str = "claude-opus-4-8"
+    reader_model: str = "claude-sonnet-4-6"
+    judge_model: str = "claude-opus-4-8"
+    reflection_model: str = "claude-opus-4-8"
+    w_comp: float = 0.6
+    w_geom: float = 0.4
+    geom_error_penalty: float = 0.34
+    geom_warning_penalty: float = 0.08
+    n_questions: int = 5
+    max_metric_calls: int = 150
+    gen_max_tokens: int = 8000
+
+    @classmethod
+    def from_env(cls, overrides: dict | None = None) -> "Config":
+        env = {**os.environ, **(overrides or {})}
+
+        def s(key: str, default: str) -> str:
+            return env.get(key, default)
+
+        def f(key: str, default: float) -> float:
+            return float(env.get(key, default))
+
+        def i(key: str, default: int) -> int:
+            return int(env.get(key, default))
+
+        return cls(
+            gen_model=s("GEPA_GEN_MODEL", cls.gen_model),
+            reader_model=s("GEPA_READER_MODEL", cls.reader_model),
+            judge_model=s("GEPA_JUDGE_MODEL", cls.judge_model),
+            reflection_model=s("GEPA_REFLECTION_MODEL", cls.reflection_model),
+            w_comp=f("GEPA_W_COMP", cls.w_comp),
+            w_geom=f("GEPA_W_GEOM", cls.w_geom),
+            geom_error_penalty=f("GEPA_GEOM_ERR_PENALTY", cls.geom_error_penalty),
+            geom_warning_penalty=f("GEPA_GEOM_WARN_PENALTY", cls.geom_warning_penalty),
+            n_questions=i("GEPA_N_QUESTIONS", cls.n_questions),
+            max_metric_calls=i("GEPA_MAX_METRIC_CALLS", cls.max_metric_calls),
+            gen_max_tokens=i("GEPA_GEN_MAX_TOKENS", cls.gen_max_tokens),
+        )
diff --git a/scripts/experiments/gepa-flowchart/package.json b/scripts/experiments/gepa-flowchart/package.json
new file mode 100644
index 0000000..6bf8617
--- /dev/null
+++ b/scripts/experiments/gepa-flowchart/package.json
@@ -0,0 +1,5 @@
+{
+  "name": "gepa-flowchart-bridge",
+  "private": true,
+  "devDependencies": { "tsx": "^4.19.0" }
+}
diff --git a/scripts/experiments/gepa-flowchart/pyproject.toml b/scripts/experiments/gepa-flowchart/pyproject.toml
new file mode 100644
index 0000000..3d56452
--- /dev/null
+++ b/scripts/experiments/gepa-flowchart/pyproject.toml
@@ -0,0 +1,11 @@
+[project]
+name = "gepa-flowchart"
+version = "0.1.0"
+requires-python = ">=3.11"
+dependencies = ["gepa", "anthropic>=0.69"]
+
+[project.optional-dependencies]
+dev = ["pytest>=8"]
+
+[tool.pytest.ini_options]
+testpaths = ["tests"]
diff --git a/scripts/experiments/gepa-flowchart/tests/test_config.py b/scripts/experiments/gepa-flowchart/tests/test_config.py
new file mode 100644
index 0000000..691aaa9
--- /dev/null
+++ b/scripts/experiments/gepa-flowchart/tests/test_config.py
@@ -0,0 +1,16 @@
+from gepa_flowchart.config import Config
+
+
+def test_defaults():
+    cfg = Config.from_env({})
+    assert cfg.gen_model == "claude-opus-4-8"
+    assert cfg.reader_model == "claude-sonnet-4-6"
+    assert cfg.judge_model == "claude-opus-4-8"
+    assert cfg.reflection_model == "claude-opus-4-8"
+    assert cfg.w_comp == 0.6 and cfg.w_geom == 0.4
+
+
+def test_env_override():
+    cfg = Config.from_env({"GEPA_GEN_MODEL": "claude-sonnet-4-6", "GEPA_W_COMP": "0.7"})
+    assert cfg.gen_model == "claude-sonnet-4-6"
+    assert cfg.w_comp == 0.7

From abb5f9601501035573ce4ab40975a763aa676c7b Mon Sep 17 00:00:00 2001
From: Ivan Cheung <ivanmkc@google.com>
Date: Thu, 18 Jun 2026 04:42:33 +0000
Subject: [PATCH 04/24] feat(gepa): geometry bridge reusing TS validateContent
 + geometryReport

---
 .../gepa_flowchart/geometry_bridge.py         |  24 +
 .../gepa_flowchart/validate_flow.ts           |  22 +
 .../gepa-flowchart/package-lock.json          | 531 ++++++++++++++++++
 .../experiments/gepa-flowchart/package.json   |   1 +
 .../tests/test_geometry_bridge.py             |  38 ++
 5 files changed, 616 insertions(+)
 create mode 100644 scripts/experiments/gepa-flowchart/gepa_flowchart/geometry_bridge.py
 create mode 100644 scripts/experiments/gepa-flowchart/gepa_flowchart/validate_flow.ts
 create mode 100644 scripts/experiments/gepa-flowchart/package-lock.json
 create mode 100644 scripts/experiments/gepa-flowchart/tests/test_geometry_bridge.py

diff --git a/scripts/experiments/gepa-flowchart/gepa_flowchart/geometry_bridge.py b/scripts/experiments/gepa-flowchart/gepa_flowchart/geometry_bridge.py
new file mode 100644
index 0000000..1cc4d28
--- /dev/null
+++ b/scripts/experiments/gepa-flowchart/gepa_flowchart/geometry_bridge.py
@@ -0,0 +1,24 @@
+from __future__ import annotations
+
+import json
+import subprocess
+from pathlib import Path
+
+_HERE = Path(__file__).resolve().parent
+_PROJECT = _HERE.parent  # scripts/experiments/gepa-flowchart
+_SCRIPT = _HERE / "validate_flow.ts"
+
+
+def validate_flow(content: str, *, cwd: str | None = None) -> dict:
+    """Run the TS validator on a flow-JSON string. Returns
+    {valid, error, findings, warnings}. Raises RuntimeError if the bridge fails."""
+    proc = subprocess.run(
+        ["npx", "tsx", str(_SCRIPT)],
+        input=content,
+        capture_output=True,
+        text=True,
+        cwd=cwd or str(_PROJECT),
+    )
+    if proc.returncode != 0:
+        raise RuntimeError(f"geometry bridge failed: {proc.stderr.strip()}")
+    return json.loads(proc.stdout)
diff --git a/scripts/experiments/gepa-flowchart/gepa_flowchart/validate_flow.ts b/scripts/experiments/gepa-flowchart/gepa_flowchart/validate_flow.ts
new file mode 100644
index 0000000..7a4a1c1
--- /dev/null
+++ b/scripts/experiments/gepa-flowchart/gepa_flowchart/validate_flow.ts
@@ -0,0 +1,22 @@
+// Reads a flow-JSON spec on stdin, prints { valid, error, findings, warnings } on stdout.
+// Imports the existing validators directly — no changes to shipped packages.
+import { validateContent } from "../../../../packages/core/src/validate.js";
+import { geometryReport } from "../../../../packages/viewer/src/flow-geometry.js";
+
+function readStdin(): Promise<string> {
+  return new Promise((resolve) => {
+    let data = "";
+    process.stdin.setEncoding("utf8");
+    process.stdin.on("data", (c) => (data += c));
+    process.stdin.on("end", () => resolve(data));
+  });
+}
+
+const content = await readStdin();
+const error = validateContent("flow", content);
+if (error) {
+  process.stdout.write(JSON.stringify({ valid: false, error, findings: [], warnings: [] }));
+} else {
+  const { warnings, findings } = geometryReport("flow", content);
+  process.stdout.write(JSON.stringify({ valid: true, error: null, findings, warnings }));
+}
diff --git a/scripts/experiments/gepa-flowchart/package-lock.json b/scripts/experiments/gepa-flowchart/package-lock.json
new file mode 100644
index 0000000..2e54688
--- /dev/null
+++ b/scripts/experiments/gepa-flowchart/package-lock.json
@@ -0,0 +1,531 @@
+{
+  "name": "gepa-flowchart-bridge",
+  "lockfileVersion": 3,
+  "requires": true,
+  "packages": {
+    "": {
+      "name": "gepa-flowchart-bridge",
+      "devDependencies": {
+        "tsx": "^4.19.0"
+      }
+    },
+    "node_modules/@esbuild/aix-ppc64": {
+      "version": "0.28.1",
+      "resolved": "https://registry.npmjs.org/@esbuild/aix-ppc64/-/aix-ppc64-0.28.1.tgz",
+      "integrity": "sha512-Svl7tq8k/08+p6CXPpRjQ1fKX+1odH/BQbb48fV6fj3CWHhsoIOoY87w1oHXm0qEpkIK3ZfVgp0hed3XBXzXMQ==",
+      "cpu": [
+        "ppc64"
+      ],
+      "dev": true,
+      "license": "MIT",
+      "optional": true,
+      "os": [
+        "aix"
+      ],
+      "engines": {
+        "node": ">=18"
+      }
+    },
+    "node_modules/@esbuild/android-arm": {
+      "version": "0.28.1",
+      "resolved": "https://registry.npmjs.org/@esbuild/android-arm/-/android-arm-0.28.1.tgz",
+      "integrity": "sha512-0k2F129Xdio1TdJfzJ8sy1Q47vUD2NnwdhiAf7drUN1EBTfPf4hsFCtmMgu/6m8JSzsBrlmVjudMBQqOfG8usQ==",
+      "cpu": [
+        "arm"
+      ],
+      "dev": true,
+      "license": "MIT",
+      "optional": true,
+      "os": [
+        "android"
+      ],
+      "engines": {
+        "node": ">=18"
+      }
+    },
+    "node_modules/@esbuild/android-arm64": {
+      "version": "0.28.1",
+      "resolved": "https://registry.npmjs.org/@esbuild/android-arm64/-/android-arm64-0.28.1.tgz",
+      "integrity": "sha512-34EGEbCIAgosYz6goLcopX6Mo7NyGv9tfwEM2/7Ce2VcVRk568iSvniGWcUXIy7wEDR1wzolcxcriFVrWYcwBg==",
+      "cpu": [
+        "arm64"
+      ],
+      "dev": true,
+      "license": "MIT",
+      "optional": true,
+      "os": [
+        "android"
+      ],
+      "engines": {
+        "node": ">=18"
+      }
+    },
+    "node_modules/@esbuild/android-x64": {
+      "version": "0.28.1",
+      "resolved": "https://registry.npmjs.org/@esbuild/android-x64/-/android-x64-0.28.1.tgz",
+      "integrity": "sha512-dbwY7ltSMDWsRatcRpCnES4F+im88OCUgGZjy52shC7GqHRE/cYlxNbB4Z4UpJswpcc4Qxd2oE/ufM0p61IKng==",
+      "cpu": [
+        "x64"
+      ],
+      "dev": true,
+      "license": "MIT",
+      "optional": true,
+      "os": [
+        "android"
+      ],
+      "engines": {
+        "node": ">=18"
+      }
+    },
+    "node_modules/@esbuild/darwin-arm64": {
+      "version": "0.28.1",
+      "resolved": "https://registry.npmjs.org/@esbuild/darwin-arm64/-/darwin-arm64-0.28.1.tgz",
+      "integrity": "sha512-TZbWkQY7kvTAXbXUT7uVACR5cMHsDiSz9z7ZKAX/RTq/WJEk3QyRr0wZpNhBDX+/0CtdqUIJlOiodQcta6tY3Q==",
+      "cpu": [
+        "arm64"
+      ],
+      "dev": true,
+      "license": "MIT",
+      "optional": true,
+      "os": [
+        "darwin"
+      ],
+      "engines": {
+        "node": ">=18"
+      }
+    },
+    "node_modules/@esbuild/darwin-x64": {
+      "version": "0.28.1",
+      "resolved": "https://registry.npmjs.org/@esbuild/darwin-x64/-/darwin-x64-0.28.1.tgz",
+      "integrity": "sha512-zfdzgK9ACBNZLI/CyHTOx81SyNbM6YXn7rxSgX97VjyiPl9W1i4Ka4fgKECEoFCKGpvBj5qArWIGgQjOwkgskQ==",
+      "cpu": [
+        "x64"
+      ],
+      "dev": true,
+      "license": "MIT",
+      "optional": true,
+      "os": [
+        "darwin"
+      ],
+      "engines": {
+        "node": ">=18"
+      }
+    },
+    "node_modules/@esbuild/freebsd-arm64": {
+      "version": "0.28.1",
+      "resolved": "https://registry.npmjs.org/@esbuild/freebsd-arm64/-/freebsd-arm64-0.28.1.tgz",
+      "integrity": "sha512-wG2EA8ENdEI0qhkSZMjfqrdY+ziCYCPMmtZjjIwOmXFjmyzEHn+UUxk5of+SYsjtfs3VpnlC7QLzSI5hY/rOAw==",
+      "cpu": [
+        "arm64"
+      ],
+      "dev": true,
+      "license": "MIT",
+      "optional": true,
+      "os": [
+        "freebsd"
+      ],
+      "engines": {
+        "node": ">=18"
+      }
+    },
+    "node_modules/@esbuild/freebsd-x64": {
+      "version": "0.28.1",
+      "resolved": "https://registry.npmjs.org/@esbuild/freebsd-x64/-/freebsd-x64-0.28.1.tgz",
+      "integrity": "sha512-i7dZ9vQgnvSCzi/rYCXNgtF/U+eKZNJBzu3eTQbRgHnM7tNSizLOkRFAl3qzVc/Op/u5YkHHa4pf/3DOYHthLQ==",
+      "cpu": [
+        "x64"
+      ],
+      "dev": true,
+      "license": "MIT",
+      "optional": true,
+      "os": [
+        "freebsd"
+      ],
+      "engines": {
+        "node": ">=18"
+      }
+    },
+    "node_modules/@esbuild/linux-arm": {
+      "version": "0.28.1",
+      "resolved": "https://registry.npmjs.org/@esbuild/linux-arm/-/linux-arm-0.28.1.tgz",
+      "integrity": "sha512-qVXBOHQS+d5Y722GwJzJUtOLlX7km3CraOaGormF1pDtPd2C/l1SHRPgjLunLGe51Sh5YYWKMFDyV4SxgMQYTQ==",
+      "cpu": [
+        "arm"
+      ],
+      "dev": true,
+      "license": "MIT",
+      "optional": true,
+      "os": [
+        "linux"
+      ],
+      "engines": {
+        "node": ">=18"
+      }
+    },
+    "node_modules/@esbuild/linux-arm64": {
+      "version": "0.28.1",
+      "resolved": "https://registry.npmjs.org/@esbuild/linux-arm64/-/linux-arm64-0.28.1.tgz",
+      "integrity": "sha512-yHs+0uc8+nvEAfAfxrWQKK5peSNzBc4PegcMO0EJ2hT71uA7vB8Ihg2e77R2P7SG5uYjPbHlLLmve4LLLRCf0g==",
+      "cpu": [
+        "arm64"
+      ],
+      "dev": true,
+      "license": "MIT",
+      "optional": true,
+      "os": [
+        "linux"
+      ],
+      "engines": {
+        "node": ">=18"
+      }
+    },
+    "node_modules/@esbuild/linux-ia32": {
+      "version": "0.28.1",
+      "resolved": "https://registry.npmjs.org/@esbuild/linux-ia32/-/linux-ia32-0.28.1.tgz",
+      "integrity": "sha512-d1z4ZuP0ajrfz/FhGT4vv278rX8KnPPJx8i5+AtK7TYbx9Le9F1hyzurZpkEyjkGa9dUGhQow4C1NmeGvqxN2w==",
+      "cpu": [
+        "ia32"
+      ],
+      "dev": true,
+      "license": "MIT",
+      "optional": true,
+      "os": [
+        "linux"
+      ],
+      "engines": {
+        "node": ">=18"
+      }
+    },
+    "node_modules/@esbuild/linux-loong64": {
+      "version": "0.28.1",
+      "resolved": "https://registry.npmjs.org/@esbuild/linux-loong64/-/linux-loong64-0.28.1.tgz",
+      "integrity": "sha512-M5sRjUVZrkm1OAPR3dlOYzNmN+loZKGVi1VUQGrwuqLcbR6qeAz+famMhjASeH3YVKvZz+zT1jlh/keC3Rj/lg==",
+      "cpu": [
+        "loong64"
+      ],
+      "dev": true,
+      "license": "MIT",
+      "optional": true,
+      "os": [
+        "linux"
+      ],
+      "engines": {
+        "node": ">=18"
+      }
+    },
+    "node_modules/@esbuild/linux-mips64el": {
+      "version": "0.28.1",
+      "resolved": "https://registry.npmjs.org/@esbuild/linux-mips64el/-/linux-mips64el-0.28.1.tgz",
+      "integrity": "sha512-mRObBZeHh2OxcBFPWE/FjylkRgZdYuiTR3vaTozquCGOH14iP9oN4x4Ge81CoIDYQrXmIxpFumJBu5MtZpnQJQ==",
+      "cpu": [
+        "mips64el"
+      ],
+      "dev": true,
+      "license": "MIT",
+      "optional": true,
+      "os": [
+        "linux"
+      ],
+      "engines": {
+        "node": ">=18"
+      }
+    },
+    "node_modules/@esbuild/linux-ppc64": {
+      "version": "0.28.1",
+      "resolved": "https://registry.npmjs.org/@esbuild/linux-ppc64/-/linux-ppc64-0.28.1.tgz",
+      "integrity": "sha512-slScBsMAb3GFDcdrCgLwZtPYRoH2H/youv10QiZyRjmsP48fznoveWytSgCI/R0ZcUgpc0ZhIUEx6LHts8yrfQ==",
+      "cpu": [
+        "ppc64"
+      ],
+      "dev": true,
+      "license": "MIT",
+      "optional": true,
+      "os": [
+        "linux"
+      ],
+      "engines": {
+        "node": ">=18"
+      }
+    },
+    "node_modules/@esbuild/linux-riscv64": {
+      "version": "0.28.1",
+      "resolved": "https://registry.npmjs.org/@esbuild/linux-riscv64/-/linux-riscv64-0.28.1.tgz",
+      "integrity": "sha512-kw0owk1o0GFETUJyW0jc0G4Yzs0BHZn0JDZ8JRT088vjJYX777BAs1fDGxAC+q831qOs2DTC96mNsG2opdfyyQ==",
+      "cpu": [
+        "riscv64"
+      ],
+      "dev": true,
+      "license": "MIT",
+      "optional": true,
+      "os": [
+        "linux"
+      ],
+      "engines": {
+        "node": ">=18"
+      }
+    },
+    "node_modules/@esbuild/linux-s390x": {
+      "version": "0.28.1",
+      "resolved": "https://registry.npmjs.org/@esbuild/linux-s390x/-/linux-s390x-0.28.1.tgz",
+      "integrity": "sha512-/lAIjX8aYFRByhh6L5rYtPEDRqa9de/4V/juOXcta5frjvzXO4/sqEtyytse0g3zZFuWu5cDN0MkLz2qRDD2Ag==",
+      "cpu": [
+        "s390x"
+      ],
+      "dev": true,
+      "license": "MIT",
+      "optional": true,
+      "os": [
+        "linux"
+      ],
+      "engines": {
+        "node": ">=18"
+      }
+    },
+    "node_modules/@esbuild/linux-x64": {
+      "version": "0.28.1",
+      "resolved": "https://registry.npmjs.org/@esbuild/linux-x64/-/linux-x64-0.28.1.tgz",
+      "integrity": "sha512-u/anNYF2mmVOEDwLtnQ1wOr3EZ9sTNGLWrsYGYwHWzGA3Si84IOkHXlbWTD1NB+9/1lcnweYKO54uhxZydNzfA==",
+      "cpu": [
+        "x64"
+      ],
+      "dev": true,
+      "license": "MIT",
+      "optional": true,
+      "os": [
+        "linux"
+      ],
+      "engines": {
+        "node": ">=18"
+      }
+    },
+    "node_modules/@esbuild/netbsd-arm64": {
+      "version": "0.28.1",
+      "resolved": "https://registry.npmjs.org/@esbuild/netbsd-arm64/-/netbsd-arm64-0.28.1.tgz",
+      "integrity": "sha512-oks0DYbLwWMmaakTsCb+zL4E+aHRVLom9IJZOAthMQEPiQmydXHkziYEsGYRx0uNV/IjEKGAV941JzH02pflqw==",
+      "cpu": [
+        "arm64"
+      ],
+      "dev": true,
+      "license": "MIT",
+      "optional": true,
+      "os": [
+        "netbsd"
+      ],
+      "engines": {
+        "node": ">=18"
+      }
+    },
+    "node_modules/@esbuild/netbsd-x64": {
+      "version": "0.28.1",
+      "resolved": "https://registry.npmjs.org/@esbuild/netbsd-x64/-/netbsd-x64-0.28.1.tgz",
+      "integrity": "sha512-aeL6lAnN89Hz43Mlh1G8ARasbuoYvSITDEx0tHh5b7jJnHcssqgjy9Yx430GDpmCa6OyrKoS0aNRjKundRizGg==",
+      "cpu": [
+        "x64"
+      ],
+      "dev": true,
+      "license": "MIT",
+      "optional": true,
+      "os": [
+        "netbsd"
+      ],
+      "engines": {
+        "node": ">=18"
+      }
+    },
+    "node_modules/@esbuild/openbsd-arm64": {
+      "version": "0.28.1",
+      "resolved": "https://registry.npmjs.org/@esbuild/openbsd-arm64/-/openbsd-arm64-0.28.1.tgz",
+      "integrity": "sha512-MEFJe5C3R8pwXdZ5Y21oo6m7ePiS0d9pWucn99O/wvyJZChoIQKrQDxKrGeW8F5+T0okTHesAmDeiHDTIq0V/Q==",
+      "cpu": [
+        "arm64"
+      ],
+      "dev": true,
+      "license": "MIT",
+      "optional": true,
+      "os": [
+        "openbsd"
+      ],
+      "engines": {
+        "node": ">=18"
+      }
+    },
+    "node_modules/@esbuild/openbsd-x64": {
+      "version": "0.28.1",
+      "resolved": "https://registry.npmjs.org/@esbuild/openbsd-x64/-/openbsd-x64-0.28.1.tgz",
+      "integrity": "sha512-i/ZLIOafE0Z8cI/XANJAixoJL/uRAoS2xOA3rb0xN+KK0K177cMAsQYkzHtBrtMXAKuAc7HGgcWiZ/sRC1Nxgw==",
+      "cpu": [
+        "x64"
+      ],
+      "dev": true,
+      "license": "MIT",
+      "optional": true,
+      "os": [
+        "openbsd"
+      ],
+      "engines": {
+        "node": ">=18"
+      }
+    },
+    "node_modules/@esbuild/openharmony-arm64": {
+      "version": "0.28.1",
+      "resolved": "https://registry.npmjs.org/@esbuild/openharmony-arm64/-/openharmony-arm64-0.28.1.tgz",
+      "integrity": "sha512-ge+Z7EXFNt2BO1oAMsVpiQ8EwndV9i1xXerAeTIK7AtPs3bKFXQM7nlRxDSIUIMeueR1CNXxqztLzdNeReKBJg==",
+      "cpu": [
+        "arm64"
+      ],
+      "dev": true,
+      "license": "MIT",
+      "optional": true,
+      "os": [
+        "openharmony"
+      ],
+      "engines": {
+        "node": ">=18"
+      }
+    },
+    "node_modules/@esbuild/sunos-x64": {
+      "version": "0.28.1",
+      "resolved": "https://registry.npmjs.org/@esbuild/sunos-x64/-/sunos-x64-0.28.1.tgz",
+      "integrity": "sha512-BEjgtECkL3vY+SaSQ6nzVfiALUeFxpawyp8Jmf5PtYhf1Ug40N1h/hxlhts+f1FvSvarEigdxS3BlSMI2PJLcQ==",
+      "cpu": [
+        "x64"
+      ],
+      "dev": true,
+      "license": "MIT",
+      "optional": true,
+      "os": [
+        "sunos"
+      ],
+      "engines": {
+        "node": ">=18"
+      }
+    },
+    "node_modules/@esbuild/win32-arm64": {
+      "version": "0.28.1",
+      "resolved": "https://registry.npmjs.org/@esbuild/win32-arm64/-/win32-arm64-0.28.1.tgz",
+      "integrity": "sha512-lCv9eK/H6ZJWbE7bh2nw54CZ9M2nupBxJcTsdk/QQnWkdSjKGuxmmH8/GWrlT1eMmZfn4dGcCjRte397WqfQXA==",
+      "cpu": [
+        "arm64"
+      ],
+      "dev": true,
+      "license": "MIT",
+      "optional": true,
+      "os": [
+        "win32"
+      ],
+      "engines": {
+        "node": ">=18"
+      }
+    },
+    "node_modules/@esbuild/win32-ia32": {
+      "version": "0.28.1",
+      "resolved": "https://registry.npmjs.org/@esbuild/win32-ia32/-/win32-ia32-0.28.1.tgz",
+      "integrity": "sha512-zvb/mB2bSCoJOpoCBgYKKpX6YM6mJBlBUVUtVj41DlZJVEB6/0CKlRYxP5wWl1C1ILiCoAU5wZZ4q1P3qeS6Eg==",
+      "cpu": [
+        "ia32"
+      ],
+      "dev": true,
+      "license": "MIT",
+      "optional": true,
+      "os": [
+        "win32"
+      ],
+      "engines": {
+        "node": ">=18"
+      }
+    },
+    "node_modules/@esbuild/win32-x64": {
+      "version": "0.28.1",
+      "resolved": "https://registry.npmjs.org/@esbuild/win32-x64/-/win32-x64-0.28.1.tgz",
+      "integrity": "sha512-bm4Mowrv+GXMlpWX++EcXw/iLyd1o3+bJkC2DkWXYVvgZCqD/bSj9ctZeAMC3cIxgjRVR2Dufaiu4YPxr5gW1A==",
+      "cpu": [
+        "x64"
+      ],
+      "dev": true,
+      "license": "MIT",
+      "optional": true,
+      "os": [
+        "win32"
+      ],
+      "engines": {
+        "node": ">=18"
+      }
+    },
+    "node_modules/esbuild": {
+      "version": "0.28.1",
+      "resolved": "https://registry.npmjs.org/esbuild/-/esbuild-0.28.1.tgz",
+      "integrity": "sha512-HrJrvZv5ayxBzPfwphOoNzkzOIIlifzk0KJrGK2c8R4+LKpMtpYLQeUdjnwjWv/LZlkH2laZk+4w78pi99D4Vw==",
+      "dev": true,
+      "hasInstallScript": true,
+      "license": "MIT",
+      "bin": {
+        "esbuild": "bin/esbuild"
+      },
+      "engines": {
+        "node": ">=18"
+      },
+      "optionalDependencies": {
+        "@esbuild/aix-ppc64": "0.28.1",
+        "@esbuild/android-arm": "0.28.1",
+        "@esbuild/android-arm64": "0.28.1",
+        "@esbuild/android-x64": "0.28.1",
+        "@esbuild/darwin-arm64": "0.28.1",
+        "@esbuild/darwin-x64": "0.28.1",
+        "@esbuild/freebsd-arm64": "0.28.1",
+        "@esbuild/freebsd-x64": "0.28.1",
+        "@esbuild/linux-arm": "0.28.1",
+        "@esbuild/linux-arm64": "0.28.1",
+        "@esbuild/linux-ia32": "0.28.1",
+        "@esbuild/linux-loong64": "0.28.1",
+        "@esbuild/linux-mips64el": "0.28.1",
+        "@esbuild/linux-ppc64": "0.28.1",
+        "@esbuild/linux-riscv64": "0.28.1",
+        "@esbuild/linux-s390x": "0.28.1",
+        "@esbuild/linux-x64": "0.28.1",
+        "@esbuild/netbsd-arm64": "0.28.1",
+        "@esbuild/netbsd-x64": "0.28.1",
+        "@esbuild/openbsd-arm64": "0.28.1",
+        "@esbuild/openbsd-x64": "0.28.1",
+        "@esbuild/openharmony-arm64": "0.28.1",
+        "@esbuild/sunos-x64": "0.28.1",
+        "@esbuild/win32-arm64": "0.28.1",
+        "@esbuild/win32-ia32": "0.28.1",
+        "@esbuild/win32-x64": "0.28.1"
+      }
+    },
+    "node_modules/fsevents": {
+      "version": "2.3.3",
+      "resolved": "https://registry.npmjs.org/fsevents/-/fsevents-2.3.3.tgz",
+      "integrity": "sha512-5xoDfX+fL7faATnagmWPpbFtwh/R77WmMMqqHGS65C3vvB0YHrgF+B1YmZ3441tMj5n63k0212XNoJwzlhffQw==",
+      "dev": true,
+      "hasInstallScript": true,
+      "license": "MIT",
+      "optional": true,
+      "os": [
+        "darwin"
+      ],
+      "engines": {
+        "node": "^8.16.0 || ^10.6.0 || >=11.0.0"
+      }
+    },
+    "node_modules/tsx": {
+      "version": "4.22.4",
+      "resolved": "https://registry.npmjs.org/tsx/-/tsx-4.22.4.tgz",
+      "integrity": "sha512-X8EX+XV4QR5xCsrgxaED954zTDfY8KqlDtskKEL0cHhyS/P8b4IFOvGDQpsC9Q1XnLq915wEfwwY/zzskCtmhg==",
+      "dev": true,
+      "license": "MIT",
+      "dependencies": {
+        "esbuild": "~0.28.0"
+      },
+      "bin": {
+        "tsx": "dist/cli.mjs"
+      },
+      "engines": {
+        "node": ">=18.0.0"
+      },
+      "optionalDependencies": {
+        "fsevents": "~2.3.3"
+      }
+    }
+  }
+}
diff --git a/scripts/experiments/gepa-flowchart/package.json b/scripts/experiments/gepa-flowchart/package.json
index 6bf8617..6d32a40 100644
--- a/scripts/experiments/gepa-flowchart/package.json
+++ b/scripts/experiments/gepa-flowchart/package.json
@@ -1,5 +1,6 @@
 {
   "name": "gepa-flowchart-bridge",
   "private": true,
+  "type": "module",
   "devDependencies": { "tsx": "^4.19.0" }
 }
diff --git a/scripts/experiments/gepa-flowchart/tests/test_geometry_bridge.py b/scripts/experiments/gepa-flowchart/tests/test_geometry_bridge.py
new file mode 100644
index 0000000..60b403c
--- /dev/null
+++ b/scripts/experiments/gepa-flowchart/tests/test_geometry_bridge.py
@@ -0,0 +1,38 @@
+import json
+
+from gepa_flowchart.geometry_bridge import validate_flow
+
+CLEAN = json.dumps({
+    "direction": "TB",
+    "nodes": [{"id": "a", "data": {"label": "A"}}, {"id": "b", "data": {"label": "B"}}],
+    "edges": [{"source": "a", "target": "b"}],
+})
+
+# a -> c with b sitting on the a-c line (explicit positions force the overlap)
+EDGE_OVER_NODE = json.dumps({
+    "layout": "manual", "direction": "LR",
+    "nodes": [
+        {"id": "a", "position": {"x": 0, "y": 0}},
+        {"id": "b", "position": {"x": 200, "y": 0}},
+        {"id": "c", "position": {"x": 440, "y": 0}},
+    ],
+    "edges": [{"source": "a", "target": "c"}],
+})
+
+
+def test_clean_spec_is_valid_no_errors():
+    r = validate_flow(CLEAN)
+    assert r["valid"] is True
+    assert [f for f in r["findings"] if f["severity"] == "error"] == []
+
+
+def test_edge_over_node_flagged():
+    r = validate_flow(EDGE_OVER_NODE)
+    assert r["valid"] is True
+    assert any(f["code"] == "edge-over-node" for f in r["findings"])
+
+
+def test_invalid_json_reported():
+    r = validate_flow("{ not json")
+    assert r["valid"] is False
+    assert r["error"]

From 8a508bb1af16743b53aa5a6186842acfe86a3215 Mon Sep 17 00:00:00 2001
From: Ivan Cheung <ivanmkc@google.com>
Date: Thu, 18 Jun 2026 04:45:57 +0000
Subject: [PATCH 05/24] feat(gepa): Anthropic SDK wrappers + reflection
 callable

---
 .../gepa-flowchart/gepa_flowchart/llm.py      | 47 +++++++++++++++++++
 .../gepa-flowchart/tests/test_llm.py          | 34 ++++++++++++++
 2 files changed, 81 insertions(+)
 create mode 100644 scripts/experiments/gepa-flowchart/gepa_flowchart/llm.py
 create mode 100644 scripts/experiments/gepa-flowchart/tests/test_llm.py

diff --git a/scripts/experiments/gepa-flowchart/gepa_flowchart/llm.py b/scripts/experiments/gepa-flowchart/gepa_flowchart/llm.py
new file mode 100644
index 0000000..cd01808
--- /dev/null
+++ b/scripts/experiments/gepa-flowchart/gepa_flowchart/llm.py
@@ -0,0 +1,47 @@
+from __future__ import annotations
+
+from typing import Callable
+
+_client = None
+
+
+def get_client():
+    global _client
+    if _client is None:
+        import anthropic
+        _client = anthropic.Anthropic()
+    return _client
+
+
+def complete(
+    system: str,
+    user: str,
+    *,
+    model: str,
+    effort: str = "medium",
+    max_tokens: int = 4096,
+    client=None,
+) -> str:
+    client = client or get_client()
+    resp = client.messages.create(
+        model=model,
+        max_tokens=max_tokens,
+        system=system,
+        output_config={"effort": effort},
+        messages=[{"role": "user", "content": user}],
+    )
+    return "".join(b.text for b in resp.content if getattr(b, "type", None) == "text")
+
+
+def make_reflection_callable(model: str, *, client=None) -> Callable[[str], str]:
+    def reflect(prompt: str) -> str:
+        return complete(
+            "You are an expert prompt engineer improving prompts based on feedback.",
+            prompt,
+            model=model,
+            effort="high",
+            max_tokens=8000,
+            client=client,
+        )
+
+    return reflect
diff --git a/scripts/experiments/gepa-flowchart/tests/test_llm.py b/scripts/experiments/gepa-flowchart/tests/test_llm.py
new file mode 100644
index 0000000..b876aee
--- /dev/null
+++ b/scripts/experiments/gepa-flowchart/tests/test_llm.py
@@ -0,0 +1,34 @@
+from gepa_flowchart import llm
+
+
+class FakeMessages:
+    def __init__(self, recorder):
+        self.recorder = recorder
+
+    def create(self, **kwargs):
+        self.recorder.update(kwargs)
+        block = type("Block", (), {"type": "text", "text": "hello world"})()
+        return type("Resp", (), {"content": [block]})()
+
+
+class FakeClient:
+    def __init__(self):
+        self.calls = {}
+        self.messages = FakeMessages(self.calls)
+
+
+def test_complete_extracts_text_and_passes_params():
+    fake = FakeClient()
+    out = llm.complete("sys", "usr", model="claude-opus-4-8", effort="high", client=fake)
+    assert out == "hello world"
+    assert fake.calls["model"] == "claude-opus-4-8"
+    assert fake.calls["output_config"] == {"effort": "high"}
+    assert "temperature" not in fake.calls  # must never be sent
+    assert fake.calls["system"] == "sys"
+    assert fake.calls["messages"] == [{"role": "user", "content": "usr"}]
+
+
+def test_reflection_callable_returns_str():
+    fake = FakeClient()
+    fn = llm.make_reflection_callable("claude-opus-4-8", client=fake)
+    assert fn("reflect on this") == "hello world"

From 1b168602142516427af7cf647febf3194969a278 Mon Sep 17 00:00:00 2001
From: Ivan Cheung <ivanmkc@google.com>
Date: Thu, 18 Jun 2026 04:48:00 +0000
Subject: [PATCH 06/24] feat(gepa): board-to-text serializer + JSON extractor

---
 .../gepa-flowchart/gepa_flowchart/render.py   | 82 +++++++++++++++++++
 .../gepa-flowchart/tests/test_render.py       | 35 ++++++++
 2 files changed, 117 insertions(+)
 create mode 100644 scripts/experiments/gepa-flowchart/gepa_flowchart/render.py
 create mode 100644 scripts/experiments/gepa-flowchart/tests/test_render.py

diff --git a/scripts/experiments/gepa-flowchart/gepa_flowchart/render.py b/scripts/experiments/gepa-flowchart/gepa_flowchart/render.py
new file mode 100644
index 0000000..c913efe
--- /dev/null
+++ b/scripts/experiments/gepa-flowchart/gepa_flowchart/render.py
@@ -0,0 +1,82 @@
+from __future__ import annotations
+
+
+def board_to_text(flow: dict) -> str:
+    lines: list[str] = []
+    direction = flow.get("direction", "TB")
+    lines.append(f"Flowchart (direction: {direction})")
+    groups = {g["id"]: g for g in flow.get("groups", []) if isinstance(g, dict) and "id" in g}
+    nodes = [n for n in flow.get("nodes", []) if isinstance(n, dict)]
+
+    lines.append("\nNodes:")
+    for n in nodes:
+        nid = n.get("id", "?")
+        data = n.get("data", {}) if isinstance(n.get("data"), dict) else {}
+        label = data.get("label", "(no label)")
+        status = data.get("status")
+        grp = groups.get(n.get("group", ""), {}).get("label")
+        suffix = []
+        if grp:
+            suffix.append(f"group: {grp}")
+        if status:
+            suffix.append(f"status: {status}")
+        extra = f" [{', '.join(suffix)}]" if suffix else ""
+        lines.append(f"  - {nid}: {label}{extra}")
+
+    label_by_id = {
+        n.get("id"): (n.get("data", {}) or {}).get("label", n.get("id"))
+        for n in nodes
+    }
+    lines.append("\nConnections:")
+    for e in flow.get("edges", []):
+        if not isinstance(e, dict):
+            continue
+        src = label_by_id.get(e.get("source"), e.get("source"))
+        tgt = label_by_id.get(e.get("target"), e.get("target"))
+        elabel = (e.get("data", {}) or {}).get("label") if isinstance(e.get("data"), dict) else None
+        arrow = f' --[{elabel}]-->' if elabel else " -->"
+        lines.append(f"  - {src}{arrow} {tgt}")
+    return "\n".join(lines)
+
+
+def extract_json(text: str) -> str | None:
+    if "```" in text:
+        # take the content of the first fenced block
+        parts = text.split("```")
+        for chunk in parts[1:]:
+            body = chunk
+            if body.lstrip().lower().startswith("json"):
+                body = body.lstrip()[4:]
+            body = body.strip()
+            if body.startswith("{"):
+                bal = _balanced(body)
+                if bal:
+                    return bal
+    start = text.find("{")
+    if start == -1:
+        return None
+    return _balanced(text[start:])
+
+
+def _balanced(s: str) -> str | None:
+    depth = 0
+    in_str = False
+    esc = False
+    for i, ch in enumerate(s):
+        if in_str:
+            if esc:
+                esc = False
+            elif ch == "\\":
+                esc = True
+            elif ch == '"':
+                in_str = False
+            continue
+        if ch == '"':
+            in_str = True
+        elif ch == "{":
+            depth += 1
+        elif ch == "}":
+            depth -= 1
+            if depth == 0:
+                return s[: i + 1]
+    return None
diff --git a/scripts/experiments/gepa-flowchart/tests/test_render.py b/scripts/experiments/gepa-flowchart/tests/test_render.py
new file mode 100644
index 0000000..83b68be
--- /dev/null
+++ b/scripts/experiments/gepa-flowchart/tests/test_render.py
@@ -0,0 +1,35 @@
+import json
+
+from gepa_flowchart.render import board_to_text, extract_json
+
+
+def test_board_to_text_includes_labels_and_edges():
+    flow = {
+        "direction": "TB",
+        "groups": [{"id": "g1", "label": "Frontend"}],
+        "nodes": [
+            {"id": "a", "group": "g1", "data": {"label": "Login form"}},
+            {"id": "b", "data": {"label": "Auth service", "status": "active"}},
+        ],
+        "edges": [{"source": "a", "target": "b", "data": {"label": "submit"}}],
+    }
+    text = board_to_text(flow)
+    assert "Login form" in text
+    assert "Auth service" in text
+    assert "Frontend" in text
+    assert "submit" in text
+
+
+def test_extract_json_from_fenced():
+    raw = 'Here:\n```json\n{"nodes": [], "edges": []}\n```\nDone.'
+    out = extract_json(raw)
+    assert json.loads(out) == {"nodes": [], "edges": []}
+
+
+def test_extract_json_bare():
+    out = extract_json('prefix {"x": 1} suffix')
+    assert json.loads(out) == {"x": 1}
+
+
+def test_extract_json_none_when_absent():
+    assert extract_json("no json here") is None

From 30bb7e98927635b82c024267d0a097c633d5c5b2 Mon Sep 17 00:00:00 2001
From: Ivan Cheung <ivanmkc@google.com>
Date: Thu, 18 Jun 2026 04:50:16 +0000
Subject: [PATCH 07/24] feat(gepa): topic dataset + reader-question
 generation/freezing

---
 .../gepa-flowchart/gepa_flowchart/dataset.py  | 62 +++++++++++++++++++
 .../gepa-flowchart/tests/test_dataset.py      | 38 ++++++++++++
 .../gepa-flowchart/topics/topics.json         | 14 +++++
 3 files changed, 114 insertions(+)
 create mode 100644 scripts/experiments/gepa-flowchart/gepa_flowchart/dataset.py
 create mode 100644 scripts/experiments/gepa-flowchart/tests/test_dataset.py
 create mode 100644 scripts/experiments/gepa-flowchart/topics/topics.json

diff --git a/scripts/experiments/gepa-flowchart/gepa_flowchart/dataset.py b/scripts/experiments/gepa-flowchart/gepa_flowchart/dataset.py
new file mode 100644
index 0000000..4313004
--- /dev/null
+++ b/scripts/experiments/gepa-flowchart/gepa_flowchart/dataset.py
@@ -0,0 +1,62 @@
+from __future__ import annotations
+
+import json
+from dataclasses import dataclass, field
+from pathlib import Path
+
+
+@dataclass
+class Task:
+    id: str
+    topic: str
+    audience: str
+    purpose: str
+    questions: list[str] = field(default_factory=list)
+
+
+def load_topics(path: str) -> list[Task]:
+    raw = json.loads(Path(path).read_text())
+    return [Task(id=r["id"], topic=r["topic"], audience=r["audience"], purpose=r["purpose"]) for r in raw]
+
+
+_QGEN_SYSTEM = (
+    "You write the questions a first-time reader of a diagram would need answered "
+    "to actually understand the subject. Output ONLY a JSON array of concise question strings."
+)
+
+
+def generate_questions(task: Task, *, model: str, n: int, complete_fn) -> list[str]:
+    user = (
+        f"Topic: {task.topic}\nAudience: {task.audience}\nPurpose: {task.purpose}\n\n"
+        f"List exactly {n} distinct questions this reader should be able to answer "
+        f"from a good flowchart on this topic. JSON array of strings only."
+    )
+    out = complete_fn(_QGEN_SYSTEM, user, model=model, effort="medium", max_tokens=1000)
+    blob = extract_array(out)
+    qs = json.loads(blob) if blob else []
+    return [str(q) for q in qs][:n]
+
+
+def extract_array(text: str) -> str | None:
+    start = text.find("[")
+    end = text.rfind("]")
+    if start == -1 or end == -1 or end < start:
+        return None
+    return text[start : end + 1]
+
+
+def freeze_questions(tasks: list[Task], run_dir: str, *, model: str, n: int, complete_fn) -> list[Task]:
+    Path(run_dir).mkdir(parents=True, exist_ok=True)
+    frozen: dict[str, list[str]] = {}
+    for t in tasks:
+        t.questions = generate_questions(t, model=model, n=n, complete_fn=complete_fn)
+        frozen[t.id] = t.questions
+    (Path(run_dir) / "frozen_questions.json").write_text(json.dumps(frozen, indent=2))
+    return tasks
+
+
+def load_frozen(tasks: list[Task], run_dir: str) -> list[Task]:
+    frozen = json.loads((Path(run_dir) / "frozen_questions.json").read_text())
+    for t in tasks:
+        t.questions = frozen.get(t.id, [])
+    return tasks
diff --git a/scripts/experiments/gepa-flowchart/tests/test_dataset.py b/scripts/experiments/gepa-flowchart/tests/test_dataset.py
new file mode 100644
index 0000000..062388b
--- /dev/null
+++ b/scripts/experiments/gepa-flowchart/tests/test_dataset.py
@@ -0,0 +1,38 @@
+import json
+from pathlib import Path
+
+from gepa_flowchart.dataset import (
+    Task,
+    load_topics,
+    generate_questions,
+    freeze_questions,
+    load_frozen,
+)
+
+TOPICS = str(Path(__file__).resolve().parent.parent / "topics" / "topics.json")
+
+
+def fake_complete(system, user, *, model, **kw):
+    # Return a JSON array of questions, regardless of input.
+    return '["Q1?", "Q2?", "Q3?"]'
+
+
+def test_load_topics():
+    tasks = load_topics(TOPICS)
+    assert len(tasks) == 12
+    assert all(isinstance(t, Task) and t.id and t.topic for t in tasks)
+
+
+def test_generate_questions_parses_list():
+    t = Task(id="x", topic="T", audience="A", purpose="P")
+    qs = generate_questions(t, model="m", n=3, complete_fn=fake_complete)
+    assert qs == ["Q1?", "Q2?", "Q3?"]
+
+
+def test_freeze_and_reload(tmp_path):
+    tasks = load_topics(TOPICS)[:2]
+    frozen = freeze_questions(tasks, str(tmp_path), model="m", n=3, complete_fn=fake_complete)
+    assert all(t.questions for t in frozen)
+    assert (tmp_path / "frozen_questions.json").exists()
+    reloaded = load_frozen(load_topics(TOPICS)[:2], str(tmp_path))
+    assert reloaded[0].questions == frozen[0].questions
diff --git a/scripts/experiments/gepa-flowchart/topics/topics.json b/scripts/experiments/gepa-flowchart/topics/topics.json
new file mode 100644
index 0000000..110e513
--- /dev/null
+++ b/scripts/experiments/gepa-flowchart/topics/topics.json
@@ -0,0 +1,14 @@
+[
+  {"id": "ci-cd", "topic": "CI/CD pipeline for a web app", "audience": "an engineer new to the team", "purpose": "understand how code reaches production and what can go wrong"},
+  {"id": "user-auth", "topic": "User authentication and session flow", "audience": "a backend developer", "purpose": "understand login, token issuance, and refresh"},
+  {"id": "order-fulfillment", "topic": "E-commerce order fulfillment", "audience": "an operations analyst", "purpose": "trace an order from checkout to delivery"},
+  {"id": "incident-response", "topic": "On-call incident response process", "audience": "a new on-call engineer", "purpose": "know what to do when paged"},
+  {"id": "data-pipeline", "topic": "Batch ETL data pipeline", "audience": "a data engineer", "purpose": "understand ingestion, transform, and load stages"},
+  {"id": "pr-review", "topic": "Pull request review and merge process", "audience": "a contributor", "purpose": "know the path from PR open to merge"},
+  {"id": "support-triage", "topic": "Customer support ticket triage", "audience": "a support agent", "purpose": "route and escalate tickets correctly"},
+  {"id": "payment-processing", "topic": "Online payment processing and retries", "audience": "a fintech engineer", "purpose": "understand auth, capture, and failure handling"},
+  {"id": "onboarding", "topic": "New employee onboarding workflow", "audience": "an HR coordinator", "purpose": "track steps from offer to first day"},
+  {"id": "state-machine", "topic": "Document approval state machine", "audience": "a product manager", "purpose": "understand statuses and transitions"},
+  {"id": "k8s-deploy", "topic": "Kubernetes rolling deployment", "audience": "a platform engineer", "purpose": "understand how a new version rolls out safely"},
+  {"id": "signup-funnel", "topic": "SaaS signup and activation funnel", "audience": "a growth analyst", "purpose": "see where users drop off between signup and activation"}
+]

From 6233d5a6b57be5a542df01c532bbf9973a46e70e Mon Sep 17 00:00:00 2001
From: Ivan Cheung <ivanmkc@google.com>
Date: Thu, 18 Jun 2026 04:52:44 +0000
Subject: [PATCH 08/24] feat(gepa): seed brainstorm/generate prompts + pipeline

---
 .../gepa-flowchart/gepa_flowchart/pipeline.py | 31 ++++++++++++
 .../gepa_flowchart/seed_prompts.py            | 47 +++++++++++++++++++
 .../gepa-flowchart/tests/test_pipeline.py     | 29 ++++++++++++
 3 files changed, 107 insertions(+)
 create mode 100644 scripts/experiments/gepa-flowchart/gepa_flowchart/pipeline.py
 create mode 100644 scripts/experiments/gepa-flowchart/gepa_flowchart/seed_prompts.py
 create mode 100644 scripts/experiments/gepa-flowchart/tests/test_pipeline.py

diff --git a/scripts/experiments/gepa-flowchart/gepa_flowchart/pipeline.py b/scripts/experiments/gepa-flowchart/gepa_flowchart/pipeline.py
new file mode 100644
index 0000000..fee812d
--- /dev/null
+++ b/scripts/experiments/gepa-flowchart/gepa_flowchart/pipeline.py
@@ -0,0 +1,31 @@
+from __future__ import annotations
+
+from .dataset import Task
+from .render import extract_json
+from .seed_prompts import SKILL_CONTEXT, SEED_CANDIDATE
+
+
+def run_pipeline(candidate: dict, task: Task, *, model: str, complete_fn, max_tokens: int):
+    trace: dict = {}
+    brainstorm_user = candidate["brainstorm"].format(
+        topic=task.topic, audience=task.audience, purpose=task.purpose
+    )
+    trace["brainstorm_input"] = brainstorm_user
+    plan = complete_fn(
+        "You are a diagram planner.", brainstorm_user, model=model, effort="medium", max_tokens=2000
+    )
+    trace["plan"] = plan
+
+    generate_user = candidate["generate"].format(
+        skill_context=SKILL_CONTEXT,
+        topic=task.topic,
+        audience=task.audience,
+        purpose=task.purpose,
+        plan=plan,
+    )
+    trace["generate_input"] = generate_user
+    raw = complete_fn(
+        "You output only valid JSON.", generate_user, model=model, effort="medium", max_tokens=max_tokens
+    )
+    trace["raw_generation"] = raw
+    return extract_json(raw), trace
diff --git a/scripts/experiments/gepa-flowchart/gepa_flowchart/seed_prompts.py b/scripts/experiments/gepa-flowchart/gepa_flowchart/seed_prompts.py
new file mode 100644
index 0000000..fa515df
--- /dev/null
+++ b/scripts/experiments/gepa-flowchart/gepa_flowchart/seed_prompts.py
@@ -0,0 +1,47 @@
+from __future__ import annotations
+
+from pathlib import Path
+
+_FLOW_SCHEMA = """A termchart `flow` spec is JSON:
+{
+  "direction": "TB" | "LR" | "BT" | "RL",   // default TB; prefer TB for processes
+  "nodes": [{ "id": "x", "data": { "label": "...", "status": "active|info|warn|success|neutral" }, "group": "g1?" }],
+  "edges": [{ "source": "x", "target": "y", "data": { "label": "when/condition?" } }],
+  "groups": [{ "id": "g1", "label": "Lane/Zone name", "color": "#hex?" }]
+}
+Rules: every edge source/target must be an existing node id; keep it under ~24 nodes; labels carry the meaning."""
+
+
+def _example() -> str:
+    candidates = [
+        "call-hierarchy.flow.json",
+        "okr-tree.flow.json",
+        "binary-search.flow.json",
+    ]
+    base = Path(__file__).resolve().parents[3] / "plugin" / "skills" / "diagram-recipes" / "examples"
+    for name in candidates:
+        p = base / name
+        if p.exists():
+            return p.read_text()
+    return '{"direction":"TB","nodes":[{"id":"a","data":{"label":"Start"}}],"edges":[]}'
+
+
+SKILL_CONTEXT = f"{_FLOW_SCHEMA}\n\nExample of a well-formed flow spec:\n{_example()}"
+
+SEED_CANDIDATE = {
+    "brainstorm": (
+        "You are planning a flowchart.\n"
+        "Topic: {topic}\nAudience: {audience}\nPurpose: {purpose}\n\n"
+        "Decide what the flowchart must show so this reader can understand the subject. "
+        "List the key steps/states, the decisions and what triggers each branch, and the "
+        "context a newcomer needs. Output a concise plan in plain text."
+    ),
+    "generate": (
+        "You generate a termchart `flow` diagram as JSON.\n\n"
+        "{skill_context}\n\n"
+        "Topic: {topic}\nAudience: {audience}\nPurpose: {purpose}\n\n"
+        "Plan to follow:\n{plan}\n\n"
+        "Produce a single flow JSON object. Use clear, specific labels; label edges with the "
+        "condition/trigger; group related nodes. Output ONLY the JSON."
+    ),
+}
diff --git a/scripts/experiments/gepa-flowchart/tests/test_pipeline.py b/scripts/experiments/gepa-flowchart/tests/test_pipeline.py
new file mode 100644
index 0000000..25aa44b
--- /dev/null
+++ b/scripts/experiments/gepa-flowchart/tests/test_pipeline.py
@@ -0,0 +1,29 @@
+import json
+
+from gepa_flowchart.dataset import Task
+from gepa_flowchart.pipeline import run_pipeline, SEED_CANDIDATE
+
+
+def test_seed_candidate_has_two_components():
+    assert set(SEED_CANDIDATE.keys()) == {"brainstorm", "generate"}
+    assert "{topic}" in SEED_CANDIDATE["brainstorm"]
+    assert "{plan}" in SEED_CANDIDATE["generate"]
+
+
+def test_run_pipeline_threads_plan_into_generation():
+    calls = []
+    flow = {"direction": "TB", "nodes": [{"id": "a", "data": {"label": "A"}}], "edges": []}
+
+    def fake_complete(system, user, *, model, **kw):
+        calls.append(user)
+        if len(calls) == 1:
+            return "PLAN: show A then B"
+        return f"```json\n{json.dumps(flow)}\n```"
+
+    task = Task(id="x", topic="Topic T", audience="Aud", purpose="Pur")
+    out, trace = run_pipeline(SEED_CANDIDATE, task, model="m", complete_fn=fake_complete, max_tokens=4000)
+    assert json.loads(out) == flow
+    assert trace["plan"] == "PLAN: show A then B"
+    # the plan must be threaded into the generate step's prompt
+    assert "PLAN: show A then B" in calls[1]
+    assert "Topic T" in calls[0]

From 82d48acabbd36f5513464924aecc0202de982f70 Mon Sep 17 00:00:00 2001
From: Ivan Cheung <ivanmkc@google.com>
Date: Thu, 18 Jun 2026 04:55:50 +0000
Subject: [PATCH 09/24] fix(gepa): load real gallery example in SKILL_CONTEXT
 (parents[4])

---
 .../gepa-flowchart/gepa_flowchart/seed_prompts.py          | 2 +-
 scripts/experiments/gepa-flowchart/tests/test_pipeline.py  | 7 +++++++
 2 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/scripts/experiments/gepa-flowchart/gepa_flowchart/seed_prompts.py b/scripts/experiments/gepa-flowchart/gepa_flowchart/seed_prompts.py
index fa515df..6cb7ab5 100644
--- a/scripts/experiments/gepa-flowchart/gepa_flowchart/seed_prompts.py
+++ b/scripts/experiments/gepa-flowchart/gepa_flowchart/seed_prompts.py
@@ -18,7 +18,7 @@ def _example() -> str:
         "okr-tree.flow.json",
         "binary-search.flow.json",
     ]
-    base = Path(__file__).resolve().parents[3] / "plugin" / "skills" / "diagram-recipes" / "examples"
+    base = Path(__file__).resolve().parents[4] / "plugin" / "skills" / "diagram-recipes" / "examples"
     for name in candidates:
         p = base / name
         if p.exists():
diff --git a/scripts/experiments/gepa-flowchart/tests/test_pipeline.py b/scripts/experiments/gepa-flowchart/tests/test_pipeline.py
index 25aa44b..6ce1954 100644
--- a/scripts/experiments/gepa-flowchart/tests/test_pipeline.py
+++ b/scripts/experiments/gepa-flowchart/tests/test_pipeline.py
@@ -2,6 +2,7 @@
 
 from gepa_flowchart.dataset import Task
 from gepa_flowchart.pipeline import run_pipeline, SEED_CANDIDATE
+from gepa_flowchart.seed_prompts import SKILL_CONTEXT
 
 
 def test_seed_candidate_has_two_components():
@@ -27,3 +28,9 @@ def fake_complete(system, user, *, model, **kw):
     # the plan must be threaded into the generate step's prompt
     assert "PLAN: show A then B" in calls[1]
     assert "Topic T" in calls[0]
+
+
+def test_skill_context_loads_real_example():
+    # The real gallery example is far larger than the 75-char fallback stub.
+    assert len(SKILL_CONTEXT) > 400
+    assert "Example of a well-formed flow spec:" in SKILL_CONTEXT

From c69dc8be74d05c3e2722576ed5bd138606a398d6 Mon Sep 17 00:00:00 2001
From: Ivan Cheung <ivanmkc@google.com>
Date: Thu, 18 Jun 2026 04:57:06 +0000
Subject: [PATCH 10/24] feat(gepa): three-part metric (validity gate, geometry,
 comprehension)

---
 .../gepa-flowchart/gepa_flowchart/metric.py   | 90 +++++++++++++++++++
 .../gepa-flowchart/tests/test_metric.py       | 45 ++++++++++
 2 files changed, 135 insertions(+)
 create mode 100644 scripts/experiments/gepa-flowchart/gepa_flowchart/metric.py
 create mode 100644 scripts/experiments/gepa-flowchart/tests/test_metric.py

diff --git a/scripts/experiments/gepa-flowchart/gepa_flowchart/metric.py b/scripts/experiments/gepa-flowchart/gepa_flowchart/metric.py
new file mode 100644
index 0000000..fbfc8d0
--- /dev/null
+++ b/scripts/experiments/gepa-flowchart/gepa_flowchart/metric.py
@@ -0,0 +1,90 @@
+from __future__ import annotations
+
+import json
+from dataclasses import dataclass
+
+from .config import Config
+from .dataset import Task, extract_array
+from .render import board_to_text
+
+
+@dataclass
+class ScoreResult:
+    score: float
+    feedback: str
+    comp: float
+    geom: float
+    valid: bool
+
+
+def geometry_score(findings: list[dict], cfg: Config) -> tuple[float, str]:
+    errs = [f for f in findings if f.get("severity") == "error"]
+    warns = [f for f in findings if f.get("severity") == "warning"]
+    score = 1.0 - cfg.geom_error_penalty * len(errs) - cfg.geom_warning_penalty * len(warns)
+    score = max(0.0, min(1.0, score))
+    if not findings:
+        return score, "Geometry: clean (no findings)."
+    msgs = "; ".join(f"[{f.get('severity')}] {f.get('code')}: {f.get('message')}" for f in findings)
+    return score, f"Geometry findings: {msgs}"
+
+
+_READER_SYSTEM = (
+    "You are seeing this flowchart for the FIRST TIME and know nothing else about it. "
+    "Answer each question using ONLY what the board shows. If the board does not tell you, "
+    "answer exactly 'not shown'. Output ONLY a JSON array of {\"q\": question, \"a\": answer}."
+)
+
+_JUDGE_SYSTEM = (
+    "You grade how well a first-time reader's answers are supported by a flowchart. For each "
+    "question score 0..1 for correctness AND whether the board actually supports the answer. "
+    "An answer of 'not shown' or one not supported by the board scores low and is a context gap. "
+    "Output ONLY a JSON array of {\"q\", \"score\", \"supported\" (bool), \"reason\"}."
+)
+
+
+def comprehension_score(board_text: str, questions: list[str], *, reader_model: str, judge_model: str, complete_fn) -> tuple[float, str]:
+    qlist = "\n".join(f"- {q}" for q in questions)
+    reader_out = complete_fn(
+        _READER_SYSTEM,
+        f"BOARD:\n{board_text}\n\nQUESTIONS:\n{qlist}",
+        model=reader_model,
+        effort="low",
+        max_tokens=1500,
+    )
+    judge_out = complete_fn(
+        _JUDGE_SYSTEM,
+        f"BOARD:\n{board_text}\n\nQUESTIONS:\n{qlist}\n\nREADER ANSWERS:\n{reader_out}",
+        model=judge_model,
+        effort="high",
+        max_tokens=1500,
+    )
+    blob = extract_array(judge_out)
+    rows = json.loads(blob) if blob else []
+    if not rows:
+        return 0.0, "Comprehension: judge produced no parseable scores."
+    scores = [float(r.get("score", 0.0)) for r in rows]
+    comp = sum(scores) / len(scores)
+    gaps = [r for r in rows if not r.get("supported", False) or float(r.get("score", 0.0)) < 0.5]
+    if gaps:
+        gap_txt = "; ".join(f"'{r.get('q')}' — {r.get('reason')}" for r in gaps)
+        fb = f"Comprehension {comp:.2f}. Reader could not answer (add detail/context): {gap_txt}"
+    else:
+        fb = f"Comprehension {comp:.2f}. Reader answered all questions from the board."
+    return comp, fb
+
+
+def score_board(content, task: Task, cfg: Config, *, validate_fn, complete_fn) -> ScoreResult:
+    if not content:
+        return ScoreResult(0.0, "Generation produced no JSON object.", 0.0, 0.0, False)
+    report = validate_fn(content)
+    if not report.get("valid"):
+        return ScoreResult(0.0, f"Invalid flow spec: {report.get('error')}", 0.0, 0.0, False)
+
+    geom, geom_fb = geometry_score(report.get("findings", []), cfg)
+    board_text = board_to_text(json.loads(content))
+    comp, comp_fb = comprehension_score(
+        board_text, task.questions, reader_model=cfg.reader_model, judge_model=cfg.judge_model, complete_fn=complete_fn
+    )
+    total = cfg.w_comp * comp + cfg.w_geom * geom
+    feedback = f"{comp_fb}\n{geom_fb}"
+    return ScoreResult(total, feedback, comp, geom, True)
diff --git a/scripts/experiments/gepa-flowchart/tests/test_metric.py b/scripts/experiments/gepa-flowchart/tests/test_metric.py
new file mode 100644
index 0000000..9dfc50c
--- /dev/null
+++ b/scripts/experiments/gepa-flowchart/tests/test_metric.py
@@ -0,0 +1,45 @@
+import json
+
+from gepa_flowchart.config import Config
+from gepa_flowchart.dataset import Task
+from gepa_flowchart.metric import geometry_score, score_board
+
+CFG = Config.from_env({})
+TASK = Task(id="x", topic="T", audience="A", purpose="P", questions=["Q1?", "Q2?"])
+GOOD_FLOW = json.dumps({"direction": "TB", "nodes": [{"id": "a", "data": {"label": "Start"}}], "edges": []})
+
+
+def test_geometry_score_penalizes_errors():
+    clean, _ = geometry_score([], CFG)
+    assert clean == 1.0
+    err, msg = geometry_score([{"severity": "error", "code": "edge-over-node", "message": "x", "count": 1}], CFG)
+    assert err < 1.0 and "edge-over-node" in msg
+
+
+def test_invalid_board_scores_zero():
+    def validate_fn(content):
+        return {"valid": False, "error": "bad json", "findings": [], "warnings": []}
+
+    res = score_board("{bad", TASK, CFG, validate_fn=validate_fn, complete_fn=lambda *a, **k: "")
+    assert res.score == 0.0 and res.valid is False
+    assert "bad json" in res.feedback
+
+
+def test_missing_context_lowers_comp_and_is_fed_back():
+    def validate_fn(content):
+        return {"valid": True, "error": None, "findings": [], "warnings": []}
+
+    # reader answers; judge marks Q2 unsupported
+    def complete_fn(system, user, *, model, **kw):
+        if "first time" in system.lower():  # reader
+            return json.dumps([{"q": "Q1?", "a": "Start the process"}, {"q": "Q2?", "a": "not shown"}])
+        # judge
+        return json.dumps([
+            {"q": "Q1?", "score": 1.0, "supported": True, "reason": "shown"},
+            {"q": "Q2?", "score": 0.0, "supported": False, "reason": "board never shows the failure path"},
+        ])
+
+    res = score_board(GOOD_FLOW, TASK, CFG, validate_fn=validate_fn, complete_fn=complete_fn)
+    assert 0.0 < res.comp < 1.0
+    assert res.geom == 1.0
+    assert "failure path" in res.feedback

From 34cf94ca17c5d8bd2efec62fab7821ac12ec3af5 Mon Sep 17 00:00:00 2001
From: Ivan Cheung <ivanmkc@google.com>
Date: Thu, 18 Jun 2026 04:59:05 +0000
Subject: [PATCH 11/24] fix(gepa): judge parse degrades safely on malformed
 JSON

---
 .../gepa-flowchart/gepa_flowchart/metric.py       |  5 ++++-
 .../gepa-flowchart/tests/test_metric.py           | 15 +++++++++++++++
 2 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/scripts/experiments/gepa-flowchart/gepa_flowchart/metric.py b/scripts/experiments/gepa-flowchart/gepa_flowchart/metric.py
index fbfc8d0..aab40f0 100644
--- a/scripts/experiments/gepa-flowchart/gepa_flowchart/metric.py
+++ b/scripts/experiments/gepa-flowchart/gepa_flowchart/metric.py
@@ -59,7 +59,10 @@ def comprehension_score(board_text: str, questions: list[str], *, reader_model:
         max_tokens=1500,
     )
     blob = extract_array(judge_out)
-    rows = json.loads(blob) if blob else []
+    try:
+        rows = json.loads(blob) if blob else []
+    except (json.JSONDecodeError, ValueError):
+        rows = []
     if not rows:
         return 0.0, "Comprehension: judge produced no parseable scores."
     scores = [float(r.get("score", 0.0)) for r in rows]
diff --git a/scripts/experiments/gepa-flowchart/tests/test_metric.py b/scripts/experiments/gepa-flowchart/tests/test_metric.py
index 9dfc50c..25a6d68 100644
--- a/scripts/experiments/gepa-flowchart/tests/test_metric.py
+++ b/scripts/experiments/gepa-flowchart/tests/test_metric.py
@@ -43,3 +43,18 @@ def complete_fn(system, user, *, model, **kw):
     assert 0.0 < res.comp < 1.0
     assert res.geom == 1.0
     assert "failure path" in res.feedback
+
+
+def test_malformed_judge_degrades_safely():
+    from gepa_flowchart.metric import comprehension_score
+
+    def complete_fn(system, user, *, model, **kw):
+        if "first time" in system.lower():
+            return '[{"q": "Q1?", "a": "x"}]'
+        return "[{ truncated malformed"  # non-empty bracketed but invalid JSON
+
+    comp, fb = comprehension_score(
+        "board text", ["Q1?"], reader_model="r", judge_model="j", complete_fn=complete_fn
+    )
+    assert comp == 0.0
+    assert "no parseable scores" in fb

From 28fdcc035b4121e5ef8224d3139997f23e742111 Mon Sep 17 00:00:00 2001
From: Ivan Cheung <ivanmkc@google.com>
Date: Thu, 18 Jun 2026 05:01:35 +0000
Subject: [PATCH 12/24] feat(gepa): GEPAAdapter (evaluate + reflective dataset)

---
 .../gepa-flowchart/gepa_flowchart/adapter.py  | 56 +++++++++++++++++++
 .../gepa-flowchart/tests/test_adapter.py      | 42 ++++++++++++++
 2 files changed, 98 insertions(+)
 create mode 100644 scripts/experiments/gepa-flowchart/gepa_flowchart/adapter.py
 create mode 100644 scripts/experiments/gepa-flowchart/tests/test_adapter.py

diff --git a/scripts/experiments/gepa-flowchart/gepa_flowchart/adapter.py b/scripts/experiments/gepa-flowchart/gepa_flowchart/adapter.py
new file mode 100644
index 0000000..ece9108
--- /dev/null
+++ b/scripts/experiments/gepa-flowchart/gepa_flowchart/adapter.py
@@ -0,0 +1,56 @@
+from __future__ import annotations
+
+from gepa.core.adapter import EvaluationBatch, GEPAAdapter
+
+from .config import Config
+from .geometry_bridge import validate_flow
+from .llm import complete
+from .metric import score_board
+from .pipeline import run_pipeline
+
+
+class FlowchartAdapter(GEPAAdapter):
+    def __init__(self, cfg: Config, *, validate_fn=validate_flow, complete_fn=complete):
+        self.cfg = cfg
+        self.validate_fn = validate_fn
+        self.complete_fn = complete_fn
+
+    def evaluate(self, batch, candidate, capture_traces=False):
+        outputs, scores, trajectories = [], [], [] if capture_traces else None
+        for task in batch:
+            content, trace = run_pipeline(
+                candidate, task, model=self.cfg.gen_model,
+                complete_fn=self.complete_fn, max_tokens=self.cfg.gen_max_tokens,
+            )
+            result = score_board(
+                content, task, self.cfg, validate_fn=self.validate_fn, complete_fn=self.complete_fn
+            )
+            outputs.append(content)
+            scores.append(result.score)
+            if capture_traces:
+                trajectories.append({"task": task, "trace": trace, "result": result, "output": content})
+        return EvaluationBatch(outputs=outputs, scores=scores, trajectories=trajectories)
+
+    def make_reflective_dataset(self, candidate, eval_batch, components_to_update):
+        out: dict[str, list[dict]] = {c: [] for c in components_to_update}
+        for traj in eval_batch.trajectories or []:
+            task = traj["task"]
+            result = traj["result"]
+            trace = traj["trace"]
+            shared_feedback = result.feedback
+            if "brainstorm" in out:
+                out["brainstorm"].append({
+                    "Inputs": f"Topic: {task.topic}\nAudience: {task.audience}\nPurpose: {task.purpose}",
+                    "Generated Outputs": trace.get("plan", ""),
+                    "Feedback": (
+                        "The plan feeds a downstream generator. The resulting board scored "
+                        f"{result.score:.2f}. " + shared_feedback
+                    ),
+                })
+            if "generate" in out:
+                out["generate"].append({
+                    "Inputs": f"Plan:\n{trace.get('plan','')}",
+                    "Generated Outputs": traj.get("output") or trace.get("raw_generation", ""),
+                    "Feedback": shared_feedback,
+                })
+        return out
diff --git a/scripts/experiments/gepa-flowchart/tests/test_adapter.py b/scripts/experiments/gepa-flowchart/tests/test_adapter.py
new file mode 100644
index 0000000..4caf04c
--- /dev/null
+++ b/scripts/experiments/gepa-flowchart/tests/test_adapter.py
@@ -0,0 +1,42 @@
+import json
+
+from gepa_flowchart.config import Config
+from gepa_flowchart.dataset import Task
+from gepa_flowchart.adapter import FlowchartAdapter
+from gepa_flowchart.pipeline import SEED_CANDIDATE
+
+CFG = Config.from_env({})
+FLOW = {"direction": "TB", "nodes": [{"id": "a", "data": {"label": "A"}}], "edges": []}
+
+
+def validate_fn(content):
+    return {"valid": True, "error": None, "findings": [], "warnings": []}
+
+
+def complete_fn(system, user, *, model, **kw):
+    if "planner" in system.lower():
+        return "plan"
+    if "only valid json" in system.lower():
+        return json.dumps(FLOW)
+    if "first time" in system.lower():
+        return json.dumps([{"q": "Q1?", "a": "A"}])
+    # judge
+    return json.dumps([{"q": "Q1?", "score": 1.0, "supported": True, "reason": "ok"}])
+
+
+def test_evaluate_returns_scores_and_traces():
+    adapter = FlowchartAdapter(CFG, validate_fn=validate_fn, complete_fn=complete_fn)
+    batch = [Task(id="x", topic="T", audience="A", purpose="P", questions=["Q1?"])]
+    out = adapter.evaluate(batch, SEED_CANDIDATE, capture_traces=True)
+    assert len(out.scores) == 1
+    assert out.scores[0] > 0.0
+    assert out.trajectories is not None and len(out.trajectories) == 1
+
+
+def test_make_reflective_dataset_has_requested_components():
+    adapter = FlowchartAdapter(CFG, validate_fn=validate_fn, complete_fn=complete_fn)
+    batch = [Task(id="x", topic="T", audience="A", purpose="P", questions=["Q1?"])]
+    ev = adapter.evaluate(batch, SEED_CANDIDATE, capture_traces=True)
+    refl = adapter.make_reflective_dataset(SEED_CANDIDATE, ev, ["brainstorm", "generate"])
+    assert set(refl.keys()) == {"brainstorm", "generate"}
+    assert refl["generate"] and "Feedback" in refl["generate"][0]

From cce5d9d1e67fe9ce82a0b91d9535a9f89c5eaf56 Mon Sep 17 00:00:00 2001
From: Ivan Cheung <ivanmkc@google.com>
Date: Thu, 18 Jun 2026 05:05:04 +0000
Subject: [PATCH 13/24] feat(gepa): run entrypoint, report, and smoke test

---
 .../gepa-flowchart/gepa_flowchart/report.py   | 28 ++++++
 .../gepa-flowchart/gepa_flowchart/run.py      | 90 +++++++++++++++++++
 .../gepa-flowchart/tests/test_report.py       |  8 ++
 .../gepa-flowchart/tests/test_smoke.py        | 13 +++
 4 files changed, 139 insertions(+)
 create mode 100644 scripts/experiments/gepa-flowchart/gepa_flowchart/report.py
 create mode 100644 scripts/experiments/gepa-flowchart/gepa_flowchart/run.py
 create mode 100644 scripts/experiments/gepa-flowchart/tests/test_report.py
 create mode 100644 scripts/experiments/gepa-flowchart/tests/test_smoke.py

diff --git a/scripts/experiments/gepa-flowchart/gepa_flowchart/report.py b/scripts/experiments/gepa-flowchart/gepa_flowchart/report.py
new file mode 100644
index 0000000..2c6eb09
--- /dev/null
+++ b/scripts/experiments/gepa-flowchart/gepa_flowchart/report.py
@@ -0,0 +1,28 @@
+from __future__ import annotations
+
+import json
+
+
+def _mean(xs: list[float]) -> float:
+    return sum(xs) / len(xs) if xs else 0.0
+
+
+def render_report(seed_scores, best_scores, best_candidate, val_ids) -> str:
+    seed_m, best_m = _mean(seed_scores), _mean(best_scores)
+    lines = [
+        "# GEPA Flowchart Optimization — Report",
+        "",
+        f"- Val tasks: {len(val_ids)} ({', '.join(val_ids)})",
+        f"- Seed mean score: **{seed_m:.2f}**",
+        f"- Best mean score: **{best_m:.2f}**",
+        f"- Delta: **{best_m - seed_m:+.2f}**",
+        "",
+        "## Per-task (seed → best)",
+        "",
+        "| task | seed | best |",
+        "|---|---|---|",
+    ]
+    for tid, s, b in zip(val_ids, seed_scores, best_scores):
+        lines.append(f"| {tid} | {s:.2f} | {b:.2f} |")
+    lines += ["", "## Best prompts", "", "```json", json.dumps(best_candidate, indent=2), "```"]
+    return "\n".join(lines)
diff --git a/scripts/experiments/gepa-flowchart/gepa_flowchart/run.py b/scripts/experiments/gepa-flowchart/gepa_flowchart/run.py
new file mode 100644
index 0000000..8d685cf
--- /dev/null
+++ b/scripts/experiments/gepa-flowchart/gepa_flowchart/run.py
@@ -0,0 +1,90 @@
+from __future__ import annotations
+
+import argparse
+import json
+import sys
+from datetime import datetime, timezone
+from pathlib import Path
+
+import gepa
+
+from .adapter import FlowchartAdapter
+from .config import Config
+from .dataset import freeze_questions, load_topics
+from .llm import complete, make_reflection_callable
+from .metric import score_board
+from .geometry_bridge import validate_flow
+from .pipeline import SEED_CANDIDATE
+from .report import render_report
+
+_TOPICS = str(Path(__file__).resolve().parent.parent / "topics" / "topics.json")
+
+
+def _eval_scores(tasks, candidate, cfg) -> list[float]:
+    return [
+        score_board(
+            run_one(candidate, t, cfg), t, cfg, validate_fn=validate_flow, complete_fn=complete
+        ).score
+        for t in tasks
+    ]
+
+
+def run_one(candidate, task, cfg):
+    from .pipeline import run_pipeline
+    content, _ = run_pipeline(candidate, task, model=cfg.gen_model, complete_fn=complete, max_tokens=cfg.gen_max_tokens)
+    return content
+
+
+def main(argv=None) -> int:
+    p = argparse.ArgumentParser(description="GEPA flowchart prompt optimization")
+    p.add_argument("--smoke", action="store_true", help="1 topic, tiny budget")
+    p.add_argument("--max-metric-calls", type=int, default=None)
+    p.add_argument("--train", type=int, default=8)
+    p.add_argument("--run-dir", default=None)
+    args = p.parse_args(argv)
+
+    cfg = Config.from_env({})
+    budget = args.max_metric_calls or (6 if args.smoke else cfg.max_metric_calls)
+    run_dir = args.run_dir or str(Path("runs") / datetime.now(timezone.utc).strftime("%Y%m%dT%H%M%SZ"))
+    Path(run_dir).mkdir(parents=True, exist_ok=True)
+
+    topics = load_topics(_TOPICS)
+    if args.smoke:
+        train, val = topics[:1], topics[:1]
+    else:
+        train, val = topics[: args.train], topics[args.train :]
+
+    print(f"[gepa] run_dir={run_dir} budget={budget} train={len(train)} val={len(val)}", file=sys.stderr)
+    print(f"[gepa] models gen={cfg.gen_model} reader={cfg.reader_model} judge={cfg.judge_model} reflect={cfg.reflection_model}", file=sys.stderr)
+
+    # Freeze the reader-questions once for the whole run (fair comparison).
+    all_tasks = freeze_questions(topics, run_dir, model=cfg.judge_model, n=cfg.n_questions, complete_fn=complete)
+    by_id = {t.id: t for t in all_tasks}
+    train = [by_id[t.id] for t in train]
+    val = [by_id[t.id] for t in val]
+
+    seed_scores = _eval_scores(val, SEED_CANDIDATE, cfg)
+
+    adapter = FlowchartAdapter(cfg)
+    result = gepa.optimize(
+        seed_candidate=SEED_CANDIDATE,
+        trainset=train,
+        valset=val,
+        adapter=adapter,
+        reflection_lm=make_reflection_callable(cfg.reflection_model),
+        max_metric_calls=budget,
+        run_dir=run_dir,
+        display_progress_bar=True,
+    )
+    best = result.best_candidate
+    best_scores = _eval_scores(val, best, cfg)
+
+    (Path(run_dir) / "best_prompts.json").write_text(json.dumps(best, indent=2))
+    report = render_report(seed_scores, best_scores, best, [t.id for t in val])
+    (Path(run_dir) / "report.md").write_text(report)
+    print(report)
+    return 0
+
+
+if __name__ == "__main__":
+    raise SystemExit(main())
diff --git a/scripts/experiments/gepa-flowchart/tests/test_report.py b/scripts/experiments/gepa-flowchart/tests/test_report.py
new file mode 100644
index 0000000..165984d
--- /dev/null
+++ b/scripts/experiments/gepa-flowchart/tests/test_report.py
@@ -0,0 +1,8 @@
+from gepa_flowchart.report import render_report
+
+
+def test_report_shows_before_after():
+    md = render_report([0.4, 0.5], [0.7, 0.8], {"brainstorm": "b", "generate": "g"}, ["t1", "t2"])
+    assert "0.45" in md  # seed mean
+    assert "0.75" in md  # best mean
+    assert "brainstorm" in md and "generate" in md
diff --git a/scripts/experiments/gepa-flowchart/tests/test_smoke.py b/scripts/experiments/gepa-flowchart/tests/test_smoke.py
new file mode 100644
index 0000000..bdf4355
--- /dev/null
+++ b/scripts/experiments/gepa-flowchart/tests/test_smoke.py
@@ -0,0 +1,13 @@
+import os
+
+import pytest
+
+from gepa_flowchart.run import main
+
+
+@pytest.mark.skipif(not os.environ.get("ANTHROPIC_API_KEY"), reason="needs ANTHROPIC_API_KEY")
+def test_smoke_end_to_end(tmp_path):
+    rc = main(["--smoke", "--run-dir", str(tmp_path)])
+    assert rc == 0
+    assert (tmp_path / "report.md").exists()
+    assert (tmp_path / "best_prompts.json").exists()

From 03fae2baa327f968ba681f5beb8297e394c783b9 Mon Sep 17 00:00:00 2001
From: Ivan Cheung <ivanmkc@google.com>
Date: Thu, 18 Jun 2026 05:07:57 +0000
Subject: [PATCH 14/24] docs(gepa): README with setup, run, and cost notes

---
 scripts/experiments/gepa-flowchart/README.md | 44 ++++++++++++++++++++
 1 file changed, 44 insertions(+)
 create mode 100644 scripts/experiments/gepa-flowchart/README.md

diff --git a/scripts/experiments/gepa-flowchart/README.md b/scripts/experiments/gepa-flowchart/README.md
new file mode 100644
index 0000000..4757c6b
--- /dev/null
+++ b/scripts/experiments/gepa-flowchart/README.md
@@ -0,0 +1,44 @@
+# gepa-flowchart
+
+GEPA prompt optimization for readable, comprehensible termchart flowcharts.
+Optimizes the `brainstorm` + `generate` prompts for first-user comprehension
+(primary) and geometric readability (secondary). See
+`docs/superpowers/specs/2026-06-18-gepa-flowchart-optimization-design.md`.
+
+## Setup
+
+```bash
+# Node deps for the geometry bridge (run once)
+npm install                         # repo root — gives the viewer its deps
+cd scripts/experiments/gepa-flowchart
+npm install                         # local tsx
+
+# Python
+python -m venv .venv && . .venv/bin/activate
+pip install -e ".[dev]"
+export ANTHROPIC_API_KEY=sk-ant-...   # or: ant auth login
+```
+
+## Run
+
+```bash
+python -m gepa_flowchart.run --smoke            # cheap end-to-end check (1 topic)
+python -m gepa_flowchart.run                    # full run (~150 metric calls)
+python -m gepa_flowchart.run --max-metric-calls 80 --train 8
+```
+
+Outputs land in `runs/<timestamp>/`: `best_prompts.json`, `report.md`,
+`frozen_questions.json`.
+
+## Config (env)
+
+`GEPA_GEN_MODEL` (default `claude-opus-4-8`; set `claude-sonnet-4-6` to cut cost),
+`GEPA_READER_MODEL`, `GEPA_JUDGE_MODEL`, `GEPA_REFLECTION_MODEL`,
+`GEPA_W_COMP` (0.6), `GEPA_W_GEOM` (0.4), `GEPA_N_QUESTIONS` (5),
+`GEPA_MAX_METRIC_CALLS` (150).
+
+## Cost note
+
+A full run does many generation + reader + judge calls per rollout plus
+reflection. Start with `--smoke`. Generation defaults to Opus; switch
+`GEPA_GEN_MODEL=claude-sonnet-4-6` for a cheaper high-volume role.

From d39a67cf37ed0dc292ecf007f2dc2f6480eecc9c Mon Sep 17 00:00:00 2001
From: Ivan Cheung <ivanmkc@google.com>
Date: Thu, 18 Jun 2026 15:43:36 +0000
Subject: [PATCH 15/24] feat(gepa): run via Claude on Vertex AI (ADC auth)

get_client() auto-selects AnthropicVertex when CLAUDE_CODE_USE_VERTEX or
GEPA_USE_VERTEX is set (project/region from env, ADC); falls back to the direct
API otherwise. Verified end-to-end: smoke run on Vertex (project adk-coding-agents,
region global) completed, seed 0.83 -> best 0.85 on the ci-cd val task.
---
 scripts/experiments/gepa-flowchart/README.md  | 15 +++++++++++++++
 .../gepa-flowchart/gepa_flowchart/llm.py      | 19 +++++++++++++++++--
 2 files changed, 32 insertions(+), 2 deletions(-)

diff --git a/scripts/experiments/gepa-flowchart/README.md b/scripts/experiments/gepa-flowchart/README.md
index 4757c6b..47f7eb9 100644
--- a/scripts/experiments/gepa-flowchart/README.md
+++ b/scripts/experiments/gepa-flowchart/README.md
@@ -19,6 +19,21 @@ pip install -e ".[dev]"
 export ANTHROPIC_API_KEY=sk-ant-...   # or: ant auth login
 ```
 
+### Auth: direct API or Vertex AI
+
+By default the harness uses the direct Anthropic API (`ANTHROPIC_API_KEY`).
+To run via **Claude on Vertex AI** (ADC, no API key), install the Vertex extra
+and set the Vertex env vars — `get_client()` auto-selects the Vertex client when
+`CLAUDE_CODE_USE_VERTEX` (or `GEPA_USE_VERTEX`) is set:
+
+```bash
+pip install "anthropic[vertex]"
+export CLAUDE_CODE_USE_VERTEX=1
+export ANTHROPIC_VERTEX_PROJECT_ID=<gcp-project>
+export CLOUD_ML_REGION=global              # or a Claude-on-Vertex region
+gcloud auth application-default login       # ADC
+```
+
 ## Run
 
 ```bash
diff --git a/scripts/experiments/gepa-flowchart/gepa_flowchart/llm.py b/scripts/experiments/gepa-flowchart/gepa_flowchart/llm.py
index cd01808..7b021ae 100644
--- a/scripts/experiments/gepa-flowchart/gepa_flowchart/llm.py
+++ b/scripts/experiments/gepa-flowchart/gepa_flowchart/llm.py
@@ -6,10 +6,25 @@
 
 
 def get_client():
+    """Lazily build the Anthropic client. Uses Vertex (ADC) when CLAUDE_CODE_USE_VERTEX/
+    GEPA_USE_VERTEX is set, else the direct API (ANTHROPIC_API_KEY)."""
     global _client
     if _client is None:
-        import anthropic
-        _client = anthropic.Anthropic()
+        import os
+
+        if os.environ.get("GEPA_USE_VERTEX") or os.environ.get("CLAUDE_CODE_USE_VERTEX"):
+            from anthropic import AnthropicVertex
+
+            _client = AnthropicVertex(
+                project_id=os.environ.get("ANTHROPIC_VERTEX_PROJECT_ID")
+                or os.environ.get("GOOGLE_CLOUD_PROJECT"),
+                region=os.environ.get("CLOUD_ML_REGION")
+                or os.environ.get("GOOGLE_CLOUD_LOCATION", "global"),
+            )
+        else:
+            import anthropic
+
+            _client = anthropic.Anthropic()
     return _client
 
 

From a5ad73cf81c9ef329e0d9b4154f70213b5fe7dff Mon Sep 17 00:00:00 2001
From: Ivan Cheung <ivanmkc@google.com>
Date: Thu, 18 Jun 2026 19:35:11 +0000
Subject: [PATCH 16/24] fix(gepa): brace-safe prompt fill so GEPA-mutated
 prompts don't KeyError

GEPA mutates the brainstorm/generate prompts and routinely inserts literal JSON
braces (e.g. {"direction","nodes",...}). pipeline.py used str.format(), which
treats those as fields and raises KeyError mid-run (crashed the full run at ~10/150
rollouts). Replace with a placeholder-only replace() that leaves other braces alone.
Regression test added.
---
 .../gepa-flowchart/gepa_flowchart/pipeline.py | 17 ++++++++++---
 .../gepa-flowchart/tests/test_pipeline.py     | 25 +++++++++++++++++++
 2 files changed, 39 insertions(+), 3 deletions(-)

diff --git a/scripts/experiments/gepa-flowchart/gepa_flowchart/pipeline.py b/scripts/experiments/gepa-flowchart/gepa_flowchart/pipeline.py
index fee812d..dbadf3b 100644
--- a/scripts/experiments/gepa-flowchart/gepa_flowchart/pipeline.py
+++ b/scripts/experiments/gepa-flowchart/gepa_flowchart/pipeline.py
@@ -5,10 +5,20 @@
 from .seed_prompts import SKILL_CONTEXT, SEED_CANDIDATE
 
 
+def _fill(template: str, **fields: str) -> str:
+    """Substitute only the named {placeholders}. Brace-safe: GEPA mutates these prompts and
+    routinely inserts literal JSON braces (e.g. {"nodes": ...}); str.format() would treat those
+    as fields and KeyError. A plain replace of each known placeholder leaves all other braces alone."""
+    out = template
+    for key, value in fields.items():
+        out = out.replace("{" + key + "}", value)
+    return out
+
+
 def run_pipeline(candidate: dict, task: Task, *, model: str, complete_fn, max_tokens: int):
     trace: dict = {}
-    brainstorm_user = candidate["brainstorm"].format(
-        topic=task.topic, audience=task.audience, purpose=task.purpose
+    brainstorm_user = _fill(
+        candidate["brainstorm"], topic=task.topic, audience=task.audience, purpose=task.purpose
     )
     trace["brainstorm_input"] = brainstorm_user
     plan = complete_fn(
@@ -16,7 +26,8 @@ def run_pipeline(candidate: dict, task: Task, *, model: str, complete_fn, max_to
     )
     trace["plan"] = plan
 
-    generate_user = candidate["generate"].format(
+    generate_user = _fill(
+        candidate["generate"],
         skill_context=SKILL_CONTEXT,
         topic=task.topic,
         audience=task.audience,
diff --git a/scripts/experiments/gepa-flowchart/tests/test_pipeline.py b/scripts/experiments/gepa-flowchart/tests/test_pipeline.py
index 6ce1954..88a3e6a 100644
--- a/scripts/experiments/gepa-flowchart/tests/test_pipeline.py
+++ b/scripts/experiments/gepa-flowchart/tests/test_pipeline.py
@@ -34,3 +34,28 @@ def test_skill_context_loads_real_example():
     # The real gallery example is far larger than the 75-char fallback stub.
     assert len(SKILL_CONTEXT) > 400
     assert "Example of a well-formed flow spec:" in SKILL_CONTEXT
+
+
+def test_generate_prompt_with_literal_json_braces_does_not_crash():
+    """Regression: GEPA mutates the generate prompt to include literal JSON braces
+    (e.g. {"direction","nodes","edges"}). The pipeline must still substitute the real
+    placeholders and must NOT raise KeyError from str.format()."""
+    flow = {"direction": "TB", "nodes": [{"id": "a", "data": {"label": "A"}}], "edges": []}
+
+    def fake_complete(system, user, *, model, **kw):
+        if "planner" in system.lower():
+            return "PLAN X"
+        return json.dumps(flow)
+
+    candidate = {
+        "brainstorm": "Plan for {topic}.",
+        # literal braces from a GEPA mutation, plus the real {plan} placeholder:
+        "generate": 'Schema is { "direction", "nodes", "edges", "groups" }.\n'
+                    'Follow this plan: {plan}\nOutput only JSON.',
+    }
+    task = Task(id="x", topic="T", audience="A", purpose="P")
+    out, trace = run_pipeline(candidate, task, model="m", complete_fn=fake_complete, max_tokens=4000)
+    assert json.loads(out) == flow
+    # the real placeholder was filled; the literal braces survived verbatim
+    assert "PLAN X" in trace["generate_input"]
+    assert '{ "direction", "nodes", "edges", "groups" }' in trace["generate_input"]

From 1bc010b1f9acb1776a3c4576c3ea27a8f8c0121c Mon Sep 17 00:00:00 2001
From: Ivan Cheung <ivanmkc@google.com>
Date: Thu, 18 Jun 2026 20:38:32 +0000
Subject: [PATCH 17/24] feat(gepa): de-saturate the metric (strict judge +
 harder questions + harder topics)

Diagnosis: comprehension was pinned at ~1.0 (lenient judge), leaving GEPA no
gradient. Fixes:
- judge scores STRICTLY and board-grounded (0/0.5/1; credit only specifics shown
  on the board, never world knowledge)
- question generator demands specific, detail-probing questions a sketchy board fails
- n_questions 5 -> 7
- +8 harder multi-branch topics (saga, oauth-pkce, raft, k8s-sched, tcp, 3ds,
  blue-green, rate-limiter) -> 20 total
Result: comprehension now spreads 0.71-1.00 (was 0.88-1.00); totals 0.69-0.86.
---
 .../gepa-flowchart/gepa_flowchart/config.py     |  2 +-
 .../gepa-flowchart/gepa_flowchart/dataset.py    | 14 ++++++++++----
 .../gepa-flowchart/gepa_flowchart/metric.py     | 17 +++++++++++++----
 .../gepa-flowchart/tests/test_dataset.py        |  3 ++-
 .../gepa-flowchart/topics/topics.json           | 10 +++++++++-
 5 files changed, 35 insertions(+), 11 deletions(-)

diff --git a/scripts/experiments/gepa-flowchart/gepa_flowchart/config.py b/scripts/experiments/gepa-flowchart/gepa_flowchart/config.py
index 1e5ae10..ddcc248 100644
--- a/scripts/experiments/gepa-flowchart/gepa_flowchart/config.py
+++ b/scripts/experiments/gepa-flowchart/gepa_flowchart/config.py
@@ -14,7 +14,7 @@ class Config:
     w_geom: float = 0.4
     geom_error_penalty: float = 0.34
     geom_warning_penalty: float = 0.08
-    n_questions: int = 5
+    n_questions: int = 7
     max_metric_calls: int = 150
     gen_max_tokens: int = 8000
 
diff --git a/scripts/experiments/gepa-flowchart/gepa_flowchart/dataset.py b/scripts/experiments/gepa-flowchart/gepa_flowchart/dataset.py
index 4313004..6a5d4ea 100644
--- a/scripts/experiments/gepa-flowchart/gepa_flowchart/dataset.py
+++ b/scripts/experiments/gepa-flowchart/gepa_flowchart/dataset.py
@@ -20,16 +20,22 @@ def load_topics(path: str) -> list[Task]:
 
 
 _QGEN_SYSTEM = (
-    "You write the questions a first-time reader of a diagram would need answered "
-    "to actually understand the subject. Output ONLY a JSON array of concise question strings."
+    "You write probing questions that test whether a diagram contains REAL, SPECIFIC DETAIL — "
+    "not just a high-level sketch. Favor questions demanding specifics: exact triggers/conditions/"
+    "thresholds, what happens on each failure or error path, ordering and dependencies, who or what "
+    "performs each step (human vs automated system), and edge cases. A vague, high-level board "
+    "should FAIL many of them; only a detailed, well-contextualized board should answer them all. "
+    "Output ONLY a JSON array of concise question strings."
 )
 
 
 def generate_questions(task: Task, *, model: str, n: int, complete_fn) -> list[str]:
     user = (
         f"Topic: {task.topic}\nAudience: {task.audience}\nPurpose: {task.purpose}\n\n"
-        f"List exactly {n} distinct questions this reader should be able to answer "
-        f"from a good flowchart on this topic. JSON array of strings only."
+        f"List exactly {n} specific, detail-probing questions this reader needs answered. "
+        f"Each must demand a concrete fact the diagram should show (a condition, branch, actor, "
+        f"order, threshold, or failure path) — not something answerable from general knowledge. "
+        f"JSON array of strings only."
     )
     out = complete_fn(_QGEN_SYSTEM, user, model=model, effort="medium", max_tokens=1000)
     blob = extract_array(out)
diff --git a/scripts/experiments/gepa-flowchart/gepa_flowchart/metric.py b/scripts/experiments/gepa-flowchart/gepa_flowchart/metric.py
index aab40f0..4712f11 100644
--- a/scripts/experiments/gepa-flowchart/gepa_flowchart/metric.py
+++ b/scripts/experiments/gepa-flowchart/gepa_flowchart/metric.py
@@ -35,10 +35,19 @@ def geometry_score(findings: list[dict], cfg: Config) -> tuple[float, str]:
 )
 
 _JUDGE_SYSTEM = (
-    "You grade how well a first-time reader's answers are supported by a flowchart. For each "
-    "question score 0..1 for correctness AND whether the board actually supports the answer. "
-    "An answer of 'not shown' or one not supported by the board scores low and is a context gap. "
-    "Output ONLY a JSON array of {\"q\", \"score\", \"supported\" (bool), \"reason\"}."
+    "You grade a first-time reader's answers STRICTLY, judging ONLY whether the BOARD ITSELF "
+    "contains the information — never whether the answer is true in general. A knowledgeable reader "
+    "can answer many questions from world knowledge; that does NOT count and must be penalized. "
+    "Score each question:\n"
+    "  1.0 = the board explicitly shows the SPECIFIC answer (the exact step, condition/trigger, "
+    "branch outcome, actor, or value is named on the board);\n"
+    "  0.5 = the board only partially or generically shows it, requiring the reader to infer;\n"
+    "  0.0 = the board does not show it, the answer is 'not shown', or it could only be answered "
+    "from outside knowledge.\n"
+    "Be demanding: vague, generic, high-level, or inferred answers are 0.0-0.5, never 1.0. A board "
+    "that merely sketches the topic should score low; only a board rich in specific detail scores high. "
+    "Output ONLY a JSON array of {\"q\", \"score\" (0, 0.5, or 1), \"supported\" (bool, true ONLY if "
+    "score is 1), \"reason\" (what specific detail is present or missing on the board)}."
 )
 
 
diff --git a/scripts/experiments/gepa-flowchart/tests/test_dataset.py b/scripts/experiments/gepa-flowchart/tests/test_dataset.py
index 062388b..30e1ec4 100644
--- a/scripts/experiments/gepa-flowchart/tests/test_dataset.py
+++ b/scripts/experiments/gepa-flowchart/tests/test_dataset.py
@@ -19,7 +19,8 @@ def fake_complete(system, user, *, model, **kw):
 
 def test_load_topics():
     tasks = load_topics(TOPICS)
-    assert len(tasks) == 12
+    assert len(tasks) == 20
+    assert len({t.id for t in tasks}) == len(tasks)  # ids unique
     assert all(isinstance(t, Task) and t.id and t.topic for t in tasks)
 
 
diff --git a/scripts/experiments/gepa-flowchart/topics/topics.json b/scripts/experiments/gepa-flowchart/topics/topics.json
index 110e513..38a19ae 100644
--- a/scripts/experiments/gepa-flowchart/topics/topics.json
+++ b/scripts/experiments/gepa-flowchart/topics/topics.json
@@ -10,5 +10,13 @@
   {"id": "onboarding", "topic": "New employee onboarding workflow", "audience": "an HR coordinator", "purpose": "track steps from offer to first day"},
   {"id": "state-machine", "topic": "Document approval state machine", "audience": "a product manager", "purpose": "understand statuses and transitions"},
   {"id": "k8s-deploy", "topic": "Kubernetes rolling deployment", "audience": "a platform engineer", "purpose": "understand how a new version rolls out safely"},
-  {"id": "signup-funnel", "topic": "SaaS signup and activation funnel", "audience": "a growth analyst", "purpose": "see where users drop off between signup and activation"}
+  {"id": "signup-funnel", "topic": "SaaS signup and activation funnel", "audience": "a growth analyst", "purpose": "see where users drop off between signup and activation"},
+  {"id": "saga-compensation", "topic": "Distributed transaction (saga) with per-step compensation", "audience": "a backend engineer", "purpose": "understand the commit path and exactly how each step rolls back on failure"},
+  {"id": "oauth-pkce", "topic": "OAuth 2.0 Authorization Code flow with PKCE, token refresh, and revocation", "audience": "a security engineer", "purpose": "understand every redirect, token exchange, and the refresh and revocation paths"},
+  {"id": "raft-election", "topic": "Raft consensus leader election and log replication", "audience": "a distributed-systems engineer", "purpose": "understand term changes, vote conditions, and how entries commit or get rejected"},
+  {"id": "k8s-scheduling", "topic": "Kubernetes pod scheduling: filtering, scoring, binding, and preemption", "audience": "a platform engineer", "purpose": "understand how a pod lands on a node and what happens when none fits"},
+  {"id": "tcp-lifecycle", "topic": "TCP connection lifecycle: handshake, data transfer, teardown, and reset/timeout paths", "audience": "a network engineer", "purpose": "understand each state transition and what triggers RST or timeout"},
+  {"id": "payment-3ds", "topic": "Card payment with 3-D Secure challenge, authorization, capture, and decline/retry", "audience": "a payments engineer", "purpose": "understand the challenge branch and every decline and retry path"},
+  {"id": "blue-green", "topic": "Blue-green deployment with health checks, traffic cutover, and rollback", "audience": "an SRE", "purpose": "understand the cutover decision and the exact rollback trigger and steps"},
+  {"id": "rate-limiter", "topic": "Distributed token-bucket rate limiter with refill, burst, and rejection", "audience": "a backend engineer", "purpose": "understand when a request is allowed, throttled, or rejected and how the bucket refills"}
 ]

From 75aa9153e515ea3e4002f6594a7c1e1110a00ac3 Mon Sep 17 00:00:00 2001
From: Ivan Cheung <ivanmkc@google.com>
Date: Thu, 18 Jun 2026 21:06:16 +0000
Subject: [PATCH 18/24] feat(gepa): multimodal + Playwright validators on the
 rendered board

Render each board in a real browser (persistent viewer + Chromium service) and add:
- rendered geometry (Playwright DOM): node-pair overlaps, off-canvas nodes, min on-screen font
- visual comprehension + visual quality: one Claude-vision call (Opus 4.8 via Vertex) reads the
  screenshot, answers the frozen questions from pixels, and rates legibility/crowding
Metric now blends text+visual comprehension and heuristic+rendered geometry plus visual quality:
total = w_comp*mean(text,visual) + w_geom*mean(heuristic,rendered) + w_vq*visual_quality
(defaults 0.5/0.3/0.2). Verified end-to-end on Vertex (render -> DOM metrics -> vision). 34 tests.
---
 scripts/experiments/gepa-flowchart/README.md  |  31 ++++--
 .../gepa-flowchart/gepa_flowchart/adapter.py  |  14 ++-
 .../gepa-flowchart/gepa_flowchart/config.py   |  16 ++-
 .../gepa-flowchart/gepa_flowchart/llm.py      |  32 ++++++
 .../gepa-flowchart/gepa_flowchart/metric.py   |  99 +++++++++++++++--
 .../gepa_flowchart/render_bridge.py           | 104 +++++++++++++++++
 .../gepa_flowchart/render_service.mjs         | 105 ++++++++++++++++++
 .../gepa-flowchart/gepa_flowchart/run.py      |  49 ++++----
 .../gepa-flowchart/tests/test_adapter.py      |  19 +++-
 .../gepa-flowchart/tests/test_config.py       |   3 +-
 .../gepa-flowchart/tests/test_llm.py          |  12 ++
 .../gepa-flowchart/tests/test_metric.py       |  94 ++++++++++++++--
 .../tests/test_render_bridge.py               |  39 +++++++
 13 files changed, 559 insertions(+), 58 deletions(-)
 create mode 100644 scripts/experiments/gepa-flowchart/gepa_flowchart/render_bridge.py
 create mode 100644 scripts/experiments/gepa-flowchart/gepa_flowchart/render_service.mjs
 create mode 100644 scripts/experiments/gepa-flowchart/tests/test_render_bridge.py

diff --git a/scripts/experiments/gepa-flowchart/README.md b/scripts/experiments/gepa-flowchart/README.md
index 47f7eb9..ac989f0 100644
--- a/scripts/experiments/gepa-flowchart/README.md
+++ b/scripts/experiments/gepa-flowchart/README.md
@@ -9,7 +9,8 @@ Optimizes the `brainstorm` + `generate` prompts for first-user comprehension
 
 ```bash
 # Node deps for the geometry bridge (run once)
-npm install                         # repo root — gives the viewer its deps
+npm install                         # repo root — gives the viewer its deps + playwright
+npm run build --workspace @ivanmkc/termchart-viewer   # build the viewer (needed to RENDER boards)
 cd scripts/experiments/gepa-flowchart
 npm install                         # local tsx
 
@@ -45,15 +46,31 @@ python -m gepa_flowchart.run --max-metric-calls 80 --train 8
 Outputs land in `runs/<timestamp>/`: `best_prompts.json`, `report.md`,
 `frozen_questions.json`.
 
+## Validators (the metric)
+
+Each board is scored on five signals, gated by structural validity:
+
+- **text comprehension** — a fresh reader-LLM answers the run-frozen reader-questions from the board's structured content; a strict, board-grounded judge scores them.
+- **visual comprehension** — the board is **rendered in a real browser** (viewer + Chromium via a persistent render service) and a **multimodal LLM reads the screenshot**, answering the same questions from the pixels.
+- **visual quality** — the same vision call rates legibility / crowding / overlaps / clipping.
+- **heuristic geometry** — the fast TS `geometryReport` (edges-over-nodes, crossings, density).
+- **rendered geometry (Playwright)** — real-DOM measurements: node-pair overlaps, off-canvas nodes, smallest on-screen font.
+
+`comp = mean(text, visual)`, `geom = mean(heuristic, rendered)`,
+`total = w_comp·comp + w_geom·geom + w_vq·visual_quality`.
+
 ## Config (env)
 
 `GEPA_GEN_MODEL` (default `claude-opus-4-8`; set `claude-sonnet-4-6` to cut cost),
-`GEPA_READER_MODEL`, `GEPA_JUDGE_MODEL`, `GEPA_REFLECTION_MODEL`,
-`GEPA_W_COMP` (0.6), `GEPA_W_GEOM` (0.4), `GEPA_N_QUESTIONS` (5),
-`GEPA_MAX_METRIC_CALLS` (150).
+`GEPA_READER_MODEL`, `GEPA_JUDGE_MODEL`, `GEPA_VISION_MODEL`, `GEPA_REFLECTION_MODEL`,
+`GEPA_W_COMP` (0.5), `GEPA_W_GEOM` (0.3), `GEPA_W_VQ` (0.2), `GEPA_N_QUESTIONS` (7),
+`GEPA_MAX_METRIC_CALLS` (150). Rendered-geometry penalties: `GEPA_RG_OVERLAP_PENALTY`,
+`GEPA_RG_OFFSCREEN_PENALTY`, `GEPA_RG_TINYFONT_PENALTY`, `GEPA_RG_MIN_FONT_PX`.
 
 ## Cost note
 
-A full run does many generation + reader + judge calls per rollout plus
-reflection. Start with `--smoke`. Generation defaults to Opus; switch
-`GEPA_GEN_MODEL=claude-sonnet-4-6` for a cheaper high-volume role.
+Each rollout now does generation + text reader/judge + **a browser render + a
+multimodal vision call**, plus reflection — heavier than a text-only metric.
+Start with `--smoke`. The render service (viewer + one Chromium) starts once per
+run and is reused. Generation defaults to Opus; `GEPA_GEN_MODEL=claude-sonnet-4-6`
+cuts the high-volume role.
diff --git a/scripts/experiments/gepa-flowchart/gepa_flowchart/adapter.py b/scripts/experiments/gepa-flowchart/gepa_flowchart/adapter.py
index ece9108..bc876fd 100644
--- a/scripts/experiments/gepa-flowchart/gepa_flowchart/adapter.py
+++ b/scripts/experiments/gepa-flowchart/gepa_flowchart/adapter.py
@@ -4,16 +4,23 @@
 
 from .config import Config
 from .geometry_bridge import validate_flow
-from .llm import complete
+from .llm import complete, complete_vision
 from .metric import score_board
 from .pipeline import run_pipeline
 
 
+def _no_render(_content):
+    return {"ok": False, "error": "no render_fn supplied"}
+
+
 class FlowchartAdapter(GEPAAdapter):
-    def __init__(self, cfg: Config, *, validate_fn=validate_flow, complete_fn=complete):
+    def __init__(self, cfg: Config, *, validate_fn=validate_flow, complete_fn=complete,
+                 render_fn=_no_render, vision_fn=complete_vision):
         self.cfg = cfg
         self.validate_fn = validate_fn
         self.complete_fn = complete_fn
+        self.render_fn = render_fn
+        self.vision_fn = vision_fn
 
     def evaluate(self, batch, candidate, capture_traces=False):
         outputs, scores, trajectories = [], [], [] if capture_traces else None
@@ -23,7 +30,8 @@ def evaluate(self, batch, candidate, capture_traces=False):
                 complete_fn=self.complete_fn, max_tokens=self.cfg.gen_max_tokens,
             )
             result = score_board(
-                content, task, self.cfg, validate_fn=self.validate_fn, complete_fn=self.complete_fn
+                content, task, self.cfg, validate_fn=self.validate_fn, complete_fn=self.complete_fn,
+                render_fn=self.render_fn, vision_fn=self.vision_fn,
             )
             outputs.append(content)
             scores.append(result.score)
diff --git a/scripts/experiments/gepa-flowchart/gepa_flowchart/config.py b/scripts/experiments/gepa-flowchart/gepa_flowchart/config.py
index ddcc248..77bf4b6 100644
--- a/scripts/experiments/gepa-flowchart/gepa_flowchart/config.py
+++ b/scripts/experiments/gepa-flowchart/gepa_flowchart/config.py
@@ -10,10 +10,16 @@ class Config:
     reader_model: str = "claude-sonnet-4-6"
     judge_model: str = "claude-opus-4-8"
     reflection_model: str = "claude-opus-4-8"
-    w_comp: float = 0.6
-    w_geom: float = 0.4
+    vision_model: str = "claude-opus-4-8"      # multimodal judge over the rendered screenshot
+    w_comp: float = 0.5                          # comprehension = mean(text, visual)
+    w_geom: float = 0.3                          # geometry = mean(heuristic, rendered-DOM)
+    w_vq: float = 0.2                            # visual quality (legibility/crowding)
     geom_error_penalty: float = 0.34
     geom_warning_penalty: float = 0.08
+    rg_overlap_penalty: float = 0.34            # rendered-DOM: per overlapping node pair
+    rg_offscreen_penalty: float = 0.2           # rendered-DOM: per off-canvas node
+    rg_tinyfont_penalty: float = 0.3            # rendered-DOM: smallest label below rg_min_font_px
+    rg_min_font_px: float = 9.0
     n_questions: int = 7
     max_metric_calls: int = 150
     gen_max_tokens: int = 8000
@@ -36,10 +42,16 @@ def i(key: str, default: int) -> int:
             reader_model=s("GEPA_READER_MODEL", cls.reader_model),
             judge_model=s("GEPA_JUDGE_MODEL", cls.judge_model),
             reflection_model=s("GEPA_REFLECTION_MODEL", cls.reflection_model),
+            vision_model=s("GEPA_VISION_MODEL", cls.vision_model),
             w_comp=f("GEPA_W_COMP", cls.w_comp),
             w_geom=f("GEPA_W_GEOM", cls.w_geom),
+            w_vq=f("GEPA_W_VQ", cls.w_vq),
             geom_error_penalty=f("GEPA_GEOM_ERR_PENALTY", cls.geom_error_penalty),
             geom_warning_penalty=f("GEPA_GEOM_WARN_PENALTY", cls.geom_warning_penalty),
+            rg_overlap_penalty=f("GEPA_RG_OVERLAP_PENALTY", cls.rg_overlap_penalty),
+            rg_offscreen_penalty=f("GEPA_RG_OFFSCREEN_PENALTY", cls.rg_offscreen_penalty),
+            rg_tinyfont_penalty=f("GEPA_RG_TINYFONT_PENALTY", cls.rg_tinyfont_penalty),
+            rg_min_font_px=f("GEPA_RG_MIN_FONT_PX", cls.rg_min_font_px),
             n_questions=i("GEPA_N_QUESTIONS", cls.n_questions),
             max_metric_calls=i("GEPA_MAX_METRIC_CALLS", cls.max_metric_calls),
             gen_max_tokens=i("GEPA_GEN_MAX_TOKENS", cls.gen_max_tokens),
diff --git a/scripts/experiments/gepa-flowchart/gepa_flowchart/llm.py b/scripts/experiments/gepa-flowchart/gepa_flowchart/llm.py
index 7b021ae..53c688b 100644
--- a/scripts/experiments/gepa-flowchart/gepa_flowchart/llm.py
+++ b/scripts/experiments/gepa-flowchart/gepa_flowchart/llm.py
@@ -48,6 +48,38 @@ def complete(
     return "".join(b.text for b in resp.content if getattr(b, "type", None) == "text")
 
 
+def complete_vision(
+    system: str,
+    user: str,
+    image_png: bytes,
+    *,
+    model: str,
+    effort: str = "high",
+    max_tokens: int = 2000,
+    client=None,
+) -> str:
+    """Single-turn multimodal call: a PNG image plus a text prompt. Used to judge the RENDERED
+    board (true first-user-experience). Same no-sampling-params rule as complete()."""
+    import base64
+
+    client = client or get_client()
+    b64 = base64.b64encode(image_png).decode()
+    resp = client.messages.create(
+        model=model,
+        max_tokens=max_tokens,
+        system=system,
+        output_config={"effort": effort},
+        messages=[{
+            "role": "user",
+            "content": [
+                {"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": b64}},
+                {"type": "text", "text": user},
+            ],
+        }],
+    )
+    return "".join(b.text for b in resp.content if getattr(b, "type", None) == "text")
+
+
 def make_reflection_callable(model: str, *, client=None) -> Callable[[str], str]:
     def reflect(prompt: str) -> str:
         return complete(
diff --git a/scripts/experiments/gepa-flowchart/gepa_flowchart/metric.py b/scripts/experiments/gepa-flowchart/gepa_flowchart/metric.py
index 4712f11..f6eb0ee 100644
--- a/scripts/experiments/gepa-flowchart/gepa_flowchart/metric.py
+++ b/scripts/experiments/gepa-flowchart/gepa_flowchart/metric.py
@@ -1,20 +1,23 @@
 from __future__ import annotations
 
 import json
-from dataclasses import dataclass
+from dataclasses import dataclass, field
 
 from .config import Config
 from .dataset import Task, extract_array
-from .render import board_to_text
+from .render import board_to_text, extract_json
+from .render_bridge import rendered_geometry_metrics
 
 
 @dataclass
 class ScoreResult:
     score: float
     feedback: str
-    comp: float
-    geom: float
+    comp: float           # combined comprehension: mean(text, visual)
+    geom: float           # combined geometry: mean(heuristic, rendered-DOM)
+    visual_quality: float
     valid: bool
+    sub: dict = field(default_factory=dict)  # text_comp / visual_comp / heuristic_geom / rendered_geom
 
 
 def geometry_score(findings: list[dict], cfg: Config) -> tuple[float, str]:
@@ -85,18 +88,90 @@ def comprehension_score(board_text: str, questions: list[str], *, reader_model:
     return comp, fb
 
 
-def score_board(content, task: Task, cfg: Config, *, validate_fn, complete_fn) -> ScoreResult:
+def rendered_geometry_score(reduced: dict, cfg: Config) -> tuple[float, str]:
+    """Score the REAL rendered DOM: partial node overlaps, off-viewport nodes, and tiny on-screen
+    text. Complements the heuristic geometryReport with what the browser actually drew."""
+    if reduced.get("n_nodes", 0) == 0:
+        return 0.0, "Rendered geometry: nothing rendered (0 nodes on the board)."
+    overlaps = reduced.get("overlaps", 0)
+    offscreen = reduced.get("offscreen", 0)
+    min_font = reduced.get("min_font_px", 99)
+    tiny = 1 if 0 < min_font < cfg.rg_min_font_px else 0
+    score = 1.0 - cfg.rg_overlap_penalty * overlaps - cfg.rg_offscreen_penalty * offscreen - cfg.rg_tinyfont_penalty * tiny
+    score = max(0.0, min(1.0, score))
+    bits = []
+    if overlaps:
+        bits.append(f"{overlaps} node-pair overlap(s)")
+    if offscreen:
+        bits.append(f"{offscreen} node(s) off the visible canvas")
+    if tiny:
+        bits.append(f"smallest on-screen label ~{min_font}px (below {cfg.rg_min_font_px}px, hard to read)")
+    fb = "Rendered geometry: " + ("; ".join(bits) if bits else "clean (no overlaps, on-screen, legible).")
+    return score, fb
+
+
+_VISUAL_SYSTEM = (
+    "You are a person seeing this flowchart IMAGE for the FIRST TIME. Judge ONLY what is actually "
+    "VISIBLE and LEGIBLE in the image — not what the topic means in general. For each question, "
+    "answer from the image, then score: 1 = the image clearly and specifically shows the answer; "
+    "0.5 = partially shown or too small/cramped to be sure; 0 = not shown or not legible. "
+    "Separately rate overall VISUAL QUALITY from 0 to 1 (1 = clean, well-spaced, legible labels, no "
+    "overlaps, nothing cut off; 0 = cramped, overlapping, tiny or clipped text). "
+    "Output ONLY a JSON object: {\"answers\": [{\"q\", \"score\" (0/0.5/1), \"reason\"}], "
+    "\"visual_quality\": number, \"quality_reason\": string}."
+)
+
+
+def visual_eval(png: bytes, questions: list[str], *, model: str, vision_fn) -> tuple[float, float, str]:
+    """One multimodal call over the rendered image: returns (visual_comprehension, visual_quality,
+    feedback). Fused reader+grader+quality to keep it to a single vision call per rollout."""
+    qlist = "\n".join(f"- {q}" for q in questions)
+    out = vision_fn(_VISUAL_SYSTEM, f"QUESTIONS:\n{qlist}", png, model=model, effort="high", max_tokens=1800)
+    blob = extract_json(out)
+    try:
+        obj = json.loads(blob) if blob else {}
+    except (json.JSONDecodeError, ValueError):
+        obj = {}
+    answers = obj.get("answers") or []
+    if not answers:
+        return 0.0, 0.0, "Visual: vision judge produced no parseable scores."
+    scores = [float(a.get("score", 0.0)) for a in answers]
+    vcomp = sum(scores) / len(scores)
+    vqual = max(0.0, min(1.0, float(obj.get("visual_quality", 0.0))))
+    gaps = [a for a in answers if float(a.get("score", 0.0)) < 1.0]
+    gap_txt = "; ".join(f"'{a.get('q')}' — {a.get('reason')}" for a in gaps[:6])
+    fb = (f"Visual comprehension {vcomp:.2f}, visual quality {vqual:.2f} ({obj.get('quality_reason','')}). "
+          + (f"Not clearly readable from the image: {gap_txt}" if gaps else "All questions readable from the image."))
+    return vcomp, vqual, fb
+
+
+def score_board(content, task: Task, cfg: Config, *, validate_fn, complete_fn, render_fn, vision_fn) -> ScoreResult:
     if not content:
-        return ScoreResult(0.0, "Generation produced no JSON object.", 0.0, 0.0, False)
+        return ScoreResult(0.0, "Generation produced no JSON object.", 0.0, 0.0, 0.0, False)
     report = validate_fn(content)
     if not report.get("valid"):
-        return ScoreResult(0.0, f"Invalid flow spec: {report.get('error')}", 0.0, 0.0, False)
+        return ScoreResult(0.0, f"Invalid flow spec: {report.get('error')}", 0.0, 0.0, 0.0, False)
 
-    geom, geom_fb = geometry_score(report.get("findings", []), cfg)
+    # 1. heuristic geometry (fast TS lint) + text comprehension (structured content)
+    hgeom, hgeom_fb = geometry_score(report.get("findings", []), cfg)
     board_text = board_to_text(json.loads(content))
-    comp, comp_fb = comprehension_score(
+    tcomp, tcomp_fb = comprehension_score(
         board_text, task.questions, reader_model=cfg.reader_model, judge_model=cfg.judge_model, complete_fn=complete_fn
     )
-    total = cfg.w_comp * comp + cfg.w_geom * geom
-    feedback = f"{comp_fb}\n{geom_fb}"
-    return ScoreResult(total, feedback, comp, geom, True)
+
+    # 2. render the board in a real browser → rendered-DOM geometry + multimodal visual judging
+    r = render_fn(content)
+    if r.get("ok") and r.get("png"):
+        rgeom, rgeom_fb = rendered_geometry_score(rendered_geometry_metrics(r.get("metrics", {})), cfg)
+        vcomp, vqual, v_fb = visual_eval(r["png"], task.questions, model=cfg.vision_model, vision_fn=vision_fn)
+    else:
+        # A valid spec that won't render is a real defect — score the visual signals at zero.
+        rgeom, rgeom_fb = 0.0, f"Rendered geometry: render failed ({r.get('error')})."
+        vcomp, vqual, v_fb = 0.0, 0.0, "Visual: the board failed to render in the viewer."
+
+    comp = (tcomp + vcomp) / 2
+    geom = (hgeom + rgeom) / 2
+    total = cfg.w_comp * comp + cfg.w_geom * geom + cfg.w_vq * vqual
+    feedback = "\n".join([tcomp_fb, v_fb, hgeom_fb, rgeom_fb])
+    sub = {"text_comp": tcomp, "visual_comp": vcomp, "heuristic_geom": hgeom, "rendered_geom": rgeom, "visual_quality": vqual}
+    return ScoreResult(total, feedback, comp, geom, vqual, True, sub)
diff --git a/scripts/experiments/gepa-flowchart/gepa_flowchart/render_bridge.py b/scripts/experiments/gepa-flowchart/gepa_flowchart/render_bridge.py
new file mode 100644
index 0000000..5fdcee0
--- /dev/null
+++ b/scripts/experiments/gepa-flowchart/gepa_flowchart/render_bridge.py
@@ -0,0 +1,104 @@
+from __future__ import annotations
+
+import base64
+import json
+import os
+import subprocess
+import time
+import urllib.request
+from pathlib import Path
+
+_HERE = Path(__file__).resolve().parent
+_SERVICE = _HERE / "render_service.mjs"
+_REPO = _HERE.parents[3]  # gepa_flowchart -> gepa-flowchart -> experiments -> scripts -> repo
+
+
+class RenderService:
+    """Starts the Node render service (viewer + Chromium) once and renders flow specs to a PNG +
+    real-DOM metrics. Use as a context manager so the browser/viewer are torn down."""
+
+    def __init__(self, render_port: int = 8799, viewer_port: int = 8798):
+        self.render_port = render_port
+        self.viewer_port = viewer_port
+        self._proc: subprocess.Popen | None = None
+
+    def start(self, timeout: float = 90.0) -> "RenderService":
+        env = {**os.environ, "RENDER_PORT": str(self.render_port), "VIEWER_PORT": str(self.viewer_port)}
+        # Run with the repo root as cwd so `import "playwright"` resolves from the root node_modules.
+        self._proc = subprocess.Popen(
+            ["node", str(_SERVICE)],
+            cwd=str(_REPO),
+            env=env,
+            stdout=subprocess.PIPE,
+            stderr=subprocess.STDOUT,
+            text=True,
+        )
+        deadline = time.monotonic() + timeout
+        while time.monotonic() < deadline:
+            line = self._proc.stdout.readline() if self._proc.stdout else ""
+            if line.startswith("READY"):
+                return self
+            if self._proc.poll() is not None:
+                raise RuntimeError("render service exited before ready")
+        raise TimeoutError("render service did not become ready")
+
+    def render(self, flow_content: str, *, timeout: float = 60.0) -> dict:
+        """Returns {ok, png (bytes) | None, metrics, error?}."""
+        req = urllib.request.Request(
+            f"http://127.0.0.1:{self.render_port}/render",
+            data=json.dumps({"flow": flow_content}).encode(),
+            headers={"content-type": "application/json"},
+            method="POST",
+        )
+        with urllib.request.urlopen(req, timeout=timeout) as r:
+            out = json.loads(r.read())
+        if out.get("ok") and out.get("png"):
+            out["png"] = base64.b64decode(out["png"])
+        return out
+
+    def stop(self) -> None:
+        if self._proc and self._proc.poll() is None:
+            self._proc.terminate()
+            try:
+                self._proc.wait(timeout=10)
+            except subprocess.TimeoutExpired:
+                self._proc.kill()
+
+    def __enter__(self) -> "RenderService":
+        return self.start()
+
+    def __exit__(self, *exc) -> None:
+        self.stop()
+
+
+def rendered_geometry_metrics(metrics: dict) -> dict:
+    """Reduce raw DOM metrics to defect counts: partial node overlaps (excluding containment, which
+    is a group enclosing its children), nodes off the diagram viewport, and the smallest on-screen
+    font size (CSS px times the React Flow zoom)."""
+    nodes = metrics.get("nodes", [])
+    zoom = metrics.get("zoom", 1) or 1
+    d = metrics.get("diagram", {})
+    dx, dy, dw, dh = d.get("x", 0), d.get("y", 0), d.get("w", 1280), d.get("h", 900)
+
+    def contains(a, b):  # a fully contains b
+        return a["x"] <= b["x"] and a["y"] <= b["y"] and a["x"] + a["w"] >= b["x"] + b["w"] and a["y"] + a["h"] >= b["y"] + b["h"]
+
+    def overlaps(a, b):
+        return a["x"] < b["x"] + b["w"] and a["x"] + a["w"] > b["x"] and a["y"] < b["y"] + b["h"] and a["y"] + a["h"] > b["y"]
+
+    overlap_pairs = 0
+    for i in range(len(nodes)):
+        for j in range(i + 1, len(nodes)):
+            a, b = nodes[i], nodes[j]
+            if overlaps(a, b) and not contains(a, b) and not contains(b, a):
+                overlap_pairs += 1
+
+    margin = 4
+    off = sum(
+        1 for n in nodes
+        if n["x"] < dx - margin or n["y"] < dy - margin
+        or n["x"] + n["w"] > dx + dw + margin or n["y"] + n["h"] > dy + dh + margin
+    )
+    on_screen_fonts = [n["fs"] * zoom for n in nodes if n.get("fs")]
+    min_font = min(on_screen_fonts) if on_screen_fonts else 0.0
+    return {"overlaps": overlap_pairs, "offscreen": off, "min_font_px": round(min_font, 1), "n_nodes": len(nodes)}
diff --git a/scripts/experiments/gepa-flowchart/gepa_flowchart/render_service.mjs b/scripts/experiments/gepa-flowchart/gepa_flowchart/render_service.mjs
new file mode 100644
index 0000000..284a765
--- /dev/null
+++ b/scripts/experiments/gepa-flowchart/gepa_flowchart/render_service.mjs
@@ -0,0 +1,105 @@
+// Persistent render service for the GEPA harness. Starts the termchart viewer once, launches one
+// Chromium, and serves POST /render {flow} -> { png (base64 of the #diagram), metrics } using the
+// REAL browser layout. Reused across every rollout (spawning a browser per rollout would be brutal).
+//
+// Run from anywhere under the worktree (resolves `playwright` from the root node_modules):
+//   node gepa_flowchart/render_service.mjs            # prints "READY <port>" then serves
+// Env: RENDER_PORT (default 8799), VIEWER_PORT (default 8798), PUSH_TOKEN (default render-tok).
+import { chromium } from "playwright";
+import { spawn } from "node:child_process";
+import http from "node:http";
+import { fileURLToPath } from "node:url";
+import { dirname, join } from "node:path";
+
+const HERE = dirname(fileURLToPath(import.meta.url));
+const REPO = join(HERE, "..", "..", "..", ".."); // gepa_flowchart -> gepa-flowchart -> experiments -> scripts -> repo
+const VIEWER = join(REPO, "packages", "viewer", "dist", "server.js");
+
+const RENDER_PORT = Number(process.env.RENDER_PORT ?? 8799);
+const VIEWER_PORT = Number(process.env.VIEWER_PORT ?? 8798);
+const TOKEN = process.env.PUSH_TOKEN ?? "render-tok";
+const VIEWER_BASE = `http://127.0.0.1:${VIEWER_PORT}`;
+
+const viewer = spawn("node", [VIEWER], { env: { ...process.env, PORT: String(VIEWER_PORT), PUSH_TOKEN: TOKEN } });
+viewer.stderr.on("data", () => {}); // swallow viewer logs
+
+const browser = await chromium.launch();
+const page = await browser.newPage({ viewport: { width: 1280, height: 900 } });
+
+// Wait for the viewer to accept connections.
+for (let i = 0; i < 50; i++) {
+  try { const r = await fetch(`${VIEWER_BASE}/healthz`); if (r.ok) break; } catch { /* not up yet */ }
+  await new Promise((r) => setTimeout(r, 200));
+}
+
+let counter = 0;
+
+async function renderFlow(flowStr) {
+  const wsid = `r${++counter}`;
+  const push = await fetch(`${VIEWER_BASE}/w/${wsid}/push`, {
+    method: "POST",
+    headers: { "content-type": "application/json", authorization: `Bearer ${TOKEN}` },
+    body: JSON.stringify({ project: "gepa", agent: "board", type: "flow", description: "render", content: flowStr }),
+  });
+  if (!push.ok && push.status !== 200 && push.status !== 204) {
+    return { ok: false, error: `push HTTP ${push.status}: ${(await push.text()).slice(0, 200)}` };
+  }
+  await page.goto(`${VIEWER_BASE}/w/${wsid}/`, { waitUntil: "domcontentloaded" });
+  try {
+    await page.waitForSelector(".react-flow__node", { timeout: 8000 });
+  } catch { /* no nodes rendered — still screenshot + report empty */ }
+  await page.waitForTimeout(900); // let fitView settle
+
+  const metrics = await page.evaluate(() => {
+    const vp = document.querySelector(".react-flow__viewport");
+    let zoom = 1;
+    if (vp) {
+      const t = getComputedStyle(vp).transform;
+      const m = t && t.match(/matrix\(([^)]+)\)/);
+      if (m) zoom = parseFloat(m[1].split(",")[0]) || 1;
+    }
+    const d = document.getElementById("diagram");
+    const dr = d ? d.getBoundingClientRect() : { x: 0, y: 0, width: 1280, height: 900 };
+    const nodes = [...document.querySelectorAll(".react-flow__node")].map((n) => {
+      const r = n.getBoundingClientRect();
+      return {
+        id: n.getAttribute("data-id") || "",
+        cls: n.className || "",
+        x: r.x, y: r.y, w: r.width, h: r.height,
+        fs: parseFloat(getComputedStyle(n).fontSize) || 0,
+      };
+    });
+    return { zoom, diagram: { x: dr.x, y: dr.y, w: dr.width, h: dr.height }, nodes };
+  });
+
+  const el = await page.$("#diagram");
+  const buf = el ? await el.screenshot() : await page.screenshot();
+  return { ok: true, png: buf.toString("base64"), metrics };
+}
+
+const server = http.createServer((req, res) => {
+  if (req.method === "GET" && req.url === "/health") { res.writeHead(200); res.end("ok"); return; }
+  if (req.method === "POST" && req.url === "/render") {
+    let body = "";
+    req.on("data", (c) => (body += c));
+    req.on("end", async () => {
+      try {
+        const { flow } = JSON.parse(body);
+        const out = await renderFlow(flow);
+        res.writeHead(out.ok ? 200 : 500, { "content-type": "application/json" });
+        res.end(JSON.stringify(out));
+      } catch (e) {
+        res.writeHead(500, { "content-type": "application/json" });
+        res.end(JSON.stringify({ ok: false, error: String(e).slice(0, 300) }));
+      }
+    });
+    return;
+  }
+  res.writeHead(404); res.end("not found");
+});
+
+server.listen(RENDER_PORT, () => console.log(`READY ${RENDER_PORT}`));
+
+function shutdown() { try { browser.close(); } catch {} try { viewer.kill(); } catch {} process.exit(0); }
+process.on("SIGTERM", shutdown);
+process.on("SIGINT", shutdown);
diff --git a/scripts/experiments/gepa-flowchart/gepa_flowchart/run.py b/scripts/experiments/gepa-flowchart/gepa_flowchart/run.py
index 8d685cf..ec1fd6c 100644
--- a/scripts/experiments/gepa-flowchart/gepa_flowchart/run.py
+++ b/scripts/experiments/gepa-flowchart/gepa_flowchart/run.py
@@ -11,19 +11,21 @@
 from .adapter import FlowchartAdapter
 from .config import Config
 from .dataset import freeze_questions, load_topics
-from .llm import complete, make_reflection_callable
+from .llm import complete, complete_vision, make_reflection_callable
 from .metric import score_board
 from .geometry_bridge import validate_flow
 from .pipeline import SEED_CANDIDATE
+from .render_bridge import RenderService
 from .report import render_report
 
 _TOPICS = str(Path(__file__).resolve().parent.parent / "topics" / "topics.json")
 
 
-def _eval_scores(tasks, candidate, cfg) -> list[float]:
+def _eval_scores(tasks, candidate, cfg, *, render_fn) -> list[float]:
     return [
         score_board(
-            run_one(candidate, t, cfg), t, cfg, validate_fn=validate_flow, complete_fn=complete
+            run_one(candidate, t, cfg), t, cfg, validate_fn=validate_flow, complete_fn=complete,
+            render_fn=render_fn, vision_fn=complete_vision,
         ).score
         for t in tasks
     ]
@@ -55,7 +57,9 @@ def main(argv=None) -> int:
         train, val = topics[: args.train], topics[args.train :]
 
     print(f"[gepa] run_dir={run_dir} budget={budget} train={len(train)} val={len(val)}", file=sys.stderr)
-    print(f"[gepa] models gen={cfg.gen_model} reader={cfg.reader_model} judge={cfg.judge_model} reflect={cfg.reflection_model}", file=sys.stderr)
+    print(f"[gepa] models gen={cfg.gen_model} reader={cfg.reader_model} judge={cfg.judge_model} "
+          f"vision={cfg.vision_model} reflect={cfg.reflection_model}", file=sys.stderr)
+    print(f"[gepa] weights comp={cfg.w_comp} geom={cfg.w_geom} visual_quality={cfg.w_vq}", file=sys.stderr)
 
     # Freeze the reader-questions once for the whole run (fair comparison).
     all_tasks = freeze_questions(topics, run_dir, model=cfg.judge_model, n=cfg.n_questions, complete_fn=complete)
@@ -63,21 +67,28 @@ def main(argv=None) -> int:
     train = [by_id[t.id] for t in train]
     val = [by_id[t.id] for t in val]
 
-    seed_scores = _eval_scores(val, SEED_CANDIDATE, cfg)
-
-    adapter = FlowchartAdapter(cfg)
-    result = gepa.optimize(
-        seed_candidate=SEED_CANDIDATE,
-        trainset=train,
-        valset=val,
-        adapter=adapter,
-        reflection_lm=make_reflection_callable(cfg.reflection_model),
-        max_metric_calls=budget,
-        run_dir=run_dir,
-        display_progress_bar=True,
-    )
-    best = result.best_candidate
-    best_scores = _eval_scores(val, best, cfg)
+    print("[gepa] starting render service (viewer + chromium)…", file=sys.stderr)
+    service = RenderService().start()
+    print("[gepa] render service ready.", file=sys.stderr)
+    try:
+        render_fn = service.render
+        seed_scores = _eval_scores(val, SEED_CANDIDATE, cfg, render_fn=render_fn)
+
+        adapter = FlowchartAdapter(cfg, render_fn=render_fn, vision_fn=complete_vision)
+        result = gepa.optimize(
+            seed_candidate=SEED_CANDIDATE,
+            trainset=train,
+            valset=val,
+            adapter=adapter,
+            reflection_lm=make_reflection_callable(cfg.reflection_model),
+            max_metric_calls=budget,
+            run_dir=run_dir,
+            display_progress_bar=True,
+        )
+        best = result.best_candidate
+        best_scores = _eval_scores(val, best, cfg, render_fn=render_fn)
+    finally:
+        service.stop()
 
     (Path(run_dir) / "best_prompts.json").write_text(json.dumps(best, indent=2))
     report = render_report(seed_scores, best_scores, best, [t.id for t in val])
diff --git a/scripts/experiments/gepa-flowchart/tests/test_adapter.py b/scripts/experiments/gepa-flowchart/tests/test_adapter.py
index 4caf04c..bf432c4 100644
--- a/scripts/experiments/gepa-flowchart/tests/test_adapter.py
+++ b/scripts/experiments/gepa-flowchart/tests/test_adapter.py
@@ -24,8 +24,23 @@ def complete_fn(system, user, *, model, **kw):
     return json.dumps([{"q": "Q1?", "score": 1.0, "supported": True, "reason": "ok"}])
 
 
+def render_fn(content):
+    return {"ok": True, "png": b"png", "metrics": {
+        "zoom": 1, "diagram": {"x": 0, "y": 0, "w": 1200, "h": 800},
+        "nodes": [{"id": "a", "cls": "react-flow__node", "x": 10, "y": 10, "w": 120, "h": 40, "fs": 12}],
+    }}
+
+
+def vision_fn(system, user, png, *, model, **kw):
+    return json.dumps({"answers": [{"q": "Q1?", "score": 1, "reason": "ok"}], "visual_quality": 0.9, "quality_reason": "clean"})
+
+
+def _adapter():
+    return FlowchartAdapter(CFG, validate_fn=validate_fn, complete_fn=complete_fn, render_fn=render_fn, vision_fn=vision_fn)
+
+
 def test_evaluate_returns_scores_and_traces():
-    adapter = FlowchartAdapter(CFG, validate_fn=validate_fn, complete_fn=complete_fn)
+    adapter = _adapter()
     batch = [Task(id="x", topic="T", audience="A", purpose="P", questions=["Q1?"])]
     out = adapter.evaluate(batch, SEED_CANDIDATE, capture_traces=True)
     assert len(out.scores) == 1
@@ -34,7 +49,7 @@ def test_evaluate_returns_scores_and_traces():
 
 
 def test_make_reflective_dataset_has_requested_components():
-    adapter = FlowchartAdapter(CFG, validate_fn=validate_fn, complete_fn=complete_fn)
+    adapter = _adapter()
     batch = [Task(id="x", topic="T", audience="A", purpose="P", questions=["Q1?"])]
     ev = adapter.evaluate(batch, SEED_CANDIDATE, capture_traces=True)
     refl = adapter.make_reflective_dataset(SEED_CANDIDATE, ev, ["brainstorm", "generate"])
diff --git a/scripts/experiments/gepa-flowchart/tests/test_config.py b/scripts/experiments/gepa-flowchart/tests/test_config.py
index 691aaa9..b0571e8 100644
--- a/scripts/experiments/gepa-flowchart/tests/test_config.py
+++ b/scripts/experiments/gepa-flowchart/tests/test_config.py
@@ -7,7 +7,8 @@ def test_defaults():
     assert cfg.reader_model == "claude-sonnet-4-6"
     assert cfg.judge_model == "claude-opus-4-8"
     assert cfg.reflection_model == "claude-opus-4-8"
-    assert cfg.w_comp == 0.6 and cfg.w_geom == 0.4
+    assert cfg.w_comp == 0.5 and cfg.w_geom == 0.3 and cfg.w_vq == 0.2
+    assert cfg.vision_model == "claude-opus-4-8"
 
 
 def test_env_override():
diff --git a/scripts/experiments/gepa-flowchart/tests/test_llm.py b/scripts/experiments/gepa-flowchart/tests/test_llm.py
index b876aee..cfefc63 100644
--- a/scripts/experiments/gepa-flowchart/tests/test_llm.py
+++ b/scripts/experiments/gepa-flowchart/tests/test_llm.py
@@ -32,3 +32,15 @@ def test_reflection_callable_returns_str():
     fake = FakeClient()
     fn = llm.make_reflection_callable("claude-opus-4-8", client=fake)
     assert fn("reflect on this") == "hello world"
+
+
+def test_complete_vision_sends_image_block():
+    fake = FakeClient()
+    out = llm.complete_vision("sys", "describe", b"\x89PNGdata", model="claude-opus-4-8", client=fake)
+    assert out == "hello world"
+    content = fake.calls["messages"][0]["content"]
+    kinds = [b["type"] for b in content]
+    assert "image" in kinds and "text" in kinds
+    img = next(b for b in content if b["type"] == "image")
+    assert img["source"]["type"] == "base64" and img["source"]["media_type"] == "image/png"
+    assert "temperature" not in fake.calls
diff --git a/scripts/experiments/gepa-flowchart/tests/test_metric.py b/scripts/experiments/gepa-flowchart/tests/test_metric.py
index 25a6d68..8dae861 100644
--- a/scripts/experiments/gepa-flowchart/tests/test_metric.py
+++ b/scripts/experiments/gepa-flowchart/tests/test_metric.py
@@ -2,12 +2,36 @@
 
 from gepa_flowchart.config import Config
 from gepa_flowchart.dataset import Task
-from gepa_flowchart.metric import geometry_score, score_board
+from gepa_flowchart.metric import (
+    geometry_score,
+    rendered_geometry_score,
+    visual_eval,
+    score_board,
+)
 
 CFG = Config.from_env({})
 TASK = Task(id="x", topic="T", audience="A", purpose="P", questions=["Q1?", "Q2?"])
 GOOD_FLOW = json.dumps({"direction": "TB", "nodes": [{"id": "a", "data": {"label": "Start"}}], "edges": []})
 
+# A clean rendered board: one node, on-screen, legible font.
+RENDER_OK = {
+    "ok": True,
+    "png": b"\x89PNG-fake",
+    "metrics": {
+        "zoom": 1,
+        "diagram": {"x": 0, "y": 0, "w": 1200, "h": 800},
+        "nodes": [{"id": "a", "cls": "react-flow__node", "x": 10, "y": 10, "w": 120, "h": 40, "fs": 12}],
+    },
+}
+
+
+def _vision_ok(system, user, png, *, model, **kw):
+    return json.dumps({
+        "answers": [{"q": "Q1?", "score": 1, "reason": "shown"}, {"q": "Q2?", "score": 0, "reason": "cramped, unreadable"}],
+        "visual_quality": 0.8,
+        "quality_reason": "legible",
+    })
+
 
 def test_geometry_score_penalizes_errors():
     clean, _ = geometry_score([], CFG)
@@ -16,33 +40,79 @@ def test_geometry_score_penalizes_errors():
     assert err < 1.0 and "edge-over-node" in msg
 
 
+def test_rendered_geometry_score():
+    clean, _ = rendered_geometry_score({"overlaps": 0, "offscreen": 0, "min_font_px": 14, "n_nodes": 3}, CFG)
+    assert clean == 1.0
+    bad, msg = rendered_geometry_score({"overlaps": 2, "offscreen": 1, "min_font_px": 5, "n_nodes": 4}, CFG)
+    assert bad < clean and "overlap" in msg
+    empty, msg2 = rendered_geometry_score({"overlaps": 0, "offscreen": 0, "min_font_px": 0, "n_nodes": 0}, CFG)
+    assert empty == 0.0 and "nothing rendered" in msg2
+
+
+def test_visual_eval_parses_object():
+    vcomp, vqual, fb = visual_eval(b"png", ["Q1?", "Q2?"], model="m", vision_fn=_vision_ok)
+    assert vcomp == 0.5 and vqual == 0.8
+    assert "cramped" in fb
+
+
+def test_visual_eval_degrades_on_garbage():
+    vcomp, vqual, fb = visual_eval(b"png", ["Q1?"], model="m", vision_fn=lambda *a, **k: "no json here")
+    assert vcomp == 0.0 and vqual == 0.0 and "no parseable" in fb
+
+
 def test_invalid_board_scores_zero():
     def validate_fn(content):
         return {"valid": False, "error": "bad json", "findings": [], "warnings": []}
 
-    res = score_board("{bad", TASK, CFG, validate_fn=validate_fn, complete_fn=lambda *a, **k: "")
+    res = score_board(
+        "{bad", TASK, CFG, validate_fn=validate_fn, complete_fn=lambda *a, **k: "",
+        render_fn=lambda c: RENDER_OK, vision_fn=_vision_ok,
+    )
     assert res.score == 0.0 and res.valid is False
     assert "bad json" in res.feedback
 
 
-def test_missing_context_lowers_comp_and_is_fed_back():
+def test_combined_score_blends_text_visual_geometry_quality():
     def validate_fn(content):
         return {"valid": True, "error": None, "findings": [], "warnings": []}
 
-    # reader answers; judge marks Q2 unsupported
     def complete_fn(system, user, *, model, **kw):
-        if "first time" in system.lower():  # reader
-            return json.dumps([{"q": "Q1?", "a": "Start the process"}, {"q": "Q2?", "a": "not shown"}])
-        # judge
-        return json.dumps([
+        if "first time" in system.lower():  # text reader
+            return json.dumps([{"q": "Q1?", "a": "x"}, {"q": "Q2?", "a": "not shown"}])
+        return json.dumps([  # text judge
             {"q": "Q1?", "score": 1.0, "supported": True, "reason": "shown"},
             {"q": "Q2?", "score": 0.0, "supported": False, "reason": "board never shows the failure path"},
         ])
 
-    res = score_board(GOOD_FLOW, TASK, CFG, validate_fn=validate_fn, complete_fn=complete_fn)
-    assert 0.0 < res.comp < 1.0
+    res = score_board(
+        GOOD_FLOW, TASK, CFG, validate_fn=validate_fn, complete_fn=complete_fn,
+        render_fn=lambda c: RENDER_OK, vision_fn=_vision_ok,
+    )
+    # text_comp 0.5, visual_comp 0.5 -> comp 0.5; heuristic 1.0, rendered 1.0 -> geom 1.0; vq 0.8
+    assert res.comp == 0.5
     assert res.geom == 1.0
-    assert "failure path" in res.feedback
+    assert res.visual_quality == 0.8
+    assert abs(res.score - (CFG.w_comp * 0.5 + CFG.w_geom * 1.0 + CFG.w_vq * 0.8)) < 1e-9
+    assert res.sub["text_comp"] == 0.5 and res.sub["visual_comp"] == 0.5
+    assert "failure path" in res.feedback and "cramped" in res.feedback
+
+
+def test_render_failure_zeros_visual_signals():
+    def validate_fn(content):
+        return {"valid": True, "error": None, "findings": [], "warnings": []}
+
+    def complete_fn(system, user, *, model, **kw):
+        if "first time" in system.lower():
+            return json.dumps([{"q": "Q1?", "a": "x"}])
+        return json.dumps([{"q": "Q1?", "score": 1.0, "supported": True, "reason": "shown"}])
+
+    res = score_board(
+        GOOD_FLOW, Task(id="x", topic="T", audience="A", purpose="P", questions=["Q1?"]), CFG,
+        validate_fn=validate_fn, complete_fn=complete_fn,
+        render_fn=lambda c: {"ok": False, "error": "boom"}, vision_fn=_vision_ok,
+    )
+    assert res.sub["visual_comp"] == 0.0 and res.sub["rendered_geom"] == 0.0 and res.visual_quality == 0.0
+    assert "failed to render" in res.feedback
 
 
 def test_malformed_judge_degrades_safely():
@@ -51,7 +121,7 @@ def test_malformed_judge_degrades_safely():
     def complete_fn(system, user, *, model, **kw):
         if "first time" in system.lower():
             return '[{"q": "Q1?", "a": "x"}]'
-        return "[{ truncated malformed"  # non-empty bracketed but invalid JSON
+        return "[{ truncated malformed"
 
     comp, fb = comprehension_score(
         "board text", ["Q1?"], reader_model="r", judge_model="j", complete_fn=complete_fn
diff --git a/scripts/experiments/gepa-flowchart/tests/test_render_bridge.py b/scripts/experiments/gepa-flowchart/tests/test_render_bridge.py
new file mode 100644
index 0000000..22c5d84
--- /dev/null
+++ b/scripts/experiments/gepa-flowchart/tests/test_render_bridge.py
@@ -0,0 +1,39 @@
+from gepa_flowchart.render_bridge import rendered_geometry_metrics
+
+DIAG = {"x": 0, "y": 0, "w": 1000, "h": 800}
+
+
+def test_clean_layout_no_defects():
+    m = {"zoom": 1, "diagram": DIAG, "nodes": [
+        {"id": "a", "cls": "react-flow__node", "x": 10, "y": 10, "w": 100, "h": 40, "fs": 12},
+        {"id": "b", "cls": "react-flow__node", "x": 10, "y": 200, "w": 100, "h": 40, "fs": 12},
+    ]}
+    r = rendered_geometry_metrics(m)
+    assert r == {"overlaps": 0, "offscreen": 0, "min_font_px": 12.0, "n_nodes": 2}
+
+
+def test_partial_overlap_counted():
+    m = {"zoom": 1, "diagram": DIAG, "nodes": [
+        {"id": "a", "cls": "react-flow__node", "x": 10, "y": 10, "w": 100, "h": 40, "fs": 12},
+        {"id": "b", "cls": "react-flow__node", "x": 60, "y": 30, "w": 100, "h": 40, "fs": 12},  # overlaps a
+    ]}
+    assert rendered_geometry_metrics(m)["overlaps"] == 1
+
+
+def test_containment_is_not_overlap():
+    # a big group node enclosing a child must NOT count as an overlap
+    m = {"zoom": 1, "diagram": DIAG, "nodes": [
+        {"id": "g", "cls": "react-flow__node-group", "x": 0, "y": 0, "w": 400, "h": 300, "fs": 12},
+        {"id": "a", "cls": "react-flow__node", "x": 20, "y": 20, "w": 100, "h": 40, "fs": 12},
+    ]}
+    assert rendered_geometry_metrics(m)["overlaps"] == 0
+
+
+def test_offscreen_and_zoom_scaled_font():
+    m = {"zoom": 0.5, "diagram": DIAG, "nodes": [
+        {"id": "a", "cls": "react-flow__node", "x": 10, "y": 10, "w": 100, "h": 40, "fs": 12},
+        {"id": "b", "cls": "react-flow__node", "x": 1200, "y": 10, "w": 100, "h": 40, "fs": 12},  # off canvas (x>1000)
+    ]}
+    r = rendered_geometry_metrics(m)
+    assert r["offscreen"] == 1
+    assert r["min_font_px"] == 6.0  # 12 * 0.5 zoom

From 2e411ea3c19f53ce385c25b8de83a0527367aece Mon Sep 17 00:00:00 2001
From: Ivan Cheung <ivanmkc@google.com>
Date: Thu, 18 Jun 2026 21:20:15 +0000
Subject: [PATCH 19/24] fix(gepa): freeze reader-questions only for used topics
 (not all 20)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

A --smoke or small --train run was generating questions for all 20 topics before
starting; freeze just train∪val so partial runs aren't paying for unused topics.
---
 scripts/experiments/gepa-flowchart/gepa_flowchart/run.py | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/scripts/experiments/gepa-flowchart/gepa_flowchart/run.py b/scripts/experiments/gepa-flowchart/gepa_flowchart/run.py
index ec1fd6c..4bc5958 100644
--- a/scripts/experiments/gepa-flowchart/gepa_flowchart/run.py
+++ b/scripts/experiments/gepa-flowchart/gepa_flowchart/run.py
@@ -61,9 +61,12 @@ def main(argv=None) -> int:
           f"vision={cfg.vision_model} reflect={cfg.reflection_model}", file=sys.stderr)
     print(f"[gepa] weights comp={cfg.w_comp} geom={cfg.w_geom} visual_quality={cfg.w_vq}", file=sys.stderr)
 
-    # Freeze the reader-questions once for the whole run (fair comparison).
-    all_tasks = freeze_questions(topics, run_dir, model=cfg.judge_model, n=cfg.n_questions, complete_fn=complete)
-    by_id = {t.id: t for t in all_tasks}
+    # Freeze the reader-questions once for the whole run (fair comparison), only for the topics
+    # actually used (train ∪ val) — freezing all 20 would waste calls on a small/smoke run.
+    used_ids = {t.id for t in train} | {t.id for t in val}
+    used = [t for t in topics if t.id in used_ids]
+    frozen = freeze_questions(used, run_dir, model=cfg.judge_model, n=cfg.n_questions, complete_fn=complete)
+    by_id = {t.id: t for t in frozen}
     train = [by_id[t.id] for t in train]
     val = [by_id[t.id] for t in val]
 

From 8b1f738ebce21ceae1ab7298bfc9e57e3ef4f92f Mon Sep 17 00:00:00 2001
From: Ivan Cheung <ivanmkc@google.com>
Date: Thu, 18 Jun 2026 22:32:50 +0000
Subject: [PATCH 20/24] fix(gepa): tolerant reader-question parsing (raw_decode
 + retry + fallback)

The harder question prompt occasionally makes the model emit a valid JSON array
followed by trailing prose, which crashed freeze_questions with JSONDecodeError
('Extra data') at the start of a run. Parse the first array via raw_decode (ignore
trailing), retry once on a clean miss, and fall back to generic questions so no
topic is ever question-less. Regression tests added.
---
 .../gepa-flowchart/gepa_flowchart/dataset.py  | 32 ++++++++++++++++---
 .../gepa-flowchart/tests/test_dataset.py      | 16 ++++++++++
 2 files changed, 44 insertions(+), 4 deletions(-)

diff --git a/scripts/experiments/gepa-flowchart/gepa_flowchart/dataset.py b/scripts/experiments/gepa-flowchart/gepa_flowchart/dataset.py
index 6a5d4ea..e0d9724 100644
--- a/scripts/experiments/gepa-flowchart/gepa_flowchart/dataset.py
+++ b/scripts/experiments/gepa-flowchart/gepa_flowchart/dataset.py
@@ -29,6 +29,28 @@ def load_topics(path: str) -> list[Task]:
 )
 
 
+_FALLBACK_QUESTIONS = [
+    "What are the main steps, in order?",
+    "What are the decision points and what does each branch lead to?",
+    "What triggers the process to start?",
+    "What happens on failure or error at each step?",
+    "Which steps are automated vs. performed by a person, and who?",
+]
+
+
+def parse_str_array(text: str) -> list[str]:
+    """Tolerantly pull a JSON array of strings from an LLM response. Uses raw_decode from the first
+    '[' so trailing text after a valid array (a common LLM habit) doesn't cause 'Extra data'."""
+    i = text.find("[")
+    if i == -1:
+        return []
+    try:
+        val, _ = json.JSONDecoder().raw_decode(text[i:])
+    except (json.JSONDecodeError, ValueError):
+        return []
+    return [str(x) for x in val] if isinstance(val, list) else []
+
+
 def generate_questions(task: Task, *, model: str, n: int, complete_fn) -> list[str]:
     user = (
         f"Topic: {task.topic}\nAudience: {task.audience}\nPurpose: {task.purpose}\n\n"
@@ -37,10 +59,12 @@ def generate_questions(task: Task, *, model: str, n: int, complete_fn) -> list[s
         f"order, threshold, or failure path) — not something answerable from general knowledge. "
         f"JSON array of strings only."
     )
-    out = complete_fn(_QGEN_SYSTEM, user, model=model, effort="medium", max_tokens=1000)
-    blob = extract_array(out)
-    qs = json.loads(blob) if blob else []
-    return [str(q) for q in qs][:n]
+    qs = parse_str_array(complete_fn(_QGEN_SYSTEM, user, model=model, effort="medium", max_tokens=1000))
+    if not qs:  # one retry on a clean miss, then a generic fallback so no topic is question-less
+        qs = parse_str_array(complete_fn(_QGEN_SYSTEM, user, model=model, effort="medium", max_tokens=1000))
+    if not qs:
+        qs = _FALLBACK_QUESTIONS
+    return qs[:n]
 
 
 def extract_array(text: str) -> str | None:
diff --git a/scripts/experiments/gepa-flowchart/tests/test_dataset.py b/scripts/experiments/gepa-flowchart/tests/test_dataset.py
index 30e1ec4..e6ff056 100644
--- a/scripts/experiments/gepa-flowchart/tests/test_dataset.py
+++ b/scripts/experiments/gepa-flowchart/tests/test_dataset.py
@@ -37,3 +37,19 @@ def test_freeze_and_reload(tmp_path):
     assert (tmp_path / "frozen_questions.json").exists()
     reloaded = load_frozen(load_topics(TOPICS)[:2], str(tmp_path))
     assert reloaded[0].questions == frozen[0].questions
+
+
+def test_generate_questions_tolerates_trailing_text():
+    """Regression: the model often emits a valid array PLUS trailing prose; raw_decode must take
+    the array and ignore the rest instead of raising 'Extra data'."""
+    def fc(system, user, *, model, **kw):
+        return '["What triggers it?", "What fails?"]\n\nThose are the key questions.'
+    qs = generate_questions(Task(id="x", topic="T", audience="A", purpose="P"), model="m", n=5, complete_fn=fc)
+    assert qs == ["What triggers it?", "What fails?"]
+
+
+def test_generate_questions_falls_back_when_unparseable():
+    def fc(system, user, *, model, **kw):
+        return "Sorry, here are some questions but not as JSON."
+    qs = generate_questions(Task(id="x", topic="T", audience="A", purpose="P"), model="m", n=3, complete_fn=fc)
+    assert len(qs) == 3 and all(isinstance(q, str) and q for q in qs)

From 830e68d456cd353ef84020c660aeffc4d52ecf47 Mon Sep 17 00:00:00 2001
From: Ivan Cheung <ivanmkc@google.com>
Date: Thu, 18 Jun 2026 22:47:38 +0000
Subject: [PATCH 21/24] chore(gepa): log dataset module + parse_str_array at
 startup
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

A one-line startup diagnostic (which module file is loaded + whether the tolerant
parser is present) — caught a stale-bytecode issue where a run executed pre-fix
code despite fixed source.
---
 scripts/experiments/gepa-flowchart/gepa_flowchart/run.py | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/scripts/experiments/gepa-flowchart/gepa_flowchart/run.py b/scripts/experiments/gepa-flowchart/gepa_flowchart/run.py
index 4bc5958..9e11157 100644
--- a/scripts/experiments/gepa-flowchart/gepa_flowchart/run.py
+++ b/scripts/experiments/gepa-flowchart/gepa_flowchart/run.py
@@ -45,6 +45,9 @@ def main(argv=None) -> int:
     p.add_argument("--run-dir", default=None)
     args = p.parse_args(argv)
 
+    from . import dataset as _ds
+    print(f"[gepa] dataset={_ds.__file__} parse_str_array={hasattr(_ds, 'parse_str_array')}", file=sys.stderr)
+
     cfg = Config.from_env({})
     budget = args.max_metric_calls or (6 if args.smoke else cfg.max_metric_calls)
     run_dir = args.run_dir or str(Path("runs") / datetime.now(timezone.utc).strftime("%Y%m%dT%H%M%SZ"))

From 619d6302bb3367cc41639d30746c1989c213d5dc Mon Sep 17 00:00:00 2001
From: Ivan Cheung <ivanmkc@google.com>
Date: Fri, 19 Jun 2026 05:33:39 +0000
Subject: [PATCH 22/24] feat(gepa): weighted-harmonic-mean metric +
 multi-objective Pareto

- Aggregate the three axes (comprehension/geometry/visual_quality) with an eps-floored
  WEIGHTED HARMONIC MEAN instead of a linear sum: a weak axis can't be bought back by a
  strong one (anti-compensation), while the eps floor keeps a single 0 from collapsing the
  score and re-saturating the metric. Within-axis (text+visual, heuristic+rendered) stays
  arithmetic mean (denoising two estimates of one thing).
- Adapter now returns per-objective scores; run uses frontier_type=hybrid so GEPA keeps the
  Pareto front over both val instances and objectives instead of only the collapsed scalar.
Verified on Vertex: base valset 0.415, GEPA iterates with the hybrid frontier, no errors. 37 tests.
---
 .../gepa-flowchart/gepa_flowchart/adapter.py  | 12 +++++++++++-
 .../gepa-flowchart/gepa_flowchart/config.py   | 10 ++++++++++
 .../gepa-flowchart/gepa_flowchart/metric.py   | 16 ++++++++++++++--
 .../gepa-flowchart/gepa_flowchart/run.py      |  1 +
 .../gepa-flowchart/tests/test_adapter.py      |  3 +++
 .../gepa-flowchart/tests/test_metric.py       | 19 ++++++++++++++++++-
 6 files changed, 57 insertions(+), 4 deletions(-)

diff --git a/scripts/experiments/gepa-flowchart/gepa_flowchart/adapter.py b/scripts/experiments/gepa-flowchart/gepa_flowchart/adapter.py
index bc876fd..1a10f2c 100644
--- a/scripts/experiments/gepa-flowchart/gepa_flowchart/adapter.py
+++ b/scripts/experiments/gepa-flowchart/gepa_flowchart/adapter.py
@@ -24,6 +24,7 @@ def __init__(self, cfg: Config, *, validate_fn=validate_flow, complete_fn=comple
 
     def evaluate(self, batch, candidate, capture_traces=False):
         outputs, scores, trajectories = [], [], [] if capture_traces else None
+        objective_scores = []
         for task in batch:
             content, trace = run_pipeline(
                 candidate, task, model=self.cfg.gen_model,
@@ -35,9 +36,18 @@ def evaluate(self, batch, candidate, capture_traces=False):
             )
             outputs.append(content)
             scores.append(result.score)
+            # Per-objective scores so GEPA can track/Pareto the distinct axes instead of only the
+            # collapsed scalar (keeps candidates that are best on comprehension vs. layout separate).
+            objective_scores.append({
+                "comprehension": result.comp,
+                "geometry": result.geom,
+                "visual_quality": result.visual_quality,
+            })
             if capture_traces:
                 trajectories.append({"task": task, "trace": trace, "result": result, "output": content})
-        return EvaluationBatch(outputs=outputs, scores=scores, trajectories=trajectories)
+        return EvaluationBatch(
+            outputs=outputs, scores=scores, trajectories=trajectories, objective_scores=objective_scores
+        )
 
     def make_reflective_dataset(self, candidate, eval_batch, components_to_update):
         out: dict[str, list[dict]] = {c: [] for c in components_to_update}
diff --git a/scripts/experiments/gepa-flowchart/gepa_flowchart/config.py b/scripts/experiments/gepa-flowchart/gepa_flowchart/config.py
index 77bf4b6..ca50a77 100644
--- a/scripts/experiments/gepa-flowchart/gepa_flowchart/config.py
+++ b/scripts/experiments/gepa-flowchart/gepa_flowchart/config.py
@@ -14,6 +14,14 @@ class Config:
     w_comp: float = 0.5                          # comprehension = mean(text, visual)
     w_geom: float = 0.3                          # geometry = mean(heuristic, rendered-DOM)
     w_vq: float = 0.2                            # visual quality (legibility/crowding)
+    # Aggregate the three axes with a WEIGHTED HARMONIC MEAN (anti-compensation: a weak axis can't
+    # be bought back by a strong one). score_epsilon floors each axis so a single 0 doesn't collapse
+    # the whole score to 0 (which would re-saturate the metric at the bottom and kill the gradient).
+    score_epsilon: float = 0.05
+    # GEPA candidate-frontier type. "hybrid" keeps the Pareto front over BOTH val instances and the
+    # per-objective scores we hand it (comprehension / geometry / visual_quality), so a candidate
+    # that's best on one objective or one topic survives. "instance" = original behavior.
+    frontier_type: str = "hybrid"
     geom_error_penalty: float = 0.34
     geom_warning_penalty: float = 0.08
     rg_overlap_penalty: float = 0.34            # rendered-DOM: per overlapping node pair
@@ -46,6 +54,8 @@ def i(key: str, default: int) -> int:
             w_comp=f("GEPA_W_COMP", cls.w_comp),
             w_geom=f("GEPA_W_GEOM", cls.w_geom),
             w_vq=f("GEPA_W_VQ", cls.w_vq),
+            score_epsilon=f("GEPA_SCORE_EPSILON", cls.score_epsilon),
+            frontier_type=s("GEPA_FRONTIER_TYPE", cls.frontier_type),
             geom_error_penalty=f("GEPA_GEOM_ERR_PENALTY", cls.geom_error_penalty),
             geom_warning_penalty=f("GEPA_GEOM_WARN_PENALTY", cls.geom_warning_penalty),
             rg_overlap_penalty=f("GEPA_RG_OVERLAP_PENALTY", cls.rg_overlap_penalty),
diff --git a/scripts/experiments/gepa-flowchart/gepa_flowchart/metric.py b/scripts/experiments/gepa-flowchart/gepa_flowchart/metric.py
index f6eb0ee..df9d819 100644
--- a/scripts/experiments/gepa-flowchart/gepa_flowchart/metric.py
+++ b/scripts/experiments/gepa-flowchart/gepa_flowchart/metric.py
@@ -20,6 +20,14 @@ class ScoreResult:
     sub: dict = field(default_factory=dict)  # text_comp / visual_comp / heuristic_geom / rendered_geom
 
 
+def weighted_harmonic_mean(pairs: list[tuple[float, float]], *, eps: float = 0.05) -> float:
+    """Weighted harmonic mean of (weight, value) pairs, each value floored at eps. Dominated by the
+    smallest value → enforces 'good on every axis' (no compensation) without a single 0 nuking it."""
+    num = sum(w for w, _ in pairs)
+    den = sum(w / max(eps, v) for w, v in pairs)
+    return num / den if den else 0.0
+
+
 def geometry_score(findings: list[dict], cfg: Config) -> tuple[float, str]:
     errs = [f for f in findings if f.get("severity") == "error"]
     warns = [f for f in findings if f.get("severity") == "warning"]
@@ -169,9 +177,13 @@ def score_board(content, task: Task, cfg: Config, *, validate_fn, complete_fn, r
         rgeom, rgeom_fb = 0.0, f"Rendered geometry: render failed ({r.get('error')})."
         vcomp, vqual, v_fb = 0.0, 0.0, "Visual: the board failed to render in the viewer."
 
-    comp = (tcomp + vcomp) / 2
+    comp = (tcomp + vcomp) / 2   # two estimates of the same thing → average to denoise
     geom = (hgeom + rgeom) / 2
-    total = cfg.w_comp * comp + cfg.w_geom * geom + cfg.w_vq * vqual
+    # Three distinct requirements → weighted harmonic mean (a weak axis drags the whole score down
+    # and can't be compensated), ε-floored so a single 0 doesn't collapse everything to 0.
+    total = weighted_harmonic_mean(
+        [(cfg.w_comp, comp), (cfg.w_geom, geom), (cfg.w_vq, vqual)], eps=cfg.score_epsilon
+    )
     feedback = "\n".join([tcomp_fb, v_fb, hgeom_fb, rgeom_fb])
     sub = {"text_comp": tcomp, "visual_comp": vcomp, "heuristic_geom": hgeom, "rendered_geom": rgeom, "visual_quality": vqual}
     return ScoreResult(total, feedback, comp, geom, vqual, True, sub)
diff --git a/scripts/experiments/gepa-flowchart/gepa_flowchart/run.py b/scripts/experiments/gepa-flowchart/gepa_flowchart/run.py
index 9e11157..ca17b0a 100644
--- a/scripts/experiments/gepa-flowchart/gepa_flowchart/run.py
+++ b/scripts/experiments/gepa-flowchart/gepa_flowchart/run.py
@@ -88,6 +88,7 @@ def main(argv=None) -> int:
             adapter=adapter,
             reflection_lm=make_reflection_callable(cfg.reflection_model),
             max_metric_calls=budget,
+            frontier_type=cfg.frontier_type,
             run_dir=run_dir,
             display_progress_bar=True,
         )
diff --git a/scripts/experiments/gepa-flowchart/tests/test_adapter.py b/scripts/experiments/gepa-flowchart/tests/test_adapter.py
index bf432c4..134719e 100644
--- a/scripts/experiments/gepa-flowchart/tests/test_adapter.py
+++ b/scripts/experiments/gepa-flowchart/tests/test_adapter.py
@@ -46,6 +46,9 @@ def test_evaluate_returns_scores_and_traces():
     assert len(out.scores) == 1
     assert out.scores[0] > 0.0
     assert out.trajectories is not None and len(out.trajectories) == 1
+    # per-objective scores handed to GEPA for multi-objective Pareto
+    assert out.objective_scores is not None and len(out.objective_scores) == 1
+    assert set(out.objective_scores[0]) == {"comprehension", "geometry", "visual_quality"}
 
 
 def test_make_reflective_dataset_has_requested_components():
diff --git a/scripts/experiments/gepa-flowchart/tests/test_metric.py b/scripts/experiments/gepa-flowchart/tests/test_metric.py
index 8dae861..feaa6fa 100644
--- a/scripts/experiments/gepa-flowchart/tests/test_metric.py
+++ b/scripts/experiments/gepa-flowchart/tests/test_metric.py
@@ -7,6 +7,7 @@
     rendered_geometry_score,
     visual_eval,
     score_board,
+    weighted_harmonic_mean,
 )
 
 CFG = Config.from_env({})
@@ -92,11 +93,27 @@ def complete_fn(system, user, *, model, **kw):
     assert res.comp == 0.5
     assert res.geom == 1.0
     assert res.visual_quality == 0.8
-    assert abs(res.score - (CFG.w_comp * 0.5 + CFG.w_geom * 1.0 + CFG.w_vq * 0.8)) < 1e-9
+    expected = weighted_harmonic_mean(
+        [(CFG.w_comp, 0.5), (CFG.w_geom, 1.0), (CFG.w_vq, 0.8)], eps=CFG.score_epsilon
+    )
+    assert abs(res.score - expected) < 1e-9
     assert res.sub["text_comp"] == 0.5 and res.sub["visual_comp"] == 0.5
     assert "failure path" in res.feedback and "cramped" in res.feedback
 
 
+def test_weighted_harmonic_mean_anti_compensation():
+    # equal values -> that value
+    assert abs(weighted_harmonic_mean([(1, 0.6), (1, 0.6), (1, 0.6)]) - 0.6) < 1e-9
+    # one weak axis drags the total well below the arithmetic mean (no buying it back)
+    vals = [(0.5, 0.9), (0.3, 0.9), (0.2, 0.1)]
+    arith = sum(w * v for w, v in vals)
+    hm = weighted_harmonic_mean(vals)
+    assert hm < arith - 0.1
+    # a single 0 is floored, not collapsed to exactly 0 (gradient preserved)
+    z = weighted_harmonic_mean([(0.5, 0.9), (0.3, 0.9), (0.2, 0.0)], eps=0.05)
+    assert 0.0 < z < 0.3
+
+
 def test_render_failure_zeros_visual_signals():
     def validate_fn(content):
         return {"valid": True, "error": None, "findings": [], "warnings": []}

From 75fc6dcb67b4bd40026b820af20a4ae18cfd80d3 Mon Sep 17 00:00:00 2001
From: Ivan Cheung <ivanmkc@google.com>
Date: Fri, 19 Jun 2026 05:43:26 +0000
Subject: [PATCH 23/24] =?UTF-8?q?feat(gepa):=20overnight=20autonomous=20sw?=
 =?UTF-8?q?eep=20=E2=80=94=20huge=20corpus=20+=20experiments=20+=20cross-e?=
 =?UTF-8?q?val?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

- corpus_gen: ~180 diverse use-cases across 20 domains (tolerant parse, dedupe, fallback)
- run.py: --topics / --val cap / reuse shared frozen_questions (comparable experiments)
- agg toggle (harmonic|linear) for ablation
- overnight.sh: robust orchestrator — shared frozen questions, 3 timeout-bounded + failure-isolated
  experiments (WHM+hybrid opus, linear+instance ablation, sonnet-gen ablation), then crosseval
- crosseval: re-score seed + each best on a held-out set under one canonical metric -> SUMMARY.md
---
 .../gepa-flowchart/gepa_flowchart/config.py   |  2 +
 .../gepa_flowchart/corpus_gen.py              | 94 +++++++++++++++++++
 .../gepa_flowchart/crosseval.py               | 87 +++++++++++++++++
 .../gepa-flowchart/gepa_flowchart/metric.py   |  9 +-
 .../gepa-flowchart/gepa_flowchart/run.py      | 33 +++++--
 .../experiments/gepa-flowchart/overnight.sh   | 72 ++++++++++++++
 6 files changed, 286 insertions(+), 11 deletions(-)
 create mode 100644 scripts/experiments/gepa-flowchart/gepa_flowchart/corpus_gen.py
 create mode 100644 scripts/experiments/gepa-flowchart/gepa_flowchart/crosseval.py
 create mode 100755 scripts/experiments/gepa-flowchart/overnight.sh

diff --git a/scripts/experiments/gepa-flowchart/gepa_flowchart/config.py b/scripts/experiments/gepa-flowchart/gepa_flowchart/config.py
index ca50a77..6cce42b 100644
--- a/scripts/experiments/gepa-flowchart/gepa_flowchart/config.py
+++ b/scripts/experiments/gepa-flowchart/gepa_flowchart/config.py
@@ -22,6 +22,7 @@ class Config:
     # per-objective scores we hand it (comprehension / geometry / visual_quality), so a candidate
     # that's best on one objective or one topic survives. "instance" = original behavior.
     frontier_type: str = "hybrid"
+    agg: str = "harmonic"   # "harmonic" (weighted HM, anti-compensation) | "linear" (weighted sum, ablation)
     geom_error_penalty: float = 0.34
     geom_warning_penalty: float = 0.08
     rg_overlap_penalty: float = 0.34            # rendered-DOM: per overlapping node pair
@@ -56,6 +57,7 @@ def i(key: str, default: int) -> int:
             w_vq=f("GEPA_W_VQ", cls.w_vq),
             score_epsilon=f("GEPA_SCORE_EPSILON", cls.score_epsilon),
             frontier_type=s("GEPA_FRONTIER_TYPE", cls.frontier_type),
+            agg=s("GEPA_AGG", cls.agg),
             geom_error_penalty=f("GEPA_GEOM_ERR_PENALTY", cls.geom_error_penalty),
             geom_warning_penalty=f("GEPA_GEOM_WARN_PENALTY", cls.geom_warning_penalty),
             rg_overlap_penalty=f("GEPA_RG_OVERLAP_PENALTY", cls.rg_overlap_penalty),
diff --git a/scripts/experiments/gepa-flowchart/gepa_flowchart/corpus_gen.py b/scripts/experiments/gepa-flowchart/gepa_flowchart/corpus_gen.py
new file mode 100644
index 0000000..a30ac19
--- /dev/null
+++ b/scripts/experiments/gepa-flowchart/gepa_flowchart/corpus_gen.py
@@ -0,0 +1,94 @@
+"""Generate a large, diverse corpus of flowchart use-cases for the overnight experiments.
+Tolerant parsing + dedupe + a guaranteed fallback so a partial LLM failure never produces an
+empty corpus. Run: python -m gepa_flowchart.corpus_gen <out.json> [per_domain]."""
+from __future__ import annotations
+
+import json
+import re
+import sys
+from pathlib import Path
+
+from .config import Config
+from .llm import complete
+
+DOMAINS = [
+    "software architecture & backend services", "DevOps, CI/CD and release engineering",
+    "distributed systems and consensus", "networking and protocols",
+    "data engineering and ML pipelines", "security, authentication and authorization",
+    "databases, storage and caching", "e-commerce, checkout and payments",
+    "business operations, approvals and workflows", "customer support and ITSM",
+    "healthcare and clinical workflows", "finance, trading and risk",
+    "manufacturing, supply chain and logistics", "scientific and laboratory processes",
+    "game logic and state machines", "product onboarding and growth funnels",
+    "compilers, interpreters and language runtimes", "robotics and control loops",
+    "incident response and on-call", "cloud infrastructure provisioning",
+]
+
+_SYS = (
+    "You design diverse flowchart USE-CASES for benchmarking a diagram generator. Each must be a "
+    "real process/flow worth diagramming — with steps, decisions, branches and failure paths. "
+    "Output ONLY a JSON array of objects: {\"topic\", \"audience\", \"purpose\"}. Be specific and varied."
+)
+
+
+def parse_objs(text: str) -> list[dict]:
+    i = text.find("[")
+    if i == -1:
+        return []
+    try:
+        v, _ = json.JSONDecoder().raw_decode(text[i:])
+    except (json.JSONDecodeError, ValueError):
+        return []
+    return [o for o in v if isinstance(o, dict)] if isinstance(v, list) else []
+
+
+def slug(s: str) -> str:
+    return re.sub(r"[^a-z0-9]+", "-", s.lower()).strip("-")[:48]
+
+
+def main(argv=None) -> int:
+    argv = argv or sys.argv[1:]
+    out = argv[0] if argv else "overnight/corpus.json"
+    per = int(argv[1]) if len(argv) > 1 else 9
+    cfg = Config.from_env({})
+    Path(out).parent.mkdir(parents=True, exist_ok=True)
+
+    seen: set[str] = set()
+    rows: list[dict] = []
+    for d in DOMAINS:
+        try:
+            txt = complete(
+                _SYS, f"Domain: {d}\nList {per} distinct flowchart use-cases as JSON objects.",
+                model=cfg.judge_model, effort="medium", max_tokens=2200,
+            )
+            added = 0
+            for o in parse_objs(txt):
+                topic = str(o.get("topic", "")).strip()
+                sid = slug(topic)
+                if not topic or not sid or sid in seen:
+                    continue
+                seen.add(sid)
+                rows.append({
+                    "id": sid, "topic": topic,
+                    "audience": str(o.get("audience", "a practitioner")).strip() or "a practitioner",
+                    "purpose": str(o.get("purpose", "understand the process")).strip() or "understand the process",
+                })
+                added += 1
+            print(f"[corpus] {d}: +{added} (total {len(rows)})", file=sys.stderr)
+        except Exception as e:  # never let one domain kill the corpus
+            print(f"[corpus] {d}: FAILED {type(e).__name__}: {str(e)[:120]}", file=sys.stderr)
+
+    # Guaranteed fallback: fold in the built-in 20 so the corpus is never tiny.
+    base_path = Path(__file__).resolve().parent.parent / "topics" / "topics.json"
+    for b in json.loads(base_path.read_text()):
+        if b["id"] not in seen:
+            rows.append(b)
+            seen.add(b["id"])
+
+    Path(out).write_text(json.dumps(rows, indent=1))
+    print(f"[corpus] wrote {len(rows)} topics -> {out}")
+    return 0
+
+
+if __name__ == "__main__":
+    raise SystemExit(main())
diff --git a/scripts/experiments/gepa-flowchart/gepa_flowchart/crosseval.py b/scripts/experiments/gepa-flowchart/gepa_flowchart/crosseval.py
new file mode 100644
index 0000000..f93ed19
--- /dev/null
+++ b/scripts/experiments/gepa-flowchart/gepa_flowchart/crosseval.py
@@ -0,0 +1,87 @@
+"""Apples-to-apples comparison: re-score the seed prompt and each experiment's best prompt on a
+shared HELD-OUT set (topics unseen during optimization) under ONE canonical metric (WHM + the
+default weights), so experiments using different aggregations/frontiers are comparable. Writes
+overnight/SUMMARY.md. Fully wrapped — a failure here leaves each run's own report.md intact.
+
+Usage: python -m gepa_flowchart.crosseval <corpus.json> <frozen_dir> <holdout_n> <exp_dir> [exp_dir...]
+"""
+from __future__ import annotations
+
+import json
+import sys
+from pathlib import Path
+
+from .config import Config
+from .dataset import generate_questions, load_frozen, load_topics
+from .geometry_bridge import validate_flow
+from .llm import complete, complete_vision
+from .metric import score_board
+from .pipeline import SEED_CANDIDATE, run_pipeline
+from .render_bridge import RenderService
+
+
+def _eval(cand, tasks, cfg, render_fn):
+    rows = []
+    for t in tasks:
+        content, _ = run_pipeline(cand, t, model=cfg.gen_model, complete_fn=complete, max_tokens=cfg.gen_max_tokens)
+        r = score_board(content, t, cfg, validate_fn=validate_flow, complete_fn=complete,
+                        render_fn=render_fn, vision_fn=complete_vision)
+        rows.append(r)
+    n = len(rows) or 1
+    return {
+        "score": sum(r.score for r in rows) / n,
+        "comp": sum(r.comp for r in rows) / n,
+        "geom": sum(r.geom for r in rows) / n,
+        "vq": sum(r.visual_quality for r in rows) / n,
+    }
+
+
+def main(argv=None) -> int:
+    argv = argv or sys.argv[1:]
+    corpus, frozen_dir, holdout_n = argv[0], argv[1], int(argv[2])
+    exp_dirs = argv[3:]
+    cfg = Config.from_env({"GEPA_AGG": "harmonic", "GEPA_FRONTIER_TYPE": "hybrid"})  # canonical metric
+
+    holdout = load_topics(corpus)[-holdout_n:]
+    holdout = load_frozen(holdout, frozen_dir)  # reuse shared questions if present
+    for t in holdout:
+        if not t.questions:
+            t.questions = generate_questions(t, model=cfg.judge_model, n=cfg.n_questions, complete_fn=complete)
+
+    cands = {"seed": SEED_CANDIDATE}
+    for d in exp_dirs:
+        p = Path(d) / "best_prompts.json"
+        if p.exists():
+            try:
+                cands[Path(d).name] = json.loads(p.read_text())
+            except Exception as e:
+                print(f"[crosseval] skip {d}: {e}", file=sys.stderr)
+
+    results = {}
+    with RenderService() as svc:
+        for name, c in cands.items():
+            try:
+                results[name] = _eval(c, holdout, cfg, svc.render)
+                print(f"[crosseval] {name}: {results[name]['score']:.3f}", file=sys.stderr)
+            except Exception as e:
+                print(f"[crosseval] {name} FAILED: {e}", file=sys.stderr)
+
+    md = [
+        "# Overnight experiments — held-out comparison",
+        "",
+        f"Canonical metric (WHM, weights {cfg.w_comp}/{cfg.w_geom}/{cfg.w_vq}), "
+        f"held-out topics ({holdout_n}, unseen during optimization): "
+        f"{', '.join(t.id for t in holdout)}",
+        "",
+        "| candidate | score | comprehension | geometry | visual_quality |",
+        "|---|---|---|---|---|",
+    ]
+    for name, m in sorted(results.items(), key=lambda kv: -kv[1]["score"]):
+        md.append(f"| {name} | **{m['score']:.3f}** | {m['comp']:.2f} | {m['geom']:.2f} | {m['vq']:.2f} |")
+    Path("overnight/SUMMARY.md").write_text("\n".join(md) + "\n")
+    print("[crosseval] wrote overnight/SUMMARY.md")
+    return 0
+
+
+if __name__ == "__main__":
+    raise SystemExit(main())
diff --git a/scripts/experiments/gepa-flowchart/gepa_flowchart/metric.py b/scripts/experiments/gepa-flowchart/gepa_flowchart/metric.py
index df9d819..965f88d 100644
--- a/scripts/experiments/gepa-flowchart/gepa_flowchart/metric.py
+++ b/scripts/experiments/gepa-flowchart/gepa_flowchart/metric.py
@@ -181,9 +181,12 @@ def score_board(content, task: Task, cfg: Config, *, validate_fn, complete_fn, r
     geom = (hgeom + rgeom) / 2
     # Three distinct requirements → weighted harmonic mean (a weak axis drags the whole score down
     # and can't be compensated), ε-floored so a single 0 doesn't collapse everything to 0.
-    total = weighted_harmonic_mean(
-        [(cfg.w_comp, comp), (cfg.w_geom, geom), (cfg.w_vq, vqual)], eps=cfg.score_epsilon
-    )
+    # "linear" is the compensatory weighted-sum baseline, kept for ablation.
+    axes = [(cfg.w_comp, comp), (cfg.w_geom, geom), (cfg.w_vq, vqual)]
+    if cfg.agg == "linear":
+        total = sum(w * v for w, v in axes)
+    else:
+        total = weighted_harmonic_mean(axes, eps=cfg.score_epsilon)
     feedback = "\n".join([tcomp_fb, v_fb, hgeom_fb, rgeom_fb])
     sub = {"text_comp": tcomp, "visual_comp": vcomp, "heuristic_geom": hgeom, "rendered_geom": rgeom, "visual_quality": vqual}
     return ScoreResult(total, feedback, comp, geom, vqual, True, sub)
diff --git a/scripts/experiments/gepa-flowchart/gepa_flowchart/run.py b/scripts/experiments/gepa-flowchart/gepa_flowchart/run.py
index ca17b0a..e12ac99 100644
--- a/scripts/experiments/gepa-flowchart/gepa_flowchart/run.py
+++ b/scripts/experiments/gepa-flowchart/gepa_flowchart/run.py
@@ -10,7 +10,7 @@
 
 from .adapter import FlowchartAdapter
 from .config import Config
-from .dataset import freeze_questions, load_topics
+from .dataset import freeze_questions, generate_questions, load_frozen, load_topics
 from .llm import complete, complete_vision, make_reflection_callable
 from .metric import score_board
 from .geometry_bridge import validate_flow
@@ -42,6 +42,8 @@ def main(argv=None) -> int:
     p.add_argument("--smoke", action="store_true", help="1 topic, tiny budget")
     p.add_argument("--max-metric-calls", type=int, default=None)
     p.add_argument("--train", type=int, default=8)
+    p.add_argument("--val", type=int, default=None, help="cap validation set size (default: all topics after train)")
+    p.add_argument("--topics", default=None, help="path to a topics.json (default: the built-in 20)")
     p.add_argument("--run-dir", default=None)
     args = p.parse_args(argv)
 
@@ -53,22 +55,37 @@ def main(argv=None) -> int:
     run_dir = args.run_dir or str(Path("runs") / datetime.now(timezone.utc).strftime("%Y%m%dT%H%M%SZ"))
     Path(run_dir).mkdir(parents=True, exist_ok=True)
 
-    topics = load_topics(_TOPICS)
+    topics = load_topics(args.topics or _TOPICS)
     if args.smoke:
         train, val = topics[:1], topics[:1]
     else:
-        train, val = topics[: args.train], topics[args.train :]
+        train = topics[: args.train]
+        val = topics[args.train :]
+        if args.val is not None:
+            val = val[: args.val]
 
-    print(f"[gepa] run_dir={run_dir} budget={budget} train={len(train)} val={len(val)}", file=sys.stderr)
+    print(f"[gepa] run_dir={run_dir} budget={budget} train={len(train)} val={len(val)} "
+          f"topics={args.topics or 'builtin'}", file=sys.stderr)
     print(f"[gepa] models gen={cfg.gen_model} reader={cfg.reader_model} judge={cfg.judge_model} "
           f"vision={cfg.vision_model} reflect={cfg.reflection_model}", file=sys.stderr)
-    print(f"[gepa] weights comp={cfg.w_comp} geom={cfg.w_geom} visual_quality={cfg.w_vq}", file=sys.stderr)
+    print(f"[gepa] agg={cfg.agg} frontier={cfg.frontier_type} weights comp={cfg.w_comp} "
+          f"geom={cfg.w_geom} vq={cfg.w_vq}", file=sys.stderr)
 
-    # Freeze the reader-questions once for the whole run (fair comparison), only for the topics
-    # actually used (train ∪ val) — freezing all 20 would waste calls on a small/smoke run.
+    # Reader-questions are frozen once. If the run_dir already carries a frozen_questions.json
+    # (e.g. the overnight orchestrator pre-froze a shared set so experiments are comparable), reuse
+    # it; otherwise generate for the used (train ∪ val) topics.
     used_ids = {t.id for t in train} | {t.id for t in val}
     used = [t for t in topics if t.id in used_ids]
-    frozen = freeze_questions(used, run_dir, model=cfg.judge_model, n=cfg.n_questions, complete_fn=complete)
+    if (Path(run_dir) / "frozen_questions.json").exists():
+        print("[gepa] reusing existing frozen_questions.json", file=sys.stderr)
+        frozen = load_frozen(used, run_dir)
+        # any topic missing from the shared file gets questions generated in-memory (don't rewrite
+        # the shared file — that would clobber the other topics' frozen questions)
+        for t in frozen:
+            if not t.questions:
+                t.questions = generate_questions(t, model=cfg.judge_model, n=cfg.n_questions, complete_fn=complete)
+    else:
+        frozen = freeze_questions(used, run_dir, model=cfg.judge_model, n=cfg.n_questions, complete_fn=complete)
     by_id = {t.id: t for t in frozen}
     train = [by_id[t.id] for t in train]
     val = [by_id[t.id] for t in val]
diff --git a/scripts/experiments/gepa-flowchart/overnight.sh b/scripts/experiments/gepa-flowchart/overnight.sh
new file mode 100755
index 0000000..0cdde9a
--- /dev/null
+++ b/scripts/experiments/gepa-flowchart/overnight.sh
@@ -0,0 +1,72 @@
+#!/usr/bin/env bash
+# Autonomous overnight experiment sweep for the GEPA flowchart optimizer. Robust by design: each
+# experiment is timeout-bounded and failure-isolated, so one hang/crash never wastes the night.
+# Everything logs to overnight/master.log; final comparison in overnight/SUMMARY.md.
+set -u
+cd "$(dirname "$0")"
+export CLAUDE_CODE_USE_VERTEX=1 ANTHROPIC_VERTEX_PROJECT_ID=adk-coding-agents CLOUD_ML_REGION=global
+export PYTHONDONTWRITEBYTECODE=1
+rm -rf gepa_flowchart/__pycache__ tests/__pycache__ 2>/dev/null
+mkdir -p overnight
+LOG=overnight/master.log
+say(){ echo "[$(date -u +%H:%M:%SZ)] $*" | tee -a "$LOG"; }
+cleanup(){ pkill -9 -f render_service.mjs 2>/dev/null; pkill -9 -f "packages/viewer/dist/server.js" 2>/dev/null; sleep 2; }
+
+say "===== OVERNIGHT START ====="
+
+# 1. Corpus (reuse if already generated; else build it).
+if [ ! -s overnight/corpus.json ]; then
+  say "generating corpus..."
+  timeout 30m python3 -u -m gepa_flowchart.corpus_gen overnight/corpus.json 9 >> "$LOG" 2>&1 || say "corpus_gen returned $?"
+fi
+N=$(python3 -c "import json;print(len(json.load(open('overnight/corpus.json'))))" 2>/dev/null || echo 0)
+say "corpus size = $N"
+if [ "$N" -lt 20 ]; then say "FATAL: corpus too small ($N); aborting"; exit 1; fi
+VAL=16
+if   [ "$N" -gt 116 ]; then TRAIN=100
+elif [ "$N" -gt 48 ];  then TRAIN=$((N-32))
+else TRAIN=$((N/2)); VAL=$((N/4)); fi
+say "split: train=$TRAIN val=$VAL (holdout=last 12 for cross-eval)"
+
+# 2. Pre-freeze ONE shared question set for train+val, so experiments are directly comparable.
+say "pre-freezing shared reader-questions for train+val..."
+timeout 30m python3 -u - "$TRAIN" "$VAL" <<'PY' >> "$LOG" 2>&1 || say "freeze returned $?"
+import sys
+from gepa_flowchart.config import Config
+from gepa_flowchart.dataset import load_topics, freeze_questions
+from gepa_flowchart.llm import complete
+tr, va = int(sys.argv[1]), int(sys.argv[2])
+cfg = Config.from_env({})
+used = load_topics("overnight/corpus.json")[: tr + va]
+freeze_questions(used, "overnight/frozen", model=cfg.judge_model, n=cfg.n_questions, complete_fn=complete)
+print(f"[freeze] froze {len(used)} topics")
+PY
+
+# 3. Experiments. Each: own run-dir seeded with the shared frozen questions; timeout-bounded.
+run_exp(){
+  local name=$1 tmout=$2 budget=$3 envline=$4
+  local rd="overnight/$name"
+  mkdir -p "$rd"
+  cp -f overnight/frozen/frozen_questions.json "$rd/frozen_questions.json" 2>/dev/null || true
+  say "EXP $name START (timeout=$tmout budget=$budget | $envline)"
+  ( eval "export $envline"
+    timeout "$tmout" python3 -u -m gepa_flowchart.run \
+      --topics overnight/corpus.json --train "$TRAIN" --val "$VAL" \
+      --max-metric-calls "$budget" --run-dir "$rd"
+  ) >> "overnight/$name.log" 2>&1
+  local rc=$?
+  if [ -f "$rd/report.md" ]; then say "EXP $name END rc=$rc report:OK"; else say "EXP $name END rc=$rc report:MISSING"; fi
+  cleanup
+}
+
+run_exp whm_hybrid_opus     4h    220 "GEPA_AGG=harmonic GEPA_FRONTIER_TYPE=hybrid"
+run_exp linear_instance     2h30m 130 "GEPA_AGG=linear GEPA_FRONTIER_TYPE=instance"
+run_exp whm_hybrid_sonnet   2h30m 130 "GEPA_AGG=harmonic GEPA_FRONTIER_TYPE=hybrid GEPA_GEN_MODEL=claude-sonnet-4-6"
+
+# 4. Cross-eval all bests on a held-out set under the canonical metric.
+say "cross-eval on held-out set..."
+timeout 90m python3 -u -m gepa_flowchart.crosseval overnight/corpus.json overnight/frozen 12 \
+  overnight/whm_hybrid_opus overnight/linear_instance overnight/whm_hybrid_sonnet >> "$LOG" 2>&1 \
+  || say "cross-eval failed (per-run report.md files are still valid)"
+cleanup
+say "===== OVERNIGHT DONE ====="

From 36afc772aa19e1c5136baa68b028eb45253c1fe6 Mon Sep 17 00:00:00 2001
From: Ivan Cheung <ivanmkc@google.com>
Date: Fri, 19 Jun 2026 06:06:29 +0000
Subject: [PATCH 24/24] fix(gepa): overnight cleanup reaps chromium + runs
 before first experiment

Orphaned render services (from kill -9 skipping the graceful handler) hold the fixed
render/viewer ports and make the next experiment fail 'render service exited before ready'.
cleanup() now also kills stray headless_shell/chrome and runs once at startup.
---
 scripts/experiments/gepa-flowchart/overnight.sh | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/scripts/experiments/gepa-flowchart/overnight.sh b/scripts/experiments/gepa-flowchart/overnight.sh
index 0cdde9a..f0c4521 100755
--- a/scripts/experiments/gepa-flowchart/overnight.sh
+++ b/scripts/experiments/gepa-flowchart/overnight.sh
@@ -10,9 +10,16 @@ rm -rf gepa_flowchart/__pycache__ tests/__pycache__ 2>/dev/null
 mkdir -p overnight
 LOG=overnight/master.log
 say(){ echo "[$(date -u +%H:%M:%SZ)] $*" | tee -a "$LOG"; }
-cleanup(){ pkill -9 -f render_service.mjs 2>/dev/null; pkill -9 -f "packages/viewer/dist/server.js" 2>/dev/null; sleep 2; }
+cleanup(){
+  pkill -9 -f render_service.mjs 2>/dev/null
+  pkill -9 -f "packages/viewer/dist/server.js" 2>/dev/null
+  pkill -9 -f "ms-playwright.*headless_shell" 2>/dev/null
+  pkill -9 -f "ms-playwright.*chrome" 2>/dev/null
+  sleep 2
+}
 
 say "===== OVERNIGHT START ====="
+cleanup   # clear any pre-existing render/viewer/chromium orphans before we begin
 
 # 1. Corpus (reuse if already generated; else build it).
 if [ ! -s overnight/corpus.json ]; then