From 3b4c5b8aa8a2c6215490255039ab9b5624dd8b63 Mon Sep 17 00:00:00 2001 From: Ivan Cheung Date: Thu, 18 Jun 2026 04:26:40 +0000 Subject: [PATCH 01/24] docs: spec for GEPA flowchart optimization (comprehension + geometry) Offline GEPA loop that evolves the brainstorm + generate prompts for termchart flowcharts, optimizing primarily for first-user comprehension (a fresh reader-LLM answers auto-generated reader-questions, judged for board-supported correctness) with geometry/readability as a secondary guardrail. Claude direct via the Anthropic SDK; standalone gepa; structured-content reader surface; smoke-first. --- ...6-18-gepa-flowchart-optimization-design.md | 226 ++++++++++++++++++ 1 file changed, 226 insertions(+) create mode 100644 docs/superpowers/specs/2026-06-18-gepa-flowchart-optimization-design.md diff --git a/docs/superpowers/specs/2026-06-18-gepa-flowchart-optimization-design.md b/docs/superpowers/specs/2026-06-18-gepa-flowchart-optimization-design.md new file mode 100644 index 0000000..b53ee1f --- /dev/null +++ b/docs/superpowers/specs/2026-06-18-gepa-flowchart-optimization-design.md @@ -0,0 +1,226 @@ +# GEPA optimization for readable, comprehensible flowcharts + +**Date:** 2026-06-18 +**Status:** Approved design — ready for implementation plan + +## Problem + +termchart can generate flowcharts, but the *quality* of a generated board is +uneven. Two failure modes recur: + +1. **Unreadable geometry** — edges run over nodes, nodes overlap, the graph is + so large it renders tiny. (The push-time geometry lint already detects this.) +2. **Insufficient detail / context** — the more common problem. A board renders + cleanly but a first-time reader can't actually learn what they need from it: + labels are terse, the triggers/conditions/outcomes aren't shown, there's no + orienting context. The reader is left with unanswered questions. + +We want to **systematically improve the prompts** that produce flowcharts so the +output is both readable *and* comprehensible to someone seeing it for the first +time. GEPA (reflective prompt evolution) is a good fit: it mutates text prompts +using an LLM that reflects on execution feedback and a metric. + +## Goal + +Stand up a GEPA optimization loop that evolves the prompts used to **brainstorm** +and **generate** a termchart flowchart, optimizing primarily for **first-user +comprehension** (a fresh reader can answer the questions they'd naturally have), +with **geometric readability** as a secondary guardrail. Produce the evolved +prompts plus a before/after report. + +## Non-goals (YAGNI) + +- **No** changes to the termchart viewer/CLI public surface (the geometry + validator is reused via a thin bridge, not re-exported). +- **No** human-in-the-loop labeling — the eval is fully automated (LLM reader + + LLM judge). +- **No** visual/image rendering of the board for the reader — comprehension is + scored on the board's **structured content** (see Decisions). True pixel + readability is the geometry lint's job, not the comprehension test's. +- **No** integration into the live push path — this is an offline experiment that + produces better prompts; wiring them into a recipe/skill is a separate follow-up. + +## Decisions (from brainstorming) + +| Question | Decision | +|---|---| +| LLM access | Claude direct via the Anthropic Python SDK (not the LiteLLM proxy) | +| GEPA implementation | Standalone `gepa` package (bring-your-own adapter) | +| What GEPA optimizes | **Two** text components: a `brainstorm` prompt and a `generate` prompt | +| Dataset | ~12 diverse flowchart topics, ~8 train / ~4 val, authored in-repo | +| Reader-questions (FUX backbone) | **Auto-generated per topic at run start**, then frozen for the run so every candidate is scored against the same questions | +| Reader input surface | **Structured flow content** (node/edge/group labels + annotations), not an image | +| Run scope | **Smoke-first**: a cheap `--smoke` default; a modest full run (~150 metric calls) via flag | + +## Architecture + +New Python harness in a new git worktree: + +``` +scripts/experiments/gepa-flowchart/ + README.md # how to run, env vars, cost notes + pyproject.toml # deps: gepa, anthropic + gepa_flowchart/ + __init__.py + config.py # models, weights, budget, paths (env-overridable) + llm.py # thin Anthropic SDK wrappers: generate / read / judge / reflect + dataset.py # ~12 topics; loads + freezes auto-generated reader-questions + pipeline.py # brainstorm -> generate (with termchart skill context) + geometry_bridge.py # shells out to the tsx Node validator, parses findings JSON + validate_flow.ts # Node bridge: imports geometryReport, reads stdin, prints JSON + metric.py # structural gate + geometry score + comprehension score + feedback + adapter.py # gepa.GEPAAdapter: evaluate() + make_reflective_dataset() + seed_prompts.py # the seed brainstorm/generate prompts (the starting candidate) + run.py # CLI entrypoint: gepa.optimize(...), writes results + report + tests/ # pytest: pure-logic units with a fake LLM + one smoke + topics/ # the authored topic dataset (json/yaml) +``` + +### The pipeline GEPA optimizes (per task) + +``` +topic ─┬▶ [brainstorm prompt]* ─▶ plan: what to show + what context a reader needs + └▶ [generate prompt]* + termchart skill context ─▶ flow JSON + │ + ┌─────────────────────────────────────────┘ + ▼ + validators → combined score + textual feedback ─▶ GEPA reflection ─▶ better prompts* +``` + +`*` marks the two text components GEPA mutates. The seed candidate is +`{ "brainstorm": , "generate": }`. The generate prompt's static +context includes the termchart `flow` JSON schema and 1–2 shipped gallery example +specs (`plugin/skills/diagram-recipes/examples/*.flow.json`) — this is the +"generate using termchart skills" step. + +### The flow spec format (target output) + +```jsonc +{ + "direction": "TB" | "LR" | "BT" | "RL", // default TB + "nodes": [{ "id", "data": { "label", "status?" }, "group?" }], + "edges": [{ "source", "target", "data?": { "label?" } }], + "groups?": [{ "id", "label", "color?" }], // tiers/lanes/zones + "tiers?": bool, "lanes?": bool +} +``` +Detail and context live in `data.label` (rich node text), edge `data.label` +(what the transition is/when it happens), and group labels — the levers the +generate prompt learns to use. + +## The metric (the heart of this) + +`evaluate(task, flow_json)` returns a score in `[0,1]` plus a textual feedback +string for GEPA's reflection. + +1. **Structural validity (hard gate).** Run the shared `validateContent` logic. + Invalid/unparseable → score `0.0`, feedback = the precise path-pointed error. + Everything below only runs on a valid spec. + +2. **Geometry / readability** → `geom_score ∈ [0,1]`. From the geometry bridge's + findings: `error`-severity (edge-over-node, node-overlap, missing-ref) drive a + large penalty; `warning`-severity (crossings, edge-near-node, low-readability, + and the over-stuffed-density codes) a small one. Feedback = the findings' + actionable messages. + +3. **Comprehension / first-user-experience (PRIMARY)** → `comp_score ∈ [0,1]`. + - A **fresh reader-LLM with no prior context** (system: "you're seeing this + board for the first time; answer ONLY from what's shown; say 'not shown' if + the board doesn't tell you") is given the board's **structured content** and + the task's frozen reader-questions. + - A **judge-LLM** scores each answer 0–1 on two axes: *correct* AND + *actually supported by the board*. An answer the reader had to invent, or + marked "not shown", counts as a comprehension miss attributable to **missing + detail/context** — flagged explicitly. + - `comp_score = mean(per-question scores)`. + - Feedback = the specific questions that scored low, each with the judge's + reason (e.g. "board never indicates what happens when validation fails"), + plus a short "missing context" list. This is what pushes reflection toward + adding detail and orienting context, not just cleaner layout. + +**Combined:** `total = w_comp * comp_score + w_geom * geom_score`, gated by +validity. Defaults `w_comp = 0.6`, `w_geom = 0.4` (env-configurable). Comprehension +leads because under-detailing is the main problem we're fixing; geometry remains a +real guardrail so the optimizer can't win by dumping unreadable walls of text. + +## GEPA wiring + +- `gepa.optimize(seed_candidate, trainset, valset, adapter, reflection_lm=, max_metric_calls=)`. +- `reflection_lm` is a **callable** wrapping the Anthropic SDK (Claude direct) — + not a LiteLLM model string — honoring the "Claude direct" decision. +- The `GEPAAdapter` implements `evaluate(batch, candidate, capture_traces)` (runs + the pipeline + metric per task) and `make_reflective_dataset(...)` (turns the + captured feedback into the per-component reflective examples GEPA mutates on). + +## Models (Anthropic SDK direct) + +| Role | Default model | Notes | +|---|---|---| +| Generation (high volume) | `claude-opus-4-8` | env knob to `claude-sonnet-4-6` to cut cost | +| Reader (FUX) | `claude-sonnet-4-6` | simulates an average reader; cheap; no thinking needed | +| Judge | `claude-opus-4-8` | scores reader answers; distinct role from the generator | +| GEPA reflection | `claude-opus-4-8` | strongest model proposes prompt mutations | + +Adaptive thinking on the reasoning-heavy roles (reflection, judge). Auth via +`ANTHROPIC_API_KEY` or an `ant auth login` profile. + +## Geometry bridge (TS ↔ Python) + +The geometry validator (`packages/viewer/src/flow-geometry.ts` → `geometryReport`) +is TypeScript and intentionally not on the CLI/public surface. Rather than re-export +it or stand up a server, a small `validate_flow.ts` Node script imports +`geometryReport`, reads a flow-JSON spec on stdin, and prints +`{ findings: [...], warnings: [...] }` on stdout. It runs via `npx tsx` with the +viewer package as the resolution root (so `dagre` etc. resolve). `geometry_bridge.py` +shells out to it and parses the JSON. No changes to shipped packages. + +## Dataset + +`topics/` holds ~12 tasks, each: + +```jsonc +{ + "id": "ci-cd-pipeline", + "topic": "CI/CD pipeline for a web app", + "audience": "an engineer new to the team", + "purpose": "understand how code reaches production and what can go wrong" +} +``` +Reader-questions are **not** stored per topic — at run start the harness +auto-generates a fixed set per topic (via an LLM, from topic+audience+purpose) and +**freezes** them to the run directory, so every candidate in that run is judged +against identical questions (fair comparison; stable signal within a run). Split +~8 train / ~4 val. + +## Outputs + +Written to a timestamped run directory: +- `best_prompts.json` — the evolved `brainstorm` + `generate` prompts. +- `report.md` — seed vs. best on the val set: comprehension score, geometry score, + combined, and per-question deltas (which previously-unanswerable questions the + improved board now answers). +- `frozen_questions.json` — the reader-questions used for the run. + +## Testing + +- **Unit (pytest, fake LLM):** findings→`geom_score` mapping; comprehension + scoring + feedback assembly; structural-gate behavior; dataset load + question + freeze; the adapter's `evaluate`/`make_reflective_dataset` shapes. +- **Geometry bridge:** a known-bad spec (edge over node) yields the expected + finding; a clean spec yields none. +- **Smoke (live, tiny):** `run.py --smoke` runs 1 topic at a tiny budget end to + end and asserts a report is produced. Gated behind `ANTHROPIC_API_KEY`. + +## Cost & scope control + +- `--smoke` is the default-safe entry: 1 topic, minimal budget, a handful of LLM + calls — validates the whole loop for cents. +- A full run defaults to ~150 metric calls (`--max-metric-calls`), train/val as + above. All models, weights, and budget are env/flag-overridable. +- Every run prints an up-front estimate (rollouts × calls/rollout × models) before + spending. + +## Files touched / created + +All new, under `scripts/experiments/gepa-flowchart/` (+ this spec). No existing +package code is modified. From 08a3a2a23adb81bd04901a209b1c81995ac53b75 Mon Sep 17 00:00:00 2001 From: Ivan Cheung Date: Thu, 18 Jun 2026 04:34:17 +0000 Subject: [PATCH 02/24] docs: implementation plan for GEPA flowchart optimization 10 TDD tasks: scaffold+config, geometry bridge (tsx), Anthropic LLM wrappers, board serializer, dataset+question freeze, seed prompts+pipeline, three-part metric, GEPAAdapter, run+report, README. Verified gepa API (adapter protocol, EvaluationBatch, optimize signature, reflection callable) and Anthropic SDK patterns against source. --- .../2026-06-18-gepa-flowchart-optimization.md | 1535 +++++++++++++++++ 1 file changed, 1535 insertions(+) create mode 100644 docs/superpowers/plans/2026-06-18-gepa-flowchart-optimization.md diff --git a/docs/superpowers/plans/2026-06-18-gepa-flowchart-optimization.md b/docs/superpowers/plans/2026-06-18-gepa-flowchart-optimization.md new file mode 100644 index 0000000..302cddf --- /dev/null +++ b/docs/superpowers/plans/2026-06-18-gepa-flowchart-optimization.md @@ -0,0 +1,1535 @@ +# GEPA Flowchart Optimization Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Build an offline GEPA loop that evolves the `brainstorm` + `generate` prompts for termchart flowcharts, optimizing primarily for first-user comprehension and secondarily for geometric readability. + +**Architecture:** A Python harness (`scripts/experiments/gepa-flowchart/`) drives the standalone `gepa` package. Each rollout runs a two-prompt pipeline (brainstorm → generate flow JSON) through a three-part metric: structural-validity gate, geometry score (reused from the TS `geometryReport` via a `tsx` bridge), and comprehension score (a fresh reader-LLM answers auto-generated, run-frozen questions; a judge scores board-supported correctness). GEPA reflects on the combined feedback to mutate the two prompts. All LLM access is Claude direct via the Anthropic Python SDK. + +**Tech Stack:** Python 3.11+, `gepa`, `anthropic` (Python SDK); Node + `tsx` for the geometry bridge that imports the existing TypeScript validators. + +## Global Constraints + +- Python **3.11+** (uses `X | Y` unions, `list[...]` generics). +- LLM access is **Claude direct via the `anthropic` Python SDK** — never LiteLLM, never a non-Anthropic shim. +- **No existing package code is modified.** Everything new lives under `scripts/experiments/gepa-flowchart/`. The TS validators are *imported* by a bridge, not edited or re-exported. +- Models come from config; defaults: generation `claude-opus-4-8`, reader `claude-sonnet-4-6`, judge `claude-opus-4-8`, reflection `claude-opus-4-8`. All env-overridable. +- **Never pass `temperature`/`top_p`/`top_k`** to the Anthropic API (these 400 on Opus 4.8 / Sonnet 4.6). Steer with prompts; control depth with `output_config={"effort": ...}`. +- Combined score `total = w_comp*comp_score + w_geom*geom_score`, gated by validity. Defaults `w_comp=0.6`, `w_geom=0.4`. +- TDD: write the failing test, watch it fail, minimal code, watch it pass, commit. Frequent commits. +- Run all `pytest` from the project dir: `cd scripts/experiments/gepa-flowchart`. + +--- + +### Task 1: Project scaffold + config + +**Files:** +- Create: `scripts/experiments/gepa-flowchart/pyproject.toml` +- Create: `scripts/experiments/gepa-flowchart/package.json` +- Create: `scripts/experiments/gepa-flowchart/.gitignore` +- Create: `scripts/experiments/gepa-flowchart/gepa_flowchart/__init__.py` +- Create: `scripts/experiments/gepa-flowchart/gepa_flowchart/config.py` +- Test: `scripts/experiments/gepa-flowchart/tests/test_config.py` + +**Interfaces:** +- Produces: `Config` dataclass with fields `gen_model: str`, `reader_model: str`, `judge_model: str`, `reflection_model: str`, `w_comp: float`, `w_geom: float`, `geom_error_penalty: float`, `geom_warning_penalty: float`, `n_questions: int`, `max_metric_calls: int`, `gen_max_tokens: int`. Classmethod `Config.from_env(overrides: dict | None = None) -> Config`. + +- [ ] **Step 1: Write the config files (non-test scaffolding)** + +`pyproject.toml`: +```toml +[project] +name = "gepa-flowchart" +version = "0.1.0" +requires-python = ">=3.11" +dependencies = ["gepa", "anthropic>=0.69"] + +[project.optional-dependencies] +dev = ["pytest>=8"] + +[tool.pytest.ini_options] +testpaths = ["tests"] +``` + +`package.json`: +```json +{ + "name": "gepa-flowchart-bridge", + "private": true, + "devDependencies": { "tsx": "^4.19.0" } +} +``` + +`.gitignore`: +``` +__pycache__/ +*.pyc +.venv/ +node_modules/ +runs/ +``` + +`gepa_flowchart/__init__.py`: +```python +"""GEPA optimization for readable, comprehensible termchart flowcharts.""" +``` + +- [ ] **Step 2: Write the failing test** + +`tests/test_config.py`: +```python +from gepa_flowchart.config import Config + + +def test_defaults(): + cfg = Config.from_env({}) + assert cfg.gen_model == "claude-opus-4-8" + assert cfg.reader_model == "claude-sonnet-4-6" + assert cfg.judge_model == "claude-opus-4-8" + assert cfg.reflection_model == "claude-opus-4-8" + assert cfg.w_comp == 0.6 and cfg.w_geom == 0.4 + + +def test_env_override(): + cfg = Config.from_env({"GEPA_GEN_MODEL": "claude-sonnet-4-6", "GEPA_W_COMP": "0.7"}) + assert cfg.gen_model == "claude-sonnet-4-6" + assert cfg.w_comp == 0.7 +``` + +- [ ] **Step 3: Run test to verify it fails** + +Run: `cd scripts/experiments/gepa-flowchart && python -m pytest tests/test_config.py -v` +Expected: FAIL — `ModuleNotFoundError: No module named 'gepa_flowchart.config'` + +- [ ] **Step 4: Write minimal implementation** + +`gepa_flowchart/config.py`: +```python +from __future__ import annotations + +import os +from dataclasses import dataclass + + +@dataclass +class Config: + gen_model: str = "claude-opus-4-8" + reader_model: str = "claude-sonnet-4-6" + judge_model: str = "claude-opus-4-8" + reflection_model: str = "claude-opus-4-8" + w_comp: float = 0.6 + w_geom: float = 0.4 + geom_error_penalty: float = 0.34 + geom_warning_penalty: float = 0.08 + n_questions: int = 5 + max_metric_calls: int = 150 + gen_max_tokens: int = 8000 + + @classmethod + def from_env(cls, overrides: dict | None = None) -> "Config": + env = {**os.environ, **(overrides or {})} + + def s(key: str, default: str) -> str: + return env.get(key, default) + + def f(key: str, default: float) -> float: + return float(env.get(key, default)) + + def i(key: str, default: int) -> int: + return int(env.get(key, default)) + + return cls( + gen_model=s("GEPA_GEN_MODEL", cls.gen_model), + reader_model=s("GEPA_READER_MODEL", cls.reader_model), + judge_model=s("GEPA_JUDGE_MODEL", cls.judge_model), + reflection_model=s("GEPA_REFLECTION_MODEL", cls.reflection_model), + w_comp=f("GEPA_W_COMP", cls.w_comp), + w_geom=f("GEPA_W_GEOM", cls.w_geom), + geom_error_penalty=f("GEPA_GEOM_ERR_PENALTY", cls.geom_error_penalty), + geom_warning_penalty=f("GEPA_GEOM_WARN_PENALTY", cls.geom_warning_penalty), + n_questions=i("GEPA_N_QUESTIONS", cls.n_questions), + max_metric_calls=i("GEPA_MAX_METRIC_CALLS", cls.max_metric_calls), + gen_max_tokens=i("GEPA_GEN_MAX_TOKENS", cls.gen_max_tokens), + ) +``` + +- [ ] **Step 5: Run test to verify it passes** + +Run: `cd scripts/experiments/gepa-flowchart && python -m pytest tests/test_config.py -v` +Expected: PASS (2 passed) + +- [ ] **Step 6: Commit** + +```bash +git add scripts/experiments/gepa-flowchart/ +git commit -m "feat(gepa): scaffold gepa-flowchart project + config" +``` + +--- + +### Task 2: Geometry bridge (TS validator + Python wrapper) + +**Files:** +- Create: `scripts/experiments/gepa-flowchart/gepa_flowchart/validate_flow.ts` +- Create: `scripts/experiments/gepa-flowchart/gepa_flowchart/geometry_bridge.py` +- Test: `scripts/experiments/gepa-flowchart/tests/test_geometry_bridge.py` + +**Interfaces:** +- Produces: `validate_flow(content: str, *, cwd: str | None = None) -> dict` returning `{"valid": bool, "error": str | None, "findings": list[dict], "warnings": list[str]}`. Each finding dict has `severity`, `code`, `message`, `count`. + +- [ ] **Step 1: Set up Node deps (one-time, not a code step)** + +Run (from repo root, then the project dir): +```bash +npm install # populates packages/viewer/node_modules (dagre, etc.) +cd scripts/experiments/gepa-flowchart && npm install # installs tsx locally +``` + +- [ ] **Step 2: Write the TS bridge** + +`gepa_flowchart/validate_flow.ts` (relative imports reach the existing validators; tsx resolves `.js`→`.ts`): +```typescript +// Reads a flow-JSON spec on stdin, prints { valid, error, findings, warnings } on stdout. +// Imports the existing validators directly — no changes to shipped packages. +import { validateContent } from "../../../../packages/core/src/validate.js"; +import { geometryReport } from "../../../../packages/viewer/src/flow-geometry.js"; + +function readStdin(): Promise { + return new Promise((resolve) => { + let data = ""; + process.stdin.setEncoding("utf8"); + process.stdin.on("data", (c) => (data += c)); + process.stdin.on("end", () => resolve(data)); + }); +} + +const content = await readStdin(); +const error = validateContent("flow", content); +if (error) { + process.stdout.write(JSON.stringify({ valid: false, error, findings: [], warnings: [] })); +} else { + const { warnings, findings } = geometryReport("flow", content); + process.stdout.write(JSON.stringify({ valid: true, error: null, findings, warnings })); +} +``` + +- [ ] **Step 3: Write the failing test** + +`tests/test_geometry_bridge.py`: +```python +import json + +from gepa_flowchart.geometry_bridge import validate_flow + +CLEAN = json.dumps({ + "direction": "TB", + "nodes": [{"id": "a", "data": {"label": "A"}}, {"id": "b", "data": {"label": "B"}}], + "edges": [{"source": "a", "target": "b"}], +}) + +# a -> c with b sitting on the a-c line (explicit positions force the overlap) +EDGE_OVER_NODE = json.dumps({ + "layout": "manual", "direction": "LR", + "nodes": [ + {"id": "a", "position": {"x": 0, "y": 0}}, + {"id": "b", "position": {"x": 200, "y": 0}}, + {"id": "c", "position": {"x": 440, "y": 0}}, + ], + "edges": [{"source": "a", "target": "c"}], +}) + + +def test_clean_spec_is_valid_no_errors(): + r = validate_flow(CLEAN) + assert r["valid"] is True + assert [f for f in r["findings"] if f["severity"] == "error"] == [] + + +def test_edge_over_node_flagged(): + r = validate_flow(EDGE_OVER_NODE) + assert r["valid"] is True + assert any(f["code"] == "edge-over-node" for f in r["findings"]) + + +def test_invalid_json_reported(): + r = validate_flow("{ not json") + assert r["valid"] is False + assert r["error"] +``` + +- [ ] **Step 4: Run test to verify it fails** + +Run: `cd scripts/experiments/gepa-flowchart && python -m pytest tests/test_geometry_bridge.py -v` +Expected: FAIL — `ModuleNotFoundError: No module named 'gepa_flowchart.geometry_bridge'` + +- [ ] **Step 5: Write the Python wrapper** + +`gepa_flowchart/geometry_bridge.py`: +```python +from __future__ import annotations + +import json +import subprocess +from pathlib import Path + +_HERE = Path(__file__).resolve().parent +_PROJECT = _HERE.parent # scripts/experiments/gepa-flowchart +_SCRIPT = _HERE / "validate_flow.ts" + + +def validate_flow(content: str, *, cwd: str | None = None) -> dict: + """Run the TS validator on a flow-JSON string. Returns + {valid, error, findings, warnings}. Raises RuntimeError if the bridge fails.""" + proc = subprocess.run( + ["npx", "tsx", str(_SCRIPT)], + input=content, + capture_output=True, + text=True, + cwd=cwd or str(_PROJECT), + ) + if proc.returncode != 0: + raise RuntimeError(f"geometry bridge failed: {proc.stderr.strip()}") + return json.loads(proc.stdout) +``` + +- [ ] **Step 6: Run test to verify it passes** + +Run: `cd scripts/experiments/gepa-flowchart && python -m pytest tests/test_geometry_bridge.py -v` +Expected: PASS (3 passed). If `npx tsx` errors on resolution, confirm Step 1 ran (`packages/viewer/node_modules/dagre` exists and `scripts/experiments/gepa-flowchart/node_modules/.bin/tsx` exists). + +- [ ] **Step 7: Commit** + +```bash +git add scripts/experiments/gepa-flowchart/ +git commit -m "feat(gepa): geometry bridge reusing TS validateContent + geometryReport" +``` + +--- + +### Task 3: Anthropic LLM wrappers + reflection callable + +**Files:** +- Create: `scripts/experiments/gepa-flowchart/gepa_flowchart/llm.py` +- Test: `scripts/experiments/gepa-flowchart/tests/test_llm.py` + +**Interfaces:** +- Produces: + - `complete(system: str, user: str, *, model: str, effort: str = "medium", max_tokens: int = 4096, client=None) -> str` — single-turn Claude call; returns concatenated text blocks. + - `make_reflection_callable(model: str, *, client=None) -> Callable[[str], str]` — wraps `complete` as the `(prompt) -> str` callable GEPA's `reflection_lm` expects. + - `get_client()` — lazily constructs and caches an `anthropic.Anthropic()`. + +- [ ] **Step 1: Write the failing test** + +`tests/test_llm.py`: +```python +from gepa_flowchart import llm + + +class FakeMessages: + def __init__(self, recorder): + self.recorder = recorder + + def create(self, **kwargs): + self.recorder.update(kwargs) + block = type("Block", (), {"type": "text", "text": "hello world"})() + return type("Resp", (), {"content": [block]})() + + +class FakeClient: + def __init__(self): + self.calls = {} + self.messages = FakeMessages(self.calls) + + +def test_complete_extracts_text_and_passes_params(): + fake = FakeClient() + out = llm.complete("sys", "usr", model="claude-opus-4-8", effort="high", client=fake) + assert out == "hello world" + assert fake.calls["model"] == "claude-opus-4-8" + assert fake.calls["output_config"] == {"effort": "high"} + assert "temperature" not in fake.calls # must never be sent + assert fake.calls["system"] == "sys" + assert fake.calls["messages"] == [{"role": "user", "content": "usr"}] + + +def test_reflection_callable_returns_str(): + fake = FakeClient() + fn = llm.make_reflection_callable("claude-opus-4-8", client=fake) + assert fn("reflect on this") == "hello world" +``` + +- [ ] **Step 2: Run test to verify it fails** + +Run: `cd scripts/experiments/gepa-flowchart && python -m pytest tests/test_llm.py -v` +Expected: FAIL — `AttributeError: module 'gepa_flowchart.llm' has no attribute 'complete'` (or import error) + +- [ ] **Step 3: Write minimal implementation** + +`gepa_flowchart/llm.py`: +```python +from __future__ import annotations + +from typing import Callable + +_client = None + + +def get_client(): + global _client + if _client is None: + import anthropic + _client = anthropic.Anthropic() + return _client + + +def complete( + system: str, + user: str, + *, + model: str, + effort: str = "medium", + max_tokens: int = 4096, + client=None, +) -> str: + client = client or get_client() + resp = client.messages.create( + model=model, + max_tokens=max_tokens, + system=system, + output_config={"effort": effort}, + messages=[{"role": "user", "content": user}], + ) + return "".join(b.text for b in resp.content if getattr(b, "type", None) == "text") + + +def make_reflection_callable(model: str, *, client=None) -> Callable[[str], str]: + def reflect(prompt: str) -> str: + return complete( + "You are an expert prompt engineer improving prompts based on feedback.", + prompt, + model=model, + effort="high", + max_tokens=8000, + client=client, + ) + + return reflect +``` + +- [ ] **Step 4: Run test to verify it passes** + +Run: `cd scripts/experiments/gepa-flowchart && python -m pytest tests/test_llm.py -v` +Expected: PASS (2 passed) + +- [ ] **Step 5: Commit** + +```bash +git add scripts/experiments/gepa-flowchart/ +git commit -m "feat(gepa): Anthropic SDK wrappers + reflection callable" +``` + +--- + +### Task 4: Board serializer + JSON extractor + +**Files:** +- Create: `scripts/experiments/gepa-flowchart/gepa_flowchart/render.py` +- Test: `scripts/experiments/gepa-flowchart/tests/test_render.py` + +**Interfaces:** +- Produces: + - `board_to_text(flow: dict) -> str` — renders a flow spec as structured, reader-facing text (direction, groups, nodes-by-group with labels/status, edges with labels). + - `extract_json(text: str) -> str | None` — pulls the first JSON object out of an LLM response (strips ```json fences; falls back to first balanced `{...}`). + +- [ ] **Step 1: Write the failing test** + +`tests/test_render.py`: +```python +import json + +from gepa_flowchart.render import board_to_text, extract_json + + +def test_board_to_text_includes_labels_and_edges(): + flow = { + "direction": "TB", + "groups": [{"id": "g1", "label": "Frontend"}], + "nodes": [ + {"id": "a", "group": "g1", "data": {"label": "Login form"}}, + {"id": "b", "data": {"label": "Auth service", "status": "active"}}, + ], + "edges": [{"source": "a", "target": "b", "data": {"label": "submit"}}], + } + text = board_to_text(flow) + assert "Login form" in text + assert "Auth service" in text + assert "Frontend" in text + assert "submit" in text + + +def test_extract_json_from_fenced(): + raw = 'Here:\n```json\n{"nodes": [], "edges": []}\n```\nDone.' + out = extract_json(raw) + assert json.loads(out) == {"nodes": [], "edges": []} + + +def test_extract_json_bare(): + out = extract_json('prefix {"x": 1} suffix') + assert json.loads(out) == {"x": 1} + + +def test_extract_json_none_when_absent(): + assert extract_json("no json here") is None +``` + +- [ ] **Step 2: Run test to verify it fails** + +Run: `cd scripts/experiments/gepa-flowchart && python -m pytest tests/test_render.py -v` +Expected: FAIL — `ModuleNotFoundError: No module named 'gepa_flowchart.render'` + +- [ ] **Step 3: Write minimal implementation** + +`gepa_flowchart/render.py`: +```python +from __future__ import annotations + + +def board_to_text(flow: dict) -> str: + lines: list[str] = [] + direction = flow.get("direction", "TB") + lines.append(f"Flowchart (direction: {direction})") + groups = {g["id"]: g for g in flow.get("groups", []) if isinstance(g, dict) and "id" in g} + nodes = [n for n in flow.get("nodes", []) if isinstance(n, dict)] + + lines.append("\nNodes:") + for n in nodes: + nid = n.get("id", "?") + data = n.get("data", {}) if isinstance(n.get("data"), dict) else {} + label = data.get("label", "(no label)") + status = data.get("status") + grp = groups.get(n.get("group", ""), {}).get("label") + suffix = [] + if grp: + suffix.append(f"group: {grp}") + if status: + suffix.append(f"status: {status}") + extra = f" [{', '.join(suffix)}]" if suffix else "" + lines.append(f" - {nid}: {label}{extra}") + + label_by_id = { + n.get("id"): (n.get("data", {}) or {}).get("label", n.get("id")) + for n in nodes + } + lines.append("\nConnections:") + for e in flow.get("edges", []): + if not isinstance(e, dict): + continue + src = label_by_id.get(e.get("source"), e.get("source")) + tgt = label_by_id.get(e.get("target"), e.get("target")) + elabel = (e.get("data", {}) or {}).get("label") if isinstance(e.get("data"), dict) else None + arrow = f' --[{elabel}]-->' if elabel else " -->" + lines.append(f" - {src}{arrow} {tgt}") + return "\n".join(lines) + + +def extract_json(text: str) -> str | None: + if "```" in text: + # take the content of the first fenced block + parts = text.split("```") + for chunk in parts[1:]: + body = chunk + if body.lstrip().lower().startswith("json"): + body = body.lstrip()[4:] + body = body.strip() + if body.startswith("{"): + bal = _balanced(body) + if bal: + return bal + start = text.find("{") + if start == -1: + return None + return _balanced(text[start:]) + + +def _balanced(s: str) -> str | None: + depth = 0 + in_str = False + esc = False + for i, ch in enumerate(s): + if in_str: + if esc: + esc = False + elif ch == "\\": + esc = True + elif ch == '"': + in_str = False + continue + if ch == '"': + in_str = True + elif ch == "{": + depth += 1 + elif ch == "}": + depth -= 1 + if depth == 0: + return s[: i + 1] + return None +``` + +- [ ] **Step 4: Run test to verify it passes** + +Run: `cd scripts/experiments/gepa-flowchart && python -m pytest tests/test_render.py -v` +Expected: PASS (4 passed) + +- [ ] **Step 5: Commit** + +```bash +git add scripts/experiments/gepa-flowchart/ +git commit -m "feat(gepa): board-to-text serializer + JSON extractor" +``` + +--- + +### Task 5: Dataset + reader-question freezing + +**Files:** +- Create: `scripts/experiments/gepa-flowchart/topics/topics.json` +- Create: `scripts/experiments/gepa-flowchart/gepa_flowchart/dataset.py` +- Test: `scripts/experiments/gepa-flowchart/tests/test_dataset.py` + +**Interfaces:** +- Consumes: `llm.complete` signature (a callable matching `complete(system, user, *, model, ...) -> str` is injected as `complete_fn` for testability). +- Produces: + - `Task` dataclass: `id: str`, `topic: str`, `audience: str`, `purpose: str`, `questions: list[str]` (empty until frozen). + - `load_topics(path: str) -> list[Task]`. + - `generate_questions(task: Task, *, model: str, n: int, complete_fn) -> list[str]`. + - `freeze_questions(tasks: list[Task], run_dir: str, *, model: str, n: int, complete_fn) -> list[Task]` — writes `/frozen_questions.json` and returns tasks with `questions` populated. + - `load_frozen(tasks: list[Task], run_dir: str) -> list[Task]`. + +- [ ] **Step 1: Write the topics dataset (data, not code)** + +`topics/topics.json` (12 entries; abbreviated shape — fill all 12): +```json +[ + {"id": "ci-cd", "topic": "CI/CD pipeline for a web app", "audience": "an engineer new to the team", "purpose": "understand how code reaches production and what can go wrong"}, + {"id": "user-auth", "topic": "User authentication and session flow", "audience": "a backend developer", "purpose": "understand login, token issuance, and refresh"}, + {"id": "order-fulfillment", "topic": "E-commerce order fulfillment", "audience": "an operations analyst", "purpose": "trace an order from checkout to delivery"}, + {"id": "incident-response", "topic": "On-call incident response process", "audience": "a new on-call engineer", "purpose": "know what to do when paged"}, + {"id": "data-pipeline", "topic": "Batch ETL data pipeline", "audience": "a data engineer", "purpose": "understand ingestion, transform, and load stages"}, + {"id": "pr-review", "topic": "Pull request review and merge process", "audience": "a contributor", "purpose": "know the path from PR open to merge"}, + {"id": "support-triage", "topic": "Customer support ticket triage", "audience": "a support agent", "purpose": "route and escalate tickets correctly"}, + {"id": "payment-processing", "topic": "Online payment processing and retries", "audience": "a fintech engineer", "purpose": "understand auth, capture, and failure handling"}, + {"id": "onboarding", "topic": "New employee onboarding workflow", "audience": "an HR coordinator", "purpose": "track steps from offer to first day"}, + {"id": "state-machine", "topic": "Document approval state machine", "audience": "a product manager", "purpose": "understand statuses and transitions"}, + {"id": "k8s-deploy", "topic": "Kubernetes rolling deployment", "audience": "a platform engineer", "purpose": "understand how a new version rolls out safely"}, + {"id": "signup-funnel", "topic": "SaaS signup and activation funnel", "audience": "a growth analyst", "purpose": "see where users drop off between signup and activation"} +] +``` + +- [ ] **Step 2: Write the failing test** + +`tests/test_dataset.py`: +```python +import json +from pathlib import Path + +from gepa_flowchart.dataset import ( + Task, + load_topics, + generate_questions, + freeze_questions, + load_frozen, +) + +TOPICS = str(Path(__file__).resolve().parent.parent / "topics" / "topics.json") + + +def fake_complete(system, user, *, model, **kw): + # Return a JSON array of questions, regardless of input. + return '["Q1?", "Q2?", "Q3?"]' + + +def test_load_topics(): + tasks = load_topics(TOPICS) + assert len(tasks) == 12 + assert all(isinstance(t, Task) and t.id and t.topic for t in tasks) + + +def test_generate_questions_parses_list(): + t = Task(id="x", topic="T", audience="A", purpose="P") + qs = generate_questions(t, model="m", n=3, complete_fn=fake_complete) + assert qs == ["Q1?", "Q2?", "Q3?"] + + +def test_freeze_and_reload(tmp_path): + tasks = load_topics(TOPICS)[:2] + frozen = freeze_questions(tasks, str(tmp_path), model="m", n=3, complete_fn=fake_complete) + assert all(t.questions for t in frozen) + assert (tmp_path / "frozen_questions.json").exists() + reloaded = load_frozen(load_topics(TOPICS)[:2], str(tmp_path)) + assert reloaded[0].questions == frozen[0].questions +``` + +- [ ] **Step 3: Run test to verify it fails** + +Run: `cd scripts/experiments/gepa-flowchart && python -m pytest tests/test_dataset.py -v` +Expected: FAIL — `ModuleNotFoundError: No module named 'gepa_flowchart.dataset'` + +- [ ] **Step 4: Write minimal implementation** + +`gepa_flowchart/dataset.py`: +```python +from __future__ import annotations + +import json +from dataclasses import dataclass, field +from pathlib import Path + +from .render import extract_json + + +@dataclass +class Task: + id: str + topic: str + audience: str + purpose: str + questions: list[str] = field(default_factory=list) + + +def load_topics(path: str) -> list[Task]: + raw = json.loads(Path(path).read_text()) + return [Task(id=r["id"], topic=r["topic"], audience=r["audience"], purpose=r["purpose"]) for r in raw] + + +_QGEN_SYSTEM = ( + "You write the questions a first-time reader of a diagram would need answered " + "to actually understand the subject. Output ONLY a JSON array of concise question strings." +) + + +def generate_questions(task: Task, *, model: str, n: int, complete_fn) -> list[str]: + user = ( + f"Topic: {task.topic}\nAudience: {task.audience}\nPurpose: {task.purpose}\n\n" + f"List exactly {n} distinct questions this reader should be able to answer " + f"from a good flowchart on this topic. JSON array of strings only." + ) + out = complete_fn(_QGEN_SYSTEM, user, model=model, effort="medium", max_tokens=1000) + blob = extract_array(out) + qs = json.loads(blob) if blob else [] + return [str(q) for q in qs][:n] + + +def extract_array(text: str) -> str | None: + start = text.find("[") + end = text.rfind("]") + if start == -1 or end == -1 or end < start: + return None + return text[start : end + 1] + + +def freeze_questions(tasks: list[Task], run_dir: str, *, model: str, n: int, complete_fn) -> list[Task]: + Path(run_dir).mkdir(parents=True, exist_ok=True) + frozen: dict[str, list[str]] = {} + for t in tasks: + t.questions = generate_questions(t, model=model, n=n, complete_fn=complete_fn) + frozen[t.id] = t.questions + (Path(run_dir) / "frozen_questions.json").write_text(json.dumps(frozen, indent=2)) + return tasks + + +def load_frozen(tasks: list[Task], run_dir: str) -> list[Task]: + frozen = json.loads((Path(run_dir) / "frozen_questions.json").read_text()) + for t in tasks: + t.questions = frozen.get(t.id, []) + return tasks +``` + +- [ ] **Step 5: Run test to verify it passes** + +Run: `cd scripts/experiments/gepa-flowchart && python -m pytest tests/test_dataset.py -v` +Expected: PASS (3 passed) + +- [ ] **Step 6: Commit** + +```bash +git add scripts/experiments/gepa-flowchart/ +git commit -m "feat(gepa): topic dataset + reader-question generation/freezing" +``` + +--- + +### Task 6: Seed prompts + pipeline + +**Files:** +- Create: `scripts/experiments/gepa-flowchart/gepa_flowchart/seed_prompts.py` +- Create: `scripts/experiments/gepa-flowchart/gepa_flowchart/pipeline.py` +- Test: `scripts/experiments/gepa-flowchart/tests/test_pipeline.py` + +**Interfaces:** +- Consumes: `Task` (Task 5), `extract_json` (Task 4), a `complete_fn` callable. +- Produces: + - `SEED_CANDIDATE: dict[str, str]` with keys `"brainstorm"` and `"generate"`. + - `SKILL_CONTEXT: str` (flow schema + an example spec). + - `run_pipeline(candidate: dict, task: Task, *, model: str, complete_fn, max_tokens: int) -> tuple[str | None, dict]` returning `(flow_json_str_or_None, trace)`. `trace` has keys `brainstorm_input`, `plan`, `generate_input`, `raw_generation`. + +- [ ] **Step 1: Write the failing test** + +`tests/test_pipeline.py`: +```python +import json + +from gepa_flowchart.dataset import Task +from gepa_flowchart.pipeline import run_pipeline, SEED_CANDIDATE + + +def test_seed_candidate_has_two_components(): + assert set(SEED_CANDIDATE.keys()) == {"brainstorm", "generate"} + assert "{topic}" in SEED_CANDIDATE["brainstorm"] + assert "{plan}" in SEED_CANDIDATE["generate"] + + +def test_run_pipeline_threads_plan_into_generation(): + calls = [] + flow = {"direction": "TB", "nodes": [{"id": "a", "data": {"label": "A"}}], "edges": []} + + def fake_complete(system, user, *, model, **kw): + calls.append(user) + if len(calls) == 1: + return "PLAN: show A then B" + return f"```json\n{json.dumps(flow)}\n```" + + task = Task(id="x", topic="Topic T", audience="Aud", purpose="Pur") + out, trace = run_pipeline(SEED_CANDIDATE, task, model="m", complete_fn=fake_complete, max_tokens=4000) + assert json.loads(out) == flow + assert trace["plan"] == "PLAN: show A then B" + # the plan must be threaded into the generate step's prompt + assert "PLAN: show A then B" in calls[1] + assert "Topic T" in calls[0] +``` + +- [ ] **Step 2: Run test to verify it fails** + +Run: `cd scripts/experiments/gepa-flowchart && python -m pytest tests/test_pipeline.py -v` +Expected: FAIL — `ModuleNotFoundError: No module named 'gepa_flowchart.pipeline'` + +- [ ] **Step 3: Write the seed prompts** + +`gepa_flowchart/seed_prompts.py`: +```python +from __future__ import annotations + +from pathlib import Path + +_FLOW_SCHEMA = """A termchart `flow` spec is JSON: +{ + "direction": "TB" | "LR" | "BT" | "RL", // default TB; prefer TB for processes + "nodes": [{ "id": "x", "data": { "label": "...", "status": "active|info|warn|success|neutral" }, "group": "g1?" }], + "edges": [{ "source": "x", "target": "y", "data": { "label": "when/condition?" } }], + "groups": [{ "id": "g1", "label": "Lane/Zone name", "color": "#hex?" }] +} +Rules: every edge source/target must be an existing node id; keep it under ~24 nodes; labels carry the meaning.""" + + +def _example() -> str: + candidates = [ + "call-hierarchy.flow.json", + "okr-tree.flow.json", + "binary-search.flow.json", + ] + base = Path(__file__).resolve().parents[3] / "plugin" / "skills" / "diagram-recipes" / "examples" + for name in candidates: + p = base / name + if p.exists(): + return p.read_text() + return '{"direction":"TB","nodes":[{"id":"a","data":{"label":"Start"}}],"edges":[]}' + + +SKILL_CONTEXT = f"{_FLOW_SCHEMA}\n\nExample of a well-formed flow spec:\n{_example()}" + +SEED_CANDIDATE = { + "brainstorm": ( + "You are planning a flowchart.\n" + "Topic: {topic}\nAudience: {audience}\nPurpose: {purpose}\n\n" + "Decide what the flowchart must show so this reader can understand the subject. " + "List the key steps/states, the decisions and what triggers each branch, and the " + "context a newcomer needs. Output a concise plan in plain text." + ), + "generate": ( + "You generate a termchart `flow` diagram as JSON.\n\n" + "{skill_context}\n\n" + "Topic: {topic}\nAudience: {audience}\nPurpose: {purpose}\n\n" + "Plan to follow:\n{plan}\n\n" + "Produce a single flow JSON object. Use clear, specific labels; label edges with the " + "condition/trigger; group related nodes. Output ONLY the JSON." + ), +} +``` + +- [ ] **Step 4: Write the pipeline** + +`gepa_flowchart/pipeline.py`: +```python +from __future__ import annotations + +from .dataset import Task +from .render import extract_json +from .seed_prompts import SKILL_CONTEXT + + +def run_pipeline(candidate: dict, task: Task, *, model: str, complete_fn, max_tokens: int): + trace: dict = {} + brainstorm_user = candidate["brainstorm"].format( + topic=task.topic, audience=task.audience, purpose=task.purpose + ) + trace["brainstorm_input"] = brainstorm_user + plan = complete_fn( + "You are a diagram planner.", brainstorm_user, model=model, effort="medium", max_tokens=2000 + ) + trace["plan"] = plan + + generate_user = candidate["generate"].format( + skill_context=SKILL_CONTEXT, + topic=task.topic, + audience=task.audience, + purpose=task.purpose, + plan=plan, + ) + trace["generate_input"] = generate_user + raw = complete_fn( + "You output only valid JSON.", generate_user, model=model, effort="medium", max_tokens=max_tokens + ) + trace["raw_generation"] = raw + return extract_json(raw), trace +``` + +- [ ] **Step 5: Run test to verify it passes** + +Run: `cd scripts/experiments/gepa-flowchart && python -m pytest tests/test_pipeline.py -v` +Expected: PASS (2 passed) + +- [ ] **Step 6: Commit** + +```bash +git add scripts/experiments/gepa-flowchart/ +git commit -m "feat(gepa): seed brainstorm/generate prompts + pipeline" +``` + +--- + +### Task 7: Metric (gate + geometry + comprehension + feedback) + +**Files:** +- Create: `scripts/experiments/gepa-flowchart/gepa_flowchart/metric.py` +- Test: `scripts/experiments/gepa-flowchart/tests/test_metric.py` + +**Interfaces:** +- Consumes: `Config` (Task 1), `Task` (Task 5), `board_to_text` (Task 4), `extract_array`/`extract_json`, a `validate_fn` (matches `validate_flow`), a `complete_fn`. +- Produces: + - `geometry_score(findings: list[dict], cfg: Config) -> tuple[float, str]`. + - `comprehension_score(board_text: str, questions: list[str], *, reader_model: str, judge_model: str, complete_fn) -> tuple[float, str]`. + - `score_board(content: str | None, task: Task, cfg: Config, *, validate_fn, complete_fn) -> ScoreResult` where `ScoreResult` is a dataclass `score: float`, `feedback: str`, `comp: float`, `geom: float`, `valid: bool`. + +- [ ] **Step 1: Write the failing test** + +`tests/test_metric.py`: +```python +import json + +from gepa_flowchart.config import Config +from gepa_flowchart.dataset import Task +from gepa_flowchart.metric import geometry_score, score_board + +CFG = Config.from_env({}) +TASK = Task(id="x", topic="T", audience="A", purpose="P", questions=["Q1?", "Q2?"]) +GOOD_FLOW = json.dumps({"direction": "TB", "nodes": [{"id": "a", "data": {"label": "Start"}}], "edges": []}) + + +def test_geometry_score_penalizes_errors(): + clean, _ = geometry_score([], CFG) + assert clean == 1.0 + err, msg = geometry_score([{"severity": "error", "code": "edge-over-node", "message": "x", "count": 1}], CFG) + assert err < 1.0 and "edge-over-node" in msg + + +def test_invalid_board_scores_zero(): + def validate_fn(content): + return {"valid": False, "error": "bad json", "findings": [], "warnings": []} + + res = score_board("{bad", TASK, CFG, validate_fn=validate_fn, complete_fn=lambda *a, **k: "") + assert res.score == 0.0 and res.valid is False + assert "bad json" in res.feedback + + +def test_missing_context_lowers_comp_and_is_fed_back(): + def validate_fn(content): + return {"valid": True, "error": None, "findings": [], "warnings": []} + + # reader answers; judge marks Q2 unsupported + def complete_fn(system, user, *, model, **kw): + if "first time" in system.lower(): # reader + return json.dumps([{"q": "Q1?", "a": "Start the process"}, {"q": "Q2?", "a": "not shown"}]) + # judge + return json.dumps([ + {"q": "Q1?", "score": 1.0, "supported": True, "reason": "shown"}, + {"q": "Q2?", "score": 0.0, "supported": False, "reason": "board never shows the failure path"}, + ]) + + res = score_board(GOOD_FLOW, TASK, CFG, validate_fn=validate_fn, complete_fn=complete_fn) + assert 0.0 < res.comp < 1.0 + assert res.geom == 1.0 + assert "failure path" in res.feedback +``` + +- [ ] **Step 2: Run test to verify it fails** + +Run: `cd scripts/experiments/gepa-flowchart && python -m pytest tests/test_metric.py -v` +Expected: FAIL — `ModuleNotFoundError: No module named 'gepa_flowchart.metric'` + +- [ ] **Step 3: Write minimal implementation** + +`gepa_flowchart/metric.py`: +```python +from __future__ import annotations + +import json +from dataclasses import dataclass + +from .config import Config +from .dataset import Task, extract_array +from .render import board_to_text + + +@dataclass +class ScoreResult: + score: float + feedback: str + comp: float + geom: float + valid: bool + + +def geometry_score(findings: list[dict], cfg: Config) -> tuple[float, str]: + errs = [f for f in findings if f.get("severity") == "error"] + warns = [f for f in findings if f.get("severity") == "warning"] + score = 1.0 - cfg.geom_error_penalty * len(errs) - cfg.geom_warning_penalty * len(warns) + score = max(0.0, min(1.0, score)) + if not findings: + return score, "Geometry: clean (no findings)." + msgs = "; ".join(f"[{f.get('severity')}] {f.get('code')}: {f.get('message')}" for f in findings) + return score, f"Geometry findings: {msgs}" + + +_READER_SYSTEM = ( + "You are seeing this flowchart for the FIRST TIME and know nothing else about it. " + "Answer each question using ONLY what the board shows. If the board does not tell you, " + "answer exactly 'not shown'. Output ONLY a JSON array of {\"q\": question, \"a\": answer}." +) + +_JUDGE_SYSTEM = ( + "You grade how well a first-time reader's answers are supported by a flowchart. For each " + "question score 0..1 for correctness AND whether the board actually supports the answer. " + "An answer of 'not shown' or one not supported by the board scores low and is a context gap. " + "Output ONLY a JSON array of {\"q\", \"score\", \"supported\" (bool), \"reason\"}." +) + + +def comprehension_score(board_text: str, questions: list[str], *, reader_model: str, judge_model: str, complete_fn) -> tuple[float, str]: + qlist = "\n".join(f"- {q}" for q in questions) + reader_out = complete_fn( + _READER_SYSTEM, + f"BOARD:\n{board_text}\n\nQUESTIONS:\n{qlist}", + model=reader_model, + effort="low", + max_tokens=1500, + ) + judge_out = complete_fn( + _JUDGE_SYSTEM, + f"BOARD:\n{board_text}\n\nQUESTIONS:\n{qlist}\n\nREADER ANSWERS:\n{reader_out}", + model=judge_model, + effort="high", + max_tokens=1500, + ) + blob = extract_array(judge_out) + rows = json.loads(blob) if blob else [] + if not rows: + return 0.0, "Comprehension: judge produced no parseable scores." + scores = [float(r.get("score", 0.0)) for r in rows] + comp = sum(scores) / len(scores) + gaps = [r for r in rows if not r.get("supported", False) or float(r.get("score", 0.0)) < 0.5] + if gaps: + gap_txt = "; ".join(f"'{r.get('q')}' — {r.get('reason')}" for r in gaps) + fb = f"Comprehension {comp:.2f}. Reader could not answer (add detail/context): {gap_txt}" + else: + fb = f"Comprehension {comp:.2f}. Reader answered all questions from the board." + return comp, fb + + +def score_board(content, task: Task, cfg: Config, *, validate_fn, complete_fn) -> ScoreResult: + if not content: + return ScoreResult(0.0, "Generation produced no JSON object.", 0.0, 0.0, False) + report = validate_fn(content) + if not report.get("valid"): + return ScoreResult(0.0, f"Invalid flow spec: {report.get('error')}", 0.0, 0.0, False) + + geom, geom_fb = geometry_score(report.get("findings", []), cfg) + board_text = board_to_text(json.loads(content)) + comp, comp_fb = comprehension_score( + board_text, task.questions, reader_model=cfg.reader_model, judge_model=cfg.judge_model, complete_fn=complete_fn + ) + total = cfg.w_comp * comp + cfg.w_geom * geom + feedback = f"{comp_fb}\n{geom_fb}" + return ScoreResult(total, feedback, comp, geom, True) +``` + +- [ ] **Step 4: Run test to verify it passes** + +Run: `cd scripts/experiments/gepa-flowchart && python -m pytest tests/test_metric.py -v` +Expected: PASS (3 passed) + +- [ ] **Step 5: Commit** + +```bash +git add scripts/experiments/gepa-flowchart/ +git commit -m "feat(gepa): three-part metric (validity gate, geometry, comprehension)" +``` + +--- + +### Task 8: GEPA adapter + +**Files:** +- Create: `scripts/experiments/gepa-flowchart/gepa_flowchart/adapter.py` +- Test: `scripts/experiments/gepa-flowchart/tests/test_adapter.py` + +**Interfaces:** +- Consumes: `Config`, `Task`, `run_pipeline` (Task 6), `score_board`/`ScoreResult` (Task 7), `validate_flow` (Task 2), `complete` (Task 3). `gepa.core.adapter.EvaluationBatch`. +- Produces: `FlowchartAdapter` implementing `evaluate(batch, candidate, capture_traces=False) -> EvaluationBatch` and `make_reflective_dataset(candidate, eval_batch, components_to_update) -> dict[str, list[dict]]`. Constructor: `FlowchartAdapter(cfg: Config, *, validate_fn=validate_flow, complete_fn=complete)`. + +- [ ] **Step 1: Write the failing test** + +`tests/test_adapter.py`: +```python +import json + +from gepa_flowchart.config import Config +from gepa_flowchart.dataset import Task +from gepa_flowchart.adapter import FlowchartAdapter +from gepa_flowchart.pipeline import SEED_CANDIDATE + +CFG = Config.from_env({}) +FLOW = {"direction": "TB", "nodes": [{"id": "a", "data": {"label": "A"}}], "edges": []} + + +def validate_fn(content): + return {"valid": True, "error": None, "findings": [], "warnings": []} + + +def complete_fn(system, user, *, model, **kw): + if "planner" in system.lower(): + return "plan" + if "only valid json" in system.lower(): + return json.dumps(FLOW) + if "first time" in system.lower(): + return json.dumps([{"q": "Q1?", "a": "A"}]) + # judge + return json.dumps([{"q": "Q1?", "score": 1.0, "supported": True, "reason": "ok"}]) + + +def test_evaluate_returns_scores_and_traces(): + adapter = FlowchartAdapter(CFG, validate_fn=validate_fn, complete_fn=complete_fn) + batch = [Task(id="x", topic="T", audience="A", purpose="P", questions=["Q1?"])] + out = adapter.evaluate(batch, SEED_CANDIDATE, capture_traces=True) + assert len(out.scores) == 1 + assert out.scores[0] > 0.0 + assert out.trajectories is not None and len(out.trajectories) == 1 + + +def test_make_reflective_dataset_has_requested_components(): + adapter = FlowchartAdapter(CFG, validate_fn=validate_fn, complete_fn=complete_fn) + batch = [Task(id="x", topic="T", audience="A", purpose="P", questions=["Q1?"])] + ev = adapter.evaluate(batch, SEED_CANDIDATE, capture_traces=True) + refl = adapter.make_reflective_dataset(SEED_CANDIDATE, ev, ["brainstorm", "generate"]) + assert set(refl.keys()) == {"brainstorm", "generate"} + assert refl["generate"] and "Feedback" in refl["generate"][0] +``` + +- [ ] **Step 2: Run test to verify it fails** + +Run: `cd scripts/experiments/gepa-flowchart && python -m pytest tests/test_adapter.py -v` +Expected: FAIL — `ModuleNotFoundError: No module named 'gepa_flowchart.adapter'` + +- [ ] **Step 3: Write minimal implementation** + +`gepa_flowchart/adapter.py`: +```python +from __future__ import annotations + +from gepa.core.adapter import EvaluationBatch, GEPAAdapter + +from .config import Config +from .geometry_bridge import validate_flow +from .llm import complete +from .metric import score_board +from .pipeline import run_pipeline + + +class FlowchartAdapter(GEPAAdapter): + def __init__(self, cfg: Config, *, validate_fn=validate_flow, complete_fn=complete): + self.cfg = cfg + self.validate_fn = validate_fn + self.complete_fn = complete_fn + + def evaluate(self, batch, candidate, capture_traces=False): + outputs, scores, trajectories = [], [], [] if capture_traces else None + for task in batch: + content, trace = run_pipeline( + candidate, task, model=self.cfg.gen_model, + complete_fn=self.complete_fn, max_tokens=self.cfg.gen_max_tokens, + ) + result = score_board( + content, task, self.cfg, validate_fn=self.validate_fn, complete_fn=self.complete_fn + ) + outputs.append(content) + scores.append(result.score) + if capture_traces: + trajectories.append({"task": task, "trace": trace, "result": result, "output": content}) + return EvaluationBatch(outputs=outputs, scores=scores, trajectories=trajectories) + + def make_reflective_dataset(self, candidate, eval_batch, components_to_update): + out: dict[str, list[dict]] = {c: [] for c in components_to_update} + for traj in eval_batch.trajectories or []: + task = traj["task"] + result = traj["result"] + trace = traj["trace"] + shared_feedback = result.feedback + if "brainstorm" in out: + out["brainstorm"].append({ + "Inputs": f"Topic: {task.topic}\nAudience: {task.audience}\nPurpose: {task.purpose}", + "Generated Outputs": trace.get("plan", ""), + "Feedback": ( + "The plan feeds a downstream generator. The resulting board scored " + f"{result.score:.2f}. " + shared_feedback + ), + }) + if "generate" in out: + out["generate"].append({ + "Inputs": f"Plan:\n{trace.get('plan','')}", + "Generated Outputs": traj.get("output") or trace.get("raw_generation", ""), + "Feedback": shared_feedback, + }) + return out +``` + +- [ ] **Step 4: Run test to verify it passes** + +Run: `cd scripts/experiments/gepa-flowchart && python -m pytest tests/test_adapter.py -v` +Expected: PASS (2 passed) + +- [ ] **Step 5: Commit** + +```bash +git add scripts/experiments/gepa-flowchart/ +git commit -m "feat(gepa): GEPAAdapter (evaluate + reflective dataset)" +``` + +--- + +### Task 9: Run entrypoint + report + +**Files:** +- Create: `scripts/experiments/gepa-flowchart/gepa_flowchart/report.py` +- Create: `scripts/experiments/gepa-flowchart/gepa_flowchart/run.py` +- Test: `scripts/experiments/gepa-flowchart/tests/test_report.py` +- Test: `scripts/experiments/gepa-flowchart/tests/test_smoke.py` + +**Interfaces:** +- Consumes: everything above; `gepa.optimize`, `GEPAResult`. +- Produces: + - `render_report(seed_scores: list[float], best_scores: list[float], best_candidate: dict, val_ids: list[str]) -> str` (Markdown). + - `run.py` CLI: `--smoke`, `--max-metric-calls N`, `--run-dir PATH`, `--train N`. `main(argv=None) -> int`. + +- [ ] **Step 1: Write the failing test (report)** + +`tests/test_report.py`: +```python +from gepa_flowchart.report import render_report + + +def test_report_shows_before_after(): + md = render_report([0.4, 0.5], [0.7, 0.8], {"brainstorm": "b", "generate": "g"}, ["t1", "t2"]) + assert "0.45" in md # seed mean + assert "0.75" in md # best mean + assert "brainstorm" in md and "generate" in md +``` + +- [ ] **Step 2: Run test to verify it fails** + +Run: `cd scripts/experiments/gepa-flowchart && python -m pytest tests/test_report.py -v` +Expected: FAIL — `ModuleNotFoundError: No module named 'gepa_flowchart.report'` + +- [ ] **Step 3: Write report.py** + +`gepa_flowchart/report.py`: +```python +from __future__ import annotations + +import json + + +def _mean(xs: list[float]) -> float: + return sum(xs) / len(xs) if xs else 0.0 + + +def render_report(seed_scores, best_scores, best_candidate, val_ids) -> str: + seed_m, best_m = _mean(seed_scores), _mean(best_scores) + lines = [ + "# GEPA Flowchart Optimization — Report", + "", + f"- Val tasks: {len(val_ids)} ({', '.join(val_ids)})", + f"- Seed mean score: **{seed_m:.2f}**", + f"- Best mean score: **{best_m:.2f}**", + f"- Delta: **{best_m - seed_m:+.2f}**", + "", + "## Per-task (seed → best)", + "", + "| task | seed | best |", + "|---|---|---|", + ] + for tid, s, b in zip(val_ids, seed_scores, best_scores): + lines.append(f"| {tid} | {s:.2f} | {b:.2f} |") + lines += ["", "## Best prompts", "", "```json", json.dumps(best_candidate, indent=2), "```"] + return "\n".join(lines) +``` + +- [ ] **Step 4: Run report test to verify it passes** + +Run: `cd scripts/experiments/gepa-flowchart && python -m pytest tests/test_report.py -v` +Expected: PASS (1 passed) + +- [ ] **Step 5: Write run.py** + +`gepa_flowchart/run.py`: +```python +from __future__ import annotations + +import argparse +import json +import sys +from datetime import datetime, timezone +from pathlib import Path + +import gepa + +from .adapter import FlowchartAdapter +from .config import Config +from .dataset import freeze_questions, load_topics +from .llm import complete, make_reflection_callable +from .metric import score_board +from .geometry_bridge import validate_flow +from .pipeline import SEED_CANDIDATE +from .report import render_report + +_TOPICS = str(Path(__file__).resolve().parent.parent / "topics" / "topics.json") + + +def _eval_scores(tasks, candidate, cfg) -> list[float]: + return [ + score_board( + run_one(candidate, t, cfg), t, cfg, validate_fn=validate_flow, complete_fn=complete + ).score + for t in tasks + ] + + +def run_one(candidate, task, cfg): + from .pipeline import run_pipeline + content, _ = run_pipeline(candidate, task, model=cfg.gen_model, complete_fn=complete, max_tokens=cfg.gen_max_tokens) + return content + + +def main(argv=None) -> int: + p = argparse.ArgumentParser(description="GEPA flowchart prompt optimization") + p.add_argument("--smoke", action="store_true", help="1 topic, tiny budget") + p.add_argument("--max-metric-calls", type=int, default=None) + p.add_argument("--train", type=int, default=8) + p.add_argument("--run-dir", default=None) + args = p.parse_args(argv) + + cfg = Config.from_env({}) + budget = args.max_metric_calls or (6 if args.smoke else cfg.max_metric_calls) + run_dir = args.run_dir or str(Path("runs") / datetime.now(timezone.utc).strftime("%Y%m%dT%H%M%SZ")) + Path(run_dir).mkdir(parents=True, exist_ok=True) + + topics = load_topics(_TOPICS) + if args.smoke: + train, val = topics[:1], topics[:1] + else: + train, val = topics[: args.train], topics[args.train :] + + print(f"[gepa] run_dir={run_dir} budget={budget} train={len(train)} val={len(val)}", file=sys.stderr) + print(f"[gepa] models gen={cfg.gen_model} reader={cfg.reader_model} judge={cfg.judge_model} reflect={cfg.reflection_model}", file=sys.stderr) + + # Freeze the reader-questions once for the whole run (fair comparison). + all_tasks = freeze_questions(topics, run_dir, model=cfg.judge_model, n=cfg.n_questions, complete_fn=complete) + by_id = {t.id: t for t in all_tasks} + train = [by_id[t.id] for t in train] + val = [by_id[t.id] for t in val] + + seed_scores = _eval_scores(val, SEED_CANDIDATE, cfg) + + adapter = FlowchartAdapter(cfg) + result = gepa.optimize( + seed_candidate=SEED_CANDIDATE, + trainset=train, + valset=val, + adapter=adapter, + reflection_lm=make_reflection_callable(cfg.reflection_model), + max_metric_calls=budget, + run_dir=run_dir, + display_progress_bar=True, + ) + best = result.best_candidate + best_scores = _eval_scores(val, best, cfg) + + (Path(run_dir) / "best_prompts.json").write_text(json.dumps(best, indent=2)) + report = render_report(seed_scores, best_scores, best, [t.id for t in val]) + (Path(run_dir) / "report.md").write_text(report) + print(report) + return 0 + + +if __name__ == "__main__": + raise SystemExit(main()) +``` + +- [ ] **Step 6: Write the smoke test (gated on a real API key)** + +`tests/test_smoke.py`: +```python +import os + +import pytest + +from gepa_flowchart.run import main + + +@pytest.mark.skipif(not os.environ.get("ANTHROPIC_API_KEY"), reason="needs ANTHROPIC_API_KEY") +def test_smoke_end_to_end(tmp_path): + rc = main(["--smoke", "--run-dir", str(tmp_path)]) + assert rc == 0 + assert (tmp_path / "report.md").exists() + assert (tmp_path / "best_prompts.json").exists() +``` + +- [ ] **Step 7: Run the non-smoke tests; confirm smoke skips without a key** + +Run: `cd scripts/experiments/gepa-flowchart && python -m pytest tests/ -v` +Expected: all unit tests PASS; `test_smoke_end_to_end` SKIPPED (no key) — or PASS if `ANTHROPIC_API_KEY` is set and you choose to run it (costs a few cents). + +- [ ] **Step 8: Commit** + +```bash +git add scripts/experiments/gepa-flowchart/ +git commit -m "feat(gepa): run entrypoint, report, and smoke test" +``` + +--- + +### Task 10: README + full verification + +**Files:** +- Create: `scripts/experiments/gepa-flowchart/README.md` + +**Interfaces:** none (docs + verification). + +- [ ] **Step 1: Write the README** + +`README.md`: +```markdown +# gepa-flowchart + +GEPA prompt optimization for readable, comprehensible termchart flowcharts. +Optimizes the `brainstorm` + `generate` prompts for first-user comprehension +(primary) and geometric readability (secondary). See +`docs/superpowers/specs/2026-06-18-gepa-flowchart-optimization-design.md`. + +## Setup + +```bash +# Node deps for the geometry bridge (run once) +npm install # repo root — gives the viewer its deps +cd scripts/experiments/gepa-flowchart +npm install # local tsx + +# Python +python -m venv .venv && . .venv/bin/activate +pip install -e ".[dev]" +export ANTHROPIC_API_KEY=sk-ant-... # or: ant auth login +``` + +## Run + +```bash +python -m gepa_flowchart.run --smoke # cheap end-to-end check (1 topic) +python -m gepa_flowchart.run # full run (~150 metric calls) +python -m gepa_flowchart.run --max-metric-calls 80 --train 8 +``` + +Outputs land in `runs//`: `best_prompts.json`, `report.md`, +`frozen_questions.json`. + +## Config (env) + +`GEPA_GEN_MODEL` (default `claude-opus-4-8`; set `claude-sonnet-4-6` to cut cost), +`GEPA_READER_MODEL`, `GEPA_JUDGE_MODEL`, `GEPA_REFLECTION_MODEL`, +`GEPA_W_COMP` (0.6), `GEPA_W_GEOM` (0.4), `GEPA_N_QUESTIONS` (5), +`GEPA_MAX_METRIC_CALLS` (150). + +## Cost note + +A full run does many generation + reader + judge calls per rollout plus +reflection. Start with `--smoke`. Generation defaults to Opus; switch +`GEPA_GEN_MODEL=claude-sonnet-4-6` for a cheaper high-volume role. +``` + +- [ ] **Step 2: Run the full unit suite** + +Run: `cd scripts/experiments/gepa-flowchart && python -m pytest tests/ -v` +Expected: all unit tests PASS; smoke SKIPPED without a key. + +- [ ] **Step 3: (Optional, costs cents) Live smoke** + +Run: `cd scripts/experiments/gepa-flowchart && ANTHROPIC_API_KEY=... python -m gepa_flowchart.run --smoke` +Expected: prints a report; `runs//report.md` and `best_prompts.json` exist. + +- [ ] **Step 4: Commit** + +```bash +git add scripts/experiments/gepa-flowchart/README.md +git commit -m "docs(gepa): README with setup, run, and cost notes" +``` + +--- + +## Self-Review + +**Spec coverage:** +- Two optimizable prompts (brainstorm/generate) → Task 6 (`SEED_CANDIDATE`), Task 8 (adapter mutates both). +- Claude direct via Anthropic SDK → Task 3 (`llm.py`); reflection callable → Task 3 + Task 9. +- Standalone `gepa` → Task 8 (`GEPAAdapter`), Task 9 (`gepa.optimize`). +- Structural validity gate → Task 7 (`score_board` early return) via Task 2 bridge (`validateContent`). +- Geometry/readability score → Task 2 (bridge `geometryReport`), Task 7 (`geometry_score`). +- Comprehension/FUX primary (fresh reader + judge, board-supported correctness, missing-context flagged) → Task 7 (`comprehension_score`). +- Auto-generated, run-frozen reader-questions → Task 5 (`freeze_questions`), Task 9 (frozen once per run). +- Structured-content reader surface → Task 4 (`board_to_text`). +- ~12 topics, train/val split → Task 5 (`topics.json`), Task 9 (split). +- Geometry bridge, no package changes → Task 2. +- Models (gen opus + sonnet knob, reader sonnet, judge/reflection opus) → Task 1 config defaults. +- Combined weights 0.6/0.4 → Task 1 + Task 7. +- Smoke-first, cost estimate → Task 9 (`--smoke`, stderr prints), Task 10 README. +- Outputs (best_prompts.json, report.md, frozen_questions.json) → Task 9. +- Testing (units with fakes + gated smoke) → every task + Task 9 smoke. + +**Placeholder scan:** No TBD/TODO; every code step has real code; commands have expected output. `topics.json` shows all 12 entries. + +**Type consistency:** `complete(system, user, *, model, effort, max_tokens, client)` used consistently (Tasks 3, 5, 6, 7, adapter). `validate_flow(content) -> {valid,error,findings,warnings}` consistent (Tasks 2, 7, 8). `Task(id,topic,audience,purpose,questions)` consistent (Tasks 5–9). `ScoreResult(score,feedback,comp,geom,valid)` consistent (Tasks 7, 8). `SEED_CANDIDATE` keys `brainstorm`/`generate` consistent (Tasks 6, 8). `EvaluationBatch(outputs,scores,trajectories)` matches the gepa source. From e6d8c168403470233ee6712d8c7fc3c815593bbd Mon Sep 17 00:00:00 2001 From: Ivan Cheung Date: Thu, 18 Jun 2026 04:38:30 +0000 Subject: [PATCH 03/24] feat(gepa): scaffold gepa-flowchart project + config --- scripts/experiments/gepa-flowchart/.gitignore | 5 ++ .../gepa-flowchart/gepa_flowchart/__init__.py | 1 + .../gepa-flowchart/gepa_flowchart/config.py | 46 +++++++++++++++++++ .../experiments/gepa-flowchart/package.json | 5 ++ .../experiments/gepa-flowchart/pyproject.toml | 11 +++++ .../gepa-flowchart/tests/test_config.py | 16 +++++++ 6 files changed, 84 insertions(+) create mode 100644 scripts/experiments/gepa-flowchart/.gitignore create mode 100644 scripts/experiments/gepa-flowchart/gepa_flowchart/__init__.py create mode 100644 scripts/experiments/gepa-flowchart/gepa_flowchart/config.py create mode 100644 scripts/experiments/gepa-flowchart/package.json create mode 100644 scripts/experiments/gepa-flowchart/pyproject.toml create mode 100644 scripts/experiments/gepa-flowchart/tests/test_config.py diff --git a/scripts/experiments/gepa-flowchart/.gitignore b/scripts/experiments/gepa-flowchart/.gitignore new file mode 100644 index 0000000..b875e21 --- /dev/null +++ b/scripts/experiments/gepa-flowchart/.gitignore @@ -0,0 +1,5 @@ +__pycache__/ +*.pyc +.venv/ +node_modules/ +runs/ diff --git a/scripts/experiments/gepa-flowchart/gepa_flowchart/__init__.py b/scripts/experiments/gepa-flowchart/gepa_flowchart/__init__.py new file mode 100644 index 0000000..2e891bd --- /dev/null +++ b/scripts/experiments/gepa-flowchart/gepa_flowchart/__init__.py @@ -0,0 +1 @@ +"""GEPA optimization for readable, comprehensible termchart flowcharts.""" diff --git a/scripts/experiments/gepa-flowchart/gepa_flowchart/config.py b/scripts/experiments/gepa-flowchart/gepa_flowchart/config.py new file mode 100644 index 0000000..1e5ae10 --- /dev/null +++ b/scripts/experiments/gepa-flowchart/gepa_flowchart/config.py @@ -0,0 +1,46 @@ +from __future__ import annotations + +import os +from dataclasses import dataclass + + +@dataclass +class Config: + gen_model: str = "claude-opus-4-8" + reader_model: str = "claude-sonnet-4-6" + judge_model: str = "claude-opus-4-8" + reflection_model: str = "claude-opus-4-8" + w_comp: float = 0.6 + w_geom: float = 0.4 + geom_error_penalty: float = 0.34 + geom_warning_penalty: float = 0.08 + n_questions: int = 5 + max_metric_calls: int = 150 + gen_max_tokens: int = 8000 + + @classmethod + def from_env(cls, overrides: dict | None = None) -> "Config": + env = {**os.environ, **(overrides or {})} + + def s(key: str, default: str) -> str: + return env.get(key, default) + + def f(key: str, default: float) -> float: + return float(env.get(key, default)) + + def i(key: str, default: int) -> int: + return int(env.get(key, default)) + + return cls( + gen_model=s("GEPA_GEN_MODEL", cls.gen_model), + reader_model=s("GEPA_READER_MODEL", cls.reader_model), + judge_model=s("GEPA_JUDGE_MODEL", cls.judge_model), + reflection_model=s("GEPA_REFLECTION_MODEL", cls.reflection_model), + w_comp=f("GEPA_W_COMP", cls.w_comp), + w_geom=f("GEPA_W_GEOM", cls.w_geom), + geom_error_penalty=f("GEPA_GEOM_ERR_PENALTY", cls.geom_error_penalty), + geom_warning_penalty=f("GEPA_GEOM_WARN_PENALTY", cls.geom_warning_penalty), + n_questions=i("GEPA_N_QUESTIONS", cls.n_questions), + max_metric_calls=i("GEPA_MAX_METRIC_CALLS", cls.max_metric_calls), + gen_max_tokens=i("GEPA_GEN_MAX_TOKENS", cls.gen_max_tokens), + ) diff --git a/scripts/experiments/gepa-flowchart/package.json b/scripts/experiments/gepa-flowchart/package.json new file mode 100644 index 0000000..6bf8617 --- /dev/null +++ b/scripts/experiments/gepa-flowchart/package.json @@ -0,0 +1,5 @@ +{ + "name": "gepa-flowchart-bridge", + "private": true, + "devDependencies": { "tsx": "^4.19.0" } +} diff --git a/scripts/experiments/gepa-flowchart/pyproject.toml b/scripts/experiments/gepa-flowchart/pyproject.toml new file mode 100644 index 0000000..3d56452 --- /dev/null +++ b/scripts/experiments/gepa-flowchart/pyproject.toml @@ -0,0 +1,11 @@ +[project] +name = "gepa-flowchart" +version = "0.1.0" +requires-python = ">=3.11" +dependencies = ["gepa", "anthropic>=0.69"] + +[project.optional-dependencies] +dev = ["pytest>=8"] + +[tool.pytest.ini_options] +testpaths = ["tests"] diff --git a/scripts/experiments/gepa-flowchart/tests/test_config.py b/scripts/experiments/gepa-flowchart/tests/test_config.py new file mode 100644 index 0000000..691aaa9 --- /dev/null +++ b/scripts/experiments/gepa-flowchart/tests/test_config.py @@ -0,0 +1,16 @@ +from gepa_flowchart.config import Config + + +def test_defaults(): + cfg = Config.from_env({}) + assert cfg.gen_model == "claude-opus-4-8" + assert cfg.reader_model == "claude-sonnet-4-6" + assert cfg.judge_model == "claude-opus-4-8" + assert cfg.reflection_model == "claude-opus-4-8" + assert cfg.w_comp == 0.6 and cfg.w_geom == 0.4 + + +def test_env_override(): + cfg = Config.from_env({"GEPA_GEN_MODEL": "claude-sonnet-4-6", "GEPA_W_COMP": "0.7"}) + assert cfg.gen_model == "claude-sonnet-4-6" + assert cfg.w_comp == 0.7 From abb5f9601501035573ce4ab40975a763aa676c7b Mon Sep 17 00:00:00 2001 From: Ivan Cheung Date: Thu, 18 Jun 2026 04:42:33 +0000 Subject: [PATCH 04/24] feat(gepa): geometry bridge reusing TS validateContent + geometryReport --- .../gepa_flowchart/geometry_bridge.py | 24 + .../gepa_flowchart/validate_flow.ts | 22 + .../gepa-flowchart/package-lock.json | 531 ++++++++++++++++++ .../experiments/gepa-flowchart/package.json | 1 + .../tests/test_geometry_bridge.py | 38 ++ 5 files changed, 616 insertions(+) create mode 100644 scripts/experiments/gepa-flowchart/gepa_flowchart/geometry_bridge.py create mode 100644 scripts/experiments/gepa-flowchart/gepa_flowchart/validate_flow.ts create mode 100644 scripts/experiments/gepa-flowchart/package-lock.json create mode 100644 scripts/experiments/gepa-flowchart/tests/test_geometry_bridge.py diff --git a/scripts/experiments/gepa-flowchart/gepa_flowchart/geometry_bridge.py b/scripts/experiments/gepa-flowchart/gepa_flowchart/geometry_bridge.py new file mode 100644 index 0000000..1cc4d28 --- /dev/null +++ b/scripts/experiments/gepa-flowchart/gepa_flowchart/geometry_bridge.py @@ -0,0 +1,24 @@ +from __future__ import annotations + +import json +import subprocess +from pathlib import Path + +_HERE = Path(__file__).resolve().parent +_PROJECT = _HERE.parent # scripts/experiments/gepa-flowchart +_SCRIPT = _HERE / "validate_flow.ts" + + +def validate_flow(content: str, *, cwd: str | None = None) -> dict: + """Run the TS validator on a flow-JSON string. Returns + {valid, error, findings, warnings}. Raises RuntimeError if the bridge fails.""" + proc = subprocess.run( + ["npx", "tsx", str(_SCRIPT)], + input=content, + capture_output=True, + text=True, + cwd=cwd or str(_PROJECT), + ) + if proc.returncode != 0: + raise RuntimeError(f"geometry bridge failed: {proc.stderr.strip()}") + return json.loads(proc.stdout) diff --git a/scripts/experiments/gepa-flowchart/gepa_flowchart/validate_flow.ts b/scripts/experiments/gepa-flowchart/gepa_flowchart/validate_flow.ts new file mode 100644 index 0000000..7a4a1c1 --- /dev/null +++ b/scripts/experiments/gepa-flowchart/gepa_flowchart/validate_flow.ts @@ -0,0 +1,22 @@ +// Reads a flow-JSON spec on stdin, prints { valid, error, findings, warnings } on stdout. +// Imports the existing validators directly — no changes to shipped packages. +import { validateContent } from "../../../../packages/core/src/validate.js"; +import { geometryReport } from "../../../../packages/viewer/src/flow-geometry.js"; + +function readStdin(): Promise { + return new Promise((resolve) => { + let data = ""; + process.stdin.setEncoding("utf8"); + process.stdin.on("data", (c) => (data += c)); + process.stdin.on("end", () => resolve(data)); + }); +} + +const content = await readStdin(); +const error = validateContent("flow", content); +if (error) { + process.stdout.write(JSON.stringify({ valid: false, error, findings: [], warnings: [] })); +} else { + const { warnings, findings } = geometryReport("flow", content); + process.stdout.write(JSON.stringify({ valid: true, error: null, findings, warnings })); +} diff --git a/scripts/experiments/gepa-flowchart/package-lock.json b/scripts/experiments/gepa-flowchart/package-lock.json new file mode 100644 index 0000000..2e54688 --- /dev/null +++ b/scripts/experiments/gepa-flowchart/package-lock.json @@ -0,0 +1,531 @@ +{ + "name": "gepa-flowchart-bridge", + "lockfileVersion": 3, + "requires": true, + "packages": { + "": { + "name": "gepa-flowchart-bridge", + "devDependencies": { + "tsx": "^4.19.0" + } + }, + "node_modules/@esbuild/aix-ppc64": { + "version": "0.28.1", + "resolved": "https://registry.npmjs.org/@esbuild/aix-ppc64/-/aix-ppc64-0.28.1.tgz", + "integrity": "sha512-Svl7tq8k/08+p6CXPpRjQ1fKX+1odH/BQbb48fV6fj3CWHhsoIOoY87w1oHXm0qEpkIK3ZfVgp0hed3XBXzXMQ==", + "cpu": [ + "ppc64" + ], + "dev": true, + "license": "MIT", + "optional": true, + "os": [ + "aix" + ], + "engines": { + "node": ">=18" + } + }, + "node_modules/@esbuild/android-arm": { + "version": "0.28.1", + "resolved": "https://registry.npmjs.org/@esbuild/android-arm/-/android-arm-0.28.1.tgz", + "integrity": "sha512-0k2F129Xdio1TdJfzJ8sy1Q47vUD2NnwdhiAf7drUN1EBTfPf4hsFCtmMgu/6m8JSzsBrlmVjudMBQqOfG8usQ==", + "cpu": [ + "arm" + ], + "dev": true, + "license": "MIT", + "optional": true, + "os": [ + "android" + ], + "engines": { + "node": ">=18" + } + }, + "node_modules/@esbuild/android-arm64": { + "version": "0.28.1", + "resolved": "https://registry.npmjs.org/@esbuild/android-arm64/-/android-arm64-0.28.1.tgz", + "integrity": "sha512-34EGEbCIAgosYz6goLcopX6Mo7NyGv9tfwEM2/7Ce2VcVRk568iSvniGWcUXIy7wEDR1wzolcxcriFVrWYcwBg==", + "cpu": [ + "arm64" + ], + "dev": true, + "license": "MIT", + "optional": true, + "os": [ + "android" + ], + "engines": { + "node": ">=18" + } + }, + "node_modules/@esbuild/android-x64": { + "version": "0.28.1", + "resolved": "https://registry.npmjs.org/@esbuild/android-x64/-/android-x64-0.28.1.tgz", + "integrity": "sha512-dbwY7ltSMDWsRatcRpCnES4F+im88OCUgGZjy52shC7GqHRE/cYlxNbB4Z4UpJswpcc4Qxd2oE/ufM0p61IKng==", + "cpu": [ + "x64" + ], + "dev": true, + "license": "MIT", + "optional": true, + "os": [ + "android" + ], + "engines": { + "node": ">=18" + } + }, + "node_modules/@esbuild/darwin-arm64": { + "version": "0.28.1", + "resolved": "https://registry.npmjs.org/@esbuild/darwin-arm64/-/darwin-arm64-0.28.1.tgz", + "integrity": "sha512-TZbWkQY7kvTAXbXUT7uVACR5cMHsDiSz9z7ZKAX/RTq/WJEk3QyRr0wZpNhBDX+/0CtdqUIJlOiodQcta6tY3Q==", + "cpu": [ + "arm64" + ], + "dev": true, + "license": "MIT", + "optional": true, + "os": [ + "darwin" + ], + "engines": { + "node": ">=18" + } + }, + "node_modules/@esbuild/darwin-x64": { + "version": "0.28.1", + "resolved": "https://registry.npmjs.org/@esbuild/darwin-x64/-/darwin-x64-0.28.1.tgz", + "integrity": "sha512-zfdzgK9ACBNZLI/CyHTOx81SyNbM6YXn7rxSgX97VjyiPl9W1i4Ka4fgKECEoFCKGpvBj5qArWIGgQjOwkgskQ==", + "cpu": [ + "x64" + ], + "dev": true, + "license": "MIT", + "optional": true, + "os": [ + "darwin" + ], + "engines": { + "node": ">=18" + } + }, + "node_modules/@esbuild/freebsd-arm64": { + "version": "0.28.1", + "resolved": "https://registry.npmjs.org/@esbuild/freebsd-arm64/-/freebsd-arm64-0.28.1.tgz", + "integrity": "sha512-wG2EA8ENdEI0qhkSZMjfqrdY+ziCYCPMmtZjjIwOmXFjmyzEHn+UUxk5of+SYsjtfs3VpnlC7QLzSI5hY/rOAw==", + "cpu": [ + "arm64" + ], + "dev": true, + "license": "MIT", + "optional": true, + "os": [ + "freebsd" + ], + "engines": { + "node": ">=18" + } + }, + "node_modules/@esbuild/freebsd-x64": { + "version": "0.28.1", + "resolved": "https://registry.npmjs.org/@esbuild/freebsd-x64/-/freebsd-x64-0.28.1.tgz", + "integrity": "sha512-i7dZ9vQgnvSCzi/rYCXNgtF/U+eKZNJBzu3eTQbRgHnM7tNSizLOkRFAl3qzVc/Op/u5YkHHa4pf/3DOYHthLQ==", + "cpu": [ + "x64" + ], + "dev": true, + "license": "MIT", + "optional": true, + "os": [ + "freebsd" + ], + "engines": { + "node": ">=18" + } + }, + "node_modules/@esbuild/linux-arm": { + "version": "0.28.1", + "resolved": "https://registry.npmjs.org/@esbuild/linux-arm/-/linux-arm-0.28.1.tgz", + "integrity": "sha512-qVXBOHQS+d5Y722GwJzJUtOLlX7km3CraOaGormF1pDtPd2C/l1SHRPgjLunLGe51Sh5YYWKMFDyV4SxgMQYTQ==", + "cpu": [ + "arm" + ], + "dev": true, + "license": "MIT", + "optional": true, + "os": [ + "linux" + ], + "engines": { + "node": ">=18" + } + }, + "node_modules/@esbuild/linux-arm64": { + "version": "0.28.1", + "resolved": "https://registry.npmjs.org/@esbuild/linux-arm64/-/linux-arm64-0.28.1.tgz", + "integrity": "sha512-yHs+0uc8+nvEAfAfxrWQKK5peSNzBc4PegcMO0EJ2hT71uA7vB8Ihg2e77R2P7SG5uYjPbHlLLmve4LLLRCf0g==", + "cpu": [ + "arm64" + ], + "dev": true, + "license": "MIT", + "optional": true, + "os": [ + "linux" + ], + "engines": { + "node": ">=18" + } + }, + "node_modules/@esbuild/linux-ia32": { + "version": "0.28.1", + "resolved": "https://registry.npmjs.org/@esbuild/linux-ia32/-/linux-ia32-0.28.1.tgz", + "integrity": "sha512-d1z4ZuP0ajrfz/FhGT4vv278rX8KnPPJx8i5+AtK7TYbx9Le9F1hyzurZpkEyjkGa9dUGhQow4C1NmeGvqxN2w==", + "cpu": [ + "ia32" + ], + "dev": true, + "license": "MIT", + "optional": true, + "os": [ + "linux" + ], + "engines": { + "node": ">=18" + } + }, + "node_modules/@esbuild/linux-loong64": { + "version": "0.28.1", + "resolved": "https://registry.npmjs.org/@esbuild/linux-loong64/-/linux-loong64-0.28.1.tgz", + "integrity": "sha512-M5sRjUVZrkm1OAPR3dlOYzNmN+loZKGVi1VUQGrwuqLcbR6qeAz+famMhjASeH3YVKvZz+zT1jlh/keC3Rj/lg==", + "cpu": [ + "loong64" + ], + "dev": true, + "license": "MIT", + "optional": true, + "os": [ + "linux" + ], + "engines": { + "node": ">=18" + } + }, + "node_modules/@esbuild/linux-mips64el": { + "version": "0.28.1", + "resolved": "https://registry.npmjs.org/@esbuild/linux-mips64el/-/linux-mips64el-0.28.1.tgz", + "integrity": "sha512-mRObBZeHh2OxcBFPWE/FjylkRgZdYuiTR3vaTozquCGOH14iP9oN4x4Ge81CoIDYQrXmIxpFumJBu5MtZpnQJQ==", + "cpu": [ + "mips64el" + ], + "dev": true, + "license": "MIT", + "optional": true, + "os": [ + "linux" + ], + "engines": { + "node": ">=18" + } + }, + "node_modules/@esbuild/linux-ppc64": { + "version": "0.28.1", + "resolved": "https://registry.npmjs.org/@esbuild/linux-ppc64/-/linux-ppc64-0.28.1.tgz", + "integrity": "sha512-slScBsMAb3GFDcdrCgLwZtPYRoH2H/youv10QiZyRjmsP48fznoveWytSgCI/R0ZcUgpc0ZhIUEx6LHts8yrfQ==", + "cpu": [ + "ppc64" + ], + "dev": true, + "license": "MIT", + "optional": true, + "os": [ + "linux" + ], + "engines": { + "node": ">=18" + } + }, + "node_modules/@esbuild/linux-riscv64": { + "version": "0.28.1", + "resolved": "https://registry.npmjs.org/@esbuild/linux-riscv64/-/linux-riscv64-0.28.1.tgz", + "integrity": "sha512-kw0owk1o0GFETUJyW0jc0G4Yzs0BHZn0JDZ8JRT088vjJYX777BAs1fDGxAC+q831qOs2DTC96mNsG2opdfyyQ==", + "cpu": [ + "riscv64" + ], + "dev": true, + "license": "MIT", + "optional": true, + "os": [ + "linux" + ], + "engines": { + "node": ">=18" + } + }, + "node_modules/@esbuild/linux-s390x": { + "version": "0.28.1", + "resolved": "https://registry.npmjs.org/@esbuild/linux-s390x/-/linux-s390x-0.28.1.tgz", + "integrity": "sha512-/lAIjX8aYFRByhh6L5rYtPEDRqa9de/4V/juOXcta5frjvzXO4/sqEtyytse0g3zZFuWu5cDN0MkLz2qRDD2Ag==", + "cpu": [ + "s390x" + ], + "dev": true, + "license": "MIT", + "optional": true, + "os": [ + "linux" + ], + "engines": { + "node": ">=18" + } + }, + "node_modules/@esbuild/linux-x64": { + "version": "0.28.1", + "resolved": "https://registry.npmjs.org/@esbuild/linux-x64/-/linux-x64-0.28.1.tgz", + "integrity": "sha512-u/anNYF2mmVOEDwLtnQ1wOr3EZ9sTNGLWrsYGYwHWzGA3Si84IOkHXlbWTD1NB+9/1lcnweYKO54uhxZydNzfA==", + "cpu": [ + "x64" + ], + "dev": true, + "license": "MIT", + "optional": true, + "os": [ + "linux" + ], + "engines": { + "node": ">=18" + } + }, + "node_modules/@esbuild/netbsd-arm64": { + "version": "0.28.1", + "resolved": "https://registry.npmjs.org/@esbuild/netbsd-arm64/-/netbsd-arm64-0.28.1.tgz", + "integrity": "sha512-oks0DYbLwWMmaakTsCb+zL4E+aHRVLom9IJZOAthMQEPiQmydXHkziYEsGYRx0uNV/IjEKGAV941JzH02pflqw==", + "cpu": [ + "arm64" + ], + "dev": true, + "license": "MIT", + "optional": true, + "os": [ + "netbsd" + ], + "engines": { + "node": ">=18" + } + }, + "node_modules/@esbuild/netbsd-x64": { + "version": "0.28.1", + "resolved": "https://registry.npmjs.org/@esbuild/netbsd-x64/-/netbsd-x64-0.28.1.tgz", + "integrity": "sha512-aeL6lAnN89Hz43Mlh1G8ARasbuoYvSITDEx0tHh5b7jJnHcssqgjy9Yx430GDpmCa6OyrKoS0aNRjKundRizGg==", + "cpu": [ + "x64" + ], + "dev": true, + "license": "MIT", + "optional": true, + "os": [ + "netbsd" + ], + "engines": { + "node": ">=18" + } + }, + "node_modules/@esbuild/openbsd-arm64": { + "version": "0.28.1", + "resolved": "https://registry.npmjs.org/@esbuild/openbsd-arm64/-/openbsd-arm64-0.28.1.tgz", + "integrity": "sha512-MEFJe5C3R8pwXdZ5Y21oo6m7ePiS0d9pWucn99O/wvyJZChoIQKrQDxKrGeW8F5+T0okTHesAmDeiHDTIq0V/Q==", + "cpu": [ + "arm64" + ], + "dev": true, + "license": "MIT", + "optional": true, + "os": [ + "openbsd" + ], + "engines": { + "node": ">=18" + } + }, + "node_modules/@esbuild/openbsd-x64": { + "version": "0.28.1", + "resolved": "https://registry.npmjs.org/@esbuild/openbsd-x64/-/openbsd-x64-0.28.1.tgz", + "integrity": "sha512-i/ZLIOafE0Z8cI/XANJAixoJL/uRAoS2xOA3rb0xN+KK0K177cMAsQYkzHtBrtMXAKuAc7HGgcWiZ/sRC1Nxgw==", + "cpu": [ + "x64" + ], + "dev": true, + "license": "MIT", + "optional": true, + "os": [ + "openbsd" + ], + "engines": { + "node": ">=18" + } + }, + "node_modules/@esbuild/openharmony-arm64": { + "version": "0.28.1", + "resolved": "https://registry.npmjs.org/@esbuild/openharmony-arm64/-/openharmony-arm64-0.28.1.tgz", + "integrity": "sha512-ge+Z7EXFNt2BO1oAMsVpiQ8EwndV9i1xXerAeTIK7AtPs3bKFXQM7nlRxDSIUIMeueR1CNXxqztLzdNeReKBJg==", + "cpu": [ + "arm64" + ], + "dev": true, + "license": "MIT", + "optional": true, + "os": [ + "openharmony" + ], + "engines": { + "node": ">=18" + } + }, + "node_modules/@esbuild/sunos-x64": { + "version": "0.28.1", + "resolved": "https://registry.npmjs.org/@esbuild/sunos-x64/-/sunos-x64-0.28.1.tgz", + "integrity": "sha512-BEjgtECkL3vY+SaSQ6nzVfiALUeFxpawyp8Jmf5PtYhf1Ug40N1h/hxlhts+f1FvSvarEigdxS3BlSMI2PJLcQ==", + "cpu": [ + "x64" + ], + "dev": true, + "license": "MIT", + "optional": true, + "os": [ + "sunos" + ], + "engines": { + "node": ">=18" + } + }, + "node_modules/@esbuild/win32-arm64": { + "version": "0.28.1", + "resolved": "https://registry.npmjs.org/@esbuild/win32-arm64/-/win32-arm64-0.28.1.tgz", + "integrity": "sha512-lCv9eK/H6ZJWbE7bh2nw54CZ9M2nupBxJcTsdk/QQnWkdSjKGuxmmH8/GWrlT1eMmZfn4dGcCjRte397WqfQXA==", + "cpu": [ + "arm64" + ], + "dev": true, + "license": "MIT", + "optional": true, + "os": [ + "win32" + ], + "engines": { + "node": ">=18" + } + }, + "node_modules/@esbuild/win32-ia32": { + "version": "0.28.1", + "resolved": "https://registry.npmjs.org/@esbuild/win32-ia32/-/win32-ia32-0.28.1.tgz", + "integrity": "sha512-zvb/mB2bSCoJOpoCBgYKKpX6YM6mJBlBUVUtVj41DlZJVEB6/0CKlRYxP5wWl1C1ILiCoAU5wZZ4q1P3qeS6Eg==", + "cpu": [ + "ia32" + ], + "dev": true, + "license": "MIT", + "optional": true, + "os": [ + "win32" + ], + "engines": { + "node": ">=18" + } + }, + "node_modules/@esbuild/win32-x64": { + "version": "0.28.1", + "resolved": "https://registry.npmjs.org/@esbuild/win32-x64/-/win32-x64-0.28.1.tgz", + "integrity": "sha512-bm4Mowrv+GXMlpWX++EcXw/iLyd1o3+bJkC2DkWXYVvgZCqD/bSj9ctZeAMC3cIxgjRVR2Dufaiu4YPxr5gW1A==", + "cpu": [ + "x64" + ], + "dev": true, + "license": "MIT", + "optional": true, + "os": [ + "win32" + ], + "engines": { + "node": ">=18" + } + }, + "node_modules/esbuild": { + "version": "0.28.1", + "resolved": "https://registry.npmjs.org/esbuild/-/esbuild-0.28.1.tgz", + "integrity": "sha512-HrJrvZv5ayxBzPfwphOoNzkzOIIlifzk0KJrGK2c8R4+LKpMtpYLQeUdjnwjWv/LZlkH2laZk+4w78pi99D4Vw==", + "dev": true, + "hasInstallScript": true, + "license": "MIT", + "bin": { + "esbuild": "bin/esbuild" + }, + "engines": { + "node": ">=18" + }, + "optionalDependencies": { + "@esbuild/aix-ppc64": "0.28.1", + "@esbuild/android-arm": "0.28.1", + "@esbuild/android-arm64": "0.28.1", + "@esbuild/android-x64": "0.28.1", + "@esbuild/darwin-arm64": "0.28.1", + "@esbuild/darwin-x64": "0.28.1", + "@esbuild/freebsd-arm64": "0.28.1", + "@esbuild/freebsd-x64": "0.28.1", + "@esbuild/linux-arm": "0.28.1", + "@esbuild/linux-arm64": "0.28.1", + "@esbuild/linux-ia32": "0.28.1", + "@esbuild/linux-loong64": "0.28.1", + "@esbuild/linux-mips64el": "0.28.1", + "@esbuild/linux-ppc64": "0.28.1", + "@esbuild/linux-riscv64": "0.28.1", + "@esbuild/linux-s390x": "0.28.1", + "@esbuild/linux-x64": "0.28.1", + "@esbuild/netbsd-arm64": "0.28.1", + "@esbuild/netbsd-x64": "0.28.1", + "@esbuild/openbsd-arm64": "0.28.1", + "@esbuild/openbsd-x64": "0.28.1", + "@esbuild/openharmony-arm64": "0.28.1", + "@esbuild/sunos-x64": "0.28.1", + "@esbuild/win32-arm64": "0.28.1", + "@esbuild/win32-ia32": "0.28.1", + "@esbuild/win32-x64": "0.28.1" + } + }, + "node_modules/fsevents": { + "version": "2.3.3", + "resolved": "https://registry.npmjs.org/fsevents/-/fsevents-2.3.3.tgz", + "integrity": "sha512-5xoDfX+fL7faATnagmWPpbFtwh/R77WmMMqqHGS65C3vvB0YHrgF+B1YmZ3441tMj5n63k0212XNoJwzlhffQw==", + "dev": true, + "hasInstallScript": true, + "license": "MIT", + "optional": true, + "os": [ + "darwin" + ], + "engines": { + "node": "^8.16.0 || ^10.6.0 || >=11.0.0" + } + }, + "node_modules/tsx": { + "version": "4.22.4", + "resolved": "https://registry.npmjs.org/tsx/-/tsx-4.22.4.tgz", + "integrity": "sha512-X8EX+XV4QR5xCsrgxaED954zTDfY8KqlDtskKEL0cHhyS/P8b4IFOvGDQpsC9Q1XnLq915wEfwwY/zzskCtmhg==", + "dev": true, + "license": "MIT", + "dependencies": { + "esbuild": "~0.28.0" + }, + "bin": { + "tsx": "dist/cli.mjs" + }, + "engines": { + "node": ">=18.0.0" + }, + "optionalDependencies": { + "fsevents": "~2.3.3" + } + } + } +} diff --git a/scripts/experiments/gepa-flowchart/package.json b/scripts/experiments/gepa-flowchart/package.json index 6bf8617..6d32a40 100644 --- a/scripts/experiments/gepa-flowchart/package.json +++ b/scripts/experiments/gepa-flowchart/package.json @@ -1,5 +1,6 @@ { "name": "gepa-flowchart-bridge", "private": true, + "type": "module", "devDependencies": { "tsx": "^4.19.0" } } diff --git a/scripts/experiments/gepa-flowchart/tests/test_geometry_bridge.py b/scripts/experiments/gepa-flowchart/tests/test_geometry_bridge.py new file mode 100644 index 0000000..60b403c --- /dev/null +++ b/scripts/experiments/gepa-flowchart/tests/test_geometry_bridge.py @@ -0,0 +1,38 @@ +import json + +from gepa_flowchart.geometry_bridge import validate_flow + +CLEAN = json.dumps({ + "direction": "TB", + "nodes": [{"id": "a", "data": {"label": "A"}}, {"id": "b", "data": {"label": "B"}}], + "edges": [{"source": "a", "target": "b"}], +}) + +# a -> c with b sitting on the a-c line (explicit positions force the overlap) +EDGE_OVER_NODE = json.dumps({ + "layout": "manual", "direction": "LR", + "nodes": [ + {"id": "a", "position": {"x": 0, "y": 0}}, + {"id": "b", "position": {"x": 200, "y": 0}}, + {"id": "c", "position": {"x": 440, "y": 0}}, + ], + "edges": [{"source": "a", "target": "c"}], +}) + + +def test_clean_spec_is_valid_no_errors(): + r = validate_flow(CLEAN) + assert r["valid"] is True + assert [f for f in r["findings"] if f["severity"] == "error"] == [] + + +def test_edge_over_node_flagged(): + r = validate_flow(EDGE_OVER_NODE) + assert r["valid"] is True + assert any(f["code"] == "edge-over-node" for f in r["findings"]) + + +def test_invalid_json_reported(): + r = validate_flow("{ not json") + assert r["valid"] is False + assert r["error"] From 8a508bb1af16743b53aa5a6186842acfe86a3215 Mon Sep 17 00:00:00 2001 From: Ivan Cheung Date: Thu, 18 Jun 2026 04:45:57 +0000 Subject: [PATCH 05/24] feat(gepa): Anthropic SDK wrappers + reflection callable --- .../gepa-flowchart/gepa_flowchart/llm.py | 47 +++++++++++++++++++ .../gepa-flowchart/tests/test_llm.py | 34 ++++++++++++++ 2 files changed, 81 insertions(+) create mode 100644 scripts/experiments/gepa-flowchart/gepa_flowchart/llm.py create mode 100644 scripts/experiments/gepa-flowchart/tests/test_llm.py diff --git a/scripts/experiments/gepa-flowchart/gepa_flowchart/llm.py b/scripts/experiments/gepa-flowchart/gepa_flowchart/llm.py new file mode 100644 index 0000000..cd01808 --- /dev/null +++ b/scripts/experiments/gepa-flowchart/gepa_flowchart/llm.py @@ -0,0 +1,47 @@ +from __future__ import annotations + +from typing import Callable + +_client = None + + +def get_client(): + global _client + if _client is None: + import anthropic + _client = anthropic.Anthropic() + return _client + + +def complete( + system: str, + user: str, + *, + model: str, + effort: str = "medium", + max_tokens: int = 4096, + client=None, +) -> str: + client = client or get_client() + resp = client.messages.create( + model=model, + max_tokens=max_tokens, + system=system, + output_config={"effort": effort}, + messages=[{"role": "user", "content": user}], + ) + return "".join(b.text for b in resp.content if getattr(b, "type", None) == "text") + + +def make_reflection_callable(model: str, *, client=None) -> Callable[[str], str]: + def reflect(prompt: str) -> str: + return complete( + "You are an expert prompt engineer improving prompts based on feedback.", + prompt, + model=model, + effort="high", + max_tokens=8000, + client=client, + ) + + return reflect diff --git a/scripts/experiments/gepa-flowchart/tests/test_llm.py b/scripts/experiments/gepa-flowchart/tests/test_llm.py new file mode 100644 index 0000000..b876aee --- /dev/null +++ b/scripts/experiments/gepa-flowchart/tests/test_llm.py @@ -0,0 +1,34 @@ +from gepa_flowchart import llm + + +class FakeMessages: + def __init__(self, recorder): + self.recorder = recorder + + def create(self, **kwargs): + self.recorder.update(kwargs) + block = type("Block", (), {"type": "text", "text": "hello world"})() + return type("Resp", (), {"content": [block]})() + + +class FakeClient: + def __init__(self): + self.calls = {} + self.messages = FakeMessages(self.calls) + + +def test_complete_extracts_text_and_passes_params(): + fake = FakeClient() + out = llm.complete("sys", "usr", model="claude-opus-4-8", effort="high", client=fake) + assert out == "hello world" + assert fake.calls["model"] == "claude-opus-4-8" + assert fake.calls["output_config"] == {"effort": "high"} + assert "temperature" not in fake.calls # must never be sent + assert fake.calls["system"] == "sys" + assert fake.calls["messages"] == [{"role": "user", "content": "usr"}] + + +def test_reflection_callable_returns_str(): + fake = FakeClient() + fn = llm.make_reflection_callable("claude-opus-4-8", client=fake) + assert fn("reflect on this") == "hello world" From 1b168602142516427af7cf647febf3194969a278 Mon Sep 17 00:00:00 2001 From: Ivan Cheung Date: Thu, 18 Jun 2026 04:48:00 +0000 Subject: [PATCH 06/24] feat(gepa): board-to-text serializer + JSON extractor --- .../gepa-flowchart/gepa_flowchart/render.py | 82 +++++++++++++++++++ .../gepa-flowchart/tests/test_render.py | 35 ++++++++ 2 files changed, 117 insertions(+) create mode 100644 scripts/experiments/gepa-flowchart/gepa_flowchart/render.py create mode 100644 scripts/experiments/gepa-flowchart/tests/test_render.py diff --git a/scripts/experiments/gepa-flowchart/gepa_flowchart/render.py b/scripts/experiments/gepa-flowchart/gepa_flowchart/render.py new file mode 100644 index 0000000..c913efe --- /dev/null +++ b/scripts/experiments/gepa-flowchart/gepa_flowchart/render.py @@ -0,0 +1,82 @@ +from __future__ import annotations + + +def board_to_text(flow: dict) -> str: + lines: list[str] = [] + direction = flow.get("direction", "TB") + lines.append(f"Flowchart (direction: {direction})") + groups = {g["id"]: g for g in flow.get("groups", []) if isinstance(g, dict) and "id" in g} + nodes = [n for n in flow.get("nodes", []) if isinstance(n, dict)] + + lines.append("\nNodes:") + for n in nodes: + nid = n.get("id", "?") + data = n.get("data", {}) if isinstance(n.get("data"), dict) else {} + label = data.get("label", "(no label)") + status = data.get("status") + grp = groups.get(n.get("group", ""), {}).get("label") + suffix = [] + if grp: + suffix.append(f"group: {grp}") + if status: + suffix.append(f"status: {status}") + extra = f" [{', '.join(suffix)}]" if suffix else "" + lines.append(f" - {nid}: {label}{extra}") + + label_by_id = { + n.get("id"): (n.get("data", {}) or {}).get("label", n.get("id")) + for n in nodes + } + lines.append("\nConnections:") + for e in flow.get("edges", []): + if not isinstance(e, dict): + continue + src = label_by_id.get(e.get("source"), e.get("source")) + tgt = label_by_id.get(e.get("target"), e.get("target")) + elabel = (e.get("data", {}) or {}).get("label") if isinstance(e.get("data"), dict) else None + arrow = f' --[{elabel}]-->' if elabel else " -->" + lines.append(f" - {src}{arrow} {tgt}") + return "\n".join(lines) + + +def extract_json(text: str) -> str | None: + if "```" in text: + # take the content of the first fenced block + parts = text.split("```") + for chunk in parts[1:]: + body = chunk + if body.lstrip().lower().startswith("json"): + body = body.lstrip()[4:] + body = body.strip() + if body.startswith("{"): + bal = _balanced(body) + if bal: + return bal + start = text.find("{") + if start == -1: + return None + return _balanced(text[start:]) + + +def _balanced(s: str) -> str | None: + depth = 0 + in_str = False + esc = False + for i, ch in enumerate(s): + if in_str: + if esc: + esc = False + elif ch == "\\": + esc = True + elif ch == '"': + in_str = False + continue + if ch == '"': + in_str = True + elif ch == "{": + depth += 1 + elif ch == "}": + depth -= 1 + if depth == 0: + return s[: i + 1] + return None diff --git a/scripts/experiments/gepa-flowchart/tests/test_render.py b/scripts/experiments/gepa-flowchart/tests/test_render.py new file mode 100644 index 0000000..83b68be --- /dev/null +++ b/scripts/experiments/gepa-flowchart/tests/test_render.py @@ -0,0 +1,35 @@ +import json + +from gepa_flowchart.render import board_to_text, extract_json + + +def test_board_to_text_includes_labels_and_edges(): + flow = { + "direction": "TB", + "groups": [{"id": "g1", "label": "Frontend"}], + "nodes": [ + {"id": "a", "group": "g1", "data": {"label": "Login form"}}, + {"id": "b", "data": {"label": "Auth service", "status": "active"}}, + ], + "edges": [{"source": "a", "target": "b", "data": {"label": "submit"}}], + } + text = board_to_text(flow) + assert "Login form" in text + assert "Auth service" in text + assert "Frontend" in text + assert "submit" in text + + +def test_extract_json_from_fenced(): + raw = 'Here:\n```json\n{"nodes": [], "edges": []}\n```\nDone.' + out = extract_json(raw) + assert json.loads(out) == {"nodes": [], "edges": []} + + +def test_extract_json_bare(): + out = extract_json('prefix {"x": 1} suffix') + assert json.loads(out) == {"x": 1} + + +def test_extract_json_none_when_absent(): + assert extract_json("no json here") is None From 30bb7e98927635b82c024267d0a097c633d5c5b2 Mon Sep 17 00:00:00 2001 From: Ivan Cheung Date: Thu, 18 Jun 2026 04:50:16 +0000 Subject: [PATCH 07/24] feat(gepa): topic dataset + reader-question generation/freezing --- .../gepa-flowchart/gepa_flowchart/dataset.py | 62 +++++++++++++++++++ .../gepa-flowchart/tests/test_dataset.py | 38 ++++++++++++ .../gepa-flowchart/topics/topics.json | 14 +++++ 3 files changed, 114 insertions(+) create mode 100644 scripts/experiments/gepa-flowchart/gepa_flowchart/dataset.py create mode 100644 scripts/experiments/gepa-flowchart/tests/test_dataset.py create mode 100644 scripts/experiments/gepa-flowchart/topics/topics.json diff --git a/scripts/experiments/gepa-flowchart/gepa_flowchart/dataset.py b/scripts/experiments/gepa-flowchart/gepa_flowchart/dataset.py new file mode 100644 index 0000000..4313004 --- /dev/null +++ b/scripts/experiments/gepa-flowchart/gepa_flowchart/dataset.py @@ -0,0 +1,62 @@ +from __future__ import annotations + +import json +from dataclasses import dataclass, field +from pathlib import Path + + +@dataclass +class Task: + id: str + topic: str + audience: str + purpose: str + questions: list[str] = field(default_factory=list) + + +def load_topics(path: str) -> list[Task]: + raw = json.loads(Path(path).read_text()) + return [Task(id=r["id"], topic=r["topic"], audience=r["audience"], purpose=r["purpose"]) for r in raw] + + +_QGEN_SYSTEM = ( + "You write the questions a first-time reader of a diagram would need answered " + "to actually understand the subject. Output ONLY a JSON array of concise question strings." +) + + +def generate_questions(task: Task, *, model: str, n: int, complete_fn) -> list[str]: + user = ( + f"Topic: {task.topic}\nAudience: {task.audience}\nPurpose: {task.purpose}\n\n" + f"List exactly {n} distinct questions this reader should be able to answer " + f"from a good flowchart on this topic. JSON array of strings only." + ) + out = complete_fn(_QGEN_SYSTEM, user, model=model, effort="medium", max_tokens=1000) + blob = extract_array(out) + qs = json.loads(blob) if blob else [] + return [str(q) for q in qs][:n] + + +def extract_array(text: str) -> str | None: + start = text.find("[") + end = text.rfind("]") + if start == -1 or end == -1 or end < start: + return None + return text[start : end + 1] + + +def freeze_questions(tasks: list[Task], run_dir: str, *, model: str, n: int, complete_fn) -> list[Task]: + Path(run_dir).mkdir(parents=True, exist_ok=True) + frozen: dict[str, list[str]] = {} + for t in tasks: + t.questions = generate_questions(t, model=model, n=n, complete_fn=complete_fn) + frozen[t.id] = t.questions + (Path(run_dir) / "frozen_questions.json").write_text(json.dumps(frozen, indent=2)) + return tasks + + +def load_frozen(tasks: list[Task], run_dir: str) -> list[Task]: + frozen = json.loads((Path(run_dir) / "frozen_questions.json").read_text()) + for t in tasks: + t.questions = frozen.get(t.id, []) + return tasks diff --git a/scripts/experiments/gepa-flowchart/tests/test_dataset.py b/scripts/experiments/gepa-flowchart/tests/test_dataset.py new file mode 100644 index 0000000..062388b --- /dev/null +++ b/scripts/experiments/gepa-flowchart/tests/test_dataset.py @@ -0,0 +1,38 @@ +import json +from pathlib import Path + +from gepa_flowchart.dataset import ( + Task, + load_topics, + generate_questions, + freeze_questions, + load_frozen, +) + +TOPICS = str(Path(__file__).resolve().parent.parent / "topics" / "topics.json") + + +def fake_complete(system, user, *, model, **kw): + # Return a JSON array of questions, regardless of input. + return '["Q1?", "Q2?", "Q3?"]' + + +def test_load_topics(): + tasks = load_topics(TOPICS) + assert len(tasks) == 12 + assert all(isinstance(t, Task) and t.id and t.topic for t in tasks) + + +def test_generate_questions_parses_list(): + t = Task(id="x", topic="T", audience="A", purpose="P") + qs = generate_questions(t, model="m", n=3, complete_fn=fake_complete) + assert qs == ["Q1?", "Q2?", "Q3?"] + + +def test_freeze_and_reload(tmp_path): + tasks = load_topics(TOPICS)[:2] + frozen = freeze_questions(tasks, str(tmp_path), model="m", n=3, complete_fn=fake_complete) + assert all(t.questions for t in frozen) + assert (tmp_path / "frozen_questions.json").exists() + reloaded = load_frozen(load_topics(TOPICS)[:2], str(tmp_path)) + assert reloaded[0].questions == frozen[0].questions diff --git a/scripts/experiments/gepa-flowchart/topics/topics.json b/scripts/experiments/gepa-flowchart/topics/topics.json new file mode 100644 index 0000000..110e513 --- /dev/null +++ b/scripts/experiments/gepa-flowchart/topics/topics.json @@ -0,0 +1,14 @@ +[ + {"id": "ci-cd", "topic": "CI/CD pipeline for a web app", "audience": "an engineer new to the team", "purpose": "understand how code reaches production and what can go wrong"}, + {"id": "user-auth", "topic": "User authentication and session flow", "audience": "a backend developer", "purpose": "understand login, token issuance, and refresh"}, + {"id": "order-fulfillment", "topic": "E-commerce order fulfillment", "audience": "an operations analyst", "purpose": "trace an order from checkout to delivery"}, + {"id": "incident-response", "topic": "On-call incident response process", "audience": "a new on-call engineer", "purpose": "know what to do when paged"}, + {"id": "data-pipeline", "topic": "Batch ETL data pipeline", "audience": "a data engineer", "purpose": "understand ingestion, transform, and load stages"}, + {"id": "pr-review", "topic": "Pull request review and merge process", "audience": "a contributor", "purpose": "know the path from PR open to merge"}, + {"id": "support-triage", "topic": "Customer support ticket triage", "audience": "a support agent", "purpose": "route and escalate tickets correctly"}, + {"id": "payment-processing", "topic": "Online payment processing and retries", "audience": "a fintech engineer", "purpose": "understand auth, capture, and failure handling"}, + {"id": "onboarding", "topic": "New employee onboarding workflow", "audience": "an HR coordinator", "purpose": "track steps from offer to first day"}, + {"id": "state-machine", "topic": "Document approval state machine", "audience": "a product manager", "purpose": "understand statuses and transitions"}, + {"id": "k8s-deploy", "topic": "Kubernetes rolling deployment", "audience": "a platform engineer", "purpose": "understand how a new version rolls out safely"}, + {"id": "signup-funnel", "topic": "SaaS signup and activation funnel", "audience": "a growth analyst", "purpose": "see where users drop off between signup and activation"} +] From 6233d5a6b57be5a542df01c532bbf9973a46e70e Mon Sep 17 00:00:00 2001 From: Ivan Cheung Date: Thu, 18 Jun 2026 04:52:44 +0000 Subject: [PATCH 08/24] feat(gepa): seed brainstorm/generate prompts + pipeline --- .../gepa-flowchart/gepa_flowchart/pipeline.py | 31 ++++++++++++ .../gepa_flowchart/seed_prompts.py | 47 +++++++++++++++++++ .../gepa-flowchart/tests/test_pipeline.py | 29 ++++++++++++ 3 files changed, 107 insertions(+) create mode 100644 scripts/experiments/gepa-flowchart/gepa_flowchart/pipeline.py create mode 100644 scripts/experiments/gepa-flowchart/gepa_flowchart/seed_prompts.py create mode 100644 scripts/experiments/gepa-flowchart/tests/test_pipeline.py diff --git a/scripts/experiments/gepa-flowchart/gepa_flowchart/pipeline.py b/scripts/experiments/gepa-flowchart/gepa_flowchart/pipeline.py new file mode 100644 index 0000000..fee812d --- /dev/null +++ b/scripts/experiments/gepa-flowchart/gepa_flowchart/pipeline.py @@ -0,0 +1,31 @@ +from __future__ import annotations + +from .dataset import Task +from .render import extract_json +from .seed_prompts import SKILL_CONTEXT, SEED_CANDIDATE + + +def run_pipeline(candidate: dict, task: Task, *, model: str, complete_fn, max_tokens: int): + trace: dict = {} + brainstorm_user = candidate["brainstorm"].format( + topic=task.topic, audience=task.audience, purpose=task.purpose + ) + trace["brainstorm_input"] = brainstorm_user + plan = complete_fn( + "You are a diagram planner.", brainstorm_user, model=model, effort="medium", max_tokens=2000 + ) + trace["plan"] = plan + + generate_user = candidate["generate"].format( + skill_context=SKILL_CONTEXT, + topic=task.topic, + audience=task.audience, + purpose=task.purpose, + plan=plan, + ) + trace["generate_input"] = generate_user + raw = complete_fn( + "You output only valid JSON.", generate_user, model=model, effort="medium", max_tokens=max_tokens + ) + trace["raw_generation"] = raw + return extract_json(raw), trace diff --git a/scripts/experiments/gepa-flowchart/gepa_flowchart/seed_prompts.py b/scripts/experiments/gepa-flowchart/gepa_flowchart/seed_prompts.py new file mode 100644 index 0000000..fa515df --- /dev/null +++ b/scripts/experiments/gepa-flowchart/gepa_flowchart/seed_prompts.py @@ -0,0 +1,47 @@ +from __future__ import annotations + +from pathlib import Path + +_FLOW_SCHEMA = """A termchart `flow` spec is JSON: +{ + "direction": "TB" | "LR" | "BT" | "RL", // default TB; prefer TB for processes + "nodes": [{ "id": "x", "data": { "label": "...", "status": "active|info|warn|success|neutral" }, "group": "g1?" }], + "edges": [{ "source": "x", "target": "y", "data": { "label": "when/condition?" } }], + "groups": [{ "id": "g1", "label": "Lane/Zone name", "color": "#hex?" }] +} +Rules: every edge source/target must be an existing node id; keep it under ~24 nodes; labels carry the meaning.""" + + +def _example() -> str: + candidates = [ + "call-hierarchy.flow.json", + "okr-tree.flow.json", + "binary-search.flow.json", + ] + base = Path(__file__).resolve().parents[3] / "plugin" / "skills" / "diagram-recipes" / "examples" + for name in candidates: + p = base / name + if p.exists(): + return p.read_text() + return '{"direction":"TB","nodes":[{"id":"a","data":{"label":"Start"}}],"edges":[]}' + + +SKILL_CONTEXT = f"{_FLOW_SCHEMA}\n\nExample of a well-formed flow spec:\n{_example()}" + +SEED_CANDIDATE = { + "brainstorm": ( + "You are planning a flowchart.\n" + "Topic: {topic}\nAudience: {audience}\nPurpose: {purpose}\n\n" + "Decide what the flowchart must show so this reader can understand the subject. " + "List the key steps/states, the decisions and what triggers each branch, and the " + "context a newcomer needs. Output a concise plan in plain text." + ), + "generate": ( + "You generate a termchart `flow` diagram as JSON.\n\n" + "{skill_context}\n\n" + "Topic: {topic}\nAudience: {audience}\nPurpose: {purpose}\n\n" + "Plan to follow:\n{plan}\n\n" + "Produce a single flow JSON object. Use clear, specific labels; label edges with the " + "condition/trigger; group related nodes. Output ONLY the JSON." + ), +} diff --git a/scripts/experiments/gepa-flowchart/tests/test_pipeline.py b/scripts/experiments/gepa-flowchart/tests/test_pipeline.py new file mode 100644 index 0000000..25aa44b --- /dev/null +++ b/scripts/experiments/gepa-flowchart/tests/test_pipeline.py @@ -0,0 +1,29 @@ +import json + +from gepa_flowchart.dataset import Task +from gepa_flowchart.pipeline import run_pipeline, SEED_CANDIDATE + + +def test_seed_candidate_has_two_components(): + assert set(SEED_CANDIDATE.keys()) == {"brainstorm", "generate"} + assert "{topic}" in SEED_CANDIDATE["brainstorm"] + assert "{plan}" in SEED_CANDIDATE["generate"] + + +def test_run_pipeline_threads_plan_into_generation(): + calls = [] + flow = {"direction": "TB", "nodes": [{"id": "a", "data": {"label": "A"}}], "edges": []} + + def fake_complete(system, user, *, model, **kw): + calls.append(user) + if len(calls) == 1: + return "PLAN: show A then B" + return f"```json\n{json.dumps(flow)}\n```" + + task = Task(id="x", topic="Topic T", audience="Aud", purpose="Pur") + out, trace = run_pipeline(SEED_CANDIDATE, task, model="m", complete_fn=fake_complete, max_tokens=4000) + assert json.loads(out) == flow + assert trace["plan"] == "PLAN: show A then B" + # the plan must be threaded into the generate step's prompt + assert "PLAN: show A then B" in calls[1] + assert "Topic T" in calls[0] From 82d48acabbd36f5513464924aecc0202de982f70 Mon Sep 17 00:00:00 2001 From: Ivan Cheung Date: Thu, 18 Jun 2026 04:55:50 +0000 Subject: [PATCH 09/24] fix(gepa): load real gallery example in SKILL_CONTEXT (parents[4]) --- .../gepa-flowchart/gepa_flowchart/seed_prompts.py | 2 +- scripts/experiments/gepa-flowchart/tests/test_pipeline.py | 7 +++++++ 2 files changed, 8 insertions(+), 1 deletion(-) diff --git a/scripts/experiments/gepa-flowchart/gepa_flowchart/seed_prompts.py b/scripts/experiments/gepa-flowchart/gepa_flowchart/seed_prompts.py index fa515df..6cb7ab5 100644 --- a/scripts/experiments/gepa-flowchart/gepa_flowchart/seed_prompts.py +++ b/scripts/experiments/gepa-flowchart/gepa_flowchart/seed_prompts.py @@ -18,7 +18,7 @@ def _example() -> str: "okr-tree.flow.json", "binary-search.flow.json", ] - base = Path(__file__).resolve().parents[3] / "plugin" / "skills" / "diagram-recipes" / "examples" + base = Path(__file__).resolve().parents[4] / "plugin" / "skills" / "diagram-recipes" / "examples" for name in candidates: p = base / name if p.exists(): diff --git a/scripts/experiments/gepa-flowchart/tests/test_pipeline.py b/scripts/experiments/gepa-flowchart/tests/test_pipeline.py index 25aa44b..6ce1954 100644 --- a/scripts/experiments/gepa-flowchart/tests/test_pipeline.py +++ b/scripts/experiments/gepa-flowchart/tests/test_pipeline.py @@ -2,6 +2,7 @@ from gepa_flowchart.dataset import Task from gepa_flowchart.pipeline import run_pipeline, SEED_CANDIDATE +from gepa_flowchart.seed_prompts import SKILL_CONTEXT def test_seed_candidate_has_two_components(): @@ -27,3 +28,9 @@ def fake_complete(system, user, *, model, **kw): # the plan must be threaded into the generate step's prompt assert "PLAN: show A then B" in calls[1] assert "Topic T" in calls[0] + + +def test_skill_context_loads_real_example(): + # The real gallery example is far larger than the 75-char fallback stub. + assert len(SKILL_CONTEXT) > 400 + assert "Example of a well-formed flow spec:" in SKILL_CONTEXT From c69dc8be74d05c3e2722576ed5bd138606a398d6 Mon Sep 17 00:00:00 2001 From: Ivan Cheung Date: Thu, 18 Jun 2026 04:57:06 +0000 Subject: [PATCH 10/24] feat(gepa): three-part metric (validity gate, geometry, comprehension) --- .../gepa-flowchart/gepa_flowchart/metric.py | 90 +++++++++++++++++++ .../gepa-flowchart/tests/test_metric.py | 45 ++++++++++ 2 files changed, 135 insertions(+) create mode 100644 scripts/experiments/gepa-flowchart/gepa_flowchart/metric.py create mode 100644 scripts/experiments/gepa-flowchart/tests/test_metric.py diff --git a/scripts/experiments/gepa-flowchart/gepa_flowchart/metric.py b/scripts/experiments/gepa-flowchart/gepa_flowchart/metric.py new file mode 100644 index 0000000..fbfc8d0 --- /dev/null +++ b/scripts/experiments/gepa-flowchart/gepa_flowchart/metric.py @@ -0,0 +1,90 @@ +from __future__ import annotations + +import json +from dataclasses import dataclass + +from .config import Config +from .dataset import Task, extract_array +from .render import board_to_text + + +@dataclass +class ScoreResult: + score: float + feedback: str + comp: float + geom: float + valid: bool + + +def geometry_score(findings: list[dict], cfg: Config) -> tuple[float, str]: + errs = [f for f in findings if f.get("severity") == "error"] + warns = [f for f in findings if f.get("severity") == "warning"] + score = 1.0 - cfg.geom_error_penalty * len(errs) - cfg.geom_warning_penalty * len(warns) + score = max(0.0, min(1.0, score)) + if not findings: + return score, "Geometry: clean (no findings)." + msgs = "; ".join(f"[{f.get('severity')}] {f.get('code')}: {f.get('message')}" for f in findings) + return score, f"Geometry findings: {msgs}" + + +_READER_SYSTEM = ( + "You are seeing this flowchart for the FIRST TIME and know nothing else about it. " + "Answer each question using ONLY what the board shows. If the board does not tell you, " + "answer exactly 'not shown'. Output ONLY a JSON array of {\"q\": question, \"a\": answer}." +) + +_JUDGE_SYSTEM = ( + "You grade how well a first-time reader's answers are supported by a flowchart. For each " + "question score 0..1 for correctness AND whether the board actually supports the answer. " + "An answer of 'not shown' or one not supported by the board scores low and is a context gap. " + "Output ONLY a JSON array of {\"q\", \"score\", \"supported\" (bool), \"reason\"}." +) + + +def comprehension_score(board_text: str, questions: list[str], *, reader_model: str, judge_model: str, complete_fn) -> tuple[float, str]: + qlist = "\n".join(f"- {q}" for q in questions) + reader_out = complete_fn( + _READER_SYSTEM, + f"BOARD:\n{board_text}\n\nQUESTIONS:\n{qlist}", + model=reader_model, + effort="low", + max_tokens=1500, + ) + judge_out = complete_fn( + _JUDGE_SYSTEM, + f"BOARD:\n{board_text}\n\nQUESTIONS:\n{qlist}\n\nREADER ANSWERS:\n{reader_out}", + model=judge_model, + effort="high", + max_tokens=1500, + ) + blob = extract_array(judge_out) + rows = json.loads(blob) if blob else [] + if not rows: + return 0.0, "Comprehension: judge produced no parseable scores." + scores = [float(r.get("score", 0.0)) for r in rows] + comp = sum(scores) / len(scores) + gaps = [r for r in rows if not r.get("supported", False) or float(r.get("score", 0.0)) < 0.5] + if gaps: + gap_txt = "; ".join(f"'{r.get('q')}' — {r.get('reason')}" for r in gaps) + fb = f"Comprehension {comp:.2f}. Reader could not answer (add detail/context): {gap_txt}" + else: + fb = f"Comprehension {comp:.2f}. Reader answered all questions from the board." + return comp, fb + + +def score_board(content, task: Task, cfg: Config, *, validate_fn, complete_fn) -> ScoreResult: + if not content: + return ScoreResult(0.0, "Generation produced no JSON object.", 0.0, 0.0, False) + report = validate_fn(content) + if not report.get("valid"): + return ScoreResult(0.0, f"Invalid flow spec: {report.get('error')}", 0.0, 0.0, False) + + geom, geom_fb = geometry_score(report.get("findings", []), cfg) + board_text = board_to_text(json.loads(content)) + comp, comp_fb = comprehension_score( + board_text, task.questions, reader_model=cfg.reader_model, judge_model=cfg.judge_model, complete_fn=complete_fn + ) + total = cfg.w_comp * comp + cfg.w_geom * geom + feedback = f"{comp_fb}\n{geom_fb}" + return ScoreResult(total, feedback, comp, geom, True) diff --git a/scripts/experiments/gepa-flowchart/tests/test_metric.py b/scripts/experiments/gepa-flowchart/tests/test_metric.py new file mode 100644 index 0000000..9dfc50c --- /dev/null +++ b/scripts/experiments/gepa-flowchart/tests/test_metric.py @@ -0,0 +1,45 @@ +import json + +from gepa_flowchart.config import Config +from gepa_flowchart.dataset import Task +from gepa_flowchart.metric import geometry_score, score_board + +CFG = Config.from_env({}) +TASK = Task(id="x", topic="T", audience="A", purpose="P", questions=["Q1?", "Q2?"]) +GOOD_FLOW = json.dumps({"direction": "TB", "nodes": [{"id": "a", "data": {"label": "Start"}}], "edges": []}) + + +def test_geometry_score_penalizes_errors(): + clean, _ = geometry_score([], CFG) + assert clean == 1.0 + err, msg = geometry_score([{"severity": "error", "code": "edge-over-node", "message": "x", "count": 1}], CFG) + assert err < 1.0 and "edge-over-node" in msg + + +def test_invalid_board_scores_zero(): + def validate_fn(content): + return {"valid": False, "error": "bad json", "findings": [], "warnings": []} + + res = score_board("{bad", TASK, CFG, validate_fn=validate_fn, complete_fn=lambda *a, **k: "") + assert res.score == 0.0 and res.valid is False + assert "bad json" in res.feedback + + +def test_missing_context_lowers_comp_and_is_fed_back(): + def validate_fn(content): + return {"valid": True, "error": None, "findings": [], "warnings": []} + + # reader answers; judge marks Q2 unsupported + def complete_fn(system, user, *, model, **kw): + if "first time" in system.lower(): # reader + return json.dumps([{"q": "Q1?", "a": "Start the process"}, {"q": "Q2?", "a": "not shown"}]) + # judge + return json.dumps([ + {"q": "Q1?", "score": 1.0, "supported": True, "reason": "shown"}, + {"q": "Q2?", "score": 0.0, "supported": False, "reason": "board never shows the failure path"}, + ]) + + res = score_board(GOOD_FLOW, TASK, CFG, validate_fn=validate_fn, complete_fn=complete_fn) + assert 0.0 < res.comp < 1.0 + assert res.geom == 1.0 + assert "failure path" in res.feedback From 34cf94ca17c5d8bd2efec62fab7821ac12ec3af5 Mon Sep 17 00:00:00 2001 From: Ivan Cheung Date: Thu, 18 Jun 2026 04:59:05 +0000 Subject: [PATCH 11/24] fix(gepa): judge parse degrades safely on malformed JSON --- .../gepa-flowchart/gepa_flowchart/metric.py | 5 ++++- .../gepa-flowchart/tests/test_metric.py | 15 +++++++++++++++ 2 files changed, 19 insertions(+), 1 deletion(-) diff --git a/scripts/experiments/gepa-flowchart/gepa_flowchart/metric.py b/scripts/experiments/gepa-flowchart/gepa_flowchart/metric.py index fbfc8d0..aab40f0 100644 --- a/scripts/experiments/gepa-flowchart/gepa_flowchart/metric.py +++ b/scripts/experiments/gepa-flowchart/gepa_flowchart/metric.py @@ -59,7 +59,10 @@ def comprehension_score(board_text: str, questions: list[str], *, reader_model: max_tokens=1500, ) blob = extract_array(judge_out) - rows = json.loads(blob) if blob else [] + try: + rows = json.loads(blob) if blob else [] + except (json.JSONDecodeError, ValueError): + rows = [] if not rows: return 0.0, "Comprehension: judge produced no parseable scores." scores = [float(r.get("score", 0.0)) for r in rows] diff --git a/scripts/experiments/gepa-flowchart/tests/test_metric.py b/scripts/experiments/gepa-flowchart/tests/test_metric.py index 9dfc50c..25a6d68 100644 --- a/scripts/experiments/gepa-flowchart/tests/test_metric.py +++ b/scripts/experiments/gepa-flowchart/tests/test_metric.py @@ -43,3 +43,18 @@ def complete_fn(system, user, *, model, **kw): assert 0.0 < res.comp < 1.0 assert res.geom == 1.0 assert "failure path" in res.feedback + + +def test_malformed_judge_degrades_safely(): + from gepa_flowchart.metric import comprehension_score + + def complete_fn(system, user, *, model, **kw): + if "first time" in system.lower(): + return '[{"q": "Q1?", "a": "x"}]' + return "[{ truncated malformed" # non-empty bracketed but invalid JSON + + comp, fb = comprehension_score( + "board text", ["Q1?"], reader_model="r", judge_model="j", complete_fn=complete_fn + ) + assert comp == 0.0 + assert "no parseable scores" in fb From 28fdcc035b4121e5ef8224d3139997f23e742111 Mon Sep 17 00:00:00 2001 From: Ivan Cheung Date: Thu, 18 Jun 2026 05:01:35 +0000 Subject: [PATCH 12/24] feat(gepa): GEPAAdapter (evaluate + reflective dataset) --- .../gepa-flowchart/gepa_flowchart/adapter.py | 56 +++++++++++++++++++ .../gepa-flowchart/tests/test_adapter.py | 42 ++++++++++++++ 2 files changed, 98 insertions(+) create mode 100644 scripts/experiments/gepa-flowchart/gepa_flowchart/adapter.py create mode 100644 scripts/experiments/gepa-flowchart/tests/test_adapter.py diff --git a/scripts/experiments/gepa-flowchart/gepa_flowchart/adapter.py b/scripts/experiments/gepa-flowchart/gepa_flowchart/adapter.py new file mode 100644 index 0000000..ece9108 --- /dev/null +++ b/scripts/experiments/gepa-flowchart/gepa_flowchart/adapter.py @@ -0,0 +1,56 @@ +from __future__ import annotations + +from gepa.core.adapter import EvaluationBatch, GEPAAdapter + +from .config import Config +from .geometry_bridge import validate_flow +from .llm import complete +from .metric import score_board +from .pipeline import run_pipeline + + +class FlowchartAdapter(GEPAAdapter): + def __init__(self, cfg: Config, *, validate_fn=validate_flow, complete_fn=complete): + self.cfg = cfg + self.validate_fn = validate_fn + self.complete_fn = complete_fn + + def evaluate(self, batch, candidate, capture_traces=False): + outputs, scores, trajectories = [], [], [] if capture_traces else None + for task in batch: + content, trace = run_pipeline( + candidate, task, model=self.cfg.gen_model, + complete_fn=self.complete_fn, max_tokens=self.cfg.gen_max_tokens, + ) + result = score_board( + content, task, self.cfg, validate_fn=self.validate_fn, complete_fn=self.complete_fn + ) + outputs.append(content) + scores.append(result.score) + if capture_traces: + trajectories.append({"task": task, "trace": trace, "result": result, "output": content}) + return EvaluationBatch(outputs=outputs, scores=scores, trajectories=trajectories) + + def make_reflective_dataset(self, candidate, eval_batch, components_to_update): + out: dict[str, list[dict]] = {c: [] for c in components_to_update} + for traj in eval_batch.trajectories or []: + task = traj["task"] + result = traj["result"] + trace = traj["trace"] + shared_feedback = result.feedback + if "brainstorm" in out: + out["brainstorm"].append({ + "Inputs": f"Topic: {task.topic}\nAudience: {task.audience}\nPurpose: {task.purpose}", + "Generated Outputs": trace.get("plan", ""), + "Feedback": ( + "The plan feeds a downstream generator. The resulting board scored " + f"{result.score:.2f}. " + shared_feedback + ), + }) + if "generate" in out: + out["generate"].append({ + "Inputs": f"Plan:\n{trace.get('plan','')}", + "Generated Outputs": traj.get("output") or trace.get("raw_generation", ""), + "Feedback": shared_feedback, + }) + return out diff --git a/scripts/experiments/gepa-flowchart/tests/test_adapter.py b/scripts/experiments/gepa-flowchart/tests/test_adapter.py new file mode 100644 index 0000000..4caf04c --- /dev/null +++ b/scripts/experiments/gepa-flowchart/tests/test_adapter.py @@ -0,0 +1,42 @@ +import json + +from gepa_flowchart.config import Config +from gepa_flowchart.dataset import Task +from gepa_flowchart.adapter import FlowchartAdapter +from gepa_flowchart.pipeline import SEED_CANDIDATE + +CFG = Config.from_env({}) +FLOW = {"direction": "TB", "nodes": [{"id": "a", "data": {"label": "A"}}], "edges": []} + + +def validate_fn(content): + return {"valid": True, "error": None, "findings": [], "warnings": []} + + +def complete_fn(system, user, *, model, **kw): + if "planner" in system.lower(): + return "plan" + if "only valid json" in system.lower(): + return json.dumps(FLOW) + if "first time" in system.lower(): + return json.dumps([{"q": "Q1?", "a": "A"}]) + # judge + return json.dumps([{"q": "Q1?", "score": 1.0, "supported": True, "reason": "ok"}]) + + +def test_evaluate_returns_scores_and_traces(): + adapter = FlowchartAdapter(CFG, validate_fn=validate_fn, complete_fn=complete_fn) + batch = [Task(id="x", topic="T", audience="A", purpose="P", questions=["Q1?"])] + out = adapter.evaluate(batch, SEED_CANDIDATE, capture_traces=True) + assert len(out.scores) == 1 + assert out.scores[0] > 0.0 + assert out.trajectories is not None and len(out.trajectories) == 1 + + +def test_make_reflective_dataset_has_requested_components(): + adapter = FlowchartAdapter(CFG, validate_fn=validate_fn, complete_fn=complete_fn) + batch = [Task(id="x", topic="T", audience="A", purpose="P", questions=["Q1?"])] + ev = adapter.evaluate(batch, SEED_CANDIDATE, capture_traces=True) + refl = adapter.make_reflective_dataset(SEED_CANDIDATE, ev, ["brainstorm", "generate"]) + assert set(refl.keys()) == {"brainstorm", "generate"} + assert refl["generate"] and "Feedback" in refl["generate"][0] From cce5d9d1e67fe9ce82a0b91d9535a9f89c5eaf56 Mon Sep 17 00:00:00 2001 From: Ivan Cheung Date: Thu, 18 Jun 2026 05:05:04 +0000 Subject: [PATCH 13/24] feat(gepa): run entrypoint, report, and smoke test --- .../gepa-flowchart/gepa_flowchart/report.py | 28 ++++++ .../gepa-flowchart/gepa_flowchart/run.py | 90 +++++++++++++++++++ .../gepa-flowchart/tests/test_report.py | 8 ++ .../gepa-flowchart/tests/test_smoke.py | 13 +++ 4 files changed, 139 insertions(+) create mode 100644 scripts/experiments/gepa-flowchart/gepa_flowchart/report.py create mode 100644 scripts/experiments/gepa-flowchart/gepa_flowchart/run.py create mode 100644 scripts/experiments/gepa-flowchart/tests/test_report.py create mode 100644 scripts/experiments/gepa-flowchart/tests/test_smoke.py diff --git a/scripts/experiments/gepa-flowchart/gepa_flowchart/report.py b/scripts/experiments/gepa-flowchart/gepa_flowchart/report.py new file mode 100644 index 0000000..2c6eb09 --- /dev/null +++ b/scripts/experiments/gepa-flowchart/gepa_flowchart/report.py @@ -0,0 +1,28 @@ +from __future__ import annotations + +import json + + +def _mean(xs: list[float]) -> float: + return sum(xs) / len(xs) if xs else 0.0 + + +def render_report(seed_scores, best_scores, best_candidate, val_ids) -> str: + seed_m, best_m = _mean(seed_scores), _mean(best_scores) + lines = [ + "# GEPA Flowchart Optimization — Report", + "", + f"- Val tasks: {len(val_ids)} ({', '.join(val_ids)})", + f"- Seed mean score: **{seed_m:.2f}**", + f"- Best mean score: **{best_m:.2f}**", + f"- Delta: **{best_m - seed_m:+.2f}**", + "", + "## Per-task (seed → best)", + "", + "| task | seed | best |", + "|---|---|---|", + ] + for tid, s, b in zip(val_ids, seed_scores, best_scores): + lines.append(f"| {tid} | {s:.2f} | {b:.2f} |") + lines += ["", "## Best prompts", "", "```json", json.dumps(best_candidate, indent=2), "```"] + return "\n".join(lines) diff --git a/scripts/experiments/gepa-flowchart/gepa_flowchart/run.py b/scripts/experiments/gepa-flowchart/gepa_flowchart/run.py new file mode 100644 index 0000000..8d685cf --- /dev/null +++ b/scripts/experiments/gepa-flowchart/gepa_flowchart/run.py @@ -0,0 +1,90 @@ +from __future__ import annotations + +import argparse +import json +import sys +from datetime import datetime, timezone +from pathlib import Path + +import gepa + +from .adapter import FlowchartAdapter +from .config import Config +from .dataset import freeze_questions, load_topics +from .llm import complete, make_reflection_callable +from .metric import score_board +from .geometry_bridge import validate_flow +from .pipeline import SEED_CANDIDATE +from .report import render_report + +_TOPICS = str(Path(__file__).resolve().parent.parent / "topics" / "topics.json") + + +def _eval_scores(tasks, candidate, cfg) -> list[float]: + return [ + score_board( + run_one(candidate, t, cfg), t, cfg, validate_fn=validate_flow, complete_fn=complete + ).score + for t in tasks + ] + + +def run_one(candidate, task, cfg): + from .pipeline import run_pipeline + content, _ = run_pipeline(candidate, task, model=cfg.gen_model, complete_fn=complete, max_tokens=cfg.gen_max_tokens) + return content + + +def main(argv=None) -> int: + p = argparse.ArgumentParser(description="GEPA flowchart prompt optimization") + p.add_argument("--smoke", action="store_true", help="1 topic, tiny budget") + p.add_argument("--max-metric-calls", type=int, default=None) + p.add_argument("--train", type=int, default=8) + p.add_argument("--run-dir", default=None) + args = p.parse_args(argv) + + cfg = Config.from_env({}) + budget = args.max_metric_calls or (6 if args.smoke else cfg.max_metric_calls) + run_dir = args.run_dir or str(Path("runs") / datetime.now(timezone.utc).strftime("%Y%m%dT%H%M%SZ")) + Path(run_dir).mkdir(parents=True, exist_ok=True) + + topics = load_topics(_TOPICS) + if args.smoke: + train, val = topics[:1], topics[:1] + else: + train, val = topics[: args.train], topics[args.train :] + + print(f"[gepa] run_dir={run_dir} budget={budget} train={len(train)} val={len(val)}", file=sys.stderr) + print(f"[gepa] models gen={cfg.gen_model} reader={cfg.reader_model} judge={cfg.judge_model} reflect={cfg.reflection_model}", file=sys.stderr) + + # Freeze the reader-questions once for the whole run (fair comparison). + all_tasks = freeze_questions(topics, run_dir, model=cfg.judge_model, n=cfg.n_questions, complete_fn=complete) + by_id = {t.id: t for t in all_tasks} + train = [by_id[t.id] for t in train] + val = [by_id[t.id] for t in val] + + seed_scores = _eval_scores(val, SEED_CANDIDATE, cfg) + + adapter = FlowchartAdapter(cfg) + result = gepa.optimize( + seed_candidate=SEED_CANDIDATE, + trainset=train, + valset=val, + adapter=adapter, + reflection_lm=make_reflection_callable(cfg.reflection_model), + max_metric_calls=budget, + run_dir=run_dir, + display_progress_bar=True, + ) + best = result.best_candidate + best_scores = _eval_scores(val, best, cfg) + + (Path(run_dir) / "best_prompts.json").write_text(json.dumps(best, indent=2)) + report = render_report(seed_scores, best_scores, best, [t.id for t in val]) + (Path(run_dir) / "report.md").write_text(report) + print(report) + return 0 + + +if __name__ == "__main__": + raise SystemExit(main()) diff --git a/scripts/experiments/gepa-flowchart/tests/test_report.py b/scripts/experiments/gepa-flowchart/tests/test_report.py new file mode 100644 index 0000000..165984d --- /dev/null +++ b/scripts/experiments/gepa-flowchart/tests/test_report.py @@ -0,0 +1,8 @@ +from gepa_flowchart.report import render_report + + +def test_report_shows_before_after(): + md = render_report([0.4, 0.5], [0.7, 0.8], {"brainstorm": "b", "generate": "g"}, ["t1", "t2"]) + assert "0.45" in md # seed mean + assert "0.75" in md # best mean + assert "brainstorm" in md and "generate" in md diff --git a/scripts/experiments/gepa-flowchart/tests/test_smoke.py b/scripts/experiments/gepa-flowchart/tests/test_smoke.py new file mode 100644 index 0000000..bdf4355 --- /dev/null +++ b/scripts/experiments/gepa-flowchart/tests/test_smoke.py @@ -0,0 +1,13 @@ +import os + +import pytest + +from gepa_flowchart.run import main + + +@pytest.mark.skipif(not os.environ.get("ANTHROPIC_API_KEY"), reason="needs ANTHROPIC_API_KEY") +def test_smoke_end_to_end(tmp_path): + rc = main(["--smoke", "--run-dir", str(tmp_path)]) + assert rc == 0 + assert (tmp_path / "report.md").exists() + assert (tmp_path / "best_prompts.json").exists() From 03fae2baa327f968ba681f5beb8297e394c783b9 Mon Sep 17 00:00:00 2001 From: Ivan Cheung Date: Thu, 18 Jun 2026 05:07:57 +0000 Subject: [PATCH 14/24] docs(gepa): README with setup, run, and cost notes --- scripts/experiments/gepa-flowchart/README.md | 44 ++++++++++++++++++++ 1 file changed, 44 insertions(+) create mode 100644 scripts/experiments/gepa-flowchart/README.md diff --git a/scripts/experiments/gepa-flowchart/README.md b/scripts/experiments/gepa-flowchart/README.md new file mode 100644 index 0000000..4757c6b --- /dev/null +++ b/scripts/experiments/gepa-flowchart/README.md @@ -0,0 +1,44 @@ +# gepa-flowchart + +GEPA prompt optimization for readable, comprehensible termchart flowcharts. +Optimizes the `brainstorm` + `generate` prompts for first-user comprehension +(primary) and geometric readability (secondary). See +`docs/superpowers/specs/2026-06-18-gepa-flowchart-optimization-design.md`. + +## Setup + +```bash +# Node deps for the geometry bridge (run once) +npm install # repo root — gives the viewer its deps +cd scripts/experiments/gepa-flowchart +npm install # local tsx + +# Python +python -m venv .venv && . .venv/bin/activate +pip install -e ".[dev]" +export ANTHROPIC_API_KEY=sk-ant-... # or: ant auth login +``` + +## Run + +```bash +python -m gepa_flowchart.run --smoke # cheap end-to-end check (1 topic) +python -m gepa_flowchart.run # full run (~150 metric calls) +python -m gepa_flowchart.run --max-metric-calls 80 --train 8 +``` + +Outputs land in `runs//`: `best_prompts.json`, `report.md`, +`frozen_questions.json`. + +## Config (env) + +`GEPA_GEN_MODEL` (default `claude-opus-4-8`; set `claude-sonnet-4-6` to cut cost), +`GEPA_READER_MODEL`, `GEPA_JUDGE_MODEL`, `GEPA_REFLECTION_MODEL`, +`GEPA_W_COMP` (0.6), `GEPA_W_GEOM` (0.4), `GEPA_N_QUESTIONS` (5), +`GEPA_MAX_METRIC_CALLS` (150). + +## Cost note + +A full run does many generation + reader + judge calls per rollout plus +reflection. Start with `--smoke`. Generation defaults to Opus; switch +`GEPA_GEN_MODEL=claude-sonnet-4-6` for a cheaper high-volume role. From d39a67cf37ed0dc292ecf007f2dc2f6480eecc9c Mon Sep 17 00:00:00 2001 From: Ivan Cheung Date: Thu, 18 Jun 2026 15:43:36 +0000 Subject: [PATCH 15/24] feat(gepa): run via Claude on Vertex AI (ADC auth) get_client() auto-selects AnthropicVertex when CLAUDE_CODE_USE_VERTEX or GEPA_USE_VERTEX is set (project/region from env, ADC); falls back to the direct API otherwise. Verified end-to-end: smoke run on Vertex (project adk-coding-agents, region global) completed, seed 0.83 -> best 0.85 on the ci-cd val task. --- scripts/experiments/gepa-flowchart/README.md | 15 +++++++++++++++ .../gepa-flowchart/gepa_flowchart/llm.py | 19 +++++++++++++++++-- 2 files changed, 32 insertions(+), 2 deletions(-) diff --git a/scripts/experiments/gepa-flowchart/README.md b/scripts/experiments/gepa-flowchart/README.md index 4757c6b..47f7eb9 100644 --- a/scripts/experiments/gepa-flowchart/README.md +++ b/scripts/experiments/gepa-flowchart/README.md @@ -19,6 +19,21 @@ pip install -e ".[dev]" export ANTHROPIC_API_KEY=sk-ant-... # or: ant auth login ``` +### Auth: direct API or Vertex AI + +By default the harness uses the direct Anthropic API (`ANTHROPIC_API_KEY`). +To run via **Claude on Vertex AI** (ADC, no API key), install the Vertex extra +and set the Vertex env vars — `get_client()` auto-selects the Vertex client when +`CLAUDE_CODE_USE_VERTEX` (or `GEPA_USE_VERTEX`) is set: + +```bash +pip install "anthropic[vertex]" +export CLAUDE_CODE_USE_VERTEX=1 +export ANTHROPIC_VERTEX_PROJECT_ID= +export CLOUD_ML_REGION=global # or a Claude-on-Vertex region +gcloud auth application-default login # ADC +``` + ## Run ```bash diff --git a/scripts/experiments/gepa-flowchart/gepa_flowchart/llm.py b/scripts/experiments/gepa-flowchart/gepa_flowchart/llm.py index cd01808..7b021ae 100644 --- a/scripts/experiments/gepa-flowchart/gepa_flowchart/llm.py +++ b/scripts/experiments/gepa-flowchart/gepa_flowchart/llm.py @@ -6,10 +6,25 @@ def get_client(): + """Lazily build the Anthropic client. Uses Vertex (ADC) when CLAUDE_CODE_USE_VERTEX/ + GEPA_USE_VERTEX is set, else the direct API (ANTHROPIC_API_KEY).""" global _client if _client is None: - import anthropic - _client = anthropic.Anthropic() + import os + + if os.environ.get("GEPA_USE_VERTEX") or os.environ.get("CLAUDE_CODE_USE_VERTEX"): + from anthropic import AnthropicVertex + + _client = AnthropicVertex( + project_id=os.environ.get("ANTHROPIC_VERTEX_PROJECT_ID") + or os.environ.get("GOOGLE_CLOUD_PROJECT"), + region=os.environ.get("CLOUD_ML_REGION") + or os.environ.get("GOOGLE_CLOUD_LOCATION", "global"), + ) + else: + import anthropic + + _client = anthropic.Anthropic() return _client From a5ad73cf81c9ef329e0d9b4154f70213b5fe7dff Mon Sep 17 00:00:00 2001 From: Ivan Cheung Date: Thu, 18 Jun 2026 19:35:11 +0000 Subject: [PATCH 16/24] fix(gepa): brace-safe prompt fill so GEPA-mutated prompts don't KeyError GEPA mutates the brainstorm/generate prompts and routinely inserts literal JSON braces (e.g. {"direction","nodes",...}). pipeline.py used str.format(), which treats those as fields and raises KeyError mid-run (crashed the full run at ~10/150 rollouts). Replace with a placeholder-only replace() that leaves other braces alone. Regression test added. --- .../gepa-flowchart/gepa_flowchart/pipeline.py | 17 ++++++++++--- .../gepa-flowchart/tests/test_pipeline.py | 25 +++++++++++++++++++ 2 files changed, 39 insertions(+), 3 deletions(-) diff --git a/scripts/experiments/gepa-flowchart/gepa_flowchart/pipeline.py b/scripts/experiments/gepa-flowchart/gepa_flowchart/pipeline.py index fee812d..dbadf3b 100644 --- a/scripts/experiments/gepa-flowchart/gepa_flowchart/pipeline.py +++ b/scripts/experiments/gepa-flowchart/gepa_flowchart/pipeline.py @@ -5,10 +5,20 @@ from .seed_prompts import SKILL_CONTEXT, SEED_CANDIDATE +def _fill(template: str, **fields: str) -> str: + """Substitute only the named {placeholders}. Brace-safe: GEPA mutates these prompts and + routinely inserts literal JSON braces (e.g. {"nodes": ...}); str.format() would treat those + as fields and KeyError. A plain replace of each known placeholder leaves all other braces alone.""" + out = template + for key, value in fields.items(): + out = out.replace("{" + key + "}", value) + return out + + def run_pipeline(candidate: dict, task: Task, *, model: str, complete_fn, max_tokens: int): trace: dict = {} - brainstorm_user = candidate["brainstorm"].format( - topic=task.topic, audience=task.audience, purpose=task.purpose + brainstorm_user = _fill( + candidate["brainstorm"], topic=task.topic, audience=task.audience, purpose=task.purpose ) trace["brainstorm_input"] = brainstorm_user plan = complete_fn( @@ -16,7 +26,8 @@ def run_pipeline(candidate: dict, task: Task, *, model: str, complete_fn, max_to ) trace["plan"] = plan - generate_user = candidate["generate"].format( + generate_user = _fill( + candidate["generate"], skill_context=SKILL_CONTEXT, topic=task.topic, audience=task.audience, diff --git a/scripts/experiments/gepa-flowchart/tests/test_pipeline.py b/scripts/experiments/gepa-flowchart/tests/test_pipeline.py index 6ce1954..88a3e6a 100644 --- a/scripts/experiments/gepa-flowchart/tests/test_pipeline.py +++ b/scripts/experiments/gepa-flowchart/tests/test_pipeline.py @@ -34,3 +34,28 @@ def test_skill_context_loads_real_example(): # The real gallery example is far larger than the 75-char fallback stub. assert len(SKILL_CONTEXT) > 400 assert "Example of a well-formed flow spec:" in SKILL_CONTEXT + + +def test_generate_prompt_with_literal_json_braces_does_not_crash(): + """Regression: GEPA mutates the generate prompt to include literal JSON braces + (e.g. {"direction","nodes","edges"}). The pipeline must still substitute the real + placeholders and must NOT raise KeyError from str.format().""" + flow = {"direction": "TB", "nodes": [{"id": "a", "data": {"label": "A"}}], "edges": []} + + def fake_complete(system, user, *, model, **kw): + if "planner" in system.lower(): + return "PLAN X" + return json.dumps(flow) + + candidate = { + "brainstorm": "Plan for {topic}.", + # literal braces from a GEPA mutation, plus the real {plan} placeholder: + "generate": 'Schema is { "direction", "nodes", "edges", "groups" }.\n' + 'Follow this plan: {plan}\nOutput only JSON.', + } + task = Task(id="x", topic="T", audience="A", purpose="P") + out, trace = run_pipeline(candidate, task, model="m", complete_fn=fake_complete, max_tokens=4000) + assert json.loads(out) == flow + # the real placeholder was filled; the literal braces survived verbatim + assert "PLAN X" in trace["generate_input"] + assert '{ "direction", "nodes", "edges", "groups" }' in trace["generate_input"] From 1bc010b1f9acb1776a3c4576c3ea27a8f8c0121c Mon Sep 17 00:00:00 2001 From: Ivan Cheung Date: Thu, 18 Jun 2026 20:38:32 +0000 Subject: [PATCH 17/24] feat(gepa): de-saturate the metric (strict judge + harder questions + harder topics) Diagnosis: comprehension was pinned at ~1.0 (lenient judge), leaving GEPA no gradient. Fixes: - judge scores STRICTLY and board-grounded (0/0.5/1; credit only specifics shown on the board, never world knowledge) - question generator demands specific, detail-probing questions a sketchy board fails - n_questions 5 -> 7 - +8 harder multi-branch topics (saga, oauth-pkce, raft, k8s-sched, tcp, 3ds, blue-green, rate-limiter) -> 20 total Result: comprehension now spreads 0.71-1.00 (was 0.88-1.00); totals 0.69-0.86. --- .../gepa-flowchart/gepa_flowchart/config.py | 2 +- .../gepa-flowchart/gepa_flowchart/dataset.py | 14 ++++++++++---- .../gepa-flowchart/gepa_flowchart/metric.py | 17 +++++++++++++---- .../gepa-flowchart/tests/test_dataset.py | 3 ++- .../gepa-flowchart/topics/topics.json | 10 +++++++++- 5 files changed, 35 insertions(+), 11 deletions(-) diff --git a/scripts/experiments/gepa-flowchart/gepa_flowchart/config.py b/scripts/experiments/gepa-flowchart/gepa_flowchart/config.py index 1e5ae10..ddcc248 100644 --- a/scripts/experiments/gepa-flowchart/gepa_flowchart/config.py +++ b/scripts/experiments/gepa-flowchart/gepa_flowchart/config.py @@ -14,7 +14,7 @@ class Config: w_geom: float = 0.4 geom_error_penalty: float = 0.34 geom_warning_penalty: float = 0.08 - n_questions: int = 5 + n_questions: int = 7 max_metric_calls: int = 150 gen_max_tokens: int = 8000 diff --git a/scripts/experiments/gepa-flowchart/gepa_flowchart/dataset.py b/scripts/experiments/gepa-flowchart/gepa_flowchart/dataset.py index 4313004..6a5d4ea 100644 --- a/scripts/experiments/gepa-flowchart/gepa_flowchart/dataset.py +++ b/scripts/experiments/gepa-flowchart/gepa_flowchart/dataset.py @@ -20,16 +20,22 @@ def load_topics(path: str) -> list[Task]: _QGEN_SYSTEM = ( - "You write the questions a first-time reader of a diagram would need answered " - "to actually understand the subject. Output ONLY a JSON array of concise question strings." + "You write probing questions that test whether a diagram contains REAL, SPECIFIC DETAIL — " + "not just a high-level sketch. Favor questions demanding specifics: exact triggers/conditions/" + "thresholds, what happens on each failure or error path, ordering and dependencies, who or what " + "performs each step (human vs automated system), and edge cases. A vague, high-level board " + "should FAIL many of them; only a detailed, well-contextualized board should answer them all. " + "Output ONLY a JSON array of concise question strings." ) def generate_questions(task: Task, *, model: str, n: int, complete_fn) -> list[str]: user = ( f"Topic: {task.topic}\nAudience: {task.audience}\nPurpose: {task.purpose}\n\n" - f"List exactly {n} distinct questions this reader should be able to answer " - f"from a good flowchart on this topic. JSON array of strings only." + f"List exactly {n} specific, detail-probing questions this reader needs answered. " + f"Each must demand a concrete fact the diagram should show (a condition, branch, actor, " + f"order, threshold, or failure path) — not something answerable from general knowledge. " + f"JSON array of strings only." ) out = complete_fn(_QGEN_SYSTEM, user, model=model, effort="medium", max_tokens=1000) blob = extract_array(out) diff --git a/scripts/experiments/gepa-flowchart/gepa_flowchart/metric.py b/scripts/experiments/gepa-flowchart/gepa_flowchart/metric.py index aab40f0..4712f11 100644 --- a/scripts/experiments/gepa-flowchart/gepa_flowchart/metric.py +++ b/scripts/experiments/gepa-flowchart/gepa_flowchart/metric.py @@ -35,10 +35,19 @@ def geometry_score(findings: list[dict], cfg: Config) -> tuple[float, str]: ) _JUDGE_SYSTEM = ( - "You grade how well a first-time reader's answers are supported by a flowchart. For each " - "question score 0..1 for correctness AND whether the board actually supports the answer. " - "An answer of 'not shown' or one not supported by the board scores low and is a context gap. " - "Output ONLY a JSON array of {\"q\", \"score\", \"supported\" (bool), \"reason\"}." + "You grade a first-time reader's answers STRICTLY, judging ONLY whether the BOARD ITSELF " + "contains the information — never whether the answer is true in general. A knowledgeable reader " + "can answer many questions from world knowledge; that does NOT count and must be penalized. " + "Score each question:\n" + " 1.0 = the board explicitly shows the SPECIFIC answer (the exact step, condition/trigger, " + "branch outcome, actor, or value is named on the board);\n" + " 0.5 = the board only partially or generically shows it, requiring the reader to infer;\n" + " 0.0 = the board does not show it, the answer is 'not shown', or it could only be answered " + "from outside knowledge.\n" + "Be demanding: vague, generic, high-level, or inferred answers are 0.0-0.5, never 1.0. A board " + "that merely sketches the topic should score low; only a board rich in specific detail scores high. " + "Output ONLY a JSON array of {\"q\", \"score\" (0, 0.5, or 1), \"supported\" (bool, true ONLY if " + "score is 1), \"reason\" (what specific detail is present or missing on the board)}." ) diff --git a/scripts/experiments/gepa-flowchart/tests/test_dataset.py b/scripts/experiments/gepa-flowchart/tests/test_dataset.py index 062388b..30e1ec4 100644 --- a/scripts/experiments/gepa-flowchart/tests/test_dataset.py +++ b/scripts/experiments/gepa-flowchart/tests/test_dataset.py @@ -19,7 +19,8 @@ def fake_complete(system, user, *, model, **kw): def test_load_topics(): tasks = load_topics(TOPICS) - assert len(tasks) == 12 + assert len(tasks) == 20 + assert len({t.id for t in tasks}) == len(tasks) # ids unique assert all(isinstance(t, Task) and t.id and t.topic for t in tasks) diff --git a/scripts/experiments/gepa-flowchart/topics/topics.json b/scripts/experiments/gepa-flowchart/topics/topics.json index 110e513..38a19ae 100644 --- a/scripts/experiments/gepa-flowchart/topics/topics.json +++ b/scripts/experiments/gepa-flowchart/topics/topics.json @@ -10,5 +10,13 @@ {"id": "onboarding", "topic": "New employee onboarding workflow", "audience": "an HR coordinator", "purpose": "track steps from offer to first day"}, {"id": "state-machine", "topic": "Document approval state machine", "audience": "a product manager", "purpose": "understand statuses and transitions"}, {"id": "k8s-deploy", "topic": "Kubernetes rolling deployment", "audience": "a platform engineer", "purpose": "understand how a new version rolls out safely"}, - {"id": "signup-funnel", "topic": "SaaS signup and activation funnel", "audience": "a growth analyst", "purpose": "see where users drop off between signup and activation"} + {"id": "signup-funnel", "topic": "SaaS signup and activation funnel", "audience": "a growth analyst", "purpose": "see where users drop off between signup and activation"}, + {"id": "saga-compensation", "topic": "Distributed transaction (saga) with per-step compensation", "audience": "a backend engineer", "purpose": "understand the commit path and exactly how each step rolls back on failure"}, + {"id": "oauth-pkce", "topic": "OAuth 2.0 Authorization Code flow with PKCE, token refresh, and revocation", "audience": "a security engineer", "purpose": "understand every redirect, token exchange, and the refresh and revocation paths"}, + {"id": "raft-election", "topic": "Raft consensus leader election and log replication", "audience": "a distributed-systems engineer", "purpose": "understand term changes, vote conditions, and how entries commit or get rejected"}, + {"id": "k8s-scheduling", "topic": "Kubernetes pod scheduling: filtering, scoring, binding, and preemption", "audience": "a platform engineer", "purpose": "understand how a pod lands on a node and what happens when none fits"}, + {"id": "tcp-lifecycle", "topic": "TCP connection lifecycle: handshake, data transfer, teardown, and reset/timeout paths", "audience": "a network engineer", "purpose": "understand each state transition and what triggers RST or timeout"}, + {"id": "payment-3ds", "topic": "Card payment with 3-D Secure challenge, authorization, capture, and decline/retry", "audience": "a payments engineer", "purpose": "understand the challenge branch and every decline and retry path"}, + {"id": "blue-green", "topic": "Blue-green deployment with health checks, traffic cutover, and rollback", "audience": "an SRE", "purpose": "understand the cutover decision and the exact rollback trigger and steps"}, + {"id": "rate-limiter", "topic": "Distributed token-bucket rate limiter with refill, burst, and rejection", "audience": "a backend engineer", "purpose": "understand when a request is allowed, throttled, or rejected and how the bucket refills"} ] From 75aa9153e515ea3e4002f6594a7c1e1110a00ac3 Mon Sep 17 00:00:00 2001 From: Ivan Cheung Date: Thu, 18 Jun 2026 21:06:16 +0000 Subject: [PATCH 18/24] feat(gepa): multimodal + Playwright validators on the rendered board Render each board in a real browser (persistent viewer + Chromium service) and add: - rendered geometry (Playwright DOM): node-pair overlaps, off-canvas nodes, min on-screen font - visual comprehension + visual quality: one Claude-vision call (Opus 4.8 via Vertex) reads the screenshot, answers the frozen questions from pixels, and rates legibility/crowding Metric now blends text+visual comprehension and heuristic+rendered geometry plus visual quality: total = w_comp*mean(text,visual) + w_geom*mean(heuristic,rendered) + w_vq*visual_quality (defaults 0.5/0.3/0.2). Verified end-to-end on Vertex (render -> DOM metrics -> vision). 34 tests. --- scripts/experiments/gepa-flowchart/README.md | 31 ++++-- .../gepa-flowchart/gepa_flowchart/adapter.py | 14 ++- .../gepa-flowchart/gepa_flowchart/config.py | 16 ++- .../gepa-flowchart/gepa_flowchart/llm.py | 32 ++++++ .../gepa-flowchart/gepa_flowchart/metric.py | 99 +++++++++++++++-- .../gepa_flowchart/render_bridge.py | 104 +++++++++++++++++ .../gepa_flowchart/render_service.mjs | 105 ++++++++++++++++++ .../gepa-flowchart/gepa_flowchart/run.py | 49 ++++---- .../gepa-flowchart/tests/test_adapter.py | 19 +++- .../gepa-flowchart/tests/test_config.py | 3 +- .../gepa-flowchart/tests/test_llm.py | 12 ++ .../gepa-flowchart/tests/test_metric.py | 94 ++++++++++++++-- .../tests/test_render_bridge.py | 39 +++++++ 13 files changed, 559 insertions(+), 58 deletions(-) create mode 100644 scripts/experiments/gepa-flowchart/gepa_flowchart/render_bridge.py create mode 100644 scripts/experiments/gepa-flowchart/gepa_flowchart/render_service.mjs create mode 100644 scripts/experiments/gepa-flowchart/tests/test_render_bridge.py diff --git a/scripts/experiments/gepa-flowchart/README.md b/scripts/experiments/gepa-flowchart/README.md index 47f7eb9..ac989f0 100644 --- a/scripts/experiments/gepa-flowchart/README.md +++ b/scripts/experiments/gepa-flowchart/README.md @@ -9,7 +9,8 @@ Optimizes the `brainstorm` + `generate` prompts for first-user comprehension ```bash # Node deps for the geometry bridge (run once) -npm install # repo root — gives the viewer its deps +npm install # repo root — gives the viewer its deps + playwright +npm run build --workspace @ivanmkc/termchart-viewer # build the viewer (needed to RENDER boards) cd scripts/experiments/gepa-flowchart npm install # local tsx @@ -45,15 +46,31 @@ python -m gepa_flowchart.run --max-metric-calls 80 --train 8 Outputs land in `runs//`: `best_prompts.json`, `report.md`, `frozen_questions.json`. +## Validators (the metric) + +Each board is scored on five signals, gated by structural validity: + +- **text comprehension** — a fresh reader-LLM answers the run-frozen reader-questions from the board's structured content; a strict, board-grounded judge scores them. +- **visual comprehension** — the board is **rendered in a real browser** (viewer + Chromium via a persistent render service) and a **multimodal LLM reads the screenshot**, answering the same questions from the pixels. +- **visual quality** — the same vision call rates legibility / crowding / overlaps / clipping. +- **heuristic geometry** — the fast TS `geometryReport` (edges-over-nodes, crossings, density). +- **rendered geometry (Playwright)** — real-DOM measurements: node-pair overlaps, off-canvas nodes, smallest on-screen font. + +`comp = mean(text, visual)`, `geom = mean(heuristic, rendered)`, +`total = w_comp·comp + w_geom·geom + w_vq·visual_quality`. + ## Config (env) `GEPA_GEN_MODEL` (default `claude-opus-4-8`; set `claude-sonnet-4-6` to cut cost), -`GEPA_READER_MODEL`, `GEPA_JUDGE_MODEL`, `GEPA_REFLECTION_MODEL`, -`GEPA_W_COMP` (0.6), `GEPA_W_GEOM` (0.4), `GEPA_N_QUESTIONS` (5), -`GEPA_MAX_METRIC_CALLS` (150). +`GEPA_READER_MODEL`, `GEPA_JUDGE_MODEL`, `GEPA_VISION_MODEL`, `GEPA_REFLECTION_MODEL`, +`GEPA_W_COMP` (0.5), `GEPA_W_GEOM` (0.3), `GEPA_W_VQ` (0.2), `GEPA_N_QUESTIONS` (7), +`GEPA_MAX_METRIC_CALLS` (150). Rendered-geometry penalties: `GEPA_RG_OVERLAP_PENALTY`, +`GEPA_RG_OFFSCREEN_PENALTY`, `GEPA_RG_TINYFONT_PENALTY`, `GEPA_RG_MIN_FONT_PX`. ## Cost note -A full run does many generation + reader + judge calls per rollout plus -reflection. Start with `--smoke`. Generation defaults to Opus; switch -`GEPA_GEN_MODEL=claude-sonnet-4-6` for a cheaper high-volume role. +Each rollout now does generation + text reader/judge + **a browser render + a +multimodal vision call**, plus reflection — heavier than a text-only metric. +Start with `--smoke`. The render service (viewer + one Chromium) starts once per +run and is reused. Generation defaults to Opus; `GEPA_GEN_MODEL=claude-sonnet-4-6` +cuts the high-volume role. diff --git a/scripts/experiments/gepa-flowchart/gepa_flowchart/adapter.py b/scripts/experiments/gepa-flowchart/gepa_flowchart/adapter.py index ece9108..bc876fd 100644 --- a/scripts/experiments/gepa-flowchart/gepa_flowchart/adapter.py +++ b/scripts/experiments/gepa-flowchart/gepa_flowchart/adapter.py @@ -4,16 +4,23 @@ from .config import Config from .geometry_bridge import validate_flow -from .llm import complete +from .llm import complete, complete_vision from .metric import score_board from .pipeline import run_pipeline +def _no_render(_content): + return {"ok": False, "error": "no render_fn supplied"} + + class FlowchartAdapter(GEPAAdapter): - def __init__(self, cfg: Config, *, validate_fn=validate_flow, complete_fn=complete): + def __init__(self, cfg: Config, *, validate_fn=validate_flow, complete_fn=complete, + render_fn=_no_render, vision_fn=complete_vision): self.cfg = cfg self.validate_fn = validate_fn self.complete_fn = complete_fn + self.render_fn = render_fn + self.vision_fn = vision_fn def evaluate(self, batch, candidate, capture_traces=False): outputs, scores, trajectories = [], [], [] if capture_traces else None @@ -23,7 +30,8 @@ def evaluate(self, batch, candidate, capture_traces=False): complete_fn=self.complete_fn, max_tokens=self.cfg.gen_max_tokens, ) result = score_board( - content, task, self.cfg, validate_fn=self.validate_fn, complete_fn=self.complete_fn + content, task, self.cfg, validate_fn=self.validate_fn, complete_fn=self.complete_fn, + render_fn=self.render_fn, vision_fn=self.vision_fn, ) outputs.append(content) scores.append(result.score) diff --git a/scripts/experiments/gepa-flowchart/gepa_flowchart/config.py b/scripts/experiments/gepa-flowchart/gepa_flowchart/config.py index ddcc248..77bf4b6 100644 --- a/scripts/experiments/gepa-flowchart/gepa_flowchart/config.py +++ b/scripts/experiments/gepa-flowchart/gepa_flowchart/config.py @@ -10,10 +10,16 @@ class Config: reader_model: str = "claude-sonnet-4-6" judge_model: str = "claude-opus-4-8" reflection_model: str = "claude-opus-4-8" - w_comp: float = 0.6 - w_geom: float = 0.4 + vision_model: str = "claude-opus-4-8" # multimodal judge over the rendered screenshot + w_comp: float = 0.5 # comprehension = mean(text, visual) + w_geom: float = 0.3 # geometry = mean(heuristic, rendered-DOM) + w_vq: float = 0.2 # visual quality (legibility/crowding) geom_error_penalty: float = 0.34 geom_warning_penalty: float = 0.08 + rg_overlap_penalty: float = 0.34 # rendered-DOM: per overlapping node pair + rg_offscreen_penalty: float = 0.2 # rendered-DOM: per off-canvas node + rg_tinyfont_penalty: float = 0.3 # rendered-DOM: smallest label below rg_min_font_px + rg_min_font_px: float = 9.0 n_questions: int = 7 max_metric_calls: int = 150 gen_max_tokens: int = 8000 @@ -36,10 +42,16 @@ def i(key: str, default: int) -> int: reader_model=s("GEPA_READER_MODEL", cls.reader_model), judge_model=s("GEPA_JUDGE_MODEL", cls.judge_model), reflection_model=s("GEPA_REFLECTION_MODEL", cls.reflection_model), + vision_model=s("GEPA_VISION_MODEL", cls.vision_model), w_comp=f("GEPA_W_COMP", cls.w_comp), w_geom=f("GEPA_W_GEOM", cls.w_geom), + w_vq=f("GEPA_W_VQ", cls.w_vq), geom_error_penalty=f("GEPA_GEOM_ERR_PENALTY", cls.geom_error_penalty), geom_warning_penalty=f("GEPA_GEOM_WARN_PENALTY", cls.geom_warning_penalty), + rg_overlap_penalty=f("GEPA_RG_OVERLAP_PENALTY", cls.rg_overlap_penalty), + rg_offscreen_penalty=f("GEPA_RG_OFFSCREEN_PENALTY", cls.rg_offscreen_penalty), + rg_tinyfont_penalty=f("GEPA_RG_TINYFONT_PENALTY", cls.rg_tinyfont_penalty), + rg_min_font_px=f("GEPA_RG_MIN_FONT_PX", cls.rg_min_font_px), n_questions=i("GEPA_N_QUESTIONS", cls.n_questions), max_metric_calls=i("GEPA_MAX_METRIC_CALLS", cls.max_metric_calls), gen_max_tokens=i("GEPA_GEN_MAX_TOKENS", cls.gen_max_tokens), diff --git a/scripts/experiments/gepa-flowchart/gepa_flowchart/llm.py b/scripts/experiments/gepa-flowchart/gepa_flowchart/llm.py index 7b021ae..53c688b 100644 --- a/scripts/experiments/gepa-flowchart/gepa_flowchart/llm.py +++ b/scripts/experiments/gepa-flowchart/gepa_flowchart/llm.py @@ -48,6 +48,38 @@ def complete( return "".join(b.text for b in resp.content if getattr(b, "type", None) == "text") +def complete_vision( + system: str, + user: str, + image_png: bytes, + *, + model: str, + effort: str = "high", + max_tokens: int = 2000, + client=None, +) -> str: + """Single-turn multimodal call: a PNG image plus a text prompt. Used to judge the RENDERED + board (true first-user-experience). Same no-sampling-params rule as complete().""" + import base64 + + client = client or get_client() + b64 = base64.b64encode(image_png).decode() + resp = client.messages.create( + model=model, + max_tokens=max_tokens, + system=system, + output_config={"effort": effort}, + messages=[{ + "role": "user", + "content": [ + {"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": b64}}, + {"type": "text", "text": user}, + ], + }], + ) + return "".join(b.text for b in resp.content if getattr(b, "type", None) == "text") + + def make_reflection_callable(model: str, *, client=None) -> Callable[[str], str]: def reflect(prompt: str) -> str: return complete( diff --git a/scripts/experiments/gepa-flowchart/gepa_flowchart/metric.py b/scripts/experiments/gepa-flowchart/gepa_flowchart/metric.py index 4712f11..f6eb0ee 100644 --- a/scripts/experiments/gepa-flowchart/gepa_flowchart/metric.py +++ b/scripts/experiments/gepa-flowchart/gepa_flowchart/metric.py @@ -1,20 +1,23 @@ from __future__ import annotations import json -from dataclasses import dataclass +from dataclasses import dataclass, field from .config import Config from .dataset import Task, extract_array -from .render import board_to_text +from .render import board_to_text, extract_json +from .render_bridge import rendered_geometry_metrics @dataclass class ScoreResult: score: float feedback: str - comp: float - geom: float + comp: float # combined comprehension: mean(text, visual) + geom: float # combined geometry: mean(heuristic, rendered-DOM) + visual_quality: float valid: bool + sub: dict = field(default_factory=dict) # text_comp / visual_comp / heuristic_geom / rendered_geom def geometry_score(findings: list[dict], cfg: Config) -> tuple[float, str]: @@ -85,18 +88,90 @@ def comprehension_score(board_text: str, questions: list[str], *, reader_model: return comp, fb -def score_board(content, task: Task, cfg: Config, *, validate_fn, complete_fn) -> ScoreResult: +def rendered_geometry_score(reduced: dict, cfg: Config) -> tuple[float, str]: + """Score the REAL rendered DOM: partial node overlaps, off-viewport nodes, and tiny on-screen + text. Complements the heuristic geometryReport with what the browser actually drew.""" + if reduced.get("n_nodes", 0) == 0: + return 0.0, "Rendered geometry: nothing rendered (0 nodes on the board)." + overlaps = reduced.get("overlaps", 0) + offscreen = reduced.get("offscreen", 0) + min_font = reduced.get("min_font_px", 99) + tiny = 1 if 0 < min_font < cfg.rg_min_font_px else 0 + score = 1.0 - cfg.rg_overlap_penalty * overlaps - cfg.rg_offscreen_penalty * offscreen - cfg.rg_tinyfont_penalty * tiny + score = max(0.0, min(1.0, score)) + bits = [] + if overlaps: + bits.append(f"{overlaps} node-pair overlap(s)") + if offscreen: + bits.append(f"{offscreen} node(s) off the visible canvas") + if tiny: + bits.append(f"smallest on-screen label ~{min_font}px (below {cfg.rg_min_font_px}px, hard to read)") + fb = "Rendered geometry: " + ("; ".join(bits) if bits else "clean (no overlaps, on-screen, legible).") + return score, fb + + +_VISUAL_SYSTEM = ( + "You are a person seeing this flowchart IMAGE for the FIRST TIME. Judge ONLY what is actually " + "VISIBLE and LEGIBLE in the image — not what the topic means in general. For each question, " + "answer from the image, then score: 1 = the image clearly and specifically shows the answer; " + "0.5 = partially shown or too small/cramped to be sure; 0 = not shown or not legible. " + "Separately rate overall VISUAL QUALITY from 0 to 1 (1 = clean, well-spaced, legible labels, no " + "overlaps, nothing cut off; 0 = cramped, overlapping, tiny or clipped text). " + "Output ONLY a JSON object: {\"answers\": [{\"q\", \"score\" (0/0.5/1), \"reason\"}], " + "\"visual_quality\": number, \"quality_reason\": string}." +) + + +def visual_eval(png: bytes, questions: list[str], *, model: str, vision_fn) -> tuple[float, float, str]: + """One multimodal call over the rendered image: returns (visual_comprehension, visual_quality, + feedback). Fused reader+grader+quality to keep it to a single vision call per rollout.""" + qlist = "\n".join(f"- {q}" for q in questions) + out = vision_fn(_VISUAL_SYSTEM, f"QUESTIONS:\n{qlist}", png, model=model, effort="high", max_tokens=1800) + blob = extract_json(out) + try: + obj = json.loads(blob) if blob else {} + except (json.JSONDecodeError, ValueError): + obj = {} + answers = obj.get("answers") or [] + if not answers: + return 0.0, 0.0, "Visual: vision judge produced no parseable scores." + scores = [float(a.get("score", 0.0)) for a in answers] + vcomp = sum(scores) / len(scores) + vqual = max(0.0, min(1.0, float(obj.get("visual_quality", 0.0)))) + gaps = [a for a in answers if float(a.get("score", 0.0)) < 1.0] + gap_txt = "; ".join(f"'{a.get('q')}' — {a.get('reason')}" for a in gaps[:6]) + fb = (f"Visual comprehension {vcomp:.2f}, visual quality {vqual:.2f} ({obj.get('quality_reason','')}). " + + (f"Not clearly readable from the image: {gap_txt}" if gaps else "All questions readable from the image.")) + return vcomp, vqual, fb + + +def score_board(content, task: Task, cfg: Config, *, validate_fn, complete_fn, render_fn, vision_fn) -> ScoreResult: if not content: - return ScoreResult(0.0, "Generation produced no JSON object.", 0.0, 0.0, False) + return ScoreResult(0.0, "Generation produced no JSON object.", 0.0, 0.0, 0.0, False) report = validate_fn(content) if not report.get("valid"): - return ScoreResult(0.0, f"Invalid flow spec: {report.get('error')}", 0.0, 0.0, False) + return ScoreResult(0.0, f"Invalid flow spec: {report.get('error')}", 0.0, 0.0, 0.0, False) - geom, geom_fb = geometry_score(report.get("findings", []), cfg) + # 1. heuristic geometry (fast TS lint) + text comprehension (structured content) + hgeom, hgeom_fb = geometry_score(report.get("findings", []), cfg) board_text = board_to_text(json.loads(content)) - comp, comp_fb = comprehension_score( + tcomp, tcomp_fb = comprehension_score( board_text, task.questions, reader_model=cfg.reader_model, judge_model=cfg.judge_model, complete_fn=complete_fn ) - total = cfg.w_comp * comp + cfg.w_geom * geom - feedback = f"{comp_fb}\n{geom_fb}" - return ScoreResult(total, feedback, comp, geom, True) + + # 2. render the board in a real browser → rendered-DOM geometry + multimodal visual judging + r = render_fn(content) + if r.get("ok") and r.get("png"): + rgeom, rgeom_fb = rendered_geometry_score(rendered_geometry_metrics(r.get("metrics", {})), cfg) + vcomp, vqual, v_fb = visual_eval(r["png"], task.questions, model=cfg.vision_model, vision_fn=vision_fn) + else: + # A valid spec that won't render is a real defect — score the visual signals at zero. + rgeom, rgeom_fb = 0.0, f"Rendered geometry: render failed ({r.get('error')})." + vcomp, vqual, v_fb = 0.0, 0.0, "Visual: the board failed to render in the viewer." + + comp = (tcomp + vcomp) / 2 + geom = (hgeom + rgeom) / 2 + total = cfg.w_comp * comp + cfg.w_geom * geom + cfg.w_vq * vqual + feedback = "\n".join([tcomp_fb, v_fb, hgeom_fb, rgeom_fb]) + sub = {"text_comp": tcomp, "visual_comp": vcomp, "heuristic_geom": hgeom, "rendered_geom": rgeom, "visual_quality": vqual} + return ScoreResult(total, feedback, comp, geom, vqual, True, sub) diff --git a/scripts/experiments/gepa-flowchart/gepa_flowchart/render_bridge.py b/scripts/experiments/gepa-flowchart/gepa_flowchart/render_bridge.py new file mode 100644 index 0000000..5fdcee0 --- /dev/null +++ b/scripts/experiments/gepa-flowchart/gepa_flowchart/render_bridge.py @@ -0,0 +1,104 @@ +from __future__ import annotations + +import base64 +import json +import os +import subprocess +import time +import urllib.request +from pathlib import Path + +_HERE = Path(__file__).resolve().parent +_SERVICE = _HERE / "render_service.mjs" +_REPO = _HERE.parents[3] # gepa_flowchart -> gepa-flowchart -> experiments -> scripts -> repo + + +class RenderService: + """Starts the Node render service (viewer + Chromium) once and renders flow specs to a PNG + + real-DOM metrics. Use as a context manager so the browser/viewer are torn down.""" + + def __init__(self, render_port: int = 8799, viewer_port: int = 8798): + self.render_port = render_port + self.viewer_port = viewer_port + self._proc: subprocess.Popen | None = None + + def start(self, timeout: float = 90.0) -> "RenderService": + env = {**os.environ, "RENDER_PORT": str(self.render_port), "VIEWER_PORT": str(self.viewer_port)} + # Run with the repo root as cwd so `import "playwright"` resolves from the root node_modules. + self._proc = subprocess.Popen( + ["node", str(_SERVICE)], + cwd=str(_REPO), + env=env, + stdout=subprocess.PIPE, + stderr=subprocess.STDOUT, + text=True, + ) + deadline = time.monotonic() + timeout + while time.monotonic() < deadline: + line = self._proc.stdout.readline() if self._proc.stdout else "" + if line.startswith("READY"): + return self + if self._proc.poll() is not None: + raise RuntimeError("render service exited before ready") + raise TimeoutError("render service did not become ready") + + def render(self, flow_content: str, *, timeout: float = 60.0) -> dict: + """Returns {ok, png (bytes) | None, metrics, error?}.""" + req = urllib.request.Request( + f"http://127.0.0.1:{self.render_port}/render", + data=json.dumps({"flow": flow_content}).encode(), + headers={"content-type": "application/json"}, + method="POST", + ) + with urllib.request.urlopen(req, timeout=timeout) as r: + out = json.loads(r.read()) + if out.get("ok") and out.get("png"): + out["png"] = base64.b64decode(out["png"]) + return out + + def stop(self) -> None: + if self._proc and self._proc.poll() is None: + self._proc.terminate() + try: + self._proc.wait(timeout=10) + except subprocess.TimeoutExpired: + self._proc.kill() + + def __enter__(self) -> "RenderService": + return self.start() + + def __exit__(self, *exc) -> None: + self.stop() + + +def rendered_geometry_metrics(metrics: dict) -> dict: + """Reduce raw DOM metrics to defect counts: partial node overlaps (excluding containment, which + is a group enclosing its children), nodes off the diagram viewport, and the smallest on-screen + font size (CSS px times the React Flow zoom).""" + nodes = metrics.get("nodes", []) + zoom = metrics.get("zoom", 1) or 1 + d = metrics.get("diagram", {}) + dx, dy, dw, dh = d.get("x", 0), d.get("y", 0), d.get("w", 1280), d.get("h", 900) + + def contains(a, b): # a fully contains b + return a["x"] <= b["x"] and a["y"] <= b["y"] and a["x"] + a["w"] >= b["x"] + b["w"] and a["y"] + a["h"] >= b["y"] + b["h"] + + def overlaps(a, b): + return a["x"] < b["x"] + b["w"] and a["x"] + a["w"] > b["x"] and a["y"] < b["y"] + b["h"] and a["y"] + a["h"] > b["y"] + + overlap_pairs = 0 + for i in range(len(nodes)): + for j in range(i + 1, len(nodes)): + a, b = nodes[i], nodes[j] + if overlaps(a, b) and not contains(a, b) and not contains(b, a): + overlap_pairs += 1 + + margin = 4 + off = sum( + 1 for n in nodes + if n["x"] < dx - margin or n["y"] < dy - margin + or n["x"] + n["w"] > dx + dw + margin or n["y"] + n["h"] > dy + dh + margin + ) + on_screen_fonts = [n["fs"] * zoom for n in nodes if n.get("fs")] + min_font = min(on_screen_fonts) if on_screen_fonts else 0.0 + return {"overlaps": overlap_pairs, "offscreen": off, "min_font_px": round(min_font, 1), "n_nodes": len(nodes)} diff --git a/scripts/experiments/gepa-flowchart/gepa_flowchart/render_service.mjs b/scripts/experiments/gepa-flowchart/gepa_flowchart/render_service.mjs new file mode 100644 index 0000000..284a765 --- /dev/null +++ b/scripts/experiments/gepa-flowchart/gepa_flowchart/render_service.mjs @@ -0,0 +1,105 @@ +// Persistent render service for the GEPA harness. Starts the termchart viewer once, launches one +// Chromium, and serves POST /render {flow} -> { png (base64 of the #diagram), metrics } using the +// REAL browser layout. Reused across every rollout (spawning a browser per rollout would be brutal). +// +// Run from anywhere under the worktree (resolves `playwright` from the root node_modules): +// node gepa_flowchart/render_service.mjs # prints "READY " then serves +// Env: RENDER_PORT (default 8799), VIEWER_PORT (default 8798), PUSH_TOKEN (default render-tok). +import { chromium } from "playwright"; +import { spawn } from "node:child_process"; +import http from "node:http"; +import { fileURLToPath } from "node:url"; +import { dirname, join } from "node:path"; + +const HERE = dirname(fileURLToPath(import.meta.url)); +const REPO = join(HERE, "..", "..", "..", ".."); // gepa_flowchart -> gepa-flowchart -> experiments -> scripts -> repo +const VIEWER = join(REPO, "packages", "viewer", "dist", "server.js"); + +const RENDER_PORT = Number(process.env.RENDER_PORT ?? 8799); +const VIEWER_PORT = Number(process.env.VIEWER_PORT ?? 8798); +const TOKEN = process.env.PUSH_TOKEN ?? "render-tok"; +const VIEWER_BASE = `http://127.0.0.1:${VIEWER_PORT}`; + +const viewer = spawn("node", [VIEWER], { env: { ...process.env, PORT: String(VIEWER_PORT), PUSH_TOKEN: TOKEN } }); +viewer.stderr.on("data", () => {}); // swallow viewer logs + +const browser = await chromium.launch(); +const page = await browser.newPage({ viewport: { width: 1280, height: 900 } }); + +// Wait for the viewer to accept connections. +for (let i = 0; i < 50; i++) { + try { const r = await fetch(`${VIEWER_BASE}/healthz`); if (r.ok) break; } catch { /* not up yet */ } + await new Promise((r) => setTimeout(r, 200)); +} + +let counter = 0; + +async function renderFlow(flowStr) { + const wsid = `r${++counter}`; + const push = await fetch(`${VIEWER_BASE}/w/${wsid}/push`, { + method: "POST", + headers: { "content-type": "application/json", authorization: `Bearer ${TOKEN}` }, + body: JSON.stringify({ project: "gepa", agent: "board", type: "flow", description: "render", content: flowStr }), + }); + if (!push.ok && push.status !== 200 && push.status !== 204) { + return { ok: false, error: `push HTTP ${push.status}: ${(await push.text()).slice(0, 200)}` }; + } + await page.goto(`${VIEWER_BASE}/w/${wsid}/`, { waitUntil: "domcontentloaded" }); + try { + await page.waitForSelector(".react-flow__node", { timeout: 8000 }); + } catch { /* no nodes rendered — still screenshot + report empty */ } + await page.waitForTimeout(900); // let fitView settle + + const metrics = await page.evaluate(() => { + const vp = document.querySelector(".react-flow__viewport"); + let zoom = 1; + if (vp) { + const t = getComputedStyle(vp).transform; + const m = t && t.match(/matrix\(([^)]+)\)/); + if (m) zoom = parseFloat(m[1].split(",")[0]) || 1; + } + const d = document.getElementById("diagram"); + const dr = d ? d.getBoundingClientRect() : { x: 0, y: 0, width: 1280, height: 900 }; + const nodes = [...document.querySelectorAll(".react-flow__node")].map((n) => { + const r = n.getBoundingClientRect(); + return { + id: n.getAttribute("data-id") || "", + cls: n.className || "", + x: r.x, y: r.y, w: r.width, h: r.height, + fs: parseFloat(getComputedStyle(n).fontSize) || 0, + }; + }); + return { zoom, diagram: { x: dr.x, y: dr.y, w: dr.width, h: dr.height }, nodes }; + }); + + const el = await page.$("#diagram"); + const buf = el ? await el.screenshot() : await page.screenshot(); + return { ok: true, png: buf.toString("base64"), metrics }; +} + +const server = http.createServer((req, res) => { + if (req.method === "GET" && req.url === "/health") { res.writeHead(200); res.end("ok"); return; } + if (req.method === "POST" && req.url === "/render") { + let body = ""; + req.on("data", (c) => (body += c)); + req.on("end", async () => { + try { + const { flow } = JSON.parse(body); + const out = await renderFlow(flow); + res.writeHead(out.ok ? 200 : 500, { "content-type": "application/json" }); + res.end(JSON.stringify(out)); + } catch (e) { + res.writeHead(500, { "content-type": "application/json" }); + res.end(JSON.stringify({ ok: false, error: String(e).slice(0, 300) })); + } + }); + return; + } + res.writeHead(404); res.end("not found"); +}); + +server.listen(RENDER_PORT, () => console.log(`READY ${RENDER_PORT}`)); + +function shutdown() { try { browser.close(); } catch {} try { viewer.kill(); } catch {} process.exit(0); } +process.on("SIGTERM", shutdown); +process.on("SIGINT", shutdown); diff --git a/scripts/experiments/gepa-flowchart/gepa_flowchart/run.py b/scripts/experiments/gepa-flowchart/gepa_flowchart/run.py index 8d685cf..ec1fd6c 100644 --- a/scripts/experiments/gepa-flowchart/gepa_flowchart/run.py +++ b/scripts/experiments/gepa-flowchart/gepa_flowchart/run.py @@ -11,19 +11,21 @@ from .adapter import FlowchartAdapter from .config import Config from .dataset import freeze_questions, load_topics -from .llm import complete, make_reflection_callable +from .llm import complete, complete_vision, make_reflection_callable from .metric import score_board from .geometry_bridge import validate_flow from .pipeline import SEED_CANDIDATE +from .render_bridge import RenderService from .report import render_report _TOPICS = str(Path(__file__).resolve().parent.parent / "topics" / "topics.json") -def _eval_scores(tasks, candidate, cfg) -> list[float]: +def _eval_scores(tasks, candidate, cfg, *, render_fn) -> list[float]: return [ score_board( - run_one(candidate, t, cfg), t, cfg, validate_fn=validate_flow, complete_fn=complete + run_one(candidate, t, cfg), t, cfg, validate_fn=validate_flow, complete_fn=complete, + render_fn=render_fn, vision_fn=complete_vision, ).score for t in tasks ] @@ -55,7 +57,9 @@ def main(argv=None) -> int: train, val = topics[: args.train], topics[args.train :] print(f"[gepa] run_dir={run_dir} budget={budget} train={len(train)} val={len(val)}", file=sys.stderr) - print(f"[gepa] models gen={cfg.gen_model} reader={cfg.reader_model} judge={cfg.judge_model} reflect={cfg.reflection_model}", file=sys.stderr) + print(f"[gepa] models gen={cfg.gen_model} reader={cfg.reader_model} judge={cfg.judge_model} " + f"vision={cfg.vision_model} reflect={cfg.reflection_model}", file=sys.stderr) + print(f"[gepa] weights comp={cfg.w_comp} geom={cfg.w_geom} visual_quality={cfg.w_vq}", file=sys.stderr) # Freeze the reader-questions once for the whole run (fair comparison). all_tasks = freeze_questions(topics, run_dir, model=cfg.judge_model, n=cfg.n_questions, complete_fn=complete) @@ -63,21 +67,28 @@ def main(argv=None) -> int: train = [by_id[t.id] for t in train] val = [by_id[t.id] for t in val] - seed_scores = _eval_scores(val, SEED_CANDIDATE, cfg) - - adapter = FlowchartAdapter(cfg) - result = gepa.optimize( - seed_candidate=SEED_CANDIDATE, - trainset=train, - valset=val, - adapter=adapter, - reflection_lm=make_reflection_callable(cfg.reflection_model), - max_metric_calls=budget, - run_dir=run_dir, - display_progress_bar=True, - ) - best = result.best_candidate - best_scores = _eval_scores(val, best, cfg) + print("[gepa] starting render service (viewer + chromium)…", file=sys.stderr) + service = RenderService().start() + print("[gepa] render service ready.", file=sys.stderr) + try: + render_fn = service.render + seed_scores = _eval_scores(val, SEED_CANDIDATE, cfg, render_fn=render_fn) + + adapter = FlowchartAdapter(cfg, render_fn=render_fn, vision_fn=complete_vision) + result = gepa.optimize( + seed_candidate=SEED_CANDIDATE, + trainset=train, + valset=val, + adapter=adapter, + reflection_lm=make_reflection_callable(cfg.reflection_model), + max_metric_calls=budget, + run_dir=run_dir, + display_progress_bar=True, + ) + best = result.best_candidate + best_scores = _eval_scores(val, best, cfg, render_fn=render_fn) + finally: + service.stop() (Path(run_dir) / "best_prompts.json").write_text(json.dumps(best, indent=2)) report = render_report(seed_scores, best_scores, best, [t.id for t in val]) diff --git a/scripts/experiments/gepa-flowchart/tests/test_adapter.py b/scripts/experiments/gepa-flowchart/tests/test_adapter.py index 4caf04c..bf432c4 100644 --- a/scripts/experiments/gepa-flowchart/tests/test_adapter.py +++ b/scripts/experiments/gepa-flowchart/tests/test_adapter.py @@ -24,8 +24,23 @@ def complete_fn(system, user, *, model, **kw): return json.dumps([{"q": "Q1?", "score": 1.0, "supported": True, "reason": "ok"}]) +def render_fn(content): + return {"ok": True, "png": b"png", "metrics": { + "zoom": 1, "diagram": {"x": 0, "y": 0, "w": 1200, "h": 800}, + "nodes": [{"id": "a", "cls": "react-flow__node", "x": 10, "y": 10, "w": 120, "h": 40, "fs": 12}], + }} + + +def vision_fn(system, user, png, *, model, **kw): + return json.dumps({"answers": [{"q": "Q1?", "score": 1, "reason": "ok"}], "visual_quality": 0.9, "quality_reason": "clean"}) + + +def _adapter(): + return FlowchartAdapter(CFG, validate_fn=validate_fn, complete_fn=complete_fn, render_fn=render_fn, vision_fn=vision_fn) + + def test_evaluate_returns_scores_and_traces(): - adapter = FlowchartAdapter(CFG, validate_fn=validate_fn, complete_fn=complete_fn) + adapter = _adapter() batch = [Task(id="x", topic="T", audience="A", purpose="P", questions=["Q1?"])] out = adapter.evaluate(batch, SEED_CANDIDATE, capture_traces=True) assert len(out.scores) == 1 @@ -34,7 +49,7 @@ def test_evaluate_returns_scores_and_traces(): def test_make_reflective_dataset_has_requested_components(): - adapter = FlowchartAdapter(CFG, validate_fn=validate_fn, complete_fn=complete_fn) + adapter = _adapter() batch = [Task(id="x", topic="T", audience="A", purpose="P", questions=["Q1?"])] ev = adapter.evaluate(batch, SEED_CANDIDATE, capture_traces=True) refl = adapter.make_reflective_dataset(SEED_CANDIDATE, ev, ["brainstorm", "generate"]) diff --git a/scripts/experiments/gepa-flowchart/tests/test_config.py b/scripts/experiments/gepa-flowchart/tests/test_config.py index 691aaa9..b0571e8 100644 --- a/scripts/experiments/gepa-flowchart/tests/test_config.py +++ b/scripts/experiments/gepa-flowchart/tests/test_config.py @@ -7,7 +7,8 @@ def test_defaults(): assert cfg.reader_model == "claude-sonnet-4-6" assert cfg.judge_model == "claude-opus-4-8" assert cfg.reflection_model == "claude-opus-4-8" - assert cfg.w_comp == 0.6 and cfg.w_geom == 0.4 + assert cfg.w_comp == 0.5 and cfg.w_geom == 0.3 and cfg.w_vq == 0.2 + assert cfg.vision_model == "claude-opus-4-8" def test_env_override(): diff --git a/scripts/experiments/gepa-flowchart/tests/test_llm.py b/scripts/experiments/gepa-flowchart/tests/test_llm.py index b876aee..cfefc63 100644 --- a/scripts/experiments/gepa-flowchart/tests/test_llm.py +++ b/scripts/experiments/gepa-flowchart/tests/test_llm.py @@ -32,3 +32,15 @@ def test_reflection_callable_returns_str(): fake = FakeClient() fn = llm.make_reflection_callable("claude-opus-4-8", client=fake) assert fn("reflect on this") == "hello world" + + +def test_complete_vision_sends_image_block(): + fake = FakeClient() + out = llm.complete_vision("sys", "describe", b"\x89PNGdata", model="claude-opus-4-8", client=fake) + assert out == "hello world" + content = fake.calls["messages"][0]["content"] + kinds = [b["type"] for b in content] + assert "image" in kinds and "text" in kinds + img = next(b for b in content if b["type"] == "image") + assert img["source"]["type"] == "base64" and img["source"]["media_type"] == "image/png" + assert "temperature" not in fake.calls diff --git a/scripts/experiments/gepa-flowchart/tests/test_metric.py b/scripts/experiments/gepa-flowchart/tests/test_metric.py index 25a6d68..8dae861 100644 --- a/scripts/experiments/gepa-flowchart/tests/test_metric.py +++ b/scripts/experiments/gepa-flowchart/tests/test_metric.py @@ -2,12 +2,36 @@ from gepa_flowchart.config import Config from gepa_flowchart.dataset import Task -from gepa_flowchart.metric import geometry_score, score_board +from gepa_flowchart.metric import ( + geometry_score, + rendered_geometry_score, + visual_eval, + score_board, +) CFG = Config.from_env({}) TASK = Task(id="x", topic="T", audience="A", purpose="P", questions=["Q1?", "Q2?"]) GOOD_FLOW = json.dumps({"direction": "TB", "nodes": [{"id": "a", "data": {"label": "Start"}}], "edges": []}) +# A clean rendered board: one node, on-screen, legible font. +RENDER_OK = { + "ok": True, + "png": b"\x89PNG-fake", + "metrics": { + "zoom": 1, + "diagram": {"x": 0, "y": 0, "w": 1200, "h": 800}, + "nodes": [{"id": "a", "cls": "react-flow__node", "x": 10, "y": 10, "w": 120, "h": 40, "fs": 12}], + }, +} + + +def _vision_ok(system, user, png, *, model, **kw): + return json.dumps({ + "answers": [{"q": "Q1?", "score": 1, "reason": "shown"}, {"q": "Q2?", "score": 0, "reason": "cramped, unreadable"}], + "visual_quality": 0.8, + "quality_reason": "legible", + }) + def test_geometry_score_penalizes_errors(): clean, _ = geometry_score([], CFG) @@ -16,33 +40,79 @@ def test_geometry_score_penalizes_errors(): assert err < 1.0 and "edge-over-node" in msg +def test_rendered_geometry_score(): + clean, _ = rendered_geometry_score({"overlaps": 0, "offscreen": 0, "min_font_px": 14, "n_nodes": 3}, CFG) + assert clean == 1.0 + bad, msg = rendered_geometry_score({"overlaps": 2, "offscreen": 1, "min_font_px": 5, "n_nodes": 4}, CFG) + assert bad < clean and "overlap" in msg + empty, msg2 = rendered_geometry_score({"overlaps": 0, "offscreen": 0, "min_font_px": 0, "n_nodes": 0}, CFG) + assert empty == 0.0 and "nothing rendered" in msg2 + + +def test_visual_eval_parses_object(): + vcomp, vqual, fb = visual_eval(b"png", ["Q1?", "Q2?"], model="m", vision_fn=_vision_ok) + assert vcomp == 0.5 and vqual == 0.8 + assert "cramped" in fb + + +def test_visual_eval_degrades_on_garbage(): + vcomp, vqual, fb = visual_eval(b"png", ["Q1?"], model="m", vision_fn=lambda *a, **k: "no json here") + assert vcomp == 0.0 and vqual == 0.0 and "no parseable" in fb + + def test_invalid_board_scores_zero(): def validate_fn(content): return {"valid": False, "error": "bad json", "findings": [], "warnings": []} - res = score_board("{bad", TASK, CFG, validate_fn=validate_fn, complete_fn=lambda *a, **k: "") + res = score_board( + "{bad", TASK, CFG, validate_fn=validate_fn, complete_fn=lambda *a, **k: "", + render_fn=lambda c: RENDER_OK, vision_fn=_vision_ok, + ) assert res.score == 0.0 and res.valid is False assert "bad json" in res.feedback -def test_missing_context_lowers_comp_and_is_fed_back(): +def test_combined_score_blends_text_visual_geometry_quality(): def validate_fn(content): return {"valid": True, "error": None, "findings": [], "warnings": []} - # reader answers; judge marks Q2 unsupported def complete_fn(system, user, *, model, **kw): - if "first time" in system.lower(): # reader - return json.dumps([{"q": "Q1?", "a": "Start the process"}, {"q": "Q2?", "a": "not shown"}]) - # judge - return json.dumps([ + if "first time" in system.lower(): # text reader + return json.dumps([{"q": "Q1?", "a": "x"}, {"q": "Q2?", "a": "not shown"}]) + return json.dumps([ # text judge {"q": "Q1?", "score": 1.0, "supported": True, "reason": "shown"}, {"q": "Q2?", "score": 0.0, "supported": False, "reason": "board never shows the failure path"}, ]) - res = score_board(GOOD_FLOW, TASK, CFG, validate_fn=validate_fn, complete_fn=complete_fn) - assert 0.0 < res.comp < 1.0 + res = score_board( + GOOD_FLOW, TASK, CFG, validate_fn=validate_fn, complete_fn=complete_fn, + render_fn=lambda c: RENDER_OK, vision_fn=_vision_ok, + ) + # text_comp 0.5, visual_comp 0.5 -> comp 0.5; heuristic 1.0, rendered 1.0 -> geom 1.0; vq 0.8 + assert res.comp == 0.5 assert res.geom == 1.0 - assert "failure path" in res.feedback + assert res.visual_quality == 0.8 + assert abs(res.score - (CFG.w_comp * 0.5 + CFG.w_geom * 1.0 + CFG.w_vq * 0.8)) < 1e-9 + assert res.sub["text_comp"] == 0.5 and res.sub["visual_comp"] == 0.5 + assert "failure path" in res.feedback and "cramped" in res.feedback + + +def test_render_failure_zeros_visual_signals(): + def validate_fn(content): + return {"valid": True, "error": None, "findings": [], "warnings": []} + + def complete_fn(system, user, *, model, **kw): + if "first time" in system.lower(): + return json.dumps([{"q": "Q1?", "a": "x"}]) + return json.dumps([{"q": "Q1?", "score": 1.0, "supported": True, "reason": "shown"}]) + + res = score_board( + GOOD_FLOW, Task(id="x", topic="T", audience="A", purpose="P", questions=["Q1?"]), CFG, + validate_fn=validate_fn, complete_fn=complete_fn, + render_fn=lambda c: {"ok": False, "error": "boom"}, vision_fn=_vision_ok, + ) + assert res.sub["visual_comp"] == 0.0 and res.sub["rendered_geom"] == 0.0 and res.visual_quality == 0.0 + assert "failed to render" in res.feedback def test_malformed_judge_degrades_safely(): @@ -51,7 +121,7 @@ def test_malformed_judge_degrades_safely(): def complete_fn(system, user, *, model, **kw): if "first time" in system.lower(): return '[{"q": "Q1?", "a": "x"}]' - return "[{ truncated malformed" # non-empty bracketed but invalid JSON + return "[{ truncated malformed" comp, fb = comprehension_score( "board text", ["Q1?"], reader_model="r", judge_model="j", complete_fn=complete_fn diff --git a/scripts/experiments/gepa-flowchart/tests/test_render_bridge.py b/scripts/experiments/gepa-flowchart/tests/test_render_bridge.py new file mode 100644 index 0000000..22c5d84 --- /dev/null +++ b/scripts/experiments/gepa-flowchart/tests/test_render_bridge.py @@ -0,0 +1,39 @@ +from gepa_flowchart.render_bridge import rendered_geometry_metrics + +DIAG = {"x": 0, "y": 0, "w": 1000, "h": 800} + + +def test_clean_layout_no_defects(): + m = {"zoom": 1, "diagram": DIAG, "nodes": [ + {"id": "a", "cls": "react-flow__node", "x": 10, "y": 10, "w": 100, "h": 40, "fs": 12}, + {"id": "b", "cls": "react-flow__node", "x": 10, "y": 200, "w": 100, "h": 40, "fs": 12}, + ]} + r = rendered_geometry_metrics(m) + assert r == {"overlaps": 0, "offscreen": 0, "min_font_px": 12.0, "n_nodes": 2} + + +def test_partial_overlap_counted(): + m = {"zoom": 1, "diagram": DIAG, "nodes": [ + {"id": "a", "cls": "react-flow__node", "x": 10, "y": 10, "w": 100, "h": 40, "fs": 12}, + {"id": "b", "cls": "react-flow__node", "x": 60, "y": 30, "w": 100, "h": 40, "fs": 12}, # overlaps a + ]} + assert rendered_geometry_metrics(m)["overlaps"] == 1 + + +def test_containment_is_not_overlap(): + # a big group node enclosing a child must NOT count as an overlap + m = {"zoom": 1, "diagram": DIAG, "nodes": [ + {"id": "g", "cls": "react-flow__node-group", "x": 0, "y": 0, "w": 400, "h": 300, "fs": 12}, + {"id": "a", "cls": "react-flow__node", "x": 20, "y": 20, "w": 100, "h": 40, "fs": 12}, + ]} + assert rendered_geometry_metrics(m)["overlaps"] == 0 + + +def test_offscreen_and_zoom_scaled_font(): + m = {"zoom": 0.5, "diagram": DIAG, "nodes": [ + {"id": "a", "cls": "react-flow__node", "x": 10, "y": 10, "w": 100, "h": 40, "fs": 12}, + {"id": "b", "cls": "react-flow__node", "x": 1200, "y": 10, "w": 100, "h": 40, "fs": 12}, # off canvas (x>1000) + ]} + r = rendered_geometry_metrics(m) + assert r["offscreen"] == 1 + assert r["min_font_px"] == 6.0 # 12 * 0.5 zoom From 2e411ea3c19f53ce385c25b8de83a0527367aece Mon Sep 17 00:00:00 2001 From: Ivan Cheung Date: Thu, 18 Jun 2026 21:20:15 +0000 Subject: [PATCH 19/24] fix(gepa): freeze reader-questions only for used topics (not all 20) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit A --smoke or small --train run was generating questions for all 20 topics before starting; freeze just train∪val so partial runs aren't paying for unused topics. --- scripts/experiments/gepa-flowchart/gepa_flowchart/run.py | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/scripts/experiments/gepa-flowchart/gepa_flowchart/run.py b/scripts/experiments/gepa-flowchart/gepa_flowchart/run.py index ec1fd6c..4bc5958 100644 --- a/scripts/experiments/gepa-flowchart/gepa_flowchart/run.py +++ b/scripts/experiments/gepa-flowchart/gepa_flowchart/run.py @@ -61,9 +61,12 @@ def main(argv=None) -> int: f"vision={cfg.vision_model} reflect={cfg.reflection_model}", file=sys.stderr) print(f"[gepa] weights comp={cfg.w_comp} geom={cfg.w_geom} visual_quality={cfg.w_vq}", file=sys.stderr) - # Freeze the reader-questions once for the whole run (fair comparison). - all_tasks = freeze_questions(topics, run_dir, model=cfg.judge_model, n=cfg.n_questions, complete_fn=complete) - by_id = {t.id: t for t in all_tasks} + # Freeze the reader-questions once for the whole run (fair comparison), only for the topics + # actually used (train ∪ val) — freezing all 20 would waste calls on a small/smoke run. + used_ids = {t.id for t in train} | {t.id for t in val} + used = [t for t in topics if t.id in used_ids] + frozen = freeze_questions(used, run_dir, model=cfg.judge_model, n=cfg.n_questions, complete_fn=complete) + by_id = {t.id: t for t in frozen} train = [by_id[t.id] for t in train] val = [by_id[t.id] for t in val] From 8b1f738ebce21ceae1ab7298bfc9e57e3ef4f92f Mon Sep 17 00:00:00 2001 From: Ivan Cheung Date: Thu, 18 Jun 2026 22:32:50 +0000 Subject: [PATCH 20/24] fix(gepa): tolerant reader-question parsing (raw_decode + retry + fallback) The harder question prompt occasionally makes the model emit a valid JSON array followed by trailing prose, which crashed freeze_questions with JSONDecodeError ('Extra data') at the start of a run. Parse the first array via raw_decode (ignore trailing), retry once on a clean miss, and fall back to generic questions so no topic is ever question-less. Regression tests added. --- .../gepa-flowchart/gepa_flowchart/dataset.py | 32 ++++++++++++++++--- .../gepa-flowchart/tests/test_dataset.py | 16 ++++++++++ 2 files changed, 44 insertions(+), 4 deletions(-) diff --git a/scripts/experiments/gepa-flowchart/gepa_flowchart/dataset.py b/scripts/experiments/gepa-flowchart/gepa_flowchart/dataset.py index 6a5d4ea..e0d9724 100644 --- a/scripts/experiments/gepa-flowchart/gepa_flowchart/dataset.py +++ b/scripts/experiments/gepa-flowchart/gepa_flowchart/dataset.py @@ -29,6 +29,28 @@ def load_topics(path: str) -> list[Task]: ) +_FALLBACK_QUESTIONS = [ + "What are the main steps, in order?", + "What are the decision points and what does each branch lead to?", + "What triggers the process to start?", + "What happens on failure or error at each step?", + "Which steps are automated vs. performed by a person, and who?", +] + + +def parse_str_array(text: str) -> list[str]: + """Tolerantly pull a JSON array of strings from an LLM response. Uses raw_decode from the first + '[' so trailing text after a valid array (a common LLM habit) doesn't cause 'Extra data'.""" + i = text.find("[") + if i == -1: + return [] + try: + val, _ = json.JSONDecoder().raw_decode(text[i:]) + except (json.JSONDecodeError, ValueError): + return [] + return [str(x) for x in val] if isinstance(val, list) else [] + + def generate_questions(task: Task, *, model: str, n: int, complete_fn) -> list[str]: user = ( f"Topic: {task.topic}\nAudience: {task.audience}\nPurpose: {task.purpose}\n\n" @@ -37,10 +59,12 @@ def generate_questions(task: Task, *, model: str, n: int, complete_fn) -> list[s f"order, threshold, or failure path) — not something answerable from general knowledge. " f"JSON array of strings only." ) - out = complete_fn(_QGEN_SYSTEM, user, model=model, effort="medium", max_tokens=1000) - blob = extract_array(out) - qs = json.loads(blob) if blob else [] - return [str(q) for q in qs][:n] + qs = parse_str_array(complete_fn(_QGEN_SYSTEM, user, model=model, effort="medium", max_tokens=1000)) + if not qs: # one retry on a clean miss, then a generic fallback so no topic is question-less + qs = parse_str_array(complete_fn(_QGEN_SYSTEM, user, model=model, effort="medium", max_tokens=1000)) + if not qs: + qs = _FALLBACK_QUESTIONS + return qs[:n] def extract_array(text: str) -> str | None: diff --git a/scripts/experiments/gepa-flowchart/tests/test_dataset.py b/scripts/experiments/gepa-flowchart/tests/test_dataset.py index 30e1ec4..e6ff056 100644 --- a/scripts/experiments/gepa-flowchart/tests/test_dataset.py +++ b/scripts/experiments/gepa-flowchart/tests/test_dataset.py @@ -37,3 +37,19 @@ def test_freeze_and_reload(tmp_path): assert (tmp_path / "frozen_questions.json").exists() reloaded = load_frozen(load_topics(TOPICS)[:2], str(tmp_path)) assert reloaded[0].questions == frozen[0].questions + + +def test_generate_questions_tolerates_trailing_text(): + """Regression: the model often emits a valid array PLUS trailing prose; raw_decode must take + the array and ignore the rest instead of raising 'Extra data'.""" + def fc(system, user, *, model, **kw): + return '["What triggers it?", "What fails?"]\n\nThose are the key questions.' + qs = generate_questions(Task(id="x", topic="T", audience="A", purpose="P"), model="m", n=5, complete_fn=fc) + assert qs == ["What triggers it?", "What fails?"] + + +def test_generate_questions_falls_back_when_unparseable(): + def fc(system, user, *, model, **kw): + return "Sorry, here are some questions but not as JSON." + qs = generate_questions(Task(id="x", topic="T", audience="A", purpose="P"), model="m", n=3, complete_fn=fc) + assert len(qs) == 3 and all(isinstance(q, str) and q for q in qs) From 830e68d456cd353ef84020c660aeffc4d52ecf47 Mon Sep 17 00:00:00 2001 From: Ivan Cheung Date: Thu, 18 Jun 2026 22:47:38 +0000 Subject: [PATCH 21/24] chore(gepa): log dataset module + parse_str_array at startup MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit A one-line startup diagnostic (which module file is loaded + whether the tolerant parser is present) — caught a stale-bytecode issue where a run executed pre-fix code despite fixed source. --- scripts/experiments/gepa-flowchart/gepa_flowchart/run.py | 3 +++ 1 file changed, 3 insertions(+) diff --git a/scripts/experiments/gepa-flowchart/gepa_flowchart/run.py b/scripts/experiments/gepa-flowchart/gepa_flowchart/run.py index 4bc5958..9e11157 100644 --- a/scripts/experiments/gepa-flowchart/gepa_flowchart/run.py +++ b/scripts/experiments/gepa-flowchart/gepa_flowchart/run.py @@ -45,6 +45,9 @@ def main(argv=None) -> int: p.add_argument("--run-dir", default=None) args = p.parse_args(argv) + from . import dataset as _ds + print(f"[gepa] dataset={_ds.__file__} parse_str_array={hasattr(_ds, 'parse_str_array')}", file=sys.stderr) + cfg = Config.from_env({}) budget = args.max_metric_calls or (6 if args.smoke else cfg.max_metric_calls) run_dir = args.run_dir or str(Path("runs") / datetime.now(timezone.utc).strftime("%Y%m%dT%H%M%SZ")) From 619d6302bb3367cc41639d30746c1989c213d5dc Mon Sep 17 00:00:00 2001 From: Ivan Cheung Date: Fri, 19 Jun 2026 05:33:39 +0000 Subject: [PATCH 22/24] feat(gepa): weighted-harmonic-mean metric + multi-objective Pareto - Aggregate the three axes (comprehension/geometry/visual_quality) with an eps-floored WEIGHTED HARMONIC MEAN instead of a linear sum: a weak axis can't be bought back by a strong one (anti-compensation), while the eps floor keeps a single 0 from collapsing the score and re-saturating the metric. Within-axis (text+visual, heuristic+rendered) stays arithmetic mean (denoising two estimates of one thing). - Adapter now returns per-objective scores; run uses frontier_type=hybrid so GEPA keeps the Pareto front over both val instances and objectives instead of only the collapsed scalar. Verified on Vertex: base valset 0.415, GEPA iterates with the hybrid frontier, no errors. 37 tests. --- .../gepa-flowchart/gepa_flowchart/adapter.py | 12 +++++++++++- .../gepa-flowchart/gepa_flowchart/config.py | 10 ++++++++++ .../gepa-flowchart/gepa_flowchart/metric.py | 16 ++++++++++++++-- .../gepa-flowchart/gepa_flowchart/run.py | 1 + .../gepa-flowchart/tests/test_adapter.py | 3 +++ .../gepa-flowchart/tests/test_metric.py | 19 ++++++++++++++++++- 6 files changed, 57 insertions(+), 4 deletions(-) diff --git a/scripts/experiments/gepa-flowchart/gepa_flowchart/adapter.py b/scripts/experiments/gepa-flowchart/gepa_flowchart/adapter.py index bc876fd..1a10f2c 100644 --- a/scripts/experiments/gepa-flowchart/gepa_flowchart/adapter.py +++ b/scripts/experiments/gepa-flowchart/gepa_flowchart/adapter.py @@ -24,6 +24,7 @@ def __init__(self, cfg: Config, *, validate_fn=validate_flow, complete_fn=comple def evaluate(self, batch, candidate, capture_traces=False): outputs, scores, trajectories = [], [], [] if capture_traces else None + objective_scores = [] for task in batch: content, trace = run_pipeline( candidate, task, model=self.cfg.gen_model, @@ -35,9 +36,18 @@ def evaluate(self, batch, candidate, capture_traces=False): ) outputs.append(content) scores.append(result.score) + # Per-objective scores so GEPA can track/Pareto the distinct axes instead of only the + # collapsed scalar (keeps candidates that are best on comprehension vs. layout separate). + objective_scores.append({ + "comprehension": result.comp, + "geometry": result.geom, + "visual_quality": result.visual_quality, + }) if capture_traces: trajectories.append({"task": task, "trace": trace, "result": result, "output": content}) - return EvaluationBatch(outputs=outputs, scores=scores, trajectories=trajectories) + return EvaluationBatch( + outputs=outputs, scores=scores, trajectories=trajectories, objective_scores=objective_scores + ) def make_reflective_dataset(self, candidate, eval_batch, components_to_update): out: dict[str, list[dict]] = {c: [] for c in components_to_update} diff --git a/scripts/experiments/gepa-flowchart/gepa_flowchart/config.py b/scripts/experiments/gepa-flowchart/gepa_flowchart/config.py index 77bf4b6..ca50a77 100644 --- a/scripts/experiments/gepa-flowchart/gepa_flowchart/config.py +++ b/scripts/experiments/gepa-flowchart/gepa_flowchart/config.py @@ -14,6 +14,14 @@ class Config: w_comp: float = 0.5 # comprehension = mean(text, visual) w_geom: float = 0.3 # geometry = mean(heuristic, rendered-DOM) w_vq: float = 0.2 # visual quality (legibility/crowding) + # Aggregate the three axes with a WEIGHTED HARMONIC MEAN (anti-compensation: a weak axis can't + # be bought back by a strong one). score_epsilon floors each axis so a single 0 doesn't collapse + # the whole score to 0 (which would re-saturate the metric at the bottom and kill the gradient). + score_epsilon: float = 0.05 + # GEPA candidate-frontier type. "hybrid" keeps the Pareto front over BOTH val instances and the + # per-objective scores we hand it (comprehension / geometry / visual_quality), so a candidate + # that's best on one objective or one topic survives. "instance" = original behavior. + frontier_type: str = "hybrid" geom_error_penalty: float = 0.34 geom_warning_penalty: float = 0.08 rg_overlap_penalty: float = 0.34 # rendered-DOM: per overlapping node pair @@ -46,6 +54,8 @@ def i(key: str, default: int) -> int: w_comp=f("GEPA_W_COMP", cls.w_comp), w_geom=f("GEPA_W_GEOM", cls.w_geom), w_vq=f("GEPA_W_VQ", cls.w_vq), + score_epsilon=f("GEPA_SCORE_EPSILON", cls.score_epsilon), + frontier_type=s("GEPA_FRONTIER_TYPE", cls.frontier_type), geom_error_penalty=f("GEPA_GEOM_ERR_PENALTY", cls.geom_error_penalty), geom_warning_penalty=f("GEPA_GEOM_WARN_PENALTY", cls.geom_warning_penalty), rg_overlap_penalty=f("GEPA_RG_OVERLAP_PENALTY", cls.rg_overlap_penalty), diff --git a/scripts/experiments/gepa-flowchart/gepa_flowchart/metric.py b/scripts/experiments/gepa-flowchart/gepa_flowchart/metric.py index f6eb0ee..df9d819 100644 --- a/scripts/experiments/gepa-flowchart/gepa_flowchart/metric.py +++ b/scripts/experiments/gepa-flowchart/gepa_flowchart/metric.py @@ -20,6 +20,14 @@ class ScoreResult: sub: dict = field(default_factory=dict) # text_comp / visual_comp / heuristic_geom / rendered_geom +def weighted_harmonic_mean(pairs: list[tuple[float, float]], *, eps: float = 0.05) -> float: + """Weighted harmonic mean of (weight, value) pairs, each value floored at eps. Dominated by the + smallest value → enforces 'good on every axis' (no compensation) without a single 0 nuking it.""" + num = sum(w for w, _ in pairs) + den = sum(w / max(eps, v) for w, v in pairs) + return num / den if den else 0.0 + + def geometry_score(findings: list[dict], cfg: Config) -> tuple[float, str]: errs = [f for f in findings if f.get("severity") == "error"] warns = [f for f in findings if f.get("severity") == "warning"] @@ -169,9 +177,13 @@ def score_board(content, task: Task, cfg: Config, *, validate_fn, complete_fn, r rgeom, rgeom_fb = 0.0, f"Rendered geometry: render failed ({r.get('error')})." vcomp, vqual, v_fb = 0.0, 0.0, "Visual: the board failed to render in the viewer." - comp = (tcomp + vcomp) / 2 + comp = (tcomp + vcomp) / 2 # two estimates of the same thing → average to denoise geom = (hgeom + rgeom) / 2 - total = cfg.w_comp * comp + cfg.w_geom * geom + cfg.w_vq * vqual + # Three distinct requirements → weighted harmonic mean (a weak axis drags the whole score down + # and can't be compensated), ε-floored so a single 0 doesn't collapse everything to 0. + total = weighted_harmonic_mean( + [(cfg.w_comp, comp), (cfg.w_geom, geom), (cfg.w_vq, vqual)], eps=cfg.score_epsilon + ) feedback = "\n".join([tcomp_fb, v_fb, hgeom_fb, rgeom_fb]) sub = {"text_comp": tcomp, "visual_comp": vcomp, "heuristic_geom": hgeom, "rendered_geom": rgeom, "visual_quality": vqual} return ScoreResult(total, feedback, comp, geom, vqual, True, sub) diff --git a/scripts/experiments/gepa-flowchart/gepa_flowchart/run.py b/scripts/experiments/gepa-flowchart/gepa_flowchart/run.py index 9e11157..ca17b0a 100644 --- a/scripts/experiments/gepa-flowchart/gepa_flowchart/run.py +++ b/scripts/experiments/gepa-flowchart/gepa_flowchart/run.py @@ -88,6 +88,7 @@ def main(argv=None) -> int: adapter=adapter, reflection_lm=make_reflection_callable(cfg.reflection_model), max_metric_calls=budget, + frontier_type=cfg.frontier_type, run_dir=run_dir, display_progress_bar=True, ) diff --git a/scripts/experiments/gepa-flowchart/tests/test_adapter.py b/scripts/experiments/gepa-flowchart/tests/test_adapter.py index bf432c4..134719e 100644 --- a/scripts/experiments/gepa-flowchart/tests/test_adapter.py +++ b/scripts/experiments/gepa-flowchart/tests/test_adapter.py @@ -46,6 +46,9 @@ def test_evaluate_returns_scores_and_traces(): assert len(out.scores) == 1 assert out.scores[0] > 0.0 assert out.trajectories is not None and len(out.trajectories) == 1 + # per-objective scores handed to GEPA for multi-objective Pareto + assert out.objective_scores is not None and len(out.objective_scores) == 1 + assert set(out.objective_scores[0]) == {"comprehension", "geometry", "visual_quality"} def test_make_reflective_dataset_has_requested_components(): diff --git a/scripts/experiments/gepa-flowchart/tests/test_metric.py b/scripts/experiments/gepa-flowchart/tests/test_metric.py index 8dae861..feaa6fa 100644 --- a/scripts/experiments/gepa-flowchart/tests/test_metric.py +++ b/scripts/experiments/gepa-flowchart/tests/test_metric.py @@ -7,6 +7,7 @@ rendered_geometry_score, visual_eval, score_board, + weighted_harmonic_mean, ) CFG = Config.from_env({}) @@ -92,11 +93,27 @@ def complete_fn(system, user, *, model, **kw): assert res.comp == 0.5 assert res.geom == 1.0 assert res.visual_quality == 0.8 - assert abs(res.score - (CFG.w_comp * 0.5 + CFG.w_geom * 1.0 + CFG.w_vq * 0.8)) < 1e-9 + expected = weighted_harmonic_mean( + [(CFG.w_comp, 0.5), (CFG.w_geom, 1.0), (CFG.w_vq, 0.8)], eps=CFG.score_epsilon + ) + assert abs(res.score - expected) < 1e-9 assert res.sub["text_comp"] == 0.5 and res.sub["visual_comp"] == 0.5 assert "failure path" in res.feedback and "cramped" in res.feedback +def test_weighted_harmonic_mean_anti_compensation(): + # equal values -> that value + assert abs(weighted_harmonic_mean([(1, 0.6), (1, 0.6), (1, 0.6)]) - 0.6) < 1e-9 + # one weak axis drags the total well below the arithmetic mean (no buying it back) + vals = [(0.5, 0.9), (0.3, 0.9), (0.2, 0.1)] + arith = sum(w * v for w, v in vals) + hm = weighted_harmonic_mean(vals) + assert hm < arith - 0.1 + # a single 0 is floored, not collapsed to exactly 0 (gradient preserved) + z = weighted_harmonic_mean([(0.5, 0.9), (0.3, 0.9), (0.2, 0.0)], eps=0.05) + assert 0.0 < z < 0.3 + + def test_render_failure_zeros_visual_signals(): def validate_fn(content): return {"valid": True, "error": None, "findings": [], "warnings": []} From 75fc6dcb67b4bd40026b820af20a4ae18cfd80d3 Mon Sep 17 00:00:00 2001 From: Ivan Cheung Date: Fri, 19 Jun 2026 05:43:26 +0000 Subject: [PATCH 23/24] =?UTF-8?q?feat(gepa):=20overnight=20autonomous=20sw?= =?UTF-8?q?eep=20=E2=80=94=20huge=20corpus=20+=20experiments=20+=20cross-e?= =?UTF-8?q?val?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - corpus_gen: ~180 diverse use-cases across 20 domains (tolerant parse, dedupe, fallback) - run.py: --topics / --val cap / reuse shared frozen_questions (comparable experiments) - agg toggle (harmonic|linear) for ablation - overnight.sh: robust orchestrator — shared frozen questions, 3 timeout-bounded + failure-isolated experiments (WHM+hybrid opus, linear+instance ablation, sonnet-gen ablation), then crosseval - crosseval: re-score seed + each best on a held-out set under one canonical metric -> SUMMARY.md --- .../gepa-flowchart/gepa_flowchart/config.py | 2 + .../gepa_flowchart/corpus_gen.py | 94 +++++++++++++++++++ .../gepa_flowchart/crosseval.py | 87 +++++++++++++++++ .../gepa-flowchart/gepa_flowchart/metric.py | 9 +- .../gepa-flowchart/gepa_flowchart/run.py | 33 +++++-- .../experiments/gepa-flowchart/overnight.sh | 72 ++++++++++++++ 6 files changed, 286 insertions(+), 11 deletions(-) create mode 100644 scripts/experiments/gepa-flowchart/gepa_flowchart/corpus_gen.py create mode 100644 scripts/experiments/gepa-flowchart/gepa_flowchart/crosseval.py create mode 100755 scripts/experiments/gepa-flowchart/overnight.sh diff --git a/scripts/experiments/gepa-flowchart/gepa_flowchart/config.py b/scripts/experiments/gepa-flowchart/gepa_flowchart/config.py index ca50a77..6cce42b 100644 --- a/scripts/experiments/gepa-flowchart/gepa_flowchart/config.py +++ b/scripts/experiments/gepa-flowchart/gepa_flowchart/config.py @@ -22,6 +22,7 @@ class Config: # per-objective scores we hand it (comprehension / geometry / visual_quality), so a candidate # that's best on one objective or one topic survives. "instance" = original behavior. frontier_type: str = "hybrid" + agg: str = "harmonic" # "harmonic" (weighted HM, anti-compensation) | "linear" (weighted sum, ablation) geom_error_penalty: float = 0.34 geom_warning_penalty: float = 0.08 rg_overlap_penalty: float = 0.34 # rendered-DOM: per overlapping node pair @@ -56,6 +57,7 @@ def i(key: str, default: int) -> int: w_vq=f("GEPA_W_VQ", cls.w_vq), score_epsilon=f("GEPA_SCORE_EPSILON", cls.score_epsilon), frontier_type=s("GEPA_FRONTIER_TYPE", cls.frontier_type), + agg=s("GEPA_AGG", cls.agg), geom_error_penalty=f("GEPA_GEOM_ERR_PENALTY", cls.geom_error_penalty), geom_warning_penalty=f("GEPA_GEOM_WARN_PENALTY", cls.geom_warning_penalty), rg_overlap_penalty=f("GEPA_RG_OVERLAP_PENALTY", cls.rg_overlap_penalty), diff --git a/scripts/experiments/gepa-flowchart/gepa_flowchart/corpus_gen.py b/scripts/experiments/gepa-flowchart/gepa_flowchart/corpus_gen.py new file mode 100644 index 0000000..a30ac19 --- /dev/null +++ b/scripts/experiments/gepa-flowchart/gepa_flowchart/corpus_gen.py @@ -0,0 +1,94 @@ +"""Generate a large, diverse corpus of flowchart use-cases for the overnight experiments. +Tolerant parsing + dedupe + a guaranteed fallback so a partial LLM failure never produces an +empty corpus. Run: python -m gepa_flowchart.corpus_gen [per_domain].""" +from __future__ import annotations + +import json +import re +import sys +from pathlib import Path + +from .config import Config +from .llm import complete + +DOMAINS = [ + "software architecture & backend services", "DevOps, CI/CD and release engineering", + "distributed systems and consensus", "networking and protocols", + "data engineering and ML pipelines", "security, authentication and authorization", + "databases, storage and caching", "e-commerce, checkout and payments", + "business operations, approvals and workflows", "customer support and ITSM", + "healthcare and clinical workflows", "finance, trading and risk", + "manufacturing, supply chain and logistics", "scientific and laboratory processes", + "game logic and state machines", "product onboarding and growth funnels", + "compilers, interpreters and language runtimes", "robotics and control loops", + "incident response and on-call", "cloud infrastructure provisioning", +] + +_SYS = ( + "You design diverse flowchart USE-CASES for benchmarking a diagram generator. Each must be a " + "real process/flow worth diagramming — with steps, decisions, branches and failure paths. " + "Output ONLY a JSON array of objects: {\"topic\", \"audience\", \"purpose\"}. Be specific and varied." +) + + +def parse_objs(text: str) -> list[dict]: + i = text.find("[") + if i == -1: + return [] + try: + v, _ = json.JSONDecoder().raw_decode(text[i:]) + except (json.JSONDecodeError, ValueError): + return [] + return [o for o in v if isinstance(o, dict)] if isinstance(v, list) else [] + + +def slug(s: str) -> str: + return re.sub(r"[^a-z0-9]+", "-", s.lower()).strip("-")[:48] + + +def main(argv=None) -> int: + argv = argv or sys.argv[1:] + out = argv[0] if argv else "overnight/corpus.json" + per = int(argv[1]) if len(argv) > 1 else 9 + cfg = Config.from_env({}) + Path(out).parent.mkdir(parents=True, exist_ok=True) + + seen: set[str] = set() + rows: list[dict] = [] + for d in DOMAINS: + try: + txt = complete( + _SYS, f"Domain: {d}\nList {per} distinct flowchart use-cases as JSON objects.", + model=cfg.judge_model, effort="medium", max_tokens=2200, + ) + added = 0 + for o in parse_objs(txt): + topic = str(o.get("topic", "")).strip() + sid = slug(topic) + if not topic or not sid or sid in seen: + continue + seen.add(sid) + rows.append({ + "id": sid, "topic": topic, + "audience": str(o.get("audience", "a practitioner")).strip() or "a practitioner", + "purpose": str(o.get("purpose", "understand the process")).strip() or "understand the process", + }) + added += 1 + print(f"[corpus] {d}: +{added} (total {len(rows)})", file=sys.stderr) + except Exception as e: # never let one domain kill the corpus + print(f"[corpus] {d}: FAILED {type(e).__name__}: {str(e)[:120]}", file=sys.stderr) + + # Guaranteed fallback: fold in the built-in 20 so the corpus is never tiny. + base_path = Path(__file__).resolve().parent.parent / "topics" / "topics.json" + for b in json.loads(base_path.read_text()): + if b["id"] not in seen: + rows.append(b) + seen.add(b["id"]) + + Path(out).write_text(json.dumps(rows, indent=1)) + print(f"[corpus] wrote {len(rows)} topics -> {out}") + return 0 + + +if __name__ == "__main__": + raise SystemExit(main()) diff --git a/scripts/experiments/gepa-flowchart/gepa_flowchart/crosseval.py b/scripts/experiments/gepa-flowchart/gepa_flowchart/crosseval.py new file mode 100644 index 0000000..f93ed19 --- /dev/null +++ b/scripts/experiments/gepa-flowchart/gepa_flowchart/crosseval.py @@ -0,0 +1,87 @@ +"""Apples-to-apples comparison: re-score the seed prompt and each experiment's best prompt on a +shared HELD-OUT set (topics unseen during optimization) under ONE canonical metric (WHM + the +default weights), so experiments using different aggregations/frontiers are comparable. Writes +overnight/SUMMARY.md. Fully wrapped — a failure here leaves each run's own report.md intact. + +Usage: python -m gepa_flowchart.crosseval [exp_dir...] +""" +from __future__ import annotations + +import json +import sys +from pathlib import Path + +from .config import Config +from .dataset import generate_questions, load_frozen, load_topics +from .geometry_bridge import validate_flow +from .llm import complete, complete_vision +from .metric import score_board +from .pipeline import SEED_CANDIDATE, run_pipeline +from .render_bridge import RenderService + + +def _eval(cand, tasks, cfg, render_fn): + rows = [] + for t in tasks: + content, _ = run_pipeline(cand, t, model=cfg.gen_model, complete_fn=complete, max_tokens=cfg.gen_max_tokens) + r = score_board(content, t, cfg, validate_fn=validate_flow, complete_fn=complete, + render_fn=render_fn, vision_fn=complete_vision) + rows.append(r) + n = len(rows) or 1 + return { + "score": sum(r.score for r in rows) / n, + "comp": sum(r.comp for r in rows) / n, + "geom": sum(r.geom for r in rows) / n, + "vq": sum(r.visual_quality for r in rows) / n, + } + + +def main(argv=None) -> int: + argv = argv or sys.argv[1:] + corpus, frozen_dir, holdout_n = argv[0], argv[1], int(argv[2]) + exp_dirs = argv[3:] + cfg = Config.from_env({"GEPA_AGG": "harmonic", "GEPA_FRONTIER_TYPE": "hybrid"}) # canonical metric + + holdout = load_topics(corpus)[-holdout_n:] + holdout = load_frozen(holdout, frozen_dir) # reuse shared questions if present + for t in holdout: + if not t.questions: + t.questions = generate_questions(t, model=cfg.judge_model, n=cfg.n_questions, complete_fn=complete) + + cands = {"seed": SEED_CANDIDATE} + for d in exp_dirs: + p = Path(d) / "best_prompts.json" + if p.exists(): + try: + cands[Path(d).name] = json.loads(p.read_text()) + except Exception as e: + print(f"[crosseval] skip {d}: {e}", file=sys.stderr) + + results = {} + with RenderService() as svc: + for name, c in cands.items(): + try: + results[name] = _eval(c, holdout, cfg, svc.render) + print(f"[crosseval] {name}: {results[name]['score']:.3f}", file=sys.stderr) + except Exception as e: + print(f"[crosseval] {name} FAILED: {e}", file=sys.stderr) + + md = [ + "# Overnight experiments — held-out comparison", + "", + f"Canonical metric (WHM, weights {cfg.w_comp}/{cfg.w_geom}/{cfg.w_vq}), " + f"held-out topics ({holdout_n}, unseen during optimization): " + f"{', '.join(t.id for t in holdout)}", + "", + "| candidate | score | comprehension | geometry | visual_quality |", + "|---|---|---|---|---|", + ] + for name, m in sorted(results.items(), key=lambda kv: -kv[1]["score"]): + md.append(f"| {name} | **{m['score']:.3f}** | {m['comp']:.2f} | {m['geom']:.2f} | {m['vq']:.2f} |") + Path("overnight/SUMMARY.md").write_text("\n".join(md) + "\n") + print("[crosseval] wrote overnight/SUMMARY.md") + return 0 + + +if __name__ == "__main__": + raise SystemExit(main()) diff --git a/scripts/experiments/gepa-flowchart/gepa_flowchart/metric.py b/scripts/experiments/gepa-flowchart/gepa_flowchart/metric.py index df9d819..965f88d 100644 --- a/scripts/experiments/gepa-flowchart/gepa_flowchart/metric.py +++ b/scripts/experiments/gepa-flowchart/gepa_flowchart/metric.py @@ -181,9 +181,12 @@ def score_board(content, task: Task, cfg: Config, *, validate_fn, complete_fn, r geom = (hgeom + rgeom) / 2 # Three distinct requirements → weighted harmonic mean (a weak axis drags the whole score down # and can't be compensated), ε-floored so a single 0 doesn't collapse everything to 0. - total = weighted_harmonic_mean( - [(cfg.w_comp, comp), (cfg.w_geom, geom), (cfg.w_vq, vqual)], eps=cfg.score_epsilon - ) + # "linear" is the compensatory weighted-sum baseline, kept for ablation. + axes = [(cfg.w_comp, comp), (cfg.w_geom, geom), (cfg.w_vq, vqual)] + if cfg.agg == "linear": + total = sum(w * v for w, v in axes) + else: + total = weighted_harmonic_mean(axes, eps=cfg.score_epsilon) feedback = "\n".join([tcomp_fb, v_fb, hgeom_fb, rgeom_fb]) sub = {"text_comp": tcomp, "visual_comp": vcomp, "heuristic_geom": hgeom, "rendered_geom": rgeom, "visual_quality": vqual} return ScoreResult(total, feedback, comp, geom, vqual, True, sub) diff --git a/scripts/experiments/gepa-flowchart/gepa_flowchart/run.py b/scripts/experiments/gepa-flowchart/gepa_flowchart/run.py index ca17b0a..e12ac99 100644 --- a/scripts/experiments/gepa-flowchart/gepa_flowchart/run.py +++ b/scripts/experiments/gepa-flowchart/gepa_flowchart/run.py @@ -10,7 +10,7 @@ from .adapter import FlowchartAdapter from .config import Config -from .dataset import freeze_questions, load_topics +from .dataset import freeze_questions, generate_questions, load_frozen, load_topics from .llm import complete, complete_vision, make_reflection_callable from .metric import score_board from .geometry_bridge import validate_flow @@ -42,6 +42,8 @@ def main(argv=None) -> int: p.add_argument("--smoke", action="store_true", help="1 topic, tiny budget") p.add_argument("--max-metric-calls", type=int, default=None) p.add_argument("--train", type=int, default=8) + p.add_argument("--val", type=int, default=None, help="cap validation set size (default: all topics after train)") + p.add_argument("--topics", default=None, help="path to a topics.json (default: the built-in 20)") p.add_argument("--run-dir", default=None) args = p.parse_args(argv) @@ -53,22 +55,37 @@ def main(argv=None) -> int: run_dir = args.run_dir or str(Path("runs") / datetime.now(timezone.utc).strftime("%Y%m%dT%H%M%SZ")) Path(run_dir).mkdir(parents=True, exist_ok=True) - topics = load_topics(_TOPICS) + topics = load_topics(args.topics or _TOPICS) if args.smoke: train, val = topics[:1], topics[:1] else: - train, val = topics[: args.train], topics[args.train :] + train = topics[: args.train] + val = topics[args.train :] + if args.val is not None: + val = val[: args.val] - print(f"[gepa] run_dir={run_dir} budget={budget} train={len(train)} val={len(val)}", file=sys.stderr) + print(f"[gepa] run_dir={run_dir} budget={budget} train={len(train)} val={len(val)} " + f"topics={args.topics or 'builtin'}", file=sys.stderr) print(f"[gepa] models gen={cfg.gen_model} reader={cfg.reader_model} judge={cfg.judge_model} " f"vision={cfg.vision_model} reflect={cfg.reflection_model}", file=sys.stderr) - print(f"[gepa] weights comp={cfg.w_comp} geom={cfg.w_geom} visual_quality={cfg.w_vq}", file=sys.stderr) + print(f"[gepa] agg={cfg.agg} frontier={cfg.frontier_type} weights comp={cfg.w_comp} " + f"geom={cfg.w_geom} vq={cfg.w_vq}", file=sys.stderr) - # Freeze the reader-questions once for the whole run (fair comparison), only for the topics - # actually used (train ∪ val) — freezing all 20 would waste calls on a small/smoke run. + # Reader-questions are frozen once. If the run_dir already carries a frozen_questions.json + # (e.g. the overnight orchestrator pre-froze a shared set so experiments are comparable), reuse + # it; otherwise generate for the used (train ∪ val) topics. used_ids = {t.id for t in train} | {t.id for t in val} used = [t for t in topics if t.id in used_ids] - frozen = freeze_questions(used, run_dir, model=cfg.judge_model, n=cfg.n_questions, complete_fn=complete) + if (Path(run_dir) / "frozen_questions.json").exists(): + print("[gepa] reusing existing frozen_questions.json", file=sys.stderr) + frozen = load_frozen(used, run_dir) + # any topic missing from the shared file gets questions generated in-memory (don't rewrite + # the shared file — that would clobber the other topics' frozen questions) + for t in frozen: + if not t.questions: + t.questions = generate_questions(t, model=cfg.judge_model, n=cfg.n_questions, complete_fn=complete) + else: + frozen = freeze_questions(used, run_dir, model=cfg.judge_model, n=cfg.n_questions, complete_fn=complete) by_id = {t.id: t for t in frozen} train = [by_id[t.id] for t in train] val = [by_id[t.id] for t in val] diff --git a/scripts/experiments/gepa-flowchart/overnight.sh b/scripts/experiments/gepa-flowchart/overnight.sh new file mode 100755 index 0000000..0cdde9a --- /dev/null +++ b/scripts/experiments/gepa-flowchart/overnight.sh @@ -0,0 +1,72 @@ +#!/usr/bin/env bash +# Autonomous overnight experiment sweep for the GEPA flowchart optimizer. Robust by design: each +# experiment is timeout-bounded and failure-isolated, so one hang/crash never wastes the night. +# Everything logs to overnight/master.log; final comparison in overnight/SUMMARY.md. +set -u +cd "$(dirname "$0")" +export CLAUDE_CODE_USE_VERTEX=1 ANTHROPIC_VERTEX_PROJECT_ID=adk-coding-agents CLOUD_ML_REGION=global +export PYTHONDONTWRITEBYTECODE=1 +rm -rf gepa_flowchart/__pycache__ tests/__pycache__ 2>/dev/null +mkdir -p overnight +LOG=overnight/master.log +say(){ echo "[$(date -u +%H:%M:%SZ)] $*" | tee -a "$LOG"; } +cleanup(){ pkill -9 -f render_service.mjs 2>/dev/null; pkill -9 -f "packages/viewer/dist/server.js" 2>/dev/null; sleep 2; } + +say "===== OVERNIGHT START =====" + +# 1. Corpus (reuse if already generated; else build it). +if [ ! -s overnight/corpus.json ]; then + say "generating corpus..." + timeout 30m python3 -u -m gepa_flowchart.corpus_gen overnight/corpus.json 9 >> "$LOG" 2>&1 || say "corpus_gen returned $?" +fi +N=$(python3 -c "import json;print(len(json.load(open('overnight/corpus.json'))))" 2>/dev/null || echo 0) +say "corpus size = $N" +if [ "$N" -lt 20 ]; then say "FATAL: corpus too small ($N); aborting"; exit 1; fi +VAL=16 +if [ "$N" -gt 116 ]; then TRAIN=100 +elif [ "$N" -gt 48 ]; then TRAIN=$((N-32)) +else TRAIN=$((N/2)); VAL=$((N/4)); fi +say "split: train=$TRAIN val=$VAL (holdout=last 12 for cross-eval)" + +# 2. Pre-freeze ONE shared question set for train+val, so experiments are directly comparable. +say "pre-freezing shared reader-questions for train+val..." +timeout 30m python3 -u - "$TRAIN" "$VAL" <<'PY' >> "$LOG" 2>&1 || say "freeze returned $?" +import sys +from gepa_flowchart.config import Config +from gepa_flowchart.dataset import load_topics, freeze_questions +from gepa_flowchart.llm import complete +tr, va = int(sys.argv[1]), int(sys.argv[2]) +cfg = Config.from_env({}) +used = load_topics("overnight/corpus.json")[: tr + va] +freeze_questions(used, "overnight/frozen", model=cfg.judge_model, n=cfg.n_questions, complete_fn=complete) +print(f"[freeze] froze {len(used)} topics") +PY + +# 3. Experiments. Each: own run-dir seeded with the shared frozen questions; timeout-bounded. +run_exp(){ + local name=$1 tmout=$2 budget=$3 envline=$4 + local rd="overnight/$name" + mkdir -p "$rd" + cp -f overnight/frozen/frozen_questions.json "$rd/frozen_questions.json" 2>/dev/null || true + say "EXP $name START (timeout=$tmout budget=$budget | $envline)" + ( eval "export $envline" + timeout "$tmout" python3 -u -m gepa_flowchart.run \ + --topics overnight/corpus.json --train "$TRAIN" --val "$VAL" \ + --max-metric-calls "$budget" --run-dir "$rd" + ) >> "overnight/$name.log" 2>&1 + local rc=$? + if [ -f "$rd/report.md" ]; then say "EXP $name END rc=$rc report:OK"; else say "EXP $name END rc=$rc report:MISSING"; fi + cleanup +} + +run_exp whm_hybrid_opus 4h 220 "GEPA_AGG=harmonic GEPA_FRONTIER_TYPE=hybrid" +run_exp linear_instance 2h30m 130 "GEPA_AGG=linear GEPA_FRONTIER_TYPE=instance" +run_exp whm_hybrid_sonnet 2h30m 130 "GEPA_AGG=harmonic GEPA_FRONTIER_TYPE=hybrid GEPA_GEN_MODEL=claude-sonnet-4-6" + +# 4. Cross-eval all bests on a held-out set under the canonical metric. +say "cross-eval on held-out set..." +timeout 90m python3 -u -m gepa_flowchart.crosseval overnight/corpus.json overnight/frozen 12 \ + overnight/whm_hybrid_opus overnight/linear_instance overnight/whm_hybrid_sonnet >> "$LOG" 2>&1 \ + || say "cross-eval failed (per-run report.md files are still valid)" +cleanup +say "===== OVERNIGHT DONE =====" From 36afc772aa19e1c5136baa68b028eb45253c1fe6 Mon Sep 17 00:00:00 2001 From: Ivan Cheung Date: Fri, 19 Jun 2026 06:06:29 +0000 Subject: [PATCH 24/24] fix(gepa): overnight cleanup reaps chromium + runs before first experiment Orphaned render services (from kill -9 skipping the graceful handler) hold the fixed render/viewer ports and make the next experiment fail 'render service exited before ready'. cleanup() now also kills stray headless_shell/chrome and runs once at startup. --- scripts/experiments/gepa-flowchart/overnight.sh | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/scripts/experiments/gepa-flowchart/overnight.sh b/scripts/experiments/gepa-flowchart/overnight.sh index 0cdde9a..f0c4521 100755 --- a/scripts/experiments/gepa-flowchart/overnight.sh +++ b/scripts/experiments/gepa-flowchart/overnight.sh @@ -10,9 +10,16 @@ rm -rf gepa_flowchart/__pycache__ tests/__pycache__ 2>/dev/null mkdir -p overnight LOG=overnight/master.log say(){ echo "[$(date -u +%H:%M:%SZ)] $*" | tee -a "$LOG"; } -cleanup(){ pkill -9 -f render_service.mjs 2>/dev/null; pkill -9 -f "packages/viewer/dist/server.js" 2>/dev/null; sleep 2; } +cleanup(){ + pkill -9 -f render_service.mjs 2>/dev/null + pkill -9 -f "packages/viewer/dist/server.js" 2>/dev/null + pkill -9 -f "ms-playwright.*headless_shell" 2>/dev/null + pkill -9 -f "ms-playwright.*chrome" 2>/dev/null + sleep 2 +} say "===== OVERNIGHT START =====" +cleanup # clear any pre-existing render/viewer/chromium orphans before we begin # 1. Corpus (reuse if already generated; else build it). if [ ! -s overnight/corpus.json ]; then