TL;DR
GEPA prompt-optimization is a hill-climber, so the scorer is the ceiling and the attack surface — gains are only as real as the verification. An audit of scripts/experiments/gepa-flowchart/gepa_flowchart/unified_metric.py against the LLM-as-judge / graph-readability literature surfaced a set of validity issues, plus a few SGCR blockers to making engine:"sgcr" the default. Several mitigations are already implemented (gated, default-off); the rest are proposed. Extends #195 (SGCR layout gaps).
A. Scoring / verification methodology
Our scorer: one Opus vision call grades comprehension (answer 7 frozen questions from the render), visual_quality (4 anchored dims), and the journey rubric; geometry is deterministic from the DOM; combined by weighted harmonic mean. The defining risks:
A1. Self-preference bias — generator and judge are the same family (Opus) ✅ mitigated
LLM evaluators recognize and favor their own generations (NeurIPS 2024); the standard fix is a different-family judge (PoLL, Verga 2024). We generate and judge with Opus → GEPA partly learns to please Opus's own style.
- Done: cross-family judge wired in (
llm.py routes any gemini-* model to Vertex Gemini); set GEPA_VISION_MODEL=gemini-2.5-flash.
- Todo: make a cross-family judge the default for scoring.
A2. Single judge, single sample → high variance, no calibration ✅ partially mitigated
LLM judges are non-deterministic even at low temp, and Anthropic models don't stabilize the way GPT/Gemini do, with no stable seed (Same Input Different Scores, temperature study). We saw it: er-diagram swung 0.62→0.79 across identical setups.
- Done:
GEPA_JUDGE_SAMPLES=N averages N vision calls (variance reduction).
- Todo: a small PoLL panel (diverse small judges) for bias-cancellation + variance, ~7× cheaper than one big judge.
A3. No human / independent calibration — we don't know the ceiling ✅ meta-eval added (no humans)
"Human correlation is not inherent; it must be earned and validated per task" via Spearman/Cohen's κ (LLM-judge vs human). We had zero validation.
- Done:
gepa_flowchart/judge_agreement.py — scores the same board with Opus vs an independent Gemini judge, reports per-axis Pearson/Spearman/mean-|Δ| (geometry is deterministic → corr≈1 sanity control). This is the no-human proxy for judge reliability.
- Todo: a small human-labeled gold set to anchor the proxy.
A4. Comprehension leakage — measures topic priors, not legibility ✅ mitigated
Chart-QA work shows many items are solvable by text shortcut, not visual reading, and VLMs struggle with axis text/precise values (Judging the Judges, ACL 2025, ChartMuseum). Our judge can answer the 7 questions from prior knowledge of the topic.
- Done:
GEPA_COMP_LIFT=1 re-asks the questions text-only (no image) and credits only the lift the render adds.
A5. Geometry is sound but myopic — missing edge crossings ✅ mitigated
Deterministic geometry is a strength (objective, bias-free), but Purchase's repeated finding is that edge crossings are the #1 readability driver (graph-drawing metrics) — and we only measured overlap/off-canvas/font, not crossings.
- Done: deterministic edge-crossing count + penalty (
GEPA_GEOM_CROSSINGS=1, _edge_crossings in unified_metric.py).
- Todo: crossing-angle / angular-resolution (secondary readability drivers).
A6. Reward overoptimization / Goodhart — no held-out gold ⬜ proposed
Optimizing an imperfect proxy eventually degrades true quality while the proxy rises (Gao 2023); a single same-family judge is the most gameable proxy. We have no held-out reference to detect the proxy↔truth gap.
- Todo: periodically score current best candidates on the A3 cross-judge/gold set; if proxy rises while gold flattens → overoptimization (early-stop signal). The hybrid Pareto frontier already gives partial protection.
A7. Coarse scale + no reasoning-before-score ⬜ proposed
0/0.5/1 is low-resolution; CoT-before-score improves calibration and reduces variance (G-Eval / RubricEval). Our VQ/rubric ask for a bare score.
- Todo: request brief per-criterion reasoning before the score; consider a finer scale.
B. SGCR readiness for engine:"sgcr" by default (extends #195)
B1. The "don't use on grouped/zoned" guidance is now obsolete (auto-fallback is safe)
flow.tsx:424-434: useSgcr = engine==="sgcr" && !grouped; grouped + engine:"sgcr" transparently falls back to dagre and logs sgcr-skipped-grouped. So defaulting engine:"sgcr" is safe for grouped boards (they just use dagre — this is why architecture-zones "isn't using SGCR"). The prompt caveat should change from "don't set it on grouped" → "auto-ignored on grouped." But grouped boards still get no SGCR benefit (grouped-graph support, #195 gap #1, remains unbuilt).
B2. No viewport-fit invariant + no dagre fallback on overflow → blocks safe default ⬜
P3 (check.ts:187) checks against SGCR's own layout.width/height, not the render viewport, and the only fallback is between SGCR modes (ring→routing in flow.tsx), never to dagre. So a large/wide ungrouped graph can pass every invariant yet overflow the screen — the A/B regressions (saga off-canvas 4→13, blue-green 4→8). This is the real blocker to defaulting engine:"sgcr": it helps small/medium ungrouped graphs but can render worse than dagre on large ones, with no safety net.
- Todo (any of): a P10 "fits target viewport after scale" invariant; fall back to dagre when the SGCR layout can't fit; or a node-count guard before choosing SGCR. With one of these,
engine:"sgcr" is a safe default.
Repro / pointers
TL;DR
GEPA prompt-optimization is a hill-climber, so the scorer is the ceiling and the attack surface — gains are only as real as the verification. An audit of
scripts/experiments/gepa-flowchart/gepa_flowchart/unified_metric.pyagainst the LLM-as-judge / graph-readability literature surfaced a set of validity issues, plus a few SGCR blockers to makingengine:"sgcr"the default. Several mitigations are already implemented (gated, default-off); the rest are proposed. Extends #195 (SGCR layout gaps).A. Scoring / verification methodology
Our scorer: one Opus vision call grades comprehension (answer 7 frozen questions from the render), visual_quality (4 anchored dims), and the journey rubric; geometry is deterministic from the DOM; combined by weighted harmonic mean. The defining risks:
A1. Self-preference bias — generator and judge are the same family (Opus) ✅ mitigated
LLM evaluators recognize and favor their own generations (NeurIPS 2024); the standard fix is a different-family judge (PoLL, Verga 2024). We generate and judge with Opus → GEPA partly learns to please Opus's own style.
llm.pyroutes anygemini-*model to Vertex Gemini); setGEPA_VISION_MODEL=gemini-2.5-flash.A2. Single judge, single sample → high variance, no calibration ✅ partially mitigated
LLM judges are non-deterministic even at low temp, and Anthropic models don't stabilize the way GPT/Gemini do, with no stable seed (Same Input Different Scores, temperature study). We saw it: er-diagram swung 0.62→0.79 across identical setups.
GEPA_JUDGE_SAMPLES=Naverages N vision calls (variance reduction).A3. No human / independent calibration — we don't know the ceiling ✅ meta-eval added (no humans)
"Human correlation is not inherent; it must be earned and validated per task" via Spearman/Cohen's κ (LLM-judge vs human). We had zero validation.
gepa_flowchart/judge_agreement.py— scores the same board with Opus vs an independent Gemini judge, reports per-axis Pearson/Spearman/mean-|Δ| (geometry is deterministic → corr≈1 sanity control). This is the no-human proxy for judge reliability.A4. Comprehension leakage — measures topic priors, not legibility ✅ mitigated
Chart-QA work shows many items are solvable by text shortcut, not visual reading, and VLMs struggle with axis text/precise values (Judging the Judges, ACL 2025, ChartMuseum). Our judge can answer the 7 questions from prior knowledge of the topic.
GEPA_COMP_LIFT=1re-asks the questions text-only (no image) and credits only the lift the render adds.A5. Geometry is sound but myopic — missing edge crossings ✅ mitigated
Deterministic geometry is a strength (objective, bias-free), but Purchase's repeated finding is that edge crossings are the #1 readability driver (graph-drawing metrics) — and we only measured overlap/off-canvas/font, not crossings.
GEPA_GEOM_CROSSINGS=1,_edge_crossingsinunified_metric.py).A6. Reward overoptimization / Goodhart — no held-out gold ⬜ proposed
Optimizing an imperfect proxy eventually degrades true quality while the proxy rises (Gao 2023); a single same-family judge is the most gameable proxy. We have no held-out reference to detect the proxy↔truth gap.
A7. Coarse scale + no reasoning-before-score ⬜ proposed
0/0.5/1 is low-resolution; CoT-before-score improves calibration and reduces variance (G-Eval / RubricEval). Our VQ/rubric ask for a bare score.
B. SGCR readiness for
engine:"sgcr"by default (extends #195)B1. The "don't use on grouped/zoned" guidance is now obsolete (auto-fallback is safe)
flow.tsx:424-434:useSgcr = engine==="sgcr" && !grouped; grouped +engine:"sgcr"transparently falls back to dagre and logssgcr-skipped-grouped. So defaultingengine:"sgcr"is safe for grouped boards (they just use dagre — this is whyarchitecture-zones"isn't using SGCR"). The prompt caveat should change from "don't set it on grouped" → "auto-ignored on grouped." But grouped boards still get no SGCR benefit (grouped-graph support, #195 gap #1, remains unbuilt).B2. No viewport-fit invariant + no dagre fallback on overflow → blocks safe default ⬜
P3 (
check.ts:187) checks against SGCR's ownlayout.width/height, not the render viewport, and the only fallback is between SGCR modes (ring→routing inflow.tsx), never to dagre. So a large/wide ungrouped graph can pass every invariant yet overflow the screen — the A/B regressions (saga off-canvas 4→13, blue-green 4→8). This is the real blocker to defaultingengine:"sgcr": it helps small/medium ungrouped graphs but can render worse than dagre on large ones, with no safety net.engine:"sgcr"is a safe default.Repro / pointers
gepa_flowchart/unified_metric.py,llm.py; meta-eval:gepa_flowchart/judge_agreement.py; regression gate:gepa_flowchart/topology_regression.py.packages/viewer/src/client/renderers/flow.tsx,sgcr/check.ts.