Verification quality bounds GEPA: scoring-methodology limitations + SGCR default-engine readiness

## TL;DR

GEPA prompt-optimization is a hill-climber, so **the scorer is the ceiling and the attack surface** — gains are only as real as the verification. An audit of `scripts/experiments/gepa-flowchart/gepa_flowchart/unified_metric.py` against the LLM-as-judge / graph-readability literature surfaced a set of validity issues, plus a few SGCR blockers to making `engine:"sgcr"` the default. Several mitigations are already implemented (gated, default-off); the rest are proposed. Extends #195 (SGCR layout gaps).

---

## A. Scoring / verification methodology

Our scorer: one Opus vision call grades **comprehension** (answer 7 frozen questions from the render), **visual_quality** (4 anchored dims), and the journey **rubric**; **geometry** is deterministic from the DOM; combined by weighted harmonic mean. The defining risks:

### A1. Self-preference bias — generator and judge are the same family (Opus) ✅ mitigated
LLM evaluators recognize and favor their own generations ([NeurIPS 2024](https://arxiv.org/html/2410.02736v1)); the standard fix is a **different-family judge** ([PoLL, Verga 2024](https://arxiv.org/abs/2404.18796)). We generate *and* judge with Opus → GEPA partly learns to please Opus's own style.
- **Done:** cross-family judge wired in (`llm.py` routes any `gemini-*` model to Vertex Gemini); set `GEPA_VISION_MODEL=gemini-2.5-flash`.
- **Todo:** make a cross-family judge the default for scoring.

### A2. Single judge, single sample → high variance, no calibration ✅ partially mitigated
LLM judges are non-deterministic even at low temp, and **Anthropic models don't stabilize** the way GPT/Gemini do, with no stable seed ([Same Input Different Scores](https://arxiv.org/abs/2603.04417), [temperature study](https://arxiv.org/html/2603.28304v1)). We saw it: er-diagram swung 0.62→0.79 across identical setups.
- **Done:** `GEPA_JUDGE_SAMPLES=N` averages N vision calls (variance reduction).
- **Todo:** a small **PoLL panel** (diverse small judges) for bias-cancellation + variance, ~7× cheaper than one big judge.

### A3. No human / independent calibration — we don't know the ceiling ✅ meta-eval added (no humans)
"Human correlation is not inherent; it must be earned and validated per task" via Spearman/Cohen's κ ([LLM-judge vs human](https://galileo.ai/blog/llm-as-a-judge-vs-human-evaluation)). We had zero validation.
- **Done:** `gepa_flowchart/judge_agreement.py` — scores the same board with Opus vs an independent Gemini judge, reports per-axis Pearson/Spearman/mean-|Δ| (geometry is deterministic → corr≈1 sanity control). This is the no-human proxy for judge reliability.
- **Todo:** a small human-labeled gold set to anchor the proxy.

### A4. Comprehension leakage — measures topic priors, not legibility ✅ mitigated
Chart-QA work shows many items are solvable by **text shortcut, not visual reading**, and VLMs struggle with axis text/precise values ([Judging the Judges, ACL 2025](https://arxiv.org/abs/2505.08468), [ChartMuseum](https://arxiv.org/pdf/2505.13444)). Our judge can answer the 7 questions from prior knowledge of the topic.
- **Done:** `GEPA_COMP_LIFT=1` re-asks the questions **text-only** (no image) and credits only the lift the render adds.

### A5. Geometry is sound but myopic — missing edge crossings ✅ mitigated
Deterministic geometry is a strength (objective, bias-free), but Purchase's repeated finding is that **edge crossings are the #1 readability driver** ([graph-drawing metrics](https://www2.cs.arizona.edu/~kobourov/gd-metrics2024.pdf)) — and we only measured overlap/off-canvas/font, not crossings.
- **Done:** deterministic edge-crossing count + penalty (`GEPA_GEOM_CROSSINGS=1`, `_edge_crossings` in `unified_metric.py`).
- **Todo:** crossing-angle / angular-resolution (secondary readability drivers).

### A6. Reward overoptimization / Goodhart — no held-out gold ⬜ proposed
Optimizing an imperfect proxy eventually **degrades true quality while the proxy rises** ([Gao 2023](https://arxiv.org/abs/2210.10760)); a single same-family judge is the most gameable proxy. We have no held-out reference to detect the proxy↔truth gap.
- **Todo:** periodically score current best candidates on the A3 cross-judge/gold set; if proxy rises while gold flattens → overoptimization (early-stop signal). The hybrid Pareto frontier already gives partial protection.

### A7. Coarse scale + no reasoning-before-score ⬜ proposed
0/0.5/1 is low-resolution; CoT-before-score improves calibration and reduces variance (G-Eval / RubricEval). Our VQ/rubric ask for a bare score.
- **Todo:** request brief per-criterion reasoning before the score; consider a finer scale.

---

## B. SGCR readiness for `engine:"sgcr"` by default (extends #195)

### B1. The "don't use on grouped/zoned" guidance is now obsolete (auto-fallback is safe)
`flow.tsx:424-434`: `useSgcr = engine==="sgcr" && !grouped`; grouped + `engine:"sgcr"` **transparently falls back to dagre** and logs `sgcr-skipped-grouped`. So defaulting `engine:"sgcr"` is **safe for grouped boards** (they just use dagre — this is why `architecture-zones` "isn't using SGCR"). The prompt caveat should change from "don't set it on grouped" → "auto-ignored on grouped." But grouped boards still get **no** SGCR benefit (grouped-graph support, #195 gap #1, remains unbuilt).

### B2. No viewport-fit invariant + no dagre fallback on overflow → blocks safe default ⬜
P3 (`check.ts:187`) checks against SGCR's **own** `layout.width/height`, not the render viewport, and the only fallback is between SGCR modes (ring→routing in `flow.tsx`), **never to dagre**. So a large/wide ungrouped graph can pass every invariant yet overflow the screen — the A/B regressions (saga off-canvas 4→13, blue-green 4→8). **This is the real blocker to defaulting `engine:"sgcr"`:** it helps small/medium ungrouped graphs but can render *worse* than dagre on large ones, with no safety net.
- **Todo (any of):** a P10 "fits target viewport after scale" invariant; **fall back to dagre when the SGCR layout can't fit**; or a node-count guard before choosing SGCR. With one of these, `engine:"sgcr"` is a safe default.

---

## Repro / pointers
- Scorer + gated mitigations: `gepa_flowchart/unified_metric.py`, `llm.py`; meta-eval: `gepa_flowchart/judge_agreement.py`; regression gate: `gepa_flowchart/topology_regression.py`.
- SGCR: `packages/viewer/src/client/renderers/flow.tsx`, `sgcr/check.ts`.
- Related: #195 (SGCR grouped no-op + viewport-fit).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Verification quality bounds GEPA: scoring-methodology limitations + SGCR default-engine readiness #212

TL;DR

A. Scoring / verification methodology

A1. Self-preference bias — generator and judge are the same family (Opus) ✅ mitigated

A2. Single judge, single sample → high variance, no calibration ✅ partially mitigated

A3. No human / independent calibration — we don't know the ceiling ✅ meta-eval added (no humans)

A4. Comprehension leakage — measures topic priors, not legibility ✅ mitigated

A5. Geometry is sound but myopic — missing edge crossings ✅ mitigated

A6. Reward overoptimization / Goodhart — no held-out gold ⬜ proposed

A7. Coarse scale + no reasoning-before-score ⬜ proposed

B. SGCR readiness for `engine:"sgcr"` by default (extends #195)

B1. The "don't use on grouped/zoned" guidance is now obsolete (auto-fallback is safe)

B2. No viewport-fit invariant + no dagre fallback on overflow → blocks safe default ⬜

Repro / pointers

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Verification quality bounds GEPA: scoring-methodology limitations + SGCR default-engine readiness #212

Description

TL;DR

A. Scoring / verification methodology

A1. Self-preference bias — generator and judge are the same family (Opus) ✅ mitigated

A2. Single judge, single sample → high variance, no calibration ✅ partially mitigated

A3. No human / independent calibration — we don't know the ceiling ✅ meta-eval added (no humans)

A4. Comprehension leakage — measures topic priors, not legibility ✅ mitigated

A5. Geometry is sound but myopic — missing edge crossings ✅ mitigated

A6. Reward overoptimization / Goodhart — no held-out gold ⬜ proposed

A7. Coarse scale + no reasoning-before-score ⬜ proposed

B. SGCR readiness for engine:"sgcr" by default (extends #195)

B1. The "don't use on grouped/zoned" guidance is now obsolete (auto-fallback is safe)

B2. No viewport-fit invariant + no dagre fallback on overflow → blocks safe default ⬜

Repro / pointers

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

B. SGCR readiness for `engine:"sgcr"` by default (extends #195)