Skip to content

Verification quality bounds GEPA: scoring-methodology limitations + SGCR default-engine readiness #212

Description

@ivanmkc

TL;DR

GEPA prompt-optimization is a hill-climber, so the scorer is the ceiling and the attack surface — gains are only as real as the verification. An audit of scripts/experiments/gepa-flowchart/gepa_flowchart/unified_metric.py against the LLM-as-judge / graph-readability literature surfaced a set of validity issues, plus a few SGCR blockers to making engine:"sgcr" the default. Several mitigations are already implemented (gated, default-off); the rest are proposed. Extends #195 (SGCR layout gaps).


A. Scoring / verification methodology

Our scorer: one Opus vision call grades comprehension (answer 7 frozen questions from the render), visual_quality (4 anchored dims), and the journey rubric; geometry is deterministic from the DOM; combined by weighted harmonic mean. The defining risks:

A1. Self-preference bias — generator and judge are the same family (Opus) ✅ mitigated

LLM evaluators recognize and favor their own generations (NeurIPS 2024); the standard fix is a different-family judge (PoLL, Verga 2024). We generate and judge with Opus → GEPA partly learns to please Opus's own style.

  • Done: cross-family judge wired in (llm.py routes any gemini-* model to Vertex Gemini); set GEPA_VISION_MODEL=gemini-2.5-flash.
  • Todo: make a cross-family judge the default for scoring.

A2. Single judge, single sample → high variance, no calibration ✅ partially mitigated

LLM judges are non-deterministic even at low temp, and Anthropic models don't stabilize the way GPT/Gemini do, with no stable seed (Same Input Different Scores, temperature study). We saw it: er-diagram swung 0.62→0.79 across identical setups.

  • Done: GEPA_JUDGE_SAMPLES=N averages N vision calls (variance reduction).
  • Todo: a small PoLL panel (diverse small judges) for bias-cancellation + variance, ~7× cheaper than one big judge.

A3. No human / independent calibration — we don't know the ceiling ✅ meta-eval added (no humans)

"Human correlation is not inherent; it must be earned and validated per task" via Spearman/Cohen's κ (LLM-judge vs human). We had zero validation.

  • Done: gepa_flowchart/judge_agreement.py — scores the same board with Opus vs an independent Gemini judge, reports per-axis Pearson/Spearman/mean-|Δ| (geometry is deterministic → corr≈1 sanity control). This is the no-human proxy for judge reliability.
  • Todo: a small human-labeled gold set to anchor the proxy.

A4. Comprehension leakage — measures topic priors, not legibility ✅ mitigated

Chart-QA work shows many items are solvable by text shortcut, not visual reading, and VLMs struggle with axis text/precise values (Judging the Judges, ACL 2025, ChartMuseum). Our judge can answer the 7 questions from prior knowledge of the topic.

  • Done: GEPA_COMP_LIFT=1 re-asks the questions text-only (no image) and credits only the lift the render adds.

A5. Geometry is sound but myopic — missing edge crossings ✅ mitigated

Deterministic geometry is a strength (objective, bias-free), but Purchase's repeated finding is that edge crossings are the #1 readability driver (graph-drawing metrics) — and we only measured overlap/off-canvas/font, not crossings.

  • Done: deterministic edge-crossing count + penalty (GEPA_GEOM_CROSSINGS=1, _edge_crossings in unified_metric.py).
  • Todo: crossing-angle / angular-resolution (secondary readability drivers).

A6. Reward overoptimization / Goodhart — no held-out gold ⬜ proposed

Optimizing an imperfect proxy eventually degrades true quality while the proxy rises (Gao 2023); a single same-family judge is the most gameable proxy. We have no held-out reference to detect the proxy↔truth gap.

  • Todo: periodically score current best candidates on the A3 cross-judge/gold set; if proxy rises while gold flattens → overoptimization (early-stop signal). The hybrid Pareto frontier already gives partial protection.

A7. Coarse scale + no reasoning-before-score ⬜ proposed

0/0.5/1 is low-resolution; CoT-before-score improves calibration and reduces variance (G-Eval / RubricEval). Our VQ/rubric ask for a bare score.

  • Todo: request brief per-criterion reasoning before the score; consider a finer scale.

B. SGCR readiness for engine:"sgcr" by default (extends #195)

B1. The "don't use on grouped/zoned" guidance is now obsolete (auto-fallback is safe)

flow.tsx:424-434: useSgcr = engine==="sgcr" && !grouped; grouped + engine:"sgcr" transparently falls back to dagre and logs sgcr-skipped-grouped. So defaulting engine:"sgcr" is safe for grouped boards (they just use dagre — this is why architecture-zones "isn't using SGCR"). The prompt caveat should change from "don't set it on grouped" → "auto-ignored on grouped." But grouped boards still get no SGCR benefit (grouped-graph support, #195 gap #1, remains unbuilt).

B2. No viewport-fit invariant + no dagre fallback on overflow → blocks safe default ⬜

P3 (check.ts:187) checks against SGCR's own layout.width/height, not the render viewport, and the only fallback is between SGCR modes (ring→routing in flow.tsx), never to dagre. So a large/wide ungrouped graph can pass every invariant yet overflow the screen — the A/B regressions (saga off-canvas 4→13, blue-green 4→8). This is the real blocker to defaulting engine:"sgcr": it helps small/medium ungrouped graphs but can render worse than dagre on large ones, with no safety net.

  • Todo (any of): a P10 "fits target viewport after scale" invariant; fall back to dagre when the SGCR layout can't fit; or a node-count guard before choosing SGCR. With one of these, engine:"sgcr" is a safe default.

Repro / pointers

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions