Skip to content

feat: ER B³ + confusion matrix, ungated reference board, and gated GoldenMatch auto-config#6

Merged
Ben Severn (benzsevern) merged 3 commits into
mainfrom
claude/leaderboard-implementation-5lk8D
May 24, 2026
Merged

feat: ER B³ + confusion matrix, ungated reference board, and gated GoldenMatch auto-config#6
Ben Severn (benzsevern) merged 3 commits into
mainfrom
claude/leaderboard-implementation-5lk8D

Conversation

@benzsevern
Copy link
Copy Markdown
Collaborator

@benzsevern Ben Severn (benzsevern) commented May 24, 2026

Three ER/leaderboard reporting enhancements.

1. B³ (BCubed) + confusion matrix for ER

score_er_tier now also reports, alongside the existing pair P/R/F1:

  • B³ (BCubed) precision/recall/F1 — clusters built from the pair graph via connected components over all rows, scored element-by-element.
  • Confusion matrix — TP/FP/FN/TN (TN = C(n,2) − TP − FP − FN).

Shown in a new rich table + the JSON output. Diagnostic only — the headline DQBench ER Score stays pair-F1-weighted, so published entries don't move. ("B²" isn't a standard metric; this implements B-Cubed/B³.)

2. GoldenMatch auto-config → gated ER board (92.36, top entry)

Investigating why auto-config was non-deterministic turned up the real cause: it's persisted state, not randomness. auto_configure_df caches configs in ~/.goldenmatch/autoconfig_memory.db and seeds each run from the last; the profiling sample itself is already seeded (df.sample(seed=42)). Disabling that store (GOLDENMATCH_AUTOCONFIG_MEMORY=0, set before importing goldenmatch since the flag is read once at import) makes it reproduce to the digit — verified across runs even with a warm DB present.

So the new GoldenMatchAutoConfigAdapter (goldenmatch-auto) is gate-verified and tops the ER board:

Rank Tool Score Source
1 GoldenMatch (auto-config) 92.36 auto-config
2 Splink 87.14 reproduced
3 recordlinkage 80.28 reproduced
4 GoldenMatch (tuned) 76.91 reproduced

Notably, auto-config beats GoldenMatch's own hand-tuned config. Splink/recordlinkage have no auto-config mode (they require explicit settings), so this is the closest thing to a zero-tuning ER number.

3. Ungated "reference" board (infrastructure)

For auto-config runs that are genuinely non-reproducible and can't be salvaged like GoldenMatch was. Manifests marked "gated": false route to leaderboard/reference/ — no manifest-linkage, skipped by the CI verify matrix and refresh audit — and render in a separate "Reference — auto-config (not gate-verified)" section. Currently seeded with GoldenSuite (zero-config) (Pipeline, ~33.85).

Test plan

  • tests/test_er_scorer.pyTestConfusionMatrix, TestBCubed.
  • tests/test_submission.py — ungated routing to reference store, verify skips ungated, reference entries need no manifest, reference section renders.
  • dqbench verify leaderboard/submissions/er-goldenmatch-auto.json reproduces 92.36 (incl. with a warm ~/.goldenmatch DB present).
  • Full suite: 251 passing; ruff clean; dqbench publish --check green.

https://claude.ai/code/session_01KjxRYsnVFPVJ3aUBmNm7vB

score_er_tier now also reports cluster-level B-Cubed precision/recall/F1
(clusters built from the pair graph via connected components over all
rows) and the full pair-level confusion matrix (TP/FP/FN/TN), shown in
the ER report (rich table + JSON). These are diagnostic only — the
headline DQBench ER Score stays pair-F1-weighted and unchanged, so
published leaderboard entries don't move.

https://claude.ai/code/session_01KjxRYsnVFPVJ3aUBmNm7vB
…ite)

Auto-config tools learn/sample and aren't reproducible, so they can't sit
on the gated board. Add a separate "Reference — auto-config (not
gate-verified)" section instead.

- Manifests marked "gated": false route to leaderboard/reference/ — no
  manifest-linkage required, skipped by the CI verify matrix and the
  refresh audit. publish/check_published render + validate (schema only).
- New GoldenMatchAutoConfigAdapter (goldenmatch-auto) via auto_configure_df.
- Seeded reference entries: GoldenMatch (auto-config) ER (~92 this cold run;
  observed 57-92, non-deterministic) and GoldenSuite (zero-config) Pipeline
  (~33.85). Both clearly flagged non-reproducible with their range.

https://claude.ai/code/session_01KjxRYsnVFPVJ3aUBmNm7vB
@benzsevern Ben Severn (benzsevern) changed the title feat: B-Cubed (B³) metrics and confusion matrix for ER feat: ER B³ + confusion matrix, and an ungated auto-config reference board May 24, 2026
GoldenMatch auto-config's run-to-run drift was its persisted cross-run
learning store (~/.goldenmatch/autoconfig_memory.db), not randomness — the
profiling sample is already seeded. Disabling that store makes auto-config
reproduce exactly, so it graduates from the ungated reference board to the
gated ER leaderboard, topping it at 92.36 (vs Splink 87.14, tuned 76.91).

- GoldenMatchAutoConfigAdapter sets GOLDENMATCH_AUTOCONFIG_MEMORY=0 (and
  LLM=0) before importing goldenmatch — the flag is read once at import.
- er-goldenmatch-auto manifest flipped to gated; entry moved to the gated
  results store; dqbench verify confirms it reproduces.
- Reference board now seeds only GoldenSuite (zero-config) (Pipeline).

https://claude.ai/code/session_01KjxRYsnVFPVJ3aUBmNm7vB
@benzsevern Ben Severn (benzsevern) changed the title feat: ER B³ + confusion matrix, and an ungated auto-config reference board feat: ER B³ + confusion matrix, ungated reference board, and gated GoldenMatch auto-config May 24, 2026
@benzsevern Ben Severn (benzsevern) marked this pull request as ready for review May 24, 2026 20:22
@benzsevern Ben Severn (benzsevern) merged commit 5e3f40a into main May 24, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants