feat: ER B³ + confusion matrix, ungated reference board, and gated GoldenMatch auto-config by benzsevern · Pull Request #6 · benseverndev-oss/dqbench

Ben Severn (benzsevern) · 2026-05-24T15:57:56Z

Three ER/leaderboard reporting enhancements.

1. B³ (BCubed) + confusion matrix for ER

score_er_tier now also reports, alongside the existing pair P/R/F1:

B³ (BCubed) precision/recall/F1 — clusters built from the pair graph via connected components over all rows, scored element-by-element.
Confusion matrix — TP/FP/FN/TN (TN = C(n,2) − TP − FP − FN).

Shown in a new rich table + the JSON output. Diagnostic only — the headline DQBench ER Score stays pair-F1-weighted, so published entries don't move. ("B²" isn't a standard metric; this implements B-Cubed/B³.)

2. GoldenMatch auto-config → gated ER board (92.36, top entry)

Investigating why auto-config was non-deterministic turned up the real cause: it's persisted state, not randomness. auto_configure_df caches configs in ~/.goldenmatch/autoconfig_memory.db and seeds each run from the last; the profiling sample itself is already seeded (df.sample(seed=42)). Disabling that store (GOLDENMATCH_AUTOCONFIG_MEMORY=0, set before importing goldenmatch since the flag is read once at import) makes it reproduce to the digit — verified across runs even with a warm DB present.

So the new GoldenMatchAutoConfigAdapter (goldenmatch-auto) is gate-verified and tops the ER board:

Rank	Tool	Score	Source
1	GoldenMatch (auto-config)	92.36	auto-config
2	Splink	87.14	reproduced
3	recordlinkage	80.28	reproduced
4	GoldenMatch (tuned)	76.91	reproduced

Notably, auto-config beats GoldenMatch's own hand-tuned config. Splink/recordlinkage have no auto-config mode (they require explicit settings), so this is the closest thing to a zero-tuning ER number.

3. Ungated "reference" board (infrastructure)

For auto-config runs that are genuinely non-reproducible and can't be salvaged like GoldenMatch was. Manifests marked "gated": false route to leaderboard/reference/ — no manifest-linkage, skipped by the CI verify matrix and refresh audit — and render in a separate "Reference — auto-config (not gate-verified)" section. Currently seeded with GoldenSuite (zero-config) (Pipeline, ~33.85).

Test plan

tests/test_er_scorer.py — TestConfusionMatrix, TestBCubed.
tests/test_submission.py — ungated routing to reference store, verify skips ungated, reference entries need no manifest, reference section renders.
dqbench verify leaderboard/submissions/er-goldenmatch-auto.json reproduces 92.36 (incl. with a warm ~/.goldenmatch DB present).
Full suite: 251 passing; ruff clean; dqbench publish --check green.

https://claude.ai/code/session_01KjxRYsnVFPVJ3aUBmNm7vB

score_er_tier now also reports cluster-level B-Cubed precision/recall/F1 (clusters built from the pair graph via connected components over all rows) and the full pair-level confusion matrix (TP/FP/FN/TN), shown in the ER report (rich table + JSON). These are diagnostic only — the headline DQBench ER Score stays pair-F1-weighted and unchanged, so published leaderboard entries don't move. https://claude.ai/code/session_01KjxRYsnVFPVJ3aUBmNm7vB

…ite) Auto-config tools learn/sample and aren't reproducible, so they can't sit on the gated board. Add a separate "Reference — auto-config (not gate-verified)" section instead. - Manifests marked "gated": false route to leaderboard/reference/ — no manifest-linkage required, skipped by the CI verify matrix and the refresh audit. publish/check_published render + validate (schema only). - New GoldenMatchAutoConfigAdapter (goldenmatch-auto) via auto_configure_df. - Seeded reference entries: GoldenMatch (auto-config) ER (~92 this cold run; observed 57-92, non-deterministic) and GoldenSuite (zero-config) Pipeline (~33.85). Both clearly flagged non-reproducible with their range. https://claude.ai/code/session_01KjxRYsnVFPVJ3aUBmNm7vB

GoldenMatch auto-config's run-to-run drift was its persisted cross-run learning store (~/.goldenmatch/autoconfig_memory.db), not randomness — the profiling sample is already seeded. Disabling that store makes auto-config reproduce exactly, so it graduates from the ungated reference board to the gated ER leaderboard, topping it at 92.36 (vs Splink 87.14, tuned 76.91). - GoldenMatchAutoConfigAdapter sets GOLDENMATCH_AUTOCONFIG_MEMORY=0 (and LLM=0) before importing goldenmatch — the flag is read once at import. - er-goldenmatch-auto manifest flipped to gated; entry moved to the gated results store; dqbench verify confirms it reproduces. - Reference board now seeds only GoldenSuite (zero-config) (Pipeline). https://claude.ai/code/session_01KjxRYsnVFPVJ3aUBmNm7vB

Claude (claude) added 2 commits May 24, 2026 15:57

Ben Severn (benzsevern) changed the title ~~feat: B-Cubed (B³) metrics and confusion matrix for ER~~ feat: ER B³ + confusion matrix, and an ungated auto-config reference board May 24, 2026

Ben Severn (benzsevern) changed the title ~~feat: ER B³ + confusion matrix, and an ungated auto-config reference board~~ feat: ER B³ + confusion matrix, ungated reference board, and gated GoldenMatch auto-config May 24, 2026

Ben Severn (benzsevern) marked this pull request as ready for review May 24, 2026 20:22

Ben Severn (benzsevern) merged commit 5e3f40a into main May 24, 2026
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: ER B³ + confusion matrix, ungated reference board, and gated GoldenMatch auto-config#6

feat: ER B³ + confusion matrix, ungated reference board, and gated GoldenMatch auto-config#6
Ben Severn (benzsevern) merged 3 commits into
mainfrom
claude/leaderboard-implementation-5lk8D

Ben Severn (benzsevern) commented May 24, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Ben Severn (benzsevern) commented May 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1. B³ (BCubed) + confusion matrix for ER

2. GoldenMatch auto-config → gated ER board (92.36, top entry)

3. Ungated "reference" board (infrastructure)

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Ben Severn (benzsevern) commented May 24, 2026 •

edited

Loading