feat: ER B³ + confusion matrix, ungated reference board, and gated GoldenMatch auto-config#6
Merged
Ben Severn (benzsevern) merged 3 commits intoMay 24, 2026
Conversation
score_er_tier now also reports cluster-level B-Cubed precision/recall/F1 (clusters built from the pair graph via connected components over all rows) and the full pair-level confusion matrix (TP/FP/FN/TN), shown in the ER report (rich table + JSON). These are diagnostic only — the headline DQBench ER Score stays pair-F1-weighted and unchanged, so published leaderboard entries don't move. https://claude.ai/code/session_01KjxRYsnVFPVJ3aUBmNm7vB
…ite) Auto-config tools learn/sample and aren't reproducible, so they can't sit on the gated board. Add a separate "Reference — auto-config (not gate-verified)" section instead. - Manifests marked "gated": false route to leaderboard/reference/ — no manifest-linkage required, skipped by the CI verify matrix and the refresh audit. publish/check_published render + validate (schema only). - New GoldenMatchAutoConfigAdapter (goldenmatch-auto) via auto_configure_df. - Seeded reference entries: GoldenMatch (auto-config) ER (~92 this cold run; observed 57-92, non-deterministic) and GoldenSuite (zero-config) Pipeline (~33.85). Both clearly flagged non-reproducible with their range. https://claude.ai/code/session_01KjxRYsnVFPVJ3aUBmNm7vB
GoldenMatch auto-config's run-to-run drift was its persisted cross-run learning store (~/.goldenmatch/autoconfig_memory.db), not randomness — the profiling sample is already seeded. Disabling that store makes auto-config reproduce exactly, so it graduates from the ungated reference board to the gated ER leaderboard, topping it at 92.36 (vs Splink 87.14, tuned 76.91). - GoldenMatchAutoConfigAdapter sets GOLDENMATCH_AUTOCONFIG_MEMORY=0 (and LLM=0) before importing goldenmatch — the flag is read once at import. - er-goldenmatch-auto manifest flipped to gated; entry moved to the gated results store; dqbench verify confirms it reproduces. - Reference board now seeds only GoldenSuite (zero-config) (Pipeline). https://claude.ai/code/session_01KjxRYsnVFPVJ3aUBmNm7vB
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Three ER/leaderboard reporting enhancements.
1. B³ (BCubed) + confusion matrix for ER
score_er_tiernow also reports, alongside the existing pair P/R/F1:TN = C(n,2) − TP − FP − FN).Shown in a new rich table + the JSON output. Diagnostic only — the headline DQBench ER Score stays pair-F1-weighted, so published entries don't move. ("B²" isn't a standard metric; this implements B-Cubed/B³.)
2. GoldenMatch auto-config → gated ER board (92.36, top entry)
Investigating why auto-config was non-deterministic turned up the real cause: it's persisted state, not randomness.
auto_configure_dfcaches configs in~/.goldenmatch/autoconfig_memory.dband seeds each run from the last; the profiling sample itself is already seeded (df.sample(seed=42)). Disabling that store (GOLDENMATCH_AUTOCONFIG_MEMORY=0, set before importing goldenmatch since the flag is read once at import) makes it reproduce to the digit — verified across runs even with a warm DB present.So the new
GoldenMatchAutoConfigAdapter(goldenmatch-auto) is gate-verified and tops the ER board:Notably, auto-config beats GoldenMatch's own hand-tuned config. Splink/recordlinkage have no auto-config mode (they require explicit settings), so this is the closest thing to a zero-tuning ER number.
3. Ungated "reference" board (infrastructure)
For auto-config runs that are genuinely non-reproducible and can't be salvaged like GoldenMatch was. Manifests marked
"gated": falseroute toleaderboard/reference/— no manifest-linkage, skipped by the CI verify matrix and refresh audit — and render in a separate "Reference — auto-config (not gate-verified)" section. Currently seeded with GoldenSuite (zero-config) (Pipeline, ~33.85).Test plan
tests/test_er_scorer.py—TestConfusionMatrix,TestBCubed.tests/test_submission.py— ungated routing to reference store,verifyskips ungated, reference entries need no manifest, reference section renders.dqbench verify leaderboard/submissions/er-goldenmatch-auto.jsonreproduces 92.36 (incl. with a warm~/.goldenmatchDB present).ruffclean;dqbench publish --checkgreen.https://claude.ai/code/session_01KjxRYsnVFPVJ3aUBmNm7vB