feat: add Pointblank + ydata-profiling (Detect) and dedupe (ER reference)#7
Merged
Ben Severn (benzsevern) merged 1 commit intoMay 24, 2026
Conversation
…nce) New third-party adapters and leaderboard entries: - Pointblank (Detect, 30.97) — Validate plan with dtype-gated rules so interrogation never aborts on a numeric/string mismatch. Gated, verified. - ydata-profiling (Detect, 4.70) — a profiler, not a validator: only its MISSING alerts map to a planted issue type; other statistical alerts are emitted as non-penalised INFO. minimal-mode profiling is deterministic. Needs setuptools<81 for pkg_resources. Gated, verified. - dedupe (ER) — active-learning matcher run with deterministic weak supervision (shared email/phone -> match, random -> distinct, no ground truth) via a training file. Not reproducible run-to-run, so it lands on the ungated reference board only (~64 representative run). Generalise the reference-board heading from "auto-config" to "not gate-verified" since it now also holds non-deterministic non-auto-config tools (dedupe's active learning). Not added: dataprep (uninstallable on Python 3.11 — pins python-crfsuite 0.9.8, which won't build against the 3.11 C API) and pyjanitor (no cell-value canonicalizers, ~0 on the Transform planted-cell metric). https://claude.ai/code/session_01KjxRYsnVFPVJ3aUBmNm7vB
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds the feasible candidate tools to the leaderboard. After digging into each, the six candidates split into "clean fit", "reference-only", and "infeasible".
Added to the gated board (Detect)
Validateplan (not-null, distinct, in-set, between, regex) keyed on common column names. Pointblank aborts the whole interrogation on a numeric/string dtype mismatch, so rules are dtype-gated before they're added. Rivals cuallee (30.56). Deterministic,dqbench verify'd.MISSINGalert maps to a planted issue type (nullability); the rest are emitted as non-penalisedINFO. minimal-mode profiling (sampling/correlations/duplicates/interactions off) is deterministic. Needssetuptools<81forpkg_resources.dqbench verify'd.Added to the ungated reference board (ER)
Also generalised the reference-board heading from "auto-config (not gate-verified)" to just "not gate-verified", since it now holds a non-auto-config non-deterministic tool too.
Not added (reported, not forced)
python-crfsuite==0.9.8, a C extension that won't compile against the 3.11 C API. Since CI gates on 3.11, it can't be added.Net: Transform gains no new tool (the only viable candidate, dataprep, can't install on 3.11).
Test plan
dqbench reproduce --writethendqbench verifyboth reproduce exactly.leaderboard/reference/er.json(gated:false manifest, skipped by the CI verify matrix).ruffclean;dqbench publish --checkgreen.https://claude.ai/code/session_01KjxRYsnVFPVJ3aUBmNm7vB
Generated by Claude Code