v1.1.0: Scale fuzzy matching to 50K rows, version bump by MsShawnP · Pull Request #12 · MsShawnP/data-hygiene-auditor

MsShawnP · 2026-05-16T18:46:53Z

Summary

Version bumped to 1.1.0 — aligns with PyPI release tag and CHANGELOG
Fuzzy matching scales to 50,000 rows — n-gram blocking replaces the old 500-row brute-force cap. Character trigram inverted index generates candidate pairs, then Levenshtein only runs on plausible matches
Small datasets (≤500 rows) still use brute-force (faster for them)

How n-gram blocking works

Generate character 3-grams for each normalized record
Build inverted index: trigram → list of row indices
Count shared trigrams per pair, skip very common trigrams (noise)
Only run Levenshtein on pairs sharing enough trigrams
Cap candidates per record (50) to bound worst-case

Test plan

220 tests pass (8 new)
ruff check . — clean
mypy data_hygiene_auditor/ — 0 errors
N-gram blocking unit tests (similar/dissimilar/empty/short/1000-row scale)
Large dataset test: 600 rows with injected typo pair found without skip warning
501-row test confirms old limit no longer triggers skip

🤖 Generated with Claude Code

- Version bumped to 1.1.0 to align with PyPI release tag - Levenshtein matching limit raised from 500 → 50,000 rows - N-gram blocking generates candidate pairs via character trigram inverted index, avoiding O(n²) brute-force comparisons - Small datasets (≤500 rows) still use brute-force (faster for them) - 8 new tests: n-gram blocking unit tests + large dataset integration - CHANGELOG converted from [Unreleased] to [1.1.0] release entry Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

MsShawnP force-pushed the claude/youthful-vaughan-3f51e1 branch from 6ad9f5c to 5f961e8 Compare May 16, 2026 18:48

MsShawnP merged commit fe4e9e9 into main May 16, 2026
0 of 3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.1.0: Scale fuzzy matching to 50K rows, version bump#12

v1.1.0: Scale fuzzy matching to 50K rows, version bump#12
MsShawnP merged 1 commit into
mainfrom
claude/youthful-vaughan-3f51e1

MsShawnP commented May 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

MsShawnP commented May 16, 2026

Summary

How n-gram blocking works

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant