Skip to content

v1.1.0: Scale fuzzy matching to 50K rows, version bump#12

Merged
MsShawnP merged 1 commit into
mainfrom
claude/youthful-vaughan-3f51e1
May 16, 2026
Merged

v1.1.0: Scale fuzzy matching to 50K rows, version bump#12
MsShawnP merged 1 commit into
mainfrom
claude/youthful-vaughan-3f51e1

Conversation

@MsShawnP

Copy link
Copy Markdown
Owner

Summary

  • Version bumped to 1.1.0 — aligns with PyPI release tag and CHANGELOG
  • Fuzzy matching scales to 50,000 rows — n-gram blocking replaces the old 500-row brute-force cap. Character trigram inverted index generates candidate pairs, then Levenshtein only runs on plausible matches
  • Small datasets (≤500 rows) still use brute-force (faster for them)

How n-gram blocking works

  1. Generate character 3-grams for each normalized record
  2. Build inverted index: trigram → list of row indices
  3. Count shared trigrams per pair, skip very common trigrams (noise)
  4. Only run Levenshtein on pairs sharing enough trigrams
  5. Cap candidates per record (50) to bound worst-case

Test plan

  • 220 tests pass (8 new)
  • ruff check . — clean
  • mypy data_hygiene_auditor/ — 0 errors
  • N-gram blocking unit tests (similar/dissimilar/empty/short/1000-row scale)
  • Large dataset test: 600 rows with injected typo pair found without skip warning
  • 501-row test confirms old limit no longer triggers skip

🤖 Generated with Claude Code

- Version bumped to 1.1.0 to align with PyPI release tag
- Levenshtein matching limit raised from 500 → 50,000 rows
- N-gram blocking generates candidate pairs via character trigram
  inverted index, avoiding O(n²) brute-force comparisons
- Small datasets (≤500 rows) still use brute-force (faster for them)
- 8 new tests: n-gram blocking unit tests + large dataset integration
- CHANGELOG converted from [Unreleased] to [1.1.0] release entry

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@MsShawnP MsShawnP force-pushed the claude/youthful-vaughan-3f51e1 branch from 6ad9f5c to 5f961e8 Compare May 16, 2026 18:48
@MsShawnP MsShawnP merged commit fe4e9e9 into main May 16, 2026
0 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant