Sprints 5-7: Polish, custom rules, profiling, multi-file, CI by MsShawnP · Pull Request #10 · MsShawnP/data-hygiene-auditor

MsShawnP · 2026-05-16T16:11:00Z

Summary

Sprint 5: Polish & DevEx

Fix CLI issue count (fuzzy dupes + schema violations were missing)
Fix AuditResult._raw type safety
Add --version, --quiet, --force flags
File size guard (warn 500K, refuse 2M without --force)
Fuzzy skip warning surfaced in CLI + JSON
Python version raised to >=3.9
CHANGELOG.md added

Sprint 6: Custom Rule Engine

JSON-based rules via --rules flag with 7 condition types
Column targeting by regex pattern or explicit list
Findings render natively in HTML, Excel, PDF reports
Sample rules file + README documentation

Sprint 7: Profiling, Multi-file, CI

Column profiling: cardinality, uniqueness %, lengths, numeric range in all reports + typed API (ColumnProfile)
Multi-file mode: --input ./dir/ or glob patterns, per-file reports, run_multi_audit() API
CI integration: --fail-under exit codes, --sarif for GitHub Code Scanning, GitHub Action

Test plan

212 tests pass (including 30 rule engine + 7 profiling + 4 rules integration)
ruff check . clean
--input samples/input/ audits both files in directory
--fail-under 90 exits 1 on low-scoring data
--sarif produces valid SARIF 2.1.0 with rules + results
Profile stats visible in HTML report
Custom rule findings in all 3 report formats
--version, --quiet work correctly

🤖 Generated with Claude Code

- Fix CLI issue count to include fuzzy duplicates and schema violations by extracting shared count_issues() helper used by CLI, HTML report - Fix AuditResult._raw to be a proper dataclass field (type-safe) - Remove _load_sheets from __all__ (internal, not public API) - Add --version/-V, --quiet/-q, --force flags - Add file size guard (warn 500K rows, refuse 2M without --force) - Surface warning when Levenshtein matching is skipped (>500 rows) - Raise minimum Python from 3.8 to 3.9 (pyproject + ruff + README) - Add CHANGELOG.md (Keep a Changelog format) - Document all CLI flags in README options table Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Users can now define detection rules in a JSON file and apply them via --rules/-r flag. Rules run alongside built-in checks with findings integrated into all 3 report formats (HTML, Excel, PDF). Supports 7 condition types: regex_match, not_regex_match, min_length, max_length, allowed_values, disallowed_values, max_missing_pct. Rules target columns by regex pattern or explicit list. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Column profiling: - Add cardinality, uniqueness %, avg/min/max length per field - Numeric stats (min/max/mean/median) for currency and ID columns - Stats rendered in HTML, Excel, PDF reports - ColumnProfile dataclass in typed API Multi-file / directory mode: - --input accepts directories and glob patterns - Each file gets its own report set - run_multi_audit() API with weighted overall score CI / pipeline integration: - --fail-under flag: exit 1 if health score < threshold - --sarif flag: SARIF 2.1.0 output for GitHub Code Scanning - GitHub Action composite action (.github/actions/audit/) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Session summary — 2026-05-16 Starting point: Sprints 5-7 merged (PR #10). --export-fixes implemented but uncommitted. What we did: - Sprint 8: Added --export-fixes CHANGELOG entry, mypy type checking in CI, fixed dataclasses.field shadowing bug and other type errors, expanded PyPI metadata (classifiers, keywords, URLs) - Sprint 9: Created automated PyPI publish workflow (OIDC trusted publisher), refreshed README with "data linter" positioning and quick-start install - Published v1.0.0 to PyPI (debugged OIDC case sensitivity issue) - Sprint 10: Bumped to v1.1.0, implemented n-gram blocking algorithm to scale fuzzy Levenshtein matching from 500 to 50,000 rows - Published v1.1.0 to PyPI - Considered and rejected package rename What worked: Tiered mypy enforcement (strict on public API, relaxed on internals) caught real bugs without requiring full annotation. N-gram blocking algorithm (trigram inverted index + candidate caps) achieved linear-ish scaling. OIDC trusted publishing eliminates API token management. What didn't work: - PyPI OIDC: case sensitivity mismatch (GitHub sends lowercase repo name) - Fuzzy test: synthetic "Person{i}" names classified as IDs, excluded from matching - Rebase after squash merge: creates conflicts from redundant commits State: v1.1.0 live on PyPI. 220 tests, ruff + mypy clean. CI: lint, type check, tests. Automated release: push v* tag -> publish. All audit items done. Next: Project shipped. Options: use on real consulting data, collect feedback, create GitHub Release with notes for v1.1.0, or move to another project. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

MsShawnP and others added 2 commits May 16, 2026 12:09

MsShawnP changed the title ~~Sprint 5: CLI polish, counting fix, and DevEx improvements~~ Sprint 5 + 6: CLI polish, counting fix, custom rule engine May 16, 2026

MsShawnP changed the title ~~Sprint 5 + 6: CLI polish, counting fix, custom rule engine~~ Sprints 5-7: Polish, custom rules, profiling, multi-file, CI May 16, 2026

MsShawnP merged commit 7094388 into main May 16, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sprints 5-7: Polish, custom rules, profiling, multi-file, CI#10

Sprints 5-7: Polish, custom rules, profiling, multi-file, CI#10
MsShawnP merged 3 commits into
mainfrom
claude/youthful-vaughan-3f51e1

MsShawnP commented May 16, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

MsShawnP commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Sprint 5: Polish & DevEx

Sprint 6: Custom Rule Engine

Sprint 7: Profiling, Multi-file, CI

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

MsShawnP commented May 16, 2026 •

edited

Loading