Sprints 5-7: Polish, custom rules, profiling, multi-file, CI#10
Merged
Conversation
- Fix CLI issue count to include fuzzy duplicates and schema violations by extracting shared count_issues() helper used by CLI, HTML report - Fix AuditResult._raw to be a proper dataclass field (type-safe) - Remove _load_sheets from __all__ (internal, not public API) - Add --version/-V, --quiet/-q, --force flags - Add file size guard (warn 500K rows, refuse 2M without --force) - Surface warning when Levenshtein matching is skipped (>500 rows) - Raise minimum Python from 3.8 to 3.9 (pyproject + ruff + README) - Add CHANGELOG.md (Keep a Changelog format) - Document all CLI flags in README options table Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Users can now define detection rules in a JSON file and apply them via --rules/-r flag. Rules run alongside built-in checks with findings integrated into all 3 report formats (HTML, Excel, PDF). Supports 7 condition types: regex_match, not_regex_match, min_length, max_length, allowed_values, disallowed_values, max_missing_pct. Rules target columns by regex pattern or explicit list. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Column profiling: - Add cardinality, uniqueness %, avg/min/max length per field - Numeric stats (min/max/mean/median) for currency and ID columns - Stats rendered in HTML, Excel, PDF reports - ColumnProfile dataclass in typed API Multi-file / directory mode: - --input accepts directories and glob patterns - Each file gets its own report set - run_multi_audit() API with weighted overall score CI / pipeline integration: - --fail-under flag: exit 1 if health score < threshold - --sarif flag: SARIF 2.1.0 output for GitHub Code Scanning - GitHub Action composite action (.github/actions/audit/) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
MsShawnP
added a commit
that referenced
this pull request
May 16, 2026
Session summary — 2026-05-16 Starting point: Sprints 5-7 merged (PR #10). --export-fixes implemented but uncommitted. What we did: - Sprint 8: Added --export-fixes CHANGELOG entry, mypy type checking in CI, fixed dataclasses.field shadowing bug and other type errors, expanded PyPI metadata (classifiers, keywords, URLs) - Sprint 9: Created automated PyPI publish workflow (OIDC trusted publisher), refreshed README with "data linter" positioning and quick-start install - Published v1.0.0 to PyPI (debugged OIDC case sensitivity issue) - Sprint 10: Bumped to v1.1.0, implemented n-gram blocking algorithm to scale fuzzy Levenshtein matching from 500 to 50,000 rows - Published v1.1.0 to PyPI - Considered and rejected package rename What worked: Tiered mypy enforcement (strict on public API, relaxed on internals) caught real bugs without requiring full annotation. N-gram blocking algorithm (trigram inverted index + candidate caps) achieved linear-ish scaling. OIDC trusted publishing eliminates API token management. What didn't work: - PyPI OIDC: case sensitivity mismatch (GitHub sends lowercase repo name) - Fuzzy test: synthetic "Person{i}" names classified as IDs, excluded from matching - Rebase after squash merge: creates conflicts from redundant commits State: v1.1.0 live on PyPI. 220 tests, ruff + mypy clean. CI: lint, type check, tests. Automated release: push v* tag -> publish. All audit items done. Next: Project shipped. Options: use on real consulting data, collect feedback, create GitHub Release with notes for v1.1.0, or move to another project. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Sprint 5: Polish & DevEx
AuditResult._rawtype safety--version,--quiet,--forceflags--force)Sprint 6: Custom Rule Engine
--rulesflag with 7 condition typesSprint 7: Profiling, Multi-file, CI
ColumnProfile)--input ./dir/or glob patterns, per-file reports,run_multi_audit()API--fail-underexit codes,--sariffor GitHub Code Scanning, GitHub ActionTest plan
ruff check .clean--input samples/input/audits both files in directory--fail-under 90exits 1 on low-scoring data--sarifproduces valid SARIF 2.1.0 with rules + results--version,--quietwork correctly🤖 Generated with Claude Code