Skip to content

Sprints 5-7: Polish, custom rules, profiling, multi-file, CI#10

Merged
MsShawnP merged 3 commits into
mainfrom
claude/youthful-vaughan-3f51e1
May 16, 2026
Merged

Sprints 5-7: Polish, custom rules, profiling, multi-file, CI#10
MsShawnP merged 3 commits into
mainfrom
claude/youthful-vaughan-3f51e1

Conversation

@MsShawnP

@MsShawnP MsShawnP commented May 16, 2026

Copy link
Copy Markdown
Owner

Summary

Sprint 5: Polish & DevEx

  • Fix CLI issue count (fuzzy dupes + schema violations were missing)
  • Fix AuditResult._raw type safety
  • Add --version, --quiet, --force flags
  • File size guard (warn 500K, refuse 2M without --force)
  • Fuzzy skip warning surfaced in CLI + JSON
  • Python version raised to >=3.9
  • CHANGELOG.md added

Sprint 6: Custom Rule Engine

  • JSON-based rules via --rules flag with 7 condition types
  • Column targeting by regex pattern or explicit list
  • Findings render natively in HTML, Excel, PDF reports
  • Sample rules file + README documentation

Sprint 7: Profiling, Multi-file, CI

  • Column profiling: cardinality, uniqueness %, lengths, numeric range in all reports + typed API (ColumnProfile)
  • Multi-file mode: --input ./dir/ or glob patterns, per-file reports, run_multi_audit() API
  • CI integration: --fail-under exit codes, --sarif for GitHub Code Scanning, GitHub Action

Test plan

  • 212 tests pass (including 30 rule engine + 7 profiling + 4 rules integration)
  • ruff check . clean
  • --input samples/input/ audits both files in directory
  • --fail-under 90 exits 1 on low-scoring data
  • --sarif produces valid SARIF 2.1.0 with rules + results
  • Profile stats visible in HTML report
  • Custom rule findings in all 3 report formats
  • --version, --quiet work correctly

🤖 Generated with Claude Code

MsShawnP and others added 2 commits May 16, 2026 12:09
- Fix CLI issue count to include fuzzy duplicates and schema violations
  by extracting shared count_issues() helper used by CLI, HTML report
- Fix AuditResult._raw to be a proper dataclass field (type-safe)
- Remove _load_sheets from __all__ (internal, not public API)
- Add --version/-V, --quiet/-q, --force flags
- Add file size guard (warn 500K rows, refuse 2M without --force)
- Surface warning when Levenshtein matching is skipped (>500 rows)
- Raise minimum Python from 3.8 to 3.9 (pyproject + ruff + README)
- Add CHANGELOG.md (Keep a Changelog format)
- Document all CLI flags in README options table

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Users can now define detection rules in a JSON file and apply them
via --rules/-r flag. Rules run alongside built-in checks with findings
integrated into all 3 report formats (HTML, Excel, PDF).

Supports 7 condition types: regex_match, not_regex_match, min_length,
max_length, allowed_values, disallowed_values, max_missing_pct.
Rules target columns by regex pattern or explicit list.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@MsShawnP MsShawnP changed the title Sprint 5: CLI polish, counting fix, and DevEx improvements Sprint 5 + 6: CLI polish, counting fix, custom rule engine May 16, 2026
Column profiling:
- Add cardinality, uniqueness %, avg/min/max length per field
- Numeric stats (min/max/mean/median) for currency and ID columns
- Stats rendered in HTML, Excel, PDF reports
- ColumnProfile dataclass in typed API

Multi-file / directory mode:
- --input accepts directories and glob patterns
- Each file gets its own report set
- run_multi_audit() API with weighted overall score

CI / pipeline integration:
- --fail-under flag: exit 1 if health score < threshold
- --sarif flag: SARIF 2.1.0 output for GitHub Code Scanning
- GitHub Action composite action (.github/actions/audit/)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@MsShawnP MsShawnP changed the title Sprint 5 + 6: CLI polish, counting fix, custom rule engine Sprints 5-7: Polish, custom rules, profiling, multi-file, CI May 16, 2026
@MsShawnP MsShawnP merged commit 7094388 into main May 16, 2026
3 checks passed
MsShawnP added a commit that referenced this pull request May 16, 2026
Session summary — 2026-05-16

Starting point: Sprints 5-7 merged (PR #10). --export-fixes implemented
but uncommitted.

What we did:
- Sprint 8: Added --export-fixes CHANGELOG entry, mypy type checking in CI,
  fixed dataclasses.field shadowing bug and other type errors, expanded PyPI
  metadata (classifiers, keywords, URLs)
- Sprint 9: Created automated PyPI publish workflow (OIDC trusted publisher),
  refreshed README with "data linter" positioning and quick-start install
- Published v1.0.0 to PyPI (debugged OIDC case sensitivity issue)
- Sprint 10: Bumped to v1.1.0, implemented n-gram blocking algorithm to scale
  fuzzy Levenshtein matching from 500 to 50,000 rows
- Published v1.1.0 to PyPI
- Considered and rejected package rename

What worked: Tiered mypy enforcement (strict on public API, relaxed on
internals) caught real bugs without requiring full annotation. N-gram
blocking algorithm (trigram inverted index + candidate caps) achieved
linear-ish scaling. OIDC trusted publishing eliminates API token management.

What didn't work:
- PyPI OIDC: case sensitivity mismatch (GitHub sends lowercase repo name)
- Fuzzy test: synthetic "Person{i}" names classified as IDs, excluded from matching
- Rebase after squash merge: creates conflicts from redundant commits

State: v1.1.0 live on PyPI. 220 tests, ruff + mypy clean. CI: lint, type
check, tests. Automated release: push v* tag -> publish. All audit items done.

Next: Project shipped. Options: use on real consulting data, collect feedback,
create GitHub Release with notes for v1.1.0, or move to another project.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant