From 8806d9e1bc1f7a03d95645dba5d74a83c98afb7a Mon Sep 17 00:00:00 2001 From: MsShawnP Date: Sat, 16 May 2026 12:09:56 -0400 Subject: [PATCH 1/3] Sprint 5: CLI polish, counting fix, version/quiet flags, CHANGELOG - Fix CLI issue count to include fuzzy duplicates and schema violations by extracting shared count_issues() helper used by CLI, HTML report - Fix AuditResult._raw to be a proper dataclass field (type-safe) - Remove _load_sheets from __all__ (internal, not public API) - Add --version/-V, --quiet/-q, --force flags - Add file size guard (warn 500K rows, refuse 2M without --force) - Surface warning when Levenshtein matching is skipped (>500 rows) - Raise minimum Python from 3.8 to 3.9 (pyproject + ruff + README) - Add CHANGELOG.md (Keep a Changelog format) - Document all CLI flags in README options table Co-Authored-By: Claude Opus 4.6 --- AUDIT.md | 197 ++++++++++++++++++ CHANGELOG.md | 45 ++++ PLAN.md | 89 +++++++- README.md | 18 +- data_hygiene_auditor/__init__.py | 4 +- data_hygiene_auditor/api.py | 6 +- data_hygiene_auditor/cli.py | 101 +++++---- data_hygiene_auditor/core.py | 49 ++++- data_hygiene_auditor/detection.py | 7 + data_hygiene_auditor/reporting/html.py | 22 +- pyproject.toml | 4 +- .../sample_messy_data_audit_findings.xlsx | Bin 11964 -> 11855 bytes .../sample_messy_data_audit_report.html | 27 +-- .../output/sample_messy_data_audit_report.pdf | Bin 24473 -> 24239 bytes tests/test_integration.py | 52 +++++ 15 files changed, 534 insertions(+), 87 deletions(-) create mode 100644 CHANGELOG.md diff --git a/AUDIT.md b/AUDIT.md index 20c63a4..ff18979 100644 --- a/AUDIT.md +++ b/AUDIT.md @@ -290,3 +290,200 @@ Move #13. AI-powered fix suggestions. Only attempt after the foundation and pres - **Don't build a GUI/web app yet.** The interactive HTML report gives you most of the "explorable" benefit without the deployment/hosting/auth complexity. A web app is a different product. - **Don't chase pipeline integration** (dbt, Airflow, CI). Your audience is consultants with spreadsheets, not data engineers with warehouses. Pipeline integration dilutes your focus without serving your users. - **Don't refactor before testing.** The temptation is to restructure first (it's messy!), but write tests against the current behavior first. Then refactor with confidence. + +--- + +# Audit Round 2 (2026-05-16) + +All items from the 2025 audit were shipped (PRs #1-#9). This round assesses the project's current state after that work, with fresh landscape data. + +## Phase 1: Baseline Assessment (2026) +**Date:** 2026-05-16 +**Project:** Data Hygiene Auditor v1.0.0 + +### What Exists Today + +A well-structured Python CLI + library (10 modules, ~3,750 LOC) that scans Excel/CSV/TSV files for data quality issues and produces interactive HTML, Excel, and PDF reports. Features shipped since last audit: schema validation, trend comparison, vectorized detection (3.4x speedup), fuzzy duplicate matching, typed Python API, health scores, interactive HTML, fix suggestions. + +### Current Architecture + +| Module | LOC | Purpose | +|--------|-----|---------| +| `detection.py` | 654 | 7 detection engines | +| `reporting/html.py` | 841 | Interactive HTML report | +| `reporting/pdf.py` | 418 | PDF deliverable | +| `reporting/excel.py` | 335 | Excel findings file | +| `api.py` | 412 | Typed Python API (dataclasses) | +| `core.py` | 292 | Orchestrator + data loading | +| `suggestions.py` | 285 | Fix suggestion engine | +| `cli.py` | 202 | CLI with colored output | +| `schema.py` | 144 | Schema validation | +| `trend.py` | 103 | Trend comparison | +| **Tests** | 1,576 | 167 tests across 8 files | + +### Health Indicators + +| Dimension | Status | +|-----------|--------| +| Tests | 167 passing, all detection engines covered | +| CI | GitHub Actions: ruff + pytest on 3.9/3.12/3.13 | +| Packaging | pyproject.toml, pip-installable, `data-hygiene-audit` CLI | +| API | `audit_file()` with typed dataclasses, py.typed marker | +| Docs | Comprehensive README with screenshots and library examples | +| Performance | Vectorized detection, 3.4x improvement on large files | + +### Gap Analysis + +**Resolved from 2025 audit:** CSV support, tests, CI, packaging, interactive HTML, health score, vectorized perf, fuzzy matching, typed API, fix suggestions, schema validation, trend comparison — all shipped. + +**Remaining or new issues:** +1. CLI under-counts issues (missing fuzzy duplicates in total) +2. `_raw` attribute set outside dataclass `__init__` — type-unsafe +3. Tests import via backward-compat shim, not package directly +4. No type checker in CI despite py.typed marker +5. Python 3.8 claimed but untested +6. Fuzzy matching silently skipped above 500 rows +7. No CHANGELOG or release tags + +## Phase 2: Internal Review (2026) +**Date:** 2026-05-16 +**Dimensions:** Code Quality, Architecture, Tests, Documentation, Performance, Security, UX, DevEx + +### Top Opportunities + +| # | Finding | Dimension | Impact | Effort | Leverage | Severity | +|---|---------|-----------|--------|--------|----------|----------| +| 1 | CLI missing fuzzy_duplicates in issue count — under-reports total | Code Quality | 3 | 1 | 3.0 | bug | +| 2 | `AuditResult._raw` monkey-patched outside `__init__` | Code Quality | 4 | 1 | 4.0 | important | +| 3 | Issue-counting logic still duplicated 3x (cli, html, excel) | Code Quality | 3 | 1 | 3.0 | important | +| 4 | `requires-python >= 3.8` but CI tests 3.9+ only | DevEx | 3 | 1 | 3.0 | important | +| 5 | Tests import from `audit` shim, not `data_hygiene_auditor` | Tests | 3 | 2 | 1.5 | important | +| 6 | No type checker in CI despite py.typed marker | DevEx | 3 | 2 | 1.5 | important | +| 7 | Levenshtein O(n²) hard-capped at 500 rows — silently skips | Performance | 3 | 3 | 1.0 | important | +| 8 | No file size guard — OOM on large crafted input | Security | 3 | 2 | 1.5 | important | +| 9 | `_load_sheets` exported in public `__all__` | Architecture | 2 | 1 | 2.0 | minor | +| 10 | `--schema`/`--baseline` undocumented in README options table | Documentation | 2 | 1 | 2.0 | minor | +| 11 | No `--quiet`/`--version` flags | UX | 2 | 1 | 2.0 | minor | +| 12 | No CHANGELOG | Documentation | 2 | 1 | 2.0 | minor | + +### Summary + +The project is in strong shape. The 2025 audit's critical issues (monolith, no tests, XSS, no CSV, no packaging) are all resolved. What remains is polish-tier work: a counting bug, a type safety issue, test import paths, and CI completeness. The architecture is clean and the detection logic is solid. + +## Phase 3: Landscape Scan (2026) +**Date:** 2026-05-16 +**Method:** Web research (verified through May 2026) + +### Key Landscape Changes (2025 → 2026) + +1. **ydata-profiling rebranded to fg-data-profiling** (v4.19.1, Apr 2026). Package/import renamed. Signals stewardship instability. +2. **GX added ExpectAI** — AI-generated expectations from data patterns. Possible acquisition May 2026 (unconfirmed). +3. **Data contracts became dominant framing** — Soda Core repositioned as "Data Contracts engine." Irrelevant to file-audit use case. +4. **Enterprise consolidation** — Metaplane → Datadog, SYNQ → Coalesce, Select Star → Snowflake. Affects $50K+ tier only. +5. **AI/LLM integration is commercial-tier only** — ExpectAI, SodaGPT. No OSS tool has AI fix suggestions. Window still open. +6. **DQX (Databricks Labs)** — new PySpark-native DQ framework. Not relevant to file-based auditing. +7. **DQOps** — OSS + commercial ($499/mo). 150+ built-in checks. Warehouse-only, no file support. + +### Competitive Position (2026) + +**Unique to this project (confirmed still unmatched):** +- Placeholder/test data detection +- Misused field detection (cross-column semantic validation) +- Triple output format (HTML + Excel + PDF) +- Severity ratings + plain-English explanations for non-technical stakeholders +- Health score (0-100) +- Deterministic fix suggestions with copy-paste code +- Schema validation + trend comparison (closes previous gaps) + +**The consultant gap remains completely unoccupied.** Every competitor is a warehouse connector for engineers, a profiler for data scientists, or an interactive GUI for researchers. No tool takes a file and produces a credentialed audit report with severity ratings and fix language for a client meeting. + +### Feature Parity Check + +| Table Stakes | Status | +|-------------|--------| +| CSV/TSV support | ✅ Shipped | +| Null/completeness analysis | ✅ | +| CLI + Python API | ✅ Both | +| Large file handling (100K+) | 🟡 Vectorized but fuzzy capped at 500 | +| Interactive report | ✅ Filters, search, TOC, collapsible | + +## Phase 4: Differentiation & Next Moves (2026) +**Date:** 2026-05-16 + +### Cross-Reference Summary + +The situation has inverted since the 2025 audit. A year ago, the project had strong detection but weak everything else. Now: +- **Foundation:** solid (tests, CI, packaging, clean architecture) +- **Presentation:** strong (interactive HTML, health score ring, fix suggestions) +- **Detection:** comprehensive (7 engines + schema + trend) +- **Competitive position:** unique and uncontested + +The remaining work is no longer transformative — it's **incremental quality improvements and strategic positioning**. The highest-impact moves are now about reach (getting the tool in front of users) and polish (fixing the few rough edges that undermine professional credibility). + +### Ranked Next Moves + +| # | Move | Category | Strategic | Internal | Effort | Score | Description | +|---|------|----------|-----------|----------|--------|-------|-------------| +| 1 | Fix CLI fuzzy dup counting bug | Correctness | 1 | 4 | 1 | 5.0 | CLI under-reports total issues by omitting fuzzy duplicates from count. One missing loop. | +| 2 | Fix `_raw` type safety | Code Quality | 1 | 3 | 1 | 4.0 | Move `_raw` into `AuditResult.__init__` as a proper field. Fixes mypy, IDE autocomplete. | +| 3 | Extract shared issue-counting helper | Code Quality | 1 | 3 | 1 | 4.0 | Single function used by CLI, HTML, and Excel. Prevents future counting bugs. | +| 4 | Document `--schema`/`--baseline` in README | Documentation | 2 | 2 | 1 | 4.0 | Features exist but aren't discoverable in README options table. | +| 5 | Add `--version` and `--quiet` flags | UX | 2 | 2 | 1 | 4.0 | Professional CLI conventions. `--quiet` enables scripted/CI usage. | +| 6 | Align Python version (drop 3.8 claim or add CI) | DevEx | 2 | 3 | 1 | 5.0 | Either add 3.8 to CI matrix or bump requires-python to >=3.9. | +| 7 | Add mypy/pyright to CI | DevEx | 2 | 3 | 2 | 2.5 | py.typed marker promises type safety — CI should enforce it. | +| 8 | Migrate test imports to `data_hygiene_auditor` | Tests | 1 | 3 | 2 | 2.0 | Tests should exercise the package, not the backward-compat shim. | +| 9 | Warn when fuzzy matching is skipped (>500 rows) | UX | 3 | 2 | 1 | 5.0 | User should know a detection pass was omitted on large sheets. | +| 10 | Scale fuzzy matching beyond 500 rows | Performance | 4 | 3 | 3 | 2.3 | Locality-sensitive hashing or blocking strategy to handle 10K+ rows. | +| 11 | Add CHANGELOG and release tagging | DevEx | 3 | 2 | 1 | 5.0 | Version tracking for users. Signal active maintenance. | +| 12 | PyPI publication | Reach | 5 | 1 | 2 | 3.0 | `pip install data-hygiene-auditor` from anywhere. Major discoverability boost. | +| 13 | "Data linter" positioning + README refresh | Reach | 4 | 1 | 2 | 2.5 | Adopt the "linter for data" framing that resonates with the developer audience. Keywords for discoverability. | +| 14 | File size guard / row limit warning | Security | 2 | 2 | 1 | 4.0 | Warn at 500K rows, refuse at 2M unless `--force`. Prevents OOM. | +| 15 | Remove `_load_sheets` from public `__all__` | Architecture | 1 | 2 | 1 | 3.0 | Private helper shouldn't be in the public API surface. | + +### Recommended Sequence + +**Sprint 5: Bug Fixes & Polish (half day)** +Moves #1-6, #9, #11, #14, #15. All effort-1 items. Brings the project to "no rough edges" state. +- Fix CLI counting bug +- Fix `_raw` type safety +- Extract issue-counting helper +- Document `--schema`/`--baseline` in README +- Add `--version` and `--quiet` +- Align Python version requirement +- Warn on skipped fuzzy matching +- Add CHANGELOG +- File size guard +- Remove `_load_sheets` from `__all__` + +**Sprint 6: Engineering Rigor (1 day)** +Moves #7, #8. Type checking + test migration. +- Add mypy/pyright to CI +- Migrate test imports to package + +**Sprint 7: Reach (1-2 days)** +Moves #12, #13. Get the tool in front of users. +- Publish to PyPI +- README refresh with "data linter" positioning + +**Sprint 8: Scale (2-3 days)** +Move #10. Requires algorithmic work. +- Scale fuzzy matching with LSH or blocking + +### What NOT to Do (2026 Update) + +Previous "don't do" items that were done anyway and **worked out:** +- ~~Don't add schema validation~~ → Added (PR #9). Lightweight, optional, complements rather than competes with GX/pandera. **Correct call to add it.** + +Updated guidance: +- **Don't add statistical profiling.** fg-data-profiling still owns this despite the rebrand. Your strength is consulting-specific findings. +- **Don't build a web app.** The interactive HTML file is self-contained, shareable, and zero-deployment. A server-side app is a different product for a different audience. +- **Don't chase pipeline integration.** The market moved further toward warehouse-native observability (DQOps, Soda, GX Cloud). That's their game. Yours is file-native audit reports. +- **Don't add LLM-powered features yet.** The deterministic fix suggestions already work well. LLM adds latency, API key requirements, and cost for marginal improvement. Revisit when local models are fast enough to run offline. +- **Don't over-engineer the fuzzy cap.** The 500-row Levenshtein cap is a reasonable default for spreadsheet-sized data. Add a warning, not a complex distributed algorithm. Only invest in scaling if real users hit the limit. +- **Don't compete on star count or downloads.** The niche is small but uncontested. One glowing testimonial from a consultant who used it on a real engagement is worth more than 1K GitHub stars from drive-by visitors. + +### Strategic Summary + +The project has successfully executed its transformation from "Claude Chat artifact" to "genuinely differentiated tool." The 2025 audit's thesis — that the detection was the moat but needed a stage — has been validated. The stage is now built. The next phase is about **credibility and reach**: fixing the remaining rough edges, publishing to PyPI, and positioning the tool where its target audience (consultants, analysts, data teams inheriting messy spreadsheets) can find it. + +The competitive landscape has moved *away* from this project's niche (toward warehouse observability and data contracts), which is strategically favorable — it means less competition, not more. The window for "file-native, consultant-focused, severity-rated audit reports" remains wide open with no credible competitor in 2026. diff --git a/CHANGELOG.md b/CHANGELOG.md new file mode 100644 index 0000000..eafcddc --- /dev/null +++ b/CHANGELOG.md @@ -0,0 +1,45 @@ +# Changelog + +All notable changes to this project will be documented in this file. + +Format follows [Keep a Changelog](https://keepachangelog.com/en/1.1.0/). + +## [Unreleased] + +### Fixed +- CLI issue count now includes fuzzy duplicates and schema violations +- `AuditResult._raw` is a proper dataclass field (type-checker visible) + +### Added +- `--version` / `-V` flag +- `--quiet` / `-q` flag to suppress terminal output +- `--force` flag to override the 2M row safety limit +- `count_issues()` shared helper for consistent issue counting +- Warning when fuzzy (Levenshtein) matching is skipped due to row count +- File size guard: warns at 500K rows, refuses at 2M without `--force` + +### Changed +- Minimum Python version raised from 3.8 to 3.9 + +## [1.0.0] - 2026-05-09 + +### Added +- Schema validation via `--schema` flag with JSON schema files +- `--generate-schema` to infer and export a schema from audit results +- `--baseline` / `-b` for trend comparison against previous audits +- Trend deltas shown in CLI output and reports +- `--threshold` / `-t` flag for fuzzy duplicate similarity tuning +- Typed Python API (`audit_file()`, dataclass results, `py.typed`) +- Fuzzy duplicate detection (fingerprint clustering + Levenshtein) +- Health score algorithm (0–100, penalty-based) +- Interactive HTML report with collapsible sections +- Fix suggestion engine with copyable code snippets +- Vectorized detection for 3.4x speedup on large files +- CSV/TSV support alongside Excel +- PDF report output (reportlab) +- Excel findings export (sortable/filterable) +- Test suite (171 tests) and CI pipeline +- MIT license + +[Unreleased]: https://github.com/MsShawnP/Data-Hygiene-Auditor/compare/v1.0.0...HEAD +[1.0.0]: https://github.com/MsShawnP/Data-Hygiene-Auditor/releases/tag/v1.0.0 diff --git a/PLAN.md b/PLAN.md index 96e04cd..8f91015 100644 --- a/PLAN.md +++ b/PLAN.md @@ -1,8 +1,9 @@ # Data Hygiene Auditor — Improvement Plan -**Source:** Full project audit (2025-05-15) +**Source:** Full project audit (2025-05-15), re-audited 2026-05-16 **Tier:** Medium -**Status:** Complete — all sprints + stretch goal shipped (PRs #1-#6, 2025-05-15) +**Status:** Sprints 1-4 + stretch complete. Sprint 5 (polish) in progress. +**Current focus:** Sprint 5 — Bug fixes, polish, and DevEx improvements --- @@ -362,3 +363,87 @@ Generate actionable fix scripts or transformation suggestions for each finding. #### Context Phase 3 Category Trends: AI-powered fix suggestions are emerging but nobody does them well. This is the leapfrog opportunity — but only after foundation and presentation are solid. + +--- + +## Sprint 5: Polish & DevEx + +**Source:** Audit Round 2 (2026-05-16) +**Priority:** Next +**Estimated effort:** Half day + +### Decomposition: Sprint 5 + +Goal: Fix remaining rough edges so the project has zero known bugs and professional-grade CLI/packaging. + +All items are independent unless noted — can be done in any order. + +--- + +#### A: Fix issue-counting (bug + dedup) + +- [ ] A1: Extract shared issue-counting helper into `core.py` + - Depends on: none + - Done when: a function `count_issues(results) -> dict` exists in `core.py` that returns `{'total': N, 'High': N, 'Medium': N, 'Low': N}` counting all issue sources (field issues, phantom dupes, fuzzy dupes, schema violations); unit test passes +- [ ] A2: Fix CLI counting bug — add fuzzy duplicates to total + - Depends on: A1 + - Done when: `cli.py` uses the shared helper; running `data-hygiene-audit` on the sample file reports the same total as the HTML report +- [ ] A3: Migrate html.py and excel.py to use the shared helper + - Depends on: A1 + - Done when: `html.py` and `excel.py` import and use `count_issues()`; all tests pass; HTML report totals unchanged + +#### B: Fix `_raw` type safety + +- [ ] B1: Make `_raw` a proper field on `AuditResult` + - Depends on: none + - Done when: `AuditResult` has `_raw: Dict[str, Any] = field(repr=False, default_factory=dict)` (or `init=False`); `audit_file()` sets it normally; `mypy --strict data_hygiene_auditor/api.py` produces no `_raw` errors; all tests pass + +#### C: Public API cleanup + +- [ ] C1: Remove `_load_sheets` from `__all__` in `__init__.py` + - Depends on: none + - Done when: `_load_sheets` is not in `__all__`; `from data_hygiene_auditor import _load_sheets` still works (it's not deleted, just not advertised); tests pass + +#### D: CLI improvements + +- [ ] D1: Add `--version` flag + - Depends on: none + - Done when: `data-hygiene-audit --version` prints `data-hygiene-auditor 1.0.0`; test or manual verification passes +- [ ] D2: Add `--quiet` flag + - Depends on: none + - Done when: `data-hygiene-audit --input ... --output ... --quiet` produces no stdout (only writes files); exit code 0 on success; test confirms no output + +#### E: Detection warnings and guards + +- [ ] E1: Warn when fuzzy matching is skipped (>500 rows) + - Depends on: none + - Done when: running on a file with >500 rows prints a warning like "Note: Fuzzy matching skipped for sheet X (501 rows > 500 limit)"; warning included in JSON output as metadata; test confirms warning appears +- [ ] E2: Add file size / row count guard + - Depends on: none + - Done when: files >500K rows print a warning "Large file: N rows. Processing may be slow."; files >2M rows exit with error unless `--force` is passed; test confirms both behaviors + +#### F: DevEx alignment + +- [ ] F1: Align Python version requirement + - Depends on: none + - Done when: `requires-python` in pyproject.toml set to `>=3.9`; CI matrix remains 3.9/3.12/3.13; README updated if it mentions 3.8 +- [ ] F2: Add CHANGELOG.md + - Depends on: none + - Done when: `CHANGELOG.md` exists with entries for v1.0.0 (initial feature set) and unreleased section for current work; follows Keep a Changelog format + +#### G: Documentation + +- [ ] G1: Document `--schema`, `--baseline`, `--generate-schema` in README options table + - Depends on: none + - Done when: README options table includes all 7 flags (--input, --output, --json, --threshold, --schema, --baseline, --generate-schema) with descriptions + +--- + +### Sprint 5 complete when: + +- [ ] All sub-tasks checked off +- [ ] `pytest` passes (167+ tests) +- [ ] `ruff check .` passes +- [ ] `data-hygiene-audit --version` works +- [ ] `data-hygiene-audit --input samples/input/sample_messy_data.xlsx --output samples/output/ --quiet` produces files with no stdout +- [ ] CLI issue count matches HTML report issue count on sample data diff --git a/README.md b/README.md index 10ba2a7..9d91e2c 100644 --- a/README.md +++ b/README.md @@ -103,7 +103,13 @@ Supports `.xlsx`, `.xls`, `.csv`, and `.tsv` files. | `--input`, `-i` | Path to the file to audit — `.xlsx`, `.csv`, or `.tsv` (required) | | `--output`, `-o` | Directory for generated reports (required) | | `--json` | Also output the raw findings as structured JSON | -| `--threshold`, `-t` | Fuzzy duplicate similarity threshold, 0.0-1.0 (default: 0.85) | +| `--threshold`, `-t` | Fuzzy duplicate similarity threshold, 0.0–1.0 (default: 0.85) | +| `--schema`, `-s` | Path to a schema JSON for type/completeness validation | +| `--generate-schema` | Infer types from the data and save a schema JSON to the given path | +| `--baseline`, `-b` | Path to a previous audit JSON for trend comparison (shows deltas) | +| `--quiet`, `-q` | Suppress all terminal output (just write report files) | +| `--force` | Process files exceeding the 2M row safety limit | +| `--version`, `-V` | Print version and exit | ### Example @@ -115,16 +121,16 @@ python audit.py --input samples/input/sample_messy_data.xlsx --output ./reports Data Hygiene Auditor Auditing: samples/input/sample_messy_data.xlsx - [1/2] Analyzed sheet: Customers - [2/2] Analyzed sheet: Orders + [1/2] Analyzed sheet: Customers (score: 42) + [2/2] Analyzed sheet: Orders (score: 68) Generating reports... HTML -> ./reports/sample_messy_data_audit_report.html Excel -> ./reports/sample_messy_data_audit_findings.xlsx PDF -> ./reports/sample_messy_data_audit_report.pdf - Audit complete: 59 issues found - High: 23 | Medium: 20 | Low: 16 + Health Score: 55/100 + 59 issues found — High: 23 | Medium: 20 | Low: 16 ``` ## Use as a Library @@ -179,7 +185,7 @@ python generate_sample.py ## Requirements -- Python 3.8+ +- Python 3.9+ - pandas - openpyxl - reportlab diff --git a/data_hygiene_auditor/__init__.py b/data_hygiene_auditor/__init__.py index e400a3a..ddd792c 100644 --- a/data_hygiene_auditor/__init__.py +++ b/data_hygiene_auditor/__init__.py @@ -12,7 +12,7 @@ TrendData, audit_file, ) -from .core import SUPPORTED_EXTENSIONS, WHY_IT_MATTERS, _load_sheets, run_audit +from .core import SUPPORTED_EXTENSIONS, WHY_IT_MATTERS, _load_sheets, count_issues, run_audit # noqa: F401 from .detection import ( analyze_fuzzy_duplicates, analyze_mixed_formats, @@ -39,7 +39,7 @@ 'SheetResult', 'TrendData', 'run_audit', - '_load_sheets', + 'count_issues', 'SUPPORTED_EXTENSIONS', 'WHY_IT_MATTERS', 'infer_field_type', diff --git a/data_hygiene_auditor/api.py b/data_hygiene_auditor/api.py index 84d71f8..082eccc 100644 --- a/data_hygiene_auditor/api.py +++ b/data_hygiene_auditor/api.py @@ -141,6 +141,7 @@ class AuditResult: overall_score: int sheets: List[SheetResult] = field(default_factory=list) trend: Optional[TrendData] = None + _raw: Dict[str, Any] = field(default_factory=dict, repr=False) @property def total_issues(self) -> int: @@ -401,12 +402,11 @@ def audit_file( sheets=raw_trend.get('sheets', {}), ) - result = AuditResult( + return AuditResult( input_file=raw['input_file'], audit_timestamp=raw['audit_timestamp'], overall_score=raw['overall_score'], sheets=sheets, trend=trend_obj, + _raw=raw, ) - result._raw = raw - return result diff --git a/data_hygiene_auditor/cli.py b/data_hygiene_auditor/cli.py index f518e2a..a33ed8f 100644 --- a/data_hygiene_auditor/cli.py +++ b/data_hygiene_auditor/cli.py @@ -4,10 +4,9 @@ import json import os import sys -from collections import Counter from pathlib import Path -from .core import SUPPORTED_EXTENSIONS, run_audit +from .core import SUPPORTED_EXTENSIONS, count_issues, run_audit from .reporting import generate_excel, generate_html, generate_pdf @@ -30,6 +29,15 @@ def _c(text, code): return f"\033[{code}m{text}\033[0m" +def _get_version(): + """Get package version from metadata.""" + from importlib.metadata import PackageNotFoundError, version + try: + return version('data-hygiene-auditor') + except PackageNotFoundError: + return '1.0.0' + + def main(): parser = argparse.ArgumentParser( description=( @@ -49,6 +57,10 @@ def main(): - audit_report.pdf (email-ready deliverable) """, ) + parser.add_argument( + '--version', '-V', action='version', + version=f'%(prog)s {_get_version()}', + ) parser.add_argument( '--input', '-i', required=True, help='Path to input file (.xlsx, .csv, .tsv)', @@ -77,6 +89,14 @@ def main(): '--baseline', '-b', help='Path to previous audit JSON for trend comparison', ) + parser.add_argument( + '--quiet', '-q', action='store_true', + help='Suppress all terminal output (just write report files)', + ) + parser.add_argument( + '--force', action='store_true', + help='Process files exceeding the 2M row safety limit', + ) args = parser.parse_args() if not os.path.exists(args.input): @@ -98,9 +118,31 @@ def main(): os.makedirs(args.output, exist_ok=True) + def _log(msg=''): + if not args.quiet: + print(msg) + + from .core import _load_sheets + ROW_WARN = 500_000 + ROW_LIMIT = 2_000_000 + sheets_preview = _load_sheets(args.input) + total_rows = sum(len(df) for df in sheets_preview.values()) + if total_rows > ROW_LIMIT and not args.force: + print( + f"Error: File has {total_rows:,} rows (limit: {ROW_LIMIT:,})." + f" Use --force to process anyway.", + file=sys.stderr, + ) + sys.exit(1) + if total_rows > ROW_WARN: + _log( + f" {_c('Warning:', '33')} Large file ({total_rows:,} rows)." + f" Processing may be slow." + ) + basename = Path(args.input).stem - print(f"\n {_c('Data Hygiene Auditor', '1')}") - print(f" Auditing: {_c(args.input, '36')}\n") + _log(f"\n {_c('Data Hygiene Auditor', '1')}") + _log(f" Auditing: {_c(args.input, '36')}\n") results = run_audit( args.input, @@ -112,7 +154,7 @@ def main(): for i, (name, sdata) in enumerate(results['sheets'].items(), 1): score = sdata['health_score'] score_color = '32' if score >= 90 else ('33' if score >= 70 else '31') - print( + _log( f" [{i}/{sheet_count}] Analyzed sheet: {_c(name, '36')}" f" (score: {_c(str(score), score_color)})" ) @@ -127,16 +169,16 @@ def main(): args.output, f"{basename}_audit_report.pdf", ) - print("\n Generating reports...") + _log("\n Generating reports...") generate_html(results, html_path) - print(f" {_c('HTML', '32')} -> {html_path}") + _log(f" {_c('HTML', '32')} -> {html_path}") generate_excel(results, xlsx_path) - print(f" {_c('Excel', '32')} -> {xlsx_path}") + _log(f" {_c('Excel', '32')} -> {xlsx_path}") generate_pdf(results, pdf_path) - print(f" {_c('PDF', '32')} -> {pdf_path}") + _log(f" {_c('PDF', '32')} -> {pdf_path}") if args.json: json_path = os.path.join( @@ -144,34 +186,21 @@ def main(): ) with open(json_path, 'w') as f: json.dump(results, f, indent=2, default=str) - print(f" {_c('JSON', '32')} -> {json_path}") + _log(f" {_c('JSON', '32')} -> {json_path}") if args.generate_schema: from .schema import generate_schema schema_data = generate_schema(results) with open(args.generate_schema, 'w') as f: json.dump(schema_data, f, indent=2) - print(f" {_c('Schema', '32')} -> {args.generate_schema}") - - total_issues = 0 - severity_totals = Counter() - schema_count = 0 - for sheet in results['sheets'].values(): - for field in sheet['fields'].values(): - for issue in field['issues']: - total_issues += 1 - severity_totals[issue['severity']] += 1 - for d in sheet['phantom_duplicates']: - total_issues += 1 - severity_totals[d['severity']] += 1 - for sv in sheet.get('schema_violations', []): - total_issues += 1 - severity_totals[sv['severity']] += 1 - schema_count += 1 - - high = severity_totals.get('High', 0) - med = severity_totals.get('Medium', 0) - low = severity_totals.get('Low', 0) + _log(f" {_c('Schema', '32')} -> {args.generate_schema}") + + counts = count_issues(results) + total_issues = counts.get('total', 0) + high = counts.get('High', 0) + med = counts.get('Medium', 0) + low = counts.get('Low', 0) + schema_count = counts.get('schema', 0) overall = results['overall_score'] score_color = '32' if overall >= 90 else ('33' if overall >= 70 else '31') @@ -182,7 +211,7 @@ def main(): delta = trend['overall_score_delta'] arrow = _c(f'+{delta}', '32') if delta > 0 else _c(f'{delta}', '31') if delta < 0 else '=' score_str += f" ({arrow} from baseline)" - print( + _log( f"\n Health Score: {_c(score_str, score_color)}" ) issue_line = ( @@ -196,7 +225,9 @@ def main(): if td != 0: sign = '+' if td > 0 else '' issue_line += f" ({sign}{td} from baseline)" - print(issue_line) + _log(issue_line) if schema_count: - print(f" Schema violations: {_c(str(schema_count), '31')}") - print() + _log(f" Schema violations: {_c(str(schema_count), '31')}") + for w in results.get('warnings', []): + _log(f" {_c('Note:', '33')} {w['message']}") + _log() diff --git a/data_hygiene_auditor/core.py b/data_hygiene_auditor/core.py index 3f5019a..48997b6 100644 --- a/data_hygiene_auditor/core.py +++ b/data_hygiene_auditor/core.py @@ -77,6 +77,36 @@ SUPPORTED_EXTENSIONS = {'.xlsx', '.xls', '.csv', '.tsv'} +def count_issues(results): + """Count total and per-severity issues across all sheets. + + Counts all issue sources: field issues, phantom duplicates, + fuzzy duplicates, and schema violations. + + Returns dict with keys: 'total', 'High', 'Medium', 'Low', 'schema'. + """ + from collections import Counter + totals = Counter() + schema_count = 0 + for sheet in results['sheets'].values(): + for field_data in sheet['fields'].values(): + for issue in field_data['issues']: + totals['total'] += 1 + totals[issue['severity']] += 1 + for d in sheet['phantom_duplicates']: + totals['total'] += 1 + totals[d['severity']] += 1 + for f in sheet.get('fuzzy_duplicates', []): + totals['total'] += 1 + totals[f['severity']] += 1 + for sv in sheet.get('schema_violations', []): + totals['total'] += 1 + totals[sv['severity']] += 1 + schema_count += 1 + totals['schema'] = schema_count + return dict(totals) + + def _load_sheets(input_path): """Load tabular data as a dict of {sheet_name: DataFrame}.""" ext = Path(input_path).suffix.lower() @@ -213,17 +243,32 @@ def run_audit(input_path, fuzzy_threshold=0.85, schema_path=None, baseline_path= frozenset(i - 2 for i in d['rows']) for d in dupes ] - fuzzy = analyze_fuzzy_duplicates( + fuzzy_raw = analyze_fuzzy_duplicates( df, sheet_name, field_types, threshold=fuzzy_threshold, phantom_row_sets=phantom_row_sets, ) - for f in fuzzy: + fuzzy = [] + for f in fuzzy_raw: + if f.get('type') == '_levenshtein_skipped': + results.setdefault('warnings', []).append({ + 'type': 'levenshtein_skipped', + 'sheet': sheet_name, + 'unmatched_rows': f['unmatched_count'], + 'limit': f['limit'], + 'message': ( + f"Fuzzy (Levenshtein) matching skipped for sheet" + f" '{sheet_name}': {f['unmatched_count']} unmatched" + f" rows exceeds the {f['limit']}-row limit." + ), + }) + continue f['severity'] = rate_severity('fuzzy_duplicate', f) f['why'] = WHY_IT_MATTERS['fuzzy_duplicate'] fix = generate_dup_fix('fuzzy_duplicate', f, sheet_name) if fix: f['fix'] = fix + fuzzy.append(f) sheet_results['fuzzy_duplicates'] = fuzzy if schema: diff --git a/data_hygiene_auditor/detection.py b/data_hygiene_auditor/detection.py index f74ad60..8dfd7f3 100644 --- a/data_hygiene_auditor/detection.py +++ b/data_hygiene_auditor/detection.py @@ -549,6 +549,13 @@ def analyze_fuzzy_duplicates( skip = already_matched | fp_matched unmatched = [i for i in range(len(df)) if i not in skip] + if len(unmatched) > 500: + findings.append({ + 'type': '_levenshtein_skipped', + 'unmatched_count': len(unmatched), + 'limit': 500, + }) + if len(unmatched) >= 2 and len(unmatched) <= 500: norm_strings = {} for idx in unmatched: diff --git a/data_hygiene_auditor/reporting/html.py b/data_hygiene_auditor/reporting/html.py index 3d2dc1f..136c814 100644 --- a/data_hygiene_auditor/reporting/html.py +++ b/data_hygiene_auditor/reporting/html.py @@ -1,9 +1,10 @@ """HTML report generator.""" import json -from collections import Counter from html import escape as _html_escape +from ..core import count_issues + def _h(val): """Escape a value for safe inclusion in HTML text or attributes.""" @@ -30,22 +31,9 @@ def _render_fix(fix): def generate_html(results, output_path): """Generate a client-readable HTML report.""" - total_issues = 0 - severity_totals = Counter() - for sheet in results['sheets'].values(): - for field in sheet['fields'].values(): - for issue in field['issues']: - total_issues += 1 - severity_totals[issue['severity']] += 1 - for d in sheet['phantom_duplicates']: - total_issues += 1 - severity_totals[d['severity']] += 1 - for f in sheet.get('fuzzy_duplicates', []): - total_issues += 1 - severity_totals[f['severity']] += 1 - for sv in sheet.get('schema_violations', []): - total_issues += 1 - severity_totals[sv['severity']] += 1 + counts = count_issues(results) + total_issues = counts.get('total', 0) + severity_totals = counts parts = [] parts.append(f""" diff --git a/pyproject.toml b/pyproject.toml index 8368447..c647a90 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -8,7 +8,7 @@ version = "1.0.0" description = "Detect data quality issues in Excel and CSV files — mixed formats, misused fields, placeholder floods, and phantom duplicates" readme = "README.md" license = {text = "MIT"} -requires-python = ">=3.8" +requires-python = ">=3.9" authors = [ {name = "Lailara LLC"}, ] @@ -42,7 +42,7 @@ data-hygiene-audit = "data_hygiene_auditor.cli:main" testpaths = ["tests"] [tool.ruff] -target-version = "py38" +target-version = "py39" line-length = 120 exclude = ["generate_sample.py"] diff --git a/samples/output/sample_messy_data_audit_findings.xlsx b/samples/output/sample_messy_data_audit_findings.xlsx index aa8e2ccc5f1f51e8de02cf121cd2df7b83f8cf03..1988e90845d3e3ad4e8c090f4c659b983bb42c65 100644 GIT binary patch delta 1222 zcmdlJdp?FYz?+#xgn@y9gF!N3!$jT#oIon3?PS>7cN4E^)SvLnyZOOXug?HA=}Y<>Q2lijs5kDzN8?iC zGZ_c?WUEdocrmV-t9GpZ^aaxbb{~U7Pd=41o^_hDx<2VS--hg>6J?%XYvaGIwlsgI z@ZfFcf1~x$5%WDOdmpdQQH@o~yOI0pb=L2S8ymMh3XT0xs`#pTh3fMn-FxcSes|=PIW7U92Og=PmQWtdad-)Z!P^sVN*fVkhEtCD(bgNo?Z;!$`WKD zm|oXW_@>xNTi^Tcjj*(p877NGb5w(PpS#3`9dOzGh3S30*$21&rQ4Y6rd@l&VST~= z^;#3#{pFe(?OuU_ik$6BzssnsZtX2++^@0a)wGzi;RM9b0E_ebd>rHDQ0|!qZF5zXvd~Uo!Zsc8fRWx^U$F z7=x$(0&gE>+t~ieE%RXg`XuKAnOAF1{)}q3KFGt*acH$d$(>jw_n%WUCM^xx;&ChE zahi1AO26C(8L8%}%a_hv>bBfi+iCXciB7C9c51G0+nrY*6fa#b{q*^>^qT6=nfZ35 z&mK1XWaGH3dh4l^)2H?GC$G2qS>j;F`vCJ^By-H)qKYd?q=oaCr$~>&QQIJ!t}o6R zhc2uR;$6`m@KJWk-j~aDqL%!7>3FJgev#d*qICY;rfcseG>7zWI_K_vQC6qrf3c64 z?S)6i4);?H9X9{q@s;6y_@n&#m3BW1s-VmK?DqStPweLa;c6PSlK!qX~qYWPw6Nrpr%m<1_@x6hJkI3iX4;eIHa@# zycwB97~r`*<_3RH#Brd?GuO~HG))fFRR>#9tt%Z_ugt(uk)vN+Qkj!ltXGkngKp6D zuH7;RfYu2Ea|k!gS|HuP_)Lv~A-*Uzr&wPPM51Y&{9ad@>Ad=69z9vGWjcD&j8&68 k^fcu`{>fXevPX`Qfgzj)lot`8fiZ3J1U)sjt=b?l0O7Ly%m4rY delta 1304 zcmX>fvnQ4}z?+#xgn@y9gCVbC{Y2gaoIomO#RS{p_Y<#a)NA-E8K?R|BpQ5+u&W~ljg~LIDpUhoUe63 zMRirp$=71XR44g8{Gu)C_`D&yZ0?_1Z{B_9U$c63Rcy%WruE-ylvlZ3nf?9d*Y^9q zH>z^^p2X2#4>7GPswn4`|XAiViEqXC-`C`1l2vT{ymVqoCp zU|^77m^@!cs($L!*!)*ukDfhUz3fjG z|F3HhHoq)$D^`W=@b7u-XDoZR$Jm?}3Ou+jY3rPdr9o#;e1E#=NY@GOdkJgZr^ST0 zyj&OYbCaM4m-ea|@~n=}!dtGz*=>9O&c*Io_R);qa;7J{R&`X&tXFz`yHmWXUHp4V zsMhC0=a2J7K9srI|IF{jy!$&vwj7EIifcab^vG^5_sa%`OxBOpkFK@JYK}Z5{QUf~ z;%J{e2Lu!kw;GAL*qd26AM2U??8jd7ue)ZrRot=L5r6yW>!UV;oyOyW&VS(8 z=)EcJ)x7-tg8d7w{Lrk6;lsJ0^GuHDJJ{etdIDi(?z-P*{`{ilC$G>luNnF~4qUun7iO3H^do1b%+Evt3yVp8%7sPs zF~Qg5_fJ(-Pni_Ld-yol(wa)0FK)4KX56u?xYVa!`f1DWHQV>i*D4Kib(}Jn{bl70 zrx|@s6(YQ{9+B-+EGNqCP~yp38ngIK>#4p4KCALSg&e>8rA4Mhk^S6+;^GW6wze$0mTqADfR1$M7A(DvMG#VE6Tb1H+pJulbL~mA7L;<1EY<`

Data Hygiene Audit Report

-

sample_messy_data.xlsx — 2026-05-15 15:03:29

+

sample_messy_data.xlsx — 2026-05-16 12:00:51

@@ -357,15 +357,6 @@

Data Hygiene Audit Report

-
-
- ↑4 - vs baseline (2026-05-15 15:03:07) -
-
Score: 28 → 32
-
Issues: 63 → 59 (-4)
-
-
59
@@ -420,7 +411,7 @@

Sheet: Customers lambda x: "coded" if isinstance(x, str) and "-" in x else "numeric" )

-
+
FirstName name @@ -436,7 +427,7 @@

Sheet: Customers suspect = df.loc[mask, "FirstName"]

Medium Placeholder detected: "Test" appears 3 times (11.5%)
Why this matters: Placeholder values ("Test", "N/A", "TBD") that persist in production data inflate counts, skew averages, and create phantom records. They often indicate incomplete data entry or inadequate validation at the point of capture.
Suggested Fix (replace_placeholders)
Replace 3 placeholder values ("Test") in "FirstName" with NaN for proper missing-data handling
import numpy as np
 df["FirstName"] = df["FirstName"].replace("Test", np.nan)
Low Placeholder detected: "TBD" appears 1 times (3.8%)
Why this matters: Placeholder values ("Test", "N/A", "TBD") that persist in production data inflate counts, skew averages, and create phantom records. They often indicate incomplete data entry or inadequate validation at the point of capture.
Suggested Fix (replace_placeholders)
Replace 1 placeholder values ("TBD") in "FirstName" with NaN for proper missing-data handling
import numpy as np
 df["FirstName"] = df["FirstName"].replace("TBD", np.nan)
-
+
LastName name @@ -453,7 +444,7 @@

Sheet: Customers )

Medium Suspicious repetition: "Doe" appears 3 times (11.5%)
Why this matters: When the same value appears far more often than expected, it may indicate a default value that was never updated, a copy-paste error, or a system glitch that stamped the same data across multiple records.
Suggested Fix (flag_repetitions)
Flag 3 rows where "LastName" = "Doe" (11.5%) for manual review
df["_LastName_review"] = (
     df["LastName"] == "Doe"
 )
-
+
Email email @@ -473,7 +464,7 @@

Sheet: Customers )

Medium Suspicious repetition: "test@test.com" appears 3 times (11.5%)
Why this matters: When the same value appears far more often than expected, it may indicate a default value that was never updated, a copy-paste error, or a system glitch that stamped the same data across multiple records.
Suggested Fix (flag_repetitions)
Flag 3 rows where "Email" = "test@test.com" (11.5%) for manual review
df["_Email_review"] = (
     df["Email"] == "test@test.com"
 )
-
+
Phone phone @@ -497,7 +488,7 @@

Sheet: Customers )

Medium Suspicious repetition: "555-555-5555" appears 3 times (11.5%)
Why this matters: When the same value appears far more often than expected, it may indicate a default value that was never updated, a copy-paste error, or a system glitch that stamped the same data across multiple records.
Suggested Fix (flag_repetitions)
Flag 3 rows where "Phone" = "555-555-5555" (11.5%) for manual review
df["_Phone_review"] = (
     df["Phone"] == "555-555-5555"
 )
-
+
JoinDate date @@ -516,7 +507,7 @@

Sheet: Customers )

Medium Suspicious repetition: "2023-01-15" appears 3 times (11.5%)
Why this matters: When the same value appears far more often than expected, it may indicate a default value that was never updated, a copy-paste error, or a system glitch that stamped the same data across multiple records.
Suggested Fix (flag_repetitions)
Flag 3 rows where "JoinDate" = "2023-01-15" (11.5%) for manual review
df["_JoinDate_review"] = (
     df["JoinDate"] == "2023-01-15"
 )
-
+
AccountBalance currency @@ -557,7 +548,7 @@

Sheet: Customers df["Status"] = df["Status"].replace("TBD", np.nan)

High Suspicious repetition: "Active" appears 18 times (69.2%)
Why this matters: When the same value appears far more often than expected, it may indicate a default value that was never updated, a copy-paste error, or a system glitch that stamped the same data across multiple records.
Suggested Fix (flag_repetitions)
Flag 18 rows where "Status" = "Active" (69.2%) for manual review
df["_Status_review"] = (
     df["Status"] == "Active"
 )
-
+
ZipCode zipcode @@ -743,7 +734,7 @@

Sheet: Orders OrderIDCustomerIDOrderDateAmountShipDateStatus ORD-006CUST-0102023-01-01$0.002023-01-01TestORD-007CUST-0102023-01-01$0.002023-01-01Test
Why this matters: Exact duplicate rows are the clearest sign of a data quality issue — they can result from double-submissions, ETL failures, or missing unique constraints. Every duplicate inflates counts and distorts any metric built on this data.
Suggested Fix (drop_exact_duplicates)
Remove 2 exact duplicate rows (rows 7, 8)
df = df.drop_duplicates(keep="first").reset_index(drop=True)