MsShawnP · MsShawnP · May 16, 2026 · May 16, 2026 · May 16, 2026 · May 16, 2026
diff --git a/.github/actions/audit/action.yml b/.github/actions/audit/action.yml
@@ -0,0 +1,81 @@
+name: 'Data Hygiene Audit'
+description: 'Run data quality checks on Excel/CSV files and fail if score is too low'
+inputs:
+  file:
+    description: 'Path to input file or directory'
+    required: true
+  output:
+    description: 'Output directory for reports'
+    required: false
+    default: './audit-reports'
+  fail-under:
+    description: 'Minimum health score (0-100). Fails if score is below this.'
+    required: false
+    default: '0'
+  threshold:
+    description: 'Fuzzy duplicate similarity threshold (0.0-1.0)'
+    required: false
+    default: '0.85'
+  rules:
+    description: 'Path to custom rules JSON file'
+    required: false
+    default: ''
+  schema:
+    description: 'Path to schema JSON file'
+    required: false
+    default: ''
+outputs:
+  score:
+    description: 'Overall health score (0-100)'
+    value: ${{ steps.audit.outputs.score }}
+  issues:
+    description: 'Total number of issues found'
+    value: ${{ steps.audit.outputs.issues }}
+runs:
+  using: 'composite'
+  steps:
+    - name: Install Data Hygiene Auditor
+      shell: bash
+      run: pip install .
+
+    - name: Run audit
+      id: audit
+      shell: bash
+      run: |
+        ARGS="--input ${{ inputs.file }} --output ${{ inputs.output }} --json"
+        ARGS="$ARGS --threshold ${{ inputs.threshold }}"
+        if [ -n "${{ inputs.rules }}" ]; then
+          ARGS="$ARGS --rules ${{ inputs.rules }}"
+        fi
+        if [ -n "${{ inputs.schema }}" ]; then
+          ARGS="$ARGS --schema ${{ inputs.schema }}"
+        fi
+        if [ "${{ inputs.fail-under }}" != "0" ]; then
+          ARGS="$ARGS --fail-under ${{ inputs.fail-under }}"
+        fi
+        data-hygiene-audit $ARGS
+        # Extract score from JSON output
+        SCORE=$(python -c "
+        import json, glob
+        files = glob.glob('${{ inputs.output }}/*_audit_results.json')
+        if files:
+            with open(files[0]) as f:
+                data = json.load(f)
+            print(data['overall_score'])
+        else:
+            print('0')
+        ")
+        echo "score=$SCORE" >> $GITHUB_OUTPUT
+        # Count issues
+        ISSUES=$(python -c "
+        import json, glob
+        from data_hygiene_auditor.core import count_issues
+        files = glob.glob('${{ inputs.output }}/*_audit_results.json')
+        total = 0
+        for f in files:
+            with open(f) as fh:
+                data = json.load(fh)
+            total += count_issues(data).get('total', 0)
+        print(total)
+        ")
+        echo "issues=$ISSUES" >> $GITHUB_OUTPUT
diff --git a/AUDIT.md b/AUDIT.md
@@ -290,3 +290,200 @@ Move #13. AI-powered fix suggestions. Only attempt after the foundation and pres
 - **Don't build a GUI/web app yet.** The interactive HTML report gives you most of the "explorable" benefit without the deployment/hosting/auth complexity. A web app is a different product.
 - **Don't chase pipeline integration** (dbt, Airflow, CI). Your audience is consultants with spreadsheets, not data engineers with warehouses. Pipeline integration dilutes your focus without serving your users.
 - **Don't refactor before testing.** The temptation is to restructure first (it's messy!), but write tests against the current behavior first. Then refactor with confidence.
+
+---
+
+# Audit Round 2 (2026-05-16)
+
+All items from the 2025 audit were shipped (PRs #1-#9). This round assesses the project's current state after that work, with fresh landscape data.
+
+## Phase 1: Baseline Assessment (2026)
+**Date:** 2026-05-16
+**Project:** Data Hygiene Auditor v1.0.0
+
+### What Exists Today
+
+A well-structured Python CLI + library (10 modules, ~3,750 LOC) that scans Excel/CSV/TSV files for data quality issues and produces interactive HTML, Excel, and PDF reports. Features shipped since last audit: schema validation, trend comparison, vectorized detection (3.4x speedup), fuzzy duplicate matching, typed Python API, health scores, interactive HTML, fix suggestions.
+
+### Current Architecture
+
+| Module | LOC | Purpose |
+|--------|-----|---------|
+| `detection.py` | 654 | 7 detection engines |
+| `reporting/html.py` | 841 | Interactive HTML report |
+| `reporting/pdf.py` | 418 | PDF deliverable |
+| `reporting/excel.py` | 335 | Excel findings file |
+| `api.py` | 412 | Typed Python API (dataclasses) |
+| `core.py` | 292 | Orchestrator + data loading |
+| `suggestions.py` | 285 | Fix suggestion engine |
+| `cli.py` | 202 | CLI with colored output |
+| `schema.py` | 144 | Schema validation |
+| `trend.py` | 103 | Trend comparison |
+| **Tests** | 1,576 | 167 tests across 8 files |
+
+### Health Indicators
+
+| Dimension | Status |
+|-----------|--------|
+| Tests | 167 passing, all detection engines covered |
+| CI | GitHub Actions: ruff + pytest on 3.9/3.12/3.13 |
+| Packaging | pyproject.toml, pip-installable, `data-hygiene-audit` CLI |
+| API | `audit_file()` with typed dataclasses, py.typed marker |
+| Docs | Comprehensive README with screenshots and library examples |
+| Performance | Vectorized detection, 3.4x improvement on large files |
+
+### Gap Analysis
+
+**Resolved from 2025 audit:** CSV support, tests, CI, packaging, interactive HTML, health score, vectorized perf, fuzzy matching, typed API, fix suggestions, schema validation, trend comparison — all shipped.
+
+**Remaining or new issues:**
+1. CLI under-counts issues (missing fuzzy duplicates in total)
+2. `_raw` attribute set outside dataclass `__init__` — type-unsafe
+3. Tests import via backward-compat shim, not package directly
+4. No type checker in CI despite py.typed marker
+5. Python 3.8 claimed but untested
+6. Fuzzy matching silently skipped above 500 rows
+7. No CHANGELOG or release tags
+
+## Phase 2: Internal Review (2026)
+**Date:** 2026-05-16
+**Dimensions:** Code Quality, Architecture, Tests, Documentation, Performance, Security, UX, DevEx
+
+### Top Opportunities
+
+| # | Finding | Dimension | Impact | Effort | Leverage | Severity |
+|---|---------|-----------|--------|--------|----------|----------|
+| 1 | CLI missing fuzzy_duplicates in issue count — under-reports total | Code Quality | 3 | 1 | 3.0 | bug |
+| 2 | `AuditResult._raw` monkey-patched outside `__init__` | Code Quality | 4 | 1 | 4.0 | important |
+| 3 | Issue-counting logic still duplicated 3x (cli, html, excel) | Code Quality | 3 | 1 | 3.0 | important |
+| 4 | `requires-python >= 3.8` but CI tests 3.9+ only | DevEx | 3 | 1 | 3.0 | important |
+| 5 | Tests import from `audit` shim, not `data_hygiene_auditor` | Tests | 3 | 2 | 1.5 | important |
+| 6 | No type checker in CI despite py.typed marker | DevEx | 3 | 2 | 1.5 | important |
+| 7 | Levenshtein O(n²) hard-capped at 500 rows — silently skips | Performance | 3 | 3 | 1.0 | important |
+| 8 | No file size guard — OOM on large crafted input | Security | 3 | 2 | 1.5 | important |
+| 9 | `_load_sheets` exported in public `__all__` | Architecture | 2 | 1 | 2.0 | minor |
+| 10 | `--schema`/`--baseline` undocumented in README options table | Documentation | 2 | 1 | 2.0 | minor |
+| 11 | No `--quiet`/`--version` flags | UX | 2 | 1 | 2.0 | minor |
+| 12 | No CHANGELOG | Documentation | 2 | 1 | 2.0 | minor |
+
+### Summary
+
+The project is in strong shape. The 2025 audit's critical issues (monolith, no tests, XSS, no CSV, no packaging) are all resolved. What remains is polish-tier work: a counting bug, a type safety issue, test import paths, and CI completeness. The architecture is clean and the detection logic is solid.
+
+## Phase 3: Landscape Scan (2026)
+**Date:** 2026-05-16
+**Method:** Web research (verified through May 2026)
+
+### Key Landscape Changes (2025 → 2026)
+
+1. **ydata-profiling rebranded to fg-data-profiling** (v4.19.1, Apr 2026). Package/import renamed. Signals stewardship instability.
+2. **GX added ExpectAI** — AI-generated expectations from data patterns. Possible acquisition May 2026 (unconfirmed).
+3. **Data contracts became dominant framing** — Soda Core repositioned as "Data Contracts engine." Irrelevant to file-audit use case.
+4. **Enterprise consolidation** — Metaplane → Datadog, SYNQ → Coalesce, Select Star → Snowflake. Affects $50K+ tier only.
+5. **AI/LLM integration is commercial-tier only** — ExpectAI, SodaGPT. No OSS tool has AI fix suggestions. Window still open.
+6. **DQX (Databricks Labs)** — new PySpark-native DQ framework. Not relevant to file-based auditing.
+7. **DQOps** — OSS + commercial ($499/mo). 150+ built-in checks. Warehouse-only, no file support.
+
+### Competitive Position (2026)
+
+**Unique to this project (confirmed still unmatched):**
+- Placeholder/test data detection
+- Misused field detection (cross-column semantic validation)
+- Triple output format (HTML + Excel + PDF)
+- Severity ratings + plain-English explanations for non-technical stakeholders
+- Health score (0-100)
+- Deterministic fix suggestions with copy-paste code
+- Schema validation + trend comparison (closes previous gaps)
+
+**The consultant gap remains completely unoccupied.** Every competitor is a warehouse connector for engineers, a profiler for data scientists, or an interactive GUI for researchers. No tool takes a file and produces a credentialed audit report with severity ratings and fix language for a client meeting.
+
+### Feature Parity Check
+
+| Table Stakes | Status |
+|-------------|--------|
+| CSV/TSV support | ✅ Shipped |
+| Null/completeness analysis | ✅ |
+| CLI + Python API | ✅ Both |
+| Large file handling (100K+) | 🟡 Vectorized but fuzzy capped at 500 |
+| Interactive report | ✅ Filters, search, TOC, collapsible |
+
+## Phase 4: Differentiation & Next Moves (2026)
+**Date:** 2026-05-16
+
+### Cross-Reference Summary
+
+The situation has inverted since the 2025 audit. A year ago, the project had strong detection but weak everything else. Now:
+- **Foundation:** solid (tests, CI, packaging, clean architecture)
+- **Presentation:** strong (interactive HTML, health score ring, fix suggestions)
+- **Detection:** comprehensive (7 engines + schema + trend)
+- **Competitive position:** unique and uncontested
+
+The remaining work is no longer transformative — it's **incremental quality improvements and strategic positioning**. The highest-impact moves are now about reach (getting the tool in front of users) and polish (fixing the few rough edges that undermine professional credibility).
+
+### Ranked Next Moves
+
+| # | Move | Category | Strategic | Internal | Effort | Score | Description |
+|---|------|----------|-----------|----------|--------|-------|-------------|
+| 1 | Fix CLI fuzzy dup counting bug | Correctness | 1 | 4 | 1 | 5.0 | CLI under-reports total issues by omitting fuzzy duplicates from count. One missing loop. |
+| 2 | Fix `_raw` type safety | Code Quality | 1 | 3 | 1 | 4.0 | Move `_raw` into `AuditResult.__init__` as a proper field. Fixes mypy, IDE autocomplete. |
+| 3 | Extract shared issue-counting helper | Code Quality | 1 | 3 | 1 | 4.0 | Single function used by CLI, HTML, and Excel. Prevents future counting bugs. |
+| 4 | Document `--schema`/`--baseline` in README | Documentation | 2 | 2 | 1 | 4.0 | Features exist but aren't discoverable in README options table. |
+| 5 | Add `--version` and `--quiet` flags | UX | 2 | 2 | 1 | 4.0 | Professional CLI conventions. `--quiet` enables scripted/CI usage. |
+| 6 | Align Python version (drop 3.8 claim or add CI) | DevEx | 2 | 3 | 1 | 5.0 | Either add 3.8 to CI matrix or bump requires-python to >=3.9. |
+| 7 | Add mypy/pyright to CI | DevEx | 2 | 3 | 2 | 2.5 | py.typed marker promises type safety — CI should enforce it. |
+| 8 | Migrate test imports to `data_hygiene_auditor` | Tests | 1 | 3 | 2 | 2.0 | Tests should exercise the package, not the backward-compat shim. |
+| 9 | Warn when fuzzy matching is skipped (>500 rows) | UX | 3 | 2 | 1 | 5.0 | User should know a detection pass was omitted on large sheets. |
+| 10 | Scale fuzzy matching beyond 500 rows | Performance | 4 | 3 | 3 | 2.3 | Locality-sensitive hashing or blocking strategy to handle 10K+ rows. |
+| 11 | Add CHANGELOG and release tagging | DevEx | 3 | 2 | 1 | 5.0 | Version tracking for users. Signal active maintenance. |
+| 12 | PyPI publication | Reach | 5 | 1 | 2 | 3.0 | `pip install data-hygiene-auditor` from anywhere. Major discoverability boost. |
+| 13 | "Data linter" positioning + README refresh | Reach | 4 | 1 | 2 | 2.5 | Adopt the "linter for data" framing that resonates with the developer audience. Keywords for discoverability. |
+| 14 | File size guard / row limit warning | Security | 2 | 2 | 1 | 4.0 | Warn at 500K rows, refuse at 2M unless `--force`. Prevents OOM. |
+| 15 | Remove `_load_sheets` from public `__all__` | Architecture | 1 | 2 | 1 | 3.0 | Private helper shouldn't be in the public API surface. |
+
+### Recommended Sequence
+
+**Sprint 5: Bug Fixes & Polish (half day)**
+Moves #1-6, #9, #11, #14, #15. All effort-1 items. Brings the project to "no rough edges" state.
+- Fix CLI counting bug
+- Fix `_raw` type safety
+- Extract issue-counting helper
+- Document `--schema`/`--baseline` in README
+- Add `--version` and `--quiet`
+- Align Python version requirement
+- Warn on skipped fuzzy matching
+- Add CHANGELOG
+- File size guard
+- Remove `_load_sheets` from `__all__`
+
+**Sprint 6: Engineering Rigor (1 day)**
+Moves #7, #8. Type checking + test migration.
+- Add mypy/pyright to CI
+- Migrate test imports to package
+
+**Sprint 7: Reach (1-2 days)**
+Moves #12, #13. Get the tool in front of users.
+- Publish to PyPI
+- README refresh with "data linter" positioning
+
+**Sprint 8: Scale (2-3 days)**
+Move #10. Requires algorithmic work.
+- Scale fuzzy matching with LSH or blocking
+
+### What NOT to Do (2026 Update)
+
+Previous "don't do" items that were done anyway and **worked out:**
+- ~~Don't add schema validation~~ → Added (PR #9). Lightweight, optional, complements rather than competes with GX/pandera. **Correct call to add it.**
+
+Updated guidance:
+- **Don't add statistical profiling.** fg-data-profiling still owns this despite the rebrand. Your strength is consulting-specific findings.
+- **Don't build a web app.** The interactive HTML file is self-contained, shareable, and zero-deployment. A server-side app is a different product for a different audience.
+- **Don't chase pipeline integration.** The market moved further toward warehouse-native observability (DQOps, Soda, GX Cloud). That's their game. Yours is file-native audit reports.
+- **Don't add LLM-powered features yet.** The deterministic fix suggestions already work well. LLM adds latency, API key requirements, and cost for marginal improvement. Revisit when local models are fast enough to run offline.
+- **Don't over-engineer the fuzzy cap.** The 500-row Levenshtein cap is a reasonable default for spreadsheet-sized data. Add a warning, not a complex distributed algorithm. Only invest in scaling if real users hit the limit.
+- **Don't compete on star count or downloads.** The niche is small but uncontested. One glowing testimonial from a consultant who used it on a real engagement is worth more than 1K GitHub stars from drive-by visitors.
+
+### Strategic Summary
+
+The project has successfully executed its transformation from "Claude Chat artifact" to "genuinely differentiated tool." The 2025 audit's thesis — that the detection was the moat but needed a stage — has been validated. The stage is now built. The next phase is about **credibility and reach**: fixing the remaining rough edges, publishing to PyPI, and positioning the tool where its target audience (consultants, analysts, data teams inheriting messy spreadsheets) can find it.
+
+The competitive landscape has moved *away* from this project's niche (toward warehouse observability and data contracts), which is strategically favorable — it means less competition, not more. The window for "file-native, consultant-focused, severity-rated audit reports" remains wide open with no credible competitor in 2026.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -0,0 +1,58 @@
+# Changelog
+
+All notable changes to this project will be documented in this file.
+
+Format follows [Keep a Changelog](https://keepachangelog.com/en/1.1.0/).
+
+## [Unreleased]
+
+### Fixed
+- CLI issue count now includes fuzzy duplicates and schema violations
+- `AuditResult._raw` is a proper dataclass field (type-checker visible)
+
+### Added
+- Custom rule engine: define detection rules in JSON (`--rules` flag)
+  - Conditions: `regex_match`, `not_regex_match`, `min_length`, `max_length`, `allowed_values`, `disallowed_values`, `max_missing_pct`
+  - Target columns by regex pattern or explicit list
+  - Findings integrated into all 3 report formats
+- Column-level profiling: cardinality, uniqueness %, avg length, numeric range
+  - Stats shown in HTML, Excel, PDF, and JSON output
+  - `ColumnProfile` dataclass in typed API
+- Multi-file / directory mode: `--input ./data/` audits all supported files
+  - `run_multi_audit()` API for programmatic multi-file audits
+- CI / pipeline integration
+  - `--fail-under` flag: exit code 1 if score < threshold
+  - `--sarif` flag: SARIF 2.1.0 output for GitHub Code Scanning
+  - GitHub Action (`.github/actions/audit/action.yml`)
+- `--version` / `-V` flag
+- `--quiet` / `-q` flag to suppress terminal output
+- `--force` flag to override the 2M row safety limit
+- `count_issues()` shared helper for consistent issue counting
+- Warning when fuzzy (Levenshtein) matching is skipped due to row count
+- File size guard: warns at 500K rows, refuses at 2M without `--force`
+
+### Changed
+- Minimum Python version raised from 3.8 to 3.9
+
+## [1.0.0] - 2026-05-09
+
+### Added
+- Schema validation via `--schema` flag with JSON schema files
+- `--generate-schema` to infer and export a schema from audit results
+- `--baseline` / `-b` for trend comparison against previous audits
+- Trend deltas shown in CLI output and reports
+- `--threshold` / `-t` flag for fuzzy duplicate similarity tuning
+- Typed Python API (`audit_file()`, dataclass results, `py.typed`)
+- Fuzzy duplicate detection (fingerprint clustering + Levenshtein)
+- Health score algorithm (0–100, penalty-based)
+- Interactive HTML report with collapsible sections
+- Fix suggestion engine with copyable code snippets
+- Vectorized detection for 3.4x speedup on large files
+- CSV/TSV support alongside Excel
+- PDF report output (reportlab)
+- Excel findings export (sortable/filterable)
+- Test suite (171 tests) and CI pipeline
+- MIT license
+
+[Unreleased]: https://github.com/MsShawnP/Data-Hygiene-Auditor/compare/v1.0.0...HEAD
+[1.0.0]: https://github.com/MsShawnP/Data-Hygiene-Auditor/releases/tag/v1.0.0