MsShawnP · MsShawnP · May 16, 2026 · May 16, 2026 · May 16, 2026 · May 16, 2026
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -29,5 +29,8 @@ jobs:
       - name: Lint with ruff
         run: ruff check .
 
+      - name: Type check with mypy
+        run: mypy data_hygiene_auditor/
+
       - name: Run tests
         run: pytest -v
diff --git a/.github/workflows/publish.yml b/.github/workflows/publish.yml
@@ -0,0 +1,58 @@
+name: Publish to PyPI
+
+on:
+  push:
+    tags: ["v*"]
+
+permissions:
+  id-token: write
+
+jobs:
+  build:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+
+      - uses: actions/setup-python@v5
+        with:
+          python-version: "3.12"
+
+      - name: Install build tools
+        run: pip install build
+
+      - name: Build package
+        run: python -m build
+
+      - name: Upload artifacts
+        uses: actions/upload-artifact@v4
+        with:
+          name: dist
+          path: dist/
+
+  test:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+
+      - uses: actions/setup-python@v5
+        with:
+          python-version: "3.12"
+
+      - name: Install and test
+        run: |
+          pip install .[dev]
+          ruff check .
+          mypy data_hygiene_auditor/
+          pytest -v
+
+  publish:
+    needs: [build, test]
+    runs-on: ubuntu-latest
+    environment: pypi
+    steps:
+      - uses: actions/download-artifact@v4
+        with:
+          name: dist
+          path: dist/
+
+      - uses: pypa/gh-action-pypi-publish@release/v1
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -26,13 +26,18 @@ Format follows [Keep a Changelog](https://keepachangelog.com/en/1.1.0/).
   - GitHub Action (`.github/actions/audit/action.yml`)
 - `--version` / `-V` flag
 - `--quiet` / `-q` flag to suppress terminal output
+- `--export-fixes` flag: export remediation plan as CSV (sorted by severity, with fix code and assignee columns)
 - `--force` flag to override the 2M row safety limit
 - `count_issues()` shared helper for consistent issue counting
 - Warning when fuzzy (Levenshtein) matching is skipped due to row count
 - File size guard: warns at 500K rows, refuses at 2M without `--force`
 
 ### Changed
 - Minimum Python version raised from 3.8 to 3.9
+- mypy type checking added to CI (public API and rules module strictly typed)
+- PyPI classifiers expanded (license, Python versions, `Typing :: Typed`)
+- Automated PyPI publish workflow (push `v*` tag → build → test → publish)
+- README refreshed with "data linter" positioning and quick-start install
 
 ## [1.0.0] - 2026-05-09
 

diff --git a/README.md b/README.md
@@ -1,8 +1,15 @@
 # Data Hygiene Auditor
 
+**A linter for your data.** Point it at a spreadsheet, get back every inconsistency, placeholder, and hidden duplicate — with severity ratings, root causes, and fix code.
+
+```
+pip install data-hygiene-auditor
+data-hygiene-audit --input customers.xlsx --output ./reports
+```
+
 Phone numbers stored seven different ways in the same column. "TBD" sitting in a status field for three years. A customer record that looks unique until you notice that whitespace and casing are the only things separating it from four others. These are the issues consultants inherit when they take over someone else's spreadsheet — and the ones nobody finds until they're already in production.
 
-The Data Hygiene Auditor is a Python CLI that scans Excel workbooks for the specific real-world failure modes that show up in actual consulting engagements: mixed-format inconsistencies, fields used for the wrong purpose, placeholder values that escaped into production, and phantom duplicates hiding behind cosmetic differences.
+The Data Hygiene Auditor scans Excel, CSV, and TSV files for the specific real-world failure modes that show up in actual consulting engagements: mixed-format inconsistencies, fields used for the wrong purpose, placeholder values that escaped into production, and phantom duplicates hiding behind cosmetic differences.
 
 A single run produces three reports tailored to three audiences: an **HTML report** for the stakeholder meeting, an **Excel findings file** for the person doing the cleanup, and a **PDF** for the deliverable folder.
 
@@ -73,13 +80,13 @@ python audit.py --input samples/input/sample_messy_data.xlsx --output samples/ou
 ## Installation
 
 ```
-pip install .
+pip install data-hygiene-auditor
 ```
 
-Or install dependencies directly:
+Or install from source:
 
 ```
-pip install -r requirements.txt
+pip install .
 ```
 
 ## Usage
@@ -109,6 +116,7 @@ Supports `.xlsx`, `.xls`, `.csv`, and `.tsv` files.
 | `--baseline`, `-b` | Path to a previous audit JSON for trend comparison (shows deltas) |
 | `--rules`, `-r` | Path to custom rules JSON for additional checks |
 | `--sarif` | Output findings in SARIF format (for GitHub Code Scanning) |
+| `--export-fixes` | Export remediation plan as CSV (sorted by severity, with fix code) |
 | `--fail-under` | Exit with code 1 if health score is below this threshold (0-100) |
 | `--quiet`, `-q` | Suppress all terminal output (just write report files) |
 | `--force` | Process files exceeding the 2M row safety limit |
@@ -290,6 +298,20 @@ python generate_sample.py
 - openpyxl
 - reportlab
 
+## Releasing
+
+To publish a new version to PyPI:
+
+1. Update `version` in `pyproject.toml`
+2. Add a release entry to `CHANGELOG.md`
+3. Commit, tag, and push:
+   ```
+   git tag v1.1.0
+   git push origin v1.1.0
+   ```
+
+The `publish.yml` workflow builds, tests, and uploads to PyPI automatically on version tags.
+
 ## License
 
 MIT — see [LICENSE](LICENSE)
diff --git a/data_hygiene_auditor/api.py b/data_hygiene_auditor/api.py
@@ -14,9 +14,10 @@
 
 from __future__ import annotations
 
+import dataclasses
 import os
 import tempfile
-from dataclasses import dataclass, field
+from dataclasses import dataclass
 from pathlib import Path
 from typing import Any, Dict, List, Optional
 
@@ -42,7 +43,7 @@ class Finding:
     severity: str
     description: str
     why: str
-    detail: Dict[str, Any] = field(default_factory=dict)
+    detail: Dict[str, Any] = dataclasses.field(default_factory=dict)
     fix: Optional[FixSuggestion] = None
 
     @property
@@ -67,7 +68,7 @@ class Duplicate:
     rows: List[int]
     group_size: int
     why: str
-    sample_data: List[Dict[str, str]] = field(default_factory=list)
+    sample_data: List[Dict[str, str]] = dataclasses.field(default_factory=list)
     fix: Optional[FixSuggestion] = None
 
 
@@ -80,8 +81,8 @@ class FuzzyDuplicate:
     rows: List[int]
     group_size: int
     why: str
-    field_differences: Dict[str, Any] = field(default_factory=dict)
-    sample_data: List[Dict[str, str]] = field(default_factory=list)
+    field_differences: Dict[str, Any] = dataclasses.field(default_factory=dict)
+    sample_data: List[Dict[str, str]] = dataclasses.field(default_factory=list)
     similarity_threshold: Optional[float] = None
     fix: Optional[FixSuggestion] = None
 
@@ -115,7 +116,7 @@ class FieldResult:
     total_missing: int
     missing_pct: float
     total_rows: int
-    findings: List[Finding] = field(default_factory=list)
+    findings: List[Finding] = dataclasses.field(default_factory=list)
     profile: Optional[ColumnProfile] = None
 
 
@@ -127,10 +128,10 @@ class SheetResult:
     row_count: int
     col_count: int
     health_score: int
-    fields: List[FieldResult] = field(default_factory=list)
-    duplicates: List[Duplicate] = field(default_factory=list)
-    fuzzy_duplicates: List[FuzzyDuplicate] = field(default_factory=list)
-    schema_violations: List[SchemaViolation] = field(default_factory=list)
+    fields: List[FieldResult] = dataclasses.field(default_factory=list)
+    duplicates: List[Duplicate] = dataclasses.field(default_factory=list)
+    fuzzy_duplicates: List[FuzzyDuplicate] = dataclasses.field(default_factory=list)
+    schema_violations: List[SchemaViolation] = dataclasses.field(default_factory=list)
 
     @property
     def findings(self) -> List[Finding]:
@@ -157,9 +158,9 @@ class AuditResult:
     input_file: str
     audit_timestamp: str
     overall_score: int
-    sheets: List[SheetResult] = field(default_factory=list)
+    sheets: List[SheetResult] = dataclasses.field(default_factory=list)
     trend: Optional[TrendData] = None
-    _raw: Dict[str, Any] = field(default_factory=dict, repr=False)
+    _raw: Dict[str, Any] = dataclasses.field(default_factory=dict, repr=False)
 
     @property
     def total_issues(self) -> int:
@@ -231,7 +232,7 @@ class SchemaViolation:
     severity: str
     column: str
     why: str
-    detail: Dict[str, Any] = field(default_factory=dict)
+    detail: Dict[str, Any] = dataclasses.field(default_factory=dict)
 
 
 @dataclass
@@ -244,8 +245,8 @@ class TrendData:
     overall_score_previous: int
     total_issues_delta: int
     total_issues_previous: int
-    severity_deltas: Dict[str, int] = field(default_factory=dict)
-    sheets: Dict[str, Any] = field(default_factory=dict)
+    severity_deltas: Dict[str, int] = dataclasses.field(default_factory=dict)
+    sheets: Dict[str, Any] = dataclasses.field(default_factory=dict)
 
 
 def _describe_issue(issue_type: str, detail: dict) -> str:
@@ -257,7 +258,7 @@ def _describe_issue(issue_type: str, detail: dict) -> str:
             f" deviate from {detail.get('dominant_format', '')}"
         )
     if issue_type == 'wrong_purpose':
-        return detail.get('issue', 'Wrong purpose')
+        return str(detail.get('issue', 'Wrong purpose'))
     if issue_type in ('placeholder_value', 'placeholder'):
         return (
             f"Placeholder \"{detail.get('value', '')}\" found"

diff --git a/data_hygiene_auditor/cli.py b/data_hygiene_auditor/cli.py
@@ -139,6 +139,79 @@ def _generate_sarif(all_results, input_files):
     }
 
 
+def _export_remediation_csv(all_results, output_path):
+    """Export a CSV remediation plan with one row per fixable issue."""
+    import csv
+
+    rows = []
+    for results in all_results:
+        source_file = results.get('input_file', '')
+        for sheet_name, sheet_data in results['sheets'].items():
+            for col_name, field_data in sheet_data['fields'].items():
+                for issue in field_data['issues']:
+                    fix = issue.get('fix', {})
+                    detail = issue.get('detail', {})
+                    msg = ''
+                    if isinstance(detail, dict):
+                        msg = detail.get('message', '')
+                        if not msg and 'issue' in detail:
+                            msg = detail['issue']
+                    rows.append({
+                        'File': source_file,
+                        'Sheet': sheet_name,
+                        'Field': col_name,
+                        'Issue Type': issue.get('rule_name', issue['type']),
+                        'Severity': issue['severity'],
+                        'Description': msg,
+                        'Fix Strategy': fix.get('strategy', '') if fix else '',
+                        'Fix Code': fix.get('code', '') if fix else '',
+                        'Assigned To': '',
+                        'Status': 'Open',
+                    })
+
+            for dup in sheet_data['phantom_duplicates']:
+                fix = dup.get('fix', {})
+                rows.append({
+                    'File': source_file,
+                    'Sheet': sheet_name,
+                    'Field': '(row-level)',
+                    'Issue Type': dup['type'],
+                    'Severity': dup['severity'],
+                    'Description': f"{dup['group_size']} rows: {', '.join(str(r) for r in dup['rows'][:5])}",
+                    'Fix Strategy': fix.get('strategy', '') if fix else '',
+                    'Fix Code': fix.get('code', '') if fix else '',
+                    'Assigned To': '',
+                    'Status': 'Open',
+                })
+
+            for fuzz in sheet_data.get('fuzzy_duplicates', []):
+                fix = fuzz.get('fix', {})
+                rows.append({
+                    'File': source_file,
+                    'Sheet': sheet_name,
+                    'Field': '(row-level)',
+                    'Issue Type': 'fuzzy_duplicate',
+                    'Severity': fuzz['severity'],
+                    'Description': f"{fuzz['group_size']} rows: {', '.join(str(r) for r in fuzz['rows'][:5])}",
+                    'Fix Strategy': fix.get('strategy', '') if fix else '',
+                    'Fix Code': fix.get('code', '') if fix else '',
+                    'Assigned To': '',
+                    'Status': 'Open',
+                })
+
+    rows.sort(key=lambda r: {'High': 0, 'Medium': 1, 'Low': 2}.get(r['Severity'], 3))
+
+    fieldnames = [
+        'File', 'Sheet', 'Field', 'Issue Type', 'Severity',
+        'Description', 'Fix Strategy', 'Fix Code',
+        'Assigned To', 'Status',
+    ]
+    with open(output_path, 'w', newline='', encoding='utf-8') as f:
+        writer = csv.DictWriter(f, fieldnames=fieldnames)
+        writer.writeheader()
+        writer.writerows(rows)
+
+
 def main():
     parser = argparse.ArgumentParser(
         description=(
@@ -198,6 +271,10 @@ def main():
         '--sarif',
         help='Output findings in SARIF format to the given path',
     )
+    parser.add_argument(
+        '--export-fixes',
+        help='Export remediation plan as CSV to the given path',
+    )
     parser.add_argument(
         '--quiet', '-q', action='store_true',
         help='Suppress all terminal output (just write report files)',
@@ -323,6 +400,10 @@ def _log(msg=''):
             json.dump(sarif_data, f, indent=2)
         _log(f"    {_c('SARIF', '32')}  -> {args.sarif}")
 
+    if args.export_fixes:
+        _export_remediation_csv(all_results, args.export_fixes)
+        _log(f"    {_c('Fixes', '32')} -> {args.export_fixes}")
+
     total_counts = {'total': 0, 'High': 0, 'Medium': 0, 'Low': 0, 'schema': 0}
     scores = []
     for results in all_results:

diff --git a/data_hygiene_auditor/core.py b/data_hygiene_auditor/core.py
@@ -86,7 +86,7 @@ def count_issues(results):
     Returns dict with keys: 'total', 'High', 'Medium', 'Low', 'schema'.
     """
     from collections import Counter
-    totals = Counter()
+    totals: Counter[str] = Counter()
     schema_count = 0
     for sheet in results['sheets'].values():
         for field_data in sheet['fields'].values():
@@ -145,7 +145,7 @@ def run_audit(input_path, fuzzy_threshold=0.85, schema_path=None, baseline_path=
         if df.empty:
             continue
 
-        sheet_results = {
+        sheet_results: dict = {
             'row_count': len(df),
             'col_count': len(df.columns),
             'fields': {},