Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -29,5 +29,8 @@ jobs:
- name: Lint with ruff
run: ruff check .

- name: Type check with mypy
run: mypy data_hygiene_auditor/

- name: Run tests
run: pytest -v
58 changes: 58 additions & 0 deletions .github/workflows/publish.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
name: Publish to PyPI

on:
push:
tags: ["v*"]

permissions:
id-token: write

jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4

- uses: actions/setup-python@v5
with:
python-version: "3.12"

- name: Install build tools
run: pip install build

- name: Build package
run: python -m build

- name: Upload artifacts
uses: actions/upload-artifact@v4
with:
name: dist
path: dist/

test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4

- uses: actions/setup-python@v5
with:
python-version: "3.12"

- name: Install and test
run: |
pip install .[dev]
ruff check .
mypy data_hygiene_auditor/
pytest -v

publish:
needs: [build, test]
runs-on: ubuntu-latest
environment: pypi
steps:
- uses: actions/download-artifact@v4
with:
name: dist
path: dist/

- uses: pypa/gh-action-pypi-publish@release/v1
5 changes: 5 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,13 +26,18 @@ Format follows [Keep a Changelog](https://keepachangelog.com/en/1.1.0/).
- GitHub Action (`.github/actions/audit/action.yml`)
- `--version` / `-V` flag
- `--quiet` / `-q` flag to suppress terminal output
- `--export-fixes` flag: export remediation plan as CSV (sorted by severity, with fix code and assignee columns)
- `--force` flag to override the 2M row safety limit
- `count_issues()` shared helper for consistent issue counting
- Warning when fuzzy (Levenshtein) matching is skipped due to row count
- File size guard: warns at 500K rows, refuses at 2M without `--force`

### Changed
- Minimum Python version raised from 3.8 to 3.9
- mypy type checking added to CI (public API and rules module strictly typed)
- PyPI classifiers expanded (license, Python versions, `Typing :: Typed`)
- Automated PyPI publish workflow (push `v*` tag → build → test → publish)
- README refreshed with "data linter" positioning and quick-start install

## [1.0.0] - 2026-05-09

Expand Down
30 changes: 26 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,15 @@
# Data Hygiene Auditor

**A linter for your data.** Point it at a spreadsheet, get back every inconsistency, placeholder, and hidden duplicate — with severity ratings, root causes, and fix code.

```
pip install data-hygiene-auditor
data-hygiene-audit --input customers.xlsx --output ./reports
```

Phone numbers stored seven different ways in the same column. "TBD" sitting in a status field for three years. A customer record that looks unique until you notice that whitespace and casing are the only things separating it from four others. These are the issues consultants inherit when they take over someone else's spreadsheet — and the ones nobody finds until they're already in production.

The Data Hygiene Auditor is a Python CLI that scans Excel workbooks for the specific real-world failure modes that show up in actual consulting engagements: mixed-format inconsistencies, fields used for the wrong purpose, placeholder values that escaped into production, and phantom duplicates hiding behind cosmetic differences.
The Data Hygiene Auditor scans Excel, CSV, and TSV files for the specific real-world failure modes that show up in actual consulting engagements: mixed-format inconsistencies, fields used for the wrong purpose, placeholder values that escaped into production, and phantom duplicates hiding behind cosmetic differences.

A single run produces three reports tailored to three audiences: an **HTML report** for the stakeholder meeting, an **Excel findings file** for the person doing the cleanup, and a **PDF** for the deliverable folder.

Expand Down Expand Up @@ -73,13 +80,13 @@ python audit.py --input samples/input/sample_messy_data.xlsx --output samples/ou
## Installation

```
pip install .
pip install data-hygiene-auditor
```

Or install dependencies directly:
Or install from source:

```
pip install -r requirements.txt
pip install .
```

## Usage
Expand Down Expand Up @@ -109,6 +116,7 @@ Supports `.xlsx`, `.xls`, `.csv`, and `.tsv` files.
| `--baseline`, `-b` | Path to a previous audit JSON for trend comparison (shows deltas) |
| `--rules`, `-r` | Path to custom rules JSON for additional checks |
| `--sarif` | Output findings in SARIF format (for GitHub Code Scanning) |
| `--export-fixes` | Export remediation plan as CSV (sorted by severity, with fix code) |
| `--fail-under` | Exit with code 1 if health score is below this threshold (0-100) |
| `--quiet`, `-q` | Suppress all terminal output (just write report files) |
| `--force` | Process files exceeding the 2M row safety limit |
Expand Down Expand Up @@ -290,6 +298,20 @@ python generate_sample.py
- openpyxl
- reportlab

## Releasing

To publish a new version to PyPI:

1. Update `version` in `pyproject.toml`
2. Add a release entry to `CHANGELOG.md`
3. Commit, tag, and push:
```
git tag v1.1.0
git push origin v1.1.0
```

The `publish.yml` workflow builds, tests, and uploads to PyPI automatically on version tags.

## License

MIT — see [LICENSE](LICENSE)
33 changes: 17 additions & 16 deletions data_hygiene_auditor/api.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,9 +14,10 @@

from __future__ import annotations

import dataclasses
import os
import tempfile
from dataclasses import dataclass, field
from dataclasses import dataclass
from pathlib import Path
from typing import Any, Dict, List, Optional

Expand All @@ -42,7 +43,7 @@ class Finding:
severity: str
description: str
why: str
detail: Dict[str, Any] = field(default_factory=dict)
detail: Dict[str, Any] = dataclasses.field(default_factory=dict)
fix: Optional[FixSuggestion] = None

@property
Expand All @@ -67,7 +68,7 @@ class Duplicate:
rows: List[int]
group_size: int
why: str
sample_data: List[Dict[str, str]] = field(default_factory=list)
sample_data: List[Dict[str, str]] = dataclasses.field(default_factory=list)
fix: Optional[FixSuggestion] = None


Expand All @@ -80,8 +81,8 @@ class FuzzyDuplicate:
rows: List[int]
group_size: int
why: str
field_differences: Dict[str, Any] = field(default_factory=dict)
sample_data: List[Dict[str, str]] = field(default_factory=list)
field_differences: Dict[str, Any] = dataclasses.field(default_factory=dict)
sample_data: List[Dict[str, str]] = dataclasses.field(default_factory=list)
similarity_threshold: Optional[float] = None
fix: Optional[FixSuggestion] = None

Expand Down Expand Up @@ -115,7 +116,7 @@ class FieldResult:
total_missing: int
missing_pct: float
total_rows: int
findings: List[Finding] = field(default_factory=list)
findings: List[Finding] = dataclasses.field(default_factory=list)
profile: Optional[ColumnProfile] = None


Expand All @@ -127,10 +128,10 @@ class SheetResult:
row_count: int
col_count: int
health_score: int
fields: List[FieldResult] = field(default_factory=list)
duplicates: List[Duplicate] = field(default_factory=list)
fuzzy_duplicates: List[FuzzyDuplicate] = field(default_factory=list)
schema_violations: List[SchemaViolation] = field(default_factory=list)
fields: List[FieldResult] = dataclasses.field(default_factory=list)
duplicates: List[Duplicate] = dataclasses.field(default_factory=list)
fuzzy_duplicates: List[FuzzyDuplicate] = dataclasses.field(default_factory=list)
schema_violations: List[SchemaViolation] = dataclasses.field(default_factory=list)

@property
def findings(self) -> List[Finding]:
Expand All @@ -157,9 +158,9 @@ class AuditResult:
input_file: str
audit_timestamp: str
overall_score: int
sheets: List[SheetResult] = field(default_factory=list)
sheets: List[SheetResult] = dataclasses.field(default_factory=list)
trend: Optional[TrendData] = None
_raw: Dict[str, Any] = field(default_factory=dict, repr=False)
_raw: Dict[str, Any] = dataclasses.field(default_factory=dict, repr=False)

@property
def total_issues(self) -> int:
Expand Down Expand Up @@ -231,7 +232,7 @@ class SchemaViolation:
severity: str
column: str
why: str
detail: Dict[str, Any] = field(default_factory=dict)
detail: Dict[str, Any] = dataclasses.field(default_factory=dict)


@dataclass
Expand All @@ -244,8 +245,8 @@ class TrendData:
overall_score_previous: int
total_issues_delta: int
total_issues_previous: int
severity_deltas: Dict[str, int] = field(default_factory=dict)
sheets: Dict[str, Any] = field(default_factory=dict)
severity_deltas: Dict[str, int] = dataclasses.field(default_factory=dict)
sheets: Dict[str, Any] = dataclasses.field(default_factory=dict)


def _describe_issue(issue_type: str, detail: dict) -> str:
Expand All @@ -257,7 +258,7 @@ def _describe_issue(issue_type: str, detail: dict) -> str:
f" deviate from {detail.get('dominant_format', '')}"
)
if issue_type == 'wrong_purpose':
return detail.get('issue', 'Wrong purpose')
return str(detail.get('issue', 'Wrong purpose'))
if issue_type in ('placeholder_value', 'placeholder'):
return (
f"Placeholder \"{detail.get('value', '')}\" found"
Expand Down
81 changes: 81 additions & 0 deletions data_hygiene_auditor/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -139,6 +139,79 @@ def _generate_sarif(all_results, input_files):
}


def _export_remediation_csv(all_results, output_path):
"""Export a CSV remediation plan with one row per fixable issue."""
import csv

rows = []
for results in all_results:
source_file = results.get('input_file', '')
for sheet_name, sheet_data in results['sheets'].items():
for col_name, field_data in sheet_data['fields'].items():
for issue in field_data['issues']:
fix = issue.get('fix', {})
detail = issue.get('detail', {})
msg = ''
if isinstance(detail, dict):
msg = detail.get('message', '')
if not msg and 'issue' in detail:
msg = detail['issue']
rows.append({
'File': source_file,
'Sheet': sheet_name,
'Field': col_name,
'Issue Type': issue.get('rule_name', issue['type']),
'Severity': issue['severity'],
'Description': msg,
'Fix Strategy': fix.get('strategy', '') if fix else '',
'Fix Code': fix.get('code', '') if fix else '',
'Assigned To': '',
'Status': 'Open',
})

for dup in sheet_data['phantom_duplicates']:
fix = dup.get('fix', {})
rows.append({
'File': source_file,
'Sheet': sheet_name,
'Field': '(row-level)',
'Issue Type': dup['type'],
'Severity': dup['severity'],
'Description': f"{dup['group_size']} rows: {', '.join(str(r) for r in dup['rows'][:5])}",
'Fix Strategy': fix.get('strategy', '') if fix else '',
'Fix Code': fix.get('code', '') if fix else '',
'Assigned To': '',
'Status': 'Open',
})

for fuzz in sheet_data.get('fuzzy_duplicates', []):
fix = fuzz.get('fix', {})
rows.append({
'File': source_file,
'Sheet': sheet_name,
'Field': '(row-level)',
'Issue Type': 'fuzzy_duplicate',
'Severity': fuzz['severity'],
'Description': f"{fuzz['group_size']} rows: {', '.join(str(r) for r in fuzz['rows'][:5])}",
'Fix Strategy': fix.get('strategy', '') if fix else '',
'Fix Code': fix.get('code', '') if fix else '',
'Assigned To': '',
'Status': 'Open',
})

rows.sort(key=lambda r: {'High': 0, 'Medium': 1, 'Low': 2}.get(r['Severity'], 3))

fieldnames = [
'File', 'Sheet', 'Field', 'Issue Type', 'Severity',
'Description', 'Fix Strategy', 'Fix Code',
'Assigned To', 'Status',
]
with open(output_path, 'w', newline='', encoding='utf-8') as f:
writer = csv.DictWriter(f, fieldnames=fieldnames)
writer.writeheader()
writer.writerows(rows)


def main():
parser = argparse.ArgumentParser(
description=(
Expand Down Expand Up @@ -198,6 +271,10 @@ def main():
'--sarif',
help='Output findings in SARIF format to the given path',
)
parser.add_argument(
'--export-fixes',
help='Export remediation plan as CSV to the given path',
)
parser.add_argument(
'--quiet', '-q', action='store_true',
help='Suppress all terminal output (just write report files)',
Expand Down Expand Up @@ -323,6 +400,10 @@ def _log(msg=''):
json.dump(sarif_data, f, indent=2)
_log(f" {_c('SARIF', '32')} -> {args.sarif}")

if args.export_fixes:
_export_remediation_csv(all_results, args.export_fixes)
_log(f" {_c('Fixes', '32')} -> {args.export_fixes}")

total_counts = {'total': 0, 'High': 0, 'Medium': 0, 'Low': 0, 'schema': 0}
scores = []
for results in all_results:
Expand Down
4 changes: 2 additions & 2 deletions data_hygiene_auditor/core.py
Original file line number Diff line number Diff line change
Expand Up @@ -86,7 +86,7 @@ def count_issues(results):
Returns dict with keys: 'total', 'High', 'Medium', 'Low', 'schema'.
"""
from collections import Counter
totals = Counter()
totals: Counter[str] = Counter()
schema_count = 0
for sheet in results['sheets'].values():
for field_data in sheet['fields'].values():
Expand Down Expand Up @@ -145,7 +145,7 @@ def run_audit(input_path, fuzzy_threshold=0.85, schema_path=None, baseline_path=
if df.empty:
continue

sheet_results = {
sheet_results: dict = {
'row_count': len(df),
'col_count': len(df.columns),
'fields': {},
Expand Down
Loading
Loading