datascope

PyPI: pip install datascope-dq

Data created upstream — by manufacturing teams entering UPCs, inventory staff assigning product codes, offshore developers choosing column types — silently breaks systems downstream. A product code with letters where EDI expects numbers. Fifteen "N/A" strings buried in 500 numeric rows that pandas silently drops, skewing every calculation by 3%.

datascope finds these problems, explains what's wrong in plain English, and tells you what to fix. It reads each cell's actual type (not what pandas infers), detects hidden quality issues, classifies their severity by downstream impact, and generates a professional diagnostic report.

What It Finds

Detection	Example	Severity
Mixed types	485 numbers + 15 strings in a "numeric" column	Critical
Sentinel values	"N/A", "TBD", "pending" hiding in numeric data	Critical
Missing values	40% of a column is blank — aggregations silently exclude those rows	Warning
Leading-zero inconsistency	"00123" alongside "456" — keys that won't match	Warning
Mixed date formats	"01/15/2026" and "2026-01-15" in the same column	Warning
Suspected duplicate IDs	98% unique in an ID column — the other 2% will fan out joins	Warning
Near-constant columns	1 distinct value across 10,000 rows	Info

Each finding is expressed as assumption vs. reality: what the data appears to be vs. what it actually contains. Every finding includes a downstream impact explanation, a fix recommendation, and a prevention rule.

Installation

pip install datascope-dq

For Parquet file support:

pip install datascope-dq[parquet]

Or install from source:

git clone https://github.com/MsShawnP/datascope.git
cd datascope
pip install -e .

Usage

# Analyze an Excel file
datascope data.xlsx

# Analyze a CSV
datascope sales_export.csv

# Analyze a Parquet file (requires pyarrow)
datascope data.parquet

# Specify a sheet and output directory
datascope data.xlsx --sheet Revenue --output-dir ./client_reports

Output Formats

# PDF report (default)
datascope data.xlsx

# Structured JSON for pipeline integration
datascope data.xlsx --format json

# Self-contained HTML report
datascope data.xlsx --format html

# Annotated Excel with highlighted problem cells
datascope data.xlsx --format annotated-excel

# PDF + JSON together
datascope data.xlsx --format both

CLI Flags

# Quiet mode — exit code only (0 = no critical, 1 = critical findings)
datascope data.xlsx --quiet

# Verbose mode — full tracebacks on analyzer failures
datascope data.xlsx --verbose

# Limit row count (default: warn at 500K cells, abort at 5M)
datascope huge_file.csv --max-rows 100000

Example Output

datascope: Analyzing sample_mixed_types.xlsx...
  200 rows x 6 columns

Found 4 findings:
  2 Critical  ########
  1 Warning   ####
  1 Info      ####

Top critical findings:
  * revenue_mixed: 15 non-numeric values hiding in an otherwise numeric column
  * status: Sentinel values 'N/A' and 'TBD' in numeric data

Report saved: reports/sample_mixed_types_diagnostic.pdf

The Report

Reports are structured for non-technical readers — no jargon, no composite scores, no unexplained metrics.

Executive Summary — overall health assessment, finding counts by severity, top critical issues highlighted.

Findings by Severity — each finding presented as a card:

Assumption: what the data appears to be
Reality: what it actually contains
Impact: what breaks downstream
Recommended Fix: what to do now
Prevention Rule: what right looks like going forward

Field Inventory — summary table of all columns with their detected issue types and severity.

Findings are color-coded (red/amber/blue) and grouped by severity so readers know what to fix first.

How It Works

Most tools let pandas (or the SQL driver, or Excel) decide column types. A column with 485 numbers and 15 strings becomes float64 — the strings become NaN, the type problem disappears, and every downstream calculation is quietly wrong.

datascope reads each cell's actual Python type via openpyxl (for Excel) or raw-string inference (for CSV). This cell-level type detection is always on — there's no flag to enable it because skipping it defeats the purpose.

The analysis pipeline:

Load — read with cell-level type preservation (no silent coercion)
Detect — seven analyzers scan for type inconsistencies, sentinels, missing values, format issues, and cardinality anomalies
Classify — severity assigned by downstream impact (critical = silent data loss, warning = likely misinterpretation, info = worth noting)
Compose — plain-English narrative generated for each finding
Report — output as PDF, HTML, JSON, or annotated Excel

Severity Model

Level	Meaning	Examples
Critical	Silent data loss or incorrect calculations will occur	Mixed types in numeric columns; sentinel values pandas drops without warning
Warning	Key mismatches or misinterpretation likely	Leading-zero stripping; ambiguous date formats; duplicate IDs; high null rates
Info	Worth noting, no direct downstream breakage	Near-constant columns; moderate missing values

Project Structure

datascope/
├── loaders/            # Excel, CSV, and Parquet with cell-level type tracking
│   ├── excel.py        # openpyxl-based, preserves per-cell Python types
│   ├── csv_loader.py   # Raw string inference with regex-accelerated datetime detection
│   ├── parquet.py      # Arrow schema → Python type mapping (optional pyarrow)
│   └── base.py         # Extension-based dispatch
├── analyzers/          # Seven detectors, each returns list[Finding]
│   ├── type_consistency.py
│   ├── sentinel.py
│   ├── format_check.py # Leading zeros + mixed dates
│   ├── cardinality.py  # Near-constant + duplicate IDs
│   └── missing_values.py
├── findings/           # Severity classifier + NL template engine
│   ├── severity.py     # Impact-based classification rules
│   ├── templates.py    # Plain-English templates per finding type
│   ├── composer.py     # Template dispatch
│   └── pipeline.py     # classify → compose → sort
├── reports/
│   ├── pdf.py          # Professional PDF with reportlab
│   ├── html.py         # Self-contained HTML with inline CSS
│   └── annotated_excel.py  # Highlighted cells + findings sheet
└── cli.py              # argparse CLI, pipeline orchestration

Requirements

Python 3.10+
pandas >= 2.0
openpyxl >= 3.1
reportlab >= 4.0
defusedxml >= 0.7
pyarrow >= 12.0 (optional, for Parquet support)

License

MIT

Built by Lailara LLC — data hygiene and analytics consulting for specialty food brands scaling into national retail.

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
.dev		.dev
.github/workflows		.github/workflows
datascope		datascope
samples		samples
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
AUDIT.md		AUDIT.md
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
DECISIONS.md		DECISIONS.md
FAILURES.md		FAILURES.md
HANDOFF.md		HANDOFF.md
LICENSE		LICENSE
PLAN.md		PLAN.md
README.md		README.md
generate_sample.py		generate_sample.py
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

datascope

What It Finds

Installation

Usage

Output Formats

CLI Flags

Example Output

The Report

How It Works

Severity Model

Project Structure

Requirements

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

datascope

What It Finds

Installation

Usage

Output Formats

CLI Flags

Example Output

The Report

How It Works

Severity Model

Project Structure

Requirements

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages