Dift

Dift is an open-source platform for dataset comparison, drift detection, and automated data trust validation.

It helps data teams instantly understand:

what changed
why it matters
whether new data is safe to trust

Dift supports:

local datasets
SQL databases
analytical warehouses
drift analysis
automation workflows
historical validation
reusable comparison systems

Documentation

Full documentation is available here:

https://reginalderzoah.github.io/Dift/

Why Dift?

Bad data silently breaks:

dashboards
ETL pipelines
analytics workflows
ML models
warehouse transformations
business decisions

Dift helps teams detect risky dataset changes before they propagate into production systems.

Key Features

Dataset Comparison

Schema comparison
Row-level comparison
Null analysis
Duplicate analysis
Risk scoring
Drift analysis

Drift Detection

Numeric Drift

Mean shift detection
Standard deviation drift
Range shift analysis
Severity classification

Categorical Drift

New value detection
Removed value detection
Frequency shift analysis
Severity scoring

Outlier Detection

IQR outlier analysis
Outlier spike detection
Risk integration

Supported Dataset Sources

Local Files

CSV
Parquet
Excel (.xlsx, .xls)
JSON

Databases & Warehouses

SQLite
PostgreSQL
MySQL
DuckDB
BigQuery
Redshift
Snowflake

Reporting

Dift supports:

Rich CLI reports
JSON reports
CSV reports
Excel reports
HTML reports

HTML Templates

Available templates:

default
clean
compact
enterprise
dark

Example:

dift old.csv new.csv \
  --report html \
  --template dark

Automation Features

Scheduled comparisons
Batch dataset comparison
Comparison history
Reusable profiles
Environment-based configs
Automation-friendly exit codes
Non-interactive execution

Installation

Install

pip install dift-cli

Upgrade

pip install --upgrade dift-cli

Optional Connector Dependencies

SQL Support

pip install sqlalchemy

PostgreSQL

pip install psycopg2-binary

MySQL

pip install pymysql

Redshift

pip install sqlalchemy-redshift redshift-connector

Snowflake

pip install snowflake-sqlalchemy

BigQuery

pip install google-cloud-bigquery db-dtypes

DuckDB

pip install duckdb

Quick Start

Compare CSV Files

dift examples/old.csv examples/new.csv \
  --key customer_id

Generate JSON Report

dift examples/old.csv examples/new.csv \
  --key customer_id \
  --report json \
  --output report.json

Generate HTML Report

dift examples/old.csv examples/new.csv \
  --key customer_id \
  --report html \
  --template enterprise \
  --output report.html

Detect Numeric Drift

dift examples/old_drift.csv examples/new_drift.csv \
  --key id \
  --threshold 0.1

Database & Warehouse Examples

PostgreSQL

dift postgresql://user:password@localhost:5432/sales_db:customers_old \
     postgresql://user:password@localhost:5432/sales_db:customers_new \
     --key customer_id

DuckDB

dift duckdb:///warehouse.duckdb:orders_old \
     duckdb:///warehouse.duckdb:orders_new \
     --key order_id

BigQuery

dift bigquery://analytics.sales.orders_old \
     bigquery://analytics.sales.orders_new \
     --key order_id

Batch Comparison

dift batch \
  --old-dir data/old \
  --new-dir data/new \
  --key id

Scheduled Workflows

Create Profile

dift profile create nightly-check \
  --old examples/old.csv \
  --new examples/new.csv \
  --key customer_id

Run Profile

dift profile run nightly-check

Generate Cron Schedule

dift schedule cron nightly-check

Comparison History

Enable persistent history tracking:

dift examples/old.csv examples/new.csv \
  --history

Automation-Friendly Execution

dift prod.csv staging.csv \
  --strict-exit-codes \
  --quiet \
  --no-color

Configuration Support

Supported config formats:

YAML
TOML
JSON

Run using config:

dift --config examples/config_sample.yaml

Environment-Based Configs

dift --config examples/config_env.yaml \
  --env production

Example Files

Most examples use files located in the project's examples/ directory.

Example structure:

examples/
├── old.csv
├── new.csv
├── old.parquet
├── new.parquet
├── old.xlsx
├── new.xlsx
├── old.json
├── new.json
├── old_drift.csv
├── new_drift.csv
├── config_sample.yaml
├── config_thresholds.yaml
├── config_env.yaml
└── warehouse.duckdb

Project Structure

dift/
├── cli.py
├── core/
├── io/
│   ├── readers.py
│   ├── registry.py
│   ├── sql_reader.py
│   ├── duckdb_reader.py
│   ├── bigquery_reader.py
│   └── base_reader.py
├── reports/
├── profiles.py
├── schedules.py
├── history.py
└── utils/

docs/
tests/
examples/

Developer Features

Connector registry architecture
Shared reader interfaces
Plugin preparation architecture
Modular connector system
Extensible reporting system
Warehouse-ready workflows

Run Tests

pytest

Linting

ruff check .

Type Checking

mypy dift

Roadmap

Upcoming areas of focus include:

streaming comparisons
distributed execution
MongoDB support
ML feature drift analysis
observability dashboards
alerting workflows
native Airflow integration
plugin ecosystem
Python SDK
Web UI dashboard

See:

docs/roadmap.md

Contributing

Contributions are welcome.

See:

CONTRIBUTING.md

Ways to contribute:

fix bugs
improve docs
improve testing
improve performance
add connectors
improve reporting
improve automation workflows

License

MIT License

Vision

Dift aims to become the open-source standard for:

dataset regression testing
data drift monitoring
ML dataset validation
warehouse trust validation
automated data quality enforcement
data deployment validation
dataset observability

Name		Name	Last commit message	Last commit date
Latest commit History 211 Commits
.github		.github
.vscode		.vscode
assets		assets
dift		dift
docs		docs
examples		examples
site		site
tests		tests
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Dift

Documentation

Why Dift?

Key Features

Dataset Comparison

Drift Detection

Numeric Drift

Categorical Drift

Outlier Detection

Supported Dataset Sources

Local Files

Databases & Warehouses

Reporting

HTML Templates

Automation Features

Installation

Install

Upgrade

Optional Connector Dependencies

SQL Support

PostgreSQL

MySQL

Redshift

Snowflake

BigQuery

DuckDB

Quick Start

Compare CSV Files

Generate JSON Report

Generate HTML Report

Detect Numeric Drift

Database & Warehouse Examples

PostgreSQL

DuckDB

BigQuery

Batch Comparison

Scheduled Workflows

Create Profile

Run Profile

Generate Cron Schedule

Comparison History

Automation-Friendly Execution

Configuration Support

Environment-Based Configs

Example Files

Project Structure

Developer Features

Run Tests

Linting

Type Checking

Roadmap

Contributing

License

Vision

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages