Dift is an open-source platform for dataset comparison, drift detection, and automated data trust validation.
It helps data teams instantly understand:
- what changed
- why it matters
- whether new data is safe to trust
Dift supports:
- local datasets
- SQL databases
- analytical warehouses
- drift analysis
- automation workflows
- historical validation
- reusable comparison systems
Full documentation is available here:
https://reginalderzoah.github.io/Dift/
Bad data silently breaks:
- dashboards
- ETL pipelines
- analytics workflows
- ML models
- warehouse transformations
- business decisions
Dift helps teams detect risky dataset changes before they propagate into production systems.
- Schema comparison
- Row-level comparison
- Null analysis
- Duplicate analysis
- Risk scoring
- Drift analysis
- Mean shift detection
- Standard deviation drift
- Range shift analysis
- Severity classification
- New value detection
- Removed value detection
- Frequency shift analysis
- Severity scoring
- IQR outlier analysis
- Outlier spike detection
- Risk integration
- CSV
- Parquet
- Excel (
.xlsx,.xls) - JSON
- SQLite
- PostgreSQL
- MySQL
- DuckDB
- BigQuery
- Redshift
- Snowflake
Dift supports:
- Rich CLI reports
- JSON reports
- CSV reports
- Excel reports
- HTML reports
Available templates:
- default
- clean
- compact
- enterprise
- dark
Example:
dift old.csv new.csv \
--report html \
--template dark- Scheduled comparisons
- Batch dataset comparison
- Comparison history
- Reusable profiles
- Environment-based configs
- Automation-friendly exit codes
- Non-interactive execution
pip install dift-clipip install --upgrade dift-clipip install sqlalchemypip install psycopg2-binarypip install pymysqlpip install sqlalchemy-redshift redshift-connectorpip install snowflake-sqlalchemypip install google-cloud-bigquery db-dtypespip install duckdbdift examples/old.csv examples/new.csv \
--key customer_iddift examples/old.csv examples/new.csv \
--key customer_id \
--report json \
--output report.jsondift examples/old.csv examples/new.csv \
--key customer_id \
--report html \
--template enterprise \
--output report.htmldift examples/old_drift.csv examples/new_drift.csv \
--key id \
--threshold 0.1dift postgresql://user:password@localhost:5432/sales_db:customers_old \
postgresql://user:password@localhost:5432/sales_db:customers_new \
--key customer_iddift duckdb:///warehouse.duckdb:orders_old \
duckdb:///warehouse.duckdb:orders_new \
--key order_iddift bigquery://analytics.sales.orders_old \
bigquery://analytics.sales.orders_new \
--key order_iddift batch \
--old-dir data/old \
--new-dir data/new \
--key iddift profile create nightly-check \
--old examples/old.csv \
--new examples/new.csv \
--key customer_iddift profile run nightly-checkdift schedule cron nightly-checkEnable persistent history tracking:
dift examples/old.csv examples/new.csv \
--historydift prod.csv staging.csv \
--strict-exit-codes \
--quiet \
--no-colorSupported config formats:
- YAML
- TOML
- JSON
Run using config:
dift --config examples/config_sample.yamldift --config examples/config_env.yaml \
--env productionMost examples use files located in the project's examples/ directory.
Example structure:
examples/
├── old.csv
├── new.csv
├── old.parquet
├── new.parquet
├── old.xlsx
├── new.xlsx
├── old.json
├── new.json
├── old_drift.csv
├── new_drift.csv
├── config_sample.yaml
├── config_thresholds.yaml
├── config_env.yaml
└── warehouse.duckdb
dift/
├── cli.py
├── core/
├── io/
│ ├── readers.py
│ ├── registry.py
│ ├── sql_reader.py
│ ├── duckdb_reader.py
│ ├── bigquery_reader.py
│ └── base_reader.py
├── reports/
├── profiles.py
├── schedules.py
├── history.py
└── utils/
docs/
tests/
examples/
- Connector registry architecture
- Shared reader interfaces
- Plugin preparation architecture
- Modular connector system
- Extensible reporting system
- Warehouse-ready workflows
pytestruff check .mypy diftUpcoming areas of focus include:
- streaming comparisons
- distributed execution
- MongoDB support
- ML feature drift analysis
- observability dashboards
- alerting workflows
- native Airflow integration
- plugin ecosystem
- Python SDK
- Web UI dashboard
See:
docs/roadmap.md
Contributions are welcome.
See:
CONTRIBUTING.md
Ways to contribute:
- fix bugs
- improve docs
- improve testing
- improve performance
- add connectors
- improve reporting
- improve automation workflows
MIT License
Dift aims to become the open-source standard for:
- dataset regression testing
- data drift monitoring
- ML dataset validation
- warehouse trust validation
- automated data quality enforcement
- data deployment validation
- dataset observability
