Dift is an open-source CLI platform for dataset comparison, drift detection, and data trust validation.
It helps data teams instantly understand:
- what changed
- why it matters
- whether the new data is safe to trust
Dift v0.5.0 introduces advanced drift analysis, outlier detection, reusable configurations, saved comparison profiles, improved reporting, and stronger dataset risk analysis.
- Numeric drift detection
- Advanced categorical drift analysis
- Outlier detection using IQR analysis
- Outlier severity classification
- Outlier risk scoring
- Frequency distribution shift detection
- Numeric drift reporting across all report formats
- Improved Excel reporting
- Improved HTML reporting
- Better CSV drift summaries
- Enhanced weighted risk scoring
- Improved warning system
- Better drift visibility in console reports
Bad data breaks:
- dashboards
- reports
- ETL pipelines
- analytics workflows
- ML models
- business decisions
Dift helps teams catch risky data changes before they cause damage.
- CSV
- Parquet
- Excel (
.xlsx,.xls) - JSON
- Mean shift detection
- Standard deviation drift
- Range shift detection
- Configurable drift thresholds
- Severity classification
- New categorical value detection
- Removed categorical value detection
- Frequency distribution shifts
- Severity classification
- IQR outlier detection
- Outlier spike detection
- Outlier percentage tracking
- Risk integration
- Rich CLI report
- JSON report
- CSV summary report
- Excel workbook report
- HTML dashboard-style report
Customize your HTML reports:
dift old.csv new.csv --report html --template cleanAvailable templates:
defaultcleancompactenterprisedark
Control drift sensitivity using --threshold.
Default threshold:
0.1Example:
dift old.csv new.csv --key id --threshold 0.2This helps detect silent numeric drift in:
- ML datasets
- ETL pipelines
- analytics tables
- production data feeds
Save reports to a directory without specifying filenames:
dift old.csv new.csv --report json --output-dir reports/Auto-generated filenames:
dift_report.jsondift_report.csvdift_report.xlsxdift_report.html
Dift supports reusable configuration files for cleaner and reproducible workflows.
dift old.csv new.csv --config examples/config_sample.yaml- YAML (
.yaml,.yml) - TOML (
.toml) - JSON (
.json)
old_dataset: "example/old.csv"
new_dataset: "examples/new.csv"
key: "customer_id"
threshold: 0.05
report: "html"old_dataset = "examples/old.csv"
new_dataset = "examples/new.csv"
key = "customer_id"
threshold = 0.1
report = "json"{
"old_dataset": "examples/old.csv",
"new_dataset": "examples/new.csv",
"key": "customer_id",
"threshold": 0.2,
"report": "csv"
}Dift can also load dataset paths directly from config files.
This means you can run a comparison without typing the old and new dataset paths in the terminal every time.
dift --config examples/config_with_datasets.yamlold_dataset: examples/old.csv
new_dataset: examples/new.csv
key: customer_id
threshold: 0.2
report: html
output: reports/config_report.htmlCLI arguments still override config values.
dift examples/old_drift.csv examples/new_drift.csv \
--config examples/config_with_datasets.yaml \
--report json \
--output override_report.jsonIn this case, Dift uses:
- datasets from the CLI
- report/output from the CLI
- remaining values from the config file
Dift supports reusable threshold policies for advanced drift detection workflows.
Threshold configurations help teams:
- standardize drift sensitivity
- customize validation rules
- apply column-specific policies
- reuse validation settings across environments
thresholds:
numeric: 0.1
categorical: 0.2
outlier: 0.15thresholds:
numeric: 0.1
categorical: 0.2
outlier: 0.15
columns:
revenue:
numeric: 0.05
outlier: 0.1
segment:
categorical: 0.3old_dataset: examples/old.csv
new_dataset: examples/new.csv
key: customer_id
report: html
output: reports/threshold_report.html
thresholds:
numeric: 0.1
categorical: 0.2
outlier: 0.15
columns:
revenue:
numeric: 0.05
outlier: 0.1
status:
categorical: 0.3dift --config examples/config_thresholds.yamlCLI thresholds still override global numeric thresholds for backward compatibility.
dift --config examples/config_thresholds.yaml --threshold 0.5This overrides:
thresholds:
numeric: 0.1But preserves:
- categorical thresholds
- outlier thresholds
- column-level overrides
| Threshold Type | Purpose |
|---|---|
| numeric | Numeric drift detection |
| categorical | Frequency shift detection |
| outlier | Outlier spike detection |
| columns | Column-specific overrides |
columns:
revenue:
numeric: 0.02Detect even small revenue drift changes.
columns:
segment:
categorical: 0.4Reduce noise for highly variable categorical fields.
columns:
transactions:
outlier: 0.05Catch abnormal spikes aggressively.
Dift supports reusable environment-specific configurations for development, staging, and production workflows.
This helps teams maintain different comparison settings across environments while keeping configs clean and reusable.
dift --config examples/config_env.yaml --env developmentkey: customer_id
report: html
output: reports/env_report.html
environments:
development:
old_dataset: examples/old.csv
new_dataset: examples/new.csv
threshold: 0.2
staging:
old_dataset: staging_old.csv
new_dataset: staging_new.csv
threshold: 0.15
production:
old_dataset: ${OLD_DATASET}
new_dataset: ${NEW_DATASET}
threshold: 0.1Dift supports environment variable interpolation inside config files.
Example:
old_dataset: ${OLD_DATASET}
new_dataset: ${NEW_DATASET}Set variables before running Dift.
export OLD_DATASET=examples/old.csv
export NEW_DATASET=examples/new.csv$env:OLD_DATASET="examples/old.csv"
$env:NEW_DATASET="examples/new.csv"Then run:
dift --config examples/config_env.yaml --env productionIf a required environment variable is missing, Dift shows a helpful error.
Example:
Error: Missing environment variable 'OLD_DATASET'
Environment configs help support:
- development workflows
- staging validation
- production deployment checks
- CI/CD pipelines
- secret management preparation
- reusable automation workflows
Dift supports reusable saved comparison profiles.
Profiles help automate recurring dataset checks and validation workflows.
dift profile create nightly-check \
--old examples/old.csv \
--new examples/new.csv \
--key customer_id \
--report html \
--threshold 0.1dift profile run nightly-checkdift profile listdift profile show nightly-checkdift profile delete nightly-checkDift resolves settings using:
CLI arguments > Saved Profiles > Config Files > Defaults
This makes Dift flexible for:
- automation
- CI/CD pipelines
- scheduled validations
- reusable workflows
Dift supports batch comparison workflows for validating multiple dataset pairs in one command.
This is useful for:
- ETL validation pipelines
- scheduled dataset monitoring
- multi-table warehouse checks
- automated regression testing
data/
├── old/
│ ├── customers.csv
│ ├── orders.csv
│ └── products.csv
│
└── new/
├── customers.csv
├── orders.csv
└── products.csv
Dift automatically matches files by filename.
Example:
old/customers.csv<-->new/customers.csvold/orders.csv<-->new/orders.csv
dift batch \
--old-dir data/old \
--new-dir data/new \
--key iddift batch \
--old-dir data/old \
--new-dir data/new \
--key id \
--report html \
--output-dir reports/batchExample output structure:
reports/
└── batch/
├── customers/
│ └── dift_report.html
├── orders/
│ └── dift_report.html
└── products/
└── dift_report.html
dift batch \
--old-dir data/old \
--new-dir data/new \
--report csv \
--output-dir reports/csvBy default, Dift continues running other comparisons even if one fails.
dift batch \
--old-dir data/old \
--new-dir data/new \
--continue-on-errorStop immediately on first failure:
dift batch \
--old-dir data/old \
--new-dir data/new \
--stop-on-error
old_dataset: "examples/old.csv"
new_dataset: "examples/new.csv"
key: "customer_id"
threshold: 0.05
report: "html" Dift supports persistent comparison history tracking.
This helps teams monitor:
- dataset drift over time
- recurring quality issues
- historical risk changes
- long-term data trust trends
dift examples/old.csv examples/new.csv \
--key customer_id \
--historyBy default, history is saved to:
.dift/history/history.jsonl
dift examples/old.csv examples/new.csv \
--key customer_id \
--history \
--history-dir reports/historydift history listExample output:
1. 2026-05-15T12:30:00Z | risk=medium | old.csv -> new.csv
2. 2026-05-16T08:10:00Z | risk=high | prod.csv -> staging.csv
dift history show 1dift history clearSave history during batch workflows:
dift batch \
--old-dir data/old \
--new-dir data/new \
--history \
--history-dir reports/batch-historyExample structure:
reports/
└── batch-history/
├── customers/
│ └── history.jsonl
├── orders/
│ └── history.jsonl
└── products/
└── history.jsonl
Dift follows a strict priority chain to give you maximum flexibility:
- CLI Arguments (Highest priority, overrides everything)
- Configuration File (YAML, TOML, or JSON)
- Internal Defaults (Threshold: 0.1, Report: console)
Dift supports reusable scheduled comparison workflows for automation, monitoring, CI/CD pipelines, and recurring data quality checks.
This makes it easy to:
- run nightly drift checks
- automate production dataset validation
- integrate with cron jobs
- schedule profile-based comparisons
- build monitoring workflows
First create a comparison profile:
dift profile create nightly-check \
--old examples/old.csv \
--new examples/new.csv \
--key customer_id \
--report html \
--output reports/nightly.htmlThis saves all comparison settings into a reusable profile.
Generate a cron-ready command:
dift schedule cron nightly-checkExample output:
0 2 * * * dift profile run nightly-check --history --strict-exit-codesThis means:
- run every day
- at 2:00 AM
- save comparison history
- use automation-friendly exit codes
Generate a custom cron schedule:
dift schedule cron nightly-check \
--hour 5 \
--minute 30Output:
30 5 * * * dift profile run nightly-check --history --strict-exit-codesRuns daily at 5:30 AM.
Create a named schedule:
dift schedule create daily-check \
--profile nightly-check \
--cron "0 2 * * *"dift schedule listExample:
- daily-check
dift schedule show daily-checkExample output:
{
"profile": "nightly-check",
"cron": "0 2 * * *"
}You can manually trigger a saved schedule:
dift schedule run daily-checkThis runs the associated profile immediately.
dift schedule delete daily-checkOpen your crontab:
crontab -eAdd:
0 2 * * * dift profile run nightly-check --history --strict-exit-codesUse the generated command:
dift profile run nightly-check --history --strict-exit-codesinside:
- Windows Task Scheduler
- Jenkins
- GitHub Actions
- Airflow
- Prefect
- Dagster
- CI/CD pipelines
Dift supports optional risk-based exit codes for automation workflows.
By default, Dift exits with:
0
when comparisons complete successfully.
Enable strict automation behavior with:
dift prod.csv candidate.csv \
--key id \
--strict-exit-codes| Exit Code | Meaning |
|---|---|
0 |
Low-risk comparison |
1 |
Medium-risk drift detected |
2 |
High-risk drift detected |
3 |
Runtime error, invalid input, or failed comparison |
This allows Dift to automatically fail pipelines when risky dataset changes are detected.
dift prod.csv staging.csv \
--key customer_id \
--strict-exit-codes
echo $?Example output:
2
This means Dift detected a high-risk dataset change.
Strict exit codes are optional.
Without --strict-exit-codes, Dift preserves the original behavior and exits successfully when comparisons complete.
Dift supports automation-friendly execution for CI/CD pipelines, cron jobs, Airflow, Jenkins, GitHub Actions, and scheduled workflows.
Suppress non-error output:
dift old.csv new.csv \
--key id \
--quietThis is useful for:
- cron jobs
- CI/CD pipelines
- scheduled validation workflows
- automated monitoring
Errors will still be displayed.
Disable ANSI terminal colors for cleaner logs:
dift old.csv new.csv \
--key id \
--no-colorUseful for:
- log aggregation systems
- CI logs
- plain-text terminals
- automation tools
dift old.csv new.csv \
--key id \
--strict-exit-codes \
--quiet \
--no-colorThis combination provides:
- predictable exit codes
- machine-friendly output
- clean CI logs
- non-interactive execution behavior
dift schedule cron nightly-checkExample output:
0 2 * * * dift profile run nightly-check --history --strict-exit-codes --quiet --no-colorDift supports comparing datasets directly from DuckDB databases.
This enables warehouse-style comparisons, SQL-based validation, and Parquet-backed analytical workflows.
dift duckdb:///examples/warehouse.duckdb:customers_old \
duckdb:///examples/warehouse.duckdb:customers_new \
--key customer_idduckdb:///path/to/database.duckdb:table_name
duckdb:///data/warehouse.duckdb:orders
- existing risk scoring
- HTML reports
- Excel reports
- JSON reports
- CSV reports
- drift detection
- quality validation
- local analytics warehouses
- Parquet validation workflows
- SQL-based dataset comparison
- batch data quality checks
- DuckDB database files must exist locally.
- Remote DuckDB connections are not currently supported.
- Table names may be case-sensitive depending on configuration.
- DuckDB comparisons use the existing Dift comparison engine and reports.
Dift supports comparing datasets directly from Google BigQuery.
This enables cloud warehouse validation, analytical comparisons, and SQL-driven data quality workflows.
dift bigquery://my-project.analytics.customers_old \
bigquery://my-project.analytics.customers_new \
--key customer_idbigquery://project.dataset.table
bigquery://acme-analytics.sales.orders
Dift uses standard Google Cloud authentication.
Set your service account credentials:
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/service-account.json"export GOOGLE_APPLICATION_CREDENTIALS="C:/path/to/service-account.json"pip install google-cloud-bigquery db-dtypes- existing risk scoring
- HTML reports
- Excel reports
- JSON reports
- CSV reports
- drift detection
- quality validation
- cloud warehouse validation
- production dataset monitoring
- analytics QA workflows
- cross-environment data comparison
- SQL-driven data quality checks
- BigQuery access requires valid Google Cloud credentials.
- BigQuery billing and permissions are managed through Google Cloud.
- BigQuery comparisons use the existing Dift comparison engine and reports.
Dift can compare tables from SQL databases using SQLAlchemy connection strings.
This is useful for validating database migrations, checking staging vs production tables, and comparing database-backed datasets.
dift sqlite:///examples/old.db:customers_old \
sqlite:///examples/new.db:customers_new \
--key customer_idconnection_string:table_name
Examples:
sqlite:///examples/data.db:customers
postgresql://user:password@localhost:5432/mydb:customers
mysql://user:password@localhost:3306/mydb:customers
pip install sqlalchemyDatabase-specific drivers may also be required:
pip install psycopg2-binary # PostgreSQL
pip install pymysql # MySQL- existing row comparison
- schema comparison
- drift detection
- risk scoring
- JSON, CSV, Excel, and HTML reports
- SQLite database files must exist locally.
- PostgreSQL and MySQL require valid connection strings and credentials.
- SQL comparisons use the existing Dift comparison engine and reports.
- Python 3.10+
pip install dift-cliThen run:
dift --helppip install --upgrade dift-clipython -m venv .venv
source .venv/Scripts/activate
pip install dift-clipython -m venv .venv
.venv\Scripts\Activate.ps1
pip install dift-clipython3 -m venv .venv
source .venv/bin/activate
pip install dift-clipipx install dift-clidift --helpor
python -m dift.cli --helpdift examples/old.csv examples/new.csv --key customer_idIf your paths are defined in the config, just run:
dift --config examples/config_sample.yamldift examples/old_drift.csv examples/new_drift.csv --key id --threshold 0.1dift examples/old.csv examples/new.csv \
--key customer_id \
--report json \
--output report.jsondift examples/old.csv examples/new.csv \
--key customer_id \
--report csv \
--output report.csvdift examples/old.csv examples/new.csv \
--key customer_id \
--report excel \
--output report.xlsxdift examples/old.csv examples/new.csv \
--key customer_id \
--report html \
--output report.htmldift examples/old.csv examples/new.csv \
--key customer_id \
--report html \
--template dark \
--output report.htmldift examples/old.csv examples/new.csv \
--config examples/config_sample.yamldift profile run nightly-checkdift batch \
--old-dir data/old \
--new-dir data/new \
--key customer_id╭─────────────────────────╮
│ Dift Dataset Comparison │
│ Risk Level: MEDIUM │
╰─────────────────────────╯
Warnings
Numeric drift:
'revenue'
mean shift 900.00%
(high, threshold 0.1)
Outlier spike:
'revenue' increased by 100.00%
(high)
Categorical shift:
'segment' max frequency shift 60.00%
(high)
examples/
├── old.csv
├── new.csv
├── old.parquet
├── new.parquet
├── old.xlsx
├── new.xlsx
├── old.json
├── new.json
├── old_drift.csv
├── new_drift.csv
├── config_sample.yaml
├── config_sample.toml
├── config_sample.json
├── config_thresholds.yaml
├── config_env.yaml
├── config_with_datasets.yaml
├── config_with_datasets.toml
└── config_with_datasets.json
dift before.csv after.csvdift train_v1.csv train_v2.csvdift prod.csv staging.csv --key iddift train_v1.csv train_v2.csv --threshold 0.1dift profile run nightly-checkdift batch \
--old-dir warehouse_snapshot_1 \
--new-dir warehouse_snapshot_2 \
--report html \
--output-dir reports/dift prod.csv staging.csv \
--key customer_id \
--historyTrack how risk and drift evolve across repeated comparison runs.
dift/
├── cli.py
├── core/
│ ├── comparator.py
│ ├── schema_diff.py
│ ├── row_diff.py
│ ├── quality_diff.py
│ ├── stats_diff.py
│ └── risk.py
├── io/
│ ├── config_loader.py
| └── readers.py
├── reports/
│ ├── console_report.py
│ ├── json_report.py
│ ├── csv_report.py
│ ├── excel_report.py
│ ├── html_report.py
│ └── models.py
├── profiles.py
├── batch.py
├── thresholds.py
├── schedules.py
├── history.py
└── utils/
tests/
examples/
pytestLint:
ruff check .Type checking:
mypy dift- Direct database-to-database comparison
- Table-to-table comparison support
- Query-based dataset comparison
- Connection string support
- CLI database input support
- PostgreSQL table reader
- Schema inference support
- Query execution support
- Secure connection handling
- MySQL table reader
- Query-based comparisons
- Type compatibility handling
- SQLite local database support
- Lightweight comparison workflows
- File-based database comparison
- Native DuckDB integration
- Analytical dataset support
- Parquet interoperability
- Snowflake authentication support
- Warehouse query execution
- Large-scale dataset comparison
- BigQuery dataset comparison
- Service account authentication
- Query-based workflows
- Redshift warehouse support
- Efficient table extraction
- Warehouse schema compatibility
- YAML configuration support
- TOML configuration support
- JSON configuration support
- Dataset path support in configuration files
- Reusable comparison profiles
- Saved report configurations
- Named comparison presets
- Numeric drift thresholds
- Categorical shift thresholds
- Outlier thresholds
- Column-level threshold overrides
- Development/staging/production configs
- Environment variable support
- Secret management preparation
- Scheduled dataset checks
- Cron-friendly execution
- Time-based comparison workflows
- Non-interactive CLI support
- Automation-friendly exit codes
- Pipeline integration support
- Multi-dataset comparison support
- Folder-based comparisons
- Batch report generation
- Historical comparison tracking
- Drift trend analysis
- Historical risk tracking
- Severity color coding
- Conditional formatting
- Improved worksheet layouts
- Better readability styling
- Drift highlighting
- Severity badges
- Improved visual summaries
- Responsive layouts
- Execution timestamps
- Runtime metrics
- Dataset source metadata
- Threshold metadata
- Connector integration tests
- Cross-format consistency tests
- Warehouse mock testing
- Better help messages
- Clearer validation errors
- Progress indicators
- Extensible reader interfaces
- Connector registry architecture
- Internal plugin preparation
Contributions are welcome.
See:
CONTRIBUTING.md
Ways to help:
- Fix bugs
- Improve docs
- Add tests
- Improve performance
- Add connectors
- Improve CLI UX
MIT License
Dift aims to become the open-source standard for:
- dataset regression testing
- data drift monitoring
- ML data validation
- warehouse trust checks
- automated data quality enforcement
