Summary
Add sampling-based comparison support for large datasets.
Motivation
For very large datasets, full comparisons may be expensive or unnecessary.
Sampling-based comparisons allow Dift to provide faster approximate insight into:
- drift
- schema changes
- quality issues
- distribution changes
This is useful for exploratory validation and large-scale monitoring.
Proposed Improvements
- Add sampling configuration support
- Support random sampling
- Support deterministic sampling with seeds
- Add sampled drift analysis workflows
- Clearly indicate when a report is based on sampled data
Suggested Files
Potential implementation areas:
dift/core/comparator.py
dift/core/stats_diff.py
dift/io/readers.py
dift/reports/models.py
dift/reports/console_report.py
dift/reports/json_report.py
Suggested Tasks
- Add sampling options to comparison configuration
- Implement random sample selection
- Add deterministic seed support
- Add sampled comparison metadata
- Add tests for repeatable sampling behavior
- Add reporting indicators for sampled results
- Update documentation
How to Test
Run:
Run targeted tests:
pytest tests/test_sampling.py
Manual validation:
dift large_old.csv large_new.csv \
--key id \
--sample-size 10000 \
--sample-seed 42
If CLI options are not added in the first implementation, test the internal API or config-driven workflow instead.
Verify:
- sampled comparisons complete successfully
- repeated runs with the same seed produce consistent results
- reports clearly show that sampling was used
- existing full comparison workflows remain unchanged
Documentation Impact
Update or create:
docs/performance.md
docs/configuration.md
docs/reports.md
Documentation should include:
- how sampling works
- CLI or config examples
- limitations of sampled comparisons
- when users should use sampling
- how sampled reports are labeled
Acceptance Criteria
- Sampling-based comparison is supported
- Deterministic sampling is available
- Reports include sampling metadata
- Existing workflows remain stable
- Tests pass
- Documentation updated
Summary
Add sampling-based comparison support for large datasets.
Motivation
For very large datasets, full comparisons may be expensive or unnecessary.
Sampling-based comparisons allow Dift to provide faster approximate insight into:
This is useful for exploratory validation and large-scale monitoring.
Proposed Improvements
Suggested Files
Potential implementation areas:
Suggested Tasks
How to Test
Run:
pytest ruff check .Run targeted tests:
Manual validation:
If CLI options are not added in the first implementation, test the internal API or config-driven workflow instead.
Verify:
Documentation Impact
Update or create:
Documentation should include:
Acceptance Criteria