Skip to content

feat: Add Sampling-Based Dataset Comparison Support #100

Description

@ReginaldErzoah

Summary

Add sampling-based comparison support for large datasets.


Motivation

For very large datasets, full comparisons may be expensive or unnecessary.

Sampling-based comparisons allow Dift to provide faster approximate insight into:

  • drift
  • schema changes
  • quality issues
  • distribution changes

This is useful for exploratory validation and large-scale monitoring.


Proposed Improvements

  • Add sampling configuration support
  • Support random sampling
  • Support deterministic sampling with seeds
  • Add sampled drift analysis workflows
  • Clearly indicate when a report is based on sampled data

Suggested Files

Potential implementation areas:

dift/core/comparator.py
dift/core/stats_diff.py
dift/io/readers.py
dift/reports/models.py
dift/reports/console_report.py
dift/reports/json_report.py

Suggested Tasks

  • Add sampling options to comparison configuration
  • Implement random sample selection
  • Add deterministic seed support
  • Add sampled comparison metadata
  • Add tests for repeatable sampling behavior
  • Add reporting indicators for sampled results
  • Update documentation

How to Test

Run:

pytest
ruff check .

Run targeted tests:

pytest tests/test_sampling.py

Manual validation:

dift large_old.csv large_new.csv \
  --key id \
  --sample-size 10000 \
  --sample-seed 42

If CLI options are not added in the first implementation, test the internal API or config-driven workflow instead.

Verify:

  • sampled comparisons complete successfully
  • repeated runs with the same seed produce consistent results
  • reports clearly show that sampling was used
  • existing full comparison workflows remain unchanged

Documentation Impact

Update or create:

docs/performance.md
docs/configuration.md
docs/reports.md

Documentation should include:

  • how sampling works
  • CLI or config examples
  • limitations of sampled comparisons
  • when users should use sampling
  • how sampled reports are labeled

Acceptance Criteria

  • Sampling-based comparison is supported
  • Deterministic sampling is available
  • Reports include sampling metadata
  • Existing workflows remain stable
  • Tests pass
  • Documentation updated

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions