feat: Add Sampling-Based Dataset Comparison Support

## Summary

Add sampling-based comparison support for large datasets.

---

## Motivation

For very large datasets, full comparisons may be expensive or unnecessary.

Sampling-based comparisons allow Dift to provide faster approximate insight into:

- drift
- schema changes
- quality issues
- distribution changes

This is useful for exploratory validation and large-scale monitoring.

---

## Proposed Improvements

- Add sampling configuration support
- Support random sampling
- Support deterministic sampling with seeds
- Add sampled drift analysis workflows
- Clearly indicate when a report is based on sampled data

---

## Suggested Files

Potential implementation areas:

```text
dift/core/comparator.py
dift/core/stats_diff.py
dift/io/readers.py
dift/reports/models.py
dift/reports/console_report.py
dift/reports/json_report.py
```

---

## Suggested Tasks

- Add sampling options to comparison configuration
- Implement random sample selection
- Add deterministic seed support
- Add sampled comparison metadata
- Add tests for repeatable sampling behavior
- Add reporting indicators for sampled results
- Update documentation

---

## How to Test

Run:

```bash
pytest
ruff check .
```

Run targeted tests:

```bash
pytest tests/test_sampling.py
```

Manual validation:

```bash
dift large_old.csv large_new.csv \
  --key id \
  --sample-size 10000 \
  --sample-seed 42
```

If CLI options are not added in the first implementation, test the internal API or config-driven workflow instead.

Verify:

- sampled comparisons complete successfully
- repeated runs with the same seed produce consistent results
- reports clearly show that sampling was used
- existing full comparison workflows remain unchanged

---

## Documentation Impact

Update or create:

```text
docs/performance.md
docs/configuration.md
docs/reports.md
```

Documentation should include:

- how sampling works
- CLI or config examples
- limitations of sampled comparisons
- when users should use sampling
- how sampled reports are labeled

---

## Acceptance Criteria

- Sampling-based comparison is supported
- Deterministic sampling is available
- Reports include sampling metadata
- Existing workflows remain stable
- Tests pass
- Documentation updated

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add Sampling-Based Dataset Comparison Support #100

Summary

Motivation

Proposed Improvements

Suggested Files

Suggested Tasks

How to Test

Documentation Impact

Acceptance Criteria

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

feat: Add Sampling-Based Dataset Comparison Support #100

Description

Summary

Motivation

Proposed Improvements

Suggested Files

Suggested Tasks

How to Test

Documentation Impact

Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions