dattri is a PyTorch library for developing, benchmarking, and deploying efficient data attribution algorithms. Data attribution quantifies how much each training sample influences model predictions. Published at NeurIPS 2024.
# Install (development)
pip install -e .[test]
# Install with CUDA acceleration
pip install -e .[test,sjlt]
# Recommended environment
conda create -n dattri python=3.10
conda activate dattri
pip install -e .[test]# Run all checks (lint + tests)
make test
# Run tests only
pytest -v test/
# Run tests excluding GPU tests
pytest -v -m "not gpu" test/
# Run specific test file
pytest -v test/dattri/algorithm/test_influence_function.py
# Lint
ruff check
# Format
ruff format
# Lint with auto-fix
ruff check --fix- Formatter/Linter: Ruff (preview mode, line length 88)
- Docstrings: Google-style
- Type hints: Required on all function parameters and return types
- Naming: PascalCase for classes, snake_case for functions, UPPER_SNAKE_CASE for constants
- Imports:
from __future__ import annotationsat top; useTYPE_CHECKINGfor expensive imports - Import order: stdlib, third-party, local (
dattri.*) - Python version: 3.8+ compatible (no modern syntax like
X | Yunions)
dattri/algorithm/: Attribution algorithms (IF, TracIn, TRAK, RPS, Shapley, DVEmb, etc.)- All inherit from
BaseAttributororBaseInnerProductAttributor - Two-phase pattern:
cache()precomputes,attribute()scores
- All inherit from
dattri/func/: Low-level utilities (HVP/IHVP, Fisher, random projection)dattri/benchmark/: Datasets and models for standardized evaluationdattri/metric/: Evaluation metrics (LOO, LDS, AUC, brittleness)dattri/model_util/: Model utilities (retraining, dropout, hooks)dattri/task.py:AttributionTaskabstraction decoupling algorithms from tasks
- Framework: pytest
- Tests mirror source structure under
test/dattri/ - GPU tests marked with
@pytest.mark.gpu - Tests use synthetic data (TensorDataset) for speed
dattri/:AttributionTaskhandles the ML model, training loss functions, and target function for attribution. Keep Attributors cleanly separated fromAttributionTaskto maintain generalizability across models, loss functions, and target functions. When implementing a new Attributor, inherit from an existing Attributor class to maximally reuse code. If the new Attributor is completely different from existing ones, inheritBaseAttributorto ensure API consistency.examples/: Bite-size examples, preferably a single script without a README. Examples should not be computationally heavy; heavy workloads belong inexperiments/instead.experiments/: Large experiments/benchmarks requiring multiple files. One folder per experiment, preferably with aREADME.md.
- Unit tests (
test/,.github/workflows/pytest.yml): Run-through tests and sanity checks. The test directory (almost) mirrors the structure ofdattri/. A new unit test should be included whenever non-trivial changes are made indattri/(new feature or bug fix). GPU tests are omitted by default in GitHub Actions and can be manually triggered by commenting "run gpu test" on the PR. - Example tests (
.github/workflows/examples_test.yml): A new example test should be added whenever a new example is added inexamples/. An example test is a line in the workflow file that runs the example script.
- For non-trivial PRs (new feature, bug fix, significant refactoring), start with a new Issue outlining the planned changes for maintainer approval. This is required if the changes involve major API design changes.
- PRs can start as
[WIP]without all tests passing to get early feedback. - Code review focuses on correctness, sufficient and clear comments/docstrings, and whether unit tests or example tests are properly added.
- Manually trigger "run gpu test" if the changes involve existing GPU tests.
- Always use squash and merge to merge a PR.
- The document test runs after the PR is merged. If it fails, fix it promptly.
- Close the related issues after the document test passes.