CLI-first industrial visual anomaly / defect detection. Fit on the good
parts → score the defects → eval-against-baseline → export to ONNX with
verified parity → benchmark CPU latency, all driven by one machine-
readable binary, tracked in MLflow. Built on the
ml-pipeline-template
house style.
Most ML demos are a notebook that works once on the author's laptop. This is the opposite: a real anomaly-detection pipeline that runs the same way in CI, in an agent loop, and on a fresh checkout — with the production bits (ONNX export, latency benchmark) actually wired and verified, not promised.
| Path | What | Status |
|---|---|---|
| Verified CPU | reconstruction-error anomaly detector (PCA / IsolationForest) on built-in load_digits, ONNX export + parity, CPU latency benchmark |
✅ runs in CI, on a laptop, in seconds |
| Scaffolded GPU | deep MVTec-AD model (PaDiM/PatchCore-style or autoencoder) on rented GPU via Modal | 🟡 wired (configs/mvtec.yaml, scripts/modal_train.py), not CI-verified |
The CPU path is genuinely a defensible CV anomaly pipeline. The deep path shows the same operational shell scaling to a real benchmark dataset on rented hardware, without pretending CI trained a deep model it didn't.
Industrial defect detection has many good parts and few/no labelled defects, so you fit only on "normal" data and score deviation from it. Here one digit class is the normal part; every other digit is a defect.
- PCA — fit a subspace on normal images; anomaly score is the squared reconstruction error. Defects don't lie on the normal-part manifold, so they reconstruct poorly. (The linear stand-in for a deep autoencoder.)
- IsolationForest — isolation-depth anomaly score, no reconstruction.
Evaluation is ROC-AUC + average precision vs a random-score baseline — because a metric without a baseline is marketing, not evaluation.
uv sync --extra dev # install (CPU-only deps)
uv run vp doctor # environment readiness (--json for CI)
uv run vp train configs/digits.yaml # fit + eval vs baseline + log to MLflow
uv run vp eval digits-pca # recompute metrics on the holdout
uv run vp export digits-pca # -> ONNX, assert parity with sklearn
uv run vp bench digits-pca --n 200 # onnxruntime CPU latency (p50/p95)
uv run vp infer digits-pca --index 1 # score one imageThe block above is marked
<!-- ci-test -->— CI runs these exact commands on every push, so this quickstart can never silently drift from the code.
make demo runs the full verified loop. Output of train (human mode):
trained digits-pca (pca, normal class 0)
ROC-AUC 1.0 (baseline 0.4895, lift +0.5105)
avg precision 1.0 (baseline 0.9602)
fit on 106 normal · eval 1691 (1619 anomalies)
model -> artifacts/digits-pca/model.joblib
Digit 0 vs the rest is an easy anomaly task — PCA reconstruction separates it perfectly. Harder normal classes are still strong but not perfect (e.g.
normal_class: 8gives ROC-AUC ~0.97 with PCA, ~0.93 with IsolationForest). The verified contract is "model ≫ random baseline, parity holds", not a magic number. Switchnormal_class/modelin the config to see the range.
export (parity is asserted — a failed check is a non-zero exit):
exported digits-pca (pca) -> artifacts/digits-pca/model.onnx (2559 bytes)
parity ok: max abs diff 0.000e+00 <= tol 1e-03 over 64 samples
(For model: isoforest the ONNX graph computes the full tree-ensemble score, so
parity is a more interesting ~6e-08 rather than exact.)
bench (onnxruntime, batch 1):
benchmarked digits-pca (pca) onnxruntime CPU, batch 1, n=200
latency p50 0.0028 ms · p95 0.0031 ms · mean 0.0028 ms
throughput ~351571 img/s
Exact numbers vary by machine; the shape (model ≫ baseline, parity within tolerance, sub-millisecond CPU latency) is what's verified in CI.
vp doctor [--json] # is this environment ready?
vp train <config> [--out] [--json] # fit, eval vs baseline, log to MLflow
vp eval <name> [--config] [--out] [--json] # recompute metrics on the holdout
vp export <name> [--out] [--json] # -> ONNX, verify sklearn parity
vp bench <name> [--n] [--out] [--json] # onnxruntime CPU latency (p50/p95)
vp infer <name> [--index] [--out] [--json] # score one dataset image
vp gpu-train <config> [--launch] [--json] # scaffolded deep MVTec path (Modal)
vp version [--json]
Switch models in configs/digits.yaml (model: pca or model: isoforest,
plus normal_class, n_components, contamination, seed).
uv sync --extra gpu # heavy deps: torch/timm/modal
modal token new # one-time auth
modal run scripts/modal_train.py --config configs/mvtec.yamlvp gpu-train configs/mvtec.yaml validates the config and either prints the
launch command or fails cleanly when the gpu extra is absent. The Modal script
is real, coherent code (frozen timm backbone → patch features → per-patch
Gaussian → Mahalanobis scoring) with NotImplementedError where the licensed
MVTec-AD download is required — it is not run in CI.
make up # MLflow on localhost:5050
export MLFLOW_TRACKING_URI=http://localhost:5050The CLI works without it (falls back to local sqlite:///mlflow.db).
uv run marimo edit notebooks/01_pr_curve.py # PR/ROC curves vs baseline
uv run marimo edit notebooks/02_samples.py # normal vs anomaly images| Path | Status |
|---|---|
vp train / eval beats random baseline |
✅ verified (asserted in tests + CI) |
vp export ONNX parity with sklearn |
✅ verified (PCA + IsolationForest) |
vp bench onnxruntime CPU latency |
✅ verified |
| Full train→eval→export→bench→infer loop | ✅ verified (pytest + CI) |
pytest smoke suite + ruff in CI |
✅ verified |
| MLflow local sqlite store | ✅ verified |
| MLflow server via docker-compose | 🟡 compose provided, runs locally |
Deep MVTec-AD model via Modal (gpu-train) |
🟡 scaffolded, not CI-verified |
Every command is non-interactive, takes --json, and uses load-bearing exit codes — so Codex, Claude Code, Cursor, Copilot, and friends can drive the whole train→eval→export→bench→infer loop with no TTY, no UI, no running service: parse stdout, branch on the exit code.
uv run vp train configs/digits.yaml --json
# -> {"ok": true, "name": "digits-pca", "metrics": {"roc_auc": 1.0, "lift_roc_auc": 0.5105, ...}, ...}Agent instructions live in AGENTS.md — the cross-tool standard; CLAUDE.md is a symlink to it.
Most repos' CI checks that the code parses. This one checks that the pipeline works — three things beyond lint + tests, all stdlib, no extra deps:
- It runs the pipeline and publishes the numbers. Every push fits the detector and posts a live metrics table (ROC-AUC, average precision, baseline, lift) to the GitHub Actions run summary (
scripts/ci_report.py). The numbers in CI are produced on that commit, not pasted by hand. - It keeps the docs honest. The Quickstart block is marked
<!-- ci-test -->andscripts/test_readme.pyruns those exact commands in CI. Docs that drift from the code fail the build. - It proves determinism.
scripts/check_repro.pytrains twice and asserts identical metrics — a seed is a promise, and CI verifies the promise holds.
Run them locally too: make summary, make readme, make repro.
Apache-2.0.