Skip to content

RubenHaisma/vision-pipeline

Repository files navigation

vision-pipeline

ci

CLI-first industrial visual anomaly / defect detection. Fit on the good parts → score the defects → eval-against-baseline → export to ONNX with verified paritybenchmark CPU latency, all driven by one machine- readable binary, tracked in MLflow. Built on the ml-pipeline-template house style.

Most ML demos are a notebook that works once on the author's laptop. This is the opposite: a real anomaly-detection pipeline that runs the same way in CI, in an agent loop, and on a fresh checkout — with the production bits (ONNX export, latency benchmark) actually wired and verified, not promised.

Two paths, honestly labelled

Path What Status
Verified CPU reconstruction-error anomaly detector (PCA / IsolationForest) on built-in load_digits, ONNX export + parity, CPU latency benchmark ✅ runs in CI, on a laptop, in seconds
Scaffolded GPU deep MVTec-AD model (PaDiM/PatchCore-style or autoencoder) on rented GPU via Modal 🟡 wired (configs/mvtec.yaml, scripts/modal_train.py), not CI-verified

The CPU path is genuinely a defensible CV anomaly pipeline. The deep path shows the same operational shell scaling to a real benchmark dataset on rented hardware, without pretending CI trained a deep model it didn't.

The model (verified path)

Industrial defect detection has many good parts and few/no labelled defects, so you fit only on "normal" data and score deviation from it. Here one digit class is the normal part; every other digit is a defect.

  • PCA — fit a subspace on normal images; anomaly score is the squared reconstruction error. Defects don't lie on the normal-part manifold, so they reconstruct poorly. (The linear stand-in for a deep autoencoder.)
  • IsolationForest — isolation-depth anomaly score, no reconstruction.

Evaluation is ROC-AUC + average precision vs a random-score baseline — because a metric without a baseline is marketing, not evaluation.

Quickstart

uv sync --extra dev                       # install (CPU-only deps)
uv run vp doctor                          # environment readiness (--json for CI)
uv run vp train configs/digits.yaml       # fit + eval vs baseline + log to MLflow
uv run vp eval digits-pca                 # recompute metrics on the holdout
uv run vp export digits-pca               # -> ONNX, assert parity with sklearn
uv run vp bench digits-pca --n 200        # onnxruntime CPU latency (p50/p95)
uv run vp infer digits-pca --index 1      # score one image

The block above is marked <!-- ci-test -->CI runs these exact commands on every push, so this quickstart can never silently drift from the code.

make demo runs the full verified loop. Output of train (human mode):

trained digits-pca (pca, normal class 0)
  ROC-AUC 1.0  (baseline 0.4895, lift +0.5105)
  avg precision 1.0  (baseline 0.9602)
  fit on 106 normal · eval 1691 (1619 anomalies)
  model -> artifacts/digits-pca/model.joblib

Digit 0 vs the rest is an easy anomaly task — PCA reconstruction separates it perfectly. Harder normal classes are still strong but not perfect (e.g. normal_class: 8 gives ROC-AUC ~0.97 with PCA, ~0.93 with IsolationForest). The verified contract is "model ≫ random baseline, parity holds", not a magic number. Switch normal_class/model in the config to see the range.

export (parity is asserted — a failed check is a non-zero exit):

exported digits-pca (pca) -> artifacts/digits-pca/model.onnx (2559 bytes)
  parity ok: max abs diff 0.000e+00 <= tol 1e-03 over 64 samples

(For model: isoforest the ONNX graph computes the full tree-ensemble score, so parity is a more interesting ~6e-08 rather than exact.)

bench (onnxruntime, batch 1):

benchmarked digits-pca (pca) onnxruntime CPU, batch 1, n=200
  latency p50 0.0028 ms · p95 0.0031 ms · mean 0.0028 ms
  throughput ~351571 img/s

Exact numbers vary by machine; the shape (model ≫ baseline, parity within tolerance, sub-millisecond CPU latency) is what's verified in CI.

CLI surface

vp doctor [--json]                                  # is this environment ready?
vp train  <config> [--out] [--json]                 # fit, eval vs baseline, log to MLflow
vp eval   <name> [--config] [--out] [--json]        # recompute metrics on the holdout
vp export <name> [--out] [--json]                   # -> ONNX, verify sklearn parity
vp bench  <name> [--n] [--out] [--json]             # onnxruntime CPU latency (p50/p95)
vp infer  <name> [--index] [--out] [--json]         # score one dataset image
vp gpu-train <config> [--launch] [--json]           # scaffolded deep MVTec path (Modal)
vp version [--json]

Switch models in configs/digits.yaml (model: pca or model: isoforest, plus normal_class, n_components, contamination, seed).

The scaffolded GPU path

uv sync --extra gpu                                 # heavy deps: torch/timm/modal
modal token new                                     # one-time auth
modal run scripts/modal_train.py --config configs/mvtec.yaml

vp gpu-train configs/mvtec.yaml validates the config and either prints the launch command or fails cleanly when the gpu extra is absent. The Modal script is real, coherent code (frozen timm backbone → patch features → per-patch Gaussian → Mahalanobis scoring) with NotImplementedError where the licensed MVTec-AD download is required — it is not run in CI.

Tracking UI (optional)

make up                                             # MLflow on localhost:5050
export MLFLOW_TRACKING_URI=http://localhost:5050

The CLI works without it (falls back to local sqlite:///mlflow.db).

Notebooks (marimo)

uv run marimo edit notebooks/01_pr_curve.py         # PR/ROC curves vs baseline
uv run marimo edit notebooks/02_samples.py          # normal vs anomaly images

What's verified

Path Status
vp train / eval beats random baseline ✅ verified (asserted in tests + CI)
vp export ONNX parity with sklearn ✅ verified (PCA + IsolationForest)
vp bench onnxruntime CPU latency ✅ verified
Full train→eval→export→bench→infer loop ✅ verified (pytest + CI)
pytest smoke suite + ruff in CI ✅ verified
MLflow local sqlite store ✅ verified
MLflow server via docker-compose 🟡 compose provided, runs locally
Deep MVTec-AD model via Modal (gpu-train) 🟡 scaffolded, not CI-verified

Agent-friendly by design

Every command is non-interactive, takes --json, and uses load-bearing exit codes — so Codex, Claude Code, Cursor, Copilot, and friends can drive the whole train→eval→export→bench→infer loop with no TTY, no UI, no running service: parse stdout, branch on the exit code.

uv run vp train configs/digits.yaml --json
# -> {"ok": true, "name": "digits-pca", "metrics": {"roc_auc": 1.0, "lift_roc_auc": 0.5105, ...}, ...}

Agent instructions live in AGENTS.md — the cross-tool standard; CLAUDE.md is a symlink to it.

CI does more than lint

Most repos' CI checks that the code parses. This one checks that the pipeline works — three things beyond lint + tests, all stdlib, no extra deps:

  1. It runs the pipeline and publishes the numbers. Every push fits the detector and posts a live metrics table (ROC-AUC, average precision, baseline, lift) to the GitHub Actions run summary (scripts/ci_report.py). The numbers in CI are produced on that commit, not pasted by hand.
  2. It keeps the docs honest. The Quickstart block is marked <!-- ci-test --> and scripts/test_readme.py runs those exact commands in CI. Docs that drift from the code fail the build.
  3. It proves determinism. scripts/check_repro.py trains twice and asserts identical metrics — a seed is a promise, and CI verifies the promise holds.

Run them locally too: make summary, make readme, make repro.

License

Apache-2.0.

About

Industrial visual anomaly/defect detection with ONNX export, parity checks, and CPU latency benchmarking. CLI-first, MLflow-tracked.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors