Classic ML done right. A CLI-first churn-style tabular classification pipeline whose value is the rigor most portfolios skip: an honest baseline, probability calibration, and drift monitoring. Train → eval-against-baseline → calibrate → monitor drift → serve, tracked in MLflow, driven by one machine-readable binary (mlp). Fully CPU-runnable, deterministic, and verified end-to-end.
Derives from
ml-pipeline-template— same operational shell (CLI,--json, load-bearing exit codes, MLflow as source of truth). The difference here is the ML: not a model that ranks well in a notebook, but one with a baseline you can trust, probabilities you can calibrate, and a drift report you can act on.
Most churn demos report an accuracy number and stop. The three things that separate that from a model you'd put in front of a retention budget are exactly the three things this repo does:
- Honest baseline — every metric is reported next to a
DummyClassifier. A model that doesn't beat its baseline is a finding, not a number to hide. ROC-AUC, PR-AUC, F1, Brier, and lift over baseline are reported together. - Calibration — a model can rank well (high ROC-AUC) and still output badly miscalibrated probabilities.
mlp calibratefitsCalibratedClassifierCVand reports Brier before/after plus reliability-curve bins, so the improvement (or lack of it) is on the table. - Drift monitoring —
mlp driftcomputes PSI and a KS test per feature between a reference set and a (synthetically shifted) current set, and flags features over a PSI threshold. The production-ops half of ML, not an afterthought.
Plus the house style: CLI-first, --json on every command, load-bearing exit codes, MLflow as the single source of truth, and marimo .py notebooks (never .ipynb).
uv sync --extra dev # install (CPU only, no GPU, no downloads)
uv run mlp doctor # environment readiness check (--json for CI)
uv run mlp train configs/churn.yaml
uv run mlp eval churn-gb
uv run mlp calibrate churn-gb
uv run mlp drift --reference configs/churn.yaml --shift 0.5
uv run mlp infer churn-gbThe block above is marked
<!-- ci-test -->— CI runs these exact commands on every push, so this quickstart can never silently drift from the code.
Output of train (human mode):
trained churn-gb (gradient_boosting)
roc_auc 0.863 (baseline 0.5, lift +0.363)
pr_auc 0.7298 f1 0.664 brier 0.1247
model -> artifacts/churn-gb/model.joblib
The DummyClassifier(strategy="prior") baseline has no ranking power (ROC-AUC 0.5), so the +0.36 lift is real signal. mlp calibrate then improves Brier from 0.1247 to 0.1223, and mlp drift --shift 0.5 flags support_calls and total_charges (PSI >= 0.25) while leaving the unshifted features stable — see below.
make demo runs the full rigor loop (train → eval → calibrate → drift → infer).
mlp doctor [--json] # is this environment ready?
mlp train <config> [--out] [--json] # train, eval vs baseline, log to MLflow
mlp eval <name> [--out] [--json] # recompute holdout metrics vs baseline
mlp calibrate <name> [--method] [--out] [--json] # calibrate, Brier before/after + reliability
mlp drift --reference <cfg> [--shift] [--threshold] [--json] # PSI + KS per feature, flag drift
mlp infer <name> [--json-input] [--out] [--json] # score one example -> churn probability
mlp version [--json]
Every command emits a single JSON object with --json and exits non-zero on failure with one line on stderr.
infer takes a full example as JSON:
uv run mlp infer churn-gb --json-input \
'{"tenure_months": 2, "monthly_charges": 110.0, "total_charges": 220.0, "support_calls": 8, "logins_last_30d": 1, "contract": "month_to_month", "payment_method": "card", "region": "randstad"}'The default dataset is a deterministic synthetic churn frame (sklearn.make_classification for the latent signal + hand-assembled numeric/categorical columns, fixed seed, ~27% churn base rate), so CI never downloads anything and results reproduce exactly. To run on real data, point the config at a CSV:
# configs/churn.yaml
dataset: data/customers.csv # CSV path instead of 'synthetic'
target: Churn # your label column; renamed to 'churn' internallyThe CSV must contain the expected feature columns (tenure_months, monthly_charges, total_charges, support_calls, logins_last_30d, contract, payment_method, region) — adapt src/ml_pipeline/lib/data.py to your schema.
make up # MLflow on localhost:5050
export MLFLOW_TRACKING_URI=http://localhost:5050
uv run marimo edit notebooks/01_reliability.py # reliability diagram before/after calibration
uv run marimo edit notebooks/02_drift.py # PSI + KS feature drift reportWithout a server the CLI falls back to a local sqlite:///mlflow.db store.
Everything in this repo is verified on CPU — no GPU path exists, nothing is hand-waved.
| Path | Status |
|---|---|
mlp train / eval / infer on CPU |
✅ verified |
| Feature pipeline (ColumnTransformer + Pipeline) | ✅ verified |
Model beats DummyClassifier baseline on ROC-AUC |
✅ verified |
mlp calibrate reduces/maintains Brier |
✅ verified |
mlp drift PSI + KS flags a shifted feature |
✅ verified |
pytest smoke + rigor suite + ruff in CI |
✅ verified |
| MLflow local sqlite store | ✅ verified |
| MLflow server via docker-compose | ✅ compose provided, runs locally |
Every command is non-interactive, emits a single JSON object with --json, and returns a load-bearing exit code — so AI coding agents (Codex, Claude Code, Cursor, Copilot, Windsurf, …) and plain scripts can drive the full train → eval → calibrate → drift → infer loop and parse results with no TTY, no UI, no screen-scraping.
mlp train configs/churn.yaml --json # -> {"ok": true, "metrics": {"roc_auc": ..., "lift_over_baseline": ...}} exit 0Agent instructions live in AGENTS.md — the cross-tool standard read natively by Codex, Cursor, Copilot, and more. CLAUDE.md is a symlink to it, so every tool reads one source of truth.
Most repos' CI checks that the code parses. This one checks that the pipeline works — three things beyond lint + tests, all stdlib, no extra deps:
- It runs the pipeline and publishes the numbers. Every push trains the churn model and posts a live metrics table (ROC-AUC, PR-AUC, F1, Brier, lift over baseline) to the GitHub Actions run summary (
scripts/ci_report.py). The numbers in CI are produced on that commit, not pasted by hand. - It keeps the docs honest. The Quickstart block is marked
<!-- ci-test -->andscripts/test_readme.pyruns those exact commands in CI. Docs that drift from the code fail the build. - It proves determinism.
scripts/check_repro.pytrains twice and asserts identical metrics — a seed is a promise, and for a model whose probabilities feed a retention budget that promise has to hold. CI verifies it.
Run them locally too: make summary, make readme, make repro.
Apache-2.0.