ml-pipeline

Classic ML done right. A CLI-first churn-style tabular classification pipeline whose value is the rigor most portfolios skip: an honest baseline, probability calibration, and drift monitoring. Train → eval-against-baseline → calibrate → monitor drift → serve, tracked in MLflow, driven by one machine-readable binary (mlp). Fully CPU-runnable, deterministic, and verified end-to-end.

Derives from ml-pipeline-template — same operational shell (CLI, --json, load-bearing exit codes, MLflow as source of truth). The difference here is the ML: not a model that ranks well in a notebook, but one with a baseline you can trust, probabilities you can calibrate, and a drift report you can act on.

Why it looks like this

Most churn demos report an accuracy number and stop. The three things that separate that from a model you'd put in front of a retention budget are exactly the three things this repo does:

Honest baseline — every metric is reported next to a DummyClassifier. A model that doesn't beat its baseline is a finding, not a number to hide. ROC-AUC, PR-AUC, F1, Brier, and lift over baseline are reported together.
Calibration — a model can rank well (high ROC-AUC) and still output badly miscalibrated probabilities. mlp calibrate fits CalibratedClassifierCV and reports Brier before/after plus reliability-curve bins, so the improvement (or lack of it) is on the table.
Drift monitoring — mlp drift computes PSI and a KS test per feature between a reference set and a (synthetically shifted) current set, and flags features over a PSI threshold. The production-ops half of ML, not an afterthought.

Plus the house style: CLI-first, --json on every command, load-bearing exit codes, MLflow as the single source of truth, and marimo .py notebooks (never .ipynb).

Quickstart

uv sync --extra dev          # install (CPU only, no GPU, no downloads)
uv run mlp doctor            # environment readiness check (--json for CI)
uv run mlp train configs/churn.yaml
uv run mlp eval churn-gb
uv run mlp calibrate churn-gb
uv run mlp drift --reference configs/churn.yaml --shift 0.5
uv run mlp infer churn-gb

The block above is marked  — CI runs these exact commands on every push, so this quickstart can never silently drift from the code.

Output of train (human mode):

trained churn-gb (gradient_boosting)
  roc_auc 0.863  (baseline 0.5, lift +0.363)
  pr_auc 0.7298  f1 0.664  brier 0.1247
  model -> artifacts/churn-gb/model.joblib

The DummyClassifier(strategy="prior") baseline has no ranking power (ROC-AUC 0.5), so the +0.36 lift is real signal. mlp calibrate then improves Brier from 0.1247 to 0.1223, and mlp drift --shift 0.5 flags support_calls and total_charges (PSI >= 0.25) while leaving the unshifted features stable — see below.

make demo runs the full rigor loop (train → eval → calibrate → drift → infer).

CLI surface

mlp doctor [--json]                                  # is this environment ready?
mlp train <config> [--out] [--json]                  # train, eval vs baseline, log to MLflow
mlp eval <name> [--out] [--json]                     # recompute holdout metrics vs baseline
mlp calibrate <name> [--method] [--out] [--json]     # calibrate, Brier before/after + reliability
mlp drift --reference <cfg> [--shift] [--threshold] [--json]   # PSI + KS per feature, flag drift
mlp infer <name> [--json-input] [--out] [--json]     # score one example -> churn probability
mlp version [--json]

Every command emits a single JSON object with --json and exits non-zero on failure with one line on stderr.

infer takes a full example as JSON:

uv run mlp infer churn-gb --json-input \
  '{"tenure_months": 2, "monthly_charges": 110.0, "total_charges": 220.0, "support_calls": 8, "logins_last_30d": 1, "contract": "month_to_month", "payment_method": "card", "region": "randstad"}'

Dataset

The default dataset is a deterministic synthetic churn frame (sklearn.make_classification for the latent signal + hand-assembled numeric/categorical columns, fixed seed, ~27% churn base rate), so CI never downloads anything and results reproduce exactly. To run on real data, point the config at a CSV:

# configs/churn.yaml
dataset: data/customers.csv   # CSV path instead of 'synthetic'
target: Churn                 # your label column; renamed to 'churn' internally

The CSV must contain the expected feature columns (tenure_months, monthly_charges, total_charges, support_calls, logins_last_30d, contract, payment_method, region) — adapt src/ml_pipeline/lib/data.py to your schema.

Tracking & notebooks

make up                                   # MLflow on localhost:5050
export MLFLOW_TRACKING_URI=http://localhost:5050
uv run marimo edit notebooks/01_reliability.py   # reliability diagram before/after calibration
uv run marimo edit notebooks/02_drift.py         # PSI + KS feature drift report

Without a server the CLI falls back to a local sqlite:///mlflow.db store.

What's verified

Everything in this repo is verified on CPU — no GPU path exists, nothing is hand-waved.

Path	Status
`mlp train` / `eval` / `infer` on CPU	✅ verified
Feature pipeline (ColumnTransformer + Pipeline)	✅ verified
Model beats `DummyClassifier` baseline on ROC-AUC	✅ verified
`mlp calibrate` reduces/maintains Brier	✅ verified
`mlp drift` PSI + KS flags a shifted feature	✅ verified
`pytest` smoke + rigor suite + ruff in CI	✅ verified
MLflow local sqlite store	✅ verified
MLflow server via docker-compose	✅ compose provided, runs locally

Agent-friendly by design

Every command is non-interactive, emits a single JSON object with --json, and returns a load-bearing exit code — so AI coding agents (Codex, Claude Code, Cursor, Copilot, Windsurf, …) and plain scripts can drive the full train → eval → calibrate → drift → infer loop and parse results with no TTY, no UI, no screen-scraping.

mlp train configs/churn.yaml --json   # -> {"ok": true, "metrics": {"roc_auc": ..., "lift_over_baseline": ...}}   exit 0

Agent instructions live in AGENTS.md — the cross-tool standard read natively by Codex, Cursor, Copilot, and more. CLAUDE.md is a symlink to it, so every tool reads one source of truth.

CI does more than lint

Most repos' CI checks that the code parses. This one checks that the pipeline works — three things beyond lint + tests, all stdlib, no extra deps:

It runs the pipeline and publishes the numbers. Every push trains the churn model and posts a live metrics table (ROC-AUC, PR-AUC, F1, Brier, lift over baseline) to the GitHub Actions run summary (scripts/ci_report.py). The numbers in CI are produced on that commit, not pasted by hand.
It keeps the docs honest. The Quickstart block is marked  and scripts/test_readme.py runs those exact commands in CI. Docs that drift from the code fail the build.
It proves determinism. scripts/check_repro.py trains twice and asserts identical metrics — a seed is a promise, and for a model whose probabilities feed a retention budget that promise has to hold. CI verifies it.

Run them locally too: make summary, make readme, make repro.

License

Apache-2.0.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.github/workflows		.github/workflows
configs		configs
docs		docs
notebooks		notebooks
scripts		scripts
src/ml_pipeline		src/ml_pipeline
tests		tests
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ml-pipeline

Why it looks like this

Quickstart

CLI surface

Dataset

Tracking & notebooks

What's verified

Agent-friendly by design

CI does more than lint

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ml-pipeline

Why it looks like this

Quickstart

CLI surface

Dataset

Tracking & notebooks

What's verified

Agent-friendly by design

CI does more than lint

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages