SRE-style preflight checks for AI applications. One command, one readiness report, one ship/block verdict.
aipreflight brings the deployment discipline of CI gates, smoke tests, and SLO-based rollouts to LLM apps, RAG systems, and inference endpoints. It turns external acceptance testing (llmprobe) and internal server telemetry (Prometheus) into an automated go/no-go decision before traffic is routed.
The project's strongest proof is SLA gating for self-hosted inference, and it has broadened into a general production readiness gate for AI applications. It ships three profiles:
inference(SLA gating),app(cost, evals, observability, rollback for hosted-API apps), andrag(retrieval and answer quality).
Read the positioning post: The missing deployment gate for AI applications.
Normal software has CI gates, smoke tests, canaries, and SLOs. Classical ML has model validation and model blessing. Modern AI applications often still ship prompt changes, RAG changes, model/provider changes, and AI features without equivalent preflight checks for quality, cost, latency, observability, and rollback readiness.
aipreflight fills that gap at the release boundary:
Can we ship this AI change?
quality/evals PASS or FAIL
RAG behavior PASS or FAIL
latency/TTFT PASS or FAIL
error rate PASS or FAIL
cost budget PASS or FAIL
observability PASS or WARN or FAIL
rollback/runbook PASS or WARN or FAIL
overall verdict PASS or FAIL
Run the no-GPU demo checks:
./scripts/demo-quick.shOr run the core commands directly:
aipreflight doctor
aipreflight check --profile profiles/app.yml
aipreflight check --profile profiles/rag.ymlThe inference profile uses llmprobe against an OpenAI-compatible endpoint:
aipreflight check --profile profiles/inference.ymlServer metrics say "healthy" while users experience 3-second TTFT. The load balancer is misconfigured, TLS adds overhead, the rate limiter is throttling, or the model is silently returning empty responses. Server-side metrics alone often miss this because they do not measure the full client path. You need an external validator.
Run aipreflight check at the same gates where classical software already runs CI checks, smoke tests, and canary analysis:
- Before merging an AI feature, prompt change, or RAG change.
- Before routing traffic to a new model or a new provider.
- Before increasing a rollout percentage.
- Before approving a customer pilot or a production launch.
Each gate gives you one verdict and one exit code, so it drops into a pull request check, a deploy step, or a Kubernetes Job without a human in the loop.
llmprobe (external) --> IS there a problem? (client-side truth)
Prometheus (internal) --> WHY is there a problem? (server-side explanation)
aipreflight --> WHAT to do about it (automated verdict)
See docs/architecture.md for the full flow: how a profile, the external signals, and the per-check verdict aggregation produce one report and one exit code.
Deploy a new model, run acceptance probes, get a readiness verdict. Exit code 0 = pass, 1 = fail, 2 = config error, 3 = probe error.
aipreflight check --profile profiles/inference.yml
# Verdict: PASS (safe to route traffic)
# Verdict: FAIL (do not route traffic)Writes runs/latest/aipreflight-report.json and .md with the verdict, failed checks, and metrics. Integrates into any CI/CD pipeline with no human in the loop. The legacy ./scripts/gate.sh is still supported and now wraps this command.
Users report slow responses. Correlate client observations with server state.
aipreflight diagnose runs/latest --prometheus http://localhost:9090Output tells you whether the problem is in the network layer (client/server TTFT gap), the inference engine (queue depth, KV cache pressure), or upstream (errors, timeouts).
Find the concurrency level where your endpoint breaks its SLA.
./scripts/sweep.sh configs/llmprobe/vllm.yml 1,2,4,8,16Produces a comparison table showing how TTFT, latency, and throughput degrade under load. Tells you exactly how many concurrent users your config supports within SLA.
A profile defines what "ready" means for one kind of target. aipreflight check
runs the right checks and aggregates them into one verdict (PASS / WARN / FAIL).
| Profile | Kind | Needs | Checks |
|---|---|---|---|
profiles/inference.yml |
inference |
llmprobe (+ optional Prometheus) | TTFT, latency, throughput, error rate vs SLA |
profiles/app.yml |
app |
nothing self-hosted | cost budget (tokentoll), eval quality gate, observability fields, rollback runbook |
profiles/rag.yml |
rag |
nothing self-hosted | cost budget (tokentoll), retrieval precision, answer quality, citation rate, hallucination rate, empty-retrieval handling, observability, rollback |
The app profile is for teams calling hosted APIs that still need production
discipline: a cost gate, a quality eval suite, debuggable telemetry, and a
rollback path. It runs with no GPU and no probe. See
examples/hosted-api-app for a runnable target.
The rag profile gates the quality signals that infrastructure checks miss: a
RAG system can be "up" while retrieval has regressed, answers have stopped citing
sources, or the model has begun answering unanswerable questions. See
examples/rag-app for a runnable offline target that fails
readiness on a retrieval regression while the service itself stays healthy.
app and rag profiles can gate on the results of an eval suite, not just check
that one is configured. aipreflight does not implement evals. It runs whatever
eval command you already have (pytest, promptfoo, ragas, a custom script), reads
its JSON output, and turns it into one pass/fail gate:
evals:
command: "python evals/run_evals.py" # emits JSON on stdout
results_file: evals/results.json # optional: read this instead of stdout
min_pass_rate: 0.9 # gate on overall pass rate
metrics: # optional per-metric gates
retrieval_precision: {min: 0.8}
hallucination_rate: {max: 0.05}The eval step must emit JSON with total/passed (or pass_rate) and an
optional metrics map. The eval command's own exit code does not decide the
gate; the reported numbers do. Without min_pass_rate/metrics, the check falls
back to verifying an eval suite is configured and present (run it in CI).
aipreflight check --profile profiles/app.yml
# Verdict: PASS
# cost PASS $7.69/mo across 1 call site(s), within budget
# evals PASS quality gate passed: pass rate 100% (min 90%), answer_quality=1.00
# observability PASS telemetry config present with all 9 required fields
# deployment PASS rollback runbook present: runbooks/rollback.mdBoth shipped example profiles gate on real eval numbers: app.yml on answer
quality, rag.yml on retrieval precision, citations, and hallucination rate.
The cost gate uses tokentoll to statically price the LLM call sites in your source and fail if per-request or monthly cost exceeds the budget in the profile.
Prerequisites: Python 3.10+. The inference profile needs llmprobe v1.4.0+. The app profile additionally uses tokentoll (pip install tokentoll). The app and rag profiles need neither llmprobe nor a GPU.
git clone https://github.com/Jwrede/aipreflight && cd aipreflight
pip install -e .
# Install the llmprobe binary (only needed for the inference profile).
# Prebuilt, no Go toolchain (macOS/Linux):
curl -fsSL https://raw.githubusercontent.com/Jwrede/aipreflight/main/scripts/install-deps.sh | bash
# ... or build from source:
go install github.com/Jwrede/llmprobe@latest
# Check that your environment is ready (Python, llmprobe, tokentoll, profiles)
aipreflight doctor
# Point the inference profile at your endpoint (vLLM, Ollama, or any OpenAI-compatible server)
vim configs/llmprobe/vllm.yml # referenced by profiles/inference.yml
# Run the readiness gate (exit 0 = pass, 1 = fail, 2 = config error, 3 = probe error)
aipreflight check --profile profiles/inference.yml
# Score an existing probe run offline (no endpoint or GPU needed)
aipreflight check --profile profiles/inference.yml --probes fixtures/sample-probes.jsonl
# Check a hosted-API app instead (no llmprobe or GPU): cost, evals, observability, rollback
aipreflight check --profile profiles/app.yml
# Check a RAG app (no llmprobe or GPU): retrieval, answer quality, citations, hallucination
aipreflight check --profile profiles/rag.yml
# Find the concurrency breaking point
./scripts/sweep.sh configs/llmprobe/vllm.yml 1,2,4,8,16
# Diagnose with server-side metrics (requires Prometheus scraping your endpoint)
aipreflight diagnose runs/latest --prometheus http://localhost:9090A profile bundles everything a check needs: how to probe, the SLA contract to gate on, and optional observability settings. profiles/inference.yml reproduces the original gate:
name: inference
probe:
config: configs/llmprobe/vllm.yml # llmprobe config for your endpoint
duration: 30s
interval: 5s
thresholds:
sla:
ttft_ms: 500 # Max acceptable TTFT (p95)
latency_ms: 10000 # Max acceptable end-to-end latency (p95)
min_throughput: 3.0 # Min acceptable throughput (p50, tok/s)
max_error_rate: 0.01 # Max acceptable error rate
gate:
min_probes: 5 # Minimum probes before making a decision
pass_rate: 0.95 # Required healthy probe rate
observability:
prometheus: null # set to http://localhost:9090 to enable diagnose
queries: configs/prometheus/queries.ymlInvalid profiles fail fast with exit code 2 and an actionable message. The standalone thresholds.yml is still read by the legacy scripts/gate.sh path. configs/prometheus/queries.yml defines which server metrics to collect for diagnosis.
vLLM exposes a /metrics endpoint by default. To correlate client probes with server telemetry:
# 1. Start vLLM (Docker CPU example)
docker run -d --name vllm -p 8000:8000 \
vllm/vllm-openai-cpu:latest \
--model Qwen/Qwen2-0.5B-Instruct --max-model-len 512
# 2. Start Prometheus
cp prometheus.example.yml prometheus.yml
# Edit prometheus.yml target if vLLM is not on host.docker.internal:8000
docker run -d --name prometheus -p 9090:9090 \
--add-host=host.docker.internal:host-gateway \
-v $(pwd)/prometheus.yml:/etc/prometheus/prometheus.yml:ro \
prom/prometheus:latest
# 3. Verify scraping works
curl -s http://localhost:9090/api/v1/targets | grep '"health":"up"'
# 4. Run probes and diagnose
aipreflight check --profile profiles/inference.yml
aipreflight diagnose runs/latest --prometheus http://localhost:9090The diagnosis correlates client-observed TTFT with server-reported TTFT. A large gap (>100ms) indicates network or proxy overhead between the client and the inference engine.
A pre-built dashboard visualizes the same metrics used by aipreflight diagnose. Grafana is for inspection; the readiness gate remains the source of deployment decisions.
# 1. Make sure prometheus.yml exists
cp prometheus.example.yml prometheus.yml
# 2. Start Prometheus + Grafana (Grafana on port 3001)
docker compose -f docker-compose.observability.yml up -d
# 3. Open in browser
open http://localhost:3001/d/vllm-readiness
# Login: admin / admin (or anonymous access enabled by default)Panels: TTFT p95/p50, end-to-end latency, running/waiting requests, KV cache usage, GPU utilization, GPU memory, GPU temperature/power, queue wait time, token throughput. Color thresholds match the SLA defaults in thresholds.yml.
To stop:
docker compose -f docker-compose.observability.yml downDeploy vLLM with GPU scheduling and use aipreflight as the readiness probe:
# Deploy vLLM with GPU and readiness probes
kubectl apply -f k8s/vllm-deployment.yml
kubectl apply -f k8s/vllm-service.yml
# Deploy DCGM exporter for GPU metrics (requires NVIDIA GPU Operator)
kubectl apply -f k8s/dcgm-exporter.yml
kubectl apply -f k8s/servicemonitor.yml
# Run readiness gate as a Job (post-deploy validation)
kubectl apply -f k8s/readiness-gate-job.yml
kubectl logs -f job/aipreflight-gateThe Deployment uses nvidia.com/gpu resource requests, a /health readiness probe for basic liveness, and the readiness gate Job for SLA validation after deployment. DCGM exporter feeds GPU utilization, memory, and temperature into Prometheus alongside vLLM inference metrics.
See docs/runpod-gpu-setup.md for reproducing GPU benchmarks on RunPod.
Concurrency sweep on Qwen2 0.5B across GPU and CPU:
| Concurrency | GPU TTFT p50 | GPU tok/s | CPU TTFT p50 | CPU tok/s | Ollama TTFT p50 | Ollama tok/s |
|---|---|---|---|---|---|---|
| 1 | 77ms | 528 | 110ms | 16.4 | 204ms | 42.3 |
| 4 | 73ms | 465 | 225ms | 17.5 | 750ms | 59.3 |
| 8 | 76ms | 406 | 327ms | 15.4 | 2.50s | 51.8 |
| 16 | 71ms | 380 | 591ms | 10.7 | 6.90s | 53.1 |
| 32 | 46ms | 450 | N/A | N/A | N/A | N/A |
GPU (RTX 3090, $0.22/hr): 32x throughput, TTFT stays flat under load. For a 500ms TTFT SLA:
| Engine | Max Concurrent Users Within SLA |
|---|---|
| vLLM + RTX 3090 | 16 |
| vLLM + CPU (8 vCPU) | 8 |
| Ollama + CPU (8 vCPU) | 1 |
Full analysis: reports/examples/cross-engine-comparison.md
- GPU readiness report (RTX 3090, c16)
- GPU concurrency sweep
- Cross-engine comparison (GPU vs CPU vs Ollama)
- Prometheus diagnosis (c16 under load)
- CPU readiness report (SLA violation at c16)
- CPU concurrency sweep
aipreflight/ # Python package (CLI + readiness logic)
cli.py # `aipreflight` entrypoint (check/report/diagnose/doctor)
doctor.py # environment readiness checks (`aipreflight doctor`)
profile.py # profile loading + validation (inference | app | rag)
checks.py # generic CheckResult + verdict aggregation
probes.py # llmprobe runner + JSONL loading
analyze.py # SLA gate logic (inference)
appcheck.py # app readiness checks (cost/evals/observability/deploy)
cost.py # tokentoll cost gate adapter
evals.py # eval quality gate adapter (pass rate + metrics)
report.py # unified JSON + Markdown report
diagnose.py # client + server + GPU correlation
compare.py # sweep comparison table
profiles/
inference.yml # vLLM / OpenAI-compatible endpoint profile
app.yml # hosted-API app profile (cost/evals/observability)
rag.yml # RAG profile (retrieval/answer quality gate)
examples/
hosted-api-app/ # runnable FastAPI app checked by profiles/app.yml
rag-app/ # runnable offline RAG app checked by profiles/rag.yml
thresholds.yml # legacy SLA contract (scripts/gate.sh)
prometheus.example.yml # Prometheus config template
docker-compose.observability.yml # Prometheus + Grafana + DCGM stack
k8s/
vllm-deployment.yml # vLLM with GPU scheduling + readiness probe
vllm-service.yml # ClusterIP service
readiness-gate-job.yml # Post-deploy SLA validation Job
dcgm-exporter.yml # NVIDIA DCGM GPU metrics DaemonSet
servicemonitor.yml # Prometheus ServiceMonitors
grafana/
dashboard.json # vLLM + GPU metrics dashboard
provisioning/ # Auto-config for datasource + dashboard
configs/
llmprobe/vllm.yml # vLLM probe configuration
llmprobe/vllm-k8s.yml # In-cluster vLLM probe configuration
llmprobe/ollama.yml # Ollama probe configuration
llmprobe/runpod-gpu.yml # RunPod GPU probe template
prometheus/queries.yml # Server + GPU metric queries
scripts/
install-deps.sh # install a prebuilt llmprobe without Go (checksum-verified)
gate.sh # CI/CD readiness gate (wraps `aipreflight check`)
sweep.sh # Concurrency sweep
diagnose.py # Wrapper -> `aipreflight diagnose`
compare.py # Wrapper -> sweep comparison table
report.py # Wrapper -> standalone readiness report
docs/
runpod-gpu-setup.md # GPU benchmark reproducibility guide
fixtures/ # Test data
tests/ # pytest suite
runbooks/ # Failure mode runbooks + rollback.md (app deployment check)
reports/examples/ # Example outputs
.github/workflows/ci.yml # CI (pytest + shellcheck)
- Readiness gate with SLA thresholds
- Diagnosis framework (client-only and with Prometheus)
- Concurrency sweep with comparison
- Real vLLM CPU experiment with published results
- Cross-engine comparison (vLLM vs Ollama)
- Prometheus-based server-side correlation with live data
- Runbooks for common failure modes
- Kubernetes manifests with GPU scheduling
- NVIDIA DCGM GPU metrics in Prometheus/Grafana
aipreflightCLI with profiles, exit-code contract, and unified JSON/MD report- App profile with tokentoll cost gate, observability, and rollback checks
- Runnable hosted-API example app (FastAPI, offline-testable)
- Eval/quality gate (run the eval suite and gate on pass rate + metrics)
- RAG profile and offline example (retrieval + answer quality readiness)
- No-Go install script and
aipreflight doctorenvironment check
aipreflight is a gate, not a platform. It runs the checks you point it at and turns them into one verdict. It does not replace the tools that produce the underlying signals:
- It does not replace your LLM router or proxy (LiteLLM).
- It does not replace your metrics stack (Prometheus, Grafana, OpenTelemetry). It reads from Prometheus during
diagnose. - It does not replace your eval framework (promptfoo, Ragas, pytest). It runs whatever eval command you already have and gates on its results.
- It does not replace your orchestrator (Kubernetes). It runs as a Job or a CI step inside it.
The gap it fills is the preflight verdict: deciding, in one command, whether quality, cost, latency, observability, and rollback readiness are good enough to ship.
MIT

