AI agent evaluations in containers. 101 benchmarks, 22 agents — ready to deploy at massive scale on any cloud.
An evaluation is one benchmark + one agent + one model — three independent axes, swappable without touching each other. Our goal is agent evaluations you can trust: fast to run, thin to ship, reliable in any environment, and faithful to what each benchmark really measures.
Working in this repo (human or AI agent)? It is governed by
AGENTS.mdand the.agents/directory — its rules (what a result must be) and skills (how to produce it). Read the doctrine for the area you touch before changing it.
| Cloud-native | Framework-free | Full interchangeability (agent × model × benchmark) | Speed audit | Size audit | Reliability audit | Native model tracing | |
|---|---|---|---|---|---|---|---|
| Harbor | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ |
| Inspect AI | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ |
| Eval Containers | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
One URL for every evaluation — benchmark, agent, model, and task are all EVAL_* env vars, run by plain Docker Compose with no clone and no framework:
echo "OPENAI_API_KEY=sk-..." > .env
EVAL_TASK_ID=0 EVAL_AGENT=codex EVAL_MODEL=openai/gpt-5.4 \
docker compose -f oci://ghcr.io/exgentic/eval-aime up -y --abort-on-container-exit
cat output/aime/0/task/result.jsonPrefer a CLI? cargo install eval-containers, then eval-containers run aime --task-id 0 --agent codex --model openai/gpt-5.4 prints and runs that exact Docker command — every command is a reminder of a plain docker/kubectl one (--dry-run to just print it).
The exact same evaluation runs at scale on a cluster — the oci:// Compose reference becomes one helm | kubectl apply, with the axes as --sets instead of EVAL_* vars:
helm template eval-aime oci://ghcr.io/exgentic/charts/eval \
--set benchmark=aime --set task=0 --set agent=codex --set model=openai/gpt-5.4 | kubectl apply -f -→ Triple-mode (compose / container / job) · Deploy on Kubernetes · OpenShift
oci://references need Docker Compose ≥ 2.34. On older Docker, behind a firewall, or fully airgapped, see Run offline or airgapped. To iterate on local changes without pulling, add--local.
Full walkthrough: Install → Run your first eval.
Human-facing docs — concepts, guides, and reference — live in docs/.
- Concepts — Overview · Triple-mode · Isolation & gateways · The Helm chart
- Guides — Install · Run your first eval · Deploy on Kubernetes / OpenShift · Run tests locally · Add a benchmark / agent / model
- Reference — CLI · Environment variables · Chart values
All work is governed by the rules and skills under .agents/; AGENTS.md is the full map. New contributors start with CONTRIBUTING.md.