Eval Containers

AI agent evaluations in containers. 101 benchmarks, 22 agents — ready to deploy at massive scale on any cloud.

An evaluation is one benchmark + one agent + one model — three independent axes, swappable without touching each other. Our goal is agent evaluations you can trust: fast to run, thin to ship, reliable in any environment, and faithful to what each benchmark really measures.

Working in this repo (human or AI agent)? It is governed by AGENTS.md and the .agents/ directory — its rules (what a result must be) and skills (how to produce it). Read the doctrine for the area you touch before changing it.

Why Eval Containers

	Cloud-native	Framework-free	Full interchangeability (agent × model × benchmark)	Speed audit	Size audit	Reliability audit	Native model tracing
Harbor	✗	✗	✗	✗	✗	✗	✗
Inspect AI	✗	✗	✗	✗	✗	✗	✗
Eval Containers	✓	✓	✓	✓	✓	✓	✓

Quick start

One URL for every evaluation — benchmark, agent, model, and task are all EVAL_* env vars, run by plain Docker Compose with no clone and no framework:

echo "OPENAI_API_KEY=sk-..." > .env

EVAL_TASK_ID=0 EVAL_AGENT=codex EVAL_MODEL=openai/gpt-5.4 \
  docker compose -f oci://ghcr.io/exgentic/eval-aime up -y --abort-on-container-exit

cat output/aime/0/task/result.json

Prefer a CLI? cargo install eval-containers, then eval-containers run aime --task-id 0 --agent codex --model openai/gpt-5.4 prints and runs that exact Docker command — every command is a reminder of a plain docker/kubectl one (--dry-run to just print it).

Same eval, on Kubernetes

The exact same evaluation runs at scale on a cluster — the oci:// Compose reference becomes one helm | kubectl apply, with the axes as --sets instead of EVAL_* vars:

helm template eval-aime oci://ghcr.io/exgentic/charts/eval \
  --set benchmark=aime --set task=0 --set agent=codex --set model=openai/gpt-5.4 | kubectl apply -f -

→ Triple-mode (compose / container / job) · Deploy on Kubernetes · OpenShift

oci:// references need Docker Compose ≥ 2.34. On older Docker, behind a firewall, or fully airgapped, see Run offline or airgapped. To iterate on local changes without pulling, add --local.

Full walkthrough: Install → Run your first eval.

Documentation

Human-facing docs — concepts, guides, and reference — live in docs/.

Concepts — Overview · Triple-mode · Isolation & gateways · The Helm chart
Guides — Install · Run your first eval · Deploy on Kubernetes / OpenShift · Run tests locally · Add a benchmark / agent / model
Reference — CLI · Environment variables · Chart values

Contributing & governance

All work is governed by the rules and skills under .agents/; AGENTS.md is the full map. New contributors start with CONTRIBUTING.md.

Name		Name	Last commit message	Last commit date
Latest commit History 337 Commits
.agents		.agents
.github		.github
cli		cli
containers		containers
deploy		deploy
docs		docs
tests		tests
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
dist-workspace.toml		dist-workspace.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Eval Containers

Why Eval Containers

Quick start

Same eval, on Kubernetes

Documentation

Contributing & governance

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Eval Containers

Why Eval Containers

Quick start

Same eval, on Kubernetes

Documentation

Contributing & governance

About

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages