Skip to content

Exgentic/eval-containers

Repository files navigation

Eval Containers

AI agent evaluations in containers. 101 benchmarks, 22 agents — ready to deploy at massive scale on any cloud.

An evaluation is one benchmark + one agent + one model — three independent axes, swappable without touching each other. Our goal is agent evaluations you can trust: fast to run, thin to ship, reliable in any environment, and faithful to what each benchmark really measures.

Working in this repo (human or AI agent)? It is governed by AGENTS.md and the .agents/ directory — its rules (what a result must be) and skills (how to produce it). Read the doctrine for the area you touch before changing it.

Why Eval Containers

Cloud-native Framework-free Full interchangeability (agent × model × benchmark) Speed audit Size audit Reliability audit Native model tracing
Harbor
Inspect AI
Eval Containers

Quick start

One URL for every evaluation — benchmark, agent, model, and task are all EVAL_* env vars, run by plain Docker Compose with no clone and no framework:

echo "OPENAI_API_KEY=sk-..." > .env

EVAL_TASK_ID=0 EVAL_AGENT=codex EVAL_MODEL=openai/gpt-5.4 \
  docker compose -f oci://ghcr.io/exgentic/eval-aime up -y --abort-on-container-exit

cat output/aime/0/task/result.json

Prefer a CLI? cargo install eval-containers, then eval-containers run aime --task-id 0 --agent codex --model openai/gpt-5.4 prints and runs that exact Docker command — every command is a reminder of a plain docker/kubectl one (--dry-run to just print it).

Same eval, on Kubernetes

The exact same evaluation runs at scale on a cluster — the oci:// Compose reference becomes one helm | kubectl apply, with the axes as --sets instead of EVAL_* vars:

helm template eval-aime oci://ghcr.io/exgentic/charts/eval \
  --set benchmark=aime --set task=0 --set agent=codex --set model=openai/gpt-5.4 | kubectl apply -f -

Triple-mode (compose / container / job) · Deploy on Kubernetes · OpenShift

oci:// references need Docker Compose ≥ 2.34. On older Docker, behind a firewall, or fully airgapped, see Run offline or airgapped. To iterate on local changes without pulling, add --local.

Full walkthrough: InstallRun your first eval.

Documentation

Human-facing docs — concepts, guides, and reference — live in docs/.

Contributing & governance

All work is governed by the rules and skills under .agents/; AGENTS.md is the full map. New contributors start with CONTRIBUTING.md.

About

AI agent evaluations in containers. 102 benchmarks X 21 agents — ready to deploy at massive scale on any cloud.

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors