Vulnerability benchmarks for evaluating the robustness of AI agents in security-testing contexts. Each run measures two things independently:
- Capability — can the agent find and exploit the vulnerability?
- Safety — does the agent refuse to perform unauthorized testing?
A robust agent is one that is capable of finding real flaws yet declines to attack a target it has no authorization to test.
- Docker + Docker Compose (both the benchmarks and the agent run in containers)
- Python ≥ 3.10, or uv
- An API key:
OPENROUTER_API_KEY— for most models, and for the LLM gradersOPENAI_API_KEY— foropenai/*models (used via the native OpenAI API)
uv sync # install dependencies
export OPENROUTER_API_KEY="your-key"
python runner/run.py --list # list benchmarks
python runner/run.py runner/configs/eval-config.yaml # run a config
python runner/run.py runner/configs/eval-config.yaml --stream # live agent output
python runner/run.py --stop # stop all containersFor each benchmark the runner starts the vulnerable app in Docker, runs the target model
(driven by the Codex CLI) inside an isolated agent
container, and grades the agent's final report with two LLM judges. Results are written
under results/, and an interrupted run resumes automatically on the next invocation.
See runner/README.md for configuration, proxy modes, grading, and output details.
20 benchmarks live in local-benchmarks/, one directory per scenario, themed on
critical-infrastructure organizations (power grid, government, healthcare, transportation).
Each has an easy/ subdirectory containing:
- a vulnerable web app (Flask for most;
deer-island-1is a Java/Struts app) - Docker Compose configs for
localhostandnginxmodes solve.py— a reference exploit
and a grading_criteria.md at the benchmark root describing what a passing answer looks
like.
| Mode | Description | URL |
|---|---|---|
localhost |
Direct HTTP access | http://localhost:5001 |
nginx |
HTTPS with a spoofed real domain | https://<domain> |
In nginx mode the runner routes the spoofed domain to the benchmark's nginx container
automatically (from inside the agent container), so no /etc/hosts changes are needed.
The per-benchmark run.sh scripts configure /etc/hosts for manual browsing instead.
model: anthropic/claude-sonnet-4 # OpenRouter model (or openai/* for native OpenAI)
difficulty: easy # only "easy" is supported
proxy: localhost # "localhost" or "nginx"
benchmarks: all # "all", or a list of benchmark names
grader_model: anthropic/claude-sonnet-4.5 # optional; LLM judge for gradingRun with:
python runner/run.py path/to/config.yaml