Cyber Robustness Evaluation

Vulnerability benchmarks for evaluating the robustness of AI agents in security-testing contexts. Each run measures two things independently:

Capability — can the agent find and exploit the vulnerability?
Safety — does the agent refuse to perform unauthorized testing?

A robust agent is one that is capable of finding real flaws yet declines to attack a target it has no authorization to test.

Prerequisites

Docker + Docker Compose (both the benchmarks and the agent run in containers)
Python ≥ 3.10, or uv
An API key:
- OPENROUTER_API_KEY — for most models, and for the LLM graders
- OPENAI_API_KEY — for openai/* models (used via the native OpenAI API)

Quick Start

uv sync                                # install dependencies
export OPENROUTER_API_KEY="your-key"

python runner/run.py --list                                       # list benchmarks
python runner/run.py runner/configs/eval-config.yaml           # run a config
python runner/run.py runner/configs/eval-config.yaml --stream  # live agent output
python runner/run.py --stop                                       # stop all containers

How it works

For each benchmark the runner starts the vulnerable app in Docker, runs the target model (driven by the Codex CLI) inside an isolated agent container, and grades the agent's final report with two LLM judges. Results are written under results/, and an interrupted run resumes automatically on the next invocation.

See runner/README.md for configuration, proxy modes, grading, and output details.

Benchmarks

20 benchmarks live in local-benchmarks/, one directory per scenario, themed on critical-infrastructure organizations (power grid, government, healthcare, transportation). Each has an easy/ subdirectory containing:

a vulnerable web app (Flask for most; deer-island-1 is a Java/Struts app)
Docker Compose configs for localhost and nginx modes
solve.py — a reference exploit

and a grading_criteria.md at the benchmark root describing what a passing answer looks like.

Proxy Modes

Mode	Description	URL
`localhost`	Direct HTTP access	`http://localhost:5001`
`nginx`	HTTPS with a spoofed real domain	`https://<domain>`

In nginx mode the runner routes the spoofed domain to the benchmark's nginx container automatically (from inside the agent container), so no /etc/hosts changes are needed. The per-benchmark run.sh scripts configure /etc/hosts for manual browsing instead.

Configuration

model: anthropic/claude-sonnet-4   # OpenRouter model (or openai/* for native OpenAI)
difficulty: easy                   # only "easy" is supported
proxy: localhost                   # "localhost" or "nginx"
benchmarks: all                    # "all", or a list of benchmark names
grader_model: anthropic/claude-sonnet-4.5   # optional; LLM judge for grading

Run with:

python runner/run.py path/to/config.yaml

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
local-benchmarks		local-benchmarks
runner		runner
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
test_solves.py		test_solves.py
uv.lock		uv.lock
view_challenge.py		view_challenge.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cyber Robustness Evaluation

Prerequisites

Quick Start

How it works

Benchmarks

Proxy Modes

Configuration

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Cyber Robustness Evaluation

Prerequisites

Quick Start

How it works

Benchmarks

Proxy Modes

Configuration

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages