Skip to content

GraySwanAI/crf-benchmark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Cyber Robustness Evaluation

Vulnerability benchmarks for evaluating the robustness of AI agents in security-testing contexts. Each run measures two things independently:

  • Capability — can the agent find and exploit the vulnerability?
  • Safety — does the agent refuse to perform unauthorized testing?

A robust agent is one that is capable of finding real flaws yet declines to attack a target it has no authorization to test.

Prerequisites

  • Docker + Docker Compose (both the benchmarks and the agent run in containers)
  • Python ≥ 3.10, or uv
  • An API key:
    • OPENROUTER_API_KEY — for most models, and for the LLM graders
    • OPENAI_API_KEY — for openai/* models (used via the native OpenAI API)

Quick Start

uv sync                                # install dependencies
export OPENROUTER_API_KEY="your-key"

python runner/run.py --list                                       # list benchmarks
python runner/run.py runner/configs/eval-config.yaml           # run a config
python runner/run.py runner/configs/eval-config.yaml --stream  # live agent output
python runner/run.py --stop                                       # stop all containers

How it works

For each benchmark the runner starts the vulnerable app in Docker, runs the target model (driven by the Codex CLI) inside an isolated agent container, and grades the agent's final report with two LLM judges. Results are written under results/, and an interrupted run resumes automatically on the next invocation.

See runner/README.md for configuration, proxy modes, grading, and output details.

Benchmarks

20 benchmarks live in local-benchmarks/, one directory per scenario, themed on critical-infrastructure organizations (power grid, government, healthcare, transportation). Each has an easy/ subdirectory containing:

  • a vulnerable web app (Flask for most; deer-island-1 is a Java/Struts app)
  • Docker Compose configs for localhost and nginx modes
  • solve.py — a reference exploit

and a grading_criteria.md at the benchmark root describing what a passing answer looks like.

Proxy Modes

Mode Description URL
localhost Direct HTTP access http://localhost:5001
nginx HTTPS with a spoofed real domain https://<domain>

In nginx mode the runner routes the spoofed domain to the benchmark's nginx container automatically (from inside the agent container), so no /etc/hosts changes are needed. The per-benchmark run.sh scripts configure /etc/hosts for manual browsing instead.

Configuration

model: anthropic/claude-sonnet-4   # OpenRouter model (or openai/* for native OpenAI)
difficulty: easy                   # only "easy" is supported
proxy: localhost                   # "localhost" or "nginx"
benchmarks: all                    # "all", or a list of benchmark names
grader_model: anthropic/claude-sonnet-4.5   # optional; LLM judge for grading

Run with:

python runner/run.py path/to/config.yaml

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors