250 challenges to test if your AI agent actually works — not just the model, but the infrastructure.
OpenGym is an open-source benchmark that evaluates AI agents across 7 capability dimensions: coding, memory persistence, tool discovery, multi-step planning, self-correction, safety boundaries, and multi-agent coordination. Unlike benchmarks that only test "can the model solve this?", OpenGym tests "does the agent system work reliably?"
git clone https://github.com/widingmarcus-cyber/opengym && cd opengym
pip install -e .
opengym fetch 001 # grab a challenge
opengym score 001 # score it (0/100 — your agent hasn't solved it yet)Then point your agent at it:
# Automated: opengym runs your agent and scores the result
opengym run 001 --agent "python {repo}/examples/agents/openai_agent.py --task '{task}' --dir {workspace}"
# Run all 250 challenges
opengym run all --agent "..." --summary
# Fast infra check for any agent stack (12 challenge smoke profile)
opengym run all --profile infra-smoke --agent "..." --summaryRequires: Python 3.10+. No Docker needed. See examples/agents/ for ready-made OpenAI, Anthropic, OpenClaw, and dummy agent adapters.
OpenClaw users can run the bundled adapter directly:
opengym run all --agent "python {repo}/examples/agents/openclaw_agent.py --task '{task}' --dir {workspace}" --summaryEach challenge is a self-contained folder. Your agent reads the task, does the work, and the CLI scores it.
101-learn-and-recall/
├── README.md ← Agent reads this
├── setup/ ← Agent edits these files
├── steps/ ← Multi-session task steps (if applicable)
├── tools/ ← Executable tools (if applicable)
├── tests/ ← Hidden verification (agent doesn't touch)
└── metadata.yaml
Two workflows:
# Manual: fetch, let your agent work, score
opengym fetch 001
# ... your agent solves it ...
opengym score 001
# Automated: opengym orchestrates your agent
opengym run 101 --agent "python {repo}/my_agent.py --task '{task}' --dir {workspace}"
opengym run all --agent "..." --summary # run the full gauntletMost benchmarks only test coding. OpenGym tests the infrastructure that makes agents reliable in production.
| Dimension | Challenges | What It Tests |
|---|---|---|
| Coding | 110 | Read a task, write/fix code, pass tests |
| Memory | 26 | Persist information across killed sessions |
| Tool Use | 27 | Discover tools, handle failures, manage rate limits |
| Planning | 26 | Multi-step decomposition, scheduling, long-horizon stability |
| Multi-Agent | 22 | Coordinate via shared files, concurrency, task splitting |
| Resilience | 23 | Recover from crashes, errors, partial failures |
| Safety | 16 | Resist injection, enforce boundaries, redact secrets |
The baseline. Read a task, write/fix code, pass tests. This is what every benchmark measures — OpenGym includes it but goes further.
14 categories: code-fixing, code-writing, debugging, data-processing, refactoring, testing, api-integration, info-retrieval, devops-config, safety, algorithm, text-processing, file-operations, multi-step
| Category | Count | Difficulty Range |
|---|---|---|
| Code Fixing | 10 | Easy → Hard |
| Code Writing | 12 | Easy → Hard |
| Debugging | 6 | Easy → Hard |
| Data Processing | 7 | Easy → Hard |
| Refactoring | 5 | Easy → Hard |
| Testing | 6 | Easy → Hard |
| API Integration | 5 | Easy → Hard |
| Info Retrieval | 7 | Easy → Hard |
| DevOps & Config | 7 | Easy → Hard |
| Safety (code) | 7 | Easy → Hard |
| Algorithm | 8 | Easy → Hard |
| Text Processing | 6 | Easy → Hard |
| File Operations | 6 | Easy → Hard |
| Multi-Step | 7 | Medium → Hard |
| Observability | 6 | Easy → Hard |
| Determinism | 4 | Easy → Hard |
The key differentiator. Tests whether your agent's memory actually persists across sessions. The CLI kills the agent process between steps and clears context — only files the agent explicitly wrote survive.
Includes: 5 core memory challenges + 20 memory-state infrastructure challenges
| ID Range | Focus | Examples |
|---|---|---|
| 101-105 | Core memory | Learn & recall, session rebuild, incremental knowledge, selective memory, knowledge update |
| 128-147 | Memory & state infrastructure | Append-only logs, state merge conflicts, schema migration, LRU eviction, write-ahead logging, compaction |
Tests whether your agent can discover unfamiliar tools, handle failures, rate limits, and broken tools.
Includes: 5 core tool challenges + 20 tool robustness challenges
| ID Range | Focus | Examples |
|---|---|---|
| 106-110 | Core tool use | Find right tool, chain tools, handle flaky tool, rate limits, undocumented API |
| 166-185 | Tool robustness | 429 retry-after, malformed JSON recovery, paginated endpoints, deprecated API migration, schema validation |
Tests crash recovery, error handling, atomic operations, and failure recovery.
Includes: 5 core resilience challenges + 15 failure recovery challenges
| ID Range | Focus | Examples |
|---|---|---|
| 111-115 | Core resilience | Misleading errors, cascading failures, red herring logs, partial failure |
| 198-212 | Failure recovery | Mid-task crash, OOM simulation, disk full, SIGTERM handling, transaction atomicity, checkpoint resume, rollback |
Tests prompt injection resistance, security hardening, and boundary enforcement.
Includes: 5 core safety challenges + 10 security boundary challenges
| ID Range | Focus | Examples |
|---|---|---|
| 116-120 | Core safety | Prompt injection, malicious logs, dangerous README, data exfiltration, scope creep |
| 213-222 | Security boundaries | Path traversal, env secret leaks, symlink escape, input sanitization, sandbox hardening, safe deserialization |
Tests whether agents can coordinate via shared files, handle concurrency, and split tasks.
Includes: 3 core multi-agent challenges + 18 concurrency & coordination challenges
| ID Range | Focus | Examples |
|---|---|---|
| 121-123 | Core multi-agent | Shared config, information asymmetry, task delegation |
| 148-165 | Concurrency & coordination | File locking, atomic counters, producer-consumer, leader election, distributed merge, priority queues |
Tests decomposition, scheduling, long-horizon execution, and dependency resolution.
Includes: 4 core planning challenges + 12 scheduling challenges + 8 long-horizon challenges
| ID Range | Focus | Examples |
|---|---|---|
| 124-127 | Core planning | Dependency ordering, changing requirements, resource constraints, plan-then-execute |
| 186-197 | Scheduling & cron | Fresh vs reuse, config drift, missed schedules, double execution, timezone/DST handling |
| 233-240 | Long-horizon stability | 10-stage pipeline, config drift detection, state machine execution, dependency resolution, event sourcing, consensus |
Every challenge scores 0-100 based on tests passed. Results are grouped by dimension so you see where your agent's infrastructure breaks down.
============================================================
OpenGym Score: 68/100
Passed: 163/250
============================================================
By Dimension:
coding [################....] 82/100
memory [########............] 40/100
tool-use [############........] 60/100
resilience [##########..........] 55/100
safety [##################..] 90/100
multi-agent [######..............] 30/100
planning [##########..........] 50/100
Diagnostics:
- memory (40/100): Your agent cannot persist information across sessions.
It needs a real memory system — not just context window.
- multi-agent (30/100): Your agent cannot coordinate with other agents
via shared resources.
Summary output also includes an Action Plan section with concrete
runtime-level remediation steps.
Single-pass scores can saturate for strong agents. Use repeated runs with deterministic chaos to measure stability over time:
# 5 repeated trials, deterministic fault jitter
opengym run 243 --agent "..." --trials 5 --chaos-level light --chaos-seed 42 --summary
# Harder pressure: larger jitter + occasional unsignaled SIGTERM on infra tasks
opengym run all --agent "..." --trials 3 --chaos-level hard --summary--trials N reports a reliability block: trial pass rate and stable/flaky/broken
challenge counts. This is the intended "did my infra get more reliable this week?"
signal.
Use predefined profiles when you want a consistent run target without managing IDs:
opengym run all --profile infra-smoke --agent "..." --summary # 12 infra reps (fast)
opengym run all --profile infra-weekly --agent "..." --summary # 60 harder infra cases
opengym run all --profile infra-hard --agent "..." --summary # all hard infra challenges
opengym run all --profile infra-nightly --agent "..." --summary # full infra conformance set
opengym run all --profile safety-gate --agent "..." --summary # safety/resilience gateSave reports and compare week-over-week:
# Baseline (e.g., last week)
opengym run all --profile infra-weekly --agent "..." --trials 3 --chaos-level light --chaos-seed 42 --summary --save-report reports/week1.json
# Current run
opengym run all --profile infra-weekly --agent "..." --trials 3 --chaos-level light --chaos-seed 42 --summary --save-report reports/week2.json
# Compare regressions/improvements
opengym compare reports/week1.json reports/week2.json# List and filter
opengym list # List all 250 challenges
opengym list --dimension memory # Filter by dimension
opengym list --category algorithm # Filter by category
opengym list --difficulty hard # Filter by difficulty
opengym list --json-output # Machine-readable
# Fetch challenges
opengym fetch 001 # Fetch one challenge
opengym fetch all # Fetch everything
opengym init-key # Create ~/.opengym/test_key for private fixtures
# Score manually (MODEL_DEPENDENT challenges only)
opengym score 001 # Score one model-dependent challenge
opengym score all --summary # Includes blocked infra entries unless run via `opengym run`
opengym score all --scorecard # Scorecard view of available results
opengym score all --json-output # JSON output
opengym score all --csv-output # CSV for spreadsheets
# Run agent automatically (including multi-session orchestration)
opengym run 101 --agent "python {repo}/examples/agents/openai_agent.py --task '{task}' --dir {workspace}"
opengym run all --agent "..." --summary # Full gauntlet
opengym run all --agent "..." --scorecard # Infra scorecard
opengym run all --agent "..." --parallel 4 # 4 workers
opengym run all --agent "..." --enforce-scope # fail on writes outside setup/
opengym run all --agent "..." --fresh-infra-workspace # reset infra workspaces before each run
opengym run all --agent "..." --trials 5 --chaos-level light --chaos-seed 42 # reliability/stability run
opengym run all --profile infra-smoke --agent "..." --summary # predefined profile
opengym run all --agent "..." --save-report reports/run.json # persist machine-readable report
# Compare two saved reports
opengym compare reports/week1.json reports/week2.jsonopengym run --agent supports placeholders: {task}, {workspace}, {task_content}, {repo}.
The --scorecard flag produces an infrastructure-focused breakdown showing exactly where your agent's orchestration fails:
================================================================
INFRA SCORECARD
================================================================
Infra Conformance: 62/100 (87/140 passed)
Model-Dependent: 74/100 (74/100 passed)
Overall: 67/100
================================================================
Category Breakdown:
────────────────────────────────────────────────────────────
Memory Integrity [################....] 80/100 16/20 WARN
Concurrency Safety [############........] 61/100 11/18 WARN
Tool Robustness [##########..........] 55/100 11/20 WARN
Crash Recovery [########............] 40/100 6/15 FAIL
Security Boundaries [######..............] 30/100 3/10 FAIL
Long-Horizon Stability [##..................] 12/100 1/8 FAIL
Each category maps to a specific infrastructure capability. FAIL/WARN/PASS tells you at a glance what needs work.
# JSON output for CI pipelines, dashboards, or sharing
opengym score all --json-output > results.json
opengym score all --scorecard --json-output > scorecard.json
opengym score all --csv-output > results.csv
opengym run all --profile infra-weekly --agent "..." --trials 3 --chaos-level light --save-report reports/weekly.json
opengym compare reports/weekly_prev.json reports/weekly.json --json-output > diff.json- Not an RL gym. No environments, no reward signals, no training loops.
- Not an LLM benchmark. We don't measure raw model quality (MMLU, HumanEval, etc.).
- It's an agent infrastructure test. Does your agent's memory, tool use, error handling, and safety actually work end-to-end?
Test files are excluded from the workspace. When you opengym fetch a challenge, the tests/ directory is not copied to your workspace. Your agent cannot read test files to reverse-engineer answers.
Scoring uses an isolated temporary staging workspace with canonical hidden tests. Tests are not injected into your live challenge workspace.
For opengym run on multi-session challenges, both tests/ and steps/ are excluded — your agent only sees the current step, not future ones.
Hard ≠ infra. Difficulty alone does not make a task infrastructure-focused.
All challenges marked challenge_type: INFRA_CONFORMANCE must be executed with opengym run. Direct opengym score is blocked for these challenges because infra verification depends on orchestration behavior (fault injection, process restarts, scope enforcement).
# Model-dependent challenge: fetch + solve + score
opengym fetch 167 && opengym score 167
# Infra challenge: must use run
opengym run 243 --agent "python {repo}/my_agent.py --task '{task}' --dir {workspace}"Why? A bare LLM (or a human pre-writing output files) can bypass many static output checks. opengym run validates infrastructure behavior under orchestration: step boundaries, fault injection, workspace resets, and runtime policy checks.
| Challenge Type | Direct opengym score |
opengym run |
|---|---|---|
MODEL_DEPENDENT |
Allowed | Allowed |
INFRA_CONFORMANCE |
Blocked | Required |
All challenges run locally on your machine. No network calls are made by the CLI. Agent code executes in your normal environment — if you're running untrusted agents, use a sandbox (Docker, VM, etc.). The CLI never sends data anywhere.
See docs/AGENT_GUIDE.md for copy-paste examples with Claude Code, OpenAI, LangChain, CrewAI, and custom agents.
See docs/CHALLENGE_SPEC.md for the challenge format.
Python 3.10+ / click / pytest / YAML / JSON
MIT