OpenGym

250 challenges to test if your AI agent actually works — not just the model, but the infrastructure.

OpenGym is an open-source benchmark that evaluates AI agents across 7 capability dimensions: coding, memory persistence, tool discovery, multi-step planning, self-correction, safety boundaries, and multi-agent coordination. Unlike benchmarks that only test "can the model solve this?", OpenGym tests "does the agent system work reliably?"

Quickstart

git clone https://github.com/widingmarcus-cyber/opengym && cd opengym
pip install -e .
opengym fetch 001              # grab a challenge
opengym score 001              # score it (0/100 — your agent hasn't solved it yet)

Then point your agent at it:

# Automated: opengym runs your agent and scores the result
opengym run 001 --agent "python {repo}/examples/agents/openai_agent.py --task '{task}' --dir {workspace}"

# Run all 250 challenges
opengym run all --agent "..." --summary

# Fast infra check for any agent stack (12 challenge smoke profile)
opengym run all --profile infra-smoke --agent "..." --summary

Requires: Python 3.10+. No Docker needed. See examples/agents/ for ready-made OpenAI, Anthropic, OpenClaw, and dummy agent adapters.

OpenClaw users can run the bundled adapter directly:

opengym run all --agent "python {repo}/examples/agents/openclaw_agent.py --task '{task}' --dir {workspace}" --summary

How It Works

Each challenge is a self-contained folder. Your agent reads the task, does the work, and the CLI scores it.

101-learn-and-recall/
├── README.md        ← Agent reads this
├── setup/           ← Agent edits these files
├── steps/           ← Multi-session task steps (if applicable)
├── tools/           ← Executable tools (if applicable)
├── tests/           ← Hidden verification (agent doesn't touch)
└── metadata.yaml

Two workflows:

# Manual: fetch, let your agent work, score
opengym fetch 001
# ... your agent solves it ...
opengym score 001

# Automated: opengym orchestrates your agent
opengym run 101 --agent "python {repo}/my_agent.py --task '{task}' --dir {workspace}"
opengym run all --agent "..." --summary    # run the full gauntlet

7 Dimensions, 250 Challenges

Most benchmarks only test coding. OpenGym tests the infrastructure that makes agents reliable in production.

Dimension	Challenges	What It Tests
Coding	110	Read a task, write/fix code, pass tests
Memory	26	Persist information across killed sessions
Tool Use	27	Discover tools, handle failures, manage rate limits
Planning	26	Multi-step decomposition, scheduling, long-horizon stability
Multi-Agent	22	Coordinate via shared files, concurrency, task splitting
Resilience	23	Recover from crashes, errors, partial failures
Safety	16	Resist injection, enforce boundaries, redact secrets

Coding — 110 challenges

The baseline. Read a task, write/fix code, pass tests. This is what every benchmark measures — OpenGym includes it but goes further.

14 categories: code-fixing, code-writing, debugging, data-processing, refactoring, testing, api-integration, info-retrieval, devops-config, safety, algorithm, text-processing, file-operations, multi-step

Category	Count	Difficulty Range
Code Fixing	10	Easy → Hard
Code Writing	12	Easy → Hard
Debugging	6	Easy → Hard
Data Processing	7	Easy → Hard
Refactoring	5	Easy → Hard
Testing	6	Easy → Hard
API Integration	5	Easy → Hard
Info Retrieval	7	Easy → Hard
DevOps & Config	7	Easy → Hard
Safety (code)	7	Easy → Hard
Algorithm	8	Easy → Hard
Text Processing	6	Easy → Hard
File Operations	6	Easy → Hard
Multi-Step	7	Medium → Hard
Observability	6	Easy → Hard
Determinism	4	Easy → Hard

Memory Persistence — 25 challenges

The key differentiator. Tests whether your agent's memory actually persists across sessions. The CLI kills the agent process between steps and clears context — only files the agent explicitly wrote survive.

Includes: 5 core memory challenges + 20 memory-state infrastructure challenges

ID Range	Focus	Examples
101-105	Core memory	Learn & recall, session rebuild, incremental knowledge, selective memory, knowledge update
128-147	Memory & state infrastructure	Append-only logs, state merge conflicts, schema migration, LRU eviction, write-ahead logging, compaction

Tool Discovery & Use — 25 challenges

Tests whether your agent can discover unfamiliar tools, handle failures, rate limits, and broken tools.

Includes: 5 core tool challenges + 20 tool robustness challenges

ID Range	Focus	Examples
106-110	Core tool use	Find right tool, chain tools, handle flaky tool, rate limits, undocumented API
166-185	Tool robustness	429 retry-after, malformed JSON recovery, paginated endpoints, deprecated API migration, schema validation

Self-Correction & Resilience — 20 challenges

Tests crash recovery, error handling, atomic operations, and failure recovery.

Includes: 5 core resilience challenges + 15 failure recovery challenges

ID Range	Focus	Examples
111-115	Core resilience	Misleading errors, cascading failures, red herring logs, partial failure
198-212	Failure recovery	Mid-task crash, OOM simulation, disk full, SIGTERM handling, transaction atomicity, checkpoint resume, rollback

Safety & Boundaries — 15 challenges

Tests prompt injection resistance, security hardening, and boundary enforcement.

Includes: 5 core safety challenges + 10 security boundary challenges

ID Range	Focus	Examples
116-120	Core safety	Prompt injection, malicious logs, dangerous README, data exfiltration, scope creep
213-222	Security boundaries	Path traversal, env secret leaks, symlink escape, input sanitization, sandbox hardening, safe deserialization

Multi-Agent Coordination — 21 challenges

Tests whether agents can coordinate via shared files, handle concurrency, and split tasks.

Includes: 3 core multi-agent challenges + 18 concurrency & coordination challenges

ID Range	Focus	Examples
121-123	Core multi-agent	Shared config, information asymmetry, task delegation
148-165	Concurrency & coordination	File locking, atomic counters, producer-consumer, leader election, distributed merge, priority queues

Multi-Step Planning — 24 challenges

Tests decomposition, scheduling, long-horizon execution, and dependency resolution.

Includes: 4 core planning challenges + 12 scheduling challenges + 8 long-horizon challenges

ID Range	Focus	Examples
124-127	Core planning	Dependency ordering, changing requirements, resource constraints, plan-then-execute
186-197	Scheduling & cron	Fresh vs reuse, config drift, missed schedules, double execution, timezone/DST handling
233-240	Long-horizon stability	10-stage pipeline, config drift detection, state machine execution, dependency resolution, event sourcing, consensus

Scoring

Every challenge scores 0-100 based on tests passed. Results are grouped by dimension so you see where your agent's infrastructure breaks down.

============================================================
  OpenGym Score: 68/100
  Passed: 163/250
============================================================

By Dimension:
  coding         [################....] 82/100
  memory         [########............] 40/100
  tool-use       [############........] 60/100
  resilience     [##########..........] 55/100
  safety         [##################..] 90/100
  multi-agent    [######..............] 30/100
  planning       [##########..........] 50/100

Diagnostics:
  - memory (40/100): Your agent cannot persist information across sessions.
    It needs a real memory system — not just context window.
  - multi-agent (30/100): Your agent cannot coordinate with other agents
    via shared resources.

Summary output also includes an Action Plan section with concrete runtime-level remediation steps.

Reliability Runs (When Baseline Is 100/100)

Single-pass scores can saturate for strong agents. Use repeated runs with deterministic chaos to measure stability over time:

# 5 repeated trials, deterministic fault jitter
opengym run 243 --agent "..." --trials 5 --chaos-level light --chaos-seed 42 --summary

# Harder pressure: larger jitter + occasional unsignaled SIGTERM on infra tasks
opengym run all --agent "..." --trials 3 --chaos-level hard --summary

--trials N reports a reliability block: trial pass rate and stable/flaky/broken challenge counts. This is the intended "did my infra get more reliable this week?" signal.

Benchmark Profiles

Use predefined profiles when you want a consistent run target without managing IDs:

opengym run all --profile infra-smoke --agent "..." --summary   # 12 infra reps (fast)
opengym run all --profile infra-weekly --agent "..." --summary  # 60 harder infra cases
opengym run all --profile infra-hard --agent "..." --summary    # all hard infra challenges
opengym run all --profile infra-nightly --agent "..." --summary # full infra conformance set
opengym run all --profile safety-gate --agent "..." --summary   # safety/resilience gate

Weekly Reliability Diff

Save reports and compare week-over-week:

# Baseline (e.g., last week)
opengym run all --profile infra-weekly --agent "..." --trials 3 --chaos-level light --chaos-seed 42 --summary --save-report reports/week1.json

# Current run
opengym run all --profile infra-weekly --agent "..." --trials 3 --chaos-level light --chaos-seed 42 --summary --save-report reports/week2.json

# Compare regressions/improvements
opengym compare reports/week1.json reports/week2.json

CLI Reference

# List and filter
opengym list                              # List all 250 challenges
opengym list --dimension memory           # Filter by dimension
opengym list --category algorithm         # Filter by category
opengym list --difficulty hard            # Filter by difficulty
opengym list --json-output                # Machine-readable

# Fetch challenges
opengym fetch 001                         # Fetch one challenge
opengym fetch all                         # Fetch everything
opengym init-key                          # Create ~/.opengym/test_key for private fixtures

# Score manually (MODEL_DEPENDENT challenges only)
opengym score 001                         # Score one model-dependent challenge
opengym score all --summary               # Includes blocked infra entries unless run via `opengym run`
opengym score all --scorecard             # Scorecard view of available results
opengym score all --json-output           # JSON output
opengym score all --csv-output            # CSV for spreadsheets

# Run agent automatically (including multi-session orchestration)
opengym run 101 --agent "python {repo}/examples/agents/openai_agent.py --task '{task}' --dir {workspace}"
opengym run all --agent "..." --summary   # Full gauntlet
opengym run all --agent "..." --scorecard # Infra scorecard
opengym run all --agent "..." --parallel 4 # 4 workers
opengym run all --agent "..." --enforce-scope # fail on writes outside setup/
opengym run all --agent "..." --fresh-infra-workspace # reset infra workspaces before each run
opengym run all --agent "..." --trials 5 --chaos-level light --chaos-seed 42 # reliability/stability run
opengym run all --profile infra-smoke --agent "..." --summary # predefined profile
opengym run all --agent "..." --save-report reports/run.json # persist machine-readable report

# Compare two saved reports
opengym compare reports/week1.json reports/week2.json

opengym run --agent supports placeholders: {task}, {workspace}, {task_content}, {repo}.

Infra Scorecard

The --scorecard flag produces an infrastructure-focused breakdown showing exactly where your agent's orchestration fails:

================================================================
  INFRA SCORECARD
================================================================
  Infra Conformance:  62/100  (87/140 passed)
  Model-Dependent:    74/100  (74/100 passed)
  Overall:            67/100
================================================================

  Category Breakdown:
  ────────────────────────────────────────────────────────────
    Memory Integrity             [################....] 80/100  16/20  WARN
    Concurrency Safety           [############........] 61/100  11/18  WARN
    Tool Robustness              [##########..........] 55/100  11/20  WARN
    Crash Recovery               [########............] 40/100   6/15  FAIL
    Security Boundaries          [######..............] 30/100   3/10  FAIL
    Long-Horizon Stability       [##..................] 12/100   1/8   FAIL

Each category maps to a specific infrastructure capability. FAIL/WARN/PASS tells you at a glance what needs work.

Export Results

# JSON output for CI pipelines, dashboards, or sharing
opengym score all --json-output > results.json
opengym score all --scorecard --json-output > scorecard.json
opengym score all --csv-output > results.csv
opengym run all --profile infra-weekly --agent "..." --trials 3 --chaos-level light --save-report reports/weekly.json
opengym compare reports/weekly_prev.json reports/weekly.json --json-output > diff.json

What OpenGym Is NOT

Not an RL gym. No environments, no reward signals, no training loops.
Not an LLM benchmark. We don't measure raw model quality (MMLU, HumanEval, etc.).
It's an agent infrastructure test. Does your agent's memory, tool use, error handling, and safety actually work end-to-end?

Fair Use & Anti-Cheat

Test files are excluded from the workspace. When you opengym fetch a challenge, the tests/ directory is not copied to your workspace. Your agent cannot read test files to reverse-engineer answers.

Scoring uses an isolated temporary staging workspace with canonical hidden tests. Tests are not injected into your live challenge workspace.

For opengym run on multi-session challenges, both tests/ and steps/ are excluded — your agent only sees the current step, not future ones.

Infra Challenges (Run-Only)

Hard ≠ infra. Difficulty alone does not make a task infrastructure-focused.

All challenges marked challenge_type: INFRA_CONFORMANCE must be executed with opengym run. Direct opengym score is blocked for these challenges because infra verification depends on orchestration behavior (fault injection, process restarts, scope enforcement).

# Model-dependent challenge: fetch + solve + score
opengym fetch 167 && opengym score 167

# Infra challenge: must use run
opengym run 243 --agent "python {repo}/my_agent.py --task '{task}' --dir {workspace}"

Why? A bare LLM (or a human pre-writing output files) can bypass many static output checks. opengym run validates infrastructure behavior under orchestration: step boundaries, fault injection, workspace resets, and runtime policy checks.

Challenge Type	Direct `opengym score`	`opengym run`
`MODEL_DEPENDENT`	Allowed	Allowed
`INFRA_CONFORMANCE`	Blocked	Required

Safety

All challenges run locally on your machine. No network calls are made by the CLI. Agent code executes in your normal environment — if you're running untrusted agents, use a sandbox (Docker, VM, etc.). The CLI never sends data anywhere.

Test Your Agent

See docs/AGENT_GUIDE.md for copy-paste examples with Claude Code, OpenAI, LangChain, CrewAI, and custom agents.

Create Challenges

See docs/CHALLENGE_SPEC.md for the challenge format.

Tech Stack

Python 3.10+ / click / pytest / YAML / JSON

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
.github		.github
challenges		challenges
cli		cli
docs		docs
examples/agents		examples/agents
lib		lib
scripts		scripts
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OpenGym

Quickstart

How It Works

7 Dimensions, 250 Challenges

Coding — 110 challenges

Memory Persistence — 25 challenges

Tool Discovery & Use — 25 challenges

Self-Correction & Resilience — 20 challenges

Safety & Boundaries — 15 challenges

Multi-Agent Coordination — 21 challenges

Multi-Step Planning — 24 challenges

Scoring

Reliability Runs (When Baseline Is 100/100)

Benchmark Profiles

Weekly Reliability Diff

CLI Reference

Infra Scorecard

Export Results

What OpenGym Is NOT

Fair Use & Anti-Cheat

Infra Challenges (Run-Only)

Safety

Test Your Agent

Create Challenges

Tech Stack

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

OpenGym

Quickstart

How It Works

7 Dimensions, 250 Challenges

Coding — 110 challenges

Memory Persistence — 25 challenges

Tool Discovery & Use — 25 challenges

Self-Correction & Resilience — 20 challenges

Safety & Boundaries — 15 challenges

Multi-Agent Coordination — 21 challenges

Multi-Step Planning — 24 challenges

Scoring

Reliability Runs (When Baseline Is 100/100)

Benchmark Profiles

Weekly Reliability Diff

CLI Reference

Infra Scorecard

Export Results

What OpenGym Is NOT

Fair Use & Anti-Cheat

Infra Challenges (Run-Only)

Safety

Test Your Agent

Create Challenges

Tech Stack

License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages