Skip to content

widingmarcus-cyber/opengym

OpenGym

PyPI License: MIT Challenges Python 3.10+

250 challenges to test if your AI agent actually works — not just the model, but the infrastructure.

OpenGym is an open-source benchmark that evaluates AI agents across 7 capability dimensions: coding, memory persistence, tool discovery, multi-step planning, self-correction, safety boundaries, and multi-agent coordination. Unlike benchmarks that only test "can the model solve this?", OpenGym tests "does the agent system work reliably?"

Quickstart

git clone https://github.com/widingmarcus-cyber/opengym && cd opengym
pip install -e .
opengym fetch 001              # grab a challenge
opengym score 001              # score it (0/100 — your agent hasn't solved it yet)

Then point your agent at it:

# Automated: opengym runs your agent and scores the result
opengym run 001 --agent "python {repo}/examples/agents/openai_agent.py --task '{task}' --dir {workspace}"

# Run all 250 challenges
opengym run all --agent "..." --summary

# Fast infra check for any agent stack (12 challenge smoke profile)
opengym run all --profile infra-smoke --agent "..." --summary

Requires: Python 3.10+. No Docker needed. See examples/agents/ for ready-made OpenAI, Anthropic, OpenClaw, and dummy agent adapters.

OpenClaw users can run the bundled adapter directly:

opengym run all --agent "python {repo}/examples/agents/openclaw_agent.py --task '{task}' --dir {workspace}" --summary

How It Works

Each challenge is a self-contained folder. Your agent reads the task, does the work, and the CLI scores it.

101-learn-and-recall/
├── README.md        ← Agent reads this
├── setup/           ← Agent edits these files
├── steps/           ← Multi-session task steps (if applicable)
├── tools/           ← Executable tools (if applicable)
├── tests/           ← Hidden verification (agent doesn't touch)
└── metadata.yaml

Two workflows:

# Manual: fetch, let your agent work, score
opengym fetch 001
# ... your agent solves it ...
opengym score 001

# Automated: opengym orchestrates your agent
opengym run 101 --agent "python {repo}/my_agent.py --task '{task}' --dir {workspace}"
opengym run all --agent "..." --summary    # run the full gauntlet

7 Dimensions, 250 Challenges

Most benchmarks only test coding. OpenGym tests the infrastructure that makes agents reliable in production.

Dimension Challenges What It Tests
Coding 110 Read a task, write/fix code, pass tests
Memory 26 Persist information across killed sessions
Tool Use 27 Discover tools, handle failures, manage rate limits
Planning 26 Multi-step decomposition, scheduling, long-horizon stability
Multi-Agent 22 Coordinate via shared files, concurrency, task splitting
Resilience 23 Recover from crashes, errors, partial failures
Safety 16 Resist injection, enforce boundaries, redact secrets

Coding — 110 challenges

The baseline. Read a task, write/fix code, pass tests. This is what every benchmark measures — OpenGym includes it but goes further.

14 categories: code-fixing, code-writing, debugging, data-processing, refactoring, testing, api-integration, info-retrieval, devops-config, safety, algorithm, text-processing, file-operations, multi-step
Category Count Difficulty Range
Code Fixing 10 Easy → Hard
Code Writing 12 Easy → Hard
Debugging 6 Easy → Hard
Data Processing 7 Easy → Hard
Refactoring 5 Easy → Hard
Testing 6 Easy → Hard
API Integration 5 Easy → Hard
Info Retrieval 7 Easy → Hard
DevOps & Config 7 Easy → Hard
Safety (code) 7 Easy → Hard
Algorithm 8 Easy → Hard
Text Processing 6 Easy → Hard
File Operations 6 Easy → Hard
Multi-Step 7 Medium → Hard
Observability 6 Easy → Hard
Determinism 4 Easy → Hard

Memory Persistence — 25 challenges

The key differentiator. Tests whether your agent's memory actually persists across sessions. The CLI kills the agent process between steps and clears context — only files the agent explicitly wrote survive.

Includes: 5 core memory challenges + 20 memory-state infrastructure challenges
ID Range Focus Examples
101-105 Core memory Learn & recall, session rebuild, incremental knowledge, selective memory, knowledge update
128-147 Memory & state infrastructure Append-only logs, state merge conflicts, schema migration, LRU eviction, write-ahead logging, compaction

Tool Discovery & Use — 25 challenges

Tests whether your agent can discover unfamiliar tools, handle failures, rate limits, and broken tools.

Includes: 5 core tool challenges + 20 tool robustness challenges
ID Range Focus Examples
106-110 Core tool use Find right tool, chain tools, handle flaky tool, rate limits, undocumented API
166-185 Tool robustness 429 retry-after, malformed JSON recovery, paginated endpoints, deprecated API migration, schema validation

Self-Correction & Resilience — 20 challenges

Tests crash recovery, error handling, atomic operations, and failure recovery.

Includes: 5 core resilience challenges + 15 failure recovery challenges
ID Range Focus Examples
111-115 Core resilience Misleading errors, cascading failures, red herring logs, partial failure
198-212 Failure recovery Mid-task crash, OOM simulation, disk full, SIGTERM handling, transaction atomicity, checkpoint resume, rollback

Safety & Boundaries — 15 challenges

Tests prompt injection resistance, security hardening, and boundary enforcement.

Includes: 5 core safety challenges + 10 security boundary challenges
ID Range Focus Examples
116-120 Core safety Prompt injection, malicious logs, dangerous README, data exfiltration, scope creep
213-222 Security boundaries Path traversal, env secret leaks, symlink escape, input sanitization, sandbox hardening, safe deserialization

Multi-Agent Coordination — 21 challenges

Tests whether agents can coordinate via shared files, handle concurrency, and split tasks.

Includes: 3 core multi-agent challenges + 18 concurrency & coordination challenges
ID Range Focus Examples
121-123 Core multi-agent Shared config, information asymmetry, task delegation
148-165 Concurrency & coordination File locking, atomic counters, producer-consumer, leader election, distributed merge, priority queues

Multi-Step Planning — 24 challenges

Tests decomposition, scheduling, long-horizon execution, and dependency resolution.

Includes: 4 core planning challenges + 12 scheduling challenges + 8 long-horizon challenges
ID Range Focus Examples
124-127 Core planning Dependency ordering, changing requirements, resource constraints, plan-then-execute
186-197 Scheduling & cron Fresh vs reuse, config drift, missed schedules, double execution, timezone/DST handling
233-240 Long-horizon stability 10-stage pipeline, config drift detection, state machine execution, dependency resolution, event sourcing, consensus

Scoring

Every challenge scores 0-100 based on tests passed. Results are grouped by dimension so you see where your agent's infrastructure breaks down.

============================================================
  OpenGym Score: 68/100
  Passed: 163/250
============================================================

By Dimension:
  coding         [################....] 82/100
  memory         [########............] 40/100
  tool-use       [############........] 60/100
  resilience     [##########..........] 55/100
  safety         [##################..] 90/100
  multi-agent    [######..............] 30/100
  planning       [##########..........] 50/100

Diagnostics:
  - memory (40/100): Your agent cannot persist information across sessions.
    It needs a real memory system — not just context window.
  - multi-agent (30/100): Your agent cannot coordinate with other agents
    via shared resources.

Summary output also includes an Action Plan section with concrete runtime-level remediation steps.

Reliability Runs (When Baseline Is 100/100)

Single-pass scores can saturate for strong agents. Use repeated runs with deterministic chaos to measure stability over time:

# 5 repeated trials, deterministic fault jitter
opengym run 243 --agent "..." --trials 5 --chaos-level light --chaos-seed 42 --summary

# Harder pressure: larger jitter + occasional unsignaled SIGTERM on infra tasks
opengym run all --agent "..." --trials 3 --chaos-level hard --summary

--trials N reports a reliability block: trial pass rate and stable/flaky/broken challenge counts. This is the intended "did my infra get more reliable this week?" signal.

Benchmark Profiles

Use predefined profiles when you want a consistent run target without managing IDs:

opengym run all --profile infra-smoke --agent "..." --summary   # 12 infra reps (fast)
opengym run all --profile infra-weekly --agent "..." --summary  # 60 harder infra cases
opengym run all --profile infra-hard --agent "..." --summary    # all hard infra challenges
opengym run all --profile infra-nightly --agent "..." --summary # full infra conformance set
opengym run all --profile safety-gate --agent "..." --summary   # safety/resilience gate

Weekly Reliability Diff

Save reports and compare week-over-week:

# Baseline (e.g., last week)
opengym run all --profile infra-weekly --agent "..." --trials 3 --chaos-level light --chaos-seed 42 --summary --save-report reports/week1.json

# Current run
opengym run all --profile infra-weekly --agent "..." --trials 3 --chaos-level light --chaos-seed 42 --summary --save-report reports/week2.json

# Compare regressions/improvements
opengym compare reports/week1.json reports/week2.json

CLI Reference

# List and filter
opengym list                              # List all 250 challenges
opengym list --dimension memory           # Filter by dimension
opengym list --category algorithm         # Filter by category
opengym list --difficulty hard            # Filter by difficulty
opengym list --json-output                # Machine-readable

# Fetch challenges
opengym fetch 001                         # Fetch one challenge
opengym fetch all                         # Fetch everything
opengym init-key                          # Create ~/.opengym/test_key for private fixtures

# Score manually (MODEL_DEPENDENT challenges only)
opengym score 001                         # Score one model-dependent challenge
opengym score all --summary               # Includes blocked infra entries unless run via `opengym run`
opengym score all --scorecard             # Scorecard view of available results
opengym score all --json-output           # JSON output
opengym score all --csv-output            # CSV for spreadsheets

# Run agent automatically (including multi-session orchestration)
opengym run 101 --agent "python {repo}/examples/agents/openai_agent.py --task '{task}' --dir {workspace}"
opengym run all --agent "..." --summary   # Full gauntlet
opengym run all --agent "..." --scorecard # Infra scorecard
opengym run all --agent "..." --parallel 4 # 4 workers
opengym run all --agent "..." --enforce-scope # fail on writes outside setup/
opengym run all --agent "..." --fresh-infra-workspace # reset infra workspaces before each run
opengym run all --agent "..." --trials 5 --chaos-level light --chaos-seed 42 # reliability/stability run
opengym run all --profile infra-smoke --agent "..." --summary # predefined profile
opengym run all --agent "..." --save-report reports/run.json # persist machine-readable report

# Compare two saved reports
opengym compare reports/week1.json reports/week2.json

opengym run --agent supports placeholders: {task}, {workspace}, {task_content}, {repo}.

Infra Scorecard

The --scorecard flag produces an infrastructure-focused breakdown showing exactly where your agent's orchestration fails:

================================================================
  INFRA SCORECARD
================================================================
  Infra Conformance:  62/100  (87/140 passed)
  Model-Dependent:    74/100  (74/100 passed)
  Overall:            67/100
================================================================

  Category Breakdown:
  ────────────────────────────────────────────────────────────
    Memory Integrity             [################....] 80/100  16/20  WARN
    Concurrency Safety           [############........] 61/100  11/18  WARN
    Tool Robustness              [##########..........] 55/100  11/20  WARN
    Crash Recovery               [########............] 40/100   6/15  FAIL
    Security Boundaries          [######..............] 30/100   3/10  FAIL
    Long-Horizon Stability       [##..................] 12/100   1/8   FAIL

Each category maps to a specific infrastructure capability. FAIL/WARN/PASS tells you at a glance what needs work.

Export Results

# JSON output for CI pipelines, dashboards, or sharing
opengym score all --json-output > results.json
opengym score all --scorecard --json-output > scorecard.json
opengym score all --csv-output > results.csv
opengym run all --profile infra-weekly --agent "..." --trials 3 --chaos-level light --save-report reports/weekly.json
opengym compare reports/weekly_prev.json reports/weekly.json --json-output > diff.json

What OpenGym Is NOT

  • Not an RL gym. No environments, no reward signals, no training loops.
  • Not an LLM benchmark. We don't measure raw model quality (MMLU, HumanEval, etc.).
  • It's an agent infrastructure test. Does your agent's memory, tool use, error handling, and safety actually work end-to-end?

Fair Use & Anti-Cheat

Test files are excluded from the workspace. When you opengym fetch a challenge, the tests/ directory is not copied to your workspace. Your agent cannot read test files to reverse-engineer answers.

Scoring uses an isolated temporary staging workspace with canonical hidden tests. Tests are not injected into your live challenge workspace.

For opengym run on multi-session challenges, both tests/ and steps/ are excluded — your agent only sees the current step, not future ones.

Infra Challenges (Run-Only)

Hard ≠ infra. Difficulty alone does not make a task infrastructure-focused.

All challenges marked challenge_type: INFRA_CONFORMANCE must be executed with opengym run. Direct opengym score is blocked for these challenges because infra verification depends on orchestration behavior (fault injection, process restarts, scope enforcement).

# Model-dependent challenge: fetch + solve + score
opengym fetch 167 && opengym score 167

# Infra challenge: must use run
opengym run 243 --agent "python {repo}/my_agent.py --task '{task}' --dir {workspace}"

Why? A bare LLM (or a human pre-writing output files) can bypass many static output checks. opengym run validates infrastructure behavior under orchestration: step boundaries, fault injection, workspace resets, and runtime policy checks.

Challenge Type Direct opengym score opengym run
MODEL_DEPENDENT Allowed Allowed
INFRA_CONFORMANCE Blocked Required

Safety

All challenges run locally on your machine. No network calls are made by the CLI. Agent code executes in your normal environment — if you're running untrusted agents, use a sandbox (Docker, VM, etc.). The CLI never sends data anywhere.

Test Your Agent

See docs/AGENT_GUIDE.md for copy-paste examples with Claude Code, OpenAI, LangChain, CrewAI, and custom agents.

Create Challenges

See docs/CHALLENGE_SPEC.md for the challenge format.

Tech Stack

Python 3.10+ / click / pytest / YAML / JSON

License

MIT

About

250 challenges to test if your AI agent actually works — not just the model, but the infrastructure.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages