This repository is the official code release for
ResearchEnvBench: Benchmarking Agents on Environment Synthesis for Research Code Execution.
It evaluates whether coding agents can bootstrap runnable ML/HPC research environments from raw repositories.
- 44 pinned research repositories (via manifest + run matrices).
- A host orchestrator that runs jobs in isolated Docker containers.
- A strict runtime verification pipeline (
C0toC5):C0: static missing-import checkC1: CPU entrypoint runC2: CUDA alignmentC3: single-GPU executionC4: multi-GPU/DDP execution (when applicable)C5: report hallucination audit
- Multi-backend agent support:
codexclaude_codenexaunexau_deepseek31_nexn1nexau_gemini30nexau_claude_sonnet45nexau_minimax25
Paper baselines and recommended model IDs:
codex:gpt-5.1-codexclaude_code:glm-4.7nexau:nex-agi/deepseek-v3.1-nex-1(or your equivalent DeepSeek-V3.1-Nex-N1 endpoint/model ID)
host_orchestrator.py: top-level runner for full benchmark jobs.m2/run_one_job.py: one-job lifecycle (container start, agent run, benchmark run, summary).scripts/<owner>@<repo>/benchmark_scripts/: fixed per-repo runtime probes.m1_repo_manifest_module/manifests/: pinned repo manifests + run matrices.m5/build_master_table.py: aggregate run outputs into CSV/XLSX summaries.tools/env_setup_runner/: backend runner + report contract enforcement.scripts_repos_test_categories.csv: stage applicability (C1/C3/C4denominators).
This checks orchestration + benchmark wiring without requiring model/API credentials.
python host_orchestrator.py \
--run-matrix m1_repo_manifest_module/manifests/run_matrix_codex.jsonl \
--image researchenvbench:ultimate \
--limit 1 \
--skip-agent \
--build-master-table \
--run-id smoke_skip_agentOutputs are written to:
results/smoke_skip_agent/
For reproducibility details aligned to the paper (hardware assumptions, backend setup, exact commands, output checks), see:
docs/REPRODUCIBILITY.md
Pass credentials via --secrets-env-file <your_env_file>.
Recommended bootstrap:
cp secrets/codex.env.example secrets/codex.env
cp secrets/claude_code.env.example secrets/claude_code.env
cp secrets/nexau.env.example secrets/nexau.envTwo supported ways:
- Session auth (default): pre-login on host, then M2 mounts
~/.codex/~/.config/codex. - Official OpenAI API mode (recommended): start from
secrets/codex.env.example.
OPENAI_BASE_URL=https://api.openai.com/v1
OPENAI_API_KEY=your_openai_api_key
CODEX_MODEL=gpt-5.1-codexBackward-compatible aliases are supported for dev-era env files:
CODEX_BASE_URL=https://your_compatible_endpoint
CODEX_API_KEY=your_codex_api_key
# optional fallback if CODEX_API_KEY is absent:
# CODEX_TOKEN=your_codex_api_keyThe runner maps CODEX_BASE_URL/CODEX_API_KEY/CODEX_TOKEN to OPENAI_BASE_URL/OPENAI_API_KEY.
You can also set model from CLI: --codex-model gpt-5.1-codex.
Env file example (start from secrets/claude_code.env.example):
ANTHROPIC_BASE_URL=https://your-anthropic-compatible-endpoint
ANTHROPIC_API_KEY=your_claude_api_key
CLAUDE_MODEL=glm-4.7
# optional alternative:
# ANTHROPIC_AUTH_TOKEN=your_claude_api_keyIf ANTHROPIC_BASE_URL is not set, the runner default is:
ANTHROPIC_BASE_URL=https://open.bigmodel.cn/api/anthropicYou can also set model from CLI: --claude-model glm-4.7.
Use one of these templates and fill values:
secrets/nexau.env.examplesecrets/nex_deepseek31_nexn1.env.examplesecrets/nex_gemini30.env.examplesecrets/nex_claude_sonnet45.env.examplesecrets/nex_minimax25.env.example
Core env fields:
LLM_MODEL=your_model_name
LLM_BASE_URL=https://your-openai-or-anthropic-compatible-endpoint
LLM_API_KEY=your_api_key
LLM_API_TYPE=anthropic_chat_completion
LLM_TOOL_CALL_MODE=anthropicConfig-level control (important for reproducibility):
nexaudefaults to official NexAU DeepSeek config:/opt/nexau/env_setup_config/deepseek-v3.1-nex-n1.yaml
nexau_deepseek31_nexn1defaults to harness DeepSeek config:tools/env_setup_runner/nexau_configs/nexau_deepseek31_nexn1.yaml
- Any NexAU backend can override YAML via env file:
NEXAU_AGENT_CONFIG=/path/to/your_config.yaml
- If your endpoint is OpenAI-compatible, set:
LLM_API_TYPE=openai_chat_completionLLM_TOOL_CALL_MODE=openai
Also supported for tracing/summarization:
SUMMARY_*EXTRACT_*LANGFUSE_*
python host_orchestrator.py \
--run-matrix m1_repo_manifest_module/manifests/run_matrix_codex.jsonl \
--image researchenvbench:ultimate \
--secrets-env-file secrets/codex.env \
--codex-model gpt-5.1-codex \
--network host \
--build-master-table \
--run-id paper_codexUse the included matrix:
m1_repo_manifest_module/manifests/run_matrix_claude_code.jsonlIf you want to regenerate it from the codex matrix (also re-generates job_id safely):
python tools/matrix/rewrite_baseline_matrix.py \
--source m1_repo_manifest_module/manifests/run_matrix_codex.jsonl \
--baseline claude_code \
--out m1_repo_manifest_module/manifests/run_matrix_claude_code.jsonlThen run:
python host_orchestrator.py \
--run-matrix m1_repo_manifest_module/manifests/run_matrix_claude_code.jsonl \
--image researchenvbench:ultimate \
--secrets-env-file secrets/claude_code.env \
--claude-model glm-4.7 \
--network host \
--build-master-table \
--run-id paper_claude_codepython host_orchestrator.py \
--run-matrix m1_repo_manifest_module/manifests/run_matrix.jsonl \
--image researchenvbench:ultimate \
--secrets-env-file secrets/nexau.env \
--network host \
--build-master-table \
--run-id paper_nexauUse included matrix + env template:
cp secrets/nex_deepseek31_nexn1.env.example secrets/nex_deepseek31_nexn1.envpython host_orchestrator.py \
--run-matrix m1_repo_manifest_module/manifests/run_matrix_nexau_deepseek31_nexn1.jsonl \
--image researchenvbench:ultimate \
--baseline nexau_deepseek31_nexn1 \
--secrets-env-file secrets/nex_deepseek31_nexn1.env \
--network host \
--build-master-table \
--run-id paper_nexau_deepseek31_nexn1For smoke validation, use:
m1_repo_manifest_module/manifests/run_matrix_smoke_nexau_deepseek31_nexn1.jsonl
For public/provider-flexible reproduction, you can use template NexAU backends (nexau_deepseek31_nexn1, nexau_gemini30, nexau_claude_sonnet45, nexau_minimax25) with env templates in secrets/*.env.example.
Reuse an existing nexau_* backend and only change env file values:
- Set
LLM_MODEL/LLM_BASE_URL/LLM_API_KEY/LLM_API_TYPE/LLM_TOOL_CALL_MODE. - Run with one existing matrix, for example
run_matrix_nexau_gemini30.jsonl. - Keep
--secrets-env-filepointing to your filled env file.
- Add a new backend key (for example
nexau_my_model) in:tools/env_setup_runner/runners.json- Copy
nexau_gemini30backend (generic env-driven config), ornexau_deepseek31_nexn1(DeepSeek-oriented config).
- Optional: set
NEXAU_AGENT_CONFIGin env file to your own YAML path (official or custom), no code change required. - Generate a new matrix with your baseline name:
python tools/matrix/rewrite_baseline_matrix.py \
--source m1_repo_manifest_module/manifests/run_matrix.jsonl \
--baseline nexau_my_model \
--out m1_repo_manifest_module/manifests/run_matrix_nexau_my_model.jsonl- Run:
python host_orchestrator.py \
--run-matrix m1_repo_manifest_module/manifests/run_matrix_nexau_my_model.jsonl \
--image researchenvbench:ultimate \
--secrets-env-file /path/to/nexau_my_model.env \
--network host \
--build-master-table \
--run-id paper_nexau_my_modelTip: use nexau_ prefix for custom names so M2 automatically uses the NexAU report mode defaults.
After each run:
- Job-level outputs:
results/<run_id>/jobs/<job_id>/... - Run-level summary:
results/<run_id>/master_summary.csv - Full table:
results/<run_id>/master_table.csv
Important denominator checks (paper setting):
C1denominator:29C2denominator:44C3denominator:43C4denominator:32C0total denominator:2858
These are enforced by tests and fixed benchmark metadata.
- Default target runtime in the paper is Ubuntu 22.04 + CUDA 12.4 + 2x RTX 4090.
- Paper driver reference is
550.163.01. dockerfiles/ultimate.Dockerfileuses the official base imagenvidia/cuda:12.4.1-devel-ubuntu22.04.- Benchmarks are designed around no-modification of tracked repo source files during agent setup.
- For backend-specific runner details, see
tools/env_setup_runner/README.md.
If you use this repository, please cite:
@misc{wang2026researchenvbenchbenchmarkingagentsenvironment,
title={ResearchEnvBench: Benchmarking Agents on Environment Synthesis for Research Code Execution},
author={Yubang Wang and Chenxi Zhang and Bowen Chen and Zezheng Huai and Zihao Dai and Xinchi Chen and Yuxin Wang and Yining Zheng and Jingjing Gong and Xipeng Qiu},
year={2026},
eprint={2603.06739},
archivePrefix={arXiv},
primaryClass={cs.SE},
url={https://arxiv.org/abs/2603.06739},
}