Skip to content

Add claude code agent benchmarking#38

Open
rajathsalegame wants to merge 4 commits into
NVlabs:mainfrom
rajathsalegame:rds/claude-code
Open

Add claude code agent benchmarking#38
rajathsalegame wants to merge 4 commits into
NVlabs:mainfrom
rajathsalegame:rds/claude-code

Conversation

@rajathsalegame
Copy link
Copy Markdown

@rajathsalegame rajathsalegame commented Feb 12, 2026

Add Claude Code agent example for CVDP benchmark

Adds a containerized agent that uses Claude Code to solve CVDP hardware verification challenges in a fully agentic workflow. Default model is Claude Sonnet 4.5.

  • Runs as a non-root user inside the container (required by Claude Code's --dangerously-skip-permissions flag).
  • Tasks are passed via stdin to avoid command-line length limits.
  • The optional CLAUDE_CODE_MAX_TURNS variable injects turn-budgeting instructions into the prompt for fair cross-agent comparison.

Changes

agent.py
Lightweight wrapper that reads the task from /code/prompt.json, pipes it to the Claude Code CLI in headless non-interactive mode (claude -p --dangerously-skip-permissions), and streams output. Web search and fetch tools are disabled so the agent works exclusively with mounted challenge files.

Dockerfile
Non-commercial image (Ubuntu 22.04) with Node.js 20, Python 3, and the Claude Code CLI. Compatible with OSS simulators (Icarus Verilog, Verilator).

Dockerfile.commercial
Commercial image (Rocky Linux 9) with Cadence Xcelium, cocotb, and coverage tooling. Accepts CDS_LIC_FILE / LM_LICENSE_FILE build args for license server configuration.

build_agent.sh
Helper script that builds both Docker images in one step.

README.md
Build, run, configuration, and debugging instructions.

AgenticProcessor
Updated to pass ANTHROPIC_API_KEY and CLAUDE_CODE_MAX_TURNS environment variables through to agent containers.


Results (pass@1, n=5)

Image pass@1
claude-code-agent (non-commercial) 44.5%
claude-code-agent-commercial 9.2%

Testing

export ANTHROPIC_API_KEY=sk-ant-...

# Full benchmark
python run_benchmark.py -f dataset.jsonl -l -g claude-code-agent

# Multi-sample pass@k
python run_samples.py -f dataset.jsonl -l -g claude-code-agent -n 5 -k 1

Comment thread examples/claude-code-agent/build_agent.sh Outdated
@rajathsalegame rajathsalegame changed the title Rds/claude code Add claude code benchmarking Feb 18, 2026
@rajathsalegame rajathsalegame changed the title Add claude code benchmarking Add claude code agent benchmarking Feb 18, 2026
@npfet
Copy link
Copy Markdown
Collaborator

npfet commented May 5, 2026

@rajathsalegame could you look at PR #49 and the pr-38 branch I have associated with it to see how well that works on the current codebase? I rebased the original #38 to cvdp-1.1.0 and then added some sort to also pull in host auth from ~/.claude. Also some agent metadata additions in reporting to better capture the common agent run failure modes I was seeing (timeouts etc), though this has only been lightly tested so far. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants