Add claude code agent benchmarking by rajathsalegame · Pull Request #38 · NVlabs/cvdp_benchmark

rajathsalegame · 2026-02-12T17:02:58Z

Add Claude Code agent example for CVDP benchmark

Adds a containerized agent that uses Claude Code to solve CVDP hardware verification challenges in a fully agentic workflow. Default model is Claude Sonnet 4.5.

Runs as a non-root user inside the container (required by Claude Code's --dangerously-skip-permissions flag).
Tasks are passed via stdin to avoid command-line length limits.
The optional CLAUDE_CODE_MAX_TURNS variable injects turn-budgeting instructions into the prompt for fair cross-agent comparison.

Changes

agent.py
Lightweight wrapper that reads the task from /code/prompt.json, pipes it to the Claude Code CLI in headless non-interactive mode (claude -p --dangerously-skip-permissions), and streams output. Web search and fetch tools are disabled so the agent works exclusively with mounted challenge files.

Dockerfile
Non-commercial image (Ubuntu 22.04) with Node.js 20, Python 3, and the Claude Code CLI. Compatible with OSS simulators (Icarus Verilog, Verilator).

Dockerfile.commercial
Commercial image (Rocky Linux 9) with Cadence Xcelium, cocotb, and coverage tooling. Accepts CDS_LIC_FILE / LM_LICENSE_FILE build args for license server configuration.

build_agent.sh
Helper script that builds both Docker images in one step.

README.md
Build, run, configuration, and debugging instructions.

AgenticProcessor
Updated to pass ANTHROPIC_API_KEY and CLAUDE_CODE_MAX_TURNS environment variables through to agent containers.

Results (pass@1, n=5)

Image	pass@1
`claude-code-agent` (non-commercial)	44.5%
`claude-code-agent-commercial`	9.2%

Testing

export ANTHROPIC_API_KEY=sk-ant-...

# Full benchmark
python run_benchmark.py -f dataset.jsonl -l -g claude-code-agent

# Multi-sample pass@k
python run_samples.py -f dataset.jsonl -l -g claude-code-agent -n 5 -k 1

npfet · 2026-05-05T22:50:02Z

@rajathsalegame could you look at PR #49 and the pr-38 branch I have associated with it to see how well that works on the current codebase? I rebased the original #38 to cvdp-1.1.0 and then added some sort to also pull in host auth from ~/.claude. Also some agent metadata additions in reporting to better capture the common agent run failure modes I was seeing (timeouts etc), though this has only been lightly tested so far. Thanks!

rajathsalegame added 3 commits February 12, 2026 16:47

add README

b9834fd

modify processor to accept env args

a9d0ddc

add claude code agent

de08a58

mojtaba-komeili reviewed Feb 12, 2026

View reviewed changes

Comment thread examples/claude-code-agent/build_agent.sh Outdated

add placeholder license files

eb90c85

rajathsalegame changed the title ~~Rds/claude code~~ Add claude code benchmarking Feb 18, 2026

rajathsalegame changed the title ~~Add claude code benchmarking~~ Add claude code agent benchmarking Feb 18, 2026

mojtaba-komeili approved these changes Feb 18, 2026

View reviewed changes

npfet mentioned this pull request May 5, 2026

Update Claude Code agent workflow #49

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add claude code agent benchmarking#38

Add claude code agent benchmarking#38
rajathsalegame wants to merge 4 commits into
NVlabs:mainfrom
rajathsalegame:rds/claude-code

rajathsalegame commented Feb 12, 2026 •

edited

Loading

Uh oh!

Uh oh!

npfet commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

rajathsalegame commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Results (pass@1, n=5)

Testing

Uh oh!

Uh oh!

npfet commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rajathsalegame commented Feb 12, 2026 •

edited

Loading