Add claude code agent benchmarking#38
Open
rajathsalegame wants to merge 4 commits into
Open
Conversation
mojtaba-komeili
approved these changes
Feb 18, 2026
Collaborator
|
@rajathsalegame could you look at PR #49 and the pr-38 branch I have associated with it to see how well that works on the current codebase? I rebased the original #38 to cvdp-1.1.0 and then added some sort to also pull in host auth from ~/.claude. Also some agent metadata additions in reporting to better capture the common agent run failure modes I was seeing (timeouts etc), though this has only been lightly tested so far. Thanks! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add Claude Code agent example for CVDP benchmark
Adds a containerized agent that uses Claude Code to solve CVDP hardware verification challenges in a fully agentic workflow. Default model is Claude Sonnet 4.5.
--dangerously-skip-permissionsflag).CLAUDE_CODE_MAX_TURNSvariable injects turn-budgeting instructions into the prompt for fair cross-agent comparison.Changes
agent.pyLightweight wrapper that reads the task from
/code/prompt.json, pipes it to the Claude Code CLI in headless non-interactive mode (claude -p --dangerously-skip-permissions), and streams output. Web search and fetch tools are disabled so the agent works exclusively with mounted challenge files.DockerfileNon-commercial image (Ubuntu 22.04) with Node.js 20, Python 3, and the Claude Code CLI. Compatible with OSS simulators (Icarus Verilog, Verilator).
Dockerfile.commercialCommercial image (Rocky Linux 9) with Cadence Xcelium, cocotb, and coverage tooling. Accepts
CDS_LIC_FILE/LM_LICENSE_FILEbuild args for license server configuration.build_agent.shHelper script that builds both Docker images in one step.
README.mdBuild, run, configuration, and debugging instructions.
AgenticProcessorUpdated to pass
ANTHROPIC_API_KEYandCLAUDE_CODE_MAX_TURNSenvironment variables through to agent containers.Results (pass@1, n=5)
claude-code-agent(non-commercial)claude-code-agent-commercialTesting