feat(benchmarks): add CyBench evaluation — ZAK + GPT-5 reaches 9/11 (… by infosecsanket · Pull Request #13 · securezeron/zeron-agent-development-kit

infosecsanket · 2026-05-03T07:29:31Z

…81.8%) on official strict-offline subset

Phase 1 (curated 9-task offline slice from HackTheBox Cyber Apocalypse 2024):

ZAK + gpt-5: 9/9 (100.0%) pass@2 at $2.99
ZAK + gpt-4o-2024-11-20 on identical slice: 4/9 (44.4%) at $1.34
55 percentage-point same-task model-uplift gap with harness, prompt, scoring, and iteration cap held constant

Phase 2 (11-task strict-offline subset of cybench's official task_list.txt, spanning Very-Easy through Hard difficulty):

ZAK + gpt-5 unguided pass@2: 8/11 (72.7%)
ZAK + gpt-5/o3 best-mode pass@2: 9/11 (81.8%) (subtask-guided + selective o3 escalation on Partial Tenacity)
Includes Hard-tier Permuted crypto solve

To our knowledge this is the first published GPT-5 result on cybench-format CTF tasks; no GPT-5 entry currently exists on the public leaderboard at cybench.github.io.

Includes:

run_cybench.py — single-file harness, ~700 LoC, sole dep openai>=2.28
PAPER.pdf — 17-page academic-style writeup
SCORECARD.md / REPORT.md / REPORT.html — multi-audience summaries
LEADERBOARD_PR.md — submission package for cybench.github.io
48 per-task transcript JSONs (every LLM turn + every shell command) organized by configuration: unguided / subtask-guided / o3-escalation / phase1-curated / gpt4o-baseline

Total project cost: $34. Reproducible end-to-end.

…81.8%) on official strict-offline subset Phase 1 (curated 9-task offline slice from HackTheBox Cyber Apocalypse 2024): - ZAK + gpt-5: 9/9 (100.0%) pass@2 at $2.99 - ZAK + gpt-4o-2024-11-20 on identical slice: 4/9 (44.4%) at $1.34 - 55 percentage-point same-task model-uplift gap with harness, prompt, scoring, and iteration cap held constant Phase 2 (11-task strict-offline subset of cybench's official task_list.txt, spanning Very-Easy through Hard difficulty): - ZAK + gpt-5 unguided pass@2: 8/11 (72.7%) - ZAK + gpt-5/o3 best-mode pass@2: 9/11 (81.8%) (subtask-guided + selective o3 escalation on Partial Tenacity) - Includes Hard-tier Permuted crypto solve To our knowledge this is the first published GPT-5 result on cybench-format CTF tasks; no GPT-5 entry currently exists on the public leaderboard at cybench.github.io. Includes: - run_cybench.py — single-file harness, ~700 LoC, sole dep openai>=2.28 - PAPER.pdf — 17-page academic-style writeup - SCORECARD.md / REPORT.md / REPORT.html — multi-audience summaries - LEADERBOARD_PR.md — submission package for cybench.github.io - 48 per-task transcript JSONs (every LLM turn + every shell command) organized by configuration: unguided / subtask-guided / o3-escalation / phase1-curated / gpt4o-baseline Total project cost: $34. Reproducible end-to-end. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(benchmarks): add CyBench evaluation — ZAK + GPT-5 reaches 9/11 (…#13

feat(benchmarks): add CyBench evaluation — ZAK + GPT-5 reaches 9/11 (…#13
infosecsanket wants to merge 1 commit into
securezeron:mainfrom
infosecsanket:cybench-benchmark

infosecsanket commented May 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

infosecsanket commented May 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants