Skip to content

feat(benchmarks): add CyBench evaluation — ZAK + GPT-5 reaches 9/11 (…#13

Open
infosecsanket wants to merge 1 commit into
securezeron:mainfrom
infosecsanket:cybench-benchmark
Open

feat(benchmarks): add CyBench evaluation — ZAK + GPT-5 reaches 9/11 (…#13
infosecsanket wants to merge 1 commit into
securezeron:mainfrom
infosecsanket:cybench-benchmark

Conversation

@infosecsanket
Copy link
Copy Markdown
Contributor

…81.8%) on official strict-offline subset

Phase 1 (curated 9-task offline slice from HackTheBox Cyber Apocalypse 2024):

  • ZAK + gpt-5: 9/9 (100.0%) pass@2 at $2.99
  • ZAK + gpt-4o-2024-11-20 on identical slice: 4/9 (44.4%) at $1.34
  • 55 percentage-point same-task model-uplift gap with harness, prompt, scoring, and iteration cap held constant

Phase 2 (11-task strict-offline subset of cybench's official task_list.txt, spanning Very-Easy through Hard difficulty):

  • ZAK + gpt-5 unguided pass@2: 8/11 (72.7%)
  • ZAK + gpt-5/o3 best-mode pass@2: 9/11 (81.8%) (subtask-guided + selective o3 escalation on Partial Tenacity)
  • Includes Hard-tier Permuted crypto solve

To our knowledge this is the first published GPT-5 result on cybench-format CTF tasks; no GPT-5 entry currently exists on the public leaderboard at cybench.github.io.

Includes:

  • run_cybench.py — single-file harness, ~700 LoC, sole dep openai>=2.28
  • PAPER.pdf — 17-page academic-style writeup
  • SCORECARD.md / REPORT.md / REPORT.html — multi-audience summaries
  • LEADERBOARD_PR.md — submission package for cybench.github.io
  • 48 per-task transcript JSONs (every LLM turn + every shell command) organized by configuration: unguided / subtask-guided / o3-escalation / phase1-curated / gpt4o-baseline

Total project cost: $34. Reproducible end-to-end.

…81.8%) on official strict-offline subset

Phase 1 (curated 9-task offline slice from HackTheBox Cyber Apocalypse 2024):
  - ZAK + gpt-5: 9/9 (100.0%) pass@2 at $2.99
  - ZAK + gpt-4o-2024-11-20 on identical slice: 4/9 (44.4%) at $1.34
  - 55 percentage-point same-task model-uplift gap with harness, prompt,
    scoring, and iteration cap held constant

Phase 2 (11-task strict-offline subset of cybench's official task_list.txt,
spanning Very-Easy through Hard difficulty):
  - ZAK + gpt-5 unguided pass@2: 8/11 (72.7%)
  - ZAK + gpt-5/o3 best-mode pass@2: 9/11 (81.8%)
    (subtask-guided + selective o3 escalation on Partial Tenacity)
  - Includes Hard-tier Permuted crypto solve

To our knowledge this is the first published GPT-5 result on cybench-format
CTF tasks; no GPT-5 entry currently exists on the public leaderboard at
cybench.github.io.

Includes:
  - run_cybench.py — single-file harness, ~700 LoC, sole dep openai>=2.28
  - PAPER.pdf — 17-page academic-style writeup
  - SCORECARD.md / REPORT.md / REPORT.html — multi-audience summaries
  - LEADERBOARD_PR.md — submission package for cybench.github.io
  - 48 per-task transcript JSONs (every LLM turn + every shell command)
    organized by configuration: unguided / subtask-guided / o3-escalation /
    phase1-curated / gpt4o-baseline

Total project cost: $34. Reproducible end-to-end.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants