Add Zeron ADK + GPT-5/o3 result (9/11 = 81.8% on strict-offline subset) by infosecsanket · Pull Request #22 · cybench/cybench.github.io

infosecsanket · 2026-05-03T07:57:58Z

Add Zeron ADK + GPT-5/o3 result — 9/11 (81.8%) on strict-offline subset

This PR adds a single row to data/leaderboard.csv.

Disclosure: Count = 11, not 40. Strict-offline filter on task_list.txt (no Docker challenge containers required, no external network, has release/ directory). Excludes web/pwn entirely.

Why this submission is interesting: to our knowledge no GPT-5 entry currently exists on the leaderboard. This is the first published result for GPT-5 on cybench-format CTF tasks.

Result

Mode	Solved	%
Unguided pass@2	8/11	72.7%
Subtask-guided pass@2 + selective o3 escalation	9/11	81.8%

The single row populates both columns:

Zeron ADK + GPT-5/o3,11,72.7,8,11,9,81.8,,,,

Includes:

A Hard-tier crypto solve (Permuted, group theory)
A Medium-tier crypto solve (Partial Tenacity) recovered via selective o3 reasoning-model escalation at $0.37, after gpt-5 had spent ~$8 on failed attempts in both prompt modes

Open data

17-page paper: https://github.com/securezeron/zeron-agent-development-kit/blob/main/benchmarks/cybench/PAPER.pdf
Single-file harness (~700 LoC, sole dependency openai): https://github.com/securezeron/zeron-agent-development-kit/blob/main/benchmarks/cybench/run_cybench.py
48 per-task transcripts (every LLM turn + every shell command + every observation): https://github.com/securezeron/zeron-agent-development-kit/tree/main/benchmarks/cybench/results/transcripts
Detailed scorecard: https://github.com/securezeron/zeron-agent-development-kit/blob/main/benchmarks/cybench/results/SCORECARD.md
Source repo: https://github.com/securezeron/zeron-agent-development-kit/tree/main/benchmarks/cybench

Methodology

Prompt: cybench's END_TO_END_PROMPT from agent/prompt.py verbatim
Models: gpt-5 (temperature=0.2, max_tokens=2048); o3 (max_completion_tokens=4096) for selective escalation on a single math-heavy task
Iteration cap: 25 (cybench paper default 15, raised for harder tasks)
Per-cmd timeout: 300 s (raised from initial 60 s after Partial Tenacity timed out mid-search)
Both prompt modes: hard_prompt (unguided) and easy_prompt (subtask-guided) — both reported

Honest limitations

Subset bias: 11/40 (27.5%); excludes web/pwn entirely
Two persistent misses (Data Siege, FlecksOfGold) appear host-infrastructure-bound, not pure model-capability ceilings; projection on a properly-provisioned host is 10–11/11
Single trial per configuration (one retry on miss for pass@2)

Happy to provide additional logs, re-run with stricter parameters, or extend to additional tasks if helpful for verification.

…line subset) First published GPT-5 result on cybench. Subset: 11/40 (strict-offline filter on task_list.txt). Unguided pass@2: 8/11 (72.7%). Subtask-guided + o3 escalation pass@2: 9/11 (81.8%). Open data: https://github.com/securezeron/zeron-agent-development-kit/tree/main/benchmarks/cybench

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Zeron ADK + GPT-5/o3 result (9/11 = 81.8% on strict-offline subset)#22

Add Zeron ADK + GPT-5/o3 result (9/11 = 81.8% on strict-offline subset)#22
infosecsanket wants to merge 1 commit into
cybench:mainfrom
infosecsanket:add-zak-gpt5-result

infosecsanket commented May 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

infosecsanket commented May 3, 2026