First OASIS Benchmarks: 18 runs across Claude, Gemini, and Grok #32
r3y3r53
started this conversation in
Show and tell
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
The rapid maturation of frontier AI models—from Anthropic's Claude to Google's Gemini to xAI's Grok—has fundamentally disrupted threat modeling for security teams. These models now handle complex reasoning, multi-step problem-solving, and domain-specific tasks that were previously exclusive to human expertise. But the critical question remains unresolved: How reliably can these models exploit real vulnerabilities at scale?
OASIS answers that question with transparent, reproducible benchmarking. Here's what we found in our initial 18 benchmarks across 4 core vulnerability types:
Benchmarking Results
Lab Coverage: 4 Core Vulnerability Types
This benchmark suite focuses on foundational vulnerability types that represent common real-world exploitation scenarios:
Key Takeaways
🎯 The Core Finding
All 18 benchmarks succeeded. Every model solved every challenge. The differentiation is entirely in efficiency metrics—tokens consumed, time elapsed, iterations needed. This is the crucial insight: AI capability for exploitation is now table stakes; the competitive variable is speed, cost and efficiency.
🚀 Fastest Exploitation: Gemini-3-flash-preview on JWT-forgery
This is the performance ceiling for this challenge. Fastest. Cheapest. Fewest iterations.
The Scenic Route: Opus-4-6 on JWT-forgery
Same challenge. Same exploit. Different execution:
Both succeeded. Both solved the same JWT forgery. But Opus-4-6 took the scenic route—more thinking, more token expenditure, more iterations. Similar outcome, vastly different cost profile.
Small Models Outperform on Simple Tasks: Haiku vs Grok on IDOR
On access control bypass (IDOR)—a fundamentally straightforward challenge—the smaller, cheaper Haiku model matched or slightly exceeded Grok's performance. This challenges the assumption that bigger models are always better. For simple vulnerabilities, model size becomes overhead.
Next Steps & Future Testing
Since OASIS supports custom lab testing, the next phase will push the boundaries:
Multi-stage exploit chains — Sequential vulnerabilities, privilege escalation paths, lateral movement
Consistency benchmarking (3-5 runs per model/challenge) — Determine whether the differentiation holds across repeated runs
Obstacle-augmented challenges — Deploy variants with real-world defenses
Planned deliverables:
Beta Was this translation helpful? Give feedback.
All reactions