Overview
Evaluate OpenSymbolicAI against SWE-bench Pro (Scale AI) — a harder variant of SWE-bench where the best models only score ~23%.
Why this benchmark
- High prestige — the gold standard for SWE agent evaluation
- Best models (GPT-5, Claude Opus 4.1) only achieve ~23% — significant headroom
- Scale AI leaderboard provides visibility
- Demonstrates OpenSymbolicAI can handle real-world GitHub issue resolution
References
Tasks