Benchmark: SWE-bench Pro — advanced software engineering tasks

## Overview
Evaluate OpenSymbolicAI against **SWE-bench Pro** (Scale AI) — a harder variant of SWE-bench where the best models only score ~23%.

## Why this benchmark
- High prestige — the gold standard for SWE agent evaluation
- Best models (GPT-5, Claude Opus 4.1) only achieve ~23% — significant headroom
- Scale AI leaderboard provides visibility
- Demonstrates OpenSymbolicAI can handle real-world GitHub issue resolution

## References
- [SWE-bench Pro Leaderboard](https://labs.scale.com/leaderboard/swe_bench_pro_public)
- [SWE-bench](https://www.swebench.com/)

## Tasks
- [ ] Review SWE-bench Pro evaluation protocol
- [ ] Design code-editing primitive set
- [ ] Build SWE agent on top of OpenSymbolicAI blueprints
- [ ] Run evaluation and collect results
- [ ] Submit to Scale AI leaderboard

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark: SWE-bench Pro — advanced software engineering tasks #28

Overview

Why this benchmark

References

Tasks

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Benchmark: SWE-bench Pro — advanced software engineering tasks #28

Description

Overview

Why this benchmark

References

Tasks

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions