Skip to content

Benchmark: SWE-bench Pro — advanced software engineering tasks #28

@rajkumar42

Description

@rajkumar42

Overview

Evaluate OpenSymbolicAI against SWE-bench Pro (Scale AI) — a harder variant of SWE-bench where the best models only score ~23%.

Why this benchmark

  • High prestige — the gold standard for SWE agent evaluation
  • Best models (GPT-5, Claude Opus 4.1) only achieve ~23% — significant headroom
  • Scale AI leaderboard provides visibility
  • Demonstrates OpenSymbolicAI can handle real-world GitHub issue resolution

References

Tasks

  • Review SWE-bench Pro evaluation protocol
  • Design code-editing primitive set
  • Build SWE agent on top of OpenSymbolicAI blueprints
  • Run evaluation and collect results
  • Submit to Scale AI leaderboard

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions