Skip to content

kimjune01/swebench-verified

Repository files navigation

swebench-verified

A three-stage agent pipeline for SWE-bench Verified, built to be re-run and inspected by skeptics. The point of this repo is not the score. It is that you can clone it, run the exact procedure on the exact skills, and check the artifacts against your own grading.

πŸ“ Want the readable version? The writeup walks the method stage by stage with annotated skills, a worked win, and a worked loss: june.kim/swebench-verified.

One repo per benchmark. This is the Verified run. The SWE-bench Pro run β€” and its port plan, goal predicate, and validation design (PRO_PORT.md) β€” lives in its own repo, swebench-pro. Separate repos so each run's artifacts and number stand alone, rather than hiding behind branches.

Where the numbers come from (start at 500)

Every one of the 500 SWE-bench Verified instances flows to a visible terminal below β€” nothing is silently dropped. Resolved: 426 / 438 eligible (~97%); 426 / 500 full set (~85%). Of the 426, 418 were first-attempt and 8 won only on a re-run β€” and every re-run was an external-fault correction (box-death, a serialization bug since fixed, co-tenant contention), never a re-roll of a reasoning loss; the original losing runs stay in history. The gap from 500 is 44 sphinx-doc (tox-based, can't run airgapped) + 18 documented defects (KNOWN_BAD.md) β€” not cherry-picking. Counts are re-derivable: the eligible/excluded split from the dataset + KNOWN_BAD.md, the resolved count from committed official_eval/summary.json files (driver/scoreboard.py).

sankey-beta
SWE-bench Verified,sphinx-doc (offline-infeasible),44
SWE-bench Verified,KNOWN_BAD defects,18
SWE-bench Verified,Eligible (attempted),438
Eligible (attempted),Resolved (official),426
Eligible (attempted),Not resolved,12
Resolved (official),first attempt,418
Resolved (official),external-fault re-run,8
Not resolved,Reasoning loss,9
Not resolved,Infra / timeout DNF,3
Reasoning loss,recon-ceiling,2
Reasoning loss,genuinely-hard,2
Reasoning loss,gate-divergence,2
Reasoning loss,craft-overfit,1
Reasoning loss,other (sympy),2
Infra / timeout DNF,heavy-suite hang,3
Loading

The 12 not-won, by name (audit them against results/): reasoning (9, not rerun β€” genuine no-solve) django-11734, django-14351 (recon-ceiling), astropy-13398, django-16263 (genuinely-hard), django-14170, pytest-5787 (gate-divergence), sympy-13091 (craft-overfit), sympy-20438, sympy-17139; infra/timeout (3) django-15957, matplotlib-25311, sympy-19040 (heavy-suite stage-hang, await a general stage-cap fix β€” not re-rolled). The 4 external-fault losses (django-15563/14404 box-death, django-15987 serialization, sympy-13878 contention) were corrected by re-run and now resolve; their original losing runs remain in history. Only external/infra faults are eligible for rerun β€” a reasoning loss stays a loss.

⚠️ Contamination disclaimer β€” read this before trusting the number

SWE-bench Verified is training-data contaminated for every modern model, including the ones used here. Claude Sonnet generates and codex (GPT-5.5) filters; both training cutoffs postdate the Verified instances. This is a leaderboard / capability configuration, not a contamination-clean science claim.

  • Contamination is a property of the benchmark, shared by all leaderboard entries β€” so any score here is a fair entry on the same terms as everyone else, and nothing more. It is not evidence that the method causes the result.
  • Isolating the method as the cause requires a separate clean-room ablation: post-cutoff instances (SWE-rebench), with-vs-without the pipeline on one fixed model. This repo does not make that claim. What it demonstrates is that the pipeline runs end-to-end and produces correct, official-grader-verified patches.
  • The honest unit is a win count with an explicit denominator (KNOWN_BAD.md exclusions committed), not a "solve rate" you should read as model capability.

LIMITATIONS.md is the full, unhedged list. Read it before drawing any conclusion.

What it is

Three skills, run as separate claude --print invocations, chained by a driver:

  • recon (read-only): reproduce the failing test, localize the root cause, emit a structured handoff. Single diagnostician.
  • craft (implementation): draft the patch from the handoff, have a codex subagent challenge it (codex generates nothing; it filters), apply, and loop against the test gate until the FAIL_TO_PASS tests pass.
  • audit (verification): run the full suite, classify each failure against a captured fail-on-base baseline, emit a verdict (RESOLVED / NOT_RESOLVED / PARTIAL) and a re-entry route.

The driver wraps these in an outer loop (max 3): a non-RESOLVED audit routes back to recon (wrong diagnosis) or craft (over-broad fix that regressed something). The system under test is an offline Docker container; the only ground truth is the test gate, never the model's own say-so.

The skills in skills/ are frozen independent copies β€” a pinned snapshot of the exact recon/craft/audit text that produced the Verified run (426/438). They were hardlinked to the canonical authoring copies during the campaign; now that the eligible pool is exhausted the link is broken, so ongoing edits to the canonical copies (e.g. the SWE-bench Pro port) do not mutate this frozen artifact. driver/link_skills.sh is retired for this repo β€” do not re-run it. A clone gets these real files regardless.

Recordkeeping (how the honesty is enforced)

The commit history is the audit trail of the method, and it is append-only by policy: no force-push, no rebase that drops a losing-run commit, no amending to tidy a failure out of view. A skeptic can replay the whole loop from the log β€” where the artifact failed, the general fix, the clean re-run β€” and a quietly-removed loss would show up as rewritten history. Concretely:

  • Every run is committed, wins and losses (results/<id>/<run>/), with the official grader report. Re-runs add new timestamped run dirs; prior losing runs are never deleted or overwritten.
  • SCOREBOARD.md is regenerated, but its source isn't. The per-run official_eval/summary.json files only accumulate; the scoreboard is derived from them.
  • Method changes are their own legible diffs. A reader judges each change for generality (instance-blind) vs a smuggled instance prior.
  • Frozen artifact versions are git-tagged. The deliverable is a single tagged version run from scratch on the whole target; checking out the tag shows exactly the code that produced those runs. The cross-version history is the development record, not the deliverable.

What counts as a win

A win is a passing attestation from the official swebench.harness.run_evaluation grader β€” nothing else. Not our gate, not the agent's RESOLVED claim, not "it would have passed without the infra hiccup." SCOREBOARD.md is the live count, re-derived by driver/scoreboard.py from the committed official_eval/summary.json files β€” trust it over any number in prose, which drifts.

Solve times

Wall-clock per instance, summed across the recon β†’ craft β†’ audit stages of the chosen run (the resolved run where there is more than one). Median 8 min, p90 15 min; the long tail past 16 min is the heavy-suite repos (sympy, matplotlib) where the audit's full-suite run dominates. Derived from the committed ledger.jsonl files by driver/solve_times.py β€” re-run it to reproduce the chart.

xychart-beta
  title "Solve time per instance (n=438): recon + craft + audit wall-clock"
  x-axis "minutes" [3-4, 4-5, 5-6, 6-7, 7-8, 8-9, 9-10, 10-11, 11-12, 12-13, 13-14, 14-15, 15-16, 16-17, 17-18, 18-19, 19-20, 20-21, 21-22, "22+"]
  y-axis "instances" 0 --> 80
  bar [4, 18, 62, 75, 59, 49, 33, 26, 24, 14, 17, 14, 6, 8, 3, 3, 1, 5, 2, 15]
Loading

p50 8.0 min Β· p90 15.0 min Β· p95 20.0 min Β· p99 65 min Β· max 77 min (a heavy-suite stage-hang). These are pipeline wall times on the driver host, not billed model time.

Exclusions

KNOWN_BAD.md lists the SWE-bench instances with broken Docker envs, flaky tests, gold patches that fail to grade, or weak coverage (sourced from SWE-bench issues and the UTBoost paper). Every batch is filtered against it before running and the exclusion is committed, so the denominator is honest. LIMITATIONS.md carries the full, unhedged list of everything else.

Run it yourself

Prerequisites:

  • An x86-64 Linux Docker host (the SWE-bench eval images are linux/amd64). The provided driver/provision.sh spins up an AWS EC2 box with a self-terminating watchdog; any amd64 Docker host works.
  • claude CLI (Anthropic) authenticated, and codex CLI (OpenAI) authenticated, both on the machine that runs the driver. The models run on the driver host; the container is offline.
  • Python with swebench and datasets (pip install -r requirements.txt).

Steps:

# 1. Build a task JSON for any Verified instance
python driver/make_task.py pallets__flask-5014 tasks/pallets__flask-5014.json

# 2. Provision an offline-capable Docker box (or point at your own)
bash driver/provision.sh          # writes /tmp/v4smoke.env (KEY, PUBIP, ...)

# 3. Run the pipeline
python driver/rung4_driver.py /tmp/v4smoke.env tasks/pallets__flask-5014.json pallets__flask-5014

# 4. Inspect: ledger, patch, hypothesis graph land in /tmp/swebench-abduction/

Grade the captured patch with the official SWE-bench harness (do not trust this repo's gate as the grader; the gate is the agent's stopping signal, the official harness is the verdict):

python -m swebench.harness.run_evaluation \
  --dataset_name princeton-nlp/SWE-bench_Verified \
  --predictions_path <patches.jsonl> --run_id check

See PROCEDURE.md for the full step-by-step and WORKLOG.md for the build history.

Layout

skills/{recon,craft,audit}/skill.md   the three skills (frozen snapshot; formerly hardlinked)
driver/rung4_driver.py                the orchestrator (recon -> craft -> audit + outer loop)
driver/make_task.py                   builds a task JSON from any Verified instance id
driver/provision.sh                   EC2 provisioning with self-terminating watchdog
driver/link_skills.sh                 (retired) re-established skill hardlinks during the campaign
tasks/                                generated task JSONs
results/<instance>/                   ledger, patch, hypothesis graph, agent logs, codex proof

Citation

To cite this pipeline, use the metadata in CITATION.cff (GitHub renders a "Cite this repository" button from it) or the archived DOI: DOI. The concept DOI 10.5281/zenodo.20383661 always resolves to the latest version. Writeup: june.kim/swebench-verified.

This work runs on the SWE-bench Verified benchmark. If you use this repo, also cite the original benchmark:

@inproceedings{
    jimenez2024swebench,
    title={{SWE}-bench: Can Language Models Resolve Real-world Github Issues?},
    author={Carlos E Jimenez and John Yang and Alexander Wettig and Shunyu Yao and Kexin Pei and Ofir Press and Karthik R Narasimhan},
    booktitle={The Twelfth International Conference on Learning Representations},
    year={2024},
    url={https://openreview.net/forum?id=VTF8yNQM66}
}

License

GPL-3.0 (copyleft). See LICENSE. If you build on the method, share back.

<iframe width="560" height="315" src="https://www.youtube.com/embed/gRVjAtPip0Y" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen> </iframe>

About

Reproducible recon/craft/audit agent pipeline for SWE-bench Verified. Official-graded, codex-attested, GPL-3.0. Run it yourself.

Resources

License

GPL-3.0, Unknown licenses found

Licenses found

GPL-3.0
LICENSE
Unknown
LICENSE.md

Stars

Watchers

Forks

Packages

 
 
 

Contributors