-
Notifications
You must be signed in to change notification settings - Fork 0
Evaluation Summary
We ran a controlled evaluation across 24 bug-localisation tasks drawn from real GitHub issues in 5 open-source repositories. Each task was run under three conditions:
| Condition | Context provided to the model |
|---|---|
| A — Heuristic grep | Keyword search over raw source files |
| B — Structural index | AST-parsed index with file/function/class structure |
| C — Enriched index | Structural index + LLM-generated semantic summaries |
Every task asked the model to identify the root cause of a bug and localise the fix to the correct file, class, or function — without seeing the solution.
Responses were scored on a 1–5 Retrieval-Localisation (RL) scale:
| Score | Meaning |
|---|---|
| 1 | Wrong area entirely |
| 2 | Correct repo area, wrong file |
| 3 | Correct file, wrong mechanism |
| 4 | Correct file + mechanism (file/class precision) |
| 5 | Exact function-level match |
Scoring was done by an LLM judge against oracle answers, with 32 of 72 runs reviewed and adjudicated by a human.
| Repo | Language | Tasks |
|---|---|---|
| httpie/cli | Python | 6 |
| pallets/click | Python | 6 |
| pinterest/ktlint | Kotlin | 6 |
| cashapp/turbine | Kotlin | 3 |
| colinhacks/zod | TypeScript | 3 |
| Condition | Mean RL | ≥4 (%) | Score distribution |
|---|---|---|---|
| A — Heuristic grep | 3.08 | 54% | 1:3 · 2:6 · 3:2 · 4:12 · 5:1 |
| B — Structural index | 3.25 | 67% | 1:3 · 2:5 · 4:15 · 5:1 |
| C — Enriched index | 3.12 | 62% | 1:4 · 2:5 · 4:14 · 5:1 |
B outperforms A on the ≥4 rate by +13 percentage points (54% → 67%). The overall mean difference is modest (+0.17) and not statistically significant at this sample size — a larger task bank would be needed to confirm.
| Repo | A | B | C | B−A |
|---|---|---|---|---|
| httpie/cli | 2.50 | 3.33 | 2.83 | +0.83 |
| pallets/click | 4.17 | 4.00 | 4.00 | −0.17 |
| pinterest/ktlint | 3.67 | 3.67 | 3.67 | 0.00 |
| cashapp/turbine | 1.67 | 2.00 | 2.00 | +0.33 |
| colinhacks/zod | 2.33 | 2.00 | 2.00 | −0.33 |
| Size | A | B | C | B−A | Tasks |
|---|---|---|---|---|---|
| S (small) | 2.91 | 3.36 | 3.09 | +0.45 | 11 |
| M (medium) | 3.20 | 3.40 | 3.40 | +0.20 | 10 |
| L (large) | 3.33 | 2.33 | 2.33 | −1.00 | 3 |
The structural index (B) uses 50% fewer prompt tokens than heuristic grep (A) while scoring higher.
| Condition | Avg prompt tokens | Avg completion tokens |
|---|---|---|
| A | 1,521 | 479 |
| B | 761 | 520 |
| C | 1,257 | 491 |
This is a meaningful efficiency gain: B delivers better results at lower cost.
1. The structural index (B) is the sweet spot. B consistently matches or beats A on score while cutting prompt tokens nearly in half. The ≥4 hit rate jumps from 54% to 67% — meaning the model correctly localises the bug file in 2 out of 3 tasks under B, vs. just over half under A.
2. Semantic enrichment (C) does not beat structural retrieval (B). C adds ~65% more prompt tokens compared to B but scores lower in 3 of 5 repos. The LLM-generated summaries appear to add noise rather than signal at this retrieval depth. This warrants further investigation — either the enrichment prompt needs tuning, or top-k needs to be reduced for C.
3. The index helps most where the model has least training familiarity.
httpie/cli shows the largest B−A gain (+0.83). pallets/click is ceiling-bound across all conditions (≈4.0) — the model already knows Click well enough from training data that retrieval context is redundant. The index adds value specifically in less-prominent codebases.
4. Some repos are hard regardless of retrieval.
cashapp/turbine (1.67–2.00) and colinhacks/zod (2.00–2.33) score low across all conditions. Two turbine tasks scored 1 under every condition — these likely require runtime/flow knowledge that static retrieval cannot provide. These may be out of scope for this eval design.
5. Large repos show a retrieval penalty. The 3 large-repo tasks show B−A = −1.00. In larger codebases, the top-k retrieval window may be too narrow to surface the right file, or the heuristic grep happens to find better matches by luck. This is worth investigating with higher top-k values.
The structural index is a net positive: better localisation, lower cost, no regressions on repos where the model is already strong. The enrichment layer needs rethinking before it earns its token budget.
Next steps would be: expanding the task bank to reach statistical significance, tuning enrichment prompts, and testing higher top-k values for large repos.
72 runs · 24 tasks · 5 repos · scores post human review (32/72 adjudicated) · March 2026