Skip to content

Evaluation Summary

Krzysztof Jackowski edited this page Mar 26, 2026 · 1 revision

Hybrid Code Indexing — Retrieval-Localisation Eval

Can a structural/semantic code index help an LLM find the right file faster?


What We Tested

We ran a controlled evaluation across 24 bug-localisation tasks drawn from real GitHub issues in 5 open-source repositories. Each task was run under three conditions:

Condition Context provided to the model
A — Heuristic grep Keyword search over raw source files
B — Structural index AST-parsed index with file/function/class structure
C — Enriched index Structural index + LLM-generated semantic summaries

Every task asked the model to identify the root cause of a bug and localise the fix to the correct file, class, or function — without seeing the solution.

Responses were scored on a 1–5 Retrieval-Localisation (RL) scale:

Score Meaning
1 Wrong area entirely
2 Correct repo area, wrong file
3 Correct file, wrong mechanism
4 Correct file + mechanism (file/class precision)
5 Exact function-level match

Scoring was done by an LLM judge against oracle answers, with 32 of 72 runs reviewed and adjudicated by a human.


Repositories

Repo Language Tasks
httpie/cli Python 6
pallets/click Python 6
pinterest/ktlint Kotlin 6
cashapp/turbine Kotlin 3
colinhacks/zod TypeScript 3

Results

Overall by Condition

Condition Mean RL ≥4 (%) Score distribution
A — Heuristic grep 3.08 54% 1:3 · 2:6 · 3:2 · 4:12 · 5:1
B — Structural index 3.25 67% 1:3 · 2:5 · 4:15 · 5:1
C — Enriched index 3.12 62% 1:4 · 2:5 · 4:14 · 5:1

B outperforms A on the ≥4 rate by +13 percentage points (54% → 67%). The overall mean difference is modest (+0.17) and not statistically significant at this sample size — a larger task bank would be needed to confirm.

By Repository

Repo A B C B−A
httpie/cli 2.50 3.33 2.83 +0.83
pallets/click 4.17 4.00 4.00 −0.17
pinterest/ktlint 3.67 3.67 3.67 0.00
cashapp/turbine 1.67 2.00 2.00 +0.33
colinhacks/zod 2.33 2.00 2.00 −0.33

By Task Size

Size A B C B−A Tasks
S (small) 2.91 3.36 3.09 +0.45 11
M (medium) 3.20 3.40 3.40 +0.20 10
L (large) 3.33 2.33 2.33 −1.00 3

Token Usage

The structural index (B) uses 50% fewer prompt tokens than heuristic grep (A) while scoring higher.

Condition Avg prompt tokens Avg completion tokens
A 1,521 479
B 761 520
C 1,257 491

This is a meaningful efficiency gain: B delivers better results at lower cost.


Key Findings

1. The structural index (B) is the sweet spot. B consistently matches or beats A on score while cutting prompt tokens nearly in half. The ≥4 hit rate jumps from 54% to 67% — meaning the model correctly localises the bug file in 2 out of 3 tasks under B, vs. just over half under A.

2. Semantic enrichment (C) does not beat structural retrieval (B). C adds ~65% more prompt tokens compared to B but scores lower in 3 of 5 repos. The LLM-generated summaries appear to add noise rather than signal at this retrieval depth. This warrants further investigation — either the enrichment prompt needs tuning, or top-k needs to be reduced for C.

3. The index helps most where the model has least training familiarity. httpie/cli shows the largest B−A gain (+0.83). pallets/click is ceiling-bound across all conditions (≈4.0) — the model already knows Click well enough from training data that retrieval context is redundant. The index adds value specifically in less-prominent codebases.

4. Some repos are hard regardless of retrieval. cashapp/turbine (1.67–2.00) and colinhacks/zod (2.00–2.33) score low across all conditions. Two turbine tasks scored 1 under every condition — these likely require runtime/flow knowledge that static retrieval cannot provide. These may be out of scope for this eval design.

5. Large repos show a retrieval penalty. The 3 large-repo tasks show B−A = −1.00. In larger codebases, the top-k retrieval window may be too narrow to surface the right file, or the heuristic grep happens to find better matches by luck. This is worth investigating with higher top-k values.


What This Means

The structural index is a net positive: better localisation, lower cost, no regressions on repos where the model is already strong. The enrichment layer needs rethinking before it earns its token budget.

Next steps would be: expanding the task bank to reach statistical significance, tuning enrichment prompts, and testing higher top-k values for large repos.


72 runs · 24 tasks · 5 repos · scores post human review (32/72 adjudicated) · March 2026