Evaluation Summary

Hybrid Code Indexing — Retrieval-Localisation Eval

Can a structural/semantic code index help an LLM find the right file faster?

What We Tested

We ran a controlled evaluation across 24 bug-localisation tasks drawn from real GitHub issues in 5 open-source repositories. Each task was run under three conditions:

Condition	Context provided to the model
A — Heuristic grep	Keyword search over raw source files
B — Structural index	AST-parsed index with file/function/class structure
C — Enriched index	Structural index + LLM-generated semantic summaries

Every task asked the model to identify the root cause of a bug and localise the fix to the correct file, class, or function — without seeing the solution.

Responses were scored on a 1–5 Retrieval-Localisation (RL) scale:

Score	Meaning
1	Wrong area entirely
2	Correct repo area, wrong file
3	Correct file, wrong mechanism
4	Correct file + mechanism (file/class precision)
5	Exact function-level match

Scoring was done by an LLM judge against oracle answers, with 32 of 72 runs reviewed and adjudicated by a human.

Repositories

Repo	Language	Tasks
httpie/cli	Python	6
pallets/click	Python	6
pinterest/ktlint	Kotlin	6
cashapp/turbine	Kotlin	3
colinhacks/zod	TypeScript	3

Results

Overall by Condition

Condition	Mean RL	≥4 (%)	Score distribution
A — Heuristic grep	3.08	54%	1:3 · 2:6 · 3:2 · 4:12 · 5:1
B — Structural index	3.25	67%	1:3 · 2:5 · 4:15 · 5:1
C — Enriched index	3.12	62%	1:4 · 2:5 · 4:14 · 5:1

B outperforms A on the ≥4 rate by +13 percentage points (54% → 67%). The overall mean difference is modest (+0.17) and not statistically significant at this sample size — a larger task bank would be needed to confirm.

By Repository

Repo	A	B	C	B−A
httpie/cli	2.50	3.33	2.83	+0.83
pallets/click	4.17	4.00	4.00	−0.17
pinterest/ktlint	3.67	3.67	3.67	0.00
cashapp/turbine	1.67	2.00	2.00	+0.33
colinhacks/zod	2.33	2.00	2.00	−0.33

By Task Size

Size	A	B	C	B−A	Tasks
S (small)	2.91	3.36	3.09	+0.45	11
M (medium)	3.20	3.40	3.40	+0.20	10
L (large)	3.33	2.33	2.33	−1.00	3

Token Usage

The structural index (B) uses 50% fewer prompt tokens than heuristic grep (A) while scoring higher.

Condition	Avg prompt tokens	Avg completion tokens
A	1,521	479
B	761	520
C	1,257	491

This is a meaningful efficiency gain: B delivers better results at lower cost.

Key Findings

1. The structural index (B) is the sweet spot. B consistently matches or beats A on score while cutting prompt tokens nearly in half. The ≥4 hit rate jumps from 54% to 67% — meaning the model correctly localises the bug file in 2 out of 3 tasks under B, vs. just over half under A.

2. Semantic enrichment (C) does not beat structural retrieval (B). C adds ~65% more prompt tokens compared to B but scores lower in 3 of 5 repos. The LLM-generated summaries appear to add noise rather than signal at this retrieval depth. This warrants further investigation — either the enrichment prompt needs tuning, or top-k needs to be reduced for C.

3. The index helps most where the model has least training familiarity. httpie/cli shows the largest B−A gain (+0.83). pallets/click is ceiling-bound across all conditions (≈4.0) — the model already knows Click well enough from training data that retrieval context is redundant. The index adds value specifically in less-prominent codebases.

4. Some repos are hard regardless of retrieval. cashapp/turbine (1.67–2.00) and colinhacks/zod (2.00–2.33) score low across all conditions. Two turbine tasks scored 1 under every condition — these likely require runtime/flow knowledge that static retrieval cannot provide. These may be out of scope for this eval design.

5. Large repos show a retrieval penalty. The 3 large-repo tasks show B−A = −1.00. In larger codebases, the top-k retrieval window may be too narrow to surface the right file, or the heuristic grep happens to find better matches by luck. This is worth investigating with higher top-k values.

What This Means

The structural index is a net positive: better localisation, lower cost, no regressions on repos where the model is already strong. The enrichment layer needs rethinking before it earns its token budget.

Next steps would be: expanding the task bank to reach statistical significance, tuning enrichment prompts, and testing higher top-k values for large repos.

72 runs · 24 tasks · 5 repos · scores post human review (32/72 adjudicated) · March 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly