fix(recall): exclude self-match and make ef_search pooler-safe#1
Open
geekyfox90 wants to merge 1 commit into
Open
fix(recall): exclude self-match and make ef_search pooler-safe#1geekyfox90 wants to merge 1 commit into
geekyfox90 wants to merge 1 commit into
Conversation
Two correctness fixes in the recall path: 1. Self-match inflation. Query vectors are sampled straight from the indexed table, so each query's nearest neighbour is the row itself (distance 0) — a guaranteed hit in both the index result and the ground truth that inflates recall@k. sampleQueryVectors now also returns each vector's ctid; recall fetches k+1 and drops that self ctid from both sets, measuring the k real neighbours. 2. ef_search ignored behind a transaction pooler. The ef_search sweep set a session-level `SET hnsw.ef_search` on autocommit queries; through a transaction pooler (PgBouncer / Supabase pooler) consecutive statements can land on different backends, so the sweep silently ran at the server default. Each ef level now runs inside one transaction using `SET LOCAL hnsw.ef_search` (one BEGIN/COMMIT per level, no per-query overhead), which is pinned and honoured on both direct and pooled connections. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What & why
Two correctness fixes in the recall path, found during a review of the benchmark methodology.
1. Self-match inflates recall@k
sampleQueryVectorsdraws query vectors straight from the indexed table, so each query's nearest neighbour is the row itself (distance 0). That self-match is a guaranteed hit in both the index result and the exact ground truth, so recall@k always includes one "free" hit (a useless index still scores ≥1/k; a real 7-of-9 shows as 8/10).Fix:
sampleQueryVectorsnow also returns each vector'sctid. Recall fetchesk+1and drops the selfctidfrom both the index result and the ground truth, so we measure the k real neighbours.2.
ef_searchsilently ignored behind a transaction poolerThe ef_search sweep applied a session-level
SET hnsw.ef_searchon otherwise-autocommit queries. Through a transaction pooler (PgBouncer / the Supabase pooler), consecutive statements can land on different backends, so theSETdoesn't persist — the whole--ef-searchsweep silently runs at the server default (identical rows).Fix: each ef level now runs inside one transaction using
SET LOCAL hnsw.ef_search(one BEGIN/COMMIT per level — no per-query overhead). This is pinned and honoured on both direct and pooled connections. (Matches the pattern the ground-truth seq-scan path already uses.)Notes
go vetclean. Removed a now-deadscanIDsConnhelper.🤖 Generated with Claude Code