Skip to content

fix(recall): exclude self-match and make ef_search pooler-safe#1

Open
geekyfox90 wants to merge 1 commit into
mainfrom
fix/recall-self-match-and-ef-search
Open

fix(recall): exclude self-match and make ef_search pooler-safe#1
geekyfox90 wants to merge 1 commit into
mainfrom
fix/recall-self-match-and-ef-search

Conversation

@geekyfox90

Copy link
Copy Markdown
Contributor

What & why

Two correctness fixes in the recall path, found during a review of the benchmark methodology.

1. Self-match inflates recall@k

sampleQueryVectors draws query vectors straight from the indexed table, so each query's nearest neighbour is the row itself (distance 0). That self-match is a guaranteed hit in both the index result and the exact ground truth, so recall@k always includes one "free" hit (a useless index still scores ≥1/k; a real 7-of-9 shows as 8/10).

Fix: sampleQueryVectors now also returns each vector's ctid. Recall fetches k+1 and drops the self ctid from both the index result and the ground truth, so we measure the k real neighbours.

2. ef_search silently ignored behind a transaction pooler

The ef_search sweep applied a session-level SET hnsw.ef_search on otherwise-autocommit queries. Through a transaction pooler (PgBouncer / the Supabase pooler), consecutive statements can land on different backends, so the SET doesn't persist — the whole --ef-search sweep silently runs at the server default (identical rows).

Fix: each ef level now runs inside one transaction using SET LOCAL hnsw.ef_search (one BEGIN/COMMIT per level — no per-query overhead). This is pinned and honoured on both direct and pooled connections. (Matches the pattern the ground-truth seq-scan path already uses.)

Notes

  • Latency/throughput are unchanged — they run at the default ef, so the hot timing paths are untouched.
  • Builds clean, go vet clean. Removed a now-dead scanIDsConn helper.

🤖 Generated with Claude Code

Two correctness fixes in the recall path:

1. Self-match inflation. Query vectors are sampled straight from the indexed
   table, so each query's nearest neighbour is the row itself (distance 0) — a
   guaranteed hit in both the index result and the ground truth that inflates
   recall@k. sampleQueryVectors now also returns each vector's ctid; recall
   fetches k+1 and drops that self ctid from both sets, measuring the k real
   neighbours.

2. ef_search ignored behind a transaction pooler. The ef_search sweep set a
   session-level `SET hnsw.ef_search` on autocommit queries; through a
   transaction pooler (PgBouncer / Supabase pooler) consecutive statements can
   land on different backends, so the sweep silently ran at the server default.
   Each ef level now runs inside one transaction using `SET LOCAL hnsw.ef_search`
   (one BEGIN/COMMIT per level, no per-query overhead), which is pinned and
   honoured on both direct and pooled connections.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant