Skip to content

Research: vector index for search performance at Context7-scale corpora #45

Description

@laradji

Investigate options for an actual vector index (HNSW, IVF, or equivalent) over the docs.embedding F32_BLOB column. Today db.SearchByEmbedding uses a linear scan via vector_distance_cos, which is fine at hundreds of vectors but becomes the dominant query latency around 100k+ vectors — i.e. well within the target scale.

Parent: #15

Why now (ish)

Deadzone targets a Context7-scale corpus (~33k libs eventually, 2-3k near term). The math:

Corpus Vectors Bytes scanned per query Linear scan
1 lib × 38 docs 38 ~58 KB <1 ms
50 libs × 100 docs 5,000 ~7.5 MB ~10-20 ms
500 libs × 100 docs 50,000 ~75 MB ~100-200 ms
3000 libs × 100 docs 300,000 ~450 MB seconds
33000 libs × 100 docs 3.3M ~5 GB tens of seconds

(Based on 384-dim float32 = 1.5 KB per vector, plus the row metadata.)

Even at the near-term 2-3k libs target, MCP query latency starts crossing the threshold where the LLM client perceives lag. At Context7-scale, linear scan is unviable. We need an actual ANN index before we get there.

docs/research/tursogo-migration.md already noted this:

Vector indexes are NOT yet implemented in turso — the docs say "All similarity searches use a linear scan over the table." For deadzone's small corpus (a handful of repos worth of markdown) this is fine. It would be a problem at >100k snippets.

This issue is the "let's not be surprised when we get there" research.

Areas to investigate

1. Tursogo / Turso roadmap

  • Is vector index support on the tursogo roadmap? Track upstream issues and discussions.
  • ETA, if any. If they ship one in the next 6-12 months, the answer might be "wait".
  • Quality of the index — IVF vs HNSW vs proprietary, recall/latency trade-offs.

2. Application-level index (in-memory HNSW)

  • Load all vectors into a Go-native HNSW index at server startup.
  • Libraries: hnsw, gomlx's vector ops, weaviate/hnswlib-go.
  • Pros: works today, no Turso dependency, fast.
  • Cons: rebuild on startup (latency cost), memory cost (vectors live in RAM AND on disk), index drift between scrape and serve.

3. Sidecar vector DB

  • Spin up qdrant / chroma / weaviate / milvus alongside Deadzone, write vectors there at scrape time, query from server.
  • Pros: production-grade vector indices, well-supported.
  • Cons: breaks the "single binary, no sidecar" promise we've been protecting. Was explicitly rejected for the embedder layer in Real embedder via hugot (pure Go, GoMLX backend) to replace the stub #2; same logic applies here unless the wins are dramatic.

4. Switch storage entirely

  • Replace tursogo with a vector-native DB that has indices today (qdrant, lancedb, milvus, ...).
  • Pros: solves the problem permanently.
  • Cons: throws away the architecture we just built around tursogo. Major rework.

5. Hybrid: tursogo for everything, in-memory HNSW only for vectors

  • Keep tursogo for the docs table + metadata + management.
  • Mirror the embedding column into an in-memory HNSW index alongside.
  • The DB stays the source of truth; the index is rebuilt on startup or on demand.
  • This is the "least disruption" path that gets us real ANN performance.

Output

A research note in docs/research/ (sibling of tursogo-migration.md) that:

  1. Documents the latency curve we're heading toward
  2. Picks one of the five options above with justification
  3. Spikes the chosen approach (probably option 5) with a quick perf comparison against the current linear scan
  4. Files concrete follow-up issues for the implementation

When to act

  • Now: this research issue. Cheap.
  • At ~50 libs / 5k vectors: re-evaluate. Linear scan still OK but the trend is visible.
  • At ~500 libs / 50k vectors: implementation must land. Linear scan is becoming a perceptible delay.
  • Before ~3000 libs: implementation must be merged and load-tested.

Acceptance criteria

  • Research note in docs/research/vector-index.md
  • Picked approach has a measured baseline vs the current linear scan (spike numbers in the note)
  • Concrete follow-up issue filed for the chosen implementation
  • If "wait for tursogo upstream" is the chosen path, includes a tracking note + check-in cadence

Related

Metadata

Metadata

Assignees

Labels

P2Normal — clear value, not urgentresearchResearch / spike

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions