You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Investigate options for an actual vector index (HNSW, IVF, or equivalent) over the docs.embedding F32_BLOB column. Today db.SearchByEmbedding uses a linear scan via vector_distance_cos, which is fine at hundreds of vectors but becomes the dominant query latency around 100k+ vectors — i.e. well within the target scale.
Deadzone targets a Context7-scale corpus (~33k libs eventually, 2-3k near term). The math:
Corpus
Vectors
Bytes scanned per query
Linear scan
1 lib × 38 docs
38
~58 KB
<1 ms
50 libs × 100 docs
5,000
~7.5 MB
~10-20 ms
500 libs × 100 docs
50,000
~75 MB
~100-200 ms
3000 libs × 100 docs
300,000
~450 MB
seconds
33000 libs × 100 docs
3.3M
~5 GB
tens of seconds
(Based on 384-dim float32 = 1.5 KB per vector, plus the row metadata.)
Even at the near-term 2-3k libs target, MCP query latency starts crossing the threshold where the LLM client perceives lag. At Context7-scale, linear scan is unviable. We need an actual ANN index before we get there.
Vector indexes are NOT yet implemented in turso — the docs say "All similarity searches use a linear scan over the table." For deadzone's small corpus (a handful of repos worth of markdown) this is fine. It would be a problem at >100k snippets.
This issue is the "let's not be surprised when we get there" research.
Areas to investigate
1. Tursogo / Turso roadmap
Is vector index support on the tursogo roadmap? Track upstream issues and discussions.
ETA, if any. If they ship one in the next 6-12 months, the answer might be "wait".
Quality of the index — IVF vs HNSW vs proprietary, recall/latency trade-offs.
2. Application-level index (in-memory HNSW)
Load all vectors into a Go-native HNSW index at server startup.
Investigate options for an actual vector index (HNSW, IVF, or equivalent) over the
docs.embedding F32_BLOBcolumn. Todaydb.SearchByEmbeddinguses a linear scan viavector_distance_cos, which is fine at hundreds of vectors but becomes the dominant query latency around 100k+ vectors — i.e. well within the target scale.Parent: #15
Why now (ish)
Deadzone targets a Context7-scale corpus (~33k libs eventually, 2-3k near term). The math:
(Based on 384-dim float32 = 1.5 KB per vector, plus the row metadata.)
Even at the near-term 2-3k libs target, MCP query latency starts crossing the threshold where the LLM client perceives lag. At Context7-scale, linear scan is unviable. We need an actual ANN index before we get there.
docs/research/tursogo-migration.mdalready noted this:This issue is the "let's not be surprised when we get there" research.
Areas to investigate
1. Tursogo / Turso roadmap
2. Application-level index (in-memory HNSW)
hnsw,gomlx's vector ops,weaviate/hnswlib-go.3. Sidecar vector DB
4. Switch storage entirely
5. Hybrid: tursogo for everything, in-memory HNSW only for vectors
Output
A research note in
docs/research/(sibling oftursogo-migration.md) that:When to act
Acceptance criteria
docs/research/vector-index.mdRelated
docs/research/tursogo-migration.md— flags the linear-scan limitation explicitly