Skip to content

vector_top_k JOIN with regular table causes pathological query plan (3700x slower than subquery) #2212

@eja-assistant

Description

@eja-assistant

Summary

When joining vector_top_k() with a regular table using JOIN ... ON, the SQLite query planner inverts the join order when the regular table has WHERE predicates. This causes the virtual table xFilter to be called once per row of the regular table instead of once total, resulting in catastrophic performance degradation.

Reproduction

Database: ~12,000 entries with 512-dim float32 embeddings, WAL mode.

-- Index definition
CREATE INDEX idx_entries_embedding ON entries (
  libsql_vector_idx(embedding, 'metric=cosine', 'compress_neighbors=float8', 'max_neighbors=50')
);

-- FAST (10ms): vector_top_k alone
SELECT id FROM vector_top_k('idx_entries_embedding', vector32(?), 50);

-- FAST (10ms): JOIN without WHERE clause on entries
SELECT e.id FROM vector_top_k('idx_entries_embedding', vector32(?), 50) AS v
JOIN entries AS e ON e.rowid = v.id;

-- SLOW (26,000ms, 100% CPU): JOIN with WHERE clause on entries
SELECT e.id FROM vector_top_k('idx_entries_embedding', vector32(?), 50) AS v
JOIN entries AS e ON e.rowid = v.id
WHERE e.superseded_by IS NULL;

-- FAST (6ms): CROSS JOIN forces correct order
SELECT e.id FROM vector_top_k('idx_entries_embedding', vector32(?), 50) AS v
CROSS JOIN entries AS e ON e.rowid = v.id
WHERE e.superseded_by IS NULL;

-- FAST (11ms): Subquery avoids join entirely
SELECT * FROM entries
WHERE rowid IN (SELECT id FROM vector_top_k('idx_entries_embedding', vector32(?), 50))
  AND superseded_by IS NULL;

Analysis

When the entries table has a WHERE predicate (superseded_by IS NULL), the query planner decides to scan entries first and probe vector_top_k in the inner loop. With ~11,500 qualifying rows, this means ~11,500 ANN searches instead of 50 rowid lookups.

V8 profiler confirms 99.9% of CPU time is spent in the libsql native module executing repeated vector searches.

The root cause appears to be that vector_top_k's xBestIndex does not report a high enough cost to prevent the planner from placing it in the inner loop when there are competing predicates on the joined table.

Workarounds

  1. Use CROSS JOIN instead of JOIN (forces left-to-right evaluation)
  2. Use WHERE rowid IN (SELECT id FROM vector_top_k(...)) instead of JOIN
  3. Split into two application-level queries

Environment

  • libsql native: 0.5.22
  • @libsql/client: 0.17.0
  • macOS arm64 (Apple M2)
  • Node.js v25.5.0

Notes

This is the same class of issue documented for FTS5 virtual tables on the SQLite forum:

The Turso DiskANN blog post (https://turso.tech/blog/approximate-nearest-neighbor-search-with-diskann-in-libsql) shows the WHERE rowid IN pattern first, suggesting awareness of this issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions