Skip to content

pgvector: forward-migration support + multi-chunk read-path dedup #9

Description

@stxkxs

Deferred from the quality-audit fix pass (#6). Two related gaps that block a real multi-chunk ingester.

Location: src/rag/backends/pgvector-schema.ts, src/rag/backends/pgvector.ts, src/rag/retriever.ts

Problem 1 — no migration story: initSchema is create-only (CREATE ... IF NOT EXISTS). The schema now keys chunks on a composite PRIMARY KEY (doc_id, chunk_index), but an existing DB created under the old doc_id PRIMARY KEY is never migrated — it silently keeps the old PK, and the seeder's ON CONFLICT (doc_id, chunk_index) then fails at runtime. There is no schema_migrations table and no up/down migration path for any future column/dim change.

Problem 2 — LIMIT before dedup: the read SQL (pgvector.ts KNN/TEXT) does ORDER BY ... LIMIT k, and RRF fusion dedups by doc_id only after the LIMIT (retriever.ts). With a real multi-chunk ingester, several chunks of one document can consume the top-k slots and crowd out other documents. Latent today because the demo seeder writes one chunk per doc (chunk_index = 0).

Proposed fix: add a lightweight versioned-migrations mechanism (a schema_migrations table + ordered up/down files), and add DISTINCT ON (doc_id) / a wider candidate pool before fusion so per-document chunk counts don't starve recall.

Why deferred: there is no production ingestion pipeline in this repo yet (demo seeder only), so both are forward-looking; doing them well is its own design slice.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions