pgvector: forward-migration support + multi-chunk read-path dedup

Deferred from the quality-audit fix pass (#6). Two related gaps that block a real multi-chunk ingester.

**Location:** `src/rag/backends/pgvector-schema.ts`, `src/rag/backends/pgvector.ts`, `src/rag/retriever.ts`

**Problem 1 — no migration story:** `initSchema` is create-only (`CREATE ... IF NOT EXISTS`). The schema now keys `chunks` on a composite `PRIMARY KEY (doc_id, chunk_index)`, but an existing DB created under the old `doc_id PRIMARY KEY` is never migrated — it silently keeps the old PK, and the seeder's `ON CONFLICT (doc_id, chunk_index)` then fails at runtime. There is no `schema_migrations` table and no up/down migration path for any future column/dim change.

**Problem 2 — LIMIT before dedup:** the read SQL (`pgvector.ts` KNN/TEXT) does `ORDER BY ... LIMIT k`, and RRF fusion dedups by `doc_id` only *after* the LIMIT (`retriever.ts`). With a real multi-chunk ingester, several chunks of one document can consume the top-k slots and crowd out other documents. Latent today because the demo seeder writes one chunk per doc (`chunk_index = 0`).

**Proposed fix:** add a lightweight versioned-migrations mechanism (a `schema_migrations` table + ordered up/down files), and add `DISTINCT ON (doc_id)` / a wider candidate pool before fusion so per-document chunk counts don't starve recall.

**Why deferred:** there is no production ingestion pipeline in this repo yet (demo seeder only), so both are forward-looking; doing them well is its own design slice.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pgvector: forward-migration support + multi-chunk read-path dedup #9

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

pgvector: forward-migration support + multi-chunk read-path dedup #9

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions