From 95644dbf16f17c85e6795fb9e113bb8f65f97c3c Mon Sep 17 00:00:00 2001 From: Safwan Erooth Date: Mon, 1 Jun 2026 19:23:20 +0400 Subject: [PATCH] docs: add pgvector knowledge architecture plan --- .../knowledge-pgvector-implementation-plan.md | 537 ++++++++++++++++++ 1 file changed, 537 insertions(+) create mode 100644 docs/knowledge-pgvector-implementation-plan.md diff --git a/docs/knowledge-pgvector-implementation-plan.md b/docs/knowledge-pgvector-implementation-plan.md new file mode 100644 index 00000000..17d747ad --- /dev/null +++ b/docs/knowledge-pgvector-implementation-plan.md @@ -0,0 +1,537 @@ +# PGVector Knowledge Backend Implementation Plan + +_Last updated: 2026-06-01_ + +This is a documentation-only proposal for adding PostgreSQL/PGVector support to HUF's knowledge layer. It is grounded in the current `develop` implementation of SQLite FTS, SQLite Vec, ChromaDB, the Knowledge Source DocType, and the React knowledge UI. + +## 1. Current state on `develop` + +HUF currently exposes three backend types in the backend abstraction: + +- `sqlite_fts` for keyword search. +- `sqlite_vec` for local vector search using the `sqlite-vec` extension. +- `chroma` for vector search using ChromaDB through the LlamaIndex adapter. + +The backend contract is currently defined by `KnowledgeBackend` and returns `ChunkResult` objects. Every backend is expected to implement: + +- `initialize(knowledge_source, config)` +- `add_chunks(chunks)` +- `delete_chunks(input_id)` +- `search(query, top_k, filters=None)` +- `clear()` +- `get_stats()` + +Backend lookup is currently hardcoded through `get_backend()` in `huf/ai/knowledge/backends/__init__.py`. + +## 2. Current ingestion flow + +The indexing pipeline is in `huf/ai/knowledge/indexer.py`. + +Current flow: + +```text +Knowledge Input + -> extract text + -> chunk text + -> build chunk payload + -> initialize backend from Knowledge Source.knowledge_type + -> delete old chunks for that input + -> add chunks + -> update stats/status +``` + +The indexer builds backend config from Knowledge Source fields. Today it passes common chunk settings to all backends. For `sqlite_vec` and `chroma`, it also passes: + +- `embedding_model` +- `vector_dimension` +- `embedding_provider` + +For Chroma, it additionally supports file mode and server mode: + +- file mode: private files path under `knowledge/_chroma` +- server mode: `chroma_host`, `chroma_port`, `chroma_ssl` + +## 3. Current SQLite Vec implementation + +SQLite Vec is local and source-scoped. + +Current characteristics: + +- One SQLite file per Knowledge Source. +- Stored under the site's private files directory: `private/files/knowledge/.sqlite3`. +- Has a `chunks` metadata/content table. +- Has a `chunks_vec` virtual table using `vec0`. +- Uses `FLOAT[dimension]` vectors. +- Generates embeddings during `add_chunks()` using HUF's embedding resolver. +- Deletes chunks by `input_id` by deleting both vector rows and chunk rows. +- Search flow: + - generate query embedding + - optionally append SQL filters + - match against `chunks_vec` + - join back to `chunks` + - convert distance into a score using `1 / (1 + distance)` + +Important limitation: + +- The current filter implementation appends `c. = ?`, so filters are only safe for real columns unless explicitly validated. Metadata JSON fields are not currently exposed as indexed filter columns. + +## 4. Current Chroma implementation + +Chroma is implemented through LlamaIndex's `ChromaVectorStore`. + +Current characteristics: + +- Supports local persistent mode and remote server mode. +- Uses a collection per Knowledge Source by default: `huf_`. +- Generates embeddings through HUF's embedding resolver before adding documents. +- Stores standard metadata: + - `input_id` + - `input_type` + - `chunk_id` + - `source_title` + - `chunk_index` + - `knowledge_source` + - extracted metadata +- Supports exact metadata filtering through LlamaIndex `MetadataFilters` and `ExactMatchFilter`. +- Deletes chunks by finding collection IDs where `input_id` matches, then deleting those IDs. +- Provides basic stats and health checks. + +Chroma is a useful reference for PGVector because it already demonstrates: + +- external vector backend config +- optional dependency handling +- HUF-managed embedding generation +- metadata filters +- backend-specific connection settings + +## 5. Current Knowledge Source DocType and UI gaps + +The backend DocType currently lists these options: + +```text +sqlite_fts +sqlite_vec +chroma +``` + +Vector settings currently depend on: + +```js +['sqlite_vec', 'chroma'].includes(doc.knowledge_type) +``` + +Chroma-specific fields are shown only when `knowledge_type === 'chroma'`. + +Frontend gaps observed: + +- `frontend/src/types/knowledge.types.ts` currently defines `KnowledgeType = 'sqlite_fts' | 'sqlite_vec'`, so it is behind the backend DocType because Chroma already exists in the DocType/backend. +- `frontend/src/data/knowledge.ts` currently lists only SQLite FTS and SQLite Vec options. +- `frontend/src/components/knowledge/GeneralTab.tsx` currently shows vector settings only for `sqlite_vec`, not for `chroma` or future `pgvector`. + +Before adding PGVector, the frontend should be aligned with current backend reality: + +```ts +export type KnowledgeType = 'sqlite_fts' | 'sqlite_vec' | 'chroma' | 'pgvector'; +``` + +And vector settings should be shown for all vector backends: + +```ts +const isVectorBackend = ['sqlite_vec', 'chroma', 'pgvector'].includes(watchKnowledgeType); +``` + +## 6. Why PGVector is useful + +PGVector is useful when HUF needs a production-grade vector backend that can combine: + +- SQL filtering +- vector similarity ranking +- operational backups +- indexes +- multi-tenant/source-aware tables +- structured metadata columns +- analytics +- future hybrid search with PostgreSQL full-text search + +For use cases like travel/hotel recommendation, PGVector is especially useful because the retrieval flow is naturally hybrid: + +```text +city / destination / supplier / availability / budget filters + -> vector ranking for semantic preference fit + -> agent explanation and recommendation +``` + +Example user intent: + +```text +quiet, comfy hotel with good coffee nearby, not too touristy +``` + +That should not search all global vectors. It should first filter by city/destination and candidate availability, then rank semantically. + +## 7. Recommended PGVector architecture for HUF knowledge + +### 7.1 Start with HUF-compatible generic knowledge backend + +First implementation should follow the existing backend contract and behave like Chroma/SQLite Vec: + +```text +Knowledge Source + -> one logical PGVector collection/table namespace + -> chunks with text, metadata, embedding + -> search returns ChunkResult +``` + +Recommended backend name: + +```text +pgvector +``` + +Recommended file: + +```text +huf/ai/knowledge/backends/pgvector_backend.py +``` + +Add to backend registry: + +```python +"pgvector": "huf.ai.knowledge.backends.pgvector_backend.PGVectorBackend" +``` + +### 7.2 Avoid one physical table per small source by default + +The earlier draft PR approach creates a table name like: + +```python +table_name = f"huf_{frappe.scrub(self.knowledge_source)}" +``` + +That mirrors Chroma's collection-per-source model and is simple. However, for long-term production use, it can create many PostgreSQL tables. + +Recommended default: + +```text +One shared table per site or configured PGVector database, with knowledge_source as a column. +``` + +Example logical schema: + +```sql +CREATE TABLE huf_knowledge_vectors ( + id BIGSERIAL PRIMARY KEY, + site_name TEXT NOT NULL, + knowledge_source TEXT NOT NULL, + input_id TEXT NOT NULL, + input_type TEXT NOT NULL, + chunk_id TEXT NOT NULL UNIQUE, + source_title TEXT, + chunk_index INTEGER, + text TEXT NOT NULL, + metadata JSONB DEFAULT '{}'::jsonb, + embedding VECTOR(1536), + created_at TIMESTAMPTZ DEFAULT now(), + updated_at TIMESTAMPTZ DEFAULT now() +); +``` + +Minimum indexes: + +```sql +CREATE INDEX idx_huf_knowledge_vectors_source +ON huf_knowledge_vectors (site_name, knowledge_source); + +CREATE INDEX idx_huf_knowledge_vectors_input +ON huf_knowledge_vectors (site_name, knowledge_source, input_id); + +CREATE INDEX idx_huf_knowledge_vectors_metadata +ON huf_knowledge_vectors USING GIN (metadata); +``` + +Vector index, depending on distance strategy: + +```sql +CREATE INDEX idx_huf_knowledge_vectors_embedding_hnsw +ON huf_knowledge_vectors +USING hnsw (embedding vector_cosine_ops); +``` + +For simple MVP, exact scan inside a filtered source is acceptable. Add HNSW/IVFFlat after data volume and latency are measured. + +### 7.3 For travel/product recommendation, use structured metadata columns + +Generic HUF knowledge can keep metadata in JSONB. But product-specific apps such as GoHoppy/Genie should not rely only on generic JSON metadata for high-cardinality filters. + +For hotels, either create a separate app-level PGVector table/tool or extend metadata extraction into structured columns. + +Recommended hotel-specific table shape: + +```sql +CREATE TABLE hotel_vector_profile ( + id BIGSERIAL PRIMARY KEY, + hotel_id TEXT NOT NULL, + supplier TEXT, + country_code TEXT, + city_id TEXT NOT NULL, + neighborhood_id TEXT, + profile_text TEXT NOT NULL, + metadata JSONB DEFAULT '{}'::jsonb, + embedding VECTOR(1536), + source_hash TEXT, + updated_at TIMESTAMPTZ DEFAULT now() +); +``` + +Indexes: + +```sql +CREATE INDEX idx_hotel_vector_city +ON hotel_vector_profile (city_id); + +CREATE INDEX idx_hotel_vector_neighborhood +ON hotel_vector_profile (city_id, neighborhood_id); + +CREATE INDEX idx_hotel_vector_supplier +ON hotel_vector_profile (supplier); + +CREATE INDEX idx_hotel_vector_embedding_hnsw +ON hotel_vector_profile +USING hnsw (embedding vector_cosine_ops); +``` + +Search pattern: + +```sql +SELECT hotel_id, profile_text, metadata, + 1 - (embedding <=> %(query_embedding)s) AS score +FROM hotel_vector_profile +WHERE city_id = %(city_id)s + AND (%(neighborhood_id)s IS NULL OR neighborhood_id = %(neighborhood_id)s) +ORDER BY embedding <=> %(query_embedding)s +LIMIT %(top_k)s; +``` + +This is the key advantage over a pure vector store: SQL filters reduce the candidate space before semantic ranking. + +## 8. PGVector backend implementation plan + +### Phase 1: Minimal backend parity + +Goal: PGVector behaves like existing vector backends for generic HUF Knowledge Sources. + +Changes: + +1. Add optional dependency support: + - `pgvector` + - `psycopg` or SQLAlchemy, depending on implementation choice + - optionally `llama-index-vector-stores-postgres` only if using LlamaIndex adapter +2. Add `PGVectorBackend` implementing `KnowledgeBackend`. +3. Generate embeddings with HUF's existing `get_embedding()` / `get_embeddings()` functions, not implicitly inside LlamaIndex. +4. Store chunks with metadata and embedding. +5. Implement delete by `input_id`. +6. Implement clear by `site_name + knowledge_source`. +7. Implement search with: + - query embedding + - source filter + - optional exact metadata filters + - vector distance order +8. Return `ChunkResult` consistently. + +Recommended approach: + +- Prefer direct SQL first for predictable filtering/deletion. +- Avoid hiding too much behavior behind LlamaIndex until filters, deletion, and schema are proven. + +### Phase 2: Knowledge Source config and UI + +Add `pgvector` to Knowledge Source options. + +Add PGVector settings: + +- `pgvector_connection_mode`: Site DB / External PostgreSQL +- `pgvector_host` +- `pgvector_port` +- `pgvector_database` +- `pgvector_user` +- `pgvector_password` as Password field +- `pgvector_sslmode` +- `pgvector_table_name` +- `pgvector_distance_metric`: cosine / l2 / inner_product +- `pgvector_index_type`: none / hnsw / ivfflat + +For MVP, prefer one of these two choices: + +A. External PostgreSQL only: +- simplest when Frappe site DB is MariaDB. +- avoids pretending the current MariaDB site DB can host PGVector. + +B. Site DB only when site runs on PostgreSQL: +- useful for future Frappe-on-Postgres installs. +- must detect DB type and fail clearly if site DB is MariaDB. + +Given most current Frappe deployments are MariaDB, external PostgreSQL should be treated as the realistic default. + +### Phase 3: Backend factory / health support + +If the BackendFactory PR is merged, PGVector should register with the factory and expose: + +- `health_check()` +- `supports_filters()` +- `supports_hybrid_search()` + +Until then, keep compatibility with current hardcoded `get_backend()`. + +### Phase 4: Hybrid search + +PGVector can later support hybrid retrieval: + +- PostgreSQL full-text search over `text` +- vector similarity over `embedding` +- weighted score combination + +Example: + +```text +final_score = (0.65 * vector_score) + (0.35 * text_score) +``` + +This should be optional per Knowledge Source. + +## 9. Recommended query semantics + +Use the same rule across HUF and product tools: + +```text +Filter with SQL. Rank with vectors. Explain with the agent. +``` + +For generic knowledge: + +```text +knowledge_source + input_type + metadata filters + -> vector ranking +``` + +For hotel recommendation: + +```text +city / neighborhood / supplier / available candidates + -> vector ranking by preference text + -> refresh price and availability + -> agent explains selected options +``` + +Do not use vector DB as the source of truth for: + +- payment state +- live availability +- live price +- supplier booking identifiers +- cancellation policy enforcement +- inventory freshness + +Use relational/API tools for those. + +## 10. Ingestion preparation guidance + +Do not embed raw JSON directly. + +Prepare a stable, concise `profile_text`. + +Generic knowledge example: + +```text +Title: Refund Policy +Source: Customer Support Handbook +Content: ... +``` + +Hotel example: + +```text +Hotel Lumiere is a boutique hotel in Le Marais, Paris. It is suitable for couples, guests who prefer walkable neighborhoods, nearby cafes, galleries, bakeries, metro access, and a calm local feel. Rooms are compact but comfortable. It has breakfast, Wi-Fi, and family rooms. +``` + +Store exact filters separately: + +- `city_id` +- `neighborhood_id` +- `supplier` +- `star_rating` +- `price_band` +- `family_friendly` +- `business_friendly` +- `updated_at` + +Only re-embed when the profile text changes. Store a `source_hash` to detect this. + +## 11. Safety and operational notes + +- Do not store PGVector passwords in plain Data fields. Use Frappe Password fields or site config. +- Validate table names; never directly interpolate untrusted names. +- Enforce source/site scoping on every query. +- Add clear errors when optional dependencies are missing. +- Add migration/backfill commands for existing Knowledge Sources. +- Keep PGVector optional so normal HUF install remains lightweight. +- Add tests for: + - add/search/delete/clear + - metadata filters + - missing dependency error + - dimension mismatch + - source isolation + - rebuild flow + +## 12. Suggested MVP acceptance criteria + +PGVector MVP is acceptable when: + +1. `pgvector` appears as a Knowledge Type in backend and frontend. +2. A Knowledge Source can be created with embedding settings and PGVector connection settings. +3. A text/file/url Knowledge Input can be indexed. +4. Re-indexing deletes old chunks for the input before inserting new chunks. +5. Search returns relevant `ChunkResult` values. +6. Search respects `knowledge_source` isolation. +7. Search supports at least exact filters for selected metadata. +8. Clear/rebuild works for one Knowledge Source without affecting others. +9. Missing PGVector dependencies produce actionable errors. +10. PGVector remains optional and does not break SQLite FTS, SQLite Vec, or Chroma installs. + +## 13. Suggested implementation order + +1. Fix frontend knowledge type parity for Chroma. +2. Add PGVector backend with direct SQL MVP. +3. Add DocType fields and client UI for PGVector connection settings. +4. Add tests for backend behavior. +5. Add bench command or patch to validate/create PGVector schema. +6. Add optional hybrid search after MVP is stable. +7. Add product-level examples for city-filtered travel/hotel retrieval. + +## 14. Product guidance for GoHoppy / Genie + +Use HUF Knowledge for generic travel policies, destination notes, support documents, and reusable explanation context. + +Use app-specific tools for hotel inventory and recommendation: + +```text +search_hotels +rank_hotels_by_preference +refresh_hotel_price +get_hotel_details +save_recommendation_snapshot +``` + +Recommended hotel retrieval flow: + +```text +User preference + -> parse city/neighborhood/date/budget constraints + -> fetch available candidate hotels from relational DB/API cache + -> rank candidate hotels semantically using PGVector + -> refresh live price/availability from supplier API + -> return shortlist to HUF agent + -> agent explains why each hotel fits + -> save shown options as recommendation snapshot +``` + +This avoids treating hotel inventory as static RAG knowledge while still using vectors where they are strongest: semantic preference matching.