Skip to content

feat: embedding-based semantic search via pgvector#45

Open
rivetphilbot wants to merge 17 commits intoMartian-Engineering:mainfrom
rivetphilbot:feat/embedding-semantic-search
Open

feat: embedding-based semantic search via pgvector#45
rivetphilbot wants to merge 17 commits intoMartian-Engineering:mainfrom
rivetphilbot:feat/embedding-semantic-search

Conversation

@rivetphilbot
Copy link

@rivetphilbot rivetphilbot commented Mar 11, 2026

Summary

Adds semantic (embedding-based) search as a third search mode alongside regex and full-text. Uses OpenAI's text-embedding-3-small model and PostgreSQL's pgvector extension for cosine similarity search.

Depends on: #44 (PostgreSQL backend support)

What changed

New files

  • src/embeddings.tsEmbeddingClient class for OpenAI-compatible embedding APIs. Handles batching (512 texts/call), auto-truncation, and API key resolution from env vars or config.

Modified files

  • src/db/config.ts — New fields: embeddingApiKey, embeddingBaseUrl, embeddingModel. Env vars: LCM_EMBEDDING_API_KEY, LCM_EMBEDDING_BASE_URL, LCM_EMBEDDING_MODEL. Falls back to OPENAI_API_KEY.
  • src/db/migration.tsaddEmbeddingColumnsIfAvailable() auto-creates pgvector columns and HNSW indexes when the extension is available. Runs during schema creation and forward migrations. Silently skips if pgvector is not installed.
  • src/engine.ts — Wires embedding config through to both stores.
  • src/store/conversation-store.ts — Embed-on-insert (async, non-blocking), searchSemantic() via parameterized <=> cosine distance, graceful fallback to full_text if unavailable.
  • src/store/summary-store.ts — Same pattern for summaries.
  • src/store/index.ts — Re-exports EmbeddingClient, toVectorLiteral, fromVectorLiteral, EmbeddingConfig.
  • src/retrieval.tsGrepInput.mode expanded to include "semantic".
  • src/tools/lcm-grep-tool.tsmode enum and description updated for semantic search.

How it works

  1. Embed on insert: When a message or summary is created, an embedding is generated asynchronously and stored. Failures are logged but never block writes.
  2. Semantic search: lcm_grep(mode="semantic") embeds the query, then finds nearest neighbors via pgvector cosine distance (<=>). All query parameters (including vector literals and limits) are fully parameterized.
  3. Fallback: If semantic search is unavailable (no embeddings, no API key, SQLite backend), falls back to full-text search transparently.
  4. Auto-setup: Schema migration auto-creates embedding vector(1536) columns and HNSW indexes when pgvector is detected. No manual DDL needed.

Configuration

# Required for semantic search
LCM_EMBEDDING_API_KEY=sk-...  # or OPENAI_API_KEY

# Optional overrides
LCM_EMBEDDING_BASE_URL=https://api.openai.com/v1  # default
LCM_EMBEDDING_MODEL=text-embedding-3-small          # default, 1536 dimensions

Prerequisites

Cost

text-embedding-3-small: $0.02 per 1M tokens. A 10k message backfill costs ~$0.05.

Testing

Running in production with ~10k messages and ~1k summaries fully embedded. Semantic search returns relevant results across conversation history with sub-second latency.

@rivetphilbot rivetphilbot force-pushed the feat/embedding-semantic-search branch 3 times, most recently from a2d1bb7 to 84e7ad9 Compare March 11, 2026 14:47
- Add DbClient interface abstracting SQLite/Postgres behind common async API
- Add PostgresClient with lazy pg loading (via createRequire) so SQLite-only
  installs don't need pg as a dependency
- Add SqliteClient wrapping DatabaseSync in DbClient interface
- Add Dialect helper eliminating per-query backend branching ($1/$2 vs ?,
  NOW() vs datetime('now'), LPAD vs printf, etc.)
- Add auto-schema creation for Postgres (ensurePostgresSchema) with
  forward-compatible migration path
- Fix transaction scoping: queries inside withTransaction now use the
  transaction-scoped client, not the connection pool
- Fix assembler raw toolCall block passthrough causing xAI 422 errors
- Add tsquery-sanitize for safe full-text search input handling
- Add websearch_to_tsquery support for Postgres FTS (quoted phrases, exclusion)
- Add pg as optionalDependency (SQLite remains zero-dep default)
- Include migrate-to-postgres.mjs data migration script

SQLite remains the default backend. Postgres is opt-in via LCM_BACKEND=postgres
and LCM_CONNECTION_STRING environment variables.

Tested on PostgreSQL 16 with OpenClaw 2026.3.7/2026.3.8.
@rivetphilbot rivetphilbot force-pushed the feat/embedding-semantic-search branch from 84e7ad9 to fcccf4e Compare March 11, 2026 15:16
When a process crashes mid-turn, afterTurn never fires and that turn's
messages are never persisted to LCM — but they ARE in the JSONL file
(appendFileSync writes immediately). On next bootstrap, the fast path
in reconcileSessionTail sees the tail matches (a later turn wrote to
both) and declares 'in sync', silently orphaning the crashed turn's
messages forever.

Fix: after the fast path confirms the tail matches, compare total
message counts. If JSONL has more messages than the DB, scan for and
import the missing ones via identity-based occurrence matching.

Also adds logging to bootstrap reconciliation so we can confirm it
fires.
@coolmanns
Copy link

Love it!!!

Copy link
Contributor

@jalehman jalehman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: Embedding-Based Semantic Search (pgvector)

The embedding architecture is well thought out — background batching via EmbeddingQueue, retry with exponential backoff, synthesizing embeddable text from tool-call parts, and the pgvector integration for cosine distance search. This is a valuable feature.

However, this PR is stacked on #44 and inherits its blockers, plus has two issues of its own:

🔴 Inherited from #44

All three blockers from #44 are present in this branch:

  1. SQLite FTS regression (feature probe missing sqliteDb argument)
  2. Postgres migration failures swallowed
  3. Transaction this.db swap not concurrency-safe

These need to be fixed in #44 and then rebased here.

🔴 1. EmbeddingQueue.stop() Does Not Actually Drain

stop() clears the timer and calls flush() once, but flush() only processes one batch (up to batchSize items). Everything beyond the first batch is silently dropped:

async stop(): Promise<void> {
  if (this.timer) {
    clearInterval(this.timer);
    this.timer = null;
  }
  if (this.queue.length > 0) {
    await this.flush(); // Only processes ONE batch
  }
  // Items beyond batchSize: gone
  // Items with future nextRetryAt: gone
}

Additionally, flush() re-enqueues failed items with a future nextRetryAt — these will never be processed since the timer is already cleared.

Fix: Loop flush() until the queue is empty (or a reasonable timeout), and handle retry-delayed items by processing them immediately on shutdown.

🔴 2. Engine Never Calls Queue Teardown

Both stores expose stopEmbeddingQueue(), but the engine's dispose() method never calls them, and there's no process exit hook:

// engine.ts dispose() — no queue cleanup:
async dispose(): Promise<void> {
  // conversationStore.stopEmbeddingQueue() ← never called
  // summaryStore.stopEmbeddingQueue() ← never called
}

Impact: Even the imperfect stop() is never triggered. On gateway restart or plugin reload, all pending embeddings are silently lost.

Fix: Wire up dispose() to call both stopEmbeddingQueue() methods, and consider a process exit hook as a safety net.

Other Notes

  • No tests for the embedding client, queue, schema migration, or semantic search. The queue's batching, retry, and drain semantics especially need test coverage.
  • The silent fallback from semanticfull_text on any exception is operationally forgiving but hides misconfigurations. Consider logging at warn level when this happens.
  • Backfill scripts have hardcoded connection strings and local paths — should be parameterized.
  • The IVFFlat index with 100 lists is reasonable for the current dataset size but should be documented as a tuning point for larger deployments.

rivetphilbot added a commit to rivetphilbot/lossless-claw that referenced this pull request Mar 13, 2026
…feedback

1. SQLite FTS5 regression: pass underlying DatabaseSync handle to
   getLcmDbFeatures() so FTS5 is properly probed on sqlite installs.
   Previously, the missing handle caused silent fallback to LIKE search.

2. Postgres migration failures: remove try/catch around ensurePostgresSchema()
   so errors propagate and this.migrated is only set on success. Subsequent
   calls will retry migration instead of operating against partial schema.

3. Transaction this.db swap race: replace instance-field swap pattern with
   AsyncLocalStorage in both ConversationStore and SummaryStore. The db getter
   returns the transaction-scoped client from ALS when inside a transaction,
   falling back to the root client otherwise. This eliminates concurrency
   hazards from shared singleton stores across concurrent sessions.

4. EmbeddingQueue.stop() drain: add drain() method that loops flush() until
   the queue is empty, clearing retry delays so backoff-delayed items are
   processed immediately. stop() now calls drain() before clearing the timer.

5. Engine dispose() wiring: dispose() now calls drainEmbeddingQueue() on both
   stores so pending embeddings are flushed between runs. Full stop() is
   reserved for process shutdown (embedding queues auto-unref their timers).
1. SQLite FTS5 regression: pass underlying DatabaseSync handle to
   getLcmDbFeatures() so FTS5 is properly probed on sqlite installs.
   Previously, the missing handle caused silent fallback to LIKE search.

2. Postgres migration failures: remove try/catch around ensurePostgresSchema()
   so errors propagate and this.migrated is only set on success. Subsequent
   calls will retry migration instead of operating against partial schema.

3. Transaction this.db swap race: replace instance-field swap pattern with
   AsyncLocalStorage in both ConversationStore and SummaryStore. The db getter
   returns the transaction-scoped client from ALS when inside a transaction,
   falling back to the root client otherwise. This eliminates concurrency
   hazards from shared singleton stores across concurrent sessions.
- Add EmbeddingClient for OpenAI-compatible embedding generation
  (text-embedding-3-small default, configurable model/endpoint)
- Add 'semantic' search mode to lcm_grep using pgvector cosine distance
- Embed messages and summaries on insert (fire-and-forget, non-blocking)
- Auto-create pgvector extension and embedding columns if available
- Falls back to full_text search when embeddings unavailable
- Config: LCM_EMBEDDING_API_KEY, LCM_EMBEDDING_BASE_URL, LCM_EMBEDDING_MODEL
- Add pgvector as optionalDependency (only needed for semantic search)

Requires PostgreSQL backend from PR Martian-Engineering#44. Embedding is purely optional —
Postgres installs without pgvector work fine with regex/full_text search.
…sageParts

Align method names with upstream lossless-claw interface expected by
OpenClaw v2026.3.8. The upstream store uses getMessageById (not getMessage)
and createMessageParts (batch, returns void) instead of createMessagePart
(single, returns record).
Dangling string concatenation operator (+) at end of description caused
ParseError at line 85. Changed to comma to properly terminate the property.
The bootstrap method inserts messages via ConversationStore and then
context_items via SummaryStore. On Postgres (unlike SQLite), these
stores use separate pool connections, so the SummaryStore can't see
uncommitted messages from the ConversationStore's transaction.

Added withSharedTransaction (exposes txClient) and withClient (joins
an existing transaction) methods. Bootstrap now wraps both stores in
the same Postgres transaction, fixing FK constraint violations on
context_items_message_id_fkey during first-time bootstrap.
- Define PgQueryResult, PgPoolClient, PgPool, PgModule interfaces for
  node-postgres types (avoids @types/pg dependency)
- Use getLcmConnection() for SQLite migration instead of db as any
- Change catch (err: any) to catch (err: unknown) with instanceof guard
Messages inserted via createMessagesBulk were not getting embeddings,
only single createMessage calls triggered embedOnInsert. This caused
~80% of messages to remain unembedded since LCM primarily uses bulk
inserts during compaction.
Replaces fire-and-forget embedOnInsert calls with a proper async queue:
- Batched flushes every 2s with configurable batch size
- Exponential backoff retry (up to 5 attempts) for transient API errors
- Synthesizes embedding text for tool-only messages from message_parts
- Truncation at 8k chars to stay within OpenAI 8192 token limit
- Backpressure and graceful shutdown

Includes backfill scripts for existing messages.

Old regime silently lost ~30% of embeddings. New queue achieves 100%.
1. Bootstrap FK constraint violation: concurrent session bootstraps
   cause this.db swap races on shared store singletons. Event loop
   interleaving means Session A's context_items insert can execute
   on Session B's transaction where its messages don't exist.
   Fix: global bootstrap mutex serializes all bootstrap operations.

2. Embedding timing race: createMessage enqueues with empty content,
   queue flushes 2s later, but createMessageParts hasn't been called
   yet. Tool-only messages get skipped with no content to synthesize.
   Fix: createMessageParts re-enqueues tool messages, and the queue
   deduplicates by checking if embedding already exists at flush time.
…feedback

1. SQLite FTS5 regression: pass underlying DatabaseSync handle to
   getLcmDbFeatures() so FTS5 is properly probed on sqlite installs.
   Previously, the missing handle caused silent fallback to LIKE search.

2. Postgres migration failures: remove try/catch around ensurePostgresSchema()
   so errors propagate and this.migrated is only set on success. Subsequent
   calls will retry migration instead of operating against partial schema.

3. Transaction this.db swap race: replace instance-field swap pattern with
   AsyncLocalStorage in both ConversationStore and SummaryStore. The db getter
   returns the transaction-scoped client from ALS when inside a transaction,
   falling back to the root client otherwise. This eliminates concurrency
   hazards from shared singleton stores across concurrent sessions.

4. EmbeddingQueue.stop() drain: add drain() method that loops flush() until
   the queue is empty, clearing retry delays so backoff-delayed items are
   processed immediately. stop() now calls drain() before clearing the timer.

5. Engine dispose() wiring: dispose() now calls drainEmbeddingQueue() on both
   stores so pending embeddings are flushed between runs. Full stop() is
   reserved for process shutdown (embedding queues auto-unref their timers).
@rivetphilbot rivetphilbot force-pushed the feat/embedding-semantic-search branch from f38fac8 to fd29152 Compare March 13, 2026 21:02
@rivetphilbot
Copy link
Author

All blockers addressed — both inherited from #44 and the two new ones:

Inherited from #44 (fixed in the base branch, rebased here):

  1. SQLite FTS5 regression — pass SQLite handle to feature probe
  2. Postgres migration failures — let errors propagate, retry on next call
  3. Transaction this.db swap — replaced with AsyncLocalStorage per-store

4. EmbeddingQueue.stop() doesn't drain — Fixed. Added drain() method that loops flush() until the queue is empty, clearing retry delays so backoff-delayed items are processed immediately. stop() now calls drain() before clearing the timer. Logs a warning if any items remain after drain attempts.

5. Engine dispose() never stops queues — Fixed. dispose() now calls drainEmbeddingQueue() on both stores (flush pending work without stopping timers, since the engine is a singleton reused across runs). Full stop() is available for process shutdown; queue timers are unref()'d so they don't block exit.

Re: tests, semantic→full_text fallback logging, and parameterized backfill scripts — noted for follow-up. The IVFFlat tuning point is a good callout, will document.

- Log warning on semantic → full_text search fallback (both stores)
- Remove hardcoded connection strings from backfill scripts (require LCM_DB_URL env)
- Document HNSW index tuning parameters for larger deployments
Fix all 40 existing test failures caused by the store/config interface changes:
- engine.test.ts: add missing backend/embedding config fields to createTestConfig
- fts-fallback.test.ts: wrap raw DatabaseSync in SqliteClient, update option names
- migration.test.ts: update fts5Available → fullTextAvailable option name
- plugin-config-registration.test.ts: add new config fields, correct defaults
- lcm-tools.test.ts: fix timestamp format expectation

Add 7 new test files (1,020 lines) covering:
- store-transaction.test.ts: ALS transaction safety, rollback, nesting, cross-store
- embedding-queue.test.ts: batching, retry, drain, stop, dedup, text synthesis
- features.test.ts: FTS5 probe, caching, backend detection
- conversation-store.test.ts: full CRUD, message parts, FTS search
- summary-store.test.ts: insert/retrieve, lineage linking, context items, FTS
- dialect.test.ts: placeholder styles, reset, backend-specific functions

Fix SqliteClient to coerce undefined → null in all param bindings (query,
queryOne, run) to prevent 'cannot bind SQLite parameter' errors when
optional fields are passed through.

284 tests, 26 files, 0 failures.
@rivetphilbot
Copy link
Author

Test coverage follow-up is done — all review feedback addressed in 3753d3e.

Adds dedicated tests for the embedding system:

  • embedding-queue.test.ts — batching across multiple flushes, retry with exponential backoff, drain() clearing retry delays, stop() draining before shutdown, deduplication of already-embedded items, text synthesis from message parts for empty content
  • features.test.ts — FTS5 probe with/without SQLite handle (validates the regression fix), caching, backend detection

Plus store-level tests covering the transaction model, CRUD operations, and full-text search that exercise the full stack both PRs touch.

284 tests, 26 files, 0 failures. Rebased on #44 with all blockers resolved.

@rivetphilbot
Copy link
Author

@jalehman Ready for re-review when #44 looks good.

@rivetphilbot rivetphilbot requested a review from jalehman March 13, 2026 21:40
Copy link
Contributor

@jalehman jalehman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A. Recommendation

Request changes.

The original blocker fixes are present on the PR head:

  • the SQLite FTS5 probe now receives the raw SQLite handle,
  • Postgres migration errors propagate and only mark migrated on success,
  • the store transaction client swap was replaced with AsyncLocalStorage,
  • EmbeddingQueue.stop() now drains instead of flushing only one batch,
  • LcmContextEngine.dispose() now drains both embedding queues.

However, the latest test-coverage commit introduces a failing test on the PR head, so this branch is not yet review-clean.

B. Scope Reviewed

  • src/db/features.ts
  • src/engine.ts
  • src/store/conversation-store.ts
  • src/store/summary-store.ts
  • src/embedding-queue.ts
  • test/embedding-queue.test.ts
  • test/features.test.ts
  • test/store-transaction.test.ts
  • test/plugin-config-registration.test.ts

C. Findings

  1. IMPORTANT: test/plugin-config-registration.test.ts still asserts maintainer-environment defaults instead of the plugin config supplied by the test. On a clean environment, npm test fails because the test expects freshTailCount: 64, backend: 'postgres', embeddingApiKey containing sk-, and timezone: 'UTC', even though the test only passes freshTailCount: 7 and no backend/API key/TZ override. This makes the new coverage non-portable and contradicts the commit claim that the suite is green.

D. Original Blockers

  1. PR #44 inherited blockers: resolved in code and covered by test/features.test.ts and test/store-transaction.test.ts.
  2. EmbeddingQueue.stop() abandoning queued / retry-delayed items: materially resolved by drain() plus direct tests in test/embedding-queue.test.ts.
  3. dispose() never tearing down / flushing the queues: materially resolved by drainEmbeddingQueue() calls in src/engine.ts.

E. Tests Run

  • npm test

Result: fail

26 test files ran. 25 passed, 1 failed.

Failing test:

  • test/plugin-config-registration.test.ts > lcm plugin registration > uses api.pluginConfig values during register

F. Risk

Low runtime risk for the original blocker fixes. Current merge risk is test-signal quality: the branch claims green coverage but does not reproduce cleanly.

G. Suggested Fix

Trim test/plugin-config-registration.test.ts back to asserting only the config fields actually provided by pluginConfig, or explicitly stub the environment for any defaults the test wants to verify.

H. Docs

Up to date enough for this review. No doc blocker.

I. Changelog

Required and present via the PR description / commit history.

J. Summary

The requested re-review came back positive on the original blocker fixes, but I cannot approve while the latest coverage commit leaves the branch with a reproducible failing test.

toolCallBlockFromPart emitted 'input' for toolCall-type blocks, but
OpenClaw's extractToolCalls and all Chat Completions providers read
'arguments'. This caused xAI 422 errors ('missing field arguments')
when LCM-assembled tool call history was sent to the API.

The fix adds toolCall to the condition that already handles functionCall,
so both OpenAI-family types use 'arguments'. Anthropic-native types
(tool_use, tool-use, toolUse) continue using 'input'.

Also fixes a stale test assertion for freshTailCount (expected 64,
config set 7) and updates the raw-metadata token counting test to
verify normalized tool_result output instead of the old raw-passthrough
behavior.

Adds assembler-blocks.test.ts with 22 tests covering all block types:
toolCall, functionCall, function_call, tool_use, tool-use, toolUse,
tool_result, function_call_output, toolResult, reasoning, thinking,
text, blockFromPart raw-bypass behavior, and OpenAI reasoning
restoration.
Replace hardcoded connection strings and paths in migrate-to-postgres.mjs
with environment variables (LCM_PG_CONNECTION, LCM_SQLITE_PATH,
LCM_AGENTS_DIR, LCM_BACKUP_AGENTS_DIR). Addresses PR review feedback
about hardcoded paths in migration scripts.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants