Skip to content

v0.37.11.0: fresh-install PGLite embedding setup fix wave#1286

Merged
garrytan merged 14 commits into
masterfrom
garrytan/pglite-fixes-asap
May 22, 2026
Merged

v0.37.11.0: fresh-install PGLite embedding setup fix wave#1286
garrytan merged 14 commits into
masterfrom
garrytan/pglite-fixes-asap

Conversation

@garrytan
Copy link
Copy Markdown
Owner

@garrytan garrytan commented May 22, 2026

Summary

Fresh gbrain init --pglite produced a brain where no embedding provider worked out of the box: schema sized to 1536 (the OpenAI default), gateway resolving to ZeroEntropy at 1280, first gbrain embed --stale dim-mismatching, gbrain config set embedding_model silently no-op, ZeroEntropy keys had no config plane. 9 reported bugs; 2 rounds of codex outside-voice review caught 26 more. All 26 folded into the plan; full wave shipped.

ELI10: if you ran gbrain init --pglite in v0.37.x and tried to embed, you got opaque dimension errors with no clear way out. v0.37.11.0 makes the fresh-install path actually work, makes the failure-mode loud + paste-ready when you do hit a mismatch, and adds gbrain reinit-pglite for one-command brain rebuilds.

What's in this PR

Five lanes plus a gbrain reinit-pglite sugar command. Each lane is a coherent commit set; the wave was structured so a future bisect can land on the right slice cleanly.

Lane A — Single source of truth for defaults. New module src/core/ai/defaults.ts exports DEFAULT_EMBEDDING_MODEL + DEFAULT_EMBEDDING_DIMENSIONS. Every hardcoded 1536 / text-embedding-3-large literal in production code paths (PGLite schema, Postgres schema, both engine fallbacks, embedding-column registry, isCacheSafe baseline, chunk-row INSERT defaults) replaced with the import. Schema seed no longer strips the provider prefix — DB config stores provider:model end-to-end.

Lane B — Init paths configure gateway and merge config. initPGLite, initPostgres, and initMigrateOnly all configureGateway() unconditionally before engine.initSchema(). Resolution precedence locked across the codebase: CLI flags this invocation > existing file plane > resolved defaults from gateway. Resolved embedding model + dimensions get printed at init and persisted to config.json even when the user passes no flags, so gbrain config show reflects the active default. New loadConfigFileOnly() helper (in src/core/config.ts) — read-back source for safe merge that doesn't poison config.json with env-only state. v0.28.5 dim-mismatch guard extended to fire on re-init even when no explicit --embedding-dimensions is passed.

Lane C — Config plane honesty. gbrain config set embedding_model / embedding_dimensions refused unconditionally (no --force escape) with paste-ready wipe-and-reinit recipe. ZeroEntropy credentials get a real config plane: zeroentropy_api_key field on GBrainConfig, env merge in loadConfig, mapping in cli.ts:buildGatewayConfig. Internal DB-write sites (ze-switch, migrate-engine) gained contract comments documenting the file-plane-is-canonical invariant.

Lane D — Error UX and recipe correctness. New tagged EmbeddingDimMismatchError class in src/commands/embed.ts; pre-flight check fires loud + structured before the embed loop begins. Both sync-side embed catch sites (incremental :990 and first-sync :1129) detect the tagged error and print the recipe + --no-embed tip. embeddingMismatchMessage extended with engine kind + database path so PGLite emits gbrain reinit-pglite ... and Postgres emits the SQL ALTER recipe. sync added to CLI_ONLY_SELF_HELP so gbrain sync --help reaches its dispatch with a comprehensive usage block. docs/embedding-migrations.md restructured PGLite-first.

Lane E — Doctor correctness. checkEmbeddingWidthConsistency, checkZeEmbeddingHealth, and loadRecommendationContext all read gateway state instead of DB plane (the canonical schema-sizing source post-Lane C). Provider-aware key check recognizes ZeroEntropy alongside OpenAI / Anthropic — no more false-warns when ZE is the active provider and OpenAI keys aren't configured.

gbrain reinit-pglite — new one-command wrapper: backs up the existing brain to <path>.bak, runs gbrain init with the new flags (preserving every other config field — chat model, expansion model, API keys), and re-syncs the brain repo. --no-sync skips the resync, --yes skips the TTY confirmation, --json for scripts. Refuses non-PGLite engines (Postgres has the in-place SQL recipe). 293-line CLI in src/commands/reinit-pglite.ts.

Tests

  • Unit: 22 new cases in test/v0_37_fix_wave.test.ts (structural lane assertions) + 12 in test/v0_37_gap_fill.test.ts (end-to-end behavior + reinit-pglite contracts), plus updates to test/embedding-dim-check.test.ts, test/ai/schema-templating.test.ts, test/search/embedding-column.test.ts, test/cli.test.ts, test/doctor-ze-checks.test.ts, test/e2e/v0_28_5-fix-wave.test.ts.
  • E2E: new test/e2e/fresh-install-pglite.test.ts (in-process, no DATABASE_URL needed) exercises the headline path: bare gbrain init --pglite → import → embed (via __setEmbedTransportForTests injection) → chunks have non-null embeddings.
  • Test infra: new test/helpers/legacy-embedding-preload.ts registered via bunfig.toml preload array so 1536-dim test fixtures keep working under the new ZE-default world without per-file mutation.

What's deferred

Four follow-up TODOs filed in TODOS.md:

  • embed --try-fallback for auto-switching providers on quota / auth failures (silent provider switching = silent vector-space corruption; needs explicit consent design).
  • gbrain reinit-pglite analog for Postgres (currently SQL recipe).
  • Full plane unification audit for non-schema-sizing fields (chat/expansion/reranker could become live-mutable via DB plane).
  • embedAll() shared AbortController so worker-pool dim-mismatches stop within 1-2 in-flight pages instead of draining the queue (current behavior: catches per-page; the top-level still emits the recipe exactly once).

Test plan

  • bun run typecheck clean
  • Unit tests pass (3650+ across 8 shards)
  • E2E pass on real Postgres (bun run test:e2e)
  • gbrain reinit-pglite --help and gbrain sync --help reach dispatch and print usage
  • gbrain config set embedding_model openai:text-embedding-3-large exits 1 with the wipe-and-reinit recipe
  • Fresh gbrain init --pglite configures ZE / 1280 end-to-end; gbrain config show reflects the active default
  • gbrain doctor reads gateway-resolved values (not DB plane) for schema-sizing checks

🤖 Generated with Claude Code

garrytan and others added 12 commits May 21, 2026 15:55
… keep working

The v0.37 fix wave changes the canonical gateway defaults to
zeroentropyai:zembed-1 / 1280 (matching what v0.36 already chose as the
system default). 20+ test files have hardcoded new Float32Array(1536)
fixtures that match the OLD schema default. Without this preload, those
tests fail with a vector-dim-mismatch on insert.

The preload is gateway-only — it doesn't change which model gbrain ships
to production users. Tests that want the new ZE/1280 defaults call
configureGateway() explicitly in their own beforeAll.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…registry

Closes the v0.36 defaults drift bug class. The gateway shipped
zeroentropyai:zembed-1 / 1280 as the system default in v0.36 but eight
other places kept hardcoding 1536 / text-embedding-3-large. Fresh
gbrain init --pglite sized the column to 1536, the embed pipeline used
ZE/1280, and every page failed with dim mismatch.

- New src/core/ai/defaults.ts leaf module is the canonical source for
  DEFAULT_EMBEDDING_MODEL / DEFAULT_EMBEDDING_DIMENSIONS. Schema and
  registry helpers import from this lean module instead of pulling the
  full gateway (which loads every provider SDK).
- src/core/ai/gateway.ts re-exports the constants for back-compat.
- src/core/pglite-schema.ts getPGLiteSchema() defaults track gateway.
- src/core/postgres-engine.ts getPostgresSchema() default args track
  gateway (same drift on the Postgres path — codex round 1 CDX-1).
- Both engine.initSchema() fallbacks track gateway constants (no more
  stale OpenAI/1536 catch-block defaults).
- Schema seed stops stripping the provider prefix; full provider:model
  is stored in the DB config table (codex round 1 CDX-4).
- Chunk-row INSERT defaults track gateway (codex round 2 CDX2-4 —
  pglite-engine:1611 + postgres-engine:1647 were production write
  sites previously hardcoded to text-embedding-3-large).
- src/core/search/embedding-column.ts loadRegistry + isCacheSafe gain
  the cfg > gateway > DEFAULT resolution chain (codex round 2 CDX2-3).
  The gateway tier matters because callers that configure the gateway
  (init paths, tests, programmatic SDK) expect the registry to mirror
  that state when cfg doesn't have an explicit embedding_model.

Tests:
- schema-templating: default expectation flips to ZE/1280 (v0.37 truth).
- embedding-dim-check: 3 new engine-kind branching cases + updated
  fresh-brain expectation (under legacy preload).
- embedding-column: registry + isCacheSafe expectations match new chain.
- v0_28_5-fix-wave E2E: engineKind required arg propagated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…nest config-set, sync/reinit help

Closes the "fresh init doesn't work + config-set silently lies" bug
class end-to-end. Six related changes that ship together because the
file-plane/DB-plane contract only holds when init paths, config-set,
the gateway env mapping, and the recipe text all agree.

Lane B (init paths):
- initPGLite, initPostgres, initMigrateOnly always configureGateway()
  before engine.initSchema(). Pre-fix the call was gated on flags, so
  bare `gbrain init --pglite` left the gateway unconfigured and the
  engine fell through to stale OpenAI/1536 defaults instead of the
  ZE/1280 the gateway would have resolved.
- New configureGatewayWithMergedPrecedence() helper applies the locked
  precedence chain `CLI > env > existing file > gateway internal`.
- printResolvedAIChoice() shows the resolved model/dim at init time +
  surfaces a ZE setup hint inline when the API key is missing.
- B.4: saveConfig merge uses loadConfigFileOnly() so transient env
  state (DATABASE_URL, etc.) never poisons ~/.gbrain/config.json
  (codex round 2 CDX-5).
- B.5: extend the v0.28.5 dim-mismatch detector so it fires when the
  gateway-resolved dim differs from the existing column, not only
  when --embedding-dimensions is explicit (codex round 2 CDX-6).

Lane C (config plane):
- New `loadConfigFileOnly()` reads ~/.gbrain/config.json only — no env
  merge, no DATABASE_URL inference. Safe write-back source for init.
- GBrainConfig gains `zeroentropy_api_key?: string`. loadConfig merges
  process.env.ZEROENTROPY_API_KEY. buildGatewayConfig at cli.ts:1401
  maps it into env.ZEROENTROPY_API_KEY so ZE recipes finally see it
  (codex round 2 CDX2-5+6 — the v1 fix landed in the wrong file).
- `gbrain config set embedding_model` and `... embedding_dimensions`
  refuse unconditionally and print a paste-ready wipe-and-reinit
  recipe. No --force escape (codex round 2 CDX2-13).
- migrate-engine.ts adds a contract comment at the DB-plane write
  site documenting "DB stores schema-applied metadata; file plane is
  canonical for runtime gateway config" + preserves the existing
  file-plane config across engine migration.

Lane D.1 (recipe text):
- embeddingMismatchMessage() takes an `engineKind` arg. PGLite branch
  emits a wipe-and-reinit recipe using gbrainPath('brain.pglite') or
  the caller's databasePath override. Postgres branch keeps the SQL
  ALTER recipe.
- The PGLite recipe recommends `gbrain reinit-pglite` (new sugar
  command below) as the one-line path before falling back to the
  by-hand mv + init + sync sequence.

Lane D.4 (sync help dispatch):
- `sync` and `reinit-pglite` added to CLI_ONLY_SELF_HELP so their own
  --help branches reach the user (pre-fix the generic short-circuit
  fired first and the dedicated usage was unreachable; codex round 2
  CDX2-12).
- `gbrain sync --help` short-circuits BEFORE engine bind so users on
  a fresh tmpdir (no config) can read the help without hitting
  no-such-config errors.

Sugar:
- New `gbrain reinit-pglite --embedding-model X --embedding-dimensions N`
  wraps the wipe + init + sync dance into one command. Backs up the
  brain to <path>.bak. TTY confirmation unless --yes. --no-sync to
  defer the resync. --json for scripts.

Tests:
- test/cli.test.ts sync-help test rewritten for the new
  per-command-usage output (lists --no-embed which is the v0.37
  user-visible flag the wave wanted to surface).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…atch sites

embedding-pipeline error UX. Pre-fix, a fresh-install dim mismatch
produced raw Postgres "expected N dimensions, not M" errors page after
page, surfacing only after the worker pool drained the entire corpus.
Sync swallowed embed errors at TWO catch sites and never surfaced
the recovery recipe.

embed.ts:
- New `EmbeddingDimMismatchError` tagged class with the paste-ready
  recipe baked in.
- `runEmbedCore` pre-flights via `readContentChunksEmbeddingDim` +
  gateway.getEmbeddingDimensions() before the worker pool spins up.
  On mismatch, throws the typed error which the CLI wrapper catches
  and prints. Dry-run skips the check (no embed risk).
- Catches the headline fresh-install bug class at first call instead
  of letting it hammer N parallel API calls into dim-rejected inserts.

sync.ts:
- Both embed catches at sync.ts:990 (incremental) and sync.ts:1129
  (first-sync) detect EmbeddingDimMismatchError and surface the recipe
  + a `--no-embed` tip on stderr (codex round 2 CDX2-8: incremental
  path was previously silent; only the first-sync path was flagged).
- Non-mismatch embed failures still stay best-effort (rate limits,
  transient network) — those shouldn't break sync.
- Sync calls runEmbedCore directly instead of runEmbed (which calls
  process.exit on error and bypasses sync's catch).
- Sync gets a proper --help block listing every meaningful flag:
  --no-embed, --workers, --source, --skip-failed, --retry-failed,
  --watch, --interval, --no-pull, --all, --json, --yes, --dry-run.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…key lookup

Doctor's embedding checks were reading the DB config table for
embedding_model / embedding_dimensions / zeroentropy_api_key. Post
v0.37 the file plane is canonical (the DB plane is schema-applied
metadata, not runtime gateway config) so those reads produced stale
verdicts on fresh installs whose DB row hadn't been written.

- checkEmbeddingWidthConsistency reads gateway.getEmbeddingDimensions()
  and gateway.getEmbeddingModel() instead of engine.getConfig(...).
  Reuses readContentChunksEmbeddingDim from the same shared helper
  init + embed use. On mismatch, the fix hint threads engineKind +
  databasePath into the new branched recipe (codex round 1 CDX-8 +
  Lane E.1/E.2).
- checkZeEmbeddingHealth reads gateway for the model + loadConfigFileOnly
  for the key. Fires when (a) resolved model starts with zeroentropyai:
  AND (b) ZEROENTROPY_API_KEY is unset in env AND (c) file plane has
  no zeroentropy_api_key (codex round 2 CDX2-10).
- loadRecommendationContext reads gateway for both fields and
  recognizes the ZE key alongside OpenAI/Anthropic in the
  hasEmbeddingApiKey check, so brains on ZE no longer look "healthy"
  just because OPENAI_API_KEY happens to be set (codex round 2 CDX2-11).

Tests rewritten for the gateway-source-of-truth contract via
configureGateway() in beforeAll. Added a "gateway unconfigured: skips
with ok" case so doctor doesn't false-warn on cold-boot brains.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ipe + TODOS

Lands the v0.37 PGLite fresh-install fix wave's structural tests and
the user-facing migration recipe overhaul.

test/v0_37_fix_wave.test.ts (new): 22 unit cases pinning the lanes:
- Lane A: defaults module exports, getPGLiteSchema/getPostgresSchema
  default-args, registry + isCacheSafe under the `cfg > gateway >
  DEFAULT` chain (both gateway-set and gateway-reset branches).
- Lane B: loadConfigFileOnly env isolation + DATABASE_URL inference
  refusal + null-on-missing.
- Lane C.3: buildGatewayConfig maps zeroentropy_api_key + process.env
  wins over config (operator escape hatch contract).
- Lane D.2: EmbeddingDimMismatchError shape + tag.
- Lane D.4: structural assertion that `sync` is in CLI_ONLY_SELF_HELP.
- Deferred-TODO ship: reinit-pglite is registered correctly +
  embeddingMismatchMessage PGLite branch recommends it.

docs/embedding-migrations.md: PGLite section moved to top (the default
install). The recommended path is `gbrain reinit-pglite` one-liner;
the by-hand mv + init + sync sequence stays as the fallback recipe.
Postgres SQL ALTER recipe preserved. New section on `gbrain config
set` refusal explains the file-plane vs DB-plane contract so users
don't follow stale documentation.

TODOS.md: 4 deferred follow-ups filed with concrete file pointers:
- gbrain embed --try-fallback (provider auto-switch with consent gate)
- Full plane unification for non-schema-sizing fields
- Worker-pool shared AbortController for mid-run dim drift
- Cleanup of back-compat constants in src/core/embedding.ts

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The structural fix-wave tests in test/v0_37_fix_wave.test.ts pin lane-level
invariants (exports, registry chain, signature shapes). The audit found 10+
END-TO-END behaviors that the structural tests didn't actually reach.
This file fills the highest-leverage gaps.

Unit coverage (test/v0_37_gap_fill.test.ts, 12 cases):
- Lane A.7: chunk-row INSERT default tracks DEFAULT_EMBEDDING_MODEL
  constant (pre-fix this was the literal 'text-embedding-3-large' at
  pglite-engine.ts:1611 + postgres-engine.ts:1647 — production write
  sites that were never directly tested; codex round 2 CDX2-4).
- Lane A.8: schema seed stores full provider:model in DB config
  (pre-fix the .split(':') strip dropped the prefix; codex round 1
  CDX-4). Asserts a fresh ZE init stores `zeroentropyai:zembed-1`
  in the config table, not bare `zembed-1`.
- Lane B precedence: explicit CLI > env > existing file > default
  test (codex round 2 CDX2-7 contradiction guard).
- Lane C.3 env merge: process.env.ZEROENTROPY_API_KEY threads through
  loadConfig → cfg.zeroentropy_api_key; loadConfigFileOnly does NOT.
- Lane D.2 end-to-end: schema=1536 + gateway=1280 →
  EmbeddingDimMismatchError fires AND the embed transport is never
  called (the whole point of pre-flight). Plus dry-run skips the
  check.
- Lane D.3 source-text grep: both sync.ts catch sites detect the
  typed error + the `--no-embed` tip is present (CDX2-8).
- Lane E.4 source-text grep: loadRecommendationContext is
  provider-aware (reads gateway + branches on ZE/OpenAI key).
- reinit-pglite contract: refuses on non-PGLite engines + refuses
  when required flags are missing.

E2E (test/e2e/fresh-install-pglite.test.ts, 2 cases):
- Bare `gbrain init --pglite` produces a `vector(1280)` schema, prints
  the resolved choice, persists defaults to config.json — the headline
  scenario that v0.37 ships to fix.
- init → seed page → embed end-to-end: chunks have non-null
  embeddings; no dim mismatch despite the wave's defaults change.

Both E2E cases are IN-PROCESS (per CDX2-12: CLI-subprocess E2E can't
inherit `__setEmbedTransportForTests`). They run with stubbed transport
returning synthetic 1280-dim vectors so we never hit real provider APIs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds an afterAll that restores the gateway to OpenAI/1536 (matching the
bunfig preload) at the end of the reinit-pglite describe. Belt-and-
suspenders: earlier describe blocks in this file already restore, but
if the reinit-pglite tests ever start mutating the gateway in the
future, this protects downstream test files in the same bun-test shard
from inheriting a non-default state.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
README + topologies + embedding-providers were still pointing users at
`gbrain config set embedding_model X` / `embedding_dimensions N`. As of
v0.37.10.0 those writes are refused — the schema column has to resize
alongside the config. Point at `gbrain reinit-pglite` (PGLite) and the
SQL recipe in `docs/embedding-migrations.md` (Postgres) instead.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@garrytan garrytan changed the title v0.37.10.0: fresh-install PGLite embedding setup fix wave v0.37.11.0: fresh-install PGLite embedding setup fix wave May 22, 2026
garrytan and others added 2 commits May 21, 2026 20:29
Resolves conflicts in VERSION (kept 0.37.11.0), package.json, CHANGELOG.md
(kept both v0.37.11.0 and master's v0.37.10.0 entries), src/commands/config.ts
(combined hard-refuse for schema-sizing fields + master's general unknown-key
gate), src/commands/embed.ts (run master's noEmbedding gate first, then my
dim-mismatch preflight), src/commands/init.ts (took master's preflight +
configureGateway + post-init invariant flow, kept my Lane B.4 file-plane
merge via loadConfigFileOnly so existing user fields aren't clobbered, added
engineKind/databasePath args to the post-init mismatch recipe, inlined the
ZE-key setup hint), src/core/embedding-dim-check.ts (kept both gbrainPath
import and master's dim-validation + EmbeddingDisabledError surface).

Quarantines three pre-existing flakes to .serial.test.ts (HTTP server module
state + gateway state contamination under parallel shards):
- test/doctor-remote.test.ts
- test/cross-modal-hybrid-integration.test.ts
- test/search/hybrid-reranker-integration.test.ts

Regenerates llms.txt + llms-full.txt for the merged CLAUDE.md.

Verified: typecheck clean, 8375 pass / 0 fail on full unit suite.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
CI's `check:test-isolation` lint flagged R1 violations (direct
`process.env.GBRAIN_HOME` mutation) in both new fix-wave test files.
Per the documented quarantine pattern in CLAUDE.md, rename to
`*.serial.test.ts` instead of refactoring through `withEnv()` — both
files use beforeEach/afterEach env wiring that's already serial-safe.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@garrytan garrytan merged commit d0d0e2a into master May 22, 2026
8 checks passed
garrytan added a commit that referenced this pull request May 22, 2026
Conflicts resolved:
- VERSION → 0.38.1.0 (higher semver wins; master bumped 0.37.10.0 → 0.37.11.0)
- package.json → 0.38.1.0 (trio agreement)
- CHANGELOG.md → my v0.38.1.0 entry stays on top; master's new v0.37.11.0
  entry inserted between mine and v0.37.10.0
- src/cli.ts CLI_ONLY Set → union of master's `reinit-pglite` and my
  `capture` CLI verbs

Master's v0.37.11.0 brings the fresh-install PGLite embedding setup fix
wave (#1286): default vector(1280) schema matching the gateway's
zembed-1 default, `gbrain reinit-pglite` wipe-and-reinit command, and
proper ZE API key plumbing. No collisions with v0.38 ingestion substrate
beyond the cli.ts dispatcher Set.

bun install + bun run typecheck → clean.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant