Skip to content

fix(sync): scope auto-embed to source#1120

Closed
hnshah wants to merge 1 commit into
garrytan:masterfrom
hnshah:ren/dogfood-sync-embed-source
Closed

fix(sync): scope auto-embed to source#1120
hnshah wants to merge 1 commit into
garrytan:masterfrom
hnshah:ren/dogfood-sync-embed-source

Conversation

@hnshah
Copy link
Copy Markdown
Contributor

@hnshah hnshah commented May 17, 2026

Summary

Dogfooding incremental code sync against a registered local source exposed that sync imported the changed page under the correct source, then the auto-embed step tried to embed the slug without passing --source.

That falls back to source=default, which can print:

Error embedding hello-js: Page not found: hello-js (source=default)

This threads sourceId into the auto-embed call so incremental source sync embeds the same source row it just imported.

Dogfood evidence

Before:

[sync.imports] 1/1 (100%) hello.js
  Error embedding hello-js: Page not found: hello-js (source=default)
Synced bb58d793..f0f41e36:
  +0 added, ~1 modified, -0 deleted, R0 renamed
  2 chunks created, 1 pages embedded

After:

[sync.imports] 1/1 (100%) hello.js
hello-js: all 2 chunks already embedded
Synced f0f41e36..86977128:
  +0 added, ~1 modified, -0 deleted, R0 renamed
  2 chunks created, 1 pages embedded

Tests

bun test test/sync.test.ts --timeout 120000

47 passing.

garrytan added a commit that referenced this pull request May 18, 2026
`gbrain sync --source-id X` triggered auto-embed for the affected slugs
but `runEmbed` ran with no `--source` flag, so it fell back to the
default source. For non-default-source syncs the page row lives at
(sourceId, slug) — the embed code saw "Page not found" for the right
slug under the wrong source, swallowed the error as best-effort, and
the sync result reported `embedded: 0` for the wrong reason.

`buildAutoEmbedArgs(slugs, sourceId)` is the new helper: when sourceId
is set, prepends `--source X`. Exported for the regression test.

Pairs with the upcoming source-id write-path audit (P1 #8). Cherry-picked
from PR #1120.

Co-Authored-By: hnshah <hnshah@users.noreply.github.com>
garrytan added a commit that referenced this pull request May 19, 2026
* fix(sync): accept .tf / .tfvars / .hcl in CODE_EXTENSIONS

Terraform repos were invisible to `gbrain sync --strategy code` because
the three HCL-family extensions never reached the file walker. Silent
data loss — the user thinks the sync covered the repo but the IaC layer
was dropped on the floor.

detectCodeLanguage() returns null for these extensions, so the chunker
falls back to recursive (no tree-sitter grammar for HCL) — the same
path toml/yaml take.

Closes #878.

Co-Authored-By: johnybradshaw <johnybradshaw@users.noreply.github.com>

* fix(upgrade): run `bun update gbrain` from Bun's global install root

`gbrain upgrade --strategy bun` was failing on canonical
`bun install -g github:garrytan/gbrain` installs because `execSync('bun
update gbrain')` ran in the user's shell cwd. Bun's update operates on
whatever package.json it finds via cwd-walk, so a user not standing in
the global root got "No package.json, so nothing to update".

resolveBunGlobalRoot() returns the right directory:
1. `$BUN_INSTALL/install/global` when set (operator override).
2. `~/.bun/install/global` (Bun's documented default).
3. Walk up from realpath(argv[1]) looking for `node_modules/gbrain` —
   handles non-standard installs without trusting argv naming.

execFileSync replaces execSync (no shell), with cwd pinned. Error path
prints the exact `cd && bun update` recovery command instead of a vague
hint.

Closes #1029. Cherry-picked from PR #1032.

Co-Authored-By: mvanhorn <mvanhorn@users.noreply.github.com>

* fix(config): redact sensitive values in `config set` output (closes #892)

`gbrain config set openai_api_key sk-...` was echoing the full key to
stderr via `console.log('Set %s = %s', key, value)`. Shell scrollback and
tmux scroll buffers commonly retain stderr for hours; a screen-share or
shoulder-glance during set leaked the secret.

The `show` path already redacted but used a naive `.includes('key')`
substring check that would mask 'monkey' or 'parsekey' (no false-negative
but ugly).

Single source of truth: `isSensitiveConfigKey()` uses a word-boundary
regex (`(^|[._-])(key|secret|token|password|pwd|passwd|auth)([._-]|$)/i`)
so 'openai_api_key' matches but 'monkey' doesn't. `redactConfigValue()`
composes the postgresql:// URL redactor + sensitive-key check, used by
both `show` and `set`. Helpers exported for unit tests.

Closes #892. Cherry-pick of @sharziki's PR #918 (config.ts hunk only —
the extract.ts walker change in that PR is unrelated and tracked in #202).

Co-Authored-By: sharziki <sharziki@users.noreply.github.com>

* fix(oauth): throw InvalidTokenError so bearerAuth returns 401, not 500

`verifyAccessToken` was throwing bare `Error` on expired or invalid
tokens. The MCP SDK's `requireBearerAuth` middleware catches
`InvalidTokenError` and returns 401 with WWW-Authenticate; bare Error
falls through to 500. Result: legitimate clients with stale tokens hit
500-not-401, so token-refresh logic (which keys off 401) never fires.

Two call sites in verifyAccessToken: token-expired path and
invalid-token path. Both now throw InvalidTokenError. Existing tests
continue to pass because they assert on the throw, not the message class.

Closes #935. Cherry-picked from PR #1012.

Co-Authored-By: Aashiqe10 <Aashiqe10@users.noreply.github.com>

* fix(serve): return 405 on GET /mcp instead of 404

MCP Streamable HTTP spec says GET /mcp opens an optional SSE backchannel
for server-initiated messages. gbrain's transport is stateless and
doesn't push server-initiated messages, so per spec we MUST return 405
with Allow: POST, DELETE — not 404. Probing clients (claude.ai, etc.)
distinguish "endpoint exists, no SSE channel" from "endpoint missing"
on this status code; 404 makes them give up.

Cherry-picked from PR #1076.

Co-Authored-By: lukejduncan <lukejduncan@users.noreply.github.com>

* fix(doctor): resolve whoknows fixture from module location, not cwd

`gbrain doctor` warned about a missing whoknows fixture for every install
that wasn't standing in the gbrain source repo at run time — which is
everyone. The check used `process.cwd()` to locate the fixture, so any
real user (running doctor against `~/.gbrain`) saw a spurious warning.

`resolveWhoknowsFixturePath()` walks up from `import.meta.url` looking
for the source-repo signature (`src/cli.ts` + `skills/RESOLVER.md`),
respects `GBRAIN_WHOKNOWS_FIXTURE_PATH` env override (absolute or
cwd-relative), and returns null with an actionable warning when the
fixture can't be located.

Closes #969. Cherry-picked from PR #1034.

Co-Authored-By: mvanhorn <mvanhorn@users.noreply.github.com>

* fix(frontmatter): centralize --fix backups under ~/.gbrain/backups/

`gbrain frontmatter validate --fix` and `gbrain frontmatter generate
--fix` wrote `<file>.bak` siblings into the source tree. Users running
gbrain over a brain repo found .bak files scattered through people/,
companies/, etc. that broke gitignore expectations and showed up in
`git status` after every fix pass.

Backups now land under `~/.gbrain/backups/frontmatter/<run-id>/<rel>.bak`
with an iso-week-sorted run-id so a multi-fix session keeps the same
parent directory. Backup directory + per-file structure mirrored from
the original file's relative path. The .bak safety contract is intact
for both git and non-git brain repos.

Also adds `--include-catch-all` opt-in to `frontmatter generate` so the
default catch-all rule (`type: note`) is no longer applied to arbitrary
workspace documents that happen to live under a brain root.

Closes #902. Cherry-picked from PR #903.

Co-Authored-By: 100yenadmin <100yenadmin@users.noreply.github.com>

* fix(config): use path.isAbsolute() for GBRAIN_HOME on Windows

The GBRAIN_HOME validator rejected every valid Windows path (`C:\\Users\\...`,
`D:\\gbrain`, etc.) because it used `trimmed.startsWith('/')` to check for
absoluteness — only POSIX absolute paths pass that. `path.isAbsolute()` is
the cross-platform check.

Same fix for the `..` traversal check: split on both `/` and `\` so
Windows path separators don't sneak `..` through.

Closes #1019. Cherry-picked from PR #1083.

Co-Authored-By: sharziki <sharziki@users.noreply.github.com>

* fix(ai): warn only for the configured embedding provider, not all recipes

Gateway construction was warning on stderr for every recipe with an
embedding touchpoint missing max_batch_tokens — including providers the
brain isn't using. Users on Voyage saw noise about OpenAI / Google /
DashScope / etc. recipes that never get loaded.

Filter the warning to recipes whose provider id is referenced by
`embedding_model` or `embedding_multimodal_model` in the active config.
The structural protection against forgetting max_batch_tokens stays in
place for the recipes that actually run; the noise for unrelated recipes
goes away.

Cherry-picked from PR #1117.

Co-Authored-By: hnshah <hnshah@users.noreply.github.com>

* fix(sync): skip git pull when repo has no origin remote

`gbrain sync` ran `git pull` unconditionally and printed scary stderr
on every cycle for brains that have no `origin` remote (local-only
workflows, single-machine setups, brains initialized via `gbrain init
--pglite` against an arbitrary directory). The pull failed harmlessly
but the noise was confusing and made operators think sync was broken.

`hasOriginRemote()` probes `git remote get-url origin` with stdio
ignored; on failure (`no such remote`), skip the pull, print a single
informational line, and proceed with the local working tree.

Cherry-picked from PR #1119.

Co-Authored-By: hnshah <hnshah@users.noreply.github.com>

* fix(query): drain cache writes before CLI exit

The query cache write was fired with `void promise.catch(...)` — true
fire-and-forget. On a fast CLI invocation (`gbrain query <q>` exits in
~50ms), the process terminates before the cache write commits. Result:
the cache effectively never warms from CLI use; every query is a miss.

`awaitPendingSearchCacheWrites()` tracks each in-flight cache write in a
module-level Set. The CLI dispatcher awaits the set after `query`
finishes formatting output but before the process exits. MCP server path
unchanged (long-lived process, fire-and-forget remains correct).

Cherry-picked from PR #1125.

Co-Authored-By: hnshah <hnshah@users.noreply.github.com>

* fix(backlinks): dedupe (source, target) pairs within a single source page

A source page that mentions the same entity N times produced N
duplicate "Referenced in" lines on the target. `extractEntityRefs`
returns one EntityRef per occurrence, and the per-ref `hasBacklink`
check reads a snapshot of `target.content` that's frozen at outer
scope — so every iteration sees "no backlink yet" and appends another
gap. The cumulative effect on a long meeting note with multiple
mentions of the same person was visible in PRs landing 3-5 identical
Timeline entries.

Track seen target slugs per source page; cap gaps at one pair.

Cherry-picked from PR #967 with a current-master regression test
covering both markdown-link and Obsidian-wikilink formats in the same
source page.

Co-Authored-By: p3ob7o <p3ob7o@users.noreply.github.com>

* fix(dream): audit backlinks without mutating pages during cycle

The dream/autopilot maintenance cycle ran the backlinks phase in 'fix'
mode, which writes "Referenced in" timeline bullets into entity pages
every sync. The graph extractor + auto-link path is the canonical link
store during sync/dream/autopilot — the legacy filesystem fixer wrote
markdown that fought with both the user's manual edits and the graph
layer's own timeline.

Cycle now runs backlinks in 'check' mode (audit-only); the materializer
remains available via `gbrain check-backlinks fix` for users who really
want markdown backlinks committed to disk.

Cherry-picked from PR #1027.

Co-Authored-By: sliday <sliday@users.noreply.github.com>

* fix(autopilot --install): source ~/.zshenv before zshrc/bashrc

zshenv is the canonical place for env vars in zsh on macOS — zshrc is
sourced only for interactive shells, so vars exported in zshrc don't
reach a non-interactive subprocess like the autopilot wrapper. Users
who exported GBRAIN_DATABASE_URL, OPENAI_API_KEY, or ANTHROPIC_API_KEY
in zshrc and assumed autopilot would inherit them hit silent missing-
secret failures on the LaunchAgent.

Source ~/.zshenv first (always reaches non-interactive shells per zsh
docs), then fall back to ~/.zshrc / ~/.bashrc for users on other
profile conventions.

Cherry-picked from PR #966.

Co-Authored-By: p3ob7o <p3ob7o@users.noreply.github.com>

* fix(apply-migrations): return exit 0 on list/dry-run/up-to-date

`gbrain apply-migrations list`, `gbrain apply-migrations --dry-run`, and
the "All migrations up to date" path were returning from the async
function but never calling `process.exit(0)`. The CLI dispatcher in
cli.ts treated the implicit fall-through as exit 1 when the parent
process inspected status via shell scripts, breaking automation that
gates on `apply-migrations list && do-something`.

Three call sites: list, dry-run, and the no-op path. All three now
exit(0) explicitly.

Cherry-picked from PR #1062.

Co-Authored-By: nezovskii <nezovskii@users.noreply.github.com>

* fix(sync): scope auto-embed to source on incremental syncs

`gbrain sync --source-id X` triggered auto-embed for the affected slugs
but `runEmbed` ran with no `--source` flag, so it fell back to the
default source. For non-default-source syncs the page row lives at
(sourceId, slug) — the embed code saw "Page not found" for the right
slug under the wrong source, swallowed the error as best-effort, and
the sync result reported `embedded: 0` for the wrong reason.

`buildAutoEmbedArgs(slugs, sourceId)` is the new helper: when sourceId
is set, prepends `--source X`. Exported for the regression test.

Pairs with the upcoming source-id write-path audit (P1 #8). Cherry-picked
from PR #1120.

Co-Authored-By: hnshah <hnshah@users.noreply.github.com>

* fix(query): honor source_id with no-expand for cross-source search

Two related corrections:

1. `gbrain query --no-expand` parsed `--no-expand` as the literal key
   `no_expand` instead of negating the boolean `expand` param. Result:
   the flag was silently ignored and expansion always ran. Now any
   `--no-<key>` where `<key>` is a boolean param flips it false.

2. The `query` op's source-id resolution treated `ctx.sourceId` as
   authoritative, so an explicit per-call `source_id` was overridden by
   the federated read scope. Now per-call `source_id` wins;
   `source_id=__all__` is an explicit opt-out for local cross-source
   search.

Cherry-picked from PR #1124.

Co-Authored-By: hnshah <hnshah@users.noreply.github.com>

* fix(doctor): child-table orphan detection (closes #1063)

The autopilot orphans phase detects orphan PAGES (no inbound links via
page-graph) but never scans FK-child tables. After a bulk delete or a
pre-FK-migration code path, orphan rows can persist indefinitely in
content_chunks, page_versions, tags, takes, raw_data, timeline_entries,
or links — all declared ON DELETE CASCADE, so any orphan row is
unexpected.

`childTableOrphansCheck` enumerates 10 FK columns across 8 tables:
- 8 NOT NULL columns (cascade): any value not in pages.id is an orphan.
- 2 nullable SET NULL columns (links.origin_page_id, files.page_id):
  NULL is valid; only NOT-NULL-but-missing-in-pages counts.

Surfaces paste-ready cleanup SQL when orphans are found.

Cherry-picked from PR #1064.

Co-Authored-By: vincedk-alt <vincedk-alt@users.noreply.github.com>

* fix(autopilot,cycle): stop respawn-storm from steady-state 'partial' cycles

Two compounding bugs under KeepAlive=true:

1. Autopilot tripped its circuit breaker on cycle.status === 'partial',
   not just 'failed'. 'partial' means at least one phase warned/failed
   while others ran — a soft signal, not fatal. On every cycle that
   warned, autopilot logged a failure and the supervisor respawned the
   worker.

2. The orphans phase emitted 'warn' when `count > 20` orphan pages.
   That threshold was tuned for small dev brains; on any corpus past a
   few hundred pages it fires every cycle in steady state. Together
   with bug 1, this produced visible respawn storms.

Fix:
- Autopilot trips only on cycle.status === 'failed'.
- Orphans phase warns by ratio: orphans / total_pages > 0.5 (the real
  "your graph fell apart" signal), not by absolute count.

Cherry-picked from PR #1113.

Co-Authored-By: sergeclaesen <sergeclaesen@users.noreply.github.com>

* fix(ai): reject partial embedding responses before indexing

`embedSubBatch` only validated the FIRST embedding's dimension and never
asserted the response length matched the input length. If a provider
returned fewer embeddings than requested (rate-limit truncation,
malformed response, etc.), the gateway silently indexed an offset-shifted
result — every page after the missing index got the embedding of a
different page's chunk.

Two new guards:
1. `result.embeddings.length === texts.length` — fail loud if any count
   mismatch, with a paste-ready retry hint.
2. Validate dim on EVERY embedding, not just the first.

Cherry-picked from PR #926.

Co-Authored-By: 100yenadmin <100yenadmin@users.noreply.github.com>

* fix(serve): admin register-client supports auth_code + PKCE public clients

The admin dashboard's /admin/api/register-client endpoint hardcoded
client_credentials and ignored grantTypes, redirectUris, and
tokenEndpointAuthMethod. Result: you couldn't register a browser-based
PKCE client (claude.ai Custom Connector, Cursor, etc.) through the
dashboard — only confidential machine-to-machine clients worked.

Pass grantTypes / redirectUris through to registerClientManual. When
tokenEndpointAuthMethod === 'none', NULL out client_secret_hash so the
SDK's clientAuth middleware skips the hash-vs-plaintext compare that
would otherwise reject the no-secret PKCE flow.

Cherry-picked from PR #1077.

Co-Authored-By: lukejduncan <lukejduncan@users.noreply.github.com>

* fix(extract-facts): treat slugs:[] as no-op, not unscoped full-walk

`runExtractFacts` checked `opts.slugs && opts.slugs.length > 0` to
decide between scoped and full-brain walk. Both `undefined` (caller
omits → full walk intended) AND `[]` (sync no-op → zero work intended)
fall through to the same `else` branch and triggered
`engine.getAllSlugs()`.

On a multi-thousand-page brain, the unintended full walk exceeded
the autopilot-cycle ~600s timeout and dead-lettered the job — visible
in production as `[cycle.extract_facts] start` followed by silence
until `Autopilot stopping (cycle-failure-cap)`.

Use presence (`opts.slugs !== undefined`), not truthiness, to
distinguish the two modes. Empty array is a real incremental no-op.

Closes #1096. Three regression cases in test/extract-facts-phase.test.ts:
slugs=[] no-op, slugs=undefined still walks, slugs=['a'] walks just one.

Co-Authored-By: navin-moorthy <navin-moorthy@users.noreply.github.com>

* fix(serve): embed admin/dist into binary; serve from manifest (closes #1090)

Pre-fix, /admin returned 404 on every globally-installed binary because
serve-http.ts:780 resolved admin/dist via process.cwd(). The admin SPA
files are checked into git but `bun build --compile` does NOT embed
arbitrary directories — only assets imported via `with { type: 'file' }`
ESM imports land in the compiled binary.

Wire:

- scripts/build-admin-embedded.ts walks admin/dist/, emits
  src/admin-embedded.ts with one `with { type: 'file' }` import per
  file + a manifest map (request path → resolved path + mime).
  Auto-invoked by `bun run build:admin`.

- src/admin-embedded.ts is the auto-generated module. Bun resolves
  every file: import to a path that works at runtime inside the
  compiled binary (same pattern as src/core/chunkers/code.ts WASM
  imports).

- serve-http.ts switches to two-tier resolution: cwd-relative
  admin/dist for dev (Vite hot-rebuild), embedded manifest otherwise.
  Embedded path reads bytes lazily and caches per-asset for the
  lifetime of the process.

- scripts/check-admin-embedded.sh CI gate re-runs the generator and
  fails on drift (mirrors check-wasm-embedded.sh). PRs that rebuild
  admin/dist but forget to regenerate the embedded module fail loud.

- package.json wires build:admin-embedded + check:admin-embedded.

Closes #1090.

* test(source-id): lock in routing regression coverage (closes #891 #978 #1078)

Audit of every page write path (sync, embed, extract, dream, autopilot,
wikilinks, tags, chunks) confirmed that sourceId already threads
correctly through importFromContent → engine.putPage → SQL INSERT
since v0.18.0. The original bug reports from #891, #978, #1078 were
real at the time and got swept by the multi-source refactor; today's
master is correct.

This commit locks in that correctness with six PGLite regression cases
(no Postgres fixture needed; runs in CI everywhere):

1. importFromContent({sourceId:"work"}) lands at source_id=work, not
   the silent 'default' fallback.
2. Two sources hold the same slug independently.
3. Omitting sourceId falls through to 'default' (legacy contract).
4. Chunks land under the requested source.
5. Tags land under the requested source.
6. FK integrity smoke (originally #1078).

The earlier issue reports stay closed by the existing threading; this
suite ensures any future refactor of the write path can't silently
re-introduce the wrong-source-default bug. The 90-minute write-path
audit budget from the plan resolves here.

* fix(apply-migrations): unblock PGLite chain (closes #1100)

`gbrain apply-migrations --yes` was wedging on the v0.11.0 (Minions)
schema phase for PGLite installs. Two compounding bugs:

1. `apply-migrations` pre-flight schema-version warning connects to
   PGLite to read config.version, then disconnects. The brief lock
   hold races with downstream subprocess spawns that try to re-acquire
   it; the 30s lock timeout fires before the parent fully releases.
   Pre-flight is a *warning*; on PGLite it adds no information the
   orchestrators don't already handle. Skip the probe for PGLite.

2. v0.11.0 phase A spawned `gbrain init --migrate-only` as an execSync
   subprocess to apply schema migrations. PGLite is single-writer;
   the subprocess inherits HOME and tries to lock the same DB. On
   Postgres this works (concurrent connections OK); on PGLite it
   deadlocks. Route in-process for PGLite — create + connect +
   initSchema + disconnect directly, skipping the subprocess hop.
   Postgres keeps the legacy execSync path.

Verified: fresh PGLite install now walks the full migration chain
through v0.32.2 (Facts SoR) and lands "All migrations up to date" on
re-run.

Closes #1100.

* fix(serve): bootstrap token env override + suppress flag (closes #1024)

`gbrain serve --http` regenerated the admin bootstrap token on every
restart and printed it to stderr. In supervisor-managed production
deployments (LaunchAgent, systemd, k8s) every restart leaks the value
into log aggregators and rotates the access for any agent that paste-
copied it.

Two new knobs:

- **GBRAIN_ADMIN_BOOTSTRAP_TOKEN** env var: when set, used as the
  bootstrap secret instead of a fresh per-process token. Validated:
  must match `^[A-Za-z0-9_-]{32,}$` (32-char minimum), else refuse to
  start with a paste-ready generator hint. Failing closed beats
  silently accepting a weak token.

- **--suppress-bootstrap-token** CLI flag: suppresses the printed
  token line entirely. Operator takes responsibility for tracking the
  value out-of-band.

Startup banner now reflects the chosen source:
- `Admin Token: suppressed` when the flag is set.
- `Admin Token: from $GBRAIN_ADMIN_BOOTSTRAP_TOKEN` when env-sourced.
- Full token print only when both are absent (default behavior, dev
  installs).

Closes #1024.

Co-Authored-By: billy-armstrong <billy-armstrong@users.noreply.github.com>

* fix(config): migrate legacy 'provider' + 'model' to 'embedding_model'

Pre-v0.32 docs and some community templates used a config shape:

  { "provider": "voyage", "model": "voyage-4-large" }

The canonical shape (since the v0.31.12 gateway seam) is:

  { "embedding_model": "voyage:voyage-4-large" }

Users on the legacy shape hit silent fallthrough to the hardcoded
OpenAI default; sync + embed errored out with "OpenAI embedding
requires OPENAI_API_KEY" regardless of their actual provider config.

loadConfig() now translates the legacy keys at parse time:
- emits a one-line stderr nudge with the paste-ready canonical key
- preserves the rest of the config unchanged
- skipped when `embedding_model` is already set (forward-compat)

Closes #1086.

Co-Authored-By: jeunessima <jeunessima@users.noreply.github.com>

* chore(test): quarantine upgrade tests (process.env mutation)

PR #1032's cherry-picked tests use the static-snapshot + try/finally
pattern for env vars instead of the project's withEnv() helper. The
test-isolation lint catches process.env mutations outside withEnv to
prevent cross-test leakage in parallel runs.

Renaming to *.serial.test.ts (the quarantine convention) is the
documented out: runs sequentially, no cross-file race. A future cleanup
PR can migrate the tests to withEnv() and drop the quarantine.

* fix(test): update brain-writer .bak assertion for centralized backup path

The v0.36.x frontmatter backup change (bd60cdfcloses #902) moved
.bak files from sibling-of-source to ~/.gbrain/backups/frontmatter/...
The old test still asserted on the sibling path, so CI failed even
though the production behavior was correct.

Updated assertion contract: backup lands under the injected backupRoot
(test-isolated), the returned backupPath ends in .bak and exists, and
no sibling .bak is created next to the source file. The pre-fix
sibling-path is now a negative assertion.

* chore: bump version and changelog (v0.36.1.0)

v0.36.1.0 — community fix wave (28 atomic fixes + 22 PRs closed as
already-shipped + 14 issues triaged).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* test(fix-wave): close test gaps surfaced by post-ship audit

After the fix-wave shipped, an audit found 11 commits with no new test
file. Some were inherently structural (build pipelines, shell content)
or had existing test coverage that worked either way; others had real
regression risk with no guard. This commit closes the gaps that matter.

New regression tests for:

- OAuth `verifyAccessToken` throws `InvalidTokenError` (not bare Error)
  on both expired and unknown token paths. Pre-fix, the SDK's
  `requireBearerAuth` middleware fell through to 500 instead of 401 →
  client token-refresh logic never fired (#935).

- `loadConfig` translates legacy `{provider, model}` config shape to
  the canonical `embedding_model: <provider>:<model>`. 3 cases: pure
  legacy → migrated; canonical wins over legacy when both present;
  canonical-only is untouched. Pre-fix, Voyage/Cohere/Mistral users
  silently fell through to OpenAI (#1086).

- `configDir` rejects relative paths; rejects `..` segments via both
  separators (regression guard for the Windows path acceptance fix
  #1019 / cherry-pick #1083).

- `resolveBootstrapToken` (new exported helper extracted from
  `runServeHttp`). 9 cases: unset env generates fresh, valid env
  accepted, hyphens/underscores accepted, < 32 chars rejected, special
  chars rejected, whitespace trimmed, empty string rejected, 32-char
  boundary accepted, 31-char one-short rejected. Security-critical
  validation surface (#1024).

- GET /mcp returns 405 with `Allow: POST, DELETE` (E2E case in
  `serve-http-oauth.test.ts`). Pre-fix, claude.ai and other probing
  MCP clients saw 404 and gave up (#1076).

- apply-migrations `process.exit(0)` on list / dry-run / up-to-date
  paths. Source-shape assertion locks the rule in; shell scripts
  gating on `$?` work (#1062).

- Autopilot wrapper sources `~/.zshenv` BEFORE `~/.zshrc`. zshenv is
  the canonical place for env vars in non-interactive zsh; without
  this ordering, LaunchAgent subprocesses never inherit secrets
  exported in zshrc (#966).

- `test/fix-wave-structural.test.ts` consolidates source-shape
  regression guards for fixes whose behavior is hard to runtime-test
  without heavy mocking: query cache drain (#1125), admin embed
  manifest + handler (#1090), admin register-client PKCE branch
  (#1077), PGLite v0.11.0 phase A in-process routing (#1100), query
  `--no-expand` negation (#1124). 9 source-grep assertions.

Refactored `runServeHttp` to extract `resolveBootstrapToken` as a pure
helper. The boot path now consumes the helper's tagged-union result
({kind:'ok'|'error'}); side effects (`process.exit`, `console.error`)
moved to the caller. Unit-testable without spinning up Express.

Test counts: oauth 71 (was 69), config 20 (was 14), apply-migrations
19 (was 18), autopilot-install 5 (was 4), serve-http-bootstrap-token
9 (new file), fix-wave-structural 9 (new file). Net: +28 cases across
6 files; +1 new exported function with full coverage.

Remaining audit gaps (deferred):
- e82dda0 admin embed E2E (post-deploy curl smoke covers this)
- d93fa81 apply-migrations PGLite chain E2E (already smoke-tested
  manually in the original commit; subprocess test would be flaky in
  CI without DATABASE_URL gating)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* test: close the two deferred E2E gaps from the post-ship audit

Both gaps now have real behavior coverage. No DATABASE_URL needed (PGLite
engine), so they run in standard unit CI alongside the rest of the suite.
Serial quarantine because both spawn subprocesses + bind ports / write
tmpdirs.

test/admin-embed-spawn.serial.test.ts (4 cases, ~6s wall-clock):
  - Spawns `gbrain serve --http` from a fresh tmpdir so `process.cwd()/
    admin/dist` does not exist — this forces the embedded-manifest
    branch (the one under test). Pre-fix, this exact setup hit 404.
  - GET /admin/ → 200 + SPA shell HTML (title + #root div), content-type
    text/html.
  - GET /admin/index.html → same body via explicit path.
  - GET /admin/agents → SPA fallback returns index.html for deep links.
  - GET /admin/api/stats → NOT 200 (regression guard: SPA fallback must
    not swallow /admin/api/* routes and silently return HTML to a JSON
    client). Closes #1090.

test/apply-migrations-pglite-spawn.serial.test.ts (3 cases, ~25s):
  - Seeds a fresh PGLite config in a tmpdir, runs `gbrain init
    --migrate-only` + `gbrain apply-migrations --yes --non-interactive`.
    Pre-fix this hit "GBrain: Timed out waiting for PGLite lock" because
    apply-migrations' pre-flight probe + v0.11.0's phase A subprocess
    both wanted the single-writer lock.
  - Asserts exit 0, no "Timed out" string, no "Phase A failed" string,
    brain.pglite file written.
  - Re-run case: idempotent — "All migrations up to date" exits 0
    (also locks in the #1062 exit-code fix end-to-end).
  - --list path exits 0 (third leg of the #1062 contract).
  Closes #1100.

Pinned bootstrap token via GBRAIN_ADMIN_BOOTSTRAP_TOKEN env so the
admin test doesn't have to scrape stderr; the startup banner format
is allowed to drift, the /health probe is the readiness contract.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(test): consolidate PGLite spawn test to one end-to-end pass

CI failed on test/apply-migrations-pglite-spawn.serial.test.ts (Ubuntu,
bun 1.3.14). The previous shape ran 3 tests × ~3 spawns each. Each
`bun run /abs/src/cli.ts` from a tmpdir cwd pays a full parse/transpile
cost (no near-cwd .bun cache); on Ubuntu CI that compounds past the
runner's per-test budget.

Consolidated to ONE test that exercises the full lifecycle in one
brain: init --migrate-only → apply-migrations --yes → re-run → --list.

Four spawns instead of eight. Local wall-clock: 32s → 11.5s. All four
assertion buckets preserved: no PGLite lock timeout, no Phase A
failure, brain.pglite written, idempotent re-run "All migrations up
to date" exits 0 (#1062 end-to-end), --list exits 0.

Per-test timeout 480_000ms as insurance against the runner's
--timeout=60000 default (bun's API spec: per-test wins).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* test(diag): dump apply-migrations output when CI exit != 0

The PGLite spawn test passes locally on macOS/bun 1.3.13 in ~11s
end-to-end but fails on Ubuntu/bun 1.3.14 in 4.92s with apply.exitCode
= 1 — fast enough that something is failing early, not timing out.
The runCli helper captured stdout+stderr but never printed them, so
the CI log only showed the bare assertion failure.

This commit prints the captured streams from BOTH init and apply
when the exit code mismatches expectation. After the next CI run we
can read the actual error message and diagnose the Ubuntu-specific
failure mode (likely BUN_INSTALL / HOME / PGLite WASM env quirk).
No behavior change; pure diagnostic output gate on failure.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(test): shim `gbrain` on PATH for PGLite spawn test

Root cause of the Ubuntu CI failure: the v0.11.0 orchestrator's phase B
runs `execSync('gbrain jobs smoke')`. PGLite phase A now routes
in-process (the #1100 fix), but phase B and several follow-up phases
still shell out to the `gbrain` binary on PATH. Locally the binary
resolves via `bun link`; on CI Ubuntu it does not exist on PATH, so
execSync exits 127 → orchestrator returns 'failed' → apply-migrations
exits 1. Test failed at 4.92s with exitCode=1, well before any timeout.

Verified locally by removing ~/.bun/bin/gbrain to simulate CI:
  pre-shim:  apply.exitCode=1 (same as CI)
  post-shim: apply.exitCode=0 in 8.4s

The shim writes a tiny `gbrain` executable to a tmpdir that just
`exec`s `bun run <repo>/src/cli.ts "$@"`. Prepended to PATH for the
spawned subprocesses. Mirrors the production contract (gbrain on
PATH) without depending on `bun link` having run in the CI image.

Diagnostic dump from the previous commit stays — useful insurance for
the next time something silently fails inside a spawned binary.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: johnybradshaw <johnybradshaw@users.noreply.github.com>
Co-authored-by: mvanhorn <mvanhorn@users.noreply.github.com>
Co-authored-by: sharziki <sharziki@users.noreply.github.com>
Co-authored-by: Aashiqe10 <Aashiqe10@users.noreply.github.com>
Co-authored-by: lukejduncan <lukejduncan@users.noreply.github.com>
Co-authored-by: 100yenadmin <100yenadmin@users.noreply.github.com>
Co-authored-by: hnshah <hnshah@users.noreply.github.com>
Co-authored-by: p3ob7o <p3ob7o@users.noreply.github.com>
Co-authored-by: sliday <sliday@users.noreply.github.com>
Co-authored-by: nezovskii <nezovskii@users.noreply.github.com>
Co-authored-by: vincedk-alt <vincedk-alt@users.noreply.github.com>
Co-authored-by: sergeclaesen <sergeclaesen@users.noreply.github.com>
Co-authored-by: navin-moorthy <navin-moorthy@users.noreply.github.com>
Co-authored-by: billy-armstrong <billy-armstrong@users.noreply.github.com>
Co-authored-by: jeunessima <jeunessima@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
@garrytan
Copy link
Copy Markdown
Owner

Closing in favor of #1253 (v0.37.7.0 fix wave). The sync --source flag already existed; this PR's contribution focused on threading source-id into the auto-embed args, which the wave preserves via the v0.31.x buildAutoEmbedArgs helper at src/commands/sync.ts:215. Co-Authored-By: hnshah trailer included. Thank you for the dogfood time.

@garrytan garrytan closed this May 21, 2026
garrytan added a commit that referenced this pull request May 21, 2026
…dential clients (#1253)

* fix(reindex-frontmatter): connect engine before query (#1225)

`createEngine()` from src/core/engine-factory.ts only constructs the
engine; callers MUST call connect() before any executeRaw. The
reindex-frontmatter CLI was constructing the engine and going
straight to countAffected, which crashed on PGLite with "PGLite not
connected. Call connect() first." even on --dry-run.

Fix follows the existing-command pattern (src/commands/auth.ts,
src/commands/backfill.ts, src/commands/integrity.ts all do the
same): pass toEngineConfig(cfg) into both createEngine() AND
engine.connect(), then engine.initSchema() (idempotent on a current
schema, ~1ms cost).

Pre-fix verification: codex outside-voice CF5 flagged the related
"can't import connectEngine from cli.ts" misdirection in the
original fix plan. This implementation uses the canonical sibling
pattern instead.

Regression test pinned at test/reindex-frontmatter-connect.test.ts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: bump VERSION to 0.37.7.0 + stub CHANGELOG

v0.37.5.0 claimed by #1229 (warsaw-v4); v0.37.6.0 by #1246
(OpenRouter recipe). v0.37.7.0 is the next free slot for this
fix wave.

CHANGELOG entry stubbed in user-facing voice per CLAUDE.md
"CHANGELOG voice + release-summary format" — ELI10 lead-first,
real fix details below. The "## To take advantage of v0.37.7.0"
block follows the v0.13+ self-repair pattern from CLAUDE.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(subagent): short-circuit terminal-on-resume (#1151)

Bug: when the worker resumed a subagent job whose persisted last
message was an assistant turn with text-only content (no tool_use
blocks), the replay reconciler at subagent.ts:241-247 had no
branch for that case. The main loop then called messages.create
against a conversation ending in assistant role, which Sonnet 4.6+
rejects with HTTP 400 "This model does not support assistant
message prefill." 3 retries later → dead-letter, despite all the
job's work having committed in earlier turns.

@zscgeek's bug report pinned this exactly: dream-cycle Otter
corpus runs hit ~7% dead-letter rate, every dead job's last
subagent_messages row was a text-only synthesis summary listing
slugs that already existed in `pages`. Their proposed fix mirrors
this implementation.

Fix: add an else branch to the assistant-tail check that mirrors
the live-loop terminal logic at subagent.ts:440-447 — reconstruct
finalText from the persisted text blocks, return
stop_reason='end_turn' immediately. No LLM call, no schema change.

Two new regression cases:
  - text-only terminal on resume returns immediately with zero
    messages.create calls
  - tool-use replay path unchanged (existing behavior preserved)

Codex outside-voice (CF13) initially flagged this fix as
mis-targeted, claiming subagent.ts already handled the case.
/investigate run revealed the live-loop terminal at :440-447 was
covered but the REPLAY-path terminal at :241-247 was missing —
both branches need symmetric handling.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(autopilot): scope lockfile to GBRAIN_HOME (#1226)

The autopilot lockfile was hardcoded at `~/.gbrain/autopilot.lock`
(via `process.env.HOME`), bypassing GBRAIN_HOME. Two brains pointed
at different GBRAIN_HOME directories still wrote to the same global
lockfile; one would silently take over the other on each restart.

Fix: route through `gbrainPath('autopilot.lock')` from
src/core/config.ts (imported aliased as gbrainHomePath since the
local `gbrainPath` var in installAutopilot references the CLI
binary path). The mkdirSync(`~/.gbrain`) call also routes through
the helper so the directory is created in the right place too.

Co-authored with @rafaelreis-r — same fix shape as PR #1227,
re-implemented against current master per the wave's
"re-implement, credit, close" workflow.

Tests cover: one GBRAIN_HOME → one canonical lock; two
GBRAIN_HOME values → two distinct locks; default fall-through
still works.

Co-Authored-By: rafaelreis-r <noreply@github.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(graph-query): foreign-edge footer + --include-foreign (#1153)

The graph-query CLI silently dropped edges to pages in other sources
on federated brains. Users had no signal those edges existed unless
they read the source code.

Fix:
- New --include-foreign flag (off by default, preserves the existing
  scoping contract; on = explicit cross-source traversal).
- After every traversal, count edges from rootSlug whose target page
  lives in a different source. When count > 0 AND user didn't opt in,
  emit a stderr footer:
    `(N edge(s) to foreign-source pages hidden; pass --include-foreign
     to include them)`
- The "no edges found" path also runs the count + footer so users
  discover foreign edges even when scoped traversal returned nothing.
- Thin-client path skips the count (engine query not available);
  future T1 work threads source resolution through MCP for that path.
- Single quotation correctness in count SQL: page_links table is
  `links` (not `page_links`); JOIN both endpoints to pages and compare
  source_id, NULL-safe via `IS NOT NULL` guards on both sides.
- Fail-open on missing source_id column for pre-v0.18 brains: return 0
  (no foreign edges to report) instead of throwing.

4 new test cases: footer fires on scoped query with foreign edge,
--include-foreign suppresses footer, zero-foreign no-footer case,
pluralization regression guard.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(sources): `gbrain sources current` + tier attribution (#1222)

Federated-brain users running destructive ops (extract, import,
purge) need a way to verify which source they're targeting BEFORE
the op runs. Pre-fix, the only way was to grep config files or run
the op with --dry-run and inspect output.

New command:
  gbrain sources current             # human output
  gbrain sources current --json      # machine-readable
  gbrain sources current --source X  # show what an explicit --source
                                     # X would resolve to (validates
                                     # X exists in the sources table)

Output names BOTH the resolved source id AND which tier of the 6-tier
resolution chain won (flag / env / dotfile / local_path /
brain_default / seed_default), plus a `detail` line naming the
winning signal (e.g. "GBRAIN_SOURCE=dept-x" or ".gbrain-source" or
"/work/gstack/src").

Implementation:
- New `resolveSourceWithTier()` in source-resolver.ts as an additive
  variant of `resolveSourceId()`. Walks the same 6 steps in the same
  order; just returns `{ source_id, tier, detail? }` instead of bare
  string. Existing `resolveSourceId()` unchanged — all callers
  continue working.
- New `SOURCE_TIER_NAMES` const + `SourceTier` type export so the
  CLI, doctor (Tier 5 follow-up), and future MCP consumers share one
  vocabulary instead of inlining strings.
- Help text updated; `current` subcommand registered in dispatcher.

11 new tests pin the 6-tier ladder + priority semantics. Existing
19 source-resolver tests still pass (regression preserved).

Per codex CF3 (the existing src/core/source-resolver.ts was missed
in the original plan). Re-uses the existing helper instead of
inventing a duplicate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(extract): --source-id scopes extraction to one brain source (#1204)

Federated brain users running `gbrain extract` had no way to scope
extraction to one source. The DB path walks all sources together via
listAllPageRefs(), which is correct for cross-source resolution but
sometimes the user wants to extract per-source explicitly (e.g.
re-running extract on a specific source after a manual import).

The pre-existing `--source` flag is the data-source axis (fs|db) and
can't be repurposed. New flag `--source-id <id>` joins it on the
brain-source-id axis:

  gbrain extract all --source db --source-id alpha
    -> walks only alpha-source pages; extracts links + timeline
       from those, into the alpha source

Important: the resolver maps (allSlugs + slugToSources) stay built
from the FULL listAllPageRefs result, not the scoped subset. This
ensures qualified cross-source wikilinks like `[[other-src:slug]]`
still resolve correctly even when the extract walk is scoped — the
filter is on which pages we extract FROM, not what we can resolve TO.

Threaded through both `extractLinksFromDB` and `extractTimelineFromDB`
with backward-compat: callers passing no opts get the old behavior.

4 new test cases pin: walks-all-without-flag baseline,
alpha-only-when-scoped-to-alpha, beta-only-when-scoped-to-beta,
empty-set-on-unknown-source.

Note: #1204's wider "silent 0 links" report on federated brains has
additional facets beyond this flag (resolver path edge cases on
overlapping slugs). The scoped-walk fix gives users an explicit
workaround AND closes the per-source extraction gap.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(todos): file v0.37.7.0 follow-ups (#1173, #1204, T5N)

Three items deferred from v0.37.7.0:

1. #1173 .sql indexing — verify-first gate found
   tree-sitter-sql.wasm missing from src/assets/wasm/grammars/.
   Dedicated wave needed: vendor the wasm, add .sql to walker
   filter, address slug-shape collision with #1172.

2. #1204 deeper investigation — wave added --source-id flag as
   workaround. Underlying silent-zero-links bug on unscoped
   federated extracts needs its own /investigate pass against
   a cross-source-duplicate-slug fixture.

3. Tier 5N doctor sweep for dead-lettered subagent jobs matching
   the #1151 fingerprint. Deferred to v0.37.8+ behind the islamabad
   doctor.ts conflict resolution.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(sync): walker skips git submodule directories (#1169)

Sync walker descended into git submodules and indexed their markdown
content as if it belonged to the parent brain. Users with submodules
in their brain repo saw foreign content in their pages table.

Fix: pruneDir gains an optional `parentDir` arg. When set, the helper
stats `<parentDir>/<name>/.git` and skips the directory if `.git`
exists as a FILE (gitfile pointer — the canonical submodule shape).
Directories containing `.git` as a DIRECTORY (a real nested repo,
not a submodule) are descended into; the inner `.git` dir itself is
then dot-prefix-excluded.

Callers updated to pass parentDir:
- src/commands/extract.ts walkMarkdownFiles
- src/core/cycle/transcript-discovery.ts walker

Back-compat preserved: existing pruneDir(name) callers without
parentDir get the pre-v0.37.7.0 behavior unchanged.

Companion `.gitignore`-respect feature from PR #1159 (@jetsetterfl)
NOT in this wave — it would require adding the `ignore` npm package
as a dep, which the plan's "no new deps in this PR" gate excludes.
Filed as follow-up TODO for a dedicated wave.

5 new test cases pin the submodule shape + back-compat + nested-repo
ambiguity. Existing extract-fs / extract-db tests unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(brain-routing): document 6-tier source resolution chain (#1222)

The convention skill didn't have a tier-by-tier reference for how
gbrain resolves the active source. Users running federated brains
had to read the source code to know which signal wins.

Added:
- Canonical 6-tier table (flag → env → dotfile → local_path →
  brain_default → seed_default) matching src/core/source-resolver.ts.
- Pointer to `gbrain sources current` (new in v0.37.7.0) as the
  verification command.
- The CLI-layer trust boundary note: operations.ts handlers don't
  read env/dotfile (preserves v0.34.1.0 source-isolation work for
  MCP callers).
- Per-command flag map: --source, --source-id (extract), and
  --include-foreign (graph-query).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(import): --source-id flag routes pages to a brain source (#1167)

`gbrain import --source dept-x ./pages` silently fell back to the
default source because the CLI parser never consumed --source. PR
#707's design intent excluded the flag explicitly; users had no
signal their pages were going to the wrong place. #1167 + #1222
filed the regression.

Fix: parse `--source-id <id>` (matching v0.37.7.0 extract.ts T2's
naming convention — --source-id stays out of conflict with future
axes that may want --source). When set, the flag value wins over
any programmatic opts.sourceId; back-compat preserved for callers
that pass sourceId via opts only.

Also threaded into the positional-dir arg parser's flagValues set
so `--source-id <value> <dir>` doesn't treat <value> as the dir.

Note on related surfaces:
- `gbrain query "X" --source_id dept-x` already routed correctly
  via the operations.ts query op (added in v0.34) — no fix needed.
- `gbrain extract --source-id <id>` shipped in T2.
- `gbrain sync --source <id>` already worked (pre-existing).
- `gbrain sources current` (shipped in T4) is the verification
  tool — run it before destructive ops to confirm routing.

Closes the silent-fallback for the import path. Co-authored with
@tyad67-netizen (#1168), @hnshah (#1124, #1120), whose patches
informed the shape; re-implemented against current master per
the wave's "re-implement, credit, close" workflow.

3 new test cases pin: default-without-flag, --source-id-routes-correctly,
flag-value-not-treated-as-dirArg.

Co-Authored-By: tyad67-netizen <noreply@github.com>
Co-Authored-By: hnshah <noreply@github.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(autopilot): reconnect classifier + launchd ThrottleInterval (#1162)

Pre-fix: when database_url was unset/malformed, the DB-health-check
reconnect loop logged `config.database_url undefined` forever
because the catch swallowed every error type uniformly. launchd's
KeepAlive=true respawned immediately on any exit, so even when
the process did exit, it came right back into the same bad state.
@colin477 reported the daemon-thrash pattern.

Two-part fix:

1. In-process error classifier — `classifyReconnectError(err)`:
   - `unrecoverable` (database_url missing/empty/malformed, auth
     failure, no-brain-configured): exit immediately with a clear
     stderr line. Pattern-matched against postgres / config-loader
     error shapes. Tests pin the matcher against the #1162
     fingerprint exactly.
   - `recoverable` (network blip, pool saturated, connection refused
     on a port coming up, Supabase 503): retry. Up to
     GBRAIN_AUTOPILOT_MAX_RECONNECT_FAILS (default 30 = ~5min) before
     finally giving up with `max_reconnect_fails_exceeded`.
   - Counter resets on every successful health probe or reconnect.

2. launchd plist gains `ThrottleInterval=60`. Combined with the
   in-process exit, launchd waits 60s before relaunching instead
   of immediate respawn. Pure-function `generateLaunchdPlist()`
   exported for tests.

16 new test cases:
- 11 classifier cases (database_url shapes, malformed URL, auth,
  role-does-not-exist with quoted name, network blip, pool
  saturated, 503, non-Error inputs, case-insensitivity)
- 5 plist generator cases (ThrottleInterval=60, KeepAlive
  preserved, wrapper path, XML escaping, StandardErrorPath).

Pre-existing autopilot-lock-path tests unchanged — both fixes
land cleanly side-by-side.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(oauth): confidential clients via custom /token middleware (#1166)

v0.34.1.0 (#909) fixed PUBLIC PKCE clients (client_secret=undefined)
by normalizing NULL → undefined in getClient. Confidential clients
regressed: the MCP SDK's clientAuth middleware does plaintext
`client.client_secret !== presented_secret` compare, but gbrain
stores SHA-256 hashes, so the SDK's compare always failed for
authorization_code and refresh_token grants on confidential clients.
Result: /token returned `invalid_client` for every confidential
exchange.

Fix shape per locked-decision-5: custom /token middleware BEFORE the
SDK's authRouter, similar to the pre-existing client_credentials
handler. The middleware:

1. Detects confidential auth via `client_secret` in body
   (client_secret_post) OR `Authorization: Basic` header
   (client_secret_basic per RFC 6749 §2.3.1).
2. Falls through to the SDK when neither is present (public PKCE
   path stays canonical, preserves v0.34.1.0 behavior).
3. Calls new `verifyConfidentialClientSecret(clientId, presented)`
   on the provider which does SHA-256 hash compare ourselves
   (same shape as exchangeClientCredentials' existing hash check).
4. On verification success, calls existing
   `exchangeAuthorizationCode` / `exchangeRefreshToken` directly
   with the validated client.
5. RFC 6749 §5.2 error semantics: 401 invalid_client for auth
   failures, 400 invalid_grant for code/token problems.

Per CLAUDE.md "GBRAIN:RLS_EXEMPT" annotation contract: this surface
sits in front of the SDK's clientAuth and doesn't depend on the
SDK's plaintext compare working — the SDK's middleware never
fires for confidential paths the new middleware claims.

7 new test cases pin: correct-secret-returns-client, wrong-secret
opaque rejection, non-existent client, public-client refuses
the confidential path, case-sensitivity, soft-deleted revocation,
verify-then-exchange-refresh round-trip with second-use rejection.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(doctor): 3 new checks — source routing + oauth + autopilot lock (T12/T13/T14)

Three v0.37.7.0 doctor checks landing in one atomic commit (single
file, shared merge-conflict surface with garrytan/islamabad-v3 per
locked decision 1):

1. source_routing_health (T12 / #1167):
   Sample non-default sources for pages; warn when a registered
   source has zero pages (silent-collapse-to-default fingerprint).
   D5 lock: total-sample cap of 200 pages across all sources, with
   per-source cap = min(50, ceil(200/N)) so a 20-source CEO brain
   pays 200 selects, not 1000. Fix hint paste-ready to
   `gbrain sources current --json` for verification.

2. oauth_confidential_client_health (T13 / #1166):
   Probe every oauth_clients row. Confidential clients (auth_method
   != 'none') must have a non-NULL client_secret_hash; if any row
   claims confidential auth but stores NULL hash, that's the
   pre-v0.37.7.0 regression. Public clients (auth_method='none')
   correctly keep NULL hash per v0.34.1.0 #909. Fix hint:
   `gbrain auth revoke-client + register-client` OR `gbrain upgrade`.
   Pre-OAuth schemas (missing oauth_clients table) skip gracefully.

3. autopilot_lock_scope (T14 / #1226):
   Detect stale ~/.gbrain/autopilot.lock outside the current
   GBRAIN_HOME. Codex CF11: dangerous to paste-ready `rm` without
   verifying the owning PID isn't a live process. Hint reads the
   PID file and gives the user a `ps -p <pid>` check before any
   delete — matches sshd-style stale-lock recovery hints.

9 new test cases pin the canonical paths. Pre-existing 80+ doctor
checks unchanged.

Expected to conflict with garrytan/islamabad-v3 at merge time. The
3 new check functions live in their own block far from the
islamabad skill_brain_first check; the conflict surface should be
limited to the `checks.push(...)` call site near the end of
runDoctor's DB-checks phase (~10 lines).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(test): withEnv wrapper in source-resolver-with-tier (test-isolation lint)

The new source-resolver-with-tier.test.ts from T4 mutated
process.env.GBRAIN_SOURCE directly in two cases, which violates
scripts/check-test-isolation.sh R1 (env mutations leak across
parallel-loaded test files in the same shard process).

Fix: wrap both mutation sites in withEnv() from test/helpers/with-env.ts,
which saves+restores via try/finally per the canonical pattern in
CLAUDE.md.

Pure refactor — all 11 cases still green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: update project documentation for v0.37.7.0

CHANGELOG.md — populated the "What landed" stub with the 18-commit
brisbane wave (source-id flag threading, sources current subcommand,
graph-query foreign-edge footer, autopilot lockfile scope + reconnect
classifier + launchd ThrottleInterval, OAuth confidential client
middleware, reindex-frontmatter connect fix, subagent terminal-on-resume
fix, sync walker submodule skip, 3 new doctor checks, brain-routing.md
convention skill). Voice: ELI10 lead, capability table, paste-ready
verification, "what's safe to know" + "what we caught" sections.

CLAUDE.md — extended Key Files annotations for the v0.37.7.0 changes:
import/extract --source-id flags, sources current subcommand, graph-query
--include-foreign, resolveSourceWithTier() additive helper, autopilot
classifyReconnectError + generateLaunchdPlist exports, OAuth confidential
client middleware, pruneDir submodule detection, subagent terminal
short-circuit, 3 new doctor checks. Pinned by their test files.

llms-full.txt — regenerated via `bun run build:llms` (CI guard at
test/build-llms.test.ts will fail otherwise).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: rafaelreis-r <noreply@github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants