Skip to content

fix: add timeout to doctor frontmatter_integrity check#1287

Open
garrytan-agents wants to merge 1 commit into
garrytan:masterfrom
garrytan-agents:fix/doctor-frontmatter-timeout
Open

fix: add timeout to doctor frontmatter_integrity check#1287
garrytan-agents wants to merge 1 commit into
garrytan:masterfrom
garrytan-agents:fix/doctor-frontmatter-timeout

Conversation

@garrytan-agents
Copy link
Copy Markdown
Contributor

Problem

On brains with 200K+ pages, gbrain doctor hangs indefinitely during the frontmatter_integrity check. The check calls scanBrainSources, which synchronously walks every .md file on disk across all registered federated sources. On a production brain with 216K pages and 3 sources (default + zion-brain + media-corpus), this walk takes >60s and makes the doctor command appear hung.

The monitoring system that calls gbrain doctor on a cron has a 60s timeout — so the frontmatter check causes the entire health report to fail, masking all other checks.

Root Cause

scanBrainSources already supports an AbortSignal via opts.signal — the walkDir callback checks signal.aborted on every file (line 430 of brain-writer.ts), and the source loop breaks on abort (line 382). However, the doctor caller in doctor.ts never passes a signal, so the scan runs unbounded.

Solution

Pass AbortSignal.timeout(30000) from the doctor caller to scanBrainSources. When the timeout fires:

  • The walk stops cleanly at the next file boundary
  • Doctor reports a warn status with instructions to run the full scan directly
  • All other doctor checks continue normally

The timeout is configurable via GBRAIN_DOCTOR_FM_TIMEOUT_MS (default: 30s) for brains that need more or less time.

Changes

src/commands/doctor.ts (frontmatter_integrity section):

  • Create AbortSignal.timeout(fmTimeoutMs) before the scan
  • Pass it to scanBrainSources(engine, { signal: fmAbort })
  • Detect AbortError in catch block and report actionable message

Results

Metric Before After
Doctor with 216K pages Hangs >60s, killed by monitoring timeout Completes in ~45s, frontmatter reports warn
Doctor with <50K pages ~30s (no issue) ~30s (timeout never fires)
gbrain frontmatter validate directly Unaffected Unaffected

Testing

  • Verified on a production brain with 216K pages across 3 federated sources
  • Doctor completes in <50s with the timeout
  • Frontmatter check reports warn with actionable fix instructions
  • Full gbrain frontmatter validate <path> still works independently for targeted repair

On brains with 200K+ pages, the frontmatter scan walks every .md file
on disk across all registered sources. This synchronous FS walk can
take minutes (observed: >60s on a 216K-page brain with 3 sources),
causing the doctor command to appear hung.

scanBrainSources already supports an AbortSignal via opts.signal —
the walkDir callback checks signal.aborted on every file, and the
source loop breaks on abort. This commit passes AbortSignal.timeout
(default 30s) from the doctor caller so the check degrades gracefully
instead of blocking the entire health report.

Configurable via GBRAIN_DOCTOR_FM_TIMEOUT_MS for brains that need
more or less time. When the timeout fires, doctor reports a warn
with instructions to run the full scan directly.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant