Summary
On a multi-source (federated) brain, gbrain extract all (and extract links) silently inserts 0 links even when there is link-rich content to extract. gbrain doctor then recommends running exactly that command as the remedy for 0% graph coverage — so the user runs it, sees "created 0 from N pages," and has no signal that anything is wrong. Four interacting defects.
Environment
- gbrain v0.36.3.0
- Federated brain, 4 sources (
default, brain, plus two repo-backed sources); one repo-backed source ~310 pages, a flat doc/memory repo
gbrain doctor: graph_coverage 0%, brain_score 45/100 (links 0/25)
Repro
$ gbrain extract all # bare, from a cwd pinned to a non-default source
Links: created 0 from 306 pages
$ gbrain extract all --dry-run
Links: (dry run) would create 253 from 306 pages # <-- dry-run disagrees with the real run
Root cause
Bug 1 — FS extract path drops source_id. extractLinksFromFile emits link objects with only from_slug/to_slug/link_type/context — no source id (src/commands/extract.ts:218). The engine contract says an omitted from_source_id/to_source_id means source_id='default', not "infer from the page" (src/core/engine.ts:73). addLinksBatch maps missing → literal 'default' (src/core/postgres-engine.ts:1650) and INNER-JOINs pages on (slug, source_id) (src/core/postgres-engine.ts:1662). Pages that live under a non-default source_id never match (slug,'default') → every row is silently dropped by the JOIN, 0 inserted, no error.
Bug 2 — doctor recommends a command that cannot work. The graph_coverage check prints Run: gbrain extract all. On a federated brain that command is structurally incapable of populating non-default-source links (Bug 1). The advice should be source-aware, or suppressed for federated brains.
Bug 3 — no backfill path for an already-synced federated brain. The only source-aware extractor is the post-sync hook extractLinksForSlugs, which correctly threads sourceId (src/commands/extract.ts:720-734) — but it runs only on pagesAffected (src/commands/sync.ts:942-944). sync --full re-walks but content-hash-skips unchanged pages (observed: imported=8 skipped=302), so the hook only ever re-extracts changed pages. There is no command that re-extracts links for an already-synced source.
Bug 4 — --dry-run overcounts. Dry-run increments created at candidate-extraction time, before any DB write (src/commands/extract.ts:628-633); the real run counts only rows surviving the JOIN. So --dry-run reports "would create 253" while the real run creates 0 — the dry-run is not a faithful preview.
Adjacent: --source db extraction only recognizes entity-directory wikis (people/, companies/, deals/, meetings/ — the DIR_PATTERN whitelist at src/core/link-extraction.ts:46); it ignores plain [text](file.md) and [[flat-slug]] body links, so it returns 0 for flat doc/memory brains regardless of source.
Proposed fixes
- Make the FS CLI extract path source-aware.
runExtractCore / extractLinksFromDir / extractTimelineFromDir should resolve a sourceId (the resolver already exists — getDefaultSourcePath / .gbrain-source) and stamp from_source_id/to_source_id/origin_source_id (and timeline source_id) on every batch row, mirroring what extractLinksForSlugs already does.
- Add a backfill command — e.g.
gbrain extract links --source <id> --backfill, or a --reextract-all flag on sync — that re-extracts links for all pages of a source, not just changed ones.
- Fix the
doctor recommendation to point at a command that works on federated brains (or detect federation and adjust the advice).
- Make
--dry-run faithful — resolve targets against (slug, source_id) the same way the real INSERT does, so the count matches.
Severity
High for federated-brain users: graph features (get_backlinks, traverse_graph, find_*) are quietly non-functional, brain_score is permanently capped, and the diagnostic actively points the wrong way. Single-source / default-only brains are unaffected.
Summary
On a multi-source (federated) brain,
gbrain extract all(andextract links) silently inserts 0 links even when there is link-rich content to extract.gbrain doctorthen recommends running exactly that command as the remedy for 0% graph coverage — so the user runs it, sees "created 0 from N pages," and has no signal that anything is wrong. Four interacting defects.Environment
default,brain, plus two repo-backed sources); one repo-backed source ~310 pages, a flat doc/memory repogbrain doctor:graph_coverage 0%,brain_score 45/100 (links 0/25)Repro
Root cause
Bug 1 — FS extract path drops
source_id.extractLinksFromFileemits link objects with onlyfrom_slug/to_slug/link_type/context— no source id (src/commands/extract.ts:218). The engine contract says an omittedfrom_source_id/to_source_idmeanssource_id='default', not "infer from the page" (src/core/engine.ts:73).addLinksBatchmaps missing → literal'default'(src/core/postgres-engine.ts:1650) and INNER-JOINs pages on(slug, source_id)(src/core/postgres-engine.ts:1662). Pages that live under a non-defaultsource_idnever match(slug,'default')→ every row is silently dropped by the JOIN, 0 inserted, no error.Bug 2 —
doctorrecommends a command that cannot work. Thegraph_coveragecheck printsRun: gbrain extract all. On a federated brain that command is structurally incapable of populating non-default-source links (Bug 1). The advice should be source-aware, or suppressed for federated brains.Bug 3 — no backfill path for an already-synced federated brain. The only source-aware extractor is the post-sync hook
extractLinksForSlugs, which correctly threadssourceId(src/commands/extract.ts:720-734) — but it runs only onpagesAffected(src/commands/sync.ts:942-944).sync --fullre-walks but content-hash-skips unchanged pages (observed:imported=8 skipped=302), so the hook only ever re-extracts changed pages. There is no command that re-extracts links for an already-synced source.Bug 4 —
--dry-runovercounts. Dry-run incrementscreatedat candidate-extraction time, before any DB write (src/commands/extract.ts:628-633); the real run counts only rows surviving the JOIN. So--dry-runreports "would create 253" while the real run creates 0 — the dry-run is not a faithful preview.Adjacent:
--source dbextraction only recognizes entity-directory wikis (people/,companies/,deals/,meetings/— theDIR_PATTERNwhitelist atsrc/core/link-extraction.ts:46); it ignores plain[text](file.md)and[[flat-slug]]body links, so it returns 0 for flat doc/memory brains regardless of source.Proposed fixes
runExtractCore/extractLinksFromDir/extractTimelineFromDirshould resolve asourceId(the resolver already exists —getDefaultSourcePath/.gbrain-source) and stampfrom_source_id/to_source_id/origin_source_id(and timelinesource_id) on every batch row, mirroring whatextractLinksForSlugsalready does.gbrain extract links --source <id> --backfill, or a--reextract-allflag onsync— that re-extracts links for all pages of a source, not just changed ones.doctorrecommendation to point at a command that works on federated brains (or detect federation and adjust the advice).--dry-runfaithful — resolve targets against(slug, source_id)the same way the real INSERT does, so the count matches.Severity
High for federated-brain users: graph features (
get_backlinks,traverse_graph,find_*) are quietly non-functional,brain_scoreis permanently capped, and the diagnostic actively points the wrong way. Single-source / default-only brains are unaffected.