Skip to content

gbrain extract is not source-aware — silently inserts 0 links on federated brains; doctor recommends a command that cannot work #1204

@mdcruz88

Description

@mdcruz88

Summary

On a multi-source (federated) brain, gbrain extract all (and extract links) silently inserts 0 links even when there is link-rich content to extract. gbrain doctor then recommends running exactly that command as the remedy for 0% graph coverage — so the user runs it, sees "created 0 from N pages," and has no signal that anything is wrong. Four interacting defects.

Environment

  • gbrain v0.36.3.0
  • Federated brain, 4 sources (default, brain, plus two repo-backed sources); one repo-backed source ~310 pages, a flat doc/memory repo
  • gbrain doctor: graph_coverage 0%, brain_score 45/100 (links 0/25)

Repro

$ gbrain extract all              # bare, from a cwd pinned to a non-default source
Links: created 0 from 306 pages
$ gbrain extract all --dry-run
Links: (dry run) would create 253 from 306 pages   # <-- dry-run disagrees with the real run

Root cause

Bug 1 — FS extract path drops source_id. extractLinksFromFile emits link objects with only from_slug/to_slug/link_type/context — no source id (src/commands/extract.ts:218). The engine contract says an omitted from_source_id/to_source_id means source_id='default', not "infer from the page" (src/core/engine.ts:73). addLinksBatch maps missing → literal 'default' (src/core/postgres-engine.ts:1650) and INNER-JOINs pages on (slug, source_id) (src/core/postgres-engine.ts:1662). Pages that live under a non-default source_id never match (slug,'default') → every row is silently dropped by the JOIN, 0 inserted, no error.

Bug 2 — doctor recommends a command that cannot work. The graph_coverage check prints Run: gbrain extract all. On a federated brain that command is structurally incapable of populating non-default-source links (Bug 1). The advice should be source-aware, or suppressed for federated brains.

Bug 3 — no backfill path for an already-synced federated brain. The only source-aware extractor is the post-sync hook extractLinksForSlugs, which correctly threads sourceId (src/commands/extract.ts:720-734) — but it runs only on pagesAffected (src/commands/sync.ts:942-944). sync --full re-walks but content-hash-skips unchanged pages (observed: imported=8 skipped=302), so the hook only ever re-extracts changed pages. There is no command that re-extracts links for an already-synced source.

Bug 4 — --dry-run overcounts. Dry-run increments created at candidate-extraction time, before any DB write (src/commands/extract.ts:628-633); the real run counts only rows surviving the JOIN. So --dry-run reports "would create 253" while the real run creates 0 — the dry-run is not a faithful preview.

Adjacent: --source db extraction only recognizes entity-directory wikis (people/, companies/, deals/, meetings/ — the DIR_PATTERN whitelist at src/core/link-extraction.ts:46); it ignores plain [text](file.md) and [[flat-slug]] body links, so it returns 0 for flat doc/memory brains regardless of source.

Proposed fixes

  1. Make the FS CLI extract path source-aware. runExtractCore / extractLinksFromDir / extractTimelineFromDir should resolve a sourceId (the resolver already exists — getDefaultSourcePath / .gbrain-source) and stamp from_source_id/to_source_id/origin_source_id (and timeline source_id) on every batch row, mirroring what extractLinksForSlugs already does.
  2. Add a backfill command — e.g. gbrain extract links --source <id> --backfill, or a --reextract-all flag on sync — that re-extracts links for all pages of a source, not just changed ones.
  3. Fix the doctor recommendation to point at a command that works on federated brains (or detect federation and adjust the advice).
  4. Make --dry-run faithful — resolve targets against (slug, source_id) the same way the real INSERT does, so the count matches.

Severity

High for federated-brain users: graph features (get_backlinks, traverse_graph, find_*) are quietly non-functional, brain_score is permanently capped, and the diagnostic actively points the wrong way. Single-source / default-only brains are unaffected.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions