gbrain extract is not source-aware — silently inserts 0 links on federated brains; doctor recommends a command that cannot work

## Summary

On a multi-source (federated) brain, `gbrain extract all` (and `extract links`) silently inserts **0 links** even when there is link-rich content to extract. `gbrain doctor` then recommends running exactly that command as the remedy for 0% graph coverage — so the user runs it, sees "created 0 from N pages," and has no signal that anything is wrong. Four interacting defects.

## Environment

- gbrain v0.36.3.0
- Federated brain, 4 sources (`default`, `brain`, plus two repo-backed sources); one repo-backed source ~310 pages, a flat doc/memory repo
- `gbrain doctor`: `graph_coverage 0%`, `brain_score 45/100 (links 0/25)`

## Repro

```
$ gbrain extract all              # bare, from a cwd pinned to a non-default source
Links: created 0 from 306 pages
$ gbrain extract all --dry-run
Links: (dry run) would create 253 from 306 pages   # <-- dry-run disagrees with the real run
```

## Root cause

**Bug 1 — FS extract path drops `source_id`.** `extractLinksFromFile` emits link objects with only `from_slug/to_slug/link_type/context` — no source id (`src/commands/extract.ts:218`). The engine contract says an omitted `from_source_id/to_source_id` means `source_id='default'`, not "infer from the page" (`src/core/engine.ts:73`). `addLinksBatch` maps missing → literal `'default'` (`src/core/postgres-engine.ts:1650`) and INNER-JOINs pages on `(slug, source_id)` (`src/core/postgres-engine.ts:1662`). Pages that live under a non-default `source_id` never match `(slug,'default')` → every row is silently dropped by the JOIN, 0 inserted, no error.

**Bug 2 — `doctor` recommends a command that cannot work.** The `graph_coverage` check prints `Run: gbrain extract all`. On a federated brain that command is structurally incapable of populating non-default-source links (Bug 1). The advice should be source-aware, or suppressed for federated brains.

**Bug 3 — no backfill path for an already-synced federated brain.** The only source-aware extractor is the post-sync hook `extractLinksForSlugs`, which correctly threads `sourceId` (`src/commands/extract.ts:720-734`) — but it runs only on `pagesAffected` (`src/commands/sync.ts:942-944`). `sync --full` re-walks but content-hash-skips unchanged pages (observed: `imported=8 skipped=302`), so the hook only ever re-extracts *changed* pages. There is no command that re-extracts links for an already-synced source.

**Bug 4 — `--dry-run` overcounts.** Dry-run increments `created` at candidate-extraction time, before any DB write (`src/commands/extract.ts:628-633`); the real run counts only rows surviving the JOIN. So `--dry-run` reports "would create 253" while the real run creates 0 — the dry-run is not a faithful preview.

**Adjacent:** `--source db` extraction only recognizes entity-directory wikis (`people/`, `companies/`, `deals/`, `meetings/` — the `DIR_PATTERN` whitelist at `src/core/link-extraction.ts:46`); it ignores plain `[text](file.md)` and `[[flat-slug]]` body links, so it returns 0 for flat doc/memory brains regardless of source.

## Proposed fixes

1. **Make the FS CLI extract path source-aware.** `runExtractCore` / `extractLinksFromDir` / `extractTimelineFromDir` should resolve a `sourceId` (the resolver already exists — `getDefaultSourcePath` / `.gbrain-source`) and stamp `from_source_id/to_source_id/origin_source_id` (and timeline `source_id`) on every batch row, mirroring what `extractLinksForSlugs` already does.
2. **Add a backfill command** — e.g. `gbrain extract links --source <id> --backfill`, or a `--reextract-all` flag on `sync` — that re-extracts links for all pages of a source, not just changed ones.
3. **Fix the `doctor` recommendation** to point at a command that works on federated brains (or detect federation and adjust the advice).
4. **Make `--dry-run` faithful** — resolve targets against `(slug, source_id)` the same way the real INSERT does, so the count matches.

## Severity

High for federated-brain users: graph features (`get_backlinks`, `traverse_graph`, `find_*`) are quietly non-functional, `brain_score` is permanently capped, and the diagnostic actively points the wrong way. Single-source / default-only brains are unaffected.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gbrain extract is not source-aware — silently inserts 0 links on federated brains; doctor recommends a command that cannot work #1204

Summary

Environment

Repro

Root cause

Proposed fixes

Severity

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

gbrain extract is not source-aware — silently inserts 0 links on federated brains; doctor recommends a command that cannot work #1204

Description

Summary

Environment

Repro

Root cause

Proposed fixes

Severity

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions