Skip to content

feat(extract): wikilink alias / title / basename fallback resolution#1188

Open
rwbaker wants to merge 2 commits into
garrytan:masterfrom
rwbaker:feat/wikilink-alias-resolution
Open

feat(extract): wikilink alias / title / basename fallback resolution#1188
rwbaker wants to merge 2 commits into
garrytan:masterfrom
rwbaker:feat/wikilink-alias-resolution

Conversation

@rwbaker
Copy link
Copy Markdown

@rwbaker rwbaker commented May 19, 2026

Problem

The wikilink extractor resolves targets via path-equality only, so Obsidian-style vaults whose links use short-form ([[_Team-Training]]), aliased-display ([[10_Projects/.../_User-Research|User Research]]), or title-only references almost never materialize as graph edges.

On a representative ~1K-page Obsidian vault (TimelyCare) this came out to 45 edges from 812 wikilinks (~5.5%). Most failures were the three classic shapes:

  • Short form[[_Team-Assessment]]: slugifyPath yields _team-assessment, but the DB stores it as 10_projects/team-assessment/_team-assessment. Miss.
  • Display-aliased[[10_Projects/.../_User-Research|User Research]]: 98 such links exist; the path should resolve but the case-mismatch (PascalCase author input vs lowercase DB slug) drops them.
  • Title-only[[Cool Topic]] for a page slugged differently.

Change

Introduce a per-extract-run WikilinkAliasIndex that maps three authored fallbacks → canonical slug:

  1. Frontmatter aliases:aliases: [Team Training, TT] on the page produces alias keys that future wikilinks can resolve against.
  2. First H1 heading# Cool Topic lets [[Cool Topic]] resolve to the slug of the page that carries it.
  3. Filename basename — last /-segment of every canonical slug, so [[_Team-Assessment]] resolves to 10_projects/team-assessment/_team-assessment.

Resolver precedence: exact path > alias > title > basename. Each fallback fired pins the existing links.resolution_type column so we can audit recall — migration v68 widens its CHECK constraint from {qualified, unqualified} to also include {path, alias, title, basename}.

Alongside the resolver, three closely related bugs that all blocked the resolver from delivering its value:

  • resolveSlug() now slugifies the candidate path before lookup. Without this, PascalCase paths like 10_Projects/... never matched the lowercased DB slug, so display-aliased wikilinks fell through to "dangling" even when their path was valid.
  • walkMarkdownFiles() no longer skips _*.md Obsidian hub pages. The canonical sync path (isSyncable) ingests them, so they're real DB pages — the extract walker was the odd one out, hiding them from both the slug set and the alias index.
  • extractLinksFromDir() now resolves and threads sourceId through addLinksBatch. Pre-fix, FS-source extract always wrote with the implicit 'default' source, so any brain whose pages live under a non-default source (e.g. timelycare-vault) silently failed the page-table JOIN and dropped every row.

Result

Re-running gbrain extract links --source fs --dir <vault> on the same TimelyCare vault:

Before After
Total edges 45 744 (16.5×)
resolution_type=path n/a 660
resolution_type=basename n/a 63
resolution_type=title n/a 21

The spot-check [[_Team-Training]] from _team-assessment now resolves to 10_projects/team-training/_team-training via the basename resolver.

Test plan

  • New test/extract-wikilink-aliases.test.ts:
    • WikilinkAliasIndex unit tests cover string vs array aliases:, alias-beats-title-beats-basename precedence, display-aliased lookups, and first-write-wins on collisions.
    • deriveTitleFromContent covers frontmatter skip + inline-markdown stripping.
    • FS-source integration tests exercise all four resolution paths plus a dangling-link no-false-positive guard.
  • test/extract-fs.test.ts, test/extract-db.test.ts, test/extract.test.ts, test/extract-incremental.test.ts all green (47 tests).
  • Adjacent regression sweep: test/link-extraction.test.ts, test/sync.test.ts, test/migrate.test.ts green (275 tests).
  • End-to-end on the TimelyCare vault confirms 45 → 744 edges with no batch errors.

Scope notes

  • Pair this with vault-side hygiene (declaring aliases: on hub pages) for the highest recall — the patch lifts recall structurally, frontmatter cleanup tightens precision.
  • The addLinksBatch path keeps ON CONFLICT DO NOTHING (not DO UPDATE) because batches commonly contain duplicate (from, to, type, source, origin) tuples (e.g. same edge appearing in both compiled_truth and timeline), and Postgres refuses to let DO UPDATE affect the same row twice in one statement. Re-pinning resolution_type on existing rows is therefore not possible from INSERT ... ON CONFLICT; callers that need to refresh it should DELETE + re-INSERT.

The patch is engine-agnostic and ships with both PGLite and Postgres SQL.

🤖 Generated with Claude Code

rwbaker and others added 2 commits May 18, 2026 16:04
DIR_PATTERN's fixed semantic-dir whitelist (people|companies|meetings|...)
matched zero of the 947 wikilinks in a 1008-page Obsidian + PARA vault.
PARA layouts use numeric-prefixed dirs to force sidebar order
(10_Projects/, 20_Meetings/, 30_Resources/, 40_Areas/, 50_Pulse/,
80_Archived/), and the resulting graph had no edges.

Add a `\d+_word` alternative to DIR_PATTERN (using [A-Za-z] so PascalCase
matches without forcing an `i` flag on QUALIFIED_WIKILINK_RE, whose
source-id sub-expression is intentionally kebab-only). Extracted slugs
are then run through slugifyPath so wikilink paths like
`[[10_Projects/Meeting Transcripts/Foo|Foo]]` reduce to the lowercased,
hyphen-segmented DB slug `10_projects/meeting-transcripts/foo` that
`allSlugs.has()` expects.

After this patch, `gbrain extract links --source db` reports 45 links
created against the timelycare vault (previously 0), and
`gbrain graph-query` returns typed edges for PARA-style pages.

Tests: 6 new cases for canonical PARA dirs (10_projects, 20_meetings,
30_resources, 40_areas, 50_pulse, 80_archived), PascalCase normalization,
spaced-segment normalization, and the markdown-link variant.

Tracked locally as TIM-27 in the Paperclip TimelyCare project.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
Pre-fix, the wikilink extractor resolved targets via path-equality only,
so Obsidian-style vaults (short forms, aliased-display, renamed pages)
materialized as a tiny fraction of edges. On the TimelyCare vault that
worked out to 45 edges from 812 wikilinks (~5.5%).

This change introduces a wikilink alias index built per-extract run from:
  - frontmatter `aliases:`
  - the page's first H1 heading
  - the filename basename (last `/`-segment of the canonical slug)

Resolver precedence: exact-path > alias > title > basename. Each fired
resolver pins the new links.resolution_type column (widened in migration
v68 from {qualified, unqualified} to also include path/alias/title/basename).

Fixes alongside the resolver:
  - resolveSlug() now slugifies the candidate before lookup so PascalCase
    Obsidian paths (`10_Projects/...`) match the lowercased DB slug.
  - walkMarkdownFiles() no longer skips `_*.md` Obsidian hub pages; the
    canonical sync path (isSyncable) already ingests them, the extract
    walker was the odd one out.
  - extractLinksFromDir() now threads sourceId through addLinksBatch so
    multi-source brains (timelycare-vault, etc.) actually receive their
    edges instead of failing the page-table JOIN to the implicit
    'default' source.

Result on the TimelyCare vault: 45 → 744 edges (16.5x). Resolution
breakdown: 660 path / 63 basename / 21 title.

New test surface (test/extract-wikilink-aliases.test.ts):
  - WikilinkAliasIndex unit tests covering each resolver pass + the
    first-write-wins collision rule.
  - FS-source integration tests for all four resolution paths plus a
    dangling-link no-false-positive guard.

Closes TIM-28.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant