feat(extract): wikilink alias / title / basename fallback resolution#1188
Open
rwbaker wants to merge 2 commits into
Open
feat(extract): wikilink alias / title / basename fallback resolution#1188rwbaker wants to merge 2 commits into
rwbaker wants to merge 2 commits into
Conversation
DIR_PATTERN's fixed semantic-dir whitelist (people|companies|meetings|...) matched zero of the 947 wikilinks in a 1008-page Obsidian + PARA vault. PARA layouts use numeric-prefixed dirs to force sidebar order (10_Projects/, 20_Meetings/, 30_Resources/, 40_Areas/, 50_Pulse/, 80_Archived/), and the resulting graph had no edges. Add a `\d+_word` alternative to DIR_PATTERN (using [A-Za-z] so PascalCase matches without forcing an `i` flag on QUALIFIED_WIKILINK_RE, whose source-id sub-expression is intentionally kebab-only). Extracted slugs are then run through slugifyPath so wikilink paths like `[[10_Projects/Meeting Transcripts/Foo|Foo]]` reduce to the lowercased, hyphen-segmented DB slug `10_projects/meeting-transcripts/foo` that `allSlugs.has()` expects. After this patch, `gbrain extract links --source db` reports 45 links created against the timelycare vault (previously 0), and `gbrain graph-query` returns typed edges for PARA-style pages. Tests: 6 new cases for canonical PARA dirs (10_projects, 20_meetings, 30_resources, 40_areas, 50_pulse, 80_archived), PascalCase normalization, spaced-segment normalization, and the markdown-link variant. Tracked locally as TIM-27 in the Paperclip TimelyCare project. Co-Authored-By: Paperclip <noreply@paperclip.ing>
Pre-fix, the wikilink extractor resolved targets via path-equality only,
so Obsidian-style vaults (short forms, aliased-display, renamed pages)
materialized as a tiny fraction of edges. On the TimelyCare vault that
worked out to 45 edges from 812 wikilinks (~5.5%).
This change introduces a wikilink alias index built per-extract run from:
- frontmatter `aliases:`
- the page's first H1 heading
- the filename basename (last `/`-segment of the canonical slug)
Resolver precedence: exact-path > alias > title > basename. Each fired
resolver pins the new links.resolution_type column (widened in migration
v68 from {qualified, unqualified} to also include path/alias/title/basename).
Fixes alongside the resolver:
- resolveSlug() now slugifies the candidate before lookup so PascalCase
Obsidian paths (`10_Projects/...`) match the lowercased DB slug.
- walkMarkdownFiles() no longer skips `_*.md` Obsidian hub pages; the
canonical sync path (isSyncable) already ingests them, the extract
walker was the odd one out.
- extractLinksFromDir() now threads sourceId through addLinksBatch so
multi-source brains (timelycare-vault, etc.) actually receive their
edges instead of failing the page-table JOIN to the implicit
'default' source.
Result on the TimelyCare vault: 45 → 744 edges (16.5x). Resolution
breakdown: 660 path / 63 basename / 21 title.
New test surface (test/extract-wikilink-aliases.test.ts):
- WikilinkAliasIndex unit tests covering each resolver pass + the
first-write-wins collision rule.
- FS-source integration tests for all four resolution paths plus a
dangling-link no-false-positive guard.
Closes TIM-28.
Co-Authored-By: Paperclip <noreply@paperclip.ing>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The wikilink extractor resolves targets via path-equality only, so Obsidian-style vaults whose links use short-form (
[[_Team-Training]]), aliased-display ([[10_Projects/.../_User-Research|User Research]]), or title-only references almost never materialize as graph edges.On a representative ~1K-page Obsidian vault (TimelyCare) this came out to 45 edges from 812 wikilinks (~5.5%). Most failures were the three classic shapes:
[[_Team-Assessment]]:slugifyPathyields_team-assessment, but the DB stores it as10_projects/team-assessment/_team-assessment. Miss.[[10_Projects/.../_User-Research|User Research]]: 98 such links exist; the path should resolve but the case-mismatch (PascalCase author input vs lowercase DB slug) drops them.[[Cool Topic]]for a page slugged differently.Change
Introduce a per-extract-run
WikilinkAliasIndexthat maps three authored fallbacks → canonical slug:aliases:—aliases: [Team Training, TT]on the page produces alias keys that future wikilinks can resolve against.# Cool Topiclets[[Cool Topic]]resolve to the slug of the page that carries it./-segment of every canonical slug, so[[_Team-Assessment]]resolves to10_projects/team-assessment/_team-assessment.Resolver precedence: exact path > alias > title > basename. Each fallback fired pins the existing
links.resolution_typecolumn so we can audit recall — migration v68 widens its CHECK constraint from{qualified, unqualified}to also include{path, alias, title, basename}.Alongside the resolver, three closely related bugs that all blocked the resolver from delivering its value:
resolveSlug()now slugifies the candidate path before lookup. Without this, PascalCase paths like10_Projects/...never matched the lowercased DB slug, so display-aliased wikilinks fell through to "dangling" even when their path was valid.walkMarkdownFiles()no longer skips_*.mdObsidian hub pages. The canonical sync path (isSyncable) ingests them, so they're real DB pages — the extract walker was the odd one out, hiding them from both the slug set and the alias index.extractLinksFromDir()now resolves and threadssourceIdthroughaddLinksBatch. Pre-fix, FS-source extract always wrote with the implicit'default'source, so any brain whose pages live under a non-default source (e.g.timelycare-vault) silently failed the page-table JOIN and dropped every row.Result
Re-running
gbrain extract links --source fs --dir <vault>on the same TimelyCare vault:resolution_type=pathresolution_type=basenameresolution_type=titleThe spot-check
[[_Team-Training]]from_team-assessmentnow resolves to10_projects/team-training/_team-trainingvia the basename resolver.Test plan
test/extract-wikilink-aliases.test.ts:WikilinkAliasIndexunit tests cover string vs arrayaliases:, alias-beats-title-beats-basename precedence, display-aliased lookups, and first-write-wins on collisions.deriveTitleFromContentcovers frontmatter skip + inline-markdown stripping.test/extract-fs.test.ts,test/extract-db.test.ts,test/extract.test.ts,test/extract-incremental.test.tsall green (47 tests).test/link-extraction.test.ts,test/sync.test.ts,test/migrate.test.tsgreen (275 tests).Scope notes
aliases:on hub pages) for the highest recall — the patch lifts recall structurally, frontmatter cleanup tightens precision.addLinksBatchpath keepsON CONFLICT DO NOTHING(notDO UPDATE) because batches commonly contain duplicate(from, to, type, source, origin)tuples (e.g. same edge appearing in bothcompiled_truthandtimeline), and Postgres refuses to letDO UPDATEaffect the same row twice in one statement. Re-pinningresolution_typeon existing rows is therefore not possible fromINSERT ... ON CONFLICT; callers that need to refresh it shouldDELETE+ re-INSERT.The patch is engine-agnostic and ships with both PGLite and Postgres SQL.
🤖 Generated with Claude Code