feat: auto-link entity mentions for orphan reduction by garrytan-agents · Pull Request #1378 · garrytan/gbrain

garrytan-agents · 2026-05-24T20:16:36Z

Proposal: Auto-Link Entity Mentions (Orphan Reduction)

Problem

In a production brain with 165K+ pages, approximately 88% of pages are orphans — they have zero inbound links. This happens because the current link extraction only recognizes explicit markdown links ([Name](path)). When a page mentions an entity by name in body text (e.g., "we discussed Acme Corp's growth trajectory"), no link is created.

This means the vast majority of a brain's knowledge graph is disconnected, making it impossible to traverse relationships, find related content, or build meaningful entity profiles through link analysis.

Scale of Impact

Metric	Value
Total pages	~165,000
Orphan pages (0 inbound links)	~146,000 (88%)
Entity link coverage	~32%

Proposed Solution

Add a link-by-mention pass in the dream/extract cycle that creates mentions links from text references to known entities.

How It Works

Build a gazetteer from existing entity pages (person, company, etc.): collect each entity's title + aliases from frontmatter
Scan recently-synced pages for text mentions of those entity names using case-insensitive fuzzy matching
Create mentions links from the mentioning page to the mentioned entity page, with deduplication to avoid duplicates
Gate behind a config flag: gbrain config set auto_link_mentions true

CLI Interface

# One-time backfill for existing pages
gbrain extract links --by-mention

# Enable ongoing auto-linking in dream cycle
gbrain config set auto_link_mentions true

# Dry run to preview what would be linked
gbrain extract links --by-mention --dry-run

Implementation Notes

The gazetteer is built from the brain's own pages — no external NER model needed for this pass
Fuzzy matching should handle common variations (e.g., "Acme" matching "Acme Corp", "Acme Corporation")
Dedup ensures running the command multiple times is safe (idempotent)
Performance: process in batches; for 165K pages, scanning all pages could take time so support --batch-size and --since flags

Agent Onboarding

Doctor Detection

gbrain doctor should detect orphan ratio >50% and surface a recommendation:

⚠ High orphan ratio: 88% of pages have no inbound links
  Recommendation: Run `gbrain extract links --by-mention` to create links from text mentions
  Or enable auto-linking: `gbrain config set auto_link_mentions true`

Fresh Install

On fresh install, the setup wizard should ask:

Would you like to enable automatic entity mention linking? 
This creates links when pages mention known entities by name. [y/N]

Migration Prompt (v0.41+)

Add a one-time migration that runs after upgrade:

Your brain has 88% orphan pages. 
Run `gbrain extract links --by-mention` to create links from text mentions? [y/N]

The migration records completion in the kv table so it doesn't prompt again.

Evidence

This proposal is based on production data from a 165K-page brain where orphan pages accumulated over months of operation. The operator discovered the issue only after running gbrain doctor — by then, 146K pages had no connections despite being rich with entity references in their body text.

Risks & Mitigations

Risk	Mitigation
False positive matches (e.g., "Apple" the fruit vs "Apple" the company)	Require minimum name length, prefer exact matches, allow an ignore list
Performance on large brains	Batch processing, `--since` flag for incremental runs
Link spam on frequently-mentioned entities	Cap mentions per source page, or only link first mention

The judgeSignificance trimming (slice at 4000 chars) could split a UTF-16 surrogate pair when an emoji sits exactly at the boundary, producing a lone high surrogate that Anthropic's JSON parser rejects with 'no low surrogate in string'. Add safeSliceEnd() helper that backs up by one char when the cut lands between a high and low surrogate. Apply to: - judgeSignificance transcript trimming (the direct cause) - findBoundary hard-split fallback (defense-in-depth) Fixes: dream cycle SYNTH_PHASE_FAIL on 2026-05-24 caused by 🤖 emoji at pos 3999 in telegram/2026-05-20-topic-1-topic-1.md

Add proposal for automatic entity mention linking to reduce orphan pages. In a 165K-page production brain, 88% of pages are orphans because link extraction only finds explicit markdown links, not text mentions. Proposes a link-by-mention pass in the dream/extract cycle.

root and others added 3 commits May 24, 2026 09:16

Merge branch 'garrytan:master' into master

5fbb0a7

garrytan-agents mentioned this pull request May 24, 2026

feat: gbrain onboard — guided agent onboarding with migration prompts #1383

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: auto-link entity mentions for orphan reduction#1378

feat: auto-link entity mentions for orphan reduction#1378
garrytan-agents wants to merge 3 commits into
garrytan:masterfrom
garrytan-agents:feat/auto-link-entity-mentions

garrytan-agents commented May 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

garrytan-agents commented May 24, 2026

Proposal: Auto-Link Entity Mentions (Orphan Reduction)

Problem

Scale of Impact

Proposed Solution

How It Works

CLI Interface

Implementation Notes

Agent Onboarding

Doctor Detection

Fresh Install

Migration Prompt (v0.41+)

Evidence

Risks & Mitigations

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant