feat: NER-based entity link extraction#1380
Open
garrytan-agents wants to merge 3 commits into
Open
Conversation
The judgeSignificance trimming (slice at 4000 chars) could split a UTF-16 surrogate pair when an emoji sits exactly at the boundary, producing a lone high surrogate that Anthropic's JSON parser rejects with 'no low surrogate in string'. Add safeSliceEnd() helper that backs up by one char when the cut lands between a high and low surrogate. Apply to: - judgeSignificance transcript trimming (the direct cause) - findBoundary hard-split fallback (defense-in-depth) Fixes: dream cycle SYNTH_PHASE_FAIL on 2026-05-24 caused by 🤖 emoji at pos 3999 in telegram/2026-05-20-topic-1-topic-1.md
Add proposal for extracting typed links from page body text using the brain's own entity pages as a gazetteer. 68% of entities have no typed links despite rich relationship data in text.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Proposal: NER-Based Entity Link Extraction
Problem
A production brain with 165K+ pages and a custom schema pack (30 types, 8 link verbs) has only 32% entity link coverage — meaning 68% of entities have no typed links at all. The schema pack defines rich relationship verbs (
works_at,founded,invested_in,advises, etc.) but the data isn't populated because typed link extraction only works when frontmatter explicitly declares the relationship.Body text is full of implicit relationships: "Alice is the CEO of Acme Corp", "Bob invested in Acme's Series A", "Alice previously worked at BigCo". None of these become typed links today.
Scale of Impact
Proposed Solution
NER Extraction Using the Brain's Own Gazetteer
Rather than adding an external NER model dependency, use the brain's own entity pages as the gazetteer:
works_at(with role annotation)foundedinvested_inadvisesacquired_byCLI Interface
Relationship Pattern Matching
The extraction engine should recognize common patterns per verb type:
works_atfoundedinvested_inadvisesacquired_bySchema Pack Integration
Schema packs should be able to declare custom extraction patterns per verb:
Agent Onboarding
Features Detection
gbrain featuresshould detect low link coverage:Migration Prompt
Evidence
The production brain has a fully installed schema pack with 30 types and 8 link verbs. The infrastructure for typed links is complete — the data just isn't there because no automated extraction exists. Meeting notes, company profiles, and person pages all contain relationship information in natural language that could be extracted.
Risks & Mitigations