Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
107 changes: 107 additions & 0 deletions docs/proposals/ner-entity-linking.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
# Proposal: NER-Based Entity Link Extraction

## Problem

A production brain with 165K+ pages and a custom schema pack (30 types, 8 link verbs) has only 32% entity link coverage — meaning 68% of entities have no typed links at all. The schema pack defines rich relationship verbs (`works_at`, `founded`, `invested_in`, `advises`, etc.) but the data isn't populated because typed link extraction only works when frontmatter explicitly declares the relationship.

Body text is full of implicit relationships: "Alice is the CEO of Acme Corp", "Bob invested in Acme's Series A", "Alice previously worked at BigCo". None of these become typed links today.

### Scale of Impact

| Metric | Value |
|--------|-------|
| Total pages | ~165,000 |
| Entity link coverage | ~32% |
| Entities with zero typed links | ~68% |
| Schema verbs defined but underused | 8 |

## Proposed Solution

### NER Extraction Using the Brain's Own Gazetteer

Rather than adding an external NER model dependency, use the brain's own entity pages as the gazetteer:

1. **Build entity index**: Collect all person and company page titles + aliases
2. **Scan page body text** for co-occurrences of entities with relationship signals
3. **Extract typed links** based on context clues:
- "CEO of Acme" → `works_at` (with role annotation)
- "founded Acme" → `founded`
- "invested in Acme" → `invested_in`
- "advisor to Acme" → `advises`
- "acquired by BigCo" → `acquired_by`
4. **Create links** with confidence scores; only auto-create above threshold

### CLI Interface

```bash
# Extract typed links from all pages
gbrain extract links --ner

# Extract from specific page types only
gbrain extract links --ner --type meeting,company

# Dry run to preview extractions
gbrain extract links --ner --dry-run

# Only extract specific verb types
gbrain extract links --ner --verbs works_at,founded

# Set confidence threshold (default: 0.7)
gbrain extract links --ner --threshold 0.8
```

### Relationship Pattern Matching

The extraction engine should recognize common patterns per verb type:

| Verb | Patterns |
|------|----------|
| `works_at` | "CEO of X", "engineer at X", "joined X", "works at X" |
| `founded` | "founded X", "co-founded X", "started X in 2020" |
| `invested_in` | "invested in X", "led X's Series A", "backed X" |
| `advises` | "advisor to X", "advises X", "on X's board" |
| `acquired_by` | "acquired by X", "bought by X", "X acquired" |

### Schema Pack Integration

Schema packs should be able to declare custom extraction patterns per verb:

```yaml
verbs:
works_at:
extraction_patterns:
- "{person} is (the )?(CEO|CTO|COO|VP|engineer|designer) (of|at) {company}"
- "{person} (joined|works at|leads) {company}"
```

## Agent Onboarding

### Features Detection

`gbrain features` should detect low link coverage:

```
⚠ Entity link coverage: 32%
Your schema defines 8 relationship verbs but 68% of entities have no typed links.
Run `gbrain extract links --ner` to extract relationships from page text.
```

### Migration Prompt

```
Entity link coverage is 32%.
Run `gbrain extract links --ner` to extract typed links from page text? [y/N]
```

## Evidence

The production brain has a fully installed schema pack with 30 types and 8 link verbs. The infrastructure for typed links is complete — the data just isn't there because no automated extraction exists. Meeting notes, company profiles, and person pages all contain relationship information in natural language that could be extracted.

## Risks & Mitigations

| Risk | Mitigation |
|------|------------|
| False positive relationships | Confidence threshold, dry-run mode, manual review option |
| Ambiguous entity names | Prefer longer/more specific matches, use page context |
| Pattern fragility | Schema-declared patterns allow customization per deployment |
| Performance on large brains | Batch processing, incremental (only scan modified pages) |
23 changes: 20 additions & 3 deletions src/core/cycle/synthesize.ts
Original file line number Diff line number Diff line change
Expand Up @@ -110,6 +110,22 @@ function warnUnknownModelOnce(model: string): void {
);
}

// ── Surrogate-safe string slicing ─────────────────────────────────────

/**
* Slice a string at `index` without splitting a UTF-16 surrogate pair.
* If `index` lands between a high and low surrogate, back up by one so
* the pair stays intact in the left half.
*/
function safeSliceEnd(index: number, str: string): number {
if (index <= 0 || index >= str.length) return index;
const code = str.charCodeAt(index - 1);
// If the char just before the cut is a high surrogate (D800–DBFF),
// the cut would orphan it. Back up one.
if (code >= 0xD800 && code <= 0xDBFF) return index - 1;
return index;
}

// ── Hash-deterministic transcript chunker (D9) ────────────────────────

/**
Expand Down Expand Up @@ -178,8 +194,9 @@ function findBoundary(text: string, maxChars: number, searchStart: number): numb
// Tier 3: any newline.
const nlIdx = window.lastIndexOf('\n');
if (nlIdx >= 0) return searchStart + nlIdx;
// No boundary fits; hard-split at maxChars (deterministic).
return maxChars;
// No boundary fits; hard-split at maxChars (deterministic),
// but avoid splitting a UTF-16 surrogate pair.
return safeSliceEnd(maxChars, text);
}

/**
Expand Down Expand Up @@ -659,7 +676,7 @@ export async function judgeSignificance(
// doesn't need the full body; the opening + closing sections are usually
// representative of significance.
const trimmed = t.content.length > 8000
? t.content.slice(0, 4000) + '\n[...truncated...]\n' + t.content.slice(-4000)
? t.content.slice(0, safeSliceEnd(4000, t.content)) + '\n[...truncated...]\n' + t.content.slice(safeSliceEnd(t.content.length - 4000, t.content))
: t.content;

const sys = `You judge whether a conversation transcript is worth synthesizing into a personal knowledge brain.
Expand Down
Loading