Version: graphifyy 0.9.5, documentary corpus (~6800 nodes).
Problem
Fuzzy dedup (Jaro-Winkler threshold 92, dedup.py) can't — and shouldn't — merge a trigram with a full name (MWA ↔ Mickael Wagner) or two phrasings of an org (datafab_by_alteca ↔ datafab). But nothing else does either. Lived consequences on our corpus:
- 200+ manual merges over a month;
- a god node destroyed by replace-on-re-extract after a subagent emitted a different ID for the same entity (174 edges → 0).
Proposal
A first-class canonical-entity registry, e.g. graphify-out/entities.json (id, label, type, aliases, forbidden-merge homonyms), consumed by:
- the extraction prompt/spec (subagents emit canonical IDs directly);
build_merge (remap aliases → canonical ID before replace-on-re-extract);
- query expansion (querying an alias finds the canonical node).
The aliases field already exists on nodes — this is mostly about making it bidirectional, persistent, and honored at merge time. We maintain such a file by hand today and inject it into extraction prompts; it works, but it's all outside the tool.
Related: #595 (graph hygiene primitives).
Version: graphifyy 0.9.5, documentary corpus (~6800 nodes).
Problem
Fuzzy dedup (Jaro-Winkler threshold 92, dedup.py) can't — and shouldn't — merge a trigram with a full name (
MWA↔Mickael Wagner) or two phrasings of an org (datafab_by_alteca↔datafab). But nothing else does either. Lived consequences on our corpus:Proposal
A first-class canonical-entity registry, e.g.
graphify-out/entities.json(id, label, type, aliases, forbidden-merge homonyms), consumed by:build_merge(remap aliases → canonical ID before replace-on-re-extract);The
aliasesfield already exists on nodes — this is mostly about making it bidirectional, persistent, and honored at merge time. We maintain such a file by hand today and inject it into extraction prompts; it works, but it's all outside the tool.Related: #595 (graph hygiene primitives).