Skip to content

Feature: first-class canonical entity registry (aliases/trigrams) consumed by extraction, build_merge and query #1651

Description

@Ns2384-star

Version: graphifyy 0.9.5, documentary corpus (~6800 nodes).

Problem

Fuzzy dedup (Jaro-Winkler threshold 92, dedup.py) can't — and shouldn't — merge a trigram with a full name (MWAMickael Wagner) or two phrasings of an org (datafab_by_altecadatafab). But nothing else does either. Lived consequences on our corpus:

  • 200+ manual merges over a month;
  • a god node destroyed by replace-on-re-extract after a subagent emitted a different ID for the same entity (174 edges → 0).

Proposal

A first-class canonical-entity registry, e.g. graphify-out/entities.json (id, label, type, aliases, forbidden-merge homonyms), consumed by:

  1. the extraction prompt/spec (subagents emit canonical IDs directly);
  2. build_merge (remap aliases → canonical ID before replace-on-re-extract);
  3. query expansion (querying an alias finds the canonical node).

The aliases field already exists on nodes — this is mostly about making it bidirectional, persistent, and honored at merge time. We maintain such a file by hand today and inject it into extraction prompts; it works, but it's all outside the tool.

Related: #595 (graph hygiene primitives).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions