Problem
When extraction runs per-file (which is the default for large vaults), agents emit file-scoped node IDs to avoid collisions (e.g., 2024-03_silvia, journal_silvia, crm_silvia). These all refer to the same concept but appear as separate nodes. On our vault, this caused 62% node bloat (1,421 raw nodes collapsed to 548 canonical nodes on one batch).
Proposed solution
A canonicalization pass that runs after extraction and merge:
- Label normalization: Lowercase, strip path prefixes, remove known suffix patterns (e.g., folder qualifiers that agents add for disambiguation).
- Canonical ID generation: Slugify the normalized label into a stable ID (alphanumeric + underscores, length-capped).
- Node merge: All nodes sharing the same canonical ID collapse into one. Metadata is merged (source files become a list,
mention_count tracks frequency).
- Edge dedup: Deduplicate edges by
(source, target, relation) triplet after remapping to canonical IDs.
- Validation: Filter self-loops and any edges that became dangling after the merge.
Why this matters
Without canonicalization, the graph grows linearly with file count even when the same concepts appear repeatedly. Community detection splits what should be one cluster into many fragments. God-node analysis misses the true hubs because their mentions are scattered across aliases.
On journal-heavy corpora, we see 34% canonicalization reduction per batch (the highest of any content type, because journal entries reference the same themes with slight phrasing variations).
Scope
This could be a graphify canonicalize subcommand or a --canonicalize post-processing flag. The normalization rules would need to be configurable since different vault types have different naming conventions.
Happy to send a PR.
Problem
When extraction runs per-file (which is the default for large vaults), agents emit file-scoped node IDs to avoid collisions (e.g.,
2024-03_silvia,journal_silvia,crm_silvia). These all refer to the same concept but appear as separate nodes. On our vault, this caused 62% node bloat (1,421 raw nodes collapsed to 548 canonical nodes on one batch).Proposed solution
A canonicalization pass that runs after extraction and merge:
mention_counttracks frequency).(source, target, relation)triplet after remapping to canonical IDs.Why this matters
Without canonicalization, the graph grows linearly with file count even when the same concepts appear repeatedly. Community detection splits what should be one cluster into many fragments. God-node analysis misses the true hubs because their mentions are scattered across aliases.
On journal-heavy corpora, we see 34% canonicalization reduction per batch (the highest of any content type, because journal entries reference the same themes with slight phrasing variations).
Scope
This could be a
graphify canonicalizesubcommand or a--canonicalizepost-processing flag. The normalization rules would need to be configurable since different vault types have different naming conventions.Happy to send a PR.