Post-extraction node canonicalization to merge duplicate concepts

### Problem

When extraction runs per-file (which is the default for large vaults), agents emit file-scoped node IDs to avoid collisions (e.g., `2024-03_silvia`, `journal_silvia`, `crm_silvia`). These all refer to the same concept but appear as separate nodes. On our vault, this caused **62% node bloat** (1,421 raw nodes collapsed to 548 canonical nodes on one batch).

### Proposed solution

A canonicalization pass that runs after extraction and merge:

1. **Label normalization:** Lowercase, strip path prefixes, remove known suffix patterns (e.g., folder qualifiers that agents add for disambiguation).
2. **Canonical ID generation:** Slugify the normalized label into a stable ID (alphanumeric + underscores, length-capped).
3. **Node merge:** All nodes sharing the same canonical ID collapse into one. Metadata is merged (source files become a list, `mention_count` tracks frequency).
4. **Edge dedup:** Deduplicate edges by `(source, target, relation)` triplet after remapping to canonical IDs.
5. **Validation:** Filter self-loops and any edges that became dangling after the merge.

### Why this matters

Without canonicalization, the graph grows linearly with file count even when the same concepts appear repeatedly. Community detection splits what should be one cluster into many fragments. God-node analysis misses the true hubs because their mentions are scattered across aliases.

On journal-heavy corpora, we see 34% canonicalization reduction per batch (the highest of any content type, because journal entries reference the same themes with slight phrasing variations).

### Scope

This could be a `graphify canonicalize` subcommand or a `--canonicalize` post-processing flag. The normalization rules would need to be configurable since different vault types have different naming conventions.

Happy to send a PR.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Post-extraction node canonicalization to merge duplicate concepts #296

Problem

Proposed solution

Why this matters

Scope

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Post-extraction node canonicalization to merge duplicate concepts #296

Description

Problem

Proposed solution

Why this matters

Scope

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions