Skip to content

Design notes: merge-import buckets, routes, and canonical-first analysis #141

Description

@tuirk

Related

Design-session follow-up to #54 (Wiki-to-wiki merge import). That issue lists edge cases and open design questions. This issue captures a Jun 2026 brainstorming session — thinking and exploration, not approved decisions.


Session arc (what we explored)

  1. Audited current import: empty-wiki gate in app/src/app/api/import/route.ts, .kompl.zip format, Settings UI stub.
  2. First direction (rejected): Obsidian-style vault merge — match pages by title, remap IDs, union provenance.
  3. Second direction: "compiler not vault" — merge sources only, recompile pages.
  4. Pushback: start from canonical entities/aliases, not merge strategy.
  5. Code audit: what Kompl treats as canonical vs instance-local.
  6. User-led brainstorm: 4 buckets, forward/backtrack on pages, routes A/B/D/F/G/R.
  7. Human review queue for ambiguous matches.

Canonical vs derived (from code audit)

Kompl already has canonicalization machinery:

Artifact Role Multi-import risk
aliases Cross-session memory: alias → canonical_name (+ optional canonical_page_id) Importing foreign alias rows can cause silent mis-routing
entity_mentions / relationship_mentions Per-source extraction trace (canonical_name, source_id) Can merge if source_id remapped
pages (entity/concept titles) Resolver anchors (existing_page_title) Title match ≠ same entity ("React" problem)
page_id Instance-local (slug(title) + plan_uuid_suffix) Importing as-is → duplicate pages
provenance source_idpage_id bridge No UNIQUE key today → duplicate rows on re-import
pages markdown Derived (compile output) Merging text risks semantic drift across models/schema

Lineage you can follow today:

  • source_identity_mentions / relationship_mentions (what was extracted)
  • source_idprovenancepage_id (what pages it contributed to)
  • Weaker: sentence-level "this paragraph came from source X" (not stored)

Extracted types: entities, concepts, relationships (not just "people").


Data flow (intake → compiled pages)

flowchart TD
  subgraph intake [Intake]
    raw[Raw sources URLs files etc]
    sources[(sources)]
    rawFiles[[raw/source_id.md.gz]]
    raw --> sources
    raw --> rawFiles
  end

  subgraph extract [Extract]
    extractApi[/compile/extract/]
    extractions[(extractions)]
    mentions[(entity_mentions relationship_mentions)]
    sources --> extractApi
    rawFiles --> extractApi
    extractApi --> extractions
    extractApi --> mentions
  end

  subgraph resolve [Resolve]
    resolveApi[/compile/resolve/]
    aliases[(aliases)]
    pageTitles[(entity concept page titles)]
    extractions --> resolveApi
    aliases --> resolveApi
    pageTitles --> resolveApi
    resolveApi --> aliases
    resolveApi --> mentions
  end

  subgraph commit [Commit]
    commitApi[/compile/commit/]
    pages[(pages)]
    pageFiles[[pages/page_id.md.gz]]
    provenance[(provenance)]
    pagePlans --> commitApi
    commitApi --> pages
    commitApi --> pageFiles
    commitApi --> provenance
    commitApi --> aliases
  end
Loading

Step 0 — Upload validation (discussed)

IF zip invalid OR no manifest → reject (422)
IF mode=restore AND wiki has user pages → reject wiki_not_empty (409)  [today]
IF mode=merge AND incoming schema_version > local → reject (422)  [discussed]
ELSE → continue to per-source analysis

Step 1 — Four buckets (per imported source)

Gate 1 — source identity (discussed keys: normalized URL, else content_hash):

IF no local source match → NEW source
ELSE → EXISTING source

Gate 2 — extraction trace (canonical entities/concepts/relationships from that source vs local graph):

IF trace same as local → KNOWN trace
ELSE → NEW trace
Bucket Code Plain English
b1 new source + new trace Pure new — never saw this source or this knowledge
b2 new source + known trace New evidence — new article about topics you already have
b3 existing source + new trace Semantic drift — same source, different extraction
b4 existing source + known trace Duplicate — skip

Refinement discussed: a source can be new but semantically old (b2), or existing but with new extraction (b3).


Step 2 — Bucket actions (before page logic)

b4 → Route G (no-op)
b3 → Route F (re-extract / recompile locally; don't trust imported pages)
b2 → import source + raw; Route D territory (provenance/mentions, no page rewrite)
b1 → HOLD pure-new units; then project page impact (Step 3)

Step 3 — Page impact (forward then backtrack)

Forward: from held sources, use imported provenance + mentions → list candidate imported pages.

Backtrack each candidate against local wiki:

IF no local page for that canonical node → net-new page candidate
ELSE local page exists → update candidate (merge evidence; do NOT paste imported markdown)

Source overlap (for update candidates):

IF low source overlap + same canonical → likely additive evidence
IF high overlap but page text diverges → compiler/config drift → Route F, not import-as-is

Step 4 — Node match bands (entity/concept canonicals)

Discussed thresholds using lexical match + all-MiniLM-L6-v2 embeddings (same family as resolver Layer 2):

Band Rule Treatment
EXACT Normalized name equal OR linked via aliases Same node
STRONG Embedding cosine ≥ 0.90 Same node
AMBIGUOUS 0.70 ≤ cosine < 0.90 Route R (human)
DISTINCT cosine < 0.70 Different node

Step 5 — Routes A / B / D / F / G / R

Decision matrix (bucket × node match → route)

Bucket Node match Route
b1 new + new DISTINCT A — create source + page
b1 new + new EXACT/STRONG B — import source, attach to existing page, recompile
b1 new + new AMBIGUOUS R — human review
b2 new + known EXACT/STRONG D — source + provenance only, no page rewrite
b2 new + known AMBIGUOUS R
b2 new + known DISTINCT A (rare)
b3 existing + new any F — re-extract locally
b4 existing + known any G — no-op

Routes in plain language

  • A — Brand new: genuinely new knowledge → add source + page normally.
  • B — Same topic, new substance: keep existing page, refresh via recompile with new evidence (no duplicate page, no blind overwrite).
  • D — Same topic, extra proof: add source + provenance links; page text mostly unchanged.
  • F — Same source, different extraction: don't trust imported page; re-run local pipeline.
  • G — Duplicate: skip.
  • R — Unclear: pause for human; default skip if no action.

Big rule discussed: for matching topics, no duplicate pages and no blind overwrite of existing page text.


Step 6 — Clash checks (gate writes)

  • Alias canonical conflict: same alias string → different canonicals → block auto-merge → R
  • canonical_page_id mismatch: never import foreign page_id pointers; re-pin to local
  • Provenance dup: needs UNIQUE on (source_id, page_id, content_hash, contribution_type) — see Wiki-to-wiki merge import (mode=merge) #54 prerequisite
  • Page content divergence (same canonical, different markdown): never overwrite → B or R
  • Relationship direction/type mismatch: normalize before dedup

Step 7 — Route R: human review UX (discussed)

One card per unclear pair:

  • Imported item: name, summary, top sources
  • Existing candidate: same
  • Why unclear: one line
  • Buttons: Merge into existing | Keep separate | Skip | Merge + add alias
  • Default if ignored: Skip
  • Log decision to activity

Full route tree

flowchart TD
  start[Each imported source] --> sm{Source matches local?}
  sm -->|No| srcNew[NEW source]
  sm -->|Yes| srcEx[EXISTING source]

  srcNew --> tmA{Trace matches local graph?}
  srcEx --> tmB{Trace matches local graph?}

  tmA -->|No| b1[b1: new + new]
  tmA -->|Yes| b2[b2: new + known]
  tmB -->|No| b3[b3: existing + new]
  tmB -->|Yes| b4[b4: duplicate]

  b4 --> G[Route G: no-op]
  b3 --> F[Route F: re-extract locally]

  b1 --> hold[Hold + project pages]
  b2 --> proj[Project page impact]

  hold --> nm{Node match band}
  proj --> nm

  nm -->|DISTINCT| A[Route A: create new]
  nm -->|EXACT/STRONG| B_or_D{b1 or b2?}
  nm -->|AMBIGUOUS| R[Route R: human]

  B_or_D -->|b1| B[Route B: merge + recompile]
  B_or_D -->|b2| D[Route D: provenance only]

  A --> clash{Clash checks pass?}
  B --> clash
  D --> clash
  F --> clash

  clash -->|yes| write[Write]
  clash -->|no| R
Loading

Example cases

b1 → A (pure new)

Local: nothing about "Vilnius street art". Import: new URL + new entities + new page.
→ Add source and page as-is.

b2 → D (new evidence)

Local: GPT-4 entity page exists. Import: new URL about GPT-4, same canonical.
→ Add source + provenance to existing page; don't overwrite page text.

b3 → F (drift)

Local: source URL with hash v1. Import: same URL, hash v2, different entities.
→ Re-extract locally; don't import page markdown.

b4 → G (duplicate)

Re-import same backup.
→ No-op.

AMBIGUOUS → R ("React" problem)

Local: "React" = JS framework. Import: "React" = chemistry. Embedding ~0.75.
→ Human decides merge vs separate.

b1 → B (same topic, new substance)

Local: Chain-of-Thought page. Import: new paper, same canonical, different page text.
→ Add source, map to existing page_id, recompile — don't paste imported markdown.


What each bucket gives / risks (from session)

Bucket Value Risk if handled wrong
b1 Net-new knowledge + pages Duplicate pages if canonical match weak
b2 More evidence for known topics Noise if auto-creating pages
b3 Detected semantic change Silent drift if not recompiled
b4 Nothing Storage bloat only

Relation to #54

#54's nine edge cases and prerequisite UNIQUE migration are still relevant. This session does not close #54's open decision list — it adds a canonical-first, route-based analysis path that may simplify some edge cases (e.g. page merge, alias union) by routing through buckets + recompile instead of vault-style text merge.

Discussed but not built:

  • POST /api/import?mode=merge
  • POST /api/import/analyze dry-run (zero writes, returns bucket/route preview)
  • Compiler fingerprint in manifest.json for optional page cache restore (v2 fast path)

Workaround today: re-ingest through onboarding (loses provenance, drafts, chat).


Possible implementation phases (for discussion)

Phase 0: UNIQUE on provenance / aliases (#54 prerequisite; schema now v25)

Phase 1: Dry-run analyze endpoint — classify sources into b1–b4, emit route preview, zero writes

Phase 2: Apply routes G/D/A with clash checks; human queue for R

Phase 3: Routes B/F with recompile trigger

Phase 4 (optional): fingerprint-matched page/vector restore


Files likely involved

  • app/src/app/api/import/route.ts
  • app/src/app/api/export/route.ts (fingerprint, later)
  • app/src/app/settings/page.tsx
  • scripts/migrate.py
  • app/src/lib/db.ts (normalize URL, match helpers)
  • Export/import tests

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions