You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Design-session follow-up to #54 (Wiki-to-wiki merge import). That issue lists edge cases and open design questions. This issue captures a Jun 2026 brainstorming session — thinking and exploration, not approved decisions.
Session arc (what we explored)
Audited current import: empty-wiki gate in app/src/app/api/import/route.ts, .kompl.zip format, Settings UI stub.
First direction (rejected): Obsidian-style vault merge — match pages by title, remap IDs, union provenance.
Second direction: "compiler not vault" — merge sources only, recompile pages.
Pushback: start from canonical entities/aliases, not merge strategy.
Code audit: what Kompl treats as canonical vs instance-local.
User-led brainstorm: 4 buckets, forward/backtrack on pages, routes A/B/D/F/G/R.
Human review queue for ambiguous matches.
Canonical vs derived (from code audit)
Kompl already has canonicalization machinery:
Artifact
Role
Multi-import risk
aliases
Cross-session memory: alias → canonical_name (+ optional canonical_page_id)
Importing foreign alias rows can cause silent mis-routing
IF zip invalid OR no manifest → reject (422)
IF mode=restore AND wiki has user pages → reject wiki_not_empty (409) [today]
IF mode=merge AND incoming schema_version > local → reject (422) [discussed]
ELSE → continue to per-source analysis
IF no local source match → NEW source
ELSE → EXISTING source
Gate 2 — extraction trace (canonical entities/concepts/relationships from that source vs local graph):
IF trace same as local → KNOWN trace
ELSE → NEW trace
Bucket
Code
Plain English
b1
new source + new trace
Pure new — never saw this source or this knowledge
b2
new source + known trace
New evidence — new article about topics you already have
b3
existing source + new trace
Semantic drift — same source, different extraction
b4
existing source + known trace
Duplicate — skip
Refinement discussed: a source can be new but semantically old (b2), or existing but with new extraction (b3).
Step 2 — Bucket actions (before page logic)
b4 → Route G (no-op)
b3 → Route F (re-extract / recompile locally; don't trust imported pages)
b2 → import source + raw; Route D territory (provenance/mentions, no page rewrite)
b1 → HOLD pure-new units; then project page impact (Step 3)
Step 3 — Page impact (forward then backtrack)
Forward: from held sources, use imported provenance + mentions → list candidate imported pages.
Backtrack each candidate against local wiki:
IF no local page for that canonical node → net-new page candidate
ELSE local page exists → update candidate (merge evidence; do NOT paste imported markdown)
Source overlap (for update candidates):
IF low source overlap + same canonical → likely additive evidence
IF high overlap but page text diverges → compiler/config drift → Route F, not import-as-is
Step 4 — Node match bands (entity/concept canonicals)
Discussed thresholds using lexical match + all-MiniLM-L6-v2 embeddings (same family as resolver Layer 2):
Band
Rule
Treatment
EXACT
Normalized name equal OR linked via aliases
Same node
STRONG
Embedding cosine ≥ 0.90
Same node
AMBIGUOUS
0.70 ≤ cosine < 0.90
Route R (human)
DISTINCT
cosine < 0.70
Different node
Step 5 — Routes A / B / D / F / G / R
Decision matrix (bucket × node match → route)
Bucket
Node match
Route
b1 new + new
DISTINCT
A — create source + page
b1 new + new
EXACT/STRONG
B — import source, attach to existing page, recompile
b1 new + new
AMBIGUOUS
R — human review
b2 new + known
EXACT/STRONG
D — source + provenance only, no page rewrite
b2 new + known
AMBIGUOUS
R
b2 new + known
DISTINCT
A (rare)
b3 existing + new
any
F — re-extract locally
b4 existing + known
any
G — no-op
Routes in plain language
A — Brand new: genuinely new knowledge → add source + page normally.
B — Same topic, new substance: keep existing page, refresh via recompile with new evidence (no duplicate page, no blind overwrite).
D — Same topic, extra proof: add source + provenance links; page text mostly unchanged.
F — Same source, different extraction: don't trust imported page; re-run local pipeline.
G — Duplicate: skip.
R — Unclear: pause for human; default skip if no action.
Big rule discussed: for matching topics, no duplicate pages and no blind overwrite of existing page text.
Step 6 — Clash checks (gate writes)
Alias canonical conflict: same alias string → different canonicals → block auto-merge → R
canonical_page_id mismatch: never import foreign page_id pointers; re-pin to local
#54's nine edge cases and prerequisite UNIQUE migration are still relevant. This session does not close#54's open decision list — it adds a canonical-first, route-based analysis path that may simplify some edge cases (e.g. page merge, alias union) by routing through buckets + recompile instead of vault-style text merge.
Discussed but not built:
POST /api/import?mode=merge
POST /api/import/analyze dry-run (zero writes, returns bucket/route preview)
Compiler fingerprint in manifest.json for optional page cache restore (v2 fast path)
Workaround today: re-ingest through onboarding (loses provenance, drafts, chat).
Possible implementation phases (for discussion)
Phase 0: UNIQUE on provenance / aliases (#54 prerequisite; schema now v25)
Phase 1: Dry-run analyze endpoint — classify sources into b1–b4, emit route preview, zero writes
Phase 2: Apply routes G/D/A with clash checks; human queue for R
Related
Design-session follow-up to #54 (Wiki-to-wiki merge import). That issue lists edge cases and open design questions. This issue captures a Jun 2026 brainstorming session — thinking and exploration, not approved decisions.
Session arc (what we explored)
app/src/app/api/import/route.ts,.kompl.zipformat, Settings UI stub.Canonical vs derived (from code audit)
Kompl already has canonicalization machinery:
aliasescanonical_name(+ optionalcanonical_page_id)entity_mentions/relationship_mentionscanonical_name,source_id)source_idremappedpages(entity/concept titles)existing_page_title)page_idslug(title) + plan_uuid_suffix)provenancesource_id→page_idbridgepagesmarkdownLineage you can follow today:
source_id→entity_mentions/relationship_mentions(what was extracted)source_id→provenance→page_id(what pages it contributed to)Extracted types: entities, concepts, relationships (not just "people").
Data flow (intake → compiled pages)
flowchart TD subgraph intake [Intake] raw[Raw sources URLs files etc] sources[(sources)] rawFiles[[raw/source_id.md.gz]] raw --> sources raw --> rawFiles end subgraph extract [Extract] extractApi[/compile/extract/] extractions[(extractions)] mentions[(entity_mentions relationship_mentions)] sources --> extractApi rawFiles --> extractApi extractApi --> extractions extractApi --> mentions end subgraph resolve [Resolve] resolveApi[/compile/resolve/] aliases[(aliases)] pageTitles[(entity concept page titles)] extractions --> resolveApi aliases --> resolveApi pageTitles --> resolveApi resolveApi --> aliases resolveApi --> mentions end subgraph commit [Commit] commitApi[/compile/commit/] pages[(pages)] pageFiles[[pages/page_id.md.gz]] provenance[(provenance)] pagePlans --> commitApi commitApi --> pages commitApi --> pageFiles commitApi --> provenance commitApi --> aliases endStep 0 — Upload validation (discussed)
Step 1 — Four buckets (per imported source)
Gate 1 — source identity (discussed keys: normalized URL, else
content_hash):Gate 2 — extraction trace (canonical entities/concepts/relationships from that source vs local graph):
Refinement discussed: a source can be new but semantically old (b2), or existing but with new extraction (b3).
Step 2 — Bucket actions (before page logic)
Step 3 — Page impact (forward then backtrack)
Forward: from held sources, use imported
provenance+ mentions → list candidate imported pages.Backtrack each candidate against local wiki:
Source overlap (for update candidates):
Step 4 — Node match bands (entity/concept canonicals)
Discussed thresholds using lexical match +
all-MiniLM-L6-v2embeddings (same family as resolver Layer 2):aliasesStep 5 — Routes A / B / D / F / G / R
Decision matrix (bucket × node match → route)
Routes in plain language
Big rule discussed: for matching topics, no duplicate pages and no blind overwrite of existing page text.
Step 6 — Clash checks (gate writes)
canonical_page_idmismatch: never import foreignpage_idpointers; re-pin to local(source_id, page_id, content_hash, contribution_type)— see Wiki-to-wiki merge import (mode=merge) #54 prerequisiteStep 7 — Route R: human review UX (discussed)
One card per unclear pair:
Full route tree
flowchart TD start[Each imported source] --> sm{Source matches local?} sm -->|No| srcNew[NEW source] sm -->|Yes| srcEx[EXISTING source] srcNew --> tmA{Trace matches local graph?} srcEx --> tmB{Trace matches local graph?} tmA -->|No| b1[b1: new + new] tmA -->|Yes| b2[b2: new + known] tmB -->|No| b3[b3: existing + new] tmB -->|Yes| b4[b4: duplicate] b4 --> G[Route G: no-op] b3 --> F[Route F: re-extract locally] b1 --> hold[Hold + project pages] b2 --> proj[Project page impact] hold --> nm{Node match band} proj --> nm nm -->|DISTINCT| A[Route A: create new] nm -->|EXACT/STRONG| B_or_D{b1 or b2?} nm -->|AMBIGUOUS| R[Route R: human] B_or_D -->|b1| B[Route B: merge + recompile] B_or_D -->|b2| D[Route D: provenance only] A --> clash{Clash checks pass?} B --> clash D --> clash F --> clash clash -->|yes| write[Write] clash -->|no| RExample cases
b1 → A (pure new)
Local: nothing about "Vilnius street art". Import: new URL + new entities + new page.
→ Add source and page as-is.
b2 → D (new evidence)
Local: GPT-4 entity page exists. Import: new URL about GPT-4, same canonical.
→ Add source + provenance to existing page; don't overwrite page text.
b3 → F (drift)
Local: source URL with hash v1. Import: same URL, hash v2, different entities.
→ Re-extract locally; don't import page markdown.
b4 → G (duplicate)
Re-import same backup.
→ No-op.
AMBIGUOUS → R ("React" problem)
Local: "React" = JS framework. Import: "React" = chemistry. Embedding ~0.75.
→ Human decides merge vs separate.
b1 → B (same topic, new substance)
Local: Chain-of-Thought page. Import: new paper, same canonical, different page text.
→ Add source, map to existing
page_id, recompile — don't paste imported markdown.What each bucket gives / risks (from session)
Relation to #54
#54's nine edge cases and prerequisite UNIQUE migration are still relevant. This session does not close #54's open decision list — it adds a canonical-first, route-based analysis path that may simplify some edge cases (e.g. page merge, alias union) by routing through buckets + recompile instead of vault-style text merge.
Discussed but not built:
POST /api/import?mode=mergePOST /api/import/analyzedry-run (zero writes, returns bucket/route preview)manifest.jsonfor optional page cache restore (v2 fast path)Workaround today: re-ingest through onboarding (loses provenance, drafts, chat).
Possible implementation phases (for discussion)
Phase 0: UNIQUE on
provenance/aliases(#54 prerequisite; schema now v25)Phase 1: Dry-run analyze endpoint — classify sources into b1–b4, emit route preview, zero writes
Phase 2: Apply routes G/D/A with clash checks; human queue for R
Phase 3: Routes B/F with recompile trigger
Phase 4 (optional): fingerprint-matched page/vector restore
Files likely involved
app/src/app/api/import/route.tsapp/src/app/api/export/route.ts(fingerprint, later)app/src/app/settings/page.tsxscripts/migrate.pyapp/src/lib/db.ts(normalize URL, match helpers)