Design notes: merge-import buckets, routes, and canonical-first analysis

## Related

Design-session follow-up to **#54** (Wiki-to-wiki merge import). That issue lists edge cases and open design questions. This issue captures a **Jun 2026 brainstorming session** — thinking and exploration, **not approved decisions**.

---

## Session arc (what we explored)

1. Audited current import: empty-wiki gate in `app/src/app/api/import/route.ts`, `.kompl.zip` format, Settings UI stub.
2. **First direction (rejected):** Obsidian-style vault merge — match pages by title, remap IDs, union provenance.
3. **Second direction:** "compiler not vault" — merge sources only, recompile pages.
4. **Pushback:** start from canonical entities/aliases, not merge strategy.
5. Code audit: what Kompl treats as canonical vs instance-local.
6. User-led brainstorm: 4 buckets, forward/backtrack on pages, routes A/B/D/F/G/R.
7. Human review queue for ambiguous matches.

---

## Canonical vs derived (from code audit)

Kompl already has canonicalization machinery:

| Artifact | Role | Multi-import risk |
|----------|------|-------------------|
| `aliases` | Cross-session memory: alias → `canonical_name` (+ optional `canonical_page_id`) | Importing foreign alias rows can cause silent mis-routing |
| `entity_mentions` / `relationship_mentions` | Per-source extraction trace (`canonical_name`, `source_id`) | Can merge if `source_id` remapped |
| `pages` (entity/concept titles) | Resolver anchors (`existing_page_title`) | Title match ≠ same entity ("React" problem) |
| `page_id` | Instance-local (`slug(title) + plan_uuid_suffix`) | Importing as-is → duplicate pages |
| `provenance` | `source_id` → `page_id` bridge | No UNIQUE key today → duplicate rows on re-import |
| `pages` markdown | **Derived** (compile output) | Merging text risks semantic drift across models/schema |

**Lineage you can follow today:**
- `source_id` → `entity_mentions` / `relationship_mentions` (what was extracted)
- `source_id` → `provenance` → `page_id` (what pages it contributed to)
- Weaker: sentence-level "this paragraph came from source X" (not stored)

**Extracted types:** entities, concepts, relationships (not just "people").

---

## Data flow (intake → compiled pages)

```mermaid
flowchart TD
  subgraph intake [Intake]
    raw[Raw sources URLs files etc]
    sources[(sources)]
    rawFiles[[raw/source_id.md.gz]]
    raw --> sources
    raw --> rawFiles
  end

  subgraph extract [Extract]
    extractApi[/compile/extract/]
    extractions[(extractions)]
    mentions[(entity_mentions relationship_mentions)]
    sources --> extractApi
    rawFiles --> extractApi
    extractApi --> extractions
    extractApi --> mentions
  end

  subgraph resolve [Resolve]
    resolveApi[/compile/resolve/]
    aliases[(aliases)]
    pageTitles[(entity concept page titles)]
    extractions --> resolveApi
    aliases --> resolveApi
    pageTitles --> resolveApi
    resolveApi --> aliases
    resolveApi --> mentions
  end

  subgraph commit [Commit]
    commitApi[/compile/commit/]
    pages[(pages)]
    pageFiles[[pages/page_id.md.gz]]
    provenance[(provenance)]
    pagePlans --> commitApi
    commitApi --> pages
    commitApi --> pageFiles
    commitApi --> provenance
    commitApi --> aliases
  end
```

---

## Step 0 — Upload validation (discussed)

```
IF zip invalid OR no manifest → reject (422)
IF mode=restore AND wiki has user pages → reject wiki_not_empty (409)  [today]
IF mode=merge AND incoming schema_version > local → reject (422)  [discussed]
ELSE → continue to per-source analysis
```

---

## Step 1 — Four buckets (per imported source)

**Gate 1 — source identity** (discussed keys: normalized URL, else `content_hash`):

```
IF no local source match → NEW source
ELSE → EXISTING source
```

**Gate 2 — extraction trace** (canonical entities/concepts/relationships from that source vs local graph):

```
IF trace same as local → KNOWN trace
ELSE → NEW trace
```

| Bucket | Code | Plain English |
|--------|------|---------------|
| **b1** | new source + new trace | Pure new — never saw this source or this knowledge |
| **b2** | new source + known trace | New evidence — new article about topics you already have |
| **b3** | existing source + new trace | Semantic drift — same source, different extraction |
| **b4** | existing source + known trace | Duplicate — skip |

Refinement discussed: a source can be new but semantically old (b2), or existing but with new extraction (b3).

---

## Step 2 — Bucket actions (before page logic)

```
b4 → Route G (no-op)
b3 → Route F (re-extract / recompile locally; don't trust imported pages)
b2 → import source + raw; Route D territory (provenance/mentions, no page rewrite)
b1 → HOLD pure-new units; then project page impact (Step 3)
```

---

## Step 3 — Page impact (forward then backtrack)

**Forward:** from held sources, use imported `provenance` + mentions → list candidate imported pages.

**Backtrack** each candidate against local wiki:

```
IF no local page for that canonical node → net-new page candidate
ELSE local page exists → update candidate (merge evidence; do NOT paste imported markdown)
```

**Source overlap** (for update candidates):

```
IF low source overlap + same canonical → likely additive evidence
IF high overlap but page text diverges → compiler/config drift → Route F, not import-as-is
```

---

## Step 4 — Node match bands (entity/concept canonicals)

Discussed thresholds using lexical match + `all-MiniLM-L6-v2` embeddings (same family as resolver Layer 2):

| Band | Rule | Treatment |
|------|------|-----------|
| **EXACT** | Normalized name equal OR linked via `aliases` | Same node |
| **STRONG** | Embedding cosine ≥ 0.90 | Same node |
| **AMBIGUOUS** | 0.70 ≤ cosine < 0.90 | Route R (human) |
| **DISTINCT** | cosine < 0.70 | Different node |

---

## Step 5 — Routes A / B / D / F / G / R

### Decision matrix (bucket × node match → route)

| Bucket | Node match | Route |
|--------|------------|-------|
| b1 new + new | DISTINCT | **A** — create source + page |
| b1 new + new | EXACT/STRONG | **B** — import source, attach to existing page, recompile |
| b1 new + new | AMBIGUOUS | **R** — human review |
| b2 new + known | EXACT/STRONG | **D** — source + provenance only, no page rewrite |
| b2 new + known | AMBIGUOUS | **R** |
| b2 new + known | DISTINCT | **A** (rare) |
| b3 existing + new | any | **F** — re-extract locally |
| b4 existing + known | any | **G** — no-op |

### Routes in plain language

- **A — Brand new:** genuinely new knowledge → add source + page normally.
- **B — Same topic, new substance:** keep existing page, refresh via recompile with new evidence (no duplicate page, no blind overwrite).
- **D — Same topic, extra proof:** add source + provenance links; page text mostly unchanged.
- **F — Same source, different extraction:** don't trust imported page; re-run local pipeline.
- **G — Duplicate:** skip.
- **R — Unclear:** pause for human; default skip if no action.

**Big rule discussed:** for matching topics, no duplicate pages and no blind overwrite of existing page text.

---

## Step 6 — Clash checks (gate writes)

- **Alias canonical conflict:** same alias string → different canonicals → block auto-merge → R
- **`canonical_page_id` mismatch:** never import foreign `page_id` pointers; re-pin to local
- **Provenance dup:** needs UNIQUE on `(source_id, page_id, content_hash, contribution_type)` — see #54 prerequisite
- **Page content divergence** (same canonical, different markdown): never overwrite → B or R
- **Relationship direction/type mismatch:** normalize before dedup

---

## Step 7 — Route R: human review UX (discussed)

One card per unclear pair:

- Imported item: name, summary, top sources
- Existing candidate: same
- Why unclear: one line
- Buttons: **Merge into existing** | **Keep separate** | **Skip** | **Merge + add alias**
- Default if ignored: **Skip**
- Log decision to activity

---

## Full route tree

```mermaid
flowchart TD
  start[Each imported source] --> sm{Source matches local?}
  sm -->|No| srcNew[NEW source]
  sm -->|Yes| srcEx[EXISTING source]

  srcNew --> tmA{Trace matches local graph?}
  srcEx --> tmB{Trace matches local graph?}

  tmA -->|No| b1[b1: new + new]
  tmA -->|Yes| b2[b2: new + known]
  tmB -->|No| b3[b3: existing + new]
  tmB -->|Yes| b4[b4: duplicate]

  b4 --> G[Route G: no-op]
  b3 --> F[Route F: re-extract locally]

  b1 --> hold[Hold + project pages]
  b2 --> proj[Project page impact]

  hold --> nm{Node match band}
  proj --> nm

  nm -->|DISTINCT| A[Route A: create new]
  nm -->|EXACT/STRONG| B_or_D{b1 or b2?}
  nm -->|AMBIGUOUS| R[Route R: human]

  B_or_D -->|b1| B[Route B: merge + recompile]
  B_or_D -->|b2| D[Route D: provenance only]

  A --> clash{Clash checks pass?}
  B --> clash
  D --> clash
  F --> clash

  clash -->|yes| write[Write]
  clash -->|no| R
```

---

## Example cases

### b1 → A (pure new)
Local: nothing about "Vilnius street art". Import: new URL + new entities + new page.
→ Add source and page as-is.

### b2 → D (new evidence)
Local: GPT-4 entity page exists. Import: new URL about GPT-4, same canonical.
→ Add source + provenance to existing page; don't overwrite page text.

### b3 → F (drift)
Local: source URL with hash v1. Import: same URL, hash v2, different entities.
→ Re-extract locally; don't import page markdown.

### b4 → G (duplicate)
Re-import same backup.
→ No-op.

### AMBIGUOUS → R ("React" problem)
Local: "React" = JS framework. Import: "React" = chemistry. Embedding ~0.75.
→ Human decides merge vs separate.

### b1 → B (same topic, new substance)
Local: Chain-of-Thought page. Import: new paper, same canonical, different page text.
→ Add source, map to existing `page_id`, recompile — don't paste imported markdown.

---

## What each bucket gives / risks (from session)

| Bucket | Value | Risk if handled wrong |
|--------|-------|----------------------|
| b1 | Net-new knowledge + pages | Duplicate pages if canonical match weak |
| b2 | More evidence for known topics | Noise if auto-creating pages |
| b3 | Detected semantic change | Silent drift if not recompiled |
| b4 | Nothing | Storage bloat only |

---

## Relation to #54

#54's nine edge cases and prerequisite UNIQUE migration are still relevant. This session **does not close** #54's open decision list — it adds a **canonical-first, route-based** analysis path that may simplify some edge cases (e.g. page merge, alias union) by routing through buckets + recompile instead of vault-style text merge.

**Discussed but not built:**
- `POST /api/import?mode=merge`
- `POST /api/import/analyze` dry-run (zero writes, returns bucket/route preview)
- Compiler fingerprint in `manifest.json` for optional page cache restore (v2 fast path)

**Workaround today:** re-ingest through onboarding (loses provenance, drafts, chat).

---

## Possible implementation phases (for discussion)

**Phase 0:** UNIQUE on `provenance` / `aliases` (#54 prerequisite; schema now v25)

**Phase 1:** Dry-run analyze endpoint — classify sources into b1–b4, emit route preview, zero writes

**Phase 2:** Apply routes G/D/A with clash checks; human queue for R

**Phase 3:** Routes B/F with recompile trigger

**Phase 4 (optional):** fingerprint-matched page/vector restore

---

## Files likely involved

- `app/src/app/api/import/route.ts`
- `app/src/app/api/export/route.ts` (fingerprint, later)
- `app/src/app/settings/page.tsx`
- `scripts/migrate.py`
- `app/src/lib/db.ts` (normalize URL, match helpers)
- Export/import tests


Artifact	Role	Multi-import risk
`aliases`	Cross-session memory: alias → `canonical_name` (+ optional `canonical_page_id`)	Importing foreign alias rows can cause silent mis-routing
`entity_mentions` / `relationship_mentions`	Per-source extraction trace (`canonical_name`, `source_id`)	Can merge if `source_id` remapped
`pages` (entity/concept titles)	Resolver anchors (`existing_page_title`)	Title match ≠ same entity ("React" problem)
`page_id`	Instance-local (`slug(title) + plan_uuid_suffix`)	Importing as-is → duplicate pages
`provenance`	`source_id` → `page_id` bridge	No UNIQUE key today → duplicate rows on re-import
`pages` markdown	Derived (compile output)	Merging text risks semantic drift across models/schema

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Design notes: merge-import buckets, routes, and canonical-first analysis #141

Related

Session arc (what we explored)

Canonical vs derived (from code audit)

Data flow (intake → compiled pages)

Step 0 — Upload validation (discussed)

Step 1 — Four buckets (per imported source)

Step 2 — Bucket actions (before page logic)

Step 3 — Page impact (forward then backtrack)

Step 4 — Node match bands (entity/concept canonicals)

Step 5 — Routes A / B / D / F / G / R

Decision matrix (bucket × node match → route)

Routes in plain language

Step 6 — Clash checks (gate writes)

Step 7 — Route R: human review UX (discussed)

Full route tree

Example cases

b1 → A (pure new)

b2 → D (new evidence)

b3 → F (drift)

b4 → G (duplicate)

AMBIGUOUS → R ("React" problem)

b1 → B (same topic, new substance)

What each bucket gives / risks (from session)

Relation to #54

Possible implementation phases (for discussion)

Files likely involved

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Bucket	Code	Plain English
b1	new source + new trace	Pure new — never saw this source or this knowledge
b2	new source + known trace	New evidence — new article about topics you already have
b3	existing source + new trace	Semantic drift — same source, different extraction
b4	existing source + known trace	Duplicate — skip

Band	Rule	Treatment
EXACT	Normalized name equal OR linked via `aliases`	Same node
STRONG	Embedding cosine ≥ 0.90	Same node
AMBIGUOUS	0.70 ≤ cosine < 0.90	Route R (human)
DISTINCT	cosine < 0.70	Different node

Bucket	Node match	Route
b1 new + new	DISTINCT	A — create source + page
b1 new + new	EXACT/STRONG	B — import source, attach to existing page, recompile
b1 new + new	AMBIGUOUS	R — human review
b2 new + known	EXACT/STRONG	D — source + provenance only, no page rewrite
b2 new + known	AMBIGUOUS	R
b2 new + known	DISTINCT	A (rare)
b3 existing + new	any	F — re-extract locally
b4 existing + known	any	G — no-op

Bucket	Value	Risk if handled wrong
b1	Net-new knowledge + pages	Duplicate pages if canonical match weak
b2	More evidence for known topics	Noise if auto-creating pages
b3	Detected semantic change	Silent drift if not recompiled
b4	Nothing	Storage bloat only

Design notes: merge-import buckets, routes, and canonical-first analysis #141

Description

Related

Session arc (what we explored)

Canonical vs derived (from code audit)

Data flow (intake → compiled pages)

Step 0 — Upload validation (discussed)

Step 1 — Four buckets (per imported source)

Step 2 — Bucket actions (before page logic)

Step 3 — Page impact (forward then backtrack)

Step 4 — Node match bands (entity/concept canonicals)

Step 5 — Routes A / B / D / F / G / R

Decision matrix (bucket × node match → route)

Routes in plain language

Step 6 — Clash checks (gate writes)

Step 7 — Route R: human review UX (discussed)

Full route tree

Example cases

b1 → A (pure new)

b2 → D (new evidence)

b3 → F (drift)

b4 → G (duplicate)

AMBIGUOUS → R ("React" problem)

b1 → B (same topic, new substance)

What each bucket gives / risks (from session)

Relation to #54

Possible implementation phases (for discussion)

Files likely involved

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions