RFC: Knowledge Graph Builder for Logseq #15

yoaquim · 2026-04-29T20:19:33Z

yoaquim
Apr 29, 2026
Maintainer

RFC: Knowledge Graph Builder for Logseq

Problem

Seam generates rich structured data from Pocket AI recordings — transcripts, summaries, action items, decisions, quotes, topics, speaker attribution, mind maps. But this data stays siloed in .seam/ as per-recording JSON files. There's no way to see patterns across recordings, track evolving topics over time, or build up a personal knowledge base from the raw material.

Users want a "second brain" — a place where a doctor appointment that mentions two health conditions automatically links to those conditions' histories, where work decisions accumulate into project timelines, and where searching "what did my doctor say about X" actually works.

Proposal

A tiered knowledge graph builder that outputs Logseq-compatible markdown, turning Seam's per-recording analyses into an interlinked personal knowledge base.

Why Logseq

Markdown files in a folder — no vendor lock-in, git-friendly, S3-syncable
Bidirectional [[links]] create a knowledge graph automatically
Block-level references link specific ideas across recordings
Journal pages map naturally to recording dates
Outliner format handles structured + freeform content
Open-source, aligns with Seam's philosophy

Design Principles

Links are the structure, not directories. No rigid folder hierarchy (health/doctors/). A recording about a doctor visit that mentions work stress links to both [[lower back pain]] and [[work/burnout]]. Logseq's graph handles the rest.
Append-only by default. Daily runs add entries. They don't rewrite history.
Synthesis is expensive and separate. Summarizing a topic's history across months of entries is a distinct step, run on-demand or periodically — not on every sync.
One level deep. Journal links to topic pages. Topic pages don't trigger updates to other topic pages. This keeps the system fast and predictable.

Architecture

Logseq Graph Structure

your-graph/
├── journals/
│   └── 2026_04_28.md          # Tier 1: daily journal entry
├── pages/
│   ├── lower back pain.md     # Tier 2: topic page
│   ├── Dr. Martinez.md        # Tier 2: person page
│   ├── Seam.md                # Tier 2: project page
│   └── exercise routine.md    # Tier 2: topic page

Tier 1: Daily Journal Builder

Script: scripts/build-journal.py
Runs: In the pipeline after analysis, or on-demand
Input: All recordings + analyses for a given day, people.json
Output: A single journal page in journals/YYYY_MM_DD.md

Takes each recording from the day and generates an entry with:

Recording title and type
source:: property linking back to the recording
Key facts, decisions, action items as bullet points
[[links]] to people, topics, conditions, projects, etc.
TODO items for Logseq's built-in task tracking

Example output:

- ## Recordings
  - ### Doctor appointment — Dr. Martinez
    source:: [[recordings/2026-04-28_doctor-visit]]
    - Discussed [[lower back pain]] — MRI results normal, recommended
      increasing [[exercise routine]] to 3x/week
    - Renewed prescription for [[omeprazole]] — continue for 3 months
    - TODO Schedule follow-up in 6 weeks
    - Mentioned work stress as possible contributor → see [[work/burnout]]
  - ### Team standup
    source:: [[recordings/2026-04-28_standup]]
    - [[Alice]] presented [[API redesign]] proposal — team approved
    - DONE [[Bob]] deployed staging fix from last week
    - Discussed [[hiring pipeline]] — 3 candidates in final round

Link selection guidance for Claude:

Always link	Sometimes link	Never link
People (`[[Dr. Martinez]]`)	Specific events (`[[2026 kitchen reno]]`)	Generic nouns ("the meeting")
Health conditions (`[[lower back pain]]`)	Tools/products (`[[Postgres migration]]`)	Throwaway mentions
Ongoing projects (`[[Seam]]`)	Places (`[[Mount Sinai]]`)	Small talk
Decisions with consequences	Recurring themes
Commitments/promises

Tier 2: Topic Page Updater

Script: scripts/update-topics.py
Runs: Immediately after Tier 1
Input: The journal entry just created + existing topic pages in pages/
Output: Created or updated topic pages

For each [[link]] in the journal entry:

Page exists: Append a new dated entry at the end
Page doesn't exist: Create it with an initial entry

Each entry is timestamped and sourced:

- **2026-04-28** — MRI results normal per [[Dr. Martinez]]. Increase
  [[exercise routine]] to 3x/week. Work stress flagged as contributor.
  via [[2026-04-28_doctor-visit]]

Hard rule: Only traverse links from the journal. Do not follow links from topic pages to other topic pages. This keeps the update bounded and predictable.

Person pages (from people.json) get structured differently:

title:: Dr. Martinez
type:: person
role:: Primary care physician

- **2026-04-28** — Regular checkup. MRI for [[lower back pain]] came back
  clean. Renewed [[omeprazole]].
  via [[2026-04-28_doctor-visit]]
- **2026-01-10** — Initial visit for back pain. Ordered MRI.
  via [[2026-01-10_doctor-visit]]

Tier 3: Periodic Synthesis

Script: scripts/synthesize-topics.py
Runs: On-demand or weekly (not every sync)
Input: Topic pages with accumulated entries (scoped to last ~3 months by default)
Output: Updated "Current Understanding" summary block at the top of topic pages

This is the expensive step — it reads all recent entries on a topic page and asks Claude to synthesize them into a coherent summary. The summary block sits at the top of the page; the individual entries remain below as the audit trail.

Example of a synthesized topic page:

- ## Current Understanding
  *(synthesized 2026-04-28 — last 3 months)*
  - Ongoing since ~Jan 2026, initially from poor desk posture
  - MRI clean as of Apr 2026. Managing with exercise, no medication needed.
  - Correlates with high-stress work periods.
- ---
- **2026-04-28** — MRI results normal per [[Dr. Martinez]]. Increase
  [[exercise routine]] to 3x/week. Work stress flagged as contributor.
  via [[2026-04-28_doctor-visit]]
- **2026-03-15** — First mention of back stiffness during [[weekly standup]].
  Started stretching routine.
  via [[2026-03-15_standup]]
- **2026-01-10** — Initial visit to [[Dr. Martinez]] for persistent lower
  back pain. Ordered MRI. Advised ergonomic desk setup.
  via [[2026-01-10_doctor-visit]]

Synthesis window: Defaults to 3 months. Older entries remain on the page but aren't re-read during synthesis (too expensive). The summary captures the current state, not the full history.

Configuration

New .env variables:

LOGSEQ_GRAPH_PATH=/path/to/your/logseq/graph

Set via the Settings page, optional (like S3). If not configured, the knowledge graph steps are skipped.

Pipeline Integration

Added as optional steps at the end of pocket-run.sh:

pull → analyze → stage people → build manifest → build journal → update topics

Synthesis (Tier 3) runs separately — either via a cron job, a dashboard button, or manual invocation.

Implementation Plan

Scripts

Script	Tier	Claude calls	Runs
`scripts/build-journal.py`	1	1 per day (all recordings batched)	Every sync
`scripts/update-topics.py`	2	1 per new/updated topic page	Every sync
`scripts/synthesize-topics.py`	3	1 per topic page being synthesized	On-demand / weekly

All three use claude -p (headless Claude Code) like the existing analysis step.

Idempotency

Journal: Overwrites the day's journal page on re-run (same day = same page)
Topic updates: Each entry includes a source recording reference. Re-running skips entries whose source is already present on the topic page.
Synthesis: Replaces the "Current Understanding" block, preserves individual entries.

Dashboard Integration

New section on the Settings page: "Knowledge Graph" with Logseq graph path
Optional: "Synthesize Topics" button (triggers Tier 3)
Optional: status indicator showing last journal build date

Open Questions

Should Tier 2 use Claude or be deterministic? Extracting [[links]] from the journal and appending entries to topic pages could be done without Claude — just parse the markdown for [[...]] patterns and append a templated entry. This would be faster and cheaper. Claude would only be needed if we want it to contextualize the entry for each topic page differently.
Namespace conventions. Should topics use Logseq namespaces ([[health/lower back pain]]) for loose categorization, or keep everything flat and let the graph organize it? Namespaces add hierarchy but require Claude to be consistent about categorization.
Conflict with manual edits. If the user manually edits a topic page in Logseq (adds their own notes), how do we avoid clobbering those edits on the next update? Proposed: only append below a ## Seam Entries marker. Everything above is user-owned.
Recording linkback format. Should journal entries link back to the Seam dashboard (http://localhost:5173/recording/...) or to the recording's directory name as a Logseq page? The latter keeps everything within the graph; the former connects to the richer UI.
Scale. How does this perform with 500+ recordings across a year? The journal build is bounded (one day at a time). Topic updates scale with the number of unique links per day. Synthesis is the bottleneck — may need to prioritize which topics to synthesize (e.g., only those with new entries since last synthesis).

cc @jedibrillo — would love your input on this, especially the open questions.

jedibrillo · 2026-05-01T03:36:35Z

jedibrillo
May 1, 2026
Collaborator

Analysis & recommended additions to the RFC

1. Graphify integration — strong candidate, but mismatch in scope

What graphify actually is: A skill for AI coding assistants (Claude Code, Cursor, etc.) that runs /graphify <folder> and produces:

graph.json — NetworkX graph with EXTRACTED/INFERRED/AMBIGUOUS edges
graph.html — interactive viz with Leiden community detection
GRAPH_REPORT.md — god nodes + surprising connections
Optional --obsidian flag to write a Logseq-adjacent vault, MCP server mode, Neo4j export

Fit with the RFC: Graphify's three-pass pipeline (deterministic AST → Whisper transcription → Claude subagents over docs/transcripts) is closest to what Tier 3 (synthesis) wants to be. It's already solved the "extract triples from a corpus, cluster, dedupe" problem.

Where graphify doesn't fit cleanly:

It's designed as a one-shot CLI on a folder — not an incremental, append-only daily journal builder. It rebuilds from cache on --update, but the Tier 1/2 flow in the RFC is fundamentally append-only with idempotency keyed on source recording.
Output is graph-shaped (nodes/edges/communities), not Logseq markdown with [[links]] and timestamped entries. The --obsidian flag exists but I'd verify it produces the journal+pages structure we want before betting on it.
Clustering is graph-topology-based (Leiden), not embedding-based. That's a feature for code, but for personal notes you may want temporal clustering ("things mentioned in the same week") which graphify doesn't do.

On using bulk/batch APIs to reduce cost: Right instinct. Graphify currently dispatches Claude subagents in parallel (one per file/chunk). The Anthropic Message Batches API gives 50% cost reduction for non-urgent work, which a daily/weekly graph rebuild absolutely is. But: it's async with up to 24h SLA. So:

Tier 1 (daily journal): Not a batch fit. User wants today's journal today. Use sync + cache.
Tier 2 (topic page updates): Could be batched. They're idempotent and don't need to be fresh.
Tier 3 (synthesis): Perfect batch fit. Run weekly, accept ~minutes-to-hours latency.

Recommendation: Don't fork graphify. Borrow its three ideas — (a) EXTRACTED/INFERRED/AMBIGUOUS edge tagging, (b) cache-keyed-by-content-hash to skip unchanged inputs, (c) MCP server over the graph for query-time use — and implement them in seam-native scripts. Forking adds a Python dependency surface and a maintenance burden for a project graphify wasn't designed for. Cite it as inspiration.

2. Taxonomy page — strongly recommend, addresses Open Question #2

This directly answers Open Question #2 (namespaces). A taxonomy page (pages/_taxonomy.md or .seam/taxonomy.json) would:

Define canonical namespace conventions (e.g. health/, work/, people/ are valid; flat is the default)
List blessed top-level categories so Claude doesn't invent medical/ one day and health/ the next
Be passed into every analysis prompt as context (like people.json already is)

This also lets us handle Open Question #1 (deterministic vs Claude Tier 2): if the taxonomy is well-defined, Tier 2 can be 100% deterministic — parse [[links]] from the journal, normalize against the taxonomy, append entries with templates. Claude is only needed for Tier 1 (extraction) and Tier 3 (synthesis). That's a major cost win.

3. Model selection — Sonnet for Tier 1, Haiku for Tier 3 candidates, never Opus

Tradeoffs in plain terms:

Tier	Recommended model	Why
Tier 1 — journal extraction	Sonnet 4.6	Needs to follow nuanced "always link / sometimes / never" rules from the RFC, infer speakers, distinguish decisions from small talk. Haiku will over-link or miss subtle commitments.
Tier 2 — topic page append	No model (deterministic)	If taxonomy is locked down, this is markdown manipulation.
Tier 3 — synthesis	Sonnet 4.6, with Haiku 4.5 for short pages	Synthesis needs to reason about cause/effect across months of entries. Haiku is fine for pages with <10 entries; we need Sonnet's reasoning for long medical/work histories.

Haiku failure modes to watch for:

Loses thread on long contexts (>50k tokens of accumulated entries)
Less reliable at "what's missing" reasoning — synthesis often requires noting a topic stopped being mentioned, which Haiku tends to skip
Better at structured extraction than freeform synthesis

Concrete cost lever: prompt caching. The taxonomy + people.json + analyze prompt are stable across all recordings on a given run. With cache_control breakpoints on those, you're paying 10% of input cost on cached portions. For a daily run with 5 recordings, this is roughly a 4x reduction in input tokens billed.

4. QMD — yes, for query-time but not for extraction

QMD is a local search engine (BM25 + vector + LLM rerank). It's the answer to a question the RFC isn't asking yet but should: how do users actually query their graph?

Where QMD helps reduce LLM costs:

Synthesis context selection (Tier 3): Instead of feeding Claude every entry on a page, use qmd query to retrieve the top-N most relevant entries to the synthesis question. For a topic page with 3 years of entries, this could cut input tokens by 90%+.
Recording lookup at journal-build time (Tier 1): When linking [[Dr. Martinez]], we might want context from the last 3 visits. QMD can retrieve those without putting all of Dr. Martinez.md into the prompt.
Dashboard search: "what did my doctor say about X" becomes qmd query instead of a Claude call.

Where QMD doesn't help:

It's a retriever, not an extractor. Tier 1 (turning a transcript into linked bullets) is generation, not retrieval. QMD doesn't change that cost.
Adds a non-trivial dep: node-llama-cpp + GGUF models, ~GB of local disk for the embedding index.

Practical integration: Run qmd embed after update-topics.py finishes. Now the Logseq vault is queryable. The MCP server mode means a future seam dashboard chat feature gets graph-aware search for free.

Concrete answers to the open questions

Tier 2 Claude vs deterministic: Deterministic, if we commit to a taxonomy page. The taxonomy is what makes deterministic safe. Without it, we'll get spelling/casing drift and no way to canonicalize.
Namespaces: Flat by default, namespaces only for the categories defined in the taxonomy. Don't let Claude invent namespaces ad-hoc.
Manual edit conflict: The ## Seam Entries marker is right. Add a second marker ## Current Understanding (Tier 3 owns this section, may be replaced wholesale) so users know what's safe to edit. Everything else is user-owned.
Recording linkback: Link to Logseq page (keeps graph self-contained), but include a dashboard:: property pointing at the localhost URL. Best of both — graph navigation works offline, dashboard link is one click away.
Scale: The real bottleneck isn't journal/topic builds, it's synthesis re-reading old entries. Two mitigations: (a) QMD-retrieve top-N entries instead of feeding the whole page, (b) cache synthesis output keyed on (page, entries_hash) so unchanged pages don't re-synthesize. Combined with Batches API for Tier 3, weekly synthesis stays cheap even at 1000+ recordings.

0 replies

jedibrillo · 2026-05-01T11:36:34Z

jedibrillo
May 1, 2026
Collaborator

https://medium.com/graph-praxis/causal-inference-on-knowledge-graphs-the-fourth-layer-of-context-blindness-2bc5f0ab66d7

I found this article and series of articles linked in here really helpful to understand knowledge graphs and how to build them best.

0 replies

yoaquim · 2026-05-03T12:46:08Z

yoaquim
May 3, 2026
Maintainer Author

Implementation Plan

Based on the RFC + @jedibrillo's feedback, here's the concrete plan. Ordered by dependency — each phase builds on the previous.

Architecture Overview

graph TD
    R[Pocket Recordings] -->|pull + analyze| A[.seam/analysis/]
    P[people.json] --> T1
    TX[taxonomy.json] --> T1
    TX --> T2
    A --> T1[Tier 1: Daily Journal Builder]
    T1 -->|parse links| T2[Tier 2: Topic Page Updater]
    T2 -->|weekly / on-demand| T3[Tier 3: Synthesis]
    T1 --> J[journals/YYYY_MM_DD.md]
    T2 --> TP[pages/*.md]
    T3 --> TP

Phase 0: Taxonomy Bootstrap

What: Create taxonomy.json — the canonical list of namespace categories that controls how [[links]] are organized.

How:

One-time bootstrap script (scripts/bootstrap_taxonomy.py) scans all existing analysis.json files — extracts topics, participants, action item owners, decision makers
Clusters them into proposed categories (health, work, personal, etc.)
Outputs a draft taxonomy.json for human review
Add to Settings page (like S3 config) for editing

Format: JSON (consistent with people.json, sync-history.json, etc.). Auto-generates markdown in-memory when injecting into Claude prompts — no separate .md file on disk.

{
  "namespaces": {
    "health": ["conditions", "medications", "doctors", "appointments"],
    "work": ["projects", "decisions", "hiring"],
    "personal": ["finances", "travel", "home"]
  },
  "aliases": {
    "medical": "health",
    "career": "work"
  }
}

Why first: Tier 2 deterministic mode depends on this. Without it, Claude invents namespaces ad-hoc and you get medical/ one day, health/ the next.

Phase 1: Daily Journal Builder

Script: scripts/build_journal.py
Claude model: Sonnet 4.6 (needs nuance for link selection, speaker inference)
Runs: Every sync, after analysis

Input: All recordings + analyses for a given day, people.json, taxonomy.json
Output: journals/YYYY_MM_DD.md in the configured Logseq graph

Prompt design (the make-or-break piece):

Receives the day's analyses + taxonomy + people as context
Follows the "always/sometimes/never link" guidance from the RFC
Links use taxonomy namespaces where applicable (e.g. [[health/lower back pain]] not just [[lower back pain]])
Outputs Logseq outliner format (bullets, source:: properties, TODO items)

Idempotency: Overwrites the day's journal page on re-run (same day = same page).

Cost optimization: Prompt caching on taxonomy + people.json + prompt template (stable across all recordings in a run). ~4x input token reduction per jedibrillo's suggestion.

Optional flag: --tag-confidence — adds confidence:: extracted|inferred|ambiguous properties to each linked bullet. Off by default for v1. Useful for trust calibration but adds prompt complexity.

Phase 2: Topic Page Updater

Script: scripts/update_topics.py
Claude model: None — fully deterministic
Runs: Every sync, immediately after Phase 1

How it works:

Parse the just-created journal for [[link]] patterns
Normalize each link against taxonomy.json (apply aliases, validate namespace)
For each link:
- Page exists → append dated entry below ## Seam Entries marker
- Page doesn't exist → create page with properties + first entry
Skip entries whose source recording is already present (idempotency)

One level deep only. Only processes links from the journal. Does not follow links from topic pages.

Manual edit safety: Two markers in each topic page:

## Seam Entries — Tier 2 appends below this. Everything above is user-owned.
## Current Understanding — Tier 3 owns this section (future).

Person pages (names found in people.json) get structured differently with type:: person, role:: properties.

Phase 3: Synthesis (v1 or v2 — see open question below)

Script: scripts/synthesize_topics.py
Claude model: Sonnet 4.6 (Haiku for pages with <10 entries)
Runs: On-demand or weekly (not every sync)

What it does:

Reads recent entries (last ~3 months) on topic pages
Generates/replaces the ## Current Understanding section at top
Individual entries below remain as audit trail

Cost optimizations:

Batches API (50% reduction, async with up to 24h SLA — fine for weekly runs)
Content-hash caching: (page_path, entries_hash) → skip unchanged pages
Future: QMD top-N retrieval instead of feeding full page (deferred)

Phase 4: Pipeline + Settings Integration

Pipeline (pocket-run.sh):

pull → analyze → stage people → build manifest → build journal → update topics

Settings page:

New "Knowledge Graph" section
LOGSEQ_GRAPH_PATH — path to Logseq graph directory
Taxonomy editor (or just path display with "edit in Logseq" guidance)
Optional: "Synthesize Now" button (triggers Phase 3)

Configuration (.env):

LOGSEQ_GRAPH_PATH=/path/to/your/logseq/graph

Optional, like S3. If not set, journal/topic steps are skipped.

README Updates

Mermaid diagram explaining the tier architecture + data flow. New "Knowledge Graph (optional)" section similar to the S3 Backup section.

What's explicitly deferred

QMD integration — solve data generation before retrieval
MCP server over the graph — build the graph first
Edge tagging (confidence) — available as --tag-confidence flag but off by default
Dashboard graph view (React Flow) — Logseq's built-in graph covers this
Mermaid in topic pages — doesn't add value over Logseq's graph view

Open questions

@jedibrillo — Tier 3 (synthesis) in v1 or v2?

Tier 1 + Tier 2 alone already give you daily journals with [[links]] and topic pages with dated entries — browseable in Logseq with full graph visualization. Synthesis adds the "Current Understanding" summaries on top.

Argument for v1: synthesis is the "second brain" payoff — without it, topic pages are just append-only logs.
Argument for v2: Tier 1 + Tier 2 are already useful, synthesis is the hardest prompt engineering, and shipping sooner gets feedback sooner.

What's your take — ship Tier 1+2 first and iterate, or is synthesis essential to the initial value prop?

For everyone — is Linear worth it for this project? We could break this plan into Linear tickets for tracking. But if it adds more overhead than value for a project this size, we can just track progress in this discussion thread. Thoughts on whether it's overkill or actually helpful here?

0 replies

yoaquim · 2026-05-06T12:28:07Z

yoaquim
May 6, 2026
Maintainer Author

https://github.com/cocoindex-io/cocoindex/blob/main/examples/conversation_to_knowledge/spec.md

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Knowledge Graph Builder for Logseq #15

Uh oh!

{{title}}

Uh oh!

Replies: 4 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

RFC: Knowledge Graph Builder for Logseq #15

Uh oh!

yoaquim Apr 29, 2026 Maintainer

RFC: Knowledge Graph Builder for Logseq

Problem

Proposal

Why Logseq

Design Principles

Architecture

Logseq Graph Structure

Tier 1: Daily Journal Builder

Tier 2: Topic Page Updater

Tier 3: Periodic Synthesis

Configuration

Pipeline Integration

Implementation Plan

Scripts

Idempotency

Dashboard Integration

Open Questions

Replies: 4 comments

Uh oh!

jedibrillo May 1, 2026 Collaborator

Analysis & recommended additions to the RFC

1. Graphify integration — strong candidate, but mismatch in scope

2. Taxonomy page — strongly recommend, addresses Open Question #2

3. Model selection — Sonnet for Tier 1, Haiku for Tier 3 candidates, never Opus

4. QMD — yes, for query-time but not for extraction

Concrete answers to the open questions

Uh oh!

jedibrillo May 1, 2026 Collaborator

Uh oh!

yoaquim May 3, 2026 Maintainer Author

Implementation Plan

Architecture Overview

Phase 0: Taxonomy Bootstrap

Phase 1: Daily Journal Builder

Phase 2: Topic Page Updater

Phase 3: Synthesis (v1 or v2 — see open question below)

Phase 4: Pipeline + Settings Integration

README Updates

What's explicitly deferred

Open questions

Uh oh!

yoaquim May 6, 2026 Maintainer Author

yoaquim
Apr 29, 2026
Maintainer

jedibrillo
May 1, 2026
Collaborator

jedibrillo
May 1, 2026
Collaborator

yoaquim
May 3, 2026
Maintainer Author

yoaquim
May 6, 2026
Maintainer Author