You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The current pipeline embeds whole sessions as a single point (TF-IDF + Tool-IDF + structural, see scripts/knowledge-graph/feature-extraction.ts). That is appropriate for "what is the shape of this corpus" (clustering, community detection, centrality) but cannot answer "given a query, fetch the most relevant moments".
Skill synthesis, UI search, and bookmark recommendation all want chunk-level retrieval that today's pipeline cannot provide.
What
Generate dense embeddings at turn granularity (the natural unit on SessionDetail.turns).
Persist them in a vector index that both the static frontend and skill-server can load.
Expose a hybrid retrieval helper (sparse BM25 + dense cosine, optional reranking) so callers do not need to know the underlying details.
How
New module scripts/knowledge-graph/embedder.ts — chunk extraction (turn-level) + embedding generation.
New module scripts/knowledge-graph/retriever.ts — hybrid retrieval, cluster-aware diversification.
Pluggable embedding backend:
Default: local model via Transformers.js (e.g. bge-small, multilingual variant for the JA/EN mix). Fits crune's offline-by-default stance.
Optional: OpenAI / Voyage / Cohere via CLI flag for users who already have API access.
Why
scripts/knowledge-graph/feature-extraction.ts). That is appropriate for "what is the shape of this corpus" (clustering, community detection, centrality) but cannot answer "given a query, fetch the most relevant moments".What
SessionDetail.turns).skill-servercan load.How
scripts/knowledge-graph/embedder.ts— chunk extraction (turn-level) + embedding generation.scripts/knowledge-graph/retriever.ts— hybrid retrieval, cluster-aware diversification.bge-small, multilingual variant for the JA/EN mix). Fits crune's offline-by-default stance.public/data/embeddings/index.bin— quantized (int8) vector matrixpublic/data/embeddings/meta.json— per-chunk metadata (sessionId, turnIndex, role, short snippet)analyze-sessions.tsas an opt-in (--embed) — promotion to default depends on PoC results.Dependencies
Notes