Skip to content

Add embedding pipeline and chunk-level vector index for RAG-based retrieval #32

@chigichan24

Description

@chigichan24

Why

  • The current pipeline embeds whole sessions as a single point (TF-IDF + Tool-IDF + structural, see scripts/knowledge-graph/feature-extraction.ts). That is appropriate for "what is the shape of this corpus" (clustering, community detection, centrality) but cannot answer "given a query, fetch the most relevant moments".
  • Skill synthesis, UI search, and bookmark recommendation all want chunk-level retrieval that today's pipeline cannot provide.

What

  • Generate dense embeddings at turn granularity (the natural unit on SessionDetail.turns).
  • Persist them in a vector index that both the static frontend and skill-server can load.
  • Expose a hybrid retrieval helper (sparse BM25 + dense cosine, optional reranking) so callers do not need to know the underlying details.

How

  • New module scripts/knowledge-graph/embedder.ts — chunk extraction (turn-level) + embedding generation.
  • New module scripts/knowledge-graph/retriever.ts — hybrid retrieval, cluster-aware diversification.
  • Pluggable embedding backend:
    • Default: local model via Transformers.js (e.g. bge-small, multilingual variant for the JA/EN mix). Fits crune's offline-by-default stance.
    • Optional: OpenAI / Voyage / Cohere via CLI flag for users who already have API access.
  • Output:
    • public/data/embeddings/index.bin — quantized (int8) vector matrix
    • public/data/embeddings/meta.json — per-chunk metadata (sessionId, turnIndex, role, short snippet)
  • Integrate into analyze-sessions.ts as an opt-in (--embed) — promotion to default depends on PoC results.

Dependencies

Notes

  • Final decisions on model choice, quantization, chunking strategy, and index format should fall out of the PoC (epic issue, separate).

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    Status
    Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions