Problem
recall returns top-K raw memories. for agents on tight context budgets (haiku, local models, multi-tool chains where memory is one of many tools), 10 raw memories can burn 2-3k tokens. the agent often only needs the synthesis: "what does synapto know about kafka in this project?" — not the verbatim memory texts.
Proposed Solution
new MCP tool summarize_recall that runs the standard hybrid retrieval, then synthesizes the top-K via LLM into a single answer with citations.
signature
summarize_recall(
query: str,
tenant: str | None = None,
max_tokens: int = 500, # synthesis budget
k: int = 10, # candidates retrieved
include_citations: bool = True,
) -> SummaryResult
returns:
class SummaryResult:
summary: str # synthesized answer
citations: list[Citation] # [memory_id, snippet, score, depth_layer]
sources_used: int # how many of the k contributed
cache_hit: bool
caching
- key =
sha256(query | tenant | top_k_memory_ids | model)
- TTL =
ephemeral_max_age_hours / 4 (re-synth if memories likely changed)
- stored in redis (already in deps)
- bypass with
force_refresh=True
LLM provider
- reuse
synapto.embeddings.registry pattern: pluggable provider
- default to
claude-haiku-4-5 (fast + cheap)
- fall back to disabled-with-error if no API key (this tool requires LLM)
prompt template
system prompt enforces: "only synthesize from provided memories, cite memory_id inline, refuse if memories don't support an answer". reduces hallucination risk.
Trade-offs
- cost: each call = LLM invocation. mitigation: aggressive cache, opt-in tool (agent picks
recall vs summarize_recall).
- fidelity loss: summarization can lose nuance. mitigation: always return citations so caller can drill into raw if needed.
- latency: adds ~500-1500ms vs raw recall. acceptable for context-saving usecase.
- dependency: pulls anthropic SDK as optional extra (
synapto[summarize]).
Out of scope
- streaming summaries (return-and-be-done is fine for MCP)
- cross-tenant summarization (same tenant only)
Success criteria
- agents on haiku see meaningful token savings on memory-heavy chains
- citations are accurate (every claim in summary maps to a memory_id)
- cache hit rate > 60% on stable topics in a session
References
- byterover ships 6 compression strategies for context optimization — confirms agents want this
Problem
recallreturns top-K raw memories. for agents on tight context budgets (haiku, local models, multi-tool chains where memory is one of many tools), 10 raw memories can burn 2-3k tokens. the agent often only needs the synthesis: "what does synapto know about kafka in this project?" — not the verbatim memory texts.Proposed Solution
new MCP tool
summarize_recallthat runs the standard hybrid retrieval, then synthesizes the top-K via LLM into a single answer with citations.signature
returns:
caching
sha256(query | tenant | top_k_memory_ids | model)ephemeral_max_age_hours / 4(re-synth if memories likely changed)force_refresh=TrueLLM provider
synapto.embeddings.registrypattern: pluggable providerclaude-haiku-4-5(fast + cheap)prompt template
system prompt enforces: "only synthesize from provided memories, cite memory_id inline, refuse if memories don't support an answer". reduces hallucination risk.
Trade-offs
recallvssummarize_recall).synapto[summarize]).Out of scope
Success criteria
References