Skip to content

Ideas worth borrowing from mempalace: historical transcript mining and a LongMemEval retrieval benchmark #233

@Oddly

Description

@Oddly

I looked at milla-jovovich/mempalace — the ChromaDB-backed memory system that posted 96.6% on LongMemEval R@5 in raw mode — to see whether any of it was worth lifting into distillery. Most of it duplicates or regresses what you already have, but two ideas sit cleanly alongside work that's already queued. Both are small.

1. Historical transcript mining, as the backfill complement to #199

#199 covers the live path: when Claude Code compacts a session, a hook extracts the last chunk of conversation via a fast model and stores summarised entries. That catches what's happening now. What it doesn't catch is everything that already happened — the months of prior sessions sitting as .jsonl files in ~/.claude/projects/ that nobody thought to /distill at the time.

Mempalace's convo_miner.py is the backfill half. Point it at a directory of transcripts and it chunks them verbatim, skips files it's already ingested, and deduplicates near-duplicates after the fact. No LLM in the loop — just embeddings and some small heuristics. The result is a low-trust safety net for /recall and /investigate, living behind the review queue so curated entries still rank first.

The shape fits what #191 already gave us — session_id is in place, source=inference exists, the review queue routes unverified entries. I think this wants a separate source=mined (or transcript) rather than overloading inference, since #199's output is LLM-summarised and this would be verbatim, but that's a small enum addition and not a migration. Auto-discovery of ~/.claude/projects/*/*.jsonl would be a natural first consumer — mempalace itself doesn't do that, you point it at a directory, but the walk is a few lines.

Technical details — chunking, dedup, integration points

The three bits of convo_miner.py worth stealing rather than reinventing:

Chunking heuristic. When a document contains three or more lines starting with >, they pair each > user turn with up to eight following lines of response as one chunk (_chunk_by_exchange), falling back to paragraph chunking otherwise. It handles both Markdown-quoted replies and plain text exports without any format sniffing.

Idempotency via mtime. Each stored chunk carries a source_mtime metadata field. Re-running the miner against the same directory is a no-op for unmodified files and re-mines only what's been edited. Free incremental mining.

Post-hoc dedup via the vector store itself. Group stored chunks by source_file, then for each candidate query the store's own nearest-neighbour index and delete anything within cosine distance 0.15 of an already-kept entry. No separate hashing pass, no SimHash. DistilleryStore.find_similar(threshold=0.85) is already the right method — the dedup pass is roughly ten lines.

Integration points. Batch miner should not go through /distill (interactive by design). It should be a CLI command or direct MCP tool that calls DistilleryStore.store() and find_similar() directly. Entries land as:

source=mined            # new enum value
verification=unverified
status=pending_review
session_id=<from jsonl filename or manifest>
metadata={source_file, source_mtime, chunk_index}

and the existing review queue picks them up from there. /recall and /pour probably want a default filter that excludes source=mined unless explicitly asked.

The auto-discovery piece — walking ~/.claude/projects/*/*.jsonl — is new on top of what mempalace does. The session JSONL files have role/content turns per line, which is a friendlier format than mempalace's >-pattern detection. A Claude-Code-aware reader can skip the heuristic entirely and use the structured turns directly.

2. Port their LongMemEval runner as a public retrieval benchmark

src/distillery/eval/ today measures skill behaviour — did the right tool get called, did the response contain the right text. retrieval_scorer.py already knows how to compute P@k / R@k / MRR, but it reads golden labels out of scenario YAMLs, not a public dataset. With hybrid BM25+vector shipped in #164 and RRF normalisation in #170, there's a real retrieval pipeline now, but no standardised end-to-end number to say "this change improved recall from X to Y on a dataset anyone can reproduce."

LongMemEval is that dataset. It's 500 questions over multi-session conversation haystacks with ground-truth labels for which sessions contain the answer. Mempalace's benchmark runner is ~800 lines but the part that matters is a tight loop — ingest this question's haystack, query, score — that maps cleanly onto DistilleryStore.store and search. Porting it gives you one number you can watch move whenever retrieval changes, and it lets you compare the new hybrid retrieval against raw dense as a publishable result.

Technical details — port plan, dataset shape, gotchas

Where the port goes. New file src/distillery/eval/longmemeval.py. For each question in the dataset: instantiate a fresh DuckDBStore on :memory: to match ChromaDB's ephemeral-per-question semantics, ingest haystack_sessions as entries, call store.search(query, limit=50), map the returned SearchResult.entry.metadata["session_id"] back to haystack_session_ids, and score against answer_session_ids.

Dataset shape. The LongMemEval dataset lives on HuggingFace at xiaowu0162/longmemeval-cleaned. Per-question JSON fields:

question, question_date
haystack_sessions        # list of [{role, content}, ...]
haystack_session_ids
haystack_dates
answer_session_ids       # ground truth

The scoring functions dcg, ndcg, and evaluate_retrieval can be lifted verbatim from benchmarks/longmemeval_bench.py. Mempalace is MIT, distillery is Apache-2.0 — compatible, add a NOTICE attribution for the lifted code.

Reusing retrieval_scorer.py. The P@k / R@k / MRR math is already correct. The adapter is turning answer_session_ids into the golden_labels=[{"entry_id": ..., "relevant": bool}] shape it expects. No protocol changes.

Gotchas.

  • Embedding cost. 500 questions × ~53 sessions each is meaningful if every PR runs it through Jina or OpenAI. Should be a local-provider opt-in or a nightly job, not CI-gated. Mempalace hits 96.6% with ChromaDB's default on-device model, so adding a fastembed-based local provider under src/distillery/embedding/ would make this benchmark cheap to run and would be useful independently.
  • Don't chase their 99.4%. The higher number uses hybrid-v2 + Sonnet reranking with k=50, which is structurally guaranteed because 50 exceeds the session count per haystack. The honest v1 target is the 96.6% raw baseline. Score the hybrid BM25+vector from feat(store): Hybrid BM25 + vector search for improved retrieval #164 alongside it as the useful comparison.
  • In-memory DuckDB vs filesystem. Don't create a new file per question — :memory: and dropping/recreating tables between questions is the analogue of chromadb.EphemeralClient. Otherwise the fresh-store-per-question pattern is disk-bound and slow.

Approximate scope. ~300 lines: dataset loader, the question loop, the adapter to retrieval_scorer.py, and a CLI entry point. No changes to the store protocol.

Happy to write either PR. The benchmark port is the higher signal per line of code and gives you a regression number to watch; the miner is larger but the heuristics lift cleanly and the schema work is already done thanks to #191. I'd tackle the benchmark first and the miner second.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    Status

    Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions