Ideas worth borrowing from mempalace: historical transcript mining and a LongMemEval retrieval benchmark

I looked at [milla-jovovich/mempalace](https://github.com/milla-jovovich/mempalace) — the ChromaDB-backed memory system that posted 96.6% on LongMemEval R@5 in raw mode — to see whether any of it was worth lifting into distillery. Most of it duplicates or regresses what you already have, but two ideas sit cleanly alongside work that's already queued. Both are small.

## 1. Historical transcript mining, as the backfill complement to #199

#199 covers the live path: when Claude Code compacts a session, a hook extracts the last chunk of conversation via a fast model and stores summarised entries. That catches what's happening now. What it doesn't catch is everything that already happened — the months of prior sessions sitting as `.jsonl` files in `~/.claude/projects/` that nobody thought to `/distill` at the time.

Mempalace's `convo_miner.py` is the backfill half. Point it at a directory of transcripts and it chunks them verbatim, skips files it's already ingested, and deduplicates near-duplicates after the fact. No LLM in the loop — just embeddings and some small heuristics. The result is a low-trust safety net for `/recall` and `/investigate`, living behind the review queue so curated entries still rank first.

The shape fits what #191 already gave us — `session_id` is in place, `source=inference` exists, the review queue routes unverified entries. I think this wants a separate `source=mined` (or `transcript`) rather than overloading `inference`, since #199's output is LLM-summarised and this would be verbatim, but that's a small enum addition and not a migration. Auto-discovery of `~/.claude/projects/*/*.jsonl` would be a natural first consumer — mempalace itself doesn't do that, you point it at a directory, but the walk is a few lines.

<details>
<summary>Technical details — chunking, dedup, integration points</summary>

The three bits of `convo_miner.py` worth stealing rather than reinventing:

**Chunking heuristic.** When a document contains three or more lines starting with `>`, they pair each `>` user turn with up to eight following lines of response as one chunk (`_chunk_by_exchange`), falling back to paragraph chunking otherwise. It handles both Markdown-quoted replies and plain text exports without any format sniffing.

**Idempotency via mtime.** Each stored chunk carries a `source_mtime` metadata field. Re-running the miner against the same directory is a no-op for unmodified files and re-mines only what's been edited. Free incremental mining.

**Post-hoc dedup via the vector store itself.** Group stored chunks by `source_file`, then for each candidate query the store's own nearest-neighbour index and delete anything within cosine distance 0.15 of an already-kept entry. No separate hashing pass, no SimHash. `DistilleryStore.find_similar(threshold=0.85)` is already the right method — the dedup pass is roughly ten lines.

**Integration points.** Batch miner should not go through `/distill` (interactive by design). It should be a CLI command or direct MCP tool that calls `DistilleryStore.store()` and `find_similar()` directly. Entries land as:

```
source=mined            # new enum value
verification=unverified
status=pending_review
session_id=<from jsonl filename or manifest>
metadata={source_file, source_mtime, chunk_index}
```

and the existing review queue picks them up from there. `/recall` and `/pour` probably want a default filter that excludes `source=mined` unless explicitly asked.

The auto-discovery piece — walking `~/.claude/projects/*/*.jsonl` — is new on top of what mempalace does. The session JSONL files have role/content turns per line, which is a friendlier format than mempalace's `>`-pattern detection. A Claude-Code-aware reader can skip the heuristic entirely and use the structured turns directly.

</details>

## 2. Port their LongMemEval runner as a public retrieval benchmark

`src/distillery/eval/` today measures skill behaviour — did the right tool get called, did the response contain the right text. `retrieval_scorer.py` already knows how to compute P@k / R@k / MRR, but it reads golden labels out of scenario YAMLs, not a public dataset. With hybrid BM25+vector shipped in #164 and RRF normalisation in #170, there's a real retrieval pipeline now, but no standardised end-to-end number to say *"this change improved recall from X to Y on a dataset anyone can reproduce."*

LongMemEval is that dataset. It's 500 questions over multi-session conversation haystacks with ground-truth labels for which sessions contain the answer. Mempalace's benchmark runner is ~800 lines but the part that matters is a tight loop — ingest this question's haystack, query, score — that maps cleanly onto `DistilleryStore.store` and `search`. Porting it gives you one number you can watch move whenever retrieval changes, and it lets you compare the new hybrid retrieval against raw dense as a publishable result.

<details>
<summary>Technical details — port plan, dataset shape, gotchas</summary>

**Where the port goes.** New file `src/distillery/eval/longmemeval.py`. For each question in the dataset: instantiate a fresh `DuckDBStore` on `:memory:` to match ChromaDB's ephemeral-per-question semantics, ingest `haystack_sessions` as entries, call `store.search(query, limit=50)`, map the returned `SearchResult.entry.metadata["session_id"]` back to `haystack_session_ids`, and score against `answer_session_ids`.

**Dataset shape.** The LongMemEval dataset lives on HuggingFace at `xiaowu0162/longmemeval-cleaned`. Per-question JSON fields:

```
question, question_date
haystack_sessions        # list of [{role, content}, ...]
haystack_session_ids
haystack_dates
answer_session_ids       # ground truth
```

The scoring functions `dcg`, `ndcg`, and `evaluate_retrieval` can be lifted verbatim from `benchmarks/longmemeval_bench.py`. Mempalace is MIT, distillery is Apache-2.0 — compatible, add a NOTICE attribution for the lifted code.

**Reusing `retrieval_scorer.py`.** The P@k / R@k / MRR math is already correct. The adapter is turning `answer_session_ids` into the `golden_labels=[{"entry_id": ..., "relevant": bool}]` shape it expects. No protocol changes.

**Gotchas.**

- *Embedding cost.* 500 questions × ~53 sessions each is meaningful if every PR runs it through Jina or OpenAI. Should be a local-provider opt-in or a nightly job, not CI-gated. Mempalace hits 96.6% with ChromaDB's default on-device model, so adding a fastembed-based local provider under `src/distillery/embedding/` would make this benchmark cheap to run and would be useful independently.
- *Don't chase their 99.4%.* The higher number uses hybrid-v2 + Sonnet reranking with k=50, which is structurally guaranteed because 50 exceeds the session count per haystack. The honest v1 target is the 96.6% raw baseline. Score the hybrid BM25+vector from #164 alongside it as the useful comparison.
- *In-memory DuckDB vs filesystem.* Don't create a new file per question — `:memory:` and dropping/recreating tables between questions is the analogue of `chromadb.EphemeralClient`. Otherwise the fresh-store-per-question pattern is disk-bound and slow.

**Approximate scope.** ~300 lines: dataset loader, the question loop, the adapter to `retrieval_scorer.py`, and a CLI entry point. No changes to the store protocol.

</details>

Happy to write either PR. The benchmark port is the higher signal per line of code and gives you a regression number to watch; the miner is larger but the heuristics lift cleanly and the schema work is already done thanks to #191. I'd tackle the benchmark first and the miner second.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ideas worth borrowing from mempalace: historical transcript mining and a LongMemEval retrieval benchmark #233

1. Historical transcript mining, as the backfill complement to #199

2. Port their LongMemEval runner as a public retrieval benchmark

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Ideas worth borrowing from mempalace: historical transcript mining and a LongMemEval retrieval benchmark #233

Description

1. Historical transcript mining, as the backfill complement to #199

2. Port their LongMemEval runner as a public retrieval benchmark

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions