Rank-aware recall gate + CodeMemEval coding-agent benchmark (Phase 1)#3
Merged
Merged
Conversation
Phase 1: optimize memory retrieval and establish the coding-agent vertical. Retrieval optimization (the core change): - Memory recall's query gate discarded the #1 FTS hit whenever the question and the stored answer shared fewer than two significant tokens — the cause of all 16 LongMemEval R@5 misses (the correct evidence was always the top lexical match). Make the gate rank-aware in both backends (_sqlite_memory.py, _pg_memory.py): keep any result within 0.15x of the best FTS score, and only apply the token-overlap gate to the weaker tail. - Impact (gpt-5.4-mini reader, apples-to-apples): R@1/R@5/R@10 96.6% -> 100% (0 misses); LongMemEval E2E 78.8% -> 84.8%; memory stress 22/23 -> 23/23 (conflicting_dates now lets the newer fact win via time-decay). 277 tests pass. - Regression test: test_strong_fts_match_survives_query_gate_low_overlap. CodeMemEval (new benchmark, the coding-agent vertical): - gen_codemem.py builds a LongMemEval-schema dataset for coding-agent memory (architecture decisions, conventions, API signatures, bug fixes, code locations, knowledge-updates, multi-session, abstention). Ground truth is hand-authored; the LLM only renders transcripts. Runs through the existing harness via --data codemem_dataset.json. - Results: retrieval R@5 100%, E2E 92.6% (gpt-5.4-mini) / 96.3% (gpt-5.5), abstention 100%, 0.8ms recall. Benchmark infra + docs: - Make the LLM base_url env-configurable (OPENROUTER_BASE_URL) so the suite runs against any OpenAI-compatible endpoint; .env stays gitignored. - Refresh stale committed result JSONs (R@K, memory stress) to reproduced runs. - README: lead with coding-agent memory; REPORT: add Part 8 (CodeMemEval) and reproduced/optimized notes. docs/hybrid-retrieval-plan.md documents Phase 2 (optional local-vector hybrid). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Phase 1: optimize memory retrieval and establish the coding-agent vertical as OpenDB's home turf.
Retrieval optimization (the core change)
Memory recall's query gate discarded the #1 FTS hit whenever the question and the stored answer shared fewer than two significant tokens — the cause of all 16 LongMemEval R@5 misses (the correct evidence was always the top lexical match). The gate is now rank-aware in both backends (
_sqlite_memory.py,_pg_memory.py): keep any result within0.15×of the best FTS score, and only apply the token-overlap gate to the weaker tail.conflicting_datesnow lets the newer fact win via time-decay. Regression test added (test_strong_fts_match_survives_query_gate_low_overlap). 277 tests pass.CodeMemEval — new coding-agent memory benchmark
gen_codemem.pybuilds a LongMemEval-schema dataset for coding-agent memory (architecture decisions, conventions, API signatures, bug fixes, code locations, knowledge-updates, multi-session, abstention). Ground truth is hand-authored; the LLM only renders transcripts. Runs through the existing harness via--data codemem_dataset.json.Benchmark infra + docs
base_urlis now env-configurable (OPENROUTER_BASE_URL) so the suite runs against any OpenAI-compatible endpoint;.envstays gitignored.docs/hybrid-retrieval-plan.mddocuments Phase 2 (optional local-vector hybrid) — not implemented here.Test plan
pytest tests/→ 277 passedpython benchmark/longmemeval_bench.py→ R@5 100%python benchmark/memory_stress_bench.py→ 23/23python benchmark/longmemeval_bench.py --data benchmark/codemem_dataset.json→ R@5 100%🤖 Generated with Claude Code