Rank-aware recall gate + CodeMemEval coding-agent benchmark (Phase 1) by wuwangzhang1216 · Pull Request #3 · opendb-ai/openDB

wuwangzhang1216 · 2026-06-09T02:30:33Z

Summary

Phase 1: optimize memory retrieval and establish the coding-agent vertical as OpenDB's home turf.

Retrieval optimization (the core change)

Memory recall's query gate discarded the #1 FTS hit whenever the question and the stored answer shared fewer than two significant tokens — the cause of all 16 LongMemEval R@5 misses (the correct evidence was always the top lexical match). The gate is now rank-aware in both backends (_sqlite_memory.py, _pg_memory.py): keep any result within 0.15× of the best FTS score, and only apply the token-overlap gate to the weaker tail.

Metric (gpt-5.4-mini reader, apples-to-apples)	Before	After
LongMemEval R@1 / R@5 / R@10	96.6%	100% (0 misses)
LongMemEval E2E	78.8%	84.8% (+6.0)
Memory stress	22/23	23/23

conflicting_dates now lets the newer fact win via time-decay. Regression test added (test_strong_fts_match_survives_query_gate_low_overlap). 277 tests pass.

CodeMemEval — new coding-agent memory benchmark

gen_codemem.py builds a LongMemEval-schema dataset for coding-agent memory (architecture decisions, conventions, API signatures, bug fixes, code locations, knowledge-updates, multi-session, abstention). Ground truth is hand-authored; the LLM only renders transcripts. Runs through the existing harness via --data codemem_dataset.json.

	Result
Retrieval R@5	100%
E2E	92.6% (gpt-5.4-mini) / 96.3% (gpt-5.5)
Anti-hallucination (abstention)	100%
Median recall	0.8 ms

Benchmark infra + docs

LLM base_url is now env-configurable (OPENROUTER_BASE_URL) so the suite runs against any OpenAI-compatible endpoint; .env stays gitignored.
Refreshed stale committed result JSONs (R@K, memory stress) to reproduced runs.
README leads with coding-agent memory; REPORT adds Part 8 (CodeMemEval) + reproduced/optimized notes.
docs/hybrid-retrieval-plan.md documents Phase 2 (optional local-vector hybrid) — not implemented here.

Test plan

pytest tests/ → 277 passed
python benchmark/longmemeval_bench.py → R@5 100%
python benchmark/memory_stress_bench.py → 23/23
python benchmark/longmemeval_bench.py --data benchmark/codemem_dataset.json → R@5 100%

🤖 Generated with Claude Code

Phase 1: optimize memory retrieval and establish the coding-agent vertical. Retrieval optimization (the core change): - Memory recall's query gate discarded the #1 FTS hit whenever the question and the stored answer shared fewer than two significant tokens — the cause of all 16 LongMemEval R@5 misses (the correct evidence was always the top lexical match). Make the gate rank-aware in both backends (_sqlite_memory.py, _pg_memory.py): keep any result within 0.15x of the best FTS score, and only apply the token-overlap gate to the weaker tail. - Impact (gpt-5.4-mini reader, apples-to-apples): R@1/R@5/R@10 96.6% -> 100% (0 misses); LongMemEval E2E 78.8% -> 84.8%; memory stress 22/23 -> 23/23 (conflicting_dates now lets the newer fact win via time-decay). 277 tests pass. - Regression test: test_strong_fts_match_survives_query_gate_low_overlap. CodeMemEval (new benchmark, the coding-agent vertical): - gen_codemem.py builds a LongMemEval-schema dataset for coding-agent memory (architecture decisions, conventions, API signatures, bug fixes, code locations, knowledge-updates, multi-session, abstention). Ground truth is hand-authored; the LLM only renders transcripts. Runs through the existing harness via --data codemem_dataset.json. - Results: retrieval R@5 100%, E2E 92.6% (gpt-5.4-mini) / 96.3% (gpt-5.5), abstention 100%, 0.8ms recall. Benchmark infra + docs: - Make the LLM base_url env-configurable (OPENROUTER_BASE_URL) so the suite runs against any OpenAI-compatible endpoint; .env stays gitignored. - Refresh stale committed result JSONs (R@K, memory stress) to reproduced runs. - README: lead with coding-agent memory; REPORT: add Part 8 (CodeMemEval) and reproduced/optimized notes. docs/hybrid-retrieval-plan.md documents Phase 2 (optional local-vector hybrid). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

wuwangzhang1216 merged commit 7bd324a into main Jun 9, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rank-aware recall gate + CodeMemEval coding-agent benchmark (Phase 1)#3

Rank-aware recall gate + CodeMemEval coding-agent benchmark (Phase 1)#3
wuwangzhang1216 merged 1 commit into
mainfrom
claude/hopeful-meninsky-f9542a

wuwangzhang1216 commented Jun 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

wuwangzhang1216 commented Jun 9, 2026

Summary

Retrieval optimization (the core change)

CodeMemEval — new coding-agent memory benchmark

Benchmark infra + docs

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant