Skip to content

Rank-aware recall gate + CodeMemEval coding-agent benchmark (Phase 1)#3

Merged
wuwangzhang1216 merged 1 commit into
mainfrom
claude/hopeful-meninsky-f9542a
Jun 9, 2026
Merged

Rank-aware recall gate + CodeMemEval coding-agent benchmark (Phase 1)#3
wuwangzhang1216 merged 1 commit into
mainfrom
claude/hopeful-meninsky-f9542a

Conversation

@wuwangzhang1216

Copy link
Copy Markdown
Contributor

Summary

Phase 1: optimize memory retrieval and establish the coding-agent vertical as OpenDB's home turf.

Retrieval optimization (the core change)

Memory recall's query gate discarded the #1 FTS hit whenever the question and the stored answer shared fewer than two significant tokens — the cause of all 16 LongMemEval R@5 misses (the correct evidence was always the top lexical match). The gate is now rank-aware in both backends (_sqlite_memory.py, _pg_memory.py): keep any result within 0.15× of the best FTS score, and only apply the token-overlap gate to the weaker tail.

Metric (gpt-5.4-mini reader, apples-to-apples) Before After
LongMemEval R@1 / R@5 / R@10 96.6% 100% (0 misses)
LongMemEval E2E 78.8% 84.8% (+6.0)
Memory stress 22/23 23/23

conflicting_dates now lets the newer fact win via time-decay. Regression test added (test_strong_fts_match_survives_query_gate_low_overlap). 277 tests pass.

CodeMemEval — new coding-agent memory benchmark

gen_codemem.py builds a LongMemEval-schema dataset for coding-agent memory (architecture decisions, conventions, API signatures, bug fixes, code locations, knowledge-updates, multi-session, abstention). Ground truth is hand-authored; the LLM only renders transcripts. Runs through the existing harness via --data codemem_dataset.json.

Result
Retrieval R@5 100%
E2E 92.6% (gpt-5.4-mini) / 96.3% (gpt-5.5)
Anti-hallucination (abstention) 100%
Median recall 0.8 ms

Benchmark infra + docs

  • LLM base_url is now env-configurable (OPENROUTER_BASE_URL) so the suite runs against any OpenAI-compatible endpoint; .env stays gitignored.
  • Refreshed stale committed result JSONs (R@K, memory stress) to reproduced runs.
  • README leads with coding-agent memory; REPORT adds Part 8 (CodeMemEval) + reproduced/optimized notes.
  • docs/hybrid-retrieval-plan.md documents Phase 2 (optional local-vector hybrid) — not implemented here.

Test plan

  • pytest tests/ → 277 passed
  • python benchmark/longmemeval_bench.py → R@5 100%
  • python benchmark/memory_stress_bench.py → 23/23
  • python benchmark/longmemeval_bench.py --data benchmark/codemem_dataset.json → R@5 100%

🤖 Generated with Claude Code

Phase 1: optimize memory retrieval and establish the coding-agent vertical.

Retrieval optimization (the core change):
- Memory recall's query gate discarded the #1 FTS hit whenever the question
  and the stored answer shared fewer than two significant tokens — the cause
  of all 16 LongMemEval R@5 misses (the correct evidence was always the top
  lexical match). Make the gate rank-aware in both backends
  (_sqlite_memory.py, _pg_memory.py): keep any result within 0.15x of the best
  FTS score, and only apply the token-overlap gate to the weaker tail.
- Impact (gpt-5.4-mini reader, apples-to-apples): R@1/R@5/R@10 96.6% -> 100%
  (0 misses); LongMemEval E2E 78.8% -> 84.8%; memory stress 22/23 -> 23/23
  (conflicting_dates now lets the newer fact win via time-decay). 277 tests pass.
- Regression test: test_strong_fts_match_survives_query_gate_low_overlap.

CodeMemEval (new benchmark, the coding-agent vertical):
- gen_codemem.py builds a LongMemEval-schema dataset for coding-agent memory
  (architecture decisions, conventions, API signatures, bug fixes, code
  locations, knowledge-updates, multi-session, abstention). Ground truth is
  hand-authored; the LLM only renders transcripts. Runs through the existing
  harness via --data codemem_dataset.json.
- Results: retrieval R@5 100%, E2E 92.6% (gpt-5.4-mini) / 96.3% (gpt-5.5),
  abstention 100%, 0.8ms recall.

Benchmark infra + docs:
- Make the LLM base_url env-configurable (OPENROUTER_BASE_URL) so the suite
  runs against any OpenAI-compatible endpoint; .env stays gitignored.
- Refresh stale committed result JSONs (R@K, memory stress) to reproduced runs.
- README: lead with coding-agent memory; REPORT: add Part 8 (CodeMemEval) and
  reproduced/optimized notes. docs/hybrid-retrieval-plan.md documents Phase 2
  (optional local-vector hybrid).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@wuwangzhang1216 wuwangzhang1216 merged commit 7bd324a into main Jun 9, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant