Problem
competitor byterover-cli publishes 96.1% on LoCoMo and 92.8% on LongMemEval-S in their paper. synapto has zero published accuracy numbers. without them we cannot make claims in the README, can't prove HRR adds value over plain hybrid, and can't detect accuracy regressions when refactoring.
complementary to #10 (which is performance benchmarks — latency, throughput); this issue is about retrieval accuracy against published competitor baselines.
Proposed Solution
run synapto against two industry-standard memory benchmarks, publish results, automate in CI.
benchmarks
| benchmark |
what it tests |
published baselines |
| LoCoMo |
long-context conversational memory; single-hop, multi-hop, temporal, open-domain |
byterover 96.1%, mem0 ~70%, baseline RAG ~50% |
| LongMemEval-S |
long-form memory recall over many sessions |
byterover 92.8% |
implementation
- new dir
tests/benchmarks/accuracy/
- harness loads dataset, calls
recall for each query, scores against gold answers
- 4 ablation runs per benchmark:
- vector-only (pgvector)
- vector + bm25 (2-way RRF)
- vector + bm25 + HRR (3-way RRF) ← our hypothesis
- 3-way + decay/trust weighting (full pipeline)
- publish results table to README + a generated
BENCHMARKS.md
CI
- nightly GitHub Action on
main
- post results as PR comment when changed > ±2pp
- fail PR if accuracy drops > 5pp without explicit
[accuracy-regression-ok] label
outputs
BENCHMARKS.md with full methodology + raw numbers + ablations
- README badge:
LoCoMo: 9X.X% | LongMemEval-S: XX.X%
- blog post: "HRR adds N% on multi-hop queries" (conditional on it being true)
Trade-offs
- honesty risk: synapto might be worse than byterover on these. that's the point — we need to know. failure is acceptable; ignorance is not.
- dataset licensing: confirm both datasets allow public benchmark publication (LoCoMo is MIT, LongMemEval-S is CC-BY).
- CI cost: full LoCoMo run takes ~15min on CPU embeddings; cap to nightly + manual trigger.
Success criteria
- numbers published in README with reproducible methodology
- ablation table shows whether HRR pulls its weight (kill it if it doesn't)
- regression-detection in CI
References
Problem
competitor byterover-cli publishes 96.1% on LoCoMo and 92.8% on LongMemEval-S in their paper. synapto has zero published accuracy numbers. without them we cannot make claims in the README, can't prove HRR adds value over plain hybrid, and can't detect accuracy regressions when refactoring.
complementary to #10 (which is performance benchmarks — latency, throughput); this issue is about retrieval accuracy against published competitor baselines.
Proposed Solution
run synapto against two industry-standard memory benchmarks, publish results, automate in CI.
benchmarks
implementation
tests/benchmarks/accuracy/recallfor each query, scores against gold answersBENCHMARKS.mdCI
main[accuracy-regression-ok]labeloutputs
BENCHMARKS.mdwith full methodology + raw numbers + ablationsLoCoMo: 9X.X% | LongMemEval-S: XX.X%Trade-offs
Success criteria
References