Skip to content

[feature] accuracy benchmarks against LoCoMo and LongMemEval-S (public, in CI) #39

@ramonlimaramos

Description

@ramonlimaramos

Problem

competitor byterover-cli publishes 96.1% on LoCoMo and 92.8% on LongMemEval-S in their paper. synapto has zero published accuracy numbers. without them we cannot make claims in the README, can't prove HRR adds value over plain hybrid, and can't detect accuracy regressions when refactoring.

complementary to #10 (which is performance benchmarks — latency, throughput); this issue is about retrieval accuracy against published competitor baselines.

Proposed Solution

run synapto against two industry-standard memory benchmarks, publish results, automate in CI.

benchmarks

benchmark what it tests published baselines
LoCoMo long-context conversational memory; single-hop, multi-hop, temporal, open-domain byterover 96.1%, mem0 ~70%, baseline RAG ~50%
LongMemEval-S long-form memory recall over many sessions byterover 92.8%

implementation

  • new dir tests/benchmarks/accuracy/
  • harness loads dataset, calls recall for each query, scores against gold answers
  • 4 ablation runs per benchmark:
    1. vector-only (pgvector)
    2. vector + bm25 (2-way RRF)
    3. vector + bm25 + HRR (3-way RRF) ← our hypothesis
    4. 3-way + decay/trust weighting (full pipeline)
  • publish results table to README + a generated BENCHMARKS.md

CI

  • nightly GitHub Action on main
  • post results as PR comment when changed > ±2pp
  • fail PR if accuracy drops > 5pp without explicit [accuracy-regression-ok] label

outputs

  • BENCHMARKS.md with full methodology + raw numbers + ablations
  • README badge: LoCoMo: 9X.X% | LongMemEval-S: XX.X%
  • blog post: "HRR adds N% on multi-hop queries" (conditional on it being true)

Trade-offs

  • honesty risk: synapto might be worse than byterover on these. that's the point — we need to know. failure is acceptable; ignorance is not.
  • dataset licensing: confirm both datasets allow public benchmark publication (LoCoMo is MIT, LongMemEval-S is CC-BY).
  • CI cost: full LoCoMo run takes ~15min on CPU embeddings; cap to nightly + manual trigger.

Success criteria

  • numbers published in README with reproducible methodology
  • ablation table shows whether HRR pulls its weight (kill it if it doesn't)
  • regression-detection in CI

References

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions