[feature] accuracy benchmarks against LoCoMo and LongMemEval-S (public, in CI)

## Problem

competitor byterover-cli publishes 96.1% on LoCoMo and 92.8% on LongMemEval-S in their paper. synapto has zero published accuracy numbers. without them we cannot make claims in the README, can't prove HRR adds value over plain hybrid, and can't detect accuracy regressions when refactoring.

complementary to #10 (which is **performance** benchmarks — latency, throughput); this issue is about **retrieval accuracy** against published competitor baselines.

## Proposed Solution

run synapto against two industry-standard memory benchmarks, publish results, automate in CI.

### benchmarks

| benchmark | what it tests | published baselines |
|-----------|---------------|--------------------|
| **LoCoMo** | long-context conversational memory; single-hop, multi-hop, temporal, open-domain | byterover 96.1%, mem0 ~70%, baseline RAG ~50% |
| **LongMemEval-S** | long-form memory recall over many sessions | byterover 92.8% |

### implementation

- new dir `tests/benchmarks/accuracy/`
- harness loads dataset, calls `recall` for each query, scores against gold answers
- 4 ablation runs per benchmark:
  1. vector-only (pgvector)
  2. vector + bm25 (2-way RRF)
  3. vector + bm25 + HRR (3-way RRF) ← our hypothesis
  4. 3-way + decay/trust weighting (full pipeline)
- publish results table to README + a generated `BENCHMARKS.md`

### CI

- nightly GitHub Action on `main`
- post results as PR comment when changed > ±2pp
- fail PR if accuracy drops > 5pp without explicit `[accuracy-regression-ok]` label

### outputs

- `BENCHMARKS.md` with full methodology + raw numbers + ablations
- README badge: `LoCoMo: 9X.X% | LongMemEval-S: XX.X%`
- blog post: "HRR adds N% on multi-hop queries" (conditional on it being true)

## Trade-offs

- **honesty risk**: synapto might be worse than byterover on these. that's the point — we need to know. **failure is acceptable; ignorance is not.**
- **dataset licensing**: confirm both datasets allow public benchmark publication (LoCoMo is MIT, LongMemEval-S is CC-BY).
- **CI cost**: full LoCoMo run takes ~15min on CPU embeddings; cap to nightly + manual trigger.

## Success criteria

- numbers published in README with reproducible methodology
- ablation table shows whether HRR pulls its weight (kill it if it doesn't)
- regression-detection in CI

## References

- LoCoMo: https://arxiv.org/abs/2402.17753
- LongMemEval: https://arxiv.org/abs/2410.10813
- byterover paper: https://arxiv.org/abs/2604.01599

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feature] accuracy benchmarks against LoCoMo and LongMemEval-S (public, in CI) #39

Problem

Proposed Solution

benchmarks

implementation

CI

outputs

Trade-offs

Success criteria

References

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

benchmark	what it tests	published baselines
LoCoMo	long-context conversational memory; single-hop, multi-hop, temporal, open-domain	byterover 96.1%, mem0 ~70%, baseline RAG ~50%
LongMemEval-S	long-form memory recall over many sessions	byterover 92.8%

[feature] accuracy benchmarks against LoCoMo and LongMemEval-S (public, in CI) #39

Description

Problem

Proposed Solution

benchmarks

implementation

CI

outputs

Trade-offs

Success criteria

References

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions