Skip to content

Latest commit

 

History

History
executable file
·
402 lines (293 loc) · 18.7 KB

File metadata and controls

executable file
·
402 lines (293 loc) · 18.7 KB

MemoryBench

A standardized, extensible benchmark for memory and continual learning in LLM systems.

HF Dataset HF Dataset Full arXiv License ICML 2026 Python 3.10+

Quick StartDatasetsBaselinesExperimentsFrontendExtendingCitation


📢 News

  • 2026-05-26 — 🌟 Accepted to ICML 2026 as a Spotlight paper.
  • 2026-04-15 — Streamlit frontend released. Configure and run experiments without touching any YAML. See frontend/README.md.
  • 2025-12-08 — Extended version released: THUIR/MemoryBench-Full.
  • 2025-12-05 — User-feedback simulator upgraded to Mistral-Small-3.2-24B-Instruct-2506.

🔍 Overview

Scaling data, parameters, and test-time compute is hitting diminishing returns for LLM systems (LLMsys). MemoryBench evaluates a complementary axis: can LLM systems learn from accumulated user feedback during service time? Memory and continual-learning frameworks claim to enable this, but most existing benchmarks reduce the problem to long-form reading comprehension — a poor proxy for real feedback-driven adaptation.

MemoryBench tests the harder regime: multi-task, multi-domain, multilingual evaluation with simulated user feedback, across both off-policy (replay pre-recorded dialogs) and on-policy (generate dialogs on the fly) settings.

Highlights

  • 28 datasets across 3 domains (Academic & Knowledge, Legal, Open-Domain) and 4 task shapes (Long-Long, Long-Short, Short-Long, Short-Short).
  • 8 memory-system baselines with a one-call registry interface (vanilla, BM25-M/S, Emb-M/S, A-Mem, Mem0, MemoryOS).
  • 4 experiment regimes: off-policy, stepwise off-policy, on-policy, and training-set performance.
  • User-feedback simulator based on Mistral-Small-3.2-24B-Instruct-2506.
  • LLM providers: vLLM, OpenAI-compatible, and Anthropic — wired through one LlmFactory.
  • Streamlit frontend with conditional UI and explicit dataset-path support.
  • Plug-and-play extension via a single registry entry. See CONTRIBUTING.md.

This repository hosts the lightweight benchmark interface and baseline implementations. The full reproduction code for the paper lives at LittleDinoC/MemoryBench-code.


📚 Table of Contents


🚀 Quick Start

Installation

conda create -n memorybench python=3.10 -y
conda activate memorybench

git clone https://github.com/QingyaoAi/MemoryBench.git
cd MemoryBench

pip install -r requirements.txt
pip install -e baselines/mem0          # editable install required by Mem0

python -c "import nltk; [nltk.download(p) for p in ('punkt','wordnet','stopwords')]"

Smoke test (no API keys, no downloads)

python smoke_test.py

Hello World

from memorybench import load_memory_bench, evaluate, summary_results

dataset = load_memory_bench(dataset_type="single", name="JRE-L")
predicts = [
    {"test_idx": int(row["test_idx"]), "response": "...", "dataset": "JRE-L"}
    for row in dataset.dataset["test"]
]
details = evaluate("single", "JRE-L", predicts)
print(summary_results("single", "JRE-L", predicts, details)["summary"])

Run an experiment

# Off-policy with BM25 on the Open-Domain split
python -m src.off-policy \
    --memory_system bm25_message \
    --dataset_type domain \
    --set_name Open-Domain

Develop offline against the TinyDataset

# The TinyDataset ships 3 train + 2 test rows per dataset; no HF download needed.
export MEMORY_BENCH_PATH=$(pwd)/../TinyDataset
python -m src.off-policy --memory_system bm25_message --dataset_type single --set_name Locomo-0
python -m unittest tests.test_refactor -v

📊 Datasets

The full dataset is on Hugging Face: THUIR/MemoryBench.

Domain Task Shape Datasets
Open-Domain Long-Short Locomo-0 … Locomo-9, DialSim-friends, DialSim-bigbang, DialSim-theoffice
Open-Domain Long-Long HelloBench-Creative&Design, WritingBench-Creative&Design
Open-Domain Short-Long WritingPrompts
Open-Domain Short-Short NFCats
Academic & Knowledge Long-Short LimitGen-Syn, IdeaBench
Academic & Knowledge Long-Long HelloBench-Academic&Knowledge-Writing, WritingBench-Academic&Engineering
Academic & Knowledge Short-Long HelloBench-Academic&Knowledge-QA
Academic & Knowledge Short-Short JRE-L
Legal Long-Short LexEval-Summarization
Legal Long-Long LexEval-Judge, WritingBench-Politics&Law
Legal Short-Long JuDGE
Legal Short-Short LexEval-QA

The full list (28 datasets) lives in configs/datasets/each.json; domain and task groupings are in domain.json / task.json.

Corpus datasets. LoCoMo and DialSim ship a multi-session conversation corpus that the memory system must ingest before answering. MemoryBench dispatches per-corpus loading by an attribute on the dataset class:

class Locomo_Dataset(BaseDataset):
    corpus_format = "locomo"      # → solver.memory_locomo_conversation
    summary_group_name = "Locomo" # → collapse Locomo-0..9 under one normalization key

🧠 Baselines

All baselines are registered in src/memory_systems.py. The runner CLI (--memory_system <name>), the frontend dropdown, and the run scripts all derive their lists from this single source of truth.

Paper Name Code Name Type Config File
Vanilla wo_memory No memory (baseline) base.json
BM25-M bm25_message Lexical, message-level bm25.json
BM25-S bm25_dialog Lexical, session-level bm25.json
Emb-M embedder_message Dense, message-level embedder.json
Emb-S embedder_dialog Dense, session-level embedder.json
A-Mem a_mem Note-based associative a_mem.json
Mem0 mem0 Fact-extraction memory mem0.json
MemoryOS memoryos Hierarchical OS-style memoryos.json

Upstream sources for a_mem, mem0, and memoryos are vendored under baselines/.


🧪 Experiments

MemoryBench evaluates memory systems under four complementary regimes. Each one ships with both a per-experiment Python entry point (src/<experiment>.py) and a sweep driver (run_scripts/<experiment>.py) that iterates every registered memory system.

Regime Train→Memory Test access When to use Entry point
Off-policy Bulk replay Read only Compare baselines on a fixed training-dialog corpus python -m src.off-policy
Stepwise off-policy Replay in batches Read between batches Track scaling with training data python -m src.stepwise_off-policy
On-policy Live generation Read between steps Realistic continual-learning loop python -m src.on-policy
Training perf. Bulk replay Re-eval on train Detect overfit / catastrophic forgetting python -m src.train_performance

Common arguments: --memory_system <name>, --dataset_type single|domain|task, --set_name <name>. See --help on any entry point for the full list.

Example: off-policy run on the Open-Domain split
python -m src.off-policy \
    --memory_system bm25_message \
    --dataset_type domain \
    --set_name Open-Domain \
    --retrieve_k 5

Results are written to off-policy/results/domain/Open-Domain/bm25_message/start_at_<timestamp>/.

Example: full sweep across all baselines × domains
python run_scripts/off-policy.py

The sweep iterates memory_systems.all_names() × domain.json and task.json, automatically skipping known-incompatible combinations declared in the registry (e.g. mem0 on Open-Domain).

Example: on-policy with live feedback generation
python -m src.on-policy \
    --memory_system mem0 \
    --dataset_type domain \
    --set_name Legal \
    --step 10 --batch_size 100 --max_rounds 3
Default vLLM deployment (only required to reproduce paper results)
vllm serve Qwen/Qwen3-32B  --port 12345 --chat-template qwen3_nonthinking.jinja   # Main LLM
vllm serve Qwen/Qwen3-8B   --port 12366 --chat-template qwen3_nonthinking.jinja   # Memory-system LLM
vllm serve Qwen/Qwen3-Embedding-0.6B --port 12377 --task embed                    # Embedder
vllm serve AQuarterMile/WritingBench-Critic-Model-Qwen-7B --port 12388            # WritingBench evaluator

With these ports the default configs/memory_systems/*.json files work as-is.


🐍 Python API

The benchmark exposes three top-level functions in memorybench.py.

load_memory_bench(dataset_type, name, eval_mode=False)

Returns a BaseDataset (when dataset_type="single") or a list[BaseDataset] (for "domain" / "task").

ds = load_memory_bench("single", "JRE-L")
ds.dataset_name           # "JRE-L"
ds.dataset                # HF DatasetDict with "train" and "test" splits
ds.has_corpus             # bool — True for LoCoMo/DialSim
ds.get_data(test_idx=42)  # → row dict

evaluate(dataset_type, name, predicts) → list[dict]

predicts = [{"test_idx": 0, "response": "...", "dataset": "JRE-L"}, ...]
details  = evaluate("single", "JRE-L", predicts)
# [{"dataset": "JRE-L", "test_idx": 0, "metrics": {"Rouge-L": ..., ...}}, ...]

summary_results(dataset_type, name, predicts, evaluate_details)

Mean metrics for a single dataset; min-max-normalized + z-normalized aggregates for a domain or task.

summary = summary_results("domain", "Open-Domain", predicts, details)
summary["summary"]["weighted_average"]
summary["minmax_normalized_average"]

Local dataset path

By default load_memory_bench pulls from THUIR/MemoryBench and caches under ~/.cache/huggingface/. Override either via MEMORY_BENCH_PATH=/abs/path/to/local/dataset or the Dataset source selector in the frontend.


🖥 Frontend

python -m streamlit run frontend/streamlit_app.py
# → http://localhost:8501

The frontend covers off-policy and on-policy runs end to end. It auto-hides irrelevant fields:

  • LLM provider dropdown is filtered per baseline — mem0 / a_mem / memoryos don't expose the Anthropic option because they route through their own provider abstractions.
  • LLM base URL default updates when you switch providers (vllm / openai / anthropic).
  • Embedder section only appears for baselines that consume embeddings (embedder_*, mem0).
  • Retrieve k is hidden for wo_memory.
  • Dataset source is an explicit radio: Hugging Face Hub vs Local path, with live path validation.

See frontend/README.md for the full walkthrough.


🧰 Extending MemoryBench

Adding a new baseline or dataset is a single-file change plus one registry entry.

Add a new memory-system baseline:

  1. Write src/agent/<name>.py (agent + pydantic config) and src/solver/<name>.py (solver).
  2. Drop configs/memory_systems/<name>.json.
  3. Add one register(MemorySystemSpec(...)) call in src/memory_systems.py.

Everything else — CLI choices, frontend dropdowns, sweep scripts, dialog-field lookups, skip rules — picks the new entry up automatically.

Add a new dataset: subclass BaseDataset, add one entry to configs/datasets/each.json, and (for corpus-style datasets) set corpus_format = "<name>" on the class.

Full step-by-step walkthrough: CONTRIBUTING.md.

The parametric test tests/test_refactor.py::TestAllBaselinesContract walks every registered baseline and asserts the off-policy + on-policy method contract — your new baseline is auto-tested.


🏗 Repository Layout

MemoryBench/
├── memorybench.py              # Public API: load_memory_bench, evaluate, summary_results
├── configs/
│   ├── datasets/               # each.json, domain.json, task.json
│   ├── memory_systems/         # one JSON per baseline
│   └── final_evaluate_summary_wo_details.json  # min/max/mu/sigma stats
├── src/
│   ├── memory_systems.py       # ← central registry of baselines
│   ├── dataset/                # BaseDataset + per-dataset subclasses
│   ├── agent/                  # Agent implementations
│   ├── solver/                 # Per-baseline solvers
│   ├── llms/                   # OpenAI / vLLM / Anthropic clients
│   ├── generate_dialogs/       # Dialog-generation scripts
│   ├── off-policy.py · on-policy.py · stepwise_off-policy.py · train_performance.py
│   └── utils.py
├── run_scripts/                # Sweep drivers (loops over every registered baseline)
├── baselines/                  # Vendored upstream baselines (mem0, A-Mem, MemoryOS)
├── frontend/                   # Streamlit app
├── tests/                      # Unit + integration tests
├── CONTRIBUTING.md             # How to add baselines / datasets
└── README.md

📝 Notes & Caveats

  • bert_score truncation bug. Some datasets (e.g. JRE-L) evaluate with bert_score. Locally-loaded models don't truncate inputs — load from Hugging Face Hub to avoid "exceeding max length" errors.
  • WritingBench evaluator. Long-form writing datasets use a 7 B critic; we recommend serving WritingBench-Critic-Model-Qwen-7B via vLLM and pointing WRITINGBENCH_EVAL_BASE_URL at it.
  • Mem0 cost. mem0 is slow on Open-Domain and Long-Short; the run scripts skip these combinations by default — skip_combinations in the registry entry.
  • Secrets. API_config.json, .env* (except .env.example), and frontend/runtime_configs/ are all gitignored — see .gitignore.

Citation

If you use MemoryBench in your research, please cite:

@inproceedings{memorybench2026,
  title     = {MemoryBench: A Benchmark for Memory and Continual Learning in LLM Systems},
  author    = {Qingyao Ai, Yichen Tang, Changyue Wang, Jianming Long, Weihang Su, Yiqun Liu},
  booktitle = {Proceedings of the 43rd International Conference on Machine Learning},
  year      = {2026},
  note      = {Spotlight}
}

(Full BibTeX will be updated once the camera-ready DOI is available.)


License

Released under the MIT License. Upstream baseline code under baselines/ retains its original license — see each subdirectory's LICENSE file.


Acknowledgements

MemoryBench builds on prior datasets and memory systems from many open-source efforts: LoCoMo, DialSim, HelloBench, WritingBench, IdeaBench, LimitGen, JRE-L, JuDGE, LexEval, NFCats, WritingPrompts, A-Mem, Mem0, and MemoryOS. Thank you to all upstream authors.

For questions and feedback, open an issue on GitHub or contact the maintainers.