This guide explains how to extend MemoryBench with new memory-system baselines or new datasets. Both extension points are designed to be single-file changes plus one registry entry — the rest of the system (CLI argparse, run scripts, frontend dropdowns, evaluation pipeline) picks the new entry up automatically.
If you only skim one section, read the Mental model first.
MemoryBench is composed of three loosely-coupled layers:
+--------------------+
datasets ──▶ | src/dataset/*.py | ──▶ load_memory_bench() / evaluate()
+--------------------+
+-------------------+ +-------------------+
baselines ──▶ | src/agent/*.py | ── | src/solver/*.py | ──▶ runners (off-policy, on-policy, ...)
+-------------------+ +-------------------+
+----------------------+
registry ──▶ | src/memory_systems | ──▶ SolverFactory, get_dialog_key,
+----------------------+ argparse choices, frontend dropdowns
Two integration points decide whether new code is picked up everywhere:
- Datasets are listed in
configs/datasets/each.json(plusdomain.json/task.jsonfor groupings) and instantiated throughBaseDatasetsubclasses insrc/dataset/. - Memory-system baselines register one
MemorySystemSpecinsrc/memory_systems.pyand provide an agent class + a solver class. Everything else flows from the spec — the CLI's--memory_systemchoices, the frontend's dropdown, the per-baseline config-file lookup, thedialog_<name>HF-dataset field, and themem0-style skip rules for known-incompatible combos.
The steps below use generic <name> / <Name> placeholders for your
baseline. The existing baselines on main (mem0, memoryos, a_mem,
the embedder_* family) are good references to read alongside this guide.
If your baseline depends on an external codebase (like mem0,
MemoryOS, or A_mem), clone the upstream repo into baselines/ and
drop the nested .git so the outer repo can track the files:
cd baselines/
git clone --depth 1 https://github.com/<org>/<repo>.git <Name>
rm -rf <Name>/.git
git add <Name>There are two patterns in baselines/, and the right choice depends on
whether your MemoryBench agent imports anything from the upstream code:
| Pattern | When to use it | Example |
|---|---|---|
Vendored & tracked. Upstream source is committed in-tree; runs after a fresh git clone. |
src/agent/<name>.py imports from baselines/<Name>/ (so the source must be present). |
mem0, a_mem, memoryos |
Reference-only mirror. Upstream source is cloned locally for reading but .gitignore'd. |
Your agent reimplements the algorithm in MemoryBench style and doesn't import from the upstream. | Keep the clone on your machine, list the folder in .gitignore, and note the upstream URL in baselines/README.md. |
For the reference-only pattern, add the folder to the top-level
.gitignore so it doesn't pollute git status, and add a
short note to baselines/README.md pointing at
the upstream URL.
The agent class encapsulates the memory store and the chat-completion
path. It must subclass BaseAgent and implement the surfaces below.
Each row notes which runner (off-policy / on-policy / stepwise) actually
calls it — that's what tests/test_refactor.py::TestAllBaselinesContract
enforces for every registered baseline.
| Method | Used by | Purpose |
|---|---|---|
add_conversation_to_memory(messages, conversation_idx) |
off-policy (bulk), on-policy (per-step), stepwise (per-batch) | Ingest a complete dialog. For on-policy this is called every step with a freshly generated dialog, so it must be re-entrant. |
retrieve_memory(query, k) |
called internally by generate_response |
Return retrieved context, either as a list of strings or one assembled string. |
generate_response(messages, lang, retrieve_k) |
off-policy + on-policy first round (via solver.predict_single_data) |
Run inference. Must accept both lang and an optional retrieve_k override. |
self.llm.generate_response(messages) |
on-policy follow-up rounds | The inner LLM client. Comes for free as long as you build it via LlmFactory.create(...). |
save_memories() / load_memories() |
BaseSolver._create_or_load_memory |
Persist / rehydrate the store across runs. The cache dir is config.memory_cache_dir. |
Pydantic config goes in the same file. Use Literal["openai", "vllm", "anthropic"]
for llm_provider if your agent routes through LlmFactory; pin it to
Literal["openai", "vllm"] if it wires to an upstream provider
abstraction that doesn't understand Anthropic.
Patterns worth borrowing from the existing agents in src/agent/:
- Vector store + summary embedding. Embedding an LLM-generated summary of each chunk (denser, more queryable) usually retrieves better than embedding raw turns.
- FIFO working memory. A
collections.deque(maxlen=N)makes bounded-size recent-turn retention free. - Defensive embedding loop. When you pipe long texts at vllm/openai
embeddings, the body sometimes exceeds the token limit — shrink-and-retry
(see
_embedinsrc/agent/embedder.py).
The solver is a thin wrapper over the agent that orchestrates ingestion.
For non-corpus datasets it just delegates to BaseSolver._create_or_load_memory,
but for corpus datasets (LoCoMo, DialSim) it must implement
memory_<corpus_format>_conversation.
Minimum body:
from src.agent.<name> import <Name>Agent, <Name>AgentConfig
from src.solver.base import BaseSolver
class <Name>Solver(BaseSolver):
AGENT_CLASS = <Name>Agent
def __init__(self, config, memory_cache_dir):
super().__init__(config, memory_cache_dir)
self.method_name = "<Name>"
def create_or_load_memory(self, dialogs):
return super()._create_or_load_memory(dialogs, can_thread=False)
def memory_locomo_conversation(self, conversation, session_cnt):
... # ingest one session at a time
def memory_dialsim_conversation(self, conversation, session_cnt):
return self.memory_locomo_conversation(conversation, session_cnt)If you don't ship a memory_<format>_conversation for a corpus format,
load_corpus_to_memory will raise NotImplementedError when someone
runs your baseline on a Locomo/DialSim dataset. That's the right
behavior — failing fast is better than silently skipping the corpus.
For full references, read the existing solvers under src/solver/.
Drop a JSON file under configs/memory_systems/<name>.json with the
fields your <Name>AgentConfig expects. The existing files under
configs/memory_systems/ are good templates.
Add one entry to src/memory_systems.py:
register(MemorySystemSpec(
name="<name>",
solver_class="src.solver.<name>.<Name>Solver",
config_class="src.solver.<name>.<Name>AgentConfig",
config_file="configs/memory_systems/<name>.json",
paper_name="<Name>",
# Optional fields:
# dialog_stem="my_stem" # → field name "dialog_my_stem" on HF rows;
# # defaults to "dialog_<name>".
# skip_combinations=[("domain", "Open-Domain"), ("task", "Long-Short")],
))That's it. The registry entry now drives:
--memory_system <name>is accepted by every runnerrun_scripts/*.pyincludes<name>in its outer loopSolverFactory.create("<name>", config, ...)resolves to your classesget_memory_system_config_file("<name>")returns the JSON pathget_dialog_key("<name>")returns the right field name- The Streamlit frontend's "Memory system" dropdown lists
<name>
This is purely a compatibility step so external code that does
from src.solver import <Name>AgentConfig keeps working, and so
missing baseline deps surface immediately at startup:
Edit src/solver/__init__.py and add:
from src.solver.<name> import <Name>AgentConfignext to the other eager imports. Skip this step only if your baseline has heavy optional dependencies you want to keep lazy.
If your baseline consumes a text-embedding model (like mem0 and the
embedder_* family do), add it to _BASELINES_NEED_EMBEDDER in
frontend/streamlit_app.py.
The frontend will auto-show the Embedder section when your baseline is selected.
If your baseline can't use the Anthropic provider (e.g. it routes
through an upstream LLM client that doesn't speak Anthropic), add it to
_BASELINES_NO_ANTHROPIC so the provider dropdown drops the option
when your baseline is selected.
Off-policy experiments read pre-generated training dialogs from the
HF dataset under the field dialog_<name>. For a fresh baseline this
field doesn't exist yet on THUIR/MemoryBench. To create it:
python -m src.generate_dialogs.reading \
--memory_system <name> \
--dataset Locomo-0Run that for every Locomo + DialSim split. The script writes new dialogs to your local cache; you'll need to re-upload them to the dataset (or distribute them separately) for other users to run off-policy on your baseline.
For non-corpus datasets the dialog field is shared across baselines,
so no re-generation is needed.
The repo ships a registry-driven test suite at
tests/test_refactor.py. Add your baseline
to the LEGACY_* oracle dicts at the top of the file, then run:
python -m unittest tests.test_refactor -vAdding a new entry to the oracles is a regression catch: if the
registry's dialog_key() ever drifts from what runner scripts expect,
this test fails immediately.
For a smoke-test that doesn't need an LLM, you can build the agent
with a dummy provider and exercise add_memory / retrieve_memory
offline:
from src.agent.<name> import <Name>Agent, <Name>AgentConfig
cfg = <Name>AgentConfig(
llm_provider="openai",
llm_config={"model": "noop", "api_key": "noop"},
embedder_provider="openai",
embedder_base_url="http://127.0.0.1:1/v1",
memory_cache_dir="/tmp/<name>-smoke",
# disable any options that would trigger live LLM calls
)
agent = <Name>Agent(cfg)If your dataset is "one input → one expected output", with no
multi-session conversation corpus, you only need a BaseDataset
subclass and a config entry.
Put src/dataset/<MyDataset>.py:
from typing import Any, Dict, List
from src.dataset.base import BaseDataset
class MyDataset(BaseDataset):
def __init__(
self,
data_path: str = None,
dataset_name: str = "MyDataset",
test_metrics: List[str] = ("accuracy",),
max_output_len: int = 2048,
eval_mode: bool = True,
):
self.dataset_name = dataset_name
super().__init__(
data_path=data_path,
test_metrics=list(test_metrics),
max_output_len=max_output_len,
)
def evaluate_single(
self,
user_prompt: str,
info: Dict[str, Any],
llm_response: str,
) -> Dict[str, float]:
# Return a dict containing every key in `self.test_metrics`.
return {"accuracy": float(...)}Required row shape on the underlying HF dataset (returned by
load_from_hf): each row must contain test_idx, dataset_name,
info, lang, and either input_prompt (string) or
input_chat_messages (list of role/content dicts). BaseDataset
asserts these in its constructor.
Add one entry:
"MyDataset": {
"dataset_name": "MyDataset",
"task_tag": "Short-Short",
"domain_tag": "Open-Domain",
"test_metrics": ["accuracy"],
"max_output_len": 2048,
"class_name": "MyDataset.MyDataset"
}The class_name is <file_stem>.<class_name> resolved against
src/dataset/. If you also want the dataset to participate in domain-
or task-level evaluations, add the same block (with optional
sample_count) to configs/datasets/domain.json / task.json.
summary_results(dataset_type="domain", ...) needs per-dataset
min/max/mu/sigma in
configs/final_evaluate_summary_wo_details.json. The simplest way is
to run all baselines on your new dataset first and let
src/single_summary.py / src/summary_evaluate_result.py regenerate
this file.
from memorybench import load_memory_bench, evaluate, summary_results
ds = load_memory_bench("single", "MyDataset", eval_mode=True)
print(ds.dataset_name, len(ds.dataset["train"]), len(ds.dataset["test"]))
predicts = [
{"test_idx": int(row["test_idx"]), "response": "...", "dataset": "MyDataset"}
for row in ds.dataset["test"]
]
details = evaluate("single", "MyDataset", predicts)
summary = summary_results("single", "MyDataset", predicts, details)
print(summary["summary"])For datasets that ship a multi-session conversation corpus you need a few extra hooks.
class MyCorpusDataset(BaseDataset):
corpus_format = "my_corpus" # used to route to per-solver methods
summary_group_name = "MyCorpus" # optional: collapse "MyCorpus-0..9"
# under one normalization keycorpus_format is the dispatch key. When someone calls
load_corpus_to_memory(solver, dataset), MemoryBench invokes
solver.memory_my_corpus_conversation(corpus, session_cnt=...).
The default loader handles LoCoMo and DialSim corpus shapes. For a
new corpus shape, extend load_from_hf (or override _load_data in
your subclass) to populate dataset.corpus and dataset.session_cnt.
The dataset class should set self.has_corpus = True on
its instance — that's already done by BaseDataset._load_data when
the loader returns a non-None corpus.
Each baseline you want to support on this dataset needs a
memory_<corpus_format>_conversation(corpus, session_cnt) method on
its solver class. This is the only per-baseline code change required
to onboard a new corpus shape — and it's why MemoryBench keeps the
corpus-loading API attribute-driven rather than dataset-name-driven.
If a baseline doesn't implement it, load_corpus_to_memory raises
NotImplementedError at runtime.
Run python -m src.generate_dialogs.reading --dataset MyCorpus-0 --memory_system <baseline>
to materialize dialog_<baseline> fields. This is only required for
off-policy / stepwise-off-policy experiments; on-policy generates
dialogs at runtime.
- Don't add parallel hardcoded dicts. Every method-name list lives
in the registry. New baselines must not introduce a "list of all
memory systems" anywhere else. The CI test
tests/test_refactor.py::TestRegistrycatches drift. - Don't string-match on dataset names. Use
dataset.corpus_formatanddataset.summary_group_nameinstead ofdataset.dataset_name.startswith("Locomo"). - Don't commit secrets.
API_config.json,.env*(except.env.example), andfrontend/runtime_configs/are all gitignored — see.gitignore. The audit done intests/test_refactor.py::TestOptionalLLMPingalso avoids printing tokens in failure messages. - Defaults should not require a paid API. Default configs in
configs/memory_systems/*.jsonpoint at vllm/local endpoints. If your baseline needs an OpenAI/Anthropic key, document it but don't hardcode it. - TinyDataset is the smoke-test target. It lives outside the repo
at
../TinyDataset/and contains 3 train + 2 test rows per dataset. SetMEMORY_BENCH_PATH=../TinyDatasetto develop without downloading the full 200 MB hub dataset.
| If you want to... | Read these files in this order |
|---|---|
| Add a memory baseline | src/memory_systems.py → src/agent/embedder.py → src/solver/embedder.py → src/solver/base.py |
| Add a dataset | src/dataset/base.py → src/dataset/Locomo.py → src/dataset/utils.py |
| Understand the evaluation pipeline | memorybench.py → src/dataset/base.py (evaluate, evaluate_test) |
| Understand how runners are wired | src/off-policy.py → src/solver/base.py (predict_test, predict_test_with_corpus) |
| Understand frontend wiring | frontend/streamlit_app.py (run_page, build_runtime_memory_config) |