Minimal harness for memory benchmarking using OpenClaw on the LOCOMO dataset.
This repo benchmarks OpenClaw against LOCOMO with three backends:
memory-core: writes lossless LOCOMO session markdown into the OpenClaw workspace, then reindexesmemory-lancedb: writes the same LOCOMO session markdown, reuses the exactmemory-coreindexed chunks and embeddings, then stores those chunks in the built-in LanceDB plugin tablememory-lancedb-pro: starts from that same chunk-aligned LanceDB corpus and uses thememory-lancedb-proplugin for more advanced retrieval tuning
QA still runs through a local OpenClaw gateway, and an LLM judge scores the answers.
npm install -g openclaw@latest
openclaw onboardPut this in .env at the repo root:
OPENAI_API_KEY=your_key_heremkdir -p datasets
curl -fsSL https://raw.githubusercontent.com/snap-research/locomo/main/data/locomo10.json -o datasets/locomo10.jsonuv sync./setup_memory_core.shFor the LanceDB leg, use:
./setup_memory_lancedb.shFor the LanceDB Pro leg, use:
./setup_memory_lancedb_pro.shBefore the first LanceDB Pro run, install the plugin once:
openclaw plugins install memory-lancedb-pro@beta./start_gateway.shThe benchmark expects the gateway at http://127.0.0.1:18789.
start_gateway.sh hardcodes the OpenClaw agent model to openai/gpt-4.1-mini so the gateway does not fall back to another provider by default.
For memory-core, prebuild the workspace markdown and SQLite index:
./setup_memory_core.sh
uv run python scripts/build_memory_core_corpus.py \
--input datasets/locomo10.jsonThat writes the full LOCOMO session markdown into the benchmark workspace and runs the built-in openclaw memory index --force. It records a manifest at locomo-bench/prebuilt-memory-core.json.
For memory-lancedb, prebuild the chunk-aligned LanceDB store:
./setup_memory_lancedb.sh
uv run python scripts/build_memory_lancedb_corpus.py \
--input datasets/locomo10.jsonThat writes the full LOCOMO session markdown into the workspace, runs the built-in memory-core indexer, reads back the exact indexed chunk text and embeddings from SQLite, and writes those chunks into the configured memory-lancedb store. It records a manifest at locomo-bench/prebuilt-memory-lancedb.json.
For memory-lancedb-pro, build from the existing memory-lancedb store:
./setup_memory_lancedb_pro.sh
uv run python scripts/build_memory_lancedb_pro_corpus.py \
--source-db locomo-bench/lancedb**This script requires that the memory-lancedb store pre-exist (to migrate already-computed embeddings to the format expected by the memory-lancedb-pro plugin).
That uses the plugin's migration path to materialize a separate memory-lancedb-pro store from the already-built chunk-aligned memory-lancedb corpus without re-embedding the corpus again. It records a manifest at locomo-bench/prebuilt-memory-lancedb-pro.json.
Use the --limit parameter to specify the number of QA pairs to benchmark.
Note
--limit applies to the flattened LOCOMO QA rows, not the number of dialogues. So, the loader flattens every sample's qa array into one ordered benchmark list. Here, --limit 5 means "run the first 5 QA rows". If those rows all come from one dialogue, the harness still ingests the full source dialogue for that selected sample. It does not ingest the entire LOCOMO dataset unless your selected rows span the entire dataset.
The benchmark exposes three model controls:
--agent-model: the actual OpenClaw agent model used for QA--gateway-model: optional model value sent to OpenClaw/v1/responses; if omitted it defaults to--agent-model--judge-model: model used by the LLM judge
In a second terminal, enter the following for memory-core:
uv run python scripts/run_memory_core.py \
--input datasets/locomo10.json \
--limit 5 \
--gateway http://127.0.0.1:18789 \
--agent-model openai/gpt-4.1-mini \
--judge-model openai/gpt-4.1-mini \
--skip-ingestFor the LanceDB leg:
uv run python scripts/run_memory_lancedb.py \
--input datasets/locomo10.json \
--limit 5 \
--gateway http://127.0.0.1:18789 \
--agent-model openai/gpt-4.1-mini \
--judge-model openai/gpt-4.1-mini \
--skip-ingestFor the LanceDB Pro leg:
uv run python scripts/run_memory_lancedb_pro.py \
--input datasets/locomo10.json \
--limit 5 \
--gateway http://127.0.0.1:18789 \
--agent-model openai/gpt-4.1-mini \
--judge-model openai/gpt-4.1-mini \
--skip-ingestBoth runs write artifacts under outputs/ by default.
The latest summaries under outputs/ show the following picture for recent --limit 10 runs:
| Backend | Rows | Correct | Wrong | Completion Rate | Avg latency (s) |
|---|---|---|---|---|---|
memory-core |
50 | 27 | 23 | 0.54 | 6.7 |
memory-lancedb |
50 | 32 | 18 | 0.64 | 4.1 |
memory-lancedb-pro |
50 | 36 | 14 | 0.72 | 10.9 |
Source summaries are from the output files after running the benchmark using each memory plugin.
For real benchmark runs, repeatedly ingesting the same corpus is wasteful. The recommended workflow is:
- build the
memory-corecorpus once - build the
memory-lancedbcorpus once - build the
memory-lancedb-procorpus once from that prebuiltmemory-lancedbstore - run each benchmark leg with
--skip-ingest
After the corpora exist, benchmark in query-only mode.
For memory-core:
./setup_memory_core.sh
./start_gateway.shIn a second terminal:
uv run python scripts/run_memory_core.py \
--input datasets/locomo10.json \
--limit 100 \
--gateway http://127.0.0.1:18789 \
--agent-model openai/gpt-4.1-mini \
--judge-model openai/gpt-4.1-mini \
--skip-ingestFor memory-lancedb:
./setup_memory_lancedb.sh
./start_gateway.shIn a second terminal:
uv run python scripts/run_memory_lancedb.py \
--input datasets/locomo10.json \
--limit 100 \
--gateway http://127.0.0.1:18789 \
--agent-model openai/gpt-4.1-mini \
--judge-model openai/gpt-4.1-mini \
--skip-ingestFor memory-lancedb-pro:
./setup_memory_lancedb_pro.sh
./start_gateway.shIn a second terminal:
uv run python scripts/run_memory_lancedb_pro.py \
--input datasets/locomo10.json \
--limit 100 \
--gateway http://127.0.0.1:18789 \
--agent-model openai/gpt-4.1-mini \
--judge-model openai/gpt-4.1-mini \
--skip-ingest--skip-ingest means:
- the benchmark does not touch the existing store
- the run is query-only
memory_status_before.jsonandmemory_status_after.jsonreflect the prebuilt store as-is- for
memory-core, this only works afterbuild_memory_core_corpus.pyhas populated the workspace markdown and SQLite index
For memory-core, the benchmark:
- writes raw LOCOMO session markdown into
workspace/memory/locomo/ - clears only that benchmark-managed subtree before each run
- runs
openclaw memory index --force - asks QA through the gateway after reindexing
- does no summarization during ingest
For memory-lancedb, the benchmark:
- writes the same LOCOMO session markdown used by
memory-core - runs the built-in
openclaw memory index --force - reads the exact indexed
memory-corechunks and stored embeddings from the SQLitechunkstable - writes those chunks directly into the plugin's
memoriestable - clears the benchmark-managed LanceDB directory before each run
- uses the bundled
memory-lancedbplugin with its Node dependency installed once - does no summarization during ingest
When --skip-ingest is set, it skips the write/reset step and queries the already-built store.
The standalone build_memory_core_corpus.py script performs that same write-and-index step once up front so later memory-core runs can safely use --skip-ingest.
For memory-lancedb-pro, the benchmark:
- writes the same LOCOMO session markdown used by
memory-core - runs the built-in
openclaw memory index --force - reads the exact indexed
memory-corechunks and stored embeddings from the SQLitechunkstable - materializes a temporary legacy
memory-lancedbstore from those chunks - migrates that temporary legacy store into
memory-lancedb-proso Pro sees the same chunk corpus and vectors - uses the installed
memory-lancedb-proplugin and the retrieval settings fromsetup_memory_lancedb_pro.sh - does no summarization during ingest
This is intentional. memory-lancedb-pro maintains additional search/index state, so the benchmark migrates through the plugin's supported path instead of treating it like the simpler built-in LanceDB store.
When --skip-ingest is set, it skips the import step and queries the already-built Pro store.
Each benchmark run writes a directory under:
Typical files in one run directory:
selected_rows.jsonl- the flattened LOCOMO QA rows selected by
--limit - useful for seeing exactly which benchmark questions were included
- the flattened LOCOMO QA rows selected by
ingest_log.jsonl- one row per stored memory unit written during ingest
- for
memory-core, this is one row per session markdown file - for
memory-lancedb, this is one row permemory-corechunk stored in LanceDB - for
memory-lancedb-pro, this is one row permemory-corechunk stored in LanceDB Pro
reindex.log- stdout from
openclaw memory index --forceformemory-core - stdout from the chunk-source
openclaw memory index --forcestep for both LanceDB legs - for
memory-lancedb-pro, this file also includes the plugin migration output
- stdout from
document_log.jsonl- the session markdown files written before chunk extraction
- produced for both LanceDB legs because their chunk source is now the same session-document corpus used by
memory-core
memory_status_before.json- backend status before ingest
- for
memory-core, this records the workspace, SQLite path, and indexed file/chunk counts - for
memory-lancedb, this records the LanceDB path and whether the store already existed - for
memory-lancedb-pro, this records the LanceDB Pro path and whether the store already existed
memory_status_after.json- backend status after ingest
- for
memory-core, this confirms how many files and chunks were indexed - for
memory-lancedb, this confirms the LanceDB path and how many chunk rows were written - for
memory-lancedb-pro, this confirms the LanceDB Pro path and how many chunk rows were written
qa_results.jsonl- raw benchmark answers returned by the gateway for each selected QA row
- includes latency, token usage if present, and any gateway errors
- token usage for gateway-mediated runs should be treated as approximate until the gateway usage fields are fully normalized by the harness
judged_results.jsonl- the QA rows after LLM judging
- marks each answer as
CORRECTorWRONGwith short reasoning
summary.json- small run-level summary
- includes task completion rate, counts, token totals, latency, and memory status before/after
- Start with
memory-coreas the baseline, then compare it againstmemory-lancedbandmemory-lancedb-pro. - The same
.envkey is used for both OpenClaw and judge calls. - For the cleanest benchmark, keep unrelated files out of the active OpenClaw workspace memory corpus.
setup_memory_lancedb_pro.shis the comparable baseline script.setup_memory_lancedb_pro_tune.shis the experimental tuned script for retrieval sweeps and A/B testing.- For large runs, prefer the prebuilt-store workflow and
--skip-ingest. Re-ingesting the same corpus on every run is mainly useful for debugging, not for full benchmarks. memory-lancedb-prostill depends on the prebuiltmemory-lancedbstore as its corpus source.