LOCOMO Benchmark for OpenClaw Memory

Minimal harness for memory benchmarking using OpenClaw on the LOCOMO dataset.

This repo benchmarks OpenClaw against LOCOMO with three backends:

memory-core: writes lossless LOCOMO session markdown into the OpenClaw workspace, then reindexes
memory-lancedb: writes the same LOCOMO session markdown, reuses the exact memory-core indexed chunks and embeddings, then stores those chunks in the built-in LanceDB plugin table
memory-lancedb-pro: starts from that same chunk-aligned LanceDB corpus and uses the memory-lancedb-pro plugin for more advanced retrieval tuning

QA still runs through a local OpenClaw gateway, and an LLM judge scores the answers.

Quick Start

1. Install OpenClaw

npm install -g openclaw@latest
openclaw onboard

2. Add your OpenAI key

Put this in .env at the repo root:

OPENAI_API_KEY=your_key_here

3. Download the dataset

mkdir -p datasets
curl -fsSL https://raw.githubusercontent.com/snap-research/locomo/main/data/locomo10.json -o datasets/locomo10.json

4. Install Python dependencies

uv sync

5. Configure OpenClaw for the backend you want to benchmark

./setup_memory_core.sh

For the LanceDB leg, use:

./setup_memory_lancedb.sh

For the LanceDB Pro leg, use:

./setup_memory_lancedb_pro.sh

Before the first LanceDB Pro run, install the plugin once:

openclaw plugins install memory-lancedb-pro@beta

6. Start the gateway

./start_gateway.sh

The benchmark expects the gateway at http://127.0.0.1:18789.

start_gateway.sh hardcodes the OpenClaw agent model to openai/gpt-4.1-mini so the gateway does not fall back to another provider by default.

7. Build the backend corpus once

For memory-core, prebuild the workspace markdown and SQLite index:

./setup_memory_core.sh

uv run python scripts/build_memory_core_corpus.py \
  --input datasets/locomo10.json

That writes the full LOCOMO session markdown into the benchmark workspace and runs the built-in openclaw memory index --force. It records a manifest at locomo-bench/prebuilt-memory-core.json.

For memory-lancedb, prebuild the chunk-aligned LanceDB store:

./setup_memory_lancedb.sh

uv run python scripts/build_memory_lancedb_corpus.py \
  --input datasets/locomo10.json

That writes the full LOCOMO session markdown into the workspace, runs the built-in memory-core indexer, reads back the exact indexed chunk text and embeddings from SQLite, and writes those chunks into the configured memory-lancedb store. It records a manifest at locomo-bench/prebuilt-memory-lancedb.json.

For memory-lancedb-pro, build from the existing memory-lancedb store:

./setup_memory_lancedb_pro.sh

uv run python scripts/build_memory_lancedb_pro_corpus.py \
  --source-db locomo-bench/lancedb

**This script requires that the memory-lancedb store pre-exist (to migrate already-computed embeddings to the format expected by the memory-lancedb-pro plugin).

That uses the plugin's migration path to materialize a separate memory-lancedb-pro store from the already-built chunk-aligned memory-lancedb corpus without re-embedding the corpus again. It records a manifest at locomo-bench/prebuilt-memory-lancedb-pro.json.

8. Run a subset of the benchmark

Use the --limit parameter to specify the number of QA pairs to benchmark.

Note

--limit applies to the flattened LOCOMO QA rows, not the number of dialogues. So, the loader flattens every sample's qa array into one ordered benchmark list. Here, --limit 5 means "run the first 5 QA rows". If those rows all come from one dialogue, the harness still ingests the full source dialogue for that selected sample. It does not ingest the entire LOCOMO dataset unless your selected rows span the entire dataset.

The benchmark exposes three model controls:

--agent-model: the actual OpenClaw agent model used for QA
--gateway-model: optional model value sent to OpenClaw /v1/responses; if omitted it defaults to --agent-model
--judge-model: model used by the LLM judge

In a second terminal, enter the following for memory-core:

uv run python scripts/run_memory_core.py \
  --input datasets/locomo10.json \
  --limit 5 \
  --gateway http://127.0.0.1:18789 \
  --agent-model openai/gpt-4.1-mini \
  --judge-model openai/gpt-4.1-mini \
  --skip-ingest

For the LanceDB leg:

uv run python scripts/run_memory_lancedb.py \
  --input datasets/locomo10.json \
  --limit 5 \
  --gateway http://127.0.0.1:18789 \
  --agent-model openai/gpt-4.1-mini \
  --judge-model openai/gpt-4.1-mini \
  --skip-ingest

For the LanceDB Pro leg:

uv run python scripts/run_memory_lancedb_pro.py \
  --input datasets/locomo10.json \
  --limit 5 \
  --gateway http://127.0.0.1:18789 \
  --agent-model openai/gpt-4.1-mini \
  --judge-model openai/gpt-4.1-mini \
  --skip-ingest

Both runs write artifacts under outputs/ by default.

Sample Results

The latest summaries under outputs/ show the following picture for recent --limit 10 runs:

Backend	Rows	Correct	Wrong	Completion Rate	Avg latency (s)
`memory-core`	50	27	23	0.54	6.7
`memory-lancedb`	50	32	18	0.64	4.1
`memory-lancedb-pro`	50	36	14	0.72	10.9

Source summaries are from the output files after running the benchmark using each memory plugin.

9. Large-scale runs: prebuild the stores once, then benchmark in query-only mode

For real benchmark runs, repeatedly ingesting the same corpus is wasteful. The recommended workflow is:

build the memory-core corpus once
build the memory-lancedb corpus once
build the memory-lancedb-pro corpus once from that prebuilt memory-lancedb store
run each benchmark leg with --skip-ingest

After the corpora exist, benchmark in query-only mode.

For memory-core:

./setup_memory_core.sh
./start_gateway.sh

In a second terminal:

uv run python scripts/run_memory_core.py \
  --input datasets/locomo10.json \
  --limit 100 \
  --gateway http://127.0.0.1:18789 \
  --agent-model openai/gpt-4.1-mini \
  --judge-model openai/gpt-4.1-mini \
  --skip-ingest

For memory-lancedb:

./setup_memory_lancedb.sh
./start_gateway.sh

In a second terminal:

uv run python scripts/run_memory_lancedb.py \
  --input datasets/locomo10.json \
  --limit 100 \
  --gateway http://127.0.0.1:18789 \
  --agent-model openai/gpt-4.1-mini \
  --judge-model openai/gpt-4.1-mini \
  --skip-ingest

For memory-lancedb-pro:

./setup_memory_lancedb_pro.sh
./start_gateway.sh

In a second terminal:

uv run python scripts/run_memory_lancedb_pro.py \
  --input datasets/locomo10.json \
  --limit 100 \
  --gateway http://127.0.0.1:18789 \
  --agent-model openai/gpt-4.1-mini \
  --judge-model openai/gpt-4.1-mini \
  --skip-ingest

--skip-ingest means:

the benchmark does not touch the existing store
the run is query-only
memory_status_before.json and memory_status_after.json reflect the prebuilt store as-is
for memory-core, this only works after build_memory_core_corpus.py has populated the workspace markdown and SQLite index

What the Runner Does

For memory-core, the benchmark:

writes raw LOCOMO session markdown into workspace/memory/locomo/
clears only that benchmark-managed subtree before each run
runs openclaw memory index --force
asks QA through the gateway after reindexing
does no summarization during ingest

For memory-lancedb, the benchmark:

writes the same LOCOMO session markdown used by memory-core
runs the built-in openclaw memory index --force
reads the exact indexed memory-core chunks and stored embeddings from the SQLite chunks table
writes those chunks directly into the plugin's memories table
clears the benchmark-managed LanceDB directory before each run
uses the bundled memory-lancedb plugin with its Node dependency installed once
does no summarization during ingest

When --skip-ingest is set, it skips the write/reset step and queries the already-built store.

The standalone build_memory_core_corpus.py script performs that same write-and-index step once up front so later memory-core runs can safely use --skip-ingest.

For memory-lancedb-pro, the benchmark:

writes the same LOCOMO session markdown used by memory-core
runs the built-in openclaw memory index --force
reads the exact indexed memory-core chunks and stored embeddings from the SQLite chunks table
materializes a temporary legacy memory-lancedb store from those chunks
migrates that temporary legacy store into memory-lancedb-pro so Pro sees the same chunk corpus and vectors
uses the installed memory-lancedb-pro plugin and the retrieval settings from setup_memory_lancedb_pro.sh
does no summarization during ingest

This is intentional. memory-lancedb-pro maintains additional search/index state, so the benchmark migrates through the plugin's supported path instead of treating it like the simpler built-in LanceDB store.

When --skip-ingest is set, it skips the import step and queries the already-built Pro store.

Run Artifacts (What is Output)

Each benchmark run writes a directory under:

./outputs

Typical files in one run directory:

selected_rows.jsonl
- the flattened LOCOMO QA rows selected by --limit
- useful for seeing exactly which benchmark questions were included
ingest_log.jsonl
- one row per stored memory unit written during ingest
- for memory-core, this is one row per session markdown file
- for memory-lancedb, this is one row per memory-core chunk stored in LanceDB
- for memory-lancedb-pro, this is one row per memory-core chunk stored in LanceDB Pro
reindex.log
- stdout from openclaw memory index --force for memory-core
- stdout from the chunk-source openclaw memory index --force step for both LanceDB legs
- for memory-lancedb-pro, this file also includes the plugin migration output
document_log.jsonl
- the session markdown files written before chunk extraction
- produced for both LanceDB legs because their chunk source is now the same session-document corpus used by memory-core
memory_status_before.json
- backend status before ingest
- for memory-core, this records the workspace, SQLite path, and indexed file/chunk counts
- for memory-lancedb, this records the LanceDB path and whether the store already existed
- for memory-lancedb-pro, this records the LanceDB Pro path and whether the store already existed
memory_status_after.json
- backend status after ingest
- for memory-core, this confirms how many files and chunks were indexed
- for memory-lancedb, this confirms the LanceDB path and how many chunk rows were written
- for memory-lancedb-pro, this confirms the LanceDB Pro path and how many chunk rows were written
qa_results.jsonl
- raw benchmark answers returned by the gateway for each selected QA row
- includes latency, token usage if present, and any gateway errors
- token usage for gateway-mediated runs should be treated as approximate until the gateway usage fields are fully normalized by the harness
judged_results.jsonl
- the QA rows after LLM judging
- marks each answer as CORRECT or WRONG with short reasoning
summary.json
- small run-level summary
- includes task completion rate, counts, token totals, latency, and memory status before/after

Notes

Start with memory-core as the baseline, then compare it against memory-lancedb and memory-lancedb-pro.
The same .env key is used for both OpenClaw and judge calls.
For the cleanest benchmark, keep unrelated files out of the active OpenClaw workspace memory corpus.
setup_memory_lancedb_pro.sh is the comparable baseline script.
setup_memory_lancedb_pro_tune.sh is the experimental tuned script for retrieval sweeps and A/B testing.
For large runs, prefer the prebuilt-store workflow and --skip-ingest. Re-ingesting the same corpus on every run is mainly useful for debugging, not for full benchmarks.
memory-lancedb-pro still depends on the prebuilt memory-lancedb store as its corpus source.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
datasets		datasets
scripts		scripts
src		src
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
reset.sh		reset.sh
setup_memory_core.sh		setup_memory_core.sh
setup_memory_lancedb.sh		setup_memory_lancedb.sh
setup_memory_lancedb_pro.sh		setup_memory_lancedb_pro.sh
start_gateway.sh		start_gateway.sh
start_tui.sh		start_tui.sh
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LOCOMO Benchmark for OpenClaw Memory

Quick Start

1. Install OpenClaw

2. Add your OpenAI key

3. Download the dataset

4. Install Python dependencies

5. Configure OpenClaw for the backend you want to benchmark

6. Start the gateway

7. Build the backend corpus once

8. Run a subset of the benchmark

Sample Results

9. Large-scale runs: prebuild the stores once, then benchmark in query-only mode

What the Runner Does

Run Artifacts (What is Output)

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LOCOMO Benchmark for OpenClaw Memory

Quick Start

1. Install OpenClaw

2. Add your OpenAI key

3. Download the dataset

4. Install Python dependencies

5. Configure OpenClaw for the backend you want to benchmark

6. Start the gateway

7. Build the backend corpus once

8. Run a subset of the benchmark

Sample Results

9. Large-scale runs: prebuild the stores once, then benchmark in query-only mode

What the Runner Does

Run Artifacts (What is Output)

Notes

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages