Memory Service Benchmarks

Benchmark harness for evaluating the memory-service cognition pipeline using industry-standard academic datasets.

Supported Benchmarks

LoCoMo — "Can you remember what friends talked about?" Two friends chatting over months across 10 long conversations, ~1,540 questions. Tests basic memory: can you recall facts, dates, causes, and connect information across sessions? Small and fast — good for quick iteration. Published at ACL 2024 by University of Massachusetts. Paper | Dataset

LongMemEval — "Can you remember what each user told you?" 500 different users, each with months of chat history (~53 sessions per user). Tests the same skills as LoCoMo but at much bigger scale, plus tracks when facts get updated and when the system should abstain. Each question is independent — no information leaks between users. Published at ICLR 2025 by UCLA. Paper | Dataset

BEAM — "Can you handle massive conversations across many topics?" 100 conversations ranging from 100K to 10 million tokens, covering coding, math, health, finance, and personal topics. The hardest of the three — tests 10 different memory abilities with a more detailed scoring system (partial credit via rubric nuggets instead of just right/wrong). Published at ICLR 2026. Paper | Dataset

Prerequisites

Java 21+
Docker and Docker Compose
Task (brew install go-task)
air (go install github.com/air-verse/air@latest — ensure ~/go/bin is in PATH)
An OpenAI API key (for embeddings, cognition extraction, and the benchmark LLM judge)

Setup (one time)

1. Configure the memory-service for cognition

From the memory-service/ directory:

cd ../memory-service
cp ../cognitive-memory/memory-service/compose.override.yaml.example ./compose.override.yaml

2. Install the memory-service REST client

cd ../memory-service
./java/mvnw -f java/pom.xml -pl quarkus/memory-service-rest-quarkus -am install -DskipTests

3. Set your OpenAI API key

Add to your ~/.zshrc:

export OPENAI_API_KEY=sk-...
export PATH=$PATH:$HOME/go/bin

Then source ~/.zshrc.

4. Build the benchmark

cd memory-service-benchmarks
./mvnw clean package -DskipTests

Running the benchmarks

Start the services

Terminal 1 — Memory service:

cd memory-service
MEMORY_SERVICE_OPENAI_API_KEY=$OPENAI_API_KEY \
MEMORY_SERVICE_ROLES_ADMIN_CLIENTS="admin,turn_traces_processor,cognition_processor" \
MEMORY_SERVICE_ROLES_INDEXER_CLIENTS="agent,cognition_processor" \
MEMORY_SERVICE_API_KEYS_COGNITION_PROCESSOR=cognition-processor-key-123 \
MEMORY_SERVICE_AIR_FULL_BIN="./bin/memory-service serve" \
task dev:memory-service

Terminal 2 — Cognition processor (skip for no-cognition runs):

cd cognitive-memory/cognition-processor-quarkus
MEMORY_SERVICE_API_KEY=cognition-processor-key-123 \
MEMORY_MODEL_PROVIDER=openai \
MEMORY_MODEL_ID=gpt-4o-mini \
OPENAI_BASE_URL=https://api.openai.com/v1 \
OPENAI_MODEL_NAME=gpt-4o-mini \
./mvnw quarkus:dev

LoCoMo

# Single conversation (quick test)
java -Xmx2g -Dbenchmark.conversations=0 -jar target/quarkus-app/quarkus-run.jar locomo

# All 10 conversations
java -Xmx2g -jar target/quarkus-app/quarkus-run.jar locomo

# Without cognition (stop cognition processor first)
java -Xmx2g -Dbenchmark.cognition.enabled=false -jar target/quarkus-app/quarkus-run.jar locomo

LongMemEval

# Quick smoke test (2 per type = 12 questions)
java -Xmx2g -Dbenchmark.longmemeval.per-type=2 -jar target/quarkus-app/quarkus-run.jar longmemeval

# Default (5 per type = 30 questions)
java -Xmx2g -jar target/quarkus-app/quarkus-run.jar longmemeval

# All 500 questions (takes many hours)
java -Xmx2g -Dbenchmark.longmemeval.per-type=0 -jar target/quarkus-app/quarkus-run.jar longmemeval

# Filter by question type
java -Xmx2g -Dbenchmark.longmemeval.question-types=temporal-reasoning,multi-session \
  -jar target/quarkus-app/quarkus-run.jar longmemeval

# Without cognition
java -Xmx2g -Dbenchmark.cognition.enabled=false -jar target/quarkus-app/quarkus-run.jar longmemeval

BEAM

# Quick smoke test (1 chat from 100K tier, 20 questions)
java -Xmx2g -Dbenchmark.beam.max-chats=1 -jar target/quarkus-app/quarkus-run.jar beam

# Default (2 chats from 100K, 40 questions)
java -Xmx2g -jar target/quarkus-app/quarkus-run.jar beam

# All 100K chats (20 chats, 400 questions)
java -Xmx2g -Dbenchmark.beam.max-chats=0 -jar target/quarkus-app/quarkus-run.jar beam

# Larger size tiers
java -Xmx4g -Dbenchmark.beam.chat-sizes=100K,500K -jar target/quarkus-app/quarkus-run.jar beam

# Without cognition
java -Xmx2g -Dbenchmark.cognition.enabled=false -jar target/quarkus-app/quarkus-run.jar beam

Clean run (reset database)

To start fresh between runs:

cd memory-service
docker compose down -v
docker compose up -d qdrant postgres redis keycloak prometheus minio minio-init clickhouse langfuse-worker langfuse-web

Then restart memory-service and cognition processor.

Results

Results are written to results/ as JSON:

results/locomo_cognition_2026-06-25T08-27-19.json
results/locomo_substrate_2026-06-25T08-30-00.json
results/longmemeval_cognition_2026-06-25T10-15-00.json
results/beam_cognition_2026-06-26T15-00-00.json

Metrics

Each benchmark reports three scoring metrics per question, matching the evaluation framework used by industry benchmarks (e.g., Mem0):

Metric	What it measures	How it works
LLM Judge	Semantic correctness	An LLM compares the answer to ground truth by meaning. "NYC" = "New York City". Most accurate but costs an LLM call per question.
F1 Score	Word overlap	Measures precision and recall of word overlap between answer and ground truth. Better than BLEU but still purely word-based.
BLEU Score	N-gram precision	Counts how many word sequences match exactly. Originally designed for machine translation. Cheapest but most rigid.

For BEAM, the LLM Judge uses rubric-based nugget scoring (0/0.5/1.0 per criterion) instead of binary CORRECT/WRONG, giving partial credit for partially correct answers.

Configuration

All settings in src/main/resources/application.properties:

Property	Default	Description
`memory-service.url`	`http://localhost:8082`	Memory service URL
`memory-service.api-key`	`agent-api-key-1`	API key for authentication
`benchmark.top-k`	`50`	Max memories to retrieve per question
`benchmark.output-dir`	`results`	Output directory
`benchmark.cognition.enabled`	`true`	Wait for cognition processor
`benchmark.cognition.namespace`	`cognition.v1`	Cognition memory namespace
`benchmark.cognition.wait-timeout-seconds`	`600`	Max wait for extraction
`benchmark.cognition.poll-interval-seconds`	`10`	Polling interval
`benchmark.cognition.stable-seconds`	`90`	Seconds of stability before proceeding

LoCoMo-specific:

Property	Default	Description
`benchmark.dataset`	`datasets/locomo10.json`	Path to LoCoMo dataset
`benchmark.conversations`	`0,1,2,3,4,5,6,7,8,9`	Which conversations to run

BEAM-specific:

Property	Default	Description
`benchmark.beam.dataset-dir`	`datasets/beam`	Path to BEAM chats directory
`benchmark.beam.chat-sizes`	`100K`	Comma-separated size tiers (100K, 500K, 1M, 10M)
`benchmark.beam.max-chats`	`2`	Max chats per size tier (0 = all)

LongMemEval-specific:

Property	Default	Description
`benchmark.longmemeval.dataset`	`datasets/longmemeval_s_cleaned.json`	Path to LongMemEval dataset
`benchmark.longmemeval.per-type`	`5`	Questions per type (0 = all)
`benchmark.longmemeval.seed`	`42`	Random seed for sampling
`benchmark.longmemeval.question-types`	(all)	Comma-separated type filter

Override any property with -D:

java -Xmx2g -Dbenchmark.top-k=20 -jar target/quarkus-app/quarkus-run.jar locomo

Architecture

Dataset (LoCoMo / LongMemEval / BEAM)
        │
        ▼
┌──────────────────────────┐
│  Benchmark Runner        │
│  1. Ingest conversations │──→  Memory Service (REST API :8082)
│  2. Wait for cognition   │         │
│  3. Search memories      │         ▼
│  4. LLM answers          │     Cognition Processor (Quarkus :8090)
│  5. LLM judges           │     extracts facts, preferences, procedures
│  6. Compute metrics      │         │
└──────────────────────────┘         ▼
        │                    Memory Service stores extracted memories
        ▼                    under ["user", userId, "cognition.v1", ...]
   results/*.json

What the benchmarks test

LoCoMo (ACL 2024) — "Can you remember what friends talked about?"

Two friends chatting over months across 10 long conversations, ~1,500 questions (after excluding adversarial). Tests basic memory: can you recall facts, dates, causes, and connect information across sessions? Small and fast — good for quick iteration.

5 categories:

Category	Tests
Multi-hop	Connecting facts across sessions
Temporal	Dates, timing, sequences
Causal	Reasoning about why things happened
Factual	Direct fact recall
Adversarial	Questions about things never discussed (skipped)

LongMemEval (ICLR 2025) — "Can you remember what each user told you?"

500 different users, each with months of chat history (~53 sessions per user). Tests the same skills as LoCoMo but at much bigger scale, plus tracks when facts get updated. Each question is independent — no information leaks between users.

6 question types:

Question Type	Tests
single-session-user	Facts stated by the user in one session
single-session-assistant	Information provided by the assistant
single-session-preference	User preferences expressed in one session
temporal-reasoning	Time-based reasoning across sessions
knowledge-update	Updated facts that override earlier ones
multi-session	Connecting information across multiple sessions

BEAM (ICLR 2026) — "Can you handle massive conversations across many topics?"

100 conversations ranging from 100K to 10 million tokens, covering coding, math, health, finance, and personal topics. The hardest of the three — tests 10 different memory abilities with a more detailed scoring system (partial credit via rubric nuggets instead of just right/wrong).

Question Type	Tests
abstention	Withholding answers when evidence is absent
contradiction_resolution	Detecting and reconciling inconsistent statements
event_ordering	Reconstructing chronological sequences
information_extraction	Recalling entities, dates, numbers, facts
instruction_following	Sustained adherence to user constraints
knowledge_update	Revising stored facts with new information
multi_session_reasoning	Integrating evidence across non-adjacent segments
preference_following	Adapting to evolving user preferences
summarization	Abstracting and compressing dialogue content
temporal_reasoning	Reasoning about time relations and durations

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.mvn/wrapper		.mvn/wrapper
datasets		datasets
src/main		src/main
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
README.md		README.md
mvnw		mvnw
mvnw.cmd		mvnw.cmd
pom.xml		pom.xml
runLocalCognition.sh		runLocalCognition.sh
runMemoryService.sh		runMemoryService.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Memory Service Benchmarks

Supported Benchmarks

Prerequisites

Setup (one time)

1. Configure the memory-service for cognition

2. Install the memory-service REST client

3. Set your OpenAI API key

4. Build the benchmark

Running the benchmarks

Start the services

LoCoMo

LongMemEval

BEAM

Clean run (reset database)

Results

Metrics

Configuration

Architecture

What the benchmarks test

LoCoMo (ACL 2024) — "Can you remember what friends talked about?"

LongMemEval (ICLR 2025) — "Can you remember what each user told you?"

BEAM (ICLR 2026) — "Can you handle massive conversations across many topics?"

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Memory Service Benchmarks

Supported Benchmarks

Prerequisites

Setup (one time)

1. Configure the memory-service for cognition

2. Install the memory-service REST client

3. Set your OpenAI API key

4. Build the benchmark

Running the benchmarks

Start the services

LoCoMo

LongMemEval

BEAM

Clean run (reset database)

Results

Metrics

Configuration

Architecture

What the benchmarks test

LoCoMo (ACL 2024) — "Can you remember what friends talked about?"

LongMemEval (ICLR 2025) — "Can you remember what each user told you?"

BEAM (ICLR 2026) — "Can you handle massive conversations across many topics?"

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages