Benchmark harness for evaluating the memory-service cognition pipeline using industry-standard academic datasets.
LoCoMo — "Can you remember what friends talked about?" Two friends chatting over months across 10 long conversations, ~1,540 questions. Tests basic memory: can you recall facts, dates, causes, and connect information across sessions? Small and fast — good for quick iteration. Published at ACL 2024 by University of Massachusetts. Paper | Dataset
LongMemEval — "Can you remember what each user told you?" 500 different users, each with months of chat history (~53 sessions per user). Tests the same skills as LoCoMo but at much bigger scale, plus tracks when facts get updated and when the system should abstain. Each question is independent — no information leaks between users. Published at ICLR 2025 by UCLA. Paper | Dataset
BEAM — "Can you handle massive conversations across many topics?" 100 conversations ranging from 100K to 10 million tokens, covering coding, math, health, finance, and personal topics. The hardest of the three — tests 10 different memory abilities with a more detailed scoring system (partial credit via rubric nuggets instead of just right/wrong). Published at ICLR 2026. Paper | Dataset
- Java 21+
- Docker and Docker Compose
- Task (
brew install go-task) - air (
go install github.com/air-verse/air@latest— ensure~/go/binis in PATH) - An OpenAI API key (for embeddings, cognition extraction, and the benchmark LLM judge)
From the memory-service/ directory:
cd ../memory-service
cp ../cognitive-memory/memory-service/compose.override.yaml.example ./compose.override.yamlcd ../memory-service
./java/mvnw -f java/pom.xml -pl quarkus/memory-service-rest-quarkus -am install -DskipTestsAdd to your ~/.zshrc:
export OPENAI_API_KEY=sk-...
export PATH=$PATH:$HOME/go/binThen source ~/.zshrc.
cd memory-service-benchmarks
./mvnw clean package -DskipTestsTerminal 1 — Memory service:
cd memory-service
MEMORY_SERVICE_OPENAI_API_KEY=$OPENAI_API_KEY \
MEMORY_SERVICE_ROLES_ADMIN_CLIENTS="admin,turn_traces_processor,cognition_processor" \
MEMORY_SERVICE_ROLES_INDEXER_CLIENTS="agent,cognition_processor" \
MEMORY_SERVICE_API_KEYS_COGNITION_PROCESSOR=cognition-processor-key-123 \
MEMORY_SERVICE_AIR_FULL_BIN="./bin/memory-service serve" \
task dev:memory-serviceTerminal 2 — Cognition processor (skip for no-cognition runs):
cd cognitive-memory/cognition-processor-quarkus
MEMORY_SERVICE_API_KEY=cognition-processor-key-123 \
MEMORY_MODEL_PROVIDER=openai \
MEMORY_MODEL_ID=gpt-4o-mini \
OPENAI_BASE_URL=https://api.openai.com/v1 \
OPENAI_MODEL_NAME=gpt-4o-mini \
./mvnw quarkus:dev# Single conversation (quick test)
java -Xmx2g -Dbenchmark.conversations=0 -jar target/quarkus-app/quarkus-run.jar locomo
# All 10 conversations
java -Xmx2g -jar target/quarkus-app/quarkus-run.jar locomo
# Without cognition (stop cognition processor first)
java -Xmx2g -Dbenchmark.cognition.enabled=false -jar target/quarkus-app/quarkus-run.jar locomo# Quick smoke test (2 per type = 12 questions)
java -Xmx2g -Dbenchmark.longmemeval.per-type=2 -jar target/quarkus-app/quarkus-run.jar longmemeval
# Default (5 per type = 30 questions)
java -Xmx2g -jar target/quarkus-app/quarkus-run.jar longmemeval
# All 500 questions (takes many hours)
java -Xmx2g -Dbenchmark.longmemeval.per-type=0 -jar target/quarkus-app/quarkus-run.jar longmemeval
# Filter by question type
java -Xmx2g -Dbenchmark.longmemeval.question-types=temporal-reasoning,multi-session \
-jar target/quarkus-app/quarkus-run.jar longmemeval
# Without cognition
java -Xmx2g -Dbenchmark.cognition.enabled=false -jar target/quarkus-app/quarkus-run.jar longmemeval# Quick smoke test (1 chat from 100K tier, 20 questions)
java -Xmx2g -Dbenchmark.beam.max-chats=1 -jar target/quarkus-app/quarkus-run.jar beam
# Default (2 chats from 100K, 40 questions)
java -Xmx2g -jar target/quarkus-app/quarkus-run.jar beam
# All 100K chats (20 chats, 400 questions)
java -Xmx2g -Dbenchmark.beam.max-chats=0 -jar target/quarkus-app/quarkus-run.jar beam
# Larger size tiers
java -Xmx4g -Dbenchmark.beam.chat-sizes=100K,500K -jar target/quarkus-app/quarkus-run.jar beam
# Without cognition
java -Xmx2g -Dbenchmark.cognition.enabled=false -jar target/quarkus-app/quarkus-run.jar beamTo start fresh between runs:
cd memory-service
docker compose down -v
docker compose up -d qdrant postgres redis keycloak prometheus minio minio-init clickhouse langfuse-worker langfuse-webThen restart memory-service and cognition processor.
Results are written to results/ as JSON:
results/locomo_cognition_2026-06-25T08-27-19.json
results/locomo_substrate_2026-06-25T08-30-00.json
results/longmemeval_cognition_2026-06-25T10-15-00.json
results/beam_cognition_2026-06-26T15-00-00.json
Each benchmark reports three scoring metrics per question, matching the evaluation framework used by industry benchmarks (e.g., Mem0):
| Metric | What it measures | How it works |
|---|---|---|
| LLM Judge | Semantic correctness | An LLM compares the answer to ground truth by meaning. "NYC" = "New York City". Most accurate but costs an LLM call per question. |
| F1 Score | Word overlap | Measures precision and recall of word overlap between answer and ground truth. Better than BLEU but still purely word-based. |
| BLEU Score | N-gram precision | Counts how many word sequences match exactly. Originally designed for machine translation. Cheapest but most rigid. |
For BEAM, the LLM Judge uses rubric-based nugget scoring (0/0.5/1.0 per criterion) instead of binary CORRECT/WRONG, giving partial credit for partially correct answers.
All settings in src/main/resources/application.properties:
| Property | Default | Description |
|---|---|---|
memory-service.url |
http://localhost:8082 |
Memory service URL |
memory-service.api-key |
agent-api-key-1 |
API key for authentication |
benchmark.top-k |
50 |
Max memories to retrieve per question |
benchmark.output-dir |
results |
Output directory |
benchmark.cognition.enabled |
true |
Wait for cognition processor |
benchmark.cognition.namespace |
cognition.v1 |
Cognition memory namespace |
benchmark.cognition.wait-timeout-seconds |
600 |
Max wait for extraction |
benchmark.cognition.poll-interval-seconds |
10 |
Polling interval |
benchmark.cognition.stable-seconds |
90 |
Seconds of stability before proceeding |
LoCoMo-specific:
| Property | Default | Description |
|---|---|---|
benchmark.dataset |
datasets/locomo10.json |
Path to LoCoMo dataset |
benchmark.conversations |
0,1,2,3,4,5,6,7,8,9 |
Which conversations to run |
BEAM-specific:
| Property | Default | Description |
|---|---|---|
benchmark.beam.dataset-dir |
datasets/beam |
Path to BEAM chats directory |
benchmark.beam.chat-sizes |
100K |
Comma-separated size tiers (100K, 500K, 1M, 10M) |
benchmark.beam.max-chats |
2 |
Max chats per size tier (0 = all) |
LongMemEval-specific:
| Property | Default | Description |
|---|---|---|
benchmark.longmemeval.dataset |
datasets/longmemeval_s_cleaned.json |
Path to LongMemEval dataset |
benchmark.longmemeval.per-type |
5 |
Questions per type (0 = all) |
benchmark.longmemeval.seed |
42 |
Random seed for sampling |
benchmark.longmemeval.question-types |
(all) | Comma-separated type filter |
Override any property with -D:
java -Xmx2g -Dbenchmark.top-k=20 -jar target/quarkus-app/quarkus-run.jar locomoDataset (LoCoMo / LongMemEval / BEAM)
│
▼
┌──────────────────────────┐
│ Benchmark Runner │
│ 1. Ingest conversations │──→ Memory Service (REST API :8082)
│ 2. Wait for cognition │ │
│ 3. Search memories │ ▼
│ 4. LLM answers │ Cognition Processor (Quarkus :8090)
│ 5. LLM judges │ extracts facts, preferences, procedures
│ 6. Compute metrics │ │
└──────────────────────────┘ ▼
│ Memory Service stores extracted memories
▼ under ["user", userId, "cognition.v1", ...]
results/*.json
Two friends chatting over months across 10 long conversations, ~1,500 questions (after excluding adversarial). Tests basic memory: can you recall facts, dates, causes, and connect information across sessions? Small and fast — good for quick iteration.
5 categories:
| Category | Tests |
|---|---|
| Multi-hop | Connecting facts across sessions |
| Temporal | Dates, timing, sequences |
| Causal | Reasoning about why things happened |
| Factual | Direct fact recall |
| Adversarial | Questions about things never discussed (skipped) |
500 different users, each with months of chat history (~53 sessions per user). Tests the same skills as LoCoMo but at much bigger scale, plus tracks when facts get updated. Each question is independent — no information leaks between users.
6 question types:
| Question Type | Tests |
|---|---|
| single-session-user | Facts stated by the user in one session |
| single-session-assistant | Information provided by the assistant |
| single-session-preference | User preferences expressed in one session |
| temporal-reasoning | Time-based reasoning across sessions |
| knowledge-update | Updated facts that override earlier ones |
| multi-session | Connecting information across multiple sessions |
100 conversations ranging from 100K to 10 million tokens, covering coding, math, health, finance, and personal topics. The hardest of the three — tests 10 different memory abilities with a more detailed scoring system (partial credit via rubric nuggets instead of just right/wrong).
| Question Type | Tests |
|---|---|
| abstention | Withholding answers when evidence is absent |
| contradiction_resolution | Detecting and reconciling inconsistent statements |
| event_ordering | Reconstructing chronological sequences |
| information_extraction | Recalling entities, dates, numbers, facts |
| instruction_following | Sustained adherence to user constraints |
| knowledge_update | Revising stored facts with new information |
| multi_session_reasoning | Integrating evidence across non-adjacent segments |
| preference_following | Adapting to evolving user preferences |
| summarization | Abstracting and compressing dialogue content |
| temporal_reasoning | Reasoning about time relations and durations |