Skip to content

karesti/memory-service-benchmarks

Repository files navigation

Memory Service Benchmarks

Benchmark harness for evaluating the memory-service cognition pipeline using industry-standard academic datasets.

Supported Benchmarks

LoCoMo — "Can you remember what friends talked about?" Two friends chatting over months across 10 long conversations, ~1,540 questions. Tests basic memory: can you recall facts, dates, causes, and connect information across sessions? Small and fast — good for quick iteration. Published at ACL 2024 by University of Massachusetts. Paper | Dataset

LongMemEval — "Can you remember what each user told you?" 500 different users, each with months of chat history (~53 sessions per user). Tests the same skills as LoCoMo but at much bigger scale, plus tracks when facts get updated and when the system should abstain. Each question is independent — no information leaks between users. Published at ICLR 2025 by UCLA. Paper | Dataset

BEAM — "Can you handle massive conversations across many topics?" 100 conversations ranging from 100K to 10 million tokens, covering coding, math, health, finance, and personal topics. The hardest of the three — tests 10 different memory abilities with a more detailed scoring system (partial credit via rubric nuggets instead of just right/wrong). Published at ICLR 2026. Paper | Dataset

Prerequisites

  • Java 21+
  • Docker and Docker Compose
  • Task (brew install go-task)
  • air (go install github.com/air-verse/air@latest — ensure ~/go/bin is in PATH)
  • An OpenAI API key (for embeddings, cognition extraction, and the benchmark LLM judge)

Setup (one time)

1. Configure the memory-service for cognition

From the memory-service/ directory:

cd ../memory-service
cp ../cognitive-memory/memory-service/compose.override.yaml.example ./compose.override.yaml

2. Install the memory-service REST client

cd ../memory-service
./java/mvnw -f java/pom.xml -pl quarkus/memory-service-rest-quarkus -am install -DskipTests

3. Set your OpenAI API key

Add to your ~/.zshrc:

export OPENAI_API_KEY=sk-...
export PATH=$PATH:$HOME/go/bin

Then source ~/.zshrc.

4. Build the benchmark

cd memory-service-benchmarks
./mvnw clean package -DskipTests

Running the benchmarks

Start the services

Terminal 1 — Memory service:

cd memory-service
MEMORY_SERVICE_OPENAI_API_KEY=$OPENAI_API_KEY \
MEMORY_SERVICE_ROLES_ADMIN_CLIENTS="admin,turn_traces_processor,cognition_processor" \
MEMORY_SERVICE_ROLES_INDEXER_CLIENTS="agent,cognition_processor" \
MEMORY_SERVICE_API_KEYS_COGNITION_PROCESSOR=cognition-processor-key-123 \
MEMORY_SERVICE_AIR_FULL_BIN="./bin/memory-service serve" \
task dev:memory-service

Terminal 2 — Cognition processor (skip for no-cognition runs):

cd cognitive-memory/cognition-processor-quarkus
MEMORY_SERVICE_API_KEY=cognition-processor-key-123 \
MEMORY_MODEL_PROVIDER=openai \
MEMORY_MODEL_ID=gpt-4o-mini \
OPENAI_BASE_URL=https://api.openai.com/v1 \
OPENAI_MODEL_NAME=gpt-4o-mini \
./mvnw quarkus:dev

LoCoMo

# Single conversation (quick test)
java -Xmx2g -Dbenchmark.conversations=0 -jar target/quarkus-app/quarkus-run.jar locomo

# All 10 conversations
java -Xmx2g -jar target/quarkus-app/quarkus-run.jar locomo

# Without cognition (stop cognition processor first)
java -Xmx2g -Dbenchmark.cognition.enabled=false -jar target/quarkus-app/quarkus-run.jar locomo

LongMemEval

# Quick smoke test (2 per type = 12 questions)
java -Xmx2g -Dbenchmark.longmemeval.per-type=2 -jar target/quarkus-app/quarkus-run.jar longmemeval

# Default (5 per type = 30 questions)
java -Xmx2g -jar target/quarkus-app/quarkus-run.jar longmemeval

# All 500 questions (takes many hours)
java -Xmx2g -Dbenchmark.longmemeval.per-type=0 -jar target/quarkus-app/quarkus-run.jar longmemeval

# Filter by question type
java -Xmx2g -Dbenchmark.longmemeval.question-types=temporal-reasoning,multi-session \
  -jar target/quarkus-app/quarkus-run.jar longmemeval

# Without cognition
java -Xmx2g -Dbenchmark.cognition.enabled=false -jar target/quarkus-app/quarkus-run.jar longmemeval

BEAM

# Quick smoke test (1 chat from 100K tier, 20 questions)
java -Xmx2g -Dbenchmark.beam.max-chats=1 -jar target/quarkus-app/quarkus-run.jar beam

# Default (2 chats from 100K, 40 questions)
java -Xmx2g -jar target/quarkus-app/quarkus-run.jar beam

# All 100K chats (20 chats, 400 questions)
java -Xmx2g -Dbenchmark.beam.max-chats=0 -jar target/quarkus-app/quarkus-run.jar beam

# Larger size tiers
java -Xmx4g -Dbenchmark.beam.chat-sizes=100K,500K -jar target/quarkus-app/quarkus-run.jar beam

# Without cognition
java -Xmx2g -Dbenchmark.cognition.enabled=false -jar target/quarkus-app/quarkus-run.jar beam

Clean run (reset database)

To start fresh between runs:

cd memory-service
docker compose down -v
docker compose up -d qdrant postgres redis keycloak prometheus minio minio-init clickhouse langfuse-worker langfuse-web

Then restart memory-service and cognition processor.

Results

Results are written to results/ as JSON:

results/locomo_cognition_2026-06-25T08-27-19.json
results/locomo_substrate_2026-06-25T08-30-00.json
results/longmemeval_cognition_2026-06-25T10-15-00.json
results/beam_cognition_2026-06-26T15-00-00.json

Metrics

Each benchmark reports three scoring metrics per question, matching the evaluation framework used by industry benchmarks (e.g., Mem0):

Metric What it measures How it works
LLM Judge Semantic correctness An LLM compares the answer to ground truth by meaning. "NYC" = "New York City". Most accurate but costs an LLM call per question.
F1 Score Word overlap Measures precision and recall of word overlap between answer and ground truth. Better than BLEU but still purely word-based.
BLEU Score N-gram precision Counts how many word sequences match exactly. Originally designed for machine translation. Cheapest but most rigid.

For BEAM, the LLM Judge uses rubric-based nugget scoring (0/0.5/1.0 per criterion) instead of binary CORRECT/WRONG, giving partial credit for partially correct answers.

Configuration

All settings in src/main/resources/application.properties:

Property Default Description
memory-service.url http://localhost:8082 Memory service URL
memory-service.api-key agent-api-key-1 API key for authentication
benchmark.top-k 50 Max memories to retrieve per question
benchmark.output-dir results Output directory
benchmark.cognition.enabled true Wait for cognition processor
benchmark.cognition.namespace cognition.v1 Cognition memory namespace
benchmark.cognition.wait-timeout-seconds 600 Max wait for extraction
benchmark.cognition.poll-interval-seconds 10 Polling interval
benchmark.cognition.stable-seconds 90 Seconds of stability before proceeding

LoCoMo-specific:

Property Default Description
benchmark.dataset datasets/locomo10.json Path to LoCoMo dataset
benchmark.conversations 0,1,2,3,4,5,6,7,8,9 Which conversations to run

BEAM-specific:

Property Default Description
benchmark.beam.dataset-dir datasets/beam Path to BEAM chats directory
benchmark.beam.chat-sizes 100K Comma-separated size tiers (100K, 500K, 1M, 10M)
benchmark.beam.max-chats 2 Max chats per size tier (0 = all)

LongMemEval-specific:

Property Default Description
benchmark.longmemeval.dataset datasets/longmemeval_s_cleaned.json Path to LongMemEval dataset
benchmark.longmemeval.per-type 5 Questions per type (0 = all)
benchmark.longmemeval.seed 42 Random seed for sampling
benchmark.longmemeval.question-types (all) Comma-separated type filter

Override any property with -D:

java -Xmx2g -Dbenchmark.top-k=20 -jar target/quarkus-app/quarkus-run.jar locomo

Architecture

Dataset (LoCoMo / LongMemEval / BEAM)
        │
        ▼
┌──────────────────────────┐
│  Benchmark Runner        │
│  1. Ingest conversations │──→  Memory Service (REST API :8082)
│  2. Wait for cognition   │         │
│  3. Search memories      │         ▼
│  4. LLM answers          │     Cognition Processor (Quarkus :8090)
│  5. LLM judges           │     extracts facts, preferences, procedures
│  6. Compute metrics      │         │
└──────────────────────────┘         ▼
        │                    Memory Service stores extracted memories
        ▼                    under ["user", userId, "cognition.v1", ...]
   results/*.json

What the benchmarks test

LoCoMo (ACL 2024) — "Can you remember what friends talked about?"

Two friends chatting over months across 10 long conversations, ~1,500 questions (after excluding adversarial). Tests basic memory: can you recall facts, dates, causes, and connect information across sessions? Small and fast — good for quick iteration.

5 categories:

Category Tests
Multi-hop Connecting facts across sessions
Temporal Dates, timing, sequences
Causal Reasoning about why things happened
Factual Direct fact recall
Adversarial Questions about things never discussed (skipped)

LongMemEval (ICLR 2025) — "Can you remember what each user told you?"

500 different users, each with months of chat history (~53 sessions per user). Tests the same skills as LoCoMo but at much bigger scale, plus tracks when facts get updated. Each question is independent — no information leaks between users.

6 question types:

Question Type Tests
single-session-user Facts stated by the user in one session
single-session-assistant Information provided by the assistant
single-session-preference User preferences expressed in one session
temporal-reasoning Time-based reasoning across sessions
knowledge-update Updated facts that override earlier ones
multi-session Connecting information across multiple sessions

BEAM (ICLR 2026) — "Can you handle massive conversations across many topics?"

100 conversations ranging from 100K to 10 million tokens, covering coding, math, health, finance, and personal topics. The hardest of the three — tests 10 different memory abilities with a more detailed scoring system (partial credit via rubric nuggets instead of just right/wrong).

Question Type Tests
abstention Withholding answers when evidence is absent
contradiction_resolution Detecting and reconciling inconsistent statements
event_ordering Reconstructing chronological sequences
information_extraction Recalling entities, dates, numbers, facts
instruction_following Sustained adherence to user constraints
knowledge_update Revising stored facts with new information
multi_session_reasoning Integrating evidence across non-adjacent segments
preference_following Adapting to evolving user preferences
summarization Abstracting and compressing dialogue content
temporal_reasoning Reasoning about time relations and durations

About

Quarkus integration with memory service benchmarks (LoCoMo, LongMemEval)

Topics

Resources

Stars

Watchers

Forks

Contributors