Skip to content

superCat-star/LoCoMo_refined

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LoCoMo Refined

LoCoMo Refined is a systematic recalibration of the original LoCoMo benchmark. LoCoMo itself is a benchmark for long-conversation memory, with questions centered on time, events, interpersonal relationships, and user preferences. Its purpose is to test whether an Agent or memory framework can still recall things accurately after a conversation becomes very long.

We kept working on this benchmark not because the original LoCoMo lacked value, but for the opposite reason: precisely because it has already been widely used, we care a lot about whether this measuring stick is actually reliable. Over the past period, many memory systems have posted impressive scores on the benchmark, but once deployed in real applications, they still show issues like getting time wrong, mixing up details, and going beyond the evidence in their answers. After breaking down the LoCoMo evaluation pipeline, we found that the main issue was not that the questions were too hard, but that the benchmark itself was still not strict enough.

What LoCoMo-Refined Changes

This release focuses on two things: making the LLM Judger behave more like a real evaluator, and cleaning the dataset itself so the benchmark becomes more trustworthy.

1. A stricter judger

The original LLM judger used boundaries that were too loose, so a lot of answers that were “roughly on the right track but wrong in the details” could still pass. That kind of looseness may be acceptable for open-ended generation tasks, but it is not enough for memory evaluation. Memory benchmarks are not really about whether an answer “sounds close enough”; they are about whether the system accurately recalls known information.

So we redefined the core principle of the judger:

Inclusive without contradiction, complete without overreach.

In more direct terms, that means three things:

  • The answer has to cover all of the required information, not just part of it.
  • It cannot add content that is not supported by evidence.
  • Time information has to align strictly, rather than being glossed over through vague conversion or unsupported extra detail.

You can find the full prompt in src/llm_judge.py. The repository keeps both the new and old judger implementations for comparison.

We also ran a human-alignment experiment to check whether this new judger is actually closer to human judgment. On 300 manually annotated samples, Qwen/Qwen3-14B + the refined prompt reached 86.33% agreement accuracy with human annotations, while the original LoCoMo setup, GPT-4o-mini + the original prompt, reached only 43.67%. This suggests that our change is not simply about making the rules harsher; it is about pulling the decision boundary back toward human consensus.

2. A cleaner dataset

Besides the judger, the quality of the question set itself also directly affects how trustworthy the benchmark is. We used AI for initial screening and then had 5 human annotators review the results, checking the core memory-evaluation questions in LoCoMo one by one. In total, we revised 337 samples with logical or factual issues. These issues included ambiguous question wording, reversed subject-object relationships, and time information inconsistent with the original conversations.

The point of this step is straightforward: if the question or gold answer itself is off, then the final evaluation reflects not real capability, but who got lucky. After fixing these problematic samples, LoCoMo-Refined becomes much more suitable for serious memory-system evaluation.

The public dataset is available at ./data/raw/locomo_refined.json (1382 questions).

In this repository, the QA schema is unified as:

  • answer: list of acceptable gold answers

What We Hope This Benchmark Solves

The goal of LoCoMo-Refined is to make the scores more meaningful. With a stricter judger and a cleaner question set, issues that used to slip through, such as time drift, redundant information, and unsupported extra details, can now be identified much more reliably.

From the actual rerun results, the same system predictions score noticeably lower under the stricter evaluation standard. That is also a sign that the old benchmark did indeed contain some inflated “high scores” created by overly loose rules.

For people building Agent memory systems, this matters more than a pretty score. We hope this benchmark helps the community see the real bottlenecks earlier and spend optimization effort on the things that actually affect long-term memory ability.


See LICENSE.txt for the license and NOTICE for modification notes.

Below are the steps to run the evaluation.

1. Environment setup

First, make sure these two requirements are met:

  • Python 3.11+
  • The openai and tenacity packages can be installed

If you do not already have a usable environment, the simplest option is to create a Python 3.11 environment with conda:

cd /path/to/LoCoMo_refined
conda create -n locomo-refined python=3.11 -y
conda activate locomo-refined
pip install openai tenacity
export LOCOMO_PYTHON_BIN="$(which python)"

If you already have a conda environment or any usable Python 3.11 environment, just confirm the following:

cd /path/to/LoCoMo_refined
conda activate <your-env-name>
python --version
python -m pip show openai tenacity
export LOCOMO_PYTHON_BIN="$(which python)"

2. Prepare the prediction file

By default, the evaluator reads this prediction file:

./outputs/predictions.jsonl

The file format is JSONL, which means one JSON object per line. At minimum, each line should contain these two fields:

{"qa_id":"conv-26#q0000","predicted_answer":"7 May 2023"}
{"qa_id":"conv-26#q0001","predicted_answer":"2022"}

The qa_id values should match the IDs in:

./data/public/questions.jsonl

3. Run lexical evaluation

If you only want lexical metrics, run:

./scripts/run_eval.sh --metrics f1 bleu

The evaluation outputs are written by default to:

  • ./outputs/predictions_scored.jsonl
  • ./outputs/predictions_scored_summary.json
  • ./outputs/predictions_scored_summary.md

As long as predictions.jsonl is ready, this step should run directly.

4. Run LLM Judge evaluation

If you also want to run the llm metric, first configure the evaluator:

export EVALUATOR_MODEL=qwen3-14b
# Optional: set this if you use a custom OpenAI-compatible endpoint
export EVALUATOR_API_BASE=https://your-endpoint/v1
# Optional: set API key if your endpoint requires authentication
export EVALUATOR_API_KEY=your_api_key

LoCoMo-Refined's official judge LLM is Qwen3-14B. Accepted aliases are:

  • qwen3-14b
  • qwen3_14b
  • Qwen/Qwen3-14B
  • qwen/qwen3-14b
  • vendor-prefixed variants that still end with one of the aliases above, such as dashscope/qwen3-14b

If you run with a non-Qwen model, the script will warn and require you to type yes before continuing.

Then run:

./scripts/run_eval.sh --metrics llm f1 bleu --llm-judge refined

--llm-judge accepts two values:

  • refined (default): the iterated judger from this repository, with stricter handling of time granularity and list completeness, giving more stable results.
  • original: the original LoCoMo judger, which is more lenient. This is mainly useful when you want to compare against the original paper.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 90.7%
  • Shell 9.3%