ShoppingComp: Are LLMs Really Ready for Your Shopping Cart?

Key Features • Overview • Dataset • Metrics • Quickstart • Citation

ShoppingComp is a realistic benchmark for evaluating LLM-powered shopping agents under open-world, safety-critical, and consumer-driven settings.

It evaluates whether models can:

retrieve correct products,
satisfy fine-grained user constraints,
generate faithful shopping reports,
and recognize unsafe or invalid usage scenarios.

中文版说明见：README_ZH.md

⭐ Key Features

🛒 Realistic expert-curated tasks grounded in authentic shopping needs
📏 Unified evaluation framework covering retrieval, reasoning, and safety
🧩 Rubric-based verification for fine-grained, interpretable scoring
🔍 Evidence-grounded evaluation with official specs and trusted reviews
⚡ Lightweight & reproducible judge pipeline (LLM-as-a-Judge + fast metrics)

🔭 Overview

Each ShoppingComp instance centers on a user shopping question, paired with:

expert-annotated ground-truth product lists,
structured rubrics capturing atomic constraints and safety conditions,
and verifiable evidence supporting expert decisions.

The evaluation pipeline is implemented in ShoppingCompJudge, which separates:

Judging: LLM-based rubric decisions producing structured JSONL
Scoring: deterministic aggregation without additional LLM calls

This design ensures both scalability and evaluation stability.

📦 Dataset

The ShoppingComp dataset is hosted on Hugging Face:

👉 https://huggingface.co/datasets/huaixiao/ShoppingComp

Files

ShoppingComp_97_20260127.en.jsonl / .zh.jsonl — expert-curated shopping tasks
ShoppingComp_traps_48_20260127.en.jsonl / .zh.jsonl — safety-critical and trap scenarios

Load with 🤗 Datasets

from datasets import load_dataset

data_files = {
  "gt_en": "ShoppingComp_97_20260127.en.jsonl",
  "gt_zh": "ShoppingComp_97_20260127.zh.jsonl",
  "traps_en": "ShoppingComp_traps_48_20260127.en.jsonl",
  "traps_zh": "ShoppingComp_traps_48_20260127.zh.jsonl",
}

dataset = load_dataset("huaixiao/ShoppingComp", data_files=data_files)

📏 Evaluation Metrics

ShoppingCompJudge currently supports the following metrics:

AnswerMatch-F1 — whether ground-truth products are retrieved
SoP (Selection Accuracy) — rubric satisfaction rate of selected products
Scenario Coverage — coverage of extracted user demands in reports
Rationale Validity (RV) — faithfulness and evidence grounding
Safety Rubric Pass Rate — compliance with safety-critical rubrics

⚡ Quickstart

1) Install

pip install -r requirements.txt
pip install -e .

2) Configure LLM API

cp api_config.example.yaml api_config.yaml
export SHOPPINGCOMPJUDGE_API_CONFIG=$(pwd)/api_config.yaml

3) Run Evaluation

python -m ShoppingCompJudge run \
  --gt data/ShoppingComp_97_20260127.en.jsonl \
  --pred data/predictions.jsonl \
  --out-dir shoppingcomp_eval/ \
  --judge-model gemini-2.5-pro

For detailed formats and advanced options, see ShoppingCompJudge/.

🗂️ Repository Structure

ShoppingComp/
├── ShoppingCompJudge/      # evaluation framework (judge + metrics)
├── workflow.png            # overview figure
├── README.md               # benchmark overview
└── README_ZH.md            # 中文说明

📚 Citation

@article{tou2025shoppingcomp,
  title={ShoppingComp: Are LLMs Really Ready for Your Shopping Cart?},
  author={Tou, Huaixiao and Zeng, Ying and Ma, Cong and Li, Muzhi and Li, Minghao and Yuan, Weijie and Zhang, He and Jia, Kai},
  journal={arXiv preprint arXiv:2511.22978},
  year={2025}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ShoppingComp: Are LLMs Really Ready for Your Shopping Cart?

⭐ Key Features

🔭 Overview

📦 Dataset

Files

Load with 🤗 Datasets

📏 Evaluation Metrics

⚡ Quickstart

1) Install

2) Configure LLM API

3) Run Evaluation

🗂️ Repository Structure

📚 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
ShoppingCompJudge		ShoppingCompJudge
LICENSE		LICENSE
README.md		README.md
README_ZH.md		README_ZH.md
api_config.example.yaml		api_config.example.yaml
bandai.png		bandai.png
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
workflow.png		workflow.png

Folders and files

Latest commit

History

Repository files navigation

ShoppingComp: Are LLMs Really Ready for Your Shopping Cart?

⭐ Key Features

🔭 Overview

📦 Dataset

Files

Load with 🤗 Datasets

📏 Evaluation Metrics

⚡ Quickstart

1) Install

2) Configure LLM API

3) Run Evaluation

🗂️ Repository Structure

📚 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages