🍳 KitchenBench

A bimanual kitchen-manipulation benchmark for VLA models.

Built on RoboLens · part of WorldEvals, the "Inspect Evals for robotics".

KitchenBench is 10 kitchen-manipulation tasks expressed as RoboLens Tasks — embodiment-agnostic, so you run them against any compatible policy/embodiment. The set emphasizes bimanual coordination: pouring, lid removal, folding, part-mating, a pure two-arm handover, and tool-mediated scooping, alongside classic pick-place / stacking / slotted insertion and a multi-instance sort.

It ships a dependency-free mock kitchen so the whole suite runs in CI, and is designed to point straight at real hardware — e.g. YAM bimanual arms driven by MolmoAct2.

The tasks

Task (`--task`)	Instruction	Variations	Bimanual	Category
`kitchenbench/place_cutlery`	place the {cutlery} on the {dishware}	spoon/fork/knife × plate/bowl/napkin		pick-place
`kitchenbench/stack`	stack the {items}	cups/bowls/plates		stacking
`kitchenbench/place_in_rack`	place the {dishware} into the dish rack	plate/bowl/cup		insertion
`kitchenbench/pour_pasta`	pour the dry pasta into the {vessel}	bowl/cup/pot	✅	granular
`kitchenbench/open_container`	open the {container}	jar/bottle/food container	✅	articulated
`kitchenbench/fold_cloth`	fold the {cloth}	dish towel/napkin/cloth	✅	deformable
`kitchenbench/seal_container`	seal the {container} with its lid	food container/pot/jar	✅	mating
`kitchenbench/handoff`	hand off the {item} from one arm to the other	utensil/cup/produce item	✅	coordination
`kitchenbench/sort_cutlery`	sort the cutlery into the correct tray compartments	3 pile layouts		classification
`kitchenbench/scoop_pasta`	scoop the {pasta} with the {tool} and transfer it to the container	penne/rigatoni × spoon/measuring cup	✅	granular+tool

Each task expands its variation axes into one Scene per combination (37 scenes total), each with a filled-in language instruction and a success Target.

Install

# RoboLens isn't on PyPI yet, so install both from GitHub (uv recommended):
uv pip install "robolens @ git+https://github.com/robocurve/robolens@v0.1.0"
uv pip install "kitchenbench @ git+https://github.com/robocurve/kitchenbench"

Run it (mock kitchen, no hardware)

KitchenBench registers a dependency-free mock embodiment (kitchen) and policies (kitchen_scripted / kitchen_random / kitchen_noop) via entry points:

robolens list tasks                       # see all kitchenbench/* tasks
robolens run --task kitchenbench/pour_pasta --policy kitchen_scripted --embodiment kitchen

Or in Python:

from robolens import eval

(log,) = eval("kitchenbench/open_container", "kitchen_scripted", "kitchen")
print(log.status, log.results.metrics)    # success {'task_success': 1.0, 'episode_length': ...}

The mock is abstract (it models progress toward the scene goal, like RoboLens's CubePick) — its job is to exercise the pipeline and give you a template. The value is the task definitions, which run unchanged on a real robot.

Run it on real hardware (YAM arms + MolmoAct2)

KitchenBench tasks are embodiment-agnostic. To evaluate on real YAM bimanual arms with MolmoAct2, provide two RoboLens components (e.g. in your own adapter package such as robocurve/embodiments):

a Policy wrapping MolmoAct2: act(observation) -> ActionChunk (the scene's instruction is fed to the VLA verbatim);
an Embodiment for the YAM arms: reset/step/close, declaring its action space (e.g. two 7-DoF arms + grippers) and cameras. Because there is no privileged success oracle, the embodiment should turn the operator's confirmation at episode end into StepResult(terminated=True, termination_reason="success") (or set record.operator_judgement) — KitchenBench's task_success scorer reads either. Declare the "self_paced" capability and pace the control loop inside step().

robolens run --task kitchenbench/pour_pasta --policy molmoact2 --embodiment yam_arms

RoboLens checks (policy, embodiment) compatibility (action dims, semantics, camera/state keys) before any motion and writes an immutable EvalLog.

Development

uv venv && uv pip install -e ".[dev]"     # robolens resolved from the v0.1.0 tag
uv run pre-commit install
uv run pytest --cov                        # 100% coverage required
uv run ruff check . && uv run mypy

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.github/workflows		.github/workflows
plans		plans
src/kitchenbench		src/kitchenbench
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🍳 KitchenBench

The tasks

Install

Run it (mock kitchen, no hardware)

Run it on real hardware (YAM arms + MolmoAct2)

Development

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

🍳 KitchenBench

The tasks

Install

Run it (mock kitchen, no hardware)

Run it on real hardware (YAM arms + MolmoAct2)

Development

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages