The Inspect Evals for robotics.
A curated catalog of physical-AI / VLA benchmarks built on RoboLens.
RoboLens is the framework (the "Inspect AI for robotics"). WorldEvals is the collection — but unlike Inspect Evals' monorepo, each benchmark here lives in its own repository so it owns its release cadence, dependencies, hardware notes, and leaderboard. WorldEvals is the lightweight index that ties them together: what benchmarks exist, what tasks each provides, and how to install them.
robolens listtells you what's installed.worldevals listtells you what exists and how to get it.
| Benchmark | Tasks | Tags | Status |
|---|---|---|---|
| KitchenBench — 10 bimanual kitchen-manipulation tasks | 10 | kitchen, bimanual, manipulation | alpha |
pip install "worldevals @ git+https://github.com/robocurve/worldevals"
worldevals list # all benchmarks
worldevals list --tag bimanual # filter by tag
worldevals info kitchenbench # repo, install command, task keys
worldevals tasks # RoboLens tasks installed locally, by benchmarkThen install a benchmark and run it through RoboLens:
pip install "kitchenbench @ git+https://github.com/robocurve/kitchenbench"
robolens run --task kitchenbench/pour_pasta --policy kitchen_scripted --embodiment kitchenA benchmark is any repo that:
- depends on
robolens, - defines one or more RoboLens
Tasks, and - registers them via
[project.entry-points."robolens.tasks"](and, if it ships a sim/embodiment or policy,robolens.embodiments/robolens.policies).
To list it here, add a Benchmark(...) entry to
src/worldevals/catalog.py and open a PR. A test
validates every entry (unique name, well-formed repo URL, ≥1 task key). See
KitchenBench as the reference
implementation.
uv venv && uv pip install -e ".[dev]" # robolens resolved from the v0.1.0 tag
uv run pre-commit install
uv run pytest --cov # 100% coverage required
uv run ruff check . && uv run mypy