🌍 WorldEvals

The Inspect Evals for robotics.

A curated catalog of physical-AI / VLA benchmarks built on RoboLens.

RoboLens is the framework (the "Inspect AI for robotics"). WorldEvals is the collection — but unlike Inspect Evals' monorepo, each benchmark here lives in its own repository so it owns its release cadence, dependencies, hardware notes, and leaderboard. WorldEvals is the lightweight index that ties them together: what benchmarks exist, what tasks each provides, and how to install them.

robolens list tells you what's installed.
worldevals list tells you what exists and how to get it.

Benchmarks

Benchmark	Tasks	Tags	Status
KitchenBench — 10 bimanual kitchen-manipulation tasks	10	kitchen, bimanual, manipulation	alpha

Install & use

pip install "worldevals @ git+https://github.com/robocurve/worldevals"

worldevals list                 # all benchmarks
worldevals list --tag bimanual  # filter by tag
worldevals info kitchenbench    # repo, install command, task keys
worldevals tasks                # RoboLens tasks installed locally, by benchmark

Then install a benchmark and run it through RoboLens:

pip install "kitchenbench @ git+https://github.com/robocurve/kitchenbench"
robolens run --task kitchenbench/pour_pasta --policy kitchen_scripted --embodiment kitchen

Add your benchmark

A benchmark is any repo that:

depends on robolens,
defines one or more RoboLens Tasks, and
registers them via [project.entry-points."robolens.tasks"] (and, if it ships a sim/embodiment or policy, robolens.embodiments / robolens.policies).

To list it here, add a Benchmark(...) entry to src/worldevals/catalog.py and open a PR. A test validates every entry (unique name, well-formed repo URL, ≥1 task key). See KitchenBench as the reference implementation.

Development

uv venv && uv pip install -e ".[dev]"     # robolens resolved from the v0.1.0 tag
uv run pre-commit install
uv run pytest --cov                        # 100% coverage required
uv run ruff check . && uv run mypy

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.github/workflows		.github/workflows
plans		plans
src/worldevals		src/worldevals
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🌍 WorldEvals

Benchmarks

Install & use

Add your benchmark

Development

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

🌍 WorldEvals

Benchmarks

Install & use

Add your benchmark

Development

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages