MineBench

A benchmark for evaluating AI spatial reasoning through Minecraft-style voxel construction.

Models are given a natural-language prompt and must produce raw 3D coordinates as JSON. In tool mode, models call voxel.exec (minimal primitives: block, box, line) to generate large builds beyond token-only JSON limits. MineBench visualizes the output and ranks models via head-to-head voting with a confidence-aware Glicko-style system (public ordering by conservative score).

Try it live

Why MineBench?

Most LLM benchmarks test text and raw accuracy. MineBench instead tests whether a model reason about 3D space. Given a prompt like "a medieval castle with four towers", the model must mentally construct geometry, pick materials, and output thousands of precise block coordinates. No vision model or diffusion – just math and spatial logic.

As it turns out, this kind of spatial reasoning correlates strongly with a model's raw general intelligence; the MineBench leaderboard tracks, anecdotally, the same hierarchy that most people observe in real-world usage: the smartest reasoning models are clearly visible when asked to produce visual builds.

MineBench, unlike other benchmarks, gives an easy way to visually determine (at least one aspect of) a model's raw intelligence. The ranking system also highlights which models are clearly 'bench-maxed' (i.e. when a model has amazing benchmarks on paper, but clearly lacks in real world usage).

Features

Arena — blind head-to-head comparisons of pre-generated builds with confidence-aware ranking
Sandbox — compare existing builds or generate new ones live with your own API keys
Local Lab — copy the benchmark prompt, run it in any model, paste the JSON back to render
Leaderboard — live rankings with win/loss/draw stats across all models

Documentation

Full docs index: docs/README.md
Local development: docs/local-development.md
Operations and API reference: docs/operations.md
Deployment: docs/deployment.md
Ranking math and matchmaking walkthrough: docs/arena-ranking-system.md
Ranking policy: docs/arena-ranking-validity-policy-v2.md
Voxel tool runtime, conversion, and import workflows: docs/voxel-exec-raw-output.md

Supported Models

MineBench currently benchmarks models from OpenAI, Anthropic, Google, Moonshot, DeepSeek, MiniMax, xAI, Z.AI, Qwen, Meta, and any model available through OpenRouter.

Quick Start (Local)

This path lets you run the full app and compare existing builds from uploads/ without generating new ones.

Prereqs: Node.js 18+, pnpm, Docker.

pnpm install
cp .env.example .env
pnpm dev:setup

In a second terminal:

pnpm prompt --import

Then open:

http://localhost:3000/ (Arena)
http://localhost:3000/sandbox
http://localhost:3000/leaderboard

For environment variables, live generation, seeding/import workflows, batch generation, API routes, troubleshooting, and deployment, see the docs:

Contributing

Contributions are welcome! See CONTRIBUTING.md for how to add new models, submit benchmark prompts, improve the UI, or fix bugs.

Support MineBench

Running MineBench is expensive: model inference, storage, and hosting costs add up quickly as the benchmark grows.

Support directly via Buy Me a Coffee.

MineBench is also sponsored by 3D-Agent, an AI assistant for Blender and 3D workflows. Use code MINEBENCH10 for 10% off a subscription.

Disclosure: MineBench earns a recurring affiliate commission when this code is used.

License

MIT

Texture pack: Faithful (see assets/texture-pack/LICENSE.txt)

Inspired by MC-Bench (GitHub)

[Disclaimer: all documentation (including README) and frontend is almost entirely AI-created]

Name		Name	Last commit message	Last commit date
Latest commit History 329 Commits
.github		.github
app		app
assets/texture-pack		assets/texture-pack
components		components
docs		docs
lib		lib
prisma		prisma
public		public
scripts		scripts
types		types
.env.example		.env.example
.eslintrc.json		.eslintrc.json
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
compose.yaml		compose.yaml
middleware.ts		middleware.ts
next-env.d.ts		next-env.d.ts
next.config.ts		next.config.ts
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
pnpm-workspace.yaml		pnpm-workspace.yaml
postcss.config.mjs		postcss.config.mjs
tailwind.config.ts		tailwind.config.ts
tsconfig.json		tsconfig.json
vercel.json		vercel.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MineBench

Why MineBench?

Features

Documentation

Supported Models

Quick Start (Local)

Contributing

Support MineBench

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MineBench

Why MineBench?

Features

Documentation

Supported Models

Quick Start (Local)

Contributing

Support MineBench

License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages