LEVANTE VLM Benchmark – Documentation

This directory contains user-facing and developer documentation for the LEVANTE VLM Benchmark.

data_schema.md – Canonical schema for trials, human responses, and item_uid → corpus → assets mapping.
releases.md – How to obtain LEVANTE trials data (Redivis) and run the R download script; versioning.
adding_tasks.md – How to add a LEVANTE task to the benchmark.
adding_models.md – How to add a VLM to the benchmark.
prompting_and_parsing.md – Prompt templates, use_json_format paths, and how option letters flow from dataset trials into the instructions.
runtime_exports.md – Public runtime API for external repos (load_model, run_trials, run-trials-jsonl).

Quick start

Install R and the redivis package; configure auth per releases.md.
Run scripts/data_prep/download_levante_assets.py (optional --version YYYY-MM-DD) to download corpus and images from the public LEVANTE assets bucket.
Run scripts/download_levante_data.R to fetch trials from Redivis into data/responses/<version>/.
Validate environment and GPU:
- levante-bench list-tasks
- levante-bench list-models
- levante-bench check-gpu
Run evaluation and benchmarks:
- levante-bench run-eval --task <task> --model <model> [--version <version>]
- levante-bench run-eval --task <task> --model <model> --true-random-option-order --num-runs 3
- levante-bench run-benchmark --benchmark v1 --device auto
- levante-bench run-benchmark --benchmark vocab --device auto
Run comparison (R):
- levante-bench run-comparison --task <task> --model <model> --version <version>
- or run comparison/compare_levante.R directly
Use validation and result-history helpers:
- scripts/validate_all.sh (smoke validations)
- scripts/validate_all.sh --full-benchmarks
- scripts/validate_all.sh --with-r-validation (adds R package checks)
- scripts/validate_r.sh --run-comparison-smoke --version <version> (R comparison smoke test)
- python3 scripts/list_benchmark_results.py --limit 20

Experiment YAML mode (eval-style)

You can run YAML-defined experiments directly through the CLI:

python -m levante_bench.cli experiment=configs/experiments/experiment.yaml
bash run_experiment.sh configs/experiments/experiment.yaml

You can also use OmegaConf dotlist overrides for task subsets and smoke caps:

python -m levante_bench.cli experiment=configs/experiments/experiment.yaml tasks=[vocab] max_items_vocab=8 device=cpu
python -m levante_bench.cli experiment=configs/experiments/experiment.yaml tasks=[egma-math] max_items_math=2 device=cpu
python -m levante_bench.cli experiment=configs/experiments/experiment.yaml tasks=[theory-of-mind] max_items_tom=2 device=cpu

When true_random_option_order is enabled (CLI flag or experiment YAML), run outputs are written under numbered subfolders (0001, 0002, ...) and per-item option ordering seeds are recorded in cache/responses.json. On Slurm/sbatch, run folders default to job.../0001 style parents (for example job12345-task7/0001) to prevent cross-job collisions.

Runner migration checklist

Use this checklist when moving legacy benchmark scripts onto the registry-based levante_bench.evaluation.runner path.

Lock parity target artifacts
- Math: predictions + summary + by-type outputs
- ToM: predictions + summary and trial-type breakdown
- Vocab: predictions + summary and quadrant stats
Implement task adapter hooks
- Add per-task prepare/postprocess hooks in runner for script-only logic
Run parity gates
- Row counts, parse rates, and metrics must match agreed tolerances
- Required output files must exist with compatible schemas
Switch command routing incrementally
- Move one task at a time from legacy scripts to runner-backed flow
Deprecate legacy paths only after stable overlap
- Keep --legacy or equivalent during transition window

Citing

When using this benchmark, cite the LEVANTE manuscript and the DevBench paper (see main README).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LEVANTE VLM Benchmark – Documentation

Quick start

Experiment YAML mode (eval-style)

Runner migration checklist

Citing

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

LEVANTE VLM Benchmark – Documentation

Quick start

Experiment YAML mode (eval-style)

Runner migration checklist

Citing