This directory contains user-facing and developer documentation for the LEVANTE VLM Benchmark.
- data_schema.md – Canonical schema for trials, human responses, and item_uid → corpus → assets mapping.
- releases.md – How to obtain LEVANTE trials data (Redivis) and run the R download script; versioning.
- adding_tasks.md – How to add a LEVANTE task to the benchmark.
- adding_models.md – How to add a VLM to the benchmark.
- prompting_and_parsing.md – Prompt templates,
use_json_formatpaths, and how option letters flow from dataset trials into the instructions. - runtime_exports.md – Public runtime API for external repos (
load_model,run_trials,run-trials-jsonl).
- Install R and the
redivispackage; configure auth per releases.md. - Run
scripts/data_prep/download_levante_assets.py(optional--version YYYY-MM-DD) to download corpus and images from the public LEVANTE assets bucket. - Run
scripts/download_levante_data.Rto fetch trials from Redivis intodata/responses/<version>/. - Validate environment and GPU:
levante-bench list-taskslevante-bench list-modelslevante-bench check-gpu
- Run evaluation and benchmarks:
levante-bench run-eval --task <task> --model <model> [--version <version>]levante-bench run-eval --task <task> --model <model> --true-random-option-order --num-runs 3levante-bench run-benchmark --benchmark v1 --device autolevante-bench run-benchmark --benchmark vocab --device auto
- Run comparison (R):
levante-bench run-comparison --task <task> --model <model> --version <version>- or run
comparison/compare_levante.Rdirectly
- Use validation and result-history helpers:
scripts/validate_all.sh(smoke validations)scripts/validate_all.sh --full-benchmarksscripts/validate_all.sh --with-r-validation(adds R package checks)scripts/validate_r.sh --run-comparison-smoke --version <version>(R comparison smoke test)python3 scripts/list_benchmark_results.py --limit 20
You can run YAML-defined experiments directly through the CLI:
python -m levante_bench.cli experiment=configs/experiments/experiment.yamlbash run_experiment.sh configs/experiments/experiment.yaml
You can also use OmegaConf dotlist overrides for task subsets and smoke caps:
python -m levante_bench.cli experiment=configs/experiments/experiment.yaml tasks=[vocab] max_items_vocab=8 device=cpupython -m levante_bench.cli experiment=configs/experiments/experiment.yaml tasks=[egma-math] max_items_math=2 device=cpupython -m levante_bench.cli experiment=configs/experiments/experiment.yaml tasks=[theory-of-mind] max_items_tom=2 device=cpu
When true_random_option_order is enabled (CLI flag or experiment YAML), run outputs are written under numbered subfolders (0001, 0002, ...) and per-item option ordering seeds are recorded in cache/responses.json. On Slurm/sbatch, run folders default to job.../0001 style parents (for example job12345-task7/0001) to prevent cross-job collisions.
Use this checklist when moving legacy benchmark scripts onto the registry-based
levante_bench.evaluation.runner path.
- Lock parity target artifacts
- Math: predictions + summary + by-type outputs
- ToM: predictions + summary and trial-type breakdown
- Vocab: predictions + summary and quadrant stats
- Implement task adapter hooks
- Add per-task prepare/postprocess hooks in runner for script-only logic
- Run parity gates
- Row counts, parse rates, and metrics must match agreed tolerances
- Required output files must exist with compatible schemas
- Switch command routing incrementally
- Move one task at a time from legacy scripts to runner-backed flow
- Deprecate legacy paths only after stable overlap
- Keep
--legacyor equivalent during transition window
- Keep
When using this benchmark, cite the LEVANTE manuscript and the DevBench paper (see main README).