Extensible Python-first benchmark comparing VLMs (CLIP-style and LLaVA-style) to children's behavioral data from LEVANTE. R is used for downloading trials (Redivis), fetching IRT models, and for statistical comparison; Python is used for config, data loaders, model adapters, and the evaluation runner.
python3 -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -r requirements.txt
pip install -e . # install this package
# Optional: pip install -r requirements-transformers.txt # for CLIPUse Python 3.10–3.13. On 3.13, requirements.txt pins torch>=2.6 and newer numpy/pandas so pip can install wheels (older torch/pandas often have no cp313 builds).
Pinned deps: requirements.txt. Dev: requirements-dev.txt.
- IRT model mapping: Edit
src/levante_bench/config/irt_model_mapping.csvto map each task to its IRT model.rdsfile in the Redivis model registry (e.g.trog,trog/multigroup_site/overlap_items/trog_rasch_f1_scalar.rds). - Data (R): Install R and the
redivispackage; runRscript scripts/download_levante_data.Rto fetch trials and IRT models intodata/responses/<version>/. - Assets (Python): Run
python scripts/download_levante_assets.py [--version YYYY-MM-DD]to download corpus and images from the public LEVANTE bucket intodata/assets/<version>/. - Evaluate: Then:
levante-bench list-taskslevante-bench list-modelslevante-bench check-gpu# verify local CUDA availabilitylevante-bench run-eval --task trog --model clip_base [--version VERSION]levante-bench run-benchmark --benchmark v1 --device autolevante-bench run-benchmark --benchmark vocab --device autolevante-bench run-workflow --workflow smol-vocab -- --helplevante-bench run-workflow --workflow benchmark-v1 -- --helpscripts/validate_all.sh# ruff + pytest + GPU check + benchmark smoke runsscripts/validate_all.sh --full-benchmarks# same checks + full v1 and vocab benchmarksscripts/validate_all.sh --with-r-validation# include R/Redivis package checksscripts/validate_r.sh --run-comparison-smoke --version 2026-03-24# optional R comparison smoke test
- Compare (R): Run
levante-bench run-comparison --task trog --model clip_baseor runRscript comparison/compare_levante.R --task TASK --model MODELdirectly. Outputs accuracy (with IRT item difficulty) and D_KL (by ability bin) toresults/comparison/.
You can run experiment configs directly using the eval-style command structure:
# Direct
python -m levante_bench.cli experiment=configs/experiment.yaml
# Wrapper (same behavior)
bash run_experiment.sh configs/experiment.yamlUse dotlist-style overrides to change task subsets and smoke caps:
# Vocab smoke
python -m levante_bench.cli experiment=configs/experiment.yaml tasks=[vocab] max_items_vocab=8 device=cpu
# Math smoke
python -m levante_bench.cli experiment=configs/experiment.yaml tasks=[egma-math] max_items_math=2 device=cpu
# ToM smoke
python -m levante_bench.cli experiment=configs/experiment.yaml tasks=[theory-of-mind] max_items_tom=2 device=cpu- Framework integration: SmolVLM benchmark scripts are now integrated under the
levante-benchCLI (run-workflowandrun-benchmark), including first-classv1andvocabbenchmark presets. - GPU-aware execution: Added
levante-bench check-gpuand automatic device resolution (--device auto) with safe CUDA->CPU fallback. - Math prompt improvements:
scripts/build_math_prompts.pynow defaults to shuffled options, supports numberline image attachment via--numberline-graphics-dir, and has configurable numberline instruction styles (minimal,stepwise). - Numberline multimodal evaluation:
scripts/run_smolvlmv2_math_eval.pynow accepts image-backed prompt records (image_paths) so numberline items can be evaluated with actual graphics. - Vocab benchmark support: Added image-grid vocab evaluation flow and integrated it into
levante-bench run-benchmark --benchmark vocab. - Validation runner: Added
scripts/validate_all.shto run lint/tests/GPU check plus smoke or full benchmark validations in one command. - Result history reporting: Added
scripts/list_benchmark_results.pyto list benchmark and prompt-experiment outputs with metric deltas vs prior runs.
# Heatmap of models x tasks accuracy
python scripts/plot_results.py
# Specific version
python scripts/plot_results.py --version 2026-03-24
# Text table only
python scripts/plot_results.py --no-plotUse these commands to verify what ran and compare with prior runs:
# Show benchmark + prompt experiment history with deltas.
python3 scripts/list_benchmark_results.py --limit 20
# Run full validation pipeline (lint/tests/gpu + smoke benchmarks).
scripts/validate_all.sh
# Run full benchmarks instead of smoke.
scripts/validate_all.sh --full-benchmarks
# Include R package validation in the full validation pass.
scripts/validate_all.sh --with-r-validationThe test suite is split into fast unit/property tests (default) and opt-in integration tests (model loading / dataset end-to-end checks).
- Default pytest run:
python -m pytest- Runs unit tests for parsing, scoring, aggregation, runner utils, cache behavior, and API retry logic.
- Includes property-based fuzz tests (Hypothesis) for parser robustness.
- Integration tests (opt-in):
tests/test_model_inference.pytests/test_task_datasets.py- These are intentionally gated behind
LEVANTE_RUN_INTEGRATION=1so default CI/local runs stay deterministic and fast. - Run with:
LEVANTE_RUN_INTEGRATION=1 python -m pytest
Current parser-focused coverage includes:
parse_answer/parse_answer_v2branch coverage (JSON, embedded JSON, phrase patterns, exact/prefix forms, ambiguous-prose rejection).parse_numeric_answer/parse_numeric_v2branch coverage (strict JSON, embedded JSON, slider mode constraints, fallback behavior).<imageN>interleaving behavior across model adapters.evaluate_trialcorrectness for label, numeric, and slider formats.- postprocessing accuracy aggregation and ordering checks.
- cache round-trip and cache-hit behavior in
run_eval. - GPT-5.3 retry logic (
5xxretry and token-cap doubling path).
Evaluation now uses a canonical parse layer with provenance so correctness is decided in normalized answer space, not output surface format.
- Label tasks: normalize to
predicted_labelinoption_labels. - Numeric/slider tasks: normalize to
predicted_value(float), then compare totarget_valueusingslider_tolerance. - Slider tasks: normalize slider position, clamp to
[0,1], then map back to task scale viaslider_min/slider_max.
ParseResult (in src/levante_bench/models/base.py) returns:
value(canonical parsed value/label orNone)reason(extracted reason or source text)parse_method(which rule matched)parse_confidence(high/medium/low/none)raw_candidate(raw extracted token)
evaluate_trial() now uses:
parse_answer_result(...)for label tasksparse_numeric_result(...)for numeric/slider tasks
Backward-compatible APIs (parse_answer, parse_numeric_answer) are kept for
existing callers, but benchmark scoring uses parser-v2 paths.
Per-task CSV outputs now include parser provenance columns:
parse_methodparse_confidenceparse_raw_candidate
This supports score audits (for example, reviewing accuracy by parse method or identifying low-confidence parses).
The benchmark compares model outputs to human behavioral data on two dimensions:
- Accuracy vs item difficulty: Model accuracy (correct/incorrect per item) is paired with IRT item difficulty parameters extracted from fitted Rasch models. A negative correlation indicates the model finds harder items harder, as children do.
- Response distribution D_KL by ability bin: Human response distributions are computed within subgroups of children binned by IRT ability (1-logit width bins on the logit scale). KL divergence between these human distributions and the model's softmax distribution quantifies alignment at each ability level.
See comparison/README.md for details.
See docs/README.md for data schema, releases, adding tasks/models, and secrets setup. See docs/aquila-intermediate-runbook.md for Aquila intermediate checkpoint integration and dual-environment setup. See docs/environment-split.md for benchmark vs Aquila virtualenv activation and usage. See scripts/README.md for a script-by-script command index. See CHANGELOG.md for ongoing project update history.
Cite the LEVANTE manuscript and the DevBench (NeurIPS 2024) paper when using this benchmark.