A small, reproducible evaluation harness for vision-language model outputs.
Modern VLM work benefits from a consistent interface for:
- dataset samples
- model predictions
- metric calculation
- run artifacts
vlm-eval-kit focuses on this interface without adding heavy infrastructure.
- JSONL task schema
- pluggable metrics (
exact_match,token_f1,rouge_l) - bootstrap confidence intervals for robust reporting
- CLI runner that exports machine-readable reports
python -m venv .venv
source .venv/bin/activate
pip install -e .python -m vlm_eval_kit.cli run \
--samples examples/samples.jsonl \
--predictions examples/predictions.jsonl \
--output examples/report.jsonsamples.jsonl
{"id":"1","answer":"cat on chair"}
{"id":"2","answer":"no"}predictions.jsonl
{"id":"1","prediction":"a cat sits on a chair"}
{"id":"2","prediction":"no"}A JSON report containing metrics, confidence intervals, and missing-id diagnostics.
- Add VQA normalization rules
- Add pairwise preference scoring
- Add Weights & Biases logging adapter
Designed for fast ablation loops in academic settings.