Skip to content

stevadu/vlm-eval-kit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 

Repository files navigation

vlm-eval-kit

A small, reproducible evaluation harness for vision-language model outputs.

Motivation

Modern VLM work benefits from a consistent interface for:

  • dataset samples
  • model predictions
  • metric calculation
  • run artifacts

vlm-eval-kit focuses on this interface without adding heavy infrastructure.

Features

  • JSONL task schema
  • pluggable metrics (exact_match, token_f1, rouge_l)
  • bootstrap confidence intervals for robust reporting
  • CLI runner that exports machine-readable reports

Install

python -m venv .venv
source .venv/bin/activate
pip install -e .

Quickstart

python -m vlm_eval_kit.cli run \
  --samples examples/samples.jsonl \
  --predictions examples/predictions.jsonl \
  --output examples/report.json

Input format

samples.jsonl

{"id":"1","answer":"cat on chair"}
{"id":"2","answer":"no"}

predictions.jsonl

{"id":"1","prediction":"a cat sits on a chair"}
{"id":"2","prediction":"no"}

Output

A JSON report containing metrics, confidence intervals, and missing-id diagnostics.

Roadmap

  • Add VQA normalization rules
  • Add pairwise preference scoring
  • Add Weights & Biases logging adapter

Notes

Designed for fast ablation loops in academic settings.

About

Tiny, reproducible evaluation harness for VLM outputs

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages