Vision LLM - Batch Document VQA with structured responses

Performance vs Cost Trade-off

The chart below shows the Pareto frontier of models, highlighting the most cost-efficient options for different performance levels:

Benchmarks

Our small test dataset (./imgs/quiz11-presidents.pdf) consists of 32 documents representing Physics quizzes and the task is to match them to the test students who took the quiz via their 8-digit university ID and, optionally, their names (./tests/data/test_ids.csv). We have already saturated our test dataset with 100% statistically confident detections, but more optimizations are explored to decrease inference cost. You can find more details in this wiki.

The table below shows the top performing models by category. See BENCHMARKS.md for comprehensive results with all tested models.

Metric	OpenCV+CNN	moonshotai kimi-k2.5	qwen qwen3-vl-8b-instruct	google gemini-2.5-flash-lite	google gemini-3-flash-preview
LLM model size	N/A	1000A32	8B	??	??
Open-weights	N/A	Yes	Yes	No	No
digit_top1	85.16%	100.00%	99.61%	99.22%	99.22%
8-digit id_top1	??	100.00%	96.88%	93.75%	93.75%
lastname_top1	N/A	100.00%	100.00%	96.88%	100.00%
ID Avg d_Lev	N/A	0.0000	0.0312	0.0625	0.0625
Lastname Avg d_Lev	N/A	0.0000	0.0000	0.0312	0.0000
Docs detected	90.62% (29/32)	100.00% (32/32)	100.00% (32/32)	100.00% (32/32)	100.00% (32/32)
Runtime (p)	~1 second	N/A	11 seconds	11 seconds	N/A
Cost per image	$0.00	$0.004679	$0.000266	$0.000214	$0.001636
Total cost	$0.00	$0.2995	$0.0171	$0.0137	$0.1047

This repository uses Large Language Models with vision capabilities to extract information from collections of documents and reports performance within a clearly specified document‑VQA setup. The goal is to create a fully local pipeline that runs on a single machine, and can be used to extract information from document collections for usage in downstream tasks.

Run Benchmark: Included Dataset (No Code Changes)

Run the built-in q11 benchmark end-to-end using the default default_student extraction task.

This repo includes benchmark inputs, but not pre-rendered benchmark images:

Source PDF: imgs/quiz11-presidents.pdf
Ground truth: tests/data/test_ids.csv

You will generate doc_info.csv and page images locally in one command.

1. Clone the repository

git clone https://github.com/IonMich/batch-doc-vqa.git
cd batch-doc-vqa

2. Install `uv`

curl -LsSf https://astral.sh/uv/install.sh | sh

After uv is installed, run commands directly with uv run .... No uv sync, pip install, or conda setup is required for this workflow.

3. Generate benchmark images + `doc_info.csv`

uv run --with pymupdf pdf-to-imgs \
  --filepath imgs/quiz11-presidents.pdf \
  --pages_i 4 \
  --dpi 300 \
  --output_dir imgs/q11

What this does:

Splits the source PDF into PNG page images in imgs/q11/.
Treats every 4 pages as one document (--pages_i 4), so filenames map to doc-<index>-page-<n>-*.png.
Writes imgs/q11/doc_info.csv with doc,page,filename so downstream commands know exactly which images belong to each document.

4. Run OpenRouter inference (interactive organization + model selection, then provider approval)

uv run openrouter-inference \
  --concurrency 64 \
  --rate-limit 64

The command checks for OPENROUTER_API_KEY and prompts for setup if missing (openrouter.ai/keys).

What this command does by default:

Since --model is omitted, the terminal UI asks you to choose organization + model.
Uses preset default_student.
Uses images from imgs/q11.
Auto-detects dataset manifest at imgs/q11/doc_info.csv when present.
Uses pages 1,3 by default for this preset.

Optional overrides:

Use a different preset: --preset <preset_id>
Use a different image directory: --images-dir /path/to/images (auto-detects /path/to/images/doc_info.csv when present)
Use a different manifest path: --dataset-manifest /path/to/doc_info.csv
Override page selection: --pages 1,3

Interactive flow:

Run the command above.
If prompted, enter your OpenRouter API key (and optionally save it to .env).
In the terminal UI, choose the model organization (the model creator).
Choose the model from that organization.
Review provider policies for hosts serving that model and approve to continue.
Confirm and start the run.

5. Regenerate benchmark artifacts

uv run update-benchmarks

This updates BENCHMARKS.md, pareto_plot.png, and the benchmark section in README.md.

Run Benchmark: Synthetic Dataset (default_student, No Code Changes)

Use this when you want to automatically generate a labeled DocVQA benchmark dataset, then run the same benchmark pipeline end-to-end. This workflow uses PyMuPDF for both synthetic PDF rendering and PDF-to-image conversion; uv run --with pymupdf ... installs it on demand.

1. Generate synthetic PDFs + labels

uv run --with pymupdf generate-synthetic-pdf-task \
  --entities-file docs/examples/synthetic/default_student_entities.csv \
  --task-config docs/examples/synthetic/default_student_task_config.yaml \
  --output-dir /tmp/synthetic_benchmark \
  --seed 42 \
  --overwrite

This writes:

/tmp/synthetic_benchmark/task_docs.pdf
/tmp/synthetic_benchmark/test_ids.csv
/tmp/synthetic_benchmark/generation_plan.json

2. Convert synthetic PDF to images + `doc_info.csv`

uv run --with pymupdf pdf-to-imgs \
  --filepath /tmp/synthetic_benchmark/task_docs.pdf \
  --pages_i 4 \
  --dpi 300 \
  --output_dir /tmp/synthetic_benchmark/images

3. Run inference on synthetic images

uv run openrouter-inference \
  --preset default_student \
  --model qwen/qwen3-vl-8b-instruct \
  --images-dir /tmp/synthetic_benchmark/images \
  --concurrency 64 \
  --rate-limit 64

4. Generate benchmark table + Pareto plot

uv run generate-benchmark-table \
  --doc-info /tmp/synthetic_benchmark/images/doc_info.csv \
  --test-ids /tmp/synthetic_benchmark/test_ids.csv \
  --format markdown \
  --output /tmp/synthetic_benchmark/BENCHMARKS.md

uv run generate-pareto-plot \
  --doc-info /tmp/synthetic_benchmark/images/doc_info.csv \
  --test-ids /tmp/synthetic_benchmark/test_ids.csv \
  --output /tmp/synthetic_benchmark/pareto_plot.png \
  --title "Model Performance vs Cost Trade-off (synthetic benchmark)"

Run Benchmark: Your Dataset (default_student, No Code Changes)

Use this when you have a labeled dataset compatible with the default student benchmark task and want benchmark tables/plots.

1. Convert your PDF batch to images

uv run --with pymupdf pdf-to-imgs \
  --filepath /path/to/new-batch.pdf \
  --pages_i 4 \
  --dpi 300 \
  --output_dir /tmp/my_benchmark/images

This creates /tmp/my_benchmark/images/doc_info.csv. Use --pages_i equal to the number of pages per logical document in your batch.

2. Prepare `/tmp/my_benchmark/test_ids.csv`

Create this CSV in Excel/Google Sheets and export as CSV. Required columns and format:

doc,student_id,student_full_name
0,33206068,Harry S. Truman
1,89797090,Franklin D. Roosevelt
2,98470266,Herbert Hoover

Notes:

doc must match the document index used in doc_info.csv (doc-0-*, doc-1-*, ...).
Keep one row per document.
student_id should be the 8-digit ground-truth ID as text.

3. Run inference on the new dataset

uv run openrouter-inference \
  --preset default_student \
  --model qwen/qwen3-vl-8b-instruct \
  --images-dir /tmp/my_benchmark/images \
  --concurrency 64 \
  --rate-limit 64

Because pdf-to-imgs wrote /tmp/my_benchmark/images/doc_info.csv, the manifest is auto-detected.

4. Generate benchmark markdown + Pareto plot for that dataset

uv run generate-benchmark-table \
  --doc-info /tmp/my_benchmark/images/doc_info.csv \
  --test-ids /tmp/my_benchmark/test_ids.csv \
  --format markdown \
  --output /tmp/my_benchmark/BENCHMARKS.md

uv run generate-pareto-plot \
  --doc-info /tmp/my_benchmark/images/doc_info.csv \
  --test-ids /tmp/my_benchmark/test_ids.csv \
  --output /tmp/my_benchmark/pareto_plot.png \
  --title "Model Performance vs Cost Trade-off (my benchmark)"

Run Extraction: Your Dataset (default_student, No Code Changes)

Use this when you want structured JSON extraction only (no benchmark scoring/plots) with the built-in student task.

Default extracted fields are:

student_full_name
university_id
section_number

If these fields match your use case, run:

uv run openrouter-inference \
  --preset default_student \
  --model qwen/qwen3-vl-8b-instruct \
  --images-dir /path/to/images \
  --concurrency 32 \
  --rate-limit 32

If your manifest is not at /path/to/images/doc_info.csv, pass --dataset-manifest /path/to/manifest.csv. If your target pages differ from the preset default, pass --pages ....

Results are saved under tests/output/runs/<run_name>/results.json.

To tune extraction behavior for your documents while keeping the same fields, edit the default preset:

src/batch_doc_vqa/openrouter/presets/student.py

If you need different output fields but do not want to edit code, use a custom prompt + schema.

Run Extraction: Your Dataset (Custom Prompt + Schema, No Code Changes)

Use this when you want extraction-only JSON outputs with your own prompt and schema, without editing Python code.

Example files included in this repo:

docs/examples/prompts/basic-entity-extraction.md
docs/examples/schemas/basic-entity-extraction.schema.json

uv run openrouter-inference \
  --model qwen/qwen3-vl-8b-instruct \
  --images-dir /path/to/images \
  --prompt-file docs/examples/prompts/basic-entity-extraction.md \
  --schema-file docs/examples/schemas/basic-entity-extraction.schema.json \
  --output-json /tmp/custom_entities.json \
  --concurrency 32 \
  --rate-limit 32

Notes:

If your manifest is not at /path/to/images/doc_info.csv, pass --dataset-manifest /path/to/manifest.csv.
Page selection still applies; pass --pages ... for your dataset.
When --schema-file is provided, strict schema mode is enabled by default. Use --no-strict-schema for best-effort passthrough.

Define New Task: Preset + Benchmark Logic (Code Changes)

Use this when you want a new reusable task integrated into the codebase (not just a one-off run).

For a different extraction schema or different scoring rules, update:

Create a new preset module (copy and adapt src/batch_doc_vqa/openrouter/presets/student.py)
Register it in src/batch_doc_vqa/openrouter/presets/__init__.py
Run inference with uv run openrouter-inference --preset <your_preset> ...
Ground-truth matching logic: src/batch_doc_vqa/utils/string_matching.py
Benchmark table metrics/rows: src/batch_doc_vqa/benchmarks/table_generator.py

Other Investigations

Statistical calibration (legacy experiment): see statistical-calibration.md.
Full analysis: Row-of-Digits-OCR: OpenCV-CNN versus LLMs.
Calibration artifact used in that analysis: tests/output/public/calibration_curves.png.

Motivations

Recent advances in LLM modelling have made it conceivable to build a quantifiably reliable pipeline to extract information in bulk from documents:

Well formatted JSON can be fully enforced. In fact, using context-free grammars, precise JSON schemas can be enforced in language models that support structured responses (e.g. see OpenAI's blog post).
OpenAI's o1-preview appears to be well-calibrated, i.e. the frequency of its answers to fact-seeking questions is a good proxy for their accuracy. This creates the possibility to sample multiple times from the model to infer probabilities of each distinct answer. It is unclear however how well this calibration generalizes to any open-source models. It is also unclear if the purely textual SimpleQA task is a good proxy for text+vision task.
The latest open-source models, such as the (Q4 quantized) Llama3.2-Vision 11B, show good performance on a variety of tasks, including document DocVQA, when compared to closed-source models like GPT-4. The OCRBench Space on Huggingface has a nice summary of their performance on various OCR tasks.
Hardware with acceptable memory bandwidth and large-enough memory capacity for LLM tasks is becoming more affordable.

Name		Name	Last commit message	Last commit date
Latest commit History 111 Commits
docs		docs
imgs		imgs
src/batch_doc_vqa		src/batch_doc_vqa
tests		tests
.gitignore		.gitignore
.python-version		.python-version
BENCHMARKS.md		BENCHMARKS.md
LICENSE		LICENSE
OPENROUTER_GENERAL_BATCH_NER_PLAN.md		OPENROUTER_GENERAL_BATCH_NER_PLAN.md
README.md		README.md
batch_extract.py		batch_extract.py
calibration.ipynb		calibration.ipynb
gemini-sdk.ipynb		gemini-sdk.ipynb
llama3.2-vision_digit_probs.csv		llama3.2-vision_digit_probs.csv
llamavision.py		llamavision.py
mlx-batch.py		mlx-batch.py
model_metadata.json		model_metadata.json
openai-compat-api.py		openai-compat-api.py
outlines_quiz.py		outlines_quiz.py
pareto_plot.png		pareto_plot.png
pyproject.toml		pyproject.toml
qwen_server.ipynb		qwen_server.ipynb
requirements.txt		requirements.txt
statistical-calibration.md		statistical-calibration.md
stringmatching.ipynb		stringmatching.ipynb
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Vision LLM - Batch Document VQA with structured responses

Performance vs Cost Trade-off

Benchmarks

Run Benchmark: Included Dataset (No Code Changes)

1. Clone the repository

2. Install `uv`

3. Generate benchmark images + `doc_info.csv`

4. Run OpenRouter inference (interactive organization + model selection, then provider approval)

5. Regenerate benchmark artifacts

Run Benchmark: Synthetic Dataset (default_student, No Code Changes)

1. Generate synthetic PDFs + labels

2. Convert synthetic PDF to images + `doc_info.csv`

3. Run inference on synthetic images

4. Generate benchmark table + Pareto plot

Run Benchmark: Your Dataset (default_student, No Code Changes)

1. Convert your PDF batch to images

2. Prepare `/tmp/my_benchmark/test_ids.csv`

3. Run inference on the new dataset

4. Generate benchmark markdown + Pareto plot for that dataset

Run Extraction: Your Dataset (default_student, No Code Changes)

Run Extraction: Your Dataset (Custom Prompt + Schema, No Code Changes)

Define New Task: Preset + Benchmark Logic (Code Changes)

Other Investigations

Motivations

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

IonMich/batch-doc-vqa

Folders and files

Latest commit

History

Repository files navigation

Vision LLM - Batch Document VQA with structured responses

Performance vs Cost Trade-off

Benchmarks

Run Benchmark: Included Dataset (No Code Changes)

1. Clone the repository

2. Install uv

3. Generate benchmark images + doc_info.csv

4. Run OpenRouter inference (interactive organization + model selection, then provider approval)

5. Regenerate benchmark artifacts

Run Benchmark: Synthetic Dataset (default_student, No Code Changes)

1. Generate synthetic PDFs + labels

2. Convert synthetic PDF to images + doc_info.csv

3. Run inference on synthetic images

4. Generate benchmark table + Pareto plot

Run Benchmark: Your Dataset (default_student, No Code Changes)

1. Convert your PDF batch to images

2. Prepare /tmp/my_benchmark/test_ids.csv

3. Run inference on the new dataset

4. Generate benchmark markdown + Pareto plot for that dataset

Run Extraction: Your Dataset (default_student, No Code Changes)

Run Extraction: Your Dataset (Custom Prompt + Schema, No Code Changes)

Define New Task: Preset + Benchmark Logic (Code Changes)

Other Investigations

Motivations

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

2. Install `uv`

3. Generate benchmark images + `doc_info.csv`

2. Convert synthetic PDF to images + `doc_info.csv`

2. Prepare `/tmp/my_benchmark/test_ids.csv`

Packages