Skip to content

Latest commit

 

History

History
307 lines (212 loc) · 7.66 KB

File metadata and controls

307 lines (212 loc) · 7.66 KB

Can MLLMs Reason Beyond Language? VisReason: A Comprehensive Benchmark for Vision-Centric Reasoning

This repository provides the official code and evaluation tools for VisReason, including dataset indexing, API and vLLM inference scripts, automatic judging, and result summarization. (ACL 2026 Findings)

Dataset   Paper

📢 News

  • [2026-05-26] We have added the paper link (arXiv).
  • [2026-05-23] We have organized and released the inference and evaluation scripts.
  • [2026-05-23] We provide examples for both API-based inference and vLLM-based local inference.
  • [2026-05-23] We have released the full VisReason dataset on Hugging Face.

✅ TODO

  • Add citation information.
  • Add license information.
  • Add continual benchmark updates

This repository hosts the stable official release. For the testing version, please visit YifanWang-Lingf/VisReason.

👀 Overview

VisReason is a benchmark for evaluating vision-centric reasoning in everyday scenarios where perception and inference are tightly coupled. Unlike many STEM-oriented or knowledge-intensive visual reasoning benchmarks, VisReason is designed to test whether MLLMs can reason directly from visual evidence rather than relying mainly on language-mediated abstractions.

VisReason contains 1,505 carefully curated questions across 10 reasoning categories, covering perceptual, structural, and conceptual reasoning. The tasks include visual difference identification, 3D-spatial reasoning, game-board reasoning, and implicit rule inference from visual cues.

Our evaluation shows that VisReason poses a qualitatively different challenge from existing benchmarks, exposing large gaps between humans and current MLLMs and revealing that test-time reasoning strategies such as explicit CoT prompting provide limited gains without stronger visual grounding.

📦 Dataset

The full dataset description, field definitions, and download instructions are provided on Hugging Face.

🗂️ Repository Structure

VisReason/
  data/
    img_<class_number>/
      datajson_label.<ext>
    ...
    class_1.jsonl
    ...
    datasets.json
  results/
    <model>/
      class_<class_number>_results.json
      class_<class_number>_judge.json
      summary.json
  utils/
    datainfer/
    bbox_utils.py
  test_api_batch.py
  test_vLLM.py
  judge.py
  README.md

🚀 Quick Start

⚙️ Installation

pip install openai tqdm pydantic datasets rich pillow

For local vLLM inference, install and run vLLM separately in the environment that serves the model.

🧪 Inference

Option A: 🌐 API Inference

test_api_batch.py calls an OpenAI-compatible API endpoint. Set your credentials through environment variables:

export API_KEY="xxx"
export BASE_URL="xxx"

Run inference:

# thinking models
python test_api_batch.py <model> 
# instruct models
python test_api_batch.py <model> cot
python test_api_batch.py <model> nocot

The default prompt mode is nocot.

Option B: 🖥️ vLLM Inference

Start a vLLM OpenAI-compatible server, for example:

CUDA_VISIBLE_DEVICES=0,1 \
python -m vllm.entrypoints.openai.api_server \
  --model /path/to/model \
  --trust-remote-code \
  --port 8008 \
  --tensor-parallel-size 2 \
  --max-model-len 8192 \
  --max-num-seqs 32 \
  --gpu-memory-utilization 0.8 \
  --served-model-name <model>

Run inference:

# thinking models
python test_vLLM.py <model> --host localhost --port 8008
# instruct models
python test_vLLM.py <model> cot --host localhost --port 8008
python test_vLLM.py <model> nocot --host localhost --port 8008

The default prompt mode is nocot.

Option C: 🧩 lmms-eval Inference

Coming soon.

📁 Output Format

Inference results are saved under:

results/<model>/class_X_results.json
results/<model>_cot/class_X_results.json

Each result file is a JSON list. Each item keeps the original sample fields and adds:

{
  "response": "model output"
}

🧑‍⚖️ Evaluation

Run judging:

export API_KEY="xxx"
export BASE_URL="xxx"

python judge.py <judge_model> results/<model>

The evaluator writes:

results/<model>/class_X_judge.json
results/<model>/summary.json

For class_1 and class_2, bounding-box predictions are evaluated with IoU at threshold 0.5. Other classes are evaluated by the judge model.

🖼️ Demos

Option A: 🌐 API Inference (o4-mini)

Set API credentials:

export API_KEY="xxx"
export BASE_URL="xxx"

Run inference:

python test_api_batch.py o4-mini

The generated results will be saved to:

results/o4-mini/class_1_results.json
results/o4-mini/class_2_results.json
...
results/o4-mini/class_10_results.json

Run evaluation:

python judge.py gpt-5 results/o4-mini

The judged outputs and final summary will be saved to:

results/o4-mini/class_1_judge.json
results/o4-mini/class_2_judge.json
...
results/o4-mini/summary.json

Option B: 🖥️ vLLM Inference (Qwen3-VL-235B-A22B-Instruct)

Start the vLLM server:

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
python -m vllm.entrypoints.openai.api_server \
  --model ../hf_models/Qwen3-VL-235B-A22B-Instruct \
  --trust-remote-code \
  --port 8008 \
  --tensor-parallel-size 8 \
  --max-model-len 8192 \
  --max-num-seqs 8 \
  --gpu-memory-utilization 0.9 \
  --served-model-name Qwen3-VL-235B-A22B-Instruct

Run inference:

python test_vLLM.py Qwen3-VL-235B-A22B-Instruct cot --host 12x.xx.xx.xx --port 8008

The generated results will be saved to:

results/Qwen3-VL-235B-A22B-Instruct_cot/class_1_results.json
results/Qwen3-VL-235B-A22B-Instruct_cot/class_2_results.json
...
results/Qwen3-VL-235B-A22B-Instruct_cot/class_10_results.json

Run evaluation:

python judge.py gpt-5 results/Qwen3-VL-235B-A22B-Instruct_cot

The judged outputs and final summary will be saved to:

results/Qwen3-VL-235B-A22B-Instruct_cot/class_1_judge.json
results/Qwen3-VL-235B-A22B-Instruct_cot/class_2_judge.json
...
results/Qwen3-VL-235B-A22B-Instruct_cot/summary.json

📊 Metrics

VisReason reports accuracy at the class level and uses the average class accuracy as the final score.

For class_1 and class_2, predictions are evaluated as bounding-box localization tasks. The evaluator parses the predicted boxes from the model response, matches them with the ground-truth boxes, and counts a box as correct when its IoU is at least 0.5.

For the remaining classes, predictions are judged by an LLM-based evaluator. The judge compares the model response with the ground-truth answer and outputs whether the prediction is correct.

For each class, accuracy is computed as:

accuracy = number of correct samples / number of evaluated samples

The final score is the unweighted mean over all class accuracies:

average accuracy = mean(class_1_acc, class_2_acc, ..., class_10_acc)

The evaluator writes per-class scores and the final average score to:

results/<model>/summary.json

🏆 Results

📝 Citation

📄 License

TODO

📬 Contact

If you have any questions, please reach out to: