Official repository for the project "A Geometric Generative Reasoning Benchmark for Unified Multimodal Models"
[🌍 Homepage] [📜 OpenReview Paper] [🤗 HF Datasets] [💻 GitHub Code]
Overview of GGBench: A Geometric Generative Reasoning Benchmark for Unified Multimodal Models.
We introduce GGBench, a geometric generative reasoning benchmark purpose-built for unified multimodal models (UMMs). Unlike prior evaluations that treat discriminative understanding and unconstrained image generation separately, GGBench diagnoses whether a model can fuse language comprehension with precise visual construction. Geometric construction serves as an ideal testbed, revealing how well a system can actively reason and synthesize structured solutions across modalities.
We investigate a key question: Can unified multimodal models integrate reasoning with controlled visual synthesis? While modern UMMs can perceive and understand complex visual scenes, their actual reliability in generative reasoning—where language understanding must guide precise geometric construction—remains unverified.
We conduct a comprehensive evaluation across multiple dimensions including planning, middle process, and final result quality, introducing GGBench as a standardized benchmark for systematic generative reasoning assessment. Our findings reveal the current capabilities and limitations of UMMs in geometric generative reasoning tasks.
We provide comprehensive investigation of unified multimodal models to analyze their geometric generative reasoning potential, detailing representative successes, characteristic errors, and the conditions under which generative reasoning emerges, holds, or breaks.
Visit our homepage to see video demonstrations showing how different models solve geometric problems step by step.
git lfs install
git clone https://huggingface.co/datasets/opendatalab-raiser/GGBenchThe evaluation script supports multiple evaluation dimensions including VLM-based text/image evaluation, mid-process evaluation, and image quality metrics (LPIPS, PSNR, SSIM).
- Navigate to the
dataset/directory - Edit line 52-53 in
evaluate.pyto add your Judge Model URL and API Key - Configure
MODEL_OUTPUT_PATHinevaluate.pyto point to your model's output JSON file - Run:
python evaluate.py
Results will be saved to eval_output/result.json and aggregated scores to eval_output/score.json
- VLM-T: Text-based step reasoning evaluation (1-5 scale)
- VLM-I-Mid: Middle process image quality evaluation (Step Accuracy, Process Consistency, Problem-Solution Accuracy)
- VLM-I-Res: Final result image quality evaluation (1-5 scale)
- LPIPS ×10⁻²: Learned Perceptual Image Patch Similarity
- PSNR: Peak Signal-to-Noise Ratio
- SSIM ×10⁻²: Structural Similarity Index
We curate GGBench, a comprehensive benchmark providing a standardized taxonomy and an evaluation protocol, enabling consistent and category-wise assessment beyond surface-level metrics.
Evaluation Radar Map and Category Distribution of GGBench.
- Total Samples: 1,411 geometric construction problems
- Categories: Multiple geometric problem types including basic constructions, circle properties, geometric transformations, triangle properties, theorem applications, polygon properties, measurement & ratios, and locus construction
- Evaluation Dimensions: Planning, Middle Process, Final Result, and Overall Scores
Main results on GGBench. VLM-T and VLM-I denote step reasoning and final diagram quality, respectively. VLM-Avg averages middle and final stages. All values are percentages.
See the full leaderboard for detailed results across all evaluated models.
If you are interested in our repository or our paper, please cite the following paper:
@article{wei2025ggbench,
title={GGBench: A Geometric Generative Reasoning Benchmark for Unified Multimodal Models},
author={Wei, Jingxuan and Jia, Caijun and Bai, Xi and Xu, Xinglong and Li, Siyuan and Sun, Linzhuang and Yu, Bihui and He, Conghui and Wu, Lijun and Tan, Cheng},
journal={arXiv preprint arXiv:2511.11134},
year={2025}
}


