GGBench: A Geometric Generative Reasoning Benchmark 🎯

Official repository for the project "A Geometric Generative Reasoning Benchmark for Unified Multimodal Models"

[🌍 Homepage] [📜 OpenReview Paper] [🤗 HF Datasets] [💻 GitHub Code]

📖 Study Overview

Overview of GGBench: A Geometric Generative Reasoning Benchmark for Unified Multimodal Models.

We introduce GGBench, a geometric generative reasoning benchmark purpose-built for unified multimodal models (UMMs). Unlike prior evaluations that treat discriminative understanding and unconstrained image generation separately, GGBench diagnoses whether a model can fuse language comprehension with precise visual construction. Geometric construction serves as an ideal testbed, revealing how well a system can actively reason and synthesize structured solutions across modalities.

We investigate a key question: Can unified multimodal models integrate reasoning with controlled visual synthesis? While modern UMMs can perceive and understand complex visual scenes, their actual reliability in generative reasoning—where language understanding must guide precise geometric construction—remains unverified.

We conduct a comprehensive evaluation across multiple dimensions including planning, middle process, and final result quality, introducing GGBench as a standardized benchmark for systematic generative reasoning assessment. Our findings reveal the current capabilities and limitations of UMMs in geometric generative reasoning tasks.

🔍 Deep-Dive Analysis

We provide comprehensive investigation of unified multimodal models to analyze their geometric generative reasoning potential, detailing representative successes, characteristic errors, and the conditions under which generative reasoning emerges, holds, or breaks.

Visit our homepage to see video demonstrations showing how different models solve geometric problems step by step.

🧐 Evaluation

Download Dataset

git lfs install
git clone https://huggingface.co/datasets/opendatalab-raiser/GGBench

Run Evaluation

The evaluation script supports multiple evaluation dimensions including VLM-based text/image evaluation, mid-process evaluation, and image quality metrics (LPIPS, PSNR, SSIM).

Navigate to the dataset/ directory
Edit line 52-53 in evaluate.py to add your Judge Model URL and API Key
Configure MODEL_OUTPUT_PATH in evaluate.py to point to your model's output JSON file
Run: python evaluate.py

Results will be saved to eval_output/result.json and aggregated scores to eval_output/score.json

Evaluation Metrics

VLM-T: Text-based step reasoning evaluation (1-5 scale)
VLM-I-Mid: Middle process image quality evaluation (Step Accuracy, Process Consistency, Problem-Solution Accuracy)
VLM-I-Res: Final result image quality evaluation (1-5 scale)
LPIPS ×10⁻²: Learned Perceptual Image Patch Similarity
PSNR: Peak Signal-to-Noise Ratio
SSIM ×10⁻²: Structural Similarity Index

⚖️ GGBench Benchmark

We curate GGBench, a comprehensive benchmark providing a standardized taxonomy and an evaluation protocol, enabling consistent and category-wise assessment beyond surface-level metrics.

Evaluation Radar Map and Category Distribution of GGBench.

Dataset Statistics

Total Samples: 1,411 geometric construction problems
Categories: Multiple geometric problem types including basic constructions, circle properties, geometric transformations, triangle properties, theorem applications, polygon properties, measurement & ratios, and locus construction
Evaluation Dimensions: Planning, Middle Process, Final Result, and Overall Scores

📊 Leaderboard

Main results on GGBench. VLM-T and VLM-I denote step reasoning and final diagram quality, respectively. VLM-Avg averages middle and final stages. All values are percentages.

See the full leaderboard for detailed results across all evaluated models.

Citation

If you are interested in our repository or our paper, please cite the following paper:

@article{wei2025ggbench,
  title={GGBench: A Geometric Generative Reasoning Benchmark for Unified Multimodal Models},
  author={Wei, Jingxuan and Jia, Caijun and Bai, Xi and Xu, Xinglong and Li, Siyuan and Sun, Linzhuang and Yu, Bihui and He, Conghui and Wu, Lijun and Tan, Cheng},
  journal={arXiv preprint arXiv:2511.11134},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
data/results		data/results
figures		figures
static		static
visualizer/data		visualizer/data
README.md		README.md
eval_prompts.py		eval_prompts.py
evaluate.py		evaluate.py
index.html		index.html
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

GGBench: A Geometric Generative Reasoning Benchmark 🎯

📖 Study Overview

🔍 Deep-Dive Analysis

🧐 Evaluation

Download Dataset

Run Evaluation

Evaluation Metrics

⚖️ GGBench Benchmark

Dataset Statistics

📊 Leaderboard

Citation

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

OpenRaiser/GGBench

Folders and files

Latest commit

History

Repository files navigation

GGBench: A Geometric Generative Reasoning Benchmark 🎯

📖 Study Overview

🔍 Deep-Dive Analysis

🧐 Evaluation

Download Dataset

Run Evaluation

Evaluation Metrics

⚖️ GGBench Benchmark

Dataset Statistics

📊 Leaderboard

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages