Di Wang1, Shunyu Liu2, Wentao Jiang1, Fengxiang Wang3, Yi Liu1, Xiaolei Qin1, Zhiming Luo1,
Chaoyang Zhou1, Haonan Guo1, Jing Zhang1 †, Bo Du1 †, Dacheng Tao2, Liangpei Zhang1 †
1 Wuhan University, 2 Nanyang Technological University, 3 Shanghai AI Laboratory.
† Corresponding author
Update | Overview | Datasets | Models | Usage | Statement
2025.12.04
- All components required for building an inference demo have been prepared.
- The updated model weights are available on:
- The JSON annotation files for the test sets of several benchmarks used in our evaluation have been released and are available at:
- Hugging Face
(Note: Only the JSON files are provided; the corresponding images should be downloaded from the original datasets.)
- Hugging Face
2025.12.01
- The paper is post on arXiv! (arXiv)
We present GeoZero, the first MLLM capable of performing emergent reasoning on geospatial scenes from scratch without any predefined CoT supervision. To encourage deep and reliable reasoning while maintaining answer accuracy, we construct two datasets, GeoZero-Instruct and GeoZero-Hard. GeoZero-Instruct allows the model to acquire preliminary geospatial knowledge through supervised fine-tuning, while GeoZero-Hard stimulates deep reasoning during the subsequent reinforcement learning stage. We also propose Answer-Anchored Group Relative Policy Optimization (A$^2$GRPO), where the reasoning process is regularized by the model’s own answers, encouraging diverse yet accurate thinking. GeoZero not only reduces annotation costs but also enhances the cognitive capability of MLLMs, offering new insights toward general geospatial AI.
Figure 1. Framework of GeoZero.
GeoZero relies on multiple remote sensing benchmarks for both model development and evaluation. Please manually download the corresponding image datasets from their original sources.
| Dataset | Dataset | Dataset |
|---|---|---|
| VHM-Instruct | RESISC-45 | EuroSAT |
| AID | NASC-TG2 | fMoW |
| WHU-RS19 | RSVQA | UCM |
| RSVG | DIOR-RSVG | SkyEye-968k |
| VRSBench | SIRI-WHU | UCM-Captions |
| Sydney-Captions | NWPU-Captions | RSICD |
We provide pre-formatted JSON annotation files to ensure consistent data loading and usage:
Coming Soon.
Evaluation samples across different benchmarks are available on our continually updated Hugging Face dataset repository:
| Model | Weights |
|---|---|
| GeoZero w/o RFT | Hugging Face & Baidu Drive |
Wait for update.
We provide an inference script for Qwen3-VL and related models on various remote sensing vision–language tasks:
python single_infer_eval_qwen3vl_think.py \
--model_path [model path] \
--json_path [dataset json path] \
--output_path [output saved path] \
--task [task type] --batchsize 4 --gpu [gpu id] --system [whether use the system prompt (Type1)]
If you find GeoZero helpful, please give a ⭐ and cite it as follows:
@article{wang2025geozero,
title = {GeoZero: Incentivizing Reasoning from Scratch on Geospatial Scenes},
author = {Wang, Di and Liu, Shunyu and Jiang, Wentao and Wang, Fengxiang and Liu, Yi and Qin, Xiaolei and Luo, Zhiming and Zhou, Chaoyang and Guo, Haonan and Zhang, Jing and Du, Bo and Tao, Dacheng and Zhang, Liangpei},
journal = {arXiv preprint arXiv:2511.22645},
year = {2025}
}
For any other questions please contact di.wang at gmail.com or whu.edu.cn.
This project is based on Qwen3-VL, ms-swift, RSEvalKit, Thanks for their wonderful work!

