EgoMind: Activating Spatial Cognition through Linguistic Reasoning in MLLMs

Zhenghao Chen^1,2, Huiqun Wang^1,2, Di Huang^1,2✉
¹State Key Laboratory of Complex and Critical Software Environment, Beihang University
²School of Computer Science and Engineering, Beihang University

✨ News

[2026.04.07] 🎉🎉 We have released the model weights and the evaluation code!
[2026.04.01] 🎉 We have released our paper on arXiv!
[2026.02.21] 🎉 Our paper has been accepted to CVPR 2026!

🚀 Framework

EgoMind is a Chain-of-Thought (CoT) framework that enables geometry-free spatial reasoning through two key components:

Role-Play Caption (RPC): Simulates an agent navigating an environment from a first-person perspective, generating coherent descriptions of frame-wise observations and viewpoint transitions to build a consistent global understanding of the scene.
Progressive Spatial Analysis (PSA): First localizes objects explicitly mentioned in the query, then expands its attention to surrounding entities, and finally reasons about their spatial relationships in an integrated manner.

With only 5K auto-generated SFT samples and 20K RL samples, EgoMind achieves competitive results on VSI-Bench, SPAR-Bench, SITE-Bench, and SPBench, demonstrating the potential of linguistic reasoning for spatial cognition.

🏆 Main Results

EgoMind achieves competitive performance among open-source MLLMs across four spatial reasoning benchmarks, using only 25K training samples (5K CoT-supervised + 20K RL) without any explicit 3D priors.

🔬 Evaluation

1. Environment Installation

# Create and activate a Conda environment (Python 3.11)
conda create -n egomind python=3.11 -y
conda activate egomind

# Install uv, PyTorch, and project dependencies
pip install uv
uv pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu124
uv pip install -r requirements.txt

2. Model Preparation

Download the model weights into the repo’s models/ directory (from the EgoMind repository root). Requires Hugging Face CLI (pip install huggingface_hub).

huggingface-cli download Hyggge/EgoMind-7B --resume-download --local-dir ./models/EgoMind-7B

After this, point --model_path to models/EgoMind-7B for local inference, or keep using Hyggge/EgoMind-7B to load from the Hub.

3. Dataset Preparation

Download the benchmark data and place them under evaluation/datasets/. See evaluation/datasets/README.md for detailed instructions.

The expected directory structure:

evaluation/datasets/
├── VSI-Bench/
│   ├── qa_processed.jsonl
│   └── data/                  # arkitscenes/, scannet/, scannetpp/
├── SPAR-Bench/
│   ├── qa_processed.jsonl
│   └── data/                  # images/
├── SITE-Bench/
│   ├── qa_processed.jsonl
│   └── data/                  # ActivityNet/, MLVU/, MVBench/, ...
└── SPBench/
    ├── qa_processed.jsonl
    └── data/                  # SPBench-MV-images/, SPBench-SI-images/

4. Running Evaluation

All benchmarks share the same entry point evaluation/run_eval.py. Below are the commands for each benchmark.

VSI-Bench

python evaluation/run_eval.py \
    --model_path models/EgoMind-7B \
    --output_path outputs/EgoMind-7B_vsibench.jsonl \
    --benchmark vsibench

SPAR-Bench

python evaluation/run_eval.py \
    --model_path models/EgoMind-7B \
    --output_path outputs/EgoMind-7B_sparbench.jsonl \
    --benchmark sparbench

SITE-Bench

python evaluation/run_eval.py \
    --model_path models/EgoMind-7B \
    --output_path outputs/EgoMind-7B_sitebench.jsonl \
    --benchmark sitebench

SPBench

python evaluation/run_eval.py \
    --model_path models/EgoMind-7B \
    --output_path outputs/EgoMind-7B_spbench.jsonl \
    --benchmark spbench

Calculate the metric using existing outputs only (skip inference):

python evaluation/run_eval.py \
    --output_path outputs/EgoMind-7B_vsibench.jsonl \
    --benchmark vsibench \
    --only_eval

📜 Citation

If you find our work helpful, please consider citing our paper:

@misc{chen2026egomind,
      title={EgoMind: Activating Spatial Cognition through Linguistic Reasoning in MLLMs}, 
      author={Zhenghao Chen and Huiqun Wang and Di Huang},
      year={2026},
      eprint={2604.03318},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2604.03318}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
assets		assets
evaluation		evaluation
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EgoMind: Activating Spatial Cognition through Linguistic Reasoning in MLLMs

✨ News

🚀 Framework

🏆 Main Results

🔬 Evaluation

1. Environment Installation

2. Model Preparation

3. Dataset Preparation

4. Running Evaluation

📜 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

EgoMind: Activating Spatial Cognition through Linguistic Reasoning in MLLMs

✨ News

🚀 Framework

🏆 Main Results

🔬 Evaluation

1. Environment Installation

2. Model Preparation

3. Dataset Preparation

4. Running Evaluation

📜 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages