AeroAct: Aerial Vision-Language Navigation with a Unified Framework for Spatial, Temporal and Embodied Reasoning

📰 News

Apr 8, 2026: We open-source the AeroAct training code.

💡 Introduction

We present a unified aerial VLN framework that operates solely on egocentric monocular RGB observations and natural language instructions. The model formulates navigation as a next-token prediction problem, jointly optimizing spatial perception, trajectory reasoning, and action prediction through prompt-guided multi-task learning.

🛠️ Installation

To build environment for training AeroAct, please run the following:

./environment_setup.sh aeroact
conda activate aeroact

📦 Data Preparation

This project uses three groups of data: navigation trajectories for action prediction, spatial VQA data for egocentric reasoning, and trajectory-summary annotations for sub-trajectory understanding. Please organize the resources exactly as shown below before launching training.

1. AerialVLN Data

This is the main navigation dataset used for action prediction. It contains step-wise tuples after action merging and keyframe selection.

Source the raw data from AirVLN, then place the processed annotations and raw frames under Dataset/AerialVLN-Dataset/.

The code expects the following files:

Dataset/AerialVLN-Dataset/Raw_data/aerialvln-s/
Dataset/AerialVLN-Dataset/data/aerialvln-s/train_merged_triple.json
Dataset/AerialVLN-Dataset/data/aerialvln-s/train_episode2instruction.json
Dataset/AerialVLN-Dataset/data/aerialvln-s/train_action_prob_weight.json
Dataset/AerialVLN-Dataset/data/aerialvln-s/train_episodes2idx_merge.json

These JSON files provide instruction mapping, action reweighting, and merged frame indices used by llava/data/dataset.py.

2. VQA Data

This data is used for drone egocentric spatial reasoning and spatial-relation QA pairs. It combines Open3DVQA-style annotations with a spatial subset of ShareGPT4V-SFT.

Download the Open3DVQA content from EmbodiedCity/Open3DVQA-v2 and arrange it under Dataset/VQA-Dataset/Open3DVQA.

For the GQA-style spatial subset, download the image assets from GQA and prepare the ShareGPT4V-SFT subset under Dataset/VQA-Dataset/ShareGPT4V-SFT.

3. Trajectory Summary Data

This is the human-annotated sub-trajectory segmentation and progress-summary data used to train the trajectory-summary branch. In the current code, it is consumed together with the AerialVLN raw frames and auxiliary JSON files listed above.

We currently provide a demo split to quickly test and sanity-check the training code.

The data should be placed in Dataset/AerialVLN-Dataset/data/aerialvln-s/, together with labeled_segments.json and subgoal_pointer_list.json.

Expected Directory Structure

Dataset
├─ AerialVLN-Dataset
|  ├─ data
|  |  └─ aerialvln-s
|  |     ├─ train_merged_triple.json
|  |     ├─ subgoal_pointer_list.json
|  |     ├─ train_episode2instruction.json
|  |     ├─ train_action_prob_weight.json
|  |     ├─ train_episodes2idx_merge.json
|  |     └─ labeled_segments.json
|  └─ Raw_data
|     └─ aerialvln-s
|        └─ <episode_id>
|           └─ rgb
|              ├─ frame_000.jpg
|              ├─ frame_001.jpg
|              └─ ...
└─ VQA-Dataset
	├─ Open3DVQA
	|  └─ O3DVQA
	|     ├─ EmbodiedCity
	|     |  └─ Wuhan
	|     |     ├─ merged_qa.json
	|     |     └─ rgb
	|     └─ UrbanScene
	|        ├─ Campus
	|        |  ├─ merged_qa.json
	|        |  └─ rgb
	|        └─ Residence
	|           ├─ merged_qa.json
	|           └─ rgb
	└─ ShareGPT4V-SFT
		├─ sharegpt4v_gqa_spatial_outdoor.json
		└─ gqa

🚀 Training

To launch training, please run the following command:

# only train on aerivln dataset
bash script/train_aerialvln.sh airvln_merge_triple_updated
# for training with all datasets, please modify llava/data/datasets_mixture.py to include the datasets you want, and make sure to prepare the data for those datasets as well.
bash script/train_aerialvln.sh airvln_merge_triple_updated+airvln_subtraj_sum_updated+open3dvqa_embodiedcity_wuhan+open3dvqa_urban_scene_campus+open3dvqa_urban_scene_residence+gqa_spatial

🧪 Evaluation

To evaluate a trained AeroAct checkpoint on AerialVLN, first start the simulator server, then run the evaluation script. Before running evaluation, please configure the test environment and simulator platform according to AirVLN.

cd AirVLN
nohup python airsim_plugin/AirVLNSimulatorServerTool.py --gpus 0 &
cd ..
# bash script/eval_aeriavln.sh checkpoints/exp1 0.2
bash script/eval_aeriavln.sh <model_path> <temperature>

📊 Performance

Table 1 | Main comparison results on AerialVLN benchmark.

Method	Observation	Seen				Unseen
Method	Observation	NE↓	SR↑	OSR↑	SDTW↑	NE↓	SR↑	OSR↑	SDTW↑
Grid-based VS	Pano + Odo	70.3	20.8	33.4	10.2	121.3	7.4	16.1	2.5
CityNavAgent	Depth + Pano + Odo	80.8	13.9	30.2	5.1	60.2	11.7	35.2	5.0
Random Sampling	-	109.6	0.0	0.0	0.0	149.7	0.0	0.0	0.0
Action Sampling	-	213.8	0.9	5.7	0.3	237.6	0.2	1.1	0.1
LingUNet	S.RGB + Depth	383.8	0.6	6.9	0.2	368.4	0.4	3.6	0.9
Seq2Seq	S.RGB + Depth	146.0	4.8	19.8	1.6	218.9	2.3	11.7	0.7
CMA	S.RGB + Depth	121.0	3.0	23.2	0.6	172.1	3.2	16.0	1.1
LAG	S.RGB + Depth	90.2	7.2	15.7	2.4	127.9	5.1	10.5	1.4
STMR	S.RGB + Depth	96.3	12.6	31.6	2.7	119.5	10.8	23.0	1.9
MapGPT	S.RGB	124.9	2.1	4.7	0.8	107.0	0.0	0.0	0.0
Navid	S.RGB	105.1	6.8	15.5	1.1	106.9	6.1	12.3	0.7
OpenFly	S.RGB	127.2	8.1	21.8	1.6	113.8	7.6	18.2	1.5
AeroAct (Ours)	S.RGB	79.6	11.4	37.7	6.3	95.8	8.1	28.9	2.2

🙏 Acknowledgement

This repository is partly based on AirVLN and NVILA.

📚 Citation

If you find our project helpful, please consider citing:

@article{xu2025aerial,
	title={Aerial Vision-Language Navigation with a Unified Framework for Spatial, Temporal and Embodied Reasoning},
	author={Xu, Huilin and Liu, Zhuoyang and Luomei, Yixiang and Xu, Feng},
	journal={arXiv preprint arXiv:2512.08639},
	year={2025}
}

📄 License

This project is released under the Apache 2.0 License.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
AirVLN		AirVLN
Dataset		Dataset
assets		assets
llava		llava
script		script
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment_setup.sh		environment_setup.sh
eval.py		eval.py
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AeroAct: Aerial Vision-Language Navigation with a Unified Framework for Spatial, Temporal and Embodied Reasoning

📰 News

💡 Introduction

📑 Table of Contents

🛠️ Installation

📦 Data Preparation

1. AerialVLN Data

2. VQA Data

3. Trajectory Summary Data

Expected Directory Structure

🚀 Training

🧪 Evaluation

📊 Performance

🙏 Acknowledgement

📚 Citation

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AeroAct: Aerial Vision-Language Navigation with a Unified Framework for Spatial, Temporal and Embodied Reasoning

📰 News

💡 Introduction

📑 Table of Contents

🛠️ Installation

📦 Data Preparation

1. AerialVLN Data

2. VQA Data

3. Trajectory Summary Data

Expected Directory Structure

🚀 Training

🧪 Evaluation

📊 Performance

🙏 Acknowledgement

📚 Citation

📄 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages