Skip to content

return-sleep/AeroAct

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AeroAct: Aerial Vision-Language Navigation with a Unified Framework for Spatial, Temporal and Embodied Reasoning

arXiv Code License: Apache 2.0 Python 3.10+

📰 News

  • Apr 8, 2026: We open-source the AeroAct training code.

💡 Introduction

We present a unified aerial VLN framework that operates solely on egocentric monocular RGB observations and natural language instructions. The model formulates navigation as a next-token prediction problem, jointly optimizing spatial perception, trajectory reasoning, and action prediction through prompt-guided multi-task learning.

📑 Table of Contents

🛠️ Installation

To build environment for training AeroAct, please run the following:

./environment_setup.sh aeroact
conda activate aeroact

📦 Data Preparation

This project uses three groups of data: navigation trajectories for action prediction, spatial VQA data for egocentric reasoning, and trajectory-summary annotations for sub-trajectory understanding. Please organize the resources exactly as shown below before launching training.

1. AerialVLN Data

This is the main navigation dataset used for action prediction. It contains step-wise tuples after action merging and keyframe selection.

Source the raw data from AirVLN, then place the processed annotations and raw frames under Dataset/AerialVLN-Dataset/.

The code expects the following files:

  • Dataset/AerialVLN-Dataset/Raw_data/aerialvln-s/
  • Dataset/AerialVLN-Dataset/data/aerialvln-s/train_merged_triple.json
  • Dataset/AerialVLN-Dataset/data/aerialvln-s/train_episode2instruction.json
  • Dataset/AerialVLN-Dataset/data/aerialvln-s/train_action_prob_weight.json
  • Dataset/AerialVLN-Dataset/data/aerialvln-s/train_episodes2idx_merge.json

These JSON files provide instruction mapping, action reweighting, and merged frame indices used by llava/data/dataset.py.

2. VQA Data

This data is used for drone egocentric spatial reasoning and spatial-relation QA pairs. It combines Open3DVQA-style annotations with a spatial subset of ShareGPT4V-SFT.

Download the Open3DVQA content from EmbodiedCity/Open3DVQA-v2 and arrange it under Dataset/VQA-Dataset/Open3DVQA.

For the GQA-style spatial subset, download the image assets from GQA and prepare the ShareGPT4V-SFT subset under Dataset/VQA-Dataset/ShareGPT4V-SFT.

3. Trajectory Summary Data

This is the human-annotated sub-trajectory segmentation and progress-summary data used to train the trajectory-summary branch. In the current code, it is consumed together with the AerialVLN raw frames and auxiliary JSON files listed above.

We currently provide a demo split to quickly test and sanity-check the training code.

The data should be placed in Dataset/AerialVLN-Dataset/data/aerialvln-s/, together with labeled_segments.json and subgoal_pointer_list.json.

Expected Directory Structure

Dataset
├─ AerialVLN-Dataset
|  ├─ data
|  |  └─ aerialvln-s
|  |     ├─ train_merged_triple.json
|  |     ├─ subgoal_pointer_list.json
|  |     ├─ train_episode2instruction.json
|  |     ├─ train_action_prob_weight.json
|  |     ├─ train_episodes2idx_merge.json
|  |     └─ labeled_segments.json
|  └─ Raw_data
|     └─ aerialvln-s
|        └─ <episode_id>
|           └─ rgb
|              ├─ frame_000.jpg
|              ├─ frame_001.jpg
|              └─ ...
└─ VQA-Dataset
	├─ Open3DVQA
	|  └─ O3DVQA
	|     ├─ EmbodiedCity
	|     |  └─ Wuhan
	|     |     ├─ merged_qa.json
	|     |     └─ rgb
	|     └─ UrbanScene
	|        ├─ Campus
	|        |  ├─ merged_qa.json
	|        |  └─ rgb
	|        └─ Residence
	|           ├─ merged_qa.json
	|           └─ rgb
	└─ ShareGPT4V-SFT
		├─ sharegpt4v_gqa_spatial_outdoor.json
		└─ gqa

🚀 Training

To launch training, please run the following command:

# only train on aerivln dataset
bash script/train_aerialvln.sh airvln_merge_triple_updated
# for training with all datasets, please modify llava/data/datasets_mixture.py to include the datasets you want, and make sure to prepare the data for those datasets as well.
bash script/train_aerialvln.sh airvln_merge_triple_updated+airvln_subtraj_sum_updated+open3dvqa_embodiedcity_wuhan+open3dvqa_urban_scene_campus+open3dvqa_urban_scene_residence+gqa_spatial

🧪 Evaluation

To evaluate a trained AeroAct checkpoint on AerialVLN, first start the simulator server, then run the evaluation script. Before running evaluation, please configure the test environment and simulator platform according to AirVLN.

cd AirVLN
nohup python airsim_plugin/AirVLNSimulatorServerTool.py --gpus 0 &
cd ..
# bash script/eval_aeriavln.sh checkpoints/exp1 0.2
bash script/eval_aeriavln.sh <model_path> <temperature>

📊 Performance

Table 1 | Main comparison results on AerialVLN benchmark.

Method Observation Seen Unseen
NE↓ SR↑ OSR↑ SDTW↑ NE↓ SR↑ OSR↑ SDTW↑
Grid-based VSPano + Odo70.320.833.410.2121.37.416.12.5
CityNavAgentDepth + Pano + Odo80.813.930.25.160.211.735.25.0
Random Sampling-109.60.00.00.0149.70.00.00.0
Action Sampling-213.80.95.70.3237.60.21.10.1
LingUNetS.RGB + Depth383.80.66.90.2368.40.43.60.9
Seq2SeqS.RGB + Depth146.04.819.81.6218.92.311.70.7
CMAS.RGB + Depth121.03.023.20.6172.13.216.01.1
LAGS.RGB + Depth90.27.215.72.4127.95.110.51.4
STMRS.RGB + Depth96.312.631.62.7119.510.823.01.9
MapGPTS.RGB124.92.14.70.8107.00.00.00.0
NavidS.RGB105.16.815.51.1106.96.112.30.7
OpenFlyS.RGB127.28.121.81.6113.87.618.21.5
AeroAct (Ours)S.RGB79.611.437.76.395.88.128.92.2

🙏 Acknowledgement

This repository is partly based on AirVLN and NVILA.

📚 Citation

If you find our project helpful, please consider citing:

@article{xu2025aerial,
	title={Aerial Vision-Language Navigation with a Unified Framework for Spatial, Temporal and Embodied Reasoning},
	author={Xu, Huilin and Liu, Zhuoyang and Luomei, Yixiang and Xu, Feng},
	journal={arXiv preprint arXiv:2512.08639},
	year={2025}
}

📄 License

This project is released under the Apache 2.0 License.

About

Aerial Vision-Language Navigation with a Unified Framework for Spatial, Temporal and Embodied Reasoning

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages