- Apr 8, 2026: We open-source the AeroAct training code.
We present a unified aerial VLN framework that operates solely on egocentric monocular RGB observations and natural language instructions. The model formulates navigation as a next-token prediction problem, jointly optimizing spatial perception, trajectory reasoning, and action prediction through prompt-guided multi-task learning.
To build environment for training AeroAct, please run the following:
./environment_setup.sh aeroact
conda activate aeroactThis project uses three groups of data: navigation trajectories for action prediction, spatial VQA data for egocentric reasoning, and trajectory-summary annotations for sub-trajectory understanding. Please organize the resources exactly as shown below before launching training.
This is the main navigation dataset used for action prediction. It contains step-wise tuples after action merging and keyframe selection.
Source the raw data from AirVLN, then place the processed annotations and raw frames under Dataset/AerialVLN-Dataset/.
The code expects the following files:
Dataset/AerialVLN-Dataset/Raw_data/aerialvln-s/Dataset/AerialVLN-Dataset/data/aerialvln-s/train_merged_triple.jsonDataset/AerialVLN-Dataset/data/aerialvln-s/train_episode2instruction.jsonDataset/AerialVLN-Dataset/data/aerialvln-s/train_action_prob_weight.jsonDataset/AerialVLN-Dataset/data/aerialvln-s/train_episodes2idx_merge.json
These JSON files provide instruction mapping, action reweighting, and merged frame indices used by llava/data/dataset.py.
This data is used for drone egocentric spatial reasoning and spatial-relation QA pairs. It combines Open3DVQA-style annotations with a spatial subset of ShareGPT4V-SFT.
Download the Open3DVQA content from EmbodiedCity/Open3DVQA-v2 and arrange it under Dataset/VQA-Dataset/Open3DVQA.
For the GQA-style spatial subset, download the image assets from GQA and prepare the ShareGPT4V-SFT subset under Dataset/VQA-Dataset/ShareGPT4V-SFT.
This is the human-annotated sub-trajectory segmentation and progress-summary data used to train the trajectory-summary branch. In the current code, it is consumed together with the AerialVLN raw frames and auxiliary JSON files listed above.
We currently provide a demo split to quickly test and sanity-check the training code.
The data should be placed in Dataset/AerialVLN-Dataset/data/aerialvln-s/, together with labeled_segments.json and subgoal_pointer_list.json.
Dataset
├─ AerialVLN-Dataset
| ├─ data
| | └─ aerialvln-s
| | ├─ train_merged_triple.json
| | ├─ subgoal_pointer_list.json
| | ├─ train_episode2instruction.json
| | ├─ train_action_prob_weight.json
| | ├─ train_episodes2idx_merge.json
| | └─ labeled_segments.json
| └─ Raw_data
| └─ aerialvln-s
| └─ <episode_id>
| └─ rgb
| ├─ frame_000.jpg
| ├─ frame_001.jpg
| └─ ...
└─ VQA-Dataset
├─ Open3DVQA
| └─ O3DVQA
| ├─ EmbodiedCity
| | └─ Wuhan
| | ├─ merged_qa.json
| | └─ rgb
| └─ UrbanScene
| ├─ Campus
| | ├─ merged_qa.json
| | └─ rgb
| └─ Residence
| ├─ merged_qa.json
| └─ rgb
└─ ShareGPT4V-SFT
├─ sharegpt4v_gqa_spatial_outdoor.json
└─ gqaTo launch training, please run the following command:
# only train on aerivln dataset
bash script/train_aerialvln.sh airvln_merge_triple_updated
# for training with all datasets, please modify llava/data/datasets_mixture.py to include the datasets you want, and make sure to prepare the data for those datasets as well.
bash script/train_aerialvln.sh airvln_merge_triple_updated+airvln_subtraj_sum_updated+open3dvqa_embodiedcity_wuhan+open3dvqa_urban_scene_campus+open3dvqa_urban_scene_residence+gqa_spatialTo evaluate a trained AeroAct checkpoint on AerialVLN, first start the simulator server, then run the evaluation script. Before running evaluation, please configure the test environment and simulator platform according to AirVLN.
cd AirVLN
nohup python airsim_plugin/AirVLNSimulatorServerTool.py --gpus 0 &
cd ..
# bash script/eval_aeriavln.sh checkpoints/exp1 0.2
bash script/eval_aeriavln.sh <model_path> <temperature>Table 1 | Main comparison results on AerialVLN benchmark.
| Method | Observation | Seen | Unseen | ||||||
|---|---|---|---|---|---|---|---|---|---|
| NE↓ | SR↑ | OSR↑ | SDTW↑ | NE↓ | SR↑ | OSR↑ | SDTW↑ | ||
| Grid-based VS | Pano + Odo | 70.3 | 20.8 | 33.4 | 10.2 | 121.3 | 7.4 | 16.1 | 2.5 |
| CityNavAgent | Depth + Pano + Odo | 80.8 | 13.9 | 30.2 | 5.1 | 60.2 | 11.7 | 35.2 | 5.0 |
| Random Sampling | - | 109.6 | 0.0 | 0.0 | 0.0 | 149.7 | 0.0 | 0.0 | 0.0 |
| Action Sampling | - | 213.8 | 0.9 | 5.7 | 0.3 | 237.6 | 0.2 | 1.1 | 0.1 |
| LingUNet | S.RGB + Depth | 383.8 | 0.6 | 6.9 | 0.2 | 368.4 | 0.4 | 3.6 | 0.9 |
| Seq2Seq | S.RGB + Depth | 146.0 | 4.8 | 19.8 | 1.6 | 218.9 | 2.3 | 11.7 | 0.7 |
| CMA | S.RGB + Depth | 121.0 | 3.0 | 23.2 | 0.6 | 172.1 | 3.2 | 16.0 | 1.1 |
| LAG | S.RGB + Depth | 90.2 | 7.2 | 15.7 | 2.4 | 127.9 | 5.1 | 10.5 | 1.4 |
| STMR | S.RGB + Depth | 96.3 | 12.6 | 31.6 | 2.7 | 119.5 | 10.8 | 23.0 | 1.9 |
| MapGPT | S.RGB | 124.9 | 2.1 | 4.7 | 0.8 | 107.0 | 0.0 | 0.0 | 0.0 |
| Navid | S.RGB | 105.1 | 6.8 | 15.5 | 1.1 | 106.9 | 6.1 | 12.3 | 0.7 |
| OpenFly | S.RGB | 127.2 | 8.1 | 21.8 | 1.6 | 113.8 | 7.6 | 18.2 | 1.5 |
| AeroAct (Ours) | S.RGB | 79.6 | 11.4 | 37.7 | 6.3 | 95.8 | 8.1 | 28.9 | 2.2 |
This repository is partly based on AirVLN and NVILA.
If you find our project helpful, please consider citing:
@article{xu2025aerial,
title={Aerial Vision-Language Navigation with a Unified Framework for Spatial, Temporal and Embodied Reasoning},
author={Xu, Huilin and Liu, Zhuoyang and Luomei, Yixiang and Xu, Feng},
journal={arXiv preprint arXiv:2512.08639},
year={2025}
}This project is released under the Apache 2.0 License.

