Existing Vision-Language Navigation (VLN) methods primarily focus on single-stage navigation, limiting their effectiveness in multi-stage and long-horizon tasks within complex and dynamic environments. To address these limitations, we propose a novel VLN task, named Long-Horizon Vision-Language Navigation (LH-VLN), which emphasizes long-term planning and decision consistency across consecutive subtasks. Furthermore, to support LH-VLN, we develop an automated data generation platform NavGen, which constructs datasets with complex task structures and improves data utility through a bidirectional, multi-granularity generation approach. To accurately evaluate complex tasks, we construct the Long-Horizon Planning and Reasoning in VLN (LHPR-VLN) benchmark consisting of 3,260 tasks with an average of 150 task steps, serving as the first dataset specifically designed for the long-horizon vision-language navigation task. Furthermore, we propose Independent Success Rate (ISR), Conditional Success Rate (CSR), and CSR weight by Ground Truth (CGT) metrics, to provide fine-grained assessments of task completion. To improve model adaptability in complex tasks, we propose a novel Multi-Granularity Dynamic Memory (MGDM) module that integrates short-term memory blurring with long-term memory retrieval to enable flexible navigation in dynamic environments. Our platform, benchmark and method supply LH-VLN with a robust data generation pipeline, comprehensive model evaluation dataset, reasonable metrics, and a novel VLN model, establishing a foundational framework for advancing LH-VLN.
The 1st Long-Horizon Vision-Language Navigation Challenge based on the “Embodied AI Challenge” track of the IEEE 27th International Workshop on Multimedia Signal Processing (MMSP 2025) is opened! Please go to the challenge page for more information.
This project is developed with Python 3.9. You can use miniconda or anaconda to create the environment:
conda create -n lhvln python=3.9
conda activate lhvlnWe use Habitat-Sim as simulator, which can be built from source or installed from conda:
conda install habitat-sim==0.3.1 headless -c conda-forge -c aihabitatThen you can install the environment required for the project:
git clone https://github.com/HCPLab-SYSU/LH-VLN.git
cd LH-VLN
pip install -r requirements.txtWe use HM3D as the scene dataset. You can download the splits we need by following the command below. Note that you need to submit an application to Matterport before using it. For more information, please refer to this link.
python -m habitat_sim.utils.datasets_download --username <api-token-id> --password <api-token-secret> --uids hm3d_train_v0.2
python -m habitat_sim.utils.datasets_download --username <api-token-id> --password <api-token-secret> --uids hm3d_val_v0.2In NavGen, we use the pre-trained model of RAM. You can download the model here.
We used pre-trained clip and bert in the model encoding, and their weights can be obtained from the following links: EVA02_CLIP_L_336_psz14_s6B.pt, clip-vit-base-patch16 and bert-large-uncased.
Your final directory structure should be like this:
LH-VLN
├── data
│ ├── hm3d
│ │ ├── train
│ │ ├── val
│ │ ├── hm3d_annotated_basis.scene_dataset_config.json
│ └── models
│ │ ├── ram_plus_swin_large_14m.pth
│ │ ├── EVA02_CLIP_L_336_psz14_s6B.pt
│ │ ├── clip-vit-base-patch16
│ │ ├── bert-large-uncased
│ ├── task
│ │ ├── batch_1
│ │ ├── ...
│ │ ├── batch_8
│ ├── step_task
│ │ ├── batch_1
│ │ ├── ...
│ │ ├── batch_8
│ ├── episode_task
│ │ ├── batch_1.json.gz
│ │ ├── ...
│ │ ├── batch_8.json.gz
Our dataset is now available in Hugging Face and ModelScope. Thanks a lot for your patience!
After completing the preparations, you can now refer to the guide to generate your LH-VLN task!
You can adjust the parameters in configs/lh_vln.yaml based on your own needs.
Run:
python train.pyOr use distributed:
torchrun --nnodes=1 --nproc_per_node=4 train.py Please set based on your machine configuration.
We currently provide a simplified version of the model and expand its adaptability so that it can be trained and inferenced on Llama/Qwen-based models of different scales (from 0.5B to 13B and more). You can adjust the parameters in configs/model.yaml based on your own needs.(Change the config file in utils/parser.py.)
Run:
python train.pyOr use distributed:
torchrun --nnodes=1 --nproc_per_node=4 train.py Please set based on your machine configuration.
In addition, we also provide the supervised fine-tuning code using VLA data, please refer to sft.py.
We used RAM's source code in nav_gen/recognize_anything and EVA's source code in NavModel/LLMModel/EVA. Besides, we refer to some codes of NaviLLM. Thanks for their contribution!!
If you find our paper and code useful in your research, please consider giving us a star ⭐ and citing our work 📝 :)
@inproceedings{song2024towards,
title={Towards long-horizon vision-language navigation: Platform, benchmark and method},
author={Song, Xinshuai and Chen, Weixing and Liu, Yang and Chen, Weikai and Li, Guanbin and Lin, Liang},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year={2025}
}