Skip to content

Towards Long-Horizon Vision-Language Navigation: Platform, Benchmark and Method (CVPR-25)

Notifications You must be signed in to change notification settings

HCPLab-SYSU/LH-VLN

Repository files navigation

Towards Long-Horizon Vision-Language Navigation:
Platform, Benchmark and Method (CVPR-25)


arXiv Website GitHub Code
1Sun Yat-sen University  2Peng Cheng Laboratory  3Guangdong Key Laboratory of Big Data Analysis and Processing 

Existing Vision-Language Navigation (VLN) methods primarily focus on single-stage navigation, limiting their effectiveness in multi-stage and long-horizon tasks within complex and dynamic environments. To address these limitations, we propose a novel VLN task, named Long-Horizon Vision-Language Navigation (LH-VLN), which emphasizes long-term planning and decision consistency across consecutive subtasks. Furthermore, to support LH-VLN, we develop an automated data generation platform NavGen, which constructs datasets with complex task structures and improves data utility through a bidirectional, multi-granularity generation approach. To accurately evaluate complex tasks, we construct the Long-Horizon Planning and Reasoning in VLN (LHPR-VLN) benchmark consisting of 3,260 tasks with an average of 150 task steps, serving as the first dataset specifically designed for the long-horizon vision-language navigation task. Furthermore, we propose Independent Success Rate (ISR), Conditional Success Rate (CSR), and CSR weight by Ground Truth (CGT) metrics, to provide fine-grained assessments of task completion. To improve model adaptability in complex tasks, we propose a novel Multi-Granularity Dynamic Memory (MGDM) module that integrates short-term memory blurring with long-term memory retrieval to enable flexible navigation in dynamic environments. Our platform, benchmark and method supply LH-VLN with a robust data generation pipeline, comprehensive model evaluation dataset, reasonable metrics, and a novel VLN model, establishing a foundational framework for advancing LH-VLN.

MMSP2025-Challenge

The 1st Long-Horizon Vision-Language Navigation Challenge based on the “Embodied AI Challenge” track of the IEEE 27th International Workshop on Multimedia Signal Processing (MMSP 2025) is opened! Please go to the challenge page for more information.

Preparation

Environment

This project is developed with Python 3.9. You can use miniconda or anaconda to create the environment:

conda create -n lhvln python=3.9
conda activate lhvln

We use Habitat-Sim as simulator, which can be built from source or installed from conda:

conda install habitat-sim==0.3.1 headless -c conda-forge -c aihabitat

Then you can install the environment required for the project:

git clone https://github.com/HCPLab-SYSU/LH-VLN.git
cd LH-VLN
pip install -r requirements.txt

Data

We use HM3D as the scene dataset. You can download the splits we need by following the command below. Note that you need to submit an application to Matterport before using it. For more information, please refer to this link.

python -m habitat_sim.utils.datasets_download --username <api-token-id> --password <api-token-secret> --uids hm3d_train_v0.2
python -m habitat_sim.utils.datasets_download --username <api-token-id> --password <api-token-secret> --uids hm3d_val_v0.2

In NavGen, we use the pre-trained model of RAM. You can download the model here.

We used pre-trained clip and bert in the model encoding, and their weights can be obtained from the following links: EVA02_CLIP_L_336_psz14_s6B.pt, clip-vit-base-patch16 and bert-large-uncased.

Your final directory structure should be like this:

LH-VLN
├── data
│   ├── hm3d
│   │   ├── train
│   │   ├── val
│   │   ├── hm3d_annotated_basis.scene_dataset_config.json
│   └── models
│   │   ├── ram_plus_swin_large_14m.pth
│   │   ├── EVA02_CLIP_L_336_psz14_s6B.pt
│   │   ├── clip-vit-base-patch16
│   │   ├── bert-large-uncased
│   ├── task
│   │   ├── batch_1
│   │   ├── ...
│   │   ├── batch_8
│   ├── step_task
│   │   ├── batch_1
│   │   ├── ...
│   │   ├── batch_8
│   ├── episode_task
│   │   ├── batch_1.json.gz
│   │   ├── ...
│   │   ├── batch_8.json.gz

LHPR-VLN Dataset

Our dataset is now available in Hugging Face and ModelScope. Thanks a lot for your patience!

NavGen Pipeline

After completing the preparations, you can now refer to the guide to generate your LH-VLN task!

Benchmark

You can adjust the parameters in configs/lh_vln.yaml based on your own needs.

Run:

python train.py

Or use distributed:

torchrun --nnodes=1 --nproc_per_node=4 train.py  

Please set based on your machine configuration.

Baseline

We currently provide a simplified version of the model and expand its adaptability so that it can be trained and inferenced on Llama/Qwen-based models of different scales (from 0.5B to 13B and more). You can adjust the parameters in configs/model.yaml based on your own needs.(Change the config file in utils/parser.py.)

Run:

python train.py

Or use distributed:

torchrun --nnodes=1 --nproc_per_node=4 train.py  

Please set based on your machine configuration.

In addition, we also provide the supervised fine-tuning code using VLA data, please refer to sft.py.

Acknowledgement

We used RAM's source code in nav_gen/recognize_anything and EVA's source code in NavModel/LLMModel/EVA. Besides, we refer to some codes of NaviLLM. Thanks for their contribution!!

Citation

If you find our paper and code useful in your research, please consider giving us a star ⭐ and citing our work 📝 :)

@inproceedings{song2024towards,
  title={Towards long-horizon vision-language navigation: Platform, benchmark and method},
  author={Song, Xinshuai and Chen, Weixing and Liu, Yang and Chen, Weikai and Li, Guanbin and Lin, Liang},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2025}
}

About

Towards Long-Horizon Vision-Language Navigation: Platform, Benchmark and Method (CVPR-25)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •