Towards Long-Horizon Vision-Language Navigation:
Platform, Benchmark and Method (CVPR-25)

Xinshuai Song^1*, Weixing Chen^1*, Yang Liu^1,3, Weikai Chen, Guanbin Li^1,2,3, Liang Lin^1,2,3,

¹Sun Yat-sen University ²Peng Cheng Laboratory ³Guangdong Key Laboratory of Big Data Analysis and Processing

Existing Vision-Language Navigation (VLN) methods primarily focus on single-stage navigation, limiting their effectiveness in multi-stage and long-horizon tasks within complex and dynamic environments. To address these limitations, we propose a novel VLN task, named Long-Horizon Vision-Language Navigation (LH-VLN), which emphasizes long-term planning and decision consistency across consecutive subtasks. Furthermore, to support LH-VLN, we develop an automated data generation platform NavGen, which constructs datasets with complex task structures and improves data utility through a bidirectional, multi-granularity generation approach. To accurately evaluate complex tasks, we construct the Long-Horizon Planning and Reasoning in VLN (LHPR-VLN) benchmark consisting of 3,260 tasks with an average of 150 task steps, serving as the first dataset specifically designed for the long-horizon vision-language navigation task. Furthermore, we propose Independent Success Rate (ISR), Conditional Success Rate (CSR), and CSR weight by Ground Truth (CGT) metrics, to provide fine-grained assessments of task completion. To improve model adaptability in complex tasks, we propose a novel Multi-Granularity Dynamic Memory (MGDM) module that integrates short-term memory blurring with long-term memory retrieval to enable flexible navigation in dynamic environments. Our platform, benchmark and method supply LH-VLN with a robust data generation pipeline, comprehensive model evaluation dataset, reasonable metrics, and a novel VLN model, establishing a foundational framework for advancing LH-VLN.

MMSP2025-Challenge

The 1st Long-Horizon Vision-Language Navigation Challenge based on the “Embodied AI Challenge” track of the IEEE 27th International Workshop on Multimedia Signal Processing (MMSP 2025) is opened! Please go to the challenge page for more information.

Preparation

Environment

This project is developed with Python 3.9. You can use miniconda or anaconda to create the environment:

conda create -n lhvln python=3.9
conda activate lhvln

We use Habitat-Sim as simulator, which can be built from source or installed from conda:

conda install habitat-sim==0.3.1 headless -c conda-forge -c aihabitat

Then you can install the environment required for the project:

git clone https://github.com/HCPLab-SYSU/LH-VLN.git
cd LH-VLN
pip install -r requirements.txt

Data

We use HM3D as the scene dataset. You can download the splits we need by following the command below. Note that you need to submit an application to Matterport before using it. For more information, please refer to this link.

python -m habitat_sim.utils.datasets_download --username <api-token-id> --password <api-token-secret> --uids hm3d_train_v0.2
python -m habitat_sim.utils.datasets_download --username <api-token-id> --password <api-token-secret> --uids hm3d_val_v0.2

In NavGen, we use the pre-trained model of RAM. You can download the model here.

We used pre-trained clip and bert in the model encoding, and their weights can be obtained from the following links: EVA02_CLIP_L_336_psz14_s6B.pt, clip-vit-base-patch16 and bert-large-uncased.

Your final directory structure should be like this:

LH-VLN
├── data
│   ├── hm3d
│   │   ├── train
│   │   ├── val
│   │   ├── hm3d_annotated_basis.scene_dataset_config.json
│   └── models
│   │   ├── ram_plus_swin_large_14m.pth
│   │   ├── EVA02_CLIP_L_336_psz14_s6B.pt
│   │   ├── clip-vit-base-patch16
│   │   ├── bert-large-uncased
│   ├── task
│   │   ├── batch_1
│   │   ├── ...
│   │   ├── batch_8
│   ├── step_task
│   │   ├── batch_1
│   │   ├── ...
│   │   ├── batch_8
│   ├── episode_task
│   │   ├── batch_1.json.gz
│   │   ├── ...
│   │   ├── batch_8.json.gz

LHPR-VLN Dataset

Our dataset is now available in Hugging Face and ModelScope. Thanks a lot for your patience!

NavGen Pipeline

After completing the preparations, you can now refer to the guide to generate your LH-VLN task!

Benchmark

You can adjust the parameters in configs/lh_vln.yaml based on your own needs.

Run:

python train.py

Or use distributed：

torchrun --nnodes=1 --nproc_per_node=4 train.py

Please set based on your machine configuration.

Baseline

We currently provide a simplified version of the model and expand its adaptability so that it can be trained and inferenced on Llama/Qwen-based models of different scales (from 0.5B to 13B and more). You can adjust the parameters in configs/model.yaml based on your own needs.（Change the config file in utils/parser.py.)

Run:

python train.py

Or use distributed：

torchrun --nnodes=1 --nproc_per_node=4 train.py

Please set based on your machine configuration.

In addition, we also provide the supervised fine-tuning code using VLA data, please refer to sft.py.

Acknowledgement

We used RAM's source code in nav_gen/recognize_anything and EVA's source code in NavModel/LLMModel/EVA. Besides, we refer to some codes of NaviLLM. Thanks for their contribution!!

Citation

If you find our paper and code useful in your research, please consider giving us a star ⭐ and citing our work 📝 :)

@inproceedings{song2024towards,
  title={Towards long-horizon vision-language navigation: Platform, benchmark and method},
  author={Song, Xinshuai and Chen, Weixing and Liu, Yang and Chen, Weikai and Li, Guanbin and Lin, Liang},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 160 Commits
.github/workflows		.github/workflows
NavModel		NavModel
configs		configs
contest		contest
habitat_base		habitat_base
nav_gen		nav_gen
static		static
utils		utils
.gitignore		.gitignore
.nojekyll		.nojekyll
README.md		README.md
index.html		index.html
requirements.txt		requirements.txt
sft.py		sft.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Towards Long-Horizon Vision-Language Navigation:
Platform, Benchmark and Method (CVPR-25)

MMSP2025-Challenge

Preparation

Environment

Data

LHPR-VLN Dataset

NavGen Pipeline

Benchmark

Baseline

Acknowledgement

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

HCPLab-SYSU/LH-VLN

Folders and files

Latest commit

History

Repository files navigation

Towards Long-Horizon Vision-Language Navigation: Platform, Benchmark and Method (CVPR-25)

MMSP2025-Challenge

Preparation

Environment

Data

LHPR-VLN Dataset

NavGen Pipeline

Benchmark

Baseline

Acknowledgement

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Towards Long-Horizon Vision-Language Navigation:
Platform, Benchmark and Method (CVPR-25)

Packages