WAM-Diff: A Masked Diffusion VLA Framework with MoE and Online Reinforcement Learning for Autonomous Driving
2025/12/06: 🎉🎉🎉 Paper submitted on Arxiv.
| Status | Milestone | ETA |
|---|---|---|
| 🚀 | Release the inference source code | 2025.12.21 |
| 🚀 | Release the SFT and inf code | 2025.12.21 |
| 🚀 | Release pretrained models on Huggingface | TBD |
| 🚀 | Release NAVSIM evaluation code | TBD |
| 🚀 | Release the RL code | TBD |
The WAM-Diff will be available on Hugging Face Hub soon. To quickly test the model, follow these simple steps:
-
Clone the repository
git clone https://github.com/fudan-generative-vision/WAM-Diff cd WAM-Diff -
Initialize the environment
If you prefer conda, run the environment setup script to install necessary dependencies:bash init_env.sh
Or you can use uv to create the environment:
uv venv && uv sync -
Prepare the Model Download the pretrained WAM-Diff model from Hugging Face (pending release) to the
./model/WAM-Diffdirectory:https://huggingface.co/fudan-generative-ai/WAM-DiffDownload the pretrained Siglip2 model from Hugging Face to the
./model/siglip2-so400m-patch14-384directory:https://huggingface.co/google/siglip2-so400m-patch14-384 -
Run the demo script
Execute the demo script to test WAM-Diff on an example image:bash inf.sh
To fine-tune WAM-Diff, please follow these steps:
- Set Up the Environment
Follow the same environment setup steps as in the Quick Inference Demo section. - Prepare the Data
Prepare your training dataset in JSON format like[ { "image": ["path/to/image1.png"], "conversations": [ { "from": "human", "value": "Here is front views of a driving vehicle:\n<image>\nThe navigation information is: straight\nThe current position is (0.00,0.00)\nCurrent velocity is: (13.48,-0.29) and current accelerate is: (0.19,0.05)\nPredict the optimal driving action for the next 4 seconds with 8 new waypoints." }, { "from": "gpt", "value": "6.60,-0.01,13.12,-0.03,19.58,-0.04,25.95,-0.03,32.27,-0.03,38.56,-0.05,44.88,-0.06,51.16,-0.09" } ] }, ... ] - Run the Training Script
Execute the training script with the following command:cd train bash ./scripts/llada_v_finetune.sh
If you find our work useful for your research, please consider citing the paper:
@article{xu2025wam,
title={WAM-Diff: A Masked Diffusion VLA Framework with MoE and Online Reinforcement Learning for Autonomous Driving},
author={Xu, Mingwang and Cui, Jiahao and Cai, Feipeng and Shang, Hanlin and Zhu, Zhihao and Luan, Shan and Xu, Yifang and Zhang, Neng and Li, Yaoyi and Cai, Jia and others},
journal={arXiv preprint arXiv:2512.11872},
year={2025}
}
We gratefully acknowledge the contributors to the LLaDA-V, repositories, whose commitment to open source has provided us with their excellent codebases and pretrained models.


