Skip to content

WAM-Diff: A Masked Diffusion VLA Framework with MoE and Online Reinforcement Learning for Autonomous Driving

Notifications You must be signed in to change notification settings

fudan-generative-vision/WAM-Diff

Repository files navigation

WAM-Diff: A Masked Diffusion VLA Framework with MoE and Online Reinforcement Learning for Autonomous Driving

1Fudan University  2Yinwang Intelligent Technology Co., Ltd 


📰 News

  • 2025/12/06: 🎉🎉🎉 Paper submitted on Arxiv.

📅️ Roadmap

Status Milestone ETA
🚀 Release the inference source code 2025.12.21
🚀 Release the SFT and inf code 2025.12.21
🚀 Release pretrained models on Huggingface TBD
🚀 Release NAVSIM evaluation code TBD
🚀 Release the RL code TBD

🔧️ Framework

framework

🏆 Qualitative Results on NAVSIM

NAVSIM-v1 benchmark results

navsim-v1

NAVSIM-v2 benchmark results

navsim-v2

Quick Inference Demo

The WAM-Diff will be available on Hugging Face Hub soon. To quickly test the model, follow these simple steps:

  1. Clone the repository

    git clone https://github.com/fudan-generative-vision/WAM-Diff
    cd WAM-Diff
  2. Initialize the environment
    If you prefer conda, run the environment setup script to install necessary dependencies:

    bash init_env.sh

    Or you can use uv to create the environment:

    uv venv && uv sync
  3. Prepare the Model Download the pretrained WAM-Diff model from Hugging Face (pending release) to the ./model/WAM-Diff directory:

    https://huggingface.co/fudan-generative-ai/WAM-Diff
    

    Download the pretrained Siglip2 model from Hugging Face to the ./model/siglip2-so400m-patch14-384 directory:

    https://huggingface.co/google/siglip2-so400m-patch14-384
    
  4. Run the demo script
    Execute the demo script to test WAM-Diff on an example image:

    bash inf.sh

Training

To fine-tune WAM-Diff, please follow these steps:

  1. Set Up the Environment
    Follow the same environment setup steps as in the Quick Inference Demo section.
  2. Prepare the Data
    Prepare your training dataset in JSON format like
    [
        {
        "image": ["path/to/image1.png"],
        "conversations": [
            {
                "from": "human",
                "value": "Here is front views of a driving vehicle:\n<image>\nThe navigation information is: straight\nThe current position is (0.00,0.00)\nCurrent velocity is: (13.48,-0.29)  and current accelerate is: (0.19,0.05)\nPredict the optimal driving action for the next 4 seconds with 8 new waypoints."
            },
            {
                "from": "gpt",
                "value": "6.60,-0.01,13.12,-0.03,19.58,-0.04,25.95,-0.03,32.27,-0.03,38.56,-0.05,44.88,-0.06,51.16,-0.09"
            }
            ]
        },
        ...
    ]
  3. Run the Training Script
    Execute the training script with the following command:
    cd train
    bash ./scripts/llada_v_finetune.sh

📝 Citation

If you find our work useful for your research, please consider citing the paper:

@article{xu2025wam,
  title={WAM-Diff: A Masked Diffusion VLA Framework with MoE and Online Reinforcement Learning for Autonomous Driving},
  author={Xu, Mingwang and Cui, Jiahao and Cai, Feipeng and Shang, Hanlin and Zhu, Zhihao and Luan, Shan and Xu, Yifang and Zhang, Neng and Li, Yaoyi and Cai, Jia and others},
  journal={arXiv preprint arXiv:2512.11872},
  year={2025}
}

🤗 Acknowledgements

We gratefully acknowledge the contributors to the LLaDA-V, repositories, whose commitment to open source has provided us with their excellent codebases and pretrained models.

About

WAM-Diff: A Masked Diffusion VLA Framework with MoE and Online Reinforcement Learning for Autonomous Driving

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •