Inquiry about MoE (Mixture of Experts) Training Support

Hello VILA team!

First, thank you for open-sourcing this incredible family of Vision Language Models! The work on VILA, NVILA, and is truly impressive, and the focus on efficiency and deployment is particularly valuable for the community.

I have been exploring the codebase and documentation with great interest. My question is regarding the future development roadmap: ​Are there any plans to support training VILA models with a Mixture of Experts (MoE) architecture（such as Qwen3-MOE, Deepseek-MOE models）?​​

The integration of MoE could be a powerful way to further scale the model's capacity and capabilities while maintaining inference efficiency, which aligns perfectly with the project's goals. This would be especially exciting for handling even more complex multi-image and long-video understanding tasks.

I would be very interested to know if this is a direction you are considering.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inquiry about MoE (Mixture of Experts) Training Support #272

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Inquiry about MoE (Mixture of Experts) Training Support #272

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions