feat(engine): add Qwen3-VL dense support to Megatron path#1299
feat(engine): add Qwen3-VL dense support to Megatron path#1299Adiactive wants to merge 2 commits into
Conversation
Extend the Megatron engine to train Qwen3-VL dense models end-to-end: mcore→HF weight conversion for update_weights and HF→mcore loading that handles Qwen3-VL's nested HF config layout. Without this, GRPO/PPO of any Qwen3-VL model on the Megatron backend is blocked. Key changes: - megatron_utils/megatron.py: convert_qwen3_vl_to_hf (anchored on mbridge.models.qwen3_vl.Qwen3VLBridge), registered before "qwen3" in _CONVERSION_FN_REGISTRY. - mcore/hf_load.py: _lang_config() helper for the HF→mcore loader routes language-side config reads through getattr(hf_config, "text_config", hf_config) so Qwen3-VL's nested text_config works alongside Qwen2.5-VL and pure text models. - megatron_engine.py: _collect_param reads text_config.vocab_size with the same getattr fallback. - test_megatron_engine_vlm.py: add TestConvertQwen3VLToHF and parametrize the VLM integration tests across qwen25_vl and qwen3_vl. - run_megatron_engine_vlm.py: mock_vlm_input reads patch geometry from engine.hf_config so it works for both VLMs.
There was a problem hiding this comment.
Code Review
This pull request adds support for the Qwen3-VL model, featuring a new parameter conversion utility and updates to the weight loading logic to handle nested text configurations. The testing suite was also updated to include Qwen3-VL and support parameterized integration tests. Feedback was provided to simplify the head dimension calculation in the conversion logic for improved readability.
| try: | ||
| head_dim = ( | ||
| tf_config.kv_channels | ||
| if tf_config.kv_channels is not None | ||
| else tf_config.hidden_size // tf_config.num_attention_heads | ||
| ) | ||
| except (AttributeError, TypeError): | ||
| head_dim = tf_config.hidden_size // tf_config.num_attention_heads |
There was a problem hiding this comment.
The try-except block for calculating head_dim can be simplified for better readability and to more clearly express the intent of falling back if kv_channels is not available.
kv_channels = getattr(tf_config, "kv_channels", None)
if kv_channels is not None:
head_dim = kv_channels
else:
head_dim = tf_config.hidden_size // tf_config.num_attention_headsThere was a problem hiding this comment.
Addressed in the follow-up commit
- Consolidate `lang_config` helper into areal/engine/core/model.py so hf_load.py and megatron_engine.py:_collect_param share a single `getattr(hf_config, "text_config", hf_config)` accessor instead of inlining it twice. - convert_qwen3_vl_to_hf: early-raise on Qwen3-VL-MoE expert / router param names. _CONVERSION_FN_REGISTRY uses substring matching, so `qwen3_vl_moe` would silently fall through to the dense converter unless registered before `qwen3_vl`. Make the registry-order requirement explicit and actionable. - _vision_qkv_mcore_to_hf: assert vision has no GQA (num_kv_heads == num_heads). Both Qwen2.5-VL and Qwen3-VL satisfy this; guard catches future vision-GQA VLMs that would otherwise silently miscompile QKV. - convert_qwen3_vl_to_hf: replace try/except for tf_config.kv_channels with getattr (per gemini-code-assist bot suggestion on PR areal-project#1299). - TestConvertQwen3VLToHF: pin fixture dims to real Qwen/Qwen3-VL-2B-Instruct values (vision hidden=1024 instead of synthetic 1152) and update dependent shape literals.
|
Will submit another PR which includes Qwen3-VL-MOE support along with current changes therefore closing this PR |
Description
Adds Qwen3-VL dense support to AReaL's Megatron engine via mbridge, so GRPO/PPO of any Qwen3-VL model on the Megatron backend is unblocked.
Key changes:
areal/engine/megatron_utils/megatron.py: newconvert_qwen3_vl_to_hfregistered before"qwen3"in_CONVERSION_FN_REGISTRY(mapping anchored onmbridge.models.qwen3_vl.Qwen3VLBridge). Carries an early-raise on Qwen3-VL-MoE param names soqwen3_vl_moecannot silently dispatch to the dense converter via substring match._vision_qkv_mcore_to_hfgains a no-GQA assertion that guards future vision-GQA VLMs.areal/engine/core/model.py: newlang_config(hf_config)helper —getattr(hf_config, "text_config", hf_config)— so callers can read language-side attrs (vocab_size,num_attention_heads,num_key_value_heads,hidden_size,head_dim) uniformly across Qwen3-VL (nested) and Qwen2.5-VL / pure text (flat).areal/models/mcore/hf_load.py:_merge_qkv_weights,_load_fused_qkv_weight, and the GQA branch of_weight_to_mcore_tpuse the sharedlang_confighelper.areal/engine/megatron_engine.py:_collect_paramuseslang_config(self.hf_config).vocab_sizeforremove_padding.tests/test_megatron_engine_vlm.py: addTestConvertQwen3VLToHFand parametrize the VLM integration tests acrossqwen25_vlandqwen3_vlvia_VLM_MODELS+env_overrides.tests/torchrun/run_megatron_engine_vlm.py:mock_vlm_inputreads patch geometry fromengine.hf_configso it works for both VLMs without code-side branching.Tests added:
TestConvertQwen3VLToHF(CPU unit tests forconvert_qwen3_vl_to_hf):qwen3_vlresolves beforeqwen3substring fallback.patch_norm,linear_fc1/2).hf_config.vision_config.num_heads; attn proj; norm1 / norm2 weight + bias; non-gated MLP regression guard (linear_fc1 must NOT chunk); fc2.[0, 1, 2]fornormandlinear_fc{1,2}.test_qwen3_vl_detectedadded toTestVisionModelDetection.test_vlm_*integration tests (test_engine_initializes,test_simple_forward,test_hf_save_load_weights,test_train_tensor_parallel) become parametrized over_VLM_MODELSso they run once per VLM (qwen25_vl, qwen3_vl). Adding a new VLM is a one-line addition to_VLM_MODELS.Scope:
bridge_type: mbridgeonly. Thebridge_type: megatron-bridgepath with Qwen3-VL +gradient_checkpointing: truecrashes insideQwen3VLTransformerBlock._checkpointed_forwardwithTypeError: save_for_backward can only save variables, but argument 6 is of type list—deepstack_visual_embedsis handed verbatim totensor_parallel.checkpoint. Fixed upstream in megatron-bridge v0.4.0 (variadic-flatten ofdeepstack_visual_embeds); this PR intentionally does NOT vendor-patch the megatron-bridge path so it lights up automatically when the dependency upgrade lands as chore(deps): upgrade runtime dependencies and CI workflow #1206 plans to do.context_parallel_size > 1) continues to raiseNotImplementedError— matches Qwen2.5-VL state.qwen3_vl_moe),pixel_values_videosplumbing for streaming video inputs.Related Issue
N/A — net-new feature support.
Type of Change
Checklist
pre-commit run --all-files)./docs/build_all.sh)main/review-prcommand/create-prBreaking Change Details (if applicable):
N/A.
Additional Context
Fix for #1298 is required to run integration tests or actual training. This PR was tested with a local patch which is not committed.
Training Reward Example
Image: ghcr.io/inclusionai/areal-runtime:v1.0.3-vllm with mbridge upgraded according to #1258

Dataset: Geometry3k
Model: Qwen3-VL-3B-Instruct / Qwen3-VL-32B-Instruct
Scheduler: Slurm