Skip to content

feat(engine): add Qwen3-VL dense support to Megatron path#1299

Closed
Adiactive wants to merge 2 commits into
areal-project:mainfrom
Adiactive:pr/megatron-qwen3-vl
Closed

feat(engine): add Qwen3-VL dense support to Megatron path#1299
Adiactive wants to merge 2 commits into
areal-project:mainfrom
Adiactive:pr/megatron-qwen3-vl

Conversation

@Adiactive
Copy link
Copy Markdown
Contributor

@Adiactive Adiactive commented May 4, 2026

Description

Adds Qwen3-VL dense support to AReaL's Megatron engine via mbridge, so GRPO/PPO of any Qwen3-VL model on the Megatron backend is unblocked.

Key changes:

  • areal/engine/megatron_utils/megatron.py: new convert_qwen3_vl_to_hf registered before "qwen3" in _CONVERSION_FN_REGISTRY (mapping anchored on mbridge.models.qwen3_vl.Qwen3VLBridge). Carries an early-raise on Qwen3-VL-MoE param names so qwen3_vl_moe cannot silently dispatch to the dense converter via substring match. _vision_qkv_mcore_to_hf gains a no-GQA assertion that guards future vision-GQA VLMs.
  • areal/engine/core/model.py: new lang_config(hf_config) helper — getattr(hf_config, "text_config", hf_config) — so callers can read language-side attrs (vocab_size, num_attention_heads, num_key_value_heads, hidden_size, head_dim) uniformly across Qwen3-VL (nested) and Qwen2.5-VL / pure text (flat).
  • areal/models/mcore/hf_load.py: _merge_qkv_weights, _load_fused_qkv_weight, and the GQA branch of _weight_to_mcore_tp use the shared lang_config helper.
  • areal/engine/megatron_engine.py: _collect_param uses lang_config(self.hf_config).vocab_size for remove_padding.
  • tests/test_megatron_engine_vlm.py: add TestConvertQwen3VLToHF and parametrize the VLM integration tests across qwen25_vl and qwen3_vl via _VLM_MODELS + env_overrides.
  • tests/torchrun/run_megatron_engine_vlm.py: mock_vlm_input reads patch geometry from engine.hf_config so it works for both VLMs without code-side branching.

Tests added:

  • TestConvertQwen3VLToHF (CPU unit tests for convert_qwen3_vl_to_hf):
    • Registry dispatch: qwen3_vl resolves before qwen3 substring fallback.
    • Language model: embedding / final norm / output_layer prefixes; QKV weight + bias split with GQA; q_norm / k_norm / o_proj / input_layernorm / post_attention_layernorm; gated MLP fc1 split + fc2.
    • Vision direct mappings: patch_embed weight + bias, pos_embed, merger (patch_norm, linear_fc1/2).
    • Vision per-block: QKV per-head→grouped reorder for weight + bias; requires hf_config.vision_config.num_heads; attn proj; norm1 / norm2 weight + bias; non-gated MLP regression guard (linear_fc1 must NOT chunk); fc2.
    • Deepstack mergers: parametrized over indices [0, 1, 2] for norm and linear_fc{1,2}.
    • Error handling: unknown language-model and unknown vision parameter names.
  • test_qwen3_vl_detected added to TestVisionModelDetection.
  • The 4 existing test_vlm_* integration tests (test_engine_initializes, test_simple_forward, test_hf_save_load_weights, test_train_tensor_parallel) become parametrized over _VLM_MODELS so they run once per VLM (qwen25_vl, qwen3_vl). Adding a new VLM is a one-line addition to _VLM_MODELS.

Scope:

  • bridge_type: mbridge only. The bridge_type: megatron-bridge path with Qwen3-VL + gradient_checkpointing: true crashes inside Qwen3VLTransformerBlock._checkpointed_forward with TypeError: save_for_backward can only save variables, but argument 6 is of type listdeepstack_visual_embeds is handed verbatim to tensor_parallel.checkpoint. Fixed upstream in megatron-bridge v0.4.0 (variadic-flatten of deepstack_visual_embeds); this PR intentionally does NOT vendor-patch the megatron-bridge path so it lights up automatically when the dependency upgrade lands as chore(deps): upgrade runtime dependencies and CI workflow #1206 plans to do.
  • Context parallelism for VLM (context_parallel_size > 1) continues to raise NotImplementedError — matches Qwen2.5-VL state.
  • Out of scope: Qwen3-VL MoE (qwen3_vl_moe), pixel_values_videos plumbing for streaming video inputs.

Related Issue

N/A — net-new feature support.

Type of Change

  • 🐛 Bug fix
  • ✨ New feature
  • 💥 Breaking change
  • 📝 Documentation update
  • ♻️ Refactoring
  • ⚡ Performance improvement
  • ✅ Test coverage improvement

Checklist

  • I have read the Contributing Guide
  • Pre-commit hooks pass (pre-commit run --all-files)
  • Relevant tests pass; new tests added for new functionality
  • Documentation updated (if applicable; built with ./docs/build_all.sh)
  • Branch is up to date with main
  • Self-reviewed via /review-pr command
  • This PR was created by a coding agent via /create-pr
  • This PR is a breaking change

Breaking Change Details (if applicable):

N/A.

Additional Context

Fix for #1298 is required to run integration tests or actual training. This PR was tested with a local patch which is not committed.

Training Reward Example

Image: ghcr.io/inclusionai/areal-runtime:v1.0.3-vllm with mbridge upgraded according to #1258
Dataset: Geometry3k
Model: Qwen3-VL-3B-Instruct / Qwen3-VL-32B-Instruct
Scheduler: Slurm
image

Extend the Megatron engine to train Qwen3-VL dense models end-to-end:
mcore→HF weight conversion for update_weights and HF→mcore loading that
handles Qwen3-VL's nested HF config layout. Without this, GRPO/PPO of
any Qwen3-VL model on the Megatron backend is blocked.

Key changes:
- megatron_utils/megatron.py: convert_qwen3_vl_to_hf (anchored on
  mbridge.models.qwen3_vl.Qwen3VLBridge), registered before "qwen3"
  in _CONVERSION_FN_REGISTRY.
- mcore/hf_load.py: _lang_config() helper for the HF→mcore loader
  routes language-side config reads through
  getattr(hf_config, "text_config", hf_config) so Qwen3-VL's nested
  text_config works alongside Qwen2.5-VL and pure text models.
- megatron_engine.py: _collect_param reads text_config.vocab_size
  with the same getattr fallback.
- test_megatron_engine_vlm.py: add TestConvertQwen3VLToHF and
  parametrize the VLM integration tests across qwen25_vl and qwen3_vl.
- run_megatron_engine_vlm.py: mock_vlm_input reads patch geometry
  from engine.hf_config so it works for both VLMs.
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for the Qwen3-VL model, featuring a new parameter conversion utility and updates to the weight loading logic to handle nested text configurations. The testing suite was also updated to include Qwen3-VL and support parameterized integration tests. Feedback was provided to simplify the head dimension calculation in the conversion logic for improved readability.

Comment thread areal/engine/megatron_utils/megatron.py Outdated
Comment on lines +530 to +537
try:
head_dim = (
tf_config.kv_channels
if tf_config.kv_channels is not None
else tf_config.hidden_size // tf_config.num_attention_heads
)
except (AttributeError, TypeError):
head_dim = tf_config.hidden_size // tf_config.num_attention_heads
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The try-except block for calculating head_dim can be simplified for better readability and to more clearly express the intent of falling back if kv_channels is not available.

            kv_channels = getattr(tf_config, "kv_channels", None)
            if kv_channels is not None:
                head_dim = kv_channels
            else:
                head_dim = tf_config.hidden_size // tf_config.num_attention_heads

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in the follow-up commit

- Consolidate `lang_config` helper into areal/engine/core/model.py so
  hf_load.py and megatron_engine.py:_collect_param share a single
  `getattr(hf_config, "text_config", hf_config)` accessor instead of
  inlining it twice.
- convert_qwen3_vl_to_hf: early-raise on Qwen3-VL-MoE expert / router
  param names. _CONVERSION_FN_REGISTRY uses substring matching, so
  `qwen3_vl_moe` would silently fall through to the dense converter
  unless registered before `qwen3_vl`. Make the registry-order
  requirement explicit and actionable.
- _vision_qkv_mcore_to_hf: assert vision has no GQA
  (num_kv_heads == num_heads). Both Qwen2.5-VL and Qwen3-VL satisfy
  this; guard catches future vision-GQA VLMs that would otherwise
  silently miscompile QKV.
- convert_qwen3_vl_to_hf: replace try/except for tf_config.kv_channels
  with getattr (per gemini-code-assist bot suggestion on PR areal-project#1299).
- TestConvertQwen3VLToHF: pin fixture dims to real
  Qwen/Qwen3-VL-2B-Instruct values (vision hidden=1024 instead of
  synthetic 1152) and update dependent shape literals.
@Adiactive
Copy link
Copy Markdown
Contributor Author

Will submit another PR which includes Qwen3-VL-MOE support along with current changes therefore closing this PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant