feat(mcore): add GLM-5/DeepSeek-V3 model support (mbridge + megatron-bridge) by dingzhiqiang · Pull Request #1373 · areal-project/AReaL

dingzhiqiang · 2026-05-28T03:52:24Z

Summary

Adds GLM-5.1 / DeepSeek-V3 / GLM-4.7-Flash model support — three architectures open-source AReaL did not previously support. Coverage spans both the mbridge path (default) and the megatron-bridge path opted into via mcore.bridge_type=megatron-bridge.

This PR stacks on top of #1372 (Bailing-MoE megatron-bridge adapter), which introduced the shared cross-cutting infrastructure (optional mbridge import wrapping, Bridge type-annotation cleanups, migration doc skeleton). The diff shown here only contains GLM-5/DSV3 additions; #1372 must merge first.

Previously, both this work and the Bailing adapter were combined in #1362 (now closed). They were split into PR-A (#1372) and this PR-B for easier review.

What's added

New model code

areal/models/mcore/deepseek_v3.py — HF config → MLATransformerConfig conversion, homogeneous MLA layer specs, _has_dsa() helper. Handles DeepseekV3ForCausalLM, GlmMoeDsaForCausalLM (GLM-5.1), and Glm4MoeForCausalLM (GLM-4.7-Flash) — all three share the underlying MLA + MoE topology; GLM-5.1 additionally exposes a DSA indexer.
areal/models/mcore/deepseek_v3_bridge.py — mbridge LLMBridge subclass registered as deepseek_v3 / glm_moe_dsa / glm4_moe_lite.
areal/models/mcore/dsa_mla_attention.py — custom DSAMLASelfAttention module that inherits Attention directly (not DSAttention) so packed THD inputs work without modification. Implements the DSA indexer (wq_b, wk, k_norm, weights_proj) called by the layer spec.
areal/models/mcore/glm5_megatron_bridge.py — megatron-bridge MegatronModelBridge subclass for GlmMoeDsaForCausalLM, with DSA indexer weight mappings and MTP layer support.
areal/experimental/ops/dsa/ — six tilelang kernel files implementing the DSA indexer forward/backward and the sparse MLA forward/backward used by DSAMLASelfAttention. Specialized for DSV3/GLM-5.1 latent geometry (kv_lora_rank=512, qk_rope_head_dim=64).

Registry / engine wiring

areal/models/mcore/registry.py — add _DEEPSEEK_V3_ARCHITECTURES set and _supplement_dsa_config() helper; register DSV3/GLM-5/GLM-4.7 architectures in make_hf_and_mcore_config (no-bridge fallback) and make_mcore_layer_specs; inject AReaL's DSA-aware layer spec into provider.transformer_layer_spec for the megatron-bridge path when _has_dsa(hf_config) is True.
areal/engine/megatron_engine.py — import deepseek_v3_bridge (lazy via try/except, since it depends on mbridge) and glm5_megatron_bridge (unconditional, since it uses megatron-bridge) so their decorators fire on engine load.

Docs

docs/en/best_practices/migrate_to_megatron_bridge.md — extend the supported-architectures table with DSV3/GLM-5/GLM-4.7 entries and describe the DSA-aware layer spec injection alongside the Bailing one.

Validation

The mbridge ↔ megatron-bridge numerical equivalence (matching starting logp / grad_norm / loss with same seed and config) has been validated on the internal branch this PR is ported from. All provider.<field> assignments in glm5_megatron_bridge.py are preserved as-is from the validated source.

Test plan

pre-commit run --all-files — passing locally
CI smoke tests (no regression on existing models) — to run on PR
Manual: GLM-5.1 GRPO step with bridge_type: megatron-bridge on a real cluster
Manual: DeepSeek-V3 GRPO with both bridge_type: mbridge and bridge_type: megatron-bridge for 5-step loss alignment

Dependencies

Depends on feat(mcore): add Bailing-MoE V2.5 megatron-bridge adapter #1372 (PR-A). Targets the chucai.dzq/bailing-megatron-bridge branch; once feat(mcore): add Bailing-MoE V2.5 megatron-bridge adapter #1372 merges I'll retarget this PR onto main and rebase.

Known limitations / not in scope

VPP > 1 with the DSA layer-spec injection: the lambda currently captures stage-0 specs (validated runs were VPP=1). Tracked as follow-up.
The two tilelang kernels (tilelang_sparse_mla_{fwd,bwd}.py) are specialized for DSV3/GLM-5.1 latent geometry — D=512, dim_plus_tail_dim=576. Supporting other latent dimensions requires kernel rewrite and re-tuning.
examples/math/glm5_grpo_megatron_bridge.yaml deferred (would require placeholder paths and large multi-node config; users should adapt existing examples/math/gsm8k_grpo_megatron.yaml with mcore.bridge_type: megatron-bridge).

gemini-code-assist

Code Review

This pull request introduces support for DeepSeek V3, GLM-5.1, and GLM-4.7-Flash models in megatron-core, implementing custom Multi-Latent Attention (MLA) and Dynamic Sparse Attention (DSA) operators with TileLang kernels. The review feedback identifies several important issues: a critical mismatch in the number of returned gradients in the autograd IndexerFunction.backward which will trigger a runtime error, potential AttributeErrors when directly accessing HuggingFace configuration attributes, a missing validation check when DSA is enabled but Transformer Engine is disabled, and a missing dimension assertion in the TileLang backward kernel.

…bridge) Adds the GLM-5.1 / DeepSeek-V3 / GLM-4.7-Flash architecture family, which open-source AReaL did not previously support. Coverage spans both the mbridge path (used by default) and the megatron-bridge path opted into via mcore.bridge_type=megatron-bridge. Stacks on top of the Bailing-MoE megatron-bridge adapter PR, which introduced the shared cross-cutting infrastructure (optional mbridge import wrapping, Bridge type-annotation cleanups, migration doc). New model code: - areal/models/mcore/deepseek_v3.py: HF config -> MLATransformerConfig conversion, homogeneous MLA layer specs, _has_dsa() helper. Handles DeepseekV3ForCausalLM, GlmMoeDsaForCausalLM (GLM-5.1), and Glm4MoeForCausalLM (GLM-4.7-Flash) — all three share the underlying MLA + MoE topology, GLM-5.1 additionally exposes a DSA indexer. - areal/models/mcore/deepseek_v3_bridge.py: mbridge LLMBridge subclass registered as deepseek_v3 / glm_moe_dsa / glm4_moe_lite. - areal/models/mcore/dsa_mla_attention.py: custom DSAMLASelfAttention module that inherits Attention directly (not DSAttention) so packed THD inputs work without modification. Implements the DSA indexer (wq_b, wk, k_norm, weights_proj) called by the layer spec. - areal/models/mcore/glm5_megatron_bridge.py: megatron-bridge MegatronModelBridge subclass for GlmMoeDsaForCausalLM, with DSA indexer weight mappings and MTP layer support. - areal/experimental/ops/dsa/: six tilelang kernel files implementing the DSA indexer forward/backward and the sparse MLA forward/backward used by DSAMLASelfAttention. Specialized for DSV3/GLM-5.1 latent geometry (kv_lora_rank=512, qk_rope_head_dim=64). Registry / engine wiring: - areal/models/mcore/registry.py: add _DEEPSEEK_V3_ARCHITECTURES set and _supplement_dsa_config() helper; register DSV3/GLM-5/GLM-4.7 architectures in make_hf_and_mcore_config (no-bridge fallback) and make_mcore_layer_specs; inject AReaL's DSA-aware layer spec into provider.transformer_layer_spec for the megatron-bridge path when _has_dsa(hf_config) is True. - areal/engine/megatron_engine.py: import deepseek_v3_bridge (lazy via try/except, since it depends on mbridge) and glm5_megatron_bridge (unconditional, since it uses megatron-bridge) so their decorators fire on engine load. Docs: - docs/en/best_practices/migrate_to_megatron_bridge.md: extend the supported-architectures table with DSV3/GLM-5/GLM-4.7 entries and describe the DSA-aware layer spec injection alongside the Bailing one.

forward(ctx, ...) takes 7 inputs after ctx (index_q, index_k, weights, cu_seqlen_ks, cu_seqlen_ke, topk, topk_indices), so PyTorch's autograd contract requires backward to return exactly 7 values. The previous return tuple had 10 entries (3 grads + 7 Nones); fix to 7 (3 grads + 4 Nones) so non-frozen indexer training does not crash with "function returned an incorrect number of gradients". This bug was masked in validated experiments because the default AREAL_DSA_TRAIN_INDEXER=0 freezes all 4 indexer parameter modules (wq_b, wk, k_norm, weights_proj) via requires_grad=False, and dsa_mla_attention.py additionally detaches q_compressed and hidden_states before the indexer call. With all upstream tensors and parameters requires_grad=False, autograd skips the IndexerFunction backward entirely and the buggy return tuple is never inspected. Setting AREAL_DSA_TRAIN_INDEXER=1 (per the code comment, "for future RL stages") would have triggered the crash. This fix changes no numerical behavior on the validated default-freeze path; it just unblocks the indexer-training path.

dingzhiqiang requested review from PrometheusComing, fishcrap, garrett4wade, geshi001, nuzant, rchardx and sitabulaixizawaluduo as code owners May 28, 2026 03:52

gemini-code-assist Bot reviewed May 28, 2026

View reviewed changes

Comment thread areal/experimental/ops/dsa/indexer.py Outdated

Comment thread areal/models/mcore/registry.py

Comment thread areal/models/mcore/deepseek_v3.py

Comment thread areal/experimental/ops/dsa/tilelang_sparse_mla_bwd.py

chucai.dzq added 2 commits May 28, 2026 12:23

dingzhiqiang force-pushed the chucai.dzq/glm5-deepseek-v3-support branch from 875ffc4 to 9aad59f Compare May 28, 2026 04:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(mcore): add GLM-5/DeepSeek-V3 model support (mbridge + megatron-bridge)#1373

feat(mcore): add GLM-5/DeepSeek-V3 model support (mbridge + megatron-bridge)#1373
dingzhiqiang wants to merge 2 commits into
chucai.dzq/bailing-megatron-bridgefrom
chucai.dzq/glm5-deepseek-v3-support

dingzhiqiang commented May 28, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dingzhiqiang commented May 28, 2026

Summary

What's added

New model code

Registry / engine wiring

Docs

Validation

Test plan

Dependencies

Known limitations / not in scope

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant