feat(mcore): add GLM-5/DeepSeek-V3 model support (mbridge + megatron-bridge)#1373
Open
dingzhiqiang wants to merge 2 commits into
Open
feat(mcore): add GLM-5/DeepSeek-V3 model support (mbridge + megatron-bridge)#1373dingzhiqiang wants to merge 2 commits into
dingzhiqiang wants to merge 2 commits into
Conversation
Contributor
There was a problem hiding this comment.
Code Review
This pull request introduces support for DeepSeek V3, GLM-5.1, and GLM-4.7-Flash models in megatron-core, implementing custom Multi-Latent Attention (MLA) and Dynamic Sparse Attention (DSA) operators with TileLang kernels. The review feedback identifies several important issues: a critical mismatch in the number of returned gradients in the autograd IndexerFunction.backward which will trigger a runtime error, potential AttributeErrors when directly accessing HuggingFace configuration attributes, a missing validation check when DSA is enabled but Transformer Engine is disabled, and a missing dimension assertion in the TileLang backward kernel.
added 2 commits
May 28, 2026 12:23
…bridge) Adds the GLM-5.1 / DeepSeek-V3 / GLM-4.7-Flash architecture family, which open-source AReaL did not previously support. Coverage spans both the mbridge path (used by default) and the megatron-bridge path opted into via mcore.bridge_type=megatron-bridge. Stacks on top of the Bailing-MoE megatron-bridge adapter PR, which introduced the shared cross-cutting infrastructure (optional mbridge import wrapping, Bridge type-annotation cleanups, migration doc). New model code: - areal/models/mcore/deepseek_v3.py: HF config -> MLATransformerConfig conversion, homogeneous MLA layer specs, _has_dsa() helper. Handles DeepseekV3ForCausalLM, GlmMoeDsaForCausalLM (GLM-5.1), and Glm4MoeForCausalLM (GLM-4.7-Flash) — all three share the underlying MLA + MoE topology, GLM-5.1 additionally exposes a DSA indexer. - areal/models/mcore/deepseek_v3_bridge.py: mbridge LLMBridge subclass registered as deepseek_v3 / glm_moe_dsa / glm4_moe_lite. - areal/models/mcore/dsa_mla_attention.py: custom DSAMLASelfAttention module that inherits Attention directly (not DSAttention) so packed THD inputs work without modification. Implements the DSA indexer (wq_b, wk, k_norm, weights_proj) called by the layer spec. - areal/models/mcore/glm5_megatron_bridge.py: megatron-bridge MegatronModelBridge subclass for GlmMoeDsaForCausalLM, with DSA indexer weight mappings and MTP layer support. - areal/experimental/ops/dsa/: six tilelang kernel files implementing the DSA indexer forward/backward and the sparse MLA forward/backward used by DSAMLASelfAttention. Specialized for DSV3/GLM-5.1 latent geometry (kv_lora_rank=512, qk_rope_head_dim=64). Registry / engine wiring: - areal/models/mcore/registry.py: add _DEEPSEEK_V3_ARCHITECTURES set and _supplement_dsa_config() helper; register DSV3/GLM-5/GLM-4.7 architectures in make_hf_and_mcore_config (no-bridge fallback) and make_mcore_layer_specs; inject AReaL's DSA-aware layer spec into provider.transformer_layer_spec for the megatron-bridge path when _has_dsa(hf_config) is True. - areal/engine/megatron_engine.py: import deepseek_v3_bridge (lazy via try/except, since it depends on mbridge) and glm5_megatron_bridge (unconditional, since it uses megatron-bridge) so their decorators fire on engine load. Docs: - docs/en/best_practices/migrate_to_megatron_bridge.md: extend the supported-architectures table with DSV3/GLM-5/GLM-4.7 entries and describe the DSA-aware layer spec injection alongside the Bailing one.
forward(ctx, ...) takes 7 inputs after ctx (index_q, index_k, weights, cu_seqlen_ks, cu_seqlen_ke, topk, topk_indices), so PyTorch's autograd contract requires backward to return exactly 7 values. The previous return tuple had 10 entries (3 grads + 7 Nones); fix to 7 (3 grads + 4 Nones) so non-frozen indexer training does not crash with "function returned an incorrect number of gradients". This bug was masked in validated experiments because the default AREAL_DSA_TRAIN_INDEXER=0 freezes all 4 indexer parameter modules (wq_b, wk, k_norm, weights_proj) via requires_grad=False, and dsa_mla_attention.py additionally detaches q_compressed and hidden_states before the indexer call. With all upstream tensors and parameters requires_grad=False, autograd skips the IndexerFunction backward entirely and the buggy return tuple is never inspected. Setting AREAL_DSA_TRAIN_INDEXER=1 (per the code comment, "for future RL stages") would have triggered the crash. This fix changes no numerical behavior on the validated default-freeze path; it just unblocks the indexer-training path.
875ffc4 to
9aad59f
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds GLM-5.1 / DeepSeek-V3 / GLM-4.7-Flash model support — three architectures open-source AReaL did not previously support. Coverage spans both the
mbridgepath (default) and themegatron-bridgepath opted into viamcore.bridge_type=megatron-bridge.This PR stacks on top of #1372 (Bailing-MoE megatron-bridge adapter), which introduced the shared cross-cutting infrastructure (optional
mbridgeimport wrapping,Bridgetype-annotation cleanups, migration doc skeleton). The diff shown here only contains GLM-5/DSV3 additions; #1372 must merge first.What's added
New model code
areal/models/mcore/deepseek_v3.py— HF config → MLATransformerConfig conversion, homogeneous MLA layer specs,_has_dsa()helper. HandlesDeepseekV3ForCausalLM,GlmMoeDsaForCausalLM(GLM-5.1), andGlm4MoeForCausalLM(GLM-4.7-Flash) — all three share the underlying MLA + MoE topology; GLM-5.1 additionally exposes a DSA indexer.areal/models/mcore/deepseek_v3_bridge.py— mbridgeLLMBridgesubclass registered asdeepseek_v3/glm_moe_dsa/glm4_moe_lite.areal/models/mcore/dsa_mla_attention.py— customDSAMLASelfAttentionmodule that inheritsAttentiondirectly (notDSAttention) so packed THD inputs work without modification. Implements the DSA indexer (wq_b,wk,k_norm,weights_proj) called by the layer spec.areal/models/mcore/glm5_megatron_bridge.py— megatron-bridgeMegatronModelBridgesubclass forGlmMoeDsaForCausalLM, with DSA indexer weight mappings and MTP layer support.areal/experimental/ops/dsa/— six tilelang kernel files implementing the DSA indexer forward/backward and the sparse MLA forward/backward used byDSAMLASelfAttention. Specialized for DSV3/GLM-5.1 latent geometry (kv_lora_rank=512,qk_rope_head_dim=64).Registry / engine wiring
areal/models/mcore/registry.py— add_DEEPSEEK_V3_ARCHITECTURESset and_supplement_dsa_config()helper; register DSV3/GLM-5/GLM-4.7 architectures inmake_hf_and_mcore_config(no-bridge fallback) andmake_mcore_layer_specs; inject AReaL's DSA-aware layer spec intoprovider.transformer_layer_specfor the megatron-bridge path when_has_dsa(hf_config)is True.areal/engine/megatron_engine.py— importdeepseek_v3_bridge(lazy viatry/except, since it depends on mbridge) andglm5_megatron_bridge(unconditional, since it uses megatron-bridge) so their decorators fire on engine load.Docs
docs/en/best_practices/migrate_to_megatron_bridge.md— extend the supported-architectures table with DSV3/GLM-5/GLM-4.7 entries and describe the DSA-aware layer spec injection alongside the Bailing one.Validation
The mbridge ↔ megatron-bridge numerical equivalence (matching starting logp / grad_norm / loss with same seed and config) has been validated on the internal branch this PR is ported from. All
provider.<field>assignments inglm5_megatron_bridge.pyare preserved as-is from the validated source.Test plan
pre-commit run --all-files— passing locallybridge_type: megatron-bridgeon a real clusterbridge_type: mbridgeandbridge_type: megatron-bridgefor 5-step loss alignmentDependencies
chucai.dzq/bailing-megatron-bridgebranch; once feat(mcore): add Bailing-MoE V2.5 megatron-bridge adapter #1372 merges I'll retarget this PR ontomainand rebase.Known limitations / not in scope
tilelang_sparse_mla_{fwd,bwd}.py) are specialized for DSV3/GLM-5.1 latent geometry —D=512,dim_plus_tail_dim=576. Supporting other latent dimensions requires kernel rewrite and re-tuning.examples/math/glm5_grpo_megatron_bridge.yamldeferred (would require placeholder paths and large multi-node config; users should adapt existingexamples/math/gsm8k_grpo_megatron.yamlwithmcore.bridge_type: megatron-bridge).