Skip to content

feat(mcore): add GLM-5/DeepSeek-V3 model support (mbridge + megatron-bridge)#1373

Open
dingzhiqiang wants to merge 2 commits into
chucai.dzq/bailing-megatron-bridgefrom
chucai.dzq/glm5-deepseek-v3-support
Open

feat(mcore): add GLM-5/DeepSeek-V3 model support (mbridge + megatron-bridge)#1373
dingzhiqiang wants to merge 2 commits into
chucai.dzq/bailing-megatron-bridgefrom
chucai.dzq/glm5-deepseek-v3-support

Conversation

@dingzhiqiang
Copy link
Copy Markdown
Collaborator

Summary

Adds GLM-5.1 / DeepSeek-V3 / GLM-4.7-Flash model support — three architectures open-source AReaL did not previously support. Coverage spans both the mbridge path (default) and the megatron-bridge path opted into via mcore.bridge_type=megatron-bridge.

This PR stacks on top of #1372 (Bailing-MoE megatron-bridge adapter), which introduced the shared cross-cutting infrastructure (optional mbridge import wrapping, Bridge type-annotation cleanups, migration doc skeleton). The diff shown here only contains GLM-5/DSV3 additions; #1372 must merge first.

Previously, both this work and the Bailing adapter were combined in #1362 (now closed). They were split into PR-A (#1372) and this PR-B for easier review.

What's added

New model code

  • areal/models/mcore/deepseek_v3.py — HF config → MLATransformerConfig conversion, homogeneous MLA layer specs, _has_dsa() helper. Handles DeepseekV3ForCausalLM, GlmMoeDsaForCausalLM (GLM-5.1), and Glm4MoeForCausalLM (GLM-4.7-Flash) — all three share the underlying MLA + MoE topology; GLM-5.1 additionally exposes a DSA indexer.
  • areal/models/mcore/deepseek_v3_bridge.py — mbridge LLMBridge subclass registered as deepseek_v3 / glm_moe_dsa / glm4_moe_lite.
  • areal/models/mcore/dsa_mla_attention.py — custom DSAMLASelfAttention module that inherits Attention directly (not DSAttention) so packed THD inputs work without modification. Implements the DSA indexer (wq_b, wk, k_norm, weights_proj) called by the layer spec.
  • areal/models/mcore/glm5_megatron_bridge.py — megatron-bridge MegatronModelBridge subclass for GlmMoeDsaForCausalLM, with DSA indexer weight mappings and MTP layer support.
  • areal/experimental/ops/dsa/ — six tilelang kernel files implementing the DSA indexer forward/backward and the sparse MLA forward/backward used by DSAMLASelfAttention. Specialized for DSV3/GLM-5.1 latent geometry (kv_lora_rank=512, qk_rope_head_dim=64).

Registry / engine wiring

  • areal/models/mcore/registry.py — add _DEEPSEEK_V3_ARCHITECTURES set and _supplement_dsa_config() helper; register DSV3/GLM-5/GLM-4.7 architectures in make_hf_and_mcore_config (no-bridge fallback) and make_mcore_layer_specs; inject AReaL's DSA-aware layer spec into provider.transformer_layer_spec for the megatron-bridge path when _has_dsa(hf_config) is True.
  • areal/engine/megatron_engine.py — import deepseek_v3_bridge (lazy via try/except, since it depends on mbridge) and glm5_megatron_bridge (unconditional, since it uses megatron-bridge) so their decorators fire on engine load.

Docs

  • docs/en/best_practices/migrate_to_megatron_bridge.md — extend the supported-architectures table with DSV3/GLM-5/GLM-4.7 entries and describe the DSA-aware layer spec injection alongside the Bailing one.

Validation

The mbridge ↔ megatron-bridge numerical equivalence (matching starting logp / grad_norm / loss with same seed and config) has been validated on the internal branch this PR is ported from. All provider.<field> assignments in glm5_megatron_bridge.py are preserved as-is from the validated source.

Test plan

  • pre-commit run --all-files — passing locally
  • CI smoke tests (no regression on existing models) — to run on PR
  • Manual: GLM-5.1 GRPO step with bridge_type: megatron-bridge on a real cluster
  • Manual: DeepSeek-V3 GRPO with both bridge_type: mbridge and bridge_type: megatron-bridge for 5-step loss alignment

Dependencies

Known limitations / not in scope

  • VPP > 1 with the DSA layer-spec injection: the lambda currently captures stage-0 specs (validated runs were VPP=1). Tracked as follow-up.
  • The two tilelang kernels (tilelang_sparse_mla_{fwd,bwd}.py) are specialized for DSV3/GLM-5.1 latent geometry — D=512, dim_plus_tail_dim=576. Supporting other latent dimensions requires kernel rewrite and re-tuning.
  • examples/math/glm5_grpo_megatron_bridge.yaml deferred (would require placeholder paths and large multi-node config; users should adapt existing examples/math/gsm8k_grpo_megatron.yaml with mcore.bridge_type: megatron-bridge).

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for DeepSeek V3, GLM-5.1, and GLM-4.7-Flash models in megatron-core, implementing custom Multi-Latent Attention (MLA) and Dynamic Sparse Attention (DSA) operators with TileLang kernels. The review feedback identifies several important issues: a critical mismatch in the number of returned gradients in the autograd IndexerFunction.backward which will trigger a runtime error, potential AttributeErrors when directly accessing HuggingFace configuration attributes, a missing validation check when DSA is enabled but Transformer Engine is disabled, and a missing dimension assertion in the TileLang backward kernel.

Comment thread areal/experimental/ops/dsa/indexer.py Outdated
Comment thread areal/models/mcore/registry.py
Comment thread areal/models/mcore/deepseek_v3.py
Comment thread areal/experimental/ops/dsa/tilelang_sparse_mla_bwd.py
chucai.dzq added 2 commits May 28, 2026 12:23
…bridge)

Adds the GLM-5.1 / DeepSeek-V3 / GLM-4.7-Flash architecture family,
which open-source AReaL did not previously support. Coverage spans both
the mbridge path (used by default) and the megatron-bridge path opted
into via mcore.bridge_type=megatron-bridge.

Stacks on top of the Bailing-MoE megatron-bridge adapter PR, which
introduced the shared cross-cutting infrastructure (optional mbridge
import wrapping, Bridge type-annotation cleanups, migration doc).

New model code:
- areal/models/mcore/deepseek_v3.py: HF config -> MLATransformerConfig
  conversion, homogeneous MLA layer specs, _has_dsa() helper. Handles
  DeepseekV3ForCausalLM, GlmMoeDsaForCausalLM (GLM-5.1), and
  Glm4MoeForCausalLM (GLM-4.7-Flash) — all three share the underlying
  MLA + MoE topology, GLM-5.1 additionally exposes a DSA indexer.
- areal/models/mcore/deepseek_v3_bridge.py: mbridge LLMBridge subclass
  registered as deepseek_v3 / glm_moe_dsa / glm4_moe_lite.
- areal/models/mcore/dsa_mla_attention.py: custom DSAMLASelfAttention
  module that inherits Attention directly (not DSAttention) so packed
  THD inputs work without modification. Implements the DSA indexer
  (wq_b, wk, k_norm, weights_proj) called by the layer spec.
- areal/models/mcore/glm5_megatron_bridge.py: megatron-bridge
  MegatronModelBridge subclass for GlmMoeDsaForCausalLM, with DSA
  indexer weight mappings and MTP layer support.
- areal/experimental/ops/dsa/: six tilelang kernel files implementing
  the DSA indexer forward/backward and the sparse MLA forward/backward
  used by DSAMLASelfAttention. Specialized for DSV3/GLM-5.1 latent
  geometry (kv_lora_rank=512, qk_rope_head_dim=64).

Registry / engine wiring:
- areal/models/mcore/registry.py: add _DEEPSEEK_V3_ARCHITECTURES set
  and _supplement_dsa_config() helper; register DSV3/GLM-5/GLM-4.7
  architectures in make_hf_and_mcore_config (no-bridge fallback) and
  make_mcore_layer_specs; inject AReaL's DSA-aware layer spec into
  provider.transformer_layer_spec for the megatron-bridge path when
  _has_dsa(hf_config) is True.
- areal/engine/megatron_engine.py: import deepseek_v3_bridge (lazy
  via try/except, since it depends on mbridge) and glm5_megatron_bridge
  (unconditional, since it uses megatron-bridge) so their
  decorators fire on engine load.

Docs:
- docs/en/best_practices/migrate_to_megatron_bridge.md: extend the
  supported-architectures table with DSV3/GLM-5/GLM-4.7 entries and
  describe the DSA-aware layer spec injection alongside the Bailing
  one.
forward(ctx, ...) takes 7 inputs after ctx (index_q, index_k, weights,
cu_seqlen_ks, cu_seqlen_ke, topk, topk_indices), so PyTorch's autograd
contract requires backward to return exactly 7 values. The previous
return tuple had 10 entries (3 grads + 7 Nones); fix to 7 (3 grads + 4
Nones) so non-frozen indexer training does not crash with
"function returned an incorrect number of gradients".

This bug was masked in validated experiments because the default
AREAL_DSA_TRAIN_INDEXER=0 freezes all 4 indexer parameter modules
(wq_b, wk, k_norm, weights_proj) via requires_grad=False, and
dsa_mla_attention.py additionally detaches q_compressed and
hidden_states before the indexer call. With all upstream tensors and
parameters requires_grad=False, autograd skips the IndexerFunction
backward entirely and the buggy return tuple is never inspected.

Setting AREAL_DSA_TRAIN_INDEXER=1 (per the code comment, "for future
RL stages") would have triggered the crash. This fix changes no
numerical behavior on the validated default-freeze path; it just
unblocks the indexer-training path.
@dingzhiqiang dingzhiqiang force-pushed the chucai.dzq/glm5-deepseek-v3-support branch from 875ffc4 to 9aad59f Compare May 28, 2026 04:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant