Skip to content

[fsdp, model] feat: support glm_moe_dsa FSDP training with DSA attention#6525

Open
Kite0011 wants to merge 9 commits into
verl-project:mainfrom
Kite0011:support_glm_moe_dsa_fsdp
Open

[fsdp, model] feat: support glm_moe_dsa FSDP training with DSA attention#6525
Kite0011 wants to merge 9 commits into
verl-project:mainfrom
Kite0011:support_glm_moe_dsa_fsdp

Conversation

@Kite0011
Copy link
Copy Markdown
Contributor

@Kite0011 Kite0011 commented May 28, 2026

What does this PR do?

Add FSDP training support for the glm_moe_dsa model, including:

  • Custom attention forward with DSA (Dynamic Sparse Attention) and Ulysses sequence parallelism support
  • Monkey patch registration for glm_moe_dsa model type
  • FLOPs estimation function accounting for MLA + DSA indexer + sparse attention

Checklist Before Starting

  • Search for similar PRs. Paste at least one query link here: ...
  • Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
    • {modules} include fsdp, megatron, veomni, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data, cfg, reward, fully_async, one_step_off
    • If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
    • {type} is in feat, fix, refactor, chore, test
    • If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
    • Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

loss converge using a random glm_moe_dsa model 10 step
task: open_reasoning_math sft
Clipboard_Screenshot_1780041072

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

# Add code snippet or script demonstrating how to use this

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for the glm_moe_dsa model type, including a custom attention forward implementation (glm_moe_dsa_attn_forward_with_dsa) that supports Ulysses sequence parallelism and padding removal, corresponding monkey patching, and FLOPs estimation. The review feedback highlights several potential runtime issues: position_ids needs to be unsqueezed to 2D if it is 1D to avoid shape mismatches; 2D or 3D attention_mask inputs should be unsqueezed to 4D before combining with index_mask to prevent crashes; the FLOPs counter must safely handle configurations where q_lora_rank is None; and use_fused_kernels should be explicitly disabled for glm_moe_dsa since they are not supported.

Comment thread verl/models/transformers/glm_moe_dsa.py
Comment thread verl/models/transformers/glm_moe_dsa.py Outdated
Comment thread verl/utils/flops_counter.py
Comment thread verl/models/transformers/monkey_patch.py Outdated
@Kite0011 Kite0011 closed this May 28, 2026
@Kite0011 Kite0011 reopened this May 29, 2026
Kite0011 and others added 6 commits May 29, 2026 15:54
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant