Add MoE calibration wrapper for GLM-4.7-Flash (Glm4MoeLiteMoE)#2547
Add MoE calibration wrapper for GLM-4.7-Flash (Glm4MoeLiteMoE)#2547Nottlespike wants to merge 3 commits intovllm-project:mainfrom
Conversation
|
👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review. Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed. |
There was a problem hiding this comment.
Code Review
This pull request introduces a calibration wrapper for GLM-4.7-Flash models to ensure all experts are properly calibrated during quantization, preventing suboptimal expert weight quantization. The feedback identifies a potential shape mismatch in the routing logic that could occur if input logits are not flattened and suggests a refactor to the forward method to eliminate redundant code execution paths.
There was a problem hiding this comment.
Pull request overview
Adds MoE calibration support for GLM-4.7-Flash models by introducing a dedicated calibration wrapper for the Glm4MoeLiteMoE architecture, ensuring expert activation statistics are collected instead of silently skipping MoE calibration.
Changes:
- Added
CalibrationGlm4MoeLiteMoEwrapper with GLM-4.7-Flash group-based routing and “all experts see tokens” calibration behavior. - Registered the new wrapper via
llmcompressor.modelingpackage import side effects.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| src/llmcompressor/modeling/glm4_moe_lite.py | New MoE calibration wrapper for Glm4MoeLiteMoE, including routing + calibration-only all-expert pass. |
| src/llmcompressor/modeling/init.py | Imports the new wrapper to trigger registry registration. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
d2c0afa to
e460938
Compare
brian-dellabetta
left a comment
There was a problem hiding this comment.
Hi @Nottlespike , thanks for preparing this. It looks like a lot of the code is shared with what is in src/llmcompressor/modeling/glm_moe_dsa.py. Have you explored what it would look like to import and subclass the classes from that file directly? I know transformers sticks to the approach of no shared code across model definitions, but given we want to apply the same operation to the 3D expert tensors in both, maybe we won't have to repeat our code
Address review feedback from brian-dellabetta on PR vllm-project#2547: the routing and forward logic in glm4_moe_lite.py was identical to glm_moe_dsa.py. Refactored to inherit from CalibrationGlmMoeDsaMoE using template method pattern (_get_num_experts, _make_experts). - glm_moe_dsa.py: add _get_num_experts() and _make_experts() factory methods; __init__ now calls them instead of hardcoding types - glm4_moe_lite.py: subclass CalibrationGlmMoeDsaMoE, override only the two factory methods; drop duplicated route_tokens_to_experts(), forward(), and redundant restore() - __init__.py: isort-ordered imports Net: -64 lines, zero behavior change.
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
WalkthroughThis PR adds GLM4 MoE Lite model calibration support by introducing a specialized calibration module that unpacks routed expert parameters, refactoring the base GLM MoE DSA class to support expert customization through overridable helper methods, and adding comprehensive GPU-based tests to verify behavior. Changes
Sequence DiagramsequenceDiagram
participant User
participant CalibrationGlm4MoeLiteMoE as Calibration<br/>Wrapper
participant SequentialGlm4MoeLiteExperts as Expert<br/>Unpacker
participant ExpertMLPs as Expert<br/>Modules
User->>CalibrationGlm4MoeLiteMoE: Initialize with<br/>original model
CalibrationGlm4MoeLiteMoE->>SequentialGlm4MoeLiteExperts: Create expert container<br/>via _make_experts()
SequentialGlm4MoeLiteExperts->>SequentialGlm4MoeLiteExperts: Extract expert count<br/>via _get_num_experts()
SequentialGlm4MoeLiteExperts->>ExpertMLPs: Unpack weights from<br/>original experts
ExpertMLPs->>ExpertMLPs: Slice gate_up_proj tensors<br/>assign to Linear layers
User->>CalibrationGlm4MoeLiteMoE: Forward pass during<br/>calibration
CalibrationGlm4MoeLiteMoE->>ExpertMLPs: Route tokens via<br/>inherited logic
ExpertMLPs-->>User: Return calibrated<br/>outputs
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
Thanks for the feedback @brian-dellabetta — great call on the shared code. Refactored
Also addressed the automated review feedback:
|
|
This pull request has merge conflicts that must be resolved before it can be |
ce4cddf to
cb2a5b6
Compare
Address review feedback from brian-dellabetta on PR vllm-project#2547: the routing and forward logic in glm4_moe_lite.py was identical to glm_moe_dsa.py. Refactored to inherit from CalibrationGlmMoeDsaMoE using template method pattern (_get_num_experts, _make_experts). - glm_moe_dsa.py: add _get_num_experts() and _make_experts() factory methods; __init__ now calls them instead of hardcoding types - glm4_moe_lite.py: subclass CalibrationGlmMoeDsaMoE, override only the two factory methods; drop duplicated route_tokens_to_experts(), forward(), and redundant restore() - __init__.py: isort-ordered imports Net: -64 lines, zero behavior change. Signed-off-by: Jason Lu <Nottlespike@users.noreply.github.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
tests/llmcompressor/modeling/test_calib_glm4_moe_lite.py (1)
110-112: Compute shared-expert Linear expectation from config.Hardcoding
expected_shared_linears = 3makes the test brittle. Derive it fromconfig.n_shared_experts.♻️ Suggested change
- expected_shared_linears = 3 + expected_shared_linears = config.n_shared_experts * 3🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/llmcompressor/modeling/test_calib_glm4_moe_lite.py` around lines 110 - 112, The test currently hardcodes expected_shared_linears = 3 which is brittle; change it to compute the expectation from the configuration by using config.n_shared_experts (e.g., set expected_shared_linears = config.n_shared_experts or expected_shared_linears = config.n_shared_experts * 1) so the assertion comparing len(linear_names) to expected_expert_linears + expected_shared_linears derives the shared-linear count from config.n_shared_experts instead of a hardcoded literal.src/llmcompressor/modeling/glm4_moe_lite.py (1)
68-78: Avoid.dataaliasing when copying expert weights.Using
.datafor parameter transfer bypasses autograd safety checks and can cause subtle state issues during future refactors. The codebase already uses the safer pattern elsewhere (e.g.,gpt_oss.py): prefertorch.no_grad()withcopy_().♻️ Proposed safer copy pattern
- gate_up_data = original.gate_up_proj.data - down_data = original.down_proj.data + gate_up_data = original.gate_up_proj + down_data = original.down_proj for i in range(self.num_experts): gate_up = gate_up_data[i] down = down_data[i] gate_proj, up_proj = gate_up.chunk(2, dim=0) - self[i].gate_proj.weight.data = gate_proj.contiguous() - self[i].up_proj.weight.data = up_proj.contiguous() - self[i].down_proj.weight.data = down.contiguous() + with torch.no_grad(): + self[i].gate_proj.weight.copy_(gate_proj.contiguous()) + self[i].up_proj.weight.copy_(up_proj.contiguous()) + self[i].down_proj.weight.copy_(down.contiguous())🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/llmcompressor/modeling/glm4_moe_lite.py` around lines 68 - 78, The code is using .data to read and write parameter tensors (gate_up_data, down_data and assignments to self[i].gate_proj.weight.data / up_proj.weight.data / down_proj.weight.data), which bypasses autograd; replace this with a torch.no_grad() block and use tensor.copy_() to copy values safely (retain .contiguous() where needed) — e.g., read original.gate_up_proj.weight and original.down_proj.weight (not .data), split gate_up via chunk, then inside torch.no_grad() call self[i].gate_proj.weight.copy_(gate_proj.contiguous()), self[i].up_proj.weight.copy_(up_proj.contiguous()), and self[i].down_proj.weight.copy_(down.contiguous()) so autograd/state tracking is preserved.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/llmcompressor/modeling/test_calib_glm4_moe_lite.py`:
- Around line 55-56: The hook function hook_fn currently uses a parameter named
input which shadows the built-in input() (Ruff A002); rename that parameter to a
non-built-in name (e.g., _inputs or inputs) in the hook_fn signature and update
any uses inside the function accordingly so the signature reads hook_fn(i,
module, _inputs, output) and the body references the new name; ensure any
registration/calls that pass that argument remain compatible with the new
parameter name.
---
Nitpick comments:
In `@src/llmcompressor/modeling/glm4_moe_lite.py`:
- Around line 68-78: The code is using .data to read and write parameter tensors
(gate_up_data, down_data and assignments to self[i].gate_proj.weight.data /
up_proj.weight.data / down_proj.weight.data), which bypasses autograd; replace
this with a torch.no_grad() block and use tensor.copy_() to copy values safely
(retain .contiguous() where needed) — e.g., read original.gate_up_proj.weight
and original.down_proj.weight (not .data), split gate_up via chunk, then inside
torch.no_grad() call self[i].gate_proj.weight.copy_(gate_proj.contiguous()),
self[i].up_proj.weight.copy_(up_proj.contiguous()), and
self[i].down_proj.weight.copy_(down.contiguous()) so autograd/state tracking is
preserved.
In `@tests/llmcompressor/modeling/test_calib_glm4_moe_lite.py`:
- Around line 110-112: The test currently hardcodes expected_shared_linears = 3
which is brittle; change it to compute the expectation from the configuration by
using config.n_shared_experts (e.g., set expected_shared_linears =
config.n_shared_experts or expected_shared_linears = config.n_shared_experts *
1) so the assertion comparing len(linear_names) to expected_expert_linears +
expected_shared_linears derives the shared-linear count from
config.n_shared_experts instead of a hardcoded literal.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 41785132-a2cd-48d2-86f2-c258b752429f
📒 Files selected for processing (4)
src/llmcompressor/modeling/__init__.pysrc/llmcompressor/modeling/glm4_moe_lite.pysrc/llmcompressor/modeling/glm_moe_dsa.pytests/llmcompressor/modeling/test_calib_glm4_moe_lite.py
|
Worth noting: the This is especially timely given the recent release of GLM-5.1 which uses the DSA MoE architecture. Any future calibration runs on GLM-5.1 (or other DSA-family models) will get the safer parameter transfer out of the box. |
brian-dellabetta
left a comment
There was a problem hiding this comment.
Thanks @Nottlespike for the updates, this is looking good. just one nit on the changing of import ordering in __init__.py
Add CalibrationGlm4MoeLiteMoE that subclasses CalibrationGlmMoeDsaMoE, overriding only _get_num_experts() and _make_experts() factory methods. Unpacks 3D expert parameters into individual nn.Linear modules so they are visible to targets="Linear" quantization. Also improves the base class (glm_moe_dsa.py): - Add _get_num_experts() / _make_experts() template methods for subclassing - Replace .data aliasing with torch.no_grad() + copy_() for safer parameter transfer (benefits all DSA-family models including GLM-5.1) Tests: 3 GPU tests covering expert triggering, output correctness, and structural properties. Signed-off-by: Jason Lu <Nottlespike@users.noreply.github.com>
5a43b30 to
0a07c50
Compare
|
Fixed — restored upstream import ordering. The diff for |
|
Could a maintainer add the |
brian-dellabetta
left a comment
There was a problem hiding this comment.
Thanks for the contribution! lgtm
Summary
GLM-4.7-Flash uses a separate MoE class (
Glm4MoeLiteMoE) that is not covered by the existingGlm4MoeMoEwrapper. Without this fix, MoE calibration is silently skipped for GLM-4.7-Flash models, resulting in quantization that doesn't properly calibrate expert weights.Problem
The model
zai-org/GLM-4.7-Flash(31B MoE) usesGlm4MoeLiteMoEwith a different architecture thanGlm4MoeMoE:shared_expertsattribute (notshared_expert)Glm4MoeLiteNaiveMoeexperts interface:(hidden_states, topk_indices, topk_weights)n_group,topk_groupparametersWhen quantizing with NVFP4, the MoE calibration context manager checks for registered wrappers but
Glm4MoeLiteMoEdoesn't matchGlm4MoeMoE, so calibration silently falls back to standard forward passes without collecting expert activation statistics.Solution
Add
CalibrationGlm4MoeLiteMoEwrapper class that:Glm4MoeLiteMoEclass specificallyroute_tokens_to_experts()for proper group-based routingforward()that routes through all expertsChanges
src/llmcompressor/modeling/glm4_moe_lite.py- New 119-line wrappersrc/llmcompressor/modeling/__init__.py- Import the new wrapperTesting
Verified that quantization now shows:
Instead of the previous behavior where no MoE modules were detected for
Glm4MoeLiteMoE.Hardware Tested
Summary by CodeRabbit
New Features
CalibrationGlm4MoeLiteMoEandCalibrationOffsetNormmodules to public exportsTests