Add MoE calibration wrapper for GLM-4.7-Flash (Glm4MoeLiteMoE) by Nottlespike · Pull Request #2547 · vllm-project/llm-compressor

Nottlespike · 2026-03-30T22:06:36Z

Summary

GLM-4.7-Flash uses a separate MoE class (Glm4MoeLiteMoE) that is not covered by the existing Glm4MoeMoE wrapper. Without this fix, MoE calibration is silently skipped for GLM-4.7-Flash models, resulting in quantization that doesn't properly calibrate expert weights.

Problem

The model zai-org/GLM-4.7-Flash (31B MoE) uses Glm4MoeLiteMoE with a different architecture than Glm4MoeMoE:

Has shared_experts attribute (not shared_expert)
Uses Glm4MoeLiteNaiveMoe experts interface: (hidden_states, topk_indices, topk_weights)
Has group-based routing with n_group, topk_group parameters
64 routed experts + shared experts

When quantizing with NVFP4, the MoE calibration context manager checks for registered wrappers but Glm4MoeLiteMoE doesn't match Glm4MoeMoE, so calibration silently falls back to standard forward passes without collecting expert activation statistics.

Solution

Add CalibrationGlm4MoeLiteMoE wrapper class that:

Registers for Glm4MoeLiteMoE class specifically
Implements route_tokens_to_experts() for proper group-based routing
Collects activation statistics for all 64 experts during calibration
Provides proper forward() that routes through all experts

Changes

src/llmcompressor/modeling/glm4_moe_lite.py - New 119-line wrapper
src/llmcompressor/modeling/__init__.py - Import the new wrapper

Testing

Verified that quantization now shows:

Found 46 MoE modules to replace
Replaced 46 MoE modules for calibration

Instead of the previous behavior where no MoE modules were detected for Glm4MoeLiteMoE.

Hardware Tested

2× RTX PRO 6000 Blackwell (96GB each)
NVFP4 quantization with sequential offload

Summary by CodeRabbit

New Features
- Added GLM4 MoE Lite model calibration support
- Introduced CalibrationGlm4MoeLiteMoE and CalibrationOffsetNorm modules to public exports
- Expanded calibration module availability
Tests
- Added comprehensive test coverage for GLM4 MoE Lite calibration including expert routing and output validation

github-actions · 2026-03-30T22:06:45Z

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

gemini-code-assist

Code Review

This pull request introduces a calibration wrapper for GLM-4.7-Flash models to ensure all experts are properly calibrated during quantization, preventing suboptimal expert weight quantization. The feedback identifies a potential shape mismatch in the routing logic that could occur if input logits are not flattened and suggests a refactor to the forward method to eliminate redundant code execution paths.

Copilot

Pull request overview

Adds MoE calibration support for GLM-4.7-Flash models by introducing a dedicated calibration wrapper for the Glm4MoeLiteMoE architecture, ensuring expert activation statistics are collected instead of silently skipping MoE calibration.

Changes:

Added CalibrationGlm4MoeLiteMoE wrapper with GLM-4.7-Flash group-based routing and “all experts see tokens” calibration behavior.
Registered the new wrapper via llmcompressor.modeling package import side effects.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
src/llmcompressor/modeling/glm4_moe_lite.py	New MoE calibration wrapper for `Glm4MoeLiteMoE`, including routing + calibration-only all-expert pass.
src/llmcompressor/modeling/init.py	Imports the new wrapper to trigger registry registration.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

brian-dellabetta

Hi @Nottlespike , thanks for preparing this. It looks like a lot of the code is shared with what is in src/llmcompressor/modeling/glm_moe_dsa.py. Have you explored what it would look like to import and subclass the classes from that file directly? I know transformers sticks to the approach of no shared code across model definitions, but given we want to apply the same operation to the 3D expert tensors in both, maybe we won't have to repeat our code

Address review feedback from brian-dellabetta on PR vllm-project#2547: the routing and forward logic in glm4_moe_lite.py was identical to glm_moe_dsa.py. Refactored to inherit from CalibrationGlmMoeDsaMoE using template method pattern (_get_num_experts, _make_experts). - glm_moe_dsa.py: add _get_num_experts() and _make_experts() factory methods; __init__ now calls them instead of hardcoding types - glm4_moe_lite.py: subclass CalibrationGlmMoeDsaMoE, override only the two factory methods; drop duplicated route_tokens_to_experts(), forward(), and redundant restore() - __init__.py: isort-ordered imports Net: -64 lines, zero behavior change.

coderabbitai · 2026-04-09T22:44:24Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: bb4d9839-748f-48f6-9659-1f0f1de732e7

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

Walkthrough

This PR adds GLM4 MoE Lite model calibration support by introducing a specialized calibration module that unpacks routed expert parameters, refactoring the base GLM MoE DSA class to support expert customization through overridable helper methods, and adding comprehensive GPU-based tests to verify behavior.

Changes

Cohort / File(s)	Summary
Module Initialization `src/llmcompressor/modeling/__init__.py`	Added wildcard import from `fuse.py` and explicit re-exports of `CalibrationGlm4MoeLiteMoE` and `CalibrationOffsetNorm` to extend the package's public API surface.
GLM MoE Base Class `src/llmcompressor/modeling/glm_moe_dsa.py`	Introduced overridable helper methods `_get_num_experts()` and `_make_experts()` to `CalibrationGlmMoeDsaMoE`, enabling subclasses to customize expert count extraction and construction while reusing shared routing and forward logic.
GLM4 MoE Lite Implementation `src/llmcompressor/modeling/glm4_moe_lite.py`	Added new `CalibrationGlm4MoeLiteMoE` calibration module and `SequentialGlm4MoeLiteExperts` helper class that unpacks routed experts from a `Glm4MoeLiteNaiveMoe` instance by slicing and reassigning expert weight tensors for quantization.
Calibration Tests `tests/llmcompressor/modeling/test_calib_glm4_moe_lite.py`	Added GPU-only tests validating expert triggering, output correctness, and structural properties of `CalibrationGlm4MoeLiteMoE` with configurable expert routing via `calibrate_all_experts` flag.

Sequence Diagram

sequenceDiagram
    participant User
    participant CalibrationGlm4MoeLiteMoE as Calibration<br/>Wrapper
    participant SequentialGlm4MoeLiteExperts as Expert<br/>Unpacker
    participant ExpertMLPs as Expert<br/>Modules

    User->>CalibrationGlm4MoeLiteMoE: Initialize with<br/>original model
    CalibrationGlm4MoeLiteMoE->>SequentialGlm4MoeLiteExperts: Create expert container<br/>via _make_experts()
    SequentialGlm4MoeLiteExperts->>SequentialGlm4MoeLiteExperts: Extract expert count<br/>via _get_num_experts()
    SequentialGlm4MoeLiteExperts->>ExpertMLPs: Unpack weights from<br/>original experts
    ExpertMLPs->>ExpertMLPs: Slice gate_up_proj tensors<br/>assign to Linear layers
    User->>CalibrationGlm4MoeLiteMoE: Forward pass during<br/>calibration
    CalibrationGlm4MoeLiteMoE->>ExpertMLPs: Route tokens via<br/>inherited logic
    ExpertMLPs-->>User: Return calibrated<br/>outputs

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested labels

ready

Suggested reviewers

kylesayrs
brian-dellabetta

Poem

🐰 A new expert unpacks its weight,
Through GLM4's specialized gate,
With sliced tensors fine,
Each linear will shine,
Calibration cannot wait! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 36.36% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title directly and accurately summarizes the main change: adding a MoE calibration wrapper for GLM-4.7-Flash (Glm4MoeLiteMoE), which is the primary focus across all file changes.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Nottlespike · 2026-04-09T22:44:48Z

Thanks for the feedback @brian-dellabetta — great call on the shared code.

Refactored CalibrationGlm4MoeLiteMoE to subclass CalibrationGlmMoeDsaMoE directly. The routing logic (route_tokens_to_experts) and sequential expert forward pass were identical between the two, so the Lite wrapper now inherits both and overrides only two factory methods:

_get_num_experts(config) — returns config.n_routed_experts (vs config.num_local_experts in DSA)
_make_experts(config, original_experts) — constructs SequentialGlm4MoeLiteExperts (unpacks Glm4MoeLiteMLP instead of GlmMoeDsaMLP)

is_permanent = True and restore() are inherited from the base class. Net: -69 lines.

Also addressed the automated review feedback:

Gemini (duplicated branch) and Copilot (torch.no_grad on calibration pass) — both referred to the v1 batch-dispatch forward() which was replaced with sequential expert iteration in v2. No longer applicable.
Copilot (missing unit test) — test_calib_glm4_moe_lite.py has 3 tests covering all-experts-triggered, output-matches, and experts-are-linear.

ruff check + ruff format pass.

mergify · 2026-04-09T22:45:41Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Nottlespike.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Address review feedback from brian-dellabetta on PR vllm-project#2547: the routing and forward logic in glm4_moe_lite.py was identical to glm_moe_dsa.py. Refactored to inherit from CalibrationGlmMoeDsaMoE using template method pattern (_get_num_experts, _make_experts). - glm_moe_dsa.py: add _get_num_experts() and _make_experts() factory methods; __init__ now calls them instead of hardcoding types - glm4_moe_lite.py: subclass CalibrationGlmMoeDsaMoE, override only the two factory methods; drop duplicated route_tokens_to_experts(), forward(), and redundant restore() - __init__.py: isort-ordered imports Net: -64 lines, zero behavior change. Signed-off-by: Jason Lu <Nottlespike@users.noreply.github.com>

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

tests/llmcompressor/modeling/test_calib_glm4_moe_lite.py (1)

110-112: Compute shared-expert Linear expectation from config.

Hardcoding expected_shared_linears = 3 makes the test brittle. Derive it from config.n_shared_experts.

♻️ Suggested change

-    expected_shared_linears = 3
+    expected_shared_linears = config.n_shared_experts * 3

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/llmcompressor/modeling/test_calib_glm4_moe_lite.py` around lines 110 -
112, The test currently hardcodes expected_shared_linears = 3 which is brittle;
change it to compute the expectation from the configuration by using
config.n_shared_experts (e.g., set expected_shared_linears =
config.n_shared_experts or expected_shared_linears = config.n_shared_experts *
1) so the assertion comparing len(linear_names) to expected_expert_linears +
expected_shared_linears derives the shared-linear count from
config.n_shared_experts instead of a hardcoded literal.

src/llmcompressor/modeling/glm4_moe_lite.py (1)

68-78: Avoid .data aliasing when copying expert weights.

Using .data for parameter transfer bypasses autograd safety checks and can cause subtle state issues during future refactors. The codebase already uses the safer pattern elsewhere (e.g., gpt_oss.py): prefer torch.no_grad() with copy_().

♻️ Proposed safer copy pattern

-        gate_up_data = original.gate_up_proj.data
-        down_data = original.down_proj.data
+        gate_up_data = original.gate_up_proj
+        down_data = original.down_proj

         for i in range(self.num_experts):
             gate_up = gate_up_data[i]
             down = down_data[i]
             gate_proj, up_proj = gate_up.chunk(2, dim=0)

-            self[i].gate_proj.weight.data = gate_proj.contiguous()
-            self[i].up_proj.weight.data = up_proj.contiguous()
-            self[i].down_proj.weight.data = down.contiguous()
+            with torch.no_grad():
+                self[i].gate_proj.weight.copy_(gate_proj.contiguous())
+                self[i].up_proj.weight.copy_(up_proj.contiguous())
+                self[i].down_proj.weight.copy_(down.contiguous())

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/llmcompressor/modeling/glm4_moe_lite.py` around lines 68 - 78, The code
is using .data to read and write parameter tensors (gate_up_data, down_data and
assignments to self[i].gate_proj.weight.data / up_proj.weight.data /
down_proj.weight.data), which bypasses autograd; replace this with a
torch.no_grad() block and use tensor.copy_() to copy values safely (retain
.contiguous() where needed) — e.g., read original.gate_up_proj.weight and
original.down_proj.weight (not .data), split gate_up via chunk, then inside
torch.no_grad() call self[i].gate_proj.weight.copy_(gate_proj.contiguous()),
self[i].up_proj.weight.copy_(up_proj.contiguous()), and
self[i].down_proj.weight.copy_(down.contiguous()) so autograd/state tracking is
preserved.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/llmcompressor/modeling/test_calib_glm4_moe_lite.py`:
- Around line 55-56: The hook function hook_fn currently uses a parameter named
input which shadows the built-in input() (Ruff A002); rename that parameter to a
non-built-in name (e.g., _inputs or inputs) in the hook_fn signature and update
any uses inside the function accordingly so the signature reads hook_fn(i,
module, _inputs, output) and the body references the new name; ensure any
registration/calls that pass that argument remain compatible with the new
parameter name.

---

Nitpick comments:
In `@src/llmcompressor/modeling/glm4_moe_lite.py`:
- Around line 68-78: The code is using .data to read and write parameter tensors
(gate_up_data, down_data and assignments to self[i].gate_proj.weight.data /
up_proj.weight.data / down_proj.weight.data), which bypasses autograd; replace
this with a torch.no_grad() block and use tensor.copy_() to copy values safely
(retain .contiguous() where needed) — e.g., read original.gate_up_proj.weight
and original.down_proj.weight (not .data), split gate_up via chunk, then inside
torch.no_grad() call self[i].gate_proj.weight.copy_(gate_proj.contiguous()),
self[i].up_proj.weight.copy_(up_proj.contiguous()), and
self[i].down_proj.weight.copy_(down.contiguous()) so autograd/state tracking is
preserved.

In `@tests/llmcompressor/modeling/test_calib_glm4_moe_lite.py`:
- Around line 110-112: The test currently hardcodes expected_shared_linears = 3
which is brittle; change it to compute the expectation from the configuration by
using config.n_shared_experts (e.g., set expected_shared_linears =
config.n_shared_experts or expected_shared_linears = config.n_shared_experts *
1) so the assertion comparing len(linear_names) to expected_expert_linears +
expected_shared_linears derives the shared-linear count from
config.n_shared_experts instead of a hardcoded literal.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 41785132-a2cd-48d2-86f2-c258b752429f

📥 Commits

Reviewing files that changed from the base of the PR and between e48353f and ce4cddf.

📒 Files selected for processing (4)

src/llmcompressor/modeling/__init__.py
src/llmcompressor/modeling/glm4_moe_lite.py
src/llmcompressor/modeling/glm_moe_dsa.py
tests/llmcompressor/modeling/test_calib_glm4_moe_lite.py

Nottlespike · 2026-04-09T23:21:42Z

Worth noting: the torch.no_grad() + copy_() fix in this PR also lands in the base class SequentialGlmMoeDsaExperts (glm_moe_dsa.py), so it benefits all GlmMoeDsaMoE models — not just GLM-4.7-Flash Lite.

This is especially timely given the recent release of GLM-5.1 which uses the DSA MoE architecture. Any future calibration runs on GLM-5.1 (or other DSA-family models) will get the safer parameter transfer out of the box.

brian-dellabetta

Thanks @Nottlespike for the updates, this is looking good. just one nit on the changing of import ordering in __init__.py

Add CalibrationGlm4MoeLiteMoE that subclasses CalibrationGlmMoeDsaMoE, overriding only _get_num_experts() and _make_experts() factory methods. Unpacks 3D expert parameters into individual nn.Linear modules so they are visible to targets="Linear" quantization. Also improves the base class (glm_moe_dsa.py): - Add _get_num_experts() / _make_experts() template methods for subclassing - Replace .data aliasing with torch.no_grad() + copy_() for safer parameter transfer (benefits all DSA-family models including GLM-5.1) Tests: 3 GPU tests covering expert triggering, output correctness, and structural properties. Signed-off-by: Jason Lu <Nottlespike@users.noreply.github.com>

Nottlespike · 2026-04-10T19:27:17Z

Fixed — restored upstream import ordering. The diff for __init__.py is now a single +1 line (our glm4_moe_lite import). Also squashed all commits into one clean commit with DCO.

Nottlespike · 2026-04-10T19:27:32Z

Could a maintainer add the ready label when this looks good? Needed for the full CI test suite to run.

brian-dellabetta

Thanks for the contribution! lgtm

Nottlespike requested review from dsikka and kylesayrs as code owners March 30, 2026 22:06

Copilot AI review requested due to automatic review settings March 30, 2026 22:06

Copilot started reviewing on behalf of Nottlespike March 30, 2026 22:07 View session

gemini-code-assist bot reviewed Mar 30, 2026

View reviewed changes

Comment thread src/llmcompressor/modeling/glm4_moe_lite.py Outdated

Comment thread src/llmcompressor/modeling/glm4_moe_lite.py Outdated

Copilot AI reviewed Mar 30, 2026

View reviewed changes

Comment thread src/llmcompressor/modeling/glm4_moe_lite.py Outdated

Comment thread src/llmcompressor/modeling/glm4_moe_lite.py

Nottlespike marked this pull request as draft March 30, 2026 23:34

Nottlespike force-pushed the add-glm4-moe-lite-calibration branch from d2c0afa to e460938 Compare March 30, 2026 23:59

brian-dellabetta reviewed Mar 31, 2026

View reviewed changes

Nottlespike marked this pull request as ready for review April 9, 2026 22:44

mergify bot added the needs-rebase label Apr 9, 2026

Nottlespike force-pushed the add-glm4-moe-lite-calibration branch from ce4cddf to cb2a5b6 Compare April 9, 2026 22:48

mergify bot removed the needs-rebase label Apr 9, 2026

coderabbitai bot reviewed Apr 9, 2026

View reviewed changes

Comment thread tests/llmcompressor/modeling/test_calib_glm4_moe_lite.py Outdated

brian-dellabetta reviewed Apr 10, 2026

View reviewed changes

Comment thread src/llmcompressor/modeling/__init__.py Outdated

Nottlespike force-pushed the add-glm4-moe-lite-calibration branch from 5a43b30 to 0a07c50 Compare April 10, 2026 19:26

brian-dellabetta approved these changes Apr 10, 2026

View reviewed changes

brian-dellabetta requested a review from HDCharles April 10, 2026 19:43

brian-dellabetta added the ready When a PR is ready for review label Apr 10, 2026

Merge branch 'main' into add-glm4-moe-lite-calibration

f9a0653

Merge branch 'main' into add-glm4-moe-lite-calibration

ca611dc

Conversation

Nottlespike commented Mar 30, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Solution

Changes

Testing

Hardware Tested

Summary by CodeRabbit

Uh oh!

github-actions bot commented Mar 30, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

brian-dellabetta left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

Nottlespike commented Apr 9, 2026

Uh oh!

mergify bot commented Apr 9, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Nottlespike commented Apr 9, 2026

Uh oh!

brian-dellabetta left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Nottlespike commented Apr 10, 2026

Uh oh!

Nottlespike commented Apr 10, 2026

Uh oh!

brian-dellabetta left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Nottlespike commented Mar 30, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 9, 2026 •

edited

Loading