Add MXFP4 packed export, precision-aware scorer, and AWQ/GPTQ support#1
Open
Add MXFP4 packed export, precision-aware scorer, and AWQ/GPTQ support#1
Conversation
…GPTQ support - Add MXFP4 packing utilities (mxfp4_pack.py): FP4 E2M1 + E8M0 scale format achieving 3.76x compression ratio for OCP MX specification compliance - Fix sensitivity scorer to use actual target precision (OCP_MXFP4Spec) instead of hardcoded INT4 proxy, reducing false exclusions from 68% to 5% of layers - Wire AWQ/GPTQ algorithm support via LLMTemplate.get_config(algorithm=...) - Add pack_mxfp4 config flag to control packed vs BF16 export - Add solar_open model type mapping to qwen3_moe template - Fix Int4PerGroupSpec fallback for Quark 0.11.1 compatibility (ch_axis param) Tested on MI355 (gfx950) with Solar-Open-100B: Packed checkpoint: 53GB (vs 192GB original, 3.62x compression) MMLU: 76.14% (-1.44% from baseline 77.58%) KMMLU: 57.03% (-0.35% from baseline 57.38%)
There was a problem hiding this comment.
Code Review
This pull request introduces support for MXFP4 weight packing, which significantly reduces model size by using FP4 E2M1 encoding with E8M0 shared scales. Key changes include the addition of MXFP4 packing utilities, updates to the UnifiedConfig to support different quantization algorithms (RTN, AWQ, GPTQ), and enhancements to the sensitivity analyzer to use target-specific precisions during scoring. Review feedback focuses on improving memory efficiency during shard processing to avoid OOM errors, correcting the scale computation logic to prevent clipping, replacing hardcoded group sizes with constants, and implementing safer file-writing patterns to prevent checkpoint corruption.
Replace custom mxfp4_pack.py with Quark's quantize_model_per_safetensor for MXFP4 quantization. This produces properly packed uint8 weights that vLLM loads natively as a Quark-quantized checkpoint. - Add _run_file2file_quantization() using Quark's file2file path - Route MXFP precision to file2file in run() dispatch - Remove custom mxfp4_pack.py and pack_mxfp4 config option - Resolve HF hub IDs to local paths for file2file compatibility Solar-Open-100B: 192GB → 53GB (3.62x), 73s quantization time. Matches AMD's official MXFP4 model format (Kimi-K2.5-MXFP4).
vLLM fuses certain projections into single linear layers (qkv_proj, gate_up_proj), requiring all members to share the same quantization scheme. Add _align_exclude_groups() to ensure that if any projection in a fused group is excluded, the entire group is excluded together. Fused groups handled: - self_attn: q_proj + k_proj + v_proj - mlp: gate_proj + up_proj - mlp.shared_experts: gate_proj + up_proj Solar-Open-100B: 16 → 32 excluded layers after alignment.
MoE router gates (*.gate, not gate_proj) must be excluded from MXFP4 quantization because vLLM's SolarOpenTopkRouter uses regular nn.Linear which cannot load packed uint8 weights. Also update CLAUDE.md with file2file quantization path, AWQ/GPTQ support, and fused layer alignment documentation. Verified: Solar-Open-100B MXFP4 checkpoint loads and runs inference on vLLM (MI355, TP=1, 53GB checkpoint).
- Add JSON config fallback in _setup() and detect_model_type() when AutoConfig fails for models not yet in transformers (e.g., exaone4_5) - Add graceful tokenizer fallback when AutoTokenizer fails - Add EXAONE model type mappings (exaone, exaone4_5, exaone4_5_text → llama) - Keep auto-strategy detection intact for non-MXFP paths - For multimodal models, merge text_config into top-level config Tested: EXAONE-4.5-33B MXFP4 quantization (64GB → 20GB, 3.2x, 20s)
- Add main() to auto_quantize.py with full argparse CLI (--model_path, --precision, --exclude_layers_file, etc.) - Fix __main__.py dispatcher to pass args through to quantization mode - Add kimi_k2/kimi_k25 model type mapping in constants.py - Update CLAUDE.md with CLI usage examples - Remove project structure section from README.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
OCP_MXFP4Spec, reducing false layer exclusions from 68% to 5%algorithm="awq"|"gptq"config option passed toLLMTemplate.get_config()solar_open→qwen3_moemodel type mappingInt4PerGroupSpecfallback withch_axisparameterTest Results (MI355 gfx950, Solar-Open-100B)
Test plan