Add MXFP4 packed export, precision-aware scorer, and AWQ/GPTQ support by haanjack · Pull Request #1 · haanjack/quanto

haanjack · 2026-04-21T07:18:24Z

Summary

MXFP4 packed export: OCP MX FP4 (E2M1) + E8M0 scale format achieving 3.62x compression (192GB → 53GB on Solar-Open-100B)
MXFP4-aware sensitivity scorer: Replaces hardcoded INT4 proxy with OCP_MXFP4Spec, reducing false layer exclusions from 68% to 5%
AWQ/GPTQ wiring: algorithm="awq"|"gptq" config option passed to LLMTemplate.get_config()
Solar-Open-100B support: Added solar_open → qwen3_moe model type mapping
Quark 0.11.1 compatibility: Fixed Int4PerGroupSpec fallback with ch_axis parameter

Test Results (MI355 gfx950, Solar-Open-100B)

Benchmark	Baseline	MXFP4 (new scorer)	Delta
MMLU	77.58%	76.14%	-1.44%
KMMLU	57.38%	57.03%	-0.35%
Checkpoint	192 GB	53 GB	3.62x compression

Test plan

MXFP4 pack/unpack roundtrip tests (7 tests passed on remote)
Compression ratio verification (3.76x theoretical, 3.62x achieved)
MMLU + KMMLU evaluation on MI355
AWQ algorithm end-to-end test
vLLM packed MXFP4 weight loader

…GPTQ support - Add MXFP4 packing utilities (mxfp4_pack.py): FP4 E2M1 + E8M0 scale format achieving 3.76x compression ratio for OCP MX specification compliance - Fix sensitivity scorer to use actual target precision (OCP_MXFP4Spec) instead of hardcoded INT4 proxy, reducing false exclusions from 68% to 5% of layers - Wire AWQ/GPTQ algorithm support via LLMTemplate.get_config(algorithm=...) - Add pack_mxfp4 config flag to control packed vs BF16 export - Add solar_open model type mapping to qwen3_moe template - Fix Int4PerGroupSpec fallback for Quark 0.11.1 compatibility (ch_axis param) Tested on MI355 (gfx950) with Solar-Open-100B: Packed checkpoint: 53GB (vs 192GB original, 3.62x compression) MMLU: 76.14% (-1.44% from baseline 77.58%) KMMLU: 57.03% (-0.35% from baseline 57.38%)

gemini-code-assist

Code Review

This pull request introduces support for MXFP4 weight packing, which significantly reduces model size by using FP4 E2M1 encoding with E8M0 shared scales. Key changes include the addition of MXFP4 packing utilities, updates to the UnifiedConfig to support different quantization algorithms (RTN, AWQ, GPTQ), and enhancements to the sensitivity analyzer to use target-specific precisions during scoring. Review feedback focuses on improving memory efficiency during shard processing to avoid OOM errors, correcting the scale computation logic to prevent clipping, replacing hardcoded group sizes with constants, and implementing safer file-writing patterns to prevent checkpoint corruption.

Replace custom mxfp4_pack.py with Quark's quantize_model_per_safetensor for MXFP4 quantization. This produces properly packed uint8 weights that vLLM loads natively as a Quark-quantized checkpoint. - Add _run_file2file_quantization() using Quark's file2file path - Route MXFP precision to file2file in run() dispatch - Remove custom mxfp4_pack.py and pack_mxfp4 config option - Resolve HF hub IDs to local paths for file2file compatibility Solar-Open-100B: 192GB → 53GB (3.62x), 73s quantization time. Matches AMD's official MXFP4 model format (Kimi-K2.5-MXFP4).

vLLM fuses certain projections into single linear layers (qkv_proj, gate_up_proj), requiring all members to share the same quantization scheme. Add _align_exclude_groups() to ensure that if any projection in a fused group is excluded, the entire group is excluded together. Fused groups handled: - self_attn: q_proj + k_proj + v_proj - mlp: gate_proj + up_proj - mlp.shared_experts: gate_proj + up_proj Solar-Open-100B: 16 → 32 excluded layers after alignment.

MoE router gates (*.gate, not gate_proj) must be excluded from MXFP4 quantization because vLLM's SolarOpenTopkRouter uses regular nn.Linear which cannot load packed uint8 weights. Also update CLAUDE.md with file2file quantization path, AWQ/GPTQ support, and fused layer alignment documentation. Verified: Solar-Open-100B MXFP4 checkpoint loads and runs inference on vLLM (MI355, TP=1, 53GB checkpoint).

- Add JSON config fallback in _setup() and detect_model_type() when AutoConfig fails for models not yet in transformers (e.g., exaone4_5) - Add graceful tokenizer fallback when AutoTokenizer fails - Add EXAONE model type mappings (exaone, exaone4_5, exaone4_5_text → llama) - Keep auto-strategy detection intact for non-MXFP paths - For multimodal models, merge text_config into top-level config Tested: EXAONE-4.5-33B MXFP4 quantization (64GB → 20GB, 3.2x, 20s)

- Add main() to auto_quantize.py with full argparse CLI (--model_path, --precision, --exclude_layers_file, etc.) - Fix __main__.py dispatcher to pass args through to quantization mode - Add kimi_k2/kimi_k25 model type mapping in constants.py - Update CLAUDE.md with CLI usage examples - Remove project structure section from README.md

haanjack added 2 commits April 21, 2026 16:14

Update .gitmodules for reorganized submodule paths

0ec8bfa

gemini-code-assist Bot reviewed Apr 21, 2026

View reviewed changes

Comment thread src/quanto/core/unified_quantizer.py Outdated

Comment thread src/quanto/utils/mxfp4_pack.py Outdated

Comment thread src/quanto/core/unified_quantizer.py Outdated

Comment thread src/quanto/core/unified_quantizer.py Outdated

haanjack added 6 commits April 21, 2026 19:27

Wire sensitivity metric through quantization flow

b7c49c6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add MXFP4 packed export, precision-aware scorer, and AWQ/GPTQ support#1

Add MXFP4 packed export, precision-aware scorer, and AWQ/GPTQ support#1
haanjack wants to merge 8 commits intomainfrom
feature/mxfp4-packed-export

haanjack commented Apr 21, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

haanjack commented Apr 21, 2026

Summary

Test Results (MI355 gfx950, Solar-Open-100B)

Test plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant