Skip to content

Add MXFP4 packed export, precision-aware scorer, and AWQ/GPTQ support#1

Open
haanjack wants to merge 8 commits intomainfrom
feature/mxfp4-packed-export
Open

Add MXFP4 packed export, precision-aware scorer, and AWQ/GPTQ support#1
haanjack wants to merge 8 commits intomainfrom
feature/mxfp4-packed-export

Conversation

@haanjack
Copy link
Copy Markdown
Owner

Summary

  • MXFP4 packed export: OCP MX FP4 (E2M1) + E8M0 scale format achieving 3.62x compression (192GB → 53GB on Solar-Open-100B)
  • MXFP4-aware sensitivity scorer: Replaces hardcoded INT4 proxy with OCP_MXFP4Spec, reducing false layer exclusions from 68% to 5%
  • AWQ/GPTQ wiring: algorithm="awq"|"gptq" config option passed to LLMTemplate.get_config()
  • Solar-Open-100B support: Added solar_openqwen3_moe model type mapping
  • Quark 0.11.1 compatibility: Fixed Int4PerGroupSpec fallback with ch_axis parameter

Test Results (MI355 gfx950, Solar-Open-100B)

Benchmark Baseline MXFP4 (new scorer) Delta
MMLU 77.58% 76.14% -1.44%
KMMLU 57.38% 57.03% -0.35%
Checkpoint 192 GB 53 GB 3.62x compression

Test plan

  • MXFP4 pack/unpack roundtrip tests (7 tests passed on remote)
  • Compression ratio verification (3.76x theoretical, 3.62x achieved)
  • MMLU + KMMLU evaluation on MI355
  • AWQ algorithm end-to-end test
  • vLLM packed MXFP4 weight loader

…GPTQ support

- Add MXFP4 packing utilities (mxfp4_pack.py): FP4 E2M1 + E8M0 scale format
  achieving 3.76x compression ratio for OCP MX specification compliance
- Fix sensitivity scorer to use actual target precision (OCP_MXFP4Spec) instead
  of hardcoded INT4 proxy, reducing false exclusions from 68% to 5% of layers
- Wire AWQ/GPTQ algorithm support via LLMTemplate.get_config(algorithm=...)
- Add pack_mxfp4 config flag to control packed vs BF16 export
- Add solar_open model type mapping to qwen3_moe template
- Fix Int4PerGroupSpec fallback for Quark 0.11.1 compatibility (ch_axis param)

Tested on MI355 (gfx950) with Solar-Open-100B:
  Packed checkpoint: 53GB (vs 192GB original, 3.62x compression)
  MMLU: 76.14% (-1.44% from baseline 77.58%)
  KMMLU: 57.03% (-0.35% from baseline 57.38%)
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for MXFP4 weight packing, which significantly reduces model size by using FP4 E2M1 encoding with E8M0 shared scales. Key changes include the addition of MXFP4 packing utilities, updates to the UnifiedConfig to support different quantization algorithms (RTN, AWQ, GPTQ), and enhancements to the sensitivity analyzer to use target-specific precisions during scoring. Review feedback focuses on improving memory efficiency during shard processing to avoid OOM errors, correcting the scale computation logic to prevent clipping, replacing hardcoded group sizes with constants, and implementing safer file-writing patterns to prevent checkpoint corruption.

Comment thread src/quanto/core/unified_quantizer.py Outdated
Comment thread src/quanto/utils/mxfp4_pack.py Outdated
Comment thread src/quanto/core/unified_quantizer.py Outdated
Comment thread src/quanto/core/unified_quantizer.py Outdated
Replace custom mxfp4_pack.py with Quark's quantize_model_per_safetensor
for MXFP4 quantization. This produces properly packed uint8 weights that
vLLM loads natively as a Quark-quantized checkpoint.

- Add _run_file2file_quantization() using Quark's file2file path
- Route MXFP precision to file2file in run() dispatch
- Remove custom mxfp4_pack.py and pack_mxfp4 config option
- Resolve HF hub IDs to local paths for file2file compatibility

Solar-Open-100B: 192GB → 53GB (3.62x), 73s quantization time.
Matches AMD's official MXFP4 model format (Kimi-K2.5-MXFP4).
vLLM fuses certain projections into single linear layers (qkv_proj,
gate_up_proj), requiring all members to share the same quantization
scheme. Add _align_exclude_groups() to ensure that if any projection
in a fused group is excluded, the entire group is excluded together.

Fused groups handled:
- self_attn: q_proj + k_proj + v_proj
- mlp: gate_proj + up_proj
- mlp.shared_experts: gate_proj + up_proj

Solar-Open-100B: 16 → 32 excluded layers after alignment.
MoE router gates (*.gate, not gate_proj) must be excluded from MXFP4
quantization because vLLM's SolarOpenTopkRouter uses regular nn.Linear
which cannot load packed uint8 weights.

Also update CLAUDE.md with file2file quantization path, AWQ/GPTQ
support, and fused layer alignment documentation.

Verified: Solar-Open-100B MXFP4 checkpoint loads and runs inference
on vLLM (MI355, TP=1, 53GB checkpoint).
- Add JSON config fallback in _setup() and detect_model_type() when
  AutoConfig fails for models not yet in transformers (e.g., exaone4_5)
- Add graceful tokenizer fallback when AutoTokenizer fails
- Add EXAONE model type mappings (exaone, exaone4_5, exaone4_5_text → llama)
- Keep auto-strategy detection intact for non-MXFP paths
- For multimodal models, merge text_config into top-level config

Tested: EXAONE-4.5-33B MXFP4 quantization (64GB → 20GB, 3.2x, 20s)
- Add main() to auto_quantize.py with full argparse CLI (--model_path,
  --precision, --exclude_layers_file, etc.)
- Fix __main__.py dispatcher to pass args through to quantization mode
- Add kimi_k2/kimi_k25 model type mapping in constants.py
- Update CLAUDE.md with CLI usage examples
- Remove project structure section from README.md
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant