Skip to content

[Performance] W4A8 MoE GEMM significantly slower than W4A16 on H200 #26

@huweim

Description

@huweim

On H200, W4A8 MoE GEMM performance is ~1.5x slower than W4A16 across all M values. Weight is quantized to MXFP4, activation is quantized to FP8 with group_size=0 (per-token) / 32 / 128. All three activation quantization configurations show poor performance compared to W4A16.

Performance on H20 is good with sm90_h20.py heuristics, but poor on H200 with sm90.py.

Setup

  • GPU: H200 (SM90)
  • Shape: w13 [M, 7000] x [7000, 512], w2 [M, 256] x [256, 7000]

Performance table on H200 (ms)

Below is the performance after tuning sm90.py.

w13 [M, 7000] x [7000, 512]

M W4a8-act-g128 W4a8-act-pertoken W4a8-act-g32 W4a16 marlin w4a16
1 0.035 0.035 0.027 0.019 0.027
64 0.070 0.056 0.057 0.051 0.047
512 0.158 0.118 0.146 0.092 0.168
4096 0.987 0.643 0.898 0.452 0.863
32768 7.529 5.073 7.021 3.399 7.045

w2 [M, 256] x [256, 7000]

M W4a8-act-g128 W4a8-act-pertoken W4a8-act-g32 W4a16 marlin w4a16
1 0.028 0.028 0.019 0.015 0.015
64 0.041 0.042 0.043 0.054 0.041
512 0.112 0.107 0.120 0.096 0.170
4096 0.669 0.581 0.670 0.513 0.850
32768 4.740 4.344 5.064 3.857 6.720

Questions

  • W4A8 performance on H20 is good ([Quantization] add humming mxfp4 moe backend vllm-project/vllm#41083), but on H200 the default sm90.py config was not tuned for MoE. After tuning, improvement is significant but still ~2x behind W4A16. Is this a known issue on H200?
  • H200 has significantly higher tensor core throughput than H20. Could this be a fundamental bottleneck that cannot be avoided?
  • Any suggestions for further optimization, or is dedicated H200 heuristics planned? Several MoE models have similar GEMM shapes to those in this issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions