[Performance] W4A8 MoE GEMM significantly slower than W4A16 on H200

On H200, W4A8 MoE GEMM performance is ~1.5x slower than W4A16 across all M values. Weight is quantized to MXFP4, activation is quantized to FP8 with group_size=0 (per-token) / 32 / 128. All three activation quantization configurations show poor performance compared to W4A16.

Performance on H20 is good with `sm90_h20.py` heuristics, but poor on H200 with `sm90.py`.

## Setup

- GPU: H200 (SM90)
- Shape: w13 [M, 7000] x [7000, 512], w2 [M, 256] x [256, 7000]

## Performance table on H200 (ms)

Below is the performance after tuning `sm90.py`.

w13 [M, 7000] x [7000, 512]

|     M | W4a8-act-g128 | W4a8-act-pertoken | W4a8-act-g32 | W4a16 | marlin w4a16 |
| ----: | ------------: | ----------------: | -----------: | ----: | -----------: |
|     1 |         0.035 |             0.035 |        0.027 | 0.019 |        0.027 |
|    64 |         0.070 |             0.056 |        0.057 | 0.051 |        0.047 |
|   512 |         0.158 |             0.118 |        0.146 | 0.092 |        0.168 |
|  4096 |         0.987 |             0.643 |        0.898 | 0.452 |        0.863 |
| 32768 |         7.529 |             5.073 |        7.021 | 3.399 |        7.045 |

w2 [M, 256] x [256, 7000]

|     M | W4a8-act-g128 | W4a8-act-pertoken | W4a8-act-g32 | W4a16 | marlin w4a16 |
| ----: | ------------: | ----------------: | -----------: | ----: | -----------: |
|     1 |         0.028 |             0.028 |        0.019 | 0.015 |        0.015 |
|    64 |         0.041 |             0.042 |        0.043 | 0.054 |        0.041 |
|   512 |         0.112 |             0.107 |        0.120 | 0.096 |        0.170 |
|  4096 |         0.669 |             0.581 |        0.670 | 0.513 |        0.850 |
| 32768 |         4.740 |             4.344 |        5.064 | 3.857 |        6.720 |

## Questions

+ W4A8 performance on H20 is good (https://github.com/vllm-project/vllm/pull/41083), but on H200 the default `sm90.py` config was not tuned for MoE. After tuning, improvement is significant but still ~2x behind W4A16. Is this a known issue on H200?
+ H200 has significantly higher tensor core throughput than H20. Could this be a fundamental bottleneck that cannot be avoided?
+ Any suggestions for further optimization, or is dedicated H200 heuristics planned? Several MoE models have similar GEMM shapes to those in this issue.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance] W4A8 MoE GEMM significantly slower than W4A16 on H200 #26

Setup

Performance table on H200 (ms)

Questions

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

M	W4a8-act-g128	W4a8-act-pertoken	W4a8-act-g32	W4a16	marlin w4a16
1	0.035	0.035	0.027	0.019	0.027
64	0.070	0.056	0.057	0.051	0.047
512	0.158	0.118	0.146	0.092	0.168
4096	0.987	0.643	0.898	0.452	0.863
32768	7.529	5.073	7.021	3.399	7.045

M	W4a8-act-g128	W4a8-act-pertoken	W4a8-act-g32	W4a16	marlin w4a16
1	0.028	0.028	0.019	0.015	0.015
64	0.041	0.042	0.043	0.054	0.041
512	0.112	0.107	0.120	0.096	0.170
4096	0.669	0.581	0.670	0.513	0.850
32768	4.740	4.344	5.064	3.857	6.720

[Performance] W4A8 MoE GEMM significantly slower than W4A16 on H200 #26

Description

Setup

Performance table on H200 (ms)

Questions

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions