You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
On H200, W4A8 MoE GEMM performance is ~1.5x slower than W4A16 across all M values. Weight is quantized to MXFP4, activation is quantized to FP8 with group_size=0 (per-token) / 32 / 128. All three activation quantization configurations show poor performance compared to W4A16.
Performance on H20 is good with sm90_h20.py heuristics, but poor on H200 with sm90.py.
Setup
GPU: H200 (SM90)
Shape: w13 [M, 7000] x [7000, 512], w2 [M, 256] x [256, 7000]
On H200, W4A8 MoE GEMM performance is ~1.5x slower than W4A16 across all M values. Weight is quantized to MXFP4, activation is quantized to FP8 with group_size=0 (per-token) / 32 / 128. All three activation quantization configurations show poor performance compared to W4A16.
Performance on H20 is good with
sm90_h20.pyheuristics, but poor on H200 withsm90.py.Setup
Performance table on H200 (ms)
Below is the performance after tuning
sm90.py.w13 [M, 7000] x [7000, 512]
w2 [M, 256] x [256, 7000]
Questions
sm90.pyconfig was not tuned for MoE. After tuning, improvement is significant but still ~2x behind W4A16. Is this a known issue on H200?