You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
vLLM image: vllm/vllm-openai:nightly-aarch64 (v0.20.1rc1.dev91+ga749a33d8, 2026-04-30); same behavior on v0.20.0
🐛 Describe the bug
Serving openai/gpt-oss-120b with native MXFP4 on GB10 / DGX Spark (SM 12.1) has no working --moe-backend:
backend
outcome
marlin (default)
Runs but emits broken first Harmony token → content: null, reasoning: null (#37030)
triton (after patching capability gate)
ptxas error: Feature '.tile::scatter4' not supported on .target 'sm_121a'
flashinfer_cutlass
Quant scheme mismatch (u8 / GroupShape(1,32) not supported)
flashinfer_trtllm
"kernel does not support current device cuda"
flashinfer_cutedsl
Engine init failure
deep_gemm
"kernel does not support current device cuda"
emulation
Works, but ≤5 tok/s
Both OAITritonExperts and UnfusedOAITritonExperts call matmul_ogs from triton_kernels, which is JIT-compiled with .tile::scatter4 PTX. That instruction is a TMA scatter feature (Hopper SM 9.x / Blackwell datacenter SM 10.x) — it is not part of the SM 12.1 (GB10/consumer Blackwell) ISA.
So the gpt-oss-120b MXFP4 kernel families currently in vLLM all assume datacenter-class TMA, which GB10 does not have. The only path that does not hit TMA is Marlin, which has the separate first-token correctness bug from #37030.
After patching both ranges to < (13, 0) so SM 12.1 passes the gate, the kernel reaches JIT and fails:
triton.runtime.errors.PTXASError: Internal Triton PTX codegen error
ptxas line 3727; error : Feature '.tile::scatter4' not supported on .target 'sm_121a'
... (16 more identical errors at different line numbers) ...
ptxas fatal : Ptx assembly aborted due to errors
gpt_oss_mxfp4 + GB10/DGX Spark should produce a usable response — the hardware advertises native MXFP4 tensor cores, DGX Spark is marketed as an AI workstation, and openai/gpt-oss-120b MXFP4 is the canonical Blackwell deployment.
Suggested directions
Triton kernel path without tile::scatter4 — SM 12.x branch in matmul_ogs (or vLLM-local override) using regular tl.store scatter. Slower but functional.
BF16 dequantize-on-load fallback — when gpt_oss_mxfp4 runs on a device with neither working Marlin nor TMA Triton, dequantize MXFP4 → BF16 at load and use the standard MoE kernel. Costs ~2× weight memory.
Your current environment
vllm/vllm-openai:nightly-aarch64(v0.20.1rc1.dev91+ga749a33d8, 2026-04-30); same behavior onv0.20.0🐛 Describe the bug
Serving
openai/gpt-oss-120bwith native MXFP4 on GB10 / DGX Spark (SM 12.1) has no working--moe-backend:marlin(default)content: null,reasoning: null(#37030)triton(after patching capability gate)ptxas error: Feature '.tile::scatter4' not supported on .target 'sm_121a'flashinfer_cutlassu8 / GroupShape(1,32)not supported)flashinfer_trtllmflashinfer_cutedsldeep_gemmemulationBoth
OAITritonExpertsandUnfusedOAITritonExpertscallmatmul_ogsfromtriton_kernels, which is JIT-compiled with.tile::scatter4PTX. That instruction is a TMA scatter feature (Hopper SM 9.x / Blackwell datacenter SM 10.x) — it is not part of the SM 12.1 (GB10/consumer Blackwell) ISA.So the gpt-oss-120b MXFP4 kernel families currently in vLLM all assume datacenter-class TMA, which GB10 does not have. The only path that does not hit TMA is Marlin, which has the separate first-token correctness bug from #37030.
🔁 Reproduction
docker run --rm -it --runtime nvidia --ipc host --shm-size 16g \ -v ~/.cache/huggingface:/root/.cache/huggingface \ -e VLLM_MXFP4_USE_MARLIN=0 \ vllm/vllm-openai:nightly-aarch64 \ --model openai/gpt-oss-120b \ --quantization gpt_oss_mxfp4 \ --moe-backend triton \ --gpu-memory-utilization 0.85 \ --max-model-len 32768 \ --reasoning-parser openai_gptosstritonis gated by(9, 0) <= cap < (11, 0)in:vllm/model_executor/layers/fused_moe/oracle/mxfp4.py:255vllm/model_executor/layers/fused_moe/experts/gpt_oss_triton_kernels_moe.py:660After patching both ranges to
< (13, 0)so SM 12.1 passes the gate, the kernel reaches JIT and fails:Default Marlin path: engine starts, every
/v1/chat/completionsreturnscontent: null(= #37030).Expected behavior
gpt_oss_mxfp4+ GB10/DGX Spark should produce a usable response — the hardware advertises native MXFP4 tensor cores, DGX Spark is marketed as an AI workstation, andopenai/gpt-oss-120bMXFP4 is the canonical Blackwell deployment.Suggested directions
tile::scatter4— SM 12.x branch inmatmul_ogs(or vLLM-local override) using regulartl.storescatter. Slower but functional.gpt_oss_mxfp4runs on a device with neither working Marlin nor TMA Triton, dequantize MXFP4 → BF16 at load and use the standard MoE kernel. Costs ~2× weight memory.Happy to test patches and collect logs from GB10 hardware.
Related
Before submitting