Skip to content

[Bug]: Triton MXFP4 MoE kernel uses .tile::scatter4 PTX (Hopper/SM10 only) — fails on SM 12.1 (GB10/DGX Spark); Marlin fallback hits #37030 #41477

@vbalko-claimate

Description

@vbalko-claimate

Your current environment

  • GPU: NVIDIA GB10 (DGX Spark) — SM 12.1 / sm_121a, 128 GB unified memory
  • Architecture: Grace Blackwell (consumer/edge variant)
  • Driver 580.142, CUDA 13.0, Ubuntu 24.04 ARM64
  • vLLM image: vllm/vllm-openai:nightly-aarch64 (v0.20.1rc1.dev91+ga749a33d8, 2026-04-30); same behavior on v0.20.0

🐛 Describe the bug

Serving openai/gpt-oss-120b with native MXFP4 on GB10 / DGX Spark (SM 12.1) has no working --moe-backend:

backend outcome
marlin (default) Runs but emits broken first Harmony token → content: null, reasoning: null (#37030)
triton (after patching capability gate) ptxas error: Feature '.tile::scatter4' not supported on .target 'sm_121a'
flashinfer_cutlass Quant scheme mismatch (u8 / GroupShape(1,32) not supported)
flashinfer_trtllm "kernel does not support current device cuda"
flashinfer_cutedsl Engine init failure
deep_gemm "kernel does not support current device cuda"
emulation Works, but ≤5 tok/s

Both OAITritonExperts and UnfusedOAITritonExperts call matmul_ogs from triton_kernels, which is JIT-compiled with .tile::scatter4 PTX. That instruction is a TMA scatter feature (Hopper SM 9.x / Blackwell datacenter SM 10.x) — it is not part of the SM 12.1 (GB10/consumer Blackwell) ISA.

So the gpt-oss-120b MXFP4 kernel families currently in vLLM all assume datacenter-class TMA, which GB10 does not have. The only path that does not hit TMA is Marlin, which has the separate first-token correctness bug from #37030.

🔁 Reproduction

docker run --rm -it --runtime nvidia --ipc host --shm-size 16g \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -e VLLM_MXFP4_USE_MARLIN=0 \
  vllm/vllm-openai:nightly-aarch64 \
  --model openai/gpt-oss-120b \
  --quantization gpt_oss_mxfp4 \
  --moe-backend triton \
  --gpu-memory-utilization 0.85 \
  --max-model-len 32768 \
  --reasoning-parser openai_gptoss

triton is gated by (9, 0) <= cap < (11, 0) in:

  • vllm/model_executor/layers/fused_moe/oracle/mxfp4.py:255
  • vllm/model_executor/layers/fused_moe/experts/gpt_oss_triton_kernels_moe.py:660

After patching both ranges to < (13, 0) so SM 12.1 passes the gate, the kernel reaches JIT and fails:

triton.runtime.errors.PTXASError: Internal Triton PTX codegen error
ptxas line 3727; error : Feature '.tile::scatter4' not supported on .target 'sm_121a'
... (16 more identical errors at different line numbers) ...
ptxas fatal : Ptx assembly aborted due to errors

Default Marlin path: engine starts, every /v1/chat/completions returns content: null (= #37030).

Expected behavior

gpt_oss_mxfp4 + GB10/DGX Spark should produce a usable response — the hardware advertises native MXFP4 tensor cores, DGX Spark is marketed as an AI workstation, and openai/gpt-oss-120b MXFP4 is the canonical Blackwell deployment.

Suggested directions

  1. Triton kernel path without tile::scatter4 — SM 12.x branch in matmul_ogs (or vLLM-local override) using regular tl.store scatter. Slower but functional.
  2. BF16 dequantize-on-load fallback — when gpt_oss_mxfp4 runs on a device with neither working Marlin nor TMA Triton, dequantize MXFP4 → BF16 at load and use the standard MoE kernel. Costs ~2× weight memory.
  3. Fix Marlin SM 12.1 first-token ([Bug]: GPT-OSS-120B gpt-oss MXFP4 on SM121 (Blackwell DGX Spark): Marlin kernel generates wrong first Harmony token, producing null content/reasoning #37030) — restores the de facto fallback.

Happy to test patches and collect logs from GB10 hardware.

Related

Before submitting

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions