[Bug]: Triton MXFP4 MoE kernel uses .tile::scatter4 PTX (Hopper/SM10 only) — fails on SM 12.1 (GB10/DGX Spark); Marlin fallback hits #37030

### Your current environment

- GPU: NVIDIA **GB10 (DGX Spark)** — SM 12.1 / sm_121a, 128 GB unified memory
- Architecture: Grace Blackwell (consumer/edge variant)
- Driver 580.142, CUDA 13.0, Ubuntu 24.04 ARM64
- vLLM image: `vllm/vllm-openai:nightly-aarch64` (v0.20.1rc1.dev91+ga749a33d8, 2026-04-30); same behavior on `v0.20.0`

### 🐛 Describe the bug

Serving `openai/gpt-oss-120b` with native MXFP4 on **GB10 / DGX Spark (SM 12.1)** has no working `--moe-backend`:

| backend | outcome |
|---|---|
| `marlin` (default) | Runs but emits broken first Harmony token → `content: null`, `reasoning: null` (#37030) |
| `triton` (after patching capability gate) | `ptxas error: Feature '.tile::scatter4' not supported on .target 'sm_121a'` |
| `flashinfer_cutlass` | Quant scheme mismatch (`u8 / GroupShape(1,32)` not supported) |
| `flashinfer_trtllm` | "kernel does not support current device cuda" |
| `flashinfer_cutedsl` | Engine init failure |
| `deep_gemm` | "kernel does not support current device cuda" |
| `emulation` | Works, but ≤5 tok/s |

Both `OAITritonExperts` and `UnfusedOAITritonExperts` call `matmul_ogs` from `triton_kernels`, which is JIT-compiled with `.tile::scatter4` PTX. **That instruction is a TMA scatter feature (Hopper SM 9.x / Blackwell datacenter SM 10.x) — it is _not_ part of the SM 12.1 (GB10/consumer Blackwell) ISA.**

So the gpt-oss-120b MXFP4 kernel families currently in vLLM all assume datacenter-class TMA, which GB10 does not have. The only path that does not hit TMA is Marlin, which has the separate first-token correctness bug from #37030.

### 🔁 Reproduction

```bash
docker run --rm -it --runtime nvidia --ipc host --shm-size 16g \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -e VLLM_MXFP4_USE_MARLIN=0 \
  vllm/vllm-openai:nightly-aarch64 \
  --model openai/gpt-oss-120b \
  --quantization gpt_oss_mxfp4 \
  --moe-backend triton \
  --gpu-memory-utilization 0.85 \
  --max-model-len 32768 \
  --reasoning-parser openai_gptoss
```

`triton` is gated by `(9, 0) <= cap < (11, 0)` in:
- `vllm/model_executor/layers/fused_moe/oracle/mxfp4.py:255`
- `vllm/model_executor/layers/fused_moe/experts/gpt_oss_triton_kernels_moe.py:660`

After patching both ranges to `< (13, 0)` so SM 12.1 passes the gate, the kernel reaches JIT and fails:

```
triton.runtime.errors.PTXASError: Internal Triton PTX codegen error
ptxas line 3727; error : Feature '.tile::scatter4' not supported on .target 'sm_121a'
... (16 more identical errors at different line numbers) ...
ptxas fatal : Ptx assembly aborted due to errors
```

Default Marlin path: engine starts, every `/v1/chat/completions` returns `content: null` (= #37030).

### Expected behavior

`gpt_oss_mxfp4` + GB10/DGX Spark should produce a usable response — the hardware advertises native MXFP4 tensor cores, DGX Spark is marketed as an AI workstation, and `openai/gpt-oss-120b` MXFP4 is the canonical Blackwell deployment.

### Suggested directions

1. **Triton kernel path without `tile::scatter4`** — SM 12.x branch in `matmul_ogs` (or vLLM-local override) using regular `tl.store` scatter. Slower but functional.
2. **BF16 dequantize-on-load fallback** — when `gpt_oss_mxfp4` runs on a device with neither working Marlin nor TMA Triton, dequantize MXFP4 → BF16 at load and use the standard MoE kernel. Costs ~2× weight memory.
3. **Fix Marlin SM 12.1 first-token** (#37030) — restores the de facto fallback.

Happy to test patches and collect logs from GB10 hardware.

### Related

- #37030 (Marlin null content on SM 12.1 — the fallback we currently land on)
- #31607 (try/except HarmonyError; turns crash into empty response, doesn't fix kernel output)
- #31740 (SM121/GB10 platform support, needs-rebase, FP8 focus, no MXFP4 MoE kernel)
- #41028, #40923, #34822 (device-range extensions; helpful but don't fix TMA / Marlin)

### Before submitting

- [x] Searched existing issues; #37030 is Marlin-only, this is a different layer (Triton PTX feature gap).


backend	outcome
`marlin` (default)	Runs but emits broken first Harmony token → `content: null`, `reasoning: null` (#37030)
`triton` (after patching capability gate)	`ptxas error: Feature '.tile::scatter4' not supported on .target 'sm_121a'`
`flashinfer_cutlass`	Quant scheme mismatch (`u8 / GroupShape(1,32)` not supported)
`flashinfer_trtllm`	"kernel does not support current device cuda"
`flashinfer_cutedsl`	Engine init failure
`deep_gemm`	"kernel does not support current device cuda"
`emulation`	Works, but ≤5 tok/s

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug]: Triton MXFP4 MoE kernel uses .tile::scatter4 PTX (Hopper/SM10 only) — fails on SM 12.1 (GB10/DGX Spark); Marlin fallback hits #37030 #41477

Your current environment

🐛 Describe the bug

🔁 Reproduction

Expected behavior

Suggested directions

Related

Before submitting

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: Triton MXFP4 MoE kernel uses .tile::scatter4 PTX (Hopper/SM10 only) — fails on SM 12.1 (GB10/DGX Spark); Marlin fallback hits #37030 #41477

Description

Your current environment

🐛 Describe the bug

🔁 Reproduction

Expected behavior

Suggested directions

Related

Before submitting

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions