[Tracking] DeepGEMM SM 12.x kernel coverage gaps for DeepSeek-V4-Flash on consumer Blackwell (RTX 50 / GB10)

# SM 12.x kernel coverage gaps for DeepSeek-V4-Flash on consumer Blackwell (RTX 50 / GB10)

Filed against `jasl/DeepGEMM` and tracked alongside vLLM #40899.

This is a comprehensive map of what's missing to run **DeepSeek-V4-Flash on SM 12.x** end-to-end through DeepGEMM, gathered from a 10-layer end-to-end debugging pass on dual NVIDIA GB10 (DGX Spark, SM 121, aarch64) using `jasl/DeepGEMM@7a7a41a1` + `jasl/vllm:ds4-sm120@8d0ebb76` against vLLM nightly `0.19.2rc1.dev220+g7b1bc0a3e`.

## What works in `jasl/DeepGEMM` today

The fork already includes SM 120-native CuTeDSL kernels for these V4 paths:
- `sm120_tf32_hc_prenorm_gemm` (mHC pre-norm GEMM — V4's hyper-connection layer)
- `sm120_fp8_einsum`
- `sm120_fp8_paged_mqa_logits`
- `MmaMXF4NVF4Op` (with the `cute_dsl/warp/mma.py` `sm_120a` whitelist patch — NOT in fork)

These paths boot end-to-end on RTX Pro 6000 (SM 120) per the PR description.

## What's still missing for V4-Flash on SM 12.x

These dispatch sites in the fork still hard-code `arch_major == 9` (Hopper) or `arch_major == 10` (datacenter Blackwell) and reject SM 120/121, OR route to SM 10 kernels that use `tcgen05.*` instructions which SM 12.x silicon does not implement.

### 1. Dispatch routing — pure dispatch fixes (safe to extend to SM 12.x today)

| File | Line | Site | Suggested fix |
|---|---|---|---|
| `csrc/utils/layout.hpp` | 76 | `get_default_recipe`: SM 10 → `(1, 1, 128)` | accept `arch_major == 12` for the same recipe (verified working on GB10) |
| `csrc/apis/layout.hpp` | 48, 56, 106, 110 | SF transformation for FP4 | accept `arch_major == 12` (verified GB10) |
| `csrc/jit/compiler.hpp` | NVCCCompiler + NVRTCCompiler `include_dirs` | cutlass headers from `third-party/cutlass/include` | both compiler classes need `-I third-party/cutlass/include` added — currently NVCC fails with `fatal error: cutlass/detail/helper_macros.hpp: No such file or directory` |

These **alone** unblock the next layer of failures.

### 2. The hard wall — kernel-source-level gaps (require actual SM 12.x kernel writing)

| File | Site | Symptom on GB10 |
|---|---|---|
| `csrc/apis/gemm.hpp:99` | `fp8_fp4_gemm_nt` dispatch only knows `sm90_fp8_gemm_*` and `sm100_fp8_fp4_gemm_1d1d` | Routing to `sm100_fp8_fp4_gemm_1d1d` produces `ptxas: Instruction 'tcgen05.fence' not supported on .target 'sm_120f'`, `Feature '.block32' not supported on .target 'sm_120f'`. **Needs `sm120_fp8_fp4_gemm_1d1d` kernel implementation** — there is no SM 120 equivalent yet in the fork. |
| `csrc/apis/attention.hpp:67, 177, 367` | FP4 attention dispatch | Same `tcgen05.*` ISA mismatch as above. **Needs SM 120 FP4 attention kernel(s)**. |
| `csrc/apis/einsum.hpp:55` | FP4 einsum dispatch | Same problem. **Needs SM 120 FP4 einsum kernel** (the existing `sm120_fp8_einsum` covers FP8 not FP4). |

### 3. Companion vLLM-side gates that also need to relax

These are independent of DeepGEMM but block the deployment from reaching DeepGEMM in the first place. We've submitted the corresponding vLLM PRs:

| vLLM file | What | PR |
|---|---|---|
| `vllm/platforms/cuda.py support_deep_gemm` | accept `is_device_capability_family(120)` | #41062 |
| `vllm/model_executor/layers/fused_moe/experts/deep_gemm_moe.py` | `DeepGemmFP4Experts._supports_current_device` | #41062 |
| `vllm/model_executor/layers/fused_moe/experts/gpt_oss_triton_kernels_moe.py` | `_supports_current_device` cap range `< (11, 0)` → `< (13, 0)` | #41028 |
| `vllm/.../csrc CMakeLists.txt MARLIN_*_ARCHS` | add `12.0f` family | #40923 (Approved by @Harry-Chen) |

## Reproduction command

```bash
# Inside vllm/vllm-openai:nightly-aarch64 with jasl/DeepGEMM@7a7a41a1 + the
# jasl/vllm ds4-sm120 branch synced into /usr/local/lib/python3.12/dist-packages/vllm/
vllm serve /models/DeepSeek-V4-Flash \
  --tensor-parallel-size 2 --distributed-executor-backend ray \
  --moe-backend deep_gemm \
  --kv-cache-dtype fp8_ds_mla \
  --gpu-memory-utilization 0.78 \
  --load-format instanttensor \
  --compilation-config '{"cudagraph_mode": "PIECEWISE"}' \
  --trust-remote-code
```

After applying section-1 patches, the `tcgen05` failure surfaces during torch.compile / cudagraph capture (engine init, not at runtime).

## Suggested rollup for `cleaner re-submission`

When you do the cleaner PR rebase you mentioned in vllm-project/vllm#40969, please consider including:

1. **Section 1 patches** wholesale — they're literal `or arch_major == 12` additions and the cutlass include for the JIT compiler (~10 lines total)
2. **Authoring SM 120 FP4 kernels** for the three Section-2 sites, or alternatively gating `--moe-backend deep_gemm` to fall through to the Triton sparse-MLA fallback on SM 12.x for FP4 ops until the kernels exist

Happy to test any rebased branch on dual GB10 TP=2 — quick turnaround (boot ≤ 2 min, TP=2 init ≤ 90s) and we already have the patch stack staged. cc @jasl @WoosukKwon @zyongye @lukealonso

— @tonyliu312 (filed from a debug trace on dual DGX Spark / SM 121 at 2026-04-28 01:25 Taipei)


File	Line	Site	Suggested fix
`csrc/utils/layout.hpp`	76	`get_default_recipe`: SM 10 → `(1, 1, 128)`	accept `arch_major == 12` for the same recipe (verified working on GB10)
`csrc/apis/layout.hpp`	48, 56, 106, 110	SF transformation for FP4	accept `arch_major == 12` (verified GB10)
`csrc/jit/compiler.hpp`	NVCCCompiler + NVRTCCompiler `include_dirs`	cutlass headers from `third-party/cutlass/include`	both compiler classes need `-I third-party/cutlass/include` added — currently NVCC fails with `fatal error: cutlass/detail/helper_macros.hpp: No such file or directory`

File	Site	Symptom on GB10
`csrc/apis/gemm.hpp:99`	`fp8_fp4_gemm_nt` dispatch only knows `sm90_fp8_gemm_*` and `sm100_fp8_fp4_gemm_1d1d`	Routing to `sm100_fp8_fp4_gemm_1d1d` produces `ptxas: Instruction 'tcgen05.fence' not supported on .target 'sm_120f'`, `Feature '.block32' not supported on .target 'sm_120f'`. Needs `sm120_fp8_fp4_gemm_1d1d` kernel implementation — there is no SM 120 equivalent yet in the fork.
`csrc/apis/attention.hpp:67, 177, 367`	FP4 attention dispatch	Same `tcgen05.` ISA mismatch as above. Needs SM 120 FP4 attention kernel(s)*.
`csrc/apis/einsum.hpp:55`	FP4 einsum dispatch	Same problem. Needs SM 120 FP4 einsum kernel (the existing `sm120_fp8_einsum` covers FP8 not FP4).

vLLM file	What	PR
`vllm/platforms/cuda.py support_deep_gemm`	accept `is_device_capability_family(120)`	#41062
`vllm/model_executor/layers/fused_moe/experts/deep_gemm_moe.py`	`DeepGemmFP4Experts._supports_current_device`	#41062
`vllm/model_executor/layers/fused_moe/experts/gpt_oss_triton_kernels_moe.py`	`_supports_current_device` cap range `< (11, 0)` → `< (13, 0)`	#41028
`vllm/.../csrc CMakeLists.txt MARLIN_*_ARCHS`	add `12.0f` family	#40923 (Approved by @Harry-Chen)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Tracking] DeepGEMM SM 12.x kernel coverage gaps for DeepSeek-V4-Flash on consumer Blackwell (RTX 50 / GB10) #41063

SM 12.x kernel coverage gaps for DeepSeek-V4-Flash on consumer Blackwell (RTX 50 / GB10)

What works in `jasl/DeepGEMM` today

What's still missing for V4-Flash on SM 12.x

1. Dispatch routing — pure dispatch fixes (safe to extend to SM 12.x today)

2. The hard wall — kernel-source-level gaps (require actual SM 12.x kernel writing)

3. Companion vLLM-side gates that also need to relax

Reproduction command

Suggested rollup for `cleaner re-submission`

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[Tracking] DeepGEMM SM 12.x kernel coverage gaps for DeepSeek-V4-Flash on consumer Blackwell (RTX 50 / GB10) #41063

Description

SM 12.x kernel coverage gaps for DeepSeek-V4-Flash on consumer Blackwell (RTX 50 / GB10)

What works in jasl/DeepGEMM today

What's still missing for V4-Flash on SM 12.x

1. Dispatch routing — pure dispatch fixes (safe to extend to SM 12.x today)

2. The hard wall — kernel-source-level gaps (require actual SM 12.x kernel writing)

3. Companion vLLM-side gates that also need to relax

Reproduction command

Suggested rollup for cleaner re-submission

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

What works in `jasl/DeepGEMM` today

Suggested rollup for `cleaner re-submission`