SM 12.x kernel coverage gaps for DeepSeek-V4-Flash on consumer Blackwell (RTX 50 / GB10)
Filed against jasl/DeepGEMM and tracked alongside vLLM #40899.
This is a comprehensive map of what's missing to run DeepSeek-V4-Flash on SM 12.x end-to-end through DeepGEMM, gathered from a 10-layer end-to-end debugging pass on dual NVIDIA GB10 (DGX Spark, SM 121, aarch64) using jasl/DeepGEMM@7a7a41a1 + jasl/vllm:ds4-sm120@8d0ebb76 against vLLM nightly 0.19.2rc1.dev220+g7b1bc0a3e.
What works in jasl/DeepGEMM today
The fork already includes SM 120-native CuTeDSL kernels for these V4 paths:
sm120_tf32_hc_prenorm_gemm (mHC pre-norm GEMM — V4's hyper-connection layer)
sm120_fp8_einsum
sm120_fp8_paged_mqa_logits
MmaMXF4NVF4Op (with the cute_dsl/warp/mma.py sm_120a whitelist patch — NOT in fork)
These paths boot end-to-end on RTX Pro 6000 (SM 120) per the PR description.
What's still missing for V4-Flash on SM 12.x
These dispatch sites in the fork still hard-code arch_major == 9 (Hopper) or arch_major == 10 (datacenter Blackwell) and reject SM 120/121, OR route to SM 10 kernels that use tcgen05.* instructions which SM 12.x silicon does not implement.
1. Dispatch routing — pure dispatch fixes (safe to extend to SM 12.x today)
| File |
Line |
Site |
Suggested fix |
csrc/utils/layout.hpp |
76 |
get_default_recipe: SM 10 → (1, 1, 128) |
accept arch_major == 12 for the same recipe (verified working on GB10) |
csrc/apis/layout.hpp |
48, 56, 106, 110 |
SF transformation for FP4 |
accept arch_major == 12 (verified GB10) |
csrc/jit/compiler.hpp |
NVCCCompiler + NVRTCCompiler include_dirs |
cutlass headers from third-party/cutlass/include |
both compiler classes need -I third-party/cutlass/include added — currently NVCC fails with fatal error: cutlass/detail/helper_macros.hpp: No such file or directory |
These alone unblock the next layer of failures.
2. The hard wall — kernel-source-level gaps (require actual SM 12.x kernel writing)
| File |
Site |
Symptom on GB10 |
csrc/apis/gemm.hpp:99 |
fp8_fp4_gemm_nt dispatch only knows sm90_fp8_gemm_* and sm100_fp8_fp4_gemm_1d1d |
Routing to sm100_fp8_fp4_gemm_1d1d produces ptxas: Instruction 'tcgen05.fence' not supported on .target 'sm_120f', Feature '.block32' not supported on .target 'sm_120f'. Needs sm120_fp8_fp4_gemm_1d1d kernel implementation — there is no SM 120 equivalent yet in the fork. |
csrc/apis/attention.hpp:67, 177, 367 |
FP4 attention dispatch |
Same tcgen05.* ISA mismatch as above. Needs SM 120 FP4 attention kernel(s). |
csrc/apis/einsum.hpp:55 |
FP4 einsum dispatch |
Same problem. Needs SM 120 FP4 einsum kernel (the existing sm120_fp8_einsum covers FP8 not FP4). |
3. Companion vLLM-side gates that also need to relax
These are independent of DeepGEMM but block the deployment from reaching DeepGEMM in the first place. We've submitted the corresponding vLLM PRs:
| vLLM file |
What |
PR |
vllm/platforms/cuda.py support_deep_gemm |
accept is_device_capability_family(120) |
#41062 |
vllm/model_executor/layers/fused_moe/experts/deep_gemm_moe.py |
DeepGemmFP4Experts._supports_current_device |
#41062 |
vllm/model_executor/layers/fused_moe/experts/gpt_oss_triton_kernels_moe.py |
_supports_current_device cap range < (11, 0) → < (13, 0) |
#41028 |
vllm/.../csrc CMakeLists.txt MARLIN_*_ARCHS |
add 12.0f family |
#40923 (Approved by @Harry-Chen) |
Reproduction command
# Inside vllm/vllm-openai:nightly-aarch64 with jasl/DeepGEMM@7a7a41a1 + the
# jasl/vllm ds4-sm120 branch synced into /usr/local/lib/python3.12/dist-packages/vllm/
vllm serve /models/DeepSeek-V4-Flash \
--tensor-parallel-size 2 --distributed-executor-backend ray \
--moe-backend deep_gemm \
--kv-cache-dtype fp8_ds_mla \
--gpu-memory-utilization 0.78 \
--load-format instanttensor \
--compilation-config '{"cudagraph_mode": "PIECEWISE"}' \
--trust-remote-code
After applying section-1 patches, the tcgen05 failure surfaces during torch.compile / cudagraph capture (engine init, not at runtime).
Suggested rollup for cleaner re-submission
When you do the cleaner PR rebase you mentioned in #40969, please consider including:
- Section 1 patches wholesale — they're literal
or arch_major == 12 additions and the cutlass include for the JIT compiler (~10 lines total)
- Authoring SM 120 FP4 kernels for the three Section-2 sites, or alternatively gating
--moe-backend deep_gemm to fall through to the Triton sparse-MLA fallback on SM 12.x for FP4 ops until the kernels exist
Happy to test any rebased branch on dual GB10 TP=2 — quick turnaround (boot ≤ 2 min, TP=2 init ≤ 90s) and we already have the patch stack staged. cc @jasl @WoosukKwon @zyongye @lukealonso
— @tonyliu312 (filed from a debug trace on dual DGX Spark / SM 121 at 2026-04-28 01:25 Taipei)
SM 12.x kernel coverage gaps for DeepSeek-V4-Flash on consumer Blackwell (RTX 50 / GB10)
Filed against
jasl/DeepGEMMand tracked alongside vLLM #40899.This is a comprehensive map of what's missing to run DeepSeek-V4-Flash on SM 12.x end-to-end through DeepGEMM, gathered from a 10-layer end-to-end debugging pass on dual NVIDIA GB10 (DGX Spark, SM 121, aarch64) using
jasl/DeepGEMM@7a7a41a1+jasl/vllm:ds4-sm120@8d0ebb76against vLLM nightly0.19.2rc1.dev220+g7b1bc0a3e.What works in
jasl/DeepGEMMtodayThe fork already includes SM 120-native CuTeDSL kernels for these V4 paths:
sm120_tf32_hc_prenorm_gemm(mHC pre-norm GEMM — V4's hyper-connection layer)sm120_fp8_einsumsm120_fp8_paged_mqa_logitsMmaMXF4NVF4Op(with thecute_dsl/warp/mma.pysm_120awhitelist patch — NOT in fork)These paths boot end-to-end on RTX Pro 6000 (SM 120) per the PR description.
What's still missing for V4-Flash on SM 12.x
These dispatch sites in the fork still hard-code
arch_major == 9(Hopper) orarch_major == 10(datacenter Blackwell) and reject SM 120/121, OR route to SM 10 kernels that usetcgen05.*instructions which SM 12.x silicon does not implement.1. Dispatch routing — pure dispatch fixes (safe to extend to SM 12.x today)
csrc/utils/layout.hppget_default_recipe: SM 10 →(1, 1, 128)arch_major == 12for the same recipe (verified working on GB10)csrc/apis/layout.hpparch_major == 12(verified GB10)csrc/jit/compiler.hppinclude_dirsthird-party/cutlass/include-I third-party/cutlass/includeadded — currently NVCC fails withfatal error: cutlass/detail/helper_macros.hpp: No such file or directoryThese alone unblock the next layer of failures.
2. The hard wall — kernel-source-level gaps (require actual SM 12.x kernel writing)
csrc/apis/gemm.hpp:99fp8_fp4_gemm_ntdispatch only knowssm90_fp8_gemm_*andsm100_fp8_fp4_gemm_1d1dsm100_fp8_fp4_gemm_1d1dproducesptxas: Instruction 'tcgen05.fence' not supported on .target 'sm_120f',Feature '.block32' not supported on .target 'sm_120f'. Needssm120_fp8_fp4_gemm_1d1dkernel implementation — there is no SM 120 equivalent yet in the fork.csrc/apis/attention.hpp:67, 177, 367tcgen05.*ISA mismatch as above. Needs SM 120 FP4 attention kernel(s).csrc/apis/einsum.hpp:55sm120_fp8_einsumcovers FP8 not FP4).3. Companion vLLM-side gates that also need to relax
These are independent of DeepGEMM but block the deployment from reaching DeepGEMM in the first place. We've submitted the corresponding vLLM PRs:
vllm/platforms/cuda.py support_deep_gemmis_device_capability_family(120)vllm/model_executor/layers/fused_moe/experts/deep_gemm_moe.pyDeepGemmFP4Experts._supports_current_devicevllm/model_executor/layers/fused_moe/experts/gpt_oss_triton_kernels_moe.py_supports_current_devicecap range< (11, 0)→< (13, 0)vllm/.../csrc CMakeLists.txt MARLIN_*_ARCHS12.0ffamilyReproduction command
After applying section-1 patches, the
tcgen05failure surfaces during torch.compile / cudagraph capture (engine init, not at runtime).Suggested rollup for
cleaner re-submissionWhen you do the cleaner PR rebase you mentioned in #40969, please consider including:
or arch_major == 12additions and the cutlass include for the JIT compiler (~10 lines total)--moe-backend deep_gemmto fall through to the Triton sparse-MLA fallback on SM 12.x for FP4 ops until the kernels existHappy to test any rebased branch on dual GB10 TP=2 — quick turnaround (boot ≤ 2 min, TP=2 init ≤ 90s) and we already have the patch stack staged. cc @jasl @WoosukKwon @zyongye @lukealonso
— @tonyliu312 (filed from a debug trace on dual DGX Spark / SM 121 at 2026-04-28 01:25 Taipei)