Skip to content

[Tracking] DeepGEMM SM 12.x kernel coverage gaps for DeepSeek-V4-Flash on consumer Blackwell (RTX 50 / GB10) #41063

@tonyliu312

Description

@tonyliu312

SM 12.x kernel coverage gaps for DeepSeek-V4-Flash on consumer Blackwell (RTX 50 / GB10)

Filed against jasl/DeepGEMM and tracked alongside vLLM #40899.

This is a comprehensive map of what's missing to run DeepSeek-V4-Flash on SM 12.x end-to-end through DeepGEMM, gathered from a 10-layer end-to-end debugging pass on dual NVIDIA GB10 (DGX Spark, SM 121, aarch64) using jasl/DeepGEMM@7a7a41a1 + jasl/vllm:ds4-sm120@8d0ebb76 against vLLM nightly 0.19.2rc1.dev220+g7b1bc0a3e.

What works in jasl/DeepGEMM today

The fork already includes SM 120-native CuTeDSL kernels for these V4 paths:

  • sm120_tf32_hc_prenorm_gemm (mHC pre-norm GEMM — V4's hyper-connection layer)
  • sm120_fp8_einsum
  • sm120_fp8_paged_mqa_logits
  • MmaMXF4NVF4Op (with the cute_dsl/warp/mma.py sm_120a whitelist patch — NOT in fork)

These paths boot end-to-end on RTX Pro 6000 (SM 120) per the PR description.

What's still missing for V4-Flash on SM 12.x

These dispatch sites in the fork still hard-code arch_major == 9 (Hopper) or arch_major == 10 (datacenter Blackwell) and reject SM 120/121, OR route to SM 10 kernels that use tcgen05.* instructions which SM 12.x silicon does not implement.

1. Dispatch routing — pure dispatch fixes (safe to extend to SM 12.x today)

File Line Site Suggested fix
csrc/utils/layout.hpp 76 get_default_recipe: SM 10 → (1, 1, 128) accept arch_major == 12 for the same recipe (verified working on GB10)
csrc/apis/layout.hpp 48, 56, 106, 110 SF transformation for FP4 accept arch_major == 12 (verified GB10)
csrc/jit/compiler.hpp NVCCCompiler + NVRTCCompiler include_dirs cutlass headers from third-party/cutlass/include both compiler classes need -I third-party/cutlass/include added — currently NVCC fails with fatal error: cutlass/detail/helper_macros.hpp: No such file or directory

These alone unblock the next layer of failures.

2. The hard wall — kernel-source-level gaps (require actual SM 12.x kernel writing)

File Site Symptom on GB10
csrc/apis/gemm.hpp:99 fp8_fp4_gemm_nt dispatch only knows sm90_fp8_gemm_* and sm100_fp8_fp4_gemm_1d1d Routing to sm100_fp8_fp4_gemm_1d1d produces ptxas: Instruction 'tcgen05.fence' not supported on .target 'sm_120f', Feature '.block32' not supported on .target 'sm_120f'. Needs sm120_fp8_fp4_gemm_1d1d kernel implementation — there is no SM 120 equivalent yet in the fork.
csrc/apis/attention.hpp:67, 177, 367 FP4 attention dispatch Same tcgen05.* ISA mismatch as above. Needs SM 120 FP4 attention kernel(s).
csrc/apis/einsum.hpp:55 FP4 einsum dispatch Same problem. Needs SM 120 FP4 einsum kernel (the existing sm120_fp8_einsum covers FP8 not FP4).

3. Companion vLLM-side gates that also need to relax

These are independent of DeepGEMM but block the deployment from reaching DeepGEMM in the first place. We've submitted the corresponding vLLM PRs:

vLLM file What PR
vllm/platforms/cuda.py support_deep_gemm accept is_device_capability_family(120) #41062
vllm/model_executor/layers/fused_moe/experts/deep_gemm_moe.py DeepGemmFP4Experts._supports_current_device #41062
vllm/model_executor/layers/fused_moe/experts/gpt_oss_triton_kernels_moe.py _supports_current_device cap range < (11, 0)< (13, 0) #41028
vllm/.../csrc CMakeLists.txt MARLIN_*_ARCHS add 12.0f family #40923 (Approved by @Harry-Chen)

Reproduction command

# Inside vllm/vllm-openai:nightly-aarch64 with jasl/DeepGEMM@7a7a41a1 + the
# jasl/vllm ds4-sm120 branch synced into /usr/local/lib/python3.12/dist-packages/vllm/
vllm serve /models/DeepSeek-V4-Flash \
  --tensor-parallel-size 2 --distributed-executor-backend ray \
  --moe-backend deep_gemm \
  --kv-cache-dtype fp8_ds_mla \
  --gpu-memory-utilization 0.78 \
  --load-format instanttensor \
  --compilation-config '{"cudagraph_mode": "PIECEWISE"}' \
  --trust-remote-code

After applying section-1 patches, the tcgen05 failure surfaces during torch.compile / cudagraph capture (engine init, not at runtime).

Suggested rollup for cleaner re-submission

When you do the cleaner PR rebase you mentioned in #40969, please consider including:

  1. Section 1 patches wholesale — they're literal or arch_major == 12 additions and the cutlass include for the JIT compiler (~10 lines total)
  2. Authoring SM 120 FP4 kernels for the three Section-2 sites, or alternatively gating --moe-backend deep_gemm to fall through to the Triton sparse-MLA fallback on SM 12.x for FP4 ops until the kernels exist

Happy to test any rebased branch on dual GB10 TP=2 — quick turnaround (boot ≤ 2 min, TP=2 init ≤ 90s) and we already have the patch stack staged. cc @jasl @WoosukKwon @zyongye @lukealonso

@tonyliu312 (filed from a debug trace on dual DGX Spark / SM 121 at 2026-04-28 01:25 Taipei)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions