Name of failing test
Qwen3.6-35B-A3B-FP8 on an NVIDIA GB10 system
Basic information
🧪 Describe the failing test
Summary
I am testing Qwen3.6-35B-A3B-FP8 on an NVIDIA GB10 system with:
NVIDIA-SMI 580.142
Driver Version: 580.142
CUDA Version: 13.0
I can reproduce a startup failure in vLLM when launching the FP8 model through the OpenAI server.
The failure happens during engine initialization / profile run and crashes inside:
torch.ops._C.cutlass_scaled_mm.default(...)
cutlass_gemm_caller ... Error Internal
📝 History of failing test
Environment
- Hardware:
NVIDIA GB10
- Driver:
580.142
- CUDA:
13.0
- Image:
vllm/vllm-openai:nightly
- vLLM in container log:
0.19.2rc1.dev134+gfe9c3d6c5
- Host Python env:
torch 2.11.0+cu130
vllm 0.19.2rc1.dev142+g4a79262e0
Model
Launch args
Current relevant launch args:
--model /models/Qwen3.6-35B-A3B-FP8
--served-model-name Qwen3.6-35B-A3B-FP8
--gpu-memory-utilization 0.7
--max-model-len 4096
--enforce-eager
Before adding --enforce-eager, the crash also went through:
vllm/compilation/cuda_graph.py
torch/_inductor
cutlass_scaled_mm
With --enforce-eager, vLLM reports that torch.compile and CUDAGraph are disabled, which avoids the old path, but I am still validating whether the model can fully come up in this mode.
Error
Relevant traceback:
RuntimeError: cutlass_gemm_caller, /workspace/csrc/libtorch_stable/quantization/w8a8/cutlass/c3x/cutlass_gemm_caller.cuh:61, Error Internal
And the call site is:
torch.ops._C.cutlass_scaled_mm.default(...)
CC List.
What I already checked
- Switched from
vllm/vllm-openai:latest to vllm/vllm-openai:nightly
- Upgraded host environment to:
torch 2.11.0+cu130
vllm 0.19.2rc1.dev142+...
- Confirmed this is not only an old-image issue
- Confirmed this is not the earlier KV-cache sizing failure
- Confirmed the FP8 path specifically is involved
Question
Is Qwen3.6-35B-A3B-FP8 on GB10 / CUDA 13.0 currently expected to work in vLLM nightly?
If yes, is there a known workaround for the cutlass_scaled_mm / cutlass_gemm_caller Error Internal failure on GB10?
Possible things I would like guidance on:
- recommended nightly image / commit for
GB10
- required environment variables or flags
- whether FP8 currently requires disabling a specific backend
- whether this is a known CUTLASS / torch / vLLM issue on
sm_121
Name of failing test
Qwen3.6-35B-A3B-FP8on anNVIDIA GB10systemBasic information
transformers)🧪 Describe the failing test
Summary
I am testing
Qwen3.6-35B-A3B-FP8on anNVIDIA GB10system with:NVIDIA-SMI 580.142Driver Version: 580.142CUDA Version: 13.0I can reproduce a startup failure in vLLM when launching the FP8 model through the OpenAI server.
The failure happens during engine initialization / profile run and crashes inside:
torch.ops._C.cutlass_scaled_mm.default(...)cutlass_gemm_caller ... Error Internal📝 History of failing test
Environment
NVIDIA GB10580.14213.0vllm/vllm-openai:nightly0.19.2rc1.dev134+gfe9c3d6c5torch 2.11.0+cu130vllm 0.19.2rc1.dev142+g4a79262e0Model
Qwen3.6-35B-A3B-FP8Launch args
Current relevant launch args:
Before adding
--enforce-eager, the crash also went through:vllm/compilation/cuda_graph.pytorch/_inductorcutlass_scaled_mmWith
--enforce-eager, vLLM reports that torch.compile and CUDAGraph are disabled, which avoids the old path, but I am still validating whether the model can fully come up in this mode.Error
Relevant traceback:
And the call site is:
CC List.
What I already checked
vllm/vllm-openai:latesttovllm/vllm-openai:nightlytorch 2.11.0+cu130vllm 0.19.2rc1.dev142+...Question
Is
Qwen3.6-35B-A3B-FP8onGB10 / CUDA 13.0currently expected to work in vLLM nightly?If yes, is there a known workaround for the
cutlass_scaled_mm/cutlass_gemm_caller Error Internalfailure on GB10?Possible things I would like guidance on:
GB10sm_121