Skip to content

[CI Failure]: Qwen3.6-35B-A3B-FP8 fails on NVIDIA GB10 with cutlass_scaled_mm / cutlass_gemm_caller Error Internal under vLLM nightly + CUDA 13.0 #40758

@amuin-2hz

Description

@amuin-2hz

Name of failing test

Qwen3.6-35B-A3B-FP8 on an NVIDIA GB10 system

Basic information

  • Flaky test
  • Can reproduce locally
  • Caused by external libraries (e.g. bug in transformers)

🧪 Describe the failing test

Summary

I am testing Qwen3.6-35B-A3B-FP8 on an NVIDIA GB10 system with:

  • NVIDIA-SMI 580.142
  • Driver Version: 580.142
  • CUDA Version: 13.0

I can reproduce a startup failure in vLLM when launching the FP8 model through the OpenAI server.

The failure happens during engine initialization / profile run and crashes inside:

  • torch.ops._C.cutlass_scaled_mm.default(...)
  • cutlass_gemm_caller ... Error Internal

📝 History of failing test

Environment

  • Hardware: NVIDIA GB10
  • Driver: 580.142
  • CUDA: 13.0
  • Image: vllm/vllm-openai:nightly
  • vLLM in container log:
    • 0.19.2rc1.dev134+gfe9c3d6c5
  • Host Python env:
    • torch 2.11.0+cu130
    • vllm 0.19.2rc1.dev142+g4a79262e0

Model

  • Qwen3.6-35B-A3B-FP8

Launch args

Current relevant launch args:

--model /models/Qwen3.6-35B-A3B-FP8
--served-model-name Qwen3.6-35B-A3B-FP8
--gpu-memory-utilization 0.7
--max-model-len 4096
--enforce-eager

Before adding --enforce-eager, the crash also went through:

  • vllm/compilation/cuda_graph.py
  • torch/_inductor
  • cutlass_scaled_mm

With --enforce-eager, vLLM reports that torch.compile and CUDAGraph are disabled, which avoids the old path, but I am still validating whether the model can fully come up in this mode.

Error

Relevant traceback:

RuntimeError: cutlass_gemm_caller, /workspace/csrc/libtorch_stable/quantization/w8a8/cutlass/c3x/cutlass_gemm_caller.cuh:61, Error Internal

And the call site is:

torch.ops._C.cutlass_scaled_mm.default(...)

CC List.

What I already checked

  • Switched from vllm/vllm-openai:latest to vllm/vllm-openai:nightly
  • Upgraded host environment to:
    • torch 2.11.0+cu130
    • vllm 0.19.2rc1.dev142+...
  • Confirmed this is not only an old-image issue
  • Confirmed this is not the earlier KV-cache sizing failure
  • Confirmed the FP8 path specifically is involved

Question

Is Qwen3.6-35B-A3B-FP8 on GB10 / CUDA 13.0 currently expected to work in vLLM nightly?

If yes, is there a known workaround for the cutlass_scaled_mm / cutlass_gemm_caller Error Internal failure on GB10?

Possible things I would like guidance on:

  • recommended nightly image / commit for GB10
  • required environment variables or flags
  • whether FP8 currently requires disabling a specific backend
  • whether this is a known CUTLASS / torch / vLLM issue on sm_121

Metadata

Metadata

Assignees

No one assigned

    Labels

    ci-failureIssue about an unexpected test failure in CI

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions