[CI Failure]: `Qwen3.6-35B-A3B-FP8` fails on `NVIDIA GB10` with `cutlass_scaled_mm` / `cutlass_gemm_caller Error Internal` under vLLM nightly + CUDA 13.0

### Name of failing test

`Qwen3.6-35B-A3B-FP8` on an `NVIDIA GB10` system

### Basic information

- [ ] Flaky test
- [ ] Can reproduce locally
- [ ] Caused by external libraries (e.g. bug in `transformers`)

### 🧪 Describe the failing test

# Summary

I am testing `Qwen3.6-35B-A3B-FP8` on an `NVIDIA GB10` system with:

- `NVIDIA-SMI 580.142`
- `Driver Version: 580.142`
- `CUDA Version: 13.0`

I can reproduce a startup failure in vLLM when launching the FP8 model through the OpenAI server.

The failure happens during engine initialization / profile run and crashes inside:

- `torch.ops._C.cutlass_scaled_mm.default(...)`
- `cutlass_gemm_caller ... Error Internal`

### 📝 History of failing test

# Environment

- Hardware: `NVIDIA GB10`
- Driver: `580.142`
- CUDA: `13.0`
- Image: `vllm/vllm-openai:nightly`
- vLLM in container log:
  - `0.19.2rc1.dev134+gfe9c3d6c5`
- Host Python env:
  - `torch 2.11.0+cu130`
  - `vllm 0.19.2rc1.dev142+g4a79262e0`

# Model

- `Qwen3.6-35B-A3B-FP8`

# Launch args

Current relevant launch args:

```text
--model /models/Qwen3.6-35B-A3B-FP8
--served-model-name Qwen3.6-35B-A3B-FP8
--gpu-memory-utilization 0.7
--max-model-len 4096
--enforce-eager
```

Before adding `--enforce-eager`, the crash also went through:

- `vllm/compilation/cuda_graph.py`
- `torch/_inductor`
- `cutlass_scaled_mm`

With `--enforce-eager`, vLLM reports that torch.compile and CUDAGraph are disabled, which avoids the old path, but I am still validating whether the model can fully come up in this mode.

# Error

Relevant traceback:

```text
RuntimeError: cutlass_gemm_caller, /workspace/csrc/libtorch_stable/quantization/w8a8/cutlass/c3x/cutlass_gemm_caller.cuh:61, Error Internal
```

And the call site is:

```text
torch.ops._C.cutlass_scaled_mm.default(...)
```



### CC List.

# What I already checked

- Switched from `vllm/vllm-openai:latest` to `vllm/vllm-openai:nightly`
- Upgraded host environment to:
  - `torch 2.11.0+cu130`
  - `vllm 0.19.2rc1.dev142+...`
- Confirmed this is not only an old-image issue
- Confirmed this is not the earlier KV-cache sizing failure
- Confirmed the FP8 path specifically is involved

# Question

Is `Qwen3.6-35B-A3B-FP8` on `GB10 / CUDA 13.0` currently expected to work in vLLM nightly?

If yes, is there a known workaround for the `cutlass_scaled_mm` / `cutlass_gemm_caller Error Internal` failure on GB10?

Possible things I would like guidance on:

- recommended nightly image / commit for `GB10`
- required environment variables or flags
- whether FP8 currently requires disabling a specific backend
- whether this is a known CUTLASS / torch / vLLM issue on `sm_121`


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[CI Failure]: `Qwen3.6-35B-A3B-FP8` fails on `NVIDIA GB10` with `cutlass_scaled_mm` / `cutlass_gemm_caller Error Internal` under vLLM nightly + CUDA 13.0 #40758

Name of failing test

Basic information

🧪 Describe the failing test

Summary

📝 History of failing test

Environment

Model

Launch args

Error

CC List.

What I already checked

Question

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[CI Failure]: Qwen3.6-35B-A3B-FP8 fails on NVIDIA GB10 with cutlass_scaled_mm / cutlass_gemm_caller Error Internal under vLLM nightly + CUDA 13.0 #40758

Description

Name of failing test

Basic information

🧪 Describe the failing test

Summary

📝 History of failing test

Environment

Model

Launch args

Error

CC List.

What I already checked

Question

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

[CI Failure]: `Qwen3.6-35B-A3B-FP8` fails on `NVIDIA GB10` with `cutlass_scaled_mm` / `cutlass_gemm_caller Error Internal` under vLLM nightly + CUDA 13.0 #40758