Qwen3.5-27B: matmul dimension mismatch with NCCL tensor parallelism (2 GPUs)

Qwen3.5-27B fails on inference with NCCL tensor parallelism on 2 GPUs. Model loads fine, but every request errors with a matmul dimension mismatch. Also affects Qwen3.5 MoE and Qwen3 Next (all models using GDN layers).

**Commit:** aa1f16689 (v0.8.1)

**Minimum reproducible example:**
```
mistralrs serve --port 8080 -m Qwen/Qwen3.5-27B --isq 6
```
With 2x RTX 3090 (NCCL auto-enabled, world_size=2). Built with `cuda,cudnn,flash-attn,nccl` features.

**Error:**
```
ERROR mistralrs_core::engine: prompt step - Model failed with error: mismatch on matmul dim [5120, 3072] [1, 1, 6144]
```

**What I found:**

In `models/gdn.rs`, the GDN layers have a TP mismatch:
- `out_proj` is loaded as `RowParallelLayer` (TP-sharded) — weight becomes [5120, 3072] with TP=2
- `in_proj_qkvz` and `in_proj_ba` are loaded as plain `Linear` (not TP-sharded)

The GDN forward pass runs the full recurrence at non-sharded width (value_dim=6144), then hits the sharded out_proj expecting 3072.

Dimensions from the model config confirm:
- `value_dim = linear_num_value_heads(48) * linear_value_head_dim(128) = 6144`
- `hidden_size = 5120`
- RowParallel with TP=2: weight [5120, 3072], but input is [1, 1, 6144]

The full attention layers in `text.rs` use ColumnParallelLayer/RowParallelLayer consistently which avoids this.

Also found a secondary issue: UQFF serialization filters by `isq_serde_supported()` (isq.rs:718) but deserialization checked against the unfiltered count, causing an artifact count mismatch when loading UQFF.

**Other information:**
- Ubuntu 24.04, CUDA 13.2.0
- 2x RTX 3090 24GB, PCIe (no NVLink)

Fix in PR #2054. Separate UQFF + TP VRAM issue tracked in #2053.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qwen3.5-27B: matmul dimension mismatch with NCCL tensor parallelism (2 GPUs) #2052

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Qwen3.5-27B: matmul dimension mismatch with NCCL tensor parallelism (2 GPUs) #2052

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions