Qwen3.5-27B fails on inference with NCCL tensor parallelism on 2 GPUs. Model loads fine, but every request errors with a matmul dimension mismatch. Also affects Qwen3.5 MoE and Qwen3 Next (all models using GDN layers).
Commit: aa1f166 (v0.8.1)
Minimum reproducible example:
mistralrs serve --port 8080 -m Qwen/Qwen3.5-27B --isq 6
With 2x RTX 3090 (NCCL auto-enabled, world_size=2). Built with cuda,cudnn,flash-attn,nccl features.
Error:
ERROR mistralrs_core::engine: prompt step - Model failed with error: mismatch on matmul dim [5120, 3072] [1, 1, 6144]
What I found:
In models/gdn.rs, the GDN layers have a TP mismatch:
out_proj is loaded as RowParallelLayer (TP-sharded) — weight becomes [5120, 3072] with TP=2
in_proj_qkvz and in_proj_ba are loaded as plain Linear (not TP-sharded)
The GDN forward pass runs the full recurrence at non-sharded width (value_dim=6144), then hits the sharded out_proj expecting 3072.
Dimensions from the model config confirm:
value_dim = linear_num_value_heads(48) * linear_value_head_dim(128) = 6144
hidden_size = 5120
- RowParallel with TP=2: weight [5120, 3072], but input is [1, 1, 6144]
The full attention layers in text.rs use ColumnParallelLayer/RowParallelLayer consistently which avoids this.
Also found a secondary issue: UQFF serialization filters by isq_serde_supported() (isq.rs:718) but deserialization checked against the unfiltered count, causing an artifact count mismatch when loading UQFF.
Other information:
- Ubuntu 24.04, CUDA 13.2.0
- 2x RTX 3090 24GB, PCIe (no NVLink)
Fix in PR #2054. Separate UQFF + TP VRAM issue tracked in #2053.
Qwen3.5-27B fails on inference with NCCL tensor parallelism on 2 GPUs. Model loads fine, but every request errors with a matmul dimension mismatch. Also affects Qwen3.5 MoE and Qwen3 Next (all models using GDN layers).
Commit: aa1f166 (v0.8.1)
Minimum reproducible example:
With 2x RTX 3090 (NCCL auto-enabled, world_size=2). Built with
cuda,cudnn,flash-attn,ncclfeatures.Error:
What I found:
In
models/gdn.rs, the GDN layers have a TP mismatch:out_projis loaded asRowParallelLayer(TP-sharded) — weight becomes [5120, 3072] with TP=2in_proj_qkvzandin_proj_baare loaded as plainLinear(not TP-sharded)The GDN forward pass runs the full recurrence at non-sharded width (value_dim=6144), then hits the sharded out_proj expecting 3072.
Dimensions from the model config confirm:
value_dim = linear_num_value_heads(48) * linear_value_head_dim(128) = 6144hidden_size = 5120The full attention layers in
text.rsuse ColumnParallelLayer/RowParallelLayer consistently which avoids this.Also found a secondary issue: UQFF serialization filters by
isq_serde_supported()(isq.rs:718) but deserialization checked against the unfiltered count, causing an artifact count mismatch when loading UQFF.Other information:
Fix in PR #2054. Separate UQFF + TP VRAM issue tracked in #2053.