Skip to content

Qwen3.5-27B: matmul dimension mismatch with NCCL tensor parallelism (2 GPUs) #2052

@ormandj

Description

@ormandj

Qwen3.5-27B fails on inference with NCCL tensor parallelism on 2 GPUs. Model loads fine, but every request errors with a matmul dimension mismatch. Also affects Qwen3.5 MoE and Qwen3 Next (all models using GDN layers).

Commit: aa1f166 (v0.8.1)

Minimum reproducible example:

mistralrs serve --port 8080 -m Qwen/Qwen3.5-27B --isq 6

With 2x RTX 3090 (NCCL auto-enabled, world_size=2). Built with cuda,cudnn,flash-attn,nccl features.

Error:

ERROR mistralrs_core::engine: prompt step - Model failed with error: mismatch on matmul dim [5120, 3072] [1, 1, 6144]

What I found:

In models/gdn.rs, the GDN layers have a TP mismatch:

  • out_proj is loaded as RowParallelLayer (TP-sharded) — weight becomes [5120, 3072] with TP=2
  • in_proj_qkvz and in_proj_ba are loaded as plain Linear (not TP-sharded)

The GDN forward pass runs the full recurrence at non-sharded width (value_dim=6144), then hits the sharded out_proj expecting 3072.

Dimensions from the model config confirm:

  • value_dim = linear_num_value_heads(48) * linear_value_head_dim(128) = 6144
  • hidden_size = 5120
  • RowParallel with TP=2: weight [5120, 3072], but input is [1, 1, 6144]

The full attention layers in text.rs use ColumnParallelLayer/RowParallelLayer consistently which avoids this.

Also found a secondary issue: UQFF serialization filters by isq_serde_supported() (isq.rs:718) but deserialization checked against the unfiltered count, causing an artifact count mismatch when loading UQFF.

Other information:

  • Ubuntu 24.04, CUDA 13.2.0
  • 2x RTX 3090 24GB, PCIe (no NVLink)

Fix in PR #2054. Separate UQFF + TP VRAM issue tracked in #2053.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions