Qwen3.5-35B-A3B on NVIDIA DGX Spark

A complete guide to running Qwen3.5-35B-A3B on the NVIDIA DGX Spark (GB10) using vLLM. Includes installation instructions, benchmark results, and configuration tips.

Overview

Qwen3.5-35B-A3B is a Mixture-of-Experts (MoE) multimodal model with:

35B total parameters, but only 3B active at inference
262K native context (extendable to 1M+ with YaRN)
Multimodal: text, images, and video input
Thinking mode: built-in chain-of-thought reasoning
Tool calling: function calling and MCP support
201 languages supported
Apache 2.0 license

Hardware

Component	Specification
Device	NVIDIA DGX Spark
GPU	NVIDIA GB10 (Blackwell)
Memory	128 GB unified (shared CPU/GPU)
CUDA Capability	12.1
Storage	3.7 TB NVMe SSD
OS	DGX OS (Ubuntu 24.04 based)

Why This Model on DGX Spark

The MoE architecture makes this model uniquely suited for the DGX Spark:

Only 3B active parameters means fast inference (~31 tok/s) despite 35B total
~70 GB model weights in BF16 fits comfortably in 128 GB unified memory
28.6 GB remaining for KV cache after loading, supporting 374K tokens
Benchmarks competitive with models 10-40x the inference cost

Installation

Prerequisites

NVIDIA DGX Spark with DGX OS
Docker installed and configured
SSH access to the DGX Spark

Step 1: Configure Docker Permissions

sudo usermod -aG docker $USER
newgrp docker

Step 2: Pull the vLLM Docker Image

Important: The standard NVIDIA vLLM container (nvcr.io/nvidia/vllm:26.01-py3) ships with vLLM 0.13.0, which does not support Qwen3.5. You need the nightly build with Qwen3.5 support.

docker pull vllm/vllm-openai:cu130-nightly

This image contains:

vLLM v0.16.0+ (with Qwen3_5MoeForConditionalGeneration support)
CUDA 13.1
PyTorch with Blackwell support
FlashAttention backend

Note on community DGX Spark images: The avarok/vllm-dgx-spark image is purpose-built for GB10 with SM12.1 kernel optimizations, but ships with vLLM 0.14.0 which does not support Qwen3.5. As of March 2026, the nightly build above is required for Qwen3.5.

Step 3: Launch the Model

docker run -d \
  --name qwen35 \
  --restart unless-stopped \
  --gpus all \
  --ipc host \
  --shm-size 64gb \
  -p 8000:8000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai:cu130-nightly \
  Qwen/Qwen3.5-35B-A3B \
    --served-model-name qwen3.5-35b \
    --port 8000 \
    --host 0.0.0.0 \
    --max-model-len 262144 \
    --gpu-memory-utilization 0.80 \
    --reasoning-parser qwen3 \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder \
    --enable-prefix-caching

Note: The vllm/vllm-openai:cu130-nightly image has vllm serve as its entrypoint, so the model name is passed directly as the first argument (not vllm serve Qwen/...).

CUDA graphs: On first startup, watch the logs (docker logs qwen35 -f) and confirm CUDA graph capture completes (look for Capturing CUDA graphs ... 100%). The first inference request after a fresh container start will be slow (~57s) due to torch.compile warmup, but subsequent requests run at full speed.

First run will download ~70 GB of model weights. Subsequent starts use the cached weights.

Step 4: Verify

Wait for the server to fully initialize (model download + CUDA graph capture takes ~15 minutes on first run), then:

curl http://localhost:8000/v1/models

Expected output:

{
  "data": [{"id": "qwen3.5-35b", "object": "model", "max_model_len": 262144}]
}

Configuration

Memory Allocation

`--gpu-memory-utilization`	Model Weights	KV Cache	Notes
0.80 (recommended)	~70 GB	28.6 GB (374K tokens)	Stable, no OOM risk
0.85	~70 GB	~35 GB (~460K tokens)	More headroom for long context
0.90	~70 GB	~42 GB (~550K tokens)	Risk of OOM after extended use

Community reports suggest 0.90 can cause OOM after ~1 hour. Stick with 0.80 for stability.

Context Length Options

Default (262K native):

--max-model-len 262144

Extended (1M with YaRN scaling):

VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 vllm serve Qwen/Qwen3.5-35B-A3B \
  --max-model-len 1010000 \
  --hf-overrides '{"text_config": {"rope_parameters": {
    "mrope_interleaved": true,
    "mrope_section": [11, 11, 10],
    "rope_type": "yarn",
    "rope_theta": 10000000,
    "partial_rotary_factor": 0.25,
    "factor": 4.0,
    "original_max_position_embeddings": 262144
  }}}'

NVFP4 Quantization (Experimental)

NVIDIA's firmware updates unlocked NVFP4 (4-bit floating point) on the DGX Spark, offering up to 2.5x throughput gains. With NVFP4, model weights shrink from ~70 GB to ~18 GB, leaving ~80+ GB for KV cache:

docker run -d \
  --name qwen35-nvfp4 \
  --restart unless-stopped \
  --gpus all \
  --ipc host \
  --shm-size 64gb \
  -p 8000:8000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai:cu130-nightly \
  Qwen/Qwen3.5-35B-A3B \
    --served-model-name qwen3.5-35b \
    --port 8000 \
    --host 0.0.0.0 \
    --max-model-len 262144 \
    --gpu-memory-utilization 0.80 \
    --quantization nvfp4 \
    --reasoning-parser qwen3 \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder \
    --enable-prefix-caching

Mode	Weights	KV Cache (0.80 util)	Expected Speed
BF16 (default)	~70 GB	~28.6 GB	~31 tok/s
NVFP4	~18 GB	~80+ GB	~60+ tok/s (estimated)

Note: NVFP4 quantization may affect output quality slightly. Benchmark your specific use case before switching. Requires DGX OS firmware that supports NVFP4.

Text-Only Mode (Disable Vision Encoder)

If you only need text inference, disable the vision encoder to save memory:

--language-model-only

Speculative Decoding (Multi-Token Prediction)

For potentially faster inference:

--speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'

API Usage

The server exposes an OpenAI-compatible API at http://<dgx-spark-ip>:8000/v1.

Text Chat

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3.5-35b",
    "messages": [{"role": "user", "content": "Explain quantum computing in simple terms."}],
    "max_tokens": 1024,
    "temperature": 1.0,
    "top_p": 0.95
  }'

Image Analysis

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3.5-35b",
    "messages": [{
      "role": "user",
      "content": [
        {"type": "image_url", "image_url": {"url": "https://example.com/photo.jpg"}},
        {"type": "text", "text": "Describe this image in detail."}
      ]
    }],
    "max_tokens": 1024
  }'

Disable Thinking Mode

By default, the model uses chain-of-thought reasoning. To disable it for faster responses:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3.5-35b",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 1024,
    "extra_body": {"chat_template_kwargs": {"enable_thinking": false}}
  }'

Python (OpenAI SDK)

from openai import OpenAI

client = OpenAI(base_url="http://192.168.42.2:8000/v1", api_key="not-needed")

response = client.chat.completions.create(
    model="qwen3.5-35b",
    messages=[{"role": "user", "content": "What is the meaning of life?"}],
    max_tokens=1024,
    temperature=1.0,
    top_p=0.95,
)

print(response.choices[0].message.content)

Recommended Sampling Parameters

Mode	Temperature	top_p	top_k	presence_penalty
Thinking - General	1.0	0.95	20	1.5
Thinking - Coding	0.6	0.95	20	0.0
Instruct - General	0.7	0.8	20	1.5
Instruct - Reasoning	1.0	1.0	40	2.0

Benchmark Results

Token Speed on DGX Spark

Measured on February 25, 2026 with BF16 precision:

Test	Prompt Tokens	Output Tokens	Time	Speed
Short response	18	128	4.1s	31.1 tok/s
Medium response	35	1,024	32.2s	31.8 tok/s
Long response	32	3,831	121.0s	31.6 tok/s

Consistent ~31-32 tokens/sec regardless of output length
Time to first token: ~0.1s
Approximately 24 words per second

Reasoning Tests

All classic reasoning benchmarks passed correctly:

Test	Question	Model Answer	Correct
Trick question	"A farmer has 17 sheep. All but 9 die. How many left?"	9 sheep	Yes
Widget problem	"5 machines, 5 minutes, 5 widgets. 100 machines, 100 widgets?"	5 minutes	Yes
Sibling puzzle	"Sally has 3 brothers. Each brother has 2 sisters. How many sisters?"	1 sister	Yes
Bat & ball (CRT)	"Bat costs $1 more than ball. Total $1.10. Ball cost?"	$0.05	Yes
Box labeling	Mislabeled boxes logic puzzle	Perfect reasoning	Yes
LIS algorithm	Code + O(n log n) complexity analysis	Both approaches correct	Yes

Academic Benchmarks vs. Comparable Models

Benchmark	Qwen3.5-35B-A3B (3B active)	Qwen3.5-27B (27B dense)	GPT-OSS-120B	Qwen3-235B	GPT-5-mini
MMLU-Pro	85.3	86.1	80.8	84.4	83.7
GPQA Diamond	84.2	85.5	80.1	81.1	82.8
MMLU-Redux	93.3	93.2	91.0	93.8	93.7
IFEval	91.9	95.0	88.9	87.8	93.9
SWE-bench Verified	69.2	72.4	62.0	--	72.0
LiveCodeBench v6	74.6	80.7	82.7	75.1	80.5
CodeForces	2028	1899	2157	2146	2160
HMMT Feb 25	89.0	92.0	90.0	85.1	89.2
HLE w/ CoT	22.4	24.3	14.9	18.2	19.4

Key takeaway: With only 3B active parameters, this model beats GPT-OSS-120B and Qwen3-235B on most knowledge and reasoning benchmarks.

Multi-User Concurrency Benchmarks

Tested with realistic RAG-style requests (system prompt + retrieved document chunks + question, 200-token responses, thinking mode disabled) to simulate a company assistant workload.

Results

Concurrent Users	Per-User Speed	Avg Latency (200 tokens)	Aggregate Throughput
1	3.3 tok/s	60.7s	3.3 tok/s
5	13.0 tok/s	15.4s	64.9 tok/s
10	8.2 tok/s	24.4s	82.0 tok/s
20	9.4 tok/s	21.4s	186.4 tok/s
50	6.2 tok/s	32.5s	307.7 tok/s
100	4.3 tok/s	47.2s	423.5 tok/s

Analysis

Aggregate throughput scales from 3.3 to 423.5 tok/s (128x improvement) as concurrency increases
100 concurrent users: all requests completed successfully, 4.3 tok/s per user, ~47s latency for a 200-token answer
Zero errors at all concurrency levels — vLLM's continuous batching handles load gracefully
Sweet spot at 5-20 users: best balance of per-user speed (9-13 tok/s) and aggregate throughput

Why It Scales So Well

The MoE architecture is uniquely suited for concurrent serving:

Only 3B active parameters — GPU compute per token is minimal, leaving headroom for batching
vLLM continuous batching — new requests join the active batch without waiting for others to finish
128 GB unified memory — large KV cache pool shared efficiently across concurrent requests
Short RAG contexts (4-16K tokens per user) — KV cache per user is small, allowing many concurrent sessions

Enterprise Use Case Viability

Scenario	Users	Expected Latency	Verdict
Small team RAG assistant	5-10	15-25s	Excellent
Department-wide assistant	20-50	21-33s	Good
Company-wide (peak load)	100	~47s	Viable

Note: Latencies above are for 200-token responses. Shorter responses (e.g., 50-100 tokens for quick Q&A) will be proportionally faster. With streaming enabled, users see the first tokens almost immediately regardless of concurrency.

Stress Testing / Context Limits

We ran extensive stress tests to find the breaking point of vLLM + Qwen3.5-35B-A3B on DGX Spark. Spoiler: vLLM is extremely resilient — it never OOM'd or crashed.

Single Request Context Tests

Test	Prompt Tokens	Max Tokens	Result	Time
50K moderate	62,501	10	OK	18.8s
130K half capacity	162,501	10	OK	48.3s
250K near max	~262K	10	Rejected (over limit by 1 token)	instant
500K double limit	~500K text sent	100	Rejected (tokenizer capped at 262K)	instant
1M quadruple limit	~1M text sent	100	Rejected (tokenizer capped at 262K)	instant

Key finding: vLLM truncates tokenization at max_model_len. Even sending 1 million tokens of text, the tokenizer stops at 262,144 and returns a clean error. No OOM possible through the API.

Concurrent Request Tests (262K context config)

Test	Requests	Tokens Each	Total Demand	Result
4x prefill only	4	250K prompt + 10 output	1M tokens	All OK (serialized, 72s each)
4x with generation	4	192K prompt + 2000 output	~776K tokens	1 completed (595s), 3 timed out waiting

Key finding: vLLM's scheduler serializes requests when KV cache is full. It processes one, frees KV, then processes the next. Requests queue rather than crash.

Forced 1M Context Window (Override)

We restarted vLLM with --max-model-len 1048576 and VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 to force 1M context (4x the model's trained 262K):

docker run -d \
  -e VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 \
  -e VLLM_FLASHINFER_MOE_BACKEND=latency \
  --gpus all --ipc=host \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  vllm/vllm-openai:cu130-nightly \
  Qwen/Qwen3.5-35B-A3B \
  --max-model-len 1048576 \
  --gpu-memory-utilization 0.95

Startup results:

Model weights: 65.53 GiB
Available KV cache: 44.47 GiB
KV cache capacity: 581,856 tokens
Max concurrency for 1M requests: 2.21x

Test	Tokens	Result	Time
300K prompt (over 262K native)	300,001	OK	179.7s
3x concurrent 300K	~450K each	All completed (serialized)	~380s

Key finding: The model processes 300K+ tokens beyond its trained 262K context via RoPE extrapolation. vLLM allocated 581K tokens of KV cache from the available 44.47 GiB. No OOM — the scheduler queues requests that don't fit.

Conclusions

You cannot OOM vLLM through the API — it validates input length before allocating KV cache
Concurrent large requests queue, they don't crash — vLLM serializes when KV cache is full
The 262K context limit is soft — with VLLM_ALLOW_LONG_MAX_MODEL_LEN=1, the model processes 300K+ tokens (quality may degrade beyond trained range)
At 0.95 GPU memory utilization, the system handles 1M context config with 581K token KV cache capacity
vLLM's scheduler is the real safety net — it never allocates more than available, just queues

Vision / Multimodal Features

The model includes a vision encoder supporting:

Capability	Status	Notes
Image description	Works	Detailed, nuanced descriptions
Object detection	Works	Provides bounding box coordinates
Object counting	Works	Accurate counting of people, objects
OCR / Text recognition	Works	Reads text from images (signs, documents)
Spatial reasoning	Works	Left/center/right positioning
Video understanding	Supported	Pass video URLs in messages
Chart / diagram analysis	Works	Requires raster images (not SVG)

Supported Image Formats

JPEG, PNG, WebP (via URL or base64)
Not supported: SVG, vector graphics

Video Input

{
  "role": "user",
  "content": [
    {"type": "video_url", "video_url": {"url": "https://example.com/video.mp4"}},
    {"type": "text", "text": "What happens in this video?"}
  ]
}

Troubleshooting

"Model type qwen3_5_moe not recognized"

Your vLLM version is too old. The Qwen3_5MoeForConditionalGeneration architecture requires vLLM v0.16.0+. Use vllm/vllm-openai:cu130-nightly instead of the NVIDIA container.

"Architectures not supported"

Same issue as above. The NVIDIA nvcr.io/nvidia/vllm:26.01-py3 container ships with vLLM 0.13.0 which doesn't support Qwen3.5. Upgrading just transformers inside the container is not enough -- vLLM itself needs the model implementation.

OOM after extended use

Reduce --gpu-memory-utilization from 0.90 to 0.80. Community reports confirm 0.80 is stable for long-running sessions.

First request is very slow (~57s)

This is normal after a fresh container start. vLLM uses torch.compile with inductor backend, and the first inference triggers compilation and caching. Subsequent requests run at full speed (~31 tok/s). The compiled cache persists within the container lifetime.

CUDA graphs not captured

If startup logs don't show Capturing CUDA graphs ... 100%, performance will be degraded (eager mode fallback). Check docker logs qwen35 for CUDA graph messages. If graphs fail, try adding --enforce-eager temporarily to confirm the issue, then investigate the underlying CUDA compatibility.

"MoE config file not found" warning

WARNING: Using default MoE config. Performance might be sub-optimal!
Config file not found at .../E=256,N=512,device_name=NVIDIA_GB10.json

This is expected -- there is no optimized MoE kernel config for the GB10 GPU yet. The model still runs correctly with default settings. We tested custom MoE configs adapted from the avarok/vllm-dgx-spark image (tuned for GB10 with fp8) but found that the GB10's shared memory limit (101,376 bytes) is too small for the larger block sizes, and conservative configs actually performed worse (~30.5 tok/s) than vLLM's auto-tuned defaults (~32 tok/s).

PyTorch CUDA capability warning

Found GPU0 NVIDIA GB10 which is of cuda capability 12.1.
Maximum cuda capability supported by this version of PyTorch is (8.0) - (12.0)

This is a harmless warning. The GB10 works fine despite the version mismatch message.

Prefix caching warning

Prefix caching in Mamba cache 'align' mode is currently enabled. Its support for Mamba layers is experimental.

This is informational. Prefix caching works and improves performance for repeated prompts. It can be disabled with --no-enable-prefix-caching if issues arise.

Useful Commands

# Check if server is running
curl http://localhost:8000/v1/models

# View logs
docker logs qwen35 -f

# Stop the server
docker stop qwen35

# Start the server
docker start qwen35

# Remove and recreate
docker rm -f qwen35
# Then re-run the docker run command above

# Check GPU memory usage
nvidia-smi

SM12.1 Optimization Attempts (Tested)

Community forums suggest several optimizations for the GB10's SM12.1 (Blackwell) compute capability. We tested the following on March 11, 2026:

Optimization	Result	Details
`VLLM_FLASHINFER_MOE_BACKEND=latency`	No effect	vLLM v0.16.0 uses Triton (not FlashInfer) for unquantized MoE. The env var is ignored.
`avarok/vllm-dgx-spark` image	Incompatible	Ships vLLM 0.14.0 — no `Qwen3_5MoeForConditionalGeneration` support.
Custom GB10 MoE config (from avarok)	Crashed	`BLOCK_SIZE_K=128, num_stages=5` exceeds GB10 shared memory (101,376 bytes; needs 163,840).
Conservative MoE config (`BLOCK_SIZE_K=64, num_stages=2`)	Slower	~30.5 tok/s vs baseline ~32 tok/s. vLLM's auto-tuned defaults are better.

Conclusion: The nightly vllm/vllm-openai:cu130-nightly image with default settings is already near-optimal for BF16 inference on GB10. The ~31-32 tok/s appears to be the hardware ceiling for this precision. The TORCH_CUDA_ARCH_LIST in the nightly already includes 12.1, confirming SM12.1 kernels are compiled.

Remaining untested: NVFP4 quantization (--quantization nvfp4) could potentially double throughput by reducing model weights from ~70 GB to ~18 GB, but requires firmware support and vLLM NVFP4 backend compatibility verification.

References

License

This guide is provided as-is under the MIT License. The Qwen3.5-35B-A3B model itself is licensed under Apache 2.0.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Qwen3.5-35B-A3B on NVIDIA DGX Spark

Table of Contents

Overview

Hardware

Why This Model on DGX Spark

Installation

Prerequisites

Step 1: Configure Docker Permissions

Step 2: Pull the vLLM Docker Image

Step 3: Launch the Model

Step 4: Verify

Configuration

Memory Allocation

Context Length Options

NVFP4 Quantization (Experimental)

Text-Only Mode (Disable Vision Encoder)

Speculative Decoding (Multi-Token Prediction)

API Usage

Text Chat

Image Analysis

Disable Thinking Mode

Python (OpenAI SDK)

Recommended Sampling Parameters

Benchmark Results

Token Speed on DGX Spark

Reasoning Tests

Academic Benchmarks vs. Comparable Models

Multi-User Concurrency Benchmarks

Results

Analysis

Why It Scales So Well

Enterprise Use Case Viability

Stress Testing / Context Limits

Single Request Context Tests

Concurrent Request Tests (262K context config)

Forced 1M Context Window (Override)

Conclusions

Vision / Multimodal Features

Supported Image Formats

Video Input

Troubleshooting

"Model type qwen3_5_moe not recognized"

"Architectures not supported"

OOM after extended use

First request is very slow (~57s)

CUDA graphs not captured

"MoE config file not found" warning

PyTorch CUDA capability warning

Prefix caching warning

Useful Commands

SM12.1 Optimization Attempts (Tested)

References

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages