A complete guide to running Qwen3.5-35B-A3B on the NVIDIA DGX Spark (GB10) using vLLM. Includes installation instructions, benchmark results, and configuration tips.
- Overview
- Hardware
- Why This Model on DGX Spark
- Installation
- Configuration
- API Usage
- Benchmark Results
- Vision / Multimodal Features
- Multi-User Concurrency Benchmarks
- Stress Testing / Context Limits
- Comparison with Other Models
- Troubleshooting
- License
Qwen3.5-35B-A3B is a Mixture-of-Experts (MoE) multimodal model with:
- 35B total parameters, but only 3B active at inference
- 262K native context (extendable to 1M+ with YaRN)
- Multimodal: text, images, and video input
- Thinking mode: built-in chain-of-thought reasoning
- Tool calling: function calling and MCP support
- 201 languages supported
- Apache 2.0 license
| Component | Specification |
|---|---|
| Device | NVIDIA DGX Spark |
| GPU | NVIDIA GB10 (Blackwell) |
| Memory | 128 GB unified (shared CPU/GPU) |
| CUDA Capability | 12.1 |
| Storage | 3.7 TB NVMe SSD |
| OS | DGX OS (Ubuntu 24.04 based) |
The MoE architecture makes this model uniquely suited for the DGX Spark:
- Only 3B active parameters means fast inference (~31 tok/s) despite 35B total
- ~70 GB model weights in BF16 fits comfortably in 128 GB unified memory
- 28.6 GB remaining for KV cache after loading, supporting 374K tokens
- Benchmarks competitive with models 10-40x the inference cost
- NVIDIA DGX Spark with DGX OS
- Docker installed and configured
- SSH access to the DGX Spark
sudo usermod -aG docker $USER
newgrp dockerImportant: The standard NVIDIA vLLM container (
nvcr.io/nvidia/vllm:26.01-py3) ships with vLLM 0.13.0, which does not support Qwen3.5. You need the nightly build with Qwen3.5 support.
docker pull vllm/vllm-openai:cu130-nightlyThis image contains:
- vLLM v0.16.0+ (with
Qwen3_5MoeForConditionalGenerationsupport) - CUDA 13.1
- PyTorch with Blackwell support
- FlashAttention backend
Note on community DGX Spark images: The avarok/vllm-dgx-spark image is purpose-built for GB10 with SM12.1 kernel optimizations, but ships with vLLM 0.14.0 which does not support Qwen3.5. As of March 2026, the nightly build above is required for Qwen3.5.
docker run -d \
--name qwen35 \
--restart unless-stopped \
--gpus all \
--ipc host \
--shm-size 64gb \
-p 8000:8000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
vllm/vllm-openai:cu130-nightly \
Qwen/Qwen3.5-35B-A3B \
--served-model-name qwen3.5-35b \
--port 8000 \
--host 0.0.0.0 \
--max-model-len 262144 \
--gpu-memory-utilization 0.80 \
--reasoning-parser qwen3 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--enable-prefix-cachingNote: The
vllm/vllm-openai:cu130-nightlyimage hasvllm serveas its entrypoint, so the model name is passed directly as the first argument (notvllm serve Qwen/...).
CUDA graphs: On first startup, watch the logs (
docker logs qwen35 -f) and confirm CUDA graph capture completes (look forCapturing CUDA graphs ... 100%). The first inference request after a fresh container start will be slow (~57s) due to torch.compile warmup, but subsequent requests run at full speed.
First run will download ~70 GB of model weights. Subsequent starts use the cached weights.
Wait for the server to fully initialize (model download + CUDA graph capture takes ~15 minutes on first run), then:
curl http://localhost:8000/v1/modelsExpected output:
{
"data": [{"id": "qwen3.5-35b", "object": "model", "max_model_len": 262144}]
}--gpu-memory-utilization |
Model Weights | KV Cache | Notes |
|---|---|---|---|
| 0.80 (recommended) | ~70 GB | 28.6 GB (374K tokens) | Stable, no OOM risk |
| 0.85 | ~70 GB | ~35 GB (~460K tokens) | More headroom for long context |
| 0.90 | ~70 GB | ~42 GB (~550K tokens) | Risk of OOM after extended use |
Community reports suggest
0.90can cause OOM after ~1 hour. Stick with0.80for stability.
Default (262K native):
--max-model-len 262144
Extended (1M with YaRN scaling):
VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 vllm serve Qwen/Qwen3.5-35B-A3B \
--max-model-len 1010000 \
--hf-overrides '{"text_config": {"rope_parameters": {
"mrope_interleaved": true,
"mrope_section": [11, 11, 10],
"rope_type": "yarn",
"rope_theta": 10000000,
"partial_rotary_factor": 0.25,
"factor": 4.0,
"original_max_position_embeddings": 262144
}}}'NVIDIA's firmware updates unlocked NVFP4 (4-bit floating point) on the DGX Spark, offering up to 2.5x throughput gains. With NVFP4, model weights shrink from ~70 GB to ~18 GB, leaving ~80+ GB for KV cache:
docker run -d \
--name qwen35-nvfp4 \
--restart unless-stopped \
--gpus all \
--ipc host \
--shm-size 64gb \
-p 8000:8000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
vllm/vllm-openai:cu130-nightly \
Qwen/Qwen3.5-35B-A3B \
--served-model-name qwen3.5-35b \
--port 8000 \
--host 0.0.0.0 \
--max-model-len 262144 \
--gpu-memory-utilization 0.80 \
--quantization nvfp4 \
--reasoning-parser qwen3 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--enable-prefix-caching| Mode | Weights | KV Cache (0.80 util) | Expected Speed |
|---|---|---|---|
| BF16 (default) | ~70 GB | ~28.6 GB | ~31 tok/s |
| NVFP4 | ~18 GB | ~80+ GB | ~60+ tok/s (estimated) |
Note: NVFP4 quantization may affect output quality slightly. Benchmark your specific use case before switching. Requires DGX OS firmware that supports NVFP4.
If you only need text inference, disable the vision encoder to save memory:
--language-model-only
For potentially faster inference:
--speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'
The server exposes an OpenAI-compatible API at http://<dgx-spark-ip>:8000/v1.
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3.5-35b",
"messages": [{"role": "user", "content": "Explain quantum computing in simple terms."}],
"max_tokens": 1024,
"temperature": 1.0,
"top_p": 0.95
}'curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3.5-35b",
"messages": [{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": "https://example.com/photo.jpg"}},
{"type": "text", "text": "Describe this image in detail."}
]
}],
"max_tokens": 1024
}'By default, the model uses chain-of-thought reasoning. To disable it for faster responses:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3.5-35b",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 1024,
"extra_body": {"chat_template_kwargs": {"enable_thinking": false}}
}'from openai import OpenAI
client = OpenAI(base_url="http://192.168.42.2:8000/v1", api_key="not-needed")
response = client.chat.completions.create(
model="qwen3.5-35b",
messages=[{"role": "user", "content": "What is the meaning of life?"}],
max_tokens=1024,
temperature=1.0,
top_p=0.95,
)
print(response.choices[0].message.content)| Mode | Temperature | top_p | top_k | presence_penalty |
|---|---|---|---|---|
| Thinking - General | 1.0 | 0.95 | 20 | 1.5 |
| Thinking - Coding | 0.6 | 0.95 | 20 | 0.0 |
| Instruct - General | 0.7 | 0.8 | 20 | 1.5 |
| Instruct - Reasoning | 1.0 | 1.0 | 40 | 2.0 |
Measured on February 25, 2026 with BF16 precision:
| Test | Prompt Tokens | Output Tokens | Time | Speed |
|---|---|---|---|---|
| Short response | 18 | 128 | 4.1s | 31.1 tok/s |
| Medium response | 35 | 1,024 | 32.2s | 31.8 tok/s |
| Long response | 32 | 3,831 | 121.0s | 31.6 tok/s |
- Consistent ~31-32 tokens/sec regardless of output length
- Time to first token: ~0.1s
- Approximately 24 words per second
All classic reasoning benchmarks passed correctly:
| Test | Question | Model Answer | Correct |
|---|---|---|---|
| Trick question | "A farmer has 17 sheep. All but 9 die. How many left?" | 9 sheep | Yes |
| Widget problem | "5 machines, 5 minutes, 5 widgets. 100 machines, 100 widgets?" | 5 minutes | Yes |
| Sibling puzzle | "Sally has 3 brothers. Each brother has 2 sisters. How many sisters?" | 1 sister | Yes |
| Bat & ball (CRT) | "Bat costs $1 more than ball. Total $1.10. Ball cost?" | $0.05 | Yes |
| Box labeling | Mislabeled boxes logic puzzle | Perfect reasoning | Yes |
| LIS algorithm | Code + O(n log n) complexity analysis | Both approaches correct | Yes |
| Benchmark | Qwen3.5-35B-A3B (3B active) | Qwen3.5-27B (27B dense) | GPT-OSS-120B | Qwen3-235B | GPT-5-mini |
|---|---|---|---|---|---|
| MMLU-Pro | 85.3 | 86.1 | 80.8 | 84.4 | 83.7 |
| GPQA Diamond | 84.2 | 85.5 | 80.1 | 81.1 | 82.8 |
| MMLU-Redux | 93.3 | 93.2 | 91.0 | 93.8 | 93.7 |
| IFEval | 91.9 | 95.0 | 88.9 | 87.8 | 93.9 |
| SWE-bench Verified | 69.2 | 72.4 | 62.0 | -- | 72.0 |
| LiveCodeBench v6 | 74.6 | 80.7 | 82.7 | 75.1 | 80.5 |
| CodeForces | 2028 | 1899 | 2157 | 2146 | 2160 |
| HMMT Feb 25 | 89.0 | 92.0 | 90.0 | 85.1 | 89.2 |
| HLE w/ CoT | 22.4 | 24.3 | 14.9 | 18.2 | 19.4 |
Key takeaway: With only 3B active parameters, this model beats GPT-OSS-120B and Qwen3-235B on most knowledge and reasoning benchmarks.
Tested with realistic RAG-style requests (system prompt + retrieved document chunks + question, 200-token responses, thinking mode disabled) to simulate a company assistant workload.
| Concurrent Users | Per-User Speed | Avg Latency (200 tokens) | Aggregate Throughput | Errors |
|---|---|---|---|---|
| 1 | 3.3 tok/s | 60.7s | 3.3 tok/s | 0 |
| 5 | 13.0 tok/s | 15.4s | 64.9 tok/s | 0 |
| 10 | 8.2 tok/s | 24.4s | 82.0 tok/s | 0 |
| 20 | 9.4 tok/s | 21.4s | 186.4 tok/s | 0 |
| 50 | 6.2 tok/s | 32.5s | 307.7 tok/s | 0 |
| 100 | 4.3 tok/s | 47.2s | 423.5 tok/s | 0 |
- Aggregate throughput scales from 3.3 to 423.5 tok/s (128x improvement) as concurrency increases
- 100 concurrent users: all requests completed successfully, 4.3 tok/s per user, ~47s latency for a 200-token answer
- Zero errors at all concurrency levels — vLLM's continuous batching handles load gracefully
- Sweet spot at 5-20 users: best balance of per-user speed (9-13 tok/s) and aggregate throughput
The MoE architecture is uniquely suited for concurrent serving:
- Only 3B active parameters — GPU compute per token is minimal, leaving headroom for batching
- vLLM continuous batching — new requests join the active batch without waiting for others to finish
- 128 GB unified memory — large KV cache pool shared efficiently across concurrent requests
- Short RAG contexts (4-16K tokens per user) — KV cache per user is small, allowing many concurrent sessions
| Scenario | Users | Expected Latency | Verdict |
|---|---|---|---|
| Small team RAG assistant | 5-10 | 15-25s | Excellent |
| Department-wide assistant | 20-50 | 21-33s | Good |
| Company-wide (peak load) | 100 | ~47s | Viable |
Note: Latencies above are for 200-token responses. Shorter responses (e.g., 50-100 tokens for quick Q&A) will be proportionally faster. With streaming enabled, users see the first tokens almost immediately regardless of concurrency.
We ran extensive stress tests to find the breaking point of vLLM + Qwen3.5-35B-A3B on DGX Spark. Spoiler: vLLM is extremely resilient — it never OOM'd or crashed.
| Test | Prompt Tokens | Max Tokens | Result | Time |
|---|---|---|---|---|
| 50K moderate | 62,501 | 10 | OK | 18.8s |
| 130K half capacity | 162,501 | 10 | OK | 48.3s |
| 250K near max | ~262K | 10 | Rejected (over limit by 1 token) | instant |
| 500K double limit | ~500K text sent | 100 | Rejected (tokenizer capped at 262K) | instant |
| 1M quadruple limit | ~1M text sent | 100 | Rejected (tokenizer capped at 262K) | instant |
Key finding: vLLM truncates tokenization at max_model_len. Even sending 1 million tokens of text, the tokenizer stops at 262,144 and returns a clean error. No OOM possible through the API.
| Test | Requests | Tokens Each | Total Demand | Result |
|---|---|---|---|---|
| 4x prefill only | 4 | 250K prompt + 10 output | 1M tokens | All OK (serialized, 72s each) |
| 4x with generation | 4 | 192K prompt + 2000 output | ~776K tokens | 1 completed (595s), 3 timed out waiting |
Key finding: vLLM's scheduler serializes requests when KV cache is full. It processes one, frees KV, then processes the next. Requests queue rather than crash.
We restarted vLLM with --max-model-len 1048576 and VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 to force 1M context (4x the model's trained 262K):
docker run -d \
-e VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 \
-e VLLM_FLASHINFER_MOE_BACKEND=latency \
--gpus all --ipc=host \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 \
vllm/vllm-openai:cu130-nightly \
Qwen/Qwen3.5-35B-A3B \
--max-model-len 1048576 \
--gpu-memory-utilization 0.95Startup results:
- Model weights: 65.53 GiB
- Available KV cache: 44.47 GiB
- KV cache capacity: 581,856 tokens
- Max concurrency for 1M requests: 2.21x
| Test | Tokens | Result | Time |
|---|---|---|---|
| 300K prompt (over 262K native) | 300,001 | OK | 179.7s |
| 3x concurrent 300K | ~450K each | All completed (serialized) | ~380s |
Key finding: The model processes 300K+ tokens beyond its trained 262K context via RoPE extrapolation. vLLM allocated 581K tokens of KV cache from the available 44.47 GiB. No OOM — the scheduler queues requests that don't fit.
- You cannot OOM vLLM through the API — it validates input length before allocating KV cache
- Concurrent large requests queue, they don't crash — vLLM serializes when KV cache is full
- The 262K context limit is soft — with
VLLM_ALLOW_LONG_MAX_MODEL_LEN=1, the model processes 300K+ tokens (quality may degrade beyond trained range) - At 0.95 GPU memory utilization, the system handles 1M context config with 581K token KV cache capacity
- vLLM's scheduler is the real safety net — it never allocates more than available, just queues
The model includes a vision encoder supporting:
| Capability | Status | Notes |
|---|---|---|
| Image description | Works | Detailed, nuanced descriptions |
| Object detection | Works | Provides bounding box coordinates |
| Object counting | Works | Accurate counting of people, objects |
| OCR / Text recognition | Works | Reads text from images (signs, documents) |
| Spatial reasoning | Works | Left/center/right positioning |
| Video understanding | Supported | Pass video URLs in messages |
| Chart / diagram analysis | Works | Requires raster images (not SVG) |
- JPEG, PNG, WebP (via URL or base64)
- Not supported: SVG, vector graphics
{
"role": "user",
"content": [
{"type": "video_url", "video_url": {"url": "https://example.com/video.mp4"}},
{"type": "text", "text": "What happens in this video?"}
]
}Your vLLM version is too old. The Qwen3_5MoeForConditionalGeneration architecture requires vLLM v0.16.0+. Use vllm/vllm-openai:cu130-nightly instead of the NVIDIA container.
Same issue as above. The NVIDIA nvcr.io/nvidia/vllm:26.01-py3 container ships with vLLM 0.13.0 which doesn't support Qwen3.5. Upgrading just transformers inside the container is not enough -- vLLM itself needs the model implementation.
Reduce --gpu-memory-utilization from 0.90 to 0.80. Community reports confirm 0.80 is stable for long-running sessions.
This is normal after a fresh container start. vLLM uses torch.compile with inductor backend, and the first inference triggers compilation and caching. Subsequent requests run at full speed (~31 tok/s). The compiled cache persists within the container lifetime.
If startup logs don't show Capturing CUDA graphs ... 100%, performance will be degraded (eager mode fallback). Check docker logs qwen35 for CUDA graph messages. If graphs fail, try adding --enforce-eager temporarily to confirm the issue, then investigate the underlying CUDA compatibility.
WARNING: Using default MoE config. Performance might be sub-optimal!
Config file not found at .../E=256,N=512,device_name=NVIDIA_GB10.json
This is expected -- there is no optimized MoE kernel config for the GB10 GPU yet. The model still runs correctly with default settings. We tested custom MoE configs adapted from the avarok/vllm-dgx-spark image (tuned for GB10 with fp8) but found that the GB10's shared memory limit (101,376 bytes) is too small for the larger block sizes, and conservative configs actually performed worse (~30.5 tok/s) than vLLM's auto-tuned defaults (~32 tok/s).
Found GPU0 NVIDIA GB10 which is of cuda capability 12.1.
Maximum cuda capability supported by this version of PyTorch is (8.0) - (12.0)
This is a harmless warning. The GB10 works fine despite the version mismatch message.
Prefix caching in Mamba cache 'align' mode is currently enabled. Its support for Mamba layers is experimental.
This is informational. Prefix caching works and improves performance for repeated prompts. It can be disabled with --no-enable-prefix-caching if issues arise.
# Check if server is running
curl http://localhost:8000/v1/models
# View logs
docker logs qwen35 -f
# Stop the server
docker stop qwen35
# Start the server
docker start qwen35
# Remove and recreate
docker rm -f qwen35
# Then re-run the docker run command above
# Check GPU memory usage
nvidia-smiCommunity forums suggest several optimizations for the GB10's SM12.1 (Blackwell) compute capability. We tested the following on March 11, 2026:
| Optimization | Result | Details |
|---|---|---|
VLLM_FLASHINFER_MOE_BACKEND=latency |
No effect | vLLM v0.16.0 uses Triton (not FlashInfer) for unquantized MoE. The env var is ignored. |
avarok/vllm-dgx-spark image |
Incompatible | Ships vLLM 0.14.0 — no Qwen3_5MoeForConditionalGeneration support. |
| Custom GB10 MoE config (from avarok) | Crashed | BLOCK_SIZE_K=128, num_stages=5 exceeds GB10 shared memory (101,376 bytes; needs 163,840). |
Conservative MoE config (BLOCK_SIZE_K=64, num_stages=2) |
Slower | ~30.5 tok/s vs baseline ~32 tok/s. vLLM's auto-tuned defaults are better. |
Conclusion: The nightly vllm/vllm-openai:cu130-nightly image with default settings is already near-optimal for BF16 inference on GB10. The ~31-32 tok/s appears to be the hardware ceiling for this precision. The TORCH_CUDA_ARCH_LIST in the nightly already includes 12.1, confirming SM12.1 kernels are compiled.
Remaining untested: NVFP4 quantization (--quantization nvfp4) could potentially double throughput by reducing model weights from ~70 GB to ~18 GB, but requires firmware support and vLLM NVFP4 backend compatibility verification.
- Qwen3.5-35B-A3B Model Card
- Qwen3.5 vLLM Recipe
- DGX Spark User Guide
- vLLM Documentation
- DGX Spark vLLM Community Docker
- NVIDIA DGX Spark Playbooks
This guide is provided as-is under the MIT License. The Qwen3.5-35B-A3B model itself is licensed under Apache 2.0.