DGX Spark Inference Stack - serve the home!

Your Nvidia DGX Spark should not be another side project. Start using it! This is a Docker-based inference stack for serving large language models (LLMs) using NVIDIA vLLM with intelligent resource management. This stack provides on-demand model loading with automatic idle shutdown, single-tenant GPU scheduling, and a unified API gateway.

The goal of the project is to provide an inference server for your home. After testing this and adding new models for a month, I decided to release it for the community. Please understand that this is a hobby project and that concrete help to improve it is highly appreciated. It is based on information I found on the Internet and on the NVIDIA Forums, I really hope it helps driving forward homelabs. This is mainly focused on the single DGX Spark setup and must work on it by default but adding support for 2 is welcome.

Documentation

Architecture & How It Works - Understanding the stack, waker service, and request flow.
Configuration - Environment variables, network settings, and waker tuning.
Model Selection Guide - Detailed list of 29+ supported models, quick chooser, and use cases.
Integrations - Guides for Cline (VS Code) and OpenCode (Terminal Agent).
Security & Remote Access - Hardening SSH and setting up restricted port forwarding.
Troubleshooting & Monitoring - Debugging, logs, and common error solutions.
Advanced Usage - Adding new models, custom configurations, and persistent operation.
TODO Notes - Ideas I have for what to do next.

Quick Start

Clone the repository

git clone <repository-url>
cd dgx-spark-inference-stack

Create necessary directories

mkdir -p models vllm_cache_huggingface manual_download/openai_gpt-oss-encodings_fix

Download required tokenizers (CRITICAL) The stack requires manual download of tiktoken files for GPT-OSS models.

wget https://openaipublic.blob.core.windows.net/encodings/cl100k_base.tiktoken -O manual_download/openai_gpt-oss-encodings_fix/cl100k_base.tiktoken
wget https://openaipublic.blob.core.windows.net/encodings/o200k_base.tiktoken -O manual_download/openai_gpt-oss-encodings_fix/o200k_base.tiktoken

Build Custom Docker Images (MANDATORY) The stack uses custom-optimized vLLM images that should be built locally to ensure maximum performance.
- Time: Expect ~20 minutes per image.
- Auth: You must authenticate with NVIDIA NGC to pull base images.
  1. Create a developer account at NVIDIA NGC Catalog (must not be in a sanctioned country).
  2. Run docker login nvcr.io with your credentials.
- Build Commands:
```
# Build Avarok image (General Purpose) - MUST use this tag to use local version over upstream
docker build -t avarok/vllm-dgx-spark:v11 custom-docker-containers/avarok

# Build Christopher Owen image (MXFP4 Optimized)
docker build -t christopherowen/vllm-dgx-spark:v12 custom-docker-containers/christopherowen
```

Start the stack

# Start gateway and waker only (models start on-demand)
docker compose up -d

# Pre-create all enabled model containers (recommended)
docker compose --profile models up --no-start

Test the API

# Request to qwen2.5-1.5b (will auto-start)
curl -X POST http://localhost:8009/v1/qwen2.5-1.5b-instruct/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer ${VLLM_API_KEY:-63TestTOKEN0REPLACEME}" \
  -d '{
    "model": "qwen2.5-1.5b-instruct",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Prerequisites

Docker 20.10+ with Docker Compose
NVIDIA GPU(s) with CUDA support and NVIDIA Container Toolkit
Linux host (tested on Ubuntu)

Contributing

Pull requests very welcome. :) However, to ensure stability, I enforce a strict Pull Request Template.

⚠️ Known Issues

Experimental Models (GB10/CUDA 12.1 Compatibility)

The following models are marked as experimental due to sporadic crashes on DGX Spark (GB10 GPU):

Qwen3-Next-80B-A3B-Instruct - Crashes randomly in linear attention layer
Qwen3-Next-80B-A3B-Thinking - Same issue

Root cause: The GB10 GPU uses CUDA 12.1, but the current vLLM/PyTorch stack only supports CUDA ≤12.0. This causes cudaErrorIllegalInstruction errors after several successful requests.

Workaround: Use gpt-oss-20b or gpt-oss-120b for stable tool calling until an updated vLLM image with proper GB10 support is available.

Nemotron 3 Nano 30B (NVFP4)

The nemotron-3-nano-30b-nvfp4 model is currently disabled. Reason: Incompatible with current vLLM build on GB10. Requires proper V1 engine support or updated backend implementation.

OpenCode Image/Screenshot Support on Linux

OpenCode (terminal AI agent) has a known bug on Linux where clipboard images and file path images do not work with vision models. The model responds with "The model you're using does not support image input" even though VL models work correctly via API.

Root cause: OpenCode's Linux clipboard handling corrupts binary image data before encoding (uses .text() instead of .arrayBuffer()). No image data is actually sent to the server.

Status: This seems to be a client-side OpenCode bug. Help investigating/fixing is welcome! The inference stack correctly handles base64 images when properly sent (verified via curl).

Workaround: Use curl or other API clients to send images directly to VL models like qwen2.5-vl-7b.

Qwen 2.5 Coder 7B & OpenCode Incompatibility

The qwen2.5-coder-7b-instruct model has a strict context limit of 32,768 tokens. However, OpenCode typically sends very large requests (buffer + input) exceeding 35,000 tokens, causing ValueError and request failures.

Recommendation: Do not use qwen2.5-coder-7b with OpenCode for long-context tasks. Instead, use qwen3-coder-30b-instruct which supports 65,536 tokens context and handles OpenCode's large requests comfortably.

Llama 3.3 & OpenCode Incompatibility

The llama-3.3-70b-instruct-fp4 model is not recommended for use with OpenCode. Reason: While the model works correctly via API, it exhibits aggressive tool calling behavior when initialized by OpenCode's specific client prompts. This leads to validation errors and a degraded user experience (e.g., trying to call tools immediately upon greeting). Recommendation: Use gpt-oss-20b or qwen3-next-80b-a3b-instruct for OpenCode sessions instead.

Credits

Special thanks to the community members who made optimized Docker images used in this stack:

Thomas P. Braun from Avarok: For the general-purpose vLLM image (avarok/vllm-dgx-spark) with support for non-gated activations (Nemotron) and hybrid models and posts like this https://blog.avarok.net/dgx-spark-nemotron3-and-nvfp4-getting-to-65-tps-8c5569025eb6.
Christopher Owen: For the MXFP4-optimized vLLM image (christopherowen/vllm-dgx-spark) enabling high-performance inference on DGX Spark.
eugr: For all the work on the original vLLM image (eugr/vllm-dgx-spark) customizations and the great postings on NVIDIA Forums.

Model Providers

Huge thanks to the organizations optimizing these models for FP4/FP8 inference:

Firworks AI (Firworks): For a wide range of optimized models including GLM-4.5, Llama 3.3, and Ministral.
NVIDIA: For Qwen3-Next, Nemotron, and standard FP4 implementations.
RedHat: For Qwen3-VL and Mistral Small.
QuantTrio: For Qwen3-VL-Thinking.
OpenAI: For the GPT-OSS models.

License

This project is licensed under the Apache License 2.0. See the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.github		.github
compose		compose
custom-docker-containers		custom-docker-containers
debugging		debugging
docs		docs
models		models
mods/fix-glm-4.7-flash-AWQ		mods/fix-glm-4.7-flash-AWQ
plugins		plugins
request-validator		request-validator
waker		waker
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README_AR.md		README_AR.md
README_DE.md		README_DE.md
README_ES.md		README_ES.md
README_FR.md		README_FR.md
README_IT.md		README_IT.md
README_JA.md		README_JA.md
README_KO.md		README_KO.md
README_PT.md		README_PT.md
README_RU.md		README_RU.md
README_TR.md		README_TR.md
README_UK.md		README_UK.md
README_VI.md		README_VI.md
README_ZH_CN.md		README_ZH_CN.md
README_ZH_TW.md		README_ZH_TW.md
TODO.md		TODO.md
docker-compose.yml		docker-compose.yml
gateway.conf		gateway.conf
opencode.json		opencode.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DGX Spark Inference Stack - serve the home!

Documentation

Quick Start

Prerequisites

Contributing

⚠️ Known Issues

Experimental Models (GB10/CUDA 12.1 Compatibility)

Nemotron 3 Nano 30B (NVFP4)

OpenCode Image/Screenshot Support on Linux

Qwen 2.5 Coder 7B & OpenCode Incompatibility

Llama 3.3 & OpenCode Incompatibility

Credits

Model Providers

License

About

Uh oh!

Releases

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DGX Spark Inference Stack - serve the home!

Documentation

Quick Start

Prerequisites

Contributing

⚠️ Known Issues

Experimental Models (GB10/CUDA 12.1 Compatibility)

Nemotron 3 Nano 30B (NVFP4)

OpenCode Image/Screenshot Support on Linux

Qwen 2.5 Coder 7B & OpenCode Incompatibility

Llama 3.3 & OpenCode Incompatibility

Credits

Model Providers

License

About

Topics

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Uh oh!

Contributors

Uh oh!

Languages