Run Diffusion Language Models locally — like Ollama, but for dLLMs.
dLLM Runner is a lightweight CLI tool that lets you run Diffusion Language Models (Dream, LLaDA) on your own hardware with a simple dllm run command. No cloud, no API keys, full privacy.
Traditional LLMs (GPT, LLaMA, Qwen) generate text one token at a time, left to right. Diffusion LLMs work differently — they start with a masked sequence and refine all tokens in parallel through iterative denoising, similar to how Stable Diffusion generates images.
This gives them unique advantages:
- Parallel generation — potential for significantly faster inference
- Bidirectional context — every token sees the full sequence, not just what came before
- No "reversal curse" — they handle "A is B" ↔ "B is A" naturally
- Superior planning — outperform much larger AR models on constraint tasks (Sudoku, Countdown)
- Native infilling — fill in blanks anywhere in the text, not just at the end
| Model | Params | VRAM (bf16) | VRAM (4bit) | Notes |
|---|---|---|---|---|
dream-7b |
7B | ~14 GB | ~5 GB | Best open dLLM for general use |
dream-7b-base |
7B | ~14 GB | ~5 GB | Untuned foundation model |
llada-8b |
8B | ~16 GB | ~5 GB | NeurIPS 2025 Oral paper |
llada-8b-base |
8B | ~16 GB | ~5 GB | Untuned foundation model |
llada2-mini |
16B MoE | ~32 GB | ~10 GB | 1.4B active params, efficient |
llada2.1-mini |
16B MoE | ~32 GB | ~10 GB | Latest with token editing + RL |
llada2-flash |
100B MoE | ~200 GB | ~60 GB | Flagship, multi-GPU required |
llada2.1-flash |
100B MoE | ~200 GB | ~60 GB | Latest 100B, multi-GPU |
Linux / macOS:
# Download the latest release for your platform from Releases page, then:
tar xzf dllm-*.tar.gz
sudo mkdir -p /opt/dllm
sudo cp -r dllm python /opt/dllm/
sudo ln -s /opt/dllm/dllm /usr/local/bin/dllmWindows:
Extract the .zip, add the folder to your PATH.
git clone https://github.com/YOUR_USER/dllm-runner.git
cd dllm-runner
cargo build --release
sudo mkdir -p /opt/dllm/python
sudo cp target/release/dllm /opt/dllm/
sudo cp python/*.py /opt/dllm/python/
sudo ln -s /opt/dllm/dllm /usr/local/bin/dllm# 1. Install Python dependencies (PyTorch + CUDA, ~3 GB download)
dllm setup
# 2. Start chatting!
dllm run dream-7b --quantize 4bitdllm <command> [options]
Commands:
list Show all available models
setup [--force] Install Python environment + dependencies
status Check GPU, CUDA, and dependency status
run <model> Interactive chat with a model
pull <model> Pre-download model weights from HuggingFace
Options for 'run':
--quantize 4bit|8bit Reduce VRAM usage (requires BitsAndBytes)
--steps <N> Number of diffusion steps (default: model-specific)
# Check your system is ready
dllm status
# Run Dream 7B in 4-bit quantization (~5 GB VRAM)
dllm run dream-7b --quantize 4bit
# Run LLaDA 8B at full precision
dllm run llada-8b
# Pre-download a model for offline use
dllm pull llada2.1-mini┌─────────────────────────┐
│ dllm (Rust CLI, ~1 MB) │ Orchestration, model management,
│ Fast startup, TUI │ interactive chat loop
└────────┬────────────────┘
│ JSON over stdin/stdout
▼
┌─────────────────────────┐
│ engine.py (Python) │ PyTorch inference engine,
│ Runs in isolated venv │ diffusion generation loop
└────────┬────────────────┘
│
▼
┌─────────────────────────┐
│ PyTorch + CUDA │ GPU-accelerated matrix ops,
│ HuggingFace models │ model weights from HF Hub
└─────────────────────────┘
The Rust binary (~1 MB) handles CLI, process management, and model registry with instant startup. The Python engine runs inside an isolated venv (~/.dllm/venv), loading models via HuggingFace Transformers and executing the diffusion inference loop on GPU. They communicate through JSON messages over stdin/stdout.
- Python 3.10+
- GPU (recommended): NVIDIA with CUDA 11.8+ and ≥8 GB VRAM
- macOS: Apple Silicon with MPS acceleration (M1/M2/M3/M4)
- CPU: Supported but very slow for 7B+ models
- Disk: ~3 GB for Python deps + 5–14 GB per model
| Platform | GPU Acceleration | Quantization (4/8bit) | Status |
|---|---|---|---|
| Linux x86_64 | CUDA | ✅ | Full support |
| Linux ARM64 | CUDA | ✅ | Full support |
| macOS Apple Silicon | MPS | ❌ | Works, no quantization |
| macOS Intel | CPU only | ❌ | Slow |
| Windows x86_64 | CUDA | ✅ | Full support |
dllm setup fails with "python3-venv not installed"
sudo apt install python3-venv python3-pipPyTorch CUDA not detected after setup
dllm status # Check what's detected
dllm setup --force # Reinstall with auto-detectionOut of VRAM
dllm run dream-7b --quantize 4bit # ~14 GB → ~5 GBModel download slow / interrupted
dllm pull dream-7b # Pre-download, resumes automatically- Dream — HKU & Huawei
- LLaDA — Renmin University & Ant Group
- LLaDA 2.0/2.1 — Ant Group inclusionAI
MIT