Skip to content

primoco/dllm-runner

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🌀 dLLM Runner

Run Diffusion Language Models locally — like Ollama, but for dLLMs.

dLLM Runner is a lightweight CLI tool that lets you run Diffusion Language Models (Dream, LLaDA) on your own hardware with a simple dllm run command. No cloud, no API keys, full privacy.

Platform License Rust


What are Diffusion Language Models?

Traditional LLMs (GPT, LLaMA, Qwen) generate text one token at a time, left to right. Diffusion LLMs work differently — they start with a masked sequence and refine all tokens in parallel through iterative denoising, similar to how Stable Diffusion generates images.

This gives them unique advantages:

  • Parallel generation — potential for significantly faster inference
  • Bidirectional context — every token sees the full sequence, not just what came before
  • No "reversal curse" — they handle "A is B" ↔ "B is A" naturally
  • Superior planning — outperform much larger AR models on constraint tasks (Sudoku, Countdown)
  • Native infilling — fill in blanks anywhere in the text, not just at the end

Supported Models

Model Params VRAM (bf16) VRAM (4bit) Notes
dream-7b 7B ~14 GB ~5 GB Best open dLLM for general use
dream-7b-base 7B ~14 GB ~5 GB Untuned foundation model
llada-8b 8B ~16 GB ~5 GB NeurIPS 2025 Oral paper
llada-8b-base 8B ~16 GB ~5 GB Untuned foundation model
llada2-mini 16B MoE ~32 GB ~10 GB 1.4B active params, efficient
llada2.1-mini 16B MoE ~32 GB ~10 GB Latest with token editing + RL
llada2-flash 100B MoE ~200 GB ~60 GB Flagship, multi-GPU required
llada2.1-flash 100B MoE ~200 GB ~60 GB Latest 100B, multi-GPU

Quick Start

Install from Release (recommended)

Linux / macOS:

# Download the latest release for your platform from Releases page, then:
tar xzf dllm-*.tar.gz
sudo mkdir -p /opt/dllm
sudo cp -r dllm python /opt/dllm/
sudo ln -s /opt/dllm/dllm /usr/local/bin/dllm

Windows:

Extract the .zip, add the folder to your PATH.

Build from source

git clone https://github.com/YOUR_USER/dllm-runner.git
cd dllm-runner
cargo build --release
sudo mkdir -p /opt/dllm/python
sudo cp target/release/dllm /opt/dllm/
sudo cp python/*.py /opt/dllm/python/
sudo ln -s /opt/dllm/dllm /usr/local/bin/dllm

Setup & Run

# 1. Install Python dependencies (PyTorch + CUDA, ~3 GB download)
dllm setup

# 2. Start chatting!
dllm run dream-7b --quantize 4bit

Usage

dllm <command> [options]

Commands:
  list                    Show all available models
  setup [--force]         Install Python environment + dependencies
  status                  Check GPU, CUDA, and dependency status
  run <model>             Interactive chat with a model
  pull <model>             Pre-download model weights from HuggingFace

Options for 'run':
  --quantize 4bit|8bit    Reduce VRAM usage (requires BitsAndBytes)
  --steps <N>             Number of diffusion steps (default: model-specific)

Examples

# Check your system is ready
dllm status

# Run Dream 7B in 4-bit quantization (~5 GB VRAM)
dllm run dream-7b --quantize 4bit

# Run LLaDA 8B at full precision
dllm run llada-8b

# Pre-download a model for offline use
dllm pull llada2.1-mini

Architecture

┌─────────────────────────┐
│  dllm (Rust CLI, ~1 MB) │   Orchestration, model management,
│  Fast startup, TUI      │   interactive chat loop
└────────┬────────────────┘
         │ JSON over stdin/stdout
         ▼
┌─────────────────────────┐
│  engine.py (Python)     │   PyTorch inference engine,
│  Runs in isolated venv  │   diffusion generation loop
└────────┬────────────────┘
         │
         ▼
┌─────────────────────────┐
│  PyTorch + CUDA         │   GPU-accelerated matrix ops,
│  HuggingFace models     │   model weights from HF Hub
└─────────────────────────┘

The Rust binary (~1 MB) handles CLI, process management, and model registry with instant startup. The Python engine runs inside an isolated venv (~/.dllm/venv), loading models via HuggingFace Transformers and executing the diffusion inference loop on GPU. They communicate through JSON messages over stdin/stdout.

Requirements

  • Python 3.10+
  • GPU (recommended): NVIDIA with CUDA 11.8+ and ≥8 GB VRAM
  • macOS: Apple Silicon with MPS acceleration (M1/M2/M3/M4)
  • CPU: Supported but very slow for 7B+ models
  • Disk: ~3 GB for Python deps + 5–14 GB per model

Platform Support

Platform GPU Acceleration Quantization (4/8bit) Status
Linux x86_64 CUDA Full support
Linux ARM64 CUDA Full support
macOS Apple Silicon MPS Works, no quantization
macOS Intel CPU only Slow
Windows x86_64 CUDA Full support

Troubleshooting

dllm setup fails with "python3-venv not installed"

sudo apt install python3-venv python3-pip

PyTorch CUDA not detected after setup

dllm status          # Check what's detected
dllm setup --force   # Reinstall with auto-detection

Out of VRAM

dllm run dream-7b --quantize 4bit   # ~14 GB → ~5 GB

Model download slow / interrupted

dllm pull dream-7b   # Pre-download, resumes automatically

Acknowledgments

License

MIT

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors