This repository contains Docker Compose configurations for deploying various large language models using vLLM with LiteLLM as an API gateway.
| Model | Directory | Description |
|---|---|---|
| GLM-4.6V-NVFP4 | glm46v/ |
Visual language model from Zhipu AI (GLM) |
| IQuest-Coder 40B | IQuest-Coder/instruct/ |
Code-focused LLM from Fireworks AI |
| GPT-OSS 120B | gpt-oss-120b/ |
120B parameter open-source model |
| Qwen3-Coder | w8a8/qwen3-coder/ |
Code model from Alibaba (W8A8 quantization) |
Claude Code → LiteLLM (port 4000) → vLLM (port 8000) → Model
Each deployment includes:
- vLLM: Inference backend serving the model via OpenAI-compatible API
- LiteLLM: API gateway providing Anthropic-compatible endpoints
- Tailscale (optional): Network connectivity for distributed setups
-
Navigate to the model directory:
cd glm46v/ # or any other model directory
-
Copy
.env.exampleto.envand configure:cp .env.example .env # Edit .env with your API keys and settings -
Start the services:
docker compose up -d
-
Verify the deployment:
curl http://localhost:4000/models
| Variable | Description |
|---|---|
HF_TOKEN |
Hugging Face access token for model downloads |
LITELLM_MASTER_KEY |
API key for LiteLLM gateway authentication |
TS_AUTHKEY |
Tailscale authentication key (if using tailnet) |
HEADSCALE_URL |
Headscale server URL (if using self-hosted) |
| Service | Port |
|---|---|
| LiteLLM Gateway | 4000 |
| vLLM Backend | 8000 |
vllm-deployments/
├── glm46v/ # GLM-4.6V deployment
│ ├── .env # Environment variables
│ ├── .env.litellm # LiteLLM-specific config
│ ├── litellm-config.yaml # LiteLLM model routing
│ ├── docker-compose.yml # Service orchestration
│ └── Dockerfile # vLLM image build
├── IQuest-Coder/
│ ├── instruct/ # IQuest-Coder 40B deployment
│ │ ├── .env
│ │ ├── litellm-config.yaml
│ │ └── docker-compose.yml
│ └── loop/ # Additional config
├── gpt-oss-120b/ # GPT-OSS 120B deployment
│ ├── .env
│ └── docker-compose.yml
└── w8a8/
└── qwen3-coder/ # Qwen3-Coder deployment
- Docker & Docker Compose
- NVIDIA GPU with sufficient VRAM
- CUDA drivers
- (Optional) Tailscale/Headscale for networking
[Add your license here]