vLLM Model Deployment Configurations

This repository contains Docker Compose configurations for deploying various large language models using vLLM with LiteLLM as an API gateway.

Models

Model	Directory	Description
GLM-4.6V-NVFP4	`glm46v/`	Visual language model from Zhipu AI (GLM)
IQuest-Coder 40B	`IQuest-Coder/instruct/`	Code-focused LLM from Fireworks AI
GPT-OSS 120B	`gpt-oss-120b/`	120B parameter open-source model
Qwen3-Coder	`w8a8/qwen3-coder/`	Code model from Alibaba (W8A8 quantization)

Architecture

Claude Code → LiteLLM (port 4000) → vLLM (port 8000) → Model

Each deployment includes:

vLLM: Inference backend serving the model via OpenAI-compatible API
LiteLLM: API gateway providing Anthropic-compatible endpoints
Tailscale (optional): Network connectivity for distributed setups

Quick Start

Navigate to the model directory:

cd glm46v/  # or any other model directory

Copy .env.example to .env and configure:

cp .env.example .env
# Edit .env with your API keys and settings

Start the services:
```
docker compose up -d
```
Verify the deployment:
```
curl http://localhost:4000/models
```

Configuration

Environment Variables

Variable	Description
`HF_TOKEN`	Hugging Face access token for model downloads
`LITELLM_MASTER_KEY`	API key for LiteLLM gateway authentication
`TS_AUTHKEY`	Tailscale authentication key (if using tailnet)
`HEADSCALE_URL`	Headscale server URL (if using self-hosted)

Model-Specific Ports

Service	Port
LiteLLM Gateway	4000
vLLM Backend	8000

Directory Structure

vllm-deployments/
├── glm46v/                    # GLM-4.6V deployment
│   ├── .env                   # Environment variables
│   ├── .env.litellm          # LiteLLM-specific config
│   ├── litellm-config.yaml   # LiteLLM model routing
│   ├── docker-compose.yml    # Service orchestration
│   └── Dockerfile            # vLLM image build
├── IQuest-Coder/
│   ├── instruct/             # IQuest-Coder 40B deployment
│   │   ├── .env
│   │   ├── litellm-config.yaml
│   │   └── docker-compose.yml
│   └── loop/                 # Additional config
├── gpt-oss-120b/             # GPT-OSS 120B deployment
│   ├── .env
│   └── docker-compose.yml
└── w8a8/
    └── qwen3-coder/          # Qwen3-Coder deployment

Requirements

Docker & Docker Compose
NVIDIA GPU with sufficient VRAM
CUDA drivers
(Optional) Tailscale/Headscale for networking

License

[Add your license here]

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
IQuest-Coder		IQuest-Coder
glm46v		glm46v
glm47-flash		glm47-flash
gpt-oss-120b		gpt-oss-120b
w8a8/qwen3-coder		w8a8/qwen3-coder
.gitignore		.gitignore
README.md		README.md
README_CN.md		README_CN.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

vLLM Model Deployment Configurations

Models

Architecture

Quick Start

Configuration

Environment Variables

Model-Specific Ports

Directory Structure

Requirements

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

vLLM Model Deployment Configurations

Models

Architecture

Quick Start

Configuration

Environment Variables

Model-Specific Ports

Directory Structure

Requirements

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages