Skip to content

Latest commit

 

History

History
207 lines (154 loc) · 6.34 KB

File metadata and controls

207 lines (154 loc) · 6.34 KB

CLI Reference

Commands Overview

Command Description
vllm-mlx serve Start OpenAI-compatible server
vllm-mlx-bench Run performance benchmarks
vllm-mlx-chat Start Gradio chat interface

vllm-mlx serve

Start the OpenAI-compatible API server.

Usage

vllm-mlx serve <model> [options]

Options

Option Description Default
--served-model-name Custom model name exposed through the OpenAI API. If not set, the model path is used as the name. None
--port Server port 8000
--host Server host 0.0.0.0
--api-key API key for authentication None
--rate-limit Requests per minute per client (0 = disabled) 0
--timeout Request timeout in seconds 300
--enable-metrics Expose Prometheus metrics on /metrics False
--continuous-batching Enable batching for multi-user False
--cache-memory-mb Cache memory limit in MB Auto
--cache-memory-percent Fraction of RAM for cache 0.20
--no-memory-aware-cache Use legacy entry-count cache False
--use-paged-cache Enable paged KV cache False
--max-tokens Default max tokens 32768
--stream-interval Tokens per stream chunk 1
--mcp-config Path to MCP config file None
--paged-cache-block-size Tokens per cache block 64
--max-cache-blocks Maximum cache blocks 1000
--max-num-seqs Max concurrent sequences 256
--default-temperature Default temperature when not specified in request None
--default-top-p Default top_p when not specified in request None
--reasoning-parser Parser for reasoning models (qwen3, deepseek_r1) None
--embedding-model Pre-load an embedding model at startup None
--enable-auto-tool-choice Enable automatic tool calling False
--tool-call-parser Tool call parser (auto, mistral, qwen, llama, hermes, deepseek, kimi, granite, nemotron, xlam, functionary, glm47) None

Examples

# Simple mode (single user, max throughput)
# Model path is used as the model name in the OpenAI API (e.g. model="mlx-community/Llama-3.2-3B-Instruct-4bit")
vllm-mlx serve mlx-community/Llama-3.2-3B-Instruct-4bit

Model will show up as 'mlx-community/Llama-3.2-3B-Instruct-4bit' in the `/v1/models` API endpoint. View with `curl http://localhost:8000/v1/models` or similar.

# With a custom API model name (model is accessed as "my-model" via the OpenAI API)
# --served-model-name sets the name clients must use when calling the API (e.g. model="my-model")
vllm-mlx serve --served-model-name my-model mlx-community/Llama-3.2-3B-Instruct-4bit
# Note: Model will show up as 'my-model' in the `/v1/models` API endpoint.

# Continuous batching (multiple users)
vllm-mlx serve mlx-community/Llama-3.2-3B-Instruct-4bit --continuous-batching

# With memory limit for large models
vllm-mlx serve mlx-community/GLM-4.7-Flash-4bit \
  --continuous-batching \
  --cache-memory-mb 2048

# Production with paged cache
vllm-mlx serve mlx-community/Qwen3-0.6B-8bit \
  --continuous-batching \
  --use-paged-cache \
  --port 8000

# With MCP tools
vllm-mlx serve mlx-community/Qwen3-4B-4bit --mcp-config mcp.json

# Multimodal model
vllm-mlx serve mlx-community/Qwen3-VL-4B-Instruct-3bit

# Reasoning model (separates thinking from answer)
vllm-mlx serve mlx-community/Qwen3-8B-4bit --reasoning-parser qwen3

# DeepSeek reasoning model
vllm-mlx serve mlx-community/DeepSeek-R1-Distill-Qwen-7B-4bit --reasoning-parser deepseek_r1

# Tool calling with Mistral/Devstral
vllm-mlx serve mlx-community/Devstral-Small-2507-4bit \
  --enable-auto-tool-choice --tool-call-parser mistral

# Tool calling with Granite
vllm-mlx serve mlx-community/granite-4.0-tiny-preview-4bit \
  --enable-auto-tool-choice --tool-call-parser granite

# With API key authentication
vllm-mlx serve mlx-community/Llama-3.2-3B-Instruct-4bit --api-key your-secret-key

# Expose Prometheus metrics
vllm-mlx serve mlx-community/Llama-3.2-3B-Instruct-4bit --enable-metrics

# Production setup with security options
vllm-mlx serve mlx-community/Qwen3-4B-4bit \
  --api-key your-secret-key \
  --rate-limit 60 \
  --timeout 120 \
  --continuous-batching

Security

When --api-key is set, all API requests require the Authorization: Bearer <api-key> header:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="your-secret-key"  # Must match --api-key
)

Or with curl:

curl http://localhost:8000/v1/models \
  -H "Authorization: Bearer your-secret-key"

vllm-mlx-bench

Run performance benchmarks.

Usage

vllm-mlx-bench --model <model> [options]

Options

Option Description Default
--model Model name Required
--prompts Number of prompts 5
--max-tokens Max tokens per prompt 256
--quick Quick benchmark mode False
--video Run video benchmark False
--video-url Custom video URL None
--video-path Custom video path None

Examples

# LLM benchmark
vllm-mlx-bench --model mlx-community/Llama-3.2-1B-Instruct-4bit

# Quick benchmark
vllm-mlx-bench --model mlx-community/Llama-3.2-1B-Instruct-4bit --quick

# Image benchmark (auto-detected for VLM models)
vllm-mlx-bench --model mlx-community/Qwen3-VL-8B-Instruct-4bit

# Video benchmark
vllm-mlx-bench --model mlx-community/Qwen3-VL-8B-Instruct-4bit --video

# Custom video
vllm-mlx-bench --model mlx-community/Qwen3-VL-8B-Instruct-4bit \
  --video --video-url https://example.com/video.mp4

vllm-mlx-chat

Start Gradio chat interface.

Usage

vllm-mlx-chat --served-model-name <model-name> [options]

Options

Option Description Default
--model Model name Required
--port Gradio port 7860
--text-only Disable multimodal False

Examples

# Multimodal chat (text + images + video)
vllm-mlx-chat --served-model-name mlx-community/Qwen3-VL-4B-Instruct-3bit

# Text-only chat
vllm-mlx-chat --served-model-name mlx-community/Llama-3.2-3B-Instruct-4bit --text-only

Environment Variables

Variable Description
VLLM_MLX_TEST_MODEL Model for tests
HF_TOKEN HuggingFace token