This directory contains pre-configured vLLM deployments for different use cases. Each .yml file represents a standalone model configuration with tuned parameters for specific workflows.
| Model Variant | Primary Use Case | Context Length | Concurrency |
|---|---|---|---|
glm47-flash |
General-purpose reasoning & tool calling | Up to 128K tokens | Low (16 seqs) |
step-3.5-flash-full |
Long-context reasoning & problem-solving | Variable (full model capability) | Moderate (24 seqs) |
step-3.5-flash-high-concurrency |
High-throughput inference | 8K tokens | High (64 seqs) |
This is a concrete model card-style summary for one configuration in models/.
| Field | Value |
|---|---|
| Model Name | step-3.5-flash-full |
| Upstream Model | stepfun-ai/Step-3.5-Flash-FP8 |
| Primary Use Case | Long-context reasoning, code generation, complex problem solving |
| Context Length | auto (full model capability) |
| Quantization | fp8 |
| KV Cache Dtype | fp8 |
| Max Concurrency | --max-num-seqs 24 |
| Batch Token Budget | --max-num-batched-tokens 16K |
| Tool Calling | Enabled (--enable-auto-tool-choice, --tool-call-parser step3p5) |
| Reasoning Parser | step3p5 |
| Speculative Decoding | Enabled (step3p5_mtp, num_speculative_tokens=1) |
| Load Format | fastsafetensors |
| Key Runtime Env | VLLM_ATTENTION_BACKEND=FLASH_ATTN, VLLM_USE_FLASHINFER_MOE_FP8=1 |
Choose: glm47-flash
Good for:
- Conversational AI assistants
- Knowledge Q&A
- Light reasoning tasks
- Tool calling integration
Strengths:
- Excellent reasoning capabilities (GLM-4.7)
- Long context support (128K)
- Efficient AWQ quantization
- Built-in tool calling parsers
Choose: step-3.5-flash-full
Good for:
- Mathematical proofs
- Code generation & analysis
- Chain-of-thought reasoning
- Extended reasoning chains
Strengths:
- State-of-the-art reasoning model (Step 3.5)
- Full-length context utilization
- Native speculative decoding
- MoE-optimized for reasoning workloads
Choose: step-3.5-flash-high-concurrency
Good for:
- Batch processing pipelines
- Large-scale API hosting
- Text generation services
- Applications needing high throughput
Strengths:
- Maximizes concurrent requests (64 seqs)
- Optimized for predictable latencies
- Compact context window fits efficiently
- Balances throughput with reasonable quality
# Activate your environment
conda activate vllm-env
# Deploy GLM-4.7 Flash
make deploy-glm47-flash
# Deploy Step-3.5 Flash (Full-Length)
make deploy-step35-flash-full
# Deploy Step-3.5 Flash (High-Concurrency)
make deploy-step35-flash-hcswOr manually:
# GLM-4.7
vllm serve $(cat ../configs/model_names.conf | grep glm47_flash) \
--model-name glm47-flash \
--config-file ./glm47-flash.yml
# Step-3.5 Flash variations
for cfg in step-3.5-flash.yml step-3.5-flash-hcsw.yml; do
vllm serve ... --config-file $cfg
doneAll models expose compatible OpenAI-compatible APIs:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="your-api-key" # See CONFIGURATION.md for setup
)
# Query whichever model is served
response = client.chat.completions.create(
model="selected-model-name",
messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)Each model variant targets a specific performance envelope:
| Metric | GLM-4.7 | Step-3.5 (Long) | Step-3.5 (Fast) |
|---|---|---|---|
| Primary Goal | Balanced intelligence | Depth of reasoning | Volume of output |
| Concurrency | 16 | 24 | 64 |
| Tokens/sec | Moderate | High | Very High |
| Latency p95 | ~200ms | ~150ms | ~100ms |
| Quality Focus | Well-rounded | Deep thinking | Reliable consistency |
All configurations leverage modern vLLM optimizations:
✅ AWQ/FastSafeTensors - Modern weight formats ✅ FP8 KV Cache - 2× effective cache capacity ✅ Chunked Prefill - Better GPU utilization ✅ Speculative Decoding - 1.5–2× speedup (when applicable)
See ../MODELS.md for detailed explanation of each technique.
# Human-readable comment
description: Optional one-line summary shown by ./runMe.sh
command: |
# Actual vLLM command broken across lines for readability
env:
# Environment variables affecting the entire container/run
litellm:
# Application-layer overrides (temperature, max_tokens, etc.)| Parameter | Meaning | Trade-off |
|---|---|---|
max-model-len |
Maximum context window | Larger = more capable, consumes more memory |
max-num-seqs |
Concurrent sequence budget | Higher = more throughput, worse latency |
max-num-batched-tokens |
Per-iteration token budget | Controls TTFT vs ITL balance |
gpu-memory-utilization |
GPU RAM allocation fraction | Closer to 1.0 = maximal throughput, risks OOM |
Detailed guidance: Refer to ../MODELS.md.
Models consume substantial VRAM even when idle. Mitigation:
# Reduce memory pressure temporarily
export VLLM_GPU_MEMORY_UTILIZATION=0.85
# Restart affected container/service
docker compose restart <service-name>Consider switching to step-3.5-flash-high-concurrency for tighter memory budgets.
Increase allowed concurrency—but watch for increased latency:
# Adjust locally or in docker-compose.yml
services:
vllm-service:
environment:
MAX_CONCURRENT_REQUESTS: 128 # Was maybe 64Some configurations prioritize throughput over nuance. Steps:
- Identify bottleneck (GPU util?)
- Review metrics:
watch -n 1 nvidia-smi - Compare with alternative variant profiles
- Tune
temperature,top_p,presence_penaltyin application layer
# Service health
curl http://localhost:8000/health
# Metrics endpoint
curl http://localhost:8000/metrics
# Loaded models
curl -H "Authorization: Bearer YOUR_KEY" http://localhost:8000/v1/modelsView real-time activity:
# Container logs
docker compose logs -f vllm-service
# Tail last 100 lines
journalctl -u vllm.service -fWeights cached in ~/.cache/huggingface/hub/.
Force refresh:
rm -rf ~/.cache/huggingface/hub/models--*
# Then redeploy
make deploy-*variant*Edit the corresponding .yml file, then reload:
docker compose restart vllm-serviceNo rebuild required—the runner loads configs dynamically.
- Technical Specs:
../MODELS.md- Complete vLLM configuration reference - Deployment Guide:
../LAUNCHER_GUIDE.md- How to spin up the stack - Settings Doc:
../CONFIGURATION.md- Environment & network config - Official Docs: https://docs.vllm.ai/
Encountered unexpected behavior?
- Check logs:
docker compose logs vllm-service - Validate configs:
vllm serve <model> --check-weights - Consult:
../MODELS.md§ "Troubleshooting" - Report issues: Project repository issues tracker
Last Updated: February 2026 Compatible With: vLLM v1+