Model Catalog

This directory contains pre-configured vLLM deployments for different use cases. Each .yml file represents a standalone model configuration with tuned parameters for specific workflows.

📋 Index

Model Variant	Primary Use Case	Context Length	Concurrency
`glm47-flash`	General-purpose reasoning & tool calling	Up to 128K tokens	Low (16 seqs)
`step-3.5-flash-full`	Long-context reasoning & problem-solving	Variable (full model capability)	Moderate (24 seqs)
`step-3.5-flash-high-concurrency`	High-throughput inference	8K tokens	High (64 seqs)

🧾 Model Card Example

`step-3.5-flash-full` (`./step-3.5-flash.yml`)

This is a concrete model card-style summary for one configuration in models/.

Field	Value
Model Name	`step-3.5-flash-full`
Upstream Model	`stepfun-ai/Step-3.5-Flash-FP8`
Primary Use Case	Long-context reasoning, code generation, complex problem solving
Context Length	`auto` (full model capability)
Quantization	`fp8`
KV Cache Dtype	`fp8`
Max Concurrency	`--max-num-seqs 24`
Batch Token Budget	`--max-num-batched-tokens 16K`
Tool Calling	Enabled (`--enable-auto-tool-choice`, `--tool-call-parser step3p5`)
Reasoning Parser	`step3p5`
Speculative Decoding	Enabled (`step3p5_mtp`, `num_speculative_tokens=1`)
Load Format	`fastsafetensors`
Key Runtime Env	`VLLM_ATTENTION_BACKEND=FLASH_ATTN`, `VLLM_USE_FLASHINFER_MOE_FP8=1`

🎯 Choose Right Model

Scenario 1: General-Purpose Assistant

Choose: glm47-flash

Good for:

Conversational AI assistants
Knowledge Q&A
Light reasoning tasks
Tool calling integration

Strengths:

Excellent reasoning capabilities (GLM-4.7)
Long context support (128K)
Efficient AWQ quantization
Built-in tool calling parsers

Scenario 2: Complex Problem Solving

Choose: step-3.5-flash-full

Good for:

Mathematical proofs
Code generation & analysis
Chain-of-thought reasoning
Extended reasoning chains

Strengths:

State-of-the-art reasoning model (Step 3.5)
Full-length context utilization
Native speculative decoding
MoE-optimized for reasoning workloads

Scenario 3: High-Traffic API Serving

Choose: step-3.5-flash-high-concurrency

Good for:

Batch processing pipelines
Large-scale API hosting
Text generation services
Applications needing high throughput

Strengths:

Maximizes concurrent requests (64 seqs)
Optimized for predictable latencies
Compact context window fits efficiently
Balances throughput with reasonable quality

🚀 Quick Start

Deploy a Specific Model

# Activate your environment
conda activate vllm-env

# Deploy GLM-4.7 Flash
make deploy-glm47-flash

# Deploy Step-3.5 Flash (Full-Length)
make deploy-step35-flash-full

# Deploy Step-3.5 Flash (High-Concurrency)
make deploy-step35-flash-hcsw

Or manually:

# GLM-4.7
vllm serve $(cat ../configs/model_names.conf | grep glm47_flash) \
  --model-name glm47-flash \
  --config-file ./glm47-flash.yml

# Step-3.5 Flash variations
for cfg in step-3.5-flash.yml step-3.5-flash-hcsw.yml; do
  vllm serve ... --config-file $cfg
done

Access via API

All models expose compatible OpenAI-compatible APIs:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="your-api-key"  # See CONFIGURATION.md for setup
)

# Query whichever model is served
response = client.chat.completions.create(
    model="selected-model-name",
    messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)

📊 Configuration Philosophy

Why Three Variants?

Each model variant targets a specific performance envelope:

Metric	GLM-4.7	Step-3.5 (Long)	Step-3.5 (Fast)
Primary Goal	Balanced intelligence	Depth of reasoning	Volume of output
Concurrency	16	24	64
Tokens/sec	Moderate	High	Very High
Latency p95	~200ms	~150ms	~100ms
Quality Focus	Well-rounded	Deep thinking	Reliable consistency

Shared Optimizations Across All

All configurations leverage modern vLLM optimizations:

✅ AWQ/FastSafeTensors - Modern weight formats ✅ FP8 KV Cache - 2× effective cache capacity ✅ Chunked Prefill - Better GPU utilization ✅ Speculative Decoding - 1.5–2× speedup (when applicable)

See ../MODELS.md for detailed explanation of each technique.

🔧 Understanding the Files

Structure of a Model Config

# Human-readable comment
description: Optional one-line summary shown by ./runMe.sh

command: |
  # Actual vLLM command broken across lines for readability

env:
  # Environment variables affecting the entire container/run

litellm:
  # Application-layer overrides (temperature, max_tokens, etc.)

Critical Parameters Explained

Parameter	Meaning	Trade-off
`max-model-len`	Maximum context window	Larger = more capable, consumes more memory
`max-num-seqs`	Concurrent sequence budget	Higher = more throughput, worse latency
`max-num-batched-tokens`	Per-iteration token budget	Controls TTFT vs ITL balance
`gpu-memory-utilization`	GPU RAM allocation fraction	Closer to 1.0 = maximal throughput, risks OOM

Detailed guidance: Refer to ../MODELS.md.

🐛 Troubleshooting

Issue: Out of Memory

Models consume substantial VRAM even when idle. Mitigation:

# Reduce memory pressure temporarily
export VLLM_GPU_MEMORY_UTILIZATION=0.85

# Restart affected container/service
docker compose restart <service-name>

Consider switching to step-3.5-flash-high-concurrency for tighter memory budgets.

Issue: Too Many Requests Queued

Increase allowed concurrency—but watch for increased latency:

# Adjust locally or in docker-compose.yml
services:
  vllm-service:
    environment:
      MAX_CONCURRENT_REQUESTS: 128  # Was maybe 64

Issue: Response Quality Degradation

Some configurations prioritize throughput over nuance. Steps:

Identify bottleneck (GPU util?)
Review metrics: watch -n 1 nvidia-smi
Compare with alternative variant profiles
Tune temperature, top_p, presence_penalty in application layer

📈 Monitoring

Health Checks

# Service health
curl http://localhost:8000/health

# Metrics endpoint
curl http://localhost:8000/metrics

# Loaded models
curl -H "Authorization: Bearer YOUR_KEY" http://localhost:8000/v1/models

Logs

View real-time activity:

# Container logs
docker compose logs -f vllm-service

# Tail last 100 lines
journalctl -u vllm.service -f

🔄 Updating Models

Pull New Weights

Weights cached in ~/.cache/huggingface/hub/.

Force refresh:

rm -rf ~/.cache/huggingface/hub/models--*
# Then redeploy
make deploy-*variant*

Modify Settings

Edit the corresponding .yml file, then reload:

docker compose restart vllm-service

No rebuild required—the runner loads configs dynamically.

📚 Further Reading

Technical Specs: ../MODELS.md - Complete vLLM configuration reference
Deployment Guide: ../LAUNCHER_GUIDE.md - How to spin up the stack
Settings Doc: ../CONFIGURATION.md - Environment & network config
Official Docs: https://docs.vllm.ai/

🆘 Getting Help

Encountered unexpected behavior?

Check logs: docker compose logs vllm-service
Validate configs: vllm serve <model> --check-weights
Consult: ../MODELS.md § "Troubleshooting"
Report issues: Project repository issues tracker

Last Updated: February 2026 Compatible With: vLLM v1+

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model Catalog

📋 Index

🧾 Model Card Example

`step-3.5-flash-full` (`./step-3.5-flash.yml`)

🎯 Choose Right Model

Scenario 1: General-Purpose Assistant

Scenario 2: Complex Problem Solving

Scenario 3: High-Traffic API Serving

🚀 Quick Start

Deploy a Specific Model

Access via API

📊 Configuration Philosophy

Why Three Variants?

Shared Optimizations Across All

🔧 Understanding the Files

Structure of a Model Config

Critical Parameters Explained

🐛 Troubleshooting

Issue: Out of Memory

Issue: Too Many Requests Queued

Issue: Response Quality Degradation

📈 Monitoring

Health Checks

Logs

🔄 Updating Models

Pull New Weights

Modify Settings

📚 Further Reading

🆘 Getting Help

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Model Catalog

📋 Index

🧾 Model Card Example

step-3.5-flash-full (./step-3.5-flash.yml)

🎯 Choose Right Model

Scenario 1: General-Purpose Assistant

Scenario 2: Complex Problem Solving

Scenario 3: High-Traffic API Serving

🚀 Quick Start

Deploy a Specific Model

Access via API

📊 Configuration Philosophy

Why Three Variants?

Shared Optimizations Across All

🔧 Understanding the Files

Structure of a Model Config

Critical Parameters Explained

🐛 Troubleshooting

Issue: Out of Memory

Issue: Too Many Requests Queued

Issue: Response Quality Degradation

📈 Monitoring

Health Checks

Logs

🔄 Updating Models

Pull New Weights

Modify Settings

📚 Further Reading

🆘 Getting Help

`step-3.5-flash-full` (`./step-3.5-flash.yml`)