Releases · lightseekorg/smg

09 Apr 18:18

slin1237

v1.4.1

ea9005d

v1.4.1 Latest

Latest

🚀 Shepherd Model Gateway v1.4.1 Released

Patch release with mesh HA stability fix, DP rank scheduling, reasoning parser fixes, and engine version bumps.

Mesh HA Stability Fix

Fixed premature worker removal during rolling deploys:

Workers synced via mesh with health: false were being removed by the health checker before they had a chance to pass local health checks
Fix: health checker now only removes workers whose health check actually failed this tick, not workers that are merely marked unhealthy from mesh state
Eliminates the 500/503 error spike during gateway redeploys with --remove-unhealthy-workers enabled

DP Rank Scheduling

Data-parallel rank scheduling for multi-GPU inference:

Supports scheduling with the minimum number of required ranks
New scheduling policy for DP-aware worker selection

MCP Tool Improvements

Argument overrides (#1048) -- Add support for argument overrides with MCP tools, enabling per-request customization of MCP tool call parameters
Passthrough output flattening (#1041) -- MCP passthrough mcp_call output now flattened to plain strings for consistency
ID normalization (#989) -- MCP call item IDs normalized to mcp_ prefix for OpenAI alignment

Reasoning Parser Fixes

Thinking toggle detection (#1031) -- Detect thinking toggle from chat template and override parser state automatically
NanoV3/Nemotron fix (#1067) -- Changed parser to always_in_reasoning=false to fix incorrect reasoning block detection
Harmony routing (#1025) -- Route reasoning_content to analysis channel per Harmony spec

Bug Fixes

Routing: Eliminate unconditional token allocation on the hot path (#1024)
Responses API: Stop defaulting top_p for omitted requests (#1043), unify upstream header handling (#1029)
gRPC: Update vLLM imports for inputs reorganization (#1033)
Frontend: Fix smg serve rejecting vLLM OpenAI args (#832)
Discovery: Periodic reconciliation with identity-based pod equality (#1039)

Engine Version Bumps

vLLM: v0.18.0 -> v0.19.0
SGLang: v0.5.9/v0.5.10rc0 -> v0.5.10
TensorRT-LLM: 1.3.0rc8 -> 1.3.0rc10

Infrastructure

Claude review workflow hardened with incremental reviews and auto-approve (#1036, #1040, #1042)
E2E worker failure diagnostics and cleanup improvements (#1015)
gRPC package releases: smg-grpc-proto 0.4.6, smg-grpc-servicer 0.5.2

Upgrade now: pip install smg --upgrade

🐑 Shepherd your LLM infrastructure with confidence.

Docker Images

Pre-built engine images on GitHub Container Registry:

SGLang:

docker pull ghcr.io/lightseekorg/smg:1.4.1-sglang-v0.5.10

vLLM:

docker pull ghcr.io/lightseekorg/smg:1.4.1-vllm-v0.19.0

TensorRT-LLM:

docker pull ghcr.io/lightseekorg/smg:1.4.1-trtllm-1.3.0rc10

All images for v1.4.1:

Engine	Tag	Pull Command
sglang	`1.4.1-sglang-v0.5.10`	`docker pull ghcr.io/lightseekorg/smg:1.4.1-sglang-v0.5.10`
trtllm	`1.4.1-trtllm-1.3.0rc10`	`docker pull ghcr.io/lightseekorg/smg:1.4.1-trtllm-1.3.0rc10`
vllm	`1.4.1-vllm-v0.19.0`	`docker pull ghcr.io/lightseekorg/smg:1.4.1-vllm-v0.19.0`

What's Changed

perf: Eliminate unconditional token allocation on the routing hot path by @ppraneth in #1024
refactor(e2e): rename worker_args to sglang_args by @CatherineSue in #1019
fix(ci): improve e2e worker failure diagnostics and cleanup by @key4ng in #1015
feat(metrics-ws): [2/4] add protocol types and watch registry by @key4ng in #982
fix(harmony): route reasoning_content to analysis channel per Harmony spec by @CatherineSue in #1025
fix(openai): unify responses upstream header handling by @zhaowenzi in #1029
fix(grpc): update vLLM imports for inputs reorganization by @CatherineSue in #1033
fix(reasoning): detect thinking toggle from chat template and override parser state by @CatherineSue in #1031
fix(ci): harden Claude review workflow with incremental reviews and resilience by @key4ng in #1036
fix(ci): fix comment fetch, add review summary, and auto-approve by @key4ng in #1040
fix(ci): handle array-format execution output in review summary by @key4ng in #1042
fix(mcp): flatten passthrough mcp_call output to plain strings by @zhaowenzi in #1041
feat(metrics-ws): [3/4] add event-driven and polled collectors by @key4ng in #1027
fix(responses): stop defaulting top_p for omitted requests by @zhaowenzi in #1043
fix(frontend): Fix smg serve reject vLLM OpenAI args by @YouNeedCryDear in #832
feat(realtime-api): WebRTC relay bridge by @pallasathena92 in #733
feat(overrides): add support for argument overrides with mcp tools by @Tobel158 in #1048
fix(mcp): normalize mcp_call item IDs to use mcp_ prefix for OpenAI alignment by @zhaowenzi in #989
feat: supports dp rank scheduling and scheduling with the minimun number of… by @jiashaokun-1 in #1007
fix(discovery): periodic reconciliation with identity-based pod equality by @Kangyan-Zhou in #1039
chore(deps): update wasm-encoder requirement from 0.245 to 0.246 by @dependabot[bot] in #1054
chore(deps): update lz4_flex requirement from 0.11 to 0.13 by @dependabot[bot] in #1053
chore(deps): update str0m requirement from 0.16 to 0.18 by @dependabot[bot] in #1052
chore(deps): bump vllm base image from v0.18.0 to v0.19.0 by @slin1237 in #1066
fix(reasoning): change NanoV3/Nemotron parser to always_in_reasoning=false by @CatherineSue in #1067
chore(deps): bump sglang from 0.5.9/0.5.10rc0 to 0.5.10 by @slin1237 in #1064
feat(metrics-ws): [4/4] add /ws/metrics endpoint with subscription support by @key4ng in #1050
fix(mesh): prevent premature removal of unhealthy workers by health checker by @slin1237 in #1076
chore(deps): bump TensorRT-LLM from 1.3.0rc8 to 1.3.0rc10 by @slin1237 in #1077
chore(grpc): release smg-grpc-proto 0.4.6 and smg-grpc-servicer 0.5.2 by @slin1237 in #1078
chore: bump versions for v1.4.1 release by @slin1237 in #1080

New Contributors

@Tobel158 made their first contribution in #1048
@jiashaokun-1 made their first contribution in #1007

Full Changelog: v1.4.0...v1.4.1

Contributors

YouNeedCryDear, CatherineSue, and 9 other contributors

Assets 2

02 Apr 15:53

slin1237

v1.4.0

52564df

v1.4.0

🚀 Shepherd Model Gateway v1.4.0 Released

The biggest SMG release yet -- Kubernetes-native deployment via Helm, a terminal dashboard, 200x mesh memory reduction, 7-11x faster multimodal preprocessing, native Completion API over gRPC, and per-model retry configuration.

Kubernetes-Native Deployment with Helm

Production-ready Helm chart for deploying SMG on Kubernetes:

One-command deployment -- helm install smg oci://ghcr.io/lightseekorg/smg-helm deploys the full gateway stack
Router + Worker deployment -- A single chart deploys both the gateway router and inference engine workers (vLLM, SGLang, TRT-LLM) with GPU scheduling
Mesh HA with service discovery -- Deploy multiple gateway replicas as a StatefulSet with automatic gossip-based peer discovery via --router-selector
Full K8s integration -- RBAC, Ingress, HPA, PDB, ServiceMonitor, Grafana dashboard ConfigMap, JSON Schema validation at helm lint time
5 example configurations -- Router-only, with-postgres, with-service-discovery, with-ingress, with-monitoring

Impact: Zero-to-production SMG deployment on Kubernetes with a single helm install. Declarative configuration, automatic scaling, and built-in observability.

Terminal Dashboard (smg-tui)

Full-featured terminal UI for real-time monitoring and interactive chat:

7 tabs -- Pulse (real-time dashboard with sparklines), Workers (per-worker stats + circuit breaker state), Chat (streaming markdown playground), Logs (per-component with ANSI stripping), Benchmark, Traffic, Mesh
Worker management -- Quick-add presets for OpenAI/Anthropic/xAI/Gemini, local worker launch with automatic GPU selection via nvidia-smi, GPU claim tracking to prevent double-allocation
Gateway auto-start -- smg-tui --auto-start launches the gateway, polls health, and cleans up on exit
Chat playground -- Streaming SSE with live cursor, markdown rendering, multi-turn support, Tab to cycle models

Mesh Performance & Reliability Revolution

Eliminated catastrophic memory growth and achieved >200x improvement in mesh resource usage:

Delta encoding (#899): Only send new tree operations since last sync -- 40x smaller sync payloads (18.3 MB → 417 KB), gzip compression for additional 5-8x wire reduction
Lazy serialization (#919): Moved full TreeState serialization off the hot path -- memory: OOM crash → 31 MB stable, CPU: 280-345% → 56-58%, latency: 12s degrading → stable
CRDT bypass (#961): Moved tree state out of CRDT operation log -- eliminated ~1 GB/1.5hr memory leak under sustained load
Two-layer sync fix (#1011): Eliminated remaining memory leaks in the tree sync protocol
Snapshot serialization (#974): Structure-preserving radix tree snapshots for mesh sync -- shared prefixes stored once, replacing 40 MB flat operation replay with compact tree format
Timeout enforcement (#952): Consistent timeout contract across all RPC and stream paths
Health mirroring (#912, #892): Mesh-synced workers now register locally for health checking with proper status mirroring

Benchmark Results (20 min, 500 rps, 20K-char prompts):

565,920 requests, 0 errors
Memory plateaus at ~2.3 GB (no linear growth)

7-11x Faster Multimodal Image Preprocessing

SMG now matches or beats HuggingFace Python preprocessing performance:

SIMD resize -- Replaced image crate (pure Rust) with fast_image_resize v6 (AVX2/SSE4.1) for 10-25x faster resize
Fused operations -- Combined to_tensor_and_normalize(), zero-copy patchify_into(), fused pad + normalize + tile split for Llama4
Additional optimizations -- Thread-local Resizer reuse, eliminated DynamicImage clones, optimized serialization and tensor conversion

Benchmark Results (Qwen3-VL):

Image Size	Before	After	vs HuggingFace Python
224×224	4.77 ms	0.44 ms (10.8x)	2.5x faster
640×480	15.5 ms	1.59 ms (9.7x)	1.8x faster
1024×768	40.6 ms	4.31 ms (9.4x)	1.6x faster
1920×1080	286 ms	39.6 ms (7.2x)	~parity

Native Completion API over gRPC

Full /v1/completions support through the gRPC pipeline with streaming and PD disaggregation:

6-PR pipeline -- CompletionRequest type, preparation stage, request building with backend sampling params, response processing, pipeline wiring, streaming support
Streaming -- OpenAI-compatible SSE events with per-index stop decoder tracking, echo and suffix handling
PD mode -- Dual streaming for prefill-decode disaggregation
Type safety -- Native RequestType::Completion throughout the pipeline, exhaustive match arms in shared stages

Per-Model Retry Configuration

Different models can now have different retry policies:

WorkerRegistry integration -- Workers declare per-model retry config via WorkerSpec.resilience, stored in WorkerRegistry with last-write-wins semantics
All routers updated -- HTTP, gRPC, OpenAI, Gemini, gRPC PD, and HTTP PD routers all look up per-model config at request time, falling back to the global default
Cleanup on removal -- Retry config is automatically cleaned up when the last worker for a model is removed

Impact: GPU-constrained models can have longer timeouts and more retries, while fast models use aggressive retry budgets. No more one-size-fits-all.

Three-Phase Graceful Shutdown

Replace fixed-timeout shutdown with an intelligent Gate → Drain → Teardown approach:

Phase 1 (Gate): Stop accepting new requests
Phase 2 (Drain): Wait for in-flight requests to complete (up to configured timeout)
Phase 3 (Teardown): MCP orchestrator cleanup + exit

Impact: Requests finishing in 2s no longer wait 28s for a fixed grace period. Requests needing 35s no longer get killed at 30s. The system drains to zero when possible.

Worker Registry & REST API Improvements

Model field required (#713) -- Clients omitting model now get 400 Bad Request instead of silent "unknown" injection. Matches OpenAI API spec. Breaking change.
REST semantics (#875) -- POST /workers (create-only, 409 on conflict), PUT /workers/{id} (full replace), PATCH /workers/{id} (partial update). Breaking change: PUT now requires WorkerSpec instead of WorkerUpdateRequest.
Split register paths (#836) -- register() (create-only), replace() (overwrite-then-diff, no transient gap), register_or_replace() (idempotent upsert)

vLLM gRPC Embedding Support

End-to-end embedding pipeline for vLLM via gRPC:

Rust gateway + Python servicer (calls engine.encode() with PoolingParams)
Flattened SGLang EmbedResponse proto (removed oneof, uses tonic::Status for errors)
Removed SGLang-specific log_metrics and cached_tokens from embed/classify protos

DeepSeek V3.1 Tool Call Parser

Native parser for DeepSeek V3.1's tool calling format:

Handles V3.1's simplified format (no function type prefix, no markdown code blocks)
Complete + streaming (parse_incremental) support
Auto-registered for deepseek-v3.1* and deepseek-ai/DeepSeek-V3.1* model patterns
E2E validated against live DeepSeek V3.1 (FP8) on 8×H200

Additional Features

Configurable storage hook context (#807) -- Map HTTP headers to storage hook request context via storage_context_headers
Conversation memories schema (#976) -- First-class conversation_memories table in data-connector with Oracle Flyway DDL and insert seam
gRPC health checking (#885) -- Standard grpc.health.v1 health service for vLLM workers
Model metadata in GetModelInfo (#871) -- vLLM GetModelInfo RPC now returns model metadata fields
Metrics server refactored to axum (#966) -- Foundation for /ws/metrics WebSocket endpoint
max_total_num_tokens in GetServerInfo (#817) -- Aligns gRPC response with HTTP server

Performance Improvements

Tokenizer: Optimized stop decoder and incremental sequence decoding (#990)
Routing: Optimized extract_text_for_routing string handling (#967)
Mesh: Eliminated per-request CRDT serialization in sync_tree_operation (#948)
Multimodal: Thread-local Resizer reuse (#923), eliminated DynamicImage clones (#928), optimized serialization and tensor conversion (#1012)

Bug Fixes

Multimodal: Fixed Phi-3-vision for string-format chat templates (#942), LLaVA-Next anyres multi-crop for vLLM gRPC (#941), hardened registry matching and token geometry (#945), propagated placeholder resolution errors (#943), fixed images smaller than patch_size × merge_size (#908), use preprocessor token counts in LlavaSpec (#958), fall back to config.model_type for aliased model IDs (#898)
Chat Templates: Inject special tokens (bos_token, eos_token) into chat template context (#914), correct content format detection for Qwen3-style templates (#981), inject special tokens inside tokenizer impls (#918)
Protocol: Accept null for boolean fields logprobs and stream (#1020), validate reasoning parser name at CLI and startup (#901)
gRPC: Fix assistant tool_calls message serialization for chat templates (#1023), include stop tokens in TRT-LLM output for Harmony parsing (#879), handle vllm log forwarding on servicer side (#975)
Mesh: Stop advertising 0.0.0.0 to peers (#883), set tonic message size limits to match application limit (#893), prevent duplicate store events from inflating tree_sizes (#946)
Gateway: Update metric when removing unhealthy workers (#884), filter empty-string backend defaults in CLI arg fallback (#934)
Responses API: Align store=false state persistence behavior (#916)
Serve: Respect user-set CUDA_VISIBLE_DEVICES in gp...

Contributors

rajatgoel, jshanson7, and 15 other contributors

Assets 2

21 Mar 16:56

slin1237

v1.3.3

9f34d35

v1.3.3

🚀 Shepherd Model Gateway v1.3.3 Released

Major performance release with 7x faster mesh synchronization and critical bug fixes.

⚡ Mesh Performance Revolution

Switched mesh serialization from JSON to bincode with dramatic performance improvements:

Benchmark Results (production workload - 1024 operations, 4000 tokens):

Serialization: 7.1x faster (35.5ms → 5.0ms)
Deserialization: 14.8x faster (63.4ms → 4.3ms)
Wire size: 4.3x smaller (67.9MB → 15.7MB)
Multi-model aggregate (10 models): 4.3x smaller (679MB → 157MB)

Additional mesh improvements:

Operation log auto-compaction and tombstone GC
Skip full-store scans when nothing has changed
Prevent stale snapshot chunks from mixing across retries
Break infinite retry loop for oversized incremental updates

Impact: Massive reduction in network bandwidth and CPU usage for multi-node deployments. Mesh state synchronization is now 7-15x faster with 4.3x less bandwidth consumption.

🎯 Structured Output Support

response_format support in Chat Completions API for Harmony models:

JSON schema constrained output
Structured generation for tool calling and data extraction
Fixed structural tag triggers for json_schema mode

🔧 PD Disaggregation Improvements

Enhanced reliability for prefill-decode mode:

Abort both PD requests when one side hits transport error (prevents hanging requests)
Handle mismatched metric labels in PD disaggregation mode
Fixed classify race condition with URL-based detection

🐛 Bug Fixes

Protocol: Validate /v1/messages tool_choice contract
Harmony: Include developer message when instructions are present
Gateway: Disable auto-detection if "runtime": "sglang" explicitly set
Client: Auto-close streaming responses on iteration exhaustion
Docker: Install gRPC proto and servicer for vLLM images

📚 Documentation Overhaul

Comprehensive audit and fixes across all documentation:

Quickstart and getting-started guides
Worker configuration and gRPC pipeline
Tokenizer, MCP, and WASM plugin extensibility
PD disaggregation and cache-aware routing
Reliability features and monitoring
Configuration, metrics, and architecture
API reference documentation

🏗️ Infrastructure

Default engine versions: vLLM 0.18.0, TensorRT-LLM 1.3.0rc8
Added minimaxai/minimax-m2 to nightly benchmarks
Improved E2E test infrastructure with parametrized fixtures

Full Changelog: v1.3.2...v1.3.3

Upgrade now: pip install smg --upgrade

🐑 Shepherd your LLM infrastructure with confidence.

Docker Images

Pre-built engine images on GitHub Container Registry:

SGLang:

docker pull ghcr.io/lightseekorg/smg:1.3.3-sglang-v0.5.9

vLLM:

docker pull ghcr.io/lightseekorg/smg:1.3.3-vllm-v0.18.0

TensorRT-LLM:

docker pull ghcr.io/lightseekorg/smg:1.3.3-trtllm-1.3.0rc8

All images for v1.3.3

Engine	Tag	Pull Command
sglang	`1.3.3-sglang-v0.5.9`	`docker pull ghcr.io/lightseekorg/smg:1.3.3-sglang-v0.5.9`
trtllm	`1.3.3-trtllm-1.3.0rc8`	`docker pull ghcr.io/lightseekorg/smg:1.3.3-trtllm-1.3.0rc8`
vllm	`1.3.3-vllm-v0.18.0`	`docker pull ghcr.io/lightseekorg/smg:1.3.3-vllm-v0.18.0`

What's Changed

chore(release): bump llm-multimodal to 1.4.0 by @slin1237 in #788
fix(harmony): fix structural tag triggers for json_schema constrained output by @CatherineSue in #789
feat(harmony): support response_format in Chat Completions by @CatherineSue in #791
refactor(e2e): reorganize Harmony tests, add validation, remove unused gateway args by @CatherineSue in #796
chore(deps): bump dorny/paths-filter from 3 to 4 by @dependabot[bot] in #792
fix(ci): fix mergify stale/close rules that never trigger by @CatherineSue in #798
fix(ci): use ignore-pr-updates for stale PR detection, disable issues by @CatherineSue in #800
chore(deps): update tokio-tungstenite requirement from 0.28 to 0.29 by @dependabot[bot] in #793
feat(core): add per-worker resilience and HTTP pool config types by @CatherineSue in #799
fix(ci): remove ignore-pr-updates that marks active PRs as stale by @CatherineSue in #804
refactor(openai): cleanup dead code, redundant state, and hot-path inefficiencies by @slin1237 in #802
fix(mesh): break infinite retry loop for oversized incremental updates by @slin1237 in #808
feat(core): wire per-worker resilience and HTTP client into BasicWorker by @CatherineSue in #803
test(mesh): add serialization benchmark for mesh state sync by @slin1237 in #810
test(e2e): re-enable skipped tests for vLLM and TRT-LLM by @CatherineSue in #806
perf(mesh): switch all mesh serialization from JSON to bincode by @slin1237 in #809
fix(mesh): use bincode for snapshot generation to match receivers by @slin1237 in #816
feat(gateway): Propagate otel context for distributed tracing by @ekzhang in #814
perf(mesh): skip full-store scans when nothing has changed by @slin1237 in #823
fix(ci): drop [grpc] extra from nightly vllm install by @CatherineSue in #826
feat(ci): add minimaxai/minimax-m2 to nightly benchmark by @smfirmin in #795
fix(gateway): Disable auto-detection if "runtime": "sglang"explciitly set by @ekzhang in #820
perf(mesh): add operation log auto-compaction and tombstone GC by @slin1237 in #825
refactor(e2e): replace smg_compare with parametrized api_client fixture by @CatherineSue in #812
fix(client): auto-close streaming responses on iteration exhaustion by @CatherineSue in #835
refactor(e2e): add model fixture, remove deprecated smg fixture by @CatherineSue in #834
fix(protocol): validate /v1/messages tool_choice contract by @nishanthp in #833
fix(mesh): prevent stale snapshot chunks from mixing across retries by @slin1237 in #837
test(mesh): improve benchmark summary with timing and side-by-side comparison by @slin1237 in #841
fix(docker): install gRPC proto and servicer for vLLM images by @slin1237 in #843
fix(gateway): use URL-based detection to eliminate classify race condition by @slin1237 in #839
ci: bump default engine vllm(0.18.0) and trt(1.3.0rc8) versions by @slin1237 in #845
fix(gateway): handle mismatched metric labels in PD disaggregation mode by @slin1237 in #846
fix(pd): abort both PD requests when one side hits a transport error by @slin1237 in #844
docs(quickstart): audit and fix getting-started documentation by @slin1237 in #848
docs(extensibility): audit and fix tokenizer, MCP, and WASM plugin documentation by @slin1237 in #849
docs(workers): audit and fix worker configuration and gRPC pipeline documentation by @slin1237 in #850
docs(reliability): audit and fix reliability feature documentation by @slin1237 in #851
docs(operations): audit and fix monitoring and data connection documentation by @slin1237 in #854
docs(routing): audit and fix PD disaggregation and cache-aware routing documentation by @slin1237 in #853
docs(config): audit and fix configuration, metrics, and architecture documentation by @slin1237 in #852
docs(api): audit and fix API reference documentation by @slin1237 in #855
fix(ci): scope VERSION_OVERRIDE to smg crate only by @slin1237 in #856
chore(release): bump version to 1.3.3 by @slin1237 in #857

New Contributors

@smfirmin made their first contribution in #795
@nishanthp made their first contribution in #833

Full Changelog: v1.3.2...v1.3.3

Contributors

nishanthp, ekzhang, and 4 other contributors

Assets 2

17 Mar 17:29

slin1237

v1.3.2

047ed98

v1.3.2

🚀 Shepherd Model Gateway v1.3.2 Released

Feature release adding multimodal support to Messages API and Python mesh bindings.

🎨 Multimodal Support for Messages API

Complete vision/image support in Messages API gRPC pipeline:

Native image processing for Messages API requests
Works across all gRPC backends (SGLang, vLLM, TensorRT-LLM)
Full feature parity with Anthropic's Messages API including vision

Impact: Messages API now supports both text and vision workloads. Deploy vision-language models with full reasoning and thinking capabilities through the Messages API protocol.

🌐 Mesh High Availability in Python

--enable-mesh support added to Python bindings:

Configure mesh HA directly from Python CLI
Distributed state synchronization accessible from Python deployments
Complete Python API coverage for mesh features

🐛 Bug Fixes

API: Restored model_id field in /workers response
Harmony: Reject ignore_eos with HTTP 400 for compatibility
Harmony: Include developer message when instructions are present
Lossy UTF-8 decode fallback for malformed text
Fixed gRPC PD mode detection
Enabled loop_controls for Jinja2 templates

📚 Documentation

Added comprehensive Messages API documentation
External providers integration guide
Mesh HA deployment documentation
Docker/PyPI badges added to README

Full Changelog: v1.3.1...v1.3.2

Upgrade now: pip install smg --upgrade

🐑 Shepherd your LLM infrastructure with confidence.

What's Changed

fix(api): restore model_id field in /workers response by @slin1237 in #774
feat(python): add --enable-mesh support to Python bindings by @slin1237 in #775
fix(ci): bump NCCL to 2.28+ for TensorRT-LLM compatibility by @slin1237 in #777
fix(harmony): reject ignore_eos for Harmony models with HTTP 400 by @CatherineSue in #778
fix(ci): install NCCL 2.28+ after TRT-LLM requirements to prevent downgrade by @slin1237 in #779
fix(harmony): include developer message when instructions are present by @CatherineSue in #781
feat(gateway): add multimodal support to Messages API gRPC pipeline by @slin1237 in #776
fix(ci): use k8s pod env vars for API keys instead of GitHub secrets by @slin1237 in #782
feat(tokenizer): lossy UTF-8 decode fallback, fix gRPC PD mode detection, enable loop_controls by @Kangyan-Zhou in #769
docs: add Messages API, external providers, and mesh HA documentation by @slin1237 in #786
docs: add Docker/PyPI badges and release docker notes script by @slin1237 in #787
chore: add release 1.3.2 by @slin1237 in #780

New Contributors

@Kangyan-Zhou made their first contribution in #769

Full Changelog: v1.3.1...v1.3.2

Contributors

CatherineSue, Kangyan-Zhou, and slin1237

Assets 2

16 Mar 18:13

slin1237

v1.3.1

eae871e

v1.3.1

🚀 Shepherd Model Gateway v1.3.1 Released

Minor release with operational improvements and bug fixes.

🛠️ New Features

Operational improvements:

--remove-unhealthy-workers flag - Automatically remove workers that fail health checks
disable_tokenizer_autoload support - Skip automatic tokenizer loading for custom configurations

🐛 Bug Fixes

Gateway: Index external workers by all discovered models (not just primary model)
CI: Added VERSION_OVERRIDE to version check script

Full Changelog: v1.3.0...v1.3.1

Upgrade now: pip install smg --upgrade

🐑 Shepherd your LLM infrastructure with confidence.

What's Changed

fix(deps): remove aws-lc-sys to fix aarch64 PyPI build by @slin1237 in #762
feat(gateway): add --remove-unhealthy-workers by @ekzhang in #714
feat(gateway): support disable_tokenizer_autoload by @Huixxi in #740
feat(messages): add unit tests for Messages API streaming & response … by @ConnorLi96 in #763
fix(gateway): index external workers by all discovered models by @zhaowenzi in #756
fix(ci): add VERSION_OVERRIDE to check-versions by @slin1237 in #764
feat(interactions): Implement steps to handle non-stream interactions req without tool call by @XinyueZhang369 in #723
chore: auto-close stale PRs after 30 days and validate DCO sign-off identity by @CatherineSue in #743
chore(release): bump version to 1.3.1 by @slin1237 in #767

New Contributors

@Huixxi made their first contribution in #740

Full Changelog: v1.3.0...v1.3.1

Contributors

ekzhang, CatherineSue, and 5 other contributors

Assets 2

15 Mar 00:26

slin1237

v1.3.0

92effcc

v1.3.0

🚀 Shepherd Model Gateway v1.3.0 Released

We're excited to announce Shepherd Model Gateway v1.3.0 – a major release bringing native Messages API support and expanding our agentic workload capabilities.

🎯 Messages API: First-Class Implementation

Native Messages API implementation with core protocol support: credits to @CatherineSue

True first-class support — Direct protocol implementation, not a translation layer
Extended thinking — Native ThinkingConfig with per-model reasoning activation and streaming thinking_delta events
Full streaming + non-streaming — Complete Anthropic SSE event protocol
Tool use — Custom tool definitions, tool_choice, structured tool output
Works across all backends — SGLang, vLLM, TensorRT-LLM via gRPC
Drop-in Anthropic SDK compatibility — Same API shape, your infrastructure

Why first-class matters: Wiring Messages API through chat completion (the common approach) silently drops thinking blocks — both in conversation history and model output — because the chat completion protocol has no concept of reasoning content. SMG's native implementation preserves thinking blocks end-to-end: ThinkingConfig activates model-specific reasoning, the streaming state machine emits proper thinking_delta events, and interleaved reasoning + text + tool use content blocks are assembled in correct order. No translation layer, no silent data loss.

🔗 Expanding Agentic Workload Support

SMG now supports three major agentic APIs:

Chat Completions API (OpenAI) - Standard conversational interface
Responses API (OpenAI) - Still the only gateway supporting this for open-source models and third party vendor
Messages API (Anthropic) - NEW - First-class native implementation with reasoning support

Plus routing to all major 3rd party providers: OpenAI, Anthropic, Gemini and more.

Impact: SMG sits behind any agent framework (Claude Code, Codex, OpenClaw, OpenCode) and routes to any model. Run agentic workflows designed for Claude on Llama 4, Qwen 3, DeepSeek, Kimi-K2.5—your infrastructure, full protocol fidelity including reasoning.

🌐 Unified /v1/models Across All Providers

Consistent model discovery experience across Anthropic, OpenAI, and Gemini:

Unified /v1/models response format across all routers
Consistent schema regardless of backend provider
Single API surface for model enumeration

⚡ High Availability Mesh Improvements

Sync cache-aware policy state across mesh HA nodes:

Cache policy state replicated across all mesh nodes
Automatic failover with consistent routing decisions
Zero-downtime deployments with state continuity

🛠️ smg-grpc-servicer Enhancements

Native SGLang backend support
Multi-backend extras for flexible deployment
Improved vLLM GetModelInfo response with served_model_name

🐛 Bug Fixes

Mesh: Plumbed --router-selector through CLI and Python bindings
Realtime API: Fixed worker health tracking in WebSocket session
gRPC servicer: Return served_model_name in vLLM GetModelInfo response
Dependencies: Pinned gRPC packages to 1.78.0 in SGLang install script

🏗️ Infrastructure

TensorRT-LLM default base image bumped to 1.3.0rc7
Added DeepWiki badge to README
Docker CI improvements

Full Changelog: v1.2.0...v1.3.0

Upgrade now: pip install smg --upgrade

🐑 Shepherd your LLM infrastructure with confidence.

Built for speed. Engineered for scale. Production-proven.

What's Changed

fix(ci): add packages:write permission to engine docker release workflows by @slin1237 in #697
refactor(gateway): unify /v1/models response across all routers by @slin1237 in #692
refactor(gateway): route realtime API through RouterTrait by @CatherineSue in #690
chore(deps): bump docker/setup-qemu-action from 3 to 4 by @dependabot[bot] in #704
chore(deps): update tokio-tungstenite requirement from 0.26 to 0.28 by @dependabot[bot] in #706
chore(deps): bump docker/login-action from 3 to 4 by @dependabot[bot] in #703
chore(deps): bump actions/checkout from 4 to 6 by @dependabot[bot] in #702
chore(deps): bump docker/setup-buildx-action from 3 to 4 by @dependabot[bot] in #701
chore(deps): bump docker/build-push-action from 6 to 7 by @dependabot[bot] in #700
Update max_concurrent_jobs from upstream by @ekzhang in #711
chore: add gongwei to code owner of docker and installation by @slin1237 in #715
chore: add gongwei to code owner of python binding by @slin1237 in #716
chore(ci): bump trtllm default base image to 1.3.0rc7 by @slin1237 in #717
docs: add DeepWiki badge to README by @slin1237 in #718
fix(deps): pin gRPC packages to 1.78.0 in sglang install script by @YouNeedCryDear in #719
feat(gateway): API-key-aware /v1/models with upstream fan-out by @slin1237 in #698
chore: fix lint by @slin1237 in #720
refactor(gateway): extract shared worker selection module by @slin1237 in #721
fix(mesh): plumb --router-selector through CLI and Python bindings by @slin1237 in #724
fix(grpc_servicer): return served_model_name in vLLM GetModelInfo response by @CatherineSue in #727
refactor(gateway): split OpenAI router.rs into chat and health modules by @slin1237 in #726
fix(realtime-api): worker health tracking in websocket session by @pallasathena92 in #725
refactor(gateway): extract MCP module from OpenAI responses by @slin1237 in #730
feat(realtime-api): WebRTC Router trait interface + HTTP route regist… by @pallasathena92 in #731
feat(realtime-api): WebRTC config plumbing through AppContext and CLI by @pallasathena92 in #729
refactor(gateway): extract history loading and storage queries from router.rs by @slin1237 in #732
feat: sync cache-aware policy state across mesh HA nodes by @llfl in #655
fix: update CODEOWNERS paths after crate relocation by @slin1237 in #734
refactor(gateway): extract route_responses orchestration into responses/route.rs by @slin1237 in #735
refactor(gateway): simplify openai router internals by @slin1237 in #737
feat(gateway): add Messages API type scaffolding to gRPC router by @slin1237 in #739
feat(gateway): add message_utils and MessagePreparationStage for Messages API by @slin1237 in #741
feat(gateway): add MessageRequestBuildingStage for Messages API by @slin1237 in #744
feat(grpc_servicer): add sglang support with multi-backend extras by @slin1237 in #745
fix(ci): prevent upload-servicer from being skipped by @slin1237 in #746
docs(template): add slack link by @lightseek-bot in #749
feat(gateway): add MessageResponseProcessingStage for Messages API (non-streaming) by @slin1237 in #747
feat(gateway): wire Messages API pipeline into gRPC routers by @slin1237 in #753
feat(gateway): add Messages API streaming support to gRPC router by @slin1237 in #758
chore: bump versions for v1.3.0 release by @slin1237 in #760

Full Changelog: v1.2.0...v1.3.0

Contributors

YouNeedCryDear, ekzhang, and 6 other contributors

Assets 2

10 Mar 16:23

slin1237

v1.2.0

4b03d32

v1.2.0

🚀 Shepherd Model Gateway v1.2.0 Released!

We're thrilled to announce Shepherd Model Gateway v1.2.0 – a transformative release featuring enhanced event-driven cache-aware routing, production-ready client SDKs, Google Gemini integration, and vLLM gRPC server adoption!

⚡ Enhanced Event-Driven Cache-Aware Routing

Inspired by Amazon Dynamo's distributed caching principles, SMG extends its existing cache-aware routing with real-time KV cache event subscriptions:

SubscribeKvEvents RPC - Real-time KV cache event stream from all backends (SGLang, vLLM, TensorRT-LLM)
KvEventMonitor - Per-worker KV cache event subscriptions with automatic recovery
PositionalIndexer - Event-driven cache-aware routing with router prefix hash for query-path disambiguation
Auto-learned block_size - Dynamically learn from KV event stream
Flash Indexer parity - Closed 4 performance gaps, tuned DashMap shards to 256

Production Results (8 Llama model replicas):

TTFT avg: -23.0% (93.10 → 71.66 ms)
TTFT p99: -27.9% (186.98 → 134.88 ms)
TPOT avg: -0.9% (6.39 → 6.33 ms)
Latency avg: -3.8% (731.60 → 703.92 ms)
Latency max: -11.8% (1034.27 → 912.47 ms)
Req/sec: +1.3% (9.959 → 10.093)

Impact: Maximum KV cache utilization across your inference fleet. Route requests to workers with matching cached prefixes, eliminating redundant computation and dramatically reducing TTFT.

🎨 TensorRT-LLM Multimodal Support

Complete vision-language model integration:

gRPC multimodal pipeline - preprocessed data with hashing
Backend-specific variants - optimized for TRT-LLM
String-based stop sequences - no pre-tokenization overhead
Matched_stop support - proper stop sequence handling

🔄 vLLM Upstream gRPC Adoption

SMG's gRPC server implementation is now upstream in vLLM!

vLLM's PR #36169 formalizes SMG's protobuf and gRPC server implementation as an upstream dependency. gRPC is now an officially supported protocol in the vllm serve command.

smg-grpc-servicer package published to PyPI
Production-grade gRPC server infrastructure
Credit to @CatherineSue and @njhill for driving this milestone

Impact: A significant milestone for the project—SMG's gRPC innovations are now the foundation for vLLM's official gRPC support.

📦 Production-Ready Client SDKs

Multi-language SDK ecosystem with OpenAPI codegen:

Python SDK - Drop-in replacement for OpenAI/Anthropic SDKs with complete API coverage
Rust HTTP Client - Type-safe, async-first client with all endpoints
Java Type Generation - Full OpenAPI-derived types

Endpoints: Chat completions, classify, parser, responses, workers, loads, and more.

Impact: Integrate SMG into any tech stack with idiomatic, type-safe clients. Zero friction migration from OpenAI/Anthropic.

🐳 Engine-Specific Docker Images

Pre-built Docker images for each inference engine:

docker pull ghcr.io/lightseekorg/smg:1.2.0-sglang-v0.5.9
docker pull ghcr.io/lightseekorg/smg:1.2.0-vllm-v0.17.0
docker pull ghcr.io/lightseekorg/smg:1.2.0-trtllm-1.3.0rc6

Impact: Zero-configuration deployment with engine-optimized images. Pull and run your preferred backend instantly. Credit to @gongwei-130 for driving this feature.

🔮 Google Gemini Integration

New Gemini router for Google's Interactions API:

Complete router registration and infrastructure
Native protocol support for Gemini models
Seamless integration alongside OpenAI, Anthropic, and self-hosted engines

Impact: Route to Gemini alongside your entire model fleet. One gateway, all providers.

💾 Advanced Data Persistence

Enterprise-grade data connector enhancements:

Schema versioning with safe-by-default migrations (Flyway integration)
SchemaConfig - Customizable table and column names for existing databases
Storage hooks - Pre/post persistence callbacks
WASM bridge - Call storage APIs from WASM middleware

Performance: Reduced Oracle DB round trips in pagination, batch linking, and delete operations.

📊 Load Monitoring & Discovery

New /v1/loads endpoint with gRPC support, per-worker model customization (--model-id-from), external worker discovery with per-provider API keys, and model aliasing.

🔌 MCP Enhancements

Responses API Integration:

MCP approval items in protocols and routers
X-SMG-MCP header for MCP passthrough control

Anthropic Features:

tool_search_tool support
defer_loading for lazy tool initialization

Performance:

Concurrent tool execution in McpToolSession
Lock-free pool stats with AtomicUsize
Fixed O(n²) insert in inject_mcp_output_items
Reverse iteration with optional limit in AuditLog

🎨 Multimodal Improvements

Qwen VL Support:

Proper patchification and prompt replacement token counts
Backend-specific MultimodalData variants

vLLM Integration:

Send preprocessed multimodal data with hashing and structured tokens
Derive keep_on_cpu keys from model spec

Code Quality:

Split registry.rs into per-model spec modules
Removed dead MultiModalInputs/Tensor/Value types

⚡ Performance Optimizations

Core Engine:

Zero-allocation JSON streaming validation with IgnoredAny
Move semantics in gRPC request handling instead of cloning
Optimized clone usage and finalize/emit_completed ordering

Indexing:

Worker_blocks moved to caller-owned storage
Setup moved out of concurrent benchmark timing loop

🛡️ Critical Bug Fixes

gRPC: Proper status codes, circuit breaker accuracy, return tonic::Status directly, use e.message() for errors
Reasoning Parser: 4MB buffer limit, non-fatal parse errors
Tokenizer: Jinja2 trim_blocks/lstrip_blocks enabled, Value::UNDEFINED for missing params, SGLang tp_size fix
Mesh: Prevent stale state overwrites via relay paths, increased multi-node timeouts
Gateway: Release semaphore permit before completion wait, unified worker registration
Multimodal: Qwen VL patchification and token counts, lazy model discovery in /v1/models

🔧 Refactoring & Code Quality

Repository: Library crates moved to crates/ directory, UUID v4 → v7 migration workspace-wide
MCP: Extracted shared iterators, deduplicated constructors, removed dead paths
gRPC: Split utils.rs into focused modules, deduplicated streaming logic
Workflow Engine: Simplified definition and internals
E2E Testing: Removed parallel infrastructure, simplified helpers and architecture
CI/CD: Reusable build workflows, pre-commit checks, conventional commits enforcement, file changes detection for E2E skipping, sequential execution, block AI co-author lines

📚 Documentation

Standardized runtime ordering to SGLang, vLLM, TensorRT-LLM
Documented O(n) complexity on pool URL-based lookups

🎯 Additional Features

Unified flag for cache token usage report in HTTP mode
OpenAI-compatible cached token usage
GetTokenizer proto and tokenizer bundle streaming
TRT-LLM parameter pass-through fixes
Matched_stop support for vLLM and TensorRT-LLM
Respect workers_config in multi-worker gRPC setup
Realtime API protocol foundations (session, conversation, response, WebSocket handler)
Standardized runtime ordering to SGLang, vLLM, TensorRT-LLM
Documented O(n) complexity on pool URL-based lookups

🔗 Full Changelog: v1.1.0...v1.2.0

Upgrade now: pip install smg --upgrade

🐑 Shepherd your LLM infrastructure with confidence.

⚡ Built for speed. Engineered for scale. Production-proven.

What's Changed

ci(nightly): Add vLLM HTTP support to nightly benchmarks by @CatherineSue in #502
docs: standardize runtime ordering to SGLang, vLLM, TensorRT-LLM by @slin1237 in #514
fix(ci): install CUDA toolkit for SGLang JIT kernel compilation by @slin1237 in #513
fix parameters pass through for trtllm by @gongwei-130 in #509
chore(ci): remove pull request trigger from nightly benchmark workflow by @key4ng in #524
refactor: migrate from UUID v4 to v7 across the workspace by @slin1237 in #518
perf: optimize JSON streaming validation with zero-allocation IgnoredAny by @ppraneth in #516
fix(e2e): respect workers_config in vLLM/TRT-LLM gRPC multi-worker setup by @slin1237 in #525
feat(responses): add mcp approval items to protocols and routers by @zhaowenzi in #491
fix(ci): modify docker-storage dir and add scaler for a10 runners by @XinyueZhang369 in #402
feat(anthropic): add X-SMG-MCP header for MCP passthrough by @key4ng in #517
feat: add --served-model-name support for model aliasing by @ConnorLi96 in #521
feat(data-connector): add SchemaConfig for customizable table and column names by @slin1237 in #526
refactor(protocol): model ResponseTool as tagged enum to match Responses spec and tighten MCP validation by @zhaowenzi in #532
ci: add PR title conventional commits check by @CatherineSue in #540
feat: add unified fl...

Contributors

YouNeedCryDear, CatherineSue, and 14 other contributors

Assets 2

23 Feb 06:40

slin1237

v1.1.0

b6f9bb5

v1.1.0

🚀 Shepherd Model Gateway v1.1.0 Released!

We're excited to announce Shepherd Model Gateway v1.1.0 – a major feature release bringing universal multimodal support, Messages API MCP integration, and critical production hardening across the entire stack!

🎨 Universal Multimodal Support 🔥

Industry-leading multimodal processing across all major inference engines:

SGLang gRPC - Full multimodal pipeline with vision processing
vLLM gRPC - Fetch + preprocess pipeline with multimodal support
TensorRT-LLM gRPC - Complete multimodal integration
Llama 4 Vision - First-class support with model spec and processor registration

Impact: Deploy vision-language models across SGLang, vLLM, and TensorRT-LLM with unified processing. Data URI detection, 4D pixel values, and i64 aspect ratios for production-grade image handling.

🔌 Messages API Gets MCP

Complete MCP tool integration for Anthropic Messages API:

Streaming and non-streaming MCP tool use
Unified tool allowlist enforcement across OpenAI and gRPC routers
Server binding architecture with session lifecycle management
Built-in server filtering from tool listings
Unique server_label requirements for tool collision prevention

E2E tested with comprehensive MCP tool use coverage.

✨ Major New Features

🌐 vLLM HTTP Backend Support
Auto-detection and support for vLLM HTTP endpoints via DetectBackendStep – seamlessly switch between gRPC and HTTP workers.

🎯 smg serve Engine Args Pass-through
Pass arbitrary engine-specific arguments directly through smg serve to your inference engines. Maximum flexibility for custom configurations.

🧠 Tiktoken Hub Model Support
Unified chat template API with tiktoken hub integration. Improved OpenAI o-series model detection and error handling.

🔍 NanoV3 Reasoning Parser
Native support for Nemotron Nano V3 reasoning output parsing.

🎨 Startup Banner
Beautiful braille art shepherd motif on startup – because production systems deserve aesthetics.

⚡ Performance Optimizations

WASM Runtime Enhancements:

Optimized component cache lookup
Reduced per-request cloning overhead
SHA-256 cache keys for efficient middleware

🛡️ Critical Production Hardening

Responses API Fixes:

Fixed data loss and panic risks
Sanitized upstream error bodies
Improved structural integrity

Middleware Reliability:

Fixed extension loss in request pipeline
Eliminated auth timing leak
Corrected streaming body buffering

Tokenizer Robustness:

Cache correctness fixes
Streaming reliability improvements
Chat template error handling

Data Connector Hardening:

Eliminated deadlock, block_on, and triple pool bugs
Storage backend protection against data corruption
Race condition fixes

Concurrency Safety:

Fixed tokio mutex release before awaiting in LoadMonitor
Improved SSE event processing and buffer management

Multimodal Correctness:

Proper data URI detection
4D pixel_values output
i64 aspect_ratios for large images

🏗️ Architectural Improvements

Worker Infrastructure:

Consolidated DPAwareWorker into BasicWorker
Moved DP fields to WorkerSpec
Unified worker metadata discovery
Cleaner registration workflow

Code Quality:

Enforced strict clippy linting workspace-wide
Added clippy::absolute_paths and single_component_path_imports lints
Improved error handling across all modules

🔧 Developer Experience

CI/DevOps:

DCO check with probot app
Mergify automation for PR management
Branch naming enforcement
Docker image release workflow
Auto-trigger benchmark workflows on code changes

Tooling:

Workspace version checker script
PyPI proto version validation
Remote dev workflow for proto testing

Python Support:

Lowered minimum Python version from 3.12 to 3.9

🐛 Bug Fixes

Fixed worker health config, bootstrap parsing, and model card cloning issues
Improved serve CLI arg filtering and config error handling
Better pre-commit hook configuration
Corrected labeler workflow for fork PRs

📚 Interactions API

Added comprehensive validations for the Interactions API protocol.

🔗 Full Changelog: v1.0.1...v1.1.0

Upgrade now: pip install smg --upgrade

🐑 Shepherd your LLM infrastructure with confidence.

⚡ Built for speed. Engineered for scale. Production-proven.

What's Changed

fix: render README images on PyPI/crates.io and bump version to 1.0.1 by @slin1237 in #420
chore(ci): Change nightly benchmark schedule to midnight PST by @key4ng in #422
ci: add DCO check, Mergify automation, and branch naming enforcement by @CatherineSue in #424
ci: temporarily disable auto-close for branch naming violations by @CatherineSue in #426
ci: add needs-rebase label management to Mergify by @CatherineSue in #427
fix(ci): use correct Mergify syntax for negated regex condition by @CatherineSue in #429
ci: improve label management with router-specific and feature labels by @CatherineSue in #428
ci: add Docker image release workflow by @slin1237 in #431
feat(message api): MCP tool use with streaming and non-streaming support by @key4ng in #352
refactor(core): consolidate DPAwareWorker into BasicWorker by @slin1237 in #434
fix: pre-existing issues in worker health config, bootstrap parsing, and model card cloning by @slin1237 in #415
refactor(core): move DP fields to WorkerSpec and remove default_model_type by @slin1237 in #436
chore: fix main log by @slin1237 in #437
test(e2e): add MCP tool use tests for Anthropic Messages API by @key4ng in #433
feat(core): add DetectBackendStep for vLLM HTTP support by @slin1237 in #438
feat(tokenizer): add tiktoken hub model support and unify chat template API by @slin1237 in #439
perf(wasm): optimize WASM component cache lookup and reduce per-request cloning by @ppraneth in #440
refactor(core): unify worker metadata discovery and clean up registration by @slin1237 in #447
feat(version): add startup banner with braille art shepherd motif by @slin1237 in #448
fix(python): lower minimum Python version from 3.12 to 3.9 by @slin1237 in #449
ci(mergify): enable auto-close for non-conforming branch names by @CatherineSue in #454
ci(mergify): allow multi-segment branch names for dependabot by @CatherineSue in #456
ci(dco): switch DCO check from GitHub Actions to probot DCO app by @CatherineSue in #462
fix(openai): fix data loss, panic risk, and structural issues in Responses API by @slin1237 in #468
fix(mcp): filter builtin servers from mcp_list_tools output by @key4ng in #450
feat(interactions): Add validations for interactions api by @XinyueZhang369 in #399
feat(mcp): enforce allowed_tools filtering across openai and grpc routers by @zhaowenzi in #467
fix(openai): sanitize upstream error bodies in Responses API by @slin1237 in #473
fix(middleware): fix extension loss, auth timing leak, and streaming body buffering by @slin1237 in #472
fix(concurrency): release tokio mutex before awaiting task in LoadMonitor::stop() by @slin1237 in #475
fix(tokenizer): correctness and robustness fixes for cache and streaming by @slin1237 in #474
fix(ci): use pull_request_target for labeler to support fork PRs by @CatherineSue in #477
fix(data-connector): fix deadlock, block_on, triple pool, and DDL type bugs by @slin1237 in #471
feat(reasoning-parser): add NanoV3 reasoning parser by @slin1237 in #480
refactor(anthropic): simplify worker lifecycle in Anthropic router by @key4ng in #476
feat(realtime api): realtime api session and transcription_session protocols by @pallasathena92 in #364
fix(protocols): require unique server_label for MCP tools by @zhaowenzi in #479
feat: smg serve pass through engine args to engine by @gongwei-130 in #460
fix(serve): harden CLI arg filtering and config error handling by @slin1237 in #483
feat(scripts): replace release notes generator with workspace version checker by @slin1237 in #484
fix(ci): match probot DCO app check name in Mergify rule by @CatherineSue in https://github.com/lightseekorg/smg/pul...

Contributors

CatherineSue, key4ng, and 6 other contributors

Assets 2

13 Feb 16:35

slin1237

v1.0.1

deae66c

v1.0.1

🎉 Introducing Shepherd Model Gateway v1.0.1!

We're thrilled to announce Shepherd Model Gateway v1.0.1 – formerly SGLang Model Gateway. This major release marks a new chapter with a complete architectural overhaul, new enterprise features, and production-grade improvements!

🐑 Welcome to Shepherd

SGLang Model Gateway is now Shepherd Model Gateway (SMG).

Truly Engine-Agnostic Architecture: Shepherd is your universal gateway supporting all major inference engines – SGLang, vLLM, and TensorRT-LLM – plus complete 3rd party model provider integration including OpenAI, Anthropic, and Gemini. One gateway to route them all.

Universal API Support: Native implementation of Chat Completions, Responses API, Messages API, Interactions API, and Realtime API. Whether you're running open-source models on your infrastructure or routing to cloud providers, Shepherd handles it seamlessly.

Same powerful technology, new identity focused on guiding and managing your entire LLM infrastructure at scale – regardless of where your models run.

✨ Major New Features

⚡ TensorRT-LLM Backend Support - Native gRPC integration for NVIDIA TensorRT-LLM

🔄 vLLM Prefill-Decode-Disaggregation Support
Mooncake and NIXL-based KV transfer for disaggregated inference:

Auto-discovery for seamless integration
Massive scalability improvements for large deployments
Efficient KV cache sharing across workers

🎯 smg serve - Unified Worker Management
New serve subcommand with complete worker lifecycle orchestration:

Multi-worker data parallelism with GPU assignment
ServeOrchestrator for automated worker management
Two-pass argument parsing for flexible configuration
One command to rule them all

🤖 Anthropic Messages API Support
Full implementation of Anthropic's Messages API with streaming and non-streaming support. Deploy Claude models alongside your existing inference fleet.

🔌 Industry-First: Universal Built-in Tools via MCP 🔥

Turn any MCP server into built-in tools for all models – an industry-first capability that brings OpenAI-style built-in tools (FileSearch, WebSearch, CodeInterpreter) to every LLM, not just proprietary models.

Complete MCP Orchestration Stack:

McpOrchestrator with YAML policy configuration
Built-in tool routing infrastructure with qualified names – seamlessly integrate any MCP server as a native capability
ResponseFormat transformation pipeline - expose MCP servers as built-in tools (FileSearch, WebSearch, CodeInterpreter, and custom tools)
Auth-aware connection pooling for scalable multi-tenant deployments
Batch tool execution API for efficient processing
Approval system for controlled tool execution
Automatic reconnection manager for reliability
Graceful shutdown support
HTTP header forwarding to MCP servers

Impact: Deploy Llama, Qwen, DeepSeek, or any open-source model with the same built-in tool capabilities as GPT-4. Your infrastructure, your models, OpenAI-grade tooling.

📡 Realtime API Foundation
Event types and protocol support for real-time streaming applications.

🏗️ Architectural Revolution

Workspace Modularization
Complete extraction into standalone, publishable crates:

smg-auth - JWT/OIDC authentication
smg-mesh - High availability mesh networking
smg-mcp - Model Context Protocol orchestration
smg-wasm - WebAssembly middleware
smg-grpc-client - gRPC client infrastructure
smg-grpc-proto - Protocol definitions (published to PyPI!)
smg-kv-index - Cache-aware routing engine
llm-tokenizer - Tokenization logic
llm-multimodal - Multimodal processing
openai-protocol - OpenAI API specifications
wfaas - Workflow-as-a-Service engine
And more...

Result: Faster builds, independent evolution, better maintainability, and easy integration into your own projects.

⚡ Performance Optimizations

Zero-Copy & Algorithm Improvements:

Zero-copy multimodal payload handling
Aho-Corasick algorithm for stop sequence and special token search
WASM Linker reuse across executions
Optimized consistent hashing with zero allocations

🛠️ Production Enhancements

High Availability:

Mesh service refactoring and cleanup
State synchronization improvements
Oracle external auth support for enterprise backends

Observability:

Nightly benchmark workflow for comprehensive model performance tracking
gRPC vs HTTP comparison benchmarks
GetLoads RPC for load metrics

Developer Experience:

Comprehensive documentation restructure (concept-centric)
Issue templates and PR templates
Pre-commit hooks with Ruff + mypy Python linting
Automated crate publishing workflows
Dependabot integration

Testing Infrastructure:

Kubernetes-based CI runners
Service containers for Oracle and Brave
vLLM and TensorRT-LLM gRPC E2E tests
Thread-safe test fixtures with proper resource management

🐛 Critical Bug Fixes

Fixed synthetic "empty" tenant pollution in radix tree
Prevented resource leaks causing GPU starvation
Fixed STDIO MCP server triggering
Aligned multi-server MCP output handling across routers
Fixed completion token counting for vLLM harmony streaming
Corrected proto definitions (logprobs token_ids uint32)

📚 Documentation

Complete restructure from configuration-centric to concept-centric
Architecture diagrams and gradient mesh homepage
Comprehensive README with features overview
Admin API reference
Getting started guides

🔧 Tool Parser Support

New model support:

Cohere Command models (tool parser + reasoning parser)
Qwen Coder (XML format for Qwen3 Coder and MicroThinker)

🔗 Repository: https://github.com/lightseekorg/smg

Install now: pip install smg --upgrade

🐑 Shepherd your LLM infrastructure with confidence.

⚡ Built for speed. Engineered for scale. Production-proven.

What's Changed

fix: render README images on PyPI/crates.io and bump version to 1.0.1 by Simo Lin
fix(ci): fix H200 nightly benchmark model path, worker logs and CUDA errors (#411) by @key4ng in #411
fix(ci): use single Python interpreter for Windows/macOS PyPI builds (#418) by @slin1237 in #418
chore(mesh): bump smg-mesh version to 1.1.0 (#419) by @slin1237 in #419
chore: unify workspace dependency management and bump crate versions (#344) by @slin1237 in #344
refactor: remove remaining pub use re-export aliases from lib.rs (#416) by @slin1237 in #416
refactor: remove pub use re-export aliases from lib.rs (#413) by @slin1237 in #413
refactor(protocols,gateway): redesign worker type hierarchy and consolidate protocol layer (#412) by @slin1237 in #412
fix(grpc-proto): bump grpcio minimum to >=1.78.0 (#409) by @CatherineSue in #409
chore(ci): increase chat-completions-trtllm timeout to 60 minutes (#408) by @CatherineSue in #408
fix(trtllm): tokenize and inject user stop sequences for TRT-LLM requests (#346) by @ppraneth in #346
fix(e2e): migrate genai-bench to Docker and fix router pipe hang (#403) by @key4ng in #403
chore(deps): update kube requirement from 1.1.0 to 3.0.1 (#397) by @app/dependabot in #397
chore(deps): update opentelemetry-proto requirement from 0.27 to 0.31 (#398) by @app/dependabot in #398
chore(deps): update ndarray requirement from 0.16 to 0.17 (#394) by @app/dependabot in #394
feat: support oracle external auth for oracle backend (#404) by @zhaowenzi in #404
fix(grpc-proto): reorder authors in pyproject.toml (#400) by @CatherineSue in #400
chore[ci]: upgrade oracle image (#393) by @key4ng in #393
chore(e2e): overhaul nightly benchmark summary and trim model list (#392) by @slin1237 in #392
feat: Implement ReconnectionManager for automatic MCP server recovery (#265) by @ppraneth in #265
perf(multimodal): optimize payload handling with zero-copy (#391) by @ppraneth in #391
refactor(mcp): standardize output injection ordering across routers (#388) by @slin1237 in #388
ci(grpc): add proto package publishing and codegen checks (#386) by @CatherineSue in #386
feat(grpc): add smg-grpc-proto Python package for proto definitions (#385) by @CatherineSue in #385
chore(e2e): include model size in gpt-oss nightly benchmark slug (#384) by @CatherineSue in #384
refactor(mcp): remove requested_servers and introduce ResponsesCallContext (#382) by @CatherineSue in #382
refactor(mcp): use imports instead of fully-qualified paths in McpToolSession (#383) by @CatherineSue in #383
e2e: rewrite nightly summary with gRPC vs HTTP comparison (#381) by @slin1237 in #381
feat(realtime api): realtime api event types (#349) b...

Contributors

xuwenyihust, CatherineSue, and 8 other contributors

Assets 2

Releases: lightseekorg/smg

v1.4.1

🚀 Shepherd Model Gateway v1.4.1 Released

Mesh HA Stability Fix

DP Rank Scheduling

MCP Tool Improvements

Reasoning Parser Fixes

Bug Fixes

Engine Version Bumps

Infrastructure

Docker Images

What's Changed

New Contributors

Contributors

Uh oh!

v1.4.0

🚀 Shepherd Model Gateway v1.4.0 Released

Kubernetes-Native Deployment with Helm

Terminal Dashboard (smg-tui)

Mesh Performance & Reliability Revolution

7-11x Faster Multimodal Image Preprocessing

Native Completion API over gRPC

Per-Model Retry Configuration

Three-Phase Graceful Shutdown

Worker Registry & REST API Improvements

vLLM gRPC Embedding Support

DeepSeek V3.1 Tool Call Parser

Additional Features

Performance Improvements

Bug Fixes

Contributors

Uh oh!

v1.3.3

🚀 Shepherd Model Gateway v1.3.3 Released

⚡ Mesh Performance Revolution

🎯 Structured Output Support

🔧 PD Disaggregation Improvements

🐛 Bug Fixes

📚 Documentation Overhaul

🏗️ Infrastructure

🐑 Shepherd your LLM infrastructure with confidence.

Docker Images

What's Changed

New Contributors

Contributors

Uh oh!

v1.3.2

🚀 Shepherd Model Gateway v1.3.2 Released

🎨 Multimodal Support for Messages API

🌐 Mesh High Availability in Python

🐛 Bug Fixes

📚 Documentation

What's Changed

New Contributors

Contributors

Uh oh!

v1.3.1

🚀 Shepherd Model Gateway v1.3.1 Released

🛠️ New Features

🐛 Bug Fixes

What's Changed

New Contributors

Contributors

Uh oh!

v1.3.0

🚀 Shepherd Model Gateway v1.3.0 Released

🎯 Messages API: First-Class Implementation

🔗 Expanding Agentic Workload Support

🌐 Unified /v1/models Across All Providers

⚡ High Availability Mesh Improvements

🛠️ smg-grpc-servicer Enhancements

🐛 Bug Fixes

🏗️ Infrastructure

What's Changed

Contributors

Uh oh!

v1.2.0

🚀 Shepherd Model Gateway v1.2.0 Released!

⚡ Enhanced Event-Driven Cache-Aware Routing

🎨 TensorRT-LLM Multimodal Support