Skip to content

Releases: lightseekorg/smg

v1.4.1

09 Apr 18:18
ea9005d

Choose a tag to compare

🚀 Shepherd Model Gateway v1.4.1 Released

Patch release with mesh HA stability fix, DP rank scheduling, reasoning parser fixes, and engine version bumps.

Mesh HA Stability Fix

Fixed premature worker removal during rolling deploys:

  • Workers synced via mesh with health: false were being removed by the health checker before they had a chance to pass local health checks
  • Fix: health checker now only removes workers whose health check actually failed this tick, not workers that are merely marked unhealthy from mesh state
  • Eliminates the 500/503 error spike during gateway redeploys with --remove-unhealthy-workers enabled

DP Rank Scheduling

Data-parallel rank scheduling for multi-GPU inference:

  • Supports scheduling with the minimum number of required ranks
  • New scheduling policy for DP-aware worker selection

MCP Tool Improvements

  • Argument overrides (#1048) -- Add support for argument overrides with MCP tools, enabling per-request customization of MCP tool call parameters
  • Passthrough output flattening (#1041) -- MCP passthrough mcp_call output now flattened to plain strings for consistency
  • ID normalization (#989) -- MCP call item IDs normalized to mcp_ prefix for OpenAI alignment

Reasoning Parser Fixes

  • Thinking toggle detection (#1031) -- Detect thinking toggle from chat template and override parser state automatically
  • NanoV3/Nemotron fix (#1067) -- Changed parser to always_in_reasoning=false to fix incorrect reasoning block detection
  • Harmony routing (#1025) -- Route reasoning_content to analysis channel per Harmony spec

Bug Fixes

  • Routing: Eliminate unconditional token allocation on the hot path (#1024)
  • Responses API: Stop defaulting top_p for omitted requests (#1043), unify upstream header handling (#1029)
  • gRPC: Update vLLM imports for inputs reorganization (#1033)
  • Frontend: Fix smg serve rejecting vLLM OpenAI args (#832)
  • Discovery: Periodic reconciliation with identity-based pod equality (#1039)

Engine Version Bumps

  • vLLM: v0.18.0 -> v0.19.0
  • SGLang: v0.5.9/v0.5.10rc0 -> v0.5.10
  • TensorRT-LLM: 1.3.0rc8 -> 1.3.0rc10

Infrastructure

  • Claude review workflow hardened with incremental reviews and auto-approve (#1036, #1040, #1042)
  • E2E worker failure diagnostics and cleanup improvements (#1015)
  • gRPC package releases: smg-grpc-proto 0.4.6, smg-grpc-servicer 0.5.2

Upgrade now: pip install smg --upgrade

🐑 Shepherd your LLM infrastructure with confidence.

Docker Images

Pre-built engine images on GitHub Container Registry:

SGLang:

docker pull ghcr.io/lightseekorg/smg:1.4.1-sglang-v0.5.10

vLLM:

docker pull ghcr.io/lightseekorg/smg:1.4.1-vllm-v0.19.0

TensorRT-LLM:

docker pull ghcr.io/lightseekorg/smg:1.4.1-trtllm-1.3.0rc10

All images for v1.4.1:

Engine Tag Pull Command
sglang 1.4.1-sglang-v0.5.10 docker pull ghcr.io/lightseekorg/smg:1.4.1-sglang-v0.5.10
trtllm 1.4.1-trtllm-1.3.0rc10 docker pull ghcr.io/lightseekorg/smg:1.4.1-trtllm-1.3.0rc10
vllm 1.4.1-vllm-v0.19.0 docker pull ghcr.io/lightseekorg/smg:1.4.1-vllm-v0.19.0

What's Changed

  • perf: Eliminate unconditional token allocation on the routing hot path by @ppraneth in #1024
  • refactor(e2e): rename worker_args to sglang_args by @CatherineSue in #1019
  • fix(ci): improve e2e worker failure diagnostics and cleanup by @key4ng in #1015
  • feat(metrics-ws): [2/4] add protocol types and watch registry by @key4ng in #982
  • fix(harmony): route reasoning_content to analysis channel per Harmony spec by @CatherineSue in #1025
  • fix(openai): unify responses upstream header handling by @zhaowenzi in #1029
  • fix(grpc): update vLLM imports for inputs reorganization by @CatherineSue in #1033
  • fix(reasoning): detect thinking toggle from chat template and override parser state by @CatherineSue in #1031
  • fix(ci): harden Claude review workflow with incremental reviews and resilience by @key4ng in #1036
  • fix(ci): fix comment fetch, add review summary, and auto-approve by @key4ng in #1040
  • fix(ci): handle array-format execution output in review summary by @key4ng in #1042
  • fix(mcp): flatten passthrough mcp_call output to plain strings by @zhaowenzi in #1041
  • feat(metrics-ws): [3/4] add event-driven and polled collectors by @key4ng in #1027
  • fix(responses): stop defaulting top_p for omitted requests by @zhaowenzi in #1043
  • fix(frontend): Fix smg serve reject vLLM OpenAI args by @YouNeedCryDear in #832
  • feat(realtime-api): WebRTC relay bridge by @pallasathena92 in #733
  • feat(overrides): add support for argument overrides with mcp tools by @Tobel158 in #1048
  • fix(mcp): normalize mcp_call item IDs to use mcp_ prefix for OpenAI alignment by @zhaowenzi in #989
  • feat: supports dp rank scheduling and scheduling with the minimun number of… by @jiashaokun-1 in #1007
  • fix(discovery): periodic reconciliation with identity-based pod equality by @Kangyan-Zhou in #1039
  • chore(deps): update wasm-encoder requirement from 0.245 to 0.246 by @dependabot[bot] in #1054
  • chore(deps): update lz4_flex requirement from 0.11 to 0.13 by @dependabot[bot] in #1053
  • chore(deps): update str0m requirement from 0.16 to 0.18 by @dependabot[bot] in #1052
  • chore(deps): bump vllm base image from v0.18.0 to v0.19.0 by @slin1237 in #1066
  • fix(reasoning): change NanoV3/Nemotron parser to always_in_reasoning=false by @CatherineSue in #1067
  • chore(deps): bump sglang from 0.5.9/0.5.10rc0 to 0.5.10 by @slin1237 in #1064
  • feat(metrics-ws): [4/4] add /ws/metrics endpoint with subscription support by @key4ng in #1050
  • fix(mesh): prevent premature removal of unhealthy workers by health checker by @slin1237 in #1076
  • chore(deps): bump TensorRT-LLM from 1.3.0rc8 to 1.3.0rc10 by @slin1237 in #1077
  • chore(grpc): release smg-grpc-proto 0.4.6 and smg-grpc-servicer 0.5.2 by @slin1237 in #1078
  • chore: bump versions for v1.4.1 release by @slin1237 in #1080

New Contributors

Full Changelog: v1.4.0...v1.4.1

v1.4.0

02 Apr 15:53
52564df

Choose a tag to compare

🚀 Shepherd Model Gateway v1.4.0 Released

The biggest SMG release yet -- Kubernetes-native deployment via Helm, a terminal dashboard, 200x mesh memory reduction, 7-11x faster multimodal preprocessing, native Completion API over gRPC, and per-model retry configuration.

Kubernetes-Native Deployment with Helm

Production-ready Helm chart for deploying SMG on Kubernetes:

  • One-command deployment -- helm install smg oci://ghcr.io/lightseekorg/smg-helm deploys the full gateway stack
  • Router + Worker deployment -- A single chart deploys both the gateway router and inference engine workers (vLLM, SGLang, TRT-LLM) with GPU scheduling
  • Mesh HA with service discovery -- Deploy multiple gateway replicas as a StatefulSet with automatic gossip-based peer discovery via --router-selector
  • Full K8s integration -- RBAC, Ingress, HPA, PDB, ServiceMonitor, Grafana dashboard ConfigMap, JSON Schema validation at helm lint time
  • 5 example configurations -- Router-only, with-postgres, with-service-discovery, with-ingress, with-monitoring

Impact: Zero-to-production SMG deployment on Kubernetes with a single helm install. Declarative configuration, automatic scaling, and built-in observability.

Terminal Dashboard (smg-tui)

Full-featured terminal UI for real-time monitoring and interactive chat:

  • 7 tabs -- Pulse (real-time dashboard with sparklines), Workers (per-worker stats + circuit breaker state), Chat (streaming markdown playground), Logs (per-component with ANSI stripping), Benchmark, Traffic, Mesh
  • Worker management -- Quick-add presets for OpenAI/Anthropic/xAI/Gemini, local worker launch with automatic GPU selection via nvidia-smi, GPU claim tracking to prevent double-allocation
  • Gateway auto-start -- smg-tui --auto-start launches the gateway, polls health, and cleans up on exit
  • Chat playground -- Streaming SSE with live cursor, markdown rendering, multi-turn support, Tab to cycle models

Mesh Performance & Reliability Revolution

Eliminated catastrophic memory growth and achieved >200x improvement in mesh resource usage:

  • Delta encoding (#899): Only send new tree operations since last sync -- 40x smaller sync payloads (18.3 MB → 417 KB), gzip compression for additional 5-8x wire reduction
  • Lazy serialization (#919): Moved full TreeState serialization off the hot path -- memory: OOM crash → 31 MB stable, CPU: 280-345% → 56-58%, latency: 12s degrading → stable
  • CRDT bypass (#961): Moved tree state out of CRDT operation log -- eliminated ~1 GB/1.5hr memory leak under sustained load
  • Two-layer sync fix (#1011): Eliminated remaining memory leaks in the tree sync protocol
  • Snapshot serialization (#974): Structure-preserving radix tree snapshots for mesh sync -- shared prefixes stored once, replacing 40 MB flat operation replay with compact tree format
  • Timeout enforcement (#952): Consistent timeout contract across all RPC and stream paths
  • Health mirroring (#912, #892): Mesh-synced workers now register locally for health checking with proper status mirroring

Benchmark Results (20 min, 500 rps, 20K-char prompts):

  • 565,920 requests, 0 errors
  • Memory plateaus at ~2.3 GB (no linear growth)

7-11x Faster Multimodal Image Preprocessing

SMG now matches or beats HuggingFace Python preprocessing performance:

  • SIMD resize -- Replaced image crate (pure Rust) with fast_image_resize v6 (AVX2/SSE4.1) for 10-25x faster resize
  • Fused operations -- Combined to_tensor_and_normalize(), zero-copy patchify_into(), fused pad + normalize + tile split for Llama4
  • Additional optimizations -- Thread-local Resizer reuse, eliminated DynamicImage clones, optimized serialization and tensor conversion

Benchmark Results (Qwen3-VL):

Image Size Before After vs HuggingFace Python
224×224 4.77 ms 0.44 ms (10.8x) 2.5x faster
640×480 15.5 ms 1.59 ms (9.7x) 1.8x faster
1024×768 40.6 ms 4.31 ms (9.4x) 1.6x faster
1920×1080 286 ms 39.6 ms (7.2x) ~parity

Native Completion API over gRPC

Full /v1/completions support through the gRPC pipeline with streaming and PD disaggregation:

  • 6-PR pipeline -- CompletionRequest type, preparation stage, request building with backend sampling params, response processing, pipeline wiring, streaming support
  • Streaming -- OpenAI-compatible SSE events with per-index stop decoder tracking, echo and suffix handling
  • PD mode -- Dual streaming for prefill-decode disaggregation
  • Type safety -- Native RequestType::Completion throughout the pipeline, exhaustive match arms in shared stages

Per-Model Retry Configuration

Different models can now have different retry policies:

  • WorkerRegistry integration -- Workers declare per-model retry config via WorkerSpec.resilience, stored in WorkerRegistry with last-write-wins semantics
  • All routers updated -- HTTP, gRPC, OpenAI, Gemini, gRPC PD, and HTTP PD routers all look up per-model config at request time, falling back to the global default
  • Cleanup on removal -- Retry config is automatically cleaned up when the last worker for a model is removed

Impact: GPU-constrained models can have longer timeouts and more retries, while fast models use aggressive retry budgets. No more one-size-fits-all.

Three-Phase Graceful Shutdown

Replace fixed-timeout shutdown with an intelligent Gate → Drain → Teardown approach:

  • Phase 1 (Gate): Stop accepting new requests
  • Phase 2 (Drain): Wait for in-flight requests to complete (up to configured timeout)
  • Phase 3 (Teardown): MCP orchestrator cleanup + exit

Impact: Requests finishing in 2s no longer wait 28s for a fixed grace period. Requests needing 35s no longer get killed at 30s. The system drains to zero when possible.

Worker Registry & REST API Improvements

  • Model field required (#713) -- Clients omitting model now get 400 Bad Request instead of silent "unknown" injection. Matches OpenAI API spec. Breaking change.
  • REST semantics (#875) -- POST /workers (create-only, 409 on conflict), PUT /workers/{id} (full replace), PATCH /workers/{id} (partial update). Breaking change: PUT now requires WorkerSpec instead of WorkerUpdateRequest.
  • Split register paths (#836) -- register() (create-only), replace() (overwrite-then-diff, no transient gap), register_or_replace() (idempotent upsert)

vLLM gRPC Embedding Support

End-to-end embedding pipeline for vLLM via gRPC:

  • Rust gateway + Python servicer (calls engine.encode() with PoolingParams)
  • Flattened SGLang EmbedResponse proto (removed oneof, uses tonic::Status for errors)
  • Removed SGLang-specific log_metrics and cached_tokens from embed/classify protos

DeepSeek V3.1 Tool Call Parser

Native parser for DeepSeek V3.1's tool calling format:

  • Handles V3.1's simplified format (no function type prefix, no markdown code blocks)
  • Complete + streaming (parse_incremental) support
  • Auto-registered for deepseek-v3.1* and deepseek-ai/DeepSeek-V3.1* model patterns
  • E2E validated against live DeepSeek V3.1 (FP8) on 8×H200

Additional Features

  • Configurable storage hook context (#807) -- Map HTTP headers to storage hook request context via storage_context_headers
  • Conversation memories schema (#976) -- First-class conversation_memories table in data-connector with Oracle Flyway DDL and insert seam
  • gRPC health checking (#885) -- Standard grpc.health.v1 health service for vLLM workers
  • Model metadata in GetModelInfo (#871) -- vLLM GetModelInfo RPC now returns model metadata fields
  • Metrics server refactored to axum (#966) -- Foundation for /ws/metrics WebSocket endpoint
  • max_total_num_tokens in GetServerInfo (#817) -- Aligns gRPC response with HTTP server

Performance Improvements

  • Tokenizer: Optimized stop decoder and incremental sequence decoding (#990)
  • Routing: Optimized extract_text_for_routing string handling (#967)
  • Mesh: Eliminated per-request CRDT serialization in sync_tree_operation (#948)
  • Multimodal: Thread-local Resizer reuse (#923), eliminated DynamicImage clones (#928), optimized serialization and tensor conversion (#1012)

Bug Fixes

  • Multimodal: Fixed Phi-3-vision for string-format chat templates (#942), LLaVA-Next anyres multi-crop for vLLM gRPC (#941), hardened registry matching and token geometry (#945), propagated placeholder resolution errors (#943), fixed images smaller than patch_size × merge_size (#908), use preprocessor token counts in LlavaSpec (#958), fall back to config.model_type for aliased model IDs (#898)
  • Chat Templates: Inject special tokens (bos_token, eos_token) into chat template context (#914), correct content format detection for Qwen3-style templates (#981), inject special tokens inside tokenizer impls (#918)
  • Protocol: Accept null for boolean fields logprobs and stream (#1020), validate reasoning parser name at CLI and startup (#901)
  • gRPC: Fix assistant tool_calls message serialization for chat templates (#1023), include stop tokens in TRT-LLM output for Harmony parsing (#879), handle vllm log forwarding on servicer side (#975)
  • Mesh: Stop advertising 0.0.0.0 to peers (#883), set tonic message size limits to match application limit (#893), prevent duplicate store events from inflating tree_sizes (#946)
  • Gateway: Update metric when removing unhealthy workers (#884), filter empty-string backend defaults in CLI arg fallback (#934)
  • Responses API: Align store=false state persistence behavior (#916)
  • Serve: Respect user-set CUDA_VISIBLE_DEVICES in gp...
Read more

v1.3.3

21 Mar 16:56
9f34d35

Choose a tag to compare

🚀 Shepherd Model Gateway v1.3.3 Released

Major performance release with 7x faster mesh synchronization and critical bug fixes.

⚡ Mesh Performance Revolution

Switched mesh serialization from JSON to bincode with dramatic performance improvements:

Benchmark Results (production workload - 1024 operations, 4000 tokens):

  • Serialization: 7.1x faster (35.5ms → 5.0ms)
  • Deserialization: 14.8x faster (63.4ms → 4.3ms)
  • Wire size: 4.3x smaller (67.9MB → 15.7MB)
  • Multi-model aggregate (10 models): 4.3x smaller (679MB → 157MB)

Additional mesh improvements:

  • Operation log auto-compaction and tombstone GC
  • Skip full-store scans when nothing has changed
  • Prevent stale snapshot chunks from mixing across retries
  • Break infinite retry loop for oversized incremental updates

Impact: Massive reduction in network bandwidth and CPU usage for multi-node deployments. Mesh state synchronization is now 7-15x faster with 4.3x less bandwidth consumption.

🎯 Structured Output Support

response_format support in Chat Completions API for Harmony models:

  • JSON schema constrained output
  • Structured generation for tool calling and data extraction
  • Fixed structural tag triggers for json_schema mode

🔧 PD Disaggregation Improvements

Enhanced reliability for prefill-decode mode:

  • Abort both PD requests when one side hits transport error (prevents hanging requests)
  • Handle mismatched metric labels in PD disaggregation mode
  • Fixed classify race condition with URL-based detection

🐛 Bug Fixes

  • Protocol: Validate /v1/messages tool_choice contract
  • Harmony: Include developer message when instructions are present
  • Gateway: Disable auto-detection if "runtime": "sglang" explicitly set
  • Client: Auto-close streaming responses on iteration exhaustion
  • Docker: Install gRPC proto and servicer for vLLM images

📚 Documentation Overhaul

Comprehensive audit and fixes across all documentation:

  • Quickstart and getting-started guides
  • Worker configuration and gRPC pipeline
  • Tokenizer, MCP, and WASM plugin extensibility
  • PD disaggregation and cache-aware routing
  • Reliability features and monitoring
  • Configuration, metrics, and architecture
  • API reference documentation

🏗️ Infrastructure

  • Default engine versions: vLLM 0.18.0, TensorRT-LLM 1.3.0rc8
  • Added minimaxai/minimax-m2 to nightly benchmarks
  • Improved E2E test infrastructure with parametrized fixtures

Full Changelog: v1.3.2...v1.3.3

Upgrade now: pip install smg --upgrade

🐑 Shepherd your LLM infrastructure with confidence.

Docker Images

Pre-built engine images on GitHub Container Registry:

SGLang:

docker pull ghcr.io/lightseekorg/smg:1.3.3-sglang-v0.5.9

vLLM:

docker pull ghcr.io/lightseekorg/smg:1.3.3-vllm-v0.18.0

TensorRT-LLM:

docker pull ghcr.io/lightseekorg/smg:1.3.3-trtllm-1.3.0rc8
All images for v1.3.3
Engine Tag Pull Command
sglang 1.3.3-sglang-v0.5.9 docker pull ghcr.io/lightseekorg/smg:1.3.3-sglang-v0.5.9
trtllm 1.3.3-trtllm-1.3.0rc8 docker pull ghcr.io/lightseekorg/smg:1.3.3-trtllm-1.3.0rc8
vllm 1.3.3-vllm-v0.18.0 docker pull ghcr.io/lightseekorg/smg:1.3.3-vllm-v0.18.0

What's Changed

  • chore(release): bump llm-multimodal to 1.4.0 by @slin1237 in #788
  • fix(harmony): fix structural tag triggers for json_schema constrained output by @CatherineSue in #789
  • feat(harmony): support response_format in Chat Completions by @CatherineSue in #791
  • refactor(e2e): reorganize Harmony tests, add validation, remove unused gateway args by @CatherineSue in #796
  • chore(deps): bump dorny/paths-filter from 3 to 4 by @dependabot[bot] in #792
  • fix(ci): fix mergify stale/close rules that never trigger by @CatherineSue in #798
  • fix(ci): use ignore-pr-updates for stale PR detection, disable issues by @CatherineSue in #800
  • chore(deps): update tokio-tungstenite requirement from 0.28 to 0.29 by @dependabot[bot] in #793
  • feat(core): add per-worker resilience and HTTP pool config types by @CatherineSue in #799
  • fix(ci): remove ignore-pr-updates that marks active PRs as stale by @CatherineSue in #804
  • refactor(openai): cleanup dead code, redundant state, and hot-path inefficiencies by @slin1237 in #802
  • fix(mesh): break infinite retry loop for oversized incremental updates by @slin1237 in #808
  • feat(core): wire per-worker resilience and HTTP client into BasicWorker by @CatherineSue in #803
  • test(mesh): add serialization benchmark for mesh state sync by @slin1237 in #810
  • test(e2e): re-enable skipped tests for vLLM and TRT-LLM by @CatherineSue in #806
  • perf(mesh): switch all mesh serialization from JSON to bincode by @slin1237 in #809
  • fix(mesh): use bincode for snapshot generation to match receivers by @slin1237 in #816
  • feat(gateway): Propagate otel context for distributed tracing by @ekzhang in #814
  • perf(mesh): skip full-store scans when nothing has changed by @slin1237 in #823
  • fix(ci): drop [grpc] extra from nightly vllm install by @CatherineSue in #826
  • feat(ci): add minimaxai/minimax-m2 to nightly benchmark by @smfirmin in #795
  • fix(gateway): Disable auto-detection if "runtime": "sglang"explciitly set by @ekzhang in #820
  • perf(mesh): add operation log auto-compaction and tombstone GC by @slin1237 in #825
  • refactor(e2e): replace smg_compare with parametrized api_client fixture by @CatherineSue in #812
  • fix(client): auto-close streaming responses on iteration exhaustion by @CatherineSue in #835
  • refactor(e2e): add model fixture, remove deprecated smg fixture by @CatherineSue in #834
  • fix(protocol): validate /v1/messages tool_choice contract by @nishanthp in #833
  • fix(mesh): prevent stale snapshot chunks from mixing across retries by @slin1237 in #837
  • test(mesh): improve benchmark summary with timing and side-by-side comparison by @slin1237 in #841
  • fix(docker): install gRPC proto and servicer for vLLM images by @slin1237 in #843
  • fix(gateway): use URL-based detection to eliminate classify race condition by @slin1237 in #839
  • ci: bump default engine vllm(0.18.0) and trt(1.3.0rc8) versions by @slin1237 in #845
  • fix(gateway): handle mismatched metric labels in PD disaggregation mode by @slin1237 in #846
  • fix(pd): abort both PD requests when one side hits a transport error by @slin1237 in #844
  • docs(quickstart): audit and fix getting-started documentation by @slin1237 in #848
  • docs(extensibility): audit and fix tokenizer, MCP, and WASM plugin documentation by @slin1237 in #849
  • docs(workers): audit and fix worker configuration and gRPC pipeline documentation by @slin1237 in #850
  • docs(reliability): audit and fix reliability feature documentation by @slin1237 in #851
  • docs(operations): audit and fix monitoring and data connection documentation by @slin1237 in #854
  • docs(routing): audit and fix PD disaggregation and cache-aware routing documentation by @slin1237 in #853
  • docs(config): audit and fix configuration, metrics, and architecture documentation by @slin1237 in #852
  • docs(api): audit and fix API reference documentation by @slin1237 in #855
  • fix(ci): scope VERSION_OVERRIDE to smg crate only by @slin1237 in #856
  • chore(release): bump version to 1.3.3 by @slin1237 in #857

New Contributors

Full Changelog: v1.3.2...v1.3.3

v1.3.2

17 Mar 17:29
047ed98

Choose a tag to compare

🚀 Shepherd Model Gateway v1.3.2 Released

Feature release adding multimodal support to Messages API and Python mesh bindings.

🎨 Multimodal Support for Messages API

Complete vision/image support in Messages API gRPC pipeline:

  • Native image processing for Messages API requests
  • Works across all gRPC backends (SGLang, vLLM, TensorRT-LLM)
  • Full feature parity with Anthropic's Messages API including vision

Impact: Messages API now supports both text and vision workloads. Deploy vision-language models with full reasoning and thinking capabilities through the Messages API protocol.

🌐 Mesh High Availability in Python

--enable-mesh support added to Python bindings:

  • Configure mesh HA directly from Python CLI
  • Distributed state synchronization accessible from Python deployments
  • Complete Python API coverage for mesh features

🐛 Bug Fixes

  • API: Restored model_id field in /workers response
  • Harmony: Reject ignore_eos with HTTP 400 for compatibility
  • Harmony: Include developer message when instructions are present
  • Lossy UTF-8 decode fallback for malformed text
  • Fixed gRPC PD mode detection
  • Enabled loop_controls for Jinja2 templates

📚 Documentation

  • Added comprehensive Messages API documentation
  • External providers integration guide
  • Mesh HA deployment documentation
  • Docker/PyPI badges added to README

Full Changelog: v1.3.1...v1.3.2

Upgrade now: pip install smg --upgrade

🐑 Shepherd your LLM infrastructure with confidence.

What's Changed

  • fix(api): restore model_id field in /workers response by @slin1237 in #774
  • feat(python): add --enable-mesh support to Python bindings by @slin1237 in #775
  • fix(ci): bump NCCL to 2.28+ for TensorRT-LLM compatibility by @slin1237 in #777
  • fix(harmony): reject ignore_eos for Harmony models with HTTP 400 by @CatherineSue in #778
  • fix(ci): install NCCL 2.28+ after TRT-LLM requirements to prevent downgrade by @slin1237 in #779
  • fix(harmony): include developer message when instructions are present by @CatherineSue in #781
  • feat(gateway): add multimodal support to Messages API gRPC pipeline by @slin1237 in #776
  • fix(ci): use k8s pod env vars for API keys instead of GitHub secrets by @slin1237 in #782
  • feat(tokenizer): lossy UTF-8 decode fallback, fix gRPC PD mode detection, enable loop_controls by @Kangyan-Zhou in #769
  • docs: add Messages API, external providers, and mesh HA documentation by @slin1237 in #786
  • docs: add Docker/PyPI badges and release docker notes script by @slin1237 in #787
  • chore: add release 1.3.2 by @slin1237 in #780

New Contributors

Full Changelog: v1.3.1...v1.3.2

v1.3.1

16 Mar 18:13
eae871e

Choose a tag to compare

🚀 Shepherd Model Gateway v1.3.1 Released

Minor release with operational improvements and bug fixes.

🛠️ New Features

Operational improvements:

  • --remove-unhealthy-workers flag - Automatically remove workers that fail health checks
  • disable_tokenizer_autoload support - Skip automatic tokenizer loading for custom configurations

🐛 Bug Fixes

  • Gateway: Index external workers by all discovered models (not just primary model)
  • CI: Added VERSION_OVERRIDE to version check script

Full Changelog: v1.3.0...v1.3.1

Upgrade now: pip install smg --upgrade

🐑 Shepherd your LLM infrastructure with confidence.

What's Changed

  • fix(deps): remove aws-lc-sys to fix aarch64 PyPI build by @slin1237 in #762
  • feat(gateway): add --remove-unhealthy-workers by @ekzhang in #714
  • feat(gateway): support disable_tokenizer_autoload by @Huixxi in #740
  • feat(messages): add unit tests for Messages API streaming & response … by @ConnorLi96 in #763
  • fix(gateway): index external workers by all discovered models by @zhaowenzi in #756
  • fix(ci): add VERSION_OVERRIDE to check-versions by @slin1237 in #764
  • feat(interactions): Implement steps to handle non-stream interactions req without tool call by @XinyueZhang369 in #723
  • chore: auto-close stale PRs after 30 days and validate DCO sign-off identity by @CatherineSue in #743
  • chore(release): bump version to 1.3.1 by @slin1237 in #767

New Contributors

Full Changelog: v1.3.0...v1.3.1

v1.3.0

15 Mar 00:26
92effcc

Choose a tag to compare

🚀 Shepherd Model Gateway v1.3.0 Released

We're excited to announce Shepherd Model Gateway v1.3.0 – a major release bringing native Messages API support and expanding our agentic workload capabilities.

🎯 Messages API: First-Class Implementation

Native Messages API implementation with core protocol support: credits to @CatherineSue

  • True first-class support — Direct protocol implementation, not a translation layer
  • Extended thinking — Native ThinkingConfig with per-model reasoning activation and streaming thinking_delta events
  • Full streaming + non-streaming — Complete Anthropic SSE event protocol
  • Tool use — Custom tool definitions, tool_choice, structured tool output
  • Works across all backends — SGLang, vLLM, TensorRT-LLM via gRPC
  • Drop-in Anthropic SDK compatibility — Same API shape, your infrastructure

Why first-class matters: Wiring Messages API through chat completion (the common approach) silently drops thinking blocks — both in conversation history and model output — because the chat completion protocol has no concept of reasoning content. SMG's native implementation preserves thinking blocks end-to-end: ThinkingConfig activates model-specific reasoning, the streaming state machine emits proper thinking_delta events, and interleaved reasoning + text + tool use content blocks are assembled in correct order. No translation layer, no silent data loss.

🔗 Expanding Agentic Workload Support

SMG now supports three major agentic APIs:

  • Chat Completions API (OpenAI) - Standard conversational interface
  • Responses API (OpenAI) - Still the only gateway supporting this for open-source models and third party vendor
  • Messages API (Anthropic) - NEW - First-class native implementation with reasoning support

Plus routing to all major 3rd party providers: OpenAI, Anthropic, Gemini and more.

Impact: SMG sits behind any agent framework (Claude Code, Codex, OpenClaw, OpenCode) and routes to any model. Run agentic workflows designed for Claude on Llama 4, Qwen 3, DeepSeek, Kimi-K2.5—your infrastructure, full protocol fidelity including reasoning.

🌐 Unified /v1/models Across All Providers

Consistent model discovery experience across Anthropic, OpenAI, and Gemini:

  • Unified /v1/models response format across all routers
  • Consistent schema regardless of backend provider
  • Single API surface for model enumeration

⚡ High Availability Mesh Improvements

Sync cache-aware policy state across mesh HA nodes:

  • Cache policy state replicated across all mesh nodes
  • Automatic failover with consistent routing decisions
  • Zero-downtime deployments with state continuity

🛠️ smg-grpc-servicer Enhancements

  • Native SGLang backend support
  • Multi-backend extras for flexible deployment
  • Improved vLLM GetModelInfo response with served_model_name

🐛 Bug Fixes

  • Mesh: Plumbed --router-selector through CLI and Python bindings
  • Realtime API: Fixed worker health tracking in WebSocket session
  • gRPC servicer: Return served_model_name in vLLM GetModelInfo response
  • Dependencies: Pinned gRPC packages to 1.78.0 in SGLang install script

🏗️ Infrastructure

  • TensorRT-LLM default base image bumped to 1.3.0rc7
  • Added DeepWiki badge to README
  • Docker CI improvements

Full Changelog: v1.2.0...v1.3.0

Upgrade now: pip install smg --upgrade

🐑 Shepherd your LLM infrastructure with confidence.

Built for speed. Engineered for scale. Production-proven.

What's Changed

  • fix(ci): add packages:write permission to engine docker release workflows by @slin1237 in #697
  • refactor(gateway): unify /v1/models response across all routers by @slin1237 in #692
  • refactor(gateway): route realtime API through RouterTrait by @CatherineSue in #690
  • chore(deps): bump docker/setup-qemu-action from 3 to 4 by @dependabot[bot] in #704
  • chore(deps): update tokio-tungstenite requirement from 0.26 to 0.28 by @dependabot[bot] in #706
  • chore(deps): bump docker/login-action from 3 to 4 by @dependabot[bot] in #703
  • chore(deps): bump actions/checkout from 4 to 6 by @dependabot[bot] in #702
  • chore(deps): bump docker/setup-buildx-action from 3 to 4 by @dependabot[bot] in #701
  • chore(deps): bump docker/build-push-action from 6 to 7 by @dependabot[bot] in #700
  • Update max_concurrent_jobs from upstream by @ekzhang in #711
  • chore: add gongwei to code owner of docker and installation by @slin1237 in #715
  • chore: add gongwei to code owner of python binding by @slin1237 in #716
  • chore(ci): bump trtllm default base image to 1.3.0rc7 by @slin1237 in #717
  • docs: add DeepWiki badge to README by @slin1237 in #718
  • fix(deps): pin gRPC packages to 1.78.0 in sglang install script by @YouNeedCryDear in #719
  • feat(gateway): API-key-aware /v1/models with upstream fan-out by @slin1237 in #698
  • chore: fix lint by @slin1237 in #720
  • refactor(gateway): extract shared worker selection module by @slin1237 in #721
  • fix(mesh): plumb --router-selector through CLI and Python bindings by @slin1237 in #724
  • fix(grpc_servicer): return served_model_name in vLLM GetModelInfo response by @CatherineSue in #727
  • refactor(gateway): split OpenAI router.rs into chat and health modules by @slin1237 in #726
  • fix(realtime-api): worker health tracking in websocket session by @pallasathena92 in #725
  • refactor(gateway): extract MCP module from OpenAI responses by @slin1237 in #730
  • feat(realtime-api): WebRTC Router trait interface + HTTP route regist… by @pallasathena92 in #731
  • feat(realtime-api): WebRTC config plumbing through AppContext and CLI by @pallasathena92 in #729
  • refactor(gateway): extract history loading and storage queries from router.rs by @slin1237 in #732
  • feat: sync cache-aware policy state across mesh HA nodes by @llfl in #655
  • fix: update CODEOWNERS paths after crate relocation by @slin1237 in #734
  • refactor(gateway): extract route_responses orchestration into responses/route.rs by @slin1237 in #735
  • refactor(gateway): simplify openai router internals by @slin1237 in #737
  • feat(gateway): add Messages API type scaffolding to gRPC router by @slin1237 in #739
  • feat(gateway): add message_utils and MessagePreparationStage for Messages API by @slin1237 in #741
  • feat(gateway): add MessageRequestBuildingStage for Messages API by @slin1237 in #744
  • feat(grpc_servicer): add sglang support with multi-backend extras by @slin1237 in #745
  • fix(ci): prevent upload-servicer from being skipped by @slin1237 in #746
  • docs(template): add slack link by @lightseek-bot in #749
  • feat(gateway): add MessageResponseProcessingStage for Messages API (non-streaming) by @slin1237 in #747
  • feat(gateway): wire Messages API pipeline into gRPC routers by @slin1237 in #753
  • feat(gateway): add Messages API streaming support to gRPC router by @slin1237 in #758
  • chore: bump versions for v1.3.0 release by @slin1237 in #760

Full Changelog: v1.2.0...v1.3.0

v1.2.0

10 Mar 16:23
4b03d32

Choose a tag to compare

🚀 Shepherd Model Gateway v1.2.0 Released!

We're thrilled to announce Shepherd Model Gateway v1.2.0 – a transformative release featuring enhanced event-driven cache-aware routing, production-ready client SDKs, Google Gemini integration, and vLLM gRPC server adoption!

Enhanced Event-Driven Cache-Aware Routing

Inspired by Amazon Dynamo's distributed caching principles, SMG extends its existing cache-aware routing with real-time KV cache event subscriptions:

  • SubscribeKvEvents RPC - Real-time KV cache event stream from all backends (SGLang, vLLM, TensorRT-LLM)
  • KvEventMonitor - Per-worker KV cache event subscriptions with automatic recovery
  • PositionalIndexer - Event-driven cache-aware routing with router prefix hash for query-path disambiguation
  • Auto-learned block_size - Dynamically learn from KV event stream
  • Flash Indexer parity - Closed 4 performance gaps, tuned DashMap shards to 256

Production Results (8 Llama model replicas):

  • TTFT avg: -23.0% (93.10 → 71.66 ms)
  • TTFT p99: -27.9% (186.98 → 134.88 ms)
  • TPOT avg: -0.9% (6.39 → 6.33 ms)
  • Latency avg: -3.8% (731.60 → 703.92 ms)
  • Latency max: -11.8% (1034.27 → 912.47 ms)
  • Req/sec: +1.3% (9.959 → 10.093)

Impact: Maximum KV cache utilization across your inference fleet. Route requests to workers with matching cached prefixes, eliminating redundant computation and dramatically reducing TTFT.

🎨 TensorRT-LLM Multimodal Support

Complete vision-language model integration:

  • gRPC multimodal pipeline - preprocessed data with hashing
  • Backend-specific variants - optimized for TRT-LLM
  • String-based stop sequences - no pre-tokenization overhead
  • Matched_stop support - proper stop sequence handling

🔄 vLLM Upstream gRPC Adoption

SMG's gRPC server implementation is now upstream in vLLM!

vLLM's PR #36169 formalizes SMG's protobuf and gRPC server implementation as an upstream dependency. gRPC is now an officially supported protocol in the vllm serve command.

  • smg-grpc-servicer package published to PyPI
  • Production-grade gRPC server infrastructure
  • Credit to @CatherineSue and @njhill for driving this milestone

Impact: A significant milestone for the project—SMG's gRPC innovations are now the foundation for vLLM's official gRPC support.

📦 Production-Ready Client SDKs

Multi-language SDK ecosystem with OpenAPI codegen:

  • Python SDK - Drop-in replacement for OpenAI/Anthropic SDKs with complete API coverage
  • Rust HTTP Client - Type-safe, async-first client with all endpoints
  • Java Type Generation - Full OpenAPI-derived types

Endpoints: Chat completions, classify, parser, responses, workers, loads, and more.

Impact: Integrate SMG into any tech stack with idiomatic, type-safe clients. Zero friction migration from OpenAI/Anthropic.

🐳 Engine-Specific Docker Images

Pre-built Docker images for each inference engine:

  • docker pull ghcr.io/lightseekorg/smg:1.2.0-sglang-v0.5.9
  • docker pull ghcr.io/lightseekorg/smg:1.2.0-vllm-v0.17.0
  • docker pull ghcr.io/lightseekorg/smg:1.2.0-trtllm-1.3.0rc6

Impact: Zero-configuration deployment with engine-optimized images. Pull and run your preferred backend instantly. Credit to @gongwei-130 for driving this feature.

🔮 Google Gemini Integration

New Gemini router for Google's Interactions API:

  • Complete router registration and infrastructure
  • Native protocol support for Gemini models
  • Seamless integration alongside OpenAI, Anthropic, and self-hosted engines

Impact: Route to Gemini alongside your entire model fleet. One gateway, all providers.

💾 Advanced Data Persistence

Enterprise-grade data connector enhancements:

  • Schema versioning with safe-by-default migrations (Flyway integration)
  • SchemaConfig - Customizable table and column names for existing databases
  • Storage hooks - Pre/post persistence callbacks
  • WASM bridge - Call storage APIs from WASM middleware

Performance: Reduced Oracle DB round trips in pagination, batch linking, and delete operations.

📊 Load Monitoring & Discovery

New /v1/loads endpoint with gRPC support, per-worker model customization (--model-id-from), external worker discovery with per-provider API keys, and model aliasing.

🔌 MCP Enhancements

Responses API Integration:

  • MCP approval items in protocols and routers
  • X-SMG-MCP header for MCP passthrough control

Anthropic Features:

  • tool_search_tool support
  • defer_loading for lazy tool initialization

Performance:

  • Concurrent tool execution in McpToolSession
  • Lock-free pool stats with AtomicUsize
  • Fixed O(n²) insert in inject_mcp_output_items
  • Reverse iteration with optional limit in AuditLog

🎨 Multimodal Improvements

Qwen VL Support:

  • Proper patchification and prompt replacement token counts
  • Backend-specific MultimodalData variants

vLLM Integration:

  • Send preprocessed multimodal data with hashing and structured tokens
  • Derive keep_on_cpu keys from model spec

Code Quality:

  • Split registry.rs into per-model spec modules
  • Removed dead MultiModalInputs/Tensor/Value types

Performance Optimizations

Core Engine:

  • Zero-allocation JSON streaming validation with IgnoredAny
  • Move semantics in gRPC request handling instead of cloning
  • Optimized clone usage and finalize/emit_completed ordering

Indexing:

  • Worker_blocks moved to caller-owned storage
  • Setup moved out of concurrent benchmark timing loop

🛡️ Critical Bug Fixes

  • gRPC: Proper status codes, circuit breaker accuracy, return tonic::Status directly, use e.message() for errors
  • Reasoning Parser: 4MB buffer limit, non-fatal parse errors
  • Tokenizer: Jinja2 trim_blocks/lstrip_blocks enabled, Value::UNDEFINED for missing params, SGLang tp_size fix
  • Mesh: Prevent stale state overwrites via relay paths, increased multi-node timeouts
  • Gateway: Release semaphore permit before completion wait, unified worker registration
  • Multimodal: Qwen VL patchification and token counts, lazy model discovery in /v1/models

🔧 Refactoring & Code Quality

  • Repository: Library crates moved to crates/ directory, UUID v4 → v7 migration workspace-wide
  • MCP: Extracted shared iterators, deduplicated constructors, removed dead paths
  • gRPC: Split utils.rs into focused modules, deduplicated streaming logic
  • Workflow Engine: Simplified definition and internals
  • E2E Testing: Removed parallel infrastructure, simplified helpers and architecture
  • CI/CD: Reusable build workflows, pre-commit checks, conventional commits enforcement, file changes detection for E2E skipping, sequential execution, block AI co-author lines

📚 Documentation

  • Standardized runtime ordering to SGLang, vLLM, TensorRT-LLM
  • Documented O(n) complexity on pool URL-based lookups

🎯 Additional Features

  • Unified flag for cache token usage report in HTTP mode
  • OpenAI-compatible cached token usage
  • GetTokenizer proto and tokenizer bundle streaming
  • TRT-LLM parameter pass-through fixes
  • Matched_stop support for vLLM and TensorRT-LLM
  • Respect workers_config in multi-worker gRPC setup
  • Realtime API protocol foundations (session, conversation, response, WebSocket handler)
  • Standardized runtime ordering to SGLang, vLLM, TensorRT-LLM
  • Documented O(n) complexity on pool URL-based lookups

🔗 Full Changelog: v1.1.0...v1.2.0

Upgrade now: pip install smg --upgrade

🐑 Shepherd your LLM infrastructure with confidence.

⚡ Built for speed. Engineered for scale. Production-proven.

What's Changed

  • ci(nightly): Add vLLM HTTP support to nightly benchmarks by @CatherineSue in #502
  • docs: standardize runtime ordering to SGLang, vLLM, TensorRT-LLM by @slin1237 in #514
  • fix(ci): install CUDA toolkit for SGLang JIT kernel compilation by @slin1237 in #513
  • fix parameters pass through for trtllm by @gongwei-130 in #509
  • chore(ci): remove pull request trigger from nightly benchmark workflow by @key4ng in #524
  • refactor: migrate from UUID v4 to v7 across the workspace by @slin1237 in #518
  • perf: optimize JSON streaming validation with zero-allocation IgnoredAny by @ppraneth in #516
  • fix(e2e): respect workers_config in vLLM/TRT-LLM gRPC multi-worker setup by @slin1237 in #525
  • feat(responses): add mcp approval items to protocols and routers by @zhaowenzi in #491
  • fix(ci): modify docker-storage dir and add scaler for a10 runners by @XinyueZhang369 in #402
  • feat(anthropic): add X-SMG-MCP header for MCP passthrough by @key4ng in #517
  • feat: add --served-model-name support for model aliasing by @ConnorLi96 in #521
  • feat(data-connector): add SchemaConfig for customizable table and column names by @slin1237 in #526
  • refactor(protocol): model ResponseTool as tagged enum to match Responses spec and tighten MCP validation by @zhaowenzi in #532
  • ci: add PR title conventional commits check by @CatherineSue in #540
  • feat: add unified fl...
Read more

v1.1.0

23 Feb 06:40
b6f9bb5

Choose a tag to compare

🚀 Shepherd Model Gateway v1.1.0 Released!

We're excited to announce Shepherd Model Gateway v1.1.0 – a major feature release bringing universal multimodal support, Messages API MCP integration, and critical production hardening across the entire stack!

🎨 Universal Multimodal Support 🔥

Industry-leading multimodal processing across all major inference engines:

  • SGLang gRPC - Full multimodal pipeline with vision processing
  • vLLM gRPC - Fetch + preprocess pipeline with multimodal support
  • TensorRT-LLM gRPC - Complete multimodal integration
  • Llama 4 Vision - First-class support with model spec and processor registration

Impact: Deploy vision-language models across SGLang, vLLM, and TensorRT-LLM with unified processing. Data URI detection, 4D pixel values, and i64 aspect ratios for production-grade image handling.

🔌 Messages API Gets MCP

Complete MCP tool integration for Anthropic Messages API:

  • Streaming and non-streaming MCP tool use
  • Unified tool allowlist enforcement across OpenAI and gRPC routers
  • Server binding architecture with session lifecycle management
  • Built-in server filtering from tool listings
  • Unique server_label requirements for tool collision prevention

E2E tested with comprehensive MCP tool use coverage.

Major New Features

🌐 vLLM HTTP Backend Support
Auto-detection and support for vLLM HTTP endpoints via DetectBackendStep – seamlessly switch between gRPC and HTTP workers.

🎯 smg serve Engine Args Pass-through
Pass arbitrary engine-specific arguments directly through smg serve to your inference engines. Maximum flexibility for custom configurations.

🧠 Tiktoken Hub Model Support
Unified chat template API with tiktoken hub integration. Improved OpenAI o-series model detection and error handling.

🔍 NanoV3 Reasoning Parser
Native support for Nemotron Nano V3 reasoning output parsing.

🎨 Startup Banner
Beautiful braille art shepherd motif on startup – because production systems deserve aesthetics.

Performance Optimizations

WASM Runtime Enhancements:

  • Optimized component cache lookup
  • Reduced per-request cloning overhead
  • SHA-256 cache keys for efficient middleware

🛡️ Critical Production Hardening

Responses API Fixes:

  • Fixed data loss and panic risks
  • Sanitized upstream error bodies
  • Improved structural integrity

Middleware Reliability:

  • Fixed extension loss in request pipeline
  • Eliminated auth timing leak
  • Corrected streaming body buffering

Tokenizer Robustness:

  • Cache correctness fixes
  • Streaming reliability improvements
  • Chat template error handling

Data Connector Hardening:

  • Eliminated deadlock, block_on, and triple pool bugs
  • Storage backend protection against data corruption
  • Race condition fixes

Concurrency Safety:

  • Fixed tokio mutex release before awaiting in LoadMonitor
  • Improved SSE event processing and buffer management

Multimodal Correctness:

  • Proper data URI detection
  • 4D pixel_values output
  • i64 aspect_ratios for large images

🏗️ Architectural Improvements

Worker Infrastructure:

  • Consolidated DPAwareWorker into BasicWorker
  • Moved DP fields to WorkerSpec
  • Unified worker metadata discovery
  • Cleaner registration workflow

Code Quality:

  • Enforced strict clippy linting workspace-wide
  • Added clippy::absolute_paths and single_component_path_imports lints
  • Improved error handling across all modules

🔧 Developer Experience

CI/DevOps:

  • DCO check with probot app
  • Mergify automation for PR management
  • Branch naming enforcement
  • Docker image release workflow
  • Auto-trigger benchmark workflows on code changes

Tooling:

  • Workspace version checker script
  • PyPI proto version validation
  • Remote dev workflow for proto testing

Python Support:

  • Lowered minimum Python version from 3.12 to 3.9

🐛 Bug Fixes

  • Fixed worker health config, bootstrap parsing, and model card cloning issues
  • Improved serve CLI arg filtering and config error handling
  • Better pre-commit hook configuration
  • Corrected labeler workflow for fork PRs

📚 Interactions API

Added comprehensive validations for the Interactions API protocol.

🔗 Full Changelog: v1.0.1...v1.1.0

Upgrade now: pip install smg --upgrade

🐑 Shepherd your LLM infrastructure with confidence.

⚡ Built for speed. Engineered for scale. Production-proven.

What's Changed

  • fix: render README images on PyPI/crates.io and bump version to 1.0.1 by @slin1237 in #420
  • chore(ci): Change nightly benchmark schedule to midnight PST by @key4ng in #422
  • ci: add DCO check, Mergify automation, and branch naming enforcement by @CatherineSue in #424
  • ci: temporarily disable auto-close for branch naming violations by @CatherineSue in #426
  • ci: add needs-rebase label management to Mergify by @CatherineSue in #427
  • fix(ci): use correct Mergify syntax for negated regex condition by @CatherineSue in #429
  • ci: improve label management with router-specific and feature labels by @CatherineSue in #428
  • ci: add Docker image release workflow by @slin1237 in #431
  • feat(message api): MCP tool use with streaming and non-streaming support by @key4ng in #352
  • refactor(core): consolidate DPAwareWorker into BasicWorker by @slin1237 in #434
  • fix: pre-existing issues in worker health config, bootstrap parsing, and model card cloning by @slin1237 in #415
  • refactor(core): move DP fields to WorkerSpec and remove default_model_type by @slin1237 in #436
  • chore: fix main log by @slin1237 in #437
  • test(e2e): add MCP tool use tests for Anthropic Messages API by @key4ng in #433
  • feat(core): add DetectBackendStep for vLLM HTTP support by @slin1237 in #438
  • feat(tokenizer): add tiktoken hub model support and unify chat template API by @slin1237 in #439
  • perf(wasm): optimize WASM component cache lookup and reduce per-request cloning by @ppraneth in #440
  • refactor(core): unify worker metadata discovery and clean up registration by @slin1237 in #447
  • feat(version): add startup banner with braille art shepherd motif by @slin1237 in #448
  • fix(python): lower minimum Python version from 3.12 to 3.9 by @slin1237 in #449
  • ci(mergify): enable auto-close for non-conforming branch names by @CatherineSue in #454
  • ci(mergify): allow multi-segment branch names for dependabot by @CatherineSue in #456
  • ci(dco): switch DCO check from GitHub Actions to probot DCO app by @CatherineSue in #462
  • fix(openai): fix data loss, panic risk, and structural issues in Responses API by @slin1237 in #468
  • fix(mcp): filter builtin servers from mcp_list_tools output by @key4ng in #450
  • feat(interactions): Add validations for interactions api by @XinyueZhang369 in #399
  • feat(mcp): enforce allowed_tools filtering across openai and grpc routers by @zhaowenzi in #467
  • fix(openai): sanitize upstream error bodies in Responses API by @slin1237 in #473
  • fix(middleware): fix extension loss, auth timing leak, and streaming body buffering by @slin1237 in #472
  • fix(concurrency): release tokio mutex before awaiting task in LoadMonitor::stop() by @slin1237 in #475
  • fix(tokenizer): correctness and robustness fixes for cache and streaming by @slin1237 in #474
  • fix(ci): use pull_request_target for labeler to support fork PRs by @CatherineSue in #477
  • fix(data-connector): fix deadlock, block_on, triple pool, and DDL type bugs by @slin1237 in #471
  • feat(reasoning-parser): add NanoV3 reasoning parser by @slin1237 in #480
  • refactor(anthropic): simplify worker lifecycle in Anthropic router by @key4ng in #476
  • feat(realtime api): realtime api session and transcription_session protocols by @pallasathena92 in #364
  • fix(protocols): require unique server_label for MCP tools by @zhaowenzi in #479
  • feat: smg serve pass through engine args to engine by @gongwei-130 in #460
  • fix(serve): harden CLI arg filtering and config error handling by @slin1237 in #483
  • feat(scripts): replace release notes generator with workspace version checker by @slin1237 in #484
  • fix(ci): match probot DCO app check name in Mergify rule by @CatherineSue in https://github.com/lightseekorg/smg/pul...
Read more

v1.0.1

13 Feb 16:35

Choose a tag to compare

🎉 Introducing Shepherd Model Gateway v1.0.1!

We're thrilled to announce Shepherd Model Gateway v1.0.1 – formerly SGLang Model Gateway. This major release marks a new chapter with a complete architectural overhaul, new enterprise features, and production-grade improvements!

🐑 Welcome to Shepherd

SGLang Model Gateway is now Shepherd Model Gateway (SMG).

Truly Engine-Agnostic Architecture: Shepherd is your universal gateway supporting all major inference engines – SGLang, vLLM, and TensorRT-LLM – plus complete 3rd party model provider integration including OpenAI, Anthropic, and Gemini. One gateway to route them all.

Universal API Support: Native implementation of Chat Completions, Responses API, Messages API, Interactions API, and Realtime API. Whether you're running open-source models on your infrastructure or routing to cloud providers, Shepherd handles it seamlessly.

Same powerful technology, new identity focused on guiding and managing your entire LLM infrastructure at scale – regardless of where your models run.

Major New Features

⚡ TensorRT-LLM Backend Support - Native gRPC integration for NVIDIA TensorRT-LLM

🔄 vLLM Prefill-Decode-Disaggregation Support
Mooncake and NIXL-based KV transfer for disaggregated inference:

  • Auto-discovery for seamless integration
  • Massive scalability improvements for large deployments
  • Efficient KV cache sharing across workers

🎯 smg serve - Unified Worker Management
New serve subcommand with complete worker lifecycle orchestration:

  • Multi-worker data parallelism with GPU assignment
  • ServeOrchestrator for automated worker management
  • Two-pass argument parsing for flexible configuration
  • One command to rule them all

🤖 Anthropic Messages API Support
Full implementation of Anthropic's Messages API with streaming and non-streaming support. Deploy Claude models alongside your existing inference fleet.

🔌 Industry-First: Universal Built-in Tools via MCP 🔥

Turn any MCP server into built-in tools for all models – an industry-first capability that brings OpenAI-style built-in tools (FileSearch, WebSearch, CodeInterpreter) to every LLM, not just proprietary models.

Complete MCP Orchestration Stack:

  • McpOrchestrator with YAML policy configuration
  • Built-in tool routing infrastructure with qualified names – seamlessly integrate any MCP server as a native capability
  • ResponseFormat transformation pipeline - expose MCP servers as built-in tools (FileSearch, WebSearch, CodeInterpreter, and custom tools)
  • Auth-aware connection pooling for scalable multi-tenant deployments
  • Batch tool execution API for efficient processing
  • Approval system for controlled tool execution
  • Automatic reconnection manager for reliability
  • Graceful shutdown support
  • HTTP header forwarding to MCP servers

Impact: Deploy Llama, Qwen, DeepSeek, or any open-source model with the same built-in tool capabilities as GPT-4. Your infrastructure, your models, OpenAI-grade tooling.

📡 Realtime API Foundation
Event types and protocol support for real-time streaming applications.

🏗️ Architectural Revolution

Workspace Modularization
Complete extraction into standalone, publishable crates:

  • smg-auth - JWT/OIDC authentication
  • smg-mesh - High availability mesh networking
  • smg-mcp - Model Context Protocol orchestration
  • smg-wasm - WebAssembly middleware
  • smg-grpc-client - gRPC client infrastructure
  • smg-grpc-proto - Protocol definitions (published to PyPI!)
  • smg-kv-index - Cache-aware routing engine
  • llm-tokenizer - Tokenization logic
  • llm-multimodal - Multimodal processing
  • openai-protocol - OpenAI API specifications
  • wfaas - Workflow-as-a-Service engine
  • And more...

Result: Faster builds, independent evolution, better maintainability, and easy integration into your own projects.

Performance Optimizations

Zero-Copy & Algorithm Improvements:

  • Zero-copy multimodal payload handling
  • Aho-Corasick algorithm for stop sequence and special token search
  • WASM Linker reuse across executions
  • Optimized consistent hashing with zero allocations

🛠️ Production Enhancements

High Availability:

  • Mesh service refactoring and cleanup
  • State synchronization improvements
  • Oracle external auth support for enterprise backends

Observability:

  • Nightly benchmark workflow for comprehensive model performance tracking
  • gRPC vs HTTP comparison benchmarks
  • GetLoads RPC for load metrics

Developer Experience:

  • Comprehensive documentation restructure (concept-centric)
  • Issue templates and PR templates
  • Pre-commit hooks with Ruff + mypy Python linting
  • Automated crate publishing workflows
  • Dependabot integration

Testing Infrastructure:

  • Kubernetes-based CI runners
  • Service containers for Oracle and Brave
  • vLLM and TensorRT-LLM gRPC E2E tests
  • Thread-safe test fixtures with proper resource management

🐛 Critical Bug Fixes

  • Fixed synthetic "empty" tenant pollution in radix tree
  • Prevented resource leaks causing GPU starvation
  • Fixed STDIO MCP server triggering
  • Aligned multi-server MCP output handling across routers
  • Fixed completion token counting for vLLM harmony streaming
  • Corrected proto definitions (logprobs token_ids uint32)

📚 Documentation

  • Complete restructure from configuration-centric to concept-centric
  • Architecture diagrams and gradient mesh homepage
  • Comprehensive README with features overview
  • Admin API reference
  • Getting started guides

🔧 Tool Parser Support

New model support:

  • Cohere Command models (tool parser + reasoning parser)
  • Qwen Coder (XML format for Qwen3 Coder and MicroThinker)

🔗 Repository: https://github.com/lightseekorg/smg

Install now: pip install smg --upgrade

🐑 Shepherd your LLM infrastructure with confidence.

⚡ Built for speed. Engineered for scale. Production-proven.

What's Changed

  • fix: render README images on PyPI/crates.io and bump version to 1.0.1 by Simo Lin
  • fix(ci): fix H200 nightly benchmark model path, worker logs and CUDA errors (#411) by @key4ng in #411
  • fix(ci): use single Python interpreter for Windows/macOS PyPI builds (#418) by @slin1237 in #418
  • chore(mesh): bump smg-mesh version to 1.1.0 (#419) by @slin1237 in #419
  • chore: unify workspace dependency management and bump crate versions (#344) by @slin1237 in #344
  • refactor: remove remaining pub use re-export aliases from lib.rs (#416) by @slin1237 in #416
  • refactor: remove pub use re-export aliases from lib.rs (#413) by @slin1237 in #413
  • refactor(protocols,gateway): redesign worker type hierarchy and consolidate protocol layer (#412) by @slin1237 in #412
  • fix(grpc-proto): bump grpcio minimum to >=1.78.0 (#409) by @CatherineSue in #409
  • chore(ci): increase chat-completions-trtllm timeout to 60 minutes (#408) by @CatherineSue in #408
  • fix(trtllm): tokenize and inject user stop sequences for TRT-LLM requests (#346) by @ppraneth in #346
  • fix(e2e): migrate genai-bench to Docker and fix router pipe hang (#403) by @key4ng in #403
  • chore(deps): update kube requirement from 1.1.0 to 3.0.1 (#397) by @app/dependabot in #397
  • chore(deps): update opentelemetry-proto requirement from 0.27 to 0.31 (#398) by @app/dependabot in #398
  • chore(deps): update ndarray requirement from 0.16 to 0.17 (#394) by @app/dependabot in #394
  • feat: support oracle external auth for oracle backend (#404) by @zhaowenzi in #404
  • fix(grpc-proto): reorder authors in pyproject.toml (#400) by @CatherineSue in #400
  • chore[ci]: upgrade oracle image (#393) by @key4ng in #393
  • chore(e2e): overhaul nightly benchmark summary and trim model list (#392) by @slin1237 in #392
  • feat: Implement ReconnectionManager for automatic MCP server recovery (#265) by @ppraneth in #265
  • perf(multimodal): optimize payload handling with zero-copy (#391) by @ppraneth in #391
  • refactor(mcp): standardize output injection ordering across routers (#388) by @slin1237 in #388
  • ci(grpc): add proto package publishing and codegen checks (#386) by @CatherineSue in #386
  • feat(grpc): add smg-grpc-proto Python package for proto definitions (#385) by @CatherineSue in #385
  • chore(e2e): include model size in gpt-oss nightly benchmark slug (#384) by @CatherineSue in #384
  • refactor(mcp): remove requested_servers and introduce ResponsesCallContext (#382) by @CatherineSue in #382
  • refactor(mcp): use imports instead of fully-qualified paths in McpToolSession (#383) by @CatherineSue in #383
  • e2e: rewrite nightly summary with gRPC vs HTTP comparison (#381) by @slin1237 in #381
  • feat(realtime api): realtime api event types (#349) b...
Read more