Skip to content

Releases: ai-dynamo/modelexpress

ModelExpress Release v0.3.0

17 Apr 22:04
76fc5d7

Choose a tag to compare

ModelExpress v0.3.0 Release Notes

The big picture

Today, scaling inference means waiting. A 671B model takes 40+ minutes to load from storage before it can serve a single token. Every new node repeats that wait. ModelExpress exists to eliminate it, by turning GPU memory itself into the fastest model cache in the cluster.

With v0.3.0, that vision gets real production-grade infrastructure to power it. This release makes NIXL-based GPU-to-GPU transfer a first-class, production-grade path. Today, users are deploying on Kubernetes with proper lifecycle management, metadata coordination, and failure handling. It's incredible to see DeepSeek-V3 transfers in production inference environments take ~15 seconds across 8 GPUs.

What's new in this release

v0.3.0 lands three things that matter:

  1. P2P transfers that work in production: not just the data plane, but the control plane around it. Multi-source metadata exchange, heartbeat-based liveness, stale source cleanup, and a NIXL-native metadata path that removes the Redis dependency for coordination. You can now have multiple workers publishing and consuming transfer metadata cleanly, and the system detects and sheds dead sources automatically.
  2. One loader, not two: the previous mx-source / mx-target split is gone. The unified MX loader auto-detects the right path (HuggingFace download, disk cache, GDS, or P2P receive) based on what's available. Less configuration, fewer failure modes, clearer operations.
  3. Kubernetes as a real deployment target: metadata management through Kubernetes-native backends, Helm chart hardening (ephemeral storage limits, ServiceAccount lockdown), no-shared-storage test coverage, and multi-node gRPC transfer validation. MX no longer just "runs on K8s", it's K8s-native.

P2P transfer and metadata

  • End-to-end NIXL P2P with post-processed tensor registration and non-contiguous RDMA: full module trees get registered, not just top-level parameters. Storage-level RDMA handles real tensor layouts. (#135, #169, #188)
  • NIXL-native metadata exchange: P2P coordination without a centralized store. Fewer moving parts for distributed setups. (#177)
  • Multi-source metadata model: multiple workers can publish and consume metadata concurrently, required for real multi-node topologies. (#170)
  • TransferEngine metadata backend + simplified schema: pluggable metadata backends with a cleaner wire format. Foundation for resilience in production. (#157, #165)
  • Client heartbeats + stale-source reaper: the server detects dead sources and cleans them up. P2P doesn't hang waiting for a node that's gone. (#182)
  • NIXL listener lifecycle: listener only starts when P2P metadata is enabled. No phantom ports in cache-only deployments. (#210)

Loading and caching

  • Unified MX loader: one path replaces the source/target split. Auto-detects the fastest available route. (#147)
  • GDS-aware loading: GPU Direct Storage integrated into the auto-detection path, used when available, skipped when not. (#166)
  • Provider-aware cache and streaming: cache and download behavior follows the active provider, not a hardcoded HuggingFace assumption. (#172)

Kubernetes and deployment

  • Helm chart hardened: ephemeral-storage limits, ServiceAccount automount disabled, PVC naming aligned with docs (#144, #143, #145, #192, #191)
  • Redis metadata setup made explicit in README and docker-compose (#203)
  • Docker build context excludes target/, cutting image build upload size (#175)
  • Unnecessary serviceAccountName removed from vLLM client manifests (#198)
  • Standard gRPC health check service for probe integration (#205)
  • CodeRabbit added for automated PR review (#138)

Bug fixes

  • Model deletion no longer triggers unintended downloads; eviction path is consistent (#154, #168)
  • HuggingFace: handles empty files, skips dotfiles, honors HF_HUB_OFFLINE (#139, #128)
  • Cache clearing works in real usage (#130)
  • Non-root workers can use PVCs correctly (#132)
  • Kubernetes status round-tripping with default Unknown status (#174)
  • Metadata publish retries before failing the loader (#196)
  • Security: Pygments CVE-2026-4539, rustls-webpki RUSTSEC-2026-0049 (#195, #178)
  • Multi-node K8s tests respect KUBECONFIG; Python tests re-enabled in CI (#189, #171)
  • P2P throughput docs corrected; metadata backend logging shows actual type (#181, #190)

Where we're headed

v0.3.0 establishes the P2P transfer plane and pluggable metadata as production primitives. The next releases focus on three areas:

  • Performance: transfer throughput optimization, contiguous region support, and benchmarking across network topologies
  • Broader runtime coverage: stronger SGLang and TensorRT-LLM integration alongside vLLM
  • Day-2 operations: observability (metrics, tracing), rolling upgrades without transfer disruption, and multi-tenant isolation

The longer arc: ModelExpress becomes the weight management layer for inference and RL systems. It becomes the critical piece that makes model placement, scaling, and migration fast enough that the orchestrator can treat GPU memory as a fungible resource across the cluster.

Contributors

Thank you to everyone who contributed to this release, especially the sustained effort that landed NIXL P2P and metadata coherently across dozens of PRs, and to all reviewers and testers who tested the Kubernetes and multi-node paths.

Full changelog

v0.2.2...v0.3.0

ModelExpress Release v0.2.2

12 Feb 18:37
e01eff3

Choose a tag to compare

ModelExpress - Release 0.2.2

Summary

ModelExpress 0.2.2 release introduces gRPC-based weight transfer for improved peer-to-peer model sharing, comprehensive Helm chart support for production Kubernetes deployments, and extensive enhancements to Hugging Face model handling. Combined with critical bug fixes, enhanced configuration management, and significantly improved documentation, this release delivers a more robust, production-ready experience for teams deploying AI models at scale.

Key Highlights

gRPC Weight Transfer
The headline feature of this release is the introduction of gRPC-based weight transfer (#115), enabling efficient peer-to-peer model distribution between ModelExpress instances. This foundational capability paves the way for advanced model sharing architectures and reduced download times in distributed environments.
Production-Ready Kubernetes Support
Complete Helm chart support (#69) makes deploying ModelExpress to Kubernetes environments straightforward and maintainable. Updated examples now work seamlessly with the latest Dynamo Operator (#105), and the Kubernetes configuration has been thoroughly tested with both standalone ModelExpress and aggregated Dynamo deployments (#31).
Enhanced Hugging Face Integration
Improved Hugging Face model handling with sub-directory exclusion (#108), selective weight downloading (#77), model name mapping (#73), and API enhancements (#7) provide greater flexibility and efficiency when working with HuggingFace Hub models.

Features & Enhancements

Model Distribution & Performance
gRPC Weight Transfer: Introduced peer-to-peer weight transfer via gRPC, enabling efficient model sharing between ModelExpress instances (#115)
High-CPU Download Mode: Enabled high-CPU download capabilities for faster model acquisition in compute-rich environments (#42)
Selective Weight Download: Added support for the ignore_weights parameter, allowing users to download models without specific weight files for reduced storage usage (#77)
HF Sub-Directory Handling: ModelExpress now intelligently ignores Hugging Face sub-directories during operations, preventing errors and improving compatibility (#108)
HF Name Mapping: Added support for mapping model names back to their original Hugging Face identifiers (#73)
HF API Enhancements: Improved the Hugging Face downloading API for better reliability and functionality (#7)
Kubernetes & Deployment
Helm Charts: Introduced official Helm charts for streamlined ModelExpress server deployment in Kubernetes environments (#69)
Full K8s Integration: Provided complete Kubernetes configuration supporting both standalone ModelExpress and aggregated Dynamo deployments (#31)
Ubuntu 24.04 Base: Migrated base image to Ubuntu 24.04 for improved security and modern package support (#84)
Configuration & Integration
Environment Variable Support: Extended environment variable configuration for cache settings, ports, and logging levels, simplifying containerized deployments (#68, #55)
Trait Interface for Providers: Introduced a clean trait interface for model providers, improving extensibility and maintainability (#12)
Dynamo Integration API: Added get_model_path API specifically for Dynamo integration scenarios (#75)
Versioning Consolidation: Moved versioning and dependency references to the top-level Cargo.toml for easier maintenance (#86)
Tooling & Developer Experience
ModelExpress CLI: Introduced the Model Express Cache CLI for command-line management of cached models (#6)
Local Cache Configuration: Added ability to update local Hugging Face model cache directory from configuration files (#18)
DevContainer Environment: Created a basic devcontainer setup for consistent development environments (#19)
Repository Rules: Added Copilot and Cursor repository rules to enhance AI-assisted development workflows (#33)
Contributing Guidelines: Added comprehensive CONTRIBUTING.md file with DCO bot integration for streamlined contributions (#93)

Bug Fixes & Stability

Critical Fixes
Race Condition Fix: Resolved a potential race condition in the initial model download process that could cause intermittent failures (#2)
Concurrent Download Improvements: Enhanced error handling and retry logic for concurrent model downloads, significantly improving stability under load (#46)
gRPC Port Configuration: Fixed gRPC port usage to ensure proper service communication (#9)
Shared Storage Handling: Fixed preload functionality to properly follow the shared_storage parameter (#125)
CLI Argument Flattening: Corrected argument parsing by properly flattening CLI arguments from the common structure (#123)
Configuration & Validation
Environment Variable Override: Fixed a bug where environment variables were not correctly overriding configuration file settings (#48)
Config File Validation: Improved configuration file validation with clearer error messages (#44)
Custom Config Serialization: Resolved a serialization bug affecting custom configuration settings (#76)
Home Directory Expansion: Fixed tilde (~) expansion to correctly resolve to the user's home directory for cache paths (#74)
Kubernetes Fixes
K8s Deployment Issues: Resolved deployment problems in Kubernetes environments (#20)
PVC Cache Configuration: Fixed Persistent Volume Claim (PVC) cache directory configuration for Kubernetes deployments (#71)
ServiceAccount YAML: Removed problematic trimming from ServiceAccount.yaml that was causing deployment issues (#101)
Helm Chart Naming: Corrected Helm chart naming to use "Modelexpress" consistently (#89)
Operator Compatibility: Updated Kubernetes examples to work with the latest Dynamo Operator version (#105)
SPDX Headers: Added required SPDX license headers to Helm chart files for compliance (#94)
Dependency & Compatibility
Tracing Subscriber Version: Loosened the tracing-subscriber dependency version to ensure compatibility with the Dynamo runtime (#79)
Rust 1.90 Upgrade: Upgraded to Rust 1.90 to resolve continuous integration issues and maintain build stability (#109)
Security Audit Fixes: Resolved unlicensed dependency errors flagged by security audits (#39)
Endpoint & Naming
Default Endpoint Handling: Fixed default endpoint configuration to ensure proper service discovery (#28)
Container Naming: Updated container name and version references for consistency (#87)
CLI Naming Consistency: Renamed model-express-cli to modelexpress-cli for consistency across the project (#103)
Image References: Updated references to point to the new release container images (#92)

Housekeeping
Bash Default Removed: Removed setting default shell to bash for better cross-platform compatibility (#67)
Version Bumps: Updated version numbers for the 0.2.2 release cycle (#114, #121)

Looking Ahead
With gRPC weight transfer now available, ModelExpress is positioned to enable sophisticated peer-to-peer model distribution patterns. The foundation laid in this release—including Helm charts, enhanced Hugging Face integration, and robust Kubernetes support—prepares the platform for enterprise-scale deployments. Future releases will focus on optimizing transfer performance, expanding provider integrations, and...

Read more

ModelExpress Release v0.2.1

05 Dec 00:49
ebae023

Choose a tag to compare

ModelExpress - Release 0.2.1

Summary

ModelExpress 0.2.1 is a maintenance release focused on stability and critical fixes. This update incorporates backported fixes to ensure a smoother and more reliable deployment experience for users.This release addresses specific issues identified in previous versions—specifically around CI stability, model directory handling, and system observability—and prepares the environment for future feature updates.

Bug Fixes

  • Upgraded Rust Version to 1.90: Upgraded the Rust compiler to version 1.90 to resolve continuous integration (CI) issues, ensuring compatibility with the latest Rust features and maintaining build stability 1.
  • Ignore Hugging Face Sub-Directories: Updated the system to exclude Hugging Face sub-directories during operations, preventing potential errors and improving compatibility with Hugging Face models 2.
  • Improved Logging Mechanism: Enhanced the logging system to provide more detailed and informative logs, facilitating easier debugging and monitoring 3.

Known Limitations

  • Aggregated Kubernetes Example Deployment Failure: The examples/aggregated_k8s/agg.yaml configuration file currently fails to deploy with Dynamo 0.7.0. This is due to the deprecation and removal of the pvc field under spec.services.<serviceName> in the newer Dynamo CRD. Users attempting to deploy this example will encounter a strict decoding error.
  • Workaround: Update the agg.yaml file to adhere to the new API format by defining PVCs at the spec.pvcs level and referencing them using spec.services.<name>.volumeMounts.

Full Changelog

v0.2.0...v0.2.1

ModelExpress v0.2.0

09 Oct 17:04
3e24472

Choose a tag to compare

ModelExpress v0.2.0 Release Notes

This release marks a significant step forward for Model Express, evolving it from a foundational service to a deployable, production-ready component for large-scale inference. The key themes for this release are Performance, Kubernetes Integration, and Enhanced Configuration. We've introduced a full Helm chart for easy deployment, significantly improved download performance, and added critical features for seamless integration with inference servers like Dynamo.

Features & Enhancements

  • High-Performance Downloads (--high): You can now enable a high-CPU download mode that multiplexes downloads to better saturate high-bandwidth network connections, dramatically speeding up model fetching. (#42)
  • Helm Chart for Kubernetes Deployment: A complete Helm chart has been added, allowing you to deploy a production-ready Model Express server to any Kubernetes cluster with a single command. (#69)
  • End-to-End Dynamo Integration Example: We've added a full Kubernetes configuration example demonstrating how to run Model Express as a sidecar with an aggregated Dynamo deployment, providing a clear blueprint for production use. (#31)
  • get_model_path API for Seamless Integration: A new get_model_path API has been added, which is a critical function for integrating with inference servers like Dynamo that need to resolve the local path of a model. (#75)
  • Support for Partial Downloads (--ignore-weights): You can now download model files while ignoring the large weight files (.bin, .safetensors). This is useful for quickly fetching tokenizer and configuration files for validation or development. (#77)
  • Improved Model Name Mapping: The server can now correctly map the Hugging Face cache folder names (e.g., models--google--gemma-7b) back to their human-readable IDs (google/gemma-7b). (#73)
  • Official Dockerfile Compliance: The Dockerfile has been updated to meet OSRB compliance standards, ensuring it's secure and ready for enterprise environments. (#83)

Deployment & Configuration

  • Expanded Environment Variable Support: You can now configure cache settings, ports, and logging levels directly through environment variables, making containerized deployments more flexible. (#68, #55)
  • Corrected Kubernetes PVC Configuration: The cache directory configuration for Kubernetes deployments has been fixed, ensuring that the Persistent Volume Claim (PVC) is correctly utilized for model storage. (#71)
  • Configuration Overriding Fix: Fixed a bug where environment variables were not correctly overriding settings from a configuration file, ensuring a predictable configuration hierarchy. (#48)
  • Improved Config File Validation: The server now provides clearer error messages when validating configuration files. (#44)

Bug Fixes & Stability

  • Improved Concurrent Download Stability: Enhanced the error handling and retry logic for concurrent model downloads, making the server more resilient under high load. (#46)
  • Correct Home Directory Expansion: Fixed a bug where the tilde (~) character was not correctly expanding to the user's home directory for cache paths. (#74)
  • Dependency Version Fix: Loosened the tracing-subscriber dependency version to resolve conflicts and ensure smooth integration with the Dynamo runtime. (#79)
  • Serialization Bug Fix: Corrected a bug related to the serialization of custom configuration settings. (#76)

Housekeeping & Documentation

  • Code Cleanup: Removed dead and redundant code to improve maintainability. (#41, #43)
  • Updated Documentation: The README and other documentation files have been updated to reflect the latest changes and remove deprecated information. (#57, #70)
  • Build System: Crate names have been updated for consistency. (#65)

Looking Ahead

With the foundational Kubernetes integration now in place, our next major focus is to unlock the next level of performance by enabling direct peer-to-peer (P2P) model transfers with NIXL. Stay tuned for updates!

New Contributors

Full Changelog: https://github.com/ai-dynamo/modelexpress/compare/v0.1.0...v0.2.0

modelexpress v0.1.0

26 Aug 22:50
2309e58

Choose a tag to compare

This is the first release of Dynamo's ModelExpress v0.1.0, and is our first alpha release.
ModelExpress is a Rust-based client-server system designed to accelerate the loading of inference models in a distributed Kubernetes cluster.
Please refer to our README for more information and guides on how to use ModelExpress in Kubernetes.
This release comes with 3 Rust crates that can be found on crates.io:

You can also install and run ModelExpress by downloading this release and following the build instructions.