From ca4cffc12d64924c374e6598572e68f8501cc8f3 Mon Sep 17 00:00:00 2001
From: Ganesh Kudleppanavar
Date: Thu, 30 Apr 2026 11:09:44 -0500
Subject: [PATCH 01/17] docs: refine README storage and provider support
---
README.md | 66 +++++++++++++++++++------------------------------------
1 file changed, 23 insertions(+), 43 deletions(-)
diff --git a/README.md b/README.md
index cd4923e7..787a13e4 100644
--- a/README.md
+++ b/README.md
@@ -40,9 +40,9 @@ ModelExpress is a Rust-based service that manages the complete model weight life
### How ModelExpress manages weights in the cluster
-ModelExpress orchestrates the full flow—from download to GPU memory. It ensures only one node downloads a model from external sources (e.g., HuggingFace); other nodes receive weights via P2P or shared storage—eliminating duplicate downloads and reducing cluster ingress.
+ModelExpress orchestrates the full flow—from download to GPU memory. It ensures only one node downloads or streams a model from external sources (for example Hugging Face, NGC, GCS, or object storage through ModelStreamer); other nodes receive weights via P2P or shared storage—eliminating duplicate downloads and reducing cluster ingress.
-1. **Download from HuggingFace** — One node pulls the model; ModelExpress coordinates so no other node duplicates this download, reducing external ingress. In air-gapped mode, serve from cache only (`HF_HUB_OFFLINE=1`).
+1. **Download or stream from external storage** — One node pulls or streams the model from Hugging Face, NGC, GCS, or object storage through ModelStreamer; ModelExpress coordinates so no other node duplicates this work. In air-gapped mode, serve from cache only (`HF_HUB_OFFLINE=1`).
2. **Persist to disk** — Store in a cache backed by disk:
- **Host-attached disk** — Local disk on the node (single-node or per-node cache).
- **PVC** — RWO (ReadWriteOnce) for single-node; RWX (ReadWriteMany) for shared access across nodes.
@@ -54,9 +54,12 @@ ModelExpress orchestrates the full flow—from download to GPU memory. It ensure
## Features
- **Cold start reduction** — GPU-to-GPU P2P transfer over InfiniBand instead of disk load
-- **HuggingFace caching** — PVC-backed cache, `HF_HUB_OFFLINE`, `ignore_weights`, `get_model_path` for Dynamo
+- **Model store providers** — built-in providers for Hugging Face, NVIDIA NGC, and Google Cloud Storage
+- **ModelStreamer loading** — stream weights from S3, GCS, Azure Blob, local paths, or Hugging Face cache into vLLM with `MX_MODEL_URI`
+- **GPUDirect Storage** — direct file-to-GPU loading path when GDS hardware and software are available
+- **Cache and path resolution** — PVC-backed cache, `HF_HUB_OFFLINE`, `ignore_weights`, `get_model_path` for Dynamo, and provider-specific cache layouts
- **P2P GPU transfer** — vLLM `mx` loader and TRT-LLM `PRESHARDED` loader with NVIDIA NIXL over RDMA
-- **Metadata backends** — In-memory, Redis, or Kubernetes CRD (layered write-through for HA)
+- **Metadata backends** — Redis or Kubernetes CRD for distributed coordination
- **Kubernetes** — Helm chart, CRDs/Redis for P2P, no-shared-storage support
- **CLI** — Health, download, list, validate, clear; init-container support for pre-warming
@@ -69,6 +72,16 @@ ModelExpress orchestrates the full flow—from download to GPU memory. It ensure
| TensorRT-LLM | `LoadFormat.PRESHARDED` with `MxLiveCheckpointLoader` for P2P weight transfer (beta) — [TRT-LLM examples](examples/p2p_transfer_k8s/client/trtllm/) |
| SGLang | Coming soon |
+### Model Store Providers
+
+ModelExpress supports two storage-access paths:
+
+| Path | Supported sources |
+|------|-------------------|
+| Model providers | Hugging Face, NVIDIA NGC, Google Cloud Storage |
+| ModelStreamer (`MX_MODEL_URI`) | S3 / S3-compatible, GCS, Azure Blob Storage, local filesystem, and Hugging Face cache-resolved model IDs |
+| GPUDirect Storage | Local filesystem or cached model files loaded directly to GPU |
+
---
## ModelExpress Architecture
@@ -96,9 +109,9 @@ ModelExpress orchestrates the full flow—from download to GPU memory. It ensure
*Source and Target exchange metadata with the server for coordination; weights transfer directly over RDMA between GPUs.*
-- **modelexpress_server**: gRPC server with configurable metadata backends (Redis, Kubernetes CRD).
-- **modelexpress_client**: Rust CLI for cache management; Python package with vLLM loaders and `MxClient` for gRPC.
-- **modelexpress_common**: Protobuf definitions, provider trait (HuggingFace), shared configuration.
+- **modelexpress_server**: gRPC server with distributed metadata backends (Redis or Kubernetes CRD)
+- **modelexpress_client**: Rust CLI for cache management; Python package with vLLM loaders and `MxClient`
+- **modelexpress_common**: Protobuf definitions, provider abstractions, and shared configuration
See [Architecture](docs/ARCHITECTURE.md).
@@ -130,7 +143,7 @@ modelexpress-cli health
```
**Without shared storage:** use `--no-shared-storage` for gRPC streaming.
-**Air-gapped:** `HF_HUB_OFFLINE=1 modelexpress-cli model get `.
+**Air-gapped:** `HF_HUB_OFFLINE=1 modelexpress-cli model download `.
---
@@ -174,6 +187,7 @@ docker-compose up --build
| `MX_METADATA_BACKEND` | (required) | `redis` \| `kubernetes` |
| `REDIS_URL` | `redis://localhost:6379` | Redis connection URL (`redis` backend only) |
| `MODEL_EXPRESS_URL` | `localhost:8001` | gRPC server (P2P) |
+| `MX_MODEL_URI` | (none) | Enable ModelStreamer with `s3://`, `gs://`, `az://`, absolute local paths, or Hugging Face model IDs |
| `UCX_TLS` | `rc_x,rc,dc_x,dc,cuda_copy` | InfiniBand transports |
```bash
@@ -185,20 +199,6 @@ Full reference: [docs/DEPLOYMENT.md](docs/DEPLOYMENT.md).
---
-## CLI
-
-```bash
-modelexpress-cli health
-modelexpress-cli model download
-modelexpress-cli model list
-modelexpress-cli model validate
-modelexpress-cli model clear
-```
-
-[CLI Reference](docs/CLI.md)
-
----
-
## Testing
```bash
@@ -218,6 +218,7 @@ cargo bench
| [Deployment](docs/DEPLOYMENT.md) | Server/client config, Docker, K8s, P2P |
| [Architecture](docs/ARCHITECTURE.md) | Components, gRPC, NIXL, FP8 |
| [CLI](docs/CLI.md) | Full CLI reference |
+| [GCS Provider](docs/GCS_PROVIDER.md) | GCS provider design, cache layout, manifest behavior, and credentials |
| [Metadata](docs/metadata.md) | Redis keys, K8s CRD schema |
| [Helm](helm/README.md) | Kubernetes configuration |
@@ -233,27 +234,6 @@ cargo bench
---
-## Roadmap
-
-### Priorities Under Development
-
-- **P2P compile/warmup caching**: torch.compile/deepGEMM cache for follower workers. Leader performs full warmup; followers consume caches and skip full warmup.
-- **ModelStreamer Integration**: Pull weights from cold storage with multi-cloud and multi-engine support.
-- **DRAM and NVMe-resident shard streaming**: Stream shards across workers while keeping weights in DRAM and host local high-speed NVMe.
-- **RL workloads**: Explore fast P2P transfers to optimize RL refit phase and support for weight resharding.
-- **Earlier weight availability**: Bring weights to prefill earlier; identify prefill workers that can act as strong source nodes.
-- **Expanded model pull providers**: Support NGC in addition to Hugging Face.
-- **GDS (GPUDirect Storage) integration**: Load model weights directly from NVMe into GPU memory, bypassing the CPU/DRAM copy path.
-- **Multi-tier cache hierarchy**: Promote and demote models across DRAM, NVMe, and PVC tiers based on access patterns.
-- **Distributed sharded cache**: Shard large models across nodes using consistent hashing and parallel shard assembly.
-- **Training checkpoint management**: Cache and reuse CUDA kernel compilations (torch.compile, deepGEMM) and CUDA graphs across restarts.
-- **Metrics and observability**: Cache hit rates, eviction frequency, transfer throughput, and P2P RDMA utilization via Prometheus/OpenTelemetry.
-- **Predictive prefetching**: Pre-warm caches from workload history or scheduling hints.
-- **P2P transfer fault tolerance**: Auto-recovery from stale rkeys on source restart; retry and fallback to storage loading.
-- **Multi-cloud storage backends**: Native support for AWS S3, Azure Blob, and NFS as model pull sources.
-
----
-
## Contributing
Contributions welcome. See [CONTRIBUTING.md](CONTRIBUTING.md).
From 66534700a0e4ededa743e1607f8c3fc423500fb0 Mon Sep 17 00:00:00 2001
From: Ganesh Kudleppanavar
Date: Thu, 30 Apr 2026 11:19:29 -0500
Subject: [PATCH 02/17] docs: clarify README ingress reduction
---
README.md | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/README.md b/README.md
index 787a13e4..9872dd12 100644
--- a/README.md
+++ b/README.md
@@ -36,7 +36,7 @@ ModelExpress is a Rust-based service that manages the complete model weight life
| LLM serving problem | How ModelExpress helps |
|---------------------|------------------------|
| **Models take too long to load** | GPU-to-GPU transfer via NIXL/RDMA instead of loading from storage. In P2P mode, weights already serving inference act as the cache—no extra storage. |
-| **Many nodes need the same model** | Metadata backends (Redis, K8s CRD) coordinate sharing: one node loads; others receive via P2P or local paths. |
+| **Many nodes need the same model** | Metadata backends (Redis, K8s CRD) coordinate sharing: one node loads; others receive via P2P or local paths. This reduces ingress bandwidth from external providers such as Hugging Face and ensures only one model copy is downloaded even when multiple clients request the same model concurrently. |
### How ModelExpress manages weights in the cluster
From e5e3f65ea030f8eb48d9adeb4333c3a1ad019f0f Mon Sep 17 00:00:00 2001
From: Ganesh Kudleppanavar
Date: Thu, 30 Apr 2026 11:24:03 -0500
Subject: [PATCH 03/17] docs: align README with latest mainline features
---
README.md | 20 ++++++++++++++++++++
1 file changed, 20 insertions(+)
diff --git a/README.md b/README.md
index 9872dd12..0c0c2736 100644
--- a/README.md
+++ b/README.md
@@ -54,6 +54,7 @@ ModelExpress orchestrates the full flow—from download to GPU memory. It ensure
## Features
- **Cold start reduction** — GPU-to-GPU P2P transfer over InfiniBand instead of disk load
+- **Distributed registry** — model download state, cache lifecycle, and P2P metadata coordinated through Redis or Kubernetes CRDs
- **Model store providers** — built-in providers for Hugging Face, NVIDIA NGC, and Google Cloud Storage
- **ModelStreamer loading** — stream weights from S3, GCS, Azure Blob, local paths, or Hugging Face cache into vLLM with `MX_MODEL_URI`
- **GPUDirect Storage** — direct file-to-GPU loading path when GDS hardware and software are available
@@ -158,6 +159,19 @@ helm install modelexpress ./helm --namespace modelexpress --create-namespace
Override [values-production.yaml](helm/values-production.yaml) for your env. Full config: [helm/README.md](helm/README.md).
+### Distributed Backend Prerequisites
+
+ModelExpress requires a distributed backend for model registry state and P2P coordination:
+
+- `redis` for Redis-backed deployments
+- `kubernetes` for CRD-backed deployments
+
+For the Kubernetes backend, install the CRDs before starting the server:
+
+```bash
+kubectl apply -f examples/crds.yaml
+```
+
### P2P GPU Transfer (vLLM)
```python
@@ -168,6 +182,12 @@ register_modelexpress_loaders()
First instance loads from disk; subsequent instances receive via RDMA. [P2P guide](examples/p2p_transfer_k8s/README.md) · [Server setup](examples/p2p_transfer_k8s/server/README.md).
+### Example Deployments
+
+- [vLLM P2P transfer](examples/p2p_transfer_k8s/README.md)
+- [Dynamo P2P transfer](examples/dynamo_p2p_transfer_k8s/README.md)
+- [TensorRT-LLM beta examples](examples/p2p_transfer_k8s/client/trtllm/README.md)
+
### Docker
```bash
From 5934f9c5d4a8429e8cabc70ba51122b6796e758f Mon Sep 17 00:00:00 2001
From: Ganesh Kudleppanavar
Date: Thu, 30 Apr 2026 11:24:47 -0500
Subject: [PATCH 04/17] docs: add air-gapped README guidance
---
README.md | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/README.md b/README.md
index 0c0c2736..b415e12b 100644
--- a/README.md
+++ b/README.md
@@ -83,6 +83,16 @@ ModelExpress supports two storage-access paths:
| ModelStreamer (`MX_MODEL_URI`) | S3 / S3-compatible, GCS, Azure Blob Storage, local filesystem, and Hugging Face cache-resolved model IDs |
| GPUDirect Storage | Local filesystem or cached model files loaded directly to GPU |
+### Air-Gapped Environments
+
+ModelExpress supports air-gapped deployments when model files are already present inside the environment.
+
+- Use a pre-populated local cache or a mounted local/PVC path as the source of truth.
+- For Hugging Face cache-only operation, set `HF_HUB_OFFLINE=1`; ModelExpress resolves models from the local HF cache and does not attempt network access.
+- For fully disconnected runtime loading, point `MX_MODEL_URI` at a local filesystem path so ModelStreamer reads from local storage instead of external object stores.
+- Once one source pod has loaded the model, additional pods can receive the weights through P2P RDMA without re-downloading from an external provider.
+- External providers such as NGC, GCS, S3, and Azure Blob still require network reachability unless their contents are mirrored into local storage inside the air-gapped environment.
+
---
## ModelExpress Architecture
From f672d51e4c1f7314f17e3ceb990c838ca4931bb0 Mon Sep 17 00:00:00 2001
From: Ganesh Kudleppanavar
Date: Thu, 30 Apr 2026 11:26:29 -0500
Subject: [PATCH 05/17] docs: link deployment guide to air-gapped overview
---
docs/DEPLOYMENT.md | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/docs/DEPLOYMENT.md b/docs/DEPLOYMENT.md
index 13eb66e7..90b46f20 100644
--- a/docs/DEPLOYMENT.md
+++ b/docs/DEPLOYMENT.md
@@ -5,7 +5,7 @@ SPDX-License-Identifier: Apache-2.0
# ModelExpress Deployment Guide
-User-facing guide for configuring and deploying ModelExpress. For architecture details, see [`ARCHITECTURE.md`](ARCHITECTURE.md). For development setup, see [`../CONTRIBUTING.md`](../CONTRIBUTING.md).
+User-facing guide for configuring and deploying ModelExpress. For architecture details, see [`ARCHITECTURE.md`](ARCHITECTURE.md). For development setup, see [`../CONTRIBUTING.md`](../CONTRIBUTING.md). For a concise overview of offline operation, see the air-gapped section in [`../README.md`](../README.md).
## Server Configuration
From 5288458114441cd163af63420924dfca4f17019c Mon Sep 17 00:00:00 2001
From: Ganesh Kudleppanavar
Date: Thu, 30 Apr 2026 11:27:52 -0500
Subject: [PATCH 06/17] docs: add ingress problem statement to README intro
---
README.md | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/README.md b/README.md
index b415e12b..8a43d894 100644
--- a/README.md
+++ b/README.md
@@ -18,6 +18,10 @@ SPDX-License-Identifier: Apache-2.0
Model weight management for LLM inference — cache, transfer, and serve weights at scale with GPU-to-GPU RDMA and multi-node coordination.
+
+ Reduce repeated ingress from external model providers by ensuring only one copy of a model is downloaded even when many clients request it concurrently.
+
+
Features •
Architecture •
From 6c1f3509195c8cfa7d47a3de72c68f395cf78512 Mon Sep 17 00:00:00 2001
From: Ganesh Kudleppanavar
Date: Thu, 30 Apr 2026 11:29:31 -0500
Subject: [PATCH 07/17] docs: clarify README download responsibility
---
README.md | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/README.md b/README.md
index 8a43d894..a3903e09 100644
--- a/README.md
+++ b/README.md
@@ -46,7 +46,7 @@ ModelExpress is a Rust-based service that manages the complete model weight life
ModelExpress orchestrates the full flow—from download to GPU memory. It ensures only one node downloads or streams a model from external sources (for example Hugging Face, NGC, GCS, or object storage through ModelStreamer); other nodes receive weights via P2P or shared storage—eliminating duplicate downloads and reducing cluster ingress.
-1. **Download or stream from external storage** — One node pulls or streams the model from Hugging Face, NGC, GCS, or object storage through ModelStreamer; ModelExpress coordinates so no other node duplicates this work. In air-gapped mode, serve from cache only (`HF_HUB_OFFLINE=1`).
+1. **Download or stream from external storage** — The ModelExpress server pulls the model from Hugging Face, NGC, or GCS, or a client streams it through ModelStreamer from object storage or local disk; ModelExpress coordinates so no other node duplicates this work. In air-gapped mode, serve from cache only (`HF_HUB_OFFLINE=1`).
2. **Persist to disk** — Store in a cache backed by disk:
- **Host-attached disk** — Local disk on the node (single-node or per-node cache).
- **PVC** — RWO (ReadWriteOnce) for single-node; RWX (ReadWriteMany) for shared access across nodes.
From 0b5cb055f78b89c7c8b9584fce0f9996f071897b Mon Sep 17 00:00:00 2001
From: Ganesh Kudleppanavar
Date: Thu, 30 Apr 2026 11:45:51 -0500
Subject: [PATCH 08/17] docs: rewrite README in product style
---
README.md | 37 ++++++++++++++++++-------------------
1 file changed, 18 insertions(+), 19 deletions(-)
diff --git a/README.md b/README.md
index a3903e09..9aefab34 100644
--- a/README.md
+++ b/README.md
@@ -15,7 +15,7 @@ SPDX-License-Identifier: Apache-2.0
Dynamo ModelExpress
- Model weight management for LLM inference — cache, transfer, and serve weights at scale with GPU-to-GPU RDMA and multi-node coordination.
+ Accelerate LLM startup and scale-out with intelligent model distribution
@@ -35,7 +35,7 @@ SPDX-License-Identifier: Apache-2.0
## Overview
-ModelExpress is a Rust-based service that manages the complete model weight lifecycle in the cluster—from acquisition to GPU memory. It accelerates LLM inference by caching, routing, and transferring weights through the fastest available path. Deploy standalone or as a sidecar alongside vLLM, NVIDIA Dynamo, and other inference runtimes.
+ModelExpress is a model distribution layer for LLM inference. It manages how model weights are acquired, cached, shared, and transferred across a cluster so inference systems can start faster, scale more efficiently, and avoid repeated downloads from external model providers. Deploy it as a standalone service or alongside runtimes such as vLLM, NVIDIA Dynamo, and TensorRT-LLM.
| LLM serving problem | How ModelExpress helps |
|---------------------|------------------------|
@@ -44,7 +44,7 @@ ModelExpress is a Rust-based service that manages the complete model weight life
### How ModelExpress manages weights in the cluster
-ModelExpress orchestrates the full flow—from download to GPU memory. It ensures only one node downloads or streams a model from external sources (for example Hugging Face, NGC, GCS, or object storage through ModelStreamer); other nodes receive weights via P2P or shared storage—eliminating duplicate downloads and reducing cluster ingress.
+ModelExpress orchestrates the weight lifecycle from external source to GPU memory. It minimizes repeated provider traffic, keeps cache state coordinated across the cluster, and routes each load through the most efficient available path.
1. **Download or stream from external storage** — The ModelExpress server pulls the model from Hugging Face, NGC, or GCS, or a client streams it through ModelStreamer from object storage or local disk; ModelExpress coordinates so no other node duplicates this work. In air-gapped mode, serve from cache only (`HF_HUB_OFFLINE=1`).
2. **Persist to disk** — Store in a cache backed by disk:
@@ -57,16 +57,15 @@ ModelExpress orchestrates the full flow—from download to GPU memory. It ensure
## Features
-- **Cold start reduction** — GPU-to-GPU P2P transfer over InfiniBand instead of disk load
-- **Distributed registry** — model download state, cache lifecycle, and P2P metadata coordinated through Redis or Kubernetes CRDs
-- **Model store providers** — built-in providers for Hugging Face, NVIDIA NGC, and Google Cloud Storage
-- **ModelStreamer loading** — stream weights from S3, GCS, Azure Blob, local paths, or Hugging Face cache into vLLM with `MX_MODEL_URI`
-- **GPUDirect Storage** — direct file-to-GPU loading path when GDS hardware and software are available
-- **Cache and path resolution** — PVC-backed cache, `HF_HUB_OFFLINE`, `ignore_weights`, `get_model_path` for Dynamo, and provider-specific cache layouts
-- **P2P GPU transfer** — vLLM `mx` loader and TRT-LLM `PRESHARDED` loader with NVIDIA NIXL over RDMA
-- **Metadata backends** — Redis or Kubernetes CRD for distributed coordination
-- **Kubernetes** — Helm chart, CRDs/Redis for P2P, no-shared-storage support
-- **CLI** — Health, download, list, validate, clear; init-container support for pre-warming
+- **Reduce startup time** — shift model loads from storage-bound workflows to GPU-to-GPU RDMA over InfiniBand
+- **Reduce provider ingress** — coordinate downloads so concurrent requests share one external fetch instead of duplicating traffic
+- **Operate with distributed state** — keep model lifecycle state and P2P metadata in Redis or Kubernetes CRDs
+- **Support multiple model sources** — built-in providers for Hugging Face, NVIDIA NGC, and Google Cloud Storage
+- **Load from object storage** — use ModelStreamer with `MX_MODEL_URI` for S3, GCS, Azure Blob, local paths, or Hugging Face cache
+- **Use direct file-to-GPU loading** — enable GPUDirect Storage when hardware and software are available
+- **Integrate with inference runtimes** — vLLM `mx` loader and TensorRT-LLM `PRESHARDED` support for RDMA-based startup
+- **Deploy in Kubernetes** — use Helm, CRDs, Redis, shared storage, or no-shared-storage topologies
+- **Operate through CLI and APIs** — health, download, list, validate, and clear models with shared server/client interfaces
### Integrations
@@ -79,7 +78,7 @@ ModelExpress orchestrates the full flow—from download to GPU memory. It ensure
### Model Store Providers
-ModelExpress supports two storage-access paths:
+ModelExpress exposes a small set of storage-access patterns, depending on how you want weights delivered:
| Path | Supported sources |
|------|-------------------|
@@ -89,7 +88,7 @@ ModelExpress supports two storage-access paths:
### Air-Gapped Environments
-ModelExpress supports air-gapped deployments when model files are already present inside the environment.
+ModelExpress supports air-gapped deployments when model files are already available inside the environment.
- Use a pre-populated local cache or a mounted local/PVC path as the source of truth.
- For Hugging Face cache-only operation, set `HF_HUB_OFFLINE=1`; ModelExpress resolves models from the local HF cache and does not attempt network access.
@@ -122,11 +121,11 @@ ModelExpress supports air-gapped deployments when model files are already presen
└──────────────────┘ │ └──────────────────┘
```
-*Source and Target exchange metadata with the server for coordination; weights transfer directly over RDMA between GPUs.*
+*The server coordinates discovery and lifecycle state; the weight bytes move directly between GPUs.*
-- **modelexpress_server**: gRPC server with distributed metadata backends (Redis or Kubernetes CRD)
-- **modelexpress_client**: Rust CLI for cache management; Python package with vLLM loaders and `MxClient`
-- **modelexpress_common**: Protobuf definitions, provider abstractions, and shared configuration
+- **modelexpress_server**: control plane for downloads, cache state, and P2P coordination
+- **modelexpress_client**: Rust CLI and Python integration layer for runtime-facing workflows
+- **modelexpress_common**: shared protobufs, provider abstractions, and configuration types
See [Architecture](docs/ARCHITECTURE.md).
From 1bd782c85cd21af3ac832f7eea661d5336cf8416 Mon Sep 17 00:00:00 2001
From: Ganesh Kudleppanavar
Date: Thu, 30 Apr 2026 11:57:27 -0500
Subject: [PATCH 09/17] docs: restore roadmap and storage wording
---
README.md | 19 ++++++++++++++++++-
1 file changed, 18 insertions(+), 1 deletion(-)
diff --git a/README.md b/README.md
index 9aefab34..30ce5034 100644
--- a/README.md
+++ b/README.md
@@ -46,7 +46,7 @@ ModelExpress is a model distribution layer for LLM inference. It manages how mod
ModelExpress orchestrates the weight lifecycle from external source to GPU memory. It minimizes repeated provider traffic, keeps cache state coordinated across the cluster, and routes each load through the most efficient available path.
-1. **Download or stream from external storage** — The ModelExpress server pulls the model from Hugging Face, NGC, or GCS, or a client streams it through ModelStreamer from object storage or local disk; ModelExpress coordinates so no other node duplicates this work. In air-gapped mode, serve from cache only (`HF_HUB_OFFLINE=1`).
+1. **Download or stream from external storage** — The ModelExpress server pulls the model from Hugging Face, NGC, or GCS, or a client streams it through ModelStreamer from S3, Azure Blob Storage, other object storage, or local disk; ModelExpress coordinates so no other node duplicates this work. In air-gapped mode, serve from cache only (`HF_HUB_OFFLINE=1`).
2. **Persist to disk** — Store in a cache backed by disk:
- **Host-attached disk** — Local disk on the node (single-node or per-node cache).
- **PVC** — RWO (ReadWriteOnce) for single-node; RWX (ReadWriteMany) for shared access across nodes.
@@ -267,6 +267,23 @@ cargo bench
---
+## Roadmap
+
+### Priorities Under Development
+
+- **P2P compile/warmup caching**: torch.compile/deepGEMM cache for follower workers. Leader performs full warmup; followers consume caches and skip full warmup.
+- **DRAM and NVMe-resident shard streaming**: Stream shards across workers while keeping weights in DRAM and host local high-speed NVMe.
+- **RL workloads**: Explore fast P2P transfers to optimize RL refit phase and support for weight resharding.
+- **Earlier weight availability**: Bring weights to prefill earlier; identify prefill workers that can act as strong source nodes.
+- **Multi-tier cache hierarchy**: Promote and demote models across DRAM, NVMe, and PVC tiers based on access patterns.
+- **Distributed sharded cache**: Shard large models across nodes using consistent hashing and parallel shard assembly.
+- **Training checkpoint management**: Cache and reuse CUDA kernel compilations (torch.compile, deepGEMM) and CUDA graphs across restarts.
+- **Metrics and observability**: Cache hit rates, eviction frequency, transfer throughput, and P2P RDMA utilization via Prometheus/OpenTelemetry.
+- **Predictive prefetching**: Pre-warm caches from workload history or scheduling hints.
+- **P2P transfer fault tolerance**: Auto-recovery from stale rkeys on source restart; retry and fallback to storage loading.
+
+---
+
## Contributing
Contributions welcome. See [CONTRIBUTING.md](CONTRIBUTING.md).
From 861b3b45bde49d4347e70234cacb1bc9ebacbc57 Mon Sep 17 00:00:00 2001
From: Ganesh Kudleppanavar
Date: Thu, 30 Apr 2026 12:00:20 -0500
Subject: [PATCH 10/17] docs: broaden README product framing
---
README.md | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/README.md b/README.md
index 30ce5034..26298a60 100644
--- a/README.md
+++ b/README.md
@@ -35,7 +35,7 @@ SPDX-License-Identifier: Apache-2.0
## Overview
-ModelExpress is a model distribution layer for LLM inference. It manages how model weights are acquired, cached, shared, and transferred across a cluster so inference systems can start faster, scale more efficiently, and avoid repeated downloads from external model providers. Deploy it as a standalone service or alongside runtimes such as vLLM, NVIDIA Dynamo, and TensorRT-LLM.
+ModelExpress is a model distribution layer for large-model workloads. It manages how model weights are acquired, cached, shared, and transferred across a cluster so systems can start faster, scale more efficiently, and avoid repeated downloads from external model providers. Deploy it as a standalone service or alongside runtimes such as vLLM, NVIDIA Dynamo, and TensorRT-LLM.
| LLM serving problem | How ModelExpress helps |
|---------------------|------------------------|
@@ -63,7 +63,7 @@ ModelExpress orchestrates the weight lifecycle from external source to GPU memor
- **Support multiple model sources** — built-in providers for Hugging Face, NVIDIA NGC, and Google Cloud Storage
- **Load from object storage** — use ModelStreamer with `MX_MODEL_URI` for S3, GCS, Azure Blob, local paths, or Hugging Face cache
- **Use direct file-to-GPU loading** — enable GPUDirect Storage when hardware and software are available
-- **Integrate with inference runtimes** — vLLM `mx` loader and TensorRT-LLM `PRESHARDED` support for RDMA-based startup
+- **Integrate with runtime platforms** — vLLM `mx` loader and TensorRT-LLM `PRESHARDED` support for RDMA-based startup
- **Deploy in Kubernetes** — use Helm, CRDs, Redis, shared storage, or no-shared-storage topologies
- **Operate through CLI and APIs** — health, download, list, validate, and clear models with shared server/client interfaces
From 71d4a5d7a17faf2b8a293ba9c6cb6e8ca929f740 Mon Sep 17 00:00:00 2001
From: Ganesh Kudleppanavar
Date: Thu, 30 Apr 2026 23:06:30 -0500
Subject: [PATCH 11/17] docs: fix MLA known issue, add MX_SKIP_FEATURE_CHECK,
add MLA roadmap item
- Update MLA known issue to reflect merged adopt_hidden_tensors workaround
and verified correct P2P transfer for Kimi-K2.5-NVFP4; add GLM-5.1 to
blocked model list; correct fallback chain (not disk-only)
- Add MX_SKIP_FEATURE_CHECK to configuration table
- Add MLA P2P transfer as active roadmap item
Co-Authored-By: Claude Sonnet 4.6
---
README.md | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/README.md b/README.md
index 26298a60..3ca27caa 100644
--- a/README.md
+++ b/README.md
@@ -221,6 +221,7 @@ docker-compose up --build
| `REDIS_URL` | `redis://localhost:6379` | Redis connection URL (`redis` backend only) |
| `MODEL_EXPRESS_URL` | `localhost:8001` | gRPC server (P2P) |
| `MX_MODEL_URI` | (none) | Enable ModelStreamer with `s3://`, `gs://`, `az://`, absolute local paths, or Hugging Face model IDs |
+| `MX_SKIP_FEATURE_CHECK` | `0` | Set to `1` to bypass the MLA transfer block (see Known Issues) |
| `UCX_TLS` | `rc_x,rc,dc_x,dc,cuda_copy` | InfiniBand transports |
```bash
@@ -259,7 +260,7 @@ cargo bench
## Known Issues
-- **MLA models blocked from P2P transfer** — Models using Multi-head Latent Attention (DeepSeek-V2/V3, Kimi K2/K2.5) are automatically blocked from GPU-to-GPU transfer and fall back to disk loading. Bytes transfer correctly but inference produces corrupted output. Set `MX_SKIP_FEATURE_CHECK=1` to bypass for debugging. See [ARCHITECTURE.md](docs/ARCHITECTURE.md) for details.
+- **MLA models blocked from P2P transfer** — Models using Multi-head Latent Attention (DeepSeek-V2/V3, Kimi K2/K2.5, GLM-5.1) are blocked from GPU-to-GPU transfer by default and fall back through the load strategy chain (ModelStreamer → GDS → disk). The root cause of post-transfer inference divergence is still under investigation. However, a workaround was merged (`adopt_hidden_tensors` + storage-level transfer for non-contiguous MLA projections) and P2P transfer has been verified correct for Kimi-K2.5-NVFP4. Set `MX_SKIP_FEATURE_CHECK=1` to enable P2P for MLA models; see [ARCHITECTURE.md](docs/ARCHITECTURE.md) for details.
- **NIXL_ERR_REMOTE_DISCONNECT** — Source restarts invalidate rkeys. Flush Redis, redeploy.
- **Long source warmup** — DeepSeek-V3 (DeepGemm, CUDA graphs) can take significant time; targets wait via coordination.
- **Large model gRPC stream** — May not close automatically; use client timeout.
@@ -275,6 +276,7 @@ cargo bench
- **DRAM and NVMe-resident shard streaming**: Stream shards across workers while keeping weights in DRAM and host local high-speed NVMe.
- **RL workloads**: Explore fast P2P transfers to optimize RL refit phase and support for weight resharding.
- **Earlier weight availability**: Bring weights to prefill earlier; identify prefill workers that can act as strong source nodes.
+- **MLA P2P transfer**: Resolve root cause of post-transfer inference divergence on MLA models (DeepSeek-V2/V3, Kimi K2/K2.5) and lift the default block.
- **Multi-tier cache hierarchy**: Promote and demote models across DRAM, NVMe, and PVC tiers based on access patterns.
- **Distributed sharded cache**: Shard large models across nodes using consistent hashing and parallel shard assembly.
- **Training checkpoint management**: Cache and reuse CUDA kernel compilations (torch.compile, deepGEMM) and CUDA graphs across restarts.
From 1890e7e44ca4bdf5f7dee2ef32a94ada8698a237 Mon Sep 17 00:00:00 2001
From: Ganesh Kudleppanavar
Date: Thu, 30 Apr 2026 23:17:07 -0500
Subject: [PATCH 12/17] docs: replace ASCII architecture diagram with Mermaid
flowchart
ASCII layout was misaligned; Mermaid renders correctly on GitHub
and is the project standard per CLAUDE.md.
Co-Authored-By: Claude Sonnet 4.6
---
README.md | 26 +++++++++++---------------
1 file changed, 11 insertions(+), 15 deletions(-)
diff --git a/README.md b/README.md
index 3ca27caa..91c0c78a 100644
--- a/README.md
+++ b/README.md
@@ -104,21 +104,17 @@ ModelExpress supports air-gapped deployments when model files are already availa
*Phase 1 — Upload once:* Model Source (HuggingFace Hub, NFS) downloads to the Seed Pod (GPU), which loads and postprocesses weights, registers VRAM with NIXL, and publishes metadata to the MX Server. *Phase 2 — Autoscale:* New pods receive weights via NIXL GPUDirect RDMA (GPU VRAM → GPU VRAM, zero-copy) from the seed GPU, using `--load-format mx` for inference.
-```
- ┌─────────────────────────────────────────────────────────────────┐
- │ ModelExpress Server │
- │ Health • Model • P2P Metadata • Redis/K8s CRD backends │
- └──────────────────────┬──────────────────────────────────────────┘
- │
- ┌─────────────────┼─────────────────┐
- │ metadata │ │ metadata
- ▼ │ ▼
- ┌──────────────────┐ │ ┌──────────────────┐
- │ Source (vLLM) │ RDMA │ │ Target (vLLM) │
- │ mx loader │════════►│ │ mx loader │
- │ Load → NIXL │ NIXL │ │ Receive → FP8 │
- │ Publish metadata│ │ │ Serve inference │
- └──────────────────┘ │ └──────────────────┘
+```mermaid
+flowchart TB
+ server["**ModelExpress Server**\nHealth · P2P Metadata · Redis / K8s CRD backends"]
+
+ source["**Source pod (vLLM)**\nmx loader\nLoad weights → NIXL\nPublish metadata"]
+
+ target["**Target pod (vLLM)**\nmx loader\nReceive weights\nServe inference"]
+
+ server -->|metadata| source
+ server -->|metadata| target
+ source -- "GPU-to-GPU RDMA / NIXL" --> target
```
*The server coordinates discovery and lifecycle state; the weight bytes move directly between GPUs.*
From 2d644e3f4f3dde2b8a19605482402520e4d124d1 Mon Sep 17 00:00:00 2001
From: Ganesh Kudleppanavar
Date: Thu, 30 Apr 2026 23:20:39 -0500
Subject: [PATCH 13/17] docs: redesign architecture diagram to show full
ModelExpress flow
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Replace minimal 3-node diagram with one that covers:
- External model sources (HF, NGC, GCS) with one-time download / ingress reduction
- ModelExpress Server with Redis / K8s CRD backend
- Source pod load pipeline (cache → post-process → NIXL → publish metadata)
- Target pod ordered fallback chain (RDMA → ModelStreamer → GDS → Default)
- Scale impact callout: 1 download → N pods at ~15 s each via P2P
Co-Authored-By: Claude Sonnet 4.6
---
README.md | 41 ++++++++++++++++++++++++++++++++---------
1 file changed, 32 insertions(+), 9 deletions(-)
diff --git a/README.md b/README.md
index 91c0c78a..15a9569e 100644
--- a/README.md
+++ b/README.md
@@ -106,15 +106,38 @@ ModelExpress supports air-gapped deployments when model files are already availa
```mermaid
flowchart TB
- server["**ModelExpress Server**\nHealth · P2P Metadata · Redis / K8s CRD backends"]
-
- source["**Source pod (vLLM)**\nmx loader\nLoad weights → NIXL\nPublish metadata"]
-
- target["**Target pod (vLLM)**\nmx loader\nReceive weights\nServe inference"]
-
- server -->|metadata| source
- server -->|metadata| target
- source -- "GPU-to-GPU RDMA / NIXL" --> target
+ subgraph ext["External Model Sources"]
+ direction LR
+ HF["Hugging Face Hub"]
+ NGC["NVIDIA NGC"]
+ GCS["Google Cloud Storage"]
+ end
+
+ subgraph mx["ModelExpress Server · Redis / K8s CRD backend"]
+ api["gRPC API · Download · Cache management · P2P coordination"]
+ end
+
+ subgraph source["Source Pod · vLLM + mx loader (1 pod)"]
+ sl["Load from cache → post-process → NIXL registration\nPublish P2P metadata to server"]
+ end
+
+ subgraph targets["Target Pods · vLLM + mx loader · ordered fallback (scales to N pods)"]
+ direction LR
+ t1["① RDMA / NIXL\nGPU-to-GPU\n~15 s / 681 GB"]
+ t2["② ModelStreamer\nS3 · GCS · Azure"]
+ t3["③ GPUDirect\nStorage"]
+ t4["④ Default\ndisk load"]
+ t1 -.->|fallback| t2 -.->|fallback| t3 -.->|fallback| t4
+ end
+
+ scale["Scale impact: 1 external download → N pods loaded via P2P\nno repeated ingress · each target pod loads in ~15 s regardless of N"]
+
+ ext -->|"downloaded once\nno duplicate ingress"| mx
+ mx -->|cached weights| source
+ source <-->|P2P metadata| mx
+ mx -->|P2P metadata| targets
+ source -->|"GPU-to-GPU RDMA / NIXL\n~45 Gbps per IB link"| targets
+ targets --- scale
```
*The server coordinates discovery and lifecycle state; the weight bytes move directly between GPUs.*
From 3e09c8d103df4e84cd705d686627ca9288c6e0a8 Mon Sep 17 00:00:00 2001
From: Ganesh Kudleppanavar
Date: Thu, 30 Apr 2026 23:22:40 -0500
Subject: [PATCH 14/17] docs: split architecture into two focused phase
diagrams
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Replace single crowded diagram with two clean diagrams:
- Phase 1: external download and cache (sources → server → cache)
- Phase 2: autoscale and rolling update (source pod → RDMA → N target pods
with ordered fallback chain)
Co-Authored-By: Claude Sonnet 4.6
---
README.md | 56 +++++++++++++++++++++++++++++++++----------------------
1 file changed, 34 insertions(+), 22 deletions(-)
diff --git a/README.md b/README.md
index 15a9569e..650a9b0f 100644
--- a/README.md
+++ b/README.md
@@ -102,45 +102,57 @@ ModelExpress supports air-gapped deployments when model files are already availa

-*Phase 1 — Upload once:* Model Source (HuggingFace Hub, NFS) downloads to the Seed Pod (GPU), which loads and postprocesses weights, registers VRAM with NIXL, and publishes metadata to the MX Server. *Phase 2 — Autoscale:* New pods receive weights via NIXL GPUDirect RDMA (GPU VRAM → GPU VRAM, zero-copy) from the seed GPU, using `--load-format mx` for inference.
+**Phase 1 — External download and cache:** ModelExpress ensures only one node pulls from external providers; all others read from the shared cache.
```mermaid
-flowchart TB
+flowchart LR
subgraph ext["External Model Sources"]
- direction LR
+ direction TB
HF["Hugging Face Hub"]
NGC["NVIDIA NGC"]
GCS["Google Cloud Storage"]
end
- subgraph mx["ModelExpress Server · Redis / K8s CRD backend"]
- api["gRPC API · Download · Cache management · P2P coordination"]
+ subgraph mx["ModelExpress Server"]
+ api["Download · Cache management\ngRPC API"]
+ be[("Redis / K8s CRD\nmetadata backend")]
+ api --- be
end
- subgraph source["Source Pod · vLLM + mx loader (1 pod)"]
- sl["Load from cache → post-process → NIXL registration\nPublish P2P metadata to server"]
+ cache[("Model Cache\nlocal disk / PVC")]
+
+ ext -->|"one-time download\nno duplicate ingress"| mx
+ mx -->|"store weights"| cache
+ cache -->|"subsequent\nrequests served\nfrom cache"| mx
+```
+
+**Phase 2 — Autoscale and rolling update:** A single source pod loads from cache and serves weights to all new pods via GPU-to-GPU RDMA — each target pod loads in ~15 s regardless of cluster size.
+
+```mermaid
+flowchart LR
+ subgraph mx["ModelExpress Server · Redis / K8s CRD"]
+ coord["P2P coordination\nmetadata registry"]
end
- subgraph targets["Target Pods · vLLM + mx loader · ordered fallback (scales to N pods)"]
- direction LR
- t1["① RDMA / NIXL\nGPU-to-GPU\n~15 s / 681 GB"]
- t2["② ModelStreamer\nS3 · GCS · Azure"]
- t3["③ GPUDirect\nStorage"]
- t4["④ Default\ndisk load"]
- t1 -.->|fallback| t2 -.->|fallback| t3 -.->|fallback| t4
+ subgraph source["Source Pod · vLLM + mx loader"]
+ sl["Load from cache\n→ post-process\n→ NIXL registration\n→ publish metadata"]
end
- scale["Scale impact: 1 external download → N pods loaded via P2P\nno repeated ingress · each target pod loads in ~15 s regardless of N"]
+ subgraph targets["Target Pods × N · vLLM + mx loader"]
+ direction TB
+ t1["① RDMA / NIXL GPU-to-GPU ~15 s / 681 GB"]
+ t2["② ModelStreamer S3 · GCS · Azure Blob"]
+ t3["③ GPUDirect Storage NVMe → GPU"]
+ t4["④ Default disk → CPU → GPU"]
+ t1 -.->|fallback| t2 -.->|fallback| t3 -.->|fallback| t4
+ end
- ext -->|"downloaded once\nno duplicate ingress"| mx
- mx -->|cached weights| source
- source <-->|P2P metadata| mx
- mx -->|P2P metadata| targets
- source -->|"GPU-to-GPU RDMA / NIXL\n~45 Gbps per IB link"| targets
- targets --- scale
+ source <-->|"P2P metadata"| mx
+ mx -->|"P2P metadata"| targets
+ source -->|"GPU-to-GPU RDMA / NIXL · ~45 Gbps per IB link"| targets
```
-*The server coordinates discovery and lifecycle state; the weight bytes move directly between GPUs.*
+*The server coordinates discovery and lifecycle state; weight bytes transfer directly between GPUs.*
- **modelexpress_server**: control plane for downloads, cache state, and P2P coordination
- **modelexpress_client**: Rust CLI and Python integration layer for runtime-facing workflows
From 11ddb6e8496edf7e62a82dcfa075de9fdd90489d Mon Sep 17 00:00:00 2001
From: Ganesh Kudleppanavar
Date: Thu, 30 Apr 2026 23:24:50 -0500
Subject: [PATCH 15/17] docs: replace inaccurate IB bandwidth with ConnectX /
fast interconnect
Co-Authored-By: Claude Sonnet 4.6
---
README.md | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/README.md b/README.md
index 650a9b0f..c5e9e237 100644
--- a/README.md
+++ b/README.md
@@ -149,7 +149,7 @@ flowchart LR
source <-->|"P2P metadata"| mx
mx -->|"P2P metadata"| targets
- source -->|"GPU-to-GPU RDMA / NIXL · ~45 Gbps per IB link"| targets
+ source -->|"GPU-to-GPU RDMA / NIXL · NVIDIA ConnectX / fast interconnect"| targets
```
*The server coordinates discovery and lifecycle state; weight bytes transfer directly between GPUs.*
From 8b8057efbc4fc2ba483b55c8be9ce863f41711bf Mon Sep 17 00:00:00 2001
From: Ganesh Kudleppanavar
Date: Thu, 30 Apr 2026 23:25:52 -0500
Subject: [PATCH 16/17] docs: remove benchmark numbers from architecture
diagram
Co-Authored-By: Claude Sonnet 4.6
---
README.md | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/README.md b/README.md
index c5e9e237..510f2b82 100644
--- a/README.md
+++ b/README.md
@@ -126,7 +126,7 @@ flowchart LR
cache -->|"subsequent\nrequests served\nfrom cache"| mx
```
-**Phase 2 — Autoscale and rolling update:** A single source pod loads from cache and serves weights to all new pods via GPU-to-GPU RDMA — each target pod loads in ~15 s regardless of cluster size.
+**Phase 2 — Autoscale and rolling update:** A single source pod loads from cache and serves weights to all new pods via GPU-to-GPU RDMA — each target pod loads in the same time regardless of cluster size.
```mermaid
flowchart LR
@@ -140,7 +140,7 @@ flowchart LR
subgraph targets["Target Pods × N · vLLM + mx loader"]
direction TB
- t1["① RDMA / NIXL GPU-to-GPU ~15 s / 681 GB"]
+ t1["① RDMA / NIXL GPU-to-GPU"]
t2["② ModelStreamer S3 · GCS · Azure Blob"]
t3["③ GPUDirect Storage NVMe → GPU"]
t4["④ Default disk → CPU → GPU"]
From a2680ecc29a17b655e100a3133c0b42991b48e7a Mon Sep 17 00:00:00 2001
From: Ganesh Kudleppanavar
Date: Thu, 30 Apr 2026 23:28:48 -0500
Subject: [PATCH 17/17] docs: expand model sources section and update SGLang
integration status
- Replace compact Model Store Providers table with full breakdown of
server-side providers (HF, NGC, GCS) and ModelStreamer backends
(S3/S3-compatible, GCS, Azure Blob, local filesystem, HF cache)
with URI format and auth notes for each
- Add GDS entry with activation conditions
- Update SGLang row from "coming soon" to in-progress GDS PR link
Co-Authored-By: Claude Sonnet 4.6
---
README.md | 30 ++++++++++++++++++++++--------
1 file changed, 22 insertions(+), 8 deletions(-)
diff --git a/README.md b/README.md
index 510f2b82..a6917758 100644
--- a/README.md
+++ b/README.md
@@ -74,17 +74,31 @@ ModelExpress orchestrates the weight lifecycle from external source to GPU memor
| vLLM | `--load-format mx` for P2P weight transfer |
| NVIDIA Dynamo (vLLM) | `get_model_path` API; [aggregated K8s example](examples/aggregated_k8s/README.md) |
| TensorRT-LLM | `LoadFormat.PRESHARDED` with `MxLiveCheckpointLoader` for P2P weight transfer (beta) — [TRT-LLM examples](examples/p2p_transfer_k8s/client/trtllm/) |
-| SGLang | Coming soon |
+| SGLang | `--load-format mx_gds` for GPUDirect Storage loading (beta, [in progress](https://github.com/sgl-project/sglang/pull/20288)) |
-### Model Store Providers
+### Model Sources
-ModelExpress exposes a small set of storage-access patterns, depending on how you want weights delivered:
+ModelExpress routes each load through the fastest available path. There are two access patterns: server-side provider downloads and client-side ModelStreamer streaming.
-| Path | Supported sources |
-|------|-------------------|
-| Model providers | Hugging Face, NVIDIA NGC, Google Cloud Storage |
-| ModelStreamer (`MX_MODEL_URI`) | S3 / S3-compatible, GCS, Azure Blob Storage, local filesystem, and Hugging Face cache-resolved model IDs |
-| GPUDirect Storage | Local filesystem or cached model files loaded directly to GPU |
+**Server-side providers** — the ModelExpress server pulls the model once; all other nodes read from the shared cache:
+
+| Provider | How to use |
+|----------|------------|
+| Hugging Face Hub | Default; set `HF_TOKEN` for gated models |
+| NVIDIA NGC | Set `NGC_API_KEY`; use `ngc://org/team/model:version` style paths |
+| Google Cloud Storage | Set `GOOGLE_APPLICATION_CREDENTIALS`; use `gs://bucket/path` |
+
+**ModelStreamer** (`MX_MODEL_URI`) — streams safetensors directly from object storage or local disk into GPU memory, bypassing the server cache. Activate by setting `MX_MODEL_URI`:
+
+| Backend | URI format | Notes |
+|---------|-----------|-------|
+| Amazon S3 / S3-compatible | `s3://bucket/path/to/model` | MinIO, Ceph, and other S3-compatible stores work |
+| Google Cloud Storage | `gs://bucket/path/to/model` | |
+| Azure Blob Storage | `az://container/path/to/model` | |
+| Local filesystem | `/absolute/path/to/model` | Useful for NVMe, NFS, or pre-staged volumes |
+| Hugging Face cache | `org/model-name` (e.g. `deepseek-ai/DeepSeek-V3`) | Resolved via `HF_HUB_CACHE` or `~/.cache/huggingface/hub` |
+
+**GPUDirect Storage (GDS)** — loads directly from NVMe into GPU memory, bypassing CPU and DRAM. Activated automatically when `cuFile` and compatible hardware are detected.
### Air-Gapped Environments