Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
17 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
179 changes: 128 additions & 51 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,11 @@ SPDX-License-Identifier: Apache-2.0
<h1 align="center">Dynamo ModelExpress</h1>

<p align="center">
<strong>Model weight management for LLM inference</strong> — cache, transfer, and serve weights at scale with GPU-to-GPU RDMA and multi-node coordination.
<strong>Accelerate LLM startup and scale-out with intelligent model distribution</strong>
</p>

<p align="center">
Reduce repeated ingress from external model providers by ensuring only one copy of a model is downloaded even when many clients request it concurrently.
</p>

<p align="center">
Expand All @@ -31,18 +35,18 @@ SPDX-License-Identifier: Apache-2.0

## Overview

ModelExpress is a Rust-based service that manages the complete model weight lifecycle in the cluster—from acquisition to GPU memory. It accelerates LLM inference by caching, routing, and transferring weights through the fastest available path. Deploy standalone or as a sidecar alongside vLLM, NVIDIA Dynamo, and other inference runtimes.
ModelExpress is a model distribution layer for large-model workloads. It manages how model weights are acquired, cached, shared, and transferred across a cluster so systems can start faster, scale more efficiently, and avoid repeated downloads from external model providers. Deploy it as a standalone service or alongside runtimes such as vLLM, NVIDIA Dynamo, and TensorRT-LLM.

| LLM serving problem | How ModelExpress helps |
|---------------------|------------------------|
| **Models take too long to load** | GPU-to-GPU transfer via NIXL/RDMA instead of loading from storage. In P2P mode, weights already serving inference act as the cache—no extra storage. |
| **Many nodes need the same model** | Metadata backends (Redis, K8s CRD) coordinate sharing: one node loads; others receive via P2P or local paths. |
| **Many nodes need the same model** | Metadata backends (Redis, K8s CRD) coordinate sharing: one node loads; others receive via P2P or local paths. This reduces ingress bandwidth from external providers such as Hugging Face and ensures only one model copy is downloaded even when multiple clients request the same model concurrently. |

### How ModelExpress manages weights in the cluster

ModelExpress orchestrates the full flow—from download to GPU memory. It ensures only one node downloads a model from external sources (e.g., HuggingFace); other nodes receive weights via P2P or shared storage—eliminating duplicate downloads and reducing cluster ingress.
ModelExpress orchestrates the weight lifecycle from external source to GPU memory. It minimizes repeated provider traffic, keeps cache state coordinated across the cluster, and routes each load through the most efficient available path.

1. **Download from HuggingFace** — One node pulls the model; ModelExpress coordinates so no other node duplicates this download, reducing external ingress. In air-gapped mode, serve from cache only (`HF_HUB_OFFLINE=1`).
1. **Download or stream from external storage** — The ModelExpress server pulls the model from Hugging Face, NGC, or GCS, or a client streams it through ModelStreamer from S3, Azure Blob Storage, other object storage, or local disk; ModelExpress coordinates so no other node duplicates this work. In air-gapped mode, serve from cache only (`HF_HUB_OFFLINE=1`).
2. **Persist to disk** — Store in a cache backed by disk:
- **Host-attached disk** — Local disk on the node (single-node or per-node cache).
- **PVC** — RWO (ReadWriteOnce) for single-node; RWX (ReadWriteMany) for shared access across nodes.
Expand All @@ -53,12 +57,15 @@ ModelExpress orchestrates the full flow—from download to GPU memory. It ensure

## Features

- **Cold start reduction** — GPU-to-GPU P2P transfer over InfiniBand instead of disk load
- **HuggingFace caching** — PVC-backed cache, `HF_HUB_OFFLINE`, `ignore_weights`, `get_model_path` for Dynamo
- **P2P GPU transfer** — vLLM `mx` loader and TRT-LLM `PRESHARDED` loader with NVIDIA NIXL over RDMA
- **Metadata backends** — In-memory, Redis, or Kubernetes CRD (layered write-through for HA)
- **Kubernetes** — Helm chart, CRDs/Redis for P2P, no-shared-storage support
- **CLI** — Health, download, list, validate, clear; init-container support for pre-warming
- **Reduce startup time** — shift model loads from storage-bound workflows to GPU-to-GPU RDMA over InfiniBand
- **Reduce provider ingress** — coordinate downloads so concurrent requests share one external fetch instead of duplicating traffic
- **Operate with distributed state** — keep model lifecycle state and P2P metadata in Redis or Kubernetes CRDs
- **Support multiple model sources** — built-in providers for Hugging Face, NVIDIA NGC, and Google Cloud Storage
- **Load from object storage** — use ModelStreamer with `MX_MODEL_URI` for S3, GCS, Azure Blob, local paths, or Hugging Face cache
- **Use direct file-to-GPU loading** — enable GPUDirect Storage when hardware and software are available
- **Integrate with runtime platforms** — vLLM `mx` loader and TensorRT-LLM `PRESHARDED` support for RDMA-based startup
- **Deploy in Kubernetes** — use Helm, CRDs, Redis, shared storage, or no-shared-storage topologies
- **Operate through CLI and APIs** — health, download, list, validate, and clear models with shared server/client interfaces

### Integrations

Expand All @@ -67,38 +74,103 @@ ModelExpress orchestrates the full flow—from download to GPU memory. It ensure
| vLLM | `--load-format mx` for P2P weight transfer |
| NVIDIA Dynamo (vLLM) | `get_model_path` API; [aggregated K8s example](examples/aggregated_k8s/README.md) |
| TensorRT-LLM | `LoadFormat.PRESHARDED` with `MxLiveCheckpointLoader` for P2P weight transfer (beta) — [TRT-LLM examples](examples/p2p_transfer_k8s/client/trtllm/) |
| SGLang | Coming soon |
| SGLang | `--load-format mx_gds` for GPUDirect Storage loading (beta, [in progress](https://github.com/sgl-project/sglang/pull/20288)) |

### Model Sources

ModelExpress routes each load through the fastest available path. There are two access patterns: server-side provider downloads and client-side ModelStreamer streaming.

**Server-side providers** — the ModelExpress server pulls the model once; all other nodes read from the shared cache:

| Provider | How to use |
|----------|------------|
| Hugging Face Hub | Default; set `HF_TOKEN` for gated models |
| NVIDIA NGC | Set `NGC_API_KEY`; use `ngc://org/team/model:version` style paths |
| Google Cloud Storage | Set `GOOGLE_APPLICATION_CREDENTIALS`; use `gs://bucket/path` |

**ModelStreamer** (`MX_MODEL_URI`) — streams safetensors directly from object storage or local disk into GPU memory, bypassing the server cache. Activate by setting `MX_MODEL_URI`:

| Backend | URI format | Notes |
|---------|-----------|-------|
| Amazon S3 / S3-compatible | `s3://bucket/path/to/model` | MinIO, Ceph, and other S3-compatible stores work |
| Google Cloud Storage | `gs://bucket/path/to/model` | |
| Azure Blob Storage | `az://container/path/to/model` | |
| Local filesystem | `/absolute/path/to/model` | Useful for NVMe, NFS, or pre-staged volumes |
| Hugging Face cache | `org/model-name` (e.g. `deepseek-ai/DeepSeek-V3`) | Resolved via `HF_HUB_CACHE` or `~/.cache/huggingface/hub` |

**GPUDirect Storage (GDS)** — loads directly from NVMe into GPU memory, bypassing CPU and DRAM. Activated automatically when `cuFile` and compatible hardware are detected.

### Air-Gapped Environments

ModelExpress supports air-gapped deployments when model files are already available inside the environment.

- Use a pre-populated local cache or a mounted local/PVC path as the source of truth.
- For Hugging Face cache-only operation, set `HF_HUB_OFFLINE=1`; ModelExpress resolves models from the local HF cache and does not attempt network access.
- For fully disconnected runtime loading, point `MX_MODEL_URI` at a local filesystem path so ModelStreamer reads from local storage instead of external object stores.
- Once one source pod has loaded the model, additional pods can receive the weights through P2P RDMA without re-downloading from an external provider.
- External providers such as NGC, GCS, S3, and Azure Blob still require network reachability unless their contents are mirrored into local storage inside the air-gapped environment.

---

## ModelExpress Architecture

![ModelExpress Architecture: Upload once, then autoscale new pods via NIXL GPUDirect RDMA from seed GPU](model-express-architecture.png)

*Phase 1 — Upload once:* Model Source (HuggingFace Hub, NFS) downloads to the Seed Pod (GPU), which loads and postprocesses weights, registers VRAM with NIXL, and publishes metadata to the MX Server. *Phase 2 — Autoscale:* New pods receive weights via NIXL GPUDirect RDMA (GPU VRAM → GPU VRAM, zero-copy) from the seed GPU, using `--load-format mx` for inference.
**Phase 1 — External download and cache:** ModelExpress ensures only one node pulls from external providers; all others read from the shared cache.

```mermaid
flowchart LR
subgraph ext["External Model Sources"]
direction TB
HF["Hugging Face Hub"]
NGC["NVIDIA NGC"]
GCS["Google Cloud Storage"]
end

subgraph mx["ModelExpress Server"]
api["Download · Cache management\ngRPC API"]
be[("Redis / K8s CRD\nmetadata backend")]
api --- be
end

cache[("Model Cache\nlocal disk / PVC")]

ext -->|"one-time download\nno duplicate ingress"| mx
mx -->|"store weights"| cache
cache -->|"subsequent\nrequests served\nfrom cache"| mx
```
┌─────────────────────────────────────────────────────────────────┐
│ ModelExpress Server │
│ Health • Model • P2P Metadata • Redis/K8s CRD backends │
└──────────────────────┬──────────────────────────────────────────┘
┌─────────────────┼─────────────────┐
│ metadata │ │ metadata
▼ │ ▼
┌──────────────────┐ │ ┌──────────────────┐
│ Source (vLLM) │ RDMA │ │ Target (vLLM) │
│ mx loader │════════►│ │ mx loader │
│ Load → NIXL │ NIXL │ │ Receive → FP8 │
│ Publish metadata│ │ │ Serve inference │
└──────────────────┘ │ └──────────────────┘

**Phase 2 — Autoscale and rolling update:** A single source pod loads from cache and serves weights to all new pods via GPU-to-GPU RDMA — each target pod loads in the same time regardless of cluster size.

```mermaid
flowchart LR
subgraph mx["ModelExpress Server · Redis / K8s CRD"]
coord["P2P coordination\nmetadata registry"]
end

subgraph source["Source Pod · vLLM + mx loader"]
sl["Load from cache\n→ post-process\n→ NIXL registration\n→ publish metadata"]
end

subgraph targets["Target Pods × N · vLLM + mx loader"]
direction TB
t1["① RDMA / NIXL GPU-to-GPU"]
t2["② ModelStreamer S3 · GCS · Azure Blob"]
t3["③ GPUDirect Storage NVMe → GPU"]
t4["④ Default disk → CPU → GPU"]
t1 -.->|fallback| t2 -.->|fallback| t3 -.->|fallback| t4
end

source <-->|"P2P metadata"| mx
mx -->|"P2P metadata"| targets
source -->|"GPU-to-GPU RDMA / NIXL · NVIDIA ConnectX / fast interconnect"| targets
```

*Source and Target exchange metadata with the server for coordination; weights transfer directly over RDMA between GPUs.*
*The server coordinates discovery and lifecycle state; weight bytes transfer directly between GPUs.*

- **modelexpress_server**: gRPC server with configurable metadata backends (Redis, Kubernetes CRD).
- **modelexpress_client**: Rust CLI for cache management; Python package with vLLM loaders and `MxClient` for gRPC.
- **modelexpress_common**: Protobuf definitions, provider trait (HuggingFace), shared configuration.
- **modelexpress_server**: control plane for downloads, cache state, and P2P coordination
- **modelexpress_client**: Rust CLI and Python integration layer for runtime-facing workflows
- **modelexpress_common**: shared protobufs, provider abstractions, and configuration types

See [Architecture](docs/ARCHITECTURE.md).

Expand Down Expand Up @@ -130,7 +202,7 @@ modelexpress-cli health
```

**Without shared storage:** use `--no-shared-storage` for gRPC streaming.
**Air-gapped:** `HF_HUB_OFFLINE=1 modelexpress-cli model get <model-id>`.
**Air-gapped:** `HF_HUB_OFFLINE=1 modelexpress-cli model download <model-id>`.

---

Expand All @@ -145,6 +217,19 @@ helm install modelexpress ./helm --namespace modelexpress --create-namespace

Override [values-production.yaml](helm/values-production.yaml) for your env. Full config: [helm/README.md](helm/README.md).

### Distributed Backend Prerequisites

ModelExpress requires a distributed backend for model registry state and P2P coordination:

- `redis` for Redis-backed deployments
- `kubernetes` for CRD-backed deployments

For the Kubernetes backend, install the CRDs before starting the server:

```bash
kubectl apply -f examples/crds.yaml
```

### P2P GPU Transfer (vLLM)

```python
Expand All @@ -155,6 +240,12 @@ register_modelexpress_loaders()

First instance loads from disk; subsequent instances receive via RDMA. [P2P guide](examples/p2p_transfer_k8s/README.md) · [Server setup](examples/p2p_transfer_k8s/server/README.md).

### Example Deployments

- [vLLM P2P transfer](examples/p2p_transfer_k8s/README.md)
- [Dynamo P2P transfer](examples/dynamo_p2p_transfer_k8s/README.md)
- [TensorRT-LLM beta examples](examples/p2p_transfer_k8s/client/trtllm/README.md)

### Docker

```bash
Expand All @@ -174,6 +265,8 @@ docker-compose up --build
| `MX_METADATA_BACKEND` | (required) | `redis` \| `kubernetes` |
| `REDIS_URL` | `redis://localhost:6379` | Redis connection URL (`redis` backend only) |
| `MODEL_EXPRESS_URL` | `localhost:8001` | gRPC server (P2P) |
| `MX_MODEL_URI` | (none) | Enable ModelStreamer with `s3://`, `gs://`, `az://`, absolute local paths, or Hugging Face model IDs |
| `MX_SKIP_FEATURE_CHECK` | `0` | Set to `1` to bypass the MLA transfer block (see Known Issues) |
| `UCX_TLS` | `rc_x,rc,dc_x,dc,cuda_copy` | InfiniBand transports |

```bash
Expand All @@ -185,20 +278,6 @@ Full reference: [docs/DEPLOYMENT.md](docs/DEPLOYMENT.md).

---

## CLI

```bash
modelexpress-cli health
modelexpress-cli model download <model-id>
modelexpress-cli model list
modelexpress-cli model validate <model-id>
modelexpress-cli model clear <model-id>
```

[CLI Reference](docs/CLI.md)

---

## Testing

```bash
Expand All @@ -218,14 +297,15 @@ cargo bench
| [Deployment](docs/DEPLOYMENT.md) | Server/client config, Docker, K8s, P2P |
| [Architecture](docs/ARCHITECTURE.md) | Components, gRPC, NIXL, FP8 |
| [CLI](docs/CLI.md) | Full CLI reference |
| [GCS Provider](docs/GCS_PROVIDER.md) | GCS provider design, cache layout, manifest behavior, and credentials |
| [Metadata](docs/metadata.md) | Redis keys, K8s CRD schema |
| [Helm](helm/README.md) | Kubernetes configuration |

---

## Known Issues

- **MLA models blocked from P2P transfer** — Models using Multi-head Latent Attention (DeepSeek-V2/V3, Kimi K2/K2.5) are automatically blocked from GPU-to-GPU transfer and fall back to disk loading. Bytes transfer correctly but inference produces corrupted output. Set `MX_SKIP_FEATURE_CHECK=1` to bypass for debugging. See [ARCHITECTURE.md](docs/ARCHITECTURE.md) for details.
- **MLA models blocked from P2P transfer** — Models using Multi-head Latent Attention (DeepSeek-V2/V3, Kimi K2/K2.5, GLM-5.1) are blocked from GPU-to-GPU transfer by default and fall back through the load strategy chain (ModelStreamer → GDS → disk). The root cause of post-transfer inference divergence is still under investigation. However, a workaround was merged (`adopt_hidden_tensors` + storage-level transfer for non-contiguous MLA projections) and P2P transfer has been verified correct for Kimi-K2.5-NVFP4. Set `MX_SKIP_FEATURE_CHECK=1` to enable P2P for MLA models; see [ARCHITECTURE.md](docs/ARCHITECTURE.md) for details.
- **NIXL_ERR_REMOTE_DISCONNECT** — Source restarts invalidate rkeys. Flush Redis, redeploy.
- **Long source warmup** — DeepSeek-V3 (DeepGemm, CUDA graphs) can take significant time; targets wait via coordination.
- **Large model gRPC stream** — May not close automatically; use client timeout.
Expand All @@ -238,19 +318,16 @@ cargo bench
### Priorities Under Development

- **P2P compile/warmup caching**: torch.compile/deepGEMM cache for follower workers. Leader performs full warmup; followers consume caches and skip full warmup.
- **ModelStreamer Integration**: Pull weights from cold storage with multi-cloud and multi-engine support.
- **DRAM and NVMe-resident shard streaming**: Stream shards across workers while keeping weights in DRAM and host local high-speed NVMe.
- **RL workloads**: Explore fast P2P transfers to optimize RL refit phase and support for weight resharding.
- **Earlier weight availability**: Bring weights to prefill earlier; identify prefill workers that can act as strong source nodes.
- **Expanded model pull providers**: Support NGC in addition to Hugging Face.
- **GDS (GPUDirect Storage) integration**: Load model weights directly from NVMe into GPU memory, bypassing the CPU/DRAM copy path.
- **MLA P2P transfer**: Resolve root cause of post-transfer inference divergence on MLA models (DeepSeek-V2/V3, Kimi K2/K2.5) and lift the default block.
- **Multi-tier cache hierarchy**: Promote and demote models across DRAM, NVMe, and PVC tiers based on access patterns.
- **Distributed sharded cache**: Shard large models across nodes using consistent hashing and parallel shard assembly.
- **Training checkpoint management**: Cache and reuse CUDA kernel compilations (torch.compile, deepGEMM) and CUDA graphs across restarts.
- **Metrics and observability**: Cache hit rates, eviction frequency, transfer throughput, and P2P RDMA utilization via Prometheus/OpenTelemetry.
- **Predictive prefetching**: Pre-warm caches from workload history or scheduling hints.
- **P2P transfer fault tolerance**: Auto-recovery from stale rkeys on source restart; retry and fallback to storage loading.
- **Multi-cloud storage backends**: Native support for AWS S3, Azure Blob, and NFS as model pull sources.

---

Expand Down
2 changes: 1 addition & 1 deletion docs/DEPLOYMENT.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ SPDX-License-Identifier: Apache-2.0

# ModelExpress Deployment Guide

User-facing guide for configuring and deploying ModelExpress. For architecture details, see [`ARCHITECTURE.md`](ARCHITECTURE.md). For development setup, see [`../CONTRIBUTING.md`](../CONTRIBUTING.md).
User-facing guide for configuring and deploying ModelExpress. For architecture details, see [`ARCHITECTURE.md`](ARCHITECTURE.md). For development setup, see [`../CONTRIBUTING.md`](../CONTRIBUTING.md). For a concise overview of offline operation, see the air-gapped section in [`../README.md`](../README.md).

## Server Configuration

Expand Down
Loading