Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
345 changes: 345 additions & 0 deletions Cargo.lock

Large diffs are not rendered by default.

3 changes: 3 additions & 0 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -41,13 +41,15 @@ jiff = { version = "0.2.15", features = ["serde"] }
modelexpress-common = { path = "modelexpress_common", version = "0.3.0" }
modelexpress-client = { path = "modelexpress_client", version = "0.3.0" }
modelexpress-server = { path = "modelexpress_server", version = "0.3.0" }
oci-client = { version = "0.16.1", default-features = false, features = ["rustls-tls"] }
once_cell = "1.21.3"
prost = "0.13"
rustls = { version = "0.23.37", default-features = false, features = ["ring", "std"] }
serde = { version = "1.0", features = ["derive"] }
serde_json = "1.0"
mockall = "0.14.0"
tempfile = "3.20"
tar = "0.4"
tokio = { version = "1.46", features = ["full"] }
tokio-stream = "0.1"
tonic = "0.13"
Expand All @@ -57,6 +59,7 @@ tracing = "0.1"
tracing-subscriber = { version = "0.3", features = ["env-filter"] }
futures = "0.3"
uuid = { version = "1.17", features = ["v4", "serde"] }
zstd = "0.13"
thiserror = "2.0"
redis = { version = "0.27", features = ["tokio-comp", "connection-manager"] }
reqwest = { version = "0.12", default-features = false, features = ["json", "rustls-tls", "stream"] }
Expand Down
11 changes: 5 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,9 +40,9 @@ ModelExpress is a Rust-based service that manages the complete model weight life

### How ModelExpress manages weights in the cluster

ModelExpress orchestrates the full flow—from download to GPU memory. It ensures only one node downloads a model from external sources (e.g., HuggingFace); other nodes receive weights via P2P or shared storage—eliminating duplicate downloads and reducing cluster ingress.
ModelExpress orchestrates the full flow—from download to GPU memory. It ensures only one node downloads a model from external sources (e.g., HuggingFace, NGC, GCS, or OCI registries); other nodes receive weights via P2P or shared storage—eliminating duplicate downloads and reducing cluster ingress.

1. **Download from HuggingFace** — One node pulls the model; ModelExpress coordinates so no other node duplicates this download, reducing external ingress. In air-gapped mode, serve from cache only (`HF_HUB_OFFLINE=1`).
1. **Download from a model source** — One node pulls the model from HuggingFace, NGC, GCS, or a file/archive OCI artifact; ModelExpress coordinates so no other node duplicates this download, reducing external ingress. In air-gapped HuggingFace mode, serve from cache only (`HF_HUB_OFFLINE=1`).
2. **Persist to disk** — Store in a cache backed by disk:
- **Host-attached disk** — Local disk on the node (single-node or per-node cache).
- **PVC** — RWO (ReadWriteOnce) for single-node; RWX (ReadWriteMany) for shared access across nodes.
Expand All @@ -54,7 +54,7 @@ ModelExpress orchestrates the full flow—from download to GPU memory. It ensure
## Features

- **Cold start reduction** — GPU-to-GPU P2P transfer over InfiniBand instead of disk load
- **HuggingFace caching** — PVC-backed cache, `HF_HUB_OFFLINE`, `ignore_weights`, `get_model_path` for Dynamo
- **Model source caching** — HuggingFace, NGC, GCS, and OCI artifact providers with PVC-backed cache support, `ignore_weights`, and `get_model_path` for Dynamo
- **P2P GPU transfer** — vLLM `mx` loader and TRT-LLM `PRESHARDED` loader with NVIDIA NIXL over RDMA
- **Metadata backends** — In-memory, Redis, or Kubernetes CRD (layered write-through for HA)
- **Kubernetes** — Helm chart, CRDs/Redis for P2P, no-shared-storage support
Expand Down Expand Up @@ -98,9 +98,9 @@ ModelExpress orchestrates the full flow—from download to GPU memory. It ensure

- **modelexpress_server**: gRPC server with configurable metadata backends (Redis, Kubernetes CRD).
- **modelexpress_client**: Rust CLI for cache management; Python package with vLLM loaders and `MxClient` for gRPC.
- **modelexpress_common**: Protobuf definitions, provider trait (HuggingFace), shared configuration.
- **modelexpress_common**: Protobuf definitions, provider trait (HuggingFace, NGC, GCS, OCI), shared configuration.

See [Architecture](docs/ARCHITECTURE.md).
See [Architecture](docs/ARCHITECTURE.md), [GCS provider](docs/GCS_PROVIDER.md), and [OCI provider](docs/OCI_PROVIDER.md).

---

Expand Down Expand Up @@ -241,7 +241,6 @@ cargo bench
- **DRAM and NVMe-resident shard streaming**: Stream shards across workers while keeping weights in DRAM and host local high-speed NVMe.
- **RL workloads**: Explore fast P2P transfers to optimize RL refit phase and support for weight resharding.
- **Earlier weight availability**: Bring weights to prefill earlier; identify prefill workers that can act as strong source nodes.
- **Expanded model pull providers**: Support NGC in addition to Hugging Face.
- **GDS (GPUDirect Storage) integration**: Load model weights directly from NVMe into GPU memory, bypassing the CPU/DRAM copy path.
- **Multi-tier cache hierarchy**: Promote and demote models across DRAM, NVMe, and PVC tiers based on access patterns.
- **Distributed sharded cache**: Shard large models across nodes using consistent hashing and parallel shard assembly.
Expand Down
19 changes: 11 additions & 8 deletions docs/ARCHITECTURE.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ Detailed reference document for the ModelExpress codebase. For deployment and co

ModelExpress is a Rust-based model cache management service and GPU-to-GPU model weight transfer system. It serves two roles:

- **Model Cache Service** - A sidecar alongside inference solutions (vLLM, SGLang, NVIDIA Dynamo) that accelerates model downloads from HuggingFace, NGC, and GCS. Model lifecycle state lives in a distributed registry — Redis or Kubernetes CRDs (`ModelCacheEntry`), selected via `MX_METADATA_BACKEND` — so multiple server replicas can coordinate without a shared-filesystem database. LRU cache eviction runs off the same registry.
- **Model Cache Service** - A sidecar alongside inference solutions (vLLM, SGLang, NVIDIA Dynamo) that accelerates model downloads from HuggingFace, NGC, GCS, and file/archive OCI artifacts. Model lifecycle state lives in a distributed registry — Redis or Kubernetes CRDs (`ModelCacheEntry`), selected via `MX_METADATA_BACKEND` — so multiple server replicas can coordinate without a shared-filesystem database. LRU cache eviction runs off the same registry.
- **P2P Weight Transfer** - GPU-to-GPU model weight transfers between vLLM instances using NVIDIA NIXL over RDMA/InfiniBand, enabling ~15-second transfers for 681GB models.

### Current Status
Expand All @@ -31,6 +31,7 @@ graph TD
S1 --> HF[HuggingFace Hub]
S1 --> NGC[NVIDIA NGC]
S1 --> GCS[Google Cloud Storage]
S1 --> OCI[OCI Registry]
S1 --> Cache[Model Cache Dir]
end

Expand Down Expand Up @@ -177,7 +178,8 @@ ModelExpress/
│ ├── gcs.rs # GcsProvider implementation
│ ├── gcs/ # GCS manifest, cache layout, locking, download helpers
│ ├── huggingface.rs # HuggingFaceProvider implementation
│ └── ngc.rs # NgcProvider implementation
│ ├── ngc.rs # NgcProvider implementation
│ └── oci.rs # OciProvider implementation
├── workspace-tests/
│ ├── Cargo.toml
Expand Down Expand Up @@ -283,7 +285,7 @@ Four proto files define four services, all compiled via `tonic-build` in `modele
| `StreamModelFiles` | `ModelFilesRequest` | stream `FileChunk` | Stream model file contents (1MB chunks) |
| `ListModelFiles` | `ModelFilesRequest` | `ModelFileList` | List files with sizes |

Key message types: `ModelProvider` (HuggingFace, NGC, GCS), `ModelStatus` (Downloading, Downloaded, Error), `ModelStatusUpdate`, `FileChunk`.
Key message types: `ModelProvider` (HuggingFace, NGC, GCS, OCI), `ModelStatus` (Downloading, Downloaded, Error), `ModelStatusUpdate`, `FileChunk`.

### p2p.proto - P2pService

Expand Down Expand Up @@ -465,7 +467,7 @@ Output formats: `--format human` (default), `--format json`, `--format json-pret
| `config` | Config trait utilities |
| `download` | Download orchestration with strategy pattern |
| `models` | `Status`, `ModelProvider`, `ModelStatus`, `ModelStatusResponse` |
| `providers` | `ModelProviderTrait` + `HuggingFaceProvider` + `NgcProvider` + `GcsProvider` |
| `providers` | `ModelProviderTrait` + `HuggingFaceProvider` + `NgcProvider` + `GcsProvider` + `OciProvider` |
| `grpc` | Generated tonic stubs for all 4 services |
| `constants` | `DEFAULT_GRPC_PORT` (8001), `DEFAULT_TIMEOUT_SECS` (30), `DEFAULT_TRANSFER_CHUNK_SIZE` (32KB) |

Expand All @@ -484,10 +486,11 @@ pub trait ModelProviderTrait: Send + Sync {
}
```

Three implementations:
- `HuggingFaceProvider` - uses the `hf-hub` crate with high-CPU download mode.
- `NgcProvider` - downloads from NVIDIA NGC via the V2 artifact API (Bearer-authenticated `/files/{path}` for team artifacts; presigned S3 URLs for org-level artifacts). Falls back to `checksums.blake3` manifest enumeration when bulk file listing returns 400. Resolves the NGC API key from `NGC_API_KEY`, `NGC_CLI_API_KEY`, or `~/.ngc/config`.
- `GcsProvider` - downloads objects under a full `gs://<bucket>/<object-prefix>` URL using Google Application Default Credentials. It writes a `.mx/manifest.json` cache manifest, verifies downloaded files with GCS CRC32C checksums, skips dotfiles, README, and images, and stores models under `<cache>/gcs/<bucket>/<object-prefix>`. See [`GCS_PROVIDER.md`](GCS_PROVIDER.md) for the detailed design.
Provider implementations:
- `HuggingFaceProvider` — uses the `hf-hub` crate with high-CPU download mode.
- `NgcProvider` — downloads from NVIDIA NGC via the V2 artifact API (Bearer-authenticated `/files/{path}` for team artifacts; presigned S3 URLs for org-level artifacts). Falls back to `checksums.blake3` manifest enumeration when bulk file listing returns 400. Resolves the NGC API key from `NGC_API_KEY`, `NGC_CLI_API_KEY`, or `~/.ngc/config`.
- `GcsProvider` — downloads objects under a full `gs://<bucket>/<object-prefix>` URL using Google Application Default Credentials. It writes a `.mx/manifest.json` cache manifest, verifies downloaded files with GCS CRC32C checksums, skips dotfiles, README, and images, and stores models under `<cache>/gcs/<bucket>/<object-prefix>`. See [`GCS_PROVIDER.md`](GCS_PROVIDER.md) for the detailed design.
- `OciProvider` — downloads OCI model artifacts via `oci-client`. Raw layers use `org.opencontainers.image.title` or `org.cncf.model.filepath` as the output file path; simple `tar` and `tar+zstd` layers are safely extracted. ModelExpress atomically publishes the completed `files` directory. Container image unpacking remains out of scope: no whiteouts or rootfs layer merging. See [`OCI_PROVIDER.md`](OCI_PROVIDER.md).

### ClientConfig / ClientArgs

Expand Down
8 changes: 8 additions & 0 deletions docs/CLI.md
Original file line number Diff line number Diff line change
Expand Up @@ -97,6 +97,11 @@ modelexpress-cli model download gs://my-bucket/models/qwen/rev-1 \
modelexpress-cli model download microsoft/DialoGPT-medium \
--strategy direct

# Download an OCI artifact from a registry
modelexpress-cli model download registry.example.com/team/model:v1 \
--provider oci \
--strategy direct

# Download with file transfer when no shared storage exists
# Note: Global options must come before the subcommand
modelexpress-cli --no-shared-storage --transfer-chunk-size 65536 \
Expand Down Expand Up @@ -153,9 +158,12 @@ modelexpress-cli model stats --detailed
- `hugging-face`: Hugging Face model hub (default)
- `ngc`: NVIDIA NGC catalog
- `gcs`: Google Cloud Storage object prefix. The model name must be a full `gs://<bucket>/<path>` URL. See [`GCS_PROVIDER.md`](GCS_PROVIDER.md) for cache layout and provider behavior.
- `oci`: OCI model artifact with raw file blobs or simple `tar`/`tar+zstd` archive layers. References must be registry-qualified and include a tag or digest, for example `oci://registry.example.com/team/model:v1` or `registry.example.com/team/model@sha256:...`. See [`OCI_PROVIDER.md`](OCI_PROVIDER.md) for artifact format, cache layout, and publish behavior.

For GCS downloads, configure Google Application Default Credentials on the process that performs the download: the server for `server-only`, the client for `direct`, and either process for `smart-fallback`. Common options are `GOOGLE_APPLICATION_CREDENTIALS`, `gcloud auth application-default login`, or Workload Identity on GKE.

For OCI downloads, set `MODEL_EXPRESS_OCI_*` credentials on the process that performs the download when anonymous registry access is not enough. See [`OCI_PROVIDER.md`](OCI_PROVIDER.md) for the exact auth precedence.

**Model Commands:**
- `download`: Download model with automatic storage (use `--strategy` and `--provider` for options)
- `init`: Initialize model storage configuration
Expand Down
13 changes: 13 additions & 0 deletions docs/DEPLOYMENT.md
Original file line number Diff line number Diff line change
Expand Up @@ -171,6 +171,8 @@ Cache directory resolution for NGC: `MODEL_EXPRESS_CACHE_DIRECTORY` -> `~/.cache

GCS uses the configured/default ModelExpress cache root; `MODEL_EXPRESS_CACHE_DIRECTORY` overrides it. Cached GCS models are stored under `<cache>/gcs/<bucket>/<object-prefix>`. See [`GCS_PROVIDER.md`](GCS_PROVIDER.md) for provider internals.

OCI uses the configured/default ModelExpress cache root; `MODEL_EXPRESS_CACHE_DIRECTORY` overrides it. Cached OCI artifacts are stored under `<cache>/oci/<registry>/<repo...>/tags/<tag>/files` or `<cache>/oci/<registry>/<repo...>/digests/<algorithm>-<hex>/files`. See [`OCI_PROVIDER.md`](OCI_PROVIDER.md) for provider internals.

See [`CLI.md`](CLI.md) for full CLI usage documentation.

## Docker
Expand Down Expand Up @@ -259,6 +261,17 @@ kubectl create secret generic gcs-service-account-key \

Mount the secret into the server or client pod and set `GOOGLE_APPLICATION_CREDENTIALS` to the mounted file path. When using Workload Identity, no key secret is needed. For cache layout, manifest behavior, and failure modes, see [`GCS_PROVIDER.md`](GCS_PROVIDER.md).

### OCI Registry Credentials

OCI artifact downloads use registry-qualified refs such as `oci://registry.example.com/team/model:v1` or `registry.example.com/team/model@sha256:...`. Auth is selected in this order:

1. `MODEL_EXPRESS_OCI_BEARER_TOKEN`
2. `MODEL_EXPRESS_OCI_USERNAME` plus `MODEL_EXPRESS_OCI_PASSWORD`
3. `MODEL_EXPRESS_OCI_USERNAME` plus `MODEL_EXPRESS_OCI_TOKEN`
4. Anonymous access

For artifact format, archive support, cache layout, and failure behavior, see [`OCI_PROVIDER.md`](OCI_PROVIDER.md).

### Helm Chart

The `helm/` directory provides a full Helm chart with configurable replicas, PVC, ingress, and resource limits.
Expand Down
83 changes: 83 additions & 0 deletions docs/OCI_PROVIDER.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
<!--
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->

# OCI Provider

ModelExpress can download file-oriented OCI model artifacts. The provider supports raw file blobs and simple archive layers. It uses the Rust `oci-client` crate for registry reference parsing, authentication, manifest fetches, and blob streaming.

OCI support is a materializer, not a container image unpacker. It does not apply whiteouts, root filesystem merges, symlinks, hardlinks, or special files.

## References

Use `--provider oci` with a registry-qualified reference that includes a tag or digest:

```bash
modelexpress-cli model download registry.example.com/team/model:v1 --provider oci
modelexpress-cli model download oci://registry.example.com/team/model:v1 --provider oci
modelexpress-cli model download registry.example.com/team/model@sha256:<digest> --provider oci
```

The optional `oci://` prefix is stripped before parsing and cache key generation.

## Artifact Format

Raw file layers must include `org.opencontainers.image.title` or `org.cncf.model.filepath`. ModelExpress uses that annotation as the output path relative to the model directory.

Archive layers are supported when their media type is `tar` or `tar+zstd`, including `application/vnd.oci.image.layer.v1.tar+zstd` and model-specific media types ending in `.tar`. Tar member paths are materialized relative to the model directory. Layer titles are labels only; include any desired directory prefixes in the tar member names.

The provider rejects empty paths, absolute paths, `.` and `..` components, backslashes, non-UTF-8 path data, duplicate output paths, symlinks, hardlinks, and special archive entries. README files, dotfiles, and images are skipped. When `ignore_weights=true`, raw weight-file layers are skipped before download and archive-like layers are skipped as whole blobs.

Example artifact layout:

```bash
oras push registry.example.com/team/model:v1 \
config.json:application/json \
tokenizer.json:application/json \
model.safetensors:application/octet-stream
```

Example archive artifact layout:

```text
layer media type: application/vnd.oci.image.layer.v1.tar+zstd
tar members:
tokenizer/tokenizer.json
part-0/program.0.gas
part-1/program.8.gas
```

This materializes those same tar member paths under the cache entry.

## Authentication

Authentication uses this precedence:

1. `MODEL_EXPRESS_OCI_BEARER_TOKEN`
2. `MODEL_EXPRESS_OCI_USERNAME` plus `MODEL_EXPRESS_OCI_PASSWORD`
3. `MODEL_EXPRESS_OCI_USERNAME` plus `MODEL_EXPRESS_OCI_TOKEN`
4. Anonymous access

## Cache Layout

OCI artifacts are cached under the ModelExpress cache root:

```text
<cache-root>/oci/<registry>/<repo...>/tags/<tag>/files
<cache-root>/oci/<registry>/<repo...>/digests/<algorithm>-<hex>/files
```

The provider follows NGC-like cache reuse semantics: `ignore_weights` affects which files are materialized during the download, but it is not part of the cache identity. An existing non-empty `files` directory for the same OCI reference is reused.

## Publish Behavior

Downloads materialize into a staging directory:

```text
<cache-root>/oci/.tmp/<uuid>/files
```

Raw blobs stream directly into files. Archive blobs stream to a temporary blob file under the staging entry, extract into `files`, and are removed before publish.

After all selected blobs are written, the staging entry is atomically renamed into the final cache path. If the final cache entry already exists and has a non-empty `files` directory, ModelExpress removes the staging entry and reuses the existing cache. If the final cache entry exists but is incomplete or corrupt, publish fails with a cache-corruption error and removes the staging entry; clear the corrupt cache entry before retrying.
2 changes: 1 addition & 1 deletion docs/metadata.md
Original file line number Diff line number Diff line change
Expand Up @@ -173,7 +173,7 @@ Three types of Redis keys are relevant:

| Field | Value | Purpose |
|-------|-------|---------|
| `provider` | `HuggingFace`, `Ngc`, or `Gcs` | Provider associated with the cached model |
| `provider` | `HuggingFace`, `Ngc`, `Gcs`, or `Oci` | Provider associated with the cached model |
| `status` | `DOWNLOADING`, `DOWNLOADED`, or `ERROR` | Download lifecycle state |
| `created_at` | RFC3339 timestamp | First write time, preserved across status updates |
| `last_used_at` | RFC3339 timestamp | Last status write or cache hit time for LRU eviction |
Expand Down
2 changes: 2 additions & 0 deletions examples/crds.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -183,6 +183,8 @@ spec:
enum:
- HuggingFace
- Ngc
- Gcs
- Oci
status:
type: object
properties:
Expand Down
7 changes: 5 additions & 2 deletions modelexpress-cli-completion.bash
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,7 @@ _model_express_cli_completions() {
elif [[ "${words[i+1]}" == "download" ]]; then
case "${prev}" in
--provider|-p)
COMPREPLY=($(compgen -W "hugging-face" -- "$cur"))
COMPREPLY=($(compgen -W "hugging-face ngc gcs oci" -- "$cur"))
;;
--strategy|-s)
COMPREPLY=($(compgen -W "smart-fallback server-only direct" -- "$cur"))
Expand Down Expand Up @@ -108,13 +108,16 @@ _model_express_cli_completions() {
fi
elif [[ "${words[i+1]}" == "clear" ]]; then
case "${prev}" in
--provider|-p)
COMPREPLY=($(compgen -W "hugging-face ngc gcs oci" -- "$cur"))
;;
clear)
# Could potentially list actual downloaded models here
COMPREPLY=($(compgen -W "google-t5/t5-small microsoft/DialoGPT-small" -- "$cur"))
;;
*)
if [[ "$cur" == -* ]]; then
COMPREPLY=($(compgen -W "--help" -- "$cur"))
COMPREPLY=($(compgen -W "--provider --help" -- "$cur"))
fi
;;
esac
Expand Down
Loading
Loading