ai-dynamo · andrewpaprotsky · May 7, 2026
diff --git a/Cargo.lock b/Cargo.lock
diff --git a/Cargo.toml b/Cargo.toml
@@ -41,13 +41,15 @@ jiff = { version = "0.2.15", features = ["serde"] }
 modelexpress-common = { path = "modelexpress_common", version = "0.3.0" }
 modelexpress-client = { path = "modelexpress_client", version = "0.3.0" }
 modelexpress-server = { path = "modelexpress_server", version = "0.3.0" }
+oci-client = { version = "0.16.1", default-features = false, features = ["rustls-tls"] }
 once_cell = "1.21.3"
 prost = "0.13"
 rustls = { version = "0.23.37", default-features = false, features = ["ring", "std"] }
 serde = { version = "1.0", features = ["derive"] }
 serde_json = "1.0"
 mockall = "0.14.0"
 tempfile = "3.20"
+tar = "0.4"
 tokio = { version = "1.46", features = ["full"] }
 tokio-stream = "0.1"
 tonic = "0.13"
@@ -57,6 +59,7 @@ tracing = "0.1"
 tracing-subscriber = { version = "0.3", features = ["env-filter"] }
 futures = "0.3"
 uuid = { version = "1.17", features = ["v4", "serde"] }
+zstd = "0.13"
 thiserror = "2.0"
 redis = { version = "0.27", features = ["tokio-comp", "connection-manager"] }
 reqwest = { version = "0.12", default-features = false, features = ["json", "rustls-tls", "stream"] }

diff --git a/README.md b/README.md
@@ -40,9 +40,9 @@ ModelExpress is a Rust-based service that manages the complete model weight life
 
 ### How ModelExpress manages weights in the cluster
 
-ModelExpress orchestrates the full flow—from download to GPU memory. It ensures only one node downloads a model from external sources (e.g., HuggingFace); other nodes receive weights via P2P or shared storage—eliminating duplicate downloads and reducing cluster ingress.
+ModelExpress orchestrates the full flow—from download to GPU memory. It ensures only one node downloads a model from external sources (e.g., HuggingFace, NGC, GCS, or OCI registries); other nodes receive weights via P2P or shared storage—eliminating duplicate downloads and reducing cluster ingress.
 
-1. **Download from HuggingFace** — One node pulls the model; ModelExpress coordinates so no other node duplicates this download, reducing external ingress. In air-gapped mode, serve from cache only (`HF_HUB_OFFLINE=1`).
+1. **Download from a model source** — One node pulls the model from HuggingFace, NGC, GCS, or a file/archive OCI artifact; ModelExpress coordinates so no other node duplicates this download, reducing external ingress. In air-gapped HuggingFace mode, serve from cache only (`HF_HUB_OFFLINE=1`).
 2. **Persist to disk** — Store in a cache backed by disk:
    - **Host-attached disk** — Local disk on the node (single-node or per-node cache).
    - **PVC** — RWO (ReadWriteOnce) for single-node; RWX (ReadWriteMany) for shared access across nodes.
@@ -54,7 +54,7 @@ ModelExpress orchestrates the full flow—from download to GPU memory. It ensure
 ## Features
 
 - **Cold start reduction** — GPU-to-GPU P2P transfer over InfiniBand instead of disk load
-- **HuggingFace caching** — PVC-backed cache, `HF_HUB_OFFLINE`, `ignore_weights`, `get_model_path` for Dynamo
+- **Model source caching** — HuggingFace, NGC, GCS, and OCI artifact providers with PVC-backed cache support, `ignore_weights`, and `get_model_path` for Dynamo
 - **P2P GPU transfer** — vLLM `mx` loader and TRT-LLM `PRESHARDED` loader with NVIDIA NIXL over RDMA
 - **Metadata backends** — In-memory, Redis, or Kubernetes CRD (layered write-through for HA)
 - **Kubernetes** — Helm chart, CRDs/Redis for P2P, no-shared-storage support
@@ -98,9 +98,9 @@ ModelExpress orchestrates the full flow—from download to GPU memory. It ensure
 
 - **modelexpress_server**: gRPC server with configurable metadata backends (Redis, Kubernetes CRD).
 - **modelexpress_client**: Rust CLI for cache management; Python package with vLLM loaders and `MxClient` for gRPC.
-- **modelexpress_common**: Protobuf definitions, provider trait (HuggingFace), shared configuration.
+- **modelexpress_common**: Protobuf definitions, provider trait (HuggingFace, NGC, GCS, OCI), shared configuration.
 
-See [Architecture](docs/ARCHITECTURE.md).
+See [Architecture](docs/ARCHITECTURE.md), [GCS provider](docs/GCS_PROVIDER.md), and [OCI provider](docs/OCI_PROVIDER.md).
 
 ---
 
@@ -241,7 +241,6 @@ cargo bench
 - **DRAM and NVMe-resident shard streaming**: Stream shards across workers while keeping weights in DRAM and host local high-speed NVMe.
 - **RL workloads**: Explore fast P2P transfers to optimize RL refit phase and support for weight resharding.
 - **Earlier weight availability**: Bring weights to prefill earlier; identify prefill workers that can act as strong source nodes.
-- **Expanded model pull providers**: Support NGC in addition to Hugging Face.
 - **GDS (GPUDirect Storage) integration**: Load model weights directly from NVMe into GPU memory, bypassing the CPU/DRAM copy path.
 - **Multi-tier cache hierarchy**: Promote and demote models across DRAM, NVMe, and PVC tiers based on access patterns.
 - **Distributed sharded cache**: Shard large models across nodes using consistent hashing and parallel shard assembly.

diff --git a/docs/ARCHITECTURE.md b/docs/ARCHITECTURE.md
@@ -11,7 +11,7 @@ Detailed reference document for the ModelExpress codebase. For deployment and co
 
 ModelExpress is a Rust-based model cache management service and GPU-to-GPU model weight transfer system. It serves two roles:
 
-- **Model Cache Service** - A sidecar alongside inference solutions (vLLM, SGLang, NVIDIA Dynamo) that accelerates model downloads from HuggingFace, NGC, and GCS. Model lifecycle state lives in a distributed registry — Redis or Kubernetes CRDs (`ModelCacheEntry`), selected via `MX_METADATA_BACKEND` — so multiple server replicas can coordinate without a shared-filesystem database. LRU cache eviction runs off the same registry.
+- **Model Cache Service** - A sidecar alongside inference solutions (vLLM, SGLang, NVIDIA Dynamo) that accelerates model downloads from HuggingFace, NGC, GCS, and file/archive OCI artifacts. Model lifecycle state lives in a distributed registry — Redis or Kubernetes CRDs (`ModelCacheEntry`), selected via `MX_METADATA_BACKEND` — so multiple server replicas can coordinate without a shared-filesystem database. LRU cache eviction runs off the same registry.
 - **P2P Weight Transfer** - GPU-to-GPU model weight transfers between vLLM instances using NVIDIA NIXL over RDMA/InfiniBand, enabling ~15-second transfers for 681GB models.
 
 ### Current Status
@@ -31,6 +31,7 @@ graph TD
         S1 --> HF[HuggingFace Hub]
         S1 --> NGC[NVIDIA NGC]
         S1 --> GCS[Google Cloud Storage]
+        S1 --> OCI[OCI Registry]
         S1 --> Cache[Model Cache Dir]
     end
 
@@ -177,7 +178,8 @@ ModelExpress/
 │           ├── gcs.rs                  # GcsProvider implementation
 │           ├── gcs/                    # GCS manifest, cache layout, locking, download helpers
 │           ├── huggingface.rs          # HuggingFaceProvider implementation
-│           └── ngc.rs                  # NgcProvider implementation
+│           ├── ngc.rs                  # NgcProvider implementation
+│           └── oci.rs                  # OciProvider implementation
 │
 ├── workspace-tests/
 │   ├── Cargo.toml
@@ -283,7 +285,7 @@ Four proto files define four services, all compiled via `tonic-build` in `modele
 | `StreamModelFiles` | `ModelFilesRequest` | stream `FileChunk` | Stream model file contents (1MB chunks) |
 | `ListModelFiles` | `ModelFilesRequest` | `ModelFileList` | List files with sizes |
 
-Key message types: `ModelProvider` (HuggingFace, NGC, GCS), `ModelStatus` (Downloading, Downloaded, Error), `ModelStatusUpdate`, `FileChunk`.
+Key message types: `ModelProvider` (HuggingFace, NGC, GCS, OCI), `ModelStatus` (Downloading, Downloaded, Error), `ModelStatusUpdate`, `FileChunk`.
 
 ### p2p.proto - P2pService
 
@@ -465,7 +467,7 @@ Output formats: `--format human` (default), `--format json`, `--format json-pret
 | `config` | Config trait utilities |
 | `download` | Download orchestration with strategy pattern |
 | `models` | `Status`, `ModelProvider`, `ModelStatus`, `ModelStatusResponse` |
-| `providers` | `ModelProviderTrait` + `HuggingFaceProvider` + `NgcProvider` + `GcsProvider` |
+| `providers` | `ModelProviderTrait` + `HuggingFaceProvider` + `NgcProvider` + `GcsProvider` + `OciProvider` |
 | `grpc` | Generated tonic stubs for all 4 services |
 | `constants` | `DEFAULT_GRPC_PORT` (8001), `DEFAULT_TIMEOUT_SECS` (30), `DEFAULT_TRANSFER_CHUNK_SIZE` (32KB) |
 
@@ -484,10 +486,11 @@ pub trait ModelProviderTrait: Send + Sync {
 }
 ```
 
-Three implementations:
-- `HuggingFaceProvider` - uses the `hf-hub` crate with high-CPU download mode.
-- `NgcProvider` - downloads from NVIDIA NGC via the V2 artifact API (Bearer-authenticated `/files/{path}` for team artifacts; presigned S3 URLs for org-level artifacts). Falls back to `checksums.blake3` manifest enumeration when bulk file listing returns 400. Resolves the NGC API key from `NGC_API_KEY`, `NGC_CLI_API_KEY`, or `~/.ngc/config`.
-- `GcsProvider` - downloads objects under a full `gs://<bucket>/<object-prefix>` URL using Google Application Default Credentials. It writes a `.mx/manifest.json` cache manifest, verifies downloaded files with GCS CRC32C checksums, skips dotfiles, README, and images, and stores models under `<cache>/gcs/<bucket>/<object-prefix>`. See [`GCS_PROVIDER.md`](GCS_PROVIDER.md) for the detailed design.
+Provider implementations:
+- `HuggingFaceProvider` — uses the `hf-hub` crate with high-CPU download mode.
+- `NgcProvider` — downloads from NVIDIA NGC via the V2 artifact API (Bearer-authenticated `/files/{path}` for team artifacts; presigned S3 URLs for org-level artifacts). Falls back to `checksums.blake3` manifest enumeration when bulk file listing returns 400. Resolves the NGC API key from `NGC_API_KEY`, `NGC_CLI_API_KEY`, or `~/.ngc/config`.
+- `GcsProvider` — downloads objects under a full `gs://<bucket>/<object-prefix>` URL using Google Application Default Credentials. It writes a `.mx/manifest.json` cache manifest, verifies downloaded files with GCS CRC32C checksums, skips dotfiles, README, and images, and stores models under `<cache>/gcs/<bucket>/<object-prefix>`. See [`GCS_PROVIDER.md`](GCS_PROVIDER.md) for the detailed design.
+- `OciProvider` — downloads OCI model artifacts via `oci-client`. Raw layers use `org.opencontainers.image.title` or `org.cncf.model.filepath` as the output file path; simple `tar` and `tar+zstd` layers are safely extracted. ModelExpress atomically publishes the completed `files` directory. Container image unpacking remains out of scope: no whiteouts or rootfs layer merging. See [`OCI_PROVIDER.md`](OCI_PROVIDER.md).
 
 ### ClientConfig / ClientArgs
 

diff --git a/docs/CLI.md b/docs/CLI.md
@@ -97,6 +97,11 @@ modelexpress-cli model download gs://my-bucket/models/qwen/rev-1 \
 modelexpress-cli model download microsoft/DialoGPT-medium \
   --strategy direct
 
+# Download an OCI artifact from a registry
+modelexpress-cli model download registry.example.com/team/model:v1 \
+  --provider oci \
+  --strategy direct
+
 # Download with file transfer when no shared storage exists
 # Note: Global options must come before the subcommand
 modelexpress-cli --no-shared-storage --transfer-chunk-size 65536 \
@@ -153,9 +158,12 @@ modelexpress-cli model stats --detailed
 - `hugging-face`: Hugging Face model hub (default)
 - `ngc`: NVIDIA NGC catalog
 - `gcs`: Google Cloud Storage object prefix. The model name must be a full `gs://<bucket>/<path>` URL. See [`GCS_PROVIDER.md`](GCS_PROVIDER.md) for cache layout and provider behavior.
+- `oci`: OCI model artifact with raw file blobs or simple `tar`/`tar+zstd` archive layers. References must be registry-qualified and include a tag or digest, for example `oci://registry.example.com/team/model:v1` or `registry.example.com/team/model@sha256:...`. See [`OCI_PROVIDER.md`](OCI_PROVIDER.md) for artifact format, cache layout, and publish behavior.
 
 For GCS downloads, configure Google Application Default Credentials on the process that performs the download: the server for `server-only`, the client for `direct`, and either process for `smart-fallback`. Common options are `GOOGLE_APPLICATION_CREDENTIALS`, `gcloud auth application-default login`, or Workload Identity on GKE.
 
+For OCI downloads, set `MODEL_EXPRESS_OCI_*` credentials on the process that performs the download when anonymous registry access is not enough. See [`OCI_PROVIDER.md`](OCI_PROVIDER.md) for the exact auth precedence.
+
 **Model Commands:**
 - `download`: Download model with automatic storage (use `--strategy` and `--provider` for options)
 - `init`: Initialize model storage configuration

diff --git a/docs/DEPLOYMENT.md b/docs/DEPLOYMENT.md
@@ -171,6 +171,8 @@ Cache directory resolution for NGC: `MODEL_EXPRESS_CACHE_DIRECTORY` -> `~/.cache
 
 GCS uses the configured/default ModelExpress cache root; `MODEL_EXPRESS_CACHE_DIRECTORY` overrides it. Cached GCS models are stored under `<cache>/gcs/<bucket>/<object-prefix>`. See [`GCS_PROVIDER.md`](GCS_PROVIDER.md) for provider internals.
 
+OCI uses the configured/default ModelExpress cache root; `MODEL_EXPRESS_CACHE_DIRECTORY` overrides it. Cached OCI artifacts are stored under `<cache>/oci/<registry>/<repo...>/tags/<tag>/files` or `<cache>/oci/<registry>/<repo...>/digests/<algorithm>-<hex>/files`. See [`OCI_PROVIDER.md`](OCI_PROVIDER.md) for provider internals.
+
 See [`CLI.md`](CLI.md) for full CLI usage documentation.
 
 ## Docker
@@ -259,6 +261,17 @@ kubectl create secret generic gcs-service-account-key \
 
 Mount the secret into the server or client pod and set `GOOGLE_APPLICATION_CREDENTIALS` to the mounted file path. When using Workload Identity, no key secret is needed. For cache layout, manifest behavior, and failure modes, see [`GCS_PROVIDER.md`](GCS_PROVIDER.md).
 
+### OCI Registry Credentials
+
+OCI artifact downloads use registry-qualified refs such as `oci://registry.example.com/team/model:v1` or `registry.example.com/team/model@sha256:...`. Auth is selected in this order:
+
+1. `MODEL_EXPRESS_OCI_BEARER_TOKEN`
+2. `MODEL_EXPRESS_OCI_USERNAME` plus `MODEL_EXPRESS_OCI_PASSWORD`
+3. `MODEL_EXPRESS_OCI_USERNAME` plus `MODEL_EXPRESS_OCI_TOKEN`
+4. Anonymous access
+
+For artifact format, archive support, cache layout, and failure behavior, see [`OCI_PROVIDER.md`](OCI_PROVIDER.md).
+
 ### Helm Chart
 
 The `helm/` directory provides a full Helm chart with configurable replicas, PVC, ingress, and resource limits.

diff --git a/docs/OCI_PROVIDER.md b/docs/OCI_PROVIDER.md
@@ -0,0 +1,83 @@
+<!--
+SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+SPDX-License-Identifier: Apache-2.0
+-->
+
+# OCI Provider
+
+ModelExpress can download file-oriented OCI model artifacts. The provider supports raw file blobs and simple archive layers. It uses the Rust `oci-client` crate for registry reference parsing, authentication, manifest fetches, and blob streaming.
+
+OCI support is a materializer, not a container image unpacker. It does not apply whiteouts, root filesystem merges, symlinks, hardlinks, or special files.
+
+## References
+
+Use `--provider oci` with a registry-qualified reference that includes a tag or digest:
+
+```bash
+modelexpress-cli model download registry.example.com/team/model:v1 --provider oci
+modelexpress-cli model download oci://registry.example.com/team/model:v1 --provider oci
+modelexpress-cli model download registry.example.com/team/model@sha256:<digest> --provider oci
+```
+
+The optional `oci://` prefix is stripped before parsing and cache key generation.
+
+## Artifact Format
+
+Raw file layers must include `org.opencontainers.image.title` or `org.cncf.model.filepath`. ModelExpress uses that annotation as the output path relative to the model directory.
+
+Archive layers are supported when their media type is `tar` or `tar+zstd`, including `application/vnd.oci.image.layer.v1.tar+zstd` and model-specific media types ending in `.tar`. Tar member paths are materialized relative to the model directory. Layer titles are labels only; include any desired directory prefixes in the tar member names.
+
+The provider rejects empty paths, absolute paths, `.` and `..` components, backslashes, non-UTF-8 path data, duplicate output paths, symlinks, hardlinks, and special archive entries. README files, dotfiles, and images are skipped. When `ignore_weights=true`, raw weight-file layers are skipped before download and archive-like layers are skipped as whole blobs.
+
+Example artifact layout:
+
+```bash
+oras push registry.example.com/team/model:v1 \
+  config.json:application/json \
+  tokenizer.json:application/json \
+  model.safetensors:application/octet-stream
+```
+
+Example archive artifact layout:
+
+```text
+layer media type: application/vnd.oci.image.layer.v1.tar+zstd
+tar members:
+  tokenizer/tokenizer.json
+  part-0/program.0.gas
+  part-1/program.8.gas
+```
+
+This materializes those same tar member paths under the cache entry.
+
+## Authentication
+
+Authentication uses this precedence:
+
+1. `MODEL_EXPRESS_OCI_BEARER_TOKEN`
+2. `MODEL_EXPRESS_OCI_USERNAME` plus `MODEL_EXPRESS_OCI_PASSWORD`
+3. `MODEL_EXPRESS_OCI_USERNAME` plus `MODEL_EXPRESS_OCI_TOKEN`
+4. Anonymous access
+
+## Cache Layout
+
+OCI artifacts are cached under the ModelExpress cache root:
+
+```text
+<cache-root>/oci/<registry>/<repo...>/tags/<tag>/files
+<cache-root>/oci/<registry>/<repo...>/digests/<algorithm>-<hex>/files
+```
+
+The provider follows NGC-like cache reuse semantics: `ignore_weights` affects which files are materialized during the download, but it is not part of the cache identity. An existing non-empty `files` directory for the same OCI reference is reused.
+
+## Publish Behavior
+
+Downloads materialize into a staging directory:
+
+```text
+<cache-root>/oci/.tmp/<uuid>/files
+```
+
+Raw blobs stream directly into files. Archive blobs stream to a temporary blob file under the staging entry, extract into `files`, and are removed before publish.
+
+After all selected blobs are written, the staging entry is atomically renamed into the final cache path. If the final cache entry already exists and has a non-empty `files` directory, ModelExpress removes the staging entry and reuses the existing cache. If the final cache entry exists but is incomplete or corrupt, publish fails with a cache-corruption error and removes the staging entry; clear the corrupt cache entry before retrying.
diff --git a/docs/metadata.md b/docs/metadata.md
@@ -173,7 +173,7 @@ Three types of Redis keys are relevant:
 
 | Field | Value | Purpose |
 |-------|-------|---------|
-| `provider` | `HuggingFace`, `Ngc`, or `Gcs` | Provider associated with the cached model |
+| `provider` | `HuggingFace`, `Ngc`, `Gcs`, or `Oci` | Provider associated with the cached model |
 | `status` | `DOWNLOADING`, `DOWNLOADED`, or `ERROR` | Download lifecycle state |
 | `created_at` | RFC3339 timestamp | First write time, preserved across status updates |
 | `last_used_at` | RFC3339 timestamp | Last status write or cache hit time for LRU eviction |

diff --git a/examples/crds.yaml b/examples/crds.yaml
@@ -183,6 +183,8 @@ spec:
                   enum:
                     - HuggingFace
                     - Ngc
+                    - Gcs
+                    - Oci
             status:
               type: object
               properties:

diff --git a/modelexpress-cli-completion.bash b/modelexpress-cli-completion.bash
@@ -73,7 +73,7 @@ _model_express_cli_completions() {
             elif [[ "${words[i+1]}" == "download" ]]; then
                 case "${prev}" in
                     --provider|-p)
-                        COMPREPLY=($(compgen -W "hugging-face" -- "$cur"))
+                        COMPREPLY=($(compgen -W "hugging-face ngc gcs oci" -- "$cur"))
                         ;;
                     --strategy|-s)
                         COMPREPLY=($(compgen -W "smart-fallback server-only direct" -- "$cur"))
@@ -108,13 +108,16 @@ _model_express_cli_completions() {
                 fi
             elif [[ "${words[i+1]}" == "clear" ]]; then
                 case "${prev}" in
+                    --provider|-p)
+                        COMPREPLY=($(compgen -W "hugging-face ngc gcs oci" -- "$cur"))
+                        ;;
                     clear)
                         # Could potentially list actual downloaded models here
                         COMPREPLY=($(compgen -W "google-t5/t5-small microsoft/DialoGPT-small" -- "$cur"))
                         ;;
                     *)
                         if [[ "$cur" == -* ]]; then
-                            COMPREPLY=($(compgen -W "--help" -- "$cur"))
+                            COMPREPLY=($(compgen -W "--provider --help" -- "$cur"))
                         fi
                         ;;
                 esac
-Original file line number
+Diff line change
@@ Expand Up / @@ -183,6 +183,8 @@ spec: @@
                       enum:
                         - HuggingFace
                         - Ngc
+                        - Gcs
+                        - Oci
                 status:
                   type: object
                   properties:
@@ Expand Down @@