Skip to content

[FEATURE] Per-InferenceService model cache PVC override (spec.modelCache.claimName) #928

Description

@joryirving

Summary

Add an optional per-InferenceService field that points the built-in model
cache (prep + download init containers) at a user-managed PVC, instead of
the operator's shared/perService cache PVC:

apiVersion: inference.llmkube.dev/v1alpha1
kind: InferenceService
spec:
  modelCache:
    claimName: my-model-cache   # pre-existing, user-owned PVC
  ...

When set, the operator uses claimName as the writable cache volume for that
workload and downloads into it exactly as it does today for the shared cache.
When unset, behavior is unchanged (the global shared/perService mode
applies). This lets a cluster mix: most InferenceServices ride the shared
cache, while specific ones use their own PVC.

Naming is the maintainer's call — spec.modelCache.claimName keeps the
existing "model cache" vocabulary and leaves room to grow; flat
spec.modelCacheClaimName or a more generic spec.persistence.claimName
were also considered.

Motivation

The model cache backend is currently an operator-global choice
(--model-cache-mode / chart modelCache): either one shared PVC (shared)
or an operator-provisioned per-workload PVC (perService), both using a single
chart-level storageClass. There is no way to say "this one model should
cache on a different volume."

Concrete case: on a multi-node cluster the shared cache is an RWX class
(e.g. a networked filesystem). For most models that's fine, but for a large
model on a specific node it's desirable to cache on node-local storage —
the networked filesystem's cold-load (first prefill reading the weights) is
markedly slower, and the model is pinned to that node anyway. Today the only
ways to get node-local weights are:

  • Switch the whole operator to perService — but that moves every model
    off the shared cache and still uses one storage class, so you can't target
    just the one workload or give it a different (local) class.
  • Pre-stage onto a PVC and use source: pvc://… — but that is read-only with
    no download
    , so you have to build and maintain your own staging Job, which
    is exactly the toil the built-in downloader was meant to remove.

A per-InferenceService claimName override closes that gap: you bring a PVC
backed by whatever storage class you want, and the operator's existing
prep+download machinery fills and serves it — no staging Job, no giving up the
managed download path.

Proposed behavior

When spec.modelCache.claimName is set on an InferenceService:

  • The operator uses that PVC as the model-cache volume in
    buildCachedStorageConfig — same code path as the shared/perService cache:
    model-cache-prep (chown) + model-downloader run against it, and the
    server is started with --model pointing into it. No new download logic.
  • Weights land under the existing <cacheKey>/ subdirectory (not the PVC
    root), matching current layout so RefreshPolicy and cache-key semantics are
    unchanged, and pointing two models at one PVC can't collide.
  • The main container mounts the PVC read-only (as today); the init
    containers mount it read-write.

When unset: unchanged — the global shared/perService mode applies.

Lifecycle

The operator never creates or deletes a claimName PVC — the user owns it
end-to-end (contrast with perService, where the operator provisions and may
GC <isvc>-model-cache). The operator only mounts and populates it.

Edge cases / guardrails

  • Missing PVC: surface a clear Degraded condition / event rather than
    silently falling back to the shared cache.
  • source: pvc://… (already-staged, read-only, no download) +
    claimName: mutually exclusive — claimName targets the download path, so
    it's meaningless for a pre-staged source. Reject at validation or document as
    ignored.
  • Node alignment is the user's responsibility: for a node-local RWO PVC,
    the InferenceService's nodeSelector must land the pod where the PVC binds
    (a WaitForFirstConsumer local class binds on first consumer = the pod's
    node; a pre-bound RWO PVC pins the pod). The operator does not enforce this.
  • fsGroup/chown: no new handling needed — the same root model-cache-prep
    runs, so fsGroupPolicy: None local storage classes are covered exactly like
    the shared path.

Out of scope (possible follow-ups)

  • llmkube cache list / --purge-cache would not inspect BYO cache PVCs (they
    read the shared cache). A natural later enhancement; relates to the CLI
    cache-key handling noted on fix: cache hf:// multi-file models instead of emptyDir #912.
  • Operator-provisioned per-model cache (a per-InferenceService
    storageClassName + size instead of a pre-made PVC) — a reasonable sibling
    feature, but intentionally left out here to keep this focused on the
    bring-your-own-PVC case.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions