You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add an optional per-InferenceService field that points the built-in model
cache (prep + download init containers) at a user-managed PVC, instead of
the operator's shared/perService cache PVC:
When set, the operator uses claimName as the writable cache volume for that
workload and downloads into it exactly as it does today for the shared cache.
When unset, behavior is unchanged (the global shared/perService mode
applies). This lets a cluster mix: most InferenceServices ride the shared
cache, while specific ones use their own PVC.
Naming is the maintainer's call — spec.modelCache.claimName keeps the
existing "model cache" vocabulary and leaves room to grow; flat spec.modelCacheClaimName or a more generic spec.persistence.claimName
were also considered.
Motivation
The model cache backend is currently an operator-global choice
(--model-cache-mode / chart modelCache): either one shared PVC (shared)
or an operator-provisioned per-workload PVC (perService), both using a single
chart-level storageClass. There is no way to say "this one model should
cache on a different volume."
Concrete case: on a multi-node cluster the shared cache is an RWX class
(e.g. a networked filesystem). For most models that's fine, but for a large
model on a specific node it's desirable to cache on node-local storage —
the networked filesystem's cold-load (first prefill reading the weights) is
markedly slower, and the model is pinned to that node anyway. Today the only
ways to get node-local weights are:
Switch the whole operator to perService — but that moves every model
off the shared cache and still uses one storage class, so you can't target
just the one workload or give it a different (local) class.
Pre-stage onto a PVC and use source: pvc://… — but that is read-only with
no download, so you have to build and maintain your own staging Job, which
is exactly the toil the built-in downloader was meant to remove.
A per-InferenceService claimName override closes that gap: you bring a PVC
backed by whatever storage class you want, and the operator's existing
prep+download machinery fills and serves it — no staging Job, no giving up the
managed download path.
Proposed behavior
When spec.modelCache.claimName is set on an InferenceService:
The operator uses that PVC as the model-cache volume in buildCachedStorageConfig — same code path as the shared/perService cache: model-cache-prep (chown) + model-downloader run against it, and the
server is started with --model pointing into it. No new download logic.
Weights land under the existing <cacheKey>/ subdirectory (not the PVC
root), matching current layout so RefreshPolicy and cache-key semantics are
unchanged, and pointing two models at one PVC can't collide.
The main container mounts the PVC read-only (as today); the init
containers mount it read-write.
When unset: unchanged — the global shared/perService mode applies.
Lifecycle
The operator never creates or deletes a claimName PVC — the user owns it
end-to-end (contrast with perService, where the operator provisions and may
GC <isvc>-model-cache). The operator only mounts and populates it.
Edge cases / guardrails
Missing PVC: surface a clear Degraded condition / event rather than
silently falling back to the shared cache.
source: pvc://… (already-staged, read-only, no download) + claimName: mutually exclusive — claimName targets the download path, so
it's meaningless for a pre-staged source. Reject at validation or document as
ignored.
Node alignment is the user's responsibility: for a node-local RWO PVC,
the InferenceService's nodeSelector must land the pod where the PVC binds
(a WaitForFirstConsumer local class binds on first consumer = the pod's
node; a pre-bound RWO PVC pins the pod). The operator does not enforce this.
fsGroup/chown: no new handling needed — the same root model-cache-prep
runs, so fsGroupPolicy: None local storage classes are covered exactly like
the shared path.
Operator-provisioned per-model cache (a per-InferenceService storageClassName + size instead of a pre-made PVC) — a reasonable sibling
feature, but intentionally left out here to keep this focused on the
bring-your-own-PVC case.
Summary
Add an optional per-
InferenceServicefield that points the built-in modelcache (prep + download init containers) at a user-managed PVC, instead of
the operator's shared/perService cache PVC:
When set, the operator uses
claimNameas the writable cache volume for thatworkload and downloads into it exactly as it does today for the shared cache.
When unset, behavior is unchanged (the global
shared/perServicemodeapplies). This lets a cluster mix: most InferenceServices ride the shared
cache, while specific ones use their own PVC.
Motivation
The model cache backend is currently an operator-global choice
(
--model-cache-mode/ chartmodelCache): either one shared PVC (shared)or an operator-provisioned per-workload PVC (
perService), both using a singlechart-level
storageClass. There is no way to say "this one model shouldcache on a different volume."
Concrete case: on a multi-node cluster the shared cache is an RWX class
(e.g. a networked filesystem). For most models that's fine, but for a large
model on a specific node it's desirable to cache on node-local storage —
the networked filesystem's cold-load (first prefill reading the weights) is
markedly slower, and the model is pinned to that node anyway. Today the only
ways to get node-local weights are:
perService— but that moves every modeloff the shared cache and still uses one storage class, so you can't target
just the one workload or give it a different (local) class.
source: pvc://…— but that is read-only withno download, so you have to build and maintain your own staging Job, which
is exactly the toil the built-in downloader was meant to remove.
A per-InferenceService
claimNameoverride closes that gap: you bring a PVCbacked by whatever storage class you want, and the operator's existing
prep+download machinery fills and serves it — no staging Job, no giving up the
managed download path.
Proposed behavior
When
spec.modelCache.claimNameis set on an InferenceService:model-cachevolume inbuildCachedStorageConfig— same code path as the shared/perService cache:model-cache-prep(chown) +model-downloaderrun against it, and theserver is started with
--modelpointing into it. No new download logic.<cacheKey>/subdirectory (not the PVCroot), matching current layout so
RefreshPolicyand cache-key semantics areunchanged, and pointing two models at one PVC can't collide.
containers mount it read-write.
When unset: unchanged — the global
shared/perServicemode applies.Lifecycle
The operator never creates or deletes a
claimNamePVC — the user owns itend-to-end (contrast with
perService, where the operator provisions and mayGC
<isvc>-model-cache). The operator only mounts and populates it.Edge cases / guardrails
Degradedcondition / event rather thansilently falling back to the shared cache.
source: pvc://…(already-staged, read-only, no download) +claimName: mutually exclusive —claimNametargets the download path, soit's meaningless for a pre-staged source. Reject at validation or document as
ignored.
the InferenceService's
nodeSelectormust land the pod where the PVC binds(a
WaitForFirstConsumerlocal class binds on first consumer = the pod'snode; a pre-bound RWO PVC pins the pod). The operator does not enforce this.
model-cache-prepruns, so
fsGroupPolicy: Nonelocal storage classes are covered exactly likethe shared path.
Out of scope (possible follow-ups)
llmkube cache list/--purge-cachewould not inspect BYO cache PVCs (theyread the shared cache). A natural later enhancement; relates to the CLI
cache-key handling noted on fix: cache hf:// multi-file models instead of emptyDir #912.
storageClassName+sizeinstead of a pre-made PVC) — a reasonable siblingfeature, but intentionally left out here to keep this focused on the
bring-your-own-PVC case.