[FEATURE] Per-InferenceService model cache PVC override (spec.modelCache.claimName)

## Summary

Add an optional per-`InferenceService` field that points the built-in model
cache (prep + download init containers) at a **user-managed PVC**, instead of
the operator's shared/perService cache PVC:

```yaml
apiVersion: inference.llmkube.dev/v1alpha1
kind: InferenceService
spec:
  modelCache:
    claimName: my-model-cache   # pre-existing, user-owned PVC
  ...
```

When set, the operator uses `claimName` as the writable cache volume for that
workload and downloads into it exactly as it does today for the shared cache.
When unset, behavior is unchanged (the global `shared`/`perService` mode
applies). This lets a cluster **mix**: most InferenceServices ride the shared
cache, while specific ones use their own PVC.

> Naming is the maintainer's call — `spec.modelCache.claimName` keeps the
> existing "model cache" vocabulary and leaves room to grow; flat
> `spec.modelCacheClaimName` or a more generic `spec.persistence.claimName`
> were also considered.

## Motivation

The model cache backend is currently an **operator-global** choice
(`--model-cache-mode` / chart `modelCache`): either one shared PVC (`shared`)
or an operator-provisioned per-workload PVC (`perService`), both using a single
chart-level `storageClass`. There is no way to say "this *one* model should
cache on a *different* volume."

Concrete case: on a multi-node cluster the shared cache is an RWX class
(e.g. a networked filesystem). For most models that's fine, but for a large
model on a specific node it's desirable to cache on **node-local** storage —
the networked filesystem's cold-load (first prefill reading the weights) is
markedly slower, and the model is pinned to that node anyway. Today the only
ways to get node-local weights are:

- Switch the **whole operator** to `perService` — but that moves *every* model
  off the shared cache and still uses one storage class, so you can't target
  just the one workload or give it a different (local) class.
- Pre-stage onto a PVC and use `source: pvc://…` — but that is **read-only with
  no download**, so you have to build and maintain your own staging Job, which
  is exactly the toil the built-in downloader was meant to remove.

A per-InferenceService `claimName` override closes that gap: you bring a PVC
backed by whatever storage class you want, and the operator's existing
prep+download machinery fills and serves it — no staging Job, no giving up the
managed download path.

## Proposed behavior

When `spec.modelCache.claimName` is set on an InferenceService:

- The operator uses that PVC as the `model-cache` volume in
  `buildCachedStorageConfig` — same code path as the shared/perService cache:
  `model-cache-prep` (chown) + `model-downloader` run against it, and the
  server is started with `--model` pointing into it. No new download logic.
- Weights land under the existing **`<cacheKey>/` subdirectory** (not the PVC
  root), matching current layout so `RefreshPolicy` and cache-key semantics are
  unchanged, and pointing two models at one PVC can't collide.
- The main container mounts the PVC **read-only** (as today); the init
  containers mount it read-write.

When unset: unchanged — the global `shared`/`perService` mode applies.

## Lifecycle

The operator **never creates or deletes** a `claimName` PVC — the user owns it
end-to-end (contrast with `perService`, where the operator provisions and may
GC `<isvc>-model-cache`). The operator only mounts and populates it.

## Edge cases / guardrails

- **Missing PVC:** surface a clear `Degraded` condition / event rather than
  silently falling back to the shared cache.
- **`source: pvc://…`** (already-staged, read-only, no download) +
  `claimName`: mutually exclusive — `claimName` targets the *download* path, so
  it's meaningless for a pre-staged source. Reject at validation or document as
  ignored.
- **Node alignment is the user's responsibility:** for a node-local RWO PVC,
  the InferenceService's `nodeSelector` must land the pod where the PVC binds
  (a `WaitForFirstConsumer` local class binds on first consumer = the pod's
  node; a pre-bound RWO PVC pins the pod). The operator does not enforce this.
- **fsGroup/chown:** no new handling needed — the same root `model-cache-prep`
  runs, so `fsGroupPolicy: None` local storage classes are covered exactly like
  the shared path.

## Out of scope (possible follow-ups)

- `llmkube cache list` / `--purge-cache` would not inspect BYO cache PVCs (they
  read the shared cache). A natural later enhancement; relates to the CLI
  cache-key handling noted on #912.
- Operator-provisioned per-model cache (a per-InferenceService
  `storageClassName` + `size` instead of a pre-made PVC) — a reasonable sibling
  feature, but intentionally left out here to keep this focused on the
  bring-your-own-PVC case.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[FEATURE] Per-InferenceService model cache PVC override (spec.modelCache.claimName) #928

Summary

Motivation

Proposed behavior

Lifecycle

Edge cases / guardrails

Out of scope (possible follow-ups)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Uh oh!

[FEATURE] Per-InferenceService model cache PVC override (spec.modelCache.claimName) #928

Description

Summary

Motivation

Proposed behavior

Lifecycle

Edge cases / guardrails

Out of scope (possible follow-ups)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions