Summary
Allow operators to configure different replica counts for different physical GPUs on the same node under the existing sharing.timeSlicing and sharing.mps config. The schema for this already exists (ReplicatedResource.Devices + ReplicatedResource.Rename); the device-map construction in internal/rm/device_map.go already supports it. The only thing blocking the feature is one helper (disableResoureRenaming in api/config/v1/replicas.go) that strips these fields on config load.
Motivation
Today, every physical GPU on a node ends up with the same replica count because sharing.timeSlicing.resources collapses to a single homogeneous entry. That's fine for symmetric nodes, but it forces an all-or-nothing trade-off: either every GPU on the node gets aggressive sharing (more slices, smaller share each) or none does.
Real-world cases where per-GPU replica counts would help:
- Light + heavy on the same node. Reserve one GPU at
replicas: 2 for latency-sensitive inference (50% share each, two co-tenants max) and another at replicas: 8 for batch jobs that don't mind small slices.
- Mixed GPU classes on one node (different ages or memory tiers). With distinct
renames, each GPU advertises as a different resource name — and consumers explicitly request the tier they need.
- MPS specifically. Because MPS enforces
CUDA_MPS_ACTIVE_THREAD_PERCENTAGE = 100 / replicas, per-GPU replicas translates directly into per-GPU compute caps. This is where the feature is most semantically clean.
Per-node configs (via the GPU Operator's labelled-config map) only solve this when entire nodes are homogeneous. Within a single node, no current option works.
Proposal
After the change, this config would do what it reads as:
version: v1
sharing:
timeSlicing:
resources:
- name: nvidia.com/gpu
rename: nvidia.com/gpu-light
devices: ["0"]
replicas: 2
- name: nvidia.com/gpu
rename: nvidia.com/gpu-heavy
devices: ["1"]
replicas: 8
Pods request nvidia.com/gpu-light or nvidia.com/gpu-heavy explicitly. Behavior for configs that omit devices: / rename: is unchanged (existing single-resource configs keep working).
What this is NOT trying to do
- Not changing time-slicing compute semantics. Time-slicing on a single physical GPU round-robins CUDA contexts in the closed-source driver. Holding more time-slicing replicas on the same physical GPU does not give a pod a larger compute share — the device plugin has no way to inject scheduler weight there. The
failRequestsGreaterThanOne: true default still applies and is still the right guardrail for time-slicing.
- For MPS, the picture is cleaner: replicas →
CUDA_MPS_ACTIVE_THREAD_PERCENTAGE, which the MPS daemon actually enforces. So MPS users get the intuitive "more replicas per GPU = more compute share" behavior, and the per-GPU replicas config becomes a per-GPU compute-cap knob.
- Not introducing new k8s scheduling semantics. The existing per-resource-name model is unchanged. Pods just see two (or more) distinct resource names instead of one.
Prior art
The same feature gap has been raised before. None of these were closed by a decision; most went stale:
The comment on DisableResourceNamingInConfig already hints at intent: "This may be reenabled in a future release."
Next steps
We have a working implementation with tests against 5c15cbe and will open a PR shortly. Filing this issue first so the design context isn't buried inside the PR description.
cc @elezar @klueska
Summary
Allow operators to configure different replica counts for different physical GPUs on the same node under the existing
sharing.timeSlicingandsharing.mpsconfig. The schema for this already exists (ReplicatedResource.Devices+ReplicatedResource.Rename); the device-map construction ininternal/rm/device_map.goalready supports it. The only thing blocking the feature is one helper (disableResoureRenaminginapi/config/v1/replicas.go) that strips these fields on config load.Motivation
Today, every physical GPU on a node ends up with the same replica count because
sharing.timeSlicing.resourcescollapses to a single homogeneous entry. That's fine for symmetric nodes, but it forces an all-or-nothing trade-off: either every GPU on the node gets aggressive sharing (more slices, smaller share each) or none does.Real-world cases where per-GPU replica counts would help:
replicas: 2for latency-sensitive inference (50% share each, two co-tenants max) and another atreplicas: 8for batch jobs that don't mind small slices.renames, each GPU advertises as a different resource name — and consumers explicitly request the tier they need.CUDA_MPS_ACTIVE_THREAD_PERCENTAGE = 100 / replicas, per-GPU replicas translates directly into per-GPU compute caps. This is where the feature is most semantically clean.Per-node configs (via the GPU Operator's labelled-config map) only solve this when entire nodes are homogeneous. Within a single node, no current option works.
Proposal
After the change, this config would do what it reads as:
Pods request
nvidia.com/gpu-lightornvidia.com/gpu-heavyexplicitly. Behavior for configs that omitdevices:/rename:is unchanged (existing single-resource configs keep working).What this is NOT trying to do
failRequestsGreaterThanOne: truedefault still applies and is still the right guardrail for time-slicing.CUDA_MPS_ACTIVE_THREAD_PERCENTAGE, which the MPS daemon actually enforces. So MPS users get the intuitive "more replicas per GPU = more compute share" behavior, and the per-GPUreplicasconfig becomes a per-GPU compute-cap knob.Prior art
The same feature gap has been raised before. None of these were closed by a decision; most went stale:
disableResoureRenaminggate.The comment on
DisableResourceNamingInConfigalready hints at intent: "This may be reenabled in a future release."Next steps
We have a working implementation with tests against
5c15cbeand will open a PR shortly. Filing this issue first so the design context isn't buried inside the PR description.cc @elezar @klueska