Skip to content

Support per-physical-GPU replica counts in sharing.timeSlicing and sharing.mps #1786

@jonathan-meiri

Description

@jonathan-meiri

Summary

Allow operators to configure different replica counts for different physical GPUs on the same node under the existing sharing.timeSlicing and sharing.mps config. The schema for this already exists (ReplicatedResource.Devices + ReplicatedResource.Rename); the device-map construction in internal/rm/device_map.go already supports it. The only thing blocking the feature is one helper (disableResoureRenaming in api/config/v1/replicas.go) that strips these fields on config load.

Motivation

Today, every physical GPU on a node ends up with the same replica count because sharing.timeSlicing.resources collapses to a single homogeneous entry. That's fine for symmetric nodes, but it forces an all-or-nothing trade-off: either every GPU on the node gets aggressive sharing (more slices, smaller share each) or none does.

Real-world cases where per-GPU replica counts would help:

  • Light + heavy on the same node. Reserve one GPU at replicas: 2 for latency-sensitive inference (50% share each, two co-tenants max) and another at replicas: 8 for batch jobs that don't mind small slices.
  • Mixed GPU classes on one node (different ages or memory tiers). With distinct renames, each GPU advertises as a different resource name — and consumers explicitly request the tier they need.
  • MPS specifically. Because MPS enforces CUDA_MPS_ACTIVE_THREAD_PERCENTAGE = 100 / replicas, per-GPU replicas translates directly into per-GPU compute caps. This is where the feature is most semantically clean.

Per-node configs (via the GPU Operator's labelled-config map) only solve this when entire nodes are homogeneous. Within a single node, no current option works.

Proposal

After the change, this config would do what it reads as:

version: v1
sharing:
  timeSlicing:
    resources:
      - name: nvidia.com/gpu
        rename: nvidia.com/gpu-light
        devices: ["0"]
        replicas: 2
      - name: nvidia.com/gpu
        rename: nvidia.com/gpu-heavy
        devices: ["1"]
        replicas: 8

Pods request nvidia.com/gpu-light or nvidia.com/gpu-heavy explicitly. Behavior for configs that omit devices: / rename: is unchanged (existing single-resource configs keep working).

What this is NOT trying to do

  • Not changing time-slicing compute semantics. Time-slicing on a single physical GPU round-robins CUDA contexts in the closed-source driver. Holding more time-slicing replicas on the same physical GPU does not give a pod a larger compute share — the device plugin has no way to inject scheduler weight there. The failRequestsGreaterThanOne: true default still applies and is still the right guardrail for time-slicing.
    • For MPS, the picture is cleaner: replicas → CUDA_MPS_ACTIVE_THREAD_PERCENTAGE, which the MPS daemon actually enforces. So MPS users get the intuitive "more replicas per GPU = more compute share" behavior, and the per-GPU replicas config becomes a per-GPU compute-cap knob.
  • Not introducing new k8s scheduling semantics. The existing per-resource-name model is unchanged. Pods just see two (or more) distinct resource names instead of one.

Prior art

The same feature gap has been raised before. None of these were closed by a decision; most went stale:

The comment on DisableResourceNamingInConfig already hints at intent: "This may be reenabled in a future release."

Next steps

We have a working implementation with tests against 5c15cbe and will open a PR shortly. Filing this issue first so the design context isn't buried inside the PR description.

cc @elezar @klueska

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions