Allow ability for MPS to have single replica by vrdn-23 · Pull Request #1655 · NVIDIA/k8s-device-plugin

vrdn-23 · 2026-03-11T12:50:38Z

Summary

Allow replicas: 1 in MPS sharing configuration so the MPS daemon can provide concurrent GPU access without per-client resource throttling. Fixes #1548

Motivation

When using MPS purely as a concurrency layer — where an external device plugin handles scheduling — the current minimum of replicas: 2 forces the MPS daemon to impose unnecessary per-client limits:

Active thread percentage: 100 / replicas = 50% per client
Pinned memory limit: total_memory / replicas = half per client

This means every MPS client is capped at 50% GPU compute, even when it's the only process running. The remaining capacity sits idle.

With replicas: 1, the daemon sets 100% thread percentage and full memory per client — MPS provides spatial sharing (multiple CUDA processes executing concurrently on different SMs) without artificial throttling.

How `replicas: 1` differs from no MPS

Both configurations result in no per-client throttling, but the execution model is different:

	No MPS	MPS with `replicas: 1`
Compute mode	`DEFAULT`	`EXCLUSIVE_PROCESS`
Concurrent execution	Time-slicing (one process at a time, context switches)	Spatial sharing (kernels from different clients run concurrently on different SMs)
Per-client limits	None	None (100% thread, full memory)
GPU access	Any process on the node	Only processes connecting through the MPS pipe
MPS daemon	Not running	Running

MPS with replicas: 1 is the right choice when you want true concurrent GPU execution for multiple pods (which are scheduled based on some other resource like GPU memory) without artificially limiting any individual client's resource usage.

Changes

Lower minimum replicas from 2 to 1 in config validation (replicas.go)
Update isReplicated() to recognize replicas = 1 as a valid sharing configuration, so SharingStrategy() correctly returns MPS and the daemon starts
Add test coverage for single-replica MPS configuration

Test plan

Existing tests pass (go test ./api/config/v1/... ./cmd/mps-control-daemon/mps/...)
New test: replicas: 1 config parses successfully
New test: assertReplicas() passes for single replica on pre-Volta and Volta+ devices
replicas: 0 and replicas: -1 still rejected

copy-pr-bot · 2026-03-11T12:50:42Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: Vinay Damodaran <vrdn@hey.com>

vrdn-23 · 2026-03-20T22:11:20Z

Hey folks,
Just wanted to know if there is any PR etiquette or some performance information I can run to help understand whether this is a valid use-case that the team feels like supporting. Given that inherently there is a difference between how GPU operations work without MPS configured and when it's configured, having a GPU be able to use MPS for all executions seems useful without having to split stuff.
@elezar @cdesiniotis

vrdn-23 · 2026-04-09T11:36:13Z

@rahulait
Just wanted to check in and see if there is any insight in how I can have someone take a look at this PR? Are there any benchmarks or testing suites I can run to help convince the maintainer team to spend some time to review this?

vrdn-23 · 2026-04-21T21:07:12Z

Hi — wanted to add some context on the use case driving this PR, in case it helps the team evaluate whether this is worth supporting.

We schedule multiple model inference pods onto shared GPUs based on GPU memory rather than replica slices. Our setup:

EKS with Bottlerocket GPU AMIs, which ship the NVIDIA device plugin and MPS control daemon as part of the OS
Karpenter with NodeOverlays to expose a GPU memory extended resource for scheduling
A custom device plugin that advertises GPU memory, injects GPU access via CDI, and connects pods to the MPS daemon's pipe — pods don't request nvidia.com/gpu
MPS for concurrent spatial sharing — multiple model pods run CUDA kernels concurrently on different SMs rather than time-slicing

The challenge we're running into is that the minimum replicas: 2 causes the MPS daemon to cap every connecting client at 50% active threads and half the GPU's pinned memory:

// daemon.go:278
return fmt.Sprintf("%d", 100/replicasPerDevice)

// daemon.go:268
limits[index] = fmt.Sprintf("%vM", totalMemory/replicas/1024/1024)

These are applied as daemon-wide defaults via set_default_active_thread_percentage and set_default_device_pinned_mem_limit at startup, and as far as I can tell there's no per-client override in the plugin — so every pod connecting through the MPS pipe gets throttled regardless of how it was scheduled.

We'd love to just run our own MPS daemon, but since Bottlerocket bundles the device plugin and its MPS control daemon as platform infrastructure, running a parallel one introduces conflicts we'd rather avoid. replicas: 1 would let us use the existing daemon cleanly — MPS spatial sharing without the per-client resource partitioning.

I understand this may not be a use case the team has considered or wants to support, and I'm happy to provide benchmarks, test results, or any other information that would help with the evaluation. Appreciate you taking the time to look at this.

cc @rahulait @elezar @cdesiniotis

vrdn-23 · 2026-05-11T15:35:33Z

Gentle bump — it's been about three weeks since the detailed write-up above, and I want to make this as easy as possible to action.

Could a maintainer let me know which of these applies?

Approach is fine, just bandwidth — happy to wait, just want to confirm it's on the radar.
Approach needs changes — happy to iterate if you can point at the concern (API shape, validation location, test coverage, docs, etc.).
Use case isn't one the project wants to support — totally fair, would appreciate knowing so I can stop pinging and look at maintaining this downstream.

cc @elezar @cdesiniotis @rahulait @rajatchopra @tariq1890 @RenaudWasTaken

tariq1890 · 2026-05-11T21:35:54Z

Hi @vrdn-23, thanks for your patience. I have some questions:

You say that you have a custom device plugin, does that mean you don't use the device-plugin deployed by BottleRocket?
Since you need spatial sharing of the GPUs, does that mean you are enabling --static-partitioning in your nvidia-cuda-mps-control daemon? If yes, how do you do this currently? I don't think the MPS daemonset we deploy allows for this.

It is worth keeping in mind that we make an assumption of the mps daemon being coupled with the device plugin here as the device plugin itself is the primary component of this project/repository. It is understandable that users want their own device plugins for various reasons, but configuring the MPS daemon in NVIDIA/k8s-device-plugin to work with them is likely out of scope.

vrdn-23 · 2026-05-12T00:26:31Z

Thanks @tariq1890, I really appreciate the response. Happy to clarify a couple of things to make things more clear:

You say that you have a custom device plugin, does that mean you don't use the device-plugin deployed by BottleRocket?

We're not replacing the one Bottlerocket deploys — Bottlerocket's GPU AMI ships both the NVIDIA k8s-device-plugin and the MPS control daemon as part of the OS, and we use them as-is. The "custom plugin" I mentioned runs alongside and advertises GPU memory as a separate extended resource so we can bin-pack inference pods by memory rather than replica slices. Pods scheduled that way still connect to Bottlerocket's MPS daemon for concurrency — i.e. the daemon this project ships, used unmodified, in a distribution NVIDIA already supports.

Since you need spatial sharing of the GPUs, does that mean you are enabling --static-partitioning in your nvidia-cuda-mps-control daemon

No, we're not using it. To clarify the term: by spatial sharing I mean concurrent kernels on different SMs, which MPS gives you by default — distinct from --static-partitioning, which explicitly pins clients to SM subsets (from my limited understanding). I just don't want the daemon to also set set_default_active_thread_percentage=50 and
set_default_device_pinned_mem_limit=half, which is what replicas: 2 forces today.

The fundamental capability I'm looking to achieve is MPS concurrency without per-client default throttling.

Currently, replicas is the only knob exposed for MPS defaults, and the floor of 2 forces every MPS user — including the path Bottlerocket exposes out of the box — into 50%-thread / half-memory partitioning. There's no other way to tell the daemon "run, but don't impose default caps," and replicas: 1 seems to be the most natural way to express that.

Happy to add whatever would help land this — docs explaining when replicas: 1 is and isn't the right choice, benchmarks, or gating it behind a separate field if replicas: 1 feels like it might be confusing with the current terminology.

Thanks again for engaging on this. Let me know if there's anything else I can clarify.

tariq1890 · 2026-05-12T05:27:46Z

Thanks for the clarifications. So, If I understand correctly, in your case, the setting of the replica count to 1 in the MPS sharing strategy does not limit the gpu from being shared by workloads because of the custom device plugin that you have deployed which allocates GPUs with memory as the advertised resource (as opposed to the resource). In that case, it does roughly fall under this category of use-cases

configuring the MPS daemon in NVIDIA/k8s-device-plugin to work with them is likely out of scope.

Let me discuss this with the other maintainers and get back to you.

vrdn-23 · 2026-05-12T19:22:35Z

So, If I understand correctly, in your case, the setting of the replica count to 1 in the MPS sharing strategy does not limit the gpu from being shared by workloads because of the custom device plugin that you have deployed which allocates GPUs with memory as the advertised resource (as opposed to the resource).

That is correct. However, if allowing the number of replicas to 1 seems architectureally weird, then maybe the functionality that can be provided is to be able to override the value of active_thread_percentage for each client that connects via MPS, irrespective of number of replicas.

Let me discuss this with the other maintainers and get back to you.

Thanks again and appreciate your timely response @tariq1890

Allow ability for MPS to have single replica

b22d56d

Signed-off-by: Vinay Damodaran <vrdn@hey.com>

vrdn-23 force-pushed the vidamoda/mps-single-replica branch from 7f4202b to b22d56d Compare March 11, 2026 12:53

vrdn-23 mentioned this pull request Mar 11, 2026

Setting MPS replicas to 1 #1548

Open

vrdn-23 added 2 commits March 31, 2026 09:38

Merge branch 'main' into vidamoda/mps-single-replica

23395ff

Merge branch 'main' into vidamoda/mps-single-replica

8d9cdd7

Merge branch 'main' into vidamoda/mps-single-replica

5293887

vrdn-23 added 3 commits April 27, 2026 10:41

Merge branch 'main' into vidamoda/mps-single-replica

3e01ddb

Merge branch 'main' into vidamoda/mps-single-replica

f84c2e8

Merge branch 'main' into vidamoda/mps-single-replica

d53a306

vrdn-23 added 4 commits May 12, 2026 12:49

Merge branch 'main' into vidamoda/mps-single-replica

25ea9e7

Merge branch 'main' into vidamoda/mps-single-replica

8a692e1

Merge branch 'NVIDIA:main' into vidamoda/mps-single-replica

7163e78

Merge branch 'main' into vidamoda/mps-single-replica

8c17b5b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow ability for MPS to have single replica#1655

Allow ability for MPS to have single replica#1655
vrdn-23 wants to merge 11 commits into
NVIDIA:mainfrom
vrdn-23:vidamoda/mps-single-replica

vrdn-23 commented Mar 11, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented Mar 11, 2026

Uh oh!

vrdn-23 commented Mar 20, 2026

Uh oh!

vrdn-23 commented Apr 9, 2026

Uh oh!

vrdn-23 commented Apr 21, 2026

Uh oh!

vrdn-23 commented May 11, 2026

Uh oh!

tariq1890 commented May 11, 2026 •

edited

Loading

Uh oh!

vrdn-23 commented May 12, 2026

Uh oh!

tariq1890 commented May 12, 2026

Uh oh!

vrdn-23 commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

vrdn-23 commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

How replicas: 1 differs from no MPS

Changes

Test plan

Uh oh!

copy-pr-bot Bot commented Mar 11, 2026

Uh oh!

vrdn-23 commented Mar 20, 2026

Uh oh!

vrdn-23 commented Apr 9, 2026

Uh oh!

vrdn-23 commented Apr 21, 2026

Uh oh!

vrdn-23 commented May 11, 2026

Uh oh!

tariq1890 commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vrdn-23 commented May 12, 2026

Uh oh!

tariq1890 commented May 12, 2026

Uh oh!

vrdn-23 commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vrdn-23 commented Mar 11, 2026 •

edited

Loading

How `replicas: 1` differs from no MPS

tariq1890 commented May 11, 2026 •

edited

Loading