Skip to content

Allow ability for MPS to have single replica#1655

Open
vrdn-23 wants to merge 11 commits into
NVIDIA:mainfrom
vrdn-23:vidamoda/mps-single-replica
Open

Allow ability for MPS to have single replica#1655
vrdn-23 wants to merge 11 commits into
NVIDIA:mainfrom
vrdn-23:vidamoda/mps-single-replica

Conversation

@vrdn-23
Copy link
Copy Markdown

@vrdn-23 vrdn-23 commented Mar 11, 2026

Summary

Allow replicas: 1 in MPS sharing configuration so the MPS daemon can provide concurrent GPU access without per-client resource throttling. Fixes #1548

Motivation

When using MPS purely as a concurrency layer — where an external device plugin handles scheduling — the current minimum of replicas: 2 forces the MPS daemon to impose unnecessary per-client limits:

  • Active thread percentage: 100 / replicas = 50% per client
  • Pinned memory limit: total_memory / replicas = half per client

This means every MPS client is capped at 50% GPU compute, even when it's the only process running. The remaining capacity sits idle.

With replicas: 1, the daemon sets 100% thread percentage and full memory per client — MPS provides spatial sharing (multiple CUDA processes executing concurrently on different SMs) without artificial throttling.

How replicas: 1 differs from no MPS

Both configurations result in no per-client throttling, but the execution model is different:

No MPS MPS with replicas: 1
Compute mode DEFAULT EXCLUSIVE_PROCESS
Concurrent execution Time-slicing (one process at a time, context switches) Spatial sharing (kernels from different clients run concurrently on different SMs)
Per-client limits None None (100% thread, full memory)
GPU access Any process on the node Only processes connecting through the MPS pipe
MPS daemon Not running Running

MPS with replicas: 1 is the right choice when you want true concurrent GPU execution for multiple pods (which are scheduled based on some other resource like GPU memory) without artificially limiting any individual client's resource usage.

Changes

  • Lower minimum replicas from 2 to 1 in config validation (replicas.go)
  • Update isReplicated() to recognize replicas = 1 as a valid sharing configuration, so SharingStrategy() correctly returns MPS and the daemon starts
  • Add test coverage for single-replica MPS configuration

Test plan

  • Existing tests pass (go test ./api/config/v1/... ./cmd/mps-control-daemon/mps/...)
  • New test: replicas: 1 config parses successfully
  • New test: assertReplicas() passes for single replica on pre-Volta and Volta+ devices
  • replicas: 0 and replicas: -1 still rejected

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Mar 11, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: Vinay Damodaran <vrdn@hey.com>
@vrdn-23 vrdn-23 force-pushed the vidamoda/mps-single-replica branch from 7f4202b to b22d56d Compare March 11, 2026 12:53
@vrdn-23
Copy link
Copy Markdown
Author

vrdn-23 commented Mar 20, 2026

Hey folks,
Just wanted to know if there is any PR etiquette or some performance information I can run to help understand whether this is a valid use-case that the team feels like supporting. Given that inherently there is a difference between how GPU operations work without MPS configured and when it's configured, having a GPU be able to use MPS for all executions seems useful without having to split stuff.
@elezar @cdesiniotis

@vrdn-23
Copy link
Copy Markdown
Author

vrdn-23 commented Apr 9, 2026

@rahulait
Just wanted to check in and see if there is any insight in how I can have someone take a look at this PR? Are there any benchmarks or testing suites I can run to help convince the maintainer team to spend some time to review this?

@vrdn-23
Copy link
Copy Markdown
Author

vrdn-23 commented Apr 21, 2026

Hi — wanted to add some context on the use case driving this PR, in case it helps the team evaluate whether this is worth supporting.

We schedule multiple model inference pods onto shared GPUs based on GPU memory rather than replica slices. Our setup:

  • EKS with Bottlerocket GPU AMIs, which ship the NVIDIA device plugin and MPS control daemon as part of the OS
  • Karpenter with NodeOverlays to expose a GPU memory extended resource for scheduling
  • A custom device plugin that advertises GPU memory, injects GPU access via CDI, and connects pods to the MPS daemon's pipe — pods don't request nvidia.com/gpu
  • MPS for concurrent spatial sharing — multiple model pods run CUDA kernels concurrently on different SMs rather than time-slicing

The challenge we're running into is that the minimum replicas: 2 causes the MPS daemon to cap every connecting client at 50% active threads and half the GPU's pinned memory:

// daemon.go:278
return fmt.Sprintf("%d", 100/replicasPerDevice)

// daemon.go:268
limits[index] = fmt.Sprintf("%vM", totalMemory/replicas/1024/1024)

These are applied as daemon-wide defaults via set_default_active_thread_percentage and set_default_device_pinned_mem_limit at startup, and as far as I can tell there's no per-client override in the plugin — so every pod connecting through the MPS pipe gets throttled regardless of how it was scheduled.

We'd love to just run our own MPS daemon, but since Bottlerocket bundles the device plugin and its MPS control daemon as platform infrastructure, running a parallel one introduces conflicts we'd rather avoid. replicas: 1 would let us use the existing daemon cleanly — MPS spatial sharing without the per-client resource partitioning.

I understand this may not be a use case the team has considered or wants to support, and I'm happy to provide benchmarks, test results, or any other information that would help with the evaluation. Appreciate you taking the time to look at this.

cc @rahulait @elezar @cdesiniotis

@vrdn-23
Copy link
Copy Markdown
Author

vrdn-23 commented May 11, 2026

Gentle bump — it's been about three weeks since the detailed write-up above, and I want to make this as easy as possible to action.

Could a maintainer let me know which of these applies?

  1. Approach is fine, just bandwidth — happy to wait, just want to confirm it's on the radar.
  2. Approach needs changes — happy to iterate if you can point at the concern (API shape, validation location, test coverage, docs, etc.).
  3. Use case isn't one the project wants to support — totally fair, would appreciate knowing so I can stop pinging and look at maintaining this downstream.

cc @elezar @cdesiniotis @rahulait @rajatchopra @tariq1890 @RenaudWasTaken

@tariq1890
Copy link
Copy Markdown
Contributor

tariq1890 commented May 11, 2026

Hi @vrdn-23, thanks for your patience. I have some questions:

  • You say that you have a custom device plugin, does that mean you don't use the device-plugin deployed by BottleRocket?
  • Since you need spatial sharing of the GPUs, does that mean you are enabling --static-partitioning in your nvidia-cuda-mps-control daemon? If yes, how do you do this currently? I don't think the MPS daemonset we deploy allows for this.

It is worth keeping in mind that we make an assumption of the mps daemon being coupled with the device plugin here as the device plugin itself is the primary component of this project/repository. It is understandable that users want their own device plugins for various reasons, but configuring the MPS daemon in NVIDIA/k8s-device-plugin to work with them is likely out of scope.

@vrdn-23
Copy link
Copy Markdown
Author

vrdn-23 commented May 12, 2026

Thanks @tariq1890, I really appreciate the response. Happy to clarify a couple of things to make things more clear:

You say that you have a custom device plugin, does that mean you don't use the device-plugin deployed by BottleRocket?

We're not replacing the one Bottlerocket deploys — Bottlerocket's GPU AMI ships both the NVIDIA k8s-device-plugin and the MPS control daemon as part of the OS, and we use them as-is. The "custom plugin" I mentioned runs alongside and advertises GPU memory as a separate extended resource so we can bin-pack inference pods by memory rather than replica slices. Pods scheduled that way still connect to Bottlerocket's MPS daemon for concurrency — i.e. the daemon this project ships, used unmodified, in a distribution NVIDIA already supports.

Since you need spatial sharing of the GPUs, does that mean you are enabling --static-partitioning in your nvidia-cuda-mps-control daemon

No, we're not using it. To clarify the term: by spatial sharing I mean concurrent kernels on different SMs, which MPS gives you by default — distinct from --static-partitioning, which explicitly pins clients to SM subsets (from my limited understanding). I just don't want the daemon to also set set_default_active_thread_percentage=50 and
set_default_device_pinned_mem_limit=half, which is what replicas: 2 forces today.

The fundamental capability I'm looking to achieve is MPS concurrency without per-client default throttling.

Currently, replicas is the only knob exposed for MPS defaults, and the floor of 2 forces every MPS user — including the path Bottlerocket exposes out of the box — into 50%-thread / half-memory partitioning. There's no other way to tell the daemon "run, but don't impose default caps," and replicas: 1 seems to be the most natural way to express that.

Happy to add whatever would help land this — docs explaining when replicas: 1 is and isn't the right choice, benchmarks, or gating it behind a separate field if replicas: 1 feels like it might be confusing with the current terminology.

Thanks again for engaging on this. Let me know if there's anything else I can clarify.

@tariq1890
Copy link
Copy Markdown
Contributor

Thanks for the clarifications. So, If I understand correctly, in your case, the setting of the replica count to 1 in the MPS sharing strategy does not limit the gpu from being shared by workloads because of the custom device plugin that you have deployed which allocates GPUs with memory as the advertised resource (as opposed to the resource). In that case, it does roughly fall under this category of use-cases

configuring the MPS daemon in NVIDIA/k8s-device-plugin to work with them is likely out of scope.

Let me discuss this with the other maintainers and get back to you.

@vrdn-23
Copy link
Copy Markdown
Author

vrdn-23 commented May 12, 2026

So, If I understand correctly, in your case, the setting of the replica count to 1 in the MPS sharing strategy does not limit the gpu from being shared by workloads because of the custom device plugin that you have deployed which allocates GPUs with memory as the advertised resource (as opposed to the resource).

That is correct. However, if allowing the number of replicas to 1 seems architectureally weird, then maybe the functionality that can be provided is to be able to override the value of active_thread_percentage for each client that connects via MPS, irrespective of number of replicas.

Let me discuss this with the other maintainers and get back to you.

Thanks again and appreciate your timely response @tariq1890

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Setting MPS replicas to 1

2 participants