Allow ability for MPS to have single replica#1655
Conversation
Signed-off-by: Vinay Damodaran <vrdn@hey.com>
7f4202b to
b22d56d
Compare
|
Hey folks, |
|
@rahulait |
|
Hi — wanted to add some context on the use case driving this PR, in case it helps the team evaluate whether this is worth supporting. We schedule multiple model inference pods onto shared GPUs based on GPU memory rather than replica slices. Our setup:
The challenge we're running into is that the minimum // daemon.go:278
return fmt.Sprintf("%d", 100/replicasPerDevice)
// daemon.go:268
limits[index] = fmt.Sprintf("%vM", totalMemory/replicas/1024/1024)These are applied as daemon-wide defaults via We'd love to just run our own MPS daemon, but since Bottlerocket bundles the device plugin and its MPS control daemon as platform infrastructure, running a parallel one introduces conflicts we'd rather avoid. I understand this may not be a use case the team has considered or wants to support, and I'm happy to provide benchmarks, test results, or any other information that would help with the evaluation. Appreciate you taking the time to look at this. |
|
Gentle bump — it's been about three weeks since the detailed write-up above, and I want to make this as easy as possible to action. Could a maintainer let me know which of these applies?
cc @elezar @cdesiniotis @rahulait @rajatchopra @tariq1890 @RenaudWasTaken |
|
Hi @vrdn-23, thanks for your patience. I have some questions:
It is worth keeping in mind that we make an assumption of the mps daemon being coupled with the device plugin here as the device plugin itself is the primary component of this project/repository. It is understandable that users want their own device plugins for various reasons, but configuring the MPS daemon in |
|
Thanks @tariq1890, I really appreciate the response. Happy to clarify a couple of things to make things more clear:
We're not replacing the one Bottlerocket deploys — Bottlerocket's GPU AMI ships both the NVIDIA k8s-device-plugin and the MPS control daemon as part of the OS, and we use them as-is. The "custom plugin" I mentioned runs alongside and advertises GPU memory as a separate extended resource so we can bin-pack inference pods by memory rather than replica slices. Pods scheduled that way still connect to Bottlerocket's MPS daemon for concurrency — i.e. the daemon this project ships, used unmodified, in a distribution NVIDIA already supports.
No, we're not using it. To clarify the term: by spatial sharing I mean concurrent kernels on different SMs, which MPS gives you by default — distinct from The fundamental capability I'm looking to achieve is MPS concurrency without per-client default throttling. Currently, Happy to add whatever would help land this — docs explaining when Thanks again for engaging on this. Let me know if there's anything else I can clarify. |
|
Thanks for the clarifications. So, If I understand correctly, in your case, the setting of the replica count to 1 in the MPS sharing strategy does not limit the gpu from being shared by workloads because of the custom device plugin that you have deployed which allocates GPUs with memory as the advertised resource (as opposed to the resource). In that case, it does roughly fall under this category of use-cases
Let me discuss this with the other maintainers and get back to you. |
That is correct. However, if allowing the number of replicas to 1 seems architectureally weird, then maybe the functionality that can be provided is to be able to override the value of
Thanks again and appreciate your timely response @tariq1890 |
Summary
Allow
replicas: 1in MPS sharing configuration so the MPS daemon can provide concurrent GPU access without per-client resource throttling. Fixes #1548Motivation
When using MPS purely as a concurrency layer — where an external device plugin handles scheduling — the current minimum of
replicas: 2forces the MPS daemon to impose unnecessary per-client limits:100 / replicas= 50% per clienttotal_memory / replicas= half per clientThis means every MPS client is capped at 50% GPU compute, even when it's the only process running. The remaining capacity sits idle.
With
replicas: 1, the daemon sets 100% thread percentage and full memory per client — MPS provides spatial sharing (multiple CUDA processes executing concurrently on different SMs) without artificial throttling.How
replicas: 1differs from no MPSBoth configurations result in no per-client throttling, but the execution model is different:
replicas: 1DEFAULTEXCLUSIVE_PROCESSMPS with
replicas: 1is the right choice when you want true concurrent GPU execution for multiple pods (which are scheduled based on some other resource like GPU memory) without artificially limiting any individual client's resource usage.Changes
replicas.go)isReplicated()to recognizereplicas = 1as a valid sharing configuration, soSharingStrategy()correctly returns MPS and the daemon startsTest plan
go test ./api/config/v1/... ./cmd/mps-control-daemon/mps/...)replicas: 1config parses successfullyassertReplicas()passes for single replica on pre-Volta and Volta+ devicesreplicas: 0andreplicas: -1still rejected