Currently, Sharing.MPS.FailRequestsGreaterThanOne is a valid config item. However, its value is ignored.
We have a use case whereby we need to be able to request all GPUs available on a node. Device plugin promises around what that request means in terms of time-on-device have no bearing, we know that this workload will have exclusivity when running. So, the only requirement is to be able to request multiple (all) replicas, in order to have access to all GPUs.
Based on this, is there interest in making the value of Sharing.MPS.FailRequestsGreaterThanOne authoritative when MPS is enabled, instead of it being ignored? Happy to PR the change if yes.
Currently,
Sharing.MPS.FailRequestsGreaterThanOneis a valid config item. However, its value is ignored.We have a use case whereby we need to be able to request all GPUs available on a node. Device plugin promises around what that request means in terms of time-on-device have no bearing, we know that this workload will have exclusivity when running. So, the only requirement is to be able to request multiple (all) replicas, in order to have access to all GPUs.
Based on this, is there interest in making the value of
Sharing.MPS.FailRequestsGreaterThanOneauthoritative when MPS is enabled, instead of it being ignored? Happy to PR the change if yes.