fix distributedAlloc tie-break to preserve distribution across physical GPUs#1788
Open
jonathan-meiri wants to merge 3 commits into
Open
fix distributedAlloc tie-break to preserve distribution across physical GPUs#1788jonathan-meiri wants to merge 3 commits into
jonathan-meiri wants to merge 3 commits into
Conversation
TDD red phase for an issue independent of NVIDIA#1787. Setup: a node with two physical GPUs of equal advertised replica counts, where one slot on the "second" GPU has already been allocated to another pod. A new pod requests two more slots. The function name and docstring of distributedAlloc promise an even spread, but today the function deterministically picks both of the new pod's slots from the GPU with the most remaining replicas — leaving the other physical GPU's available slot untouched. The bug is in the sort tie-break. After the first pick the per-GPU 'used' counts tie across the candidates, and sort.Slice is unstable, so the next iteration ends up picking the next slot on the GPU we just picked from rather than rotating to the sibling GPU that still has capacity. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-Authored-By: runatom-ai <258621014+runatom-ai@users.noreply.github.com> Signed-off-by: Jonathan Meiri <33288957+Meiri28@users.noreply.github.com>
… physical GPUs distributedAlloc sorts candidate replicas by 'used' (total - available) per underlying physical device. When two physical devices end up with the same 'used' count after some picks, sort.Slice is unstable and breaks the tie arbitrarily, which in practice causes the loop to keep picking from whichever device's slots happened to land first in the candidate list rather than rotating to a sibling physical device. That manifests when the available pool starts uneven (e.g., one slot on GPU-1 has already been allocated to another pod). The function name and docstring promise an even spread across replicated GPUs; the tie-break failure deterministically concentrates the new pod's slots on the GPU that had more available replicas, leaving the other physical device(s) untouched. Introduce a pickedFrom map tracking how many slots have been taken from each physical device during this allocation, and consult it as a tie-break sort key. The existing 'used' ordering remains primary; when two devices tie on that, the one we have not touched (or have touched the least) this allocation comes first. Behavior is unchanged whenever 'used' counts differ, including for fresh state and for cases where only one physical device has free slots. Failing tests added in the preceding commit now pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-Authored-By: runatom-ai <258621014+runatom-ai@users.noreply.github.com> Signed-off-by: Jonathan Meiri <33288957+Meiri28@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
distributedAllocis designed to spread allocations across replicated physical GPUs evenly, but a tie in its sort key — easy to reach once any prior pod has consumed a slot on one of the GPUs — falls through tosort.Slice's arbitrary order. In practice this means the loop keeps picking the next slot on the GPU it just picked from, instead of rotating to a sibling physical GPU that still has capacity.Contributed by @Meiri28 on behalf of @runatom-ai.
What's here
Two commits, TDD-style:
pickedFrommap and consult it as a tie-break sort key. The existing ordering remains primary. Behavior is unchanged whenever the primary sort key differs; the change only touches the previously-arbitrary tied case.Commits are DCO-signed.