Skip to content

fix distributedAlloc tie-break to preserve distribution across physical GPUs#1788

Open
jonathan-meiri wants to merge 3 commits into
NVIDIA:mainfrom
jonathan-meiri:prefer-distinct-physical-gpu-on-tiebreak
Open

fix distributedAlloc tie-break to preserve distribution across physical GPUs#1788
jonathan-meiri wants to merge 3 commits into
NVIDIA:mainfrom
jonathan-meiri:prefer-distinct-physical-gpu-on-tiebreak

Conversation

@jonathan-meiri
Copy link
Copy Markdown

Summary

distributedAlloc is designed to spread allocations across replicated physical GPUs evenly, but a tie in its sort key — easy to reach once any prior pod has consumed a slot on one of the GPUs — falls through to sort.Slice's arbitrary order. In practice this means the loop keeps picking the next slot on the GPU it just picked from, instead of rotating to a sibling physical GPU that still has capacity.

Contributed by @Meiri28 on behalf of @runatom-ai.

What's here

Two commits, TDD-style:

  1. A failing test describing the bug — 2 physical GPUs, 1 slot on GPU-1 already allocated to another pod, new pod requests 2 slots. Today both slots deterministically land on GPU-0; the test asserts 1-from-each.
  2. The fix — introduce a small pickedFrom map and consult it as a tie-break sort key. The existing ordering remains primary. Behavior is unchanged whenever the primary sort key differs; the change only touches the previously-arbitrary tied case.

Commits are DCO-signed.

Meiri28 and others added 2 commits May 19, 2026 17:15
TDD red phase for an issue independent of NVIDIA#1787.

Setup: a node with two physical GPUs of equal advertised replica
counts, where one slot on the "second" GPU has already been allocated
to another pod. A new pod requests two more slots. The function name
and docstring of distributedAlloc promise an even spread, but today
the function deterministically picks both of the new pod's slots from
the GPU with the most remaining replicas — leaving the other physical
GPU's available slot untouched.

The bug is in the sort tie-break. After the first pick the per-GPU
'used' counts tie across the candidates, and sort.Slice is unstable,
so the next iteration ends up picking the next slot on the GPU we
just picked from rather than rotating to the sibling GPU that still
has capacity.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: runatom-ai <258621014+runatom-ai@users.noreply.github.com>
Signed-off-by: Jonathan Meiri <33288957+Meiri28@users.noreply.github.com>
… physical GPUs

distributedAlloc sorts candidate replicas by 'used' (total - available)
per underlying physical device. When two physical devices end up with
the same 'used' count after some picks, sort.Slice is unstable and
breaks the tie arbitrarily, which in practice causes the loop to keep
picking from whichever device's slots happened to land first in the
candidate list rather than rotating to a sibling physical device.

That manifests when the available pool starts uneven (e.g., one slot
on GPU-1 has already been allocated to another pod). The function name
and docstring promise an even spread across replicated GPUs; the
tie-break failure deterministically concentrates the new pod's slots
on the GPU that had more available replicas, leaving the other
physical device(s) untouched.

Introduce a pickedFrom map tracking how many slots have been taken
from each physical device during this allocation, and consult it as a
tie-break sort key. The existing 'used' ordering remains primary;
when two devices tie on that, the one we have not touched (or have
touched the least) this allocation comes first. Behavior is unchanged
whenever 'used' counts differ, including for fresh state and for
cases where only one physical device has free slots.

Failing tests added in the preceding commit now pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: runatom-ai <258621014+runatom-ai@users.noreply.github.com>
Signed-off-by: Jonathan Meiri <33288957+Meiri28@users.noreply.github.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 19, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants