feat(stealing): add margin-of-improvement bound to prevent work-stealing thrash#9262
feat(stealing): add margin-of-improvement bound to prevent work-stealing thrash#9262prince8273 wants to merge 10 commits into
Conversation
Unit Test ResultsSee test report for an extended history of previous test failures. This is useful for diagnosing flaky tests. 31 files ± 0 31 suites ±0 11h 7m 24s ⏱️ - 1m 37s For more details on these failures, see this check. Results for commit 4026b1c. ± Comparison against base commit cf508b9. ♻️ This comment has been updated with latest results. |
… death When a worker drops off the cluster unexpectedly (e.g., due to an OOM kill), the scheduler tracks the processing_keys but previously did not log them to the console. This change surfaces exactly which tasks were interrupted, significantly improving debugging provenance for cluster hangs and memory crashes.
f052680 to
2e48bea
Compare
CI Status NoteThe failing checks are pre-existing flaky tests unrelated to this PR. The dask/distributed test report The GitHub Actions bot confirms: 3 ❌ -1 against base commit cf508b9 — |
- Add reject_count_margin_total metric to WorkStealing.metrics - Add observability logging for interrupted tasks in scheduler.py - Add test_reject_count_margin_metric to test_steal.py - Revert accidental range() changes in test_steal.py Signed-off-by: prince8273 <princesingh29757@gmail.com>
Signed-off-by: prince8273 <princesingh29757@gmail.com>
Signed-off-by: prince8273 <princesingh29757@gmail.com>
|
@fjetter Could you take a look when you get a chance? Quick summary in case it helps: Change: Single logger.warning in remove_worker when expected=False, surfacing interrupted processing_keys at the moment of worker death |
Problem
The scheduler would steal a task whenever the thief was even 1ms
faster than the victim. For data-heavy, compute-light tasks this
caused chronic thrashing — transfer costs routinely exceeded savings.
Change
Added a margin constraint to
balance():The thief must now promise a speedup of at least 50% of the network
transfer cost. Marginal steals that are net-negative under realistic
network jitter are suppressed.
Observability
Added
reject_count_margin_total(keyed by level) toWorkStealing.metricsso operators can measure exactly how manythrashing steals are being prevented. A
logger.debugline isemitted on each rejection with full task and margin details.
Tests
Added
test_reject_count_margin_metric— simulates a highcomm_cost/low compute scenario, triggers
balance(), and assertsreject_count_margin_total >= 1.