Skip to content

[NIXL] Fix NIXL UCX worker node placement#922

Merged
podkidyshev merged 1 commit into
mainfrom
ipod/nixl-inter-node
Jun 11, 2026
Merged

[NIXL] Fix NIXL UCX worker node placement#922
podkidyshev merged 1 commit into
mainfrom
ipod/nixl-inter-node

Conversation

@podkidyshev

Copy link
Copy Markdown
Contributor

Summary

On some clusters NIXLBench UCX Inter-node tests don’t actually run across two nodes - both the initiator and the target end up on the same node.
Some clusters though are not affected (probably due to older Slurm version, which is most likely the reason). The issue affects all CloudAI workloads that launch two SLURM steps per UCX inter-node run

Test Plan

  • Automated CI
  • Manual runs

Additional Notes

@podkidyshev podkidyshev self-assigned this Jun 11, 2026
@podkidyshev podkidyshev added the bug Something isn't working label Jun 11, 2026
@coderabbitai

coderabbitai Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 9bfc064f-e4e3-4d89-aaff-de3d67074b7d

📥 Commits

Reviewing files that changed from the base of the PR and between 7a52c83 and 3c06305.

📒 Files selected for processing (4)
  • src/cloudai/workloads/common/nixl.py
  • tests/ref_data/nixl-kvbench.sbatch
  • tests/ref_data/nixl_bench.sbatch
  • tests/workloads/nixl_bench/test_command_gen_strategy_slurm.py

📝 Walkthrough

Walkthrough

This PR replaces SLURM --relative={idx} node selection with explicit --nodelist expressions in UCX multi-node workload generation. The implementation now extracts target hostnames from $SLURM_JOB_NODELIST via scontrol show hostname ... | sed -n '{idx + 1}p', and reference test data and assertions are updated to match.

Changes

SLURM UCX multi-node node selection

Layer / File(s) Summary
Core implementation and test assertion update
src/cloudai/workloads/common/nixl.py, tests/workloads/nixl_bench/test_command_gen_strategy_slurm.py
gen_nixlbench_srun_commands generates per-index srun commands with --nodelist expressions from $SLURM_JOB_NODELIST, and the test assertions verify this pattern and absence of --relative.
Reference test data updates
tests/ref_data/nixl-kvbench.sbatch, tests/ref_data/nixl_bench.sbatch
Expected kvbench and nixlbench srun command outputs updated to show --nodelist expressions targeting first and second hostnames instead of --relative offsets.

🎯 2 (Simple) | ⏱️ ~10 minutes


🐰 From --relative to explicit node lists we hop,
Each task now knows its target node, clean and top!
SLURM commands dance with sed and scontrol's might,
Multi-node clarity blooms in the night! 🌙✨

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically identifies the main change: fixing NIXL UCX worker node placement in multi-node test scenarios.
Description check ✅ Passed The description explains the issue (UCX inter-node tests running on same node instead of across nodes), its scope, and provides a test plan, all directly related to the changeset.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch ipod/nixl-inter-node

Comment @coderabbitai help to get the list of available commands and usage tips.

@podkidyshev podkidyshev marked this pull request as ready for review June 11, 2026 11:06
@podkidyshev podkidyshev merged commit 1d0a713 into main Jun 11, 2026
5 checks passed
@podkidyshev podkidyshev deleted the ipod/nixl-inter-node branch June 11, 2026 13:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants