[NIXL] Fix NIXL UCX worker node placement#922
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Enterprise Run ID: 📒 Files selected for processing (4)
📝 WalkthroughWalkthroughThis PR replaces SLURM ChangesSLURM UCX multi-node node selection
🎯 2 (Simple) | ⏱️ ~10 minutes
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
Summary
On some clusters NIXLBench UCX Inter-node tests don’t actually run across two nodes - both the initiator and the target end up on the same node.
Some clusters though are not affected (probably due to older Slurm version, which is most likely the reason). The issue affects all CloudAI workloads that launch two SLURM steps per UCX inter-node run
Test Plan
Additional Notes