Skip to content

docs: add fallback guide for downgrading HyperPod clusters from Slurm 25.11 to 24.11#1107

Open
hossamazon wants to merge 2 commits into
awslabs:mainfrom
hossamazon:main
Open

docs: add fallback guide for downgrading HyperPod clusters from Slurm 25.11 to 24.11#1107
hossamazon wants to merge 2 commits into
awslabs:mainfrom
hossamazon:main

Conversation

@hossamazon
Copy link
Copy Markdown

Summary

Adds a runbook for customers who need to downgrade a SageMaker HyperPod cluster from Slurm 25.11 to Slurm 24.11.

HyperPod AMIs ship with both versions pre-installed (/opt/slurm-24.11/ and /opt/slurm-25.11/), so downgrading is a matter of switching the /opt/slurm symlink and ensuring the correct binaries are active before the Slurm daemons
start. This document covers both paths depending on whether a new AMI is available.

New file: LifecycleScripts/base-config/hotfix/fallback-to-slurm-24.md

What's covered

  • Prerequisite — backup section using the HyperPod-provided patching-backup.sh script, positioned before both options so it applies to both
  • Option 1 (recommended) — update lifecycle_script.py to copy configs and switch the symlink to /opt/slurm-24.11 before Slurm services start, then trigger Update Cluster Software; mirrors the same approach used in the enable
    Slurm 25.11 guide but in reverse; includes diffs for both the "replacing existing 25.11 lines" and "adding version switch for the first time" cases
  • Option 2 (manual, no new AMI) — SSH-based procedure that:
    1. Updates the lifecycle scripts in S3 first so future replaced/scaled nodes also come up on 24.11
    2. Captures the node list before making any changes (avoids sinfo version mismatch after the symlink switch)
    3. Copies config files into /opt/slurm-24.11/etc/ on all nodes in parallel
    4. Switches the symlink on all nodes
    5. Restarts slurmdbd (if accounting is enabled) before slurmctld to avoid library version mismatch
    6. Restarts slurmctld on the head node, then slurmd on all worker nodes in that order
  • Revert section — how to return to Slurm 25.11 via either option
  • Notes — mixed-version warning, state directory behavior, config file compatibility

Testing

Validated the symlink-switch approach and daemon restart order on a live HyperPod cluster running Slurm 25.11.

@hossamazon hossamazon requested a review from a team as a code owner May 22, 2026 02:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant