Skip to content

Comments

Add "Install Slurm" documentation#72

Open
lunamorrow wants to merge 12 commits intoOpenCHAMI:mainfrom
lunamorrow:lunamorrow/install-slurm-documentation
Open

Add "Install Slurm" documentation#72
lunamorrow wants to merge 12 commits intoOpenCHAMI:mainfrom
lunamorrow:lunamorrow/install-slurm-documentation

Conversation

@lunamorrow
Copy link

Pull Request Template

Thank you for your contribution! Please ensure the following before submitting:

Checklist

  • My code follows the style guidelines of this project
  • I have added/updated comments where needed
  • I have added tests that prove my fix is effective or my feature works
  • I have run make test (or equivalent) locally and all tests pass
  • DCO Sign-off: All commits are signed off (git commit -s) with my real name and email
  • REUSE Compliance:
    • Each new/modified source file has SPDX copyright and license headers
    • Any non-commentable files include a <filename>.license sidecar
    • All referenced licenses are present in the LICENSES/ directory

Description

Contributing to the "Install Slurm" documentation under the OpenCHAMI guides.

Any feedback or suggestions about making the documentation broad enough for general purpose or to fit in well with the existing documentation are appreciated.

Type of Change

  • Bug fix
  • New feature
  • Breaking change
  • Documentation update

For more info, see Contributing Guidelines.

…n will need some further updates to align better with the Tutorial (e.g. changing IP addresses, adjusting comments to support bare-metal and cloud setups, etc.) and to ensure the documented approach is sufficiently broad for general purpose.

Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>
@lunamorrow
Copy link
Author

I am in the process of adjusting the documentation to align better with the format of the OpenCHAMI tutorial to enable ease of use for OpenCHAMI users. I will be making some more commits to adjust and fine-tune the documentation, and I would appreciate feedback/suggestions as this is my first time contributing to this project.

@davidallendj
Copy link
Contributor

At a glance, this looks great! I'm going try to take time to run through this today if I get a chance.

… for creating some files from cat to copy-paste to prevent issues with bash command/variable processing

Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>
… - this should make this guide easy to follow on with after the tutorial

Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>
…Next step will be expanding comments/explanations to provide more context to users, as well as providing more code blocks to show expected output of commands that produce output.

Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>
…id. Changes include making it more clear when pwgen password is used, correcting the file creation step for slurm.conf to prevent errors, removing instructions for aliasing the build commend (and instead redirecting to the appropriate tutorial section), updating instructions inline with a recent PR to replace MinIO with Versity S3 and some minor typo fixes

Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>
@lunamorrow
Copy link
Author

Thank you for your suggestions/fixes @davidallendj, I have applied all of them except for simplifying the MariaDB configuration process. I will look into if there is a more hands-off approach to configuring MariaDB that we can implement instead. I will also expand the comments/explanations that are provided with code blocks, add more 'expected output' code blocks and provide support for cloud or bare-metal deployment variations over the next few days.

…ck from David.

Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>
```

{{< callout context="note" title="Note" icon="outline/info-circle" >}}
Find all directories owned by old munge UID/GID with the following command:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we see directories from the output of find / -uid 991 -type d, should we chown -R munge:munge them? I'm wondering if combining the commands into something like chown -R munge:munge $(find / -uid 991 -type d) would make sense here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or even find / -uid 991 -type d -exec chown -R munge:munge \{\} \;.

Copy link
Author

@lunamorrow lunamorrow Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of the directories that get picked up are mounted in read-only file systems, so chown will fail. We could apply a filter to skip over these or just let users know to ignore the chown: changing ownership of '/run/rootfsbase/other/directories': Read-only file system warnings?

How does something like this sound for a filter?

chown -R munge:munge $(find / -uid 991 -type d | grep -v /run/)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a quick note on find, you can avoid recursing into other filesystems with -mount (on GNU find) and can exclude paths via ! -path <pathspec> as well. Putting it all together, you can do everything within a single find command:

find / -uid 991 -type d -mount ! -path '/run/*' -exec chown -R munge:munge \{\} \;

(You'll of course need root.)

…Some reviews are still pending as I figure out the source of the problem and a solution, and I will address these in a later commit.

Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>
… to VM head nodes.

Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>
…certain commands shoudl behave and/or the output they should produce.

Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>
@lunamorrow
Copy link
Author

I've bulked up the documentation with support for bare-metal and cloud instance head nodes, and more 'expected output' code blocks. I've also added a few more subsections to help define the flow more. I will keep expanding the text and the text explanations for certain choices/commands next week to make the guide more of a one-stop shop for Slurm configuration.

@lunamorrow
Copy link
Author

I have been notified of a security vulnerability with versions 0.5-0.5.17 of munge. I will update the documentation next week to pin installation of munge >= 0.5.18, so we can ensure anyone following the guide isn't installing a vulnerable version of Munge.

More info: https://nvd.nist.gov/vuln/detail/CVE-2026-25506

@alexlovelltroy
Copy link
Member

I really appreciate all the work you're doing here. Really shaping up nicely!

@davidallendj
Copy link
Contributor

Thank you for your suggestions/fixes @davidallendj, I have applied all of them except for simplifying the MariaDB configuration process. I will look into if there is a more hands-off approach to configuring MariaDB that we can implement instead. I will also expand the comments/explanations that are provided with code blocks, add more 'expected output' code blocks and provide support for cloud or bare-metal deployment variations over the next few days.

Thanks for making all these changes! Just a few more changes/adjustments and we should be able to merge soon!

Run a test job as the user 'testuser':

```bash
srun hostname
Copy link
Contributor

@davidallendj davidallendj Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I think this is the last issue for me. Whenever I try to run this, I'm getting this error about accounts and partitions. Did I miss a step?

[testuser@openchami-dev ~]$ srun hostname
srun: error: Unable to allocate resources: Invalid account or account/partition combination specified

Here's what I'm seeing for journalctl slurmctld.

Feb 13 18:53:48 openchami-dev.novalocal slurmctld[1412948]: slurmctld: _job_create: invalid account or partition for user 1002, account '(null)', and partition 'main'
Feb 13 18:53:48 openchami-dev.novalocal slurmctld[1412948]: slurmctld: _slurm_rpc_allocate_resources: Invalid account or account/partition combination specified

And here's the sinfo too.

PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
main*        up   infinite      1   idle de01

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm strange. I've only had that issue when I switch user and immediately try to run a job, but it looks like you are running srun hostname as the user you've created with Slurm privileges. I will look into this and get back to you.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried this again after running the commands above and I got this instead now.

[testuser@openchami-dev ~]$ srun hostname
srun: job 1 queued and waiting for resources
srun: job 1 has been allocated resources
slurmstepd: error: couldn't chdir to `/home/testuser': No such file or directory: going to /tmp instead
slurmstepd: error: couldn't chdir to `/home/testuser': No such file or directory: going to /tmp instead
de01
ERROR: ld.so: object '/software/r9/xalt/3.0.1/$LIB/libxalt_init.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.

I do see the hostname like expected but I'm not sure if the other things warnings/errors are really all that important here. If not, then I'd say we're pretty much done with PR and it should be ready to merge.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic! That looks to be working now, which is great. The two slurmstepd errors are expected as LDAP is not configured. I haven't seen the LD_PRELOAD error before, but it could be due to your cluster having xalt installed or on a path somewhere? It isn't something that should be installed from my directions and I can't see it on my cluster. It doesn't seem to be causing issues with srun though, so I am not worried.

Hopefully the issues you had previously were a one-off. I was trying to replicate them, but have been having some issues with my test cluster that was preventing me unfortunately.

@lunamorrow
Copy link
Author

I really appreciate all the work you're doing here. Really shaping up nicely!

Thanks Alex :)

Thanks for making all these changes! Just a few more changes/adjustments and we should be able to merge soon!

No problem and sounds great! Thanks David :)

…ecurity vulnerabilities with versions 0.5-0.5.17

Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>
…ompute node. Additionally made some tweaks to the documentation to make the workflow more robust after repeating it on a fresh node.

Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>
@lunamorrow
Copy link
Author

lunamorrow commented Feb 20, 2026

I've setup a new test cluster to run through the documentation again and have made some minor tweaks which should address some of the issues found with slurmdbd and slurmctld. Additionally, munge install has been pinned to version 0.5.18 to address security concerns.

@davidallendj I was originally planning to add more comments/explanations to explain the workflow and certain choices more, but if that isn't needed (or we want to do that at a later time) then I am happy with the state of the documentation to merge now if you are.

@davidallendj
Copy link
Contributor

@davidallendj I was originally planning to add more comments/explanations to explain the workflow and certain choices more, but if that isn't needed (or we want to do that at a later time) then I am happy with the state of the documentation to merge now if you are.

I think this is a good start so we can add that later if you want but that's up to you. I'm going to go ahead and approve.

Copy link
Contributor

@davidallendj davidallendj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested and can verify that it works 🚀

@lunamorrow
Copy link
Author

I think this is a good start so we can add that later if you want but that's up to you. I'm going to go ahead and approve.

Thanks David, that sounds great to me! I've got some things coming up at work, so I can pivot back to expand it later. I also want to flesh out some other documentation components once I have figured them out if the OpenCHAMI dev team would be interested (e.g. K8s, serving images with NFS, etc.).

…in a few places

Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>
@synackd
Copy link
Contributor

synackd commented Feb 24, 2026

(e.g. K8s, serving images with NFS, etc.)

All of the above! I've been intending to add these to the handbook at some point as well, but I think we could use some help as we've been quite busy. 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants