Conversation
…n will need some further updates to align better with the Tutorial (e.g. changing IP addresses, adjusting comments to support bare-metal and cloud setups, etc.) and to ensure the documented approach is sufficiently broad for general purpose. Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>
|
I am in the process of adjusting the documentation to align better with the format of the OpenCHAMI tutorial to enable ease of use for OpenCHAMI users. I will be making some more commits to adjust and fine-tune the documentation, and I would appreciate feedback/suggestions as this is my first time contributing to this project. |
|
At a glance, this looks great! I'm going try to take time to run through this today if I get a chance. |
… for creating some files from cat to copy-paste to prevent issues with bash command/variable processing Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>
… - this should make this guide easy to follow on with after the tutorial Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>
…Next step will be expanding comments/explanations to provide more context to users, as well as providing more code blocks to show expected output of commands that produce output. Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>
…id. Changes include making it more clear when pwgen password is used, correcting the file creation step for slurm.conf to prevent errors, removing instructions for aliasing the build commend (and instead redirecting to the appropriate tutorial section), updating instructions inline with a recent PR to replace MinIO with Versity S3 and some minor typo fixes Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>
|
Thank you for your suggestions/fixes @davidallendj, I have applied all of them except for simplifying the MariaDB configuration process. I will look into if there is a more hands-off approach to configuring MariaDB that we can implement instead. I will also expand the comments/explanations that are provided with code blocks, add more 'expected output' code blocks and provide support for cloud or bare-metal deployment variations over the next few days. |
…ck from David. Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>
| ``` | ||
|
|
||
| {{< callout context="note" title="Note" icon="outline/info-circle" >}} | ||
| Find all directories owned by old munge UID/GID with the following command: |
There was a problem hiding this comment.
If we see directories from the output of find / -uid 991 -type d, should we chown -R munge:munge them? I'm wondering if combining the commands into something like chown -R munge:munge $(find / -uid 991 -type d) would make sense here?
There was a problem hiding this comment.
Or even find / -uid 991 -type d -exec chown -R munge:munge \{\} \;.
There was a problem hiding this comment.
Some of the directories that get picked up are mounted in read-only file systems, so chown will fail. We could apply a filter to skip over these or just let users know to ignore the chown: changing ownership of '/run/rootfsbase/other/directories': Read-only file system warnings?
How does something like this sound for a filter?
chown -R munge:munge $(find / -uid 991 -type d | grep -v /run/)
There was a problem hiding this comment.
Just a quick note on find, you can avoid recursing into other filesystems with -mount (on GNU find) and can exclude paths via ! -path <pathspec> as well. Putting it all together, you can do everything within a single find command:
find / -uid 991 -type d -mount ! -path '/run/*' -exec chown -R munge:munge \{\} \;
(You'll of course need root.)
…Some reviews are still pending as I figure out the source of the problem and a solution, and I will address these in a later commit. Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>
… to VM head nodes. Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>
…certain commands shoudl behave and/or the output they should produce. Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>
|
I've bulked up the documentation with support for bare-metal and cloud instance head nodes, and more 'expected output' code blocks. I've also added a few more subsections to help define the flow more. I will keep expanding the text and the text explanations for certain choices/commands next week to make the guide more of a one-stop shop for Slurm configuration. |
|
I have been notified of a security vulnerability with versions 0.5-0.5.17 of munge. I will update the documentation next week to pin installation of munge >= 0.5.18, so we can ensure anyone following the guide isn't installing a vulnerable version of Munge. |
|
I really appreciate all the work you're doing here. Really shaping up nicely! |
Thanks for making all these changes! Just a few more changes/adjustments and we should be able to merge soon! |
| Run a test job as the user 'testuser': | ||
|
|
||
| ```bash | ||
| srun hostname |
There was a problem hiding this comment.
Okay, I think this is the last issue for me. Whenever I try to run this, I'm getting this error about accounts and partitions. Did I miss a step?
[testuser@openchami-dev ~]$ srun hostname
srun: error: Unable to allocate resources: Invalid account or account/partition combination specifiedHere's what I'm seeing for journalctl slurmctld.
Feb 13 18:53:48 openchami-dev.novalocal slurmctld[1412948]: slurmctld: _job_create: invalid account or partition for user 1002, account '(null)', and partition 'main'
Feb 13 18:53:48 openchami-dev.novalocal slurmctld[1412948]: slurmctld: _slurm_rpc_allocate_resources: Invalid account or account/partition combination specifiedAnd here's the sinfo too.
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
main* up infinite 1 idle de01There was a problem hiding this comment.
Hmmm strange. I've only had that issue when I switch user and immediately try to run a job, but it looks like you are running srun hostname as the user you've created with Slurm privileges. I will look into this and get back to you.
There was a problem hiding this comment.
I tried this again after running the commands above and I got this instead now.
[testuser@openchami-dev ~]$ srun hostname
srun: job 1 queued and waiting for resources
srun: job 1 has been allocated resources
slurmstepd: error: couldn't chdir to `/home/testuser': No such file or directory: going to /tmp instead
slurmstepd: error: couldn't chdir to `/home/testuser': No such file or directory: going to /tmp instead
de01
ERROR: ld.so: object '/software/r9/xalt/3.0.1/$LIB/libxalt_init.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.I do see the hostname like expected but I'm not sure if the other things warnings/errors are really all that important here. If not, then I'd say we're pretty much done with PR and it should be ready to merge.
There was a problem hiding this comment.
Fantastic! That looks to be working now, which is great. The two slurmstepd errors are expected as LDAP is not configured. I haven't seen the LD_PRELOAD error before, but it could be due to your cluster having xalt installed or on a path somewhere? It isn't something that should be installed from my directions and I can't see it on my cluster. It doesn't seem to be causing issues with srun though, so I am not worried.
Hopefully the issues you had previously were a one-off. I was trying to replicate them, but have been having some issues with my test cluster that was preventing me unfortunately.
Thanks Alex :)
No problem and sounds great! Thanks David :) |
…ecurity vulnerabilities with versions 0.5-0.5.17 Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>
…ompute node. Additionally made some tweaks to the documentation to make the workflow more robust after repeating it on a fresh node. Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>
|
I've setup a new test cluster to run through the documentation again and have made some minor tweaks which should address some of the issues found with slurmdbd and slurmctld. Additionally, munge install has been pinned to version 0.5.18 to address security concerns. @davidallendj I was originally planning to add more comments/explanations to explain the workflow and certain choices more, but if that isn't needed (or we want to do that at a later time) then I am happy with the state of the documentation to merge now if you are. |
I think this is a good start so we can add that later if you want but that's up to you. I'm going to go ahead and approve. |
davidallendj
left a comment
There was a problem hiding this comment.
Tested and can verify that it works 🚀
Thanks David, that sounds great to me! I've got some things coming up at work, so I can pivot back to expand it later. I also want to flesh out some other documentation components once I have figured them out if the OpenCHAMI dev team would be interested (e.g. K8s, serving images with NFS, etc.). |
…in a few places Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>
All of the above! I've been intending to add these to the handbook at some point as well, but I think we could use some help as we've been quite busy. 🙂 |
Pull Request Template
Thank you for your contribution! Please ensure the following before submitting:
Checklist
make test(or equivalent) locally and all tests passgit commit -s) with my real name and email<filename>.licensesidecarLICENSES/directoryDescription
Contributing to the "Install Slurm" documentation under the OpenCHAMI guides.
Any feedback or suggestions about making the documentation broad enough for general purpose or to fit in well with the existing documentation are appreciated.
Type of Change
For more info, see Contributing Guidelines.