Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 5 additions & 3 deletions openpbs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ This repository contains the source code of different container images:
- `alphaunito/openpbs-server:23.06.06`, which runs the OpenPBS control plane
- `alphaunito/openpbs-execution:23.06.06`, which runs a OpenPBS compute node

Plus, it also contains a [docker-compose.yml](./docker-compose.yml) file that can deplyo an entire OpenPBS cluster with a single controller and a set of compute nodes. All these components are detailed below
Plus, it also contains a [docker-compose.yml](./docker-compose.yml) file that can deploy an entire OpenPBS cluster with a single controller and a set of compute nodes. All these components are detailed below

## OpenPBS Server

Expand All @@ -29,11 +29,11 @@ To correctly register the execution nodes, an `openpbs-server` container needs 2
- The `PBS_EXECUTION_NODES` variable must contain the number of compute nodes that OpenPBS should manage. If this variable is not set, the container displays an error message and terminates
- The `PBS_EXECUTION_HOST_NAME_PREFIX` variable should contain the prefix of the hostname used to identify compute nodes. If this variable is not set, the container displays an error message and terminates

Note that all the compute nodes in the simulated HPC cluster should have a reachable hostname equal to `"${PBS_EXECUTION_NODES}${X}"`, where `X` is an integer in the range `[1, ${PBS_EXECUTION_NODES}]`
Note that all the compute nodes in the simulated HPC cluster should have a reachable hostname equal to `"${PBS_EXECUTION_HOST_NAME_PREFIX}-${X}"`, where `X` is an integer in the range `[1, ${PBS_EXECUTION_NODES}]`.

## OpenPBS Execution

The `pbs_mom` process is the compute node daemon for OpenPBS. It places jobs into execution as directed by the server, establishes resource usage limits, monitors the job's usage, and notifies the server when the job completes. The `openpbs-execution` Docker image can be build and published using the following commands
The `pbs_mom` process is the compute node daemon for OpenPBS. It places jobs into execution as directed by the server, establishes resource usage limits, monitors the job's usage, and notifies the server when the job completes. The `openpbs-execution` Docker image can be build and published using the following commands.

```bash
docker build -t alphaunito/openpbs-execution:23.06.06 execution
Expand All @@ -48,6 +48,8 @@ The `openpbs-server` and `openpbs-execution` images described above can be used

Note that the `openpbs-server` node should have an identifiable hostname, as compute nodes must register with the control plane to be addressable. In Docker Compose, an explicit hostname can be set for a given service using the `hostname` keyword.

**Beware of long hostnames.** The Docker FQDN (`<container>.<network>`) includes the compose project name as part of the domain. If the FQDN exceeds PBS's `HOST_NAME_MAX` (64), `pbs_mom` fails with `Failed to get fullhostname`, jobs abort with `Exit_status = -3`, and the server holds them after too many retries. Use a short project name (e.g. `docker compose -p openpbs up` or `export COMPOSE_PROJECT_NAME=openpbs`).

To allow for unprivileged workloads, an `hpcuser` has been configured inside the images. Commands can be executed by explicitly impersonating this user, through the `--user hpcuser` flag. For example

```bash
Expand Down
4 changes: 4 additions & 0 deletions openpbs/execution/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -47,8 +47,12 @@ ENV PATH="${PATH}:/opt/pbs/bin"
COPY config/supervisord.conf \
/etc/supervisor/conf.d/

COPY config/pbs.conf \
/etc/pbs.conf

COPY scripts/healthcheck \
scripts/run-sshd \
scripts/run-munge \
scripts/run-openpbs \
/bin/

Expand Down
2 changes: 1 addition & 1 deletion openpbs/execution/config/supervisord.conf
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ supervisor.rpcinterface_factory = supervisor.rpcinterface:make_main_rpcinterface
serverurl=unix:///var/run/supervisor.sock

[program:munge]
command=gosu munge /usr/sbin/munged --foreground
command=run-munge
autostart=true
stdout_logfile=/dev/stdout
stdout_logfile_maxbytes=0
Expand Down
11 changes: 11 additions & 0 deletions openpbs/execution/scripts/run-munge
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
#!/bin/bash -e

MUNGE_KEY=/etc/munge/munge.key
while [[ ! -s "${MUNGE_KEY}" || \
"$(stat -c "%U" "${MUNGE_KEY}" 2>/dev/null)" != "munge" || \
"$(stat -c "%a" "${MUNGE_KEY}" 2>/dev/null)" != "400" ]]; do
echo "Waiting for munge.key with correct ownership (munge) and permissions (400)..."
sleep 2
done

gosu munge /usr/sbin/munged --foreground
11 changes: 10 additions & 1 deletion openpbs/execution/scripts/run-openpbs
Original file line number Diff line number Diff line change
@@ -1,12 +1,21 @@
#!/bin/bash
#!/bin/bash -e

if [ -z "${PBS_SERVER_HOST_NAME}" ]; then
>&2 echo "Missing environment variable PBS_SERVER_HOST_NAME containing the OpenPBS server hostname"
exit 1
fi

if [ ! -f "/etc/pbs.conf" ]; then
>&2 echo "Missing /etc/pbs.conf file"
exit 1
fi
sed -i "s/__PBS_SERVER_HOST_NAME__/${PBS_SERVER_HOST_NAME}/g" /etc/pbs.conf

source /etc/pbs.conf

until munge -n >/dev/null 2>&1; do
echo "Waiting for munged to become ready..."
sleep 1
done

${PBS_EXEC}/sbin/pbs_mom -N
2 changes: 2 additions & 0 deletions openpbs/server/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,8 @@ RUN curl -fsSL -O https://vcdn.altair.com/rl/OpenPBS/openpbs_23.06.06.rockylinux
&& ln -s /usr/lib64/libmunge.so.2 /usr/lib64/libmunge.so \
&& adduser hpcuser

RUN ln -s /usr/bin/pg_resetwal /usr/bin/pg_resetxlog

ENV PATH="${PATH}:/opt/pbs/bin"

COPY config/pbs.conf \
Expand Down
1 change: 1 addition & 0 deletions openpbs/server/config/supervisord.conf
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@ redirect_stderr=true
[program:pbs_server]
command=run-openpbs
autostart=true
autorestart=false
stdout_logfile=/dev/stdout
stdout_logfile_maxbytes=0
redirect_stderr=true
Expand Down
9 changes: 8 additions & 1 deletion openpbs/server/scripts/run-munge
Original file line number Diff line number Diff line change
@@ -1,5 +1,12 @@
#!/bin/bash
#!/bin/bash -e

if [ -f "/etc/munge/munge.key" ]; then
>&2 echo "File /etc/munge/munge.key already exists"
exit 1
fi

create-munge-key
chown munge:munge /etc/munge/munge.key
chmod 400 /etc/munge/munge.key

gosu munge /usr/sbin/munged --foreground
28 changes: 25 additions & 3 deletions openpbs/server/scripts/run-openpbs
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
#!/bin/bash
#!/bin/bash -e

set -m

Expand All @@ -14,10 +14,32 @@ if [ -z "${PBS_EXECUTION_NODES}" ]; then
exit 1
fi

echo "PBS_SERVER_HOST_NAME=${PBS_SERVER_HOST_NAME}" >> /var/spool/pbs/pbs_environment

if [ ! -f "/etc/pbs.conf" ]; then
>&2 echo "Missing /etc/pbs.conf file"
exit 1
fi
sed -i "s/__PBS_SERVER_HOST_NAME__/${PBS_SERVER_HOST_NAME}/g" /etc/pbs.conf

source /etc/pbs.conf

MAX_HOSTNAME_LEN=$(getconf HOST_NAME_MAX 2>/dev/null || echo 64)
for NODE in $(seq 1 ${PBS_EXECUTION_NODES}); do
NODE_NAME="${PBS_EXECUTION_HOST_NAME_PREFIX}-${NODE}"
NODE_IP=$(getent hosts "${NODE_NAME}" | awk '{print $1}' 2>/dev/null || true)
if [ -n "$NODE_IP" ]; then
FQDN=$(python3 -c "import socket; print(socket.gethostbyaddr('${NODE_IP}')[0])" 2>/dev/null || echo "")
if [ -n "$FQDN" ] && [ ${#FQDN} -ge ${MAX_HOSTNAME_LEN} ]; then
>&2 echo "ERROR: FQDN for node ${NODE_NAME} is ${#FQDN} chars (>= ${MAX_HOSTNAME_LEN})."
>&2 echo "PBS cannot resolve hostnames this long. Use a shorter project name"
>&2 echo "or set explicit short hostnames in docker-compose.yml."
exit 1
fi
fi
done


if [ ! -f "${PBS_HOME}/pbs_version" ]; then
echo "PBS Home directory ${PBS_HOME} needs updating."
echo "Running ${PBS_EXEC}/libexec/pbs_habitat to update it."
Expand All @@ -33,9 +55,9 @@ ${PBS_EXEC}/bin/qmgr -c "set server job_history_enable = True"
${PBS_EXEC}/bin/qmgr -c "set server job_history_duration = 00:05:00"

echo "Configure OpenPBS scheduler"
${PBS_EXEC}/bin/qmgr -c "set sched sched_host = localhost"
${PBS_EXEC}/bin/qmgr -c "set sched sched_host = ${PBS_SERVER_HOST_NAME}"

for NODE in $(seq 0 $((PBS_EXECUTION_NODES-1))); do
for NODE in $(seq 1 ${PBS_EXECUTION_NODES}); do
echo "Create node ${PBS_EXECUTION_HOST_NAME_PREFIX}-${NODE}"
${PBS_EXEC}/bin/qmgr -c "create node ${PBS_EXECUTION_HOST_NAME_PREFIX}-${NODE} queue=workq"
done
Expand Down