Skip to content

Tutorial 09

nadavmoav edited this page Oct 21, 2025 · 5 revisions

๐Ÿ–ฅ๏ธ Working on the TAU Supercomputer

๐Ÿงต Running long scripts on the POWER cluster (with TMUX)

Usually, you need to keep the connection to the cluster open to run a script. But advanced codes often take hoursโ€”or even daysโ€”to complete. ๐Ÿ˜ฉ A dropped VPN or Wi-Fi connection can cause your code to stop, forcing you to start over.

Good news! ๐ŸŽ‰ You can keep your scripts running in the background using tmux.

TMUX stands for "terminal multiplexer"โ€”it lets you open multiple terminal windows inside a single SSH session.

๐ŸŸข Start a new session:

tmux

โœ๏ธ Start a named session:

tmux new -s myname

โž• Create a new window:

Press Ctrl+B, then c

๐Ÿ” Switch between windows:

  • Ctrl+B then n โ†’ next window
  • Ctrl+B then p โ†’ previous window
  • Ctrl+B then 0โ€“9 โ†’ numbered windows

๐Ÿ“ด Detach from the session:

Press Ctrl+B, then d

๐Ÿ”™ Reattach to a session:

tmux a -t myname

๐Ÿ“‹ List running sessions:

tmux ls

๐Ÿ”— More info: tmuxcheatsheet.com


๐Ÿ“… The Queue System on the POWER Cluster (SLURM)

For long or resource-intensive jobs, use SLURM ๐Ÿง  to manage resources. It queues jobs, assigns compute nodes, and prevents collisions between users.

๐Ÿš€ Submitting a Python job

#!/bin/bash
#
# filename: slurm_script
#SBATCH -p leeburton-pool
#SBATCH --account=power-leeburton-users_v2
#SBATCH --job-name=vasprun
#SBATCH --time=24:00:00
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --mem=40GB

ulimit -s 81920

# job
python filename.py > output_file.out

๐Ÿ“Œ Replace filename.py with your script. Save as job.script, then run:

sbatch job.script

๐Ÿ“ The output appears in output_file.out.

โš™๏ธ SLURM Directives Explained

  • --partition ๐Ÿงฉ: Which queue partition to use
  • --account ๐Ÿ‘ฅ: The shared user account
  • --job-name ๐Ÿท๏ธ: A readable job name
  • --time โฒ๏ธ: Max run time (up to 10 days)
  • --nodes ๐Ÿ–ฅ๏ธ: Number of compute nodes
  • --ntasks ๐Ÿ”ข: Number of tasks (e.g., cores)
  • --mem ๐Ÿง : Memory per node

โœจ Common Extras

  • --output: Save stdout
  • --error: Save stderr
  • --cpus-per-task: For multi-threaded jobs
  • --mail-type: Get notified (BEGIN, END, FAIL, ALL)
  • --mail-user: Email for notifications
  • --dependency: Chain jobs (e.g., afterok:12345)

โš›๏ธ Running VASP on POWER with SLURM

๐Ÿ–ฅ๏ธ CPU Version

#!/bin/bash
#
# filename: slurm_script
#SBATCH -p leeburton-pool
#SBATCH --account=power-leeburton-users_v2
#SBATCH --job-name=vasprun
#SBATCH --time=72:00:00
#SBATCH --nodes=1
#SBATCH --ntasks=24
#SBATCH --mem=120GB

ulimit -s 81920

# job
module load intel/rocky8-oneAPI-2023
module load vasp/rocky8-intel-6.4.1

mpirun -n $SLURM_NTASKS vasp_std > output

๐Ÿงฑ For large systems (~1000 atoms):

#!/bin/bash
#
#SBATCH -p leeburton-pool
#SBATCH --account=power-leeburton-users_v2
#SBATCH -J bigslab
#SBATCH --time=99:00:00
#SBATCH --nodes=2
#SBATCH --ntasks=96
#SBATCH --mem=920GB

ulimit -s 81920

# job
module load intel/rocky8-oneAPI-2023
module load vasp/rocky8-intel-6.4.1

mpirun -n $SLURM_NTASKS vasp_std > output

๐ŸŽฎ GPU Version (and MIGs)

POWER has 4 GPUs. You can:

  • Use 1 full GPU (max 4 parallel jobs)
  • Or use MIGs (Multi-Instance GPUs) to run more jobs ๐Ÿคฏ

โš ๏ธ Only 2 GPUs are configured for MIGs (each split into 7 MIGs, each with 10GB).

More info: NVIDIA MIG Guide

โš™๏ธ SLURM script for GPU VASP

#!/bin/bash
#
# filename: slurm_script
#SBATCH -p gpu-leeburton-pool
#SBATCH --account=power-leeburton-users_v2
#SBATCH --job-name=vasprun
#SBATCH --time=04:00:00
#SBATCH --ntasks=1
##SBATCH --gres=gpu:1                  # Full GPU (2 available)
##SBATCH --gres=gpu:1g.10gb:1          # One MIG (14 available)
#SBATCH --cpus-per-task=2
#SBATCH --mem=16GB

ulimit -s 81920

# Job setup
module purge
module load vasp/vasp.6.5.1-hpc_sdk
echo $CUDA_VISIBLE_DEVICES

mpirun -n $SLURM_NTASKS vasp_std > output

๐Ÿ”€ MIG or Full GPU?

Pick one and uncomment the right line:

โœ… To use a MIG:

#SBATCH --gres=gpu:1g.10gb:1

โœ… To use a full GPU:

#SBATCH --gres=gpu:1

Comment out the other with #.


โœ… Next Steps

๐ŸŽ‰ Congratulations! You now know how to work with the POWER cluster.

Continue to Tutorial 10 ๐Ÿš€

Clone this wiki locally