Skip to content

Tutorial 16

nadavmoav edited this page Jun 24, 2025 · 9 revisions

High-Throughput Computing on POWER

If you want to run more than a hundred or so calculations you enter the realm of high-throughput computing. Up to this amount the queue can handle your submissions but when we talk about 200 or 300 or even thousands of calculations you have to be a little more considerate of the system and other users.

High-throughput submitting jobs

Here is an example of a submitter script. Assuming you prepare all the necessary files for calculations in separate folders beneath the directory you are in now (see tutorial 14), you can run the following code as a file in a TMUX session (see tutorial 9).

#!/bin/bash

for f in */; do
    [ -d "$f" ] || continue  # Skip if not a directory
    cd "$f" || continue  # Change to directory, skip if it fails

    if grep -sq "Voluntary" OUTCAR; then
        echo "$f completed successfully"
    else
        echo "$f not completed. Checking queue"

        if (( $(squeue | grep -c "lee") < 20 )); then
            echo "There are fewer than 20 jobs in the queue. Submitting..."
            sbatch job.script
            sleep 60
        else
            echo "Queue is full, waiting to submit"
            sleep 180
            echo "Slept, retrying submission..."
            sbatch job.script
        fi
    fi

    cd - > /dev/null  # Suppress output of 'cd -'
done

The code will check if the VASP calculation in a folder is finished and if so report its finding and move on. If not it checks the number of jobs in the queue. If there are fewer than 20 jobs it will submit the calculation and sleep for 60 seconds. If there are more than 20 jobs it will still submit the current job but only after sleeping for 180 seconds before moving on to the next folder. This is a fairly basic routine where the code speeds up or slows down submission based on how busy the queue is. Depending on how long the average calculation you wish to submit might take, you may want to increase or decrease the sleep durations but the principle is the same.

High-throughput analysing jobs

Here is an example of a script to check if calculations are done and move them if so.

#!/bin/bash

mkdir -p done  # Ensure "done" directory exists

for f in */; do
    [ -d "$f" ] || continue  # Ensure it's a directory

    OUTCAR="${f}OUTCAR"
    OSZICAR="${f}OSZICAR"
    INCAR="${f}INCAR"
    JOB_SCRIPT="${f}job_script.sh"  # Adjust if the job script has a different name

    if [[ ! -f "$OUTCAR" || ! -f "$OSZICAR" || ! -f "$INCAR" ]]; then
        echo "Skipping $f: Missing OUTCAR, OSZICAR, or INCAR"
        continue
    fi

    # Extract NSW value from INCAR (handles whitespace better)
    NSW=$(grep -E '^\s*NSW\s*=' "$INCAR" | awk -F '=' '{gsub(/ /,"",$2); print $2}')

    if [[ -z "$NSW" ]]; then
        echo "Skipping $f: Could not determine NSW from INCAR"
        continue
    fi

    if grep -q "Voluntary" "$OUTCAR"; then
        if ! grep -q "$NSW F" "$OSZICAR"; then
            echo "$f completed successfully"
            mv "$f" done/
        else
            # Extract "d E" values from OSZICAR
            dE_values=($(grep -oP '^\s*\d+\s+F=.*?d E =\s*[-+]?\d+\.\d+E[-+]?\d+' "$OSZICAR" | awk '{print $(NF-1)}'))

            if [[ ${#dE_values[@]} -eq 0 ]]; then
                echo "Skipping $f: No valid d E values found in OSZICAR"
                continue
            fi

            # Define moving average window size
            window_size=3  # Adjust as needed

            # Function to compute moving average over a window
            compute_moving_avg() {
                local -n arr=$1
                local win_size=$2
                local num_values=${#arr[@]}
                local moving_avg=()

                for ((i = 0; i <= num_values - win_size; i++)); do
                    sum=0
                    for ((j = 0; j < win_size; j++)); do
                        sum=$(echo "$sum + ${arr[i+j]}" | bc -l)
                    done
                    avg=$(echo "$sum / $win_size" | bc -l)
                    moving_avg+=("$avg")
                done

                echo "${moving_avg[@]}"
            }

            # Compute moving averages
            moving_avg_values=($(compute_moving_avg dE_values $window_size))

            # Check if moving average is decreasing
            converging=true
            for ((i = 1; i < ${#moving_avg_values[@]}; i++)); do
                if (( $(echo "${moving_avg_values[i]} > ${moving_avg_values[i-1]}" | bc) )); then
                    converging=false
                    break
                fi
            done

            if $converging; then
                echo "$f is converging (based on moving average)"
            else
                echo "$f is NOT converging"
            fi

            if $converging; then
                echo "$f reached NSW but is converging - restarting job"

                # Move into the directory, restart the job, then return to original location
                if cd "$f"; then
                    if [[ -f "CONTCAR" ]]; then
                        cp CONTCAR POSCAR
                        echo "Copied CONTCAR to POSCAR"
                    else
                        echo "Warning: CONTCAR not found in $f"
                    fi

                    if command -v sbatch &>/dev/null; then
                        if [[ -f "$JOB_SCRIPT" ]]; then
                            sbatch "$JOB_SCRIPT"
                            echo "Resubmitted job in $f"
                        else
                            echo "Warning: Job script not found in $f, skipping sbatch"
                        fi
                    else
                        echo "Error: SLURM is not available. Job not submitted."
                    fi

                    cd - > /dev/null  # Return to the original directory
                else
                    echo "Failed to enter $f"
                    continue
                fi
            else
                echo "$f reached NSW and is NOT converging"
            fi
        fi
    fi
done

Clone this wiki locally