Skip to content

wait_for_copycomplete.sh 10-minute walltime is too tight when slurmctld is slow #84

@rsandstromUW

Description

@rsandstromUW

Summary

The wait_for_copycomplete.sh template (rendered into the run directory and self-resubmitted hourly via sbatch --begin=now+1hour) runs with #SBATCH --time=00:10:00. That window leaves no slack when slurmctld is under pressure: a single iteration's
sbatch resubmit call can hang long enough to push the job past its walltime, killing the iteration before it can enqueue the next link in the chain.

Reproducer (real incident)

Flowcell 23K3FVLT3 / LIMS flowcell_run/3552, run dir /net/seq/data2/sequencers/20260424_LH00361_0116_B23K3FVLT3.

The hourly chain ran cleanly for 21 iterations from 2026-04-24 14:34 through 2026-04-25 09:42. Iteration 13948496 started on hpcy-0708 at 2026-04-25T10:42:56 and was killed by Slurm at 2026-04-25T10:53:25:

slurmstepd: error: *** JOB 13948496 ON hpcy-0708 CANCELLED AT 2026-04-25T10:53:25 DUE TO TIME LIMIT ***

sacct -j 13948496 -X confirms TIMEOUT, ExitCode 0:0.

The log file (wait_bcl2fastq_13948496.log, 104 bytes) contains only that cancellation banner — the script's first echo "CopyComplete.txt not found. Requeuing in 1 hour..." line never appears. Earlier iterations write that exact string and a
Submitted batch job … line in 80 bytes, so output is normally flushed promptly. The most plausible explanation is that the iteration reached the (cd "$RUNDIR" && sbatch --begin=now+1hour "$0") resubmit and the sbatch call hung against a busy
slurmctld for >10 minutes.

CopyComplete.txt arrived 2026-04-25 12:17 — 84 minutes after the chain broke — and had to be picked up manually.

Source

  • Rendered script: wait_for_copycomplete.sh in the run dir.
  • Template: <TODO: path in stampipes repo, e.g. the template introduced/edited in Feat/bcl convert #83>.

Proposed fix

  • Bump --time to something like 00:30:00. The script's actual work is trivial; the walltime is only there as a backstop, and 30 minutes is still cheap on the cluster while giving large headroom against transient slurmctld latency.
  • Optionally add srun --time=… or a foreground timeout around the sbatch resubmit so a hung control-plane call fails fast with a clear error instead of swallowing the whole iteration.

Related

See companion issue: a single TIMEOUT in this chain silently breaks the whole launcher with no watchdog/resurrect.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions