You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The wait_for_copycomplete.sh template (rendered into the run directory and self-resubmitted hourly via sbatch --begin=now+1hour) runs with #SBATCH --time=00:10:00. That window leaves no slack when slurmctld is under pressure: a single iteration's sbatch resubmit call can hang long enough to push the job past its walltime, killing the iteration before it can enqueue the next link in the chain.
Reproducer (real incident)
Flowcell 23K3FVLT3 / LIMS flowcell_run/3552, run dir /net/seq/data2/sequencers/20260424_LH00361_0116_B23K3FVLT3.
The hourly chain ran cleanly for 21 iterations from 2026-04-24 14:34 through 2026-04-25 09:42. Iteration 13948496 started on hpcy-0708 at 2026-04-25T10:42:56 and was killed by Slurm at 2026-04-25T10:53:25:
slurmstepd: error: *** JOB 13948496 ON hpcy-0708 CANCELLED AT 2026-04-25T10:53:25 DUE TO TIME LIMIT ***
The log file (wait_bcl2fastq_13948496.log, 104 bytes) contains only that cancellation banner — the script's first echo "CopyComplete.txt not found. Requeuing in 1 hour..." line never appears. Earlier iterations write that exact string and a Submitted batch job … line in 80 bytes, so output is normally flushed promptly. The most plausible explanation is that the iteration reached the (cd "$RUNDIR" && sbatch --begin=now+1hour "$0") resubmit and the sbatch call hung against a busy
slurmctld for >10 minutes.
CopyComplete.txt arrived 2026-04-25 12:17 — 84 minutes after the chain broke — and had to be picked up manually.
Source
Rendered script: wait_for_copycomplete.sh in the run dir.
Template: <TODO: path in stampipes repo, e.g. the template introduced/edited in Feat/bcl convert #83>.
Proposed fix
Bump --time to something like 00:30:00. The script's actual work is trivial; the walltime is only there as a backstop, and 30 minutes is still cheap on the cluster while giving large headroom against transient slurmctld latency.
Optionally add srun --time=… or a foreground timeout around the sbatch resubmit so a hung control-plane call fails fast with a clear error instead of swallowing the whole iteration.
Related
See companion issue: a single TIMEOUT in this chain silently breaks the whole launcher with no watchdog/resurrect.
Summary
The
wait_for_copycomplete.shtemplate (rendered into the run directory and self-resubmitted hourly viasbatch --begin=now+1hour) runs with#SBATCH --time=00:10:00. That window leaves no slack when slurmctld is under pressure: a single iteration'ssbatchresubmit call can hang long enough to push the job past its walltime, killing the iteration before it can enqueue the next link in the chain.Reproducer (real incident)
Flowcell
23K3FVLT3/ LIMSflowcell_run/3552, run dir/net/seq/data2/sequencers/20260424_LH00361_0116_B23K3FVLT3.The hourly chain ran cleanly for 21 iterations from
2026-04-24 14:34through2026-04-25 09:42. Iteration13948496started onhpcy-0708at2026-04-25T10:42:56and was killed by Slurm at2026-04-25T10:53:25:slurmstepd: error: *** JOB 13948496 ON hpcy-0708 CANCELLED AT 2026-04-25T10:53:25 DUE TO TIME LIMIT ***
sacct -j 13948496 -XconfirmsTIMEOUT, ExitCode 0:0.The log file (
wait_bcl2fastq_13948496.log, 104 bytes) contains only that cancellation banner — the script's firstecho "CopyComplete.txt not found. Requeuing in 1 hour..."line never appears. Earlier iterations write that exact string and aSubmitted batch job …line in 80 bytes, so output is normally flushed promptly. The most plausible explanation is that the iteration reached the(cd "$RUNDIR" && sbatch --begin=now+1hour "$0")resubmit and thesbatchcall hung against a busyslurmctld for >10 minutes.
CopyComplete.txtarrived2026-04-25 12:17— 84 minutes after the chain broke — and had to be picked up manually.Source
wait_for_copycomplete.shin the run dir.Proposed fix
--timeto something like00:30:00. The script's actual work is trivial; the walltime is only there as a backstop, and 30 minutes is still cheap on the cluster while giving large headroom against transient slurmctld latency.srun --time=…or a foregroundtimeoutaround thesbatchresubmit so a hung control-plane call fails fast with a clear error instead of swallowing the whole iteration.Related
See companion issue: a single TIMEOUT in this chain silently breaks the whole launcher with no watchdog/resurrect.