wait_for_copycomplete.sh 10-minute walltime is too tight when slurmctld is slow

## Summary
  The `wait_for_copycomplete.sh` template (rendered into the run directory and self-resubmitted hourly via `sbatch --begin=now+1hour`) runs with `#SBATCH --time=00:10:00`. That window leaves no slack when slurmctld is under pressure: a single iteration's
  `sbatch` resubmit call can hang long enough to push the job past its walltime, killing the iteration before it can enqueue the next link in the chain.

  ## Reproducer (real incident)
  Flowcell `23K3FVLT3` / LIMS `flowcell_run/3552`, run dir `/net/seq/data2/sequencers/20260424_LH00361_0116_B23K3FVLT3`.

  The hourly chain ran cleanly for 21 iterations from `2026-04-24 14:34` through `2026-04-25 09:42`. Iteration `13948496` started on `hpcy-0708` at `2026-04-25T10:42:56` and was killed by Slurm at `2026-04-25T10:53:25`:

  slurmstepd: error: *** JOB 13948496 ON hpcy-0708 CANCELLED AT 2026-04-25T10:53:25 DUE TO TIME LIMIT ***

  `sacct -j 13948496 -X` confirms `TIMEOUT, ExitCode 0:0`.

  The log file (`wait_bcl2fastq_13948496.log`, 104 bytes) contains only that cancellation banner — the script's first `echo "CopyComplete.txt not found. Requeuing in 1 hour..."` line never appears. Earlier iterations write that exact string and a
  `Submitted batch job …` line in 80 bytes, so output is normally flushed promptly. The most plausible explanation is that the iteration reached the `(cd "$RUNDIR" && sbatch --begin=now+1hour "$0")` resubmit and the `sbatch` call hung against a busy
  slurmctld for >10 minutes.

  `CopyComplete.txt` arrived `2026-04-25 12:17` — 84 minutes after the chain broke — and had to be picked up manually.

  ## Source
  - Rendered script: `wait_for_copycomplete.sh` in the run dir.
  - Template: <TODO: path in stampipes repo, e.g. the template introduced/edited in #83>.

  ## Proposed fix
  - Bump `--time` to something like `00:30:00`. The script's actual work is trivial; the walltime is only there as a backstop, and 30 minutes is still cheap on the cluster while giving large headroom against transient slurmctld latency.
  - Optionally add `srun --time=…` or a foreground `timeout` around the `sbatch` resubmit so a hung control-plane call fails fast with a clear error instead of swallowing the whole iteration.

  ## Related
  See companion issue: a single TIMEOUT in this chain silently breaks the whole launcher with no watchdog/resurrect.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wait_for_copycomplete.sh 10-minute walltime is too tight when slurmctld is slow #84

Summary

Reproducer (real incident)

Source

Proposed fix

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

wait_for_copycomplete.sh 10-minute walltime is too tight when slurmctld is slow #84

Description

Summary

Reproducer (real incident)

Source

Proposed fix

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions