watchdog_copycomplete.sh: replace existence check with readability check on CopyComplete.txt - latent race against Isilon perms-correction

  ## Observed

  When the sequencer writes `CopyComplete.txt` to a rundir on Isilon, the
  file initially lands as `-rwx------ sequencers stamlab` (mode 700). Some
  time later - observed to be **up to ~20 minutes** in some cases - an
  Isilon-side perms-correction process flips it to `-rw-r--r--` (mode 644).
  The mode-644 state is the actual "rundir is ready to process" signal:
  until then, other files in the rundir may also still be perm-restricted.

  Example: today's flowcells `20260508_LH00361_0119_A22Y7LYLT4` and
  `20260508_LH00361_0120_B22Y7M5LT4` (CopyComplete landed 2026-05-10 09:07
  and 09:13). By the time the watchdog fired at ~09:41, both files were
  already in the post-correction state, so today's run was unaffected.

  ## Issue

  `scripts/flowcells/watchdog_copycomplete.sh` checks file *existence*:

  ```bash
  # Only act if CopyComplete.txt exists
  if [ ! -e "$fc/CopyComplete.txt" ] ; then
      continue
  fi
  ```
  [ -e ] returns true as soon as the file shows up on disk —-independent
  of mode. The rundir itself is drwxrws--- sequencers stamlab (group
  stamlab has rwx, with setgid). Solexa is in stamlab, so solexa can
  search the dir and stat files inside it from the first moment the
  sequencer writes them. The watchdog therefore fires the moment
  CopyComplete.txt exists, which can be up to ~20 min before the rundir
  is actually ready.

  If the watchdog fires inside that window, setup.sh && run_bcl2fastq.sh
  starts against a not-yet-fully-ready rundir.

  Why we're safe today

  The watchdog cycles via scrontab --begin=now+1hour self-resubmit, so it
  fires roughly hourly. In practice the gap between CopyComplete-landing
  and the next watchdog fire is much longer than the perms-correction
  window, so the race never opens. The fix here is hardening, not
  recovering from an outage.

  Why it's worth fixing anyway

  If scrontab cadence is ever tightened (e.g. to 5 minutes for snappier
  turnaround), or if perms-correction is unusually slow on a given run,
  the race opens. Better to harden now while it's a 1-line diff than
  discover it at 3am.

  Recommended fix: switch to readability check

  - Skip until perms-correction has flipped the file readable by us.
  - CopyComplete.txt lands mode 700 owned by `sequencers`; the perms
  - corrector flips it to 644 once the rundir is fully ready (observed
  - latency up to ~20 min). Until then `solexa` (group=stamlab, not owner)
  - can't read it, so `[ -r ]` returns false.
  ```
  if [ ! -r "$fc/CopyComplete.txt" ] ; then
      continue
  fi
  ```
  Why readability is right here:
  - Tests the actual signal (perms have flipped), not a timing proxy
  - Naturally aligned with the 700→644 transition: at mode 700 the group
  bits are --- so solexa-as-group-member can't read; at 644 the group
  bits are r-- so solexa can.
  - Decoupled from the corrector's latency - works regardless of whether
  the perms flip happens in 30 seconds or 20 minutes.

  Why mtime alone wouldn't be sufficient

  A find -mmin +N style guard only works if you know an upper bound on
  perms-correction latency. With observations up to ~20 minutes, the
  threshold would have to be larger than that, and we don't have a
  guaranteed upper bound. Readability tests the actual signal directly.

  Priority

  Low. No current outage; today's run was unaffected. File for when
  there's bandwidth.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

watchdog_copycomplete.sh: replace existence check with readability check on CopyComplete.txt - latent race against Isilon perms-correction #88

Observed

Issue

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

watchdog_copycomplete.sh: replace existence check with readability check on CopyComplete.txt - latent race against Isilon perms-correction #88

Description

Observed

Issue

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions