Observed
When the sequencer writes CopyComplete.txt to a rundir on Isilon, the
file initially lands as -rwx------ sequencers stamlab (mode 700). Some
time later - observed to be up to ~20 minutes in some cases - an
Isilon-side perms-correction process flips it to -rw-r--r-- (mode 644).
The mode-644 state is the actual "rundir is ready to process" signal:
until then, other files in the rundir may also still be perm-restricted.
Example: today's flowcells 20260508_LH00361_0119_A22Y7LYLT4 and
20260508_LH00361_0120_B22Y7M5LT4 (CopyComplete landed 2026-05-10 09:07
and 09:13). By the time the watchdog fired at ~09:41, both files were
already in the post-correction state, so today's run was unaffected.
Issue
scripts/flowcells/watchdog_copycomplete.sh checks file existence:
# Only act if CopyComplete.txt exists
if [ ! -e "$fc/CopyComplete.txt" ] ; then
continue
fi
[ -e ] returns true as soon as the file shows up on disk —-independent
of mode. The rundir itself is drwxrws--- sequencers stamlab (group
stamlab has rwx, with setgid). Solexa is in stamlab, so solexa can
search the dir and stat files inside it from the first moment the
sequencer writes them. The watchdog therefore fires the moment
CopyComplete.txt exists, which can be up to ~20 min before the rundir
is actually ready.
If the watchdog fires inside that window, setup.sh && run_bcl2fastq.sh
starts against a not-yet-fully-ready rundir.
Why we're safe today
The watchdog cycles via scrontab --begin=now+1hour self-resubmit, so it
fires roughly hourly. In practice the gap between CopyComplete-landing
and the next watchdog fire is much longer than the perms-correction
window, so the race never opens. The fix here is hardening, not
recovering from an outage.
Why it's worth fixing anyway
If scrontab cadence is ever tightened (e.g. to 5 minutes for snappier
turnaround), or if perms-correction is unusually slow on a given run,
the race opens. Better to harden now while it's a 1-line diff than
discover it at 3am.
Recommended fix: switch to readability check
- Skip until perms-correction has flipped the file readable by us.
- CopyComplete.txt lands mode 700 owned by
sequencers; the perms
- corrector flips it to 644 once the rundir is fully ready (observed
- latency up to ~20 min). Until then
solexa (group=stamlab, not owner)
- can't read it, so
[ -r ] returns false.
if [ ! -r "$fc/CopyComplete.txt" ] ; then
continue
fi
Why readability is right here:
- Tests the actual signal (perms have flipped), not a timing proxy
- Naturally aligned with the 700→644 transition: at mode 700 the group
bits are --- so solexa-as-group-member can't read; at 644 the group
bits are r-- so solexa can.
- Decoupled from the corrector's latency - works regardless of whether
the perms flip happens in 30 seconds or 20 minutes.
Why mtime alone wouldn't be sufficient
A find -mmin +N style guard only works if you know an upper bound on
perms-correction latency. With observations up to ~20 minutes, the
threshold would have to be larger than that, and we don't have a
guaranteed upper bound. Readability tests the actual signal directly.
Priority
Low. No current outage; today's run was unaffected. File for when
there's bandwidth.
Observed
When the sequencer writes
CopyComplete.txtto a rundir on Isilon, thefile initially lands as
-rwx------ sequencers stamlab(mode 700). Sometime later - observed to be up to ~20 minutes in some cases - an
Isilon-side perms-correction process flips it to
-rw-r--r--(mode 644).The mode-644 state is the actual "rundir is ready to process" signal:
until then, other files in the rundir may also still be perm-restricted.
Example: today's flowcells
20260508_LH00361_0119_A22Y7LYLT4and20260508_LH00361_0120_B22Y7M5LT4(CopyComplete landed 2026-05-10 09:07and 09:13). By the time the watchdog fired at ~09:41, both files were
already in the post-correction state, so today's run was unaffected.
Issue
scripts/flowcells/watchdog_copycomplete.shchecks file existence:[ -e ] returns true as soon as the file shows up on disk —-independent
of mode. The rundir itself is drwxrws--- sequencers stamlab (group
stamlab has rwx, with setgid). Solexa is in stamlab, so solexa can
search the dir and stat files inside it from the first moment the
sequencer writes them. The watchdog therefore fires the moment
CopyComplete.txt exists, which can be up to ~20 min before the rundir
is actually ready.
If the watchdog fires inside that window, setup.sh && run_bcl2fastq.sh
starts against a not-yet-fully-ready rundir.
Why we're safe today
The watchdog cycles via scrontab --begin=now+1hour self-resubmit, so it
fires roughly hourly. In practice the gap between CopyComplete-landing
and the next watchdog fire is much longer than the perms-correction
window, so the race never opens. The fix here is hardening, not
recovering from an outage.
Why it's worth fixing anyway
If scrontab cadence is ever tightened (e.g. to 5 minutes for snappier
turnaround), or if perms-correction is unusually slow on a given run,
the race opens. Better to harden now while it's a 1-line diff than
discover it at 3am.
Recommended fix: switch to readability check
sequencers; the permssolexa(group=stamlab, not owner)[ -r ]returns false.Why readability is right here:
bits are --- so solexa-as-group-member can't read; at 644 the group
bits are r-- so solexa can.
the perms flip happens in 30 seconds or 20 minutes.
Why mtime alone wouldn't be sufficient
A find -mmin +N style guard only works if you know an upper bound on
perms-correction latency. With observations up to ~20 minutes, the
threshold would have to be larger than that, and we don't have a
guaranteed upper bound. Readability tests the actual signal directly.
Priority
Low. No current outage; today's run was unaffected. File for when
there's bandwidth.