Skip to content

watchdog_copycomplete.sh: replace existence check with readability check on CopyComplete.txt - latent race against Isilon perms-correction #88

@rsandstromUW

Description

@rsandstromUW

Observed

When the sequencer writes CopyComplete.txt to a rundir on Isilon, the
file initially lands as -rwx------ sequencers stamlab (mode 700). Some
time later - observed to be up to ~20 minutes in some cases - an
Isilon-side perms-correction process flips it to -rw-r--r-- (mode 644).
The mode-644 state is the actual "rundir is ready to process" signal:
until then, other files in the rundir may also still be perm-restricted.

Example: today's flowcells 20260508_LH00361_0119_A22Y7LYLT4 and
20260508_LH00361_0120_B22Y7M5LT4 (CopyComplete landed 2026-05-10 09:07
and 09:13). By the time the watchdog fired at ~09:41, both files were
already in the post-correction state, so today's run was unaffected.

Issue

scripts/flowcells/watchdog_copycomplete.sh checks file existence:

# Only act if CopyComplete.txt exists
if [ ! -e "$fc/CopyComplete.txt" ] ; then
    continue
fi

[ -e ] returns true as soon as the file shows up on disk —-independent
of mode. The rundir itself is drwxrws--- sequencers stamlab (group
stamlab has rwx, with setgid). Solexa is in stamlab, so solexa can
search the dir and stat files inside it from the first moment the
sequencer writes them. The watchdog therefore fires the moment
CopyComplete.txt exists, which can be up to ~20 min before the rundir
is actually ready.

If the watchdog fires inside that window, setup.sh && run_bcl2fastq.sh
starts against a not-yet-fully-ready rundir.

Why we're safe today

The watchdog cycles via scrontab --begin=now+1hour self-resubmit, so it
fires roughly hourly. In practice the gap between CopyComplete-landing
and the next watchdog fire is much longer than the perms-correction
window, so the race never opens. The fix here is hardening, not
recovering from an outage.

Why it's worth fixing anyway

If scrontab cadence is ever tightened (e.g. to 5 minutes for snappier
turnaround), or if perms-correction is unusually slow on a given run,
the race opens. Better to harden now while it's a 1-line diff than
discover it at 3am.

Recommended fix: switch to readability check

  • Skip until perms-correction has flipped the file readable by us.
  • CopyComplete.txt lands mode 700 owned by sequencers; the perms
  • corrector flips it to 644 once the rundir is fully ready (observed
  • latency up to ~20 min). Until then solexa (group=stamlab, not owner)
  • can't read it, so [ -r ] returns false.
if [ ! -r "$fc/CopyComplete.txt" ] ; then
    continue
fi

Why readability is right here:

  • Tests the actual signal (perms have flipped), not a timing proxy
  • Naturally aligned with the 700→644 transition: at mode 700 the group
    bits are --- so solexa-as-group-member can't read; at 644 the group
    bits are r-- so solexa can.
  • Decoupled from the corrector's latency - works regardless of whether
    the perms flip happens in 30 seconds or 20 minutes.

Why mtime alone wouldn't be sufficient

A find -mmin +N style guard only works if you know an upper bound on
perms-correction latency. With observations up to ~20 minutes, the
threshold would have to be larger than that, and we don't have a
guaranteed upper bound. Readability tests the actual signal directly.

Priority

Low. No current outage; today's run was unaffected. File for when
there's bandwidth.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions