Skip to content

Crash/reboot resilience + atomic O_EXCL lock (fix WSL2 mkdir non-atomicity)#48

Open
owenzhang26-sys wants to merge 2 commits into
kunchenguid:mainfrom
owenzhang26-sys:fm/crash-resilience
Open

Crash/reboot resilience + atomic O_EXCL lock (fix WSL2 mkdir non-atomicity)#48
owenzhang26-sys wants to merge 2 commits into
kunchenguid:mainfrom
owenzhang26-sys:fm/crash-resilience

Conversation

@owenzhang26-sys

Copy link
Copy Markdown

Why

A WSL VM teardown (host sleep, idle timeout, or a Windows Update reboot) kills tmux and every crewmate at once, and nothing relaunched firstmate after the VM came back — the fleet stayed dead until the captain reconnected. Separately, the test suite was intermittently flaky.

Crash/reboot resilience

  • systemd/firstmate.service + bin/fm-resume.sh — a watchdog recreates the persistent firstmate tmux session on boot and self-heals if it dies. KillMode=process means stopping the unit never kills a live firstmate. Verified: restarts ~10s after a kill; survivor session untouched.
  • bin/fm-install-autostart.shinstall / status / uninstall; fully reversible, never touches a running session.
  • Recommended companion (not in this repo): ~/.wslconfig with vmIdleTimeout=-1 to prevent the idle teardown in the first place (needs wsl --shutdown to apply).

Lock correctness (root cause of the flakiness)

The wake-queue / watcher-singleton lock used mkdir as its atomic primitive, but mkdir is not atomic on WSL2's filesystem — a 20-way barrier race produced up to 4 concurrent mkdir successes on one path. The lock therefore double-granted under contention (duplicate watchers, raced wake-queue drains, flaky tests). A canonical mutex probe showed 4–8 double-grants per run.

Replaced the primitive with an O_EXCL (noclobber) create — atomic on Linux, WSL2, and macOS — which also writes the holder pid in the same step, eliminating the empty-pid window. Dead holders are reclaimed in a single try_acquire (so fm-watch.sh takes over a crashed watcher); a live holder's lock is never stolen. Probe now reports 0 double-grants.

Tests

  • tests/fm-lock-exclusivity.test.sh — canonical mutex probe + dead/live reclaim (guards the lock fix).
  • tests/fm-resume.test.sh — base-index-safe session creation + idempotency (runs against an isolated tmux server).
  • tests/fm-wake-queue.test.sh — singleton-start now polls for the invariant instead of a fixed sleep.

Local CI green: shellcheck bin/*.sh tests/*.sh, all 5 test files, and the repo invariants.

Also included

fix(spawn,teardown): fm-spawn waits for a */.treehouse/* cwd before recording the worktree; fm-teardown ignores the tracked .claude/settings.local.json (turn-end hook) in its dirty check.

🤖 Generated with Claude Code

root added 2 commits June 23, 2026 10:34
…ocal.json in dirty check

fm-spawn waits specifically for a */.treehouse/* cwd rather than any change off
the project dir, so a transient default-cwd reading is never misrecorded as the
worktree. fm-teardown's dirty check ignores the tracked .claude/settings.local.json
that the turn-end hook modifies, so teardown is not blocked by firstmate's own hook.
Crash/reboot resilience: a WSL VM teardown (host sleep, idle timeout, Windows
Update reboot) kills tmux and every crewmate at once and nothing relaunched
firstmate afterward. systemd/firstmate.service runs bin/fm-resume.sh as a
watchdog that recreates the persistent firstmate tmux session on boot and
self-heals if it dies (KillMode=process, so stopping the unit never kills a live
firstmate). bin/fm-install-autostart.sh installs/removes it. Pair with
~/.wslconfig vmIdleTimeout=-1 to prevent the idle teardown in the first place.

Lock correctness: the wake-queue/singleton lock used mkdir as its atomic
primitive, but mkdir is NOT atomic on WSL2's filesystem (verified: 4 concurrent
mkdir calls succeeded on one path in a barrier race), so the lock double-granted
under contention - duplicate watchers, raced wake-queue drains, flaky tests.
Replaced with an O_EXCL (noclobber) create, atomic on Linux, WSL2, and macOS,
which also writes the holder pid in the same step (no empty-pid window). Dead
holders are reclaimed in one call; a live holder's lock is never stolen.

Tests: fm-lock-exclusivity (canonical mutex probe + reclaim) and fm-resume
(base-index-safe create + idempotency) guard both fixes; the singleton-start
test now polls for the invariant instead of a fixed sleep.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant