v0.35.8.1 fix(locks): evict dead same-host PID rows from gbrain_cycle_locks#1161
v0.35.8.1 fix(locks): evict dead same-host PID rows from gbrain_cycle_locks#1161zekear wants to merge 2 commits into
Conversation
A SIGKILL'd cycle holder leaves a row in gbrain_cycle_locks whose TTL is up to 30 minutes in the future. The TTL-bounded UPSERT in acquirePostgresLock + tryAcquireDbLock reads that row, classifies it as a live holder, and bails — every subsequent invocation no-ops for half an hour. pglite-lock.ts and acquireFileLock already do the same-host PID liveness check via process.kill(pid, 0); the DB-row layer didn't. Mirror that check into both DB-lock acquire paths. Before the UPSERT, SELECT the existing row's holder_pid + holder_host, and when the host matches os.hostname() and process.kill(pid, 0) throws ESRCH, DELETE the row keyed on (id, holder_pid) so the UPSERT below can proceed. Cross-host holders fall through to the existing TTL eviction (we have no way to probe a remote PID table). Same-host alive-but-unsignalable holders (EPERM, e.g. PID 1) are treated as alive, not stale. The helper is duplicated across cycle.ts and db-lock.ts rather than imported, to avoid a circular dependency between the two modules. Both copies carry a JSDoc note pointing at the sibling. Probe is best-effort: any failure in the SELECT or DELETE falls through to the legacy TTL-bounded UPSERT, which is still the safety net. Test coverage: 6 cases in test/db-lock-stale-pid.test.ts pin the contract — dead same-host PID evicted, cross-host left alone, live same-host (PID 1, EPERM) left alone, TTL-expired cross-host evicted by the existing UPSERT, no-existing-row INSERT path unaffected, own PID never evicted. Relates to garrytan#1065. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Companion to the db-lock stale-PID eviction fix. No schema migration; the gbrain_cycle_locks row shape is unchanged. Only the acquire-path read of it changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Hi @garrytan, congrats on the v0.36.x release wave — #1182 looks like a heroic triage. Quick note on why this PR is still relevant after the wave: #1065 was closed as already-fixed, which is correct for the filesystem-lock variant ( On
The symptom we hit on a real install: an openclaw cron's Happy to rebase against Regardless, thanks for the open-source work. First PR I've ever opened on a project that isn't mine, picked yours because of the brain-as-a-substrate thesis — really fun codebase to read. |
Summary
A SIGKILL'd
gbrain dream(OOM, cron timeout, hard Ctrl-C) leaves a rowin
gbrain_cycle_lockswhosettl_expires_atis up to 30 minutes in thefuture. Every subsequent acquire reads that row, classifies it as a live
holder via the TTL gate, and bails. On cron-driven setups, every dream
invocation no-ops for half an hour after one of them dies.
pglite-lock.tsandacquireFileLockalready do a same-hostprocess.kill(pid, 0)liveness check; the DB-row layer didn't. This PRmirrors that check into both
acquirePostgresLock(the broad cycle lock,src/core/cycle.ts) andtryAcquireDbLock(the generic named-DB-lockhelper used by
gbrain-syncand future siblings,src/core/db-lock.ts).Before the TTL-bounded UPSERT, the acquire path SELECTs the existing
row's
holder_pid+holder_host, and when the host matchesos.hostname()andprocess.kill(pid, 0)throwsESRCH, DELETEs therow keyed on
(id, holder_pid)so the UPSERT below can proceed.Cross-host holders fall through to the existing TTL eviction. Same-host
alive-but-unsignalable holders (EPERM, e.g. PID 1) are treated as alive,
not stale.
The helper is duplicated across
cycle.tsanddb-lock.tsrather thanimported — both modules would otherwise form a circular dependency.
Each copy carries a JSDoc note pointing at the sibling.
Test Coverage
test/db-lock-stale-pid.test.ts(new, 6 cases, PGLite in-memory):holder_host) left alone — TTL is the only safe signal`bun test test/db-lock-stale-pid.test.ts` — 6 pass, 0 fail.
`bun run typecheck` — clean.
`bun run check:test-isolation` — clean.
Relationship to #1065
#1065 reports `gbrain dream` hanging at 100% CPU on Linux with
`.gbrain-lock/lock` surviving SIGKILL. This PR closes the DB-row layer
of that broader cleanup gap. The busy-loop hang and the filesystem
`.gbrain-lock/` cleanup behavior on Linux SIGKILL are separate symptoms
still open in #1065.
Release docs
`v0.35.8.1`. No schema migration, no config change. The
`gbrain_cycle_locks` row shape is identical to v0.35.8.0; only the
acquire-path read of it changes. Operator action: `gbrain upgrade`.
Test plan
Relates to #1065