v0.35.8.1 fix(locks): evict dead same-host PID rows from gbrain_cycle_locks by zekear · Pull Request #1161 · garrytan/gbrain

zekear · 2026-05-18T15:29:58Z

Summary

A SIGKILL'd gbrain dream (OOM, cron timeout, hard Ctrl-C) leaves a row
in gbrain_cycle_locks whose ttl_expires_at is up to 30 minutes in the
future. Every subsequent acquire reads that row, classifies it as a live
holder via the TTL gate, and bails. On cron-driven setups, every dream
invocation no-ops for half an hour after one of them dies.

pglite-lock.ts and acquireFileLock already do a same-host
process.kill(pid, 0) liveness check; the DB-row layer didn't. This PR
mirrors that check into both acquirePostgresLock (the broad cycle lock,
src/core/cycle.ts) and tryAcquireDbLock (the generic named-DB-lock
helper used by gbrain-sync and future siblings, src/core/db-lock.ts).

Before the TTL-bounded UPSERT, the acquire path SELECTs the existing
row's holder_pid + holder_host, and when the host matches
os.hostname() and process.kill(pid, 0) throws ESRCH, DELETEs the
row keyed on (id, holder_pid) so the UPSERT below can proceed.
Cross-host holders fall through to the existing TTL eviction. Same-host
alive-but-unsignalable holders (EPERM, e.g. PID 1) are treated as alive,
not stale.

The helper is duplicated across cycle.ts and db-lock.ts rather than
imported — both modules would otherwise form a circular dependency.
Each copy carries a JSDoc note pointing at the sibling.

Test Coverage

test/db-lock-stale-pid.test.ts (new, 6 cases, PGLite in-memory):

Dead same-host PID (999999) gets evicted, next acquire writes our PID
Cross-host row (different holder_host) left alone — TTL is the only safe signal
Live same-host PID 1 (EPERM, not ESRCH) left alone
TTL-expired cross-host row still evicted by the existing UPSERT (new path is additive)
No-existing-row clean INSERT path unaffected
Current process's own PID never evicted (re-acquire safety)

`bun test test/db-lock-stale-pid.test.ts` — 6 pass, 0 fail.
`bun run typecheck` — clean.
`bun run check:test-isolation` — clean.

Relationship to #1065

#1065 reports `gbrain dream` hanging at 100% CPU on Linux with
`.gbrain-lock/lock` surviving SIGKILL. This PR closes the DB-row layer
of that broader cleanup gap. The busy-loop hang and the filesystem
`.gbrain-lock/` cleanup behavior on Linux SIGKILL are separate symptoms
still open in #1065.

Release docs

`v0.35.8.1`. No schema migration, no config change. The
`gbrain_cycle_locks` row shape is identical to v0.35.8.0; only the
acquire-path read of it changes. Operator action: `gbrain upgrade`.

Test plan

`bun test test/db-lock-stale-pid.test.ts` passes (6/6)
`bun run typecheck` clean
`bun run check:test-isolation` clean
CHANGELOG entry + VERSION + package.json bumped in a separate commit (bisect-friendly)
CI version-gate accepts v0.35.8.1
CI shard suite green

Relates to #1065

A SIGKILL'd cycle holder leaves a row in gbrain_cycle_locks whose TTL is up to 30 minutes in the future. The TTL-bounded UPSERT in acquirePostgresLock + tryAcquireDbLock reads that row, classifies it as a live holder, and bails — every subsequent invocation no-ops for half an hour. pglite-lock.ts and acquireFileLock already do the same-host PID liveness check via process.kill(pid, 0); the DB-row layer didn't. Mirror that check into both DB-lock acquire paths. Before the UPSERT, SELECT the existing row's holder_pid + holder_host, and when the host matches os.hostname() and process.kill(pid, 0) throws ESRCH, DELETE the row keyed on (id, holder_pid) so the UPSERT below can proceed. Cross-host holders fall through to the existing TTL eviction (we have no way to probe a remote PID table). Same-host alive-but-unsignalable holders (EPERM, e.g. PID 1) are treated as alive, not stale. The helper is duplicated across cycle.ts and db-lock.ts rather than imported, to avoid a circular dependency between the two modules. Both copies carry a JSDoc note pointing at the sibling. Probe is best-effort: any failure in the SELECT or DELETE falls through to the legacy TTL-bounded UPSERT, which is still the safety net. Test coverage: 6 cases in test/db-lock-stale-pid.test.ts pin the contract — dead same-host PID evicted, cross-host left alone, live same-host (PID 1, EPERM) left alone, TTL-expired cross-host evicted by the existing UPSERT, no-existing-row INSERT path unaffected, own PID never evicted. Relates to garrytan#1065. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Companion to the db-lock stale-PID eviction fix. No schema migration; the gbrain_cycle_locks row shape is unchanged. Only the acquire-path read of it changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

zekear · 2026-05-19T04:32:23Z

Hi @garrytan, congrats on the v0.36.x release wave — #1182 looks like a heroic triage.

Quick note on why this PR is still relevant after the wave: #1065 was closed as already-fixed, which is correct for the filesystem-lock variant (.gbrain-lock/lock) that's covered by acquireFileLock and pglite-lock.ts. This PR addresses a different code path — the row in the gbrain_cycle_locks Postgres table that acquirePostgresLock and tryAcquireDbLock UPSERT into.

On master at v0.36.2.0:

src/core/cycle.ts:405-450 — acquireFileLock does the same-host PID liveness probe (ESRCH/EPERM split). Good.
src/core/cycle.ts:320-393 — acquirePostgresLock goes straight to INSERT ... ON CONFLICT (id) DO UPDATE ... WHERE ttl_expires_at < NOW() with no PID liveness probe.
src/core/db-lock.ts — same shape as acquirePostgresLock, no probe.

The symptom we hit on a real install: an openclaw cron's gbrain dream got SIGKILL'd (the busy-loop variant from #1065), and even after I cleaned up the filesystem lock manually, every subsequent dream invocation no-op'd for the full 30-minute TTL. The row in gbrain_cycle_locks was still there with holder_pid pointing at a PID that no longer existed and ttl_expires_at far in the future. This PR mirrors the same liveness probe that acquireFileLock already does into the two DB-row acquire sites.

Happy to rebase against master and bump to v0.36.2.1 if this is in scope. If the design call is "no DB-row PID probe, the TTL is enough", totally fair — just let me know and I'll close.

Regardless, thanks for the open-source work. First PR I've ever opened on a project that isn't mine, picked yours because of the brain-as-a-substrate thesis — really fun codebase to read.

Clawdia and others added 2 commits May 18, 2026 12:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.35.8.1 fix(locks): evict dead same-host PID rows from gbrain_cycle_locks#1161

v0.35.8.1 fix(locks): evict dead same-host PID rows from gbrain_cycle_locks#1161
zekear wants to merge 2 commits into
garrytan:masterfrom
zekear:garrytan/db-lock-stale-pid

zekear commented May 18, 2026

Uh oh!

zekear commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

zekear commented May 18, 2026

Summary

Test Coverage

Relationship to #1065

Release docs

Test plan

Uh oh!

zekear commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant