Skip to content

v0.35.8.1 fix(locks): evict dead same-host PID rows from gbrain_cycle_locks#1161

Open
zekear wants to merge 2 commits into
garrytan:masterfrom
zekear:garrytan/db-lock-stale-pid
Open

v0.35.8.1 fix(locks): evict dead same-host PID rows from gbrain_cycle_locks#1161
zekear wants to merge 2 commits into
garrytan:masterfrom
zekear:garrytan/db-lock-stale-pid

Conversation

@zekear
Copy link
Copy Markdown

@zekear zekear commented May 18, 2026

Summary

A SIGKILL'd gbrain dream (OOM, cron timeout, hard Ctrl-C) leaves a row
in gbrain_cycle_locks whose ttl_expires_at is up to 30 minutes in the
future. Every subsequent acquire reads that row, classifies it as a live
holder via the TTL gate, and bails. On cron-driven setups, every dream
invocation no-ops for half an hour after one of them dies.

pglite-lock.ts and acquireFileLock already do a same-host
process.kill(pid, 0) liveness check; the DB-row layer didn't. This PR
mirrors that check into both acquirePostgresLock (the broad cycle lock,
src/core/cycle.ts) and tryAcquireDbLock (the generic named-DB-lock
helper used by gbrain-sync and future siblings, src/core/db-lock.ts).

Before the TTL-bounded UPSERT, the acquire path SELECTs the existing
row's holder_pid + holder_host, and when the host matches
os.hostname() and process.kill(pid, 0) throws ESRCH, DELETEs the
row keyed on (id, holder_pid) so the UPSERT below can proceed.
Cross-host holders fall through to the existing TTL eviction. Same-host
alive-but-unsignalable holders (EPERM, e.g. PID 1) are treated as alive,
not stale.

The helper is duplicated across cycle.ts and db-lock.ts rather than
imported — both modules would otherwise form a circular dependency.
Each copy carries a JSDoc note pointing at the sibling.

Test Coverage

test/db-lock-stale-pid.test.ts (new, 6 cases, PGLite in-memory):

  • Dead same-host PID (999999) gets evicted, next acquire writes our PID
  • Cross-host row (different holder_host) left alone — TTL is the only safe signal
  • Live same-host PID 1 (EPERM, not ESRCH) left alone
  • TTL-expired cross-host row still evicted by the existing UPSERT (new path is additive)
  • No-existing-row clean INSERT path unaffected
  • Current process's own PID never evicted (re-acquire safety)

`bun test test/db-lock-stale-pid.test.ts` — 6 pass, 0 fail.
`bun run typecheck` — clean.
`bun run check:test-isolation` — clean.

Relationship to #1065

#1065 reports `gbrain dream` hanging at 100% CPU on Linux with
`.gbrain-lock/lock` surviving SIGKILL. This PR closes the DB-row layer
of that broader cleanup gap. The busy-loop hang and the filesystem
`.gbrain-lock/` cleanup behavior on Linux SIGKILL are separate symptoms
still open in #1065.

Release docs

`v0.35.8.1`. No schema migration, no config change. The
`gbrain_cycle_locks` row shape is identical to v0.35.8.0; only the
acquire-path read of it changes. Operator action: `gbrain upgrade`.

Test plan

  • `bun test test/db-lock-stale-pid.test.ts` passes (6/6)
  • `bun run typecheck` clean
  • `bun run check:test-isolation` clean
  • CHANGELOG entry + VERSION + package.json bumped in a separate commit (bisect-friendly)
  • CI version-gate accepts v0.35.8.1
  • CI shard suite green

Relates to #1065

Clawdia and others added 2 commits May 18, 2026 12:27
A SIGKILL'd cycle holder leaves a row in gbrain_cycle_locks whose TTL is
up to 30 minutes in the future. The TTL-bounded UPSERT in
acquirePostgresLock + tryAcquireDbLock reads that row, classifies it as a
live holder, and bails — every subsequent invocation no-ops for half an
hour. pglite-lock.ts and acquireFileLock already do the same-host PID
liveness check via process.kill(pid, 0); the DB-row layer didn't.

Mirror that check into both DB-lock acquire paths. Before the UPSERT,
SELECT the existing row's holder_pid + holder_host, and when the host
matches os.hostname() and process.kill(pid, 0) throws ESRCH, DELETE the
row keyed on (id, holder_pid) so the UPSERT below can proceed. Cross-host
holders fall through to the existing TTL eviction (we have no way to
probe a remote PID table). Same-host alive-but-unsignalable holders
(EPERM, e.g. PID 1) are treated as alive, not stale.

The helper is duplicated across cycle.ts and db-lock.ts rather than
imported, to avoid a circular dependency between the two modules. Both
copies carry a JSDoc note pointing at the sibling.

Probe is best-effort: any failure in the SELECT or DELETE falls through
to the legacy TTL-bounded UPSERT, which is still the safety net.

Test coverage: 6 cases in test/db-lock-stale-pid.test.ts pin the
contract — dead same-host PID evicted, cross-host left alone, live
same-host (PID 1, EPERM) left alone, TTL-expired cross-host evicted by
the existing UPSERT, no-existing-row INSERT path unaffected, own PID
never evicted.

Relates to garrytan#1065.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Companion to the db-lock stale-PID eviction fix. No schema migration; the
gbrain_cycle_locks row shape is unchanged. Only the acquire-path read of
it changes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@zekear
Copy link
Copy Markdown
Author

zekear commented May 19, 2026

Hi @garrytan, congrats on the v0.36.x release wave — #1182 looks like a heroic triage.

Quick note on why this PR is still relevant after the wave: #1065 was closed as already-fixed, which is correct for the filesystem-lock variant (.gbrain-lock/lock) that's covered by acquireFileLock and pglite-lock.ts. This PR addresses a different code path — the row in the gbrain_cycle_locks Postgres table that acquirePostgresLock and tryAcquireDbLock UPSERT into.

On master at v0.36.2.0:

  • src/core/cycle.ts:405-450acquireFileLock does the same-host PID liveness probe (ESRCH/EPERM split). Good.
  • src/core/cycle.ts:320-393acquirePostgresLock goes straight to INSERT ... ON CONFLICT (id) DO UPDATE ... WHERE ttl_expires_at < NOW() with no PID liveness probe.
  • src/core/db-lock.ts — same shape as acquirePostgresLock, no probe.

The symptom we hit on a real install: an openclaw cron's gbrain dream got SIGKILL'd (the busy-loop variant from #1065), and even after I cleaned up the filesystem lock manually, every subsequent dream invocation no-op'd for the full 30-minute TTL. The row in gbrain_cycle_locks was still there with holder_pid pointing at a PID that no longer existed and ttl_expires_at far in the future. This PR mirrors the same liveness probe that acquireFileLock already does into the two DB-row acquire sites.

Happy to rebase against master and bump to v0.36.2.1 if this is in scope. If the design call is "no DB-row PID probe, the TTL is enough", totally fair — just let me know and I'll close.

Regardless, thanks for the open-source work. First PR I've ever opened on a project that isn't mine, picked yours because of the brain-as-a-substrate thesis — really fun codebase to read.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant