[AAASM-3793] ✅ (test): Add stale-lock detection to packaging test lock by Chisanan232 · Pull Request #201 · ai-agent-assembly/node-sdk

Chisanan232 · 2026-06-26T04:20:51Z

Target

Task summary:

Harden tests/packaging/lock.ts against an orphaned .packaging-test.lock cascade. When a packaging test timed out, vitest abandoned the promise before the finally removed the exclusive lock; the orphan then made every later packaging test spin in the unbounded retry loop until its own per-test timeout, producing a cascade of unrelated "Test timed out" failures.
Task tickets:
- Task ID: AAASM-3793.
- Relative task IDs:
  - N/A.
- Relative PRs:
  - N/A.
Key point change (optional):
- Lock metadata. The lock file now stores {pid, timestamp} on acquire.
- Stale detection + reclaim. A waiter reclaims an orphan when the owning process is dead, when a 0-byte/unparseable lock (e.g. left by a previous run) is older than a short grace, or — as a backstop against pid reuse — when the lock exceeds a large TTL. A live, slow-but-legitimate build (run via blocking execSync) is never reclaimed, so two builds can never race the same dist/.
- Bounded acquire timeout. Kept just under the 30s per-test budget so a lock held by a live, non-stale process fails with an actionable error instead of an opaque "Test timed out". Every packaging critical section is a multi-second build, so this only ever converts a would-be timeout into a clear message — it never fails a waiter that would otherwise have acquired the lock in time.

Effecting Scope

Action Types:
- 🔧 Fixing bug
- 🍀 Improving something (performance, code quality, security, etc.)
Scopes:
- 🧪 Testing
  - 🧪 Unit testing
- Additional description:
  Test-infra only (tests/packaging/). No product/runtime code changes; no public API, transport, or build-output changes.

Description

tests/packaging/lock.ts:
- Write {pid, timestamp} into the lock on acquire.
- Add isProcessAlive (signal-0 probe), readLockMetadata, lockFileAgeMs, and reclaimStaleLock helpers; reclaim orphaned locks in the retry loop.
- Add a bounded acquire timeout (28s) that throws a clear error (with the caught EEXIST as cause).
tests/packaging/lock-helper.test.ts:
- Add a regression test for the dead-owner reclaim path.
- Mock fs.writeFileSync in the existing contention test now that the helper writes metadata.

How to verify

pnpm lint — clean.
pnpm typecheck — clean.
pnpm test — full suite green (338 passed, 2 skipped), including the new reclaim test.
Pre-seed a stale lock from a dead process and re-run the packaging suite: the orphan is reclaimed rather than cascading into timeouts.

Closes AAASM-3793

🤖 Generated with Claude Code

Record the holder's pid and acquisition time inside .packaging-test.lock so a later waiter can distinguish a live holder from an orphaned lock. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_019mSz31RysZF6DYToUoBWLf

Replace the unbounded retry loop with stale-lock detection: an orphaned .packaging-test.lock (dead owner pid, or a 0-byte lock left by a previous run past a short grace) is reclaimed and retried, and a large TTL backs that up against pid reuse without ever evicting a live, slow-but-legitimate build. Also bound the acquire wait just under the 30s per-test timeout so a lock held by a live, non-stale process fails with an actionable error instead of an opaque "Test timed out". Previously a leaked lock made every later packaging test spin until its own timeout, cascading unrelated failures. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_019mSz31RysZF6DYToUoBWLf

Cover the path where the lock is held by a dead owner: the helper probes pid liveness, reclaims the orphan, and acquires the lock so the callback runs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_019mSz31RysZF6DYToUoBWLf

codecov · 2026-06-26T04:22:34Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

The new stale-detection path reads the lock file and probes the owner pid, so the existing contention-retry test became dependent on whatever real .packaging-test.lock a concurrent packaging test holds on CI — a stale orphan left by a sibling test made reclaim fire, breaking the assertions (and on the slower CI Node 18 runner, timing the test out). Model "no lock on disk" by mocking readFileSync/statSync so the test deterministically exercises the not-stale fast-retry path independent of shared filesystem state. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_019mSz31RysZF6DYToUoBWLf

sonarqubecloud · 2026-06-26T05:19:33Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

Chisanan232 · 2026-06-26T05:44:15Z

🤖 Claude Code — PR Review (AAASM-3793)

Verdict: Approve (non-blocking observations below). Independent review of test-infra hardening for tests/packaging/lock.ts.

CI

All green at head 0e6de6fb: test (18/20/22/24), napi-build (20/22), module-smoke (18/20/22), quality, audit, coverage-and-analysis, codecov/patch, CodeQL, SonarCloud. No fix needed.

Scope / acceptance

Change is confined to tests/packaging/lock.ts and tests/packaging/lock-helper.test.ts — test-infra only, no product/runtime, public-API, transport, or build-output change. All four acceptance items for AAASM-3793 are delivered:

✅ {pid, timestamp} metadata written into the lock on acquire.
✅ Stale detection/reclaim: dead-pid (signal-0 probe) + empty/unparseable-lock grace (5s) + TTL backstop (120s) for pid reuse.
✅ Bounded acquire timeout (28s) throwing a clear, actionable error with the caught EEXIST as cause.
✅ Regression test for the dead-owner reclaim path.

Locally verified: pnpm lint clean, pnpm typecheck clean, tests/packaging/lock-helper.test.ts 3/3 pass.

Side-effect assessment (this lock guards ALL packaging tests)

No false-reclaim of a live, slow-but-legitimate build. During an active build the owning vitest worker pid is alive (the build runs under blocking execSync, which pins that worker) and the lock's timestamp is seconds old. So none of the three staleness signals can fire on a live holder: dead-pid sees an alive pid; the 120s TTL dwarfs any real build; the 5s empty-lock grace only covers the microsecond openSync→writeFileSync window. Two builds therefore cannot race the same dist/. Dead-pid is correctly the primary discriminator and TTL is a conservative backstop — the safety property holds.
No new shared-FS flakiness for other packaging tests. The metadata write is cleared in the finally (closeSync + rmSync), and the hermetic-test mocks are vi.spyOn + vi.restoreAllMocks() in afterEach, so nothing leaks across tests. The 4th commit correctly makes the pre-existing contention test independent of any sibling test's real .packaging-test.lock (mocking readFileSync/statSync to model "no lock on disk" → deterministic not-stale fast-retry path). Good catch and the right fix.

Non-blocking observations

Acquire-timeout budget is justified against a non-uniform per-test timeout. LOCK_ACQUIRE_TIMEOUT_MS = 28_000 and its comment assume a "30s per-test budget," but the budgets are mixed: exports-types.test.ts and package-size.test.ts use }, 90000) (90s), and there is no global testTimeout in vitest.config.ts. For those 90s tests the 28s bound leaves ~62s of budget unused — under contention heavier than today's CI, a legitimate waiter sitting behind a slow-but-live holder could be failed with the "Timed out acquiring packaging lock" error even though it had ample budget to acquire and build. This is still strictly better than the original opaque cascade (the error is clear and bounded) and CI is green, so it is not a blocker — but the justifying comment is inaccurate. Consider deriving the bound from the calling test's timeout (or making it a parameter) so it scales with the 90s tests.
"Primary signal" comment vs. the actual bug. The comment says "Dead-pid detection, not this TTL, is the primary staleness signal," yet the ticket's own root cause is that a vitest test timeout abandons the lock while the worker process stays alive — so dead-pid does not fire in the real cascade, and recovery falls to the TTL backstop (with intervening tests surfacing the clear acquire-timeout error until the 120s TTL elapses). Functionally fine and bounded; the comment just overstates dead-pid's role for this specific failure mode.

Neither observation blocks merge. Recommend addressing the 28s-vs-90s mismatch (and the comment wording) in a quick follow-up if convenient. Not merging — review only.

Chisanan232 and others added 3 commits June 26, 2026 12:00

Chisanan232 closed this Jun 26, 2026

Chisanan232 reopened this Jun 26, 2026

Chisanan232 marked this pull request as ready for review June 26, 2026 06:45

Chisanan232 merged commit a493fe3 into master Jun 26, 2026
24 of 30 checks passed

Chisanan232 deleted the v0.0.1/AAASM-3793/packaging_lock_stale_detection branch June 26, 2026 07:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[AAASM-3793] ✅ (test): Add stale-lock detection to packaging test lock#201

[AAASM-3793] ✅ (test): Add stale-lock detection to packaging test lock#201
Chisanan232 merged 4 commits into
masterfrom
v0.0.1/AAASM-3793/packaging_lock_stale_detection

Chisanan232 commented Jun 26, 2026

Uh oh!

codecov Bot commented Jun 26, 2026

Uh oh!

sonarqubecloud Bot commented Jun 26, 2026

Uh oh!

Chisanan232 commented Jun 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

Chisanan232 commented Jun 26, 2026

Target

Task summary:

Task tickets:

Key point change (optional):

Effecting Scope

Description

How to verify

Uh oh!

codecov Bot commented Jun 26, 2026

Codecov Report

Uh oh!

sonarqubecloud Bot commented Jun 26, 2026

Quality Gate passed

Uh oh!

Chisanan232 commented Jun 26, 2026

🤖 Claude Code — PR Review (AAASM-3793)

CI

Scope / acceptance

Side-effect assessment (this lock guards ALL packaging tests)

Non-blocking observations

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant