Skip to content

[AAASM-3793] ✅ (test): Add stale-lock detection to packaging test lock#201

Merged
Chisanan232 merged 4 commits into
masterfrom
v0.0.1/AAASM-3793/packaging_lock_stale_detection
Jun 26, 2026
Merged

[AAASM-3793] ✅ (test): Add stale-lock detection to packaging test lock#201
Chisanan232 merged 4 commits into
masterfrom
v0.0.1/AAASM-3793/packaging_lock_stale_detection

Conversation

@Chisanan232

Copy link
Copy Markdown
Contributor

Target

  • Task summary:

    Harden tests/packaging/lock.ts against an orphaned .packaging-test.lock cascade. When a packaging test timed out, vitest abandoned the promise before the finally removed the exclusive lock; the orphan then made every later packaging test spin in the unbounded retry loop until its own per-test timeout, producing a cascade of unrelated "Test timed out" failures.

  • Task tickets:

    • Task ID: AAASM-3793.
    • Relative task IDs:
      • N/A.
    • Relative PRs:
      • N/A.
  • Key point change (optional):

    • Lock metadata. The lock file now stores {pid, timestamp} on acquire.
    • Stale detection + reclaim. A waiter reclaims an orphan when the owning process is dead, when a 0-byte/unparseable lock (e.g. left by a previous run) is older than a short grace, or — as a backstop against pid reuse — when the lock exceeds a large TTL. A live, slow-but-legitimate build (run via blocking execSync) is never reclaimed, so two builds can never race the same dist/.
    • Bounded acquire timeout. Kept just under the 30s per-test budget so a lock held by a live, non-stale process fails with an actionable error instead of an opaque "Test timed out". Every packaging critical section is a multi-second build, so this only ever converts a would-be timeout into a clear message — it never fails a waiter that would otherwise have acquired the lock in time.

Effecting Scope

  • Action Types:
    • 🔧 Fixing bug
    • 🍀 Improving something (performance, code quality, security, etc.)
  • Scopes:
    • 🧪 Testing
      • 🧪 Unit testing
    • Additional description:
      Test-infra only (tests/packaging/). No product/runtime code changes; no public API, transport, or build-output changes.

Description

  • tests/packaging/lock.ts:
    • Write {pid, timestamp} into the lock on acquire.
    • Add isProcessAlive (signal-0 probe), readLockMetadata, lockFileAgeMs, and reclaimStaleLock helpers; reclaim orphaned locks in the retry loop.
    • Add a bounded acquire timeout (28s) that throws a clear error (with the caught EEXIST as cause).
  • tests/packaging/lock-helper.test.ts:
    • Add a regression test for the dead-owner reclaim path.
    • Mock fs.writeFileSync in the existing contention test now that the helper writes metadata.

How to verify

  • pnpm lint — clean.
  • pnpm typecheck — clean.
  • pnpm test — full suite green (338 passed, 2 skipped), including the new reclaim test.
  • Pre-seed a stale lock from a dead process and re-run the packaging suite: the orphan is reclaimed rather than cascading into timeouts.

Closes AAASM-3793

🤖 Generated with Claude Code

Chisanan232 and others added 3 commits June 26, 2026 12:00
Record the holder's pid and acquisition time inside .packaging-test.lock so
a later waiter can distinguish a live holder from an orphaned lock.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_019mSz31RysZF6DYToUoBWLf
Replace the unbounded retry loop with stale-lock detection: an orphaned
.packaging-test.lock (dead owner pid, or a 0-byte lock left by a previous run
past a short grace) is reclaimed and retried, and a large TTL backs that up
against pid reuse without ever evicting a live, slow-but-legitimate build.

Also bound the acquire wait just under the 30s per-test timeout so a lock held
by a live, non-stale process fails with an actionable error instead of an
opaque "Test timed out". Previously a leaked lock made every later packaging
test spin until its own timeout, cascading unrelated failures.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_019mSz31RysZF6DYToUoBWLf
Cover the path where the lock is held by a dead owner: the helper probes pid
liveness, reclaims the orphan, and acquires the lock so the callback runs.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_019mSz31RysZF6DYToUoBWLf
@codecov

codecov Bot commented Jun 26, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

The new stale-detection path reads the lock file and probes the owner pid, so
the existing contention-retry test became dependent on whatever real
.packaging-test.lock a concurrent packaging test holds on CI — a stale orphan
left by a sibling test made reclaim fire, breaking the assertions (and on the
slower CI Node 18 runner, timing the test out). Model "no lock on disk" by
mocking readFileSync/statSync so the test deterministically exercises the
not-stale fast-retry path independent of shared filesystem state.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_019mSz31RysZF6DYToUoBWLf
@Chisanan232 Chisanan232 reopened this Jun 26, 2026
@sonarqubecloud

Copy link
Copy Markdown

@Chisanan232

Copy link
Copy Markdown
Contributor Author

🤖 Claude Code — PR Review (AAASM-3793)

Verdict: Approve (non-blocking observations below). Independent review of test-infra hardening for tests/packaging/lock.ts.

CI

All green at head 0e6de6fb: test (18/20/22/24), napi-build (20/22), module-smoke (18/20/22), quality, audit, coverage-and-analysis, codecov/patch, CodeQL, SonarCloud. No fix needed.

Scope / acceptance

Change is confined to tests/packaging/lock.ts and tests/packaging/lock-helper.test.ts — test-infra only, no product/runtime, public-API, transport, or build-output change. All four acceptance items for AAASM-3793 are delivered:

  • {pid, timestamp} metadata written into the lock on acquire.
  • ✅ Stale detection/reclaim: dead-pid (signal-0 probe) + empty/unparseable-lock grace (5s) + TTL backstop (120s) for pid reuse.
  • ✅ Bounded acquire timeout (28s) throwing a clear, actionable error with the caught EEXIST as cause.
  • ✅ Regression test for the dead-owner reclaim path.

Locally verified: pnpm lint clean, pnpm typecheck clean, tests/packaging/lock-helper.test.ts 3/3 pass.

Side-effect assessment (this lock guards ALL packaging tests)

  • No false-reclaim of a live, slow-but-legitimate build. During an active build the owning vitest worker pid is alive (the build runs under blocking execSync, which pins that worker) and the lock's timestamp is seconds old. So none of the three staleness signals can fire on a live holder: dead-pid sees an alive pid; the 120s TTL dwarfs any real build; the 5s empty-lock grace only covers the microsecond openSyncwriteFileSync window. Two builds therefore cannot race the same dist/. Dead-pid is correctly the primary discriminator and TTL is a conservative backstop — the safety property holds.
  • No new shared-FS flakiness for other packaging tests. The metadata write is cleared in the finally (closeSync + rmSync), and the hermetic-test mocks are vi.spyOn + vi.restoreAllMocks() in afterEach, so nothing leaks across tests. The 4th commit correctly makes the pre-existing contention test independent of any sibling test's real .packaging-test.lock (mocking readFileSync/statSync to model "no lock on disk" → deterministic not-stale fast-retry path). Good catch and the right fix.

Non-blocking observations

  1. Acquire-timeout budget is justified against a non-uniform per-test timeout. LOCK_ACQUIRE_TIMEOUT_MS = 28_000 and its comment assume a "30s per-test budget," but the budgets are mixed: exports-types.test.ts and package-size.test.ts use }, 90000) (90s), and there is no global testTimeout in vitest.config.ts. For those 90s tests the 28s bound leaves ~62s of budget unused — under contention heavier than today's CI, a legitimate waiter sitting behind a slow-but-live holder could be failed with the "Timed out acquiring packaging lock" error even though it had ample budget to acquire and build. This is still strictly better than the original opaque cascade (the error is clear and bounded) and CI is green, so it is not a blocker — but the justifying comment is inaccurate. Consider deriving the bound from the calling test's timeout (or making it a parameter) so it scales with the 90s tests.
  2. "Primary signal" comment vs. the actual bug. The comment says "Dead-pid detection, not this TTL, is the primary staleness signal," yet the ticket's own root cause is that a vitest test timeout abandons the lock while the worker process stays alive — so dead-pid does not fire in the real cascade, and recovery falls to the TTL backstop (with intervening tests surfacing the clear acquire-timeout error until the 120s TTL elapses). Functionally fine and bounded; the comment just overstates dead-pid's role for this specific failure mode.

Neither observation blocks merge. Recommend addressing the 28s-vs-90s mismatch (and the comment wording) in a quick follow-up if convenient. Not merging — review only.

@Chisanan232 Chisanan232 marked this pull request as ready for review June 26, 2026 06:45
@Chisanan232 Chisanan232 merged commit a493fe3 into master Jun 26, 2026
24 of 30 checks passed
@Chisanan232 Chisanan232 deleted the v0.0.1/AAASM-3793/packaging_lock_stale_detection branch June 26, 2026 07:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant