Skip to content

feat(groomer): serialize runs behind a DB lock#517

Merged
joryirving merged 2 commits into
mainfrom
feat/groomer-run-lock
Jul 1, 2026
Merged

feat(groomer): serialize runs behind a DB lock#517
joryirving merged 2 commits into
mainfrom
feat/groomer-run-lock

Conversation

@joryirving

Copy link
Copy Markdown
Contributor

Summary

  • Add a run-level DB lock for the hosted groomer (acquireGroomerLock/releaseGroomerLock in src/lib/groomer/groomer-lock.ts). runHostedGroomer takes it up front, bails (returns null, like an empty tick) on conflict, and releases in finally.

Why

/api/groomer/run had no global lock — candidate selection (selector.ts) and per-issue lease acquisition (run.ts) aren't atomic, and the "other lease" check only compares different agent names. So two concurrent groomer runs could select + groom the same issue (duplicate LLM call + label writes). This makes the groomer safe on a schedule / >1 replica — the prerequisite for internalizing it in #504.

Design

  • Mirrors sync-lock.ts: first-writer-wins single row, 30-min stale reclaim, double-check inside a transaction, conditional release by token.
  • No migration — reuses the generic sync_lock table with id="groomer" and a random UUID token (its syncRunId column is a plain nullable string, no FK). Sync uses id="global"; the rows don't interfere.

Verification

  • vitest run src/lib/groomer/ → 189 passing (2 new: bails-without-selecting-when-locked, releases-lock-after-run). tsc → 0; eslint → clean.

Closes #503 · part of #498

/api/groomer/run had no run-level lock: candidate selection (selector.ts) and
per-issue lease acquisition (run.ts) are not atomic, and the "other lease"
check only looks at different agent names — so two concurrent groomer runs
could select and groom the same issue (duplicate LLM call + label writes).

Add acquireGroomerLock/releaseGroomerLock (src/lib/groomer/groomer-lock.ts),
mirroring sync-lock.ts: a first-writer-wins single-row lock with 30-min stale
reclaim, reusing the generic sync_lock table with id="groomer" and a random
token (its syncRunId column is a plain nullable string, no FK) — so no
migration. runHostedGroomer acquires the lock up front, bails (returns null,
like an empty tick) on conflict, and releases in finally. This makes the
groomer safe to run on a schedule / >1 replica (prereq for #504).

Closes #503
its-saffron[bot]

This comment was marked as outdated.

@joryirving joryirving enabled auto-merge (squash) July 1, 2026 21:43
@joryirving joryirving merged commit 4d6529a into main Jul 1, 2026
6 checks passed
@joryirving joryirving deleted the feat/groomer-run-lock branch July 1, 2026 21:45

@its-saffron its-saffron Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AI Automated Review

Full PR review.

Analysis engine: MiniMax-M2.7@https://litellm.jory.dev/v1 (anthropic) — escalated (fast_low_confidence)

Recommendation: Approve

This PR correctly implements a DB-backed lock for the hosted groomer, solving the race condition described in issue PR 503 where concurrent groomer runs could double-groom the same issue.

Change-by-Change Findings

src/lib/groomer/groomer-lock.ts (new file, 58 lines)

  • acquireGroomerLock(): Checks for existing lock (stale if >30 min), deletes stale lock, then does a transaction with double-check + create. Throws/rethrows on unexpected errors, returns { locked: true, token } or { locked: false }.
  • releaseGroomerLock(token): Conditional deleteMany by id AND syncRunId token — guards against releasing another run's lock.
  • Lock ID: Uses LOCK_ID = "groomer", distinct from sync's "global" — rows don't interfere.
  • Token: Random UUID via crypto.randomUUID(), stored in syncRunId column (already nullable per schema).

src/lib/groomer/run.ts (modified)

  • Imports acquireGroomerLock/releaseGroomerLock.
  • runHostedGroomer now acquires the lock upfront, bails with null on conflict, and releases in finally.
  • Original logic refactored into executeGroomerRun (called only when lock is held).
  • GroomerDeps interface extended with lock dependencies for testability.

src/lib/groomer/run.test.ts (modified)

  • Added mocks for acquireGroomerLock/releaseGroomerLock.
  • "bails without selecting when the groomer lock is held": Verifies lock conflict returns null without calling selectGroomingCandidate or releaseGroomerLock.
  • "releases the groomer lock after a run completes": Verifies acquireGroomerLock called once and releaseGroomerLock called with correct token.

Sources

  • PR body and diff confirm 189 tests passing, tsc clean, eslint clean.
  • CI check results: Tests ✅, Lint ✅, Build ✅, Typecheck ✅, Docker Build ✅.
  • Tool harness: src/lib/sync-lock.ts read successfully; confirmed the pattern this PR mirrors.
  • Repository history: confirms this commit (8a3aed2) adds the groomer lock.

Standards Compliance

Convention Status Evidence
Error handling (error instanceof Error) groomer-lock.ts line ~44: if (err instanceof Error && err.message === "already_locked")
Prisma schema reuse (no unnecessary migrations) Reuses existing sync_lock table with id="groomer"; no schema change needed
Lock semantics match existing pattern First-writer-wins, 30-min stale reclaim, double-check in transaction, conditional release by token — mirrors sync-lock.ts
Lock is always released (no leak) finally block in runHostedGroomer guarantees release

Linked Issue Fit

Issue PR 503 requested: a DB lock for groomer analogous to sync-lock.ts, preventing concurrent runs from double-grooming, protecting manual UI trigger racing scheduler, and being safe on >1 replica.

This PR satisfies all requirements:

  • ✅ DB lock created (groomer-lock.ts), same pattern as sync-lock.ts
  • ✅ Race condition eliminated: lock acquired before candidate selection
  • ✅ Manual trigger + scheduler safe from race
  • ✅ Multi-replica safe (DB lock works across replicas)
  • ✅ No migration needed (reuses sync_lock table)

Issue PR 504 is referenced in the PR body as a follow-up (internalizing groomer on >1 replica), but PR 503 is the prerequisite that this PR completes.

Evidence Provider Findings

No evidence providers configured for this review.

Tool Harness Findings

The tool harness attempted to verify sync-lock.ts and the groomer directory via GitHub API — both returned 404. However, git_status_short and read_file for src/lib/sync-lock.ts succeeded. The read confirmed:

  • sync-lock.ts uses the same syncLock table
  • Same MAX_AGE_MS = 30 * 60 * 1000 constant
  • Same double-check-in-transaction pattern
  • Token-based conditional release

The groomer-lock implementation correctly mirrors this established pattern.

Unknowns / Needs Verification

None. The PR is fully verifiable from the corpus:

  • Lock semantics verified against sync-lock.ts (read directly)
  • Tests confirm bail-on-conflict and release-on-completion paths
  • CI confirms all checks pass
  • No schema migration was needed (confirms syncRunId is already nullable)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(groomer): add a DB lock so the groomer is concurrency/multi-replica safe

1 participant