Skip to content

fix(jobs): retry resets started_at + attempts counters#1376

Open
ethanbeard wants to merge 1 commit into
garrytan:masterfrom
ethanbeard:feat/retry-resets-started-at
Open

fix(jobs): retry resets started_at + attempts counters#1376
ethanbeard wants to merge 1 commit into
garrytan:masterfrom
ethanbeard:feat/retry-resets-started-at

Conversation

@ethanbeard
Copy link
Copy Markdown

What

gbrain jobs retry <id> now resets started_at, attempts_made, and attempts_started. Pre-fix, those fields kept the original run's values, which made retries of long-dead jobs unrecoverable.

Why

handleWallClockTimeouts (queue.ts) dead-letters any active job where now() - started_at > timeout_ms * 2. For a retry of a dead job, that elapsed time is essentially always exceeded — the operator only retries jobs they've noticed are dead, which by definition is well after the original timeout window.

Symptom pre-fix: gbrain jobs retry 123 flips status dead → waiting, then the next worker tick (≤60s later) flips it waiting → dead again with error_text="wall-clock timeout exceeded" and attempts_made unchanged. No handler ever runs.

Validated locally on a Supabase brain: retried 9 dead jobs (from 2-4 days old), all 9 got re-dead-lettered within 60 seconds without the underlying scripts being invoked.

The three resets

  • started_at = NULL — the actual bug fix. The claim path does started_at = COALESCE(started_at, now()) so a fresh start is recorded only when the retry is claimed.
  • attempts_made = 0 — gives the retry a fresh max_attempts round. Otherwise a 3/3 dead job would dead-letter again on first failure of the retry, defeating the explicit operator intent of "try this again."
  • attempts_started = 0 — consistency with attempts_made. Both track per-job attempt counters; resetting one but not the other would leave the job in an inconsistent state on the next dashboard render.

stacktrace is preserved across retries to retain debug history.

Test

Extended the existing retry dead job re-queues test in test/minions.test.ts with a retry resets started_at and attempts counters case that asserts all three fields are reset and that pre-retry they hold the post-failure values. Original test still passes.

Compat

The fields being reset are all already valid NULL/0 states (set on insert), so existing consumers reading these fields see no new shapes. No schema change. No CLI signature change.

Adjacent

Surfaced while operating on the alt-account-shared Postgres brain — same install where #1177 (--lock-duration / --low-pri-rate-cap) and #1185 (doctor dead-jobs surface) came from. Branch rooted on v0.35.8.0 to match those PRs' shape; happy to rebase forward if it would help land.

`gbrain jobs retry <id>` previously left `started_at` populated from the
original run. The DB-side wall-clock detector (`handleWallClockTimeouts`)
compares `now() - started_at` against `timeout_ms * 2` and dead-letters
any job that exceeds the window. For a retry of a dead job, that elapsed
time is always > `timeout_ms * 2` (the operator only retries jobs they've
noticed are dead, which takes longer than the timeout window itself).

Result pre-fix: the retry got immediately re-dead-lettered by the next
worker tick before any handler ran. Symptom: `gbrain jobs retry 123` →
job status flips waiting → dead within 60s, error_text="wall-clock
timeout exceeded", attempts_made unchanged.

This fix resets three fields on retry:

- `started_at = NULL` — the actual bug fix. claim() does
  `started_at = COALESCE(started_at, now())` so a fresh start is
  recorded only when the retry is claimed.
- `attempts_made = 0` — gives the retry a fresh max_attempts round.
  Otherwise a 3/3 dead job would dead-letter on first failure of the
  retry, defeating the point.
- `attempts_started = 0` — consistency with attempts_made.

`stacktrace` is preserved to retain debug history across retries.

Validated locally: extended the existing `retry dead job re-queues`
test to assert the three resets. Original test still passes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant