fix(jobs): retry resets started_at + attempts counters by ethanbeard · Pull Request #1376 · garrytan/gbrain

ethanbeard · 2026-05-24T19:50:34Z

What

gbrain jobs retry <id> now resets started_at, attempts_made, and attempts_started. Pre-fix, those fields kept the original run's values, which made retries of long-dead jobs unrecoverable.

Why

handleWallClockTimeouts (queue.ts) dead-letters any active job where now() - started_at > timeout_ms * 2. For a retry of a dead job, that elapsed time is essentially always exceeded — the operator only retries jobs they've noticed are dead, which by definition is well after the original timeout window.

Symptom pre-fix: gbrain jobs retry 123 flips status dead → waiting, then the next worker tick (≤60s later) flips it waiting → dead again with error_text="wall-clock timeout exceeded" and attempts_made unchanged. No handler ever runs.

Validated locally on a Supabase brain: retried 9 dead jobs (from 2-4 days old), all 9 got re-dead-lettered within 60 seconds without the underlying scripts being invoked.

The three resets

started_at = NULL — the actual bug fix. The claim path does started_at = COALESCE(started_at, now()) so a fresh start is recorded only when the retry is claimed.
attempts_made = 0 — gives the retry a fresh max_attempts round. Otherwise a 3/3 dead job would dead-letter again on first failure of the retry, defeating the explicit operator intent of "try this again."
attempts_started = 0 — consistency with attempts_made. Both track per-job attempt counters; resetting one but not the other would leave the job in an inconsistent state on the next dashboard render.

stacktrace is preserved across retries to retain debug history.

Test

Extended the existing retry dead job re-queues test in test/minions.test.ts with a retry resets started_at and attempts counters case that asserts all three fields are reset and that pre-retry they hold the post-failure values. Original test still passes.

Compat

The fields being reset are all already valid NULL/0 states (set on insert), so existing consumers reading these fields see no new shapes. No schema change. No CLI signature change.

Adjacent

Surfaced while operating on the alt-account-shared Postgres brain — same install where #1177 (--lock-duration / --low-pri-rate-cap) and #1185 (doctor dead-jobs surface) came from. Branch rooted on v0.35.8.0 to match those PRs' shape; happy to rebase forward if it would help land.

`gbrain jobs retry <id>` previously left `started_at` populated from the original run. The DB-side wall-clock detector (`handleWallClockTimeouts`) compares `now() - started_at` against `timeout_ms * 2` and dead-letters any job that exceeds the window. For a retry of a dead job, that elapsed time is always > `timeout_ms * 2` (the operator only retries jobs they've noticed are dead, which takes longer than the timeout window itself). Result pre-fix: the retry got immediately re-dead-lettered by the next worker tick before any handler ran. Symptom: `gbrain jobs retry 123` → job status flips waiting → dead within 60s, error_text="wall-clock timeout exceeded", attempts_made unchanged. This fix resets three fields on retry: - `started_at = NULL` — the actual bug fix. claim() does `started_at = COALESCE(started_at, now())` so a fresh start is recorded only when the retry is claimed. - `attempts_made = 0` — gives the retry a fresh max_attempts round. Otherwise a 3/3 dead job would dead-letter on first failure of the retry, defeating the point. - `attempts_started = 0` — consistency with attempts_made. `stacktrace` is preserved to retain debug history across retries. Validated locally: extended the existing `retry dead job re-queues` test to assert the three resets. Original test still passes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(jobs): retry resets started_at + attempts counters#1376

fix(jobs): retry resets started_at + attempts counters#1376
ethanbeard wants to merge 1 commit into
garrytan:masterfrom
ethanbeard:feat/retry-resets-started-at

ethanbeard commented May 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ethanbeard commented May 24, 2026

What

Why

The three resets

Test

Compat

Adjacent

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant