fix(jobs): retry resets started_at + attempts counters#1376
Open
ethanbeard wants to merge 1 commit into
Open
Conversation
`gbrain jobs retry <id>` previously left `started_at` populated from the original run. The DB-side wall-clock detector (`handleWallClockTimeouts`) compares `now() - started_at` against `timeout_ms * 2` and dead-letters any job that exceeds the window. For a retry of a dead job, that elapsed time is always > `timeout_ms * 2` (the operator only retries jobs they've noticed are dead, which takes longer than the timeout window itself). Result pre-fix: the retry got immediately re-dead-lettered by the next worker tick before any handler ran. Symptom: `gbrain jobs retry 123` → job status flips waiting → dead within 60s, error_text="wall-clock timeout exceeded", attempts_made unchanged. This fix resets three fields on retry: - `started_at = NULL` — the actual bug fix. claim() does `started_at = COALESCE(started_at, now())` so a fresh start is recorded only when the retry is claimed. - `attempts_made = 0` — gives the retry a fresh max_attempts round. Otherwise a 3/3 dead job would dead-letter on first failure of the retry, defeating the point. - `attempts_started = 0` — consistency with attempts_made. `stacktrace` is preserved to retain debug history across retries. Validated locally: extended the existing `retry dead job re-queues` test to assert the three resets. Original test still passes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
gbrain jobs retry <id>now resetsstarted_at,attempts_made, andattempts_started. Pre-fix, those fields kept the original run's values, which made retries of long-dead jobs unrecoverable.Why
handleWallClockTimeouts(queue.ts) dead-letters any active job wherenow() - started_at > timeout_ms * 2. For a retry of a dead job, that elapsed time is essentially always exceeded — the operator only retries jobs they've noticed are dead, which by definition is well after the original timeout window.Symptom pre-fix:
gbrain jobs retry 123flips statusdead → waiting, then the next worker tick (≤60s later) flips itwaiting → deadagain witherror_text="wall-clock timeout exceeded"andattempts_madeunchanged. No handler ever runs.Validated locally on a Supabase brain: retried 9 dead jobs (from 2-4 days old), all 9 got re-dead-lettered within 60 seconds without the underlying scripts being invoked.
The three resets
started_at = NULL— the actual bug fix. The claim path doesstarted_at = COALESCE(started_at, now())so a fresh start is recorded only when the retry is claimed.attempts_made = 0— gives the retry a freshmax_attemptsround. Otherwise a 3/3 dead job would dead-letter again on first failure of the retry, defeating the explicit operator intent of "try this again."attempts_started = 0— consistency withattempts_made. Both track per-job attempt counters; resetting one but not the other would leave the job in an inconsistent state on the next dashboard render.stacktraceis preserved across retries to retain debug history.Test
Extended the existing
retry dead job re-queuestest intest/minions.test.tswith aretry resets started_at and attempts counterscase that asserts all three fields are reset and that pre-retry they hold the post-failure values. Original test still passes.Compat
The fields being reset are all already valid
NULL/0states (set on insert), so existing consumers reading these fields see no new shapes. No schema change. No CLI signature change.Adjacent
Surfaced while operating on the alt-account-shared Postgres brain — same install where #1177 (
--lock-duration/--low-pri-rate-cap) and #1185 (doctor dead-jobs surface) came from. Branch rooted on v0.35.8.0 to match those PRs' shape; happy to rebase forward if it would help land.