fix: prevent cron job flood after server restart and interval drift#796
Merged
viettranx merged 2 commits intonextlevelbuilder:mainfrom Apr 9, 2026
Merged
Conversation
On startup, recomputeStaleJobs only fixed jobs with next_run_at IS NULL. Jobs whose next_run_at was set but in the past (from server downtime) were not touched, causing ALL past-due jobs to fire simultaneously on the first scheduler tick. Additionally, interval-based (every) jobs computed next_run from now instead of the original scheduled time, causing permanent synchronization after any simultaneous execution and cumulative drift. Fix both issues across all 3 store implementations (PG, SQLite, JSON): - Advance past-due next_run_at to next future time on startup - Use anchor-based next_run computation for interval jobs
…ite manual run bug - Fix SQLite RunJob silently skipping: add next_run_at=NULL claim and reloadClaimed param to match PG store behavior - Nil-anchor guard for manual RunJob: use now+interval instead of anchor-based scheduling for manual triggers across PG/SQLite stores - Replace O(N) advance loop with O(1) modular arithmetic to prevent CPU starvation after prolonged downtime with short-interval jobs - Fix JSON store Start() leaving past-due at-jobs as enabled zombies - Add zero-anchor guard in JSON store to match DB stores' nil check - Add startup synchronization limitation comments in all 3 stores - Add 4 unit tests: flood prevention, anchor arithmetic, RunJob scheduling, and past-due at-job disabling
Contributor
Review & Improvements AppliedOriginal fix correctly identified both root causes (flood after restart + interval drift). During review, found and fixed additional issues before merging: Fixes added on top
Tests added
Known limitation (documented in code)After prolonged downtime, all past-due |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
recomputeStaleJobs()only fixed jobs withnext_run_at IS NULL. Jobs whosenext_run_atwas set but in the past (from server downtime) were not handled — causing ALL past-due jobs to fire simultaneously on the first scheduler tick. Now also advances past-duenext_run_atto the next future time.every) jobs computednext_run_atfromnow(execution end time) instead of the original scheduled time. This caused permanent synchronization after any simultaneous execution (e.g. restart) and cumulative timing drift. Now uses anchor-based computation to preserve each job's original offset.Both fixes applied across all 3 store implementations (PG, SQLite, JSON).
Test plan
everyjobs with the same interval (e.g. every 5min), verify they don't cluster after server restartRunJobstill works correctly (usesnowas base, no anchor)cronexpression jobs unaffected (still use gronx for next tick)at(one-time) jobs with past time correctly disabled on startup🤖 Generated with Claude Code