Skip to content

fix: prevent cron job flood after server restart and interval drift#796

Merged
viettranx merged 2 commits intonextlevelbuilder:mainfrom
Luvu182:fix/cron-flood-after-restart
Apr 9, 2026
Merged

fix: prevent cron job flood after server restart and interval drift#796
viettranx merged 2 commits intonextlevelbuilder:mainfrom
Luvu182:fix/cron-flood-after-restart

Conversation

@Luvu182
Copy link
Copy Markdown
Contributor

@Luvu182 Luvu182 commented Apr 9, 2026

Summary

  • Flood after restart: recomputeStaleJobs() only fixed jobs with next_run_at IS NULL. Jobs whose next_run_at was set but in the past (from server downtime) were not handled — causing ALL past-due jobs to fire simultaneously on the first scheduler tick. Now also advances past-due next_run_at to the next future time.
  • Interval drift & synchronization: Interval-based (every) jobs computed next_run_at from now (execution end time) instead of the original scheduled time. This caused permanent synchronization after any simultaneous execution (e.g. restart) and cumulative timing drift. Now uses anchor-based computation to preserve each job's original offset.

Both fixes applied across all 3 store implementations (PG, SQLite, JSON).

Test plan

  • Create multiple every jobs with the same interval (e.g. every 5min), verify they don't cluster after server restart
  • Stop server for longer than job interval, restart — verify past-due jobs are advanced to future, not executed immediately
  • Manual RunJob still works correctly (uses now as base, no anchor)
  • cron expression jobs unaffected (still use gronx for next tick)
  • at (one-time) jobs with past time correctly disabled on startup

🤖 Generated with Claude Code

Luvu182 and others added 2 commits April 9, 2026 22:59
On startup, recomputeStaleJobs only fixed jobs with next_run_at IS NULL.
Jobs whose next_run_at was set but in the past (from server downtime)
were not touched, causing ALL past-due jobs to fire simultaneously on
the first scheduler tick.

Additionally, interval-based (every) jobs computed next_run from now
instead of the original scheduled time, causing permanent synchronization
after any simultaneous execution and cumulative drift.

Fix both issues across all 3 store implementations (PG, SQLite, JSON):
- Advance past-due next_run_at to next future time on startup
- Use anchor-based next_run computation for interval jobs
…ite manual run bug

- Fix SQLite RunJob silently skipping: add next_run_at=NULL claim and
  reloadClaimed param to match PG store behavior
- Nil-anchor guard for manual RunJob: use now+interval instead of
  anchor-based scheduling for manual triggers across PG/SQLite stores
- Replace O(N) advance loop with O(1) modular arithmetic to prevent
  CPU starvation after prolonged downtime with short-interval jobs
- Fix JSON store Start() leaving past-due at-jobs as enabled zombies
- Add zero-anchor guard in JSON store to match DB stores' nil check
- Add startup synchronization limitation comments in all 3 stores
- Add 4 unit tests: flood prevention, anchor arithmetic, RunJob
  scheduling, and past-due at-job disabling
@viettranx viettranx merged commit 292e63d into nextlevelbuilder:main Apr 9, 2026
2 checks passed
@viettranx
Copy link
Copy Markdown
Contributor

Review & Improvements Applied

Original fix correctly identified both root causes (flood after restart + interval drift). During review, found and fixed additional issues before merging:

Fixes added on top

Fix Impact
SQLite RunJob pre-existing bug RunJob never set next_run_at = NULLloadClaimedJob always returned false → manual runs silently skipped
O(N) → O(1) advance loop for next <= now { next += interval } could iterate millions of times after prolonged downtime (e.g., 1s interval × 30 days = 2.6M iterations per job)
RunJob anchor inconsistency PG RunJob used anchor-based scheduling (preserving rhythm); JSON store used now + interval. Aligned all stores to now + interval for manual runs
JSON Start() zombie at-jobs Past-due one-time at jobs stayed enabled with nil NextRunAtMS — now properly disabled
Zero-anchor guard JSON store scheduledAtMS == 0 falls through to computeNextRun to match DB stores' nil-pointer check

Tests added

  • TestService_Start_AdvancesPastDueJobs — flood prevention
  • TestAnchorBasedNextRun_PreservesOffset — deterministic O(1) arithmetic verification
  • TestService_RunJob_UsesNowBasedScheduling — manual run uses now + interval
  • TestService_Start_DisablesPastDueAtJobs — zombie at-job prevention

Known limitation (documented in code)

After prolonged downtime, all past-due every jobs with the same interval will synchronize (now + interval). Inherent — no stored anchor. Diverges after first execution cycle via anchor-based scheduling.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants