fix: prevent cron job flood after server restart and interval drift by Luvu182 · Pull Request #796 · nextlevelbuilder/goclaw

Luvu182 · 2026-04-09T16:00:13Z

Summary

Flood after restart: recomputeStaleJobs() only fixed jobs with next_run_at IS NULL. Jobs whose next_run_at was set but in the past (from server downtime) were not handled — causing ALL past-due jobs to fire simultaneously on the first scheduler tick. Now also advances past-due next_run_at to the next future time.
Interval drift & synchronization: Interval-based (every) jobs computed next_run_at from now (execution end time) instead of the original scheduled time. This caused permanent synchronization after any simultaneous execution (e.g. restart) and cumulative timing drift. Now uses anchor-based computation to preserve each job's original offset.

Both fixes applied across all 3 store implementations (PG, SQLite, JSON).

Test plan

Create multiple every jobs with the same interval (e.g. every 5min), verify they don't cluster after server restart
Stop server for longer than job interval, restart — verify past-due jobs are advanced to future, not executed immediately
Manual RunJob still works correctly (uses now as base, no anchor)
cron expression jobs unaffected (still use gronx for next tick)
at (one-time) jobs with past time correctly disabled on startup

🤖 Generated with Claude Code

On startup, recomputeStaleJobs only fixed jobs with next_run_at IS NULL. Jobs whose next_run_at was set but in the past (from server downtime) were not touched, causing ALL past-due jobs to fire simultaneously on the first scheduler tick. Additionally, interval-based (every) jobs computed next_run from now instead of the original scheduled time, causing permanent synchronization after any simultaneous execution and cumulative drift. Fix both issues across all 3 store implementations (PG, SQLite, JSON): - Advance past-due next_run_at to next future time on startup - Use anchor-based next_run computation for interval jobs

…ite manual run bug - Fix SQLite RunJob silently skipping: add next_run_at=NULL claim and reloadClaimed param to match PG store behavior - Nil-anchor guard for manual RunJob: use now+interval instead of anchor-based scheduling for manual triggers across PG/SQLite stores - Replace O(N) advance loop with O(1) modular arithmetic to prevent CPU starvation after prolonged downtime with short-interval jobs - Fix JSON store Start() leaving past-due at-jobs as enabled zombies - Add zero-anchor guard in JSON store to match DB stores' nil check - Add startup synchronization limitation comments in all 3 stores - Add 4 unit tests: flood prevention, anchor arithmetic, RunJob scheduling, and past-due at-job disabling

viettranx · 2026-04-09T16:41:11Z

Review & Improvements Applied

Original fix correctly identified both root causes (flood after restart + interval drift). During review, found and fixed additional issues before merging:

Fixes added on top

Fix	Impact
SQLite RunJob pre-existing bug	`RunJob` never set `next_run_at = NULL` → `loadClaimedJob` always returned false → manual runs silently skipped
O(N) → O(1) advance loop	`for next <= now { next += interval }` could iterate millions of times after prolonged downtime (e.g., 1s interval × 30 days = 2.6M iterations per job)
RunJob anchor inconsistency	PG RunJob used anchor-based scheduling (preserving rhythm); JSON store used `now + interval`. Aligned all stores to `now + interval` for manual runs
JSON Start() zombie at-jobs	Past-due one-time `at` jobs stayed enabled with nil NextRunAtMS — now properly disabled
Zero-anchor guard	JSON store `scheduledAtMS == 0` falls through to `computeNextRun` to match DB stores' nil-pointer check

Tests added

TestService_Start_AdvancesPastDueJobs — flood prevention
TestAnchorBasedNextRun_PreservesOffset — deterministic O(1) arithmetic verification
TestService_RunJob_UsesNowBasedScheduling — manual run uses now + interval
TestService_Start_DisablesPastDueAtJobs — zombie at-job prevention

Known limitation (documented in code)

After prolonged downtime, all past-due every jobs with the same interval will synchronize (now + interval). Inherent — no stored anchor. Diverges after first execution cycle via anchor-based scheduling.

Luvu182 and others added 2 commits April 9, 2026 22:59

viettranx merged commit 292e63d into nextlevelbuilder:main Apr 9, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: prevent cron job flood after server restart and interval drift#796

fix: prevent cron job flood after server restart and interval drift#796
viettranx merged 2 commits intonextlevelbuilder:mainfrom
Luvu182:fix/cron-flood-after-restart

Luvu182 commented Apr 9, 2026

Uh oh!

Uh oh!

viettranx commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Luvu182 commented Apr 9, 2026

Summary

Test plan

Uh oh!

Uh oh!

viettranx commented Apr 9, 2026

Review & Improvements Applied

Fixes added on top

Tests added

Known limitation (documented in code)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants