Skip to content

Add post-reset spread to _github_csma_sleep_after_rate_limit to prevent thundering herd on retries #44

Description

@mysticgohan1

What happened

PR #2304 added a post-reset spread (up to 60s via GITHUB_CSMA_SPREAD_MAX_SEC) to github_csma_sense in internal/scaffold/fullsend-repo/scripts/lib/github-api-csma.sh to prevent multiple runners from waking simultaneously after a rate-limit reset. The review agent correctly identified that _github_csma_sleep_after_rate_limit (lines 142-164) has the same thundering-herd problem: when multiple runners hit a 429/secondary-rate-limit, they all sleep until the same reset timestamp and wake together. The exponential backoff component adds some randomness but does not fully desynchronize runners on the same attempt count. The finding was not addressed before the PR was approved and merged on 2026-06-16.

What could go better

The fix should have been applied to both sleep-until-reset paths. This is a straightforward omission — the same spread logic that was added to github_csma_sense should also be added after the sleep "${delay}" call in _github_csma_sleep_after_rate_limit. Confidence: high — the review agent's analysis is correct and the two code paths are structurally parallel.

Proposed change

In internal/scaffold/fullsend-repo/scripts/lib/github-api-csma.sh, add the same post-reset spread logic after the sleep "${delay}" call in _github_csma_sleep_after_rate_limit (around line 160). Either extract the spread logic into a shared helper function (e.g., _github_csma_post_reset_spread) called from both github_csma_sense and _github_csma_sleep_after_rate_limit, or inline the same spread_max / RANDOM % spread_max / sleep pattern. The helper approach is preferred to avoid duplication.

Validation criteria

After the fix: (1) _github_csma_sleep_after_rate_limit includes a random spread delay after sleeping until the reset timestamp, using the same GITHUB_CSMA_SPREAD_MAX_SEC configuration. (2) Existing tests (post-prioritize-test.sh and any CSMA unit tests) continue to pass. (3) In a batch of 5+ concurrent runners hitting a 429, runners should wake at staggered times rather than simultaneously.


Generated by retro agent from fullsend-ai#2304

Metadata

Metadata

Assignees

No one assigned

    Labels

    component/dispatchWorkflow dispatch and triggersready-to-codeTriaged and ready for the code agenttype/bugConfirmed defect in existing behavior

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions