Skip to content

feat: add configurable limit on concurrent bulk dispatch goroutines#6751

Merged
ycombinator merged 7 commits intoelastic:mainfrom
ycombinator:fix/pending-dispatch-limit
Apr 8, 2026
Merged

feat: add configurable limit on concurrent bulk dispatch goroutines#6751
ycombinator merged 7 commits intoelastic:mainfrom
ycombinator:fix/pending-dispatch-limit

Conversation

@ycombinator
Copy link
Copy Markdown
Contributor

What is the problem this PR solves?

During a 30k Serverless scale test, 22 of 39 fleet-server pods were OOMKilled. Analysis of the captured pod logs showed:

  • An upgrade + policy reassignment storm caused checkin durations to spike from normal (~ms) to 5-45 seconds average.
  • This created ~479 concurrent checkins per pod, each blocked in dispatch() waiting to enqueue onto the bulk engine's channel (capacity 32).
  • Elasticsearch was not the bottleneck — ES had ~88% free heap after GC with sub-30ms pause times.
  • Each blocked goroutine holds its stack (~4-8KB) plus the bulkT object. With no upper bound on concurrent dispatches, goroutines piled up until pods hit their memory limit (~154 Mi) and were killed.

How does this PR solve the problem?

This adds an optional cap on concurrent dispatch goroutines to bound memory usage.

The limit check runs at the top of dispatch(), before blocking on the channel:

// Check pending dispatch limit before blocking on the channel.
if limit := b.opts.maxPendingDispatches; limit > 0 {
pending := b.pendingDispatches.Add(1)
defer b.pendingDispatches.Add(-1)
if pending > int64(limit) {
zerolog.Ctx(ctx).Warn().
Str("mod", kModBulk).
Str("action", blk.action.String()).
Int64("pending", pending).
Int("limit", limit).
Msg("Dispatch rejected: too many pending")
b.freeBlk(blk)
return respT{err: ErrTooManyDispatches}
}
}

When the limit is reached, the dispatch is rejected immediately with ErrTooManyDispatches, which maps to HTTP 429 so agents retry on their next checkin interval:

{
bulk.ErrTooManyDispatches,
HTTPErrResp{
http.StatusTooManyRequests,
"TooManyDispatches",
"too many pending dispatches",
zerolog.WarnLevel,
},
},

The limit is configurable via max_pending_dispatches in the bulk config:

MaxPendingDispatches int `config:"max_pending_dispatches"`

The default is 0 (no limit) so existing deployments are unaffected. Operators opt in by setting a value appropriate for their deployment size.

How to test this PR locally

go test -race ./internal/pkg/bulk/ -run TestDispatch -v

Design Checklist

  • I have ensured my design is stateless and will work when multiple fleet-server instances are behind a load balancer.
  • I have or intend to scale test my changes, ensuring it will work reliably with 100K+ agents connected.
  • I have included fail safe mechanisms to limit the load on fleet-server: rate limiting, circuit breakers, caching, load shedding, etc.

Checklist

  • I have commented my code, particularly in hard-to-understand areas
  • I have added tests that prove my fix is effective or that my feature works

Related issues

🤖 Generated with Claude Code

@ycombinator ycombinator marked this pull request as ready for review April 3, 2026 22:11
@ycombinator ycombinator requested a review from a team as a code owner April 3, 2026 22:11
@ycombinator ycombinator changed the title feat: add configurable limit on concurrent dispatch goroutines feat: add configurable limit on concurrent bulk dispatch goroutines Apr 3, 2026
@pierrehilbert pierrehilbert added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label Apr 4, 2026
@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Apr 4, 2026

This pull request does not have a backport label. Could you fix it @ycombinator? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-./d./d is the label to automatically backport to the 8./d branch. /d is the digit
  • backport-active-all is the label that automatically backports to all active branches.
  • backport-active-8 is the label that automatically backports to all active minor branches for the 8 major.
  • backport-active-9 is the label that automatically backports to all active minor branches for the 9 major.

Copy link
Copy Markdown
Contributor

@michel-laterman michel-laterman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We really should be consistent with the type of maxPendingBulkDispatches; either have it as an int64 in the implementation + config or just an int

Also the new tests don't follow our existing test structures with the use of the require test package

michel-laterman
michel-laterman previously approved these changes Apr 7, 2026
Copy link
Copy Markdown
Contributor

@michel-laterman michel-laterman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, should we fix the linter warnings?

@ycombinator ycombinator force-pushed the fix/pending-dispatch-limit branch from db3c667 to 487b09a Compare April 7, 2026 22:48
@ycombinator ycombinator enabled auto-merge (squash) April 7, 2026 23:10
@ycombinator ycombinator added the backport-skip Skip notification from the automated backport with mergify label Apr 7, 2026
ycombinator and others added 7 commits April 8, 2026 10:44
When agent count exceeds what the bulk engine can process, goroutines
pile up in dispatch() waiting to send on the 32-slot channel. Each
blocked goroutine holds its stack plus the bulkT object. With 30k+
agents under upgrade/policy storms, this grows unbounded until OOM.

This adds an optional cap (max_pending_dispatches) on concurrent
dispatch goroutines. When the limit is reached, new dispatches are
rejected immediately with ErrTooManyDispatches, which maps to HTTP 429.
Agents retry on their next checkin interval, spreading load over time.

The default is 0 (no limit) so existing deployments are unaffected.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Fixes errcheck linter warnings by capturing the error from
bytes.Buffer.WriteString and asserting via require.NoError. The
WriteString calls inside the worker goroutines are hoisted to the
test goroutine so require can be used safely.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@ycombinator ycombinator force-pushed the fix/pending-dispatch-limit branch from 487b09a to 3285fa6 Compare April 8, 2026 17:44
@ycombinator ycombinator merged commit 54caea6 into elastic:main Apr 8, 2026
11 of 17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-skip Skip notification from the automated backport with mergify Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants