feat: add configurable limit on concurrent bulk dispatch goroutines by ycombinator · Pull Request #6751 · elastic/fleet-server

ycombinator · 2026-04-03T22:09:11Z

What is the problem this PR solves?

During a 30k Serverless scale test, 22 of 39 fleet-server pods were OOMKilled. Analysis of the captured pod logs showed:

An upgrade + policy reassignment storm caused checkin durations to spike from normal (~ms) to 5-45 seconds average.
This created ~479 concurrent checkins per pod, each blocked in dispatch() waiting to enqueue onto the bulk engine's channel (capacity 32).
Elasticsearch was not the bottleneck — ES had ~88% free heap after GC with sub-30ms pause times.
Each blocked goroutine holds its stack (~4-8KB) plus the bulkT object. With no upper bound on concurrent dispatches, goroutines piled up until pods hit their memory limit (~154 Mi) and were killed.

How does this PR solve the problem?

This adds an optional cap on concurrent dispatch goroutines to bound memory usage.

The limit check runs at the top of dispatch(), before blocking on the channel:

fleet-server/internal/pkg/bulk/engine.go

Lines 608 to 622 in 1f456fb

    
           // Check pending dispatch limit before blocking on the channel. 
        
           if limit := b.opts.maxPendingDispatches; limit > 0 { 
        
           	pending := b.pendingDispatches.Add(1) 
        
           	defer b.pendingDispatches.Add(-1) 
        
           	if pending > int64(limit) { 
        
           		zerolog.Ctx(ctx).Warn(). 
        
           			Str("mod", kModBulk). 
        
           			Str("action", blk.action.String()). 
        
           			Int64("pending", pending). 
        
           			Int("limit", limit). 
        
           			Msg("Dispatch rejected: too many pending") 
        
           		b.freeBlk(blk) 
        
           		return respT{err: ErrTooManyDispatches} 
        
           	} 
        
           }

When the limit is reached, the dispatch is rejected immediately with ErrTooManyDispatches, which maps to HTTP 429 so agents retry on their next checkin interval:

fleet-server/internal/pkg/api/error.go

Lines 181 to 189 in 1f456fb

    
           { 
        
           	bulk.ErrTooManyDispatches, 
        
           	HTTPErrResp{ 
        
           		http.StatusTooManyRequests, 
        
           		"TooManyDispatches", 
        
           		"too many pending dispatches", 
        
           		zerolog.WarnLevel, 
        
           	}, 
        
           },

The limit is configurable via max_pending_dispatches in the bulk config:

fleet-server/internal/pkg/config/input.go

Line 52 in 1f456fb

MaxPendingDispatches int `config:"max_pending_dispatches"`

The default is 0 (no limit) so existing deployments are unaffected. Operators opt in by setting a value appropriate for their deployment size.

How to test this PR locally

go test -race ./internal/pkg/bulk/ -run TestDispatch -v

Design Checklist

I have ensured my design is stateless and will work when multiple fleet-server instances are behind a load balancer.
I have or intend to scale test my changes, ensuring it will work reliably with 100K+ agents connected.
I have included fail safe mechanisms to limit the load on fleet-server: rate limiting, circuit breakers, caching, load shedding, etc.

Checklist

I have commented my code, particularly in hard-to-understand areas
I have added tests that prove my fix is effective or that my feature works

Related issues

Relates fix: free bulkT on dispatch abort to prevent memory leak under load #6747

🤖 Generated with Claude Code

mergify · 2026-04-04T22:18:52Z

This pull request does not have a backport label. Could you fix it @ycombinator? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

backport-./d./d is the label to automatically backport to the 8./d branch. /d is the digit
backport-active-all is the label that automatically backports to all active branches.
backport-active-8 is the label that automatically backports to all active minor branches for the 8 major.
backport-active-9 is the label that automatically backports to all active minor branches for the 9 major.

michel-laterman

We really should be consistent with the type of maxPendingBulkDispatches; either have it as an int64 in the implementation + config or just an int

Also the new tests don't follow our existing test structures with the use of the require test package

internal/pkg/bulk/dispatch_limit_test.go

internal/pkg/bulk/opt.go

michel-laterman

lgtm, should we fix the linter warnings?

When agent count exceeds what the bulk engine can process, goroutines pile up in dispatch() waiting to send on the 32-slot channel. Each blocked goroutine holds its stack plus the bulkT object. With 30k+ agents under upgrade/policy storms, this grows unbounded until OOM. This adds an optional cap (max_pending_dispatches) on concurrent dispatch goroutines. When the limit is reached, new dispatches are rejected immediately with ErrTooManyDispatches, which maps to HTTP 429. Agents retry on their next checkin interval, spreading load over time. The default is 0 (no limit) so existing deployments are unaffected. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Fixes errcheck linter warnings by capturing the error from bytes.Buffer.WriteString and asserting via require.NoError. The WriteString calls inside the worker goroutines are hoisted to the test goroutine so require can be used safely. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

ycombinator mentioned this pull request Apr 3, 2026

fix: free bulkT on dispatch abort to prevent memory leak under load #6747

Open

5 tasks

ycombinator marked this pull request as ready for review April 3, 2026 22:11

ycombinator requested a review from a team as a code owner April 3, 2026 22:11

ycombinator requested review from michel-laterman and swiatekm April 3, 2026 22:11

ycombinator changed the title ~~feat: add configurable limit on concurrent dispatch goroutines~~ feat: add configurable limit on concurrent bulk dispatch goroutines Apr 3, 2026

pierrehilbert added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label Apr 4, 2026

michel-laterman requested changes Apr 6, 2026

View reviewed changes

internal/pkg/bulk/dispatch_limit_test.go Outdated Show resolved Hide resolved

internal/pkg/bulk/dispatch_limit_test.go Outdated Show resolved Hide resolved

internal/pkg/bulk/opt.go Outdated Show resolved Hide resolved

ycombinator requested a review from michel-laterman April 6, 2026 21:12

michel-laterman previously approved these changes Apr 7, 2026

View reviewed changes

ycombinator dismissed michel-laterman’s stale review via db3c667 April 7, 2026 22:41

ycombinator force-pushed the fix/pending-dispatch-limit branch from db3c667 to 487b09a Compare April 7, 2026 22:48

ycombinator enabled auto-merge (squash) April 7, 2026 23:10

ycombinator added the backport-skip Skip notification from the automated backport with mergify label Apr 7, 2026

michel-laterman approved these changes Apr 8, 2026

View reviewed changes

swiatekm approved these changes Apr 8, 2026

View reviewed changes

ycombinator and others added 7 commits April 8, 2026 10:44

refactor: rename dispatches to bulk dispatches for clarity

619795c

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

style: run go fmt on changed files

a082d70

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix: use require package and errors.Is in dispatch limit tests

afb326b

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

refactor: use int64 consistently for maxPendingBulkDispatches

d12da57

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

style: run goimports on engine.go

43d7983

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

ycombinator force-pushed the fix/pending-dispatch-limit branch from 487b09a to 3285fa6 Compare April 8, 2026 17:44

ycombinator merged commit 54caea6 into elastic:main Apr 8, 2026
11 of 17 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add configurable limit on concurrent bulk dispatch goroutines#6751

feat: add configurable limit on concurrent bulk dispatch goroutines#6751
ycombinator merged 7 commits intoelastic:mainfrom
ycombinator:fix/pending-dispatch-limit

ycombinator commented Apr 3, 2026

Uh oh!

mergify bot commented Apr 4, 2026

Uh oh!

michel-laterman left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

michel-laterman left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	// Check pending dispatch limit before blocking on the channel.
	if limit := b.opts.maxPendingDispatches; limit > 0 {
	pending := b.pendingDispatches.Add(1)
	defer b.pendingDispatches.Add(-1)
	if pending > int64(limit) {
	zerolog.Ctx(ctx).Warn().
	Str("mod", kModBulk).
	Str("action", blk.action.String()).
	Int64("pending", pending).
	Int("limit", limit).
	Msg("Dispatch rejected: too many pending")
	b.freeBlk(blk)
	return respT{err: ErrTooManyDispatches}
	}
	}

	{
	bulk.ErrTooManyDispatches,
	HTTPErrResp{
	http.StatusTooManyRequests,
	"TooManyDispatches",
	"too many pending dispatches",
	zerolog.WarnLevel,
	},
	},

Conversation

ycombinator commented Apr 3, 2026

What is the problem this PR solves?

How does this PR solve the problem?

How to test this PR locally

Design Checklist

Checklist

Related issues

Uh oh!

mergify bot commented Apr 4, 2026

Uh oh!

michel-laterman left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

michel-laterman left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants