Enforce the ECS worker-fleet cap under a concurrent cold-start burst — Closes #67#68
Merged
Merged
Conversation
The worker-fleet cap counted only list_tasks before each RunTask. ECS list_tasks is eventually consistent, so a simultaneous burst of distinct uncached files — the exact scenario the cap guards — had every overflow spawn-decision observe a stale ~0 count and launch a worker. Verified live: a 24-file burst drove 24 Fargate workers (peak 24, cap 16). Count this provisioner's own recently-issued launches (aged out after a visibility window) on top of the list_tasks fleet, under a lock so concurrent burst deciders observe each other's reservations. A failed RunTask releases its reservation so a spawn that never happened is not over-counted. A single API task is now held to the cap; a multi-task API would each track only its own launches and need a shared lease (documented). Add a regression test that drives a concurrent burst against a permanently list_tasks-blind client and asserts at most max_workers RunTask calls happen — it fails on the previous list_tasks-only cap.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Make the ECS worker-fleet cap actually bound a concurrent cold-start burst. The cap previously counted only
list_tasksbefore eachRunTask; because ECSlist_tasksis eventually consistent, a simultaneous burst from a cold fleet had every spawn decision observe a stale ~0 count and launch — verified live, where a 24-file burst drove 24 workers against a cap of 16.Count the ECS-visible fleet plus the provisioner's own recently-issued launches (aged out after a visibility window) under a lock, so concurrent burst deciders observe each other's reservations; release the reservation when a
RunTaskfails so a spawn that never happened is not over-counted. The cap is held per API task — a multi-task API would each track only its own launches and need a shared lease (documented). Trade-off: counting in-flight launches is conservative — over-counting only briefly delays a spawn, whereas under-counting (the old behavior) let the fleet overshoot to the full demand.Closes #67
Proposed changes
Count in-flight launches in the worker-fleet cap
src/cfdb/workflows/provisioner.py— add_reserve_worker_slot/_release_worker_slot, the_recent_launcheslist, a_cap_lock, and the_LAUNCH_VISIBILITY_WINDOW_Sconstant._run_task_ownedreserves a slot under the lock (countinglist_tasks+ recent launches) and returns[](queue, no spawn) when at the cap; a failedRunTaskreleases the reservation.Document the in-flight accounting
src/cfdb/workflows/provisioner.pymax_workersdocstring andREADME.md— describe that the count includes in-flight launches (what bounds a burst) and the single-vs-multi-API-task scope.Test cases
TestEcsProvisionerWorkerCaplist_tasksalways reports zero (modeling ECS visibility lag)RunTaskcalls happen; the other three return[]