Cap the ephemeral worker fleet at the ECS provisioner — Closes #65#66
Merged
Conversation
Bound the number of concurrent ephemeral Fargate worker tasks without limiting the queue. Before each RunTask the EcsProvisioner counts running/starting worker tasks via list_tasks and skips the spawn when already at ECS_MAX_WORKERS (default 16); the job stays pending and the durable scheduler dispatches it when an existing worker frees, so the worker fleet is bounded while the queue is not. This is deliberately distinct from CFDB_WORKFLOW_MAX_ACTIVE, which bounds queue-plus-running (admission) and sheds with 429 — the worker cap never sheds. The knob flows env (ECS_MAX_WORKERS) -> _EcsConfig -> EcsProvisioner and is exposed as the backend stack's EcsMaxWorkers parameter. 0 disables the cap (rely on the Fargate vCPU quota). Soft cap: list_tasks eventual-consistency lag plus a count-then-spawn race across distinct workflow keys can briefly overshoot. A list_tasks failure raises a retryable error so the job queues and retries rather than spawning blind past the cap.
6176fe7 to
1136553
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Bound the concurrent ephemeral Fargate worker fleet on the ECS profile without limiting the durable queue. Before each
RunTasktheEcsProvisionercounts running/starting worker tasks (list_tasks) and skips the spawn when atECS_MAX_WORKERS(default 16); the job stayspendingand the durable scheduler dispatches it when an existing worker frees. So the worker fleet is bounded while the queue is preserved and nothing is shed.This is deliberately separate from
CFDB_WORKFLOW_MAX_ACTIVE(left at its 1024 default), which bounds queue + running (admission) and sheds with429. Lowering theMAX_ACTIVEdefault to cap workers was considered and rejected because it would throttle the queue, not just the workers.Closes #65
Proposed changes
Worker-fleet cap in the provisioner
src/cfdb/workflows/provisioner.py— newmax_workers(andtask_familyfor thelist_tasksfilter)._run_task_ownedcounts the fleet via a new_current_worker_countand returns[](no spawn) when at the cap; alist_tasksfailure raisesRetryableProvisionerErrorso the job queues and retries rather than spawning blind.Wire the
ECS_MAX_WORKERSknobsrc/cfdb/api/__init__.py(ECS_MAX_WORKERS, default 16),src/cfdb/api/profile.py(_EcsConfig.max_workers),src/cfdb/api/main.py(_build_provisionerpassesmax_workers+task_family).Expose the CloudFormation knob
cloudformation/backend.yml—EcsMaxWorkersparameter (default 16) ->ECS_MAX_WORKERSon the API task.Docs
README.md— env table + bounded-concurrency section:ECS_MAX_WORKERS(caps workers, preserves queue) vsCFDB_WORKFLOW_MAX_ACTIVE(caps queue + running, sheds).Test cases
TestEcsProvisionerWorkerCaplist_tasksreporting 2 running tasksrequestis awaitedRunTask; returns[]; family strips the:revisionTestEcsProvisionerWorkerCaprequestis awaitedRunTask; returns its ARNTestEcsProvisionerWorkerCapmax_workers=0requestis awaitedlist_tasksTestEcsProvisionerWorkerCaplist_tasksraising a ClientErrorrequestis awaitedRetryableProvisionerError; noRunTaskTestEcsProvisionerWorkerCapmax_workersValueError