Description
Bound the number of concurrently-running ephemeral Fargate worker tasks (the worker-container fleet) on the ECS profile, without limiting the durable job queue.
In the ECS profile the API launches roughly one ephemeral Fargate worker per running workflow (one task per worker). Nothing capped the worker-container count — concurrency was bounded only by the account/region Fargate On-Demand vCPU quota (4000 in us-east-2), which is shared with other services (cvh-admin, gosling-dmc). A burst of distinct uncached files could fan out toward that quota.
Cap the fleet at the provisioner, queue-preserving:
- Add
ECS_MAX_WORKERS (default 16; 0 disables). Before each RunTask the EcsProvisioner counts running/starting worker tasks via list_tasks and skips the spawn when at the cap; the job stays pending and the durable scheduler dispatches it when an existing worker frees. The fleet is bounded; the queue is not, and nothing is shed.
- Thread the knob env (
ECS_MAX_WORKERS) -> _EcsConfig -> EcsProvisioner, and expose it as the backend stack's EcsMaxWorkers parameter.
This is deliberately distinct from CFDB_WORKFLOW_MAX_ACTIVE (kept at 1024), the admission ceiling on queue + running that sheds with 429. An earlier idea of lowering the MAX_ACTIVE default to cap workers was rejected: MAX_ACTIVE bounds queue + running together, so it would have throttled the queue, not just the workers.
Motivation
The worker fleet had no per-service upper bound, and the Fargate quota cannot serve as one because the account is shared. Capping workers at the provisioner bounds blast radius and cost while preserving the queue (excess work waits rather than being rejected).
Expected Outcome
- With the default
ECS_MAX_WORKERS=16, at most ~16 ephemeral worker tasks run concurrently; additional distinct uncached files queue and run as workers free up — no 429 from the cap.
CFDB_WORKFLOW_MAX_ACTIVE remains the separate queue/admission bound (default 1024).
- Configurable via env and the
EcsMaxWorkers CloudFormation parameter; 0 disables it.
Notes / scope
- Soft cap:
list_tasks eventual-consistency lag plus a count-then-spawn race across distinct workflow keys can briefly overshoot.
- ECS profile only (no provisioner in the local/LAN profile, where the pool is fixed).
Description
Bound the number of concurrently-running ephemeral Fargate worker tasks (the worker-container fleet) on the ECS profile, without limiting the durable job queue.
In the ECS profile the API launches roughly one ephemeral Fargate worker per running workflow (one task per worker). Nothing capped the worker-container count — concurrency was bounded only by the account/region Fargate On-Demand vCPU quota (4000 in us-east-2), which is shared with other services (
cvh-admin,gosling-dmc). A burst of distinct uncached files could fan out toward that quota.Cap the fleet at the provisioner, queue-preserving:
ECS_MAX_WORKERS(default16;0disables). Before eachRunTasktheEcsProvisionercounts running/starting worker tasks vialist_tasksand skips the spawn when at the cap; the job stayspendingand the durable scheduler dispatches it when an existing worker frees. The fleet is bounded; the queue is not, and nothing is shed.ECS_MAX_WORKERS) ->_EcsConfig->EcsProvisioner, and expose it as the backend stack'sEcsMaxWorkersparameter.This is deliberately distinct from
CFDB_WORKFLOW_MAX_ACTIVE(kept at1024), the admission ceiling on queue + running that sheds with429. An earlier idea of lowering theMAX_ACTIVEdefault to cap workers was rejected:MAX_ACTIVEbounds queue + running together, so it would have throttled the queue, not just the workers.Motivation
The worker fleet had no per-service upper bound, and the Fargate quota cannot serve as one because the account is shared. Capping workers at the provisioner bounds blast radius and cost while preserving the queue (excess work waits rather than being rejected).
Expected Outcome
ECS_MAX_WORKERS=16, at most ~16 ephemeral worker tasks run concurrently; additional distinct uncached files queue and run as workers free up — no429from the cap.CFDB_WORKFLOW_MAX_ACTIVEremains the separate queue/admission bound (default1024).EcsMaxWorkersCloudFormation parameter;0disables it.Notes / scope
list_taskseventual-consistency lag plus a count-then-spawn race across distinct workflow keys can briefly overshoot.