Skip to content

Cap the ephemeral worker fleet at the ECS provisioner #65

Description

@conradbzura

Description

Bound the number of concurrently-running ephemeral Fargate worker tasks (the worker-container fleet) on the ECS profile, without limiting the durable job queue.

In the ECS profile the API launches roughly one ephemeral Fargate worker per running workflow (one task per worker). Nothing capped the worker-container count — concurrency was bounded only by the account/region Fargate On-Demand vCPU quota (4000 in us-east-2), which is shared with other services (cvh-admin, gosling-dmc). A burst of distinct uncached files could fan out toward that quota.

Cap the fleet at the provisioner, queue-preserving:

  • Add ECS_MAX_WORKERS (default 16; 0 disables). Before each RunTask the EcsProvisioner counts running/starting worker tasks via list_tasks and skips the spawn when at the cap; the job stays pending and the durable scheduler dispatches it when an existing worker frees. The fleet is bounded; the queue is not, and nothing is shed.
  • Thread the knob env (ECS_MAX_WORKERS) -> _EcsConfig -> EcsProvisioner, and expose it as the backend stack's EcsMaxWorkers parameter.

This is deliberately distinct from CFDB_WORKFLOW_MAX_ACTIVE (kept at 1024), the admission ceiling on queue + running that sheds with 429. An earlier idea of lowering the MAX_ACTIVE default to cap workers was rejected: MAX_ACTIVE bounds queue + running together, so it would have throttled the queue, not just the workers.

Motivation

The worker fleet had no per-service upper bound, and the Fargate quota cannot serve as one because the account is shared. Capping workers at the provisioner bounds blast radius and cost while preserving the queue (excess work waits rather than being rejected).

Expected Outcome

  • With the default ECS_MAX_WORKERS=16, at most ~16 ephemeral worker tasks run concurrently; additional distinct uncached files queue and run as workers free up — no 429 from the cap.
  • CFDB_WORKFLOW_MAX_ACTIVE remains the separate queue/admission bound (default 1024).
  • Configurable via env and the EcsMaxWorkers CloudFormation parameter; 0 disables it.

Notes / scope

  • Soft cap: list_tasks eventual-consistency lag plus a count-then-spawn race across distinct workflow keys can briefly overshoot.
  • ECS profile only (no provisioner in the local/LAN profile, where the pool is fixed).

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions