Skip to content

Cap the ephemeral worker fleet at the ECS provisioner — Closes #65#66

Merged
conradbzura merged 1 commit into
masterfrom
65-cap-worker-fleet
Jun 23, 2026
Merged

Cap the ephemeral worker fleet at the ECS provisioner — Closes #65#66
conradbzura merged 1 commit into
masterfrom
65-cap-worker-fleet

Conversation

@conradbzura

@conradbzura conradbzura commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator

Summary

Bound the concurrent ephemeral Fargate worker fleet on the ECS profile without limiting the durable queue. Before each RunTask the EcsProvisioner counts running/starting worker tasks (list_tasks) and skips the spawn when at ECS_MAX_WORKERS (default 16); the job stays pending and the durable scheduler dispatches it when an existing worker frees. So the worker fleet is bounded while the queue is preserved and nothing is shed.

This is deliberately separate from CFDB_WORKFLOW_MAX_ACTIVE (left at its 1024 default), which bounds queue + running (admission) and sheds with 429. Lowering the MAX_ACTIVE default to cap workers was considered and rejected because it would throttle the queue, not just the workers.

Closes #65

Proposed changes

Worker-fleet cap in the provisioner

src/cfdb/workflows/provisioner.py — new max_workers (and task_family for the list_tasks filter). _run_task_owned counts the fleet via a new _current_worker_count and returns [] (no spawn) when at the cap; a list_tasks failure raises RetryableProvisionerError so the job queues and retries rather than spawning blind.

Wire the ECS_MAX_WORKERS knob

src/cfdb/api/__init__.py (ECS_MAX_WORKERS, default 16), src/cfdb/api/profile.py (_EcsConfig.max_workers), src/cfdb/api/main.py (_build_provisioner passes max_workers + task_family).

Expose the CloudFormation knob

cloudformation/backend.ymlEcsMaxWorkers parameter (default 16) -> ECS_MAX_WORKERS on the API task.

Docs

README.md — env table + bounded-concurrency section: ECS_MAX_WORKERS (caps workers, preserves queue) vs CFDB_WORKFLOW_MAX_ACTIVE (caps queue + running, sheds).

Test cases

# Test Suite Given When Then Coverage Target
1 TestEcsProvisionerWorkerCap A cap of 2 and list_tasks reporting 2 running tasks request is awaited No RunTask; returns []; family strips the :revision Skip spawn at capacity
2 TestEcsProvisionerWorkerCap A cap of 3 and one running task request is awaited One RunTask; returns its ARN Spawn below capacity
3 TestEcsProvisionerWorkerCap The default max_workers=0 request is awaited Spawns without ever calling list_tasks Cap disabled by default
4 TestEcsProvisionerWorkerCap A cap and list_tasks raising a ClientError request is awaited Raises RetryableProvisionerError; no RunTask Fail-safe on count error
5 TestEcsProvisionerWorkerCap A negative max_workers Provisioner is constructed Raises ValueError Construction guard

@conradbzura conradbzura self-assigned this Jun 23, 2026
Bound the number of concurrent ephemeral Fargate worker tasks without
limiting the queue. Before each RunTask the EcsProvisioner counts
running/starting worker tasks via list_tasks and skips the spawn when
already at ECS_MAX_WORKERS (default 16); the job stays pending and the
durable scheduler dispatches it when an existing worker frees, so the
worker fleet is bounded while the queue is not.

This is deliberately distinct from CFDB_WORKFLOW_MAX_ACTIVE, which bounds
queue-plus-running (admission) and sheds with 429 — the worker cap never
sheds. The knob flows env (ECS_MAX_WORKERS) -> _EcsConfig ->
EcsProvisioner and is exposed as the backend stack's EcsMaxWorkers
parameter. 0 disables the cap (rely on the Fargate vCPU quota).

Soft cap: list_tasks eventual-consistency lag plus a count-then-spawn
race across distinct workflow keys can briefly overshoot. A list_tasks
failure raises a retryable error so the job queues and retries rather
than spawning blind past the cap.
@conradbzura conradbzura force-pushed the 65-cap-worker-fleet branch from 6176fe7 to 1136553 Compare June 23, 2026 17:41
@conradbzura conradbzura changed the title Cap the ephemeral worker fleet via a conservative CFDB_WORKFLOW_MAX_ACTIVE default — Closes #65 Cap the ephemeral worker fleet at the ECS provisioner — Closes #65 Jun 23, 2026
@conradbzura conradbzura marked this pull request as ready for review June 23, 2026 17:57
@conradbzura conradbzura merged commit 8f3a484 into master Jun 23, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Cap the ephemeral worker fleet at the ECS provisioner

1 participant