Skip to content

Enforce the ECS worker-fleet cap under a concurrent cold-start burst — Closes #67#68

Merged
conradbzura merged 1 commit into
masterfrom
67-enforce-ecs-worker-cap-burst
Jun 24, 2026
Merged

Enforce the ECS worker-fleet cap under a concurrent cold-start burst — Closes #67#68
conradbzura merged 1 commit into
masterfrom
67-enforce-ecs-worker-cap-burst

Conversation

@conradbzura

Copy link
Copy Markdown
Collaborator

Summary

Make the ECS worker-fleet cap actually bound a concurrent cold-start burst. The cap previously counted only list_tasks before each RunTask; because ECS list_tasks is eventually consistent, a simultaneous burst from a cold fleet had every spawn decision observe a stale ~0 count and launch — verified live, where a 24-file burst drove 24 workers against a cap of 16.

Count the ECS-visible fleet plus the provisioner's own recently-issued launches (aged out after a visibility window) under a lock, so concurrent burst deciders observe each other's reservations; release the reservation when a RunTask fails so a spawn that never happened is not over-counted. The cap is held per API task — a multi-task API would each track only its own launches and need a shared lease (documented). Trade-off: counting in-flight launches is conservative — over-counting only briefly delays a spawn, whereas under-counting (the old behavior) let the fleet overshoot to the full demand.

Closes #67

Proposed changes

Count in-flight launches in the worker-fleet cap

src/cfdb/workflows/provisioner.py — add _reserve_worker_slot / _release_worker_slot, the _recent_launches list, a _cap_lock, and the _LAUNCH_VISIBILITY_WINDOW_S constant. _run_task_owned reserves a slot under the lock (counting list_tasks + recent launches) and returns [] (queue, no spawn) when at the cap; a failed RunTask releases the reservation.

Document the in-flight accounting

src/cfdb/workflows/provisioner.py max_workers docstring and README.md — describe that the count includes in-flight launches (what bounds a burst) and the single-vs-multi-API-task scope.

Test cases

# Test Suite Given When Then Coverage Target
1 TestEcsProvisionerWorkerCap A cap of 2 and a client whose list_tasks always reports zero (modeling ECS visibility lag) Five distinct-key requests are awaited concurrently At most two RunTask calls happen; the other three return [] Cap bounds a concurrent cold-start burst (regression for the live failure)

The worker-fleet cap counted only list_tasks before each RunTask. ECS
list_tasks is eventually consistent, so a simultaneous burst of distinct
uncached files — the exact scenario the cap guards — had every overflow
spawn-decision observe a stale ~0 count and launch a worker. Verified
live: a 24-file burst drove 24 Fargate workers (peak 24, cap 16).

Count this provisioner's own recently-issued launches (aged out after a
visibility window) on top of the list_tasks fleet, under a lock so
concurrent burst deciders observe each other's reservations. A failed
RunTask releases its reservation so a spawn that never happened is not
over-counted. A single API task is now held to the cap; a multi-task API
would each track only its own launches and need a shared lease (documented).

Add a regression test that drives a concurrent burst against a
permanently list_tasks-blind client and asserts at most max_workers
RunTask calls happen — it fails on the previous list_tasks-only cap.
@conradbzura conradbzura self-assigned this Jun 23, 2026
@conradbzura conradbzura marked this pull request as ready for review June 23, 2026 19:48
@conradbzura conradbzura merged commit 459c2d6 into master Jun 24, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Enforce the ECS worker-fleet cap under a concurrent cold-start burst

1 participant