Description
The ECS worker-fleet cap (ECS_MAX_WORKERS, default 16) does not bound a simultaneous burst of distinct uncached files — the exact scenario it exists to guard.
Reproduced on live dev (2026-06-23): a burst of 24 distinct uncached /data requests drove 24 concurrent ephemeral Fargate workers (peak 24) against a cap of 16 — zero capping. All 24 jobs dispatched; none queued.
Steps to reproduce:
- Deploy the ECS profile with
ECS_MAX_WORKERS=16 (the default).
- Issue ~24 concurrent
GET /data/{dcc}/{id} for distinct uncached processable files while the worker fleet is cold (no workers running).
- Observe the running worker-task count (
aws ecs list-tasks --family <worker-family>) reach ~24, not plateau at 16.
Expected Behavior
The concurrent worker fleet is bounded at ECS_MAX_WORKERS (≈16, modulo small documented overshoot); requests beyond the cap stay pending and the durable scheduler dispatches them as workers free up. The cap bounds workers, not the queue, and never sheds.
Root Cause
EcsProvisioner counted only list_tasks before each RunTask. ECS list_tasks is eventually consistent — a freshly launched task is not visible for several seconds. In a simultaneous burst from a cold fleet, all ~24 overflow->spawn decisions run within a few seconds, before any just-launched task becomes visible, so every decision sees a stale ~0 count, passes the < max_workers check, and spawns. The original "soft cap, briefly overshoots by a small margin" framing was wrong: against a concurrent cold-start burst it overshoots to the full demand. Unit tests passed because they fed list_tasks a synchronous, already-correct count and did not model the visibility lag / concurrent-spawn race.
Fix: count the ECS-visible fleet plus the provisioner's own recently-issued launches (aged out after a visibility window) under a lock, so concurrent burst deciders observe each other's reservations; release the reservation when a RunTask fails. Add a regression test that drives a concurrent burst against a permanently list_tasks-blind client and asserts at most max_workers RunTask calls. (PR #66 introduced the cap; it is merged and live but ineffective against bursts until this lands.)
Description
The ECS worker-fleet cap (
ECS_MAX_WORKERS, default 16) does not bound a simultaneous burst of distinct uncached files — the exact scenario it exists to guard.Reproduced on live dev (2026-06-23): a burst of 24 distinct uncached
/datarequests drove 24 concurrent ephemeral Fargate workers (peak 24) against a cap of 16 — zero capping. All 24 jobs dispatched; none queued.Steps to reproduce:
ECS_MAX_WORKERS=16(the default).GET /data/{dcc}/{id}for distinct uncached processable files while the worker fleet is cold (no workers running).aws ecs list-tasks --family <worker-family>) reach ~24, not plateau at 16.Expected Behavior
The concurrent worker fleet is bounded at
ECS_MAX_WORKERS(≈16, modulo small documented overshoot); requests beyond the cap staypendingand the durable scheduler dispatches them as workers free up. The cap bounds workers, not the queue, and never sheds.Root Cause
EcsProvisionercounted onlylist_tasksbefore eachRunTask. ECSlist_tasksis eventually consistent — a freshly launched task is not visible for several seconds. In a simultaneous burst from a cold fleet, all ~24 overflow->spawn decisions run within a few seconds, before any just-launched task becomes visible, so every decision sees a stale ~0 count, passes the< max_workerscheck, and spawns. The original "soft cap, briefly overshoots by a small margin" framing was wrong: against a concurrent cold-start burst it overshoots to the full demand. Unit tests passed because they fedlist_tasksa synchronous, already-correct count and did not model the visibility lag / concurrent-spawn race.Fix: count the ECS-visible fleet plus the provisioner's own recently-issued launches (aged out after a visibility window) under a lock, so concurrent burst deciders observe each other's reservations; release the reservation when a
RunTaskfails. Add a regression test that drives a concurrent burst against a permanentlylist_tasks-blind client and asserts at mostmax_workersRunTaskcalls. (PR #66 introduced the cap; it is merged and live but ineffective against bursts until this lands.)