Enforce the ECS worker-fleet cap under a concurrent cold-start burst

## Description

The ECS worker-fleet cap (`ECS_MAX_WORKERS`, default 16) does not bound a simultaneous burst of distinct uncached files — the exact scenario it exists to guard.

Reproduced on live dev (2026-06-23): a burst of 24 distinct uncached `/data` requests drove **24** concurrent ephemeral Fargate workers (peak 24) against a cap of 16 — zero capping. All 24 jobs dispatched; none queued.

Steps to reproduce:
1. Deploy the ECS profile with `ECS_MAX_WORKERS=16` (the default).
2. Issue ~24 concurrent `GET /data/{dcc}/{id}` for distinct uncached processable files while the worker fleet is cold (no workers running).
3. Observe the running worker-task count (`aws ecs list-tasks --family <worker-family>`) reach ~24, not plateau at 16.

## Expected Behavior

The concurrent worker fleet is bounded at `ECS_MAX_WORKERS` (≈16, modulo small documented overshoot); requests beyond the cap stay `pending` and the durable scheduler dispatches them as workers free up. The cap bounds workers, not the queue, and never sheds.

## Root Cause

`EcsProvisioner` counted only `list_tasks` before each `RunTask`. ECS `list_tasks` is eventually consistent — a freshly launched task is not visible for several seconds. In a simultaneous burst from a cold fleet, all ~24 overflow->spawn decisions run within a few seconds, before any just-launched task becomes visible, so every decision sees a stale ~0 count, passes the `< max_workers` check, and spawns. The original "soft cap, briefly overshoots by a small margin" framing was wrong: against a concurrent cold-start burst it overshoots to the full demand. Unit tests passed because they fed `list_tasks` a synchronous, already-correct count and did not model the visibility lag / concurrent-spawn race.

Fix: count the ECS-visible fleet **plus** the provisioner's own recently-issued launches (aged out after a visibility window) under a lock, so concurrent burst deciders observe each other's reservations; release the reservation when a `RunTask` fails. Add a regression test that drives a concurrent burst against a permanently `list_tasks`-blind client and asserts at most `max_workers` `RunTask` calls. (PR #66 introduced the cap; it is merged and live but ineffective against bursts until this lands.)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enforce the ECS worker-fleet cap under a concurrent cold-start burst #67

Description

Expected Behavior

Root Cause

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Enforce the ECS worker-fleet cap under a concurrent cold-start burst #67

Description

Description

Expected Behavior

Root Cause

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions