Skip to content

Enforce the ECS worker-fleet cap under a concurrent cold-start burst #67

Description

@conradbzura

Description

The ECS worker-fleet cap (ECS_MAX_WORKERS, default 16) does not bound a simultaneous burst of distinct uncached files — the exact scenario it exists to guard.

Reproduced on live dev (2026-06-23): a burst of 24 distinct uncached /data requests drove 24 concurrent ephemeral Fargate workers (peak 24) against a cap of 16 — zero capping. All 24 jobs dispatched; none queued.

Steps to reproduce:

  1. Deploy the ECS profile with ECS_MAX_WORKERS=16 (the default).
  2. Issue ~24 concurrent GET /data/{dcc}/{id} for distinct uncached processable files while the worker fleet is cold (no workers running).
  3. Observe the running worker-task count (aws ecs list-tasks --family <worker-family>) reach ~24, not plateau at 16.

Expected Behavior

The concurrent worker fleet is bounded at ECS_MAX_WORKERS (≈16, modulo small documented overshoot); requests beyond the cap stay pending and the durable scheduler dispatches them as workers free up. The cap bounds workers, not the queue, and never sheds.

Root Cause

EcsProvisioner counted only list_tasks before each RunTask. ECS list_tasks is eventually consistent — a freshly launched task is not visible for several seconds. In a simultaneous burst from a cold fleet, all ~24 overflow->spawn decisions run within a few seconds, before any just-launched task becomes visible, so every decision sees a stale ~0 count, passes the < max_workers check, and spawns. The original "soft cap, briefly overshoots by a small margin" framing was wrong: against a concurrent cold-start burst it overshoots to the full demand. Unit tests passed because they fed list_tasks a synchronous, already-correct count and did not model the visibility lag / concurrent-spawn race.

Fix: count the ECS-visible fleet plus the provisioner's own recently-issued launches (aged out after a visibility window) under a lock, so concurrent burst deciders observe each other's reservations; release the reservation when a RunTask fails. Add a regression test that drives a concurrent burst against a permanently list_tasks-blind client and asserts at most max_workers RunTask calls. (PR #66 introduced the cap; it is merged and live but ineffective against bursts until this lands.)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions