From 918c5ab5ffe1f353b7d1e003cff6b8d8d4562f61 Mon Sep 17 00:00:00 2001
From: Conrad <conradbzura@gmail.com>
Date: Mon, 22 Jun 2026 12:11:14 -0400
Subject: [PATCH 01/19] feat: Add bounded-concurrency env knobs for the
 workflow pipeline
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Introduce the configuration surface for issue #45's bounded-concurrency
control, parsed and validated at import alongside the existing workflow
runtime caps:

- CFDB_WORKER_MAX_CONCURRENT_TASKS (default 1) — per-worker backpressure
  threshold; 0 disables backpressure.
- CFDB_WORKFLOW_MAX_ACTIVE (default 1024) — admission ceiling on active
  workflows.
- CFDB_WORKFLOW_DISPATCH_DEADLINE_S (default 14400) — how long a job may
  wait for worker capacity before it is failed.
- CFDB_WORKFLOW_RETRY_INTERVAL_S (default 120) — base cadence for the
  durable dispatch-retry scheduler.

The threshold knob is consumed by the per-worker backpressure wiring; the
remaining three are consumed by the admission and durable-retry layers
that follow.
---
 src/cfdb/workflows/__init__.py | 55 ++++++++++++++++++++++++++++++++++
 1 file changed, 55 insertions(+)

diff --git a/src/cfdb/workflows/__init__.py b/src/cfdb/workflows/__init__.py
index e6509ca..597b5f2 100644
--- a/src/cfdb/workflows/__init__.py
+++ b/src/cfdb/workflows/__init__.py
@@ -44,6 +44,28 @@
   RUNNING/PENDING row is considered stale and reclaimable. Default ``900``
   (15 min) — sized as ``2 × heartbeat_interval + safety_margin`` so a
   single missed heartbeat does not falsely reclaim a healthy worker.
+
+Concurrency / admission (bounded-concurrency control, issue #45):
+
+- ``CFDB_WORKER_MAX_CONCURRENT_TASKS`` — per-worker backpressure
+  threshold: a worker rejects a dispatch (gRPC ``RESOURCE_EXHAUSTED``,
+  which the load balancer treats as transient and rotates past) once it
+  already has this many tasks in flight. Default ``1`` (serialize the
+  subprocess pipelines on a 1-vCPU worker); ``0`` disables backpressure
+  (unbounded — the prior behavior). Enforced worker-side via a
+  ``wool.BackpressureLike`` hook.
+- ``CFDB_WORKFLOW_MAX_ACTIVE`` — admission ceiling on concurrently active
+  workflows (``pending`` + ``running`` jobs). ``ensure_workflow`` returns
+  ``429 Retry-After`` once this many are active, shedding load before
+  claiming the per-file mutex. Default ``1024``. Soft cap
+  (count-then-insert may briefly overshoot).
+- ``CFDB_WORKFLOW_DISPATCH_DEADLINE_S`` — how long a job may wait for
+  worker capacity (re-dispatched on the retry cadence below) before it is
+  marked ``failed``. Default ``14400`` (4 h).
+- ``CFDB_WORKFLOW_RETRY_INTERVAL_S`` — base cadence at which the durable
+  retry scheduler re-attempts dispatch for a queued job awaiting
+  capacity; a small random jitter is added per attempt. Default ``120``
+  (2 min).
 """
 
 from __future__ import annotations
@@ -153,3 +175,36 @@ def _positive_int(name: str, value: str, *, minimum: int = 0) -> int:
         f"({WORKFLOW_HEARTBEAT_INTERVAL_S}s) to avoid false stale-reclaim "
         "of healthy workers between heartbeats."
     )
+
+# --- Bounded-concurrency control (issue #45) --------------------------------
+
+# Per-worker backpressure threshold. ``minimum=0`` because 0 is a valid
+# "disable backpressure" sentinel — the worker wiring passes
+# ``backpressure=None`` when 0, restoring the unbounded prior behavior.
+WORKER_MAX_CONCURRENT_TASKS: Final = _positive_int(
+    "CFDB_WORKER_MAX_CONCURRENT_TASKS",
+    os.getenv("CFDB_WORKER_MAX_CONCURRENT_TASKS", "1"),
+    minimum=0,
+)
+# Admission ceiling on concurrently active (pending + running) workflows.
+# ``ensure_workflow`` returns 429 once this many are active. ``minimum=1``
+# because 0 would reject every request.
+WORKFLOW_MAX_ACTIVE: Final = _positive_int(
+    "CFDB_WORKFLOW_MAX_ACTIVE",
+    os.getenv("CFDB_WORKFLOW_MAX_ACTIVE", "1024"),
+    minimum=1,
+)
+# How long a job may wait for worker capacity before being failed.
+# ``minimum=1`` — 0 would fail every queued job on its first attempt.
+WORKFLOW_DISPATCH_DEADLINE_S: Final = _positive_int(
+    "CFDB_WORKFLOW_DISPATCH_DEADLINE_S",
+    os.getenv("CFDB_WORKFLOW_DISPATCH_DEADLINE_S", "14400"),
+    minimum=1,
+)
+# Base cadence for the durable retry scheduler's re-dispatch attempts;
+# per-attempt jitter is layered on top. ``minimum=1`` — 0 would busy-spin.
+WORKFLOW_RETRY_INTERVAL_S: Final = _positive_int(
+    "CFDB_WORKFLOW_RETRY_INTERVAL_S",
+    os.getenv("CFDB_WORKFLOW_RETRY_INTERVAL_S", "120"),
+    minimum=1,
+)

From 67e1d404010bb2004634e52246ca92cafc954f4e Mon Sep 17 00:00:00 2001
From: Conrad <conradbzura@gmail.com>
Date: Mon, 22 Jun 2026 12:11:24 -0400
Subject: [PATCH 02/19] feat: Add per-worker backpressure for the wool worker
 pool

Workers ran with backpressure=None and accepted many concurrent
routines, so a single 1-vCPU/2-GiB Fargate worker oversubscribed the
samtools/bgzip/tabix/sort subprocess pipelines its routines shell out to.

Add TaskCountBackpressure, a picklable wool BackpressureLike hook that
rejects a dispatch once the worker already has
CFDB_WORKER_MAX_CONCURRENT_TASKS routines in flight (default 1, so each
worker serializes its pipeline). Rejection surfaces to the dispatcher as
gRPC RESOURCE_EXHAUSTED, which the load balancer treats as transient and
rotates past. Wire it into both worker entrypoints: worker_main builds the
hook directly, and worker_lan binds it onto the spawn factory via
functools.partial so the pool still prescribes the bind host. A threshold
of 0 passes backpressure=None, restoring the unbounded prior behavior.

The hook holds only an int and no async state so it survives wool's
serialization into the worker subprocess.
---
 src/cfdb/workflows/backpressure.py        |  56 ++++++++
 src/cfdb/workflows/worker_lan.py          |  20 ++-
 src/cfdb/workflows/worker_main.py         |  17 ++-
 tests/test_workflows/test_backpressure.py | 165 ++++++++++++++++++++++
 4 files changed, 254 insertions(+), 4 deletions(-)
 create mode 100644 src/cfdb/workflows/backpressure.py
 create mode 100644 tests/test_workflows/test_backpressure.py

diff --git a/src/cfdb/workflows/backpressure.py b/src/cfdb/workflows/backpressure.py
new file mode 100644
index 0000000..a5abaed
--- /dev/null
+++ b/src/cfdb/workflows/backpressure.py
@@ -0,0 +1,56 @@
+"""Per-worker backpressure hook for the wool worker pool (issue #45).
+
+A worker rejects an incoming dispatch once it already has a configured
+number of tasks in flight, serializing the subprocess pipelines
+(samtools/bgzip/tabix/sort/…) the routine shells out to so a single
+1-vCPU/2-GiB worker stops oversubscribing. wool surfaces the rejection to
+the dispatcher as gRPC ``RESOURCE_EXHAUSTED`` — a transient error the load
+balancer rotates past to the next worker.
+
+The hook is a plain, picklable object holding only an ``int`` because
+``wool.LocalWorker`` serializes ``backpressure`` into its worker
+subprocess. Keep it free of async/runtime state (no locks, queues, or
+event-loop references) so it survives that serialization boundary cleanly.
+"""
+
+from __future__ import annotations
+
+import wool
+
+
+class TaskCountBackpressure:
+    """Reject a dispatch when the worker already has ``threshold`` tasks.
+
+    Implements :class:`wool.BackpressureLike`: returns ``True`` (reject)
+    when the worker's in-flight task count has reached ``threshold`` and
+    ``False`` (accept) otherwise. With ``threshold == 1`` each worker runs
+    at most one routine at a time, serializing its CPU/IO-heavy subprocess
+    pipeline. The "disable backpressure" case (threshold 0) is handled by
+    the wiring, which passes ``backpressure=None`` rather than this hook.
+    """
+
+    def __init__(self, threshold: int) -> None:
+        if threshold < 1:
+            raise ValueError(
+                f"TaskCountBackpressure threshold must be >= 1; got {threshold} "
+                "(pass backpressure=None to disable backpressure entirely)"
+            )
+        self.threshold = threshold
+
+    def __call__(self, ctx: wool.BackpressureContext) -> bool:
+        """Return True to reject the incoming task, False to accept it."""
+        return ctx.active_task_count >= self.threshold
+
+    def __repr__(self) -> str:
+        return f"{type(self).__name__}(threshold={self.threshold})"
+
+
+def backpressure_for(threshold: int) -> TaskCountBackpressure | None:
+    """Build the per-worker backpressure hook for ``threshold``, or None.
+
+    ``threshold == 0`` disables backpressure: the worker wiring passes
+    ``backpressure=None`` to wool, restoring unbounded admission (the prior
+    behavior). Any positive value yields a :class:`TaskCountBackpressure`
+    that rejects once the worker has that many tasks in flight.
+    """
+    return TaskCountBackpressure(threshold) if threshold > 0 else None
diff --git a/src/cfdb/workflows/worker_lan.py b/src/cfdb/workflows/worker_lan.py
index 832ccc3..80c6437 100644
--- a/src/cfdb/workflows/worker_lan.py
+++ b/src/cfdb/workflows/worker_lan.py
@@ -32,6 +32,7 @@
 from __future__ import annotations
 
 import asyncio
+import functools
 import logging
 import signal
 from typing import Optional
@@ -40,6 +41,8 @@
 import wool
 from wool.runtime.discovery.lan import LanDiscovery
 
+from cfdb.workflows import WORKER_MAX_CONCURRENT_TASKS
+from cfdb.workflows.backpressure import backpressure_for
 from cfdb.workflows.credentials import build_worker_credentials
 
 logger = logging.getLogger(__name__)
@@ -97,18 +100,31 @@ def _signal_handler_threaded(*_: object) -> None:
         except NotImplementedError:
             signal.signal(sig, _signal_handler_threaded)
 
+    # Bind per-worker backpressure onto the spawn factory so each spawned
+    # LocalWorker serializes its routines (mirrors the ECS worker_main
+    # wiring). ``functools.partial(..., backpressure=hook)`` keeps wool's
+    # ``declares_host`` True, so the pool still prescribes the bind host.
+    backpressure = backpressure_for(WORKER_MAX_CONCURRENT_TASKS)
+    worker_factory = (
+        functools.partial(wool.LocalWorker, backpressure=backpressure)
+        if backpressure is not None
+        else wool.LocalWorker
+    )
+
     pool = wool.WorkerPool(
         spawn=workers,
+        worker=worker_factory,
         discovery=LanDiscovery(namespace),
         credentials=credentials,
     )
     async with pool:
         logger.info(
-            "Published %d wool worker(s) under LAN namespace %r (mTLS %s) "
-            "— Ctrl-C to drain and exit",
+            "Published %d wool worker(s) under LAN namespace %r (mTLS %s, "
+            "max concurrent tasks %s) — Ctrl-C to drain and exit",
             workers,
             namespace,
             "enabled" if credentials is not None else "disabled",
+            WORKER_MAX_CONCURRENT_TASKS if backpressure is not None else "unbounded",
         )
         await stop_event.wait()
     return 0
diff --git a/src/cfdb/workflows/worker_main.py b/src/cfdb/workflows/worker_main.py
index c4cc603..5a92264 100644
--- a/src/cfdb/workflows/worker_main.py
+++ b/src/cfdb/workflows/worker_main.py
@@ -35,6 +35,8 @@
 import click
 import wool
 
+from cfdb.workflows import WORKER_MAX_CONCURRENT_TASKS
+from cfdb.workflows.backpressure import backpressure_for
 from cfdb.workflows.constants import DEFAULT_WORKER_PORT
 from cfdb.workflows.credentials import build_worker_credentials
 
@@ -136,16 +138,27 @@ def _signal_handler_threaded(*_: object) -> None:
         health_port, lambda: stop_event.is_set()
     )
 
+    # Per-worker backpressure: reject a dispatch once this worker already
+    # has CFDB_WORKER_MAX_CONCURRENT_TASKS routines in flight (default 1),
+    # so its subprocess pipelines serialize instead of oversubscribing the
+    # 1-vCPU task. ``None`` (threshold 0) keeps the unbounded prior behavior.
+    backpressure = backpressure_for(WORKER_MAX_CONCURRENT_TASKS)
+
     worker = wool.LocalWorker(
-        host="0.0.0.0", port=worker_port, credentials=credentials
+        host="0.0.0.0",
+        port=worker_port,
+        credentials=credentials,
+        backpressure=backpressure,
     )
     await worker.start()
     try:
         logger.info(
-            "Wool worker listening on port %d (health on %d, mTLS %s)",
+            "Wool worker listening on port %d (health on %d, mTLS %s, "
+            "max concurrent tasks %s)",
             worker_port,
             health_port,
             "enabled" if credentials is not None else "disabled",
+            WORKER_MAX_CONCURRENT_TASKS if backpressure is not None else "unbounded",
         )
         while True:
             # Check the self-termination path first. Max-lifetime is a
diff --git a/tests/test_workflows/test_backpressure.py b/tests/test_workflows/test_backpressure.py
new file mode 100644
index 0000000..5ab61d7
--- /dev/null
+++ b/tests/test_workflows/test_backpressure.py
@@ -0,0 +1,165 @@
+"""Tests for the per-worker backpressure hook."""
+
+from __future__ import annotations
+
+import cloudpickle
+import pytest
+import wool
+
+from cfdb.workflows.backpressure import TaskCountBackpressure
+from cfdb.workflows.backpressure import backpressure_for
+
+
+def _ctx(active_task_count: int) -> wool.BackpressureContext:
+    """Build a BackpressureContext with a sentinel task for hook tests."""
+    return wool.BackpressureContext(active_task_count=active_task_count, task=object())
+
+
+class TestTaskCountBackpressure:
+    def test___init___should_raise_when_threshold_below_one(self):
+        """Test that a non-positive threshold is rejected at construction.
+
+        Given:
+            A threshold of 0, the "disable backpressure" sentinel that the
+            wiring handles by passing None instead of this hook.
+        When:
+            TaskCountBackpressure is constructed with it.
+        Then:
+            It should raise ValueError rather than build a hook that never
+            rejects.
+        """
+        # Arrange, act, & assert
+        with pytest.raises(ValueError, match="threshold must be >= 1"):
+            TaskCountBackpressure(0)
+
+    def test___call___should_accept_when_below_threshold(self):
+        """Test that the hook accepts a task below the threshold.
+
+        Given:
+            A hook with threshold 1 and a worker with no tasks in flight.
+        When:
+            The hook is invoked with that context.
+        Then:
+            It should return False (accept).
+        """
+        # Arrange
+        hook = TaskCountBackpressure(1)
+
+        # Act
+        decision = hook(_ctx(active_task_count=0))
+
+        # Assert
+        assert decision is False
+
+    def test___call___should_reject_when_at_threshold(self):
+        """Test that the hook rejects a task at the threshold.
+
+        Given:
+            A hook with threshold 1 and a worker already running one task.
+        When:
+            The hook is invoked with that context.
+        Then:
+            It should return True (reject), which wool surfaces as
+            RESOURCE_EXHAUSTED.
+        """
+        # Arrange
+        hook = TaskCountBackpressure(1)
+
+        # Act
+        decision = hook(_ctx(active_task_count=1))
+
+        # Assert
+        assert decision is True
+
+    def test___call___should_reject_when_above_threshold(self):
+        """Test that the hook rejects once in-flight exceeds the threshold.
+
+        Given:
+            A hook with threshold 2 and a worker running three tasks.
+        When:
+            The hook is invoked with that context.
+        Then:
+            It should return True (reject).
+        """
+        # Arrange
+        hook = TaskCountBackpressure(2)
+
+        # Act
+        decision = hook(_ctx(active_task_count=3))
+
+        # Assert
+        assert decision is True
+
+    def test_cloudpickle_roundtrip_should_preserve_threshold(self):
+        """Test that the hook survives the worker-serialization boundary.
+
+        Given:
+            A hook that wool will serialize into a LocalWorker subprocess.
+        When:
+            It is cloudpickle dumped and loaded back.
+        Then:
+            The restored hook preserves its threshold and still rejects at
+            it, proving it is safe to ship across the worker boundary.
+        """
+        # Arrange
+        hook = TaskCountBackpressure(2)
+
+        # Act
+        restored = cloudpickle.loads(cloudpickle.dumps(hook))
+
+        # Assert
+        assert isinstance(restored, TaskCountBackpressure)
+        assert restored.threshold == 2
+        assert restored(_ctx(active_task_count=2)) is True
+        assert restored(_ctx(active_task_count=1)) is False
+
+    def test_conforms_to_backpressure_like_protocol(self):
+        """Test that the hook satisfies wool's BackpressureLike protocol.
+
+        Given:
+            A constructed hook.
+        When:
+            It is checked against the runtime-checkable BackpressureLike
+            protocol.
+        Then:
+            It should be recognized as a BackpressureLike, so wool accepts
+            it where a backpressure hook is expected.
+        """
+        # Arrange
+        hook = TaskCountBackpressure(1)
+
+        # Act & assert
+        assert isinstance(hook, wool.BackpressureLike)
+
+
+class TestBackpressureFor:
+    def test_backpressure_for_should_return_none_when_threshold_is_zero(self):
+        """Test that a threshold of 0 disables backpressure.
+
+        Given:
+            A threshold of 0 (the disable sentinel).
+        When:
+            backpressure_for is called.
+        Then:
+            It should return None so the wiring passes backpressure=None to
+            wool (unbounded admission).
+        """
+        # Act & assert
+        assert backpressure_for(0) is None
+
+    def test_backpressure_for_should_build_hook_when_threshold_positive(self):
+        """Test that a positive threshold yields a configured hook.
+
+        Given:
+            A threshold of 2.
+        When:
+            backpressure_for is called.
+        Then:
+            It should return a TaskCountBackpressure carrying that threshold.
+        """
+        # Act
+        hook = backpressure_for(2)
+
+        # Assert
+        assert isinstance(hook, TaskCountBackpressure)
+        assert hook.threshold == 2

From 8823a31270da34ce7f7aaf82294d4cc503c448ae Mon Sep 17 00:00:00 2001
From: Conrad <conradbzura@gmail.com>
Date: Mon, 22 Jun 2026 12:11:34 -0400
Subject: [PATCH 03/19] feat: Add a priority load balancer for the workflow
 worker pool
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

wool's default round-robin balancer spreads load evenly, so every worker
carries a thin perpetual slice of traffic and none drains to idle —
defeating worker self-reaping on an over-provisioned fleet.

Add PriorityLoadBalancer, a stateless wool LoadBalancerLike that offers
each task to the discovered workers in a stable order (sorted by uid) on
every dispatch. Combined with per-worker backpressure (one task each),
load fills the lowest-ordered workers first and higher-ordered ones drain
to idle and self-terminate on their max-lifetime — the leaky-bucket
behavior. It honors wool's worker-health contract: rotate to the next
worker on TransientRpcError (a backpressure RESOURCE_EXHAUSTED rejection),
evict on a non-transient RpcError, propagate anything else, and raise
NoWorkersAvailable when no worker accepts. Wire it into the API
WorkerPool.

Being stateless (no per-context index, no lock) keeps it trivially
picklable.
---
 src/cfdb/api/main.py                      |   8 +
 src/cfdb/workflows/loadbalancer.py        | 109 ++++++++++
 tests/test_workflows/test_loadbalancer.py | 239 ++++++++++++++++++++++
 3 files changed, 356 insertions(+)
 create mode 100644 src/cfdb/workflows/loadbalancer.py
 create mode 100644 tests/test_workflows/test_loadbalancer.py

diff --git a/src/cfdb/api/main.py b/src/cfdb/api/main.py
index eb1fdb8..55ef8f0 100644
--- a/src/cfdb/api/main.py
+++ b/src/cfdb/api/main.py
@@ -33,6 +33,7 @@
 from cfdb.workflows.credentials import build_worker_credentials
 from cfdb.workflows.discovery import EcsDiscovery
 from cfdb.workflows.executor import WoolExecutor
+from cfdb.workflows.loadbalancer import PriorityLoadBalancer
 from cfdb.workflows.processors.bam import BamIndexProcessor
 from cfdb.workflows.processors.registry import default_registry
 from cfdb.workflows.processors.tabix import TabixIntervalProcessor
@@ -342,6 +343,13 @@ async def lifespan(_: FastAPI):
                     discovery=discovery,
                     credentials=worker_credentials,
                     quorum=0,
+                    # Priority/leaky-bucket balancing: always offer a task to
+                    # discovered workers in the same stable order. With
+                    # per-worker backpressure (one task each), load fills the
+                    # lowest-ordered workers first and higher-ordered ones
+                    # drain to idle and self-reap, instead of every worker
+                    # carrying a thin perpetual slice (issue #45).
+                    loadbalancer=PriorityLoadBalancer(),
                 ):
                     # Snapshot the lifespan task's contextvars after the
                     # pool's ``__aenter__`` has populated wool's internals.
diff --git a/src/cfdb/workflows/loadbalancer.py b/src/cfdb/workflows/loadbalancer.py
new file mode 100644
index 0000000..fd60ac9
--- /dev/null
+++ b/src/cfdb/workflows/loadbalancer.py
@@ -0,0 +1,109 @@
+"""Priority (leaky-bucket) load balancer for the wool worker pool (issue #45).
+
+Unlike wool's round-robin balancer, this one always offers a task to the
+discovered workers in the same stable order (sorted by
+``WorkerMetadata.uid``) on every dispatch. Combined with per-worker
+backpressure (one task per worker), load concentrates on the
+lowest-ordered workers and higher-ordered workers drain to idle — so an
+over-provisioned fleet sheds idle capacity (workers self-terminate on
+their max-lifetime) instead of every worker carrying a thin, perpetual
+slice of traffic. This is the "leaky bucket": fill the priority workers
+first and let the overflow spill to the next.
+
+It honors wool's load-balancer worker-health contract (see
+:class:`wool.LoadBalancerLike`): rotate to the next worker on
+``TransientRpcError`` — which includes a backpressure ``RESOURCE_EXHAUSTED``
+rejection — evict the worker on a non-transient ``RpcError``, and let any
+other exception propagate untouched. When no worker accepts the task (the
+pool is empty, or every worker rejected or was evicted in one pass) it
+raises ``NoWorkersAvailable``; the executor treats that as the signal to
+add a worker and re-queue the job.
+
+Stateless by design: the stable order is recomputed per call, so there is
+no per-context index, no lock, and nothing that resists pickling.
+"""
+
+from __future__ import annotations
+
+import logging
+from typing import TYPE_CHECKING
+from typing import AsyncGenerator
+
+from wool import LoadBalancerContextLike
+from wool import LoadBalancerLike
+from wool import NoWorkersAvailable
+from wool import RpcError
+from wool import TransientRpcError
+
+if TYPE_CHECKING:
+    from wool.runtime.routine.task import Task
+
+logger = logging.getLogger(__name__)
+
+
+class PriorityLoadBalancer(LoadBalancerLike):
+    """Offer each task to workers in a stable priority order.
+
+    See the module docstring for the leaky-bucket rationale and the
+    worker-health contract this honors.
+    """
+
+    def __reduce__(self):
+        # Stateless — reconstruct via the bare constructor so the balancer
+        # is trivially picklable (mirrors RoundRobinLoadBalancer, which
+        # uses __reduce__ to shed its lock/index state).
+        return (self.__class__, ())
+
+    async def dispatch(
+        self,
+        task: Task,
+        *,
+        context: LoadBalancerContextLike,
+        timeout: float | None = None,
+    ) -> AsyncGenerator:
+        """Dispatch *task* to the first worker, in priority order, that accepts.
+
+        :param task:
+            The :class:`wool.Task` to dispatch.
+        :param context:
+            The :class:`wool.LoadBalancerContextLike` whose ``workers`` map
+            names the candidates and through which ``remove_worker`` evicts
+            unhealthy peers.
+        :param timeout:
+            Per-attempt timeout in seconds against a single worker.
+        :returns:
+            The worker's response stream (an async iterator).
+        :raises NoWorkersAvailable:
+            When the pool is empty or no worker accepts the task across one
+            stable-order pass (all rejected transiently or were evicted).
+        """
+        # Snapshot a stable, deterministic order up front. Sorting by uid
+        # makes "priority" reproducible across dispatches, so the same
+        # low-order workers stay saturated. ``str(uid)`` guarantees an
+        # orderable key regardless of the uid's concrete type. Iterating a
+        # snapshot keeps the pass well-defined even when ``remove_worker``
+        # mutates the live context mid-loop on eviction.
+        candidates = sorted(
+            context.workers.items(), key=lambda item: str(item[0].uid)
+        )
+
+        for metadata, connection in candidates:
+            try:
+                stream = await connection.dispatch(task, timeout=timeout)
+            except TransientRpcError as exc:
+                logger.debug(
+                    "Skipping worker %s on transient error: %s", metadata.uid, exc
+                )
+                continue
+            except RpcError as exc:
+                logger.warning(
+                    "Evicting worker %s after non-transient RPC error: %s",
+                    metadata.uid,
+                    exc,
+                )
+                context.remove_worker(metadata)
+                continue
+            else:
+                return stream
+
+        raise NoWorkersAvailable("No healthy workers available for dispatch")
diff --git a/tests/test_workflows/test_loadbalancer.py b/tests/test_workflows/test_loadbalancer.py
new file mode 100644
index 0000000..715eadb
--- /dev/null
+++ b/tests/test_workflows/test_loadbalancer.py
@@ -0,0 +1,239 @@
+"""Tests for the priority (leaky-bucket) load balancer."""
+
+from __future__ import annotations
+
+from dataclasses import dataclass
+from types import MappingProxyType
+from typing import Any
+
+import cloudpickle
+import pytest
+import wool
+
+from cfdb.workflows.loadbalancer import PriorityLoadBalancer
+
+
+@dataclass(frozen=True)
+class _FakeMeta:
+    """Stand-in for WorkerMetadata; frozen so it is hashable as a dict key."""
+
+    uid: str
+
+
+class _FakeConn:
+    """Worker connection stub whose dispatch returns a stream or raises."""
+
+    def __init__(self, behavior: Any) -> None:
+        # behavior: a sentinel "stream" to return, or an Exception to raise.
+        self.behavior = behavior
+        self.calls = 0
+
+    async def dispatch(self, task: Any, *, timeout: float | None = None) -> Any:
+        self.calls += 1
+        if isinstance(self.behavior, Exception):
+            raise self.behavior
+        return self.behavior
+
+
+class _FakeContext:
+    """Minimal LoadBalancerContextLike backed by an ordered dict."""
+
+    def __init__(self, workers: dict[_FakeMeta, _FakeConn]) -> None:
+        self._workers = dict(workers)
+        self.removed: list[_FakeMeta] = []
+
+    @property
+    def workers(self) -> MappingProxyType:
+        return MappingProxyType(self._workers)
+
+    def add_worker(self, metadata: _FakeMeta, connection: _FakeConn) -> None:
+        self._workers[metadata] = connection
+
+    def update_worker(self, metadata, connection, *, upsert: bool = False) -> None:
+        if upsert or metadata in self._workers:
+            self._workers[metadata] = connection
+
+    def remove_worker(self, metadata: _FakeMeta) -> None:
+        self.removed.append(metadata)
+        self._workers.pop(metadata, None)
+
+
+class TestPriorityLoadBalancer:
+    @pytest.mark.asyncio
+    async def test_dispatch_should_return_first_worker_in_priority_order(self):
+        """Test that the lowest-uid worker is offered the task first.
+
+        Given:
+            Three healthy workers inserted out of uid order.
+        When:
+            A task is dispatched.
+        Then:
+            It should return the lowest-uid worker's stream and not even
+            contact the higher-uid workers (priority short-circuit).
+        """
+        # Arrange — insert c, a, b so insertion order != priority order.
+        conns = {
+            _FakeMeta("c"): _FakeConn("stream-c"),
+            _FakeMeta("a"): _FakeConn("stream-a"),
+            _FakeMeta("b"): _FakeConn("stream-b"),
+        }
+        context = _FakeContext(conns)
+
+        # Act
+        stream = await PriorityLoadBalancer().dispatch(object(), context=context)
+
+        # Assert
+        assert stream == "stream-a"
+        assert conns[_FakeMeta("a")].calls == 1
+        assert conns[_FakeMeta("b")].calls == 0
+        assert conns[_FakeMeta("c")].calls == 0
+
+    @pytest.mark.asyncio
+    async def test_dispatch_should_rotate_past_busy_worker_without_evicting(self):
+        """Test that a transiently-rejecting worker is skipped, not evicted.
+
+        Given:
+            The priority worker rejects with TransientRpcError (a
+            backpressure RESOURCE_EXHAUSTED) and the next accepts.
+        When:
+            A task is dispatched.
+        Then:
+            It should return the next worker's stream and leave the busy
+            worker in the pool (no eviction).
+        """
+        # Arrange
+        busy = _FakeConn(wool.TransientRpcError(details="busy"))
+        free = _FakeConn("stream-b")
+        context = _FakeContext({_FakeMeta("a"): busy, _FakeMeta("b"): free})
+
+        # Act
+        stream = await PriorityLoadBalancer().dispatch(object(), context=context)
+
+        # Assert
+        assert stream == "stream-b"
+        assert busy.calls == 1
+        assert context.removed == []
+
+    @pytest.mark.asyncio
+    async def test_dispatch_should_evict_worker_on_non_transient_error(self):
+        """Test that a non-transient RpcError evicts the failing worker.
+
+        Given:
+            The priority worker fails with a non-transient RpcError and the
+            next accepts.
+        When:
+            A task is dispatched.
+        Then:
+            It should return the next worker's stream and evict the failing
+            worker via remove_worker.
+        """
+        # Arrange
+        broken_meta = _FakeMeta("a")
+        broken = _FakeConn(wool.RpcError(details="broken"))
+        free = _FakeConn("stream-b")
+        context = _FakeContext({broken_meta: broken, _FakeMeta("b"): free})
+
+        # Act
+        stream = await PriorityLoadBalancer().dispatch(object(), context=context)
+
+        # Assert
+        assert stream == "stream-b"
+        assert context.removed == [broken_meta]
+
+    @pytest.mark.asyncio
+    async def test_dispatch_should_raise_when_pool_is_empty(self):
+        """Test that an empty pool surfaces NoWorkersAvailable.
+
+        Given:
+            A context with no workers.
+        When:
+            A task is dispatched.
+        Then:
+            It should raise NoWorkersAvailable (the executor's signal to add
+            a worker and re-queue).
+        """
+        # Arrange
+        context = _FakeContext({})
+
+        # Act & assert
+        with pytest.raises(wool.NoWorkersAvailable):
+            await PriorityLoadBalancer().dispatch(object(), context=context)
+
+    @pytest.mark.asyncio
+    async def test_dispatch_should_raise_when_all_workers_reject(self):
+        """Test that an all-busy pool surfaces NoWorkersAvailable.
+
+        Given:
+            Every worker rejects with TransientRpcError (all at capacity).
+        When:
+            A task is dispatched.
+        Then:
+            It should raise NoWorkersAvailable after one full pass, without
+            evicting the still-healthy busy workers.
+        """
+        # Arrange
+        a = _FakeConn(wool.TransientRpcError(details="busy"))
+        b = _FakeConn(wool.TransientRpcError(details="busy"))
+        context = _FakeContext({_FakeMeta("a"): a, _FakeMeta("b"): b})
+
+        # Act & assert
+        with pytest.raises(wool.NoWorkersAvailable):
+            await PriorityLoadBalancer().dispatch(object(), context=context)
+
+        # ...and busy workers stay in the pool for the next attempt.
+        assert context.removed == []
+        assert a.calls == 1 and b.calls == 1
+
+    @pytest.mark.asyncio
+    async def test_dispatch_should_let_unexpected_errors_propagate(self):
+        """Test that a non-RpcError propagates without touching the pool.
+
+        Given:
+            The priority worker raises a plain ValueError (a caller-fault,
+            not a worker-health signal).
+        When:
+            A task is dispatched.
+        Then:
+            It should propagate the ValueError unwrapped and not evict the
+            worker, per wool's worker-health contract.
+        """
+        # Arrange
+        meta = _FakeMeta("a")
+        context = _FakeContext({meta: _FakeConn(ValueError("boom"))})
+
+        # Act & assert
+        with pytest.raises(ValueError, match="boom"):
+            await PriorityLoadBalancer().dispatch(object(), context=context)
+        assert context.removed == []
+
+    def test_cloudpickle_roundtrip_should_succeed(self):
+        """Test that the stateless balancer cloudpickle-serializes.
+
+        Given:
+            A PriorityLoadBalancer instance.
+        When:
+            It is cloudpickle dumped and loaded back.
+        Then:
+            The round-trip yields another PriorityLoadBalancer (no lock or
+            index state to poison serialization).
+        """
+        # Act
+        restored = cloudpickle.loads(cloudpickle.dumps(PriorityLoadBalancer()))
+
+        # Assert
+        assert isinstance(restored, PriorityLoadBalancer)
+
+    def test_conforms_to_loadbalancer_like_protocol(self):
+        """Test that the balancer satisfies wool's LoadBalancerLike protocol.
+
+        Given:
+            A constructed balancer.
+        When:
+            It is checked against the runtime-checkable LoadBalancerLike
+            protocol.
+        Then:
+            It should be recognized as a LoadBalancerLike so WorkerPool
+            accepts it.
+        """
+        # Act & assert
+        assert isinstance(PriorityLoadBalancer(), wool.LoadBalancerLike)

From 5ff27db8943e781a8044ec58d66268cae5c16fff Mon Sep 17 00:00:00 2001
From: Conrad <conradbzura@gmail.com>
Date: Mon, 22 Jun 2026 12:44:33 -0400
Subject: [PATCH 04/19] feat: Add JobRecord dispatch-scheduling fields and lock
 queries

Extend the workflow job record and mutex layer for issue #45's durable
dispatch-retry and admission control, without changing dispatch behavior
yet (the executor consumes these next):

- JobRecord gains next_dispatch_at (when the retry scheduler should next
  attempt dispatch) and dispatch_attempts (deferral count). Both are
  serialized by to_mongo; the aware-datetime validator and
  _record_from_mongo treat next_dispatch_at as nullable UTC.
- lock.py gains count_active_workflows (admission count),
  reschedule_dispatch (defer a still-PENDING job, fenced on PENDING), and
  lease_due_dispatch (atomically claim one due PENDING job, pushing
  next_dispatch_at forward so concurrent scheduler ticks or API replicas
  cannot double-pick it).
---
 src/cfdb/workflows/lock.py   | 66 +++++++++++++++++++++++++++++++++++-
 src/cfdb/workflows/models.py | 24 ++++++++++---
 2 files changed, 85 insertions(+), 5 deletions(-)

diff --git a/src/cfdb/workflows/lock.py b/src/cfdb/workflows/lock.py
index e5f1fef..c7457b7 100644
--- a/src/cfdb/workflows/lock.py
+++ b/src/cfdb/workflows/lock.py
@@ -20,6 +20,7 @@
 from datetime import datetime, timedelta, timezone
 from typing import Any, Optional
 
+from pymongo import ReturnDocument
 from pymongo.errors import DuplicateKeyError
 from pymongo.write_concern import WriteConcern
 
@@ -133,7 +134,7 @@ def _record_from_mongo(doc: dict[str, Any]) -> JobRecord:
     here so the round-trip survives.
     """
     sanitized = {k: v for k, v in doc.items() if k != "_id"}
-    for field in ("submitted_at", "updated_at"):
+    for field in ("submitted_at", "updated_at", "next_dispatch_at"):
         value = sanitized.get(field)
         if isinstance(value, datetime) and value.tzinfo is None:
             sanitized[field] = value.replace(tzinfo=timezone.utc)
@@ -471,3 +472,66 @@ async def heartbeat_workflow(db, job_id: str) -> None:
         },
         {"$set": {"updated_at": _utcnow()}},
     )
+
+
+# --- Bounded-concurrency admission + durable dispatch retry (issue #45) ------
+
+
+async def count_active_workflows(db) -> int:
+    """Count workflows currently holding the mutex (pending + running).
+
+    Backs the admission ceiling in ``ensure_workflow``: once this reaches
+    ``CFDB_WORKFLOW_MAX_ACTIVE`` new requests are shed with 429. Soft by
+    nature — a count-then-insert race can briefly overshoot the cap, which
+    is acceptable for a flood guard.
+    """
+    jobs = _jobs(db)
+    return await jobs.count_documents({"status": {"$in": _ACTIVE_STATUS_VALUES}})
+
+
+async def reschedule_dispatch(db, job_id: str, *, next_at: datetime) -> None:
+    """Defer a still-PENDING job's next dispatch attempt to ``next_at``.
+
+    Called when a dispatch attempt finds no worker capacity: the job stays
+    PENDING and the durable scheduler re-attempts it at ``next_at``.
+    Fenced on PENDING so a job that has since gone RUNNING/terminal (a
+    racing attempt won the worker, or it was stale-reclaimed) is not
+    dragged back into the dispatch queue. Bumps ``dispatch_attempts`` for
+    observability.
+    """
+    jobs = _jobs(db)
+    await jobs.update_one(
+        {"job_id": job_id, "status": JobStatus.PENDING.value},
+        {
+            "$set": {"next_dispatch_at": next_at, "updated_at": _utcnow()},
+            "$inc": {"dispatch_attempts": 1},
+        },
+    )
+
+
+async def lease_due_dispatch(
+    db, *, now: datetime, next_at: datetime
+) -> Optional[JobRecord]:
+    """Atomically claim one PENDING job whose dispatch is due.
+
+    Selects a PENDING job with ``next_dispatch_at <= now`` and, in the same
+    operation, pushes its ``next_dispatch_at`` forward to ``next_at`` — so a
+    concurrent scheduler tick (or a second API replica) cannot lease the
+    same job, and a crash mid-attempt still leaves the job scheduled for a
+    later retry. Returns the leased job, or ``None`` when nothing is due.
+
+    The caller runs a dispatch attempt for the returned job; that attempt
+    either wins a worker (``mark_running``) or calls ``reschedule_dispatch``
+    again on overflow. Jobs with ``next_dispatch_at`` unset (``None``) are
+    never due, so they are not leased here.
+    """
+    jobs = _jobs(db)
+    doc = await jobs.find_one_and_update(
+        {
+            "status": JobStatus.PENDING.value,
+            "next_dispatch_at": {"$lte": now},
+        },
+        {"$set": {"next_dispatch_at": next_at, "updated_at": now}},
+        return_document=ReturnDocument.AFTER,
+    )
+    return _record_from_mongo(doc) if doc is not None else None
diff --git a/src/cfdb/workflows/models.py b/src/cfdb/workflows/models.py
index b1fff7d..2cc4058 100644
--- a/src/cfdb/workflows/models.py
+++ b/src/cfdb/workflows/models.py
@@ -81,6 +81,14 @@ class JobRecord(BaseModel):
             metadata as it was at claim time. Persisted on insert so a
             re-run / observer can reconstruct the dispatch context
             without re-reading the (possibly mutated) ``files`` document.
+        next_dispatch_at: When the durable retry scheduler should next
+            attempt to dispatch this (still-PENDING) job to a worker. Set
+            on claim and pushed forward each time an attempt finds no
+            capacity; ``None`` once the job is running or terminal. Drives
+            the scheduler's "due for dispatch" query.
+        dispatch_attempts: How many times dispatch has been deferred for
+            lack of worker capacity. Observability only — the failure
+            deadline is measured from ``submitted_at``, not this count.
     """
 
     model_config = ConfigDict(
@@ -107,18 +115,24 @@ class JobRecord(BaseModel):
     error: Annotated[str, StringConstraints(max_length=1024)] | None = None
     superseded_by: Annotated[str, StringConstraints(max_length=64)] | None = None
     file_meta_snapshot: dict[str, Any] | None = None
+    next_dispatch_at: datetime | None = None
+    dispatch_attempts: int = Field(default=0, ge=0)
 
-    @field_validator("submitted_at", "updated_at")
+    @field_validator("submitted_at", "updated_at", "next_dispatch_at")
     @classmethod
-    def _require_aware_datetime(cls, value: datetime) -> datetime:
+    def _require_aware_datetime(cls, value: datetime | None) -> datetime | None:
         """Reject naive datetimes.
 
         All internal producers go through ``lock._utcnow()`` which returns
         an aware UTC datetime; a naive value would either be a refactor
         regression or a malformed document. Caught here so the stale
-        cutoff comparison in ``claim_workflow`` can never raise
-        ``TypeError`` on a mismatched-naivety operand.
+        cutoff comparison in ``claim_workflow`` (and the scheduler's
+        ``next_dispatch_at`` comparison) can never raise ``TypeError`` on a
+        mismatched-naivety operand. ``next_dispatch_at`` is nullable, so a
+        ``None`` passes through untouched.
         """
+        if value is None:
+            return value
         if value.tzinfo is None:
             raise ValueError(
                 "JobRecord datetimes must be timezone-aware "
@@ -163,4 +177,6 @@ def to_mongo(self) -> dict[str, Any]:
                 if self.file_meta_snapshot is not None
                 else None
             ),
+            "next_dispatch_at": self.next_dispatch_at,
+            "dispatch_attempts": self.dispatch_attempts,
         }

From de19caeef1304e4f08ec9e77be04e33df92ad272 Mon Sep 17 00:00:00 2001
From: Conrad <conradbzura@gmail.com>
Date: Mon, 22 Jun 2026 13:56:19 -0400
Subject: [PATCH 05/19] fix: Harden dispatch-scheduling lock helpers per review

Address review findings on the foundational #45 primitives:

- lease_due_dispatch leases due jobs oldest-first (next_dispatch_at
  ascending) so sustained overflow cannot starve the longest-waiting job
  toward its deadline, and rejects a next_at that is not strictly in the
  future (an equal/past value would let a concurrent tick re-lease the
  same job before the attempt resolves, defeating the single-claim guard).
- reschedule_dispatch logs a debug line on a no-op fenced write, matching
  the observability convention of the module's other fenced updates.
- Correct the JobRecord.next_dispatch_at docstring: it is unset on claim,
  not set on claim.
- backpressure_for rejects a negative threshold rather than silently
  disabling backpressure, making its contract total.
---
 src/cfdb/workflows/backpressure.py |  9 ++++++++-
 src/cfdb/workflows/lock.py         | 23 ++++++++++++++++++++++-
 src/cfdb/workflows/models.py       | 10 ++++++----
 3 files changed, 36 insertions(+), 6 deletions(-)

diff --git a/src/cfdb/workflows/backpressure.py b/src/cfdb/workflows/backpressure.py
index a5abaed..8f89966 100644
--- a/src/cfdb/workflows/backpressure.py
+++ b/src/cfdb/workflows/backpressure.py
@@ -51,6 +51,13 @@ def backpressure_for(threshold: int) -> TaskCountBackpressure | None:
     ``threshold == 0`` disables backpressure: the worker wiring passes
     ``backpressure=None`` to wool, restoring unbounded admission (the prior
     behavior). Any positive value yields a :class:`TaskCountBackpressure`
-    that rejects once the worker has that many tasks in flight.
+    that rejects once the worker has that many tasks in flight. A negative
+    threshold is rejected rather than silently disabling backpressure, so
+    the helper's contract is total.
     """
+    if threshold < 0:
+        raise ValueError(
+            f"backpressure threshold must be >= 0; got {threshold} "
+            "(0 disables backpressure, a positive value sets the limit)"
+        )
     return TaskCountBackpressure(threshold) if threshold > 0 else None
diff --git a/src/cfdb/workflows/lock.py b/src/cfdb/workflows/lock.py
index c7457b7..41f805c 100644
--- a/src/cfdb/workflows/lock.py
+++ b/src/cfdb/workflows/lock.py
@@ -500,13 +500,19 @@ async def reschedule_dispatch(db, job_id: str, *, next_at: datetime) -> None:
     observability.
     """
     jobs = _jobs(db)
-    await jobs.update_one(
+    result = await jobs.update_one(
         {"job_id": job_id, "status": JobStatus.PENDING.value},
         {
             "$set": {"next_dispatch_at": next_at, "updated_at": _utcnow()},
             "$inc": {"dispatch_attempts": 1},
         },
     )
+    if getattr(result, "matched_count", 0) == 0:
+        logger.debug(
+            "reschedule_dispatch no-op for job %s — row no longer PENDING "
+            "(won a worker, terminated, or was stale-reclaimed)",
+            job_id,
+        )
 
 
 async def lease_due_dispatch(
@@ -524,7 +530,21 @@ async def lease_due_dispatch(
     either wins a worker (``mark_running``) or calls ``reschedule_dispatch``
     again on overflow. Jobs with ``next_dispatch_at`` unset (``None``) are
     never due, so they are not leased here.
+
+    Due jobs are leased oldest-first (``next_dispatch_at`` ascending) so
+    sustained overflow cannot starve the longest-waiting job toward its
+    deadline. ``next_at`` MUST be strictly in the future relative to
+    ``now``: the single-claim guarantee depends on the leased row's
+    ``next_dispatch_at`` moving out of the ``<= now`` window, so an
+    equal/past value would let a concurrent tick re-lease the same job
+    before the attempt resolves.
     """
+    if next_at <= now:
+        raise ValueError(
+            f"lease_due_dispatch requires next_at ({next_at}) strictly after "
+            f"now ({now}); an equal/past value would let a concurrent tick "
+            "immediately re-lease the same job"
+        )
     jobs = _jobs(db)
     doc = await jobs.find_one_and_update(
         {
@@ -532,6 +552,7 @@ async def lease_due_dispatch(
             "next_dispatch_at": {"$lte": now},
         },
         {"$set": {"next_dispatch_at": next_at, "updated_at": now}},
+        sort=[("next_dispatch_at", 1)],
         return_document=ReturnDocument.AFTER,
     )
     return _record_from_mongo(doc) if doc is not None else None
diff --git a/src/cfdb/workflows/models.py b/src/cfdb/workflows/models.py
index 2cc4058..cdeda5f 100644
--- a/src/cfdb/workflows/models.py
+++ b/src/cfdb/workflows/models.py
@@ -82,10 +82,12 @@ class JobRecord(BaseModel):
             re-run / observer can reconstruct the dispatch context
             without re-reading the (possibly mutated) ``files`` document.
         next_dispatch_at: When the durable retry scheduler should next
-            attempt to dispatch this (still-PENDING) job to a worker. Set
-            on claim and pushed forward each time an attempt finds no
-            capacity; ``None`` once the job is running or terminal. Drives
-            the scheduler's "due for dispatch" query.
+            attempt to dispatch this (still-PENDING) job to a worker.
+            Unset (``None``) on claim and on a running/terminal job; set
+            and pushed forward by the retry scheduler each time a dispatch
+            attempt finds no capacity. Drives the scheduler's "due for
+            dispatch" query, which matches only rows whose
+            ``next_dispatch_at`` is set and in the past.
         dispatch_attempts: How many times dispatch has been deferred for
             lack of worker capacity. Observability only — the failure
             deadline is measured from ``submitted_at``, not this count.

From 2e7925aa9b026eca1e7fea4da2e30ab7488dc2e5 Mon Sep 17 00:00:00 2001
From: Conrad <conradbzura@gmail.com>
Date: Mon, 22 Jun 2026 13:57:40 -0400
Subject: [PATCH 06/19] feat: Index the durable retry scheduler's due-dispatch
 lease query

lease_due_dispatch selects {status: pending, next_dispatch_at: {$lte:
now}} sorted by next_dispatch_at ascending, but the existing
status_updated_at index covers only the status prefix, so the
next_dispatch_at range/sort would scan all PENDING rows on every
scheduler tick. Add a (status, next_dispatch_at) index to
operational_index_specs and mirror it in create-indexes.js so the poll is
index-served. Adding it now, alongside the primitive that defines the
query, avoids an index migration on a hot, TTL-populated collection once
the scheduler goes live.
---
 scripts/create-indexes.js | 11 +++++++++++
 src/cfdb/indexes.py       | 10 ++++++++++
 2 files changed, 21 insertions(+)

diff --git a/scripts/create-indexes.js b/scripts/create-indexes.js
index dad8f29..c7ebff3 100644
--- a/scripts/create-indexes.js
+++ b/scripts/create-indexes.js
@@ -221,6 +221,17 @@ ensureIndex(
     { name: "status_updated_at" }
 );
 
+// Serves the durable retry scheduler's due-dispatch lease
+// (workflows.lock.lease_due_dispatch): { status: "pending",
+// next_dispatch_at: { $lte: now } } sorted by next_dispatch_at asc. The
+// status equality prefix plus the next_dispatch_at range/sort are both
+// index-served, so the per-tick poll is not a scan of all PENDING rows.
+ensureIndex(
+    db.jobs,
+    { status: 1, next_dispatch_at: 1 },
+    { name: "status_next_dispatch_at" }
+);
+
 // TTL index on terminal job rows. Without this, every stale-reclaim
 // transition leaves a permanent ``failed`` document and the collection
 // grows unbounded for any frequently-touched workflow_key. The partial
diff --git a/src/cfdb/indexes.py b/src/cfdb/indexes.py
index a3f9ccd..7bffd5a 100644
--- a/src/cfdb/indexes.py
+++ b/src/cfdb/indexes.py
@@ -134,6 +134,16 @@ def operational_index_specs() -> list[IndexSpec]:
         ),
         IndexSpec("jobs", [("job_id", 1)], name="job_id_unique", unique=True),
         IndexSpec("jobs", [("status", 1), ("updated_at", 1)], name="status_updated_at"),
+        # Serves the durable retry scheduler's due-dispatch lease
+        # (workflows.lock.lease_due_dispatch): {status: pending,
+        # next_dispatch_at: {$lte: now}} sorted by next_dispatch_at asc.
+        # The status equality prefix + next_dispatch_at range/sort are
+        # both index-served, so the per-tick poll is not a PENDING scan.
+        IndexSpec(
+            "jobs",
+            [("status", 1), ("next_dispatch_at", 1)],
+            name="status_next_dispatch_at",
+        ),
         # TTL on terminal rows so the collection doesn't grow unbounded.
         # The partial filter excludes active rows (active=True) so an
         # in-flight job is never reaped. 7 days gives operators a window

From d7a63feeb92796acade085f57a7a97114a36d8c8 Mon Sep 17 00:00:00 2001
From: Conrad <conradbzura@gmail.com>
Date: Mon, 22 Jun 2026 14:03:34 -0400
Subject: [PATCH 07/19] test: Cover the dispatch-scheduling lock helpers and
 JobRecord fields

Pin the foundational #45 concurrency primitives flagged by review as
shipping untested:

- lease_due_dispatch: leases a due PENDING job and pushes next_dispatch_at
  forward; a second lease in the same window returns None (single-claim
  guard); skips unscheduled (None), not-yet-due, and non-PENDING jobs;
  leases oldest-due first; rejects a non-future next_at.
- reschedule_dispatch: defers a PENDING job and bumps dispatch_attempts;
  no-ops (fenced) on a non-PENDING row.
- count_active_workflows: counts pending+running, excludes terminal.
- JobRecord: next_dispatch_at/dispatch_attempts defaults, naive-datetime
  rejection, aware round-trip, and to_mongo serialization.

Extend the in-memory FakeCollection test double with $inc and a
single-key sort on find_one_and_update so these queries are exercisable
(both additive; existing tests use neither).
---
 tests/conftest.py                   |  19 +-
 tests/test_workflows/test_lock.py   | 317 +++++++++++++++++++++++++++-
 tests/test_workflows/test_models.py |  79 +++++++
 3 files changed, 410 insertions(+), 5 deletions(-)

diff --git a/tests/conftest.py b/tests/conftest.py
index 21e5b6d..c767496 100644
--- a/tests/conftest.py
+++ b/tests/conftest.py
@@ -173,6 +173,10 @@ def _apply_update(doc: dict, update: dict, *, is_insert: bool = False) -> bool:
                 if v not in existing:
                     existing.append(v)
                     changed = True
+        elif op == "$inc":
+            for k, v in fields.items():
+                _set_nested(doc, k, (_resolve(doc, k) or 0) + v)
+                changed = True
         else:
             raise NotImplementedError(f"FakeCollection update op: {op}")
     return changed
@@ -267,12 +271,19 @@ async def find_one_and_update(
         *,
         upsert: bool = False,
         return_document=None,
+        sort: list | None = None,
         **_kwargs,
     ) -> dict | None:
-        for d in self.docs:
-            if _match(d, query):
-                _apply_update(d, update, is_insert=False)
-                return dict(d)
+        candidates = [d for d in self.docs if _match(d, query)]
+        if sort:
+            # Single-key sort is all the production queries use; honoring it
+            # lets tests exercise ordered claims (e.g. oldest-due-first).
+            field, direction = sort[0]
+            candidates.sort(key=lambda d: d.get(field), reverse=direction < 0)
+        if candidates:
+            d = candidates[0]
+            _apply_update(d, update, is_insert=False)
+            return dict(d)
         if upsert:
             seed: dict = {}
             for k, v in query.items():
diff --git a/tests/test_workflows/test_lock.py b/tests/test_workflows/test_lock.py
index 7061e7f..8b84e4b 100644
--- a/tests/test_workflows/test_lock.py
+++ b/tests/test_workflows/test_lock.py
@@ -7,12 +7,39 @@
 import pytest
 
 from cfdb.workflows import lock
-from cfdb.workflows.models import ACTIVE_STATUSES, ArtifactKind, JobStatus
+from cfdb.workflows.models import ACTIVE_STATUSES, ArtifactKind, JobRecord, JobStatus
 from tests.test_workflows import FIXTURE_MD5
 
 PIPELINE_VERSION = 1
 
 
+def _insert_job(
+    mock_db,
+    *,
+    workflow_key: str,
+    status: JobStatus,
+    next_dispatch_at: datetime | None = None,
+    dispatch_attempts: int = 0,
+) -> JobRecord:
+    """Insert a crafted job doc into the fake collection and return it."""
+    now = datetime.now(timezone.utc)
+    record = JobRecord(
+        job_id=f"job-{workflow_key}",
+        workflow_key=workflow_key,
+        status=status,
+        dcc="encode",
+        local_id="x",
+        md5=FIXTURE_MD5,
+        pipeline_version=PIPELINE_VERSION,
+        submitted_at=now,
+        updated_at=now,
+        next_dispatch_at=next_dispatch_at,
+        dispatch_attempts=dispatch_attempts,
+    )
+    mock_db.jobs.docs.append(record.to_mongo())
+    return record
+
+
 def _install_jobs_index(mock_db) -> None:
     """Register the partial unique index on the FakeDB jobs collection."""
     mock_db.jobs.create_index(
@@ -875,3 +902,291 @@ async def test_get_job_should_hydrate_persisted_record(self, mock_db):
         assert hydrated is not None
         assert hydrated.job_id == record.job_id
         assert hydrated.workflow_key == record.workflow_key
+
+
+class TestCountActiveWorkflows:
+    @pytest.mark.asyncio
+    async def test_count_active_workflows_should_count_pending_and_running(
+        self, mock_db
+    ):
+        """Test that the admission count includes only active jobs.
+
+        Given:
+            Two pending, one running, and two terminal job rows.
+        When:
+            count_active_workflows is awaited.
+        Then:
+            It should return 3 — pending + running only, excluding the
+            terminal rows.
+        """
+        # Arrange
+        _insert_job(mock_db, workflow_key="a", status=JobStatus.PENDING)
+        _insert_job(mock_db, workflow_key="b", status=JobStatus.PENDING)
+        _insert_job(mock_db, workflow_key="c", status=JobStatus.RUNNING)
+        _insert_job(mock_db, workflow_key="d", status=JobStatus.COMPLETED)
+        _insert_job(mock_db, workflow_key="e", status=JobStatus.FAILED)
+
+        # Act
+        count = await lock.count_active_workflows(mock_db)
+
+        # Assert
+        assert count == 3
+
+    @pytest.mark.asyncio
+    async def test_count_active_workflows_should_return_zero_when_empty(self, mock_db):
+        """Test that an empty collection counts as zero active workflows.
+
+        Given:
+            A jobs collection with no rows.
+        When:
+            count_active_workflows is awaited.
+        Then:
+            It should return 0.
+        """
+        # Act
+        count = await lock.count_active_workflows(mock_db)
+
+        # Assert
+        assert count == 0
+
+
+class TestRescheduleDispatch:
+    @pytest.mark.asyncio
+    async def test_reschedule_dispatch_should_defer_pending_and_bump_attempts(
+        self, mock_db
+    ):
+        """Test that rescheduling a pending job advances its retry time.
+
+        Given:
+            A pending job with no next_dispatch_at and zero attempts.
+        When:
+            reschedule_dispatch is awaited with a future next_at.
+        Then:
+            It should set next_dispatch_at to that time and increment
+            dispatch_attempts.
+        """
+        # Arrange
+        _insert_job(mock_db, workflow_key="a", status=JobStatus.PENDING)
+        next_at = datetime.now(timezone.utc) + timedelta(seconds=120)
+
+        # Act
+        await lock.reschedule_dispatch(mock_db, "job-a", next_at=next_at)
+
+        # Assert
+        doc = mock_db.jobs.docs[0]
+        assert doc["next_dispatch_at"] == next_at
+        assert doc["dispatch_attempts"] == 1
+
+    @pytest.mark.asyncio
+    async def test_reschedule_dispatch_should_be_noop_when_not_pending(self, mock_db):
+        """Test that rescheduling a non-pending job changes nothing.
+
+        Given:
+            A running job (it already won a worker).
+        When:
+            reschedule_dispatch is awaited for it.
+        Then:
+            It should leave next_dispatch_at and dispatch_attempts untouched,
+            because the update is fenced on PENDING.
+        """
+        # Arrange
+        _insert_job(mock_db, workflow_key="a", status=JobStatus.RUNNING)
+        next_at = datetime.now(timezone.utc) + timedelta(seconds=120)
+
+        # Act
+        await lock.reschedule_dispatch(mock_db, "job-a", next_at=next_at)
+
+        # Assert
+        doc = mock_db.jobs.docs[0]
+        assert doc["next_dispatch_at"] is None
+        assert doc["dispatch_attempts"] == 0
+
+
+class TestLeaseDueDispatch:
+    @pytest.mark.asyncio
+    async def test_lease_due_dispatch_should_claim_due_job_and_push_forward(
+        self, mock_db
+    ):
+        """Test that a due pending job is leased and its next attempt deferred.
+
+        Given:
+            A pending job whose next_dispatch_at is in the past.
+        When:
+            lease_due_dispatch is awaited with now past that time and a
+            future next_at.
+        Then:
+            It should return the job and push its next_dispatch_at forward to
+            next_at so a later tick won't re-lease it.
+        """
+        # Arrange
+        now = datetime.now(timezone.utc)
+        _insert_job(
+            mock_db,
+            workflow_key="a",
+            status=JobStatus.PENDING,
+            next_dispatch_at=now - timedelta(seconds=1),
+        )
+        next_at = now + timedelta(seconds=120)
+
+        # Act
+        leased = await lock.lease_due_dispatch(mock_db, now=now, next_at=next_at)
+
+        # Assert
+        assert leased is not None
+        assert leased.workflow_key == "a"
+        assert mock_db.jobs.docs[0]["next_dispatch_at"] == next_at
+
+    @pytest.mark.asyncio
+    async def test_lease_due_dispatch_should_not_re_lease_within_lease_window(
+        self, mock_db
+    ):
+        """Test that a freshly-leased job is not immediately re-leasable.
+
+        Given:
+            One due pending job that has just been leased (its
+            next_dispatch_at pushed into the future).
+        When:
+            lease_due_dispatch is awaited again at the same now.
+        Then:
+            It should return None — the single-claim guard, so two ticks or
+            replicas cannot double-dispatch one job.
+        """
+        # Arrange
+        now = datetime.now(timezone.utc)
+        _insert_job(
+            mock_db,
+            workflow_key="a",
+            status=JobStatus.PENDING,
+            next_dispatch_at=now - timedelta(seconds=1),
+        )
+        next_at = now + timedelta(seconds=120)
+        first = await lock.lease_due_dispatch(mock_db, now=now, next_at=next_at)
+
+        # Act
+        second = await lock.lease_due_dispatch(mock_db, now=now, next_at=next_at)
+
+        # Assert
+        assert first is not None
+        assert second is None
+
+    @pytest.mark.asyncio
+    async def test_lease_due_dispatch_should_skip_unscheduled_job(self, mock_db):
+        """Test that a job with next_dispatch_at unset is never leased.
+
+        Given:
+            A pending job whose next_dispatch_at is None (freshly claimed,
+            not yet deferred by the scheduler).
+        When:
+            lease_due_dispatch is awaited.
+        Then:
+            It should return None — an unscheduled job is not due.
+        """
+        # Arrange
+        now = datetime.now(timezone.utc)
+        _insert_job(
+            mock_db, workflow_key="a", status=JobStatus.PENDING, next_dispatch_at=None
+        )
+
+        # Act
+        leased = await lock.lease_due_dispatch(
+            mock_db, now=now, next_at=now + timedelta(seconds=120)
+        )
+
+        # Assert
+        assert leased is None
+
+    @pytest.mark.asyncio
+    async def test_lease_due_dispatch_should_skip_future_and_running_jobs(
+        self, mock_db
+    ):
+        """Test that not-yet-due and non-pending jobs are not leased.
+
+        Given:
+            A pending job due in the future and a running job already past
+            its time.
+        When:
+            lease_due_dispatch is awaited.
+        Then:
+            It should return None — the future job is not due and the running
+            job is not pending.
+        """
+        # Arrange
+        now = datetime.now(timezone.utc)
+        _insert_job(
+            mock_db,
+            workflow_key="future",
+            status=JobStatus.PENDING,
+            next_dispatch_at=now + timedelta(seconds=60),
+        )
+        _insert_job(
+            mock_db,
+            workflow_key="running",
+            status=JobStatus.RUNNING,
+            next_dispatch_at=now - timedelta(seconds=60),
+        )
+
+        # Act
+        leased = await lock.lease_due_dispatch(
+            mock_db, now=now, next_at=now + timedelta(seconds=120)
+        )
+
+        # Assert
+        assert leased is None
+
+    @pytest.mark.asyncio
+    async def test_lease_due_dispatch_should_lease_oldest_due_job_first(self, mock_db):
+        """Test that the longest-waiting due job is leased first.
+
+        Given:
+            Two due pending jobs with different next_dispatch_at times,
+            inserted newest-first.
+        When:
+            lease_due_dispatch is awaited.
+        Then:
+            It should lease the one with the earliest next_dispatch_at, so
+            sustained overflow cannot starve the oldest job.
+        """
+        # Arrange
+        now = datetime.now(timezone.utc)
+        _insert_job(
+            mock_db,
+            workflow_key="newer",
+            status=JobStatus.PENDING,
+            next_dispatch_at=now - timedelta(seconds=10),
+        )
+        _insert_job(
+            mock_db,
+            workflow_key="older",
+            status=JobStatus.PENDING,
+            next_dispatch_at=now - timedelta(seconds=300),
+        )
+
+        # Act
+        leased = await lock.lease_due_dispatch(
+            mock_db, now=now, next_at=now + timedelta(seconds=120)
+        )
+
+        # Assert
+        assert leased is not None
+        assert leased.workflow_key == "older"
+
+    @pytest.mark.asyncio
+    async def test_lease_due_dispatch_should_reject_next_at_not_in_future(
+        self, mock_db
+    ):
+        """Test that a non-future next_at is rejected.
+
+        Given:
+            A now timestamp and a next_at equal to it.
+        When:
+            lease_due_dispatch is awaited.
+        Then:
+            It should raise ValueError — an equal/past next_at would let a
+            concurrent tick immediately re-lease the same job.
+        """
+        # Arrange
+        now = datetime.now(timezone.utc)
+
+        # Act & assert
+        with pytest.raises(ValueError, match="strictly after"):
+            await lock.lease_due_dispatch(mock_db, now=now, next_at=now)
diff --git a/tests/test_workflows/test_models.py b/tests/test_workflows/test_models.py
index f5eb23d..09b458b 100644
--- a/tests/test_workflows/test_models.py
+++ b/tests/test_workflows/test_models.py
@@ -309,6 +309,85 @@ def test_to_mongo_should_serialize_stages_progress_and_superseded_by(self):
         assert payload["superseded_by"] == "job-xyz"
         assert payload["progress"] == "indexing"
 
+    def test___init___should_default_dispatch_scheduling_fields(self):
+        """Test that the dispatch-scheduling fields default sanely.
+
+        Given:
+            A JobRecord built without dispatch-scheduling overrides.
+        When:
+            It is constructed.
+        Then:
+            next_dispatch_at should default to None and dispatch_attempts
+            to 0.
+        """
+        # Act
+        job = _make_job()
+
+        # Assert
+        assert job.next_dispatch_at is None
+        assert job.dispatch_attempts == 0
+
+    def test___init___should_reject_naive_next_dispatch_at(self):
+        """Test that a naive next_dispatch_at is rejected.
+
+        Given:
+            A naive next_dispatch_at (no tzinfo).
+        When:
+            A JobRecord is constructed with it.
+        Then:
+            It should raise ValidationError — the scheduler's
+            next_dispatch_at comparison requires aware UTC.
+        """
+        # Arrange
+        naive = datetime(2026, 4, 21, 13, 0, 0)
+
+        # Act & assert
+        with pytest.raises(ValidationError, match="timezone-aware"):
+            _make_job(next_dispatch_at=naive)
+
+    def test___init___should_accept_aware_next_dispatch_at(self):
+        """Test that an aware next_dispatch_at validates and round-trips.
+
+        Given:
+            An aware UTC next_dispatch_at.
+        When:
+            A JobRecord is constructed with it.
+        Then:
+            It should validate and preserve the value (None remains the
+            nullable default elsewhere).
+        """
+        # Arrange
+        when = datetime(2026, 4, 21, 13, 0, 0, tzinfo=timezone.utc)
+
+        # Act
+        job = _make_job(next_dispatch_at=when)
+
+        # Assert
+        assert job.next_dispatch_at == when
+
+    def test_to_mongo_should_serialize_dispatch_scheduling_fields(self):
+        """Test that to_mongo persists next_dispatch_at and dispatch_attempts.
+
+        Given:
+            A JobRecord with an aware next_dispatch_at and a non-zero
+            dispatch_attempts.
+        When:
+            to_mongo is invoked.
+        Then:
+            Both fields should appear in the dump with their values
+            preserved verbatim.
+        """
+        # Arrange
+        when = datetime(2026, 4, 21, 13, 0, 0, tzinfo=timezone.utc)
+        job = _make_job(next_dispatch_at=when, dispatch_attempts=3)
+
+        # Act
+        payload = job.to_mongo()
+
+        # Assert
+        assert payload["next_dispatch_at"] == when
+        assert payload["dispatch_attempts"] == 3
+
     @pytest.mark.parametrize(
         "status",
         list(JobStatus),

From c8eae8d60fcf826eb9e26cf32586bb8656d8d4b5 Mon Sep 17 00:00:00 2001
From: Conrad <conradbzura@gmail.com>
Date: Mon, 22 Jun 2026 14:08:06 -0400
Subject: [PATCH 08/19] test: Harden load-balancer, backpressure, and wiring
 coverage

Address the test-coverage findings from review:

- PriorityLoadBalancer: pin cross-dispatch order stability (the
  leaky-bucket invariant, distinct from round-robin), continued in-order
  dispatch after evicting the first worker (the snapshot-iteration
  guarantee a 2-worker eviction did not exercise), and the str(uid)
  ordering key. Clarify the docstring that the uid order is
  arbitrary-but-stable, not a seniority ranking.
- backpressure_for: assert it rejects a negative threshold; give the
  protocol-conformance test teeth via the bool-return contract.
- Lifespan: assert the WorkerPool is constructed with a
  PriorityLoadBalancer (the headline wiring, previously uncovered).
- worker_lan: pin that the backpressure-bound spawn factory keeps wool's
  declares_host True, so the pool still prescribes the bind host.
---
 src/cfdb/workflows/loadbalancer.py         | 17 +++--
 tests/test_api/test_lifespan_workerpool.py |  6 ++
 tests/test_workflows/test_backpressure.py  | 20 ++++-
 tests/test_workflows/test_loadbalancer.py  | 87 ++++++++++++++++++++++
 tests/test_workflows/test_worker_lan.py    | 30 ++++++++
 5 files changed, 152 insertions(+), 8 deletions(-)

diff --git a/src/cfdb/workflows/loadbalancer.py b/src/cfdb/workflows/loadbalancer.py
index fd60ac9..d5bc06a 100644
--- a/src/cfdb/workflows/loadbalancer.py
+++ b/src/cfdb/workflows/loadbalancer.py
@@ -2,13 +2,16 @@
 
 Unlike wool's round-robin balancer, this one always offers a task to the
 discovered workers in the same stable order (sorted by
-``WorkerMetadata.uid``) on every dispatch. Combined with per-worker
-backpressure (one task per worker), load concentrates on the
-lowest-ordered workers and higher-ordered workers drain to idle — so an
-over-provisioned fleet sheds idle capacity (workers self-terminate on
-their max-lifetime) instead of every worker carrying a thin, perpetual
-slice of traffic. This is the "leaky bucket": fill the priority workers
-first and let the overflow spill to the next.
+``WorkerMetadata.uid``) on every dispatch. The order is arbitrary but
+reproducible — ``uid`` is a per-worker UUID, not a seniority or cost
+ranking; what matters is only that the same workers are offered work first
+on each dispatch. Combined with per-worker backpressure (one task per
+worker), load concentrates on the lowest-ordered workers and
+higher-ordered workers drain to idle — so an over-provisioned fleet sheds
+idle capacity (workers self-terminate on their max-lifetime) instead of
+every worker carrying a thin, perpetual slice of traffic. This is the
+"leaky bucket": fill the priority workers first and let the overflow spill
+to the next.
 
 It honors wool's load-balancer worker-health contract (see
 :class:`wool.LoadBalancerLike`): rotate to the next worker on
diff --git a/tests/test_api/test_lifespan_workerpool.py b/tests/test_api/test_lifespan_workerpool.py
index d0eb79c..30c857f 100644
--- a/tests/test_api/test_lifespan_workerpool.py
+++ b/tests/test_api/test_lifespan_workerpool.py
@@ -97,6 +97,12 @@ async def _fake_build_discovery(_profile):
     assert kwargs["quorum"] == 0
     assert kwargs["discovery"] is sentinel_discovery
     assert kwargs["credentials"] == "CREDS-SENTINEL"
+    # The priority/leaky-bucket balancer must be wired in (issue #45). This
+    # file exists to pin pool kwargs against silent regressions, so the
+    # load balancer belongs here alongside quorum/discovery/credentials.
+    from cfdb.workflows.loadbalancer import PriorityLoadBalancer
+
+    assert isinstance(kwargs["loadbalancer"], PriorityLoadBalancer)
 
 
 def test_worker_pool_signature_should_accept_quorum():
diff --git a/tests/test_workflows/test_backpressure.py b/tests/test_workflows/test_backpressure.py
index 5ab61d7..df984be 100644
--- a/tests/test_workflows/test_backpressure.py
+++ b/tests/test_workflows/test_backpressure.py
@@ -128,8 +128,11 @@ def test_conforms_to_backpressure_like_protocol(self):
         # Arrange
         hook = TaskCountBackpressure(1)
 
-        # Act & assert
+        # Act & assert — structural conformance, plus the bool-return
+        # contract wool actually depends on (BackpressureLike is satisfied
+        # by any callable, so the return-type check gives the test teeth).
         assert isinstance(hook, wool.BackpressureLike)
+        assert isinstance(hook(_ctx(active_task_count=0)), bool)
 
 
 class TestBackpressureFor:
@@ -163,3 +166,18 @@ def test_backpressure_for_should_build_hook_when_threshold_positive(self):
         # Assert
         assert isinstance(hook, TaskCountBackpressure)
         assert hook.threshold == 2
+
+    def test_backpressure_for_should_raise_on_negative_threshold(self):
+        """Test that a negative threshold is rejected, not silently disabled.
+
+        Given:
+            A negative threshold.
+        When:
+            backpressure_for is called.
+        Then:
+            It should raise ValueError, so a negative value cannot
+            accidentally disable backpressure (only 0 disables it).
+        """
+        # Act & assert
+        with pytest.raises(ValueError, match=">= 0"):
+            backpressure_for(-1)
diff --git a/tests/test_workflows/test_loadbalancer.py b/tests/test_workflows/test_loadbalancer.py
index 715eadb..01b61be 100644
--- a/tests/test_workflows/test_loadbalancer.py
+++ b/tests/test_workflows/test_loadbalancer.py
@@ -237,3 +237,90 @@ def test_conforms_to_loadbalancer_like_protocol(self):
         """
         # Act & assert
         assert isinstance(PriorityLoadBalancer(), wool.LoadBalancerLike)
+
+    @pytest.mark.asyncio
+    async def test_dispatch_should_pick_same_priority_worker_across_dispatches(self):
+        """Test that the priority order is stable across repeated dispatches.
+
+        Given:
+            Three healthy workers and a single balancer + context reused
+            across two consecutive dispatches.
+        When:
+            Two tasks are dispatched in succession.
+        Then:
+            Both should go to the lowest-uid worker — the leaky-bucket
+            invariant, distinct from round-robin which advances each call.
+        """
+        # Arrange
+        conns = {
+            _FakeMeta("a"): _FakeConn("s-a"),
+            _FakeMeta("b"): _FakeConn("s-b"),
+            _FakeMeta("c"): _FakeConn("s-c"),
+        }
+        context = _FakeContext(conns)
+        balancer = PriorityLoadBalancer()
+
+        # Act
+        first = await balancer.dispatch(object(), context=context)
+        second = await balancer.dispatch(object(), context=context)
+
+        # Assert
+        assert first == "s-a"
+        assert second == "s-a"
+        assert conns[_FakeMeta("a")].calls == 2
+        assert conns[_FakeMeta("b")].calls == 0
+
+    @pytest.mark.asyncio
+    async def test_dispatch_should_continue_in_order_after_evicting_first_worker(self):
+        """Test that evicting the first worker still tries the rest in order.
+
+        Given:
+            Three workers where the lowest-uid one fails with a non-transient
+            RpcError and the next accepts.
+        When:
+            A task is dispatched.
+        Then:
+            It should evict the first, dispatch to the second, and finish the
+            pass without error — pinning that iteration runs over a snapshot,
+            not the live worker map mutated by remove_worker.
+        """
+        # Arrange
+        broken = _FakeMeta("a")
+        conns = {
+            broken: _FakeConn(wool.RpcError(details="broken")),
+            _FakeMeta("b"): _FakeConn("s-b"),
+            _FakeMeta("c"): _FakeConn("s-c"),
+        }
+        context = _FakeContext(conns)
+
+        # Act
+        stream = await PriorityLoadBalancer().dispatch(object(), context=context)
+
+        # Assert
+        assert stream == "s-b"
+        assert context.removed == [broken]
+        assert conns[_FakeMeta("c")].calls == 0
+
+    @pytest.mark.asyncio
+    async def test_dispatch_should_order_workers_by_string_form_of_uid(self):
+        """Test that workers are ordered by the string form of their uid.
+
+        Given:
+            Two workers with non-string uids (10 and 2) whose lexicographic
+            string order ("10" < "2") differs from numeric order.
+        When:
+            A task is dispatched.
+        Then:
+            It should pick the worker whose str(uid) sorts first ("10"),
+            pinning the str(uid) coercion that gives a total, orderable key
+            regardless of the uid's concrete type.
+        """
+        # Arrange
+        conns = {_FakeMeta(10): _FakeConn("s-10"), _FakeMeta(2): _FakeConn("s-2")}
+        context = _FakeContext(conns)
+
+        # Act
+        stream = await PriorityLoadBalancer().dispatch(object(), context=context)
+
+        # Assert
+        assert stream == "s-10"
diff --git a/tests/test_workflows/test_worker_lan.py b/tests/test_workflows/test_worker_lan.py
index f99f784..5305670 100644
--- a/tests/test_workflows/test_worker_lan.py
+++ b/tests/test_workflows/test_worker_lan.py
@@ -174,3 +174,33 @@ def test_main_tls_cli_flags_override_env_vars(self, monkeypatch):
         # Assert
         assert exit_code == 0
         assert captured["tls_ca"] == "/cli/ca.pem"
+
+
+def test_backpressure_bound_factory_should_still_declare_host():
+    """Test that the backpressure-bound spawn factory still declares a host.
+
+    Given:
+        The ``functools.partial(LocalWorker, backpressure=hook)`` factory
+        that ``worker_lan.serve`` hands to ``WorkerPool(spawn=…)`` to apply
+        per-worker backpressure.
+    When:
+        wool's ``declares_host`` classifier inspects it.
+    Then:
+        It should return True, so the pool still prescribes the bind host. A
+        False here would make spawned workers bind 127.0.0.1 and advertise
+        unreachable addresses — a runtime failure with no other unit signal,
+        so this pins the load-bearing invariant the worker_lan wiring relies
+        on against a wool signature change.
+    """
+    # Arrange
+    import functools
+
+    import wool
+    from wool.runtime.worker.base import declares_host
+
+    from cfdb.workflows.backpressure import TaskCountBackpressure
+
+    factory = functools.partial(wool.LocalWorker, backpressure=TaskCountBackpressure(1))
+
+    # Act & assert
+    assert declares_host(factory) is True

From d6c5309841db51192e2580772212d3e71c9eb37b Mon Sep 17 00:00:00 2001
From: Conrad <conradbzura@gmail.com>
Date: Mon, 22 Jun 2026 19:07:30 -0400
Subject: [PATCH 09/19] feat: Drive workflow dispatch with a durable
 bounded-concurrency scheduler

Wire the dormant bounded-concurrency primitives into the executor's
request path so concurrency and cost become predictable and a dispatch
that finds no capacity queues durably instead of blocking a request.

ensure_workflow now enforces an admission ceiling: once the active
backlog (pending + running) reaches CFDB_WORKFLOW_MAX_ACTIVE it raises
AdmissionRejected before claiming the mutex, shedding a flood at the
door rather than queuing unbounded work.

A fresh claim dispatches its first attempt inline; each attempt offers
the task to existing workers once via the priority load balancer and
marks the job RUNNING only after a worker accepts, so a queued job stays
PENDING until it genuinely runs. On overflow the job is rescheduled
(staying PENDING, next_dispatch_at set) and, on the ECS profile, a single
best-effort worker spawn is requested -- inverting the old unconditional
pre-dispatch spawn. A durable, Mongo-backed scheduler re-attempts due
jobs on CFDB_WORKFLOW_RETRY_INTERVAL_S until a worker frees up or the
CFDB_WORKFLOW_DISPATCH_DEADLINE_S deadline elapses (then it fails the job
with a capacity: prefix). Because the queue lives in Mongo it survives an
API restart. A fresh claim leaves next_dispatch_at unset so the scheduler
cannot race the inline attempt.

This replaces the single in-request CFDB_WORKFLOW_DISPATCH_WAIT_S cold-
start wait, which is removed; the retry cadence plus deadline supersede
it.
---
 src/cfdb/workflows/__init__.py              |  25 +-
 src/cfdb/workflows/executor.py              | 607 +++++++++++++-------
 src/cfdb/workflows/lock.py                  |   7 +
 tests/integration/routines.py               |   8 +-
 tests/integration/test_executor_boundary.py |   4 +-
 tests/test_workflows/test_executor.py       | 573 ++++++++++++------
 6 files changed, 824 insertions(+), 400 deletions(-)

diff --git a/src/cfdb/workflows/__init__.py b/src/cfdb/workflows/__init__.py
index 597b5f2..8bd5012 100644
--- a/src/cfdb/workflows/__init__.py
+++ b/src/cfdb/workflows/__init__.py
@@ -31,12 +31,6 @@
   preprocessing runs (e.g., a ``samtools sort`` on a multi-GB BAM
   followed by ``samtools index``). Operators running on bounded fixtures
   in dev should lower this via env.
-- ``CFDB_WORKFLOW_DISPATCH_WAIT_S`` — how long ``ensure_workflow`` waits
-  for a wool worker to become available before giving up. Default ``240``
-  (4 min) — sized for an ECS Fargate cold start (image pull + health
-  check, typically 1-3 min). A smaller value risks exhausting the retry
-  budget before a freshly-provisioned worker reports HEALTHY; lower it
-  for fixture-bound dev where workers are already running.
 - ``CFDB_WORKFLOW_HEARTBEAT_INTERVAL_S`` — how often the routine emits a
   heartbeat event during quiet periods so the API can refresh
   ``updated_at`` on the JobRecord. Default ``300`` (5 min).
@@ -134,24 +128,17 @@ def _positive_int(name: str, value: str, *, minimum: int = 0) -> int:
 # Runtime caps require ``minimum=1`` — zero values silently break the
 # corresponding loop or timeout (``asyncio.timeout(0)`` fires immediately,
 # ``HEARTBEAT_INTERVAL_S=0`` spins, ``STALE_THRESHOLD_S=0`` reclaims every
-# active job on every check, ``DISPATCH_WAIT_S=0`` makes every cold start
-# look like a hard failure).
+# active job on every check).
 WORKFLOW_DURATION_CAP_S: Final = _positive_int(
     "CFDB_WORKFLOW_DURATION_CAP_S",
     os.getenv("CFDB_WORKFLOW_DURATION_CAP_S", "14400"),
     minimum=1,
 )
-# Default sized for a Fargate cold start (image pull + health check,
-# ~1-3 min). With ``quorum=0`` a cold-start dispatch surfaces
-# ``NoWorkersAvailable``, which the executor retries inside this budget;
-# the old 60s default could expire before the just-launched worker
-# reports HEALTHY, hard-failing the first request. 240s covers the
-# cold-start window with headroom while staying env-overridable.
-WORKFLOW_DISPATCH_WAIT_S: Final = _positive_int(
-    "CFDB_WORKFLOW_DISPATCH_WAIT_S",
-    os.getenv("CFDB_WORKFLOW_DISPATCH_WAIT_S", "240"),
-    minimum=1,
-)
+# NOTE: the former ``CFDB_WORKFLOW_DISPATCH_WAIT_S`` (a single in-request
+# cold-start wait) was removed in the #45 restructure. A dispatch that
+# finds no capacity no longer blocks the request — the job queues durably
+# and the retry scheduler re-attempts it on ``CFDB_WORKFLOW_RETRY_INTERVAL_S``
+# until ``CFDB_WORKFLOW_DISPATCH_DEADLINE_S`` (both below).
 WORKFLOW_HEARTBEAT_INTERVAL_S: Final = _positive_int(
     "CFDB_WORKFLOW_HEARTBEAT_INTERVAL_S",
     os.getenv("CFDB_WORKFLOW_HEARTBEAT_INTERVAL_S", "300"),
diff --git a/src/cfdb/workflows/executor.py b/src/cfdb/workflows/executor.py
index 41510a2..9c19ce6 100644
--- a/src/cfdb/workflows/executor.py
+++ b/src/cfdb/workflows/executor.py
@@ -38,18 +38,22 @@
 import asyncio
 import contextlib
 import logging
+import random
 import shutil
 from abc import ABC, abstractmethod
 from collections.abc import AsyncIterator
+from datetime import datetime, timedelta, timezone
 from pathlib import Path
 from typing import Any, Optional
 
 import wool
 
 from cfdb.workflows import (
-    WORKFLOW_DISPATCH_WAIT_S,
+    WORKFLOW_DISPATCH_DEADLINE_S,
     WORKFLOW_DURATION_CAP_S,
     WORKFLOW_HEARTBEAT_INTERVAL_S,
+    WORKFLOW_MAX_ACTIVE,
+    WORKFLOW_RETRY_INTERVAL_S,
     keys as key_utils,
 )
 from cfdb.workflows.cache import CacheBackend
@@ -63,10 +67,13 @@
 )
 from cfdb.workflows.lock import (
     claim_workflow,
+    count_active_workflows,
     heartbeat_workflow,
+    lease_due_dispatch,
     mark_running,
     record_stage_complete,
     release_workflow,
+    reschedule_dispatch,
     update_progress,
 )
 from cfdb.workflows.models import JobRecord, JobStatus
@@ -87,37 +94,41 @@
 #: Sourced from ``CFDB_WORKFLOW_DURATION_CAP_S`` (default 14400 s = 4 h)
 #: so deployments can override without a code change. Module-level
 #: rather than a ctor kwarg because tests cap it via ``monkeypatch.setattr``
-#: alongside the other dispatch knobs (``_DISPATCH_WAIT_SECONDS``,
+#: alongside the other dispatch knobs (``_RETRY_INTERVAL_SECONDS``,
 #: ``_HEARTBEAT_INTERVAL_S``) and uniform module-state keeps fixture
 #: shape consistent.
 _WORKFLOW_DURATION_CAP_SECONDS = WORKFLOW_DURATION_CAP_S
 
-#: Persisted ``error`` prefix surfaced to clients when the dispatch
-#: budget is exhausted with no workers available. The prefix is part
-#: of the on-wire contract — clients parse it to decide whether to
-#: resubmit. Kept as a module constant so tests can import the same
-#: symbol the executor writes.
+#: Persisted ``error`` prefix surfaced to clients when a job is failed for
+#: lack of worker capacity (its dispatch deadline elapsed while queued).
+#: Part of the on-wire contract — clients parse it to decide whether to
+#: resubmit. Kept as a module constant so tests can import the same symbol
+#: the executor writes.
 _ERR_PREFIX_CAPACITY = "capacity:"
 
-#: Persisted ``error`` prefix surfaced to clients when the provisioner
-#: itself raises a non-retryable exception. Distinct from
-#: :data:`_ERR_PREFIX_CAPACITY` so clients can tell "retry me later,
-#: capacity issue" apart from "the provisioner crashed".
-_ERR_PREFIX_PROVISIONER = "provisioner:"
-
-#: Total wall-clock budget waiting for a leased worker to surface during
-#: dispatch. Sized for an ECS Fargate cold start (image pull + health
-#: check, typically 1-3 min; see ``CFDB_WORKFLOW_DISPATCH_WAIT_S``,
-#: default 240s) so a ``NoWorkersAvailable`` at the moment of dispatch
-#: does not immediately fail the job — instead we poll the pool until a
-#: worker shows up or the budget expires.
-_DISPATCH_WAIT_SECONDS = float(WORKFLOW_DISPATCH_WAIT_S)
-
-#: Cadence at which we re-attempt the dispatch while waiting on capacity.
-#: Sub-second polling buys us nothing because ECS scale-up dominates the
-#: wait; a 1-second cadence keeps the noise floor low while still
-#: surfacing a freshly-available worker quickly.
-_DISPATCH_RETRY_INTERVAL_SECONDS = 1.0
+#: Admission ceiling on concurrently-active workflows (pending + running).
+#: ``ensure_workflow`` sheds requests beyond this with ``AdmissionRejected``
+#: → HTTP 429. Module-level so tests ``monkeypatch.setattr`` it alongside
+#: the other dispatch knobs. Soft cap: a count-then-claim race can briefly
+#: overshoot, which is acceptable for a flood guard.
+_MAX_ACTIVE_WORKFLOWS = WORKFLOW_MAX_ACTIVE
+
+#: Base cadence for the durable retry scheduler. A dispatch attempt that
+#: finds no worker capacity leaves the job PENDING and reschedules its next
+#: attempt this far out (plus jitter). Also the scheduler loop's idle wait
+#: between ticks, so a freshly-rescheduled job is re-attempted promptly.
+_RETRY_INTERVAL_SECONDS = float(WORKFLOW_RETRY_INTERVAL_S)
+
+#: Upper bound on the random jitter added to each reschedule, to spread a
+#: thundering herd of queued jobs across retry ticks instead of stampeding
+#: the pool in lockstep. Tests set this to 0 for deterministic timing.
+_RETRY_JITTER_MAX_SECONDS = min(30.0, _RETRY_INTERVAL_SECONDS)
+
+#: Max wall-clock a job may wait for capacity (measured from
+#: ``submitted_at``) before the scheduler fails it ``capacity:``. Replaces
+#: the old in-request 240s wait: instead of blocking one request, the job
+#: queues durably and is retried until it runs or this deadline elapses.
+_DISPATCH_DEADLINE_SECONDS = float(WORKFLOW_DISPATCH_DEADLINE_S)
 
 #: Cadence at which the wool routine emits ``heartbeat`` events into its
 #: stream during quiet periods. The API process consumes the stream and
@@ -153,6 +164,59 @@ class ExecutorDraining(WorkflowNotApplicable):
     """
 
 
+class AdmissionRejected(RuntimeError):
+    """Raised by ``ensure_workflow`` when the active-workflow ceiling is hit.
+
+    Deliberately NOT a :class:`WorkflowNotApplicable` subclass: a caller
+    that catches ``WorkflowNotApplicable`` falls through to direct upstream
+    streaming, but an admission rejection must surface as a distinct 429
+    (with ``Retry-After``) so the client backs off rather than silently
+    bypassing the bounded pipeline. Carries the observed ``active`` count,
+    the ``ceiling`` it hit, and a ``retry_after_seconds`` hint for the
+    response header.
+    """
+
+    def __init__(
+        self,
+        *,
+        active: int,
+        ceiling: int,
+        retry_after_seconds: Optional[int] = None,
+    ) -> None:
+        self.active = active
+        self.ceiling = ceiling
+        self.retry_after_seconds = (
+            retry_after_seconds
+            if retry_after_seconds is not None
+            else max(1, int(_RETRY_INTERVAL_SECONDS))
+        )
+        super().__init__(
+            f"Active workflow ceiling reached ({active} >= {ceiling}); "
+            "shedding request"
+        )
+
+
+def _utcnow() -> datetime:
+    return datetime.now(timezone.utc)
+
+
+def _next_dispatch_time(now: datetime) -> datetime:
+    """Compute the next dispatch attempt time for an overflowed job.
+
+    ``now + retry_interval + jitter``. The jitter spreads a queued backlog
+    across ticks; tests zero ``_RETRY_JITTER_MAX_SECONDS`` for determinism.
+    Always strictly after ``now`` (retry interval is >= 1s), which
+    ``lease_due_dispatch`` requires so a leased row moves out of the
+    due-window before its attempt resolves.
+    """
+    jitter = (
+        random.uniform(0, _RETRY_JITTER_MAX_SECONDS)
+        if _RETRY_JITTER_MAX_SECONDS > 0
+        else 0.0
+    )
+    return now + timedelta(seconds=_RETRY_INTERVAL_SECONDS + jitter)
+
+
 class JobExecutor(ABC):
     """Abstract interface for workflow dispatch + job tracking."""
 
@@ -322,11 +386,20 @@ def __init__(
         self._provisioner = provisioner
         self._pending_tasks: set[asyncio.Task] = set()
         #: Finalize tasks (release_workflow + workdir cleanup) created by
-        #: _run_workflow's `finally`. Tracked separately so drain() can
-        #: await them after the main _pending_tasks gather, ensuring they
-        #: complete before the lifespan teardown closes the Motor client.
+        #: _consume_and_finalize's `finally`. Tracked separately so drain()
+        #: can await them after the main _pending_tasks gather, ensuring
+        #: they complete before the lifespan teardown closes the Motor
+        #: client.
         self._finalize_tasks: set[asyncio.Task] = set()
         self._draining = False
+        #: The durable retry scheduler — re-attempts dispatch for jobs that
+        #: overflowed (no worker capacity) and are awaiting a free worker.
+        #: Fresh claims dispatch inline via ``ensure_workflow``; the
+        #: scheduler only handles the retry path. Started by
+        #: ``start_scheduler`` inside the lifespan's wool-pool context (so
+        #: it inherits wool's dispatch contextvars) and cancelled by
+        #: ``drain`` before the pool closes.
+        self._scheduler_task: Optional[asyncio.Task] = None
 
     async def ensure_workflow(
         self, file_meta: dict[str, Any]
@@ -345,6 +418,17 @@ async def ensure_workflow(
                 "No processor registered for this file, or no work required"
             )
 
+        # Admission ceiling: shed new work once the active backlog
+        # (pending + running) hits the cap, so an unauthenticated flood on
+        # /data and /index can't queue unbounded jobs in Mongo. Checked
+        # AFTER applicability (don't 429 a file that needs no workflow) and
+        # BEFORE claiming the mutex. Soft cap — a count-then-claim race may
+        # transiently overshoot, acceptable for a flood guard; the
+        # per-source mutex still dedups same-file requests.
+        active = await count_active_workflows(self._db)
+        if active >= _MAX_ACTIVE_WORKFLOWS:
+            raise AdmissionRejected(active=active, ceiling=_MAX_ACTIVE_WORKFLOWS)
+
         dcc, local_id, md5 = extract_identity(file_meta)
         wf_key = key_utils.workflow_key(
             dcc=dcc,
@@ -363,15 +447,110 @@ async def ensure_workflow(
         )
 
         if fresh:
-            task = asyncio.create_task(
-                self._run_workflow(record, processor, file_meta)
-            )
-            self._pending_tasks.add(task)
-            task.add_done_callback(self._pending_tasks.discard)
-            task.add_done_callback(_log_unexpected_exception)
+            # Dispatch the first attempt inline (fire-and-forget), so a
+            # fresh claim runs immediately rather than waiting out a
+            # scheduler tick. The job's ``next_dispatch_at`` stays None
+            # until this attempt overflows, so the durable scheduler — which
+            # only leases jobs with a due ``next_dispatch_at`` — cannot
+            # race this inline attempt. Retries (after an overflow sets
+            # ``next_dispatch_at``) are driven by the scheduler.
+            self._spawn_attempt(record, processor, file_meta)
 
         return record, fresh
 
+    def _spawn_attempt(
+        self,
+        record: JobRecord,
+        processor: Processor,
+        file_meta: dict[str, Any],
+    ) -> None:
+        """Spawn a tracked background task running one dispatch attempt."""
+        task = asyncio.create_task(
+            self._attempt_dispatch(record, processor, file_meta)
+        )
+        self._pending_tasks.add(task)
+        task.add_done_callback(self._pending_tasks.discard)
+        task.add_done_callback(_log_unexpected_exception)
+
+    def start_scheduler(self) -> None:
+        """Start the durable retry scheduler as a background task.
+
+        MUST be called from inside the lifespan's ``wool.WorkerPool``
+        context so the created task inherits wool's dispatch contextvars
+        (the same mechanism request-spawned tasks rely on via
+        ``attach_wool_context``); otherwise ``@wool.routine`` dispatch from
+        the scheduler would fail with no pool in context. Idempotent.
+        """
+        if self._scheduler_task is not None:
+            return
+        self._scheduler_task = asyncio.create_task(self._scheduler_loop())
+        self._scheduler_task.add_done_callback(_log_unexpected_exception)
+
+    async def _scheduler_loop(self) -> None:
+        """Re-attempt dispatch for queued jobs until drain.
+
+        Fresh claims dispatch inline (``ensure_workflow``); this loop drives
+        the *retry* path. Each tick leases every currently-due job — one
+        whose inline (or prior) attempt overflowed and set a
+        ``next_dispatch_at`` now in the past — and spawns a fresh dispatch
+        attempt for it, then sleeps one retry interval. A per-tick exception
+        is logged and swallowed so the scheduler never dies; cancellation
+        (from ``drain``) breaks the loop. DB-backed, so a freshly-started
+        replica resumes whatever queue the previous process left behind.
+        """
+        while not self._draining:
+            try:
+                await self._drain_due_jobs()
+            except asyncio.CancelledError:
+                raise
+            except Exception:
+                logger.exception("Scheduler dispatch tick failed; continuing")
+            await asyncio.sleep(_RETRY_INTERVAL_SECONDS)
+
+    async def _drain_due_jobs(self) -> None:
+        """Lease every currently-due queued job and spawn its retry attempt.
+
+        ``lease_due_dispatch`` atomically claims one due job and pushes its
+        ``next_dispatch_at`` forward, so concurrent ticks (or replicas)
+        can't double-lease and a crashed attempt is still retried. We loop
+        until nothing is due, dispatching each leased job as a tracked
+        background task so a winning attempt's long stream-consume doesn't
+        block the scheduler from filling the rest of the pool.
+        """
+        while not self._draining:
+            now = _utcnow()
+            leased = await lease_due_dispatch(
+                self._db, now=now, next_at=_next_dispatch_time(now)
+            )
+            if leased is None:
+                return
+            file_meta = leased.file_meta_snapshot
+            if file_meta is None:
+                logger.error(
+                    "Leased job %s has no file_meta_snapshot — failing",
+                    leased.job_id,
+                )
+                await release_workflow(
+                    self._db,
+                    leased.job_id,
+                    JobStatus.FAILED,
+                    error="internal: leased job missing file_meta_snapshot",
+                )
+                continue
+            processor = self._registry.lookup_for(file_meta)
+            if processor is None:
+                logger.error(
+                    "No processor for leased job %s — failing", leased.job_id
+                )
+                await release_workflow(
+                    self._db,
+                    leased.job_id,
+                    JobStatus.FAILED,
+                    error="internal: no processor for leased job",
+                )
+                continue
+            self._spawn_attempt(leased, processor, file_meta)
+
     async def drain(self, *, timeout: float) -> int:
         """Wait for in-flight workflow tasks to complete.
 
@@ -391,6 +570,14 @@ async def drain(self, *, timeout: float) -> int:
             immediately.
         """
         self._draining = True
+        # Stop the dispatch driver first so no new attempt tasks are spawned
+        # into the set we're about to snapshot, and so it stops dispatching
+        # into a pool that's about to close.
+        if self._scheduler_task is not None:
+            self._scheduler_task.cancel()
+            with contextlib.suppress(asyncio.CancelledError, Exception):
+                await self._scheduler_task
+            self._scheduler_task = None
         pending = list(self._pending_tasks)
         if not pending:
             return 0
@@ -405,7 +592,7 @@ async def drain(self, *, timeout: float) -> int:
             # against a closed Motor client surfaces as a noisy
             # exception in the lifespan exit, and the wool stream they
             # hold blocks pool aclose. Cancel them explicitly and
-            # await the cancellation so _run_workflow's shielded
+            # await the cancellation so _consume_and_finalize's shielded
             # _finalize gets its 2s best-effort grace before we let
             # the lifespan continue to teardown.
             for task in pending:
@@ -429,14 +616,147 @@ async def drain(self, *, timeout: float) -> int:
                 )
         return len(pending)
 
-    async def _run_workflow(
+    async def _attempt_dispatch(
         self,
         record: JobRecord,
         processor: Processor,
         file_meta: dict[str, Any],
     ) -> None:
-        """Background coroutine: mark running, consume the routine's event
-        stream, and release a terminal status.
+        """Run one dispatch attempt for a leased PENDING job.
+
+        Offers the task to the existing worker pool **once** via the
+        priority load balancer. On success the job is marked RUNNING (only
+        now that a worker has accepted — a queued job stays PENDING until
+        then) and its event stream is consumed to completion. On overflow
+        (every worker rejected, or none exist) the job stays PENDING, a
+        best-effort bounded worker spawn is requested, and it is rescheduled
+        for a later tick — unless it has blown its dispatch deadline, in
+        which case it is failed ``capacity:``. This inverts the old
+        unconditional pre-dispatch spawn: existing capacity is tried first
+        and scale-up happens only on overflow.
+        """
+        workdir = self._workdir_root / record.job_id
+
+        # Deadline: a job that has waited for capacity longer than the
+        # dispatch deadline (measured from submission) is failed rather than
+        # retried forever. ``capacity:`` is the stable on-wire prefix
+        # clients parse to decide whether to resubmit.
+        waited = (_utcnow() - record.submitted_at).total_seconds()
+        if waited > _DISPATCH_DEADLINE_SECONDS:
+            logger.warning(
+                "Dispatch deadline exceeded for %s after %.0fs; failing",
+                record.job_id,
+                waited,
+            )
+            await self._finalize(
+                workdir,
+                record.job_id,
+                JobStatus.FAILED,
+                f"{_ERR_PREFIX_CAPACITY} dispatch deadline exceeded",
+            )
+            return
+
+        try:
+            stream = await self._open_stream_once(processor, file_meta, workdir)
+        except asyncio.CancelledError:
+            # Cancelled (e.g. lifespan drain) before a worker accepted.
+            # Finalize FAILED so the row doesn't dangle PENDING until
+            # stale-reclaim; the client resubmits. Best-effort, then
+            # propagate the cancellation.
+            with contextlib.suppress(Exception):
+                await self._finalize(
+                    workdir,
+                    record.job_id,
+                    JobStatus.FAILED,
+                    "Workflow cancelled (worker shutdown)",
+                )
+            raise
+        except wool.NoWorkersAvailable:
+            # No capacity this pass — keep the job PENDING, request a
+            # bounded scale-up (ECS only), and reschedule for a later tick.
+            await self._handle_overflow(record)
+            return
+        except Exception as exc:
+            logger.exception("Failed to open routine stream for %s", record.job_id)
+            await self._finalize(
+                workdir, record.job_id, JobStatus.FAILED, str(exc)
+            )
+            return
+
+        # A worker accepted the task. Claim RUNNING now — after the stream
+        # opened — so a job only leaves PENDING once it is genuinely
+        # running. ``mark_running`` is fenced on PENDING; a RuntimeError
+        # means a racing claimant (or stale-reclaim) already owns the row,
+        # so hand off: close the stream (cancelling the remote routine) and
+        # bail without duplicate work.
+        try:
+            await mark_running(self._db, record.job_id)
+        except asyncio.CancelledError:
+            with contextlib.suppress(BaseException):
+                await stream.aclose()
+            raise
+        except RuntimeError:
+            logger.info(
+                "mark_running rejected for %s (row no longer PENDING); "
+                "closing stream and handing off",
+                record.job_id,
+            )
+            with contextlib.suppress(BaseException):
+                await stream.aclose()
+            return
+        except Exception as exc:
+            logger.exception("Failed to mark job %s as running", record.job_id)
+            with contextlib.suppress(BaseException):
+                await stream.aclose()
+            await self._finalize(
+                workdir,
+                record.job_id,
+                JobStatus.FAILED,
+                f"mark_running failed: {exc}",
+            )
+            return
+
+        await self._consume_and_finalize(record, stream, workdir)
+
+    async def _handle_overflow(self, record: JobRecord) -> None:
+        """Handle a dispatch attempt that found no worker capacity.
+
+        Requests one best-effort worker spawn through the provisioner (ECS
+        profile only; a no-op in the local LAN profile, where the pool is
+        fixed) and reschedules the job's next dispatch attempt. The spawn is
+        strictly best-effort — existing workers may free up before the next
+        tick — so any provisioner error is logged and swallowed rather than
+        failing the job; the dispatch deadline bounds the retry loop.
+        """
+        if self._provisioner is not None:
+            try:
+                await self._provisioner.request(dedup_key=record.workflow_key)
+            except asyncio.CancelledError:
+                raise
+            except RetryableProvisionerError as exc:
+                logger.warning(
+                    "Provisioner retryable capacity error for %s; will retry "
+                    "on the next tick: %s",
+                    record.job_id,
+                    exc,
+                )
+            except Exception:
+                logger.exception(
+                    "Provisioner request failed for %s; will retry on the "
+                    "next tick",
+                    record.job_id,
+                )
+        await reschedule_dispatch(
+            self._db, record.job_id, next_at=_next_dispatch_time(_utcnow())
+        )
+
+    async def _consume_and_finalize(
+        self,
+        record: JobRecord,
+        stream: AsyncIterator[Any],
+        workdir: Path,
+    ) -> None:
+        """Consume a running workflow's event stream and release it terminal.
 
         The routine emits ``heartbeat``, ``stage_complete``, ``complete``,
         and ``error`` events as an async generator over wool's gRPC stream.
@@ -446,102 +766,16 @@ async def _run_workflow(
         ``stage_complete`` arrives, so /jobs/{id} reflects partial
         progress in real time rather than all-at-once at the end.
 
-        A ``try/finally`` ensures the job always reaches a terminal
-        status and the per-job workdir is cleaned up, even when
-        ``mark_running`` raises or the task is cancelled during
-        shutdown. ``CancelledError`` propagates after the terminal write.
+        A ``try/finally`` ensures the job always reaches a terminal status
+        and the per-job workdir is cleaned up, even when the task is
+        cancelled during shutdown. ``CancelledError`` propagates after the
+        terminal write. The caller has already marked the job RUNNING and
+        opened ``stream``.
         """
-        workdir = self._workdir_root / record.job_id
-
         final_status: JobStatus = JobStatus.FAILED
         final_error: Optional[str] = None
 
         try:
-            # mark_running lives inside the try so a Mongo write failure
-            # here still routes through the finally — the row gets
-            # released to FAILED rather than dangling at PENDING for
-            # the full stale-reclaim window.
-            try:
-                await mark_running(self._db, record.job_id)
-            except Exception as exc:
-                logger.exception("Failed to mark job %s as running", record.job_id)
-                final_error = f"mark_running failed: {exc}"
-                return
-
-            # Request a worker via the external provisioner (e.g. ECS
-            # ``RunTask``) before opening the routine stream. The
-            # provisioner dedup-keys on the workflow mutex so two
-            # concurrent fresh claims for the same source file
-            # share one ``RunTask`` and one worker.
-            #
-            # ``RetryableProvisionerError`` is treated as a best-effort
-            # scale-up failure, not a workflow-level failure: the Wool
-            # pool routes ``@wool.routine`` calls to *any* available
-            # worker via the discovery namespace, so an existing idle
-            # worker can still service this job. Log a warning and fall
-            # through to ``_open_stream_with_retry``; the
-            # ``capacity:``-prefixed terminal status is reserved for the
-            # case where dispatch *also* exhausts its budget with no
-            # workers available.
-            if self._provisioner is not None:
-                try:
-                    await self._provisioner.request(dedup_key=record.workflow_key)
-                except asyncio.CancelledError:
-                    final_error = "Workflow cancelled (worker shutdown)"
-                    raise
-                except RetryableProvisionerError as exc:
-                    logger.warning(
-                        "Provisioner reported retryable capacity error for %s; "
-                        "falling through to dispatch against existing workers: %s",
-                        record.job_id,
-                        exc,
-                    )
-                except Exception as exc:
-                    logger.exception(
-                        "Provisioner request failed for %s", record.job_id
-                    )
-                    final_error = f"{_ERR_PREFIX_PROVISIONER} {exc}"
-                    return
-                # Bound the stale-reclaim window across the provisioner
-                # round-trip: ``mark_running`` ran before
-                # ``provisioner.request`` and a Fargate cold start can
-                # plausibly exceed STALE_WORKFLOW_THRESHOLD. Heartbeat
-                # so the row stays fresh through the upcoming dispatch
-                # wait. A failed heartbeat is logged but does not
-                # abort the workflow — the next ``heartbeat`` event in
-                # the stream loop will retry.
-                try:
-                    await heartbeat_workflow(self._db, record.job_id)
-                except Exception:
-                    logger.exception(
-                        "Post-provisioner heartbeat failed for %s; continuing",
-                        record.job_id,
-                    )
-
-            try:
-                stream = await self._open_stream_with_retry(
-                    processor, file_meta, workdir
-                )
-            except asyncio.CancelledError:
-                final_error = "Workflow cancelled (worker shutdown)"
-                raise
-            except wool.NoWorkersAvailable as exc:
-                # Dispatch budget exhausted with no workers reachable.
-                # ``capacity:`` is the stable on-wire prefix clients
-                # parse to decide whether to resubmit the job — its
-                # origin shifted from the provisioner-failure branch
-                # (which now falls through) to here, but the wire
-                # contract is unchanged.
-                logger.warning(
-                    "Dispatch budget exhausted for %s with no workers available",
-                    record.job_id,
-                )
-                final_error = f"{_ERR_PREFIX_CAPACITY} {exc}"
-                return
-            except Exception as exc:
-                logger.exception("Failed to open routine stream for %s", record.job_id)
-                final_error = str(exc)
-                return
 
             try:
                 async with asyncio.timeout(_WORKFLOW_DURATION_CAP_SECONDS):
@@ -686,74 +920,43 @@ async def _finalize(
         except Exception:
             logger.exception("Unable to clean workdir %s", workdir)
 
-    async def _open_stream_with_retry(
+    async def _open_stream_once(
         self,
         processor: Processor,
         file_meta: dict[str, Any],
         workdir: Path,
     ) -> AsyncIterator[dict[str, Any]]:
-        """Open the routine's event stream, retrying on ``NoWorkersAvailable``.
-
-        The API leases workers from a pool that's provisioned externally
-        (ECS in production); cold starts can take ~30-90s. Until wool
-        exposes a quorum-style readiness gate we poll the pool with
-        ``NoWorkersAvailable`` retries on each attempt to open the
-        stream. ``_DISPATCH_WAIT_SECONDS`` caps the total wait so a
-        stuck scale-up surfaces as a workflow failure rather than
-        hanging until the much-larger ``_WORKFLOW_DURATION_CAP_SECONDS``
-        cap fires.
-
-        Returns an async iterator that yields the first event followed
-        by the rest of the routine's stream. The retry-on-capacity logic
-        only applies to the first ``__anext__`` call (which is where
-        wool surfaces ``NoWorkersAvailable``); once events start
-        flowing, subsequent ``__anext__`` calls just wait on the worker.
+        """Open the routine's event stream in a single dispatch attempt.
+
+        Constructs the ``@wool.routine`` stream and pulls its first event,
+        which is where wool surfaces ``wool.NoWorkersAvailable`` when the
+        priority load balancer found no worker willing to accept the task.
+        That exception propagates to ``_attempt_dispatch``, which reschedules
+        the job — the durable retry scheduler is now the retry mechanism, so
+        there is no in-attempt polling loop (a single reschedule tick
+        covers an ECS cold start). On success returns an async iterator
+        yielding the first event followed by the rest of the stream.
         """
-        loop = asyncio.get_running_loop()
-        deadline = loop.time() + _DISPATCH_WAIT_SECONDS
-        attempt = 0
-        last_exc: wool.NoWorkersAvailable | None = None
-        while True:
-            stream = _run_processor_routine(
-                processor,
-                file_meta,
-                str(workdir),
-                self._cache,
-            )
-            # ``returning`` flips True only after we've handed the
-            # stream off to ``_prepend_first_event``. Any other exit
-            # from this iteration (NoWorkersAvailable, CancelledError
-            # from __anext__ or the inter-iteration sleep, unexpected
-            # exception) leaves the freshly-constructed stream
-            # unawaited; the finally aclose()s it so wool's pool can
-            # release the lease cleanly on drain.
-            returning = False
-            try:
-                try:
-                    first_event = await stream.__anext__()
-                except wool.NoWorkersAvailable as exc:
-                    last_exc = exc
-                    attempt += 1
-                    if loop.time() >= deadline:
-                        break
-                    if attempt >= 2:
-                        logger.warning(
-                            "No workers available — retrying dispatch "
-                            "(attempt %d, %.1fs of %.1fs budget remaining)",
-                            attempt,
-                            deadline - loop.time(),
-                            _DISPATCH_WAIT_SECONDS,
-                        )
-                    await asyncio.sleep(_DISPATCH_RETRY_INTERVAL_SECONDS)
-                    continue
-                returning = True
-                return _prepend_first_event(first_event, stream)
-            finally:
-                if not returning:
-                    with contextlib.suppress(BaseException):
-                        await stream.aclose()
-        assert last_exc is not None
-        raise last_exc
+        stream = _run_processor_routine(
+            processor,
+            file_meta,
+            str(workdir),
+            self._cache,
+        )
+        # ``returning`` flips True only after we've handed the stream off to
+        # ``_prepend_first_event``. Any other exit (NoWorkersAvailable, a
+        # CancelledError from __anext__, an unexpected exception) leaves the
+        # freshly-constructed stream unawaited; the finally aclose()s it so
+        # wool's pool can release the lease cleanly.
+        returning = False
+        try:
+            first_event = await stream.__anext__()
+            returning = True
+            return _prepend_first_event(first_event, stream)
+        finally:
+            if not returning:
+                with contextlib.suppress(BaseException):
+                    await stream.aclose()
 
 
 async def _prepend_first_event(
@@ -761,10 +964,10 @@ async def _prepend_first_event(
 ) -> AsyncIterator[dict[str, Any]]:
     """Yield ``first`` then the rest of ``stream``.
 
-    Used by ``_open_stream_with_retry`` to consume the initial
-    ``__anext__`` (where wool surfaces ``NoWorkersAvailable``) inside
-    the retry loop, then expose the remaining events to the workflow
-    consumer through a single uniform async-iterator interface.
+    Used by ``_open_stream_once`` to consume the initial ``__anext__``
+    (where wool surfaces ``NoWorkersAvailable``), then expose the remaining
+    events to the workflow consumer through a single uniform async-iterator
+    interface.
     """
     try:
         yield first
@@ -778,12 +981,12 @@ async def _prepend_first_event(
 def _log_unexpected_exception(task: asyncio.Task) -> None:
     """Done-callback that surfaces an unexpected exception via the logger.
 
-    ``_run_workflow`` is supposed to catch every failure and route it to
-    a terminal release; if anything escapes that contract (e.g. a future
-    refactor leaves a bare ``raise`` outside the try), asyncio's
-    default behavior is to log the unretrieved exception only at
-    interpreter shutdown. Surfacing it through the project logger
-    immediately makes the regression visible.
+    The scheduler loop and ``_attempt_dispatch`` are supposed to catch
+    every failure and route it to a terminal release or a reschedule; if
+    anything escapes that contract (e.g. a future refactor leaves a bare
+    ``raise`` outside the try), asyncio's default behavior is to log the
+    unretrieved exception only at interpreter shutdown. Surfacing it
+    through the project logger immediately makes the regression visible.
     """
     if task.cancelled():
         return
diff --git a/src/cfdb/workflows/lock.py b/src/cfdb/workflows/lock.py
index 41f805c..ce1753e 100644
--- a/src/cfdb/workflows/lock.py
+++ b/src/cfdb/workflows/lock.py
@@ -191,6 +191,13 @@ async def claim_workflow(
             submitted_at=now,
             updated_at=now,
             file_meta_snapshot=file_meta_snapshot,
+            # ``next_dispatch_at`` is intentionally left unset (None) on a
+            # fresh claim. ``ensure_workflow`` dispatches the first attempt
+            # inline; only if that attempt overflows (no worker capacity)
+            # does it set ``next_dispatch_at``, handing the job to the
+            # durable retry scheduler. A None value is never "due", so the
+            # scheduler can't double-dispatch a job whose inline attempt is
+            # still in flight.
         )
 
         # Attempt the insert first. If the partial-unique index admits it,
diff --git a/tests/integration/routines.py b/tests/integration/routines.py
index 11f7d0a..5c4a291 100644
--- a/tests/integration/routines.py
+++ b/tests/integration/routines.py
@@ -53,10 +53,10 @@ class StubProcessor(Processor):
       workflow to race against.
     - ``sleep_between_yields``: a delay inserted AFTER each
       ``stage_complete`` yield. Used by the runtime-cap test: a quick
-      first event lets ``_open_stream_with_retry`` return so the
-      executor's ``asyncio.timeout(cap)`` block is entered, then the
-      between-yields delay outlasts the cap so the wait for the next
-      event triggers the cap.
+      first event lets ``_open_stream_once`` return (and the executor mark
+      the job RUNNING) so the ``asyncio.timeout(cap)`` block is entered,
+      then the between-yields delay outlasts the cap so the wait for the
+      next event triggers the cap.
     - ``unpicklable_field``: any value assigned to ``self.unpicklable_field``
       to deliberately poison cloudpickle. Used by the boundary tests
       that need to verify a pickling failure surfaces as a clean
diff --git a/tests/integration/test_executor_boundary.py b/tests/integration/test_executor_boundary.py
index 2a6d1a8..582d0d8 100644
--- a/tests/integration/test_executor_boundary.py
+++ b/tests/integration/test_executor_boundary.py
@@ -127,8 +127,8 @@ async def test_ensure_workflow_should_record_failed_when_routine_raises_across_p
         assert final.status == JobStatus.FAILED
         assert final.error is not None
         # When the routine raises before yielding, wool propagates the
-        # exception across the boundary; the executor records
-        # ``str(exc)`` from the ``_open_stream_with_retry`` failure path.
+        # exception across the boundary; the executor records ``str(exc)``
+        # from ``_attempt_dispatch``'s stream-open failure path.
         # The exception class name is not part of the persisted error.
         assert "boom" in final.error
         # The scrub regex strips multi-segment absolute paths from the
diff --git a/tests/test_workflows/test_executor.py b/tests/test_workflows/test_executor.py
index 76d315f..364fef3 100644
--- a/tests/test_workflows/test_executor.py
+++ b/tests/test_workflows/test_executor.py
@@ -3,7 +3,9 @@
 from __future__ import annotations
 
 import asyncio
+import uuid
 from collections.abc import AsyncIterator
+from datetime import datetime, timedelta, timezone
 from pathlib import Path
 from typing import Any
 
@@ -20,13 +22,19 @@
 )
 from cfdb.workflows.executor import (
     PIPELINE_VERSION,
+    AdmissionRejected,
     ExecutorDraining,
     WoolExecutor,
     WorkflowNotApplicable,
     extract_identity,
 )
 from cfdb.workflows.lock import get_job
-from cfdb.workflows.models import ACTIVE_STATUSES, ArtifactKind, JobStatus
+from cfdb.workflows.models import (
+    ACTIVE_STATUSES,
+    ArtifactKind,
+    JobRecord,
+    JobStatus,
+)
 from cfdb.workflows.processors.base import Processor
 from cfdb.workflows.processors.registry import ProcessorRegistry
 from cfdb.workflows.provisioner import EcsProvisioner, RetryableProvisionerError
@@ -124,6 +132,42 @@ async def _wait_for_terminal(mock_db, job_id: str, timeout: float = 2.0) -> None
     raise AssertionError(f"Job {job_id} did not reach terminal status")
 
 
+async def _seed_pending_job(
+    mock_db,
+    *,
+    submitted_at: datetime | None = None,
+    next_dispatch_at: datetime | None = None,
+    meta: dict[str, Any] | None = None,
+) -> JobRecord:
+    """Insert a PENDING job row directly and return its JobRecord.
+
+    Lets a test drive ``_attempt_dispatch`` / ``_drain_due_jobs`` against a
+    queued job without going through ``ensure_workflow`` (which would spawn
+    its own inline attempt). ``submitted_at`` controls the deadline clock;
+    ``next_dispatch_at`` controls scheduler due-ness.
+    """
+    meta = meta or _file_meta()
+    dcc, local_id, md5 = extract_identity(meta)
+    now = datetime.now(timezone.utc)
+    record = JobRecord(
+        job_id=str(uuid.uuid4()),
+        workflow_key=key_utils.workflow_key(
+            dcc=dcc, local_id=local_id, md5=md5, pipeline_version=PIPELINE_VERSION
+        ),
+        status=JobStatus.PENDING,
+        dcc=dcc,
+        local_id=local_id,
+        md5=md5,
+        pipeline_version=PIPELINE_VERSION,
+        submitted_at=submitted_at or now,
+        updated_at=now,
+        file_meta_snapshot=meta,
+        next_dispatch_at=next_dispatch_at,
+    )
+    await mock_db.jobs.insert_one(record.to_mongo())
+    return record
+
+
 class TestExtractIdentity:
     def test_extract_identity_should_prefer_nested_dcc_abbreviation(self):
         """Test that extract_identity reads the canonical C2M2 shape.
@@ -862,22 +906,21 @@ def test_pipeline_version_should_be_positive():
     assert isinstance(PIPELINE_VERSION, int)
 
 
-class TestOpenStreamWithRetry:
+class TestOpenStreamOnce:
     @pytest.mark.asyncio
-    async def test_open_stream_with_retry_should_recover_after_one_no_workers_error(
+    async def test_open_stream_once_should_return_stream_on_first_event(
         self, mock_db, tmp_cache, tmp_workdir, no_wool_dispatch, mocker
     ):
-        """Test that ``NoWorkersAvailable`` triggers a retry that eventually wins.
+        """Test that a routine yielding an event returns a usable stream.
 
         Given:
-            An executor whose ``_run_processor_routine`` raises
-            ``wool.NoWorkersAvailable`` on its first ``__anext__`` then
-            yields events on the second attempt.
+            An executor whose ``_run_processor_routine`` yields a regular
+            event followed by a ``complete``.
         When:
-            ``_open_stream_with_retry`` is awaited.
+            ``_open_stream_once`` is awaited.
         Then:
-            It should return a stream that yields the second invocation's
-            events without re-raising.
+            It should return a stream that re-yields the first event plus
+            the rest, in a single dispatch attempt (no retry loop).
         """
         # Arrange
         _install_jobs_index(mock_db)
@@ -888,57 +931,45 @@ async def test_open_stream_with_retry_should_recover_after_one_no_workers_error(
             mock_db, tmp_cache, registry, workdir_root=tmp_workdir
         )
 
-        # Build two routines: one that immediately raises NoWorkersAvailable
-        # and one that yields a regular event followed by a complete.
         attempts = {"count": 0}
 
-        async def failing_stream():
-            attempts["count"] += 1
-            raise wool.NoWorkersAvailable("simulated cold start")
-            yield  # pragma: no cover
-
         async def working_stream():
             attempts["count"] += 1
             yield StageComplete(kind=ArtifactKind.INDEX, key=_INDEX_KEY)
             yield Complete(artifacts={})
 
-        streams = [failing_stream(), working_stream()]
-
-        def fake_routine(*_args, **_kwargs):
-            return streams.pop(0)
-
-        mocker.patch.object(executor_module, "_run_processor_routine", fake_routine)
-        # Patch sleep so retries don't slow the test down.
-        async def fast_sleep(_s):
-            return None
-
-        mocker.patch.object(executor_module.asyncio, "sleep", fast_sleep)
+        mocker.patch.object(
+            executor_module,
+            "_run_processor_routine",
+            lambda *_a, **_k: working_stream(),
+        )
 
         # Act
-        stream = await executor._open_stream_with_retry(
+        stream = await executor._open_stream_once(
             processor, _file_meta(), tmp_workdir / "wd"
         )
         events = [event async for event in stream]
 
-        # Assert
-        assert attempts["count"] == 2
+        # Assert — a single attempt, both events delivered.
+        assert attempts["count"] == 1
+        assert any(isinstance(e, StageComplete) for e in events)
         assert any(isinstance(e, Complete) for e in events)
 
     @pytest.mark.asyncio
-    async def test_open_stream_with_retry_should_raise_when_deadline_expires(
+    async def test_open_stream_once_should_propagate_no_workers_available(
         self, mock_db, tmp_cache, tmp_workdir, no_wool_dispatch, mocker
     ):
-        """Test that the retry loop eventually surfaces ``NoWorkersAvailable``.
+        """Test that ``NoWorkersAvailable`` propagates without retrying.
 
         Given:
-            An executor whose ``_run_processor_routine`` always raises
-            ``wool.NoWorkersAvailable``; the dispatch-wait budget is
-            patched to a tiny value so the deadline expires quickly.
+            An executor whose ``_run_processor_routine`` raises
+            ``wool.NoWorkersAvailable`` on its first ``__anext__``.
         When:
-            ``_open_stream_with_retry`` is awaited.
+            ``_open_stream_once`` is awaited.
         Then:
-            It should raise the last ``NoWorkersAvailable`` after the
-            dispatch budget is exhausted.
+            It should propagate the ``NoWorkersAvailable`` in a single
+            attempt — the durable retry scheduler, not an in-attempt loop,
+            owns retries now. The freshly-constructed stream is closed.
         """
         # Arrange
         _install_jobs_index(mock_db)
@@ -949,26 +980,25 @@ async def test_open_stream_with_retry_should_raise_when_deadline_expires(
             mock_db, tmp_cache, registry, workdir_root=tmp_workdir
         )
 
+        attempts = {"count": 0}
+
         async def failing_stream():
-            raise wool.NoWorkersAvailable("still no workers")
+            attempts["count"] += 1
+            raise wool.NoWorkersAvailable("simulated cold start")
             yield  # pragma: no cover
 
-        def fake_routine(*_args, **_kwargs):
-            return failing_stream()
-
-        mocker.patch.object(executor_module, "_run_processor_routine", fake_routine)
-        mocker.patch.object(executor_module, "_DISPATCH_WAIT_SECONDS", 0.01)
-
-        async def fast_sleep(_s):
-            return None
-
-        mocker.patch.object(executor_module.asyncio, "sleep", fast_sleep)
+        mocker.patch.object(
+            executor_module,
+            "_run_processor_routine",
+            lambda *_a, **_k: failing_stream(),
+        )
 
         # Act & assert
         with pytest.raises(wool.NoWorkersAvailable):
-            await executor._open_stream_with_retry(
+            await executor._open_stream_once(
                 processor, _file_meta(), tmp_workdir / "wd"
             )
+        assert attempts["count"] == 1
 
 
 class TestStreamConsumerProgressEvents:
@@ -1121,7 +1151,7 @@ def test_executor_draining_should_be_caught_by_workflow_not_applicable(self):
 
 class TestOpenStreamMalformedFirstEvent:
     @pytest.mark.asyncio
-    async def test_open_stream_with_retry_should_propagate_non_capacity_error(
+    async def test_open_stream_once_should_propagate_non_capacity_error(
         self, mock_db, tmp_cache, tmp_workdir, no_wool_dispatch, mocker
     ):
         """Test that a non-``NoWorkersAvailable`` first error propagates.
@@ -1130,11 +1160,11 @@ async def test_open_stream_with_retry_should_propagate_non_capacity_error(
             A routine whose first ``__anext__`` raises a generic
             ``RuntimeError`` (not a wool capacity error).
         When:
-            ``_open_stream_with_retry`` is awaited.
+            ``_open_stream_once`` is awaited.
         Then:
-            The exception should propagate to the caller — the retry
-            loop only handles ``NoWorkersAvailable``, every other class
-            is a real failure.
+            The exception should propagate to the caller —
+            ``_attempt_dispatch`` translates it into a FAILED job; only
+            ``NoWorkersAvailable`` is treated as overflow.
         """
         # Arrange
         _install_jobs_index(mock_db)
@@ -1156,7 +1186,7 @@ def fake_routine(*_args, **_kwargs):
 
         # Act & assert
         with pytest.raises(RuntimeError, match="not a capacity"):
-            await executor._open_stream_with_retry(
+            await executor._open_stream_once(
                 processor, _file_meta(), tmp_workdir / "wd"
             )
 
@@ -1197,25 +1227,28 @@ async def test_ensure_workflow_should_persist_file_meta_snapshot_on_insert(
 
 class TestWoolExecutorWithProvisioner:
     @pytest.mark.asyncio
-    async def test_ensure_workflow_should_request_worker_when_provisioner_set(
+    async def test_attempt_should_not_request_provisioner_when_worker_available(
         self, mock_db, tmp_cache, tmp_workdir, no_wool_dispatch, mocker
     ):
-        """Test that a fresh claim dispatches a provisioner request.
+        """Test that the provisioner is left idle when a worker accepts.
 
         Given:
-            A WoolExecutor wired with a stub EcsProvisioner.
+            A WoolExecutor wired with a stub EcsProvisioner and an
+            in-process worker (no_wool_dispatch) that accepts the task.
         When:
-            ``ensure_workflow`` is awaited on a previously-unseen file.
+            ``ensure_workflow`` dispatches the inline attempt and it wins
+            a worker.
         Then:
-            The provisioner's ``request`` should be awaited exactly once
-            with ``dedup_key`` set to the workflow mutex key for the file.
+            ``provisioner.request`` should NOT be awaited — spawn-on-
+            overflow means scale-up only happens when no worker accepts,
+            inverting the old unconditional pre-dispatch spawn (the cost
+            leak).
         """
         # Arrange
         _install_jobs_index(mock_db)
         registry = ProcessorRegistry()
         registry.register(_StubProcessor())
         provisioner = mocker.AsyncMock(spec=EcsProvisioner)
-        provisioner.request.return_value = ["arn:fake:task/abc"]
         executor = WoolExecutor(
             mock_db,
             tmp_cache,
@@ -1223,51 +1256,41 @@ async def test_ensure_workflow_should_request_worker_when_provisioner_set(
             workdir_root=tmp_workdir,
             provisioner=provisioner,
         )
-        meta = _file_meta()
-        dcc, local_id, md5 = extract_identity(meta)
-        expected_key = key_utils.workflow_key(
-            dcc=dcc, local_id=local_id, md5=md5, pipeline_version=PIPELINE_VERSION
-        )
 
         # Act
-        record, _ = await executor.ensure_workflow(meta)
+        record, _ = await executor.ensure_workflow(_file_meta())
         await _wait_for_terminal(mock_db, record.job_id)
 
         # Assert
-        provisioner.request.assert_awaited_once_with(dedup_key=expected_key)
+        final = await get_job(mock_db, record.job_id)
+        assert final is not None
+        assert final.status == JobStatus.COMPLETED
+        provisioner.request.assert_not_awaited()
 
     @pytest.mark.asyncio
-    async def test_ensure_workflow_should_skip_provisioner_when_attaching_to_existing_job(
+    async def test_attempt_should_request_provisioner_on_overflow(
         self, mock_db, tmp_cache, tmp_workdir, no_wool_dispatch, mocker
     ):
-        """Test that the provisioner is only invoked on fresh claims.
+        """Test that overflow requests one bounded worker spawn and reschedules.
 
         Given:
-            An executor with a stub provisioner and a blocking processor
-            so the first claim stays active when the second call arrives.
+            A WoolExecutor wired with a stub provisioner and a queued job
+            whose dispatch attempt finds no worker capacity
+            (``_open_stream_once`` raises ``NoWorkersAvailable``).
         When:
-            ``ensure_workflow`` is awaited twice for the same file_meta.
+            ``_attempt_dispatch`` runs.
         Then:
-            ``provisioner.request`` should be awaited exactly once — the
-            attach path must not double-spend a Fargate worker against
-            an already-claimed workflow.
+            ``provisioner.request`` should be awaited once with the job's
+            workflow key, and the job should stay PENDING with
+            ``next_dispatch_at`` set and ``dispatch_attempts`` bumped —
+            queued durably for a later retry, not failed.
         """
         # Arrange
         _install_jobs_index(mock_db)
-        release = asyncio.Event()
-
-        class _BlockingProcessor(_StubProcessor):
-            async def run(self, file_meta, workdir, cache_root):
-                self.run_calls += 1
-                await release.wait()
-                for kind, key in self.artifacts.items():
-                    yield StageComplete(kind=ArtifactKind(kind), key=key)
-                yield Complete(artifacts=dict(self.artifacts))
-
         registry = ProcessorRegistry()
-        registry.register(_BlockingProcessor())
+        processor = _StubProcessor()
+        registry.register(processor)
         provisioner = mocker.AsyncMock(spec=EcsProvisioner)
-        provisioner.request.return_value = ["arn:fake:task/abc"]
         executor = WoolExecutor(
             mock_db,
             tmp_cache,
@@ -1275,50 +1298,47 @@ async def run(self, file_meta, workdir, cache_root):
             workdir_root=tmp_workdir,
             provisioner=provisioner,
         )
+        record = await _seed_pending_job(mock_db)
+        mocker.patch.object(
+            executor,
+            "_open_stream_once",
+            new=mocker.AsyncMock(side_effect=wool.NoWorkersAvailable("empty")),
+        )
 
         # Act
-        record_a, fresh_a = await executor.ensure_workflow(_file_meta())
-        record_b, fresh_b = await executor.ensure_workflow(_file_meta())
-        release.set()
-        await _wait_for_terminal(mock_db, record_a.job_id)
+        await executor._attempt_dispatch(record, processor, _file_meta())
 
         # Assert
-        assert fresh_a is True
-        assert fresh_b is False
-        assert record_a.job_id == record_b.job_id
-        provisioner.request.assert_awaited_once()
+        provisioner.request.assert_awaited_once_with(dedup_key=record.workflow_key)
+        final = await get_job(mock_db, record.job_id)
+        assert final is not None
+        assert final.status == JobStatus.PENDING
+        assert final.next_dispatch_at is not None
+        assert final.dispatch_attempts == 1
 
     @pytest.mark.asyncio
-    async def test_ensure_workflow_should_fall_through_when_retryable_provisioner_raises(
+    async def test_overflow_should_swallow_retryable_provisioner_error(
         self, mock_db, tmp_cache, tmp_workdir, no_wool_dispatch, mocker
     ):
-        """Test that a retryable provisioner failure falls through to dispatch.
+        """Test that a retryable provisioner error is swallowed, job reschedules.
 
         Given:
             A provisioner whose ``request`` raises
-            ``RetryableProvisionerError`` (capacity / ENI exhaustion /
-            throttling) AND a wool pool that can still dispatch to an
-            existing worker (the no_wool_dispatch fixture runs the
-            routine in-process).
+            ``RetryableProvisionerError`` on an overflowed dispatch.
         When:
-            ``ensure_workflow`` is awaited.
+            ``_attempt_dispatch`` runs.
         Then:
-            The retryable error should be logged but not fail the
-            workflow — the Wool pool routes to any available worker
-            via discovery, so an existing idle worker can service the
-            job even when ECS quota blocks a fresh launch. The
-            ``capacity:`` prefix is reserved for the case where
-            dispatch also exhausts its budget (see
-            ``test_ensure_workflow_should_record_capacity_failure_when_dispatch_also_fails``).
+            The error should be swallowed (best-effort scale-up) and the
+            job should stay PENDING, rescheduled for a later attempt — not
+            failed.
         """
         # Arrange
         _install_jobs_index(mock_db)
         registry = ProcessorRegistry()
-        registry.register(_StubProcessor())
+        processor = _StubProcessor()
+        registry.register(processor)
         provisioner = mocker.AsyncMock(spec=EcsProvisioner)
-        provisioner.request.side_effect = RetryableProvisionerError(
-            "ClientError: capacity-unavailable"
-        )
+        provisioner.request.side_effect = RetryableProvisionerError("capacity")
         executor = WoolExecutor(
             mock_db,
             tmp_cache,
@@ -1326,49 +1346,47 @@ async def test_ensure_workflow_should_fall_through_when_retryable_provisioner_ra
             workdir_root=tmp_workdir,
             provisioner=provisioner,
         )
+        record = await _seed_pending_job(mock_db)
+        mocker.patch.object(
+            executor,
+            "_open_stream_once",
+            new=mocker.AsyncMock(side_effect=wool.NoWorkersAvailable("empty")),
+        )
 
         # Act
-        record, _ = await executor.ensure_workflow(_file_meta())
-        await _wait_for_terminal(mock_db, record.job_id)
+        await executor._attempt_dispatch(record, processor, _file_meta())
 
         # Assert
         final = await get_job(mock_db, record.job_id)
         assert final is not None
-        provisioner.request.assert_awaited_once()
-        # Provisioner failure was swallowed; in-process dispatch
-        # succeeded, so the workflow completes despite the
-        # retryable provisioner error.
-        assert final.status == JobStatus.COMPLETED
-        assert final.error is None
+        assert final.status == JobStatus.PENDING
+        assert final.next_dispatch_at is not None
 
     @pytest.mark.asyncio
-    async def test_ensure_workflow_should_record_capacity_failure_when_dispatch_also_fails(
+    async def test_overflow_should_swallow_generic_provisioner_error(
         self, mock_db, tmp_cache, tmp_workdir, no_wool_dispatch, mocker
     ):
-        """Test that the capacity: prefix fires when dispatch also exhausts.
+        """Test that a generic provisioner error doesn't fail the queued job.
 
         Given:
-            A provisioner whose ``request`` raises
-            ``RetryableProvisionerError`` AND a dispatch path that
-            exhausts its budget with ``wool.NoWorkersAvailable``.
+            A provisioner whose ``request`` raises a generic
+            ``RuntimeError`` (misconfiguration / unexpected boto failure)
+            on an overflowed dispatch.
         When:
-            ``ensure_workflow`` is awaited.
+            ``_attempt_dispatch`` runs.
         Then:
-            The workflow should reach ``FAILED`` with the
-            ``capacity:`` prefix — the wire contract clients parse to
-            decide whether to resubmit. Per C26 the origin of the
-            prefix shifted from the provisioner-failure branch to the
-            dispatch-budget-exhausted branch, but the on-wire shape
-            is unchanged.
+            The error should be logged and swallowed and the job should
+            stay PENDING (rescheduled) rather than failing — the
+            provisioner is a best-effort scale-up hint on overflow, not a
+            dispatch gate; the dispatch deadline bounds the retry loop.
         """
         # Arrange
         _install_jobs_index(mock_db)
         registry = ProcessorRegistry()
-        registry.register(_StubProcessor())
+        processor = _StubProcessor()
+        registry.register(processor)
         provisioner = mocker.AsyncMock(spec=EcsProvisioner)
-        provisioner.request.side_effect = RetryableProvisionerError(
-            "ClientError: capacity-unavailable"
-        )
+        provisioner.request.side_effect = RuntimeError("misconfigured cluster")
         executor = WoolExecutor(
             mock_db,
             tmp_cache,
@@ -1376,70 +1394,279 @@ async def test_ensure_workflow_should_record_capacity_failure_when_dispatch_also
             workdir_root=tmp_workdir,
             provisioner=provisioner,
         )
-        # Force the dispatch path to also fail with
-        # ``wool.NoWorkersAvailable``; the C26 stream-open branch
-        # translates that to the ``capacity:`` prefix.
+        record = await _seed_pending_job(mock_db)
         mocker.patch.object(
             executor,
-            "_open_stream_with_retry",
-            side_effect=wool.NoWorkersAvailable("pool empty after budget"),
+            "_open_stream_once",
+            new=mocker.AsyncMock(side_effect=wool.NoWorkersAvailable("empty")),
         )
 
         # Act
-        record, _ = await executor.ensure_workflow(_file_meta())
-        await _wait_for_terminal(mock_db, record.job_id)
+        await executor._attempt_dispatch(record, processor, _file_meta())
 
         # Assert
         final = await get_job(mock_db, record.job_id)
         assert final is not None
-        assert final.status == JobStatus.FAILED
-        assert final.error is not None
-        assert final.error.startswith("capacity: "), (
-            f"expected 'capacity: ' prefix, got {final.error!r}"
+        assert final.status == JobStatus.PENDING
+        assert final.next_dispatch_at is not None
+
+
+class TestWoolExecutorAdmission:
+    @pytest.mark.asyncio
+    async def test_ensure_workflow_should_reject_at_ceiling(
+        self, mock_db, tmp_cache, tmp_workdir, mocker
+    ):
+        """Test that ensure_workflow sheds load once the active ceiling is hit.
+
+        Given:
+            One active workflow already in the jobs collection and the
+            admission ceiling pinned to 1.
+        When:
+            ``ensure_workflow`` is awaited for another file.
+        Then:
+            It should raise ``AdmissionRejected`` carrying the observed
+            active count, the ceiling, and a positive ``retry_after`` hint
+            — before claiming the mutex, so a flood is shed at the door.
+        """
+        # Arrange
+        _install_jobs_index(mock_db)
+        registry = ProcessorRegistry()
+        registry.register(_StubProcessor())
+        executor = WoolExecutor(
+            mock_db, tmp_cache, registry, workdir_root=tmp_workdir
         )
-        assert "pool empty" in final.error
+        mocker.patch.object(executor_module, "_MAX_ACTIVE_WORKFLOWS", 1)
+        await _seed_pending_job(mock_db)  # one active workflow
+
+        # Act & assert
+        with pytest.raises(AdmissionRejected) as excinfo:
+            await executor.ensure_workflow(_file_meta())
+        assert excinfo.value.active == 1
+        assert excinfo.value.ceiling == 1
+        assert excinfo.value.retry_after_seconds >= 1
 
     @pytest.mark.asyncio
-    async def test_ensure_workflow_should_record_provisioner_prefix_for_generic_exception(
+    async def test_ensure_workflow_should_admit_below_ceiling(
         self, mock_db, tmp_cache, tmp_workdir, no_wool_dispatch, mocker
     ):
-        """Test that non-retryable provisioner failures use the 'provisioner: ' prefix.
+        """Test that ensure_workflow claims normally below the ceiling.
 
         Given:
-            A provisioner whose ``request`` raises a generic
-            ``RuntimeError`` (a misconfiguration or unexpected boto
-            failure, not a retryable transient).
+            An admission ceiling of 5 and no active workflows.
         When:
             ``ensure_workflow`` is awaited.
         Then:
-            The job should reach ``FAILED`` with an error string that
-            starts with ``"provisioner: "`` so clients can distinguish
-            "retry me later" from "the provisioner crashed".
+            It should claim and dispatch the job to completion — the
+            ceiling only sheds load once it is reached.
         """
         # Arrange
         _install_jobs_index(mock_db)
         registry = ProcessorRegistry()
         registry.register(_StubProcessor())
-        provisioner = mocker.AsyncMock(spec=EcsProvisioner)
-        provisioner.request.side_effect = RuntimeError("misconfigured cluster")
         executor = WoolExecutor(
-            mock_db,
-            tmp_cache,
-            registry,
-            workdir_root=tmp_workdir,
-            provisioner=provisioner,
+            mock_db, tmp_cache, registry, workdir_root=tmp_workdir
         )
+        mocker.patch.object(executor_module, "_MAX_ACTIVE_WORKFLOWS", 5)
 
         # Act
-        record, _ = await executor.ensure_workflow(_file_meta())
+        record, fresh = await executor.ensure_workflow(_file_meta())
         await _wait_for_terminal(mock_db, record.job_id)
 
         # Assert
+        assert fresh is True
+        final = await get_job(mock_db, record.job_id)
+        assert final is not None
+        assert final.status == JobStatus.COMPLETED
+
+
+class TestWoolExecutorDispatchDeadline:
+    @pytest.mark.asyncio
+    async def test_attempt_should_fail_capacity_when_deadline_exceeded(
+        self, mock_db, tmp_cache, tmp_workdir, no_wool_dispatch, mocker
+    ):
+        """Test that a job past its dispatch deadline is failed ``capacity:``.
+
+        Given:
+            A queued job submitted long ago (older than a tiny patched
+            dispatch deadline).
+        When:
+            ``_attempt_dispatch`` runs.
+        Then:
+            The job should be failed with the ``capacity:`` prefix without
+            attempting dispatch — the deadline bounds the durable retry
+            loop so a job can't queue forever.
+        """
+        # Arrange
+        _install_jobs_index(mock_db)
+        registry = ProcessorRegistry()
+        processor = _StubProcessor()
+        registry.register(processor)
+        executor = WoolExecutor(
+            mock_db, tmp_cache, registry, workdir_root=tmp_workdir
+        )
+        mocker.patch.object(executor_module, "_DISPATCH_DEADLINE_SECONDS", 1.0)
+        old = datetime.now(timezone.utc) - timedelta(seconds=10000)
+        record = await _seed_pending_job(mock_db, submitted_at=old)
+        open_spy = mocker.patch.object(
+            executor, "_open_stream_once", new=mocker.AsyncMock()
+        )
+
+        # Act
+        await executor._attempt_dispatch(record, processor, _file_meta())
+
+        # Assert
+        open_spy.assert_not_awaited()  # never tried to dispatch
         final = await get_job(mock_db, record.job_id)
         assert final is not None
         assert final.status == JobStatus.FAILED
-        assert final.error is not None
-        assert final.error.startswith("provisioner: "), (
-            f"expected 'provisioner: ' prefix, got {final.error!r}"
+        assert (final.error or "").startswith("capacity:")
+
+
+class TestWoolExecutorScheduler:
+    @pytest.mark.asyncio
+    async def test_drain_due_jobs_should_dispatch_a_due_queued_job(
+        self, mock_db, tmp_cache, tmp_workdir, no_wool_dispatch
+    ):
+        """Test that the scheduler tick dispatches a due rescheduled job.
+
+        Given:
+            A queued job whose ``next_dispatch_at`` is in the past (an
+            earlier attempt overflowed) and a registered processor.
+        When:
+            One scheduler tick (``_drain_due_jobs``) runs.
+        Then:
+            The job should be leased, dispatched to the in-process worker,
+            and reach COMPLETED — the durable retry path picking up a
+            previously-overflowed job.
+        """
+        # Arrange
+        _install_jobs_index(mock_db)
+        registry = ProcessorRegistry()
+        registry.register(_StubProcessor())
+        executor = WoolExecutor(
+            mock_db, tmp_cache, registry, workdir_root=tmp_workdir
+        )
+        due = datetime.now(timezone.utc) - timedelta(seconds=1)
+        record = await _seed_pending_job(mock_db, next_dispatch_at=due)
+
+        # Act
+        await executor._drain_due_jobs()
+        await _wait_for_terminal(mock_db, record.job_id)
+
+        # Assert
+        final = await get_job(mock_db, record.job_id)
+        assert final is not None
+        assert final.status == JobStatus.COMPLETED
+
+    @pytest.mark.asyncio
+    async def test_drain_due_jobs_should_ignore_not_yet_due_jobs(
+        self, mock_db, tmp_cache, tmp_workdir, no_wool_dispatch
+    ):
+        """Test that a queued job whose retry time is in the future is skipped.
+
+        Given:
+            A queued job whose ``next_dispatch_at`` is in the future.
+        When:
+            One scheduler tick runs.
+        Then:
+            The job should be left PENDING and undispatched — only due
+            jobs are leased, so a not-yet-due retry isn't run early.
+        """
+        # Arrange
+        _install_jobs_index(mock_db)
+        registry = ProcessorRegistry()
+        processor = _StubProcessor()
+        registry.register(processor)
+        executor = WoolExecutor(
+            mock_db, tmp_cache, registry, workdir_root=tmp_workdir
+        )
+        future = datetime.now(timezone.utc) + timedelta(seconds=3600)
+        record = await _seed_pending_job(mock_db, next_dispatch_at=future)
+
+        # Act
+        await executor._drain_due_jobs()
+
+        # Assert
+        final = await get_job(mock_db, record.job_id)
+        assert final is not None
+        assert final.status == JobStatus.PENDING
+        assert processor.run_calls == 0
+
+    @pytest.mark.asyncio
+    async def test_scheduler_loop_should_survive_a_tick_exception(
+        self, mock_db, tmp_cache, tmp_workdir, mocker
+    ):
+        """Test that an exception in one tick doesn't kill the scheduler.
+
+        Given:
+            A scheduler whose lease query raises on its first call then
+            returns nothing, with a tiny retry interval.
+        When:
+            ``start_scheduler`` runs the loop across several ticks.
+        Then:
+            The scheduler task should still be alive and have ticked again
+            after the exception — a transient Mongo blip must not silently
+            stop the dispatch driver.
+        """
+        # Arrange
+        _install_jobs_index(mock_db)
+        registry = ProcessorRegistry()
+        registry.register(_StubProcessor())
+        executor = WoolExecutor(
+            mock_db, tmp_cache, registry, workdir_root=tmp_workdir
         )
-        assert "misconfigured cluster" in final.error
+        calls = {"n": 0}
+
+        async def flaky_lease(*_a, **_k):
+            calls["n"] += 1
+            if calls["n"] == 1:
+                raise RuntimeError("transient mongo blip")
+            return None
+
+        mocker.patch.object(executor_module, "lease_due_dispatch", flaky_lease)
+        mocker.patch.object(executor_module, "_RETRY_INTERVAL_SECONDS", 0.01)
+
+        # Act
+        executor.start_scheduler()
+        try:
+            await asyncio.sleep(0.1)
+            # Assert — survived the first-tick exception and ticked again.
+            assert executor._scheduler_task is not None
+            assert not executor._scheduler_task.done()
+            assert calls["n"] >= 2
+        finally:
+            await executor.drain(timeout=1.0)
+
+
+class TestWoolExecutorStartScheduler:
+    @pytest.mark.asyncio
+    async def test_drain_should_cancel_the_scheduler(
+        self, mock_db, tmp_cache, tmp_workdir
+    ):
+        """Test that drain stops the durable scheduler task.
+
+        Given:
+            An executor whose scheduler has been started.
+        When:
+            ``drain`` runs.
+        Then:
+            The scheduler task should be cancelled and cleared, so it stops
+            dispatching before the wool pool closes on shutdown.
+        """
+        # Arrange
+        _install_jobs_index(mock_db)
+        registry = ProcessorRegistry()
+        registry.register(_StubProcessor())
+        executor = WoolExecutor(
+            mock_db, tmp_cache, registry, workdir_root=tmp_workdir
+        )
+        executor.start_scheduler()
+        task = executor._scheduler_task
+        assert task is not None
+
+        # Act
+        await executor.drain(timeout=1.0)
+
+        # Assert
+        assert task.cancelled() or task.done()
+        assert executor._scheduler_task is None

From c60f6ec817519c5c8b6e89f239e820caa5ab0db8 Mon Sep 17 00:00:00 2001
From: Conrad <conradbzura@gmail.com>
Date: Mon, 22 Jun 2026 19:07:43 -0400
Subject: [PATCH 10/19] feat: Surface admission rejection as 429 and start the
 retry scheduler

Map the executor's AdmissionRejected to a 429 response with a Retry-After
header on the /data and /index dispatch path, caught explicitly before
WorkflowNotApplicable so a rejected request backs off rather than falling
through to direct upstream streaming and bypassing the bounded pipeline.
The side-effect-free /status readiness probes never dispatch, so they
never 429.

Start the durable retry scheduler in the lifespan, inside the wool pool
context so the task inherits wool's dispatch contextvars (the same
mechanism request-spawned tasks rely on). The existing drain teardown
cancels it before the pool closes.
---
 src/cfdb/api/main.py                       | 20 ++++++---
 src/cfdb/api/routers/cache_stream.py       | 18 +++++++-
 tests/test_api/test_lifespan_workerpool.py |  4 ++
 tests/test_cache_stream.py                 | 48 +++++++++++++++++++++-
 4 files changed, 82 insertions(+), 8 deletions(-)

diff --git a/src/cfdb/api/main.py b/src/cfdb/api/main.py
index 55ef8f0..53db36d 100644
--- a/src/cfdb/api/main.py
+++ b/src/cfdb/api/main.py
@@ -330,15 +330,16 @@ async def lifespan(_: FastAPI):
                 # dispatch — fired while the Fargate worker is still booting
                 # (image pull + health check, ~1-3 min) — blow the quorum
                 # wait and raise an un-retried ``TimeoutError`` out of
-                # ``WorkerPool.start``. cfdb's ``_open_stream_with_retry``
-                # only retries ``wool.NoWorkersAvailable``, so that
+                # ``WorkerPool.start``. cfdb's dispatch attempt only treats
+                # ``wool.NoWorkersAvailable`` as "no capacity", so that
                 # ``TimeoutError`` fails the first request hard (and the
                 # provisioner's cancelled ``RunTask`` leaks an idle worker).
                 # With the gate off, a cold-start dispatch surfaces
-                # ``NoWorkersAvailable`` instead, which the executor already
-                # absorbs within its own ``_DISPATCH_WAIT_SECONDS`` budget —
-                # keeping the cold-start wait in one layer cfdb owns rather
-                # than duplicating it across wool's quorum wait too.
+                # ``NoWorkersAvailable`` instead, which the executor absorbs
+                # by leaving the job PENDING and re-attempting it on the
+                # durable retry scheduler's cadence — keeping the cold-start
+                # wait in one layer cfdb owns rather than duplicating it
+                # across wool's quorum wait too.
                 async with wool.WorkerPool(
                     discovery=discovery,
                     credentials=worker_credentials,
@@ -362,6 +363,13 @@ async def lifespan(_: FastAPI):
                         provisioner=provisioner,
                     )
                     executor_handle = api.executor
+                    # Start the durable retry scheduler — the single dispatch
+                    # driver — here, inside the pool context, so the created
+                    # task inherits wool's dispatch contextvars (same as
+                    # request-spawned tasks via ``attach_wool_context``).
+                    # ``_drain_executor`` cancels it on teardown before the
+                    # pool closes.
+                    api.executor.start_scheduler()
                     log.info(
                         "Workflow subsystem enabled: profile=%s cache=%s "
                         "workdir=%s discovery=%s provisioner=%s "
diff --git a/src/cfdb/api/routers/cache_stream.py b/src/cfdb/api/routers/cache_stream.py
index 5b89651..830f5cc 100644
--- a/src/cfdb/api/routers/cache_stream.py
+++ b/src/cfdb/api/routers/cache_stream.py
@@ -19,7 +19,11 @@
 from cfdb import api
 from cfdb.services import drs
 from cfdb.workflows.cache import CacheBackend
-from cfdb.workflows.executor import ExecutorDraining, WorkflowNotApplicable
+from cfdb.workflows.executor import (
+    AdmissionRejected,
+    ExecutorDraining,
+    WorkflowNotApplicable,
+)
 from cfdb.workflows.models import ArtifactKind
 
 logger = logging.getLogger(__name__)
@@ -275,6 +279,18 @@ async def serve_workflow_artifact_or_dispatch(
             detail="Service is shutting down; please retry",
             headers={"Retry-After": "30"},
         )
+    except AdmissionRejected as exc:
+        # The active-workflow ceiling is hit — shed this request with a
+        # 429 so the client backs off rather than queuing unbounded work.
+        # NOT a WorkflowNotApplicable subclass, so it must be caught
+        # explicitly (before that handler) to avoid falling through to
+        # direct upstream streaming, which would bypass the bounded
+        # pipeline entirely.
+        raise HTTPException(
+            status_code=status.HTTP_429_TOO_MANY_REQUESTS,
+            detail="Too many active preprocessing jobs; please retry shortly",
+            headers={"Retry-After": str(exc.retry_after_seconds)},
+        )
     except WorkflowNotApplicable:
         # Race: processor accepted this file, but executor disagreed.
         # Fall through to the caller's own path rather than 500.
diff --git a/tests/test_api/test_lifespan_workerpool.py b/tests/test_api/test_lifespan_workerpool.py
index 30c857f..b4681fd 100644
--- a/tests/test_api/test_lifespan_workerpool.py
+++ b/tests/test_api/test_lifespan_workerpool.py
@@ -103,6 +103,10 @@ async def _fake_build_discovery(_profile):
     from cfdb.workflows.loadbalancer import PriorityLoadBalancer
 
     assert isinstance(kwargs["loadbalancer"], PriorityLoadBalancer)
+    # The durable retry scheduler must be started inside the pool context
+    # (issue #45) so it inherits wool's dispatch contextvars; pin it here
+    # alongside the other load-bearing lifespan wiring.
+    fake_executor.start_scheduler.assert_called_once()
 
 
 def test_worker_pool_signature_should_accept_quorum():
diff --git a/tests/test_cache_stream.py b/tests/test_cache_stream.py
index 5626d8b..e415829 100644
--- a/tests/test_cache_stream.py
+++ b/tests/test_cache_stream.py
@@ -17,7 +17,11 @@
     stream_cache_entry,
 )
 from cfdb.workflows.cache import LocalFsCache
-from cfdb.workflows.executor import ExecutorDraining, WorkflowNotApplicable
+from cfdb.workflows.executor import (
+    AdmissionRejected,
+    ExecutorDraining,
+    WorkflowNotApplicable,
+)
 from cfdb.workflows.models import ArtifactKind, JobRecord, JobStatus
 from cfdb.workflows.processors.base import Processor
 from cfdb.workflows.processors.registry import ProcessorRegistry
@@ -492,6 +496,48 @@ async def test_should_raise_503_when_executor_is_draining(
         assert exc_info.value.status_code == 503
         assert "Retry-After" in exc_info.value.headers
 
+    @pytest.mark.asyncio
+    async def test_should_raise_429_when_admission_rejected(
+        self, mocker, tmp_path
+    ):
+        """Test that ``AdmissionRejected`` is translated to 429 with Retry-After.
+
+        Given:
+            Cache miss + GET + an executor that raises ``AdmissionRejected``
+            (the active-workflow ceiling is hit).
+        When:
+            The helper is awaited.
+        Then:
+            It should raise ``HTTPException(429)`` carrying the exception's
+            ``retry_after_seconds`` as the ``Retry-After`` header — so a
+            flood backs off rather than falling through to direct streaming
+            (which would bypass the bounded pipeline).
+        """
+        # Arrange
+        registry = ProcessorRegistry()
+        registry.register(_StubProcessor())
+        mocker.patch.object(api, "processor_registry", registry)
+        mocker.patch.object(api, "cache", LocalFsCache(tmp_path / "cache"))
+        mocker.patch.object(
+            api,
+            "executor",
+            _RecordingExecutor(
+                exc=AdmissionRejected(active=12, ceiling=12, retry_after_seconds=7)
+            ),
+        )
+
+        # Act & assert
+        with pytest.raises(HTTPException) as exc_info:
+            await serve_workflow_artifact_or_dispatch(
+                _file_doc(),
+                ArtifactKind.DATA,
+                _Request(),
+                None,
+                head_404_detail="missing",
+            )
+        assert exc_info.value.status_code == 429
+        assert exc_info.value.headers["Retry-After"] == "7"
+
     @pytest.mark.asyncio
     async def test_should_return_none_on_workflow_not_applicable_race(
         self, mocker, tmp_path

From 8480a65a7d0b4de9f04194713d4303c17e795d9b Mon Sep 17 00:00:00 2001
From: Conrad <conradbzura@gmail.com>
Date: Mon, 22 Jun 2026 19:07:52 -0400
Subject: [PATCH 11/19] docs: Document bounded-concurrency, durable queuing,
 and admission control

Describe the four cooperating dispatch layers (per-worker backpressure,
priority load balancing, durable retry-to-deadline queue, admission
ceiling), add the new CFDB_WORKER_MAX_CONCURRENT_TASKS /
CFDB_WORKFLOW_MAX_ACTIVE / CFDB_WORKFLOW_RETRY_INTERVAL_S /
CFDB_WORKFLOW_DISPATCH_DEADLINE_S knobs, note that the removed
CFDB_WORKFLOW_DISPATCH_WAIT_S is superseded, and list 429 in the /data
and /index status-code tables.
---
 README.md | 17 ++++++++++++++++-
 1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/README.md b/README.md
index 01cd22a..af665ed 100644
--- a/README.md
+++ b/README.md
@@ -554,6 +554,7 @@ Stream file contents from DCCs via HTTPS.
 | 403 | File requires authentication (consortium/protected access) or denied by upstream repository |
 | 404 | File not found, or HEAD probe of a not-yet-cached processed artifact (GET would dispatch) |
 | 416 | Range not satisfiable (out of bounds, or file size unknown so no range can be satisfied) |
+| 429 | Too many active preprocessing jobs — the active-workflow ceiling (`CFDB_WORKFLOW_MAX_ACTIVE`) is reached; `Retry-After` header set. Retry shortly. |
 | 501 | No supported access method (e.g., Globus-only files) |
 | 502 | Upstream service error |
 | 503 | Workflow subsystem shutting down (`Retry-After`) |
@@ -592,6 +593,7 @@ Stream index files (e.g., `.px2`, `.bai`) associated with DCC data files.
 | 403 | File requires consortium/protected access (HuBMAP) |
 | 404 | File not found, format has no index (CSV/TSV/bigWig), or HEAD probe of a not-yet-cached index |
 | 416 | Range not satisfiable |
+| 429 | Too many active preprocessing jobs — the active-workflow ceiling (`CFDB_WORKFLOW_MAX_ACTIVE`) is reached; `Retry-After` header set. Retry shortly. |
 | 502 | Upstream service error or malformed sidecar |
 | 503 | Workflow subsystem disabled (set `SYNC_DATA_DIR`) for a processable format, or shutting down (`Retry-After`) |
 
@@ -651,6 +653,13 @@ The preprocessed artifact is the default response. Clients that want the raw ups
 
 Cache keys are content-addressed using each file's upstream `md5`, so a byte change upstream (with the sync pipeline refreshing `md5`) invalidates the cache automatically.
 
+**Bounded concurrency, durable queuing, and admission control.** Dispatch is bounded on three cooperating layers so an unauthenticated burst on `/data` and `/index` can't oversubscribe the worker fleet or queue unbounded work:
+
+- **Per-worker backpressure** — each worker accepts at most `CFDB_WORKER_MAX_CONCURRENT_TASKS` tasks at once (default `1`), serializing the subprocess pipelines on a 1-vCPU worker. A worker at capacity rejects the dispatch and the API's priority load balancer rotates to the next worker.
+- **Priority (leaky-bucket) load balancing** — the API offers each task to discovered workers in a stable order, so load concentrates on the lowest-ordered workers and over-provisioned workers drain to idle and self-reap (via `CFDB_WORKER_MAX_LIFETIME_SECONDS`) instead of every worker carrying a thin perpetual slice.
+- **Durable queue + retry-to-deadline** — when no worker has capacity, the job is **not** failed and does **not** block the request: it stays `pending` and a durable, Mongo-backed scheduler re-attempts dispatch every `CFDB_WORKFLOW_RETRY_INTERVAL_S` (plus jitter) until a worker frees up or the `CFDB_WORKFLOW_DISPATCH_DEADLINE_S` deadline elapses (then it is failed with a `capacity:`-prefixed error). Because the queue lives in Mongo, an API restart resumes it. On the ECS profile, an overflow also requests one bounded worker spawn (the leaky bucket overflowing), inverting the old unconditional per-request spawn.
+- **Admission ceiling** — once `CFDB_WORKFLOW_MAX_ACTIVE` workflows are active (`pending` + `running`), further preprocessing requests are shed with `429 Retry-After` rather than queued, so the backlog itself is bounded. The readiness `/status` probes never dispatch and so never `429`.
+
 `/index` continues to serve upstream sidecars first when present (the 218 BED→beddb and 4 BED→tbi 4DN cases that publish under `extra.extra_files` or `extra.fourdn.extra_files`); the workflow path is dispatched only when no sidecar exists. Set `?raw=true` to bypass the workflow path entirely and return only the upstream sidecar (404 when none exists).
 
 Required environment variables:
@@ -659,10 +668,16 @@ Required environment variables:
 - `WORKFLOW_WORKER_COUNT` — local-dev only: how many workers the LAN pool (`python -m cfdb.workflows.worker_lan`) spawns and publishes (default `2`). The API itself no longer leases a fixed count — its `WorkerPool` admits every worker discovery surfaces. In the ECS profile one ephemeral worker is launched per workflow (via `EcsProvisioner` `RunTask`), so the **maximum concurrent worker count is bounded by the AWS Fargate vCPU service quota** for the account/region — raise it in Service Quotas; at 1 vCPU per worker it maps roughly 1:1 to concurrent workers. A burst of N distinct uncached files can launch up to ~N workers concurrently, subject to that quota.
 - `WORKFLOW_POOL_NAMESPACE` — wool discovery namespace shared by the API and the external worker pool (default `cfdb-workers`). Both processes must agree on this value or dispatch will hang waiting for workers.
 
+Bounded-concurrency control (issue #45):
+
+- `CFDB_WORKER_MAX_CONCURRENT_TASKS` — per-worker backpressure threshold: a worker rejects a dispatch (gRPC `RESOURCE_EXHAUSTED`, which the priority load balancer treats as transient and rotates past) once it already has this many tasks in flight (default `1`, to serialize the subprocess pipelines on a 1-vCPU worker; `0` disables backpressure). Set on the **worker** process.
+- `CFDB_WORKFLOW_MAX_ACTIVE` — admission ceiling on concurrently active workflows (`pending` + `running` jobs). Once this many are active, `/data` and `/index` shed new preprocessing requests with `429 Retry-After` before claiming the per-file mutex (default `1024`). Soft cap — a count-then-claim race may briefly overshoot. Set on the **API**.
+- `CFDB_WORKFLOW_RETRY_INTERVAL_S` — base cadence (plus a small random jitter) at which the durable retry scheduler re-attempts dispatch for a job awaiting worker capacity (default `120`, i.e. 2 min). A dispatch attempt that finds no free worker leaves the job `pending` and rescheduled rather than blocking the request. Set on the **API**.
+- `CFDB_WORKFLOW_DISPATCH_DEADLINE_S` — how long a job may wait for worker capacity (re-attempted on the retry cadence above) before the scheduler fails it with a `capacity:`-prefixed error (default `14400`, i.e. 4 h). Replaces the former single in-request `CFDB_WORKFLOW_DISPATCH_WAIT_S` wait — instead of blocking one request on a cold start, the job queues durably and is retried to this deadline. Set on the **API**.
+
 Optional tunables (with defaults):
 
 - `CFDB_WORKFLOW_DURATION_CAP_S` — per-workflow wall-clock cap (default `14400`, i.e. 4 h — sized for multi-hour preprocessing runs; lower it for fixture-bound dev).
-- `CFDB_WORKFLOW_DISPATCH_WAIT_S` — how long `ensure_workflow` waits for a free worker before giving up (default `240`, i.e. 4 min — sized for an ECS Fargate cold start: image pull + health check typically take 1-3 min, so a smaller budget can expire before a freshly-provisioned worker reports HEALTHY and hard-fail the first request. Lower it for fixture-bound dev where workers are already running).
 - `CFDB_WORKFLOW_HEARTBEAT_INTERVAL_S` — cadence at which the wool routine emits heartbeat events during quiet stages so the API can refresh `JobRecord.updated_at` (default `300`). The stale-reclaim threshold below is sized as `2 × heartbeat + safety_margin`; lowering this knob without also lowering the threshold widens the false-reclaim window.
 - `CFDB_WORKFLOW_STALE_THRESHOLD_S` — `updated_at` age beyond which an active row is reclaimable (default `900`; sized as `2 × heartbeat_interval + safety_margin` so a single missed heartbeat does not falsely reclaim a healthy worker).
 - `CFDB_SAMTOOLS_THREADS` (default `1`), `CFDB_SORT_PARALLEL` (default `2`) — CPU/thread caps for `samtools sort/index` and GNU `sort` respectively.

From db6529e3867ca02eb66571a5ae15bfb29c17be0f Mon Sep 17 00:00:00 2001
From: Conrad <conradbzura@gmail.com>
Date: Mon, 22 Jun 2026 19:29:36 -0400
Subject: [PATCH 12/19] fix: Mark a workflow RUNNING on worker acceptance to
 stop double-dispatch
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The dispatch routine now yields an immediate heartbeat as its first event,
the instant a worker accepts the task and before the upstream download.
_attempt_dispatch marks the job RUNNING on that first event, so the job
leaves the leasable PENDING set the moment a worker takes it.

Previously the processor's first event arrived only after the download, so
a job stayed PENDING for the whole (potentially multi-second) download.
When that exceeded the retry interval the durable scheduler re-leased the
same job and dispatched a second concurrent attempt onto the same per-job
workdir, corrupting the in-flight download (observed live as "invalid
compressed data" and a source.raw.part rename race) and inflating apparent
worker concurrency. A rejected dispatch raises NoWorkersAvailable before
the routine body runs, so the leading heartbeat fires only on acceptance —
overflowed jobs are never marked RUNNING, keeping concurrency bounded to
the worker count.
---
 src/cfdb/workflows/executor.py        | 19 +++++++
 tests/test_workflows/test_executor.py | 71 +++++++++++++++++++++++++++
 2 files changed, 90 insertions(+)

diff --git a/src/cfdb/workflows/executor.py b/src/cfdb/workflows/executor.py
index 9c19ce6..2636405 100644
--- a/src/cfdb/workflows/executor.py
+++ b/src/cfdb/workflows/executor.py
@@ -263,6 +263,20 @@ async def _run_processor_routine(
     converted to an :class:`~cfdb.workflows.events.Error` event so the API
     consumer can record a clean terminal state.
 
+    The routine yields an immediate :class:`~cfdb.workflows.events.Heartbeat`
+    as its very first event — the instant a worker *accepts* the dispatch,
+    before the (potentially slow) upstream download. This is the dispatcher's
+    "worker accepted" signal: ``_attempt_dispatch`` marks the job RUNNING on
+    the first event, so a job leaves the leasable PENDING set the moment a
+    worker takes it rather than only after the first processor event lands.
+    Without this, a long download (which happens before the processor's
+    first event) keeps the job PENDING, and the durable scheduler would
+    re-lease it and double-dispatch onto the same workdir once the retry
+    interval elapsed. A rejected dispatch (worker backpressure) raises
+    ``NoWorkersAvailable`` before the routine body runs, so this leading
+    heartbeat fires only on acceptance — cleanly distinguishing accepted
+    from overflowed.
+
     Arguments are cloudpickle-serializable: ``processor`` is a stateless
     class instance, ``file_meta`` is a plain dict (B1 strips Mongo's
     ``_id``), ``workdir_str`` crosses as a string, and ``cache`` is the
@@ -277,6 +291,11 @@ class instance, ``file_meta`` is a plain dict (B1 strips Mongo's
     """
     workdir = Path(workdir_str)
     workdir.mkdir(parents=True, exist_ok=True)
+    # "Worker accepted" signal — see the docstring. Emitted before the
+    # processor runs so the dispatcher marks the job RUNNING the instant a
+    # worker takes it (before the download), closing the re-lease /
+    # double-dispatch window. A rejected dispatch never reaches here.
+    yield Heartbeat()
     inner = processor.run(file_meta, workdir, cache).__aiter__()
     next_task: Optional[asyncio.Task] = None
     try:
diff --git a/tests/test_workflows/test_executor.py b/tests/test_workflows/test_executor.py
index 364fef3..9ece050 100644
--- a/tests/test_workflows/test_executor.py
+++ b/tests/test_workflows/test_executor.py
@@ -132,6 +132,19 @@ async def _wait_for_terminal(mock_db, job_id: str, timeout: float = 2.0) -> None
     raise AssertionError(f"Job {job_id} did not reach terminal status")
 
 
+async def _wait_for_status(
+    mock_db, job_id: str, status: JobStatus, timeout: float = 2.0
+) -> None:
+    """Poll the job record until it reaches ``status`` (or time out)."""
+    deadline = asyncio.get_event_loop().time() + timeout
+    while asyncio.get_event_loop().time() < deadline:
+        record = await get_job(mock_db, job_id)
+        if record is not None and record.status == status:
+            return
+        await asyncio.sleep(0.01)
+    raise AssertionError(f"Job {job_id} did not reach status {status}")
+
+
 async def _seed_pending_job(
     mock_db,
     *,
@@ -1411,6 +1424,64 @@ async def test_overflow_should_swallow_generic_provisioner_error(
         assert final.next_dispatch_at is not None
 
 
+class TestWoolExecutorMarkRunningOnAcceptance:
+    @pytest.mark.asyncio
+    async def test_attempt_should_mark_running_before_first_processor_event(
+        self, mock_db, tmp_cache, tmp_workdir, no_wool_dispatch
+    ):
+        """Test that a job is marked RUNNING the instant a worker accepts it.
+
+        Given:
+            A processor that blocks before yielding any real event (standing
+            in for a slow upstream download, which is the processor's first
+            action before its first stage event).
+        When:
+            ``ensure_workflow`` dispatches the inline attempt and a worker
+            accepts it.
+        Then:
+            The job should reach RUNNING while the processor is still blocked
+            — driven by the routine's leading "worker accepted" heartbeat,
+            not the first processor event. This is the invariant that keeps a
+            job from lingering PENDING (and being re-leased / double-
+            dispatched onto the same workdir) for the whole download.
+        """
+        # Arrange
+        _install_jobs_index(mock_db)
+        release = asyncio.Event()
+
+        class _BlockBeforeFirstEvent(_StubProcessor):
+            async def run(self, file_meta, workdir, cache_root):
+                self.run_calls += 1
+                await release.wait()  # block before the first real event
+                yield Complete(artifacts={})
+
+        registry = ProcessorRegistry()
+        registry.register(_BlockBeforeFirstEvent())
+        executor = WoolExecutor(
+            mock_db, tmp_cache, registry, workdir_root=tmp_workdir
+        )
+
+        # Act
+        record, _ = await executor.ensure_workflow(_file_meta())
+        try:
+            # Would time out (job stuck PENDING) if mark_running waited for
+            # the first processor event rather than the acceptance heartbeat.
+            await _wait_for_status(mock_db, record.job_id, JobStatus.RUNNING)
+            running = await get_job(mock_db, record.job_id)
+            release.set()
+            await _wait_for_terminal(mock_db, record.job_id)
+        finally:
+            release.set()
+            await executor.drain(timeout=2.0)
+
+        # Assert
+        assert running is not None
+        assert running.status == JobStatus.RUNNING
+        final = await get_job(mock_db, record.job_id)
+        assert final is not None
+        assert final.status == JobStatus.COMPLETED
+
+
 class TestWoolExecutorAdmission:
     @pytest.mark.asyncio
     async def test_ensure_workflow_should_reject_at_ceiling(

From 23ec652fe13202abde8b4100ae31b7af01e510dc Mon Sep 17 00:00:00 2001
From: Conrad <conradbzura@gmail.com>
Date: Mon, 22 Jun 2026 21:28:17 -0400
Subject: [PATCH 13/19] feat: Autonomously recover workflows orphaned by an API
 or worker crash
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Add a scheduler orphan-recovery sweep so restart recovery no longer depends
on a client re-requesting the same file.

The durable scheduler only leases PENDING jobs whose next_dispatch_at is set
and due, so two states a crash leaves behind would otherwise sit until
claim_workflow's stale-reclaim fired on a fresh same-file request: a RUNNING
row whose API consumer died mid-stream, and a fresh PENDING claim whose
inline attempt never rescheduled. requeue_orphaned_dispatch resets both to
PENDING with next_dispatch_at=now once they pass the stale threshold, and
the scheduler runs it at the top of every tick — including the first, on
boot — so the next tick re-dispatches them.

The sweep is gated on the same updated_at staleness threshold as
stale-reclaim, so a healthy in-flight job (which heartbeats) is never
yanked from its worker; partial-commit recovery means a re-dispatched job
reuses any stages already cached. Unlike stale-reclaim, which fails the row
so a new claimant supersedes it, the sweep revives the same row since there
is no new claimant.

Also fix the FakeCollection.update_many test double to report a row-level
modified_count (matching real Mongo) instead of a per-field-write count.
---
 README.md                             |   2 +-
 src/cfdb/workflows/executor.py        |  25 ++++
 src/cfdb/workflows/lock.py            |  50 ++++++++
 tests/conftest.py                     |   8 +-
 tests/test_workflows/test_executor.py |  96 ++++++++++++--
 tests/test_workflows/test_lock.py     | 177 +++++++++++++++++++++++++-
 6 files changed, 344 insertions(+), 14 deletions(-)

diff --git a/README.md b/README.md
index af665ed..847a3ba 100644
--- a/README.md
+++ b/README.md
@@ -657,7 +657,7 @@ Cache keys are content-addressed using each file's upstream `md5`, so a byte cha
 
 - **Per-worker backpressure** — each worker accepts at most `CFDB_WORKER_MAX_CONCURRENT_TASKS` tasks at once (default `1`), serializing the subprocess pipelines on a 1-vCPU worker. A worker at capacity rejects the dispatch and the API's priority load balancer rotates to the next worker.
 - **Priority (leaky-bucket) load balancing** — the API offers each task to discovered workers in a stable order, so load concentrates on the lowest-ordered workers and over-provisioned workers drain to idle and self-reap (via `CFDB_WORKER_MAX_LIFETIME_SECONDS`) instead of every worker carrying a thin perpetual slice.
-- **Durable queue + retry-to-deadline** — when no worker has capacity, the job is **not** failed and does **not** block the request: it stays `pending` and a durable, Mongo-backed scheduler re-attempts dispatch every `CFDB_WORKFLOW_RETRY_INTERVAL_S` (plus jitter) until a worker frees up or the `CFDB_WORKFLOW_DISPATCH_DEADLINE_S` deadline elapses (then it is failed with a `capacity:`-prefixed error). Because the queue lives in Mongo, an API restart resumes it. On the ECS profile, an overflow also requests one bounded worker spawn (the leaky bucket overflowing), inverting the old unconditional per-request spawn.
+- **Durable queue + retry-to-deadline** — when no worker has capacity, the job is **not** failed and does **not** block the request: it stays `pending` and a durable, Mongo-backed scheduler re-attempts dispatch every `CFDB_WORKFLOW_RETRY_INTERVAL_S` (plus jitter) until a worker frees up or the `CFDB_WORKFLOW_DISPATCH_DEADLINE_S` deadline elapses (then it is failed with a `capacity:`-prefixed error). Because the queue lives in Mongo, an API restart resumes it. On every scheduler tick (including the first, on boot) an orphan-recovery sweep re-queues jobs a crash left mid-flight — a `running` job whose API consumer died, or a fresh `pending` claim that never rescheduled — once they pass the stale threshold (`CFDB_WORKFLOW_STALE_THRESHOLD_S`), so recovery is autonomous and does not wait for a client to re-request the file. On the ECS profile, an overflow also requests one bounded worker spawn (the leaky bucket overflowing), inverting the old unconditional per-request spawn.
 - **Admission ceiling** — once `CFDB_WORKFLOW_MAX_ACTIVE` workflows are active (`pending` + `running`), further preprocessing requests are shed with `429 Retry-After` rather than queued, so the backlog itself is bounded. The readiness `/status` probes never dispatch and so never `429`.
 
 `/index` continues to serve upstream sidecars first when present (the 218 BED→beddb and 4 BED→tbi 4DN cases that publish under `extra.extra_files` or `extra.fourdn.extra_files`); the workflow path is dispatched only when no sidecar exists. Set `?raw=true` to bypass the workflow path entirely and return only the upstream sidecar (404 when none exists).
diff --git a/src/cfdb/workflows/executor.py b/src/cfdb/workflows/executor.py
index 2636405..4e8b03c 100644
--- a/src/cfdb/workflows/executor.py
+++ b/src/cfdb/workflows/executor.py
@@ -66,6 +66,7 @@
     WorkflowEvent,
 )
 from cfdb.workflows.lock import (
+    STALE_WORKFLOW_THRESHOLD,
     claim_workflow,
     count_active_workflows,
     heartbeat_workflow,
@@ -73,6 +74,7 @@
     mark_running,
     record_stage_complete,
     release_workflow,
+    requeue_orphaned_dispatch,
     reschedule_dispatch,
     update_progress,
 )
@@ -519,6 +521,7 @@ async def _scheduler_loop(self) -> None:
         """
         while not self._draining:
             try:
+                await self._recover_orphans()
                 await self._drain_due_jobs()
             except asyncio.CancelledError:
                 raise
@@ -526,6 +529,28 @@ async def _scheduler_loop(self) -> None:
                 logger.exception("Scheduler dispatch tick failed; continuing")
             await asyncio.sleep(_RETRY_INTERVAL_SECONDS)
 
+    async def _recover_orphans(self) -> None:
+        """Re-queue jobs orphaned by an API/worker crash for re-dispatch.
+
+        Runs at the top of every scheduler tick — including the first, on
+        boot — so a process that died mid-flight does not strand jobs the
+        normal lease query cannot pick up (``RUNNING`` rows whose consumer
+        died, and fresh ``PENDING`` claims that never rescheduled). Recovery
+        is then autonomous: it no longer depends on a client re-requesting
+        the same file to trigger ``claim_workflow``'s stale-reclaim. Gated
+        on the stale threshold so healthy heartbeating jobs are untouched.
+        """
+        now = _utcnow()
+        requeued = await requeue_orphaned_dispatch(
+            self._db, now=now, stale_before=now - STALE_WORKFLOW_THRESHOLD
+        )
+        if requeued:
+            logger.info(
+                "Re-queued %d orphaned workflow(s) for re-dispatch after a "
+                "restart or worker crash",
+                requeued,
+            )
+
     async def _drain_due_jobs(self) -> None:
         """Lease every currently-due queued job and spawn its retry attempt.
 
diff --git a/src/cfdb/workflows/lock.py b/src/cfdb/workflows/lock.py
index ce1753e..6f5763d 100644
--- a/src/cfdb/workflows/lock.py
+++ b/src/cfdb/workflows/lock.py
@@ -563,3 +563,53 @@ async def lease_due_dispatch(
         return_document=ReturnDocument.AFTER,
     )
     return _record_from_mongo(doc) if doc is not None else None
+
+
+async def requeue_orphaned_dispatch(
+    db, *, now: datetime, stale_before: datetime
+) -> int:
+    """Re-queue jobs orphaned by an API/worker crash so they re-dispatch.
+
+    The durable scheduler only leases ``PENDING`` jobs whose
+    ``next_dispatch_at`` is set and due, so two states survive a restart
+    that it would never pick up on its own:
+
+    - ``RUNNING`` rows whose API consumer died mid-stream (no heartbeat
+      since), and
+    - ``PENDING`` rows with ``next_dispatch_at`` unset — a fresh claim whose
+      inline first attempt never got to reschedule before the crash.
+
+    Both are reset to ``PENDING`` with ``next_dispatch_at = now`` so the next
+    scheduler tick leases them. Gated on ``updated_at < stale_before`` (the
+    same staleness threshold :func:`claim_workflow` uses) so a healthy
+    in-flight job — which keeps ``updated_at`` fresh via heartbeats — is
+    never falsely reclaimed. Returns the number of rows requeued.
+
+    Unlike ``claim_workflow``'s stale-reclaim, which fails the stale row so a
+    *new* claimant can supersede it, this revives the same row: there is no
+    new claimant, so the goal is to recover the in-flight work itself.
+    Partial-commit recovery means the re-dispatched job reuses any stages
+    already committed to cache.
+    """
+    jobs = _jobs(db)
+    result = await jobs.update_many(
+        {
+            "active": True,
+            "updated_at": {"$lt": stale_before},
+            "$or": [
+                {"status": JobStatus.RUNNING.value},
+                {"status": JobStatus.PENDING.value, "next_dispatch_at": None},
+            ],
+        },
+        {
+            "$set": {
+                "status": JobStatus.PENDING.value,
+                # PENDING is active; re-assert defensively so the
+                # discriminator can never drift from the status.
+                "active": True,
+                "next_dispatch_at": now,
+                "updated_at": now,
+            }
+        },
+    )
+    return getattr(result, "modified_count", 0)
diff --git a/tests/conftest.py b/tests/conftest.py
index c767496..c706d7f 100644
--- a/tests/conftest.py
+++ b/tests/conftest.py
@@ -322,10 +322,10 @@ async def update_many(self, query: dict, update: dict, **kwargs) -> _UpdateResul
         for d in self.docs:
             if _match(d, query):
                 matched += 1
-                for k, v in update.get("$set", {}).items():
-                    if d.get(k) != v:
-                        d[k] = v
-                        modified += 1
+                # Row-level modified count, matching real Mongo semantics
+                # (number of documents changed, not number of field writes).
+                if _apply_update(d, update, is_insert=False):
+                    modified += 1
         return _UpdateResult(matched, modified)
 
     async def update_one(self, query: dict, update: dict, **kwargs) -> _UpdateResult:
diff --git a/tests/test_workflows/test_executor.py b/tests/test_workflows/test_executor.py
index 9ece050..3ec7028 100644
--- a/tests/test_workflows/test_executor.py
+++ b/tests/test_workflows/test_executor.py
@@ -151,13 +151,17 @@ async def _seed_pending_job(
     submitted_at: datetime | None = None,
     next_dispatch_at: datetime | None = None,
     meta: dict[str, Any] | None = None,
+    status: JobStatus = JobStatus.PENDING,
+    updated_at: datetime | None = None,
 ) -> JobRecord:
-    """Insert a PENDING job row directly and return its JobRecord.
-
-    Lets a test drive ``_attempt_dispatch`` / ``_drain_due_jobs`` against a
-    queued job without going through ``ensure_workflow`` (which would spawn
-    its own inline attempt). ``submitted_at`` controls the deadline clock;
-    ``next_dispatch_at`` controls scheduler due-ness.
+    """Insert a crafted job row directly and return its JobRecord.
+
+    Lets a test drive ``_attempt_dispatch`` / ``_drain_due_jobs`` /
+    ``_recover_orphans`` against a job without going through
+    ``ensure_workflow`` (which would spawn its own inline attempt).
+    ``submitted_at`` controls the deadline clock; ``next_dispatch_at``
+    controls scheduler due-ness; ``status`` / ``updated_at`` craft orphan
+    states (e.g. a stale RUNNING row).
     """
     meta = meta or _file_meta()
     dcc, local_id, md5 = extract_identity(meta)
@@ -167,13 +171,13 @@ async def _seed_pending_job(
         workflow_key=key_utils.workflow_key(
             dcc=dcc, local_id=local_id, md5=md5, pipeline_version=PIPELINE_VERSION
         ),
-        status=JobStatus.PENDING,
+        status=status,
         dcc=dcc,
         local_id=local_id,
         md5=md5,
         pipeline_version=PIPELINE_VERSION,
         submitted_at=submitted_at or now,
-        updated_at=now,
+        updated_at=updated_at or now,
         file_meta_snapshot=meta,
         next_dispatch_at=next_dispatch_at,
     )
@@ -1663,6 +1667,82 @@ async def test_drain_due_jobs_should_ignore_not_yet_due_jobs(
         assert final.status == JobStatus.PENDING
         assert processor.run_calls == 0
 
+    @pytest.mark.asyncio
+    async def test_recover_orphans_should_requeue_and_dispatch_a_stale_running_job(
+        self, mock_db, tmp_cache, tmp_workdir, no_wool_dispatch
+    ):
+        """Test that a RUNNING job orphaned by a crash is recovered autonomously.
+
+        Given:
+            A stale RUNNING job (no heartbeat past the stale threshold,
+            `next_dispatch_at` unset) — what an API crash mid-stream leaves
+            behind — and a registered processor.
+        When:
+            A scheduler tick runs (`_recover_orphans` then `_drain_due_jobs`).
+        Then:
+            The orphan is re-queued to PENDING, leased, dispatched, and
+            reaches COMPLETED — without any new request for the file, closing
+            the restart-recovery gap.
+        """
+        # Arrange
+        _install_jobs_index(mock_db)
+        registry = ProcessorRegistry()
+        registry.register(_StubProcessor())
+        executor = WoolExecutor(
+            mock_db, tmp_cache, registry, workdir_root=tmp_workdir
+        )
+        stale = datetime.now(timezone.utc) - timedelta(seconds=2000)
+        record = await _seed_pending_job(
+            mock_db,
+            status=JobStatus.RUNNING,
+            updated_at=stale,
+            next_dispatch_at=None,
+        )
+
+        # Act — one scheduler tick: recover orphans, then dispatch due jobs.
+        await executor._recover_orphans()
+        await executor._drain_due_jobs()
+        await _wait_for_terminal(mock_db, record.job_id)
+
+        # Assert
+        final = await get_job(mock_db, record.job_id)
+        assert final is not None
+        assert final.status == JobStatus.COMPLETED
+
+    @pytest.mark.asyncio
+    async def test_recover_orphans_should_leave_a_fresh_running_job_alone(
+        self, mock_db, tmp_cache, tmp_workdir, no_wool_dispatch
+    ):
+        """Test that a healthy in-flight RUNNING job is not reclaimed.
+
+        Given:
+            A RUNNING job with a recent `updated_at` (a live, heartbeating
+            job).
+        When:
+            `_recover_orphans` runs.
+        Then:
+            The job is left RUNNING — the staleness gate keeps the sweep from
+            yanking a live job away from its worker.
+        """
+        # Arrange
+        _install_jobs_index(mock_db)
+        registry = ProcessorRegistry()
+        registry.register(_StubProcessor())
+        executor = WoolExecutor(
+            mock_db, tmp_cache, registry, workdir_root=tmp_workdir
+        )
+        record = await _seed_pending_job(
+            mock_db, status=JobStatus.RUNNING, next_dispatch_at=None
+        )
+
+        # Act
+        await executor._recover_orphans()
+
+        # Assert
+        job = await get_job(mock_db, record.job_id)
+        assert job is not None
+        assert job.status == JobStatus.RUNNING
+
     @pytest.mark.asyncio
     async def test_scheduler_loop_should_survive_a_tick_exception(
         self, mock_db, tmp_cache, tmp_workdir, mocker
diff --git a/tests/test_workflows/test_lock.py b/tests/test_workflows/test_lock.py
index 8b84e4b..1f280d3 100644
--- a/tests/test_workflows/test_lock.py
+++ b/tests/test_workflows/test_lock.py
@@ -20,6 +20,7 @@ def _insert_job(
     status: JobStatus,
     next_dispatch_at: datetime | None = None,
     dispatch_attempts: int = 0,
+    updated_at: datetime | None = None,
 ) -> JobRecord:
     """Insert a crafted job doc into the fake collection and return it."""
     now = datetime.now(timezone.utc)
@@ -32,7 +33,7 @@ def _insert_job(
         md5=FIXTURE_MD5,
         pipeline_version=PIPELINE_VERSION,
         submitted_at=now,
-        updated_at=now,
+        updated_at=updated_at or now,
         next_dispatch_at=next_dispatch_at,
         dispatch_attempts=dispatch_attempts,
     )
@@ -1190,3 +1191,177 @@ async def test_lease_due_dispatch_should_reject_next_at_not_in_future(
         # Act & assert
         with pytest.raises(ValueError, match="strictly after"):
             await lock.lease_due_dispatch(mock_db, now=now, next_at=now)
+
+
+class TestRequeueOrphanedDispatch:
+    @pytest.mark.asyncio
+    async def test_should_requeue_a_stale_running_orphan(self, mock_db):
+        """Test that a RUNNING job with no recent heartbeat is re-queued.
+
+        Given:
+            A RUNNING job whose `updated_at` is older than the stale
+            threshold — an in-flight job orphaned when its API consumer
+            died mid-stream.
+        When:
+            requeue_orphaned_dispatch runs with that staleness cutoff.
+        Then:
+            The job is reset to PENDING with `next_dispatch_at` set so the
+            scheduler re-dispatches it, and the call reports one requeued
+            row.
+        """
+        # Arrange
+        now = datetime.now(timezone.utc)
+        stale = now - timedelta(seconds=2000)
+        _insert_job(
+            mock_db, workflow_key="a", status=JobStatus.RUNNING, updated_at=stale
+        )
+
+        # Act
+        count = await lock.requeue_orphaned_dispatch(
+            mock_db, now=now, stale_before=now - timedelta(seconds=900)
+        )
+
+        # Assert
+        assert count == 1
+        job = await lock.get_job(mock_db, "job-a")
+        assert job is not None
+        assert job.status == JobStatus.PENDING
+        assert job.next_dispatch_at == now
+
+    @pytest.mark.asyncio
+    async def test_should_requeue_a_stale_pending_job_with_no_next_dispatch(
+        self, mock_db
+    ):
+        """Test that a stale PENDING job with unset next_dispatch_at is re-queued.
+
+        Given:
+            A PENDING job with `next_dispatch_at` unset (a fresh claim whose
+            inline attempt never rescheduled) and a stale `updated_at`.
+        When:
+            requeue_orphaned_dispatch runs.
+        Then:
+            Its `next_dispatch_at` is set to now so the scheduler — which
+            never leases a null-`next_dispatch_at` row — can pick it up.
+        """
+        # Arrange
+        now = datetime.now(timezone.utc)
+        _insert_job(
+            mock_db,
+            workflow_key="a",
+            status=JobStatus.PENDING,
+            next_dispatch_at=None,
+            updated_at=now - timedelta(seconds=2000),
+        )
+
+        # Act
+        count = await lock.requeue_orphaned_dispatch(
+            mock_db, now=now, stale_before=now - timedelta(seconds=900)
+        )
+
+        # Assert
+        assert count == 1
+        job = await lock.get_job(mock_db, "job-a")
+        assert job is not None
+        assert job.status == JobStatus.PENDING
+        assert job.next_dispatch_at == now
+
+    @pytest.mark.asyncio
+    async def test_should_not_touch_a_fresh_running_job(self, mock_db):
+        """Test that a healthy, recently-heartbeating RUNNING job is left alone.
+
+        Given:
+            A RUNNING job whose `updated_at` is recent (within the stale
+            threshold), i.e. a healthy in-flight job.
+        When:
+            requeue_orphaned_dispatch runs.
+        Then:
+            It is not touched — staleness gating prevents the sweep from
+            yanking a live job away from its worker.
+        """
+        # Arrange
+        now = datetime.now(timezone.utc)
+        _insert_job(
+            mock_db,
+            workflow_key="a",
+            status=JobStatus.RUNNING,
+            updated_at=now - timedelta(seconds=10),
+        )
+
+        # Act
+        count = await lock.requeue_orphaned_dispatch(
+            mock_db, now=now, stale_before=now - timedelta(seconds=900)
+        )
+
+        # Assert
+        assert count == 0
+        job = await lock.get_job(mock_db, "job-a")
+        assert job is not None
+        assert job.status == JobStatus.RUNNING
+
+    @pytest.mark.asyncio
+    async def test_should_not_touch_an_already_queued_pending_job(self, mock_db):
+        """Test that a stale PENDING job already carrying next_dispatch_at is left.
+
+        Given:
+            A PENDING job that already has `next_dispatch_at` set (it is
+            queued and leasable by the scheduler), even with a stale
+            `updated_at`.
+        When:
+            requeue_orphaned_dispatch runs.
+        Then:
+            It is not touched — the scheduler already handles it; the sweep
+            targets only the non-leasable orphan states.
+        """
+        # Arrange
+        now = datetime.now(timezone.utc)
+        original = now - timedelta(seconds=50)
+        _insert_job(
+            mock_db,
+            workflow_key="a",
+            status=JobStatus.PENDING,
+            next_dispatch_at=original,
+            updated_at=now - timedelta(seconds=2000),
+        )
+
+        # Act
+        count = await lock.requeue_orphaned_dispatch(
+            mock_db, now=now, stale_before=now - timedelta(seconds=900)
+        )
+
+        # Assert
+        assert count == 0
+        job = await lock.get_job(mock_db, "job-a")
+        assert job is not None
+        assert job.next_dispatch_at == original
+
+    @pytest.mark.asyncio
+    async def test_should_not_touch_terminal_jobs(self, mock_db):
+        """Test that terminal (completed/failed) jobs are never re-queued.
+
+        Given:
+            Stale COMPLETED and FAILED jobs.
+        When:
+            requeue_orphaned_dispatch runs.
+        Then:
+            Neither is touched — only active rows (the mutex holders) are
+            candidates for recovery.
+        """
+        # Arrange
+        now = datetime.now(timezone.utc)
+        stale = now - timedelta(seconds=2000)
+        _insert_job(
+            mock_db, workflow_key="done", status=JobStatus.COMPLETED, updated_at=stale
+        )
+        _insert_job(
+            mock_db, workflow_key="fail", status=JobStatus.FAILED, updated_at=stale
+        )
+
+        # Act
+        count = await lock.requeue_orphaned_dispatch(
+            mock_db, now=now, stale_before=now - timedelta(seconds=900)
+        )
+
+        # Assert
+        assert count == 0
+        assert (await lock.get_job(mock_db, "job-done")).status == JobStatus.COMPLETED
+        assert (await lock.get_job(mock_db, "job-fail")).status == JobStatus.FAILED

From 26f0ab44977fc15c931e48a575d3e06fbbb05de7 Mon Sep 17 00:00:00 2001
From: Conrad <conradbzura@gmail.com>
Date: Mon, 22 Jun 2026 23:17:30 -0400
Subject: [PATCH 14/19] fix: Resolve review-2 findings in the workflow executor
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Address the principal-engineer review of the bounded-concurrency change
(PR #63, round 2):

- B1: make a stale RUNNING row reliably mean a dead consumer. The stream
  consumer aborts (HeartbeatLost) once its heartbeat writes have been
  failing longer than _HEARTBEAT_LOSS_ABORT_S — sized strictly between one
  heartbeat interval and the orphan sweep's stale threshold — so a
  Mongo-blind consumer stops computing (aclose cancels the worker) before
  recovery would re-dispatch it, eliminating the duplicate run rather than
  merely making it non-corrupting. Each dispatch attempt also uses a
  per-attempt workdir (job_id + nonce) so the pathological residual can't
  corrupt a sibling attempt's scratch.
- A1: clean up the per-attempt workdir on the mark_running hand-off branch
  (without releasing the successor's mutex), with a regression test.
- A3: cap leases per scheduler tick (_MAX_DISPATCHES_PER_TICK) so a backlog
  can't spawn ~MAX_ACTIVE concurrent attempts in a single tick.
- A6: pass heartbeat_interval explicitly to the routine so the value is
  resolved API-side at dispatch rather than snapshotted by cloudpickle.
- A9/A12: correct the stale WoolExecutor docstring and comments.
- A2/A4: document that the ceiling sheds attach re-GETs and that the
  dispatch deadline runs from submission and is not reset on recovery.
---
 src/cfdb/workflows/executor.py        | 167 ++++++++++++++---
 tests/test_workflows/test_executor.py | 247 ++++++++++++++++++++++++--
 2 files changed, 371 insertions(+), 43 deletions(-)

diff --git a/src/cfdb/workflows/executor.py b/src/cfdb/workflows/executor.py
index 4e8b03c..7615ed7 100644
--- a/src/cfdb/workflows/executor.py
+++ b/src/cfdb/workflows/executor.py
@@ -40,6 +40,7 @@
 import logging
 import random
 import shutil
+import uuid
 from abc import ABC, abstractmethod
 from collections.abc import AsyncIterator
 from datetime import datetime, timedelta, timezone
@@ -123,7 +124,10 @@
 
 #: Upper bound on the random jitter added to each reschedule, to spread a
 #: thundering herd of queued jobs across retry ticks instead of stampeding
-#: the pool in lockstep. Tests set this to 0 for deterministic timing.
+#: the pool in lockstep. Clamped to the retry interval, so tests that
+#: monkeypatch ``_RETRY_INTERVAL_SECONDS`` low get correspondingly small,
+#: bounded jitter (determinism in tests comes from that clamp, not from
+#: zeroing this constant).
 _RETRY_JITTER_MAX_SECONDS = min(30.0, _RETRY_INTERVAL_SECONDS)
 
 #: Max wall-clock a job may wait for capacity (measured from
@@ -138,6 +142,28 @@
 #: long-running stage doesn't get reclaimed as stale.
 _HEARTBEAT_INTERVAL_S = float(WORKFLOW_HEARTBEAT_INTERVAL_S)
 
+#: How long a consumer may fail to refresh its heartbeat before it aborts
+#: its own job. This makes a stale ``RUNNING`` row reliably mean "the
+#: consumer is gone": a consumer that cannot reach Mongo stops computing
+#: (and lets ``stream.aclose`` cancel the remote worker) *before* the
+#: orphan sweep's ``STALE_WORKFLOW_THRESHOLD`` would revive the row, so
+#: recovery never double-dispatches a still-live job. Sized strictly
+#: between one heartbeat interval (a single transient write failure must
+#: not abort) and the sweep threshold (the consumer must give up with
+#: margin to spare); the ``STALE >= 2*HEARTBEAT`` startup invariant keeps
+#: that window non-empty.
+_HEARTBEAT_LOSS_ABORT_S = max(
+    _HEARTBEAT_INTERVAL_S,
+    STALE_WORKFLOW_THRESHOLD.total_seconds() - _HEARTBEAT_INTERVAL_S,
+)
+
+#: Max jobs leased per scheduler tick. Bounds the per-tick dispatch
+#: fan-out so a large queued backlog (up to ``WORKFLOW_MAX_ACTIVE``) can't
+#: spawn thousands of concurrent attempts in one tick; the remainder is
+#: leased on subsequent ticks. Generous enough to keep a reasonable fleet
+#: saturated, small enough to avoid a thundering herd against the pool.
+_MAX_DISPATCHES_PER_TICK = 64
+
 #: Grace given to ``_finalize`` tasks during ``drain``'s second phase
 #: after the primary ``_pending_tasks`` gather is done. A workflow that
 #: failed mid-write — Mongo connection blip, S3 cache cleanup blocked
@@ -198,6 +224,20 @@ def __init__(
         )
 
 
+class HeartbeatLost(RuntimeError):
+    """Raised by the stream consumer when it can no longer refresh the job.
+
+    Signals that ``heartbeat_workflow`` has been failing for longer than
+    :data:`_HEARTBEAT_LOSS_ABORT_S` while a worker is still streaming — i.e.
+    this consumer has lost its connection to Mongo and can no longer keep
+    the row fresh. Consuming the rest of the stream would let the worker
+    keep computing invisibly until the orphan sweep (on a healthy replica)
+    reclaims the now-stale row and re-dispatches it, double-running the job.
+    Raising this aborts the attempt and ``aclose``s the stream, cancelling
+    the remote worker so recovery has a single live attempt to revive.
+    """
+
+
 def _utcnow() -> datetime:
     return datetime.now(timezone.utc)
 
@@ -251,6 +291,7 @@ async def _run_processor_routine(
     file_meta: dict[str, Any],
     workdir_str: str,
     cache: CacheBackend,
+    heartbeat_interval: float,
 ) -> AsyncIterator[WorkflowEvent]:
     """Wool routine body: stream processor events back to the dispatcher.
 
@@ -258,12 +299,16 @@ async def _run_processor_routine(
     :class:`~cfdb.workflows.events.StageComplete` and
     :class:`~cfdb.workflows.events.Complete` events. We wrap its stream
     with a heartbeat-aware iterator: while waiting on the next event from
-    the processor, every ``_HEARTBEAT_INTERVAL_S`` we inject a
+    the processor, every ``heartbeat_interval`` seconds we inject a
     :class:`~cfdb.workflows.events.Heartbeat` so the API process can
     refresh ``JobRecord.updated_at`` and signal liveness without requiring
-    the worker to touch Mongo. An exception from the processor is
-    converted to an :class:`~cfdb.workflows.events.Error` event so the API
-    consumer can record a clean terminal state.
+    the worker to touch Mongo. ``heartbeat_interval`` is passed explicitly
+    (resolved API-side at dispatch) rather than read from the worker's
+    module globals, because cloudpickle ships this routine by value and the
+    worker's own ``_HEARTBEAT_INTERVAL_S`` would otherwise be ignored. An
+    exception from the processor is converted to an
+    :class:`~cfdb.workflows.events.Error` event so the API consumer can
+    record a clean terminal state.
 
     The routine yields an immediate :class:`~cfdb.workflows.events.Heartbeat`
     as its very first event — the instant a worker *accepts* the dispatch,
@@ -304,14 +349,14 @@ class instance, ``file_meta`` is a plain dict (B1 strips Mongo's
         while True:
             next_task = asyncio.ensure_future(inner.__anext__())
             # Loop on a per-event wait_for so a heartbeat injects every
-            # _HEARTBEAT_INTERVAL_S of stage silence; shield the task so
+            # heartbeat_interval of stage silence; shield the task so
             # a timeout doesn't propagate cancellation into the inner
             # generator mid-stage.
             while True:
                 try:
                     await asyncio.wait_for(
                         asyncio.shield(next_task),
-                        timeout=_HEARTBEAT_INTERVAL_S,
+                        timeout=heartbeat_interval,
                     )
                     break
                 except asyncio.TimeoutError:
@@ -356,18 +401,19 @@ class instance, ``file_meta`` is a plain dict (B1 strips Mongo's
 class WoolExecutor(JobExecutor):
     """Executor backed by a ``wool.WorkerPool`` and a Mongo jobs collection.
 
-    When configured with an :class:`EcsProvisioner`, the executor also
-    issues a ``RunTask`` on each fresh claim before opening the routine
-    stream, so a Fargate worker boots before the dispatch retry loop
-    begins polling for it. Without a provisioner (PoC dev profile) the
-    executor relies on workers already published into the discovery
-    namespace.
+    Dispatch tries existing workers first via the priority load balancer.
+    Only when an attempt overflows (no worker accepts) does the executor —
+    when configured with an :class:`EcsProvisioner` — issue a best-effort
+    ``RunTask`` to scale the fleet up, then leave the job queued for a later
+    retry; it never spawns a worker pre-dispatch. Without a provisioner
+    (PoC dev profile) the executor relies on workers already published into
+    the discovery namespace.
 
-    A :class:`RetryableProvisionerError` from the provisioner surfaces
-    as a terminal ``FAILED`` job with a ``capacity:``-prefixed error
-    string; any other provisioner failure surfaces with a
-    ``provisioner:`` prefix. Clients parse the prefix to decide
-    whether to resubmit.
+    A job that cannot find capacity before its dispatch deadline is failed
+    with a ``capacity:``-prefixed error string, which clients parse to
+    decide whether to resubmit. Provisioner failures on the overflow path
+    are best-effort and swallowed (the job simply retries on the next tick),
+    so they never surface as a terminal job state.
 
     Args:
         db: Motor database handle holding the ``jobs`` collection.
@@ -376,7 +422,7 @@ class WoolExecutor(JobExecutor):
             processors persist artifacts to the deployment's real store
             (local FS or S3), not a worker-local filesystem.
         registry: Processor registry mapping artifact kinds to runners.
-        workdir_root: Parent directory under which per-job workdirs land.
+        workdir_root: Parent directory under which per-attempt workdirs land.
         pipeline_version: Embedded in workflow keys; bump to invalidate
             every in-flight workflow.
         provisioner: Optional :class:`EcsProvisioner`. When ``None``
@@ -445,7 +491,12 @@ async def ensure_workflow(
         # AFTER applicability (don't 429 a file that needs no workflow) and
         # BEFORE claiming the mutex. Soft cap — a count-then-claim race may
         # transiently overshoot, acceptable for a flood guard; the
-        # per-source mutex still dedups same-file requests.
+        # per-source mutex still dedups same-file requests below the cap.
+        # NOTE: because this precedes the claim, at the ceiling a re-GET for
+        # a file whose workflow is already active is also shed with 429
+        # (rather than attaching to the in-flight job); the client retries.
+        # This is the deliberate trade for shedding before an unbounded
+        # count-then-insert race window.
         active = await count_active_workflows(self._db)
         if active >= _MAX_ACTIVE_WORKFLOWS:
             raise AdmissionRejected(active=active, ceiling=_MAX_ACTIVE_WORKFLOWS)
@@ -552,16 +603,24 @@ async def _recover_orphans(self) -> None:
             )
 
     async def _drain_due_jobs(self) -> None:
-        """Lease every currently-due queued job and spawn its retry attempt.
+        """Lease currently-due queued jobs and spawn their retry attempts.
 
         ``lease_due_dispatch`` atomically claims one due job and pushes its
         ``next_dispatch_at`` forward, so concurrent ticks (or replicas)
         can't double-lease and a crashed attempt is still retried. We loop
-        until nothing is due, dispatching each leased job as a tracked
-        background task so a winning attempt's long stream-consume doesn't
-        block the scheduler from filling the rest of the pool.
+        until nothing is due (or the per-tick cap is hit), dispatching each
+        leased job as a tracked background task so a winning attempt's long
+        stream-consume doesn't block the scheduler from filling the pool.
+
+        The per-tick cap (``_MAX_DISPATCHES_PER_TICK``) bounds the dispatch
+        fan-out: after a large backlog (up to ``WORKFLOW_MAX_ACTIVE``) a
+        single tick would otherwise spawn thousands of concurrent attempts,
+        each opening a wool stream that immediately overflows. The remainder
+        is leased on subsequent ticks; the deadline/retry machinery tolerates
+        the delay.
         """
-        while not self._draining:
+        dispatched = 0
+        while not self._draining and dispatched < _MAX_DISPATCHES_PER_TICK:
             now = _utcnow()
             leased = await lease_due_dispatch(
                 self._db, now=now, next_at=_next_dispatch_time(now)
@@ -594,6 +653,7 @@ async def _drain_due_jobs(self) -> None:
                 )
                 continue
             self._spawn_attempt(leased, processor, file_meta)
+            dispatched += 1
 
     async def drain(self, *, timeout: float) -> int:
         """Wait for in-flight workflow tasks to complete.
@@ -679,12 +739,22 @@ async def _attempt_dispatch(
         unconditional pre-dispatch spawn: existing capacity is tried first
         and scale-up happens only on overflow.
         """
-        workdir = self._workdir_root / record.job_id
+        # Per-attempt workdir (unique nonce), not per-job. If the orphan
+        # sweep ever re-dispatches a job whose previous attempt is somehow
+        # still alive (e.g. event-loop starvation that defeated the
+        # heartbeat-loss abort), the two attempts use disjoint scratch dirs
+        # and cannot corrupt each other's downloads / sort temp / cleanup.
+        # Partial-commit recovery is unaffected — it keys off the cache, not
+        # the workdir.
+        workdir = self._workdir_root / f"{record.job_id}-{uuid.uuid4().hex}"
 
         # Deadline: a job that has waited for capacity longer than the
-        # dispatch deadline (measured from submission) is failed rather than
-        # retried forever. ``capacity:`` is the stable on-wire prefix
-        # clients parse to decide whether to resubmit.
+        # dispatch deadline (measured from ``submitted_at``) is failed rather
+        # than retried forever. ``capacity:`` is the stable on-wire prefix
+        # clients parse to decide whether to resubmit. Note the clock runs
+        # from original submission and is NOT reset on orphan recovery, so a
+        # job recovered after its deadline is failed here rather than resumed
+        # — recovery is best-effort, bounded by the same deadline.
         waited = (_utcnow() - record.submitted_at).total_seconds()
         if waited > _DISPATCH_DEADLINE_SECONDS:
             logger.warning(
@@ -747,6 +817,13 @@ async def _attempt_dispatch(
             )
             with contextlib.suppress(BaseException):
                 await stream.aclose()
+            # The successor owns the row, so do NOT release the mutex here —
+            # but this attempt's workdir was already created by the routine,
+            # so clean it up to avoid leaking scratch on the hand-off (A1).
+            with contextlib.suppress(Exception):
+                await asyncio.to_thread(
+                    shutil.rmtree, str(workdir), ignore_errors=True
+                )
             return
         except Exception as exc:
             logger.exception("Failed to mark job %s as running", record.job_id)
@@ -818,6 +895,12 @@ async def _consume_and_finalize(
         """
         final_status: JobStatus = JobStatus.FAILED
         final_error: Optional[str] = None
+        # Time of the last successful heartbeat write. Seeded now because
+        # the caller just marked the job RUNNING (the row is fresh). If
+        # heartbeat writes then fail for longer than _HEARTBEAT_LOSS_ABORT_S
+        # we raise HeartbeatLost and abort, so a Mongo-blind consumer stops
+        # computing before the orphan sweep revives its row (B1).
+        last_heartbeat_ok = _utcnow()
 
         try:
 
@@ -827,7 +910,20 @@ async def _consume_and_finalize(
                         if isinstance(event, Heartbeat):
                             try:
                                 await heartbeat_workflow(self._db, record.job_id)
+                                last_heartbeat_ok = _utcnow()
                             except Exception:
+                                stalled = (
+                                    _utcnow() - last_heartbeat_ok
+                                ).total_seconds()
+                                if stalled > _HEARTBEAT_LOSS_ABORT_S:
+                                    # Lost Mongo long enough that the orphan
+                                    # sweep will treat this row as dead; abort
+                                    # so recovery doesn't double-run the job.
+                                    raise HeartbeatLost(
+                                        f"no successful heartbeat for "
+                                        f"{stalled:.0f}s "
+                                        f"(> {_HEARTBEAT_LOSS_ABORT_S:.0f}s)"
+                                    )
                                 logger.exception(
                                     "Heartbeat failed for %s; continuing",
                                     record.job_id,
@@ -900,6 +996,18 @@ async def _consume_and_finalize(
                 if final_status == JobStatus.FAILED:
                     final_error = "Workflow cancelled (worker shutdown)"
                 raise
+            except HeartbeatLost as exc:
+                # Expected abort, not a crash: this consumer lost Mongo and
+                # is giving up the job so the orphan sweep can recover it
+                # without a concurrent live attempt. The terminal write
+                # below will likely no-op (Mongo is unreachable for us); the
+                # inner finally still aclose()s the stream, cancelling the
+                # remote worker so it stops computing.
+                logger.warning(
+                    "Aborting workflow %s: %s", record.job_id, exc
+                )
+                final_status = JobStatus.FAILED
+                final_error = f"heartbeat lost: {exc}"
             except Exception as exc:
                 logger.exception(
                     "Workflow %s failed during stream consumption",
@@ -986,6 +1094,7 @@ async def _open_stream_once(
             file_meta,
             str(workdir),
             self._cache,
+            _HEARTBEAT_INTERVAL_S,
         )
         # ``returning`` flips True only after we've handed the stream off to
         # ``_prepend_first_event``. Any other exit (NoWorkersAvailable, a
diff --git a/tests/test_workflows/test_executor.py b/tests/test_workflows/test_executor.py
index 3ec7028..cd45e38 100644
--- a/tests/test_workflows/test_executor.py
+++ b/tests/test_workflows/test_executor.py
@@ -115,9 +115,7 @@ def _install_jobs_index(mock_db) -> None:
     mock_db.jobs.create_index(
         {"workflow_key": 1},
         unique=True,
-        partialFilterExpression={
-            "status": {"$in": [s.value for s in ACTIVE_STATUSES]}
-        },
+        partialFilterExpression={"active": True},
     )
 
 
@@ -454,7 +452,7 @@ async def test_ensure_workflow_should_cleanup_workdir_after_success(
         tmp_workdir,
         no_wool_dispatch,
     ):
-        """Test that the per-job workdir is removed after a successful run.
+        """Test that the per-attempt workdir is removed after a successful run.
 
         Given:
             A registry with a stub processor that writes a file into its
@@ -462,7 +460,8 @@ async def test_ensure_workflow_should_cleanup_workdir_after_success(
         When:
             ensure_workflow is awaited and the background task completes.
         Then:
-            The per-job workdir should no longer exist on disk.
+            No scratch directory should remain under the workdir root (the
+            per-attempt workdir, named with a nonce, is removed).
         """
 
         class _WorkdirTouchingProcessor(_StubProcessor):
@@ -486,8 +485,9 @@ async def run(self, file_meta, workdir, cache_root):
         record, _ = await executor.ensure_workflow(_file_meta())
         await _wait_for_terminal(mock_db, record.job_id)
 
-        # Assert
-        assert not (tmp_workdir / record.job_id).exists()
+        # Assert — the per-attempt workdir (job_id + nonce) is removed, so
+        # no scratch directory leaks under the workdir root.
+        assert list(tmp_workdir.iterdir()) == []
 
     @pytest.mark.asyncio
     async def test_ensure_workflow_should_release_even_when_record_stage_fails(
@@ -556,7 +556,7 @@ async def test_ensure_workflow_should_record_timeout_when_cap_exceeded(
 
         # Arrange — yield one event first so the consumer enters the
         # asyncio.timeout block (the timeout wraps the event loop, not
-        # the initial _open_stream_with_retry call); then hang so the
+        # the initial _open_stream_once call); then hang so the
         # cap fires on the second iteration.
         class _HangingProcessor(_StubProcessor):
             async def run(self, file_meta, workdir, cache_root):
@@ -1244,7 +1244,7 @@ async def test_ensure_workflow_should_persist_file_meta_snapshot_on_insert(
 
 class TestWoolExecutorWithProvisioner:
     @pytest.mark.asyncio
-    async def test_attempt_should_not_request_provisioner_when_worker_available(
+    async def test__attempt_dispatch_should_not_request_provisioner_when_worker_available(
         self, mock_db, tmp_cache, tmp_workdir, no_wool_dispatch, mocker
     ):
         """Test that the provisioner is left idle when a worker accepts.
@@ -1285,7 +1285,7 @@ async def test_attempt_should_not_request_provisioner_when_worker_available(
         provisioner.request.assert_not_awaited()
 
     @pytest.mark.asyncio
-    async def test_attempt_should_request_provisioner_on_overflow(
+    async def test__attempt_dispatch_should_request_provisioner_on_overflow(
         self, mock_db, tmp_cache, tmp_workdir, no_wool_dispatch, mocker
     ):
         """Test that overflow requests one bounded worker spawn and reschedules.
@@ -1334,7 +1334,7 @@ async def test_attempt_should_request_provisioner_on_overflow(
         assert final.dispatch_attempts == 1
 
     @pytest.mark.asyncio
-    async def test_overflow_should_swallow_retryable_provisioner_error(
+    async def test__handle_overflow_should_swallow_retryable_provisioner_error(
         self, mock_db, tmp_cache, tmp_workdir, no_wool_dispatch, mocker
     ):
         """Test that a retryable provisioner error is swallowed, job reschedules.
@@ -1380,7 +1380,7 @@ async def test_overflow_should_swallow_retryable_provisioner_error(
         assert final.next_dispatch_at is not None
 
     @pytest.mark.asyncio
-    async def test_overflow_should_swallow_generic_provisioner_error(
+    async def test__handle_overflow_should_swallow_generic_provisioner_error(
         self, mock_db, tmp_cache, tmp_workdir, no_wool_dispatch, mocker
     ):
         """Test that a generic provisioner error doesn't fail the queued job.
@@ -1428,9 +1428,228 @@ async def test_overflow_should_swallow_generic_provisioner_error(
         assert final.next_dispatch_at is not None
 
 
+class TestConsumeAndFinalizeHeartbeatLoss:
+    def test_heartbeat_loss_abort_threshold_sits_between_interval_and_stale(self):
+        """Test that the self-cancel threshold is between heartbeat and stale.
+
+        Given:
+            The module's derived heartbeat-loss abort threshold.
+        When:
+            It is compared against the heartbeat interval and the orphan
+            sweep's stale threshold.
+        Then:
+            It should be at least one heartbeat interval (a single transient
+            write failure must not abort) and strictly below the stale
+            threshold (a Mongo-blind consumer must give up before the sweep
+            would revive its row).
+        """
+        # Assert
+        assert (
+            executor_module._HEARTBEAT_LOSS_ABORT_S
+            >= executor_module._HEARTBEAT_INTERVAL_S
+        )
+        assert (
+            executor_module._HEARTBEAT_LOSS_ABORT_S
+            < executor_module.STALE_WORKFLOW_THRESHOLD.total_seconds()
+        )
+
+    @pytest.mark.asyncio
+    async def test__consume_and_finalize_should_abort_when_heartbeat_lost(
+        self, mock_db, tmp_cache, tmp_workdir, mocker
+    ):
+        """Test that sustained heartbeat-write failure aborts the workflow.
+
+        Given:
+            A RUNNING job whose ``heartbeat_workflow`` write always fails and
+            a zeroed abort threshold (so the first failure is already past
+            it).
+        When:
+            ``_consume_and_finalize`` consumes a stream that emits a
+            heartbeat.
+        Then:
+            It should abort with a ``heartbeat lost`` FAILED status and close
+            the stream (cancelling the remote worker), rather than consuming
+            on invisibly while the orphan sweep reclaims the row.
+        """
+        # Arrange
+        _install_jobs_index(mock_db)
+        registry = ProcessorRegistry()
+        registry.register(_StubProcessor())
+        executor = WoolExecutor(
+            mock_db, tmp_cache, registry, workdir_root=tmp_workdir
+        )
+        record = await _seed_pending_job(mock_db, status=JobStatus.RUNNING)
+        mocker.patch.object(executor_module, "_HEARTBEAT_LOSS_ABORT_S", -1.0)
+
+        async def boom(*_a, **_kw):
+            raise RuntimeError("mongo down")
+
+        mocker.patch.object(executor_module, "heartbeat_workflow", boom)
+        closed = {"v": False}
+
+        async def stream():
+            try:
+                yield Heartbeat()
+                yield Complete(artifacts={})  # pragma: no cover (aborted first)
+            finally:
+                closed["v"] = True
+
+        # Act
+        await executor._consume_and_finalize(record, stream(), tmp_workdir / "wd")
+
+        # Assert
+        final = await get_job(mock_db, record.job_id)
+        assert final is not None
+        assert final.status == JobStatus.FAILED
+        assert "heartbeat lost" in (final.error or "")
+        assert closed["v"] is True
+
+    @pytest.mark.asyncio
+    async def test__consume_and_finalize_should_tolerate_a_single_heartbeat_failure(
+        self, mock_db, tmp_cache, tmp_workdir, mocker
+    ):
+        """Test that one transient heartbeat-write failure does not abort.
+
+        Given:
+            A RUNNING job whose first ``heartbeat_workflow`` write fails but
+            the abort threshold is far in the future.
+        When:
+            ``_consume_and_finalize`` consumes a heartbeat then a complete.
+        Then:
+            The single failure is logged and swallowed and the job still
+            reaches COMPLETED.
+        """
+        # Arrange
+        _install_jobs_index(mock_db)
+        registry = ProcessorRegistry()
+        registry.register(_StubProcessor())
+        executor = WoolExecutor(
+            mock_db, tmp_cache, registry, workdir_root=tmp_workdir
+        )
+        record = await _seed_pending_job(mock_db, status=JobStatus.RUNNING)
+        mocker.patch.object(executor_module, "_HEARTBEAT_LOSS_ABORT_S", 9999.0)
+        calls = {"n": 0}
+
+        async def flaky(*_a, **_kw):
+            calls["n"] += 1
+            if calls["n"] == 1:
+                raise RuntimeError("transient")
+
+        mocker.patch.object(executor_module, "heartbeat_workflow", flaky)
+
+        async def stream():
+            yield Heartbeat()
+            yield Complete(artifacts={})
+
+        # Act
+        await executor._consume_and_finalize(record, stream(), tmp_workdir / "wd")
+
+        # Assert
+        final = await get_job(mock_db, record.job_id)
+        assert final is not None
+        assert final.status == JobStatus.COMPLETED
+
+
+class TestAttemptDispatchWorkdir:
+    @pytest.mark.asyncio
+    async def test__attempt_dispatch_should_use_a_distinct_workdir_per_attempt(
+        self, mock_db, tmp_cache, tmp_workdir, mocker
+    ):
+        """Test that each dispatch attempt for a job gets its own workdir.
+
+        Given:
+            A queued job whose dispatch attempts always overflow.
+        When:
+            ``_attempt_dispatch`` runs twice for the same job.
+        Then:
+            The two attempts should compute distinct workdirs (both prefixed
+            with the job_id), so a re-dispatch can never share scratch with a
+            still-live prior attempt.
+        """
+        # Arrange
+        _install_jobs_index(mock_db)
+        registry = ProcessorRegistry()
+        processor = _StubProcessor()
+        registry.register(processor)
+        executor = WoolExecutor(
+            mock_db, tmp_cache, registry, workdir_root=tmp_workdir
+        )
+        record = await _seed_pending_job(mock_db)
+        seen: list[Path] = []
+
+        async def capture(_processor, _file_meta, workdir):
+            seen.append(workdir)
+            raise wool.NoWorkersAvailable("overflow")
+
+        mocker.patch.object(executor, "_open_stream_once", new=capture)
+
+        # Act
+        await executor._attempt_dispatch(record, processor, _file_meta())
+        await executor._attempt_dispatch(record, processor, _file_meta())
+
+        # Assert
+        assert len(seen) == 2
+        assert seen[0] != seen[1]
+        assert all(w.name.startswith(record.job_id) for w in seen)
+
+    @pytest.mark.asyncio
+    async def test__attempt_dispatch_should_clean_workdir_when_mark_running_rejected(
+        self, mock_db, tmp_cache, tmp_workdir, mocker
+    ):
+        """Test that the hand-off branch cleans up its workdir.
+
+        Given:
+            A worker accepts (the routine created the per-attempt workdir),
+            but ``mark_running`` rejects because a successor now owns the row.
+        When:
+            ``_attempt_dispatch`` runs.
+        Then:
+            It should remove this attempt's workdir (no scratch leak) and
+            leave the row untouched for the successor — without releasing the
+            mutex.
+        """
+        # Arrange
+        _install_jobs_index(mock_db)
+        registry = ProcessorRegistry()
+        processor = _StubProcessor()
+        registry.register(processor)
+        executor = WoolExecutor(
+            mock_db, tmp_cache, registry, workdir_root=tmp_workdir
+        )
+        record = await _seed_pending_job(mock_db)
+        captured: dict[str, Path] = {}
+
+        async def fake_open(_processor, _file_meta, workdir):
+            workdir.mkdir(parents=True, exist_ok=True)
+            (workdir / "scratch").write_bytes(b"x")
+            captured["workdir"] = workdir
+
+            async def s():
+                yield Heartbeat()
+
+            return s()
+
+        mocker.patch.object(executor, "_open_stream_once", new=fake_open)
+
+        async def reject(*_a, **_kw):
+            raise RuntimeError("row no longer PENDING")
+
+        mocker.patch.object(executor_module, "mark_running", reject)
+
+        # Act
+        await executor._attempt_dispatch(record, processor, _file_meta())
+
+        # Assert
+        assert "workdir" in captured
+        assert not captured["workdir"].exists()
+        final = await get_job(mock_db, record.job_id)
+        assert final is not None
+        assert final.status == JobStatus.PENDING
+
+
 class TestWoolExecutorMarkRunningOnAcceptance:
     @pytest.mark.asyncio
-    async def test_attempt_should_mark_running_before_first_processor_event(
+    async def test__attempt_dispatch_should_mark_running_before_first_processor_event(
         self, mock_db, tmp_cache, tmp_workdir, no_wool_dispatch
     ):
         """Test that a job is marked RUNNING the instant a worker accepts it.
@@ -1556,7 +1775,7 @@ async def test_ensure_workflow_should_admit_below_ceiling(
 
 class TestWoolExecutorDispatchDeadline:
     @pytest.mark.asyncio
-    async def test_attempt_should_fail_capacity_when_deadline_exceeded(
+    async def test__attempt_dispatch_should_fail_capacity_when_deadline_exceeded(
         self, mock_db, tmp_cache, tmp_workdir, no_wool_dispatch, mocker
     ):
         """Test that a job past its dispatch deadline is failed ``capacity:``.

From a263f5c5bcde4b4b5e11cee26c5abc6391a22348 Mon Sep 17 00:00:00 2001
From: Conrad <conradbzura@gmail.com>
Date: Mon, 22 Jun 2026 23:17:43 -0400
Subject: [PATCH 15/19] test: Tighten test-double fidelity, mutex predicate,
 and naming per review

- A8: register the unit jobs mutex index with the production
  {active: true} partialFilterExpression (not the DocumentDB-rejected
  status-$in form), so the unit layer guards active/status lockstep drift
  the way production does.
- A15: make the FakeCollection matcher AND $and/$or with sibling top-level
  keys instead of short-circuiting on them, so a mixed query (e.g. the
  orphan-recovery filter) is modeled regardless of key order.
- A7: derive the create-indexes.js parity assertion from
  operational_index_specs() so a newly-added operational index is guarded
  automatically (now covers status_next_dispatch_at).
- A11: rename executor/lock tests to match the method __name__ convention.
---
 tests/conftest.py                           | 12 +++++++++--
 tests/integration/test_executor_boundary.py |  4 +---
 tests/test_data.py                          |  5 +----
 tests/test_index.py                         |  6 ++----
 tests/test_indexes.py                       | 22 +++++++++++++--------
 tests/test_workflows/test_lock.py           | 20 ++++++++++---------
 tests/test_workflows/test_s3_contract.py    |  6 ++----
 7 files changed, 41 insertions(+), 34 deletions(-)

diff --git a/tests/conftest.py b/tests/conftest.py
index c706d7f..8d8ab76 100644
--- a/tests/conftest.py
+++ b/tests/conftest.py
@@ -23,10 +23,18 @@ def _resolve(doc: dict, key: str):
 def _match(doc: dict, query: dict) -> bool:
     """Minimal MongoDB query matcher supporting a small operator subset."""
     for key, cond in query.items():
+        # AND $and/$or with the rest of the top-level keys rather than
+        # short-circuiting the whole match, so a query that mixes them with
+        # sibling field conditions (e.g. {active, updated_at, $or}) is
+        # evaluated faithfully regardless of key order.
         if key == "$and":
-            return all(_match(doc, sub) for sub in cond)
+            if not all(_match(doc, sub) for sub in cond):
+                return False
+            continue
         if key == "$or":
-            return any(_match(doc, sub) for sub in cond)
+            if not any(_match(doc, sub) for sub in cond):
+                return False
+            continue
 
         value = _resolve(doc, key)
 
diff --git a/tests/integration/test_executor_boundary.py b/tests/integration/test_executor_boundary.py
index 582d0d8..b1deadb 100644
--- a/tests/integration/test_executor_boundary.py
+++ b/tests/integration/test_executor_boundary.py
@@ -30,9 +30,7 @@ def _install_jobs_index(mock_db) -> None:
     mock_db.jobs.create_index(
         {"workflow_key": 1},
         unique=True,
-        partialFilterExpression={
-            "status": {"$in": [s.value for s in ACTIVE_STATUSES]}
-        },
+        partialFilterExpression={"active": True},
     )
 
 
diff --git a/tests/test_data.py b/tests/test_data.py
index a81d769..f012969 100644
--- a/tests/test_data.py
+++ b/tests/test_data.py
@@ -11,7 +11,6 @@
 from cfdb.api.routers.data import stream_file, stream_file_status
 from cfdb.services import drs, locks
 from cfdb.workflows.executor import WoolExecutor
-from cfdb.workflows.models import ACTIVE_STATUSES
 from cfdb.workflows.processors.bam import BamIndexProcessor
 from tests.test_workflows import FIXTURE_MD5
 from cfdb.workflows.processors.registry import ProcessorRegistry, default_registry
@@ -115,9 +114,7 @@ async def test_stream_file_should_dispatch_workflow_when_processor_applies(
         mock_db.jobs.create_index(
             {"workflow_key": 1},
             unique=True,
-            partialFilterExpression={
-                "status": {"$in": [s.value for s in ACTIVE_STATUSES]}
-            },
+            partialFilterExpression={"active": True},
         )
 
         mock_db.dcc.docs = [
diff --git a/tests/test_index.py b/tests/test_index.py
index 8e286fc..3dc00b0 100644
--- a/tests/test_index.py
+++ b/tests/test_index.py
@@ -11,7 +11,7 @@
 from cfdb.api.routers.index import stream_index_file, stream_index_file_status
 from cfdb.services import drs, locks
 from cfdb.workflows.executor import WoolExecutor
-from cfdb.workflows.models import ACTIVE_STATUSES, ArtifactKind
+from cfdb.workflows.models import ArtifactKind
 from cfdb.workflows.processors.bam import BamIndexProcessor
 from cfdb.workflows.processors.registry import ProcessorRegistry, default_registry
 from tests.test_workflows import FIXTURE_MD5
@@ -1048,9 +1048,7 @@ async def test_stream_index_file_should_return_202_when_processor_applies_and_ca
         mock_db.jobs.create_index(
             {"workflow_key": 1},
             unique=True,
-            partialFilterExpression={
-                "status": {"$in": [s.value for s in ACTIVE_STATUSES]}
-            },
+            partialFilterExpression={"active": True},
         )
 
         # Register a real processor + cache + executor on the module.
diff --git a/tests/test_indexes.py b/tests/test_indexes.py
index 7b1780d..46e04b3 100644
--- a/tests/test_indexes.py
+++ b/tests/test_indexes.py
@@ -373,16 +373,22 @@ def test_create_indexes_js_should_define_all_operational_index_names():
         The JS source is read.
     Then:
         It should reference each named operational index and the
-        locks.active index.
+        locks.active index. The names are derived from
+        ``operational_index_specs()`` (not hard-coded) so a newly-added
+        operational index is guarded automatically.
     """
+    # Arrange — the named jobs-collection operational indexes (the locks
+    # index relies on Mongo's auto-generated name and is asserted below).
+    named = [
+        spec.name
+        for spec in operational_index_specs()
+        if spec.name and spec.collection == "jobs"
+    ]
+
     # Assert
-    for name in (
-        "workflow_key_active_unique",
-        "job_id_unique",
-        "status_updated_at",
-        "terminal_ttl",
-    ):
-        assert name in _JS
+    assert named, "expected at least one named jobs operational index"
+    for name in named:
+        assert name in _JS, f"{name} missing from create-indexes.js"
     assert "db.locks.createIndex({ active: 1 })" in _JS
 
 
diff --git a/tests/test_workflows/test_lock.py b/tests/test_workflows/test_lock.py
index 1f280d3..2b785c6 100644
--- a/tests/test_workflows/test_lock.py
+++ b/tests/test_workflows/test_lock.py
@@ -7,7 +7,7 @@
 import pytest
 
 from cfdb.workflows import lock
-from cfdb.workflows.models import ACTIVE_STATUSES, ArtifactKind, JobRecord, JobStatus
+from cfdb.workflows.models import ArtifactKind, JobRecord, JobStatus
 from tests.test_workflows import FIXTURE_MD5
 
 PIPELINE_VERSION = 1
@@ -46,9 +46,11 @@ def _install_jobs_index(mock_db) -> None:
     mock_db.jobs.create_index(
         {"workflow_key": 1},
         unique=True,
-        partialFilterExpression={
-            "status": {"$in": [s.value for s in ACTIVE_STATUSES]}
-        },
+        # Match production: the mutex partial-unique index filters on the
+        # ``active`` boolean discriminator (DocumentDB rejects ``$in`` inside
+        # a partialFilterExpression). Keying the fake on ``active`` lets the
+        # unit layer catch an ``active``/``status`` lockstep drift.
+        partialFilterExpression={"active": True},
     )
 
 
@@ -1195,7 +1197,7 @@ async def test_lease_due_dispatch_should_reject_next_at_not_in_future(
 
 class TestRequeueOrphanedDispatch:
     @pytest.mark.asyncio
-    async def test_should_requeue_a_stale_running_orphan(self, mock_db):
+    async def test_requeue_orphaned_dispatch_should_requeue_a_stale_running_orphan(self, mock_db):
         """Test that a RUNNING job with no recent heartbeat is re-queued.
 
         Given:
@@ -1229,7 +1231,7 @@ async def test_should_requeue_a_stale_running_orphan(self, mock_db):
         assert job.next_dispatch_at == now
 
     @pytest.mark.asyncio
-    async def test_should_requeue_a_stale_pending_job_with_no_next_dispatch(
+    async def test_requeue_orphaned_dispatch_should_requeue_a_stale_pending_job_with_no_next_dispatch(
         self, mock_db
     ):
         """Test that a stale PENDING job with unset next_dispatch_at is re-queued.
@@ -1266,7 +1268,7 @@ async def test_should_requeue_a_stale_pending_job_with_no_next_dispatch(
         assert job.next_dispatch_at == now
 
     @pytest.mark.asyncio
-    async def test_should_not_touch_a_fresh_running_job(self, mock_db):
+    async def test_requeue_orphaned_dispatch_should_not_touch_a_fresh_running_job(self, mock_db):
         """Test that a healthy, recently-heartbeating RUNNING job is left alone.
 
         Given:
@@ -1299,7 +1301,7 @@ async def test_should_not_touch_a_fresh_running_job(self, mock_db):
         assert job.status == JobStatus.RUNNING
 
     @pytest.mark.asyncio
-    async def test_should_not_touch_an_already_queued_pending_job(self, mock_db):
+    async def test_requeue_orphaned_dispatch_should_not_touch_an_already_queued_pending_job(self, mock_db):
         """Test that a stale PENDING job already carrying next_dispatch_at is left.
 
         Given:
@@ -1335,7 +1337,7 @@ async def test_should_not_touch_an_already_queued_pending_job(self, mock_db):
         assert job.next_dispatch_at == original
 
     @pytest.mark.asyncio
-    async def test_should_not_touch_terminal_jobs(self, mock_db):
+    async def test_requeue_orphaned_dispatch_should_not_touch_terminal_jobs(self, mock_db):
         """Test that terminal (completed/failed) jobs are never re-queued.
 
         Given:
diff --git a/tests/test_workflows/test_s3_contract.py b/tests/test_workflows/test_s3_contract.py
index 16c599f..3f3a5f4 100644
--- a/tests/test_workflows/test_s3_contract.py
+++ b/tests/test_workflows/test_s3_contract.py
@@ -27,7 +27,7 @@
 from cfdb.workflows.events import Complete, StageComplete, WorkflowEvent
 from cfdb.workflows.executor import WoolExecutor
 from cfdb.workflows.lock import get_job
-from cfdb.workflows.models import ACTIVE_STATUSES, ArtifactKind, JobStatus
+from cfdb.workflows.models import ArtifactKind, JobStatus
 from cfdb.workflows.processors.base import Processor
 from cfdb.workflows.processors.registry import ProcessorRegistry
 from tests.test_workflows import FIXTURE_MD5
@@ -79,9 +79,7 @@ def _install_jobs_index(mock_db) -> None:
     mock_db.jobs.create_index(
         {"workflow_key": 1},
         unique=True,
-        partialFilterExpression={
-            "status": {"$in": [s.value for s in ACTIVE_STATUSES]}
-        },
+        partialFilterExpression={"active": True},
     )
 
 

From 9a2e4dbc1e24dc935065e197d05a815d069bfb8a Mon Sep 17 00:00:00 2001
From: Conrad <conradbzura@gmail.com>
Date: Mon, 22 Jun 2026 23:17:43 -0400
Subject: [PATCH 16/19] docs: Clarify admission-ceiling attach behavior and 202
 queuing

A2: note that at the ceiling a re-GET for an already-active file is also
shed with 429 (it does not attach). A18: clarify the /data and /index 202
rows that the accepted job may sit pending in the durable queue under load.
---
 README.md | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/README.md b/README.md
index 847a3ba..2f30c80 100644
--- a/README.md
+++ b/README.md
@@ -548,7 +548,7 @@ Stream file contents from DCCs via HTTPS.
 | Code | Description |
 |------|-------------|
 | 200 | Full file content (GET) or file metadata (HEAD) |
-| 202 | Preprocessed artifact not yet cached — workflow dispatched. `Location: /jobs/{id}` and `Retry-After` headers point to the polling endpoint. |
+| 202 | Preprocessed artifact not yet cached — workflow accepted. `Location: /jobs/{id}` and `Retry-After` headers point to the polling endpoint. Under load the job may sit `pending` in the durable queue before a worker picks it up; poll `/jobs/{id}` for progress. |
 | 206 | Partial content (Range request) |
 | 400 | Invalid DCC, path-param shape, or Range header |
 | 403 | File requires authentication (consortium/protected access) or denied by upstream repository |
@@ -587,7 +587,7 @@ Stream index files (e.g., `.px2`, `.bai`) associated with DCC data files.
 | Code | Description |
 |------|-------------|
 | 200 | Full index file content (GET) or file metadata (HEAD) |
-| 202 | Index not yet cached — workflow dispatched. `Location: /jobs/{id}` and `Retry-After` headers point to the polling endpoint. |
+| 202 | Index not yet cached — workflow accepted. `Location: /jobs/{id}` and `Retry-After` headers point to the polling endpoint. Under load the job may sit `pending` in the durable queue before a worker picks it up; poll `/jobs/{id}` for progress. |
 | 206 | Partial content (Range request) |
 | 400 | Invalid DCC, path-param shape, or Range header |
 | 403 | File requires consortium/protected access (HuBMAP) |
@@ -658,7 +658,7 @@ Cache keys are content-addressed using each file's upstream `md5`, so a byte cha
 - **Per-worker backpressure** — each worker accepts at most `CFDB_WORKER_MAX_CONCURRENT_TASKS` tasks at once (default `1`), serializing the subprocess pipelines on a 1-vCPU worker. A worker at capacity rejects the dispatch and the API's priority load balancer rotates to the next worker.
 - **Priority (leaky-bucket) load balancing** — the API offers each task to discovered workers in a stable order, so load concentrates on the lowest-ordered workers and over-provisioned workers drain to idle and self-reap (via `CFDB_WORKER_MAX_LIFETIME_SECONDS`) instead of every worker carrying a thin perpetual slice.
 - **Durable queue + retry-to-deadline** — when no worker has capacity, the job is **not** failed and does **not** block the request: it stays `pending` and a durable, Mongo-backed scheduler re-attempts dispatch every `CFDB_WORKFLOW_RETRY_INTERVAL_S` (plus jitter) until a worker frees up or the `CFDB_WORKFLOW_DISPATCH_DEADLINE_S` deadline elapses (then it is failed with a `capacity:`-prefixed error). Because the queue lives in Mongo, an API restart resumes it. On every scheduler tick (including the first, on boot) an orphan-recovery sweep re-queues jobs a crash left mid-flight — a `running` job whose API consumer died, or a fresh `pending` claim that never rescheduled — once they pass the stale threshold (`CFDB_WORKFLOW_STALE_THRESHOLD_S`), so recovery is autonomous and does not wait for a client to re-request the file. On the ECS profile, an overflow also requests one bounded worker spawn (the leaky bucket overflowing), inverting the old unconditional per-request spawn.
-- **Admission ceiling** — once `CFDB_WORKFLOW_MAX_ACTIVE` workflows are active (`pending` + `running`), further preprocessing requests are shed with `429 Retry-After` rather than queued, so the backlog itself is bounded. The readiness `/status` probes never dispatch and so never `429`.
+- **Admission ceiling** — once `CFDB_WORKFLOW_MAX_ACTIVE` workflows are active (`pending` + `running`), further preprocessing requests are shed with `429 Retry-After` rather than queued, so the backlog itself is bounded. The check runs before the per-file mutex, so at the ceiling even a re-`GET` for a file whose workflow is already in flight is shed with `429` (rather than attaching to the in-flight job) and the client retries — the deliberate trade for shedding before an unbounded admission race. The readiness `/status` probes never dispatch and so never `429`.
 
 `/index` continues to serve upstream sidecars first when present (the 218 BED→beddb and 4 BED→tbi 4DN cases that publish under `extra.extra_files` or `extra.fourdn.extra_files`); the workflow path is dispatched only when no sidecar exists. Set `?raw=true` to bypass the workflow path entirely and return only the upstream sidecar (404 when none exists).
 

From 779b0605ad0fe6cbc223f15c77d662775888af58 Mon Sep 17 00:00:00 2001
From: Conrad <conradbzura@gmail.com>
Date: Tue, 23 Jun 2026 11:40:32 -0400
Subject: [PATCH 17/19] fix: Resolve review-3 advisories in the workflow
 dispatch layer

Tighten the stale-reclaim invariant to a strict STALE > 2 * HEARTBEAT so
the heartbeat-loss abort window (_HEARTBEAT_LOSS_ABORT_S = max(HEARTBEAT,
STALE - HEARTBEAT)) keeps a non-zero margin above one interval; at the
former permitted boundary STALE == 2 * HEARTBEAT it collapsed to exactly
one interval, defeating the single-transient-failure tolerance the design
advertises.

Count the admission ceiling on the canonical active boolean discriminator
rather than the derived status-in-ACTIVE_STATUSES view, so the ceiling and
the mutex stay in lockstep if the two ever drift.

Emit leading operability signals: a per-tick dispatch summary and an
overflow-reschedule line, so a saturated pool is visible before jobs start
failing capacity: at the deadline.

Correct stale or overstated docstrings and comments: the HeartbeatLost
abort is documented as a best-effort optimization (it only fires on quiet
stages), with the per-attempt workdir named as the actual double-dispatch
guard; the _next_dispatch_time jitter note no longer claims a test zeroes
the constant; the load balancer dispatch return type is an async iterator,
not an async generator; the worker_lan module docstring describes the
durable-queue model instead of the removed hang-then-NoWorkersAvailable
behavior; _recover_orphans and the JobRecord.error field document the
from-submission deadline clock and the internal: error prefix.
---
 src/cfdb/workflows/__init__.py     | 21 +++++---
 src/cfdb/workflows/executor.py     | 85 +++++++++++++++++++++++-------
 src/cfdb/workflows/loadbalancer.py |  5 +-
 src/cfdb/workflows/lock.py         |  7 ++-
 src/cfdb/workflows/models.py       |  9 +++-
 src/cfdb/workflows/worker_lan.py   | 12 +++--
 6 files changed, 104 insertions(+), 35 deletions(-)

diff --git a/src/cfdb/workflows/__init__.py b/src/cfdb/workflows/__init__.py
index 8bd5012..12ae181 100644
--- a/src/cfdb/workflows/__init__.py
+++ b/src/cfdb/workflows/__init__.py
@@ -150,17 +150,22 @@ def _positive_int(name: str, value: str, *, minimum: int = 0) -> int:
     minimum=1,
 )
 
-# The stale-reclaim threshold MUST exceed two heartbeat intervals so a
-# single missed heartbeat (e.g., a brief Mongo write delay) does not
-# falsely reclaim a healthy worker. The default values (300 / 900)
-# satisfy this with a 300s safety margin; an operator-tuned pair that
-# violates the invariant is a configuration error.
-if WORKFLOW_STALE_THRESHOLD_S < 2 * WORKFLOW_HEARTBEAT_INTERVAL_S:
+# The stale-reclaim threshold MUST strictly exceed two heartbeat intervals.
+# A single missed heartbeat (e.g., a brief Mongo write delay) must not
+# falsely reclaim a healthy worker, AND the executor's
+# ``_HEARTBEAT_LOSS_ABORT_S = max(HEARTBEAT, STALE - HEARTBEAT)`` needs a
+# non-zero margin above one interval to keep its single-transient-failure
+# tolerance: at the boundary ``STALE == 2 * HEARTBEAT`` that window
+# collapses to exactly one interval (see issue #45 review A7). The default
+# values (300 / 900) satisfy the strict form with a 300s margin; an
+# operator-tuned pair that violates it is a configuration error.
+if WORKFLOW_STALE_THRESHOLD_S <= 2 * WORKFLOW_HEARTBEAT_INTERVAL_S:
     raise ValueError(
         f"CFDB_WORKFLOW_STALE_THRESHOLD_S ({WORKFLOW_STALE_THRESHOLD_S}s) "
-        f"must be >= 2 * CFDB_WORKFLOW_HEARTBEAT_INTERVAL_S "
+        f"must be > 2 * CFDB_WORKFLOW_HEARTBEAT_INTERVAL_S "
         f"({WORKFLOW_HEARTBEAT_INTERVAL_S}s) to avoid false stale-reclaim "
-        "of healthy workers between heartbeats."
+        "of healthy workers between heartbeats and to keep the "
+        "heartbeat-loss abort window strictly positive."
     )
 
 # --- Bounded-concurrency control (issue #45) --------------------------------
diff --git a/src/cfdb/workflows/executor.py b/src/cfdb/workflows/executor.py
index 7615ed7..54c8619 100644
--- a/src/cfdb/workflows/executor.py
+++ b/src/cfdb/workflows/executor.py
@@ -147,11 +147,13 @@
 #: consumer is gone": a consumer that cannot reach Mongo stops computing
 #: (and lets ``stream.aclose`` cancel the remote worker) *before* the
 #: orphan sweep's ``STALE_WORKFLOW_THRESHOLD`` would revive the row, so
-#: recovery never double-dispatches a still-live job. Sized strictly
-#: between one heartbeat interval (a single transient write failure must
-#: not abort) and the sweep threshold (the consumer must give up with
-#: margin to spare); the ``STALE >= 2*HEARTBEAT`` startup invariant keeps
-#: that window non-empty.
+#: recovery is spared a redundant second run of a still-live job. Sized
+#: strictly between one heartbeat interval (a single transient write
+#: failure must not abort) and the sweep threshold (the consumer must give
+#: up with margin to spare); the strict ``STALE > 2*HEARTBEAT`` startup
+#: invariant keeps that window strictly positive (at ``STALE == 2*HEARTBEAT``
+#: it would collapse to exactly one interval, leaving no transient-failure
+#: tolerance — see issue #45 review A7).
 _HEARTBEAT_LOSS_ABORT_S = max(
     _HEARTBEAT_INTERVAL_S,
     STALE_WORKFLOW_THRESHOLD.total_seconds() - _HEARTBEAT_INTERVAL_S,
@@ -228,13 +230,23 @@ class HeartbeatLost(RuntimeError):
     """Raised by the stream consumer when it can no longer refresh the job.
 
     Signals that ``heartbeat_workflow`` has been failing for longer than
-    :data:`_HEARTBEAT_LOSS_ABORT_S` while a worker is still streaming — i.e.
-    this consumer has lost its connection to Mongo and can no longer keep
-    the row fresh. Consuming the rest of the stream would let the worker
-    keep computing invisibly until the orphan sweep (on a healthy replica)
-    reclaims the now-stale row and re-dispatches it, double-running the job.
-    Raising this aborts the attempt and ``aclose``s the stream, cancelling
-    the remote worker so recovery has a single live attempt to revive.
+    :data:`_HEARTBEAT_LOSS_ABORT_S` while a worker is still streaming
+    ``heartbeat`` events — i.e. this consumer has lost its connection to
+    Mongo and can no longer keep the row fresh. Raising this aborts the
+    attempt and ``aclose``s the stream, cancelling the remote worker so it
+    stops computing.
+
+    This is a *best-effort optimization*, NOT the load-bearing
+    double-dispatch guard. A ``heartbeat`` event is only injected after a
+    quiet ``heartbeat_interval`` of stage silence, so a stage that streams
+    ``stage_complete`` / ``progress`` faster than that interval never trips
+    the check — meaning a Mongo-blind consumer is not guaranteed to abort
+    before the orphan sweep revives its row. What actually keeps a recovered
+    re-dispatch from corrupting a still-live attempt is the per-attempt
+    workdir nonce (``_attempt_dispatch``) plus content-addressed, atomically
+    committed cache artifacts. The abort just spares the wasted second run
+    when it can; do not remove the per-attempt workdir on the assumption
+    this covers the race.
     """
 
 
@@ -246,10 +258,12 @@ def _next_dispatch_time(now: datetime) -> datetime:
     """Compute the next dispatch attempt time for an overflowed job.
 
     ``now + retry_interval + jitter``. The jitter spreads a queued backlog
-    across ticks; tests zero ``_RETRY_JITTER_MAX_SECONDS`` for determinism.
-    Always strictly after ``now`` (retry interval is >= 1s), which
-    ``lease_due_dispatch`` requires so a leased row moves out of the
-    due-window before its attempt resolves.
+    across ticks; it is bounded by ``_RETRY_JITTER_MAX_SECONDS`` (itself
+    clamped to the retry interval), so a test that monkeypatches
+    ``_RETRY_INTERVAL_SECONDS`` low gets correspondingly small, bounded
+    jitter — no test zeroes the constant. Always strictly after ``now``
+    (retry interval is >= 1s), which ``lease_due_dispatch`` requires so a
+    leased row moves out of the due-window before its attempt resolves.
     """
     jitter = (
         random.uniform(0, _RETRY_JITTER_MAX_SECONDS)
@@ -590,6 +604,14 @@ async def _recover_orphans(self) -> None:
         is then autonomous: it no longer depends on a client re-requesting
         the same file to trigger ``claim_workflow``'s stale-reclaim. Gated
         on the stale threshold so healthy heartbeating jobs are untouched.
+
+        Note the recovery promise is bounded by the same dispatch deadline
+        as a fresh job: ``requeue_orphaned_dispatch`` preserves the original
+        ``submitted_at``, and ``_attempt_dispatch`` measures the deadline
+        from it, so an orphan older than ``CFDB_WORKFLOW_DISPATCH_DEADLINE_S``
+        is failed ``capacity:`` on its first recovery attempt rather than
+        resumed (its committed cache artifacts survive for a later fresh
+        ``GET`` to reuse). Recovery is best-effort, not unbounded.
         """
         now = _utcnow()
         requeued = await requeue_orphaned_dispatch(
@@ -626,7 +648,7 @@ async def _drain_due_jobs(self) -> None:
                 self._db, now=now, next_at=_next_dispatch_time(now)
             )
             if leased is None:
-                return
+                break
             file_meta = leased.file_meta_snapshot
             if file_meta is None:
                 logger.error(
@@ -655,6 +677,19 @@ async def _drain_due_jobs(self) -> None:
             self._spawn_attempt(leased, processor, file_meta)
             dispatched += 1
 
+        # Leading operability signal: a saturated queue is otherwise silent
+        # until jobs start failing ``capacity:`` at the dispatch deadline.
+        # Emitted at most once per tick (cadence ``_RETRY_INTERVAL_SECONDS``),
+        # so it does not flood the log even under a sustained backlog.
+        if dispatched:
+            logger.info(
+                "Scheduler tick dispatched %d queued workflow(s)%s",
+                dispatched,
+                " (per-tick cap hit; remainder deferred to the next tick)"
+                if dispatched >= _MAX_DISPATCHES_PER_TICK
+                else "",
+            )
+
     async def drain(self, *, timeout: float) -> int:
         """Wait for in-flight workflow tasks to complete.
 
@@ -867,6 +902,14 @@ async def _handle_overflow(self, record: JobRecord) -> None:
                     "next tick",
                     record.job_id,
                 )
+        # Leading operability signal that the pool is over capacity: the job
+        # found no worker and is being deferred to a later tick. Logged at
+        # info so a steadily-overflowing fleet is visible before jobs start
+        # failing ``capacity:`` at the dispatch deadline.
+        logger.info(
+            "No worker capacity for %s; rescheduling (pool overflow)",
+            record.job_id,
+        )
         await reschedule_dispatch(
             self._db, record.job_id, next_at=_next_dispatch_time(_utcnow())
         )
@@ -899,7 +942,10 @@ async def _consume_and_finalize(
         # the caller just marked the job RUNNING (the row is fresh). If
         # heartbeat writes then fail for longer than _HEARTBEAT_LOSS_ABORT_S
         # we raise HeartbeatLost and abort, so a Mongo-blind consumer stops
-        # computing before the orphan sweep revives its row (B1).
+        # computing before the orphan sweep revives its row. Best-effort: it
+        # only re-evaluates on a Heartbeat event (i.e. during quiet stages),
+        # so the per-attempt workdir — not this abort — is the actual
+        # corruption guard (see HeartbeatLost).
         last_heartbeat_ok = _utcnow()
 
         try:
@@ -918,7 +964,8 @@ async def _consume_and_finalize(
                                 if stalled > _HEARTBEAT_LOSS_ABORT_S:
                                     # Lost Mongo long enough that the orphan
                                     # sweep will treat this row as dead; abort
-                                    # so recovery doesn't double-run the job.
+                                    # so recovery is spared a redundant second
+                                    # run of this still-live attempt.
                                     raise HeartbeatLost(
                                         f"no successful heartbeat for "
                                         f"{stalled:.0f}s "
diff --git a/src/cfdb/workflows/loadbalancer.py b/src/cfdb/workflows/loadbalancer.py
index d5bc06a..bacbd8c 100644
--- a/src/cfdb/workflows/loadbalancer.py
+++ b/src/cfdb/workflows/loadbalancer.py
@@ -30,7 +30,8 @@
 
 import logging
 from typing import TYPE_CHECKING
-from typing import AsyncGenerator
+from typing import Any
+from typing import AsyncIterator
 
 from wool import LoadBalancerContextLike
 from wool import LoadBalancerLike
@@ -63,7 +64,7 @@ async def dispatch(
         *,
         context: LoadBalancerContextLike,
         timeout: float | None = None,
-    ) -> AsyncGenerator:
+    ) -> AsyncIterator[Any]:
         """Dispatch *task* to the first worker, in priority order, that accepts.
 
         :param task:
diff --git a/src/cfdb/workflows/lock.py b/src/cfdb/workflows/lock.py
index 6f5763d..05503f3 100644
--- a/src/cfdb/workflows/lock.py
+++ b/src/cfdb/workflows/lock.py
@@ -491,9 +491,14 @@ async def count_active_workflows(db) -> int:
     ``CFDB_WORKFLOW_MAX_ACTIVE`` new requests are shed with 429. Soft by
     nature — a count-then-insert race can briefly overshoot the cap, which
     is acceptable for a flood guard.
+
+    Counts on the canonical ``active`` boolean discriminator (the single
+    source of truth every other jobs read/write/index keys off), not the
+    derived ``status $in ACTIVE_STATUSES`` view, so the ceiling and the
+    mutex stay in lockstep even if the two ever drift.
     """
     jobs = _jobs(db)
-    return await jobs.count_documents({"status": {"$in": _ACTIVE_STATUS_VALUES}})
+    return await jobs.count_documents({"active": True})
 
 
 async def reschedule_dispatch(db, job_id: str, *, next_at: datetime) -> None:
diff --git a/src/cfdb/workflows/models.py b/src/cfdb/workflows/models.py
index cdeda5f..6156b25 100644
--- a/src/cfdb/workflows/models.py
+++ b/src/cfdb/workflows/models.py
@@ -73,7 +73,14 @@ class JobRecord(BaseModel):
         error: Populated on ``FAILED`` status with a human-readable reason.
             Capped at 1024 chars after path-scrubbing in
             ``lock.release_workflow`` so absolute filesystem paths from
-            tool stderr don't leak via ``/jobs/{id}``.
+            tool stderr don't leak via ``/jobs/{id}``. Some failures carry a
+            stable, client-parseable prefix: ``capacity:`` (the dispatch
+            deadline elapsed while the job waited for a worker — safe to
+            resubmit later), ``heartbeat lost:`` (the consumer lost its
+            Mongo connection and aborted so the orphan sweep can recover the
+            job), and ``internal:`` (a should-never-happen executor
+            invariant failure, e.g. a leased job missing its file_meta
+            snapshot or processor).
         superseded_by: When this row is stale-reclaimed by a fresh claim,
             the winner's ``job_id`` is recorded here so clients polling
             this (now-FAILED) job can follow the chain to the live one.
diff --git a/src/cfdb/workflows/worker_lan.py b/src/cfdb/workflows/worker_lan.py
index 80c6437..a9d2512 100644
--- a/src/cfdb/workflows/worker_lan.py
+++ b/src/cfdb/workflows/worker_lan.py
@@ -17,16 +17,20 @@
     python -m cfdb.workflows.worker_lan --namespace cfdb-workers --workers 2
 
 SIGINT (Ctrl-C) or SIGTERM drains the pool and exits. With no pool
-running, ``/data`` and ``/index`` requests for processable formats hang
-on the dispatch retry budget before failing with ``NoWorkersAvailable``.
+running, ``/data`` and ``/index`` requests for processable formats do not
+hang or surface ``NoWorkersAvailable`` to the client: the job is claimed
+and queued PENDING (the request returns ``202``), and the API's durable
+retry scheduler re-attempts dispatch every ``CFDB_WORKFLOW_RETRY_INTERVAL_S``
+until a worker appears or the ``CFDB_WORKFLOW_DISPATCH_DEADLINE_S`` deadline
+elapses (then the job is failed ``capacity:``).
 
 Environment variables (CLI flags mirror them):
 
 * ``WORKFLOW_POOL_NAMESPACE`` — LAN discovery namespace the pool
   publishes under (default ``cfdb-workers``). MUST match the API's value.
 * ``WORKFLOW_WORKER_COUNT`` — number of workers to spawn and publish
-  (default ``2``). Size it at least as high as the API's lease count or
-  the API blocks waiting for workers.
+  (default ``2``). The API admits every worker discovery surfaces (there is
+  no fixed lease count), so size this to the local concurrency you want.
 """
 
 from __future__ import annotations

From 217690362edceaedbecad547215ffa6aa92675f6 Mon Sep 17 00:00:00 2001
From: Conrad <conradbzura@gmail.com>
Date: Tue, 23 Jun 2026 11:40:45 -0400
Subject: [PATCH 18/19] test: Add review-3 coverage and tighten test fidelity

Replace the vacuous workdir-cleanup assertions in the executor-boundary
and processor e2e tests: since the per-attempt workdir is named
job_id-uuid, a bare-job_id path is never created, so asserting its absence
passed even if scratch leaked. Assert the workdir root is empty instead.

Pin the per-tick lease cap: one _drain_due_jobs tick leases at most
_MAX_DISPATCHES_PER_TICK jobs and leaves the remainder leasable, so a
refactor dropping the thundering-herd guard fails CI.

Pin the production pool configuration over the real cloudpickle/gRPC
boundary: a two-worker pool wired with PriorityLoadBalancer and a
TaskCountBackpressure(1)-bound spawn factory distributes two workflows
across both workers and leaves the third PENDING with a next_dispatch_at,
proving rotation-on-rejection and durable reschedule end to end.

Pin the ECS worker entrypoint's backpressure wiring (worker_main.serve)
and add Hypothesis property coverage for the JobRecord datetime
aware/naive to_mongo/from_mongo round-trip (now including next_dispatch_at)
and the cloudpickle round-trip of the backpressure-bound spawn factory.

Strengthen the heartbeat-loss threshold assertion to the strict margin the
tightened invariant now guarantees, rename the scheduler-method tests to
the leading-underscore convention, and refresh the stale wool_pool fixture
docstring.
---
 tests/integration/conftest.py               |   7 +-
 tests/integration/test_executor_boundary.py | 100 +++++++++++++++++++-
 tests/integration/test_processor_e2e.py     |   7 +-
 tests/test_workflows/test_executor.py       |  70 ++++++++++++--
 tests/test_workflows/test_lock.py           |  61 ++++++++++++
 tests/test_workflows/test_worker_lan.py     |  32 +++++++
 tests/test_workflows/test_worker_main.py    |  82 ++++++++++++++++
 7 files changed, 342 insertions(+), 17 deletions(-)

diff --git a/tests/integration/conftest.py b/tests/integration/conftest.py
index 41269d5..2b1418f 100644
--- a/tests/integration/conftest.py
+++ b/tests/integration/conftest.py
@@ -530,9 +530,10 @@ async def wool_pool():
 
     Function-scoped so every test gets a clean pool — prevents any
     cross-test state bleeding through worker process memory. The
-    executor's ``_dispatch_with_retry`` covers the brief window
-    between pool startup and LocalDiscovery surfacing the worker,
-    so no explicit sleep is needed here.
+    executor opens a single dispatch attempt and, on overflow,
+    reschedules the job onto its durable retry scheduler — so the brief
+    window between pool startup and the worker being surfaced needs no
+    explicit sleep here.
     """
     async with wool.WorkerPool(spawn=1):
         yield
diff --git a/tests/integration/test_executor_boundary.py b/tests/integration/test_executor_boundary.py
index b1deadb..24acad5 100644
--- a/tests/integration/test_executor_boundary.py
+++ b/tests/integration/test_executor_boundary.py
@@ -10,11 +10,16 @@
 from __future__ import annotations
 
 import asyncio
+import functools
 
 import pytest
+import wool
 
+from cfdb.workflows import executor as executor_module
+from cfdb.workflows.backpressure import TaskCountBackpressure
 from cfdb.workflows.cache import LocalFsCache
 from cfdb.workflows.executor import WoolExecutor
+from cfdb.workflows.loadbalancer import PriorityLoadBalancer
 from cfdb.workflows.lock import get_job
 from cfdb.workflows.models import ACTIVE_STATUSES, JobStatus
 from cfdb.workflows.processors.registry import ProcessorRegistry
@@ -181,8 +186,11 @@ async def test_ensure_workflow_should_record_failed_when_workflow_duration_cap_f
         assert final.status == JobStatus.FAILED
         assert final.error is not None
         assert "runtime cap" in final.error.lower()
-        # Workdir is cleaned up regardless of failure mode.
-        assert not (executor._workdir_root / record.job_id).exists()
+        # Workdir is cleaned up regardless of failure mode. Assert against
+        # the workdir ROOT, not ``root / job_id``: the per-attempt workdir is
+        # ``root / f"{job_id}-{uuid}"`` (B1), so a bare-``job_id`` path is
+        # never created and asserting its absence would be vacuously true.
+        assert list(executor._workdir_root.iterdir()) == []
         # The internal bookkeeping sets are drained after the await.
         assert len(executor._pending_tasks) == 0
         assert len(executor._finalize_tasks) == 0
@@ -273,3 +281,91 @@ async def test_ensure_workflow_should_attach_second_caller_to_active_job_across_
             d for d in mock_db.jobs.docs if d["workflow_key"] == rec_a.workflow_key
         ]
         assert len(records_for_key) == 1
+
+
+class TestProductionPoolConfiguration:
+    @pytest.mark.asyncio
+    async def test_ensure_workflow_should_distribute_and_reschedule_across_priority_pool(
+        self, mock_db, tmp_path, monkeypatch
+    ):
+        """Test the production pool config over the real pickle boundary.
+
+        Given:
+            A two-worker ``wool.WorkerPool`` wired exactly as production —
+            ``PriorityLoadBalancer`` plus a ``TaskCountBackpressure(1)``-bound
+            spawn factory — the durable retry scheduler running, and three
+            distinct slow workflows that each hold their worker busy for the
+            whole test.
+        When:
+            All three are dispatched via ``ensure_workflow`` and the system
+            is allowed to settle.
+        Then:
+            Exactly two reach RUNNING concurrently (one per worker — proving
+            the load balancer distributes across both over the cloudpickle /
+            gRPC boundary), and the third is left PENDING with a
+            ``next_dispatch_at``: the priority balancer rotates past both
+            ``RESOURCE_EXHAUSTED`` rejections, surfaces ``NoWorkersAvailable``,
+            and the durable scheduler keeps it queued rather than dropping or
+            double-running it.
+        """
+        # Arrange — fast retries so an early overflow (a worker not yet
+        # surfaced at pool startup) is healed by the scheduler promptly.
+        monkeypatch.setattr(executor_module, "_RETRY_INTERVAL_SECONDS", 0.2)
+        monkeypatch.setattr(executor_module, "_RETRY_JITTER_MAX_SECONDS", 0.1)
+        _install_jobs_index(mock_db)
+        cache = LocalFsCache(tmp_path / "cache")
+        registry = ProcessorRegistry()
+        registry.register(StubProcessor(sleep_seconds=30.0))
+        executor = WoolExecutor(
+            mock_db, cache, registry, workdir_root=tmp_path / "jobs"
+        )
+        metas = [{**stub_file_meta(), "local_id": f"ENCFF-{i}"} for i in range(3)]
+        pool = wool.WorkerPool(
+            spawn=2,
+            worker=functools.partial(
+                wool.LocalWorker, backpressure=TaskCountBackpressure(1)
+            ),
+            loadbalancer=PriorityLoadBalancer(),
+        )
+
+        async def _settled() -> tuple[int, int]:
+            running = pending_resched = 0
+            for meta in metas:
+                # job_id isn't known until ensure_workflow returns; match on
+                # the workflow_key the records carry instead.
+                for doc in mock_db.jobs.docs:
+                    if doc["local_id"] != meta["local_id"]:
+                        continue
+                    if doc["status"] == JobStatus.RUNNING.value:
+                        running += 1
+                    elif (
+                        doc["status"] == JobStatus.PENDING.value
+                        and doc.get("next_dispatch_at") is not None
+                    ):
+                        pending_resched += 1
+            return running, pending_resched
+
+        # Act / Assert
+        try:
+            async with pool:
+                executor.start_scheduler()
+                for meta in metas:
+                    await executor.ensure_workflow(meta)
+
+                # Backpressure(1) on two workers caps concurrent RUNNING at
+                # two, so the steady state is (2 running, 1 overflowed). The
+                # long-running stubs hold that state for the whole poll.
+                deadline = asyncio.get_event_loop().time() + 25.0
+                running = pending_resched = 0
+                while asyncio.get_event_loop().time() < deadline:
+                    running, pending_resched = await _settled()
+                    if running == 2 and pending_resched == 1:
+                        break
+                    await asyncio.sleep(0.2)
+
+                assert running == 2, f"expected 2 RUNNING (one per worker), got {running}"
+                assert pending_resched == 1, (
+                    f"expected 1 overflowed+rescheduled job, got {pending_resched}"
+                )
+        finally:
+            await executor.drain(timeout=15.0)
diff --git a/tests/integration/test_processor_e2e.py b/tests/integration/test_processor_e2e.py
index f592878..780eb23 100644
--- a/tests/integration/test_processor_e2e.py
+++ b/tests/integration/test_processor_e2e.py
@@ -643,5 +643,8 @@ async def test_ensure_workflow_should_surface_failed_status_when_processor_raise
         # The path-scrub regex strips multi-segment absolute paths so a
         # ``/<seg>/<seg>`` shape MUST NOT survive into persisted text.
         assert re.search(r"/[A-Za-z][A-Za-z0-9_.-]*/[A-Za-z]", final.error) is None
-        # Workdir is cleaned up regardless of failure mode.
-        assert not (integration_executor._workdir_root / record.job_id).exists()
+        # Workdir is cleaned up regardless of failure mode. Assert against
+        # the workdir ROOT, not ``root / job_id``: the per-attempt workdir is
+        # ``root / f"{job_id}-{uuid}"`` (B1), so a bare-``job_id`` path is
+        # never created and asserting its absence would be vacuously true.
+        assert list(integration_executor._workdir_root.iterdir()) == []
diff --git a/tests/test_workflows/test_executor.py b/tests/test_workflows/test_executor.py
index cd45e38..a33a86b 100644
--- a/tests/test_workflows/test_executor.py
+++ b/tests/test_workflows/test_executor.py
@@ -1438,15 +1438,17 @@ def test_heartbeat_loss_abort_threshold_sits_between_interval_and_stale(self):
             It is compared against the heartbeat interval and the orphan
             sweep's stale threshold.
         Then:
-            It should be at least one heartbeat interval (a single transient
-            write failure must not abort) and strictly below the stale
-            threshold (a Mongo-blind consumer must give up before the sweep
-            would revive its row).
+            It should be strictly more than one heartbeat interval (a single
+            transient write failure must not abort — the strict
+            ``STALE > 2*HEARTBEAT`` startup invariant guarantees the margin,
+            issue #45 review A7) and strictly below the stale threshold (a
+            Mongo-blind consumer must give up before the sweep would revive
+            its row).
         """
         # Assert
         assert (
             executor_module._HEARTBEAT_LOSS_ABORT_S
-            >= executor_module._HEARTBEAT_INTERVAL_S
+            > executor_module._HEARTBEAT_INTERVAL_S
         )
         assert (
             executor_module._HEARTBEAT_LOSS_ABORT_S
@@ -1818,7 +1820,7 @@ async def test__attempt_dispatch_should_fail_capacity_when_deadline_exceeded(
 
 class TestWoolExecutorScheduler:
     @pytest.mark.asyncio
-    async def test_drain_due_jobs_should_dispatch_a_due_queued_job(
+    async def test__drain_due_jobs_should_dispatch_a_due_queued_job(
         self, mock_db, tmp_cache, tmp_workdir, no_wool_dispatch
     ):
         """Test that the scheduler tick dispatches a due rescheduled job.
@@ -1853,7 +1855,7 @@ async def test_drain_due_jobs_should_dispatch_a_due_queued_job(
         assert final.status == JobStatus.COMPLETED
 
     @pytest.mark.asyncio
-    async def test_drain_due_jobs_should_ignore_not_yet_due_jobs(
+    async def test__drain_due_jobs_should_ignore_not_yet_due_jobs(
         self, mock_db, tmp_cache, tmp_workdir, no_wool_dispatch
     ):
         """Test that a queued job whose retry time is in the future is skipped.
@@ -1887,7 +1889,7 @@ async def test_drain_due_jobs_should_ignore_not_yet_due_jobs(
         assert processor.run_calls == 0
 
     @pytest.mark.asyncio
-    async def test_recover_orphans_should_requeue_and_dispatch_a_stale_running_job(
+    async def test__recover_orphans_should_requeue_and_dispatch_a_stale_running_job(
         self, mock_db, tmp_cache, tmp_workdir, no_wool_dispatch
     ):
         """Test that a RUNNING job orphaned by a crash is recovered autonomously.
@@ -1929,7 +1931,7 @@ async def test_recover_orphans_should_requeue_and_dispatch_a_stale_running_job(
         assert final.status == JobStatus.COMPLETED
 
     @pytest.mark.asyncio
-    async def test_recover_orphans_should_leave_a_fresh_running_job_alone(
+    async def test__recover_orphans_should_leave_a_fresh_running_job_alone(
         self, mock_db, tmp_cache, tmp_workdir, no_wool_dispatch
     ):
         """Test that a healthy in-flight RUNNING job is not reclaimed.
@@ -1963,7 +1965,7 @@ async def test_recover_orphans_should_leave_a_fresh_running_job_alone(
         assert job.status == JobStatus.RUNNING
 
     @pytest.mark.asyncio
-    async def test_scheduler_loop_should_survive_a_tick_exception(
+    async def test__scheduler_loop_should_survive_a_tick_exception(
         self, mock_db, tmp_cache, tmp_workdir, mocker
     ):
         """Test that an exception in one tick doesn't kill the scheduler.
@@ -2007,6 +2009,54 @@ async def flaky_lease(*_a, **_k):
         finally:
             await executor.drain(timeout=1.0)
 
+    @pytest.mark.asyncio
+    async def test__drain_due_jobs_should_lease_at_most_the_per_tick_cap(
+        self, mock_db, tmp_cache, tmp_workdir, mocker
+    ):
+        """Test that one tick leases no more than the per-tick cap.
+
+        Given:
+            More due queued jobs than ``_MAX_DISPATCHES_PER_TICK`` (the cap
+            monkeypatched low) and a stubbed ``_spawn_attempt`` so the tick
+            only counts dispatches.
+        When:
+            One scheduler tick (``_drain_due_jobs``) runs.
+        Then:
+            Exactly the cap's worth of attempts are spawned and the
+            remainder is left leasable for a subsequent tick — the
+            thundering-herd guard bounds the per-tick fan-out so a large
+            backlog can't spawn thousands of concurrent attempts at once.
+        """
+        # Arrange
+        _install_jobs_index(mock_db)
+        registry = ProcessorRegistry()
+        registry.register(_StubProcessor())
+        executor = WoolExecutor(
+            mock_db, tmp_cache, registry, workdir_root=tmp_workdir
+        )
+        mocker.patch.object(executor_module, "_MAX_DISPATCHES_PER_TICK", 2)
+        due = datetime.now(timezone.utc) - timedelta(seconds=1)
+        for i in range(5):  # cap (2) + 3 remainder, each a distinct workflow
+            await _seed_pending_job(
+                mock_db,
+                meta={**_file_meta(), "local_id": f"ENCFF{i:03d}"},
+                next_dispatch_at=due,
+            )
+        spawn = mocker.patch.object(executor, "_spawn_attempt")
+
+        # Act — a single tick.
+        await executor._drain_due_jobs()
+
+        # Assert — capped at the per-tick limit...
+        assert spawn.call_count == 2
+        # ...with the remainder still due and leasable on the next tick.
+        leftover = await executor_module.lease_due_dispatch(
+            mock_db,
+            now=datetime.now(timezone.utc),
+            next_at=datetime.now(timezone.utc) + timedelta(seconds=60),
+        )
+        assert leftover is not None
+
 
 class TestWoolExecutorStartScheduler:
     @pytest.mark.asyncio
diff --git a/tests/test_workflows/test_lock.py b/tests/test_workflows/test_lock.py
index 2b785c6..af7a9ae 100644
--- a/tests/test_workflows/test_lock.py
+++ b/tests/test_workflows/test_lock.py
@@ -5,6 +5,8 @@
 from datetime import datetime, timedelta, timezone
 
 import pytest
+from hypothesis import given
+from hypothesis import strategies as st
 
 from cfdb.workflows import lock
 from cfdb.workflows.models import ArtifactKind, JobRecord, JobStatus
@@ -807,6 +809,65 @@ def test_record_from_mongo_should_attach_utc_to_naive_timestamps(self):
         assert record.submitted_at.tzinfo is timezone.utc
         assert record.updated_at.tzinfo is timezone.utc
 
+    @given(
+        submitted=st.datetimes(timezones=st.just(timezone.utc)),
+        updated=st.datetimes(timezones=st.just(timezone.utc)),
+        next_dispatch=st.one_of(
+            st.none(), st.datetimes(timezones=st.just(timezone.utc))
+        ),
+    )
+    def test_record_from_mongo_should_roundtrip_aware_and_naive_datetimes(
+        self, submitted, updated, next_dispatch
+    ):
+        """Test that the to_mongo/_record_from_mongo datetime round-trip is total.
+
+        Given:
+            A JobRecord with arbitrary aware-UTC ``submitted_at`` /
+            ``updated_at`` and an aware-or-``None`` ``next_dispatch_at``,
+            serialized via ``to_mongo`` and then rendered naive — the BSON
+            Date round-trip some Motor / mongomock-motor variants produce.
+        When:
+            ``_record_from_mongo`` hydrates the doc back into a JobRecord.
+        Then:
+            Every datetime instant is preserved and re-attached to UTC, and
+            a ``None`` ``next_dispatch_at`` survives as ``None`` — the third
+            nullable datetime threads the round-trip without loss.
+        """
+        # Arrange
+        record = JobRecord(
+            job_id="job-rt",
+            workflow_key=f"encode/x/{FIXTURE_MD5}/v1",
+            status=JobStatus.PENDING,
+            dcc="encode",
+            local_id="x",
+            md5=FIXTURE_MD5,
+            pipeline_version=1,
+            submitted_at=submitted,
+            updated_at=updated,
+            next_dispatch_at=next_dispatch,
+        )
+        doc = record.to_mongo()
+        # Simulate the BSON Date round-trip: strip tzinfo on stored dates,
+        # which is the naive rendering ``_record_from_mongo`` must repair.
+        for field in ("submitted_at", "updated_at", "next_dispatch_at"):
+            value = doc.get(field)
+            if isinstance(value, datetime):
+                doc[field] = value.astimezone(timezone.utc).replace(tzinfo=None)
+
+        # Act
+        hydrated = lock._record_from_mongo(doc)
+
+        # Assert
+        assert hydrated.submitted_at == submitted
+        assert hydrated.updated_at == updated
+        assert hydrated.submitted_at.tzinfo is timezone.utc
+        assert hydrated.updated_at.tzinfo is timezone.utc
+        if next_dispatch is None:
+            assert hydrated.next_dispatch_at is None
+        else:
+            assert hydrated.next_dispatch_at == next_dispatch
+            assert hydrated.next_dispatch_at.tzinfo is timezone.utc
+
 
 class TestClaimWorkflowSupersededByChain:
     @pytest.mark.asyncio
diff --git a/tests/test_workflows/test_worker_lan.py b/tests/test_workflows/test_worker_lan.py
index 5305670..0c545b4 100644
--- a/tests/test_workflows/test_worker_lan.py
+++ b/tests/test_workflows/test_worker_lan.py
@@ -204,3 +204,35 @@ def test_backpressure_bound_factory_should_still_declare_host():
 
     # Act & assert
     assert declares_host(factory) is True
+
+
+def test_backpressure_bound_factory_should_survive_cloudpickle():
+    """Test that the backpressure-bound spawn factory round-trips cloudpickle.
+
+    Given:
+        The ``functools.partial(LocalWorker, backpressure=TaskCountBackpressure(1))``
+        factory ``worker_lan.serve`` hands to ``WorkerPool(spawn=…)`` — the
+        exact composite object wool serializes into each spawned worker
+        subprocess.
+    When:
+        It is cloudpickled and reloaded.
+    Then:
+        The reconstructed partial still carries a backpressure hook with the
+        same threshold, so per-worker backpressure survives the spawn
+        boundary rather than silently reverting to unbounded admission.
+    """
+    # Arrange
+    import functools
+
+    import cloudpickle
+    import wool
+
+    from cfdb.workflows.backpressure import TaskCountBackpressure
+
+    factory = functools.partial(wool.LocalWorker, backpressure=TaskCountBackpressure(1))
+
+    # Act
+    restored = cloudpickle.loads(cloudpickle.dumps(factory))
+
+    # Assert
+    assert restored.keywords["backpressure"].threshold == 1
diff --git a/tests/test_workflows/test_worker_main.py b/tests/test_workflows/test_worker_main.py
index 8f4ea9e..757d58d 100644
--- a/tests/test_workflows/test_worker_main.py
+++ b/tests/test_workflows/test_worker_main.py
@@ -4,6 +4,7 @@
 
 from unittest.mock import patch
 
+import pytest
 from click.testing import CliRunner
 
 from cfdb.workflows import worker_main
@@ -183,3 +184,84 @@ def test_main_tls_cli_flags_override_env_vars(self, monkeypatch):
         # Assert
         assert exit_code == 0
         assert captured["tls_ca"] == "/cli/ca.pem"
+
+
+class _StopServe(Exception):
+    """Sentinel raised from the mocked ``worker.start`` to halt ``serve``.
+
+    ``serve`` constructs the worker (with its backpressure kwarg) immediately
+    before ``await worker.start()``, so raising here exits the coroutine right
+    after the construction under test without entering the run loop, the
+    signal-wait, or the health/drain teardown.
+    """
+
+
+def _arrange_serve(mocker) -> object:
+    """Patch ``serve``'s collaborators and return the ``LocalWorker`` spy."""
+    mocker.patch.object(worker_main, "build_worker_credentials", return_value=None)
+    mocker.patch.object(
+        worker_main, "_start_health_server", mocker.AsyncMock(return_value=mocker.Mock())
+    )
+    worker_instance = mocker.Mock()
+    worker_instance.start = mocker.AsyncMock(side_effect=_StopServe)
+    worker_instance.stop = mocker.AsyncMock()
+    return mocker.patch.object(
+        worker_main.wool, "LocalWorker", return_value=worker_instance
+    )
+
+
+class TestServeBackpressureWiring:
+    @pytest.mark.asyncio
+    async def test_serve_should_wire_taskcount_backpressure_when_threshold_positive(
+        self, mocker, monkeypatch
+    ):
+        """Test that a positive task ceiling becomes a TaskCountBackpressure.
+
+        Given:
+            ``WORKER_MAX_CONCURRENT_TASKS`` set to 1 and ``wool.LocalWorker``
+            spied so ``serve`` halts right after constructing the worker.
+        When:
+            ``serve`` is run.
+        Then:
+            The worker is constructed with a ``backpressure`` hook whose
+            threshold is 1, so the ECS worker entrypoint actually serializes
+            its subprocess pipelines.
+        """
+        # Arrange
+        monkeypatch.setattr(worker_main, "WORKER_MAX_CONCURRENT_TASKS", 1)
+        local_worker = _arrange_serve(mocker)
+
+        # Act
+        with pytest.raises(_StopServe):
+            await worker_main.serve(worker_port=0, health_port=0)
+
+        # Assert
+        backpressure = local_worker.call_args.kwargs["backpressure"]
+        assert backpressure is not None
+        assert backpressure.threshold == 1
+
+    @pytest.mark.asyncio
+    async def test_serve_should_disable_backpressure_when_threshold_zero(
+        self, mocker, monkeypatch
+    ):
+        """Test that a zero task ceiling wires ``backpressure=None``.
+
+        Given:
+            ``WORKER_MAX_CONCURRENT_TASKS`` set to 0 (the disable sentinel)
+            and ``wool.LocalWorker`` spied.
+        When:
+            ``serve`` is run.
+        Then:
+            The worker is constructed with ``backpressure=None``, restoring
+            the unbounded admission behavior.
+        """
+        # Arrange
+        monkeypatch.setattr(worker_main, "WORKER_MAX_CONCURRENT_TASKS", 0)
+        local_worker = _arrange_serve(mocker)
+
+        # Act
+        with pytest.raises(_StopServe):
+            await worker_main.serve(worker_port=0, health_port=0)
+
+        # Assert
+        assert local_worker.call_args.kwargs["backpressure"] is None

From d841aa45a108e5bf35530fcaaaad4c1b88f5f362 Mon Sep 17 00:00:00 2001
From: Conrad <conradbzura@gmail.com>
Date: Tue, 23 Jun 2026 11:40:56 -0400
Subject: [PATCH 19/19] docs: Note recovered jobs share the from-submission
 dispatch deadline

Autonomous orphan recovery preserves the original submission time, so an
orphan older than the dispatch deadline is failed capacity: on its first
recovery attempt rather than resumed. State this in the durable-queue
bullet so the autonomous-recovery promise is not read as unbounded;
committed cache artifacts still survive for a later fresh GET to reuse.
---
 README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/README.md b/README.md
index 2f30c80..f7e4ce8 100644
--- a/README.md
+++ b/README.md
@@ -657,7 +657,7 @@ Cache keys are content-addressed using each file's upstream `md5`, so a byte cha
 
 - **Per-worker backpressure** — each worker accepts at most `CFDB_WORKER_MAX_CONCURRENT_TASKS` tasks at once (default `1`), serializing the subprocess pipelines on a 1-vCPU worker. A worker at capacity rejects the dispatch and the API's priority load balancer rotates to the next worker.
 - **Priority (leaky-bucket) load balancing** — the API offers each task to discovered workers in a stable order, so load concentrates on the lowest-ordered workers and over-provisioned workers drain to idle and self-reap (via `CFDB_WORKER_MAX_LIFETIME_SECONDS`) instead of every worker carrying a thin perpetual slice.
-- **Durable queue + retry-to-deadline** — when no worker has capacity, the job is **not** failed and does **not** block the request: it stays `pending` and a durable, Mongo-backed scheduler re-attempts dispatch every `CFDB_WORKFLOW_RETRY_INTERVAL_S` (plus jitter) until a worker frees up or the `CFDB_WORKFLOW_DISPATCH_DEADLINE_S` deadline elapses (then it is failed with a `capacity:`-prefixed error). Because the queue lives in Mongo, an API restart resumes it. On every scheduler tick (including the first, on boot) an orphan-recovery sweep re-queues jobs a crash left mid-flight — a `running` job whose API consumer died, or a fresh `pending` claim that never rescheduled — once they pass the stale threshold (`CFDB_WORKFLOW_STALE_THRESHOLD_S`), so recovery is autonomous and does not wait for a client to re-request the file. On the ECS profile, an overflow also requests one bounded worker spawn (the leaky bucket overflowing), inverting the old unconditional per-request spawn.
+- **Durable queue + retry-to-deadline** — when no worker has capacity, the job is **not** failed and does **not** block the request: it stays `pending` and a durable, Mongo-backed scheduler re-attempts dispatch every `CFDB_WORKFLOW_RETRY_INTERVAL_S` (plus jitter) until a worker frees up or the `CFDB_WORKFLOW_DISPATCH_DEADLINE_S` deadline elapses (then it is failed with a `capacity:`-prefixed error). Because the queue lives in Mongo, an API restart resumes it. On every scheduler tick (including the first, on boot) an orphan-recovery sweep re-queues jobs a crash left mid-flight — a `running` job whose API consumer died, or a fresh `pending` claim that never rescheduled — once they pass the stale threshold (`CFDB_WORKFLOW_STALE_THRESHOLD_S`), so recovery is autonomous and does not wait for a client to re-request the file. Recovery shares the same deadline clock as a fresh job: the re-queue preserves the original submission time, so an orphan older than `CFDB_WORKFLOW_DISPATCH_DEADLINE_S` is failed `capacity:` on its first recovery attempt rather than resumed (its committed cache artifacts survive for a later fresh `GET` to reuse) — recovery is best-effort, not unbounded. On the ECS profile, an overflow also requests one bounded worker spawn (the leaky bucket overflowing), inverting the old unconditional per-request spawn.
 - **Admission ceiling** — once `CFDB_WORKFLOW_MAX_ACTIVE` workflows are active (`pending` + `running`), further preprocessing requests are shed with `429 Retry-After` rather than queued, so the backlog itself is bounded. The check runs before the per-file mutex, so at the ceiling even a re-`GET` for a file whose workflow is already in flight is shed with `429` (rather than attaching to the in-flight job) and the client retries — the deliberate trade for shedding before an unbounded admission race. The readiness `/status` probes never dispatch and so never `429`.
 
 `/index` continues to serve upstream sidecars first when present (the 218 BED→beddb and 4 BED→tbi 4DN cases that publish under `extra.extra_files` or `extra.fourdn.extra_files`); the workflow path is dispatched only when no sidecar exists. Set `?raw=true` to bypass the workflow path entirely and return only the upstream sidecar (404 when none exists).