diff --git a/README.md b/README.md
index 01cd22a..f7e4ce8 100644
--- a/README.md
+++ b/README.md
@@ -548,12 +548,13 @@ Stream file contents from DCCs via HTTPS.
 | Code | Description |
 |------|-------------|
 | 200 | Full file content (GET) or file metadata (HEAD) |
-| 202 | Preprocessed artifact not yet cached — workflow dispatched. `Location: /jobs/{id}` and `Retry-After` headers point to the polling endpoint. |
+| 202 | Preprocessed artifact not yet cached — workflow accepted. `Location: /jobs/{id}` and `Retry-After` headers point to the polling endpoint. Under load the job may sit `pending` in the durable queue before a worker picks it up; poll `/jobs/{id}` for progress. |
 | 206 | Partial content (Range request) |
 | 400 | Invalid DCC, path-param shape, or Range header |
 | 403 | File requires authentication (consortium/protected access) or denied by upstream repository |
 | 404 | File not found, or HEAD probe of a not-yet-cached processed artifact (GET would dispatch) |
 | 416 | Range not satisfiable (out of bounds, or file size unknown so no range can be satisfied) |
+| 429 | Too many active preprocessing jobs — the active-workflow ceiling (`CFDB_WORKFLOW_MAX_ACTIVE`) is reached; `Retry-After` header set. Retry shortly. |
 | 501 | No supported access method (e.g., Globus-only files) |
 | 502 | Upstream service error |
 | 503 | Workflow subsystem shutting down (`Retry-After`) |
@@ -586,12 +587,13 @@ Stream index files (e.g., `.px2`, `.bai`) associated with DCC data files.
 | Code | Description |
 |------|-------------|
 | 200 | Full index file content (GET) or file metadata (HEAD) |
-| 202 | Index not yet cached — workflow dispatched. `Location: /jobs/{id}` and `Retry-After` headers point to the polling endpoint. |
+| 202 | Index not yet cached — workflow accepted. `Location: /jobs/{id}` and `Retry-After` headers point to the polling endpoint. Under load the job may sit `pending` in the durable queue before a worker picks it up; poll `/jobs/{id}` for progress. |
 | 206 | Partial content (Range request) |
 | 400 | Invalid DCC, path-param shape, or Range header |
 | 403 | File requires consortium/protected access (HuBMAP) |
 | 404 | File not found, format has no index (CSV/TSV/bigWig), or HEAD probe of a not-yet-cached index |
 | 416 | Range not satisfiable |
+| 429 | Too many active preprocessing jobs — the active-workflow ceiling (`CFDB_WORKFLOW_MAX_ACTIVE`) is reached; `Retry-After` header set. Retry shortly. |
 | 502 | Upstream service error or malformed sidecar |
 | 503 | Workflow subsystem disabled (set `SYNC_DATA_DIR`) for a processable format, or shutting down (`Retry-After`) |
 
@@ -651,6 +653,13 @@ The preprocessed artifact is the default response. Clients that want the raw ups
 
 Cache keys are content-addressed using each file's upstream `md5`, so a byte change upstream (with the sync pipeline refreshing `md5`) invalidates the cache automatically.
 
+**Bounded concurrency, durable queuing, and admission control.** Dispatch is bounded on three cooperating layers so an unauthenticated burst on `/data` and `/index` can't oversubscribe the worker fleet or queue unbounded work:
+
+- **Per-worker backpressure** — each worker accepts at most `CFDB_WORKER_MAX_CONCURRENT_TASKS` tasks at once (default `1`), serializing the subprocess pipelines on a 1-vCPU worker. A worker at capacity rejects the dispatch and the API's priority load balancer rotates to the next worker.
+- **Priority (leaky-bucket) load balancing** — the API offers each task to discovered workers in a stable order, so load concentrates on the lowest-ordered workers and over-provisioned workers drain to idle and self-reap (via `CFDB_WORKER_MAX_LIFETIME_SECONDS`) instead of every worker carrying a thin perpetual slice.
+- **Durable queue + retry-to-deadline** — when no worker has capacity, the job is **not** failed and does **not** block the request: it stays `pending` and a durable, Mongo-backed scheduler re-attempts dispatch every `CFDB_WORKFLOW_RETRY_INTERVAL_S` (plus jitter) until a worker frees up or the `CFDB_WORKFLOW_DISPATCH_DEADLINE_S` deadline elapses (then it is failed with a `capacity:`-prefixed error). Because the queue lives in Mongo, an API restart resumes it. On every scheduler tick (including the first, on boot) an orphan-recovery sweep re-queues jobs a crash left mid-flight — a `running` job whose API consumer died, or a fresh `pending` claim that never rescheduled — once they pass the stale threshold (`CFDB_WORKFLOW_STALE_THRESHOLD_S`), so recovery is autonomous and does not wait for a client to re-request the file. Recovery shares the same deadline clock as a fresh job: the re-queue preserves the original submission time, so an orphan older than `CFDB_WORKFLOW_DISPATCH_DEADLINE_S` is failed `capacity:` on its first recovery attempt rather than resumed (its committed cache artifacts survive for a later fresh `GET` to reuse) — recovery is best-effort, not unbounded. On the ECS profile, an overflow also requests one bounded worker spawn (the leaky bucket overflowing), inverting the old unconditional per-request spawn.
+- **Admission ceiling** — once `CFDB_WORKFLOW_MAX_ACTIVE` workflows are active (`pending` + `running`), further preprocessing requests are shed with `429 Retry-After` rather than queued, so the backlog itself is bounded. The check runs before the per-file mutex, so at the ceiling even a re-`GET` for a file whose workflow is already in flight is shed with `429` (rather than attaching to the in-flight job) and the client retries — the deliberate trade for shedding before an unbounded admission race. The readiness `/status` probes never dispatch and so never `429`.
+
 `/index` continues to serve upstream sidecars first when present (the 218 BED→beddb and 4 BED→tbi 4DN cases that publish under `extra.extra_files` or `extra.fourdn.extra_files`); the workflow path is dispatched only when no sidecar exists. Set `?raw=true` to bypass the workflow path entirely and return only the upstream sidecar (404 when none exists).
 
 Required environment variables:
@@ -659,10 +668,16 @@ Required environment variables:
 - `WORKFLOW_WORKER_COUNT` — local-dev only: how many workers the LAN pool (`python -m cfdb.workflows.worker_lan`) spawns and publishes (default `2`). The API itself no longer leases a fixed count — its `WorkerPool` admits every worker discovery surfaces. In the ECS profile one ephemeral worker is launched per workflow (via `EcsProvisioner` `RunTask`), so the **maximum concurrent worker count is bounded by the AWS Fargate vCPU service quota** for the account/region — raise it in Service Quotas; at 1 vCPU per worker it maps roughly 1:1 to concurrent workers. A burst of N distinct uncached files can launch up to ~N workers concurrently, subject to that quota.
 - `WORKFLOW_POOL_NAMESPACE` — wool discovery namespace shared by the API and the external worker pool (default `cfdb-workers`). Both processes must agree on this value or dispatch will hang waiting for workers.
 
+Bounded-concurrency control (issue #45):
+
+- `CFDB_WORKER_MAX_CONCURRENT_TASKS` — per-worker backpressure threshold: a worker rejects a dispatch (gRPC `RESOURCE_EXHAUSTED`, which the priority load balancer treats as transient and rotates past) once it already has this many tasks in flight (default `1`, to serialize the subprocess pipelines on a 1-vCPU worker; `0` disables backpressure). Set on the **worker** process.
+- `CFDB_WORKFLOW_MAX_ACTIVE` — admission ceiling on concurrently active workflows (`pending` + `running` jobs). Once this many are active, `/data` and `/index` shed new preprocessing requests with `429 Retry-After` before claiming the per-file mutex (default `1024`). Soft cap — a count-then-claim race may briefly overshoot. Set on the **API**.
+- `CFDB_WORKFLOW_RETRY_INTERVAL_S` — base cadence (plus a small random jitter) at which the durable retry scheduler re-attempts dispatch for a job awaiting worker capacity (default `120`, i.e. 2 min). A dispatch attempt that finds no free worker leaves the job `pending` and rescheduled rather than blocking the request. Set on the **API**.
+- `CFDB_WORKFLOW_DISPATCH_DEADLINE_S` — how long a job may wait for worker capacity (re-attempted on the retry cadence above) before the scheduler fails it with a `capacity:`-prefixed error (default `14400`, i.e. 4 h). Replaces the former single in-request `CFDB_WORKFLOW_DISPATCH_WAIT_S` wait — instead of blocking one request on a cold start, the job queues durably and is retried to this deadline. Set on the **API**.
+
 Optional tunables (with defaults):
 
 - `CFDB_WORKFLOW_DURATION_CAP_S` — per-workflow wall-clock cap (default `14400`, i.e. 4 h — sized for multi-hour preprocessing runs; lower it for fixture-bound dev).
-- `CFDB_WORKFLOW_DISPATCH_WAIT_S` — how long `ensure_workflow` waits for a free worker before giving up (default `240`, i.e. 4 min — sized for an ECS Fargate cold start: image pull + health check typically take 1-3 min, so a smaller budget can expire before a freshly-provisioned worker reports HEALTHY and hard-fail the first request. Lower it for fixture-bound dev where workers are already running).
 - `CFDB_WORKFLOW_HEARTBEAT_INTERVAL_S` — cadence at which the wool routine emits heartbeat events during quiet stages so the API can refresh `JobRecord.updated_at` (default `300`). The stale-reclaim threshold below is sized as `2 × heartbeat + safety_margin`; lowering this knob without also lowering the threshold widens the false-reclaim window.
 - `CFDB_WORKFLOW_STALE_THRESHOLD_S` — `updated_at` age beyond which an active row is reclaimable (default `900`; sized as `2 × heartbeat_interval + safety_margin` so a single missed heartbeat does not falsely reclaim a healthy worker).
 - `CFDB_SAMTOOLS_THREADS` (default `1`), `CFDB_SORT_PARALLEL` (default `2`) — CPU/thread caps for `samtools sort/index` and GNU `sort` respectively.
diff --git a/scripts/create-indexes.js b/scripts/create-indexes.js
index dad8f29..c7ebff3 100644
--- a/scripts/create-indexes.js
+++ b/scripts/create-indexes.js
@@ -221,6 +221,17 @@ ensureIndex(
     { name: "status_updated_at" }
 );
 
+// Serves the durable retry scheduler's due-dispatch lease
+// (workflows.lock.lease_due_dispatch): { status: "pending",
+// next_dispatch_at: { $lte: now } } sorted by next_dispatch_at asc. The
+// status equality prefix plus the next_dispatch_at range/sort are both
+// index-served, so the per-tick poll is not a scan of all PENDING rows.
+ensureIndex(
+    db.jobs,
+    { status: 1, next_dispatch_at: 1 },
+    { name: "status_next_dispatch_at" }
+);
+
 // TTL index on terminal job rows. Without this, every stale-reclaim
 // transition leaves a permanent ``failed`` document and the collection
 // grows unbounded for any frequently-touched workflow_key. The partial
diff --git a/src/cfdb/api/main.py b/src/cfdb/api/main.py
index eb1fdb8..53db36d 100644
--- a/src/cfdb/api/main.py
+++ b/src/cfdb/api/main.py
@@ -33,6 +33,7 @@
 from cfdb.workflows.credentials import build_worker_credentials
 from cfdb.workflows.discovery import EcsDiscovery
 from cfdb.workflows.executor import WoolExecutor
+from cfdb.workflows.loadbalancer import PriorityLoadBalancer
 from cfdb.workflows.processors.bam import BamIndexProcessor
 from cfdb.workflows.processors.registry import default_registry
 from cfdb.workflows.processors.tabix import TabixIntervalProcessor
@@ -329,19 +330,27 @@ async def lifespan(_: FastAPI):
                 # dispatch — fired while the Fargate worker is still booting
                 # (image pull + health check, ~1-3 min) — blow the quorum
                 # wait and raise an un-retried ``TimeoutError`` out of
-                # ``WorkerPool.start``. cfdb's ``_open_stream_with_retry``
-                # only retries ``wool.NoWorkersAvailable``, so that
+                # ``WorkerPool.start``. cfdb's dispatch attempt only treats
+                # ``wool.NoWorkersAvailable`` as "no capacity", so that
                 # ``TimeoutError`` fails the first request hard (and the
                 # provisioner's cancelled ``RunTask`` leaks an idle worker).
                 # With the gate off, a cold-start dispatch surfaces
-                # ``NoWorkersAvailable`` instead, which the executor already
-                # absorbs within its own ``_DISPATCH_WAIT_SECONDS`` budget —
-                # keeping the cold-start wait in one layer cfdb owns rather
-                # than duplicating it across wool's quorum wait too.
+                # ``NoWorkersAvailable`` instead, which the executor absorbs
+                # by leaving the job PENDING and re-attempting it on the
+                # durable retry scheduler's cadence — keeping the cold-start
+                # wait in one layer cfdb owns rather than duplicating it
+                # across wool's quorum wait too.
                 async with wool.WorkerPool(
                     discovery=discovery,
                     credentials=worker_credentials,
                     quorum=0,
+                    # Priority/leaky-bucket balancing: always offer a task to
+                    # discovered workers in the same stable order. With
+                    # per-worker backpressure (one task each), load fills the
+                    # lowest-ordered workers first and higher-ordered ones
+                    # drain to idle and self-reap, instead of every worker
+                    # carrying a thin perpetual slice (issue #45).
+                    loadbalancer=PriorityLoadBalancer(),
                 ):
                     # Snapshot the lifespan task's contextvars after the
                     # pool's ``__aenter__`` has populated wool's internals.
@@ -354,6 +363,13 @@ async def lifespan(_: FastAPI):
                         provisioner=provisioner,
                     )
                     executor_handle = api.executor
+                    # Start the durable retry scheduler — the single dispatch
+                    # driver — here, inside the pool context, so the created
+                    # task inherits wool's dispatch contextvars (same as
+                    # request-spawned tasks via ``attach_wool_context``).
+                    # ``_drain_executor`` cancels it on teardown before the
+                    # pool closes.
+                    api.executor.start_scheduler()
                     log.info(
                         "Workflow subsystem enabled: profile=%s cache=%s "
                         "workdir=%s discovery=%s provisioner=%s "
diff --git a/src/cfdb/api/routers/cache_stream.py b/src/cfdb/api/routers/cache_stream.py
index 5b89651..830f5cc 100644
--- a/src/cfdb/api/routers/cache_stream.py
+++ b/src/cfdb/api/routers/cache_stream.py
@@ -19,7 +19,11 @@
 from cfdb import api
 from cfdb.services import drs
 from cfdb.workflows.cache import CacheBackend
-from cfdb.workflows.executor import ExecutorDraining, WorkflowNotApplicable
+from cfdb.workflows.executor import (
+    AdmissionRejected,
+    ExecutorDraining,
+    WorkflowNotApplicable,
+)
 from cfdb.workflows.models import ArtifactKind
 
 logger = logging.getLogger(__name__)
@@ -275,6 +279,18 @@ async def serve_workflow_artifact_or_dispatch(
             detail="Service is shutting down; please retry",
             headers={"Retry-After": "30"},
         )
+    except AdmissionRejected as exc:
+        # The active-workflow ceiling is hit — shed this request with a
+        # 429 so the client backs off rather than queuing unbounded work.
+        # NOT a WorkflowNotApplicable subclass, so it must be caught
+        # explicitly (before that handler) to avoid falling through to
+        # direct upstream streaming, which would bypass the bounded
+        # pipeline entirely.
+        raise HTTPException(
+            status_code=status.HTTP_429_TOO_MANY_REQUESTS,
+            detail="Too many active preprocessing jobs; please retry shortly",
+            headers={"Retry-After": str(exc.retry_after_seconds)},
+        )
     except WorkflowNotApplicable:
         # Race: processor accepted this file, but executor disagreed.
         # Fall through to the caller's own path rather than 500.
diff --git a/src/cfdb/indexes.py b/src/cfdb/indexes.py
index a3f9ccd..7bffd5a 100644
--- a/src/cfdb/indexes.py
+++ b/src/cfdb/indexes.py
@@ -134,6 +134,16 @@ def operational_index_specs() -> list[IndexSpec]:
         ),
         IndexSpec("jobs", [("job_id", 1)], name="job_id_unique", unique=True),
         IndexSpec("jobs", [("status", 1), ("updated_at", 1)], name="status_updated_at"),
+        # Serves the durable retry scheduler's due-dispatch lease
+        # (workflows.lock.lease_due_dispatch): {status: pending,
+        # next_dispatch_at: {$lte: now}} sorted by next_dispatch_at asc.
+        # The status equality prefix + next_dispatch_at range/sort are
+        # both index-served, so the per-tick poll is not a PENDING scan.
+        IndexSpec(
+            "jobs",
+            [("status", 1), ("next_dispatch_at", 1)],
+            name="status_next_dispatch_at",
+        ),
         # TTL on terminal rows so the collection doesn't grow unbounded.
         # The partial filter excludes active rows (active=True) so an
         # in-flight job is never reaped. 7 days gives operators a window
diff --git a/src/cfdb/workflows/__init__.py b/src/cfdb/workflows/__init__.py
index e6509ca..12ae181 100644
--- a/src/cfdb/workflows/__init__.py
+++ b/src/cfdb/workflows/__init__.py
@@ -31,12 +31,6 @@
   preprocessing runs (e.g., a ``samtools sort`` on a multi-GB BAM
   followed by ``samtools index``). Operators running on bounded fixtures
   in dev should lower this via env.
-- ``CFDB_WORKFLOW_DISPATCH_WAIT_S`` — how long ``ensure_workflow`` waits
-  for a wool worker to become available before giving up. Default ``240``
-  (4 min) — sized for an ECS Fargate cold start (image pull + health
-  check, typically 1-3 min). A smaller value risks exhausting the retry
-  budget before a freshly-provisioned worker reports HEALTHY; lower it
-  for fixture-bound dev where workers are already running.
 - ``CFDB_WORKFLOW_HEARTBEAT_INTERVAL_S`` — how often the routine emits a
   heartbeat event during quiet periods so the API can refresh
   ``updated_at`` on the JobRecord. Default ``300`` (5 min).
@@ -44,6 +38,28 @@
   RUNNING/PENDING row is considered stale and reclaimable. Default ``900``
   (15 min) — sized as ``2 × heartbeat_interval + safety_margin`` so a
   single missed heartbeat does not falsely reclaim a healthy worker.
+
+Concurrency / admission (bounded-concurrency control, issue #45):
+
+- ``CFDB_WORKER_MAX_CONCURRENT_TASKS`` — per-worker backpressure
+  threshold: a worker rejects a dispatch (gRPC ``RESOURCE_EXHAUSTED``,
+  which the load balancer treats as transient and rotates past) once it
+  already has this many tasks in flight. Default ``1`` (serialize the
+  subprocess pipelines on a 1-vCPU worker); ``0`` disables backpressure
+  (unbounded — the prior behavior). Enforced worker-side via a
+  ``wool.BackpressureLike`` hook.
+- ``CFDB_WORKFLOW_MAX_ACTIVE`` — admission ceiling on concurrently active
+  workflows (``pending`` + ``running`` jobs). ``ensure_workflow`` returns
+  ``429 Retry-After`` once this many are active, shedding load before
+  claiming the per-file mutex. Default ``1024``. Soft cap
+  (count-then-insert may briefly overshoot).
+- ``CFDB_WORKFLOW_DISPATCH_DEADLINE_S`` — how long a job may wait for
+  worker capacity (re-dispatched on the retry cadence below) before it is
+  marked ``failed``. Default ``14400`` (4 h).
+- ``CFDB_WORKFLOW_RETRY_INTERVAL_S`` — base cadence at which the durable
+  retry scheduler re-attempts dispatch for a queued job awaiting
+  capacity; a small random jitter is added per attempt. Default ``120``
+  (2 min).
 """
 
 from __future__ import annotations
@@ -112,24 +128,17 @@ def _positive_int(name: str, value: str, *, minimum: int = 0) -> int:
 # Runtime caps require ``minimum=1`` — zero values silently break the
 # corresponding loop or timeout (``asyncio.timeout(0)`` fires immediately,
 # ``HEARTBEAT_INTERVAL_S=0`` spins, ``STALE_THRESHOLD_S=0`` reclaims every
-# active job on every check, ``DISPATCH_WAIT_S=0`` makes every cold start
-# look like a hard failure).
+# active job on every check).
 WORKFLOW_DURATION_CAP_S: Final = _positive_int(
     "CFDB_WORKFLOW_DURATION_CAP_S",
     os.getenv("CFDB_WORKFLOW_DURATION_CAP_S", "14400"),
     minimum=1,
 )
-# Default sized for a Fargate cold start (image pull + health check,
-# ~1-3 min). With ``quorum=0`` a cold-start dispatch surfaces
-# ``NoWorkersAvailable``, which the executor retries inside this budget;
-# the old 60s default could expire before the just-launched worker
-# reports HEALTHY, hard-failing the first request. 240s covers the
-# cold-start window with headroom while staying env-overridable.
-WORKFLOW_DISPATCH_WAIT_S: Final = _positive_int(
-    "CFDB_WORKFLOW_DISPATCH_WAIT_S",
-    os.getenv("CFDB_WORKFLOW_DISPATCH_WAIT_S", "240"),
-    minimum=1,
-)
+# NOTE: the former ``CFDB_WORKFLOW_DISPATCH_WAIT_S`` (a single in-request
+# cold-start wait) was removed in the #45 restructure. A dispatch that
+# finds no capacity no longer blocks the request — the job queues durably
+# and the retry scheduler re-attempts it on ``CFDB_WORKFLOW_RETRY_INTERVAL_S``
+# until ``CFDB_WORKFLOW_DISPATCH_DEADLINE_S`` (both below).
 WORKFLOW_HEARTBEAT_INTERVAL_S: Final = _positive_int(
     "CFDB_WORKFLOW_HEARTBEAT_INTERVAL_S",
     os.getenv("CFDB_WORKFLOW_HEARTBEAT_INTERVAL_S", "300"),
@@ -141,15 +150,53 @@ def _positive_int(name: str, value: str, *, minimum: int = 0) -> int:
     minimum=1,
 )
 
-# The stale-reclaim threshold MUST exceed two heartbeat intervals so a
-# single missed heartbeat (e.g., a brief Mongo write delay) does not
-# falsely reclaim a healthy worker. The default values (300 / 900)
-# satisfy this with a 300s safety margin; an operator-tuned pair that
-# violates the invariant is a configuration error.
-if WORKFLOW_STALE_THRESHOLD_S < 2 * WORKFLOW_HEARTBEAT_INTERVAL_S:
+# The stale-reclaim threshold MUST strictly exceed two heartbeat intervals.
+# A single missed heartbeat (e.g., a brief Mongo write delay) must not
+# falsely reclaim a healthy worker, AND the executor's
+# ``_HEARTBEAT_LOSS_ABORT_S = max(HEARTBEAT, STALE - HEARTBEAT)`` needs a
+# non-zero margin above one interval to keep its single-transient-failure
+# tolerance: at the boundary ``STALE == 2 * HEARTBEAT`` that window
+# collapses to exactly one interval (see issue #45 review A7). The default
+# values (300 / 900) satisfy the strict form with a 300s margin; an
+# operator-tuned pair that violates it is a configuration error.
+if WORKFLOW_STALE_THRESHOLD_S <= 2 * WORKFLOW_HEARTBEAT_INTERVAL_S:
     raise ValueError(
         f"CFDB_WORKFLOW_STALE_THRESHOLD_S ({WORKFLOW_STALE_THRESHOLD_S}s) "
-        f"must be >= 2 * CFDB_WORKFLOW_HEARTBEAT_INTERVAL_S "
+        f"must be > 2 * CFDB_WORKFLOW_HEARTBEAT_INTERVAL_S "
         f"({WORKFLOW_HEARTBEAT_INTERVAL_S}s) to avoid false stale-reclaim "
-        "of healthy workers between heartbeats."
+        "of healthy workers between heartbeats and to keep the "
+        "heartbeat-loss abort window strictly positive."
     )
+
+# --- Bounded-concurrency control (issue #45) --------------------------------
+
+# Per-worker backpressure threshold. ``minimum=0`` because 0 is a valid
+# "disable backpressure" sentinel — the worker wiring passes
+# ``backpressure=None`` when 0, restoring the unbounded prior behavior.
+WORKER_MAX_CONCURRENT_TASKS: Final = _positive_int(
+    "CFDB_WORKER_MAX_CONCURRENT_TASKS",
+    os.getenv("CFDB_WORKER_MAX_CONCURRENT_TASKS", "1"),
+    minimum=0,
+)
+# Admission ceiling on concurrently active (pending + running) workflows.
+# ``ensure_workflow`` returns 429 once this many are active. ``minimum=1``
+# because 0 would reject every request.
+WORKFLOW_MAX_ACTIVE: Final = _positive_int(
+    "CFDB_WORKFLOW_MAX_ACTIVE",
+    os.getenv("CFDB_WORKFLOW_MAX_ACTIVE", "1024"),
+    minimum=1,
+)
+# How long a job may wait for worker capacity before being failed.
+# ``minimum=1`` — 0 would fail every queued job on its first attempt.
+WORKFLOW_DISPATCH_DEADLINE_S: Final = _positive_int(
+    "CFDB_WORKFLOW_DISPATCH_DEADLINE_S",
+    os.getenv("CFDB_WORKFLOW_DISPATCH_DEADLINE_S", "14400"),
+    minimum=1,
+)
+# Base cadence for the durable retry scheduler's re-dispatch attempts;
+# per-attempt jitter is layered on top. ``minimum=1`` — 0 would busy-spin.
+WORKFLOW_RETRY_INTERVAL_S: Final = _positive_int(
+    "CFDB_WORKFLOW_RETRY_INTERVAL_S",
+    os.getenv("CFDB_WORKFLOW_RETRY_INTERVAL_S", "120"),
+    minimum=1,
+)
diff --git a/src/cfdb/workflows/backpressure.py b/src/cfdb/workflows/backpressure.py
new file mode 100644
index 0000000..8f89966
--- /dev/null
+++ b/src/cfdb/workflows/backpressure.py
@@ -0,0 +1,63 @@
+"""Per-worker backpressure hook for the wool worker pool (issue #45).
+
+A worker rejects an incoming dispatch once it already has a configured
+number of tasks in flight, serializing the subprocess pipelines
+(samtools/bgzip/tabix/sort/…) the routine shells out to so a single
+1-vCPU/2-GiB worker stops oversubscribing. wool surfaces the rejection to
+the dispatcher as gRPC ``RESOURCE_EXHAUSTED`` — a transient error the load
+balancer rotates past to the next worker.
+
+The hook is a plain, picklable object holding only an ``int`` because
+``wool.LocalWorker`` serializes ``backpressure`` into its worker
+subprocess. Keep it free of async/runtime state (no locks, queues, or
+event-loop references) so it survives that serialization boundary cleanly.
+"""
+
+from __future__ import annotations
+
+import wool
+
+
+class TaskCountBackpressure:
+    """Reject a dispatch when the worker already has ``threshold`` tasks.
+
+    Implements :class:`wool.BackpressureLike`: returns ``True`` (reject)
+    when the worker's in-flight task count has reached ``threshold`` and
+    ``False`` (accept) otherwise. With ``threshold == 1`` each worker runs
+    at most one routine at a time, serializing its CPU/IO-heavy subprocess
+    pipeline. The "disable backpressure" case (threshold 0) is handled by
+    the wiring, which passes ``backpressure=None`` rather than this hook.
+    """
+
+    def __init__(self, threshold: int) -> None:
+        if threshold < 1:
+            raise ValueError(
+                f"TaskCountBackpressure threshold must be >= 1; got {threshold} "
+                "(pass backpressure=None to disable backpressure entirely)"
+            )
+        self.threshold = threshold
+
+    def __call__(self, ctx: wool.BackpressureContext) -> bool:
+        """Return True to reject the incoming task, False to accept it."""
+        return ctx.active_task_count >= self.threshold
+
+    def __repr__(self) -> str:
+        return f"{type(self).__name__}(threshold={self.threshold})"
+
+
+def backpressure_for(threshold: int) -> TaskCountBackpressure | None:
+    """Build the per-worker backpressure hook for ``threshold``, or None.
+
+    ``threshold == 0`` disables backpressure: the worker wiring passes
+    ``backpressure=None`` to wool, restoring unbounded admission (the prior
+    behavior). Any positive value yields a :class:`TaskCountBackpressure`
+    that rejects once the worker has that many tasks in flight. A negative
+    threshold is rejected rather than silently disabling backpressure, so
+    the helper's contract is total.
+    """
+    if threshold < 0:
+        raise ValueError(
+            f"backpressure threshold must be >= 0; got {threshold} "
+            "(0 disables backpressure, a positive value sets the limit)"
+        )
+    return TaskCountBackpressure(threshold) if threshold > 0 else None
diff --git a/src/cfdb/workflows/executor.py b/src/cfdb/workflows/executor.py
index 41510a2..54c8619 100644
--- a/src/cfdb/workflows/executor.py
+++ b/src/cfdb/workflows/executor.py
@@ -38,18 +38,23 @@
 import asyncio
 import contextlib
 import logging
+import random
 import shutil
+import uuid
 from abc import ABC, abstractmethod
 from collections.abc import AsyncIterator
+from datetime import datetime, timedelta, timezone
 from pathlib import Path
 from typing import Any, Optional
 
 import wool
 
 from cfdb.workflows import (
-    WORKFLOW_DISPATCH_WAIT_S,
+    WORKFLOW_DISPATCH_DEADLINE_S,
     WORKFLOW_DURATION_CAP_S,
     WORKFLOW_HEARTBEAT_INTERVAL_S,
+    WORKFLOW_MAX_ACTIVE,
+    WORKFLOW_RETRY_INTERVAL_S,
     keys as key_utils,
 )
 from cfdb.workflows.cache import CacheBackend
@@ -62,11 +67,16 @@
     WorkflowEvent,
 )
 from cfdb.workflows.lock import (
+    STALE_WORKFLOW_THRESHOLD,
     claim_workflow,
+    count_active_workflows,
     heartbeat_workflow,
+    lease_due_dispatch,
     mark_running,
     record_stage_complete,
     release_workflow,
+    requeue_orphaned_dispatch,
+    reschedule_dispatch,
     update_progress,
 )
 from cfdb.workflows.models import JobRecord, JobStatus
@@ -87,37 +97,44 @@
 #: Sourced from ``CFDB_WORKFLOW_DURATION_CAP_S`` (default 14400 s = 4 h)
 #: so deployments can override without a code change. Module-level
 #: rather than a ctor kwarg because tests cap it via ``monkeypatch.setattr``
-#: alongside the other dispatch knobs (``_DISPATCH_WAIT_SECONDS``,
+#: alongside the other dispatch knobs (``_RETRY_INTERVAL_SECONDS``,
 #: ``_HEARTBEAT_INTERVAL_S``) and uniform module-state keeps fixture
 #: shape consistent.
 _WORKFLOW_DURATION_CAP_SECONDS = WORKFLOW_DURATION_CAP_S
 
-#: Persisted ``error`` prefix surfaced to clients when the dispatch
-#: budget is exhausted with no workers available. The prefix is part
-#: of the on-wire contract — clients parse it to decide whether to
-#: resubmit. Kept as a module constant so tests can import the same
-#: symbol the executor writes.
+#: Persisted ``error`` prefix surfaced to clients when a job is failed for
+#: lack of worker capacity (its dispatch deadline elapsed while queued).
+#: Part of the on-wire contract — clients parse it to decide whether to
+#: resubmit. Kept as a module constant so tests can import the same symbol
+#: the executor writes.
 _ERR_PREFIX_CAPACITY = "capacity:"
 
-#: Persisted ``error`` prefix surfaced to clients when the provisioner
-#: itself raises a non-retryable exception. Distinct from
-#: :data:`_ERR_PREFIX_CAPACITY` so clients can tell "retry me later,
-#: capacity issue" apart from "the provisioner crashed".
-_ERR_PREFIX_PROVISIONER = "provisioner:"
-
-#: Total wall-clock budget waiting for a leased worker to surface during
-#: dispatch. Sized for an ECS Fargate cold start (image pull + health
-#: check, typically 1-3 min; see ``CFDB_WORKFLOW_DISPATCH_WAIT_S``,
-#: default 240s) so a ``NoWorkersAvailable`` at the moment of dispatch
-#: does not immediately fail the job — instead we poll the pool until a
-#: worker shows up or the budget expires.
-_DISPATCH_WAIT_SECONDS = float(WORKFLOW_DISPATCH_WAIT_S)
-
-#: Cadence at which we re-attempt the dispatch while waiting on capacity.
-#: Sub-second polling buys us nothing because ECS scale-up dominates the
-#: wait; a 1-second cadence keeps the noise floor low while still
-#: surfacing a freshly-available worker quickly.
-_DISPATCH_RETRY_INTERVAL_SECONDS = 1.0
+#: Admission ceiling on concurrently-active workflows (pending + running).
+#: ``ensure_workflow`` sheds requests beyond this with ``AdmissionRejected``
+#: → HTTP 429. Module-level so tests ``monkeypatch.setattr`` it alongside
+#: the other dispatch knobs. Soft cap: a count-then-claim race can briefly
+#: overshoot, which is acceptable for a flood guard.
+_MAX_ACTIVE_WORKFLOWS = WORKFLOW_MAX_ACTIVE
+
+#: Base cadence for the durable retry scheduler. A dispatch attempt that
+#: finds no worker capacity leaves the job PENDING and reschedules its next
+#: attempt this far out (plus jitter). Also the scheduler loop's idle wait
+#: between ticks, so a freshly-rescheduled job is re-attempted promptly.
+_RETRY_INTERVAL_SECONDS = float(WORKFLOW_RETRY_INTERVAL_S)
+
+#: Upper bound on the random jitter added to each reschedule, to spread a
+#: thundering herd of queued jobs across retry ticks instead of stampeding
+#: the pool in lockstep. Clamped to the retry interval, so tests that
+#: monkeypatch ``_RETRY_INTERVAL_SECONDS`` low get correspondingly small,
+#: bounded jitter (determinism in tests comes from that clamp, not from
+#: zeroing this constant).
+_RETRY_JITTER_MAX_SECONDS = min(30.0, _RETRY_INTERVAL_SECONDS)
+
+#: Max wall-clock a job may wait for capacity (measured from
+#: ``submitted_at``) before the scheduler fails it ``capacity:``. Replaces
+#: the old in-request 240s wait: instead of blocking one request, the job
+#: queues durably and is retried until it runs or this deadline elapses.
+_DISPATCH_DEADLINE_SECONDS = float(WORKFLOW_DISPATCH_DEADLINE_S)
 
 #: Cadence at which the wool routine emits ``heartbeat`` events into its
 #: stream during quiet periods. The API process consumes the stream and
@@ -125,6 +142,30 @@
 #: long-running stage doesn't get reclaimed as stale.
 _HEARTBEAT_INTERVAL_S = float(WORKFLOW_HEARTBEAT_INTERVAL_S)
 
+#: How long a consumer may fail to refresh its heartbeat before it aborts
+#: its own job. This makes a stale ``RUNNING`` row reliably mean "the
+#: consumer is gone": a consumer that cannot reach Mongo stops computing
+#: (and lets ``stream.aclose`` cancel the remote worker) *before* the
+#: orphan sweep's ``STALE_WORKFLOW_THRESHOLD`` would revive the row, so
+#: recovery is spared a redundant second run of a still-live job. Sized
+#: strictly between one heartbeat interval (a single transient write
+#: failure must not abort) and the sweep threshold (the consumer must give
+#: up with margin to spare); the strict ``STALE > 2*HEARTBEAT`` startup
+#: invariant keeps that window strictly positive (at ``STALE == 2*HEARTBEAT``
+#: it would collapse to exactly one interval, leaving no transient-failure
+#: tolerance — see issue #45 review A7).
+_HEARTBEAT_LOSS_ABORT_S = max(
+    _HEARTBEAT_INTERVAL_S,
+    STALE_WORKFLOW_THRESHOLD.total_seconds() - _HEARTBEAT_INTERVAL_S,
+)
+
+#: Max jobs leased per scheduler tick. Bounds the per-tick dispatch
+#: fan-out so a large queued backlog (up to ``WORKFLOW_MAX_ACTIVE``) can't
+#: spawn thousands of concurrent attempts in one tick; the remainder is
+#: leased on subsequent ticks. Generous enough to keep a reasonable fleet
+#: saturated, small enough to avoid a thundering herd against the pool.
+_MAX_DISPATCHES_PER_TICK = 64
+
 #: Grace given to ``_finalize`` tasks during ``drain``'s second phase
 #: after the primary ``_pending_tasks`` gather is done. A workflow that
 #: failed mid-write — Mongo connection blip, S3 cache cleanup blocked
@@ -153,6 +194,85 @@ class ExecutorDraining(WorkflowNotApplicable):
     """
 
 
+class AdmissionRejected(RuntimeError):
+    """Raised by ``ensure_workflow`` when the active-workflow ceiling is hit.
+
+    Deliberately NOT a :class:`WorkflowNotApplicable` subclass: a caller
+    that catches ``WorkflowNotApplicable`` falls through to direct upstream
+    streaming, but an admission rejection must surface as a distinct 429
+    (with ``Retry-After``) so the client backs off rather than silently
+    bypassing the bounded pipeline. Carries the observed ``active`` count,
+    the ``ceiling`` it hit, and a ``retry_after_seconds`` hint for the
+    response header.
+    """
+
+    def __init__(
+        self,
+        *,
+        active: int,
+        ceiling: int,
+        retry_after_seconds: Optional[int] = None,
+    ) -> None:
+        self.active = active
+        self.ceiling = ceiling
+        self.retry_after_seconds = (
+            retry_after_seconds
+            if retry_after_seconds is not None
+            else max(1, int(_RETRY_INTERVAL_SECONDS))
+        )
+        super().__init__(
+            f"Active workflow ceiling reached ({active} >= {ceiling}); "
+            "shedding request"
+        )
+
+
+class HeartbeatLost(RuntimeError):
+    """Raised by the stream consumer when it can no longer refresh the job.
+
+    Signals that ``heartbeat_workflow`` has been failing for longer than
+    :data:`_HEARTBEAT_LOSS_ABORT_S` while a worker is still streaming
+    ``heartbeat`` events — i.e. this consumer has lost its connection to
+    Mongo and can no longer keep the row fresh. Raising this aborts the
+    attempt and ``aclose``s the stream, cancelling the remote worker so it
+    stops computing.
+
+    This is a *best-effort optimization*, NOT the load-bearing
+    double-dispatch guard. A ``heartbeat`` event is only injected after a
+    quiet ``heartbeat_interval`` of stage silence, so a stage that streams
+    ``stage_complete`` / ``progress`` faster than that interval never trips
+    the check — meaning a Mongo-blind consumer is not guaranteed to abort
+    before the orphan sweep revives its row. What actually keeps a recovered
+    re-dispatch from corrupting a still-live attempt is the per-attempt
+    workdir nonce (``_attempt_dispatch``) plus content-addressed, atomically
+    committed cache artifacts. The abort just spares the wasted second run
+    when it can; do not remove the per-attempt workdir on the assumption
+    this covers the race.
+    """
+
+
+def _utcnow() -> datetime:
+    return datetime.now(timezone.utc)
+
+
+def _next_dispatch_time(now: datetime) -> datetime:
+    """Compute the next dispatch attempt time for an overflowed job.
+
+    ``now + retry_interval + jitter``. The jitter spreads a queued backlog
+    across ticks; it is bounded by ``_RETRY_JITTER_MAX_SECONDS`` (itself
+    clamped to the retry interval), so a test that monkeypatches
+    ``_RETRY_INTERVAL_SECONDS`` low gets correspondingly small, bounded
+    jitter — no test zeroes the constant. Always strictly after ``now``
+    (retry interval is >= 1s), which ``lease_due_dispatch`` requires so a
+    leased row moves out of the due-window before its attempt resolves.
+    """
+    jitter = (
+        random.uniform(0, _RETRY_JITTER_MAX_SECONDS)
+        if _RETRY_JITTER_MAX_SECONDS > 0
+        else 0.0
+    )
+    return now + timedelta(seconds=_RETRY_INTERVAL_SECONDS + jitter)
+
+
 class JobExecutor(ABC):
     """Abstract interface for workflow dispatch + job tracking."""
 
@@ -185,6 +305,7 @@ async def _run_processor_routine(
     file_meta: dict[str, Any],
     workdir_str: str,
     cache: CacheBackend,
+    heartbeat_interval: float,
 ) -> AsyncIterator[WorkflowEvent]:
     """Wool routine body: stream processor events back to the dispatcher.
 
@@ -192,12 +313,30 @@ async def _run_processor_routine(
     :class:`~cfdb.workflows.events.StageComplete` and
     :class:`~cfdb.workflows.events.Complete` events. We wrap its stream
     with a heartbeat-aware iterator: while waiting on the next event from
-    the processor, every ``_HEARTBEAT_INTERVAL_S`` we inject a
+    the processor, every ``heartbeat_interval`` seconds we inject a
     :class:`~cfdb.workflows.events.Heartbeat` so the API process can
     refresh ``JobRecord.updated_at`` and signal liveness without requiring
-    the worker to touch Mongo. An exception from the processor is
-    converted to an :class:`~cfdb.workflows.events.Error` event so the API
-    consumer can record a clean terminal state.
+    the worker to touch Mongo. ``heartbeat_interval`` is passed explicitly
+    (resolved API-side at dispatch) rather than read from the worker's
+    module globals, because cloudpickle ships this routine by value and the
+    worker's own ``_HEARTBEAT_INTERVAL_S`` would otherwise be ignored. An
+    exception from the processor is converted to an
+    :class:`~cfdb.workflows.events.Error` event so the API consumer can
+    record a clean terminal state.
+
+    The routine yields an immediate :class:`~cfdb.workflows.events.Heartbeat`
+    as its very first event — the instant a worker *accepts* the dispatch,
+    before the (potentially slow) upstream download. This is the dispatcher's
+    "worker accepted" signal: ``_attempt_dispatch`` marks the job RUNNING on
+    the first event, so a job leaves the leasable PENDING set the moment a
+    worker takes it rather than only after the first processor event lands.
+    Without this, a long download (which happens before the processor's
+    first event) keeps the job PENDING, and the durable scheduler would
+    re-lease it and double-dispatch onto the same workdir once the retry
+    interval elapsed. A rejected dispatch (worker backpressure) raises
+    ``NoWorkersAvailable`` before the routine body runs, so this leading
+    heartbeat fires only on acceptance — cleanly distinguishing accepted
+    from overflowed.
 
     Arguments are cloudpickle-serializable: ``processor`` is a stateless
     class instance, ``file_meta`` is a plain dict (B1 strips Mongo's
@@ -213,20 +352,25 @@ class instance, ``file_meta`` is a plain dict (B1 strips Mongo's
     """
     workdir = Path(workdir_str)
     workdir.mkdir(parents=True, exist_ok=True)
+    # "Worker accepted" signal — see the docstring. Emitted before the
+    # processor runs so the dispatcher marks the job RUNNING the instant a
+    # worker takes it (before the download), closing the re-lease /
+    # double-dispatch window. A rejected dispatch never reaches here.
+    yield Heartbeat()
     inner = processor.run(file_meta, workdir, cache).__aiter__()
     next_task: Optional[asyncio.Task] = None
     try:
         while True:
             next_task = asyncio.ensure_future(inner.__anext__())
             # Loop on a per-event wait_for so a heartbeat injects every
-            # _HEARTBEAT_INTERVAL_S of stage silence; shield the task so
+            # heartbeat_interval of stage silence; shield the task so
             # a timeout doesn't propagate cancellation into the inner
             # generator mid-stage.
             while True:
                 try:
                     await asyncio.wait_for(
                         asyncio.shield(next_task),
-                        timeout=_HEARTBEAT_INTERVAL_S,
+                        timeout=heartbeat_interval,
                     )
                     break
                 except asyncio.TimeoutError:
@@ -271,18 +415,19 @@ class instance, ``file_meta`` is a plain dict (B1 strips Mongo's
 class WoolExecutor(JobExecutor):
     """Executor backed by a ``wool.WorkerPool`` and a Mongo jobs collection.
 
-    When configured with an :class:`EcsProvisioner`, the executor also
-    issues a ``RunTask`` on each fresh claim before opening the routine
-    stream, so a Fargate worker boots before the dispatch retry loop
-    begins polling for it. Without a provisioner (PoC dev profile) the
-    executor relies on workers already published into the discovery
-    namespace.
+    Dispatch tries existing workers first via the priority load balancer.
+    Only when an attempt overflows (no worker accepts) does the executor —
+    when configured with an :class:`EcsProvisioner` — issue a best-effort
+    ``RunTask`` to scale the fleet up, then leave the job queued for a later
+    retry; it never spawns a worker pre-dispatch. Without a provisioner
+    (PoC dev profile) the executor relies on workers already published into
+    the discovery namespace.
 
-    A :class:`RetryableProvisionerError` from the provisioner surfaces
-    as a terminal ``FAILED`` job with a ``capacity:``-prefixed error
-    string; any other provisioner failure surfaces with a
-    ``provisioner:`` prefix. Clients parse the prefix to decide
-    whether to resubmit.
+    A job that cannot find capacity before its dispatch deadline is failed
+    with a ``capacity:``-prefixed error string, which clients parse to
+    decide whether to resubmit. Provisioner failures on the overflow path
+    are best-effort and swallowed (the job simply retries on the next tick),
+    so they never surface as a terminal job state.
 
     Args:
         db: Motor database handle holding the ``jobs`` collection.
@@ -291,7 +436,7 @@ class WoolExecutor(JobExecutor):
             processors persist artifacts to the deployment's real store
             (local FS or S3), not a worker-local filesystem.
         registry: Processor registry mapping artifact kinds to runners.
-        workdir_root: Parent directory under which per-job workdirs land.
+        workdir_root: Parent directory under which per-attempt workdirs land.
         pipeline_version: Embedded in workflow keys; bump to invalidate
             every in-flight workflow.
         provisioner: Optional :class:`EcsProvisioner`. When ``None``
@@ -322,11 +467,20 @@ def __init__(
         self._provisioner = provisioner
         self._pending_tasks: set[asyncio.Task] = set()
         #: Finalize tasks (release_workflow + workdir cleanup) created by
-        #: _run_workflow's `finally`. Tracked separately so drain() can
-        #: await them after the main _pending_tasks gather, ensuring they
-        #: complete before the lifespan teardown closes the Motor client.
+        #: _consume_and_finalize's `finally`. Tracked separately so drain()
+        #: can await them after the main _pending_tasks gather, ensuring
+        #: they complete before the lifespan teardown closes the Motor
+        #: client.
         self._finalize_tasks: set[asyncio.Task] = set()
         self._draining = False
+        #: The durable retry scheduler — re-attempts dispatch for jobs that
+        #: overflowed (no worker capacity) and are awaiting a free worker.
+        #: Fresh claims dispatch inline via ``ensure_workflow``; the
+        #: scheduler only handles the retry path. Started by
+        #: ``start_scheduler`` inside the lifespan's wool-pool context (so
+        #: it inherits wool's dispatch contextvars) and cancelled by
+        #: ``drain`` before the pool closes.
+        self._scheduler_task: Optional[asyncio.Task] = None
 
     async def ensure_workflow(
         self, file_meta: dict[str, Any]
@@ -345,6 +499,22 @@ async def ensure_workflow(
                 "No processor registered for this file, or no work required"
             )
 
+        # Admission ceiling: shed new work once the active backlog
+        # (pending + running) hits the cap, so an unauthenticated flood on
+        # /data and /index can't queue unbounded jobs in Mongo. Checked
+        # AFTER applicability (don't 429 a file that needs no workflow) and
+        # BEFORE claiming the mutex. Soft cap — a count-then-claim race may
+        # transiently overshoot, acceptable for a flood guard; the
+        # per-source mutex still dedups same-file requests below the cap.
+        # NOTE: because this precedes the claim, at the ceiling a re-GET for
+        # a file whose workflow is already active is also shed with 429
+        # (rather than attaching to the in-flight job); the client retries.
+        # This is the deliberate trade for shedding before an unbounded
+        # count-then-insert race window.
+        active = await count_active_workflows(self._db)
+        if active >= _MAX_ACTIVE_WORKFLOWS:
+            raise AdmissionRejected(active=active, ceiling=_MAX_ACTIVE_WORKFLOWS)
+
         dcc, local_id, md5 = extract_identity(file_meta)
         wf_key = key_utils.workflow_key(
             dcc=dcc,
@@ -363,15 +533,163 @@ async def ensure_workflow(
         )
 
         if fresh:
-            task = asyncio.create_task(
-                self._run_workflow(record, processor, file_meta)
-            )
-            self._pending_tasks.add(task)
-            task.add_done_callback(self._pending_tasks.discard)
-            task.add_done_callback(_log_unexpected_exception)
+            # Dispatch the first attempt inline (fire-and-forget), so a
+            # fresh claim runs immediately rather than waiting out a
+            # scheduler tick. The job's ``next_dispatch_at`` stays None
+            # until this attempt overflows, so the durable scheduler — which
+            # only leases jobs with a due ``next_dispatch_at`` — cannot
+            # race this inline attempt. Retries (after an overflow sets
+            # ``next_dispatch_at``) are driven by the scheduler.
+            self._spawn_attempt(record, processor, file_meta)
 
         return record, fresh
 
+    def _spawn_attempt(
+        self,
+        record: JobRecord,
+        processor: Processor,
+        file_meta: dict[str, Any],
+    ) -> None:
+        """Spawn a tracked background task running one dispatch attempt."""
+        task = asyncio.create_task(
+            self._attempt_dispatch(record, processor, file_meta)
+        )
+        self._pending_tasks.add(task)
+        task.add_done_callback(self._pending_tasks.discard)
+        task.add_done_callback(_log_unexpected_exception)
+
+    def start_scheduler(self) -> None:
+        """Start the durable retry scheduler as a background task.
+
+        MUST be called from inside the lifespan's ``wool.WorkerPool``
+        context so the created task inherits wool's dispatch contextvars
+        (the same mechanism request-spawned tasks rely on via
+        ``attach_wool_context``); otherwise ``@wool.routine`` dispatch from
+        the scheduler would fail with no pool in context. Idempotent.
+        """
+        if self._scheduler_task is not None:
+            return
+        self._scheduler_task = asyncio.create_task(self._scheduler_loop())
+        self._scheduler_task.add_done_callback(_log_unexpected_exception)
+
+    async def _scheduler_loop(self) -> None:
+        """Re-attempt dispatch for queued jobs until drain.
+
+        Fresh claims dispatch inline (``ensure_workflow``); this loop drives
+        the *retry* path. Each tick leases every currently-due job — one
+        whose inline (or prior) attempt overflowed and set a
+        ``next_dispatch_at`` now in the past — and spawns a fresh dispatch
+        attempt for it, then sleeps one retry interval. A per-tick exception
+        is logged and swallowed so the scheduler never dies; cancellation
+        (from ``drain``) breaks the loop. DB-backed, so a freshly-started
+        replica resumes whatever queue the previous process left behind.
+        """
+        while not self._draining:
+            try:
+                await self._recover_orphans()
+                await self._drain_due_jobs()
+            except asyncio.CancelledError:
+                raise
+            except Exception:
+                logger.exception("Scheduler dispatch tick failed; continuing")
+            await asyncio.sleep(_RETRY_INTERVAL_SECONDS)
+
+    async def _recover_orphans(self) -> None:
+        """Re-queue jobs orphaned by an API/worker crash for re-dispatch.
+
+        Runs at the top of every scheduler tick — including the first, on
+        boot — so a process that died mid-flight does not strand jobs the
+        normal lease query cannot pick up (``RUNNING`` rows whose consumer
+        died, and fresh ``PENDING`` claims that never rescheduled). Recovery
+        is then autonomous: it no longer depends on a client re-requesting
+        the same file to trigger ``claim_workflow``'s stale-reclaim. Gated
+        on the stale threshold so healthy heartbeating jobs are untouched.
+
+        Note the recovery promise is bounded by the same dispatch deadline
+        as a fresh job: ``requeue_orphaned_dispatch`` preserves the original
+        ``submitted_at``, and ``_attempt_dispatch`` measures the deadline
+        from it, so an orphan older than ``CFDB_WORKFLOW_DISPATCH_DEADLINE_S``
+        is failed ``capacity:`` on its first recovery attempt rather than
+        resumed (its committed cache artifacts survive for a later fresh
+        ``GET`` to reuse). Recovery is best-effort, not unbounded.
+        """
+        now = _utcnow()
+        requeued = await requeue_orphaned_dispatch(
+            self._db, now=now, stale_before=now - STALE_WORKFLOW_THRESHOLD
+        )
+        if requeued:
+            logger.info(
+                "Re-queued %d orphaned workflow(s) for re-dispatch after a "
+                "restart or worker crash",
+                requeued,
+            )
+
+    async def _drain_due_jobs(self) -> None:
+        """Lease currently-due queued jobs and spawn their retry attempts.
+
+        ``lease_due_dispatch`` atomically claims one due job and pushes its
+        ``next_dispatch_at`` forward, so concurrent ticks (or replicas)
+        can't double-lease and a crashed attempt is still retried. We loop
+        until nothing is due (or the per-tick cap is hit), dispatching each
+        leased job as a tracked background task so a winning attempt's long
+        stream-consume doesn't block the scheduler from filling the pool.
+
+        The per-tick cap (``_MAX_DISPATCHES_PER_TICK``) bounds the dispatch
+        fan-out: after a large backlog (up to ``WORKFLOW_MAX_ACTIVE``) a
+        single tick would otherwise spawn thousands of concurrent attempts,
+        each opening a wool stream that immediately overflows. The remainder
+        is leased on subsequent ticks; the deadline/retry machinery tolerates
+        the delay.
+        """
+        dispatched = 0
+        while not self._draining and dispatched < _MAX_DISPATCHES_PER_TICK:
+            now = _utcnow()
+            leased = await lease_due_dispatch(
+                self._db, now=now, next_at=_next_dispatch_time(now)
+            )
+            if leased is None:
+                break
+            file_meta = leased.file_meta_snapshot
+            if file_meta is None:
+                logger.error(
+                    "Leased job %s has no file_meta_snapshot — failing",
+                    leased.job_id,
+                )
+                await release_workflow(
+                    self._db,
+                    leased.job_id,
+                    JobStatus.FAILED,
+                    error="internal: leased job missing file_meta_snapshot",
+                )
+                continue
+            processor = self._registry.lookup_for(file_meta)
+            if processor is None:
+                logger.error(
+                    "No processor for leased job %s — failing", leased.job_id
+                )
+                await release_workflow(
+                    self._db,
+                    leased.job_id,
+                    JobStatus.FAILED,
+                    error="internal: no processor for leased job",
+                )
+                continue
+            self._spawn_attempt(leased, processor, file_meta)
+            dispatched += 1
+
+        # Leading operability signal: a saturated queue is otherwise silent
+        # until jobs start failing ``capacity:`` at the dispatch deadline.
+        # Emitted at most once per tick (cadence ``_RETRY_INTERVAL_SECONDS``),
+        # so it does not flood the log even under a sustained backlog.
+        if dispatched:
+            logger.info(
+                "Scheduler tick dispatched %d queued workflow(s)%s",
+                dispatched,
+                " (per-tick cap hit; remainder deferred to the next tick)"
+                if dispatched >= _MAX_DISPATCHES_PER_TICK
+                else "",
+            )
+
     async def drain(self, *, timeout: float) -> int:
         """Wait for in-flight workflow tasks to complete.
 
@@ -391,6 +709,14 @@ async def drain(self, *, timeout: float) -> int:
             immediately.
         """
         self._draining = True
+        # Stop the dispatch driver first so no new attempt tasks are spawned
+        # into the set we're about to snapshot, and so it stops dispatching
+        # into a pool that's about to close.
+        if self._scheduler_task is not None:
+            self._scheduler_task.cancel()
+            with contextlib.suppress(asyncio.CancelledError, Exception):
+                await self._scheduler_task
+            self._scheduler_task = None
         pending = list(self._pending_tasks)
         if not pending:
             return 0
@@ -405,7 +731,7 @@ async def drain(self, *, timeout: float) -> int:
             # against a closed Motor client surfaces as a noisy
             # exception in the lifespan exit, and the wool stream they
             # hold blocks pool aclose. Cancel them explicitly and
-            # await the cancellation so _run_workflow's shielded
+            # await the cancellation so _consume_and_finalize's shielded
             # _finalize gets its 2s best-effort grace before we let
             # the lifespan continue to teardown.
             for task in pending:
@@ -429,14 +755,172 @@ async def drain(self, *, timeout: float) -> int:
                 )
         return len(pending)
 
-    async def _run_workflow(
+    async def _attempt_dispatch(
         self,
         record: JobRecord,
         processor: Processor,
         file_meta: dict[str, Any],
     ) -> None:
-        """Background coroutine: mark running, consume the routine's event
-        stream, and release a terminal status.
+        """Run one dispatch attempt for a leased PENDING job.
+
+        Offers the task to the existing worker pool **once** via the
+        priority load balancer. On success the job is marked RUNNING (only
+        now that a worker has accepted — a queued job stays PENDING until
+        then) and its event stream is consumed to completion. On overflow
+        (every worker rejected, or none exist) the job stays PENDING, a
+        best-effort bounded worker spawn is requested, and it is rescheduled
+        for a later tick — unless it has blown its dispatch deadline, in
+        which case it is failed ``capacity:``. This inverts the old
+        unconditional pre-dispatch spawn: existing capacity is tried first
+        and scale-up happens only on overflow.
+        """
+        # Per-attempt workdir (unique nonce), not per-job. If the orphan
+        # sweep ever re-dispatches a job whose previous attempt is somehow
+        # still alive (e.g. event-loop starvation that defeated the
+        # heartbeat-loss abort), the two attempts use disjoint scratch dirs
+        # and cannot corrupt each other's downloads / sort temp / cleanup.
+        # Partial-commit recovery is unaffected — it keys off the cache, not
+        # the workdir.
+        workdir = self._workdir_root / f"{record.job_id}-{uuid.uuid4().hex}"
+
+        # Deadline: a job that has waited for capacity longer than the
+        # dispatch deadline (measured from ``submitted_at``) is failed rather
+        # than retried forever. ``capacity:`` is the stable on-wire prefix
+        # clients parse to decide whether to resubmit. Note the clock runs
+        # from original submission and is NOT reset on orphan recovery, so a
+        # job recovered after its deadline is failed here rather than resumed
+        # — recovery is best-effort, bounded by the same deadline.
+        waited = (_utcnow() - record.submitted_at).total_seconds()
+        if waited > _DISPATCH_DEADLINE_SECONDS:
+            logger.warning(
+                "Dispatch deadline exceeded for %s after %.0fs; failing",
+                record.job_id,
+                waited,
+            )
+            await self._finalize(
+                workdir,
+                record.job_id,
+                JobStatus.FAILED,
+                f"{_ERR_PREFIX_CAPACITY} dispatch deadline exceeded",
+            )
+            return
+
+        try:
+            stream = await self._open_stream_once(processor, file_meta, workdir)
+        except asyncio.CancelledError:
+            # Cancelled (e.g. lifespan drain) before a worker accepted.
+            # Finalize FAILED so the row doesn't dangle PENDING until
+            # stale-reclaim; the client resubmits. Best-effort, then
+            # propagate the cancellation.
+            with contextlib.suppress(Exception):
+                await self._finalize(
+                    workdir,
+                    record.job_id,
+                    JobStatus.FAILED,
+                    "Workflow cancelled (worker shutdown)",
+                )
+            raise
+        except wool.NoWorkersAvailable:
+            # No capacity this pass — keep the job PENDING, request a
+            # bounded scale-up (ECS only), and reschedule for a later tick.
+            await self._handle_overflow(record)
+            return
+        except Exception as exc:
+            logger.exception("Failed to open routine stream for %s", record.job_id)
+            await self._finalize(
+                workdir, record.job_id, JobStatus.FAILED, str(exc)
+            )
+            return
+
+        # A worker accepted the task. Claim RUNNING now — after the stream
+        # opened — so a job only leaves PENDING once it is genuinely
+        # running. ``mark_running`` is fenced on PENDING; a RuntimeError
+        # means a racing claimant (or stale-reclaim) already owns the row,
+        # so hand off: close the stream (cancelling the remote routine) and
+        # bail without duplicate work.
+        try:
+            await mark_running(self._db, record.job_id)
+        except asyncio.CancelledError:
+            with contextlib.suppress(BaseException):
+                await stream.aclose()
+            raise
+        except RuntimeError:
+            logger.info(
+                "mark_running rejected for %s (row no longer PENDING); "
+                "closing stream and handing off",
+                record.job_id,
+            )
+            with contextlib.suppress(BaseException):
+                await stream.aclose()
+            # The successor owns the row, so do NOT release the mutex here —
+            # but this attempt's workdir was already created by the routine,
+            # so clean it up to avoid leaking scratch on the hand-off (A1).
+            with contextlib.suppress(Exception):
+                await asyncio.to_thread(
+                    shutil.rmtree, str(workdir), ignore_errors=True
+                )
+            return
+        except Exception as exc:
+            logger.exception("Failed to mark job %s as running", record.job_id)
+            with contextlib.suppress(BaseException):
+                await stream.aclose()
+            await self._finalize(
+                workdir,
+                record.job_id,
+                JobStatus.FAILED,
+                f"mark_running failed: {exc}",
+            )
+            return
+
+        await self._consume_and_finalize(record, stream, workdir)
+
+    async def _handle_overflow(self, record: JobRecord) -> None:
+        """Handle a dispatch attempt that found no worker capacity.
+
+        Requests one best-effort worker spawn through the provisioner (ECS
+        profile only; a no-op in the local LAN profile, where the pool is
+        fixed) and reschedules the job's next dispatch attempt. The spawn is
+        strictly best-effort — existing workers may free up before the next
+        tick — so any provisioner error is logged and swallowed rather than
+        failing the job; the dispatch deadline bounds the retry loop.
+        """
+        if self._provisioner is not None:
+            try:
+                await self._provisioner.request(dedup_key=record.workflow_key)
+            except asyncio.CancelledError:
+                raise
+            except RetryableProvisionerError as exc:
+                logger.warning(
+                    "Provisioner retryable capacity error for %s; will retry "
+                    "on the next tick: %s",
+                    record.job_id,
+                    exc,
+                )
+            except Exception:
+                logger.exception(
+                    "Provisioner request failed for %s; will retry on the "
+                    "next tick",
+                    record.job_id,
+                )
+        # Leading operability signal that the pool is over capacity: the job
+        # found no worker and is being deferred to a later tick. Logged at
+        # info so a steadily-overflowing fleet is visible before jobs start
+        # failing ``capacity:`` at the dispatch deadline.
+        logger.info(
+            "No worker capacity for %s; rescheduling (pool overflow)",
+            record.job_id,
+        )
+        await reschedule_dispatch(
+            self._db, record.job_id, next_at=_next_dispatch_time(_utcnow())
+        )
+
+    async def _consume_and_finalize(
+        self,
+        record: JobRecord,
+        stream: AsyncIterator[Any],
+        workdir: Path,
+    ) -> None:
+        """Consume a running workflow's event stream and release it terminal.
 
         The routine emits ``heartbeat``, ``stage_complete``, ``complete``,
         and ``error`` events as an async generator over wool's gRPC stream.
@@ -446,102 +930,25 @@ async def _run_workflow(
         ``stage_complete`` arrives, so /jobs/{id} reflects partial
         progress in real time rather than all-at-once at the end.
 
-        A ``try/finally`` ensures the job always reaches a terminal
-        status and the per-job workdir is cleaned up, even when
-        ``mark_running`` raises or the task is cancelled during
-        shutdown. ``CancelledError`` propagates after the terminal write.
+        A ``try/finally`` ensures the job always reaches a terminal status
+        and the per-job workdir is cleaned up, even when the task is
+        cancelled during shutdown. ``CancelledError`` propagates after the
+        terminal write. The caller has already marked the job RUNNING and
+        opened ``stream``.
         """
-        workdir = self._workdir_root / record.job_id
-
         final_status: JobStatus = JobStatus.FAILED
         final_error: Optional[str] = None
+        # Time of the last successful heartbeat write. Seeded now because
+        # the caller just marked the job RUNNING (the row is fresh). If
+        # heartbeat writes then fail for longer than _HEARTBEAT_LOSS_ABORT_S
+        # we raise HeartbeatLost and abort, so a Mongo-blind consumer stops
+        # computing before the orphan sweep revives its row. Best-effort: it
+        # only re-evaluates on a Heartbeat event (i.e. during quiet stages),
+        # so the per-attempt workdir — not this abort — is the actual
+        # corruption guard (see HeartbeatLost).
+        last_heartbeat_ok = _utcnow()
 
         try:
-            # mark_running lives inside the try so a Mongo write failure
-            # here still routes through the finally — the row gets
-            # released to FAILED rather than dangling at PENDING for
-            # the full stale-reclaim window.
-            try:
-                await mark_running(self._db, record.job_id)
-            except Exception as exc:
-                logger.exception("Failed to mark job %s as running", record.job_id)
-                final_error = f"mark_running failed: {exc}"
-                return
-
-            # Request a worker via the external provisioner (e.g. ECS
-            # ``RunTask``) before opening the routine stream. The
-            # provisioner dedup-keys on the workflow mutex so two
-            # concurrent fresh claims for the same source file
-            # share one ``RunTask`` and one worker.
-            #
-            # ``RetryableProvisionerError`` is treated as a best-effort
-            # scale-up failure, not a workflow-level failure: the Wool
-            # pool routes ``@wool.routine`` calls to *any* available
-            # worker via the discovery namespace, so an existing idle
-            # worker can still service this job. Log a warning and fall
-            # through to ``_open_stream_with_retry``; the
-            # ``capacity:``-prefixed terminal status is reserved for the
-            # case where dispatch *also* exhausts its budget with no
-            # workers available.
-            if self._provisioner is not None:
-                try:
-                    await self._provisioner.request(dedup_key=record.workflow_key)
-                except asyncio.CancelledError:
-                    final_error = "Workflow cancelled (worker shutdown)"
-                    raise
-                except RetryableProvisionerError as exc:
-                    logger.warning(
-                        "Provisioner reported retryable capacity error for %s; "
-                        "falling through to dispatch against existing workers: %s",
-                        record.job_id,
-                        exc,
-                    )
-                except Exception as exc:
-                    logger.exception(
-                        "Provisioner request failed for %s", record.job_id
-                    )
-                    final_error = f"{_ERR_PREFIX_PROVISIONER} {exc}"
-                    return
-                # Bound the stale-reclaim window across the provisioner
-                # round-trip: ``mark_running`` ran before
-                # ``provisioner.request`` and a Fargate cold start can
-                # plausibly exceed STALE_WORKFLOW_THRESHOLD. Heartbeat
-                # so the row stays fresh through the upcoming dispatch
-                # wait. A failed heartbeat is logged but does not
-                # abort the workflow — the next ``heartbeat`` event in
-                # the stream loop will retry.
-                try:
-                    await heartbeat_workflow(self._db, record.job_id)
-                except Exception:
-                    logger.exception(
-                        "Post-provisioner heartbeat failed for %s; continuing",
-                        record.job_id,
-                    )
-
-            try:
-                stream = await self._open_stream_with_retry(
-                    processor, file_meta, workdir
-                )
-            except asyncio.CancelledError:
-                final_error = "Workflow cancelled (worker shutdown)"
-                raise
-            except wool.NoWorkersAvailable as exc:
-                # Dispatch budget exhausted with no workers reachable.
-                # ``capacity:`` is the stable on-wire prefix clients
-                # parse to decide whether to resubmit the job — its
-                # origin shifted from the provisioner-failure branch
-                # (which now falls through) to here, but the wire
-                # contract is unchanged.
-                logger.warning(
-                    "Dispatch budget exhausted for %s with no workers available",
-                    record.job_id,
-                )
-                final_error = f"{_ERR_PREFIX_CAPACITY} {exc}"
-                return
-            except Exception as exc:
-                logger.exception("Failed to open routine stream for %s", record.job_id)
-                final_error = str(exc)
-                return
 
             try:
                 async with asyncio.timeout(_WORKFLOW_DURATION_CAP_SECONDS):
@@ -549,7 +956,21 @@ async def _run_workflow(
                         if isinstance(event, Heartbeat):
                             try:
                                 await heartbeat_workflow(self._db, record.job_id)
+                                last_heartbeat_ok = _utcnow()
                             except Exception:
+                                stalled = (
+                                    _utcnow() - last_heartbeat_ok
+                                ).total_seconds()
+                                if stalled > _HEARTBEAT_LOSS_ABORT_S:
+                                    # Lost Mongo long enough that the orphan
+                                    # sweep will treat this row as dead; abort
+                                    # so recovery is spared a redundant second
+                                    # run of this still-live attempt.
+                                    raise HeartbeatLost(
+                                        f"no successful heartbeat for "
+                                        f"{stalled:.0f}s "
+                                        f"(> {_HEARTBEAT_LOSS_ABORT_S:.0f}s)"
+                                    )
                                 logger.exception(
                                     "Heartbeat failed for %s; continuing",
                                     record.job_id,
@@ -622,6 +1043,18 @@ async def _run_workflow(
                 if final_status == JobStatus.FAILED:
                     final_error = "Workflow cancelled (worker shutdown)"
                 raise
+            except HeartbeatLost as exc:
+                # Expected abort, not a crash: this consumer lost Mongo and
+                # is giving up the job so the orphan sweep can recover it
+                # without a concurrent live attempt. The terminal write
+                # below will likely no-op (Mongo is unreachable for us); the
+                # inner finally still aclose()s the stream, cancelling the
+                # remote worker so it stops computing.
+                logger.warning(
+                    "Aborting workflow %s: %s", record.job_id, exc
+                )
+                final_status = JobStatus.FAILED
+                final_error = f"heartbeat lost: {exc}"
             except Exception as exc:
                 logger.exception(
                     "Workflow %s failed during stream consumption",
@@ -686,74 +1119,44 @@ async def _finalize(
         except Exception:
             logger.exception("Unable to clean workdir %s", workdir)
 
-    async def _open_stream_with_retry(
+    async def _open_stream_once(
         self,
         processor: Processor,
         file_meta: dict[str, Any],
         workdir: Path,
     ) -> AsyncIterator[dict[str, Any]]:
-        """Open the routine's event stream, retrying on ``NoWorkersAvailable``.
-
-        The API leases workers from a pool that's provisioned externally
-        (ECS in production); cold starts can take ~30-90s. Until wool
-        exposes a quorum-style readiness gate we poll the pool with
-        ``NoWorkersAvailable`` retries on each attempt to open the
-        stream. ``_DISPATCH_WAIT_SECONDS`` caps the total wait so a
-        stuck scale-up surfaces as a workflow failure rather than
-        hanging until the much-larger ``_WORKFLOW_DURATION_CAP_SECONDS``
-        cap fires.
-
-        Returns an async iterator that yields the first event followed
-        by the rest of the routine's stream. The retry-on-capacity logic
-        only applies to the first ``__anext__`` call (which is where
-        wool surfaces ``NoWorkersAvailable``); once events start
-        flowing, subsequent ``__anext__`` calls just wait on the worker.
+        """Open the routine's event stream in a single dispatch attempt.
+
+        Constructs the ``@wool.routine`` stream and pulls its first event,
+        which is where wool surfaces ``wool.NoWorkersAvailable`` when the
+        priority load balancer found no worker willing to accept the task.
+        That exception propagates to ``_attempt_dispatch``, which reschedules
+        the job — the durable retry scheduler is now the retry mechanism, so
+        there is no in-attempt polling loop (a single reschedule tick
+        covers an ECS cold start). On success returns an async iterator
+        yielding the first event followed by the rest of the stream.
         """
-        loop = asyncio.get_running_loop()
-        deadline = loop.time() + _DISPATCH_WAIT_SECONDS
-        attempt = 0
-        last_exc: wool.NoWorkersAvailable | None = None
-        while True:
-            stream = _run_processor_routine(
-                processor,
-                file_meta,
-                str(workdir),
-                self._cache,
-            )
-            # ``returning`` flips True only after we've handed the
-            # stream off to ``_prepend_first_event``. Any other exit
-            # from this iteration (NoWorkersAvailable, CancelledError
-            # from __anext__ or the inter-iteration sleep, unexpected
-            # exception) leaves the freshly-constructed stream
-            # unawaited; the finally aclose()s it so wool's pool can
-            # release the lease cleanly on drain.
-            returning = False
-            try:
-                try:
-                    first_event = await stream.__anext__()
-                except wool.NoWorkersAvailable as exc:
-                    last_exc = exc
-                    attempt += 1
-                    if loop.time() >= deadline:
-                        break
-                    if attempt >= 2:
-                        logger.warning(
-                            "No workers available — retrying dispatch "
-                            "(attempt %d, %.1fs of %.1fs budget remaining)",
-                            attempt,
-                            deadline - loop.time(),
-                            _DISPATCH_WAIT_SECONDS,
-                        )
-                    await asyncio.sleep(_DISPATCH_RETRY_INTERVAL_SECONDS)
-                    continue
-                returning = True
-                return _prepend_first_event(first_event, stream)
-            finally:
-                if not returning:
-                    with contextlib.suppress(BaseException):
-                        await stream.aclose()
-        assert last_exc is not None
-        raise last_exc
+        stream = _run_processor_routine(
+            processor,
+            file_meta,
+            str(workdir),
+            self._cache,
+            _HEARTBEAT_INTERVAL_S,
+        )
+        # ``returning`` flips True only after we've handed the stream off to
+        # ``_prepend_first_event``. Any other exit (NoWorkersAvailable, a
+        # CancelledError from __anext__, an unexpected exception) leaves the
+        # freshly-constructed stream unawaited; the finally aclose()s it so
+        # wool's pool can release the lease cleanly.
+        returning = False
+        try:
+            first_event = await stream.__anext__()
+            returning = True
+            return _prepend_first_event(first_event, stream)
+        finally:
+            if not returning:
+                with contextlib.suppress(BaseException):
+                    await stream.aclose()
 
 
 async def _prepend_first_event(
@@ -761,10 +1164,10 @@ async def _prepend_first_event(
 ) -> AsyncIterator[dict[str, Any]]:
     """Yield ``first`` then the rest of ``stream``.
 
-    Used by ``_open_stream_with_retry`` to consume the initial
-    ``__anext__`` (where wool surfaces ``NoWorkersAvailable``) inside
-    the retry loop, then expose the remaining events to the workflow
-    consumer through a single uniform async-iterator interface.
+    Used by ``_open_stream_once`` to consume the initial ``__anext__``
+    (where wool surfaces ``NoWorkersAvailable``), then expose the remaining
+    events to the workflow consumer through a single uniform async-iterator
+    interface.
     """
     try:
         yield first
@@ -778,12 +1181,12 @@ async def _prepend_first_event(
 def _log_unexpected_exception(task: asyncio.Task) -> None:
     """Done-callback that surfaces an unexpected exception via the logger.
 
-    ``_run_workflow`` is supposed to catch every failure and route it to
-    a terminal release; if anything escapes that contract (e.g. a future
-    refactor leaves a bare ``raise`` outside the try), asyncio's
-    default behavior is to log the unretrieved exception only at
-    interpreter shutdown. Surfacing it through the project logger
-    immediately makes the regression visible.
+    The scheduler loop and ``_attempt_dispatch`` are supposed to catch
+    every failure and route it to a terminal release or a reschedule; if
+    anything escapes that contract (e.g. a future refactor leaves a bare
+    ``raise`` outside the try), asyncio's default behavior is to log the
+    unretrieved exception only at interpreter shutdown. Surfacing it
+    through the project logger immediately makes the regression visible.
     """
     if task.cancelled():
         return
diff --git a/src/cfdb/workflows/loadbalancer.py b/src/cfdb/workflows/loadbalancer.py
new file mode 100644
index 0000000..bacbd8c
--- /dev/null
+++ b/src/cfdb/workflows/loadbalancer.py
@@ -0,0 +1,113 @@
+"""Priority (leaky-bucket) load balancer for the wool worker pool (issue #45).
+
+Unlike wool's round-robin balancer, this one always offers a task to the
+discovered workers in the same stable order (sorted by
+``WorkerMetadata.uid``) on every dispatch. The order is arbitrary but
+reproducible — ``uid`` is a per-worker UUID, not a seniority or cost
+ranking; what matters is only that the same workers are offered work first
+on each dispatch. Combined with per-worker backpressure (one task per
+worker), load concentrates on the lowest-ordered workers and
+higher-ordered workers drain to idle — so an over-provisioned fleet sheds
+idle capacity (workers self-terminate on their max-lifetime) instead of
+every worker carrying a thin, perpetual slice of traffic. This is the
+"leaky bucket": fill the priority workers first and let the overflow spill
+to the next.
+
+It honors wool's load-balancer worker-health contract (see
+:class:`wool.LoadBalancerLike`): rotate to the next worker on
+``TransientRpcError`` — which includes a backpressure ``RESOURCE_EXHAUSTED``
+rejection — evict the worker on a non-transient ``RpcError``, and let any
+other exception propagate untouched. When no worker accepts the task (the
+pool is empty, or every worker rejected or was evicted in one pass) it
+raises ``NoWorkersAvailable``; the executor treats that as the signal to
+add a worker and re-queue the job.
+
+Stateless by design: the stable order is recomputed per call, so there is
+no per-context index, no lock, and nothing that resists pickling.
+"""
+
+from __future__ import annotations
+
+import logging
+from typing import TYPE_CHECKING
+from typing import Any
+from typing import AsyncIterator
+
+from wool import LoadBalancerContextLike
+from wool import LoadBalancerLike
+from wool import NoWorkersAvailable
+from wool import RpcError
+from wool import TransientRpcError
+
+if TYPE_CHECKING:
+    from wool.runtime.routine.task import Task
+
+logger = logging.getLogger(__name__)
+
+
+class PriorityLoadBalancer(LoadBalancerLike):
+    """Offer each task to workers in a stable priority order.
+
+    See the module docstring for the leaky-bucket rationale and the
+    worker-health contract this honors.
+    """
+
+    def __reduce__(self):
+        # Stateless — reconstruct via the bare constructor so the balancer
+        # is trivially picklable (mirrors RoundRobinLoadBalancer, which
+        # uses __reduce__ to shed its lock/index state).
+        return (self.__class__, ())
+
+    async def dispatch(
+        self,
+        task: Task,
+        *,
+        context: LoadBalancerContextLike,
+        timeout: float | None = None,
+    ) -> AsyncIterator[Any]:
+        """Dispatch *task* to the first worker, in priority order, that accepts.
+
+        :param task:
+            The :class:`wool.Task` to dispatch.
+        :param context:
+            The :class:`wool.LoadBalancerContextLike` whose ``workers`` map
+            names the candidates and through which ``remove_worker`` evicts
+            unhealthy peers.
+        :param timeout:
+            Per-attempt timeout in seconds against a single worker.
+        :returns:
+            The worker's response stream (an async iterator).
+        :raises NoWorkersAvailable:
+            When the pool is empty or no worker accepts the task across one
+            stable-order pass (all rejected transiently or were evicted).
+        """
+        # Snapshot a stable, deterministic order up front. Sorting by uid
+        # makes "priority" reproducible across dispatches, so the same
+        # low-order workers stay saturated. ``str(uid)`` guarantees an
+        # orderable key regardless of the uid's concrete type. Iterating a
+        # snapshot keeps the pass well-defined even when ``remove_worker``
+        # mutates the live context mid-loop on eviction.
+        candidates = sorted(
+            context.workers.items(), key=lambda item: str(item[0].uid)
+        )
+
+        for metadata, connection in candidates:
+            try:
+                stream = await connection.dispatch(task, timeout=timeout)
+            except TransientRpcError as exc:
+                logger.debug(
+                    "Skipping worker %s on transient error: %s", metadata.uid, exc
+                )
+                continue
+            except RpcError as exc:
+                logger.warning(
+                    "Evicting worker %s after non-transient RPC error: %s",
+                    metadata.uid,
+                    exc,
+                )
+                context.remove_worker(metadata)
+                continue
+            else:
+                return stream
+
+        raise NoWorkersAvailable("No healthy workers available for dispatch")
diff --git a/src/cfdb/workflows/lock.py b/src/cfdb/workflows/lock.py
index e5f1fef..05503f3 100644
--- a/src/cfdb/workflows/lock.py
+++ b/src/cfdb/workflows/lock.py
@@ -20,6 +20,7 @@
 from datetime import datetime, timedelta, timezone
 from typing import Any, Optional
 
+from pymongo import ReturnDocument
 from pymongo.errors import DuplicateKeyError
 from pymongo.write_concern import WriteConcern
 
@@ -133,7 +134,7 @@ def _record_from_mongo(doc: dict[str, Any]) -> JobRecord:
     here so the round-trip survives.
     """
     sanitized = {k: v for k, v in doc.items() if k != "_id"}
-    for field in ("submitted_at", "updated_at"):
+    for field in ("submitted_at", "updated_at", "next_dispatch_at"):
         value = sanitized.get(field)
         if isinstance(value, datetime) and value.tzinfo is None:
             sanitized[field] = value.replace(tzinfo=timezone.utc)
@@ -190,6 +191,13 @@ async def claim_workflow(
             submitted_at=now,
             updated_at=now,
             file_meta_snapshot=file_meta_snapshot,
+            # ``next_dispatch_at`` is intentionally left unset (None) on a
+            # fresh claim. ``ensure_workflow`` dispatches the first attempt
+            # inline; only if that attempt overflows (no worker capacity)
+            # does it set ``next_dispatch_at``, handing the job to the
+            # durable retry scheduler. A None value is never "due", so the
+            # scheduler can't double-dispatch a job whose inline attempt is
+            # still in flight.
         )
 
         # Attempt the insert first. If the partial-unique index admits it,
@@ -471,3 +479,142 @@ async def heartbeat_workflow(db, job_id: str) -> None:
         },
         {"$set": {"updated_at": _utcnow()}},
     )
+
+
+# --- Bounded-concurrency admission + durable dispatch retry (issue #45) ------
+
+
+async def count_active_workflows(db) -> int:
+    """Count workflows currently holding the mutex (pending + running).
+
+    Backs the admission ceiling in ``ensure_workflow``: once this reaches
+    ``CFDB_WORKFLOW_MAX_ACTIVE`` new requests are shed with 429. Soft by
+    nature — a count-then-insert race can briefly overshoot the cap, which
+    is acceptable for a flood guard.
+
+    Counts on the canonical ``active`` boolean discriminator (the single
+    source of truth every other jobs read/write/index keys off), not the
+    derived ``status $in ACTIVE_STATUSES`` view, so the ceiling and the
+    mutex stay in lockstep even if the two ever drift.
+    """
+    jobs = _jobs(db)
+    return await jobs.count_documents({"active": True})
+
+
+async def reschedule_dispatch(db, job_id: str, *, next_at: datetime) -> None:
+    """Defer a still-PENDING job's next dispatch attempt to ``next_at``.
+
+    Called when a dispatch attempt finds no worker capacity: the job stays
+    PENDING and the durable scheduler re-attempts it at ``next_at``.
+    Fenced on PENDING so a job that has since gone RUNNING/terminal (a
+    racing attempt won the worker, or it was stale-reclaimed) is not
+    dragged back into the dispatch queue. Bumps ``dispatch_attempts`` for
+    observability.
+    """
+    jobs = _jobs(db)
+    result = await jobs.update_one(
+        {"job_id": job_id, "status": JobStatus.PENDING.value},
+        {
+            "$set": {"next_dispatch_at": next_at, "updated_at": _utcnow()},
+            "$inc": {"dispatch_attempts": 1},
+        },
+    )
+    if getattr(result, "matched_count", 0) == 0:
+        logger.debug(
+            "reschedule_dispatch no-op for job %s — row no longer PENDING "
+            "(won a worker, terminated, or was stale-reclaimed)",
+            job_id,
+        )
+
+
+async def lease_due_dispatch(
+    db, *, now: datetime, next_at: datetime
+) -> Optional[JobRecord]:
+    """Atomically claim one PENDING job whose dispatch is due.
+
+    Selects a PENDING job with ``next_dispatch_at <= now`` and, in the same
+    operation, pushes its ``next_dispatch_at`` forward to ``next_at`` — so a
+    concurrent scheduler tick (or a second API replica) cannot lease the
+    same job, and a crash mid-attempt still leaves the job scheduled for a
+    later retry. Returns the leased job, or ``None`` when nothing is due.
+
+    The caller runs a dispatch attempt for the returned job; that attempt
+    either wins a worker (``mark_running``) or calls ``reschedule_dispatch``
+    again on overflow. Jobs with ``next_dispatch_at`` unset (``None``) are
+    never due, so they are not leased here.
+
+    Due jobs are leased oldest-first (``next_dispatch_at`` ascending) so
+    sustained overflow cannot starve the longest-waiting job toward its
+    deadline. ``next_at`` MUST be strictly in the future relative to
+    ``now``: the single-claim guarantee depends on the leased row's
+    ``next_dispatch_at`` moving out of the ``<= now`` window, so an
+    equal/past value would let a concurrent tick re-lease the same job
+    before the attempt resolves.
+    """
+    if next_at <= now:
+        raise ValueError(
+            f"lease_due_dispatch requires next_at ({next_at}) strictly after "
+            f"now ({now}); an equal/past value would let a concurrent tick "
+            "immediately re-lease the same job"
+        )
+    jobs = _jobs(db)
+    doc = await jobs.find_one_and_update(
+        {
+            "status": JobStatus.PENDING.value,
+            "next_dispatch_at": {"$lte": now},
+        },
+        {"$set": {"next_dispatch_at": next_at, "updated_at": now}},
+        sort=[("next_dispatch_at", 1)],
+        return_document=ReturnDocument.AFTER,
+    )
+    return _record_from_mongo(doc) if doc is not None else None
+
+
+async def requeue_orphaned_dispatch(
+    db, *, now: datetime, stale_before: datetime
+) -> int:
+    """Re-queue jobs orphaned by an API/worker crash so they re-dispatch.
+
+    The durable scheduler only leases ``PENDING`` jobs whose
+    ``next_dispatch_at`` is set and due, so two states survive a restart
+    that it would never pick up on its own:
+
+    - ``RUNNING`` rows whose API consumer died mid-stream (no heartbeat
+      since), and
+    - ``PENDING`` rows with ``next_dispatch_at`` unset — a fresh claim whose
+      inline first attempt never got to reschedule before the crash.
+
+    Both are reset to ``PENDING`` with ``next_dispatch_at = now`` so the next
+    scheduler tick leases them. Gated on ``updated_at < stale_before`` (the
+    same staleness threshold :func:`claim_workflow` uses) so a healthy
+    in-flight job — which keeps ``updated_at`` fresh via heartbeats — is
+    never falsely reclaimed. Returns the number of rows requeued.
+
+    Unlike ``claim_workflow``'s stale-reclaim, which fails the stale row so a
+    *new* claimant can supersede it, this revives the same row: there is no
+    new claimant, so the goal is to recover the in-flight work itself.
+    Partial-commit recovery means the re-dispatched job reuses any stages
+    already committed to cache.
+    """
+    jobs = _jobs(db)
+    result = await jobs.update_many(
+        {
+            "active": True,
+            "updated_at": {"$lt": stale_before},
+            "$or": [
+                {"status": JobStatus.RUNNING.value},
+                {"status": JobStatus.PENDING.value, "next_dispatch_at": None},
+            ],
+        },
+        {
+            "$set": {
+                "status": JobStatus.PENDING.value,
+                # PENDING is active; re-assert defensively so the
+                # discriminator can never drift from the status.
+                "active": True,
+                "next_dispatch_at": now,
+                "updated_at": now,
+            }
+        },
+    )
+    return getattr(result, "modified_count", 0)
diff --git a/src/cfdb/workflows/models.py b/src/cfdb/workflows/models.py
index b1fff7d..6156b25 100644
--- a/src/cfdb/workflows/models.py
+++ b/src/cfdb/workflows/models.py
@@ -73,7 +73,14 @@ class JobRecord(BaseModel):
         error: Populated on ``FAILED`` status with a human-readable reason.
             Capped at 1024 chars after path-scrubbing in
             ``lock.release_workflow`` so absolute filesystem paths from
-            tool stderr don't leak via ``/jobs/{id}``.
+            tool stderr don't leak via ``/jobs/{id}``. Some failures carry a
+            stable, client-parseable prefix: ``capacity:`` (the dispatch
+            deadline elapsed while the job waited for a worker — safe to
+            resubmit later), ``heartbeat lost:`` (the consumer lost its
+            Mongo connection and aborted so the orphan sweep can recover the
+            job), and ``internal:`` (a should-never-happen executor
+            invariant failure, e.g. a leased job missing its file_meta
+            snapshot or processor).
         superseded_by: When this row is stale-reclaimed by a fresh claim,
             the winner's ``job_id`` is recorded here so clients polling
             this (now-FAILED) job can follow the chain to the live one.
@@ -81,6 +88,16 @@ class JobRecord(BaseModel):
             metadata as it was at claim time. Persisted on insert so a
             re-run / observer can reconstruct the dispatch context
             without re-reading the (possibly mutated) ``files`` document.
+        next_dispatch_at: When the durable retry scheduler should next
+            attempt to dispatch this (still-PENDING) job to a worker.
+            Unset (``None``) on claim and on a running/terminal job; set
+            and pushed forward by the retry scheduler each time a dispatch
+            attempt finds no capacity. Drives the scheduler's "due for
+            dispatch" query, which matches only rows whose
+            ``next_dispatch_at`` is set and in the past.
+        dispatch_attempts: How many times dispatch has been deferred for
+            lack of worker capacity. Observability only — the failure
+            deadline is measured from ``submitted_at``, not this count.
     """
 
     model_config = ConfigDict(
@@ -107,18 +124,24 @@ class JobRecord(BaseModel):
     error: Annotated[str, StringConstraints(max_length=1024)] | None = None
     superseded_by: Annotated[str, StringConstraints(max_length=64)] | None = None
     file_meta_snapshot: dict[str, Any] | None = None
+    next_dispatch_at: datetime | None = None
+    dispatch_attempts: int = Field(default=0, ge=0)
 
-    @field_validator("submitted_at", "updated_at")
+    @field_validator("submitted_at", "updated_at", "next_dispatch_at")
     @classmethod
-    def _require_aware_datetime(cls, value: datetime) -> datetime:
+    def _require_aware_datetime(cls, value: datetime | None) -> datetime | None:
         """Reject naive datetimes.
 
         All internal producers go through ``lock._utcnow()`` which returns
         an aware UTC datetime; a naive value would either be a refactor
         regression or a malformed document. Caught here so the stale
-        cutoff comparison in ``claim_workflow`` can never raise
-        ``TypeError`` on a mismatched-naivety operand.
+        cutoff comparison in ``claim_workflow`` (and the scheduler's
+        ``next_dispatch_at`` comparison) can never raise ``TypeError`` on a
+        mismatched-naivety operand. ``next_dispatch_at`` is nullable, so a
+        ``None`` passes through untouched.
         """
+        if value is None:
+            return value
         if value.tzinfo is None:
             raise ValueError(
                 "JobRecord datetimes must be timezone-aware "
@@ -163,4 +186,6 @@ def to_mongo(self) -> dict[str, Any]:
                 if self.file_meta_snapshot is not None
                 else None
             ),
+            "next_dispatch_at": self.next_dispatch_at,
+            "dispatch_attempts": self.dispatch_attempts,
         }
diff --git a/src/cfdb/workflows/worker_lan.py b/src/cfdb/workflows/worker_lan.py
index 832ccc3..a9d2512 100644
--- a/src/cfdb/workflows/worker_lan.py
+++ b/src/cfdb/workflows/worker_lan.py
@@ -17,21 +17,26 @@
     python -m cfdb.workflows.worker_lan --namespace cfdb-workers --workers 2
 
 SIGINT (Ctrl-C) or SIGTERM drains the pool and exits. With no pool
-running, ``/data`` and ``/index`` requests for processable formats hang
-on the dispatch retry budget before failing with ``NoWorkersAvailable``.
+running, ``/data`` and ``/index`` requests for processable formats do not
+hang or surface ``NoWorkersAvailable`` to the client: the job is claimed
+and queued PENDING (the request returns ``202``), and the API's durable
+retry scheduler re-attempts dispatch every ``CFDB_WORKFLOW_RETRY_INTERVAL_S``
+until a worker appears or the ``CFDB_WORKFLOW_DISPATCH_DEADLINE_S`` deadline
+elapses (then the job is failed ``capacity:``).
 
 Environment variables (CLI flags mirror them):
 
 * ``WORKFLOW_POOL_NAMESPACE`` — LAN discovery namespace the pool
   publishes under (default ``cfdb-workers``). MUST match the API's value.
 * ``WORKFLOW_WORKER_COUNT`` — number of workers to spawn and publish
-  (default ``2``). Size it at least as high as the API's lease count or
-  the API blocks waiting for workers.
+  (default ``2``). The API admits every worker discovery surfaces (there is
+  no fixed lease count), so size this to the local concurrency you want.
 """
 
 from __future__ import annotations
 
 import asyncio
+import functools
 import logging
 import signal
 from typing import Optional
@@ -40,6 +45,8 @@
 import wool
 from wool.runtime.discovery.lan import LanDiscovery
 
+from cfdb.workflows import WORKER_MAX_CONCURRENT_TASKS
+from cfdb.workflows.backpressure import backpressure_for
 from cfdb.workflows.credentials import build_worker_credentials
 
 logger = logging.getLogger(__name__)
@@ -97,18 +104,31 @@ def _signal_handler_threaded(*_: object) -> None:
         except NotImplementedError:
             signal.signal(sig, _signal_handler_threaded)
 
+    # Bind per-worker backpressure onto the spawn factory so each spawned
+    # LocalWorker serializes its routines (mirrors the ECS worker_main
+    # wiring). ``functools.partial(..., backpressure=hook)`` keeps wool's
+    # ``declares_host`` True, so the pool still prescribes the bind host.
+    backpressure = backpressure_for(WORKER_MAX_CONCURRENT_TASKS)
+    worker_factory = (
+        functools.partial(wool.LocalWorker, backpressure=backpressure)
+        if backpressure is not None
+        else wool.LocalWorker
+    )
+
     pool = wool.WorkerPool(
         spawn=workers,
+        worker=worker_factory,
         discovery=LanDiscovery(namespace),
         credentials=credentials,
     )
     async with pool:
         logger.info(
-            "Published %d wool worker(s) under LAN namespace %r (mTLS %s) "
-            "— Ctrl-C to drain and exit",
+            "Published %d wool worker(s) under LAN namespace %r (mTLS %s, "
+            "max concurrent tasks %s) — Ctrl-C to drain and exit",
             workers,
             namespace,
             "enabled" if credentials is not None else "disabled",
+            WORKER_MAX_CONCURRENT_TASKS if backpressure is not None else "unbounded",
         )
         await stop_event.wait()
     return 0
diff --git a/src/cfdb/workflows/worker_main.py b/src/cfdb/workflows/worker_main.py
index c4cc603..5a92264 100644
--- a/src/cfdb/workflows/worker_main.py
+++ b/src/cfdb/workflows/worker_main.py
@@ -35,6 +35,8 @@
 import click
 import wool
 
+from cfdb.workflows import WORKER_MAX_CONCURRENT_TASKS
+from cfdb.workflows.backpressure import backpressure_for
 from cfdb.workflows.constants import DEFAULT_WORKER_PORT
 from cfdb.workflows.credentials import build_worker_credentials
 
@@ -136,16 +138,27 @@ def _signal_handler_threaded(*_: object) -> None:
         health_port, lambda: stop_event.is_set()
     )
 
+    # Per-worker backpressure: reject a dispatch once this worker already
+    # has CFDB_WORKER_MAX_CONCURRENT_TASKS routines in flight (default 1),
+    # so its subprocess pipelines serialize instead of oversubscribing the
+    # 1-vCPU task. ``None`` (threshold 0) keeps the unbounded prior behavior.
+    backpressure = backpressure_for(WORKER_MAX_CONCURRENT_TASKS)
+
     worker = wool.LocalWorker(
-        host="0.0.0.0", port=worker_port, credentials=credentials
+        host="0.0.0.0",
+        port=worker_port,
+        credentials=credentials,
+        backpressure=backpressure,
     )
     await worker.start()
     try:
         logger.info(
-            "Wool worker listening on port %d (health on %d, mTLS %s)",
+            "Wool worker listening on port %d (health on %d, mTLS %s, "
+            "max concurrent tasks %s)",
             worker_port,
             health_port,
             "enabled" if credentials is not None else "disabled",
+            WORKER_MAX_CONCURRENT_TASKS if backpressure is not None else "unbounded",
         )
         while True:
             # Check the self-termination path first. Max-lifetime is a
diff --git a/tests/conftest.py b/tests/conftest.py
index 21e5b6d..8d8ab76 100644
--- a/tests/conftest.py
+++ b/tests/conftest.py
@@ -23,10 +23,18 @@ def _resolve(doc: dict, key: str):
 def _match(doc: dict, query: dict) -> bool:
     """Minimal MongoDB query matcher supporting a small operator subset."""
     for key, cond in query.items():
+        # AND $and/$or with the rest of the top-level keys rather than
+        # short-circuiting the whole match, so a query that mixes them with
+        # sibling field conditions (e.g. {active, updated_at, $or}) is
+        # evaluated faithfully regardless of key order.
         if key == "$and":
-            return all(_match(doc, sub) for sub in cond)
+            if not all(_match(doc, sub) for sub in cond):
+                return False
+            continue
         if key == "$or":
-            return any(_match(doc, sub) for sub in cond)
+            if not any(_match(doc, sub) for sub in cond):
+                return False
+            continue
 
         value = _resolve(doc, key)
 
@@ -173,6 +181,10 @@ def _apply_update(doc: dict, update: dict, *, is_insert: bool = False) -> bool:
                 if v not in existing:
                     existing.append(v)
                     changed = True
+        elif op == "$inc":
+            for k, v in fields.items():
+                _set_nested(doc, k, (_resolve(doc, k) or 0) + v)
+                changed = True
         else:
             raise NotImplementedError(f"FakeCollection update op: {op}")
     return changed
@@ -267,12 +279,19 @@ async def find_one_and_update(
         *,
         upsert: bool = False,
         return_document=None,
+        sort: list | None = None,
         **_kwargs,
     ) -> dict | None:
-        for d in self.docs:
-            if _match(d, query):
-                _apply_update(d, update, is_insert=False)
-                return dict(d)
+        candidates = [d for d in self.docs if _match(d, query)]
+        if sort:
+            # Single-key sort is all the production queries use; honoring it
+            # lets tests exercise ordered claims (e.g. oldest-due-first).
+            field, direction = sort[0]
+            candidates.sort(key=lambda d: d.get(field), reverse=direction < 0)
+        if candidates:
+            d = candidates[0]
+            _apply_update(d, update, is_insert=False)
+            return dict(d)
         if upsert:
             seed: dict = {}
             for k, v in query.items():
@@ -311,10 +330,10 @@ async def update_many(self, query: dict, update: dict, **kwargs) -> _UpdateResul
         for d in self.docs:
             if _match(d, query):
                 matched += 1
-                for k, v in update.get("$set", {}).items():
-                    if d.get(k) != v:
-                        d[k] = v
-                        modified += 1
+                # Row-level modified count, matching real Mongo semantics
+                # (number of documents changed, not number of field writes).
+                if _apply_update(d, update, is_insert=False):
+                    modified += 1
         return _UpdateResult(matched, modified)
 
     async def update_one(self, query: dict, update: dict, **kwargs) -> _UpdateResult:
diff --git a/tests/integration/conftest.py b/tests/integration/conftest.py
index 41269d5..2b1418f 100644
--- a/tests/integration/conftest.py
+++ b/tests/integration/conftest.py
@@ -530,9 +530,10 @@ async def wool_pool():
 
     Function-scoped so every test gets a clean pool — prevents any
     cross-test state bleeding through worker process memory. The
-    executor's ``_dispatch_with_retry`` covers the brief window
-    between pool startup and LocalDiscovery surfacing the worker,
-    so no explicit sleep is needed here.
+    executor opens a single dispatch attempt and, on overflow,
+    reschedules the job onto its durable retry scheduler — so the brief
+    window between pool startup and the worker being surfaced needs no
+    explicit sleep here.
     """
     async with wool.WorkerPool(spawn=1):
         yield
diff --git a/tests/integration/routines.py b/tests/integration/routines.py
index 11f7d0a..5c4a291 100644
--- a/tests/integration/routines.py
+++ b/tests/integration/routines.py
@@ -53,10 +53,10 @@ class StubProcessor(Processor):
       workflow to race against.
     - ``sleep_between_yields``: a delay inserted AFTER each
       ``stage_complete`` yield. Used by the runtime-cap test: a quick
-      first event lets ``_open_stream_with_retry`` return so the
-      executor's ``asyncio.timeout(cap)`` block is entered, then the
-      between-yields delay outlasts the cap so the wait for the next
-      event triggers the cap.
+      first event lets ``_open_stream_once`` return (and the executor mark
+      the job RUNNING) so the ``asyncio.timeout(cap)`` block is entered,
+      then the between-yields delay outlasts the cap so the wait for the
+      next event triggers the cap.
     - ``unpicklable_field``: any value assigned to ``self.unpicklable_field``
       to deliberately poison cloudpickle. Used by the boundary tests
       that need to verify a pickling failure surfaces as a clean
diff --git a/tests/integration/test_executor_boundary.py b/tests/integration/test_executor_boundary.py
index 2a6d1a8..24acad5 100644
--- a/tests/integration/test_executor_boundary.py
+++ b/tests/integration/test_executor_boundary.py
@@ -10,11 +10,16 @@
 from __future__ import annotations
 
 import asyncio
+import functools
 
 import pytest
+import wool
 
+from cfdb.workflows import executor as executor_module
+from cfdb.workflows.backpressure import TaskCountBackpressure
 from cfdb.workflows.cache import LocalFsCache
 from cfdb.workflows.executor import WoolExecutor
+from cfdb.workflows.loadbalancer import PriorityLoadBalancer
 from cfdb.workflows.lock import get_job
 from cfdb.workflows.models import ACTIVE_STATUSES, JobStatus
 from cfdb.workflows.processors.registry import ProcessorRegistry
@@ -30,9 +35,7 @@ def _install_jobs_index(mock_db) -> None:
     mock_db.jobs.create_index(
         {"workflow_key": 1},
         unique=True,
-        partialFilterExpression={
-            "status": {"$in": [s.value for s in ACTIVE_STATUSES]}
-        },
+        partialFilterExpression={"active": True},
     )
 
 
@@ -127,8 +130,8 @@ async def test_ensure_workflow_should_record_failed_when_routine_raises_across_p
         assert final.status == JobStatus.FAILED
         assert final.error is not None
         # When the routine raises before yielding, wool propagates the
-        # exception across the boundary; the executor records
-        # ``str(exc)`` from the ``_open_stream_with_retry`` failure path.
+        # exception across the boundary; the executor records ``str(exc)``
+        # from ``_attempt_dispatch``'s stream-open failure path.
         # The exception class name is not part of the persisted error.
         assert "boom" in final.error
         # The scrub regex strips multi-segment absolute paths from the
@@ -183,8 +186,11 @@ async def test_ensure_workflow_should_record_failed_when_workflow_duration_cap_f
         assert final.status == JobStatus.FAILED
         assert final.error is not None
         assert "runtime cap" in final.error.lower()
-        # Workdir is cleaned up regardless of failure mode.
-        assert not (executor._workdir_root / record.job_id).exists()
+        # Workdir is cleaned up regardless of failure mode. Assert against
+        # the workdir ROOT, not ``root / job_id``: the per-attempt workdir is
+        # ``root / f"{job_id}-{uuid}"`` (B1), so a bare-``job_id`` path is
+        # never created and asserting its absence would be vacuously true.
+        assert list(executor._workdir_root.iterdir()) == []
         # The internal bookkeeping sets are drained after the await.
         assert len(executor._pending_tasks) == 0
         assert len(executor._finalize_tasks) == 0
@@ -275,3 +281,91 @@ async def test_ensure_workflow_should_attach_second_caller_to_active_job_across_
             d for d in mock_db.jobs.docs if d["workflow_key"] == rec_a.workflow_key
         ]
         assert len(records_for_key) == 1
+
+
+class TestProductionPoolConfiguration:
+    @pytest.mark.asyncio
+    async def test_ensure_workflow_should_distribute_and_reschedule_across_priority_pool(
+        self, mock_db, tmp_path, monkeypatch
+    ):
+        """Test the production pool config over the real pickle boundary.
+
+        Given:
+            A two-worker ``wool.WorkerPool`` wired exactly as production —
+            ``PriorityLoadBalancer`` plus a ``TaskCountBackpressure(1)``-bound
+            spawn factory — the durable retry scheduler running, and three
+            distinct slow workflows that each hold their worker busy for the
+            whole test.
+        When:
+            All three are dispatched via ``ensure_workflow`` and the system
+            is allowed to settle.
+        Then:
+            Exactly two reach RUNNING concurrently (one per worker — proving
+            the load balancer distributes across both over the cloudpickle /
+            gRPC boundary), and the third is left PENDING with a
+            ``next_dispatch_at``: the priority balancer rotates past both
+            ``RESOURCE_EXHAUSTED`` rejections, surfaces ``NoWorkersAvailable``,
+            and the durable scheduler keeps it queued rather than dropping or
+            double-running it.
+        """
+        # Arrange — fast retries so an early overflow (a worker not yet
+        # surfaced at pool startup) is healed by the scheduler promptly.
+        monkeypatch.setattr(executor_module, "_RETRY_INTERVAL_SECONDS", 0.2)
+        monkeypatch.setattr(executor_module, "_RETRY_JITTER_MAX_SECONDS", 0.1)
+        _install_jobs_index(mock_db)
+        cache = LocalFsCache(tmp_path / "cache")
+        registry = ProcessorRegistry()
+        registry.register(StubProcessor(sleep_seconds=30.0))
+        executor = WoolExecutor(
+            mock_db, cache, registry, workdir_root=tmp_path / "jobs"
+        )
+        metas = [{**stub_file_meta(), "local_id": f"ENCFF-{i}"} for i in range(3)]
+        pool = wool.WorkerPool(
+            spawn=2,
+            worker=functools.partial(
+                wool.LocalWorker, backpressure=TaskCountBackpressure(1)
+            ),
+            loadbalancer=PriorityLoadBalancer(),
+        )
+
+        async def _settled() -> tuple[int, int]:
+            running = pending_resched = 0
+            for meta in metas:
+                # job_id isn't known until ensure_workflow returns; match on
+                # the workflow_key the records carry instead.
+                for doc in mock_db.jobs.docs:
+                    if doc["local_id"] != meta["local_id"]:
+                        continue
+                    if doc["status"] == JobStatus.RUNNING.value:
+                        running += 1
+                    elif (
+                        doc["status"] == JobStatus.PENDING.value
+                        and doc.get("next_dispatch_at") is not None
+                    ):
+                        pending_resched += 1
+            return running, pending_resched
+
+        # Act / Assert
+        try:
+            async with pool:
+                executor.start_scheduler()
+                for meta in metas:
+                    await executor.ensure_workflow(meta)
+
+                # Backpressure(1) on two workers caps concurrent RUNNING at
+                # two, so the steady state is (2 running, 1 overflowed). The
+                # long-running stubs hold that state for the whole poll.
+                deadline = asyncio.get_event_loop().time() + 25.0
+                running = pending_resched = 0
+                while asyncio.get_event_loop().time() < deadline:
+                    running, pending_resched = await _settled()
+                    if running == 2 and pending_resched == 1:
+                        break
+                    await asyncio.sleep(0.2)
+
+                assert running == 2, f"expected 2 RUNNING (one per worker), got {running}"
+                assert pending_resched == 1, (
+                    f"expected 1 overflowed+rescheduled job, got {pending_resched}"
+                )
+        finally:
+            await executor.drain(timeout=15.0)
diff --git a/tests/integration/test_processor_e2e.py b/tests/integration/test_processor_e2e.py
index f592878..780eb23 100644
--- a/tests/integration/test_processor_e2e.py
+++ b/tests/integration/test_processor_e2e.py
@@ -643,5 +643,8 @@ async def test_ensure_workflow_should_surface_failed_status_when_processor_raise
         # The path-scrub regex strips multi-segment absolute paths so a
         # ``/<seg>/<seg>`` shape MUST NOT survive into persisted text.
         assert re.search(r"/[A-Za-z][A-Za-z0-9_.-]*/[A-Za-z]", final.error) is None
-        # Workdir is cleaned up regardless of failure mode.
-        assert not (integration_executor._workdir_root / record.job_id).exists()
+        # Workdir is cleaned up regardless of failure mode. Assert against
+        # the workdir ROOT, not ``root / job_id``: the per-attempt workdir is
+        # ``root / f"{job_id}-{uuid}"`` (B1), so a bare-``job_id`` path is
+        # never created and asserting its absence would be vacuously true.
+        assert list(integration_executor._workdir_root.iterdir()) == []
diff --git a/tests/test_api/test_lifespan_workerpool.py b/tests/test_api/test_lifespan_workerpool.py
index d0eb79c..b4681fd 100644
--- a/tests/test_api/test_lifespan_workerpool.py
+++ b/tests/test_api/test_lifespan_workerpool.py
@@ -97,6 +97,16 @@ async def _fake_build_discovery(_profile):
     assert kwargs["quorum"] == 0
     assert kwargs["discovery"] is sentinel_discovery
     assert kwargs["credentials"] == "CREDS-SENTINEL"
+    # The priority/leaky-bucket balancer must be wired in (issue #45). This
+    # file exists to pin pool kwargs against silent regressions, so the
+    # load balancer belongs here alongside quorum/discovery/credentials.
+    from cfdb.workflows.loadbalancer import PriorityLoadBalancer
+
+    assert isinstance(kwargs["loadbalancer"], PriorityLoadBalancer)
+    # The durable retry scheduler must be started inside the pool context
+    # (issue #45) so it inherits wool's dispatch contextvars; pin it here
+    # alongside the other load-bearing lifespan wiring.
+    fake_executor.start_scheduler.assert_called_once()
 
 
 def test_worker_pool_signature_should_accept_quorum():
diff --git a/tests/test_cache_stream.py b/tests/test_cache_stream.py
index 5626d8b..e415829 100644
--- a/tests/test_cache_stream.py
+++ b/tests/test_cache_stream.py
@@ -17,7 +17,11 @@
     stream_cache_entry,
 )
 from cfdb.workflows.cache import LocalFsCache
-from cfdb.workflows.executor import ExecutorDraining, WorkflowNotApplicable
+from cfdb.workflows.executor import (
+    AdmissionRejected,
+    ExecutorDraining,
+    WorkflowNotApplicable,
+)
 from cfdb.workflows.models import ArtifactKind, JobRecord, JobStatus
 from cfdb.workflows.processors.base import Processor
 from cfdb.workflows.processors.registry import ProcessorRegistry
@@ -492,6 +496,48 @@ async def test_should_raise_503_when_executor_is_draining(
         assert exc_info.value.status_code == 503
         assert "Retry-After" in exc_info.value.headers
 
+    @pytest.mark.asyncio
+    async def test_should_raise_429_when_admission_rejected(
+        self, mocker, tmp_path
+    ):
+        """Test that ``AdmissionRejected`` is translated to 429 with Retry-After.
+
+        Given:
+            Cache miss + GET + an executor that raises ``AdmissionRejected``
+            (the active-workflow ceiling is hit).
+        When:
+            The helper is awaited.
+        Then:
+            It should raise ``HTTPException(429)`` carrying the exception's
+            ``retry_after_seconds`` as the ``Retry-After`` header — so a
+            flood backs off rather than falling through to direct streaming
+            (which would bypass the bounded pipeline).
+        """
+        # Arrange
+        registry = ProcessorRegistry()
+        registry.register(_StubProcessor())
+        mocker.patch.object(api, "processor_registry", registry)
+        mocker.patch.object(api, "cache", LocalFsCache(tmp_path / "cache"))
+        mocker.patch.object(
+            api,
+            "executor",
+            _RecordingExecutor(
+                exc=AdmissionRejected(active=12, ceiling=12, retry_after_seconds=7)
+            ),
+        )
+
+        # Act & assert
+        with pytest.raises(HTTPException) as exc_info:
+            await serve_workflow_artifact_or_dispatch(
+                _file_doc(),
+                ArtifactKind.DATA,
+                _Request(),
+                None,
+                head_404_detail="missing",
+            )
+        assert exc_info.value.status_code == 429
+        assert exc_info.value.headers["Retry-After"] == "7"
+
     @pytest.mark.asyncio
     async def test_should_return_none_on_workflow_not_applicable_race(
         self, mocker, tmp_path
diff --git a/tests/test_data.py b/tests/test_data.py
index a81d769..f012969 100644
--- a/tests/test_data.py
+++ b/tests/test_data.py
@@ -11,7 +11,6 @@
 from cfdb.api.routers.data import stream_file, stream_file_status
 from cfdb.services import drs, locks
 from cfdb.workflows.executor import WoolExecutor
-from cfdb.workflows.models import ACTIVE_STATUSES
 from cfdb.workflows.processors.bam import BamIndexProcessor
 from tests.test_workflows import FIXTURE_MD5
 from cfdb.workflows.processors.registry import ProcessorRegistry, default_registry
@@ -115,9 +114,7 @@ async def test_stream_file_should_dispatch_workflow_when_processor_applies(
         mock_db.jobs.create_index(
             {"workflow_key": 1},
             unique=True,
-            partialFilterExpression={
-                "status": {"$in": [s.value for s in ACTIVE_STATUSES]}
-            },
+            partialFilterExpression={"active": True},
         )
 
         mock_db.dcc.docs = [
diff --git a/tests/test_index.py b/tests/test_index.py
index 8e286fc..3dc00b0 100644
--- a/tests/test_index.py
+++ b/tests/test_index.py
@@ -11,7 +11,7 @@
 from cfdb.api.routers.index import stream_index_file, stream_index_file_status
 from cfdb.services import drs, locks
 from cfdb.workflows.executor import WoolExecutor
-from cfdb.workflows.models import ACTIVE_STATUSES, ArtifactKind
+from cfdb.workflows.models import ArtifactKind
 from cfdb.workflows.processors.bam import BamIndexProcessor
 from cfdb.workflows.processors.registry import ProcessorRegistry, default_registry
 from tests.test_workflows import FIXTURE_MD5
@@ -1048,9 +1048,7 @@ async def test_stream_index_file_should_return_202_when_processor_applies_and_ca
         mock_db.jobs.create_index(
             {"workflow_key": 1},
             unique=True,
-            partialFilterExpression={
-                "status": {"$in": [s.value for s in ACTIVE_STATUSES]}
-            },
+            partialFilterExpression={"active": True},
         )
 
         # Register a real processor + cache + executor on the module.
diff --git a/tests/test_indexes.py b/tests/test_indexes.py
index 7b1780d..46e04b3 100644
--- a/tests/test_indexes.py
+++ b/tests/test_indexes.py
@@ -373,16 +373,22 @@ def test_create_indexes_js_should_define_all_operational_index_names():
         The JS source is read.
     Then:
         It should reference each named operational index and the
-        locks.active index.
+        locks.active index. The names are derived from
+        ``operational_index_specs()`` (not hard-coded) so a newly-added
+        operational index is guarded automatically.
     """
+    # Arrange — the named jobs-collection operational indexes (the locks
+    # index relies on Mongo's auto-generated name and is asserted below).
+    named = [
+        spec.name
+        for spec in operational_index_specs()
+        if spec.name and spec.collection == "jobs"
+    ]
+
     # Assert
-    for name in (
-        "workflow_key_active_unique",
-        "job_id_unique",
-        "status_updated_at",
-        "terminal_ttl",
-    ):
-        assert name in _JS
+    assert named, "expected at least one named jobs operational index"
+    for name in named:
+        assert name in _JS, f"{name} missing from create-indexes.js"
     assert "db.locks.createIndex({ active: 1 })" in _JS
 
 
diff --git a/tests/test_workflows/test_backpressure.py b/tests/test_workflows/test_backpressure.py
new file mode 100644
index 0000000..df984be
--- /dev/null
+++ b/tests/test_workflows/test_backpressure.py
@@ -0,0 +1,183 @@
+"""Tests for the per-worker backpressure hook."""
+
+from __future__ import annotations
+
+import cloudpickle
+import pytest
+import wool
+
+from cfdb.workflows.backpressure import TaskCountBackpressure
+from cfdb.workflows.backpressure import backpressure_for
+
+
+def _ctx(active_task_count: int) -> wool.BackpressureContext:
+    """Build a BackpressureContext with a sentinel task for hook tests."""
+    return wool.BackpressureContext(active_task_count=active_task_count, task=object())
+
+
+class TestTaskCountBackpressure:
+    def test___init___should_raise_when_threshold_below_one(self):
+        """Test that a non-positive threshold is rejected at construction.
+
+        Given:
+            A threshold of 0, the "disable backpressure" sentinel that the
+            wiring handles by passing None instead of this hook.
+        When:
+            TaskCountBackpressure is constructed with it.
+        Then:
+            It should raise ValueError rather than build a hook that never
+            rejects.
+        """
+        # Arrange, act, & assert
+        with pytest.raises(ValueError, match="threshold must be >= 1"):
+            TaskCountBackpressure(0)
+
+    def test___call___should_accept_when_below_threshold(self):
+        """Test that the hook accepts a task below the threshold.
+
+        Given:
+            A hook with threshold 1 and a worker with no tasks in flight.
+        When:
+            The hook is invoked with that context.
+        Then:
+            It should return False (accept).
+        """
+        # Arrange
+        hook = TaskCountBackpressure(1)
+
+        # Act
+        decision = hook(_ctx(active_task_count=0))
+
+        # Assert
+        assert decision is False
+
+    def test___call___should_reject_when_at_threshold(self):
+        """Test that the hook rejects a task at the threshold.
+
+        Given:
+            A hook with threshold 1 and a worker already running one task.
+        When:
+            The hook is invoked with that context.
+        Then:
+            It should return True (reject), which wool surfaces as
+            RESOURCE_EXHAUSTED.
+        """
+        # Arrange
+        hook = TaskCountBackpressure(1)
+
+        # Act
+        decision = hook(_ctx(active_task_count=1))
+
+        # Assert
+        assert decision is True
+
+    def test___call___should_reject_when_above_threshold(self):
+        """Test that the hook rejects once in-flight exceeds the threshold.
+
+        Given:
+            A hook with threshold 2 and a worker running three tasks.
+        When:
+            The hook is invoked with that context.
+        Then:
+            It should return True (reject).
+        """
+        # Arrange
+        hook = TaskCountBackpressure(2)
+
+        # Act
+        decision = hook(_ctx(active_task_count=3))
+
+        # Assert
+        assert decision is True
+
+    def test_cloudpickle_roundtrip_should_preserve_threshold(self):
+        """Test that the hook survives the worker-serialization boundary.
+
+        Given:
+            A hook that wool will serialize into a LocalWorker subprocess.
+        When:
+            It is cloudpickle dumped and loaded back.
+        Then:
+            The restored hook preserves its threshold and still rejects at
+            it, proving it is safe to ship across the worker boundary.
+        """
+        # Arrange
+        hook = TaskCountBackpressure(2)
+
+        # Act
+        restored = cloudpickle.loads(cloudpickle.dumps(hook))
+
+        # Assert
+        assert isinstance(restored, TaskCountBackpressure)
+        assert restored.threshold == 2
+        assert restored(_ctx(active_task_count=2)) is True
+        assert restored(_ctx(active_task_count=1)) is False
+
+    def test_conforms_to_backpressure_like_protocol(self):
+        """Test that the hook satisfies wool's BackpressureLike protocol.
+
+        Given:
+            A constructed hook.
+        When:
+            It is checked against the runtime-checkable BackpressureLike
+            protocol.
+        Then:
+            It should be recognized as a BackpressureLike, so wool accepts
+            it where a backpressure hook is expected.
+        """
+        # Arrange
+        hook = TaskCountBackpressure(1)
+
+        # Act & assert — structural conformance, plus the bool-return
+        # contract wool actually depends on (BackpressureLike is satisfied
+        # by any callable, so the return-type check gives the test teeth).
+        assert isinstance(hook, wool.BackpressureLike)
+        assert isinstance(hook(_ctx(active_task_count=0)), bool)
+
+
+class TestBackpressureFor:
+    def test_backpressure_for_should_return_none_when_threshold_is_zero(self):
+        """Test that a threshold of 0 disables backpressure.
+
+        Given:
+            A threshold of 0 (the disable sentinel).
+        When:
+            backpressure_for is called.
+        Then:
+            It should return None so the wiring passes backpressure=None to
+            wool (unbounded admission).
+        """
+        # Act & assert
+        assert backpressure_for(0) is None
+
+    def test_backpressure_for_should_build_hook_when_threshold_positive(self):
+        """Test that a positive threshold yields a configured hook.
+
+        Given:
+            A threshold of 2.
+        When:
+            backpressure_for is called.
+        Then:
+            It should return a TaskCountBackpressure carrying that threshold.
+        """
+        # Act
+        hook = backpressure_for(2)
+
+        # Assert
+        assert isinstance(hook, TaskCountBackpressure)
+        assert hook.threshold == 2
+
+    def test_backpressure_for_should_raise_on_negative_threshold(self):
+        """Test that a negative threshold is rejected, not silently disabled.
+
+        Given:
+            A negative threshold.
+        When:
+            backpressure_for is called.
+        Then:
+            It should raise ValueError, so a negative value cannot
+            accidentally disable backpressure (only 0 disables it).
+        """
+        # Act & assert
+        with pytest.raises(ValueError, match=">= 0"):
+            backpressure_for(-1)
diff --git a/tests/test_workflows/test_executor.py b/tests/test_workflows/test_executor.py
index 76d315f..a33a86b 100644
--- a/tests/test_workflows/test_executor.py
+++ b/tests/test_workflows/test_executor.py
@@ -3,7 +3,9 @@
 from __future__ import annotations
 
 import asyncio
+import uuid
 from collections.abc import AsyncIterator
+from datetime import datetime, timedelta, timezone
 from pathlib import Path
 from typing import Any
 
@@ -20,13 +22,19 @@
 )
 from cfdb.workflows.executor import (
     PIPELINE_VERSION,
+    AdmissionRejected,
     ExecutorDraining,
     WoolExecutor,
     WorkflowNotApplicable,
     extract_identity,
 )
 from cfdb.workflows.lock import get_job
-from cfdb.workflows.models import ACTIVE_STATUSES, ArtifactKind, JobStatus
+from cfdb.workflows.models import (
+    ACTIVE_STATUSES,
+    ArtifactKind,
+    JobRecord,
+    JobStatus,
+)
 from cfdb.workflows.processors.base import Processor
 from cfdb.workflows.processors.registry import ProcessorRegistry
 from cfdb.workflows.provisioner import EcsProvisioner, RetryableProvisionerError
@@ -107,9 +115,7 @@ def _install_jobs_index(mock_db) -> None:
     mock_db.jobs.create_index(
         {"workflow_key": 1},
         unique=True,
-        partialFilterExpression={
-            "status": {"$in": [s.value for s in ACTIVE_STATUSES]}
-        },
+        partialFilterExpression={"active": True},
     )
 
 
@@ -124,6 +130,59 @@ async def _wait_for_terminal(mock_db, job_id: str, timeout: float = 2.0) -> None
     raise AssertionError(f"Job {job_id} did not reach terminal status")
 
 
+async def _wait_for_status(
+    mock_db, job_id: str, status: JobStatus, timeout: float = 2.0
+) -> None:
+    """Poll the job record until it reaches ``status`` (or time out)."""
+    deadline = asyncio.get_event_loop().time() + timeout
+    while asyncio.get_event_loop().time() < deadline:
+        record = await get_job(mock_db, job_id)
+        if record is not None and record.status == status:
+            return
+        await asyncio.sleep(0.01)
+    raise AssertionError(f"Job {job_id} did not reach status {status}")
+
+
+async def _seed_pending_job(
+    mock_db,
+    *,
+    submitted_at: datetime | None = None,
+    next_dispatch_at: datetime | None = None,
+    meta: dict[str, Any] | None = None,
+    status: JobStatus = JobStatus.PENDING,
+    updated_at: datetime | None = None,
+) -> JobRecord:
+    """Insert a crafted job row directly and return its JobRecord.
+
+    Lets a test drive ``_attempt_dispatch`` / ``_drain_due_jobs`` /
+    ``_recover_orphans`` against a job without going through
+    ``ensure_workflow`` (which would spawn its own inline attempt).
+    ``submitted_at`` controls the deadline clock; ``next_dispatch_at``
+    controls scheduler due-ness; ``status`` / ``updated_at`` craft orphan
+    states (e.g. a stale RUNNING row).
+    """
+    meta = meta or _file_meta()
+    dcc, local_id, md5 = extract_identity(meta)
+    now = datetime.now(timezone.utc)
+    record = JobRecord(
+        job_id=str(uuid.uuid4()),
+        workflow_key=key_utils.workflow_key(
+            dcc=dcc, local_id=local_id, md5=md5, pipeline_version=PIPELINE_VERSION
+        ),
+        status=status,
+        dcc=dcc,
+        local_id=local_id,
+        md5=md5,
+        pipeline_version=PIPELINE_VERSION,
+        submitted_at=submitted_at or now,
+        updated_at=updated_at or now,
+        file_meta_snapshot=meta,
+        next_dispatch_at=next_dispatch_at,
+    )
+    await mock_db.jobs.insert_one(record.to_mongo())
+    return record
+
+
 class TestExtractIdentity:
     def test_extract_identity_should_prefer_nested_dcc_abbreviation(self):
         """Test that extract_identity reads the canonical C2M2 shape.
@@ -393,7 +452,7 @@ async def test_ensure_workflow_should_cleanup_workdir_after_success(
         tmp_workdir,
         no_wool_dispatch,
     ):
-        """Test that the per-job workdir is removed after a successful run.
+        """Test that the per-attempt workdir is removed after a successful run.
 
         Given:
             A registry with a stub processor that writes a file into its
@@ -401,7 +460,8 @@ async def test_ensure_workflow_should_cleanup_workdir_after_success(
         When:
             ensure_workflow is awaited and the background task completes.
         Then:
-            The per-job workdir should no longer exist on disk.
+            No scratch directory should remain under the workdir root (the
+            per-attempt workdir, named with a nonce, is removed).
         """
 
         class _WorkdirTouchingProcessor(_StubProcessor):
@@ -425,8 +485,9 @@ async def run(self, file_meta, workdir, cache_root):
         record, _ = await executor.ensure_workflow(_file_meta())
         await _wait_for_terminal(mock_db, record.job_id)
 
-        # Assert
-        assert not (tmp_workdir / record.job_id).exists()
+        # Assert — the per-attempt workdir (job_id + nonce) is removed, so
+        # no scratch directory leaks under the workdir root.
+        assert list(tmp_workdir.iterdir()) == []
 
     @pytest.mark.asyncio
     async def test_ensure_workflow_should_release_even_when_record_stage_fails(
@@ -495,7 +556,7 @@ async def test_ensure_workflow_should_record_timeout_when_cap_exceeded(
 
         # Arrange — yield one event first so the consumer enters the
         # asyncio.timeout block (the timeout wraps the event loop, not
-        # the initial _open_stream_with_retry call); then hang so the
+        # the initial _open_stream_once call); then hang so the
         # cap fires on the second iteration.
         class _HangingProcessor(_StubProcessor):
             async def run(self, file_meta, workdir, cache_root):
@@ -862,22 +923,21 @@ def test_pipeline_version_should_be_positive():
     assert isinstance(PIPELINE_VERSION, int)
 
 
-class TestOpenStreamWithRetry:
+class TestOpenStreamOnce:
     @pytest.mark.asyncio
-    async def test_open_stream_with_retry_should_recover_after_one_no_workers_error(
+    async def test_open_stream_once_should_return_stream_on_first_event(
         self, mock_db, tmp_cache, tmp_workdir, no_wool_dispatch, mocker
     ):
-        """Test that ``NoWorkersAvailable`` triggers a retry that eventually wins.
+        """Test that a routine yielding an event returns a usable stream.
 
         Given:
-            An executor whose ``_run_processor_routine`` raises
-            ``wool.NoWorkersAvailable`` on its first ``__anext__`` then
-            yields events on the second attempt.
+            An executor whose ``_run_processor_routine`` yields a regular
+            event followed by a ``complete``.
         When:
-            ``_open_stream_with_retry`` is awaited.
+            ``_open_stream_once`` is awaited.
         Then:
-            It should return a stream that yields the second invocation's
-            events without re-raising.
+            It should return a stream that re-yields the first event plus
+            the rest, in a single dispatch attempt (no retry loop).
         """
         # Arrange
         _install_jobs_index(mock_db)
@@ -888,57 +948,45 @@ async def test_open_stream_with_retry_should_recover_after_one_no_workers_error(
             mock_db, tmp_cache, registry, workdir_root=tmp_workdir
         )
 
-        # Build two routines: one that immediately raises NoWorkersAvailable
-        # and one that yields a regular event followed by a complete.
         attempts = {"count": 0}
 
-        async def failing_stream():
-            attempts["count"] += 1
-            raise wool.NoWorkersAvailable("simulated cold start")
-            yield  # pragma: no cover
-
         async def working_stream():
             attempts["count"] += 1
             yield StageComplete(kind=ArtifactKind.INDEX, key=_INDEX_KEY)
             yield Complete(artifacts={})
 
-        streams = [failing_stream(), working_stream()]
-
-        def fake_routine(*_args, **_kwargs):
-            return streams.pop(0)
-
-        mocker.patch.object(executor_module, "_run_processor_routine", fake_routine)
-        # Patch sleep so retries don't slow the test down.
-        async def fast_sleep(_s):
-            return None
-
-        mocker.patch.object(executor_module.asyncio, "sleep", fast_sleep)
+        mocker.patch.object(
+            executor_module,
+            "_run_processor_routine",
+            lambda *_a, **_k: working_stream(),
+        )
 
         # Act
-        stream = await executor._open_stream_with_retry(
+        stream = await executor._open_stream_once(
             processor, _file_meta(), tmp_workdir / "wd"
         )
         events = [event async for event in stream]
 
-        # Assert
-        assert attempts["count"] == 2
+        # Assert — a single attempt, both events delivered.
+        assert attempts["count"] == 1
+        assert any(isinstance(e, StageComplete) for e in events)
         assert any(isinstance(e, Complete) for e in events)
 
     @pytest.mark.asyncio
-    async def test_open_stream_with_retry_should_raise_when_deadline_expires(
+    async def test_open_stream_once_should_propagate_no_workers_available(
         self, mock_db, tmp_cache, tmp_workdir, no_wool_dispatch, mocker
     ):
-        """Test that the retry loop eventually surfaces ``NoWorkersAvailable``.
+        """Test that ``NoWorkersAvailable`` propagates without retrying.
 
         Given:
-            An executor whose ``_run_processor_routine`` always raises
-            ``wool.NoWorkersAvailable``; the dispatch-wait budget is
-            patched to a tiny value so the deadline expires quickly.
+            An executor whose ``_run_processor_routine`` raises
+            ``wool.NoWorkersAvailable`` on its first ``__anext__``.
         When:
-            ``_open_stream_with_retry`` is awaited.
+            ``_open_stream_once`` is awaited.
         Then:
-            It should raise the last ``NoWorkersAvailable`` after the
-            dispatch budget is exhausted.
+            It should propagate the ``NoWorkersAvailable`` in a single
+            attempt — the durable retry scheduler, not an in-attempt loop,
+            owns retries now. The freshly-constructed stream is closed.
         """
         # Arrange
         _install_jobs_index(mock_db)
@@ -949,26 +997,25 @@ async def test_open_stream_with_retry_should_raise_when_deadline_expires(
             mock_db, tmp_cache, registry, workdir_root=tmp_workdir
         )
 
+        attempts = {"count": 0}
+
         async def failing_stream():
-            raise wool.NoWorkersAvailable("still no workers")
+            attempts["count"] += 1
+            raise wool.NoWorkersAvailable("simulated cold start")
             yield  # pragma: no cover
 
-        def fake_routine(*_args, **_kwargs):
-            return failing_stream()
-
-        mocker.patch.object(executor_module, "_run_processor_routine", fake_routine)
-        mocker.patch.object(executor_module, "_DISPATCH_WAIT_SECONDS", 0.01)
-
-        async def fast_sleep(_s):
-            return None
-
-        mocker.patch.object(executor_module.asyncio, "sleep", fast_sleep)
+        mocker.patch.object(
+            executor_module,
+            "_run_processor_routine",
+            lambda *_a, **_k: failing_stream(),
+        )
 
         # Act & assert
         with pytest.raises(wool.NoWorkersAvailable):
-            await executor._open_stream_with_retry(
+            await executor._open_stream_once(
                 processor, _file_meta(), tmp_workdir / "wd"
             )
+        assert attempts["count"] == 1
 
 
 class TestStreamConsumerProgressEvents:
@@ -1121,7 +1168,7 @@ def test_executor_draining_should_be_caught_by_workflow_not_applicable(self):
 
 class TestOpenStreamMalformedFirstEvent:
     @pytest.mark.asyncio
-    async def test_open_stream_with_retry_should_propagate_non_capacity_error(
+    async def test_open_stream_once_should_propagate_non_capacity_error(
         self, mock_db, tmp_cache, tmp_workdir, no_wool_dispatch, mocker
     ):
         """Test that a non-``NoWorkersAvailable`` first error propagates.
@@ -1130,11 +1177,11 @@ async def test_open_stream_with_retry_should_propagate_non_capacity_error(
             A routine whose first ``__anext__`` raises a generic
             ``RuntimeError`` (not a wool capacity error).
         When:
-            ``_open_stream_with_retry`` is awaited.
+            ``_open_stream_once`` is awaited.
         Then:
-            The exception should propagate to the caller — the retry
-            loop only handles ``NoWorkersAvailable``, every other class
-            is a real failure.
+            The exception should propagate to the caller —
+            ``_attempt_dispatch`` translates it into a FAILED job; only
+            ``NoWorkersAvailable`` is treated as overflow.
         """
         # Arrange
         _install_jobs_index(mock_db)
@@ -1156,7 +1203,7 @@ def fake_routine(*_args, **_kwargs):
 
         # Act & assert
         with pytest.raises(RuntimeError, match="not a capacity"):
-            await executor._open_stream_with_retry(
+            await executor._open_stream_once(
                 processor, _file_meta(), tmp_workdir / "wd"
             )
 
@@ -1197,25 +1244,28 @@ async def test_ensure_workflow_should_persist_file_meta_snapshot_on_insert(
 
 class TestWoolExecutorWithProvisioner:
     @pytest.mark.asyncio
-    async def test_ensure_workflow_should_request_worker_when_provisioner_set(
+    async def test__attempt_dispatch_should_not_request_provisioner_when_worker_available(
         self, mock_db, tmp_cache, tmp_workdir, no_wool_dispatch, mocker
     ):
-        """Test that a fresh claim dispatches a provisioner request.
+        """Test that the provisioner is left idle when a worker accepts.
 
         Given:
-            A WoolExecutor wired with a stub EcsProvisioner.
+            A WoolExecutor wired with a stub EcsProvisioner and an
+            in-process worker (no_wool_dispatch) that accepts the task.
         When:
-            ``ensure_workflow`` is awaited on a previously-unseen file.
+            ``ensure_workflow`` dispatches the inline attempt and it wins
+            a worker.
         Then:
-            The provisioner's ``request`` should be awaited exactly once
-            with ``dedup_key`` set to the workflow mutex key for the file.
+            ``provisioner.request`` should NOT be awaited — spawn-on-
+            overflow means scale-up only happens when no worker accepts,
+            inverting the old unconditional pre-dispatch spawn (the cost
+            leak).
         """
         # Arrange
         _install_jobs_index(mock_db)
         registry = ProcessorRegistry()
         registry.register(_StubProcessor())
         provisioner = mocker.AsyncMock(spec=EcsProvisioner)
-        provisioner.request.return_value = ["arn:fake:task/abc"]
         executor = WoolExecutor(
             mock_db,
             tmp_cache,
@@ -1223,51 +1273,41 @@ async def test_ensure_workflow_should_request_worker_when_provisioner_set(
             workdir_root=tmp_workdir,
             provisioner=provisioner,
         )
-        meta = _file_meta()
-        dcc, local_id, md5 = extract_identity(meta)
-        expected_key = key_utils.workflow_key(
-            dcc=dcc, local_id=local_id, md5=md5, pipeline_version=PIPELINE_VERSION
-        )
 
         # Act
-        record, _ = await executor.ensure_workflow(meta)
+        record, _ = await executor.ensure_workflow(_file_meta())
         await _wait_for_terminal(mock_db, record.job_id)
 
         # Assert
-        provisioner.request.assert_awaited_once_with(dedup_key=expected_key)
+        final = await get_job(mock_db, record.job_id)
+        assert final is not None
+        assert final.status == JobStatus.COMPLETED
+        provisioner.request.assert_not_awaited()
 
     @pytest.mark.asyncio
-    async def test_ensure_workflow_should_skip_provisioner_when_attaching_to_existing_job(
+    async def test__attempt_dispatch_should_request_provisioner_on_overflow(
         self, mock_db, tmp_cache, tmp_workdir, no_wool_dispatch, mocker
     ):
-        """Test that the provisioner is only invoked on fresh claims.
+        """Test that overflow requests one bounded worker spawn and reschedules.
 
         Given:
-            An executor with a stub provisioner and a blocking processor
-            so the first claim stays active when the second call arrives.
+            A WoolExecutor wired with a stub provisioner and a queued job
+            whose dispatch attempt finds no worker capacity
+            (``_open_stream_once`` raises ``NoWorkersAvailable``).
         When:
-            ``ensure_workflow`` is awaited twice for the same file_meta.
+            ``_attempt_dispatch`` runs.
         Then:
-            ``provisioner.request`` should be awaited exactly once — the
-            attach path must not double-spend a Fargate worker against
-            an already-claimed workflow.
+            ``provisioner.request`` should be awaited once with the job's
+            workflow key, and the job should stay PENDING with
+            ``next_dispatch_at`` set and ``dispatch_attempts`` bumped —
+            queued durably for a later retry, not failed.
         """
         # Arrange
         _install_jobs_index(mock_db)
-        release = asyncio.Event()
-
-        class _BlockingProcessor(_StubProcessor):
-            async def run(self, file_meta, workdir, cache_root):
-                self.run_calls += 1
-                await release.wait()
-                for kind, key in self.artifacts.items():
-                    yield StageComplete(kind=ArtifactKind(kind), key=key)
-                yield Complete(artifacts=dict(self.artifacts))
-
         registry = ProcessorRegistry()
-        registry.register(_BlockingProcessor())
+        processor = _StubProcessor()
+        registry.register(processor)
         provisioner = mocker.AsyncMock(spec=EcsProvisioner)
-        provisioner.request.return_value = ["arn:fake:task/abc"]
         executor = WoolExecutor(
             mock_db,
             tmp_cache,
@@ -1275,50 +1315,47 @@ async def run(self, file_meta, workdir, cache_root):
             workdir_root=tmp_workdir,
             provisioner=provisioner,
         )
+        record = await _seed_pending_job(mock_db)
+        mocker.patch.object(
+            executor,
+            "_open_stream_once",
+            new=mocker.AsyncMock(side_effect=wool.NoWorkersAvailable("empty")),
+        )
 
         # Act
-        record_a, fresh_a = await executor.ensure_workflow(_file_meta())
-        record_b, fresh_b = await executor.ensure_workflow(_file_meta())
-        release.set()
-        await _wait_for_terminal(mock_db, record_a.job_id)
+        await executor._attempt_dispatch(record, processor, _file_meta())
 
         # Assert
-        assert fresh_a is True
-        assert fresh_b is False
-        assert record_a.job_id == record_b.job_id
-        provisioner.request.assert_awaited_once()
+        provisioner.request.assert_awaited_once_with(dedup_key=record.workflow_key)
+        final = await get_job(mock_db, record.job_id)
+        assert final is not None
+        assert final.status == JobStatus.PENDING
+        assert final.next_dispatch_at is not None
+        assert final.dispatch_attempts == 1
 
     @pytest.mark.asyncio
-    async def test_ensure_workflow_should_fall_through_when_retryable_provisioner_raises(
+    async def test__handle_overflow_should_swallow_retryable_provisioner_error(
         self, mock_db, tmp_cache, tmp_workdir, no_wool_dispatch, mocker
     ):
-        """Test that a retryable provisioner failure falls through to dispatch.
+        """Test that a retryable provisioner error is swallowed, job reschedules.
 
         Given:
             A provisioner whose ``request`` raises
-            ``RetryableProvisionerError`` (capacity / ENI exhaustion /
-            throttling) AND a wool pool that can still dispatch to an
-            existing worker (the no_wool_dispatch fixture runs the
-            routine in-process).
+            ``RetryableProvisionerError`` on an overflowed dispatch.
         When:
-            ``ensure_workflow`` is awaited.
+            ``_attempt_dispatch`` runs.
         Then:
-            The retryable error should be logged but not fail the
-            workflow — the Wool pool routes to any available worker
-            via discovery, so an existing idle worker can service the
-            job even when ECS quota blocks a fresh launch. The
-            ``capacity:`` prefix is reserved for the case where
-            dispatch also exhausts its budget (see
-            ``test_ensure_workflow_should_record_capacity_failure_when_dispatch_also_fails``).
+            The error should be swallowed (best-effort scale-up) and the
+            job should stay PENDING, rescheduled for a later attempt — not
+            failed.
         """
         # Arrange
         _install_jobs_index(mock_db)
         registry = ProcessorRegistry()
-        registry.register(_StubProcessor())
+        processor = _StubProcessor()
+        registry.register(processor)
         provisioner = mocker.AsyncMock(spec=EcsProvisioner)
-        provisioner.request.side_effect = RetryableProvisionerError(
-            "ClientError: capacity-unavailable"
-        )
+        provisioner.request.side_effect = RetryableProvisionerError("capacity")
         executor = WoolExecutor(
             mock_db,
             tmp_cache,
@@ -1326,49 +1363,47 @@ async def test_ensure_workflow_should_fall_through_when_retryable_provisioner_ra
             workdir_root=tmp_workdir,
             provisioner=provisioner,
         )
+        record = await _seed_pending_job(mock_db)
+        mocker.patch.object(
+            executor,
+            "_open_stream_once",
+            new=mocker.AsyncMock(side_effect=wool.NoWorkersAvailable("empty")),
+        )
 
         # Act
-        record, _ = await executor.ensure_workflow(_file_meta())
-        await _wait_for_terminal(mock_db, record.job_id)
+        await executor._attempt_dispatch(record, processor, _file_meta())
 
         # Assert
         final = await get_job(mock_db, record.job_id)
         assert final is not None
-        provisioner.request.assert_awaited_once()
-        # Provisioner failure was swallowed; in-process dispatch
-        # succeeded, so the workflow completes despite the
-        # retryable provisioner error.
-        assert final.status == JobStatus.COMPLETED
-        assert final.error is None
+        assert final.status == JobStatus.PENDING
+        assert final.next_dispatch_at is not None
 
     @pytest.mark.asyncio
-    async def test_ensure_workflow_should_record_capacity_failure_when_dispatch_also_fails(
+    async def test__handle_overflow_should_swallow_generic_provisioner_error(
         self, mock_db, tmp_cache, tmp_workdir, no_wool_dispatch, mocker
     ):
-        """Test that the capacity: prefix fires when dispatch also exhausts.
+        """Test that a generic provisioner error doesn't fail the queued job.
 
         Given:
-            A provisioner whose ``request`` raises
-            ``RetryableProvisionerError`` AND a dispatch path that
-            exhausts its budget with ``wool.NoWorkersAvailable``.
+            A provisioner whose ``request`` raises a generic
+            ``RuntimeError`` (misconfiguration / unexpected boto failure)
+            on an overflowed dispatch.
         When:
-            ``ensure_workflow`` is awaited.
+            ``_attempt_dispatch`` runs.
         Then:
-            The workflow should reach ``FAILED`` with the
-            ``capacity:`` prefix — the wire contract clients parse to
-            decide whether to resubmit. Per C26 the origin of the
-            prefix shifted from the provisioner-failure branch to the
-            dispatch-budget-exhausted branch, but the on-wire shape
-            is unchanged.
+            The error should be logged and swallowed and the job should
+            stay PENDING (rescheduled) rather than failing — the
+            provisioner is a best-effort scale-up hint on overflow, not a
+            dispatch gate; the dispatch deadline bounds the retry loop.
         """
         # Arrange
         _install_jobs_index(mock_db)
         registry = ProcessorRegistry()
-        registry.register(_StubProcessor())
+        processor = _StubProcessor()
+        registry.register(processor)
         provisioner = mocker.AsyncMock(spec=EcsProvisioner)
-        provisioner.request.side_effect = RetryableProvisionerError(
-            "ClientError: capacity-unavailable"
-        )
+        provisioner.request.side_effect = RuntimeError("misconfigured cluster")
         executor = WoolExecutor(
             mock_db,
             tmp_cache,
@@ -1376,70 +1411,682 @@ async def test_ensure_workflow_should_record_capacity_failure_when_dispatch_also
             workdir_root=tmp_workdir,
             provisioner=provisioner,
         )
-        # Force the dispatch path to also fail with
-        # ``wool.NoWorkersAvailable``; the C26 stream-open branch
-        # translates that to the ``capacity:`` prefix.
+        record = await _seed_pending_job(mock_db)
         mocker.patch.object(
             executor,
-            "_open_stream_with_retry",
-            side_effect=wool.NoWorkersAvailable("pool empty after budget"),
+            "_open_stream_once",
+            new=mocker.AsyncMock(side_effect=wool.NoWorkersAvailable("empty")),
         )
 
         # Act
-        record, _ = await executor.ensure_workflow(_file_meta())
-        await _wait_for_terminal(mock_db, record.job_id)
+        await executor._attempt_dispatch(record, processor, _file_meta())
+
+        # Assert
+        final = await get_job(mock_db, record.job_id)
+        assert final is not None
+        assert final.status == JobStatus.PENDING
+        assert final.next_dispatch_at is not None
+
+
+class TestConsumeAndFinalizeHeartbeatLoss:
+    def test_heartbeat_loss_abort_threshold_sits_between_interval_and_stale(self):
+        """Test that the self-cancel threshold is between heartbeat and stale.
+
+        Given:
+            The module's derived heartbeat-loss abort threshold.
+        When:
+            It is compared against the heartbeat interval and the orphan
+            sweep's stale threshold.
+        Then:
+            It should be strictly more than one heartbeat interval (a single
+            transient write failure must not abort — the strict
+            ``STALE > 2*HEARTBEAT`` startup invariant guarantees the margin,
+            issue #45 review A7) and strictly below the stale threshold (a
+            Mongo-blind consumer must give up before the sweep would revive
+            its row).
+        """
+        # Assert
+        assert (
+            executor_module._HEARTBEAT_LOSS_ABORT_S
+            > executor_module._HEARTBEAT_INTERVAL_S
+        )
+        assert (
+            executor_module._HEARTBEAT_LOSS_ABORT_S
+            < executor_module.STALE_WORKFLOW_THRESHOLD.total_seconds()
+        )
+
+    @pytest.mark.asyncio
+    async def test__consume_and_finalize_should_abort_when_heartbeat_lost(
+        self, mock_db, tmp_cache, tmp_workdir, mocker
+    ):
+        """Test that sustained heartbeat-write failure aborts the workflow.
+
+        Given:
+            A RUNNING job whose ``heartbeat_workflow`` write always fails and
+            a zeroed abort threshold (so the first failure is already past
+            it).
+        When:
+            ``_consume_and_finalize`` consumes a stream that emits a
+            heartbeat.
+        Then:
+            It should abort with a ``heartbeat lost`` FAILED status and close
+            the stream (cancelling the remote worker), rather than consuming
+            on invisibly while the orphan sweep reclaims the row.
+        """
+        # Arrange
+        _install_jobs_index(mock_db)
+        registry = ProcessorRegistry()
+        registry.register(_StubProcessor())
+        executor = WoolExecutor(
+            mock_db, tmp_cache, registry, workdir_root=tmp_workdir
+        )
+        record = await _seed_pending_job(mock_db, status=JobStatus.RUNNING)
+        mocker.patch.object(executor_module, "_HEARTBEAT_LOSS_ABORT_S", -1.0)
+
+        async def boom(*_a, **_kw):
+            raise RuntimeError("mongo down")
+
+        mocker.patch.object(executor_module, "heartbeat_workflow", boom)
+        closed = {"v": False}
+
+        async def stream():
+            try:
+                yield Heartbeat()
+                yield Complete(artifacts={})  # pragma: no cover (aborted first)
+            finally:
+                closed["v"] = True
+
+        # Act
+        await executor._consume_and_finalize(record, stream(), tmp_workdir / "wd")
 
         # Assert
         final = await get_job(mock_db, record.job_id)
         assert final is not None
         assert final.status == JobStatus.FAILED
-        assert final.error is not None
-        assert final.error.startswith("capacity: "), (
-            f"expected 'capacity: ' prefix, got {final.error!r}"
+        assert "heartbeat lost" in (final.error or "")
+        assert closed["v"] is True
+
+    @pytest.mark.asyncio
+    async def test__consume_and_finalize_should_tolerate_a_single_heartbeat_failure(
+        self, mock_db, tmp_cache, tmp_workdir, mocker
+    ):
+        """Test that one transient heartbeat-write failure does not abort.
+
+        Given:
+            A RUNNING job whose first ``heartbeat_workflow`` write fails but
+            the abort threshold is far in the future.
+        When:
+            ``_consume_and_finalize`` consumes a heartbeat then a complete.
+        Then:
+            The single failure is logged and swallowed and the job still
+            reaches COMPLETED.
+        """
+        # Arrange
+        _install_jobs_index(mock_db)
+        registry = ProcessorRegistry()
+        registry.register(_StubProcessor())
+        executor = WoolExecutor(
+            mock_db, tmp_cache, registry, workdir_root=tmp_workdir
+        )
+        record = await _seed_pending_job(mock_db, status=JobStatus.RUNNING)
+        mocker.patch.object(executor_module, "_HEARTBEAT_LOSS_ABORT_S", 9999.0)
+        calls = {"n": 0}
+
+        async def flaky(*_a, **_kw):
+            calls["n"] += 1
+            if calls["n"] == 1:
+                raise RuntimeError("transient")
+
+        mocker.patch.object(executor_module, "heartbeat_workflow", flaky)
+
+        async def stream():
+            yield Heartbeat()
+            yield Complete(artifacts={})
+
+        # Act
+        await executor._consume_and_finalize(record, stream(), tmp_workdir / "wd")
+
+        # Assert
+        final = await get_job(mock_db, record.job_id)
+        assert final is not None
+        assert final.status == JobStatus.COMPLETED
+
+
+class TestAttemptDispatchWorkdir:
+    @pytest.mark.asyncio
+    async def test__attempt_dispatch_should_use_a_distinct_workdir_per_attempt(
+        self, mock_db, tmp_cache, tmp_workdir, mocker
+    ):
+        """Test that each dispatch attempt for a job gets its own workdir.
+
+        Given:
+            A queued job whose dispatch attempts always overflow.
+        When:
+            ``_attempt_dispatch`` runs twice for the same job.
+        Then:
+            The two attempts should compute distinct workdirs (both prefixed
+            with the job_id), so a re-dispatch can never share scratch with a
+            still-live prior attempt.
+        """
+        # Arrange
+        _install_jobs_index(mock_db)
+        registry = ProcessorRegistry()
+        processor = _StubProcessor()
+        registry.register(processor)
+        executor = WoolExecutor(
+            mock_db, tmp_cache, registry, workdir_root=tmp_workdir
         )
-        assert "pool empty" in final.error
+        record = await _seed_pending_job(mock_db)
+        seen: list[Path] = []
+
+        async def capture(_processor, _file_meta, workdir):
+            seen.append(workdir)
+            raise wool.NoWorkersAvailable("overflow")
+
+        mocker.patch.object(executor, "_open_stream_once", new=capture)
+
+        # Act
+        await executor._attempt_dispatch(record, processor, _file_meta())
+        await executor._attempt_dispatch(record, processor, _file_meta())
+
+        # Assert
+        assert len(seen) == 2
+        assert seen[0] != seen[1]
+        assert all(w.name.startswith(record.job_id) for w in seen)
 
     @pytest.mark.asyncio
-    async def test_ensure_workflow_should_record_provisioner_prefix_for_generic_exception(
+    async def test__attempt_dispatch_should_clean_workdir_when_mark_running_rejected(
+        self, mock_db, tmp_cache, tmp_workdir, mocker
+    ):
+        """Test that the hand-off branch cleans up its workdir.
+
+        Given:
+            A worker accepts (the routine created the per-attempt workdir),
+            but ``mark_running`` rejects because a successor now owns the row.
+        When:
+            ``_attempt_dispatch`` runs.
+        Then:
+            It should remove this attempt's workdir (no scratch leak) and
+            leave the row untouched for the successor — without releasing the
+            mutex.
+        """
+        # Arrange
+        _install_jobs_index(mock_db)
+        registry = ProcessorRegistry()
+        processor = _StubProcessor()
+        registry.register(processor)
+        executor = WoolExecutor(
+            mock_db, tmp_cache, registry, workdir_root=tmp_workdir
+        )
+        record = await _seed_pending_job(mock_db)
+        captured: dict[str, Path] = {}
+
+        async def fake_open(_processor, _file_meta, workdir):
+            workdir.mkdir(parents=True, exist_ok=True)
+            (workdir / "scratch").write_bytes(b"x")
+            captured["workdir"] = workdir
+
+            async def s():
+                yield Heartbeat()
+
+            return s()
+
+        mocker.patch.object(executor, "_open_stream_once", new=fake_open)
+
+        async def reject(*_a, **_kw):
+            raise RuntimeError("row no longer PENDING")
+
+        mocker.patch.object(executor_module, "mark_running", reject)
+
+        # Act
+        await executor._attempt_dispatch(record, processor, _file_meta())
+
+        # Assert
+        assert "workdir" in captured
+        assert not captured["workdir"].exists()
+        final = await get_job(mock_db, record.job_id)
+        assert final is not None
+        assert final.status == JobStatus.PENDING
+
+
+class TestWoolExecutorMarkRunningOnAcceptance:
+    @pytest.mark.asyncio
+    async def test__attempt_dispatch_should_mark_running_before_first_processor_event(
+        self, mock_db, tmp_cache, tmp_workdir, no_wool_dispatch
+    ):
+        """Test that a job is marked RUNNING the instant a worker accepts it.
+
+        Given:
+            A processor that blocks before yielding any real event (standing
+            in for a slow upstream download, which is the processor's first
+            action before its first stage event).
+        When:
+            ``ensure_workflow`` dispatches the inline attempt and a worker
+            accepts it.
+        Then:
+            The job should reach RUNNING while the processor is still blocked
+            — driven by the routine's leading "worker accepted" heartbeat,
+            not the first processor event. This is the invariant that keeps a
+            job from lingering PENDING (and being re-leased / double-
+            dispatched onto the same workdir) for the whole download.
+        """
+        # Arrange
+        _install_jobs_index(mock_db)
+        release = asyncio.Event()
+
+        class _BlockBeforeFirstEvent(_StubProcessor):
+            async def run(self, file_meta, workdir, cache_root):
+                self.run_calls += 1
+                await release.wait()  # block before the first real event
+                yield Complete(artifacts={})
+
+        registry = ProcessorRegistry()
+        registry.register(_BlockBeforeFirstEvent())
+        executor = WoolExecutor(
+            mock_db, tmp_cache, registry, workdir_root=tmp_workdir
+        )
+
+        # Act
+        record, _ = await executor.ensure_workflow(_file_meta())
+        try:
+            # Would time out (job stuck PENDING) if mark_running waited for
+            # the first processor event rather than the acceptance heartbeat.
+            await _wait_for_status(mock_db, record.job_id, JobStatus.RUNNING)
+            running = await get_job(mock_db, record.job_id)
+            release.set()
+            await _wait_for_terminal(mock_db, record.job_id)
+        finally:
+            release.set()
+            await executor.drain(timeout=2.0)
+
+        # Assert
+        assert running is not None
+        assert running.status == JobStatus.RUNNING
+        final = await get_job(mock_db, record.job_id)
+        assert final is not None
+        assert final.status == JobStatus.COMPLETED
+
+
+class TestWoolExecutorAdmission:
+    @pytest.mark.asyncio
+    async def test_ensure_workflow_should_reject_at_ceiling(
+        self, mock_db, tmp_cache, tmp_workdir, mocker
+    ):
+        """Test that ensure_workflow sheds load once the active ceiling is hit.
+
+        Given:
+            One active workflow already in the jobs collection and the
+            admission ceiling pinned to 1.
+        When:
+            ``ensure_workflow`` is awaited for another file.
+        Then:
+            It should raise ``AdmissionRejected`` carrying the observed
+            active count, the ceiling, and a positive ``retry_after`` hint
+            — before claiming the mutex, so a flood is shed at the door.
+        """
+        # Arrange
+        _install_jobs_index(mock_db)
+        registry = ProcessorRegistry()
+        registry.register(_StubProcessor())
+        executor = WoolExecutor(
+            mock_db, tmp_cache, registry, workdir_root=tmp_workdir
+        )
+        mocker.patch.object(executor_module, "_MAX_ACTIVE_WORKFLOWS", 1)
+        await _seed_pending_job(mock_db)  # one active workflow
+
+        # Act & assert
+        with pytest.raises(AdmissionRejected) as excinfo:
+            await executor.ensure_workflow(_file_meta())
+        assert excinfo.value.active == 1
+        assert excinfo.value.ceiling == 1
+        assert excinfo.value.retry_after_seconds >= 1
+
+    @pytest.mark.asyncio
+    async def test_ensure_workflow_should_admit_below_ceiling(
         self, mock_db, tmp_cache, tmp_workdir, no_wool_dispatch, mocker
     ):
-        """Test that non-retryable provisioner failures use the 'provisioner: ' prefix.
+        """Test that ensure_workflow claims normally below the ceiling.
 
         Given:
-            A provisioner whose ``request`` raises a generic
-            ``RuntimeError`` (a misconfiguration or unexpected boto
-            failure, not a retryable transient).
+            An admission ceiling of 5 and no active workflows.
         When:
             ``ensure_workflow`` is awaited.
         Then:
-            The job should reach ``FAILED`` with an error string that
-            starts with ``"provisioner: "`` so clients can distinguish
-            "retry me later" from "the provisioner crashed".
+            It should claim and dispatch the job to completion — the
+            ceiling only sheds load once it is reached.
         """
         # Arrange
         _install_jobs_index(mock_db)
         registry = ProcessorRegistry()
         registry.register(_StubProcessor())
-        provisioner = mocker.AsyncMock(spec=EcsProvisioner)
-        provisioner.request.side_effect = RuntimeError("misconfigured cluster")
         executor = WoolExecutor(
-            mock_db,
-            tmp_cache,
-            registry,
-            workdir_root=tmp_workdir,
-            provisioner=provisioner,
+            mock_db, tmp_cache, registry, workdir_root=tmp_workdir
         )
+        mocker.patch.object(executor_module, "_MAX_ACTIVE_WORKFLOWS", 5)
 
         # Act
-        record, _ = await executor.ensure_workflow(_file_meta())
+        record, fresh = await executor.ensure_workflow(_file_meta())
         await _wait_for_terminal(mock_db, record.job_id)
 
         # Assert
+        assert fresh is True
+        final = await get_job(mock_db, record.job_id)
+        assert final is not None
+        assert final.status == JobStatus.COMPLETED
+
+
+class TestWoolExecutorDispatchDeadline:
+    @pytest.mark.asyncio
+    async def test__attempt_dispatch_should_fail_capacity_when_deadline_exceeded(
+        self, mock_db, tmp_cache, tmp_workdir, no_wool_dispatch, mocker
+    ):
+        """Test that a job past its dispatch deadline is failed ``capacity:``.
+
+        Given:
+            A queued job submitted long ago (older than a tiny patched
+            dispatch deadline).
+        When:
+            ``_attempt_dispatch`` runs.
+        Then:
+            The job should be failed with the ``capacity:`` prefix without
+            attempting dispatch — the deadline bounds the durable retry
+            loop so a job can't queue forever.
+        """
+        # Arrange
+        _install_jobs_index(mock_db)
+        registry = ProcessorRegistry()
+        processor = _StubProcessor()
+        registry.register(processor)
+        executor = WoolExecutor(
+            mock_db, tmp_cache, registry, workdir_root=tmp_workdir
+        )
+        mocker.patch.object(executor_module, "_DISPATCH_DEADLINE_SECONDS", 1.0)
+        old = datetime.now(timezone.utc) - timedelta(seconds=10000)
+        record = await _seed_pending_job(mock_db, submitted_at=old)
+        open_spy = mocker.patch.object(
+            executor, "_open_stream_once", new=mocker.AsyncMock()
+        )
+
+        # Act
+        await executor._attempt_dispatch(record, processor, _file_meta())
+
+        # Assert
+        open_spy.assert_not_awaited()  # never tried to dispatch
         final = await get_job(mock_db, record.job_id)
         assert final is not None
         assert final.status == JobStatus.FAILED
-        assert final.error is not None
-        assert final.error.startswith("provisioner: "), (
-            f"expected 'provisioner: ' prefix, got {final.error!r}"
+        assert (final.error or "").startswith("capacity:")
+
+
+class TestWoolExecutorScheduler:
+    @pytest.mark.asyncio
+    async def test__drain_due_jobs_should_dispatch_a_due_queued_job(
+        self, mock_db, tmp_cache, tmp_workdir, no_wool_dispatch
+    ):
+        """Test that the scheduler tick dispatches a due rescheduled job.
+
+        Given:
+            A queued job whose ``next_dispatch_at`` is in the past (an
+            earlier attempt overflowed) and a registered processor.
+        When:
+            One scheduler tick (``_drain_due_jobs``) runs.
+        Then:
+            The job should be leased, dispatched to the in-process worker,
+            and reach COMPLETED — the durable retry path picking up a
+            previously-overflowed job.
+        """
+        # Arrange
+        _install_jobs_index(mock_db)
+        registry = ProcessorRegistry()
+        registry.register(_StubProcessor())
+        executor = WoolExecutor(
+            mock_db, tmp_cache, registry, workdir_root=tmp_workdir
         )
-        assert "misconfigured cluster" in final.error
+        due = datetime.now(timezone.utc) - timedelta(seconds=1)
+        record = await _seed_pending_job(mock_db, next_dispatch_at=due)
+
+        # Act
+        await executor._drain_due_jobs()
+        await _wait_for_terminal(mock_db, record.job_id)
+
+        # Assert
+        final = await get_job(mock_db, record.job_id)
+        assert final is not None
+        assert final.status == JobStatus.COMPLETED
+
+    @pytest.mark.asyncio
+    async def test__drain_due_jobs_should_ignore_not_yet_due_jobs(
+        self, mock_db, tmp_cache, tmp_workdir, no_wool_dispatch
+    ):
+        """Test that a queued job whose retry time is in the future is skipped.
+
+        Given:
+            A queued job whose ``next_dispatch_at`` is in the future.
+        When:
+            One scheduler tick runs.
+        Then:
+            The job should be left PENDING and undispatched — only due
+            jobs are leased, so a not-yet-due retry isn't run early.
+        """
+        # Arrange
+        _install_jobs_index(mock_db)
+        registry = ProcessorRegistry()
+        processor = _StubProcessor()
+        registry.register(processor)
+        executor = WoolExecutor(
+            mock_db, tmp_cache, registry, workdir_root=tmp_workdir
+        )
+        future = datetime.now(timezone.utc) + timedelta(seconds=3600)
+        record = await _seed_pending_job(mock_db, next_dispatch_at=future)
+
+        # Act
+        await executor._drain_due_jobs()
+
+        # Assert
+        final = await get_job(mock_db, record.job_id)
+        assert final is not None
+        assert final.status == JobStatus.PENDING
+        assert processor.run_calls == 0
+
+    @pytest.mark.asyncio
+    async def test__recover_orphans_should_requeue_and_dispatch_a_stale_running_job(
+        self, mock_db, tmp_cache, tmp_workdir, no_wool_dispatch
+    ):
+        """Test that a RUNNING job orphaned by a crash is recovered autonomously.
+
+        Given:
+            A stale RUNNING job (no heartbeat past the stale threshold,
+            `next_dispatch_at` unset) — what an API crash mid-stream leaves
+            behind — and a registered processor.
+        When:
+            A scheduler tick runs (`_recover_orphans` then `_drain_due_jobs`).
+        Then:
+            The orphan is re-queued to PENDING, leased, dispatched, and
+            reaches COMPLETED — without any new request for the file, closing
+            the restart-recovery gap.
+        """
+        # Arrange
+        _install_jobs_index(mock_db)
+        registry = ProcessorRegistry()
+        registry.register(_StubProcessor())
+        executor = WoolExecutor(
+            mock_db, tmp_cache, registry, workdir_root=tmp_workdir
+        )
+        stale = datetime.now(timezone.utc) - timedelta(seconds=2000)
+        record = await _seed_pending_job(
+            mock_db,
+            status=JobStatus.RUNNING,
+            updated_at=stale,
+            next_dispatch_at=None,
+        )
+
+        # Act — one scheduler tick: recover orphans, then dispatch due jobs.
+        await executor._recover_orphans()
+        await executor._drain_due_jobs()
+        await _wait_for_terminal(mock_db, record.job_id)
+
+        # Assert
+        final = await get_job(mock_db, record.job_id)
+        assert final is not None
+        assert final.status == JobStatus.COMPLETED
+
+    @pytest.mark.asyncio
+    async def test__recover_orphans_should_leave_a_fresh_running_job_alone(
+        self, mock_db, tmp_cache, tmp_workdir, no_wool_dispatch
+    ):
+        """Test that a healthy in-flight RUNNING job is not reclaimed.
+
+        Given:
+            A RUNNING job with a recent `updated_at` (a live, heartbeating
+            job).
+        When:
+            `_recover_orphans` runs.
+        Then:
+            The job is left RUNNING — the staleness gate keeps the sweep from
+            yanking a live job away from its worker.
+        """
+        # Arrange
+        _install_jobs_index(mock_db)
+        registry = ProcessorRegistry()
+        registry.register(_StubProcessor())
+        executor = WoolExecutor(
+            mock_db, tmp_cache, registry, workdir_root=tmp_workdir
+        )
+        record = await _seed_pending_job(
+            mock_db, status=JobStatus.RUNNING, next_dispatch_at=None
+        )
+
+        # Act
+        await executor._recover_orphans()
+
+        # Assert
+        job = await get_job(mock_db, record.job_id)
+        assert job is not None
+        assert job.status == JobStatus.RUNNING
+
+    @pytest.mark.asyncio
+    async def test__scheduler_loop_should_survive_a_tick_exception(
+        self, mock_db, tmp_cache, tmp_workdir, mocker
+    ):
+        """Test that an exception in one tick doesn't kill the scheduler.
+
+        Given:
+            A scheduler whose lease query raises on its first call then
+            returns nothing, with a tiny retry interval.
+        When:
+            ``start_scheduler`` runs the loop across several ticks.
+        Then:
+            The scheduler task should still be alive and have ticked again
+            after the exception — a transient Mongo blip must not silently
+            stop the dispatch driver.
+        """
+        # Arrange
+        _install_jobs_index(mock_db)
+        registry = ProcessorRegistry()
+        registry.register(_StubProcessor())
+        executor = WoolExecutor(
+            mock_db, tmp_cache, registry, workdir_root=tmp_workdir
+        )
+        calls = {"n": 0}
+
+        async def flaky_lease(*_a, **_k):
+            calls["n"] += 1
+            if calls["n"] == 1:
+                raise RuntimeError("transient mongo blip")
+            return None
+
+        mocker.patch.object(executor_module, "lease_due_dispatch", flaky_lease)
+        mocker.patch.object(executor_module, "_RETRY_INTERVAL_SECONDS", 0.01)
+
+        # Act
+        executor.start_scheduler()
+        try:
+            await asyncio.sleep(0.1)
+            # Assert — survived the first-tick exception and ticked again.
+            assert executor._scheduler_task is not None
+            assert not executor._scheduler_task.done()
+            assert calls["n"] >= 2
+        finally:
+            await executor.drain(timeout=1.0)
+
+    @pytest.mark.asyncio
+    async def test__drain_due_jobs_should_lease_at_most_the_per_tick_cap(
+        self, mock_db, tmp_cache, tmp_workdir, mocker
+    ):
+        """Test that one tick leases no more than the per-tick cap.
+
+        Given:
+            More due queued jobs than ``_MAX_DISPATCHES_PER_TICK`` (the cap
+            monkeypatched low) and a stubbed ``_spawn_attempt`` so the tick
+            only counts dispatches.
+        When:
+            One scheduler tick (``_drain_due_jobs``) runs.
+        Then:
+            Exactly the cap's worth of attempts are spawned and the
+            remainder is left leasable for a subsequent tick — the
+            thundering-herd guard bounds the per-tick fan-out so a large
+            backlog can't spawn thousands of concurrent attempts at once.
+        """
+        # Arrange
+        _install_jobs_index(mock_db)
+        registry = ProcessorRegistry()
+        registry.register(_StubProcessor())
+        executor = WoolExecutor(
+            mock_db, tmp_cache, registry, workdir_root=tmp_workdir
+        )
+        mocker.patch.object(executor_module, "_MAX_DISPATCHES_PER_TICK", 2)
+        due = datetime.now(timezone.utc) - timedelta(seconds=1)
+        for i in range(5):  # cap (2) + 3 remainder, each a distinct workflow
+            await _seed_pending_job(
+                mock_db,
+                meta={**_file_meta(), "local_id": f"ENCFF{i:03d}"},
+                next_dispatch_at=due,
+            )
+        spawn = mocker.patch.object(executor, "_spawn_attempt")
+
+        # Act — a single tick.
+        await executor._drain_due_jobs()
+
+        # Assert — capped at the per-tick limit...
+        assert spawn.call_count == 2
+        # ...with the remainder still due and leasable on the next tick.
+        leftover = await executor_module.lease_due_dispatch(
+            mock_db,
+            now=datetime.now(timezone.utc),
+            next_at=datetime.now(timezone.utc) + timedelta(seconds=60),
+        )
+        assert leftover is not None
+
+
+class TestWoolExecutorStartScheduler:
+    @pytest.mark.asyncio
+    async def test_drain_should_cancel_the_scheduler(
+        self, mock_db, tmp_cache, tmp_workdir
+    ):
+        """Test that drain stops the durable scheduler task.
+
+        Given:
+            An executor whose scheduler has been started.
+        When:
+            ``drain`` runs.
+        Then:
+            The scheduler task should be cancelled and cleared, so it stops
+            dispatching before the wool pool closes on shutdown.
+        """
+        # Arrange
+        _install_jobs_index(mock_db)
+        registry = ProcessorRegistry()
+        registry.register(_StubProcessor())
+        executor = WoolExecutor(
+            mock_db, tmp_cache, registry, workdir_root=tmp_workdir
+        )
+        executor.start_scheduler()
+        task = executor._scheduler_task
+        assert task is not None
+
+        # Act
+        await executor.drain(timeout=1.0)
+
+        # Assert
+        assert task.cancelled() or task.done()
+        assert executor._scheduler_task is None
diff --git a/tests/test_workflows/test_loadbalancer.py b/tests/test_workflows/test_loadbalancer.py
new file mode 100644
index 0000000..01b61be
--- /dev/null
+++ b/tests/test_workflows/test_loadbalancer.py
@@ -0,0 +1,326 @@
+"""Tests for the priority (leaky-bucket) load balancer."""
+
+from __future__ import annotations
+
+from dataclasses import dataclass
+from types import MappingProxyType
+from typing import Any
+
+import cloudpickle
+import pytest
+import wool
+
+from cfdb.workflows.loadbalancer import PriorityLoadBalancer
+
+
+@dataclass(frozen=True)
+class _FakeMeta:
+    """Stand-in for WorkerMetadata; frozen so it is hashable as a dict key."""
+
+    uid: str
+
+
+class _FakeConn:
+    """Worker connection stub whose dispatch returns a stream or raises."""
+
+    def __init__(self, behavior: Any) -> None:
+        # behavior: a sentinel "stream" to return, or an Exception to raise.
+        self.behavior = behavior
+        self.calls = 0
+
+    async def dispatch(self, task: Any, *, timeout: float | None = None) -> Any:
+        self.calls += 1
+        if isinstance(self.behavior, Exception):
+            raise self.behavior
+        return self.behavior
+
+
+class _FakeContext:
+    """Minimal LoadBalancerContextLike backed by an ordered dict."""
+
+    def __init__(self, workers: dict[_FakeMeta, _FakeConn]) -> None:
+        self._workers = dict(workers)
+        self.removed: list[_FakeMeta] = []
+
+    @property
+    def workers(self) -> MappingProxyType:
+        return MappingProxyType(self._workers)
+
+    def add_worker(self, metadata: _FakeMeta, connection: _FakeConn) -> None:
+        self._workers[metadata] = connection
+
+    def update_worker(self, metadata, connection, *, upsert: bool = False) -> None:
+        if upsert or metadata in self._workers:
+            self._workers[metadata] = connection
+
+    def remove_worker(self, metadata: _FakeMeta) -> None:
+        self.removed.append(metadata)
+        self._workers.pop(metadata, None)
+
+
+class TestPriorityLoadBalancer:
+    @pytest.mark.asyncio
+    async def test_dispatch_should_return_first_worker_in_priority_order(self):
+        """Test that the lowest-uid worker is offered the task first.
+
+        Given:
+            Three healthy workers inserted out of uid order.
+        When:
+            A task is dispatched.
+        Then:
+            It should return the lowest-uid worker's stream and not even
+            contact the higher-uid workers (priority short-circuit).
+        """
+        # Arrange — insert c, a, b so insertion order != priority order.
+        conns = {
+            _FakeMeta("c"): _FakeConn("stream-c"),
+            _FakeMeta("a"): _FakeConn("stream-a"),
+            _FakeMeta("b"): _FakeConn("stream-b"),
+        }
+        context = _FakeContext(conns)
+
+        # Act
+        stream = await PriorityLoadBalancer().dispatch(object(), context=context)
+
+        # Assert
+        assert stream == "stream-a"
+        assert conns[_FakeMeta("a")].calls == 1
+        assert conns[_FakeMeta("b")].calls == 0
+        assert conns[_FakeMeta("c")].calls == 0
+
+    @pytest.mark.asyncio
+    async def test_dispatch_should_rotate_past_busy_worker_without_evicting(self):
+        """Test that a transiently-rejecting worker is skipped, not evicted.
+
+        Given:
+            The priority worker rejects with TransientRpcError (a
+            backpressure RESOURCE_EXHAUSTED) and the next accepts.
+        When:
+            A task is dispatched.
+        Then:
+            It should return the next worker's stream and leave the busy
+            worker in the pool (no eviction).
+        """
+        # Arrange
+        busy = _FakeConn(wool.TransientRpcError(details="busy"))
+        free = _FakeConn("stream-b")
+        context = _FakeContext({_FakeMeta("a"): busy, _FakeMeta("b"): free})
+
+        # Act
+        stream = await PriorityLoadBalancer().dispatch(object(), context=context)
+
+        # Assert
+        assert stream == "stream-b"
+        assert busy.calls == 1
+        assert context.removed == []
+
+    @pytest.mark.asyncio
+    async def test_dispatch_should_evict_worker_on_non_transient_error(self):
+        """Test that a non-transient RpcError evicts the failing worker.
+
+        Given:
+            The priority worker fails with a non-transient RpcError and the
+            next accepts.
+        When:
+            A task is dispatched.
+        Then:
+            It should return the next worker's stream and evict the failing
+            worker via remove_worker.
+        """
+        # Arrange
+        broken_meta = _FakeMeta("a")
+        broken = _FakeConn(wool.RpcError(details="broken"))
+        free = _FakeConn("stream-b")
+        context = _FakeContext({broken_meta: broken, _FakeMeta("b"): free})
+
+        # Act
+        stream = await PriorityLoadBalancer().dispatch(object(), context=context)
+
+        # Assert
+        assert stream == "stream-b"
+        assert context.removed == [broken_meta]
+
+    @pytest.mark.asyncio
+    async def test_dispatch_should_raise_when_pool_is_empty(self):
+        """Test that an empty pool surfaces NoWorkersAvailable.
+
+        Given:
+            A context with no workers.
+        When:
+            A task is dispatched.
+        Then:
+            It should raise NoWorkersAvailable (the executor's signal to add
+            a worker and re-queue).
+        """
+        # Arrange
+        context = _FakeContext({})
+
+        # Act & assert
+        with pytest.raises(wool.NoWorkersAvailable):
+            await PriorityLoadBalancer().dispatch(object(), context=context)
+
+    @pytest.mark.asyncio
+    async def test_dispatch_should_raise_when_all_workers_reject(self):
+        """Test that an all-busy pool surfaces NoWorkersAvailable.
+
+        Given:
+            Every worker rejects with TransientRpcError (all at capacity).
+        When:
+            A task is dispatched.
+        Then:
+            It should raise NoWorkersAvailable after one full pass, without
+            evicting the still-healthy busy workers.
+        """
+        # Arrange
+        a = _FakeConn(wool.TransientRpcError(details="busy"))
+        b = _FakeConn(wool.TransientRpcError(details="busy"))
+        context = _FakeContext({_FakeMeta("a"): a, _FakeMeta("b"): b})
+
+        # Act & assert
+        with pytest.raises(wool.NoWorkersAvailable):
+            await PriorityLoadBalancer().dispatch(object(), context=context)
+
+        # ...and busy workers stay in the pool for the next attempt.
+        assert context.removed == []
+        assert a.calls == 1 and b.calls == 1
+
+    @pytest.mark.asyncio
+    async def test_dispatch_should_let_unexpected_errors_propagate(self):
+        """Test that a non-RpcError propagates without touching the pool.
+
+        Given:
+            The priority worker raises a plain ValueError (a caller-fault,
+            not a worker-health signal).
+        When:
+            A task is dispatched.
+        Then:
+            It should propagate the ValueError unwrapped and not evict the
+            worker, per wool's worker-health contract.
+        """
+        # Arrange
+        meta = _FakeMeta("a")
+        context = _FakeContext({meta: _FakeConn(ValueError("boom"))})
+
+        # Act & assert
+        with pytest.raises(ValueError, match="boom"):
+            await PriorityLoadBalancer().dispatch(object(), context=context)
+        assert context.removed == []
+
+    def test_cloudpickle_roundtrip_should_succeed(self):
+        """Test that the stateless balancer cloudpickle-serializes.
+
+        Given:
+            A PriorityLoadBalancer instance.
+        When:
+            It is cloudpickle dumped and loaded back.
+        Then:
+            The round-trip yields another PriorityLoadBalancer (no lock or
+            index state to poison serialization).
+        """
+        # Act
+        restored = cloudpickle.loads(cloudpickle.dumps(PriorityLoadBalancer()))
+
+        # Assert
+        assert isinstance(restored, PriorityLoadBalancer)
+
+    def test_conforms_to_loadbalancer_like_protocol(self):
+        """Test that the balancer satisfies wool's LoadBalancerLike protocol.
+
+        Given:
+            A constructed balancer.
+        When:
+            It is checked against the runtime-checkable LoadBalancerLike
+            protocol.
+        Then:
+            It should be recognized as a LoadBalancerLike so WorkerPool
+            accepts it.
+        """
+        # Act & assert
+        assert isinstance(PriorityLoadBalancer(), wool.LoadBalancerLike)
+
+    @pytest.mark.asyncio
+    async def test_dispatch_should_pick_same_priority_worker_across_dispatches(self):
+        """Test that the priority order is stable across repeated dispatches.
+
+        Given:
+            Three healthy workers and a single balancer + context reused
+            across two consecutive dispatches.
+        When:
+            Two tasks are dispatched in succession.
+        Then:
+            Both should go to the lowest-uid worker — the leaky-bucket
+            invariant, distinct from round-robin which advances each call.
+        """
+        # Arrange
+        conns = {
+            _FakeMeta("a"): _FakeConn("s-a"),
+            _FakeMeta("b"): _FakeConn("s-b"),
+            _FakeMeta("c"): _FakeConn("s-c"),
+        }
+        context = _FakeContext(conns)
+        balancer = PriorityLoadBalancer()
+
+        # Act
+        first = await balancer.dispatch(object(), context=context)
+        second = await balancer.dispatch(object(), context=context)
+
+        # Assert
+        assert first == "s-a"
+        assert second == "s-a"
+        assert conns[_FakeMeta("a")].calls == 2
+        assert conns[_FakeMeta("b")].calls == 0
+
+    @pytest.mark.asyncio
+    async def test_dispatch_should_continue_in_order_after_evicting_first_worker(self):
+        """Test that evicting the first worker still tries the rest in order.
+
+        Given:
+            Three workers where the lowest-uid one fails with a non-transient
+            RpcError and the next accepts.
+        When:
+            A task is dispatched.
+        Then:
+            It should evict the first, dispatch to the second, and finish the
+            pass without error — pinning that iteration runs over a snapshot,
+            not the live worker map mutated by remove_worker.
+        """
+        # Arrange
+        broken = _FakeMeta("a")
+        conns = {
+            broken: _FakeConn(wool.RpcError(details="broken")),
+            _FakeMeta("b"): _FakeConn("s-b"),
+            _FakeMeta("c"): _FakeConn("s-c"),
+        }
+        context = _FakeContext(conns)
+
+        # Act
+        stream = await PriorityLoadBalancer().dispatch(object(), context=context)
+
+        # Assert
+        assert stream == "s-b"
+        assert context.removed == [broken]
+        assert conns[_FakeMeta("c")].calls == 0
+
+    @pytest.mark.asyncio
+    async def test_dispatch_should_order_workers_by_string_form_of_uid(self):
+        """Test that workers are ordered by the string form of their uid.
+
+        Given:
+            Two workers with non-string uids (10 and 2) whose lexicographic
+            string order ("10" < "2") differs from numeric order.
+        When:
+            A task is dispatched.
+        Then:
+            It should pick the worker whose str(uid) sorts first ("10"),
+            pinning the str(uid) coercion that gives a total, orderable key
+            regardless of the uid's concrete type.
+        """
+        # Arrange
+        conns = {_FakeMeta(10): _FakeConn("s-10"), _FakeMeta(2): _FakeConn("s-2")}
+        context = _FakeContext(conns)
+
+        # Act
+        stream = await PriorityLoadBalancer().dispatch(object(), context=context)
+
+        # Assert
+        assert stream == "s-10"
diff --git a/tests/test_workflows/test_lock.py b/tests/test_workflows/test_lock.py
index 7061e7f..af7a9ae 100644
--- a/tests/test_workflows/test_lock.py
+++ b/tests/test_workflows/test_lock.py
@@ -5,22 +5,54 @@
 from datetime import datetime, timedelta, timezone
 
 import pytest
+from hypothesis import given
+from hypothesis import strategies as st
 
 from cfdb.workflows import lock
-from cfdb.workflows.models import ACTIVE_STATUSES, ArtifactKind, JobStatus
+from cfdb.workflows.models import ArtifactKind, JobRecord, JobStatus
 from tests.test_workflows import FIXTURE_MD5
 
 PIPELINE_VERSION = 1
 
 
+def _insert_job(
+    mock_db,
+    *,
+    workflow_key: str,
+    status: JobStatus,
+    next_dispatch_at: datetime | None = None,
+    dispatch_attempts: int = 0,
+    updated_at: datetime | None = None,
+) -> JobRecord:
+    """Insert a crafted job doc into the fake collection and return it."""
+    now = datetime.now(timezone.utc)
+    record = JobRecord(
+        job_id=f"job-{workflow_key}",
+        workflow_key=workflow_key,
+        status=status,
+        dcc="encode",
+        local_id="x",
+        md5=FIXTURE_MD5,
+        pipeline_version=PIPELINE_VERSION,
+        submitted_at=now,
+        updated_at=updated_at or now,
+        next_dispatch_at=next_dispatch_at,
+        dispatch_attempts=dispatch_attempts,
+    )
+    mock_db.jobs.docs.append(record.to_mongo())
+    return record
+
+
 def _install_jobs_index(mock_db) -> None:
     """Register the partial unique index on the FakeDB jobs collection."""
     mock_db.jobs.create_index(
         {"workflow_key": 1},
         unique=True,
-        partialFilterExpression={
-            "status": {"$in": [s.value for s in ACTIVE_STATUSES]}
-        },
+        # Match production: the mutex partial-unique index filters on the
+        # ``active`` boolean discriminator (DocumentDB rejects ``$in`` inside
+        # a partialFilterExpression). Keying the fake on ``active`` lets the
+        # unit layer catch an ``active``/``status`` lockstep drift.
+        partialFilterExpression={"active": True},
     )
 
 
@@ -777,6 +809,65 @@ def test_record_from_mongo_should_attach_utc_to_naive_timestamps(self):
         assert record.submitted_at.tzinfo is timezone.utc
         assert record.updated_at.tzinfo is timezone.utc
 
+    @given(
+        submitted=st.datetimes(timezones=st.just(timezone.utc)),
+        updated=st.datetimes(timezones=st.just(timezone.utc)),
+        next_dispatch=st.one_of(
+            st.none(), st.datetimes(timezones=st.just(timezone.utc))
+        ),
+    )
+    def test_record_from_mongo_should_roundtrip_aware_and_naive_datetimes(
+        self, submitted, updated, next_dispatch
+    ):
+        """Test that the to_mongo/_record_from_mongo datetime round-trip is total.
+
+        Given:
+            A JobRecord with arbitrary aware-UTC ``submitted_at`` /
+            ``updated_at`` and an aware-or-``None`` ``next_dispatch_at``,
+            serialized via ``to_mongo`` and then rendered naive — the BSON
+            Date round-trip some Motor / mongomock-motor variants produce.
+        When:
+            ``_record_from_mongo`` hydrates the doc back into a JobRecord.
+        Then:
+            Every datetime instant is preserved and re-attached to UTC, and
+            a ``None`` ``next_dispatch_at`` survives as ``None`` — the third
+            nullable datetime threads the round-trip without loss.
+        """
+        # Arrange
+        record = JobRecord(
+            job_id="job-rt",
+            workflow_key=f"encode/x/{FIXTURE_MD5}/v1",
+            status=JobStatus.PENDING,
+            dcc="encode",
+            local_id="x",
+            md5=FIXTURE_MD5,
+            pipeline_version=1,
+            submitted_at=submitted,
+            updated_at=updated,
+            next_dispatch_at=next_dispatch,
+        )
+        doc = record.to_mongo()
+        # Simulate the BSON Date round-trip: strip tzinfo on stored dates,
+        # which is the naive rendering ``_record_from_mongo`` must repair.
+        for field in ("submitted_at", "updated_at", "next_dispatch_at"):
+            value = doc.get(field)
+            if isinstance(value, datetime):
+                doc[field] = value.astimezone(timezone.utc).replace(tzinfo=None)
+
+        # Act
+        hydrated = lock._record_from_mongo(doc)
+
+        # Assert
+        assert hydrated.submitted_at == submitted
+        assert hydrated.updated_at == updated
+        assert hydrated.submitted_at.tzinfo is timezone.utc
+        assert hydrated.updated_at.tzinfo is timezone.utc
+        if next_dispatch is None:
+            assert hydrated.next_dispatch_at is None
+        else:
+            assert hydrated.next_dispatch_at == next_dispatch
+            assert hydrated.next_dispatch_at.tzinfo is timezone.utc
+
 
 class TestClaimWorkflowSupersededByChain:
     @pytest.mark.asyncio
@@ -875,3 +966,465 @@ async def test_get_job_should_hydrate_persisted_record(self, mock_db):
         assert hydrated is not None
         assert hydrated.job_id == record.job_id
         assert hydrated.workflow_key == record.workflow_key
+
+
+class TestCountActiveWorkflows:
+    @pytest.mark.asyncio
+    async def test_count_active_workflows_should_count_pending_and_running(
+        self, mock_db
+    ):
+        """Test that the admission count includes only active jobs.
+
+        Given:
+            Two pending, one running, and two terminal job rows.
+        When:
+            count_active_workflows is awaited.
+        Then:
+            It should return 3 — pending + running only, excluding the
+            terminal rows.
+        """
+        # Arrange
+        _insert_job(mock_db, workflow_key="a", status=JobStatus.PENDING)
+        _insert_job(mock_db, workflow_key="b", status=JobStatus.PENDING)
+        _insert_job(mock_db, workflow_key="c", status=JobStatus.RUNNING)
+        _insert_job(mock_db, workflow_key="d", status=JobStatus.COMPLETED)
+        _insert_job(mock_db, workflow_key="e", status=JobStatus.FAILED)
+
+        # Act
+        count = await lock.count_active_workflows(mock_db)
+
+        # Assert
+        assert count == 3
+
+    @pytest.mark.asyncio
+    async def test_count_active_workflows_should_return_zero_when_empty(self, mock_db):
+        """Test that an empty collection counts as zero active workflows.
+
+        Given:
+            A jobs collection with no rows.
+        When:
+            count_active_workflows is awaited.
+        Then:
+            It should return 0.
+        """
+        # Act
+        count = await lock.count_active_workflows(mock_db)
+
+        # Assert
+        assert count == 0
+
+
+class TestRescheduleDispatch:
+    @pytest.mark.asyncio
+    async def test_reschedule_dispatch_should_defer_pending_and_bump_attempts(
+        self, mock_db
+    ):
+        """Test that rescheduling a pending job advances its retry time.
+
+        Given:
+            A pending job with no next_dispatch_at and zero attempts.
+        When:
+            reschedule_dispatch is awaited with a future next_at.
+        Then:
+            It should set next_dispatch_at to that time and increment
+            dispatch_attempts.
+        """
+        # Arrange
+        _insert_job(mock_db, workflow_key="a", status=JobStatus.PENDING)
+        next_at = datetime.now(timezone.utc) + timedelta(seconds=120)
+
+        # Act
+        await lock.reschedule_dispatch(mock_db, "job-a", next_at=next_at)
+
+        # Assert
+        doc = mock_db.jobs.docs[0]
+        assert doc["next_dispatch_at"] == next_at
+        assert doc["dispatch_attempts"] == 1
+
+    @pytest.mark.asyncio
+    async def test_reschedule_dispatch_should_be_noop_when_not_pending(self, mock_db):
+        """Test that rescheduling a non-pending job changes nothing.
+
+        Given:
+            A running job (it already won a worker).
+        When:
+            reschedule_dispatch is awaited for it.
+        Then:
+            It should leave next_dispatch_at and dispatch_attempts untouched,
+            because the update is fenced on PENDING.
+        """
+        # Arrange
+        _insert_job(mock_db, workflow_key="a", status=JobStatus.RUNNING)
+        next_at = datetime.now(timezone.utc) + timedelta(seconds=120)
+
+        # Act
+        await lock.reschedule_dispatch(mock_db, "job-a", next_at=next_at)
+
+        # Assert
+        doc = mock_db.jobs.docs[0]
+        assert doc["next_dispatch_at"] is None
+        assert doc["dispatch_attempts"] == 0
+
+
+class TestLeaseDueDispatch:
+    @pytest.mark.asyncio
+    async def test_lease_due_dispatch_should_claim_due_job_and_push_forward(
+        self, mock_db
+    ):
+        """Test that a due pending job is leased and its next attempt deferred.
+
+        Given:
+            A pending job whose next_dispatch_at is in the past.
+        When:
+            lease_due_dispatch is awaited with now past that time and a
+            future next_at.
+        Then:
+            It should return the job and push its next_dispatch_at forward to
+            next_at so a later tick won't re-lease it.
+        """
+        # Arrange
+        now = datetime.now(timezone.utc)
+        _insert_job(
+            mock_db,
+            workflow_key="a",
+            status=JobStatus.PENDING,
+            next_dispatch_at=now - timedelta(seconds=1),
+        )
+        next_at = now + timedelta(seconds=120)
+
+        # Act
+        leased = await lock.lease_due_dispatch(mock_db, now=now, next_at=next_at)
+
+        # Assert
+        assert leased is not None
+        assert leased.workflow_key == "a"
+        assert mock_db.jobs.docs[0]["next_dispatch_at"] == next_at
+
+    @pytest.mark.asyncio
+    async def test_lease_due_dispatch_should_not_re_lease_within_lease_window(
+        self, mock_db
+    ):
+        """Test that a freshly-leased job is not immediately re-leasable.
+
+        Given:
+            One due pending job that has just been leased (its
+            next_dispatch_at pushed into the future).
+        When:
+            lease_due_dispatch is awaited again at the same now.
+        Then:
+            It should return None — the single-claim guard, so two ticks or
+            replicas cannot double-dispatch one job.
+        """
+        # Arrange
+        now = datetime.now(timezone.utc)
+        _insert_job(
+            mock_db,
+            workflow_key="a",
+            status=JobStatus.PENDING,
+            next_dispatch_at=now - timedelta(seconds=1),
+        )
+        next_at = now + timedelta(seconds=120)
+        first = await lock.lease_due_dispatch(mock_db, now=now, next_at=next_at)
+
+        # Act
+        second = await lock.lease_due_dispatch(mock_db, now=now, next_at=next_at)
+
+        # Assert
+        assert first is not None
+        assert second is None
+
+    @pytest.mark.asyncio
+    async def test_lease_due_dispatch_should_skip_unscheduled_job(self, mock_db):
+        """Test that a job with next_dispatch_at unset is never leased.
+
+        Given:
+            A pending job whose next_dispatch_at is None (freshly claimed,
+            not yet deferred by the scheduler).
+        When:
+            lease_due_dispatch is awaited.
+        Then:
+            It should return None — an unscheduled job is not due.
+        """
+        # Arrange
+        now = datetime.now(timezone.utc)
+        _insert_job(
+            mock_db, workflow_key="a", status=JobStatus.PENDING, next_dispatch_at=None
+        )
+
+        # Act
+        leased = await lock.lease_due_dispatch(
+            mock_db, now=now, next_at=now + timedelta(seconds=120)
+        )
+
+        # Assert
+        assert leased is None
+
+    @pytest.mark.asyncio
+    async def test_lease_due_dispatch_should_skip_future_and_running_jobs(
+        self, mock_db
+    ):
+        """Test that not-yet-due and non-pending jobs are not leased.
+
+        Given:
+            A pending job due in the future and a running job already past
+            its time.
+        When:
+            lease_due_dispatch is awaited.
+        Then:
+            It should return None — the future job is not due and the running
+            job is not pending.
+        """
+        # Arrange
+        now = datetime.now(timezone.utc)
+        _insert_job(
+            mock_db,
+            workflow_key="future",
+            status=JobStatus.PENDING,
+            next_dispatch_at=now + timedelta(seconds=60),
+        )
+        _insert_job(
+            mock_db,
+            workflow_key="running",
+            status=JobStatus.RUNNING,
+            next_dispatch_at=now - timedelta(seconds=60),
+        )
+
+        # Act
+        leased = await lock.lease_due_dispatch(
+            mock_db, now=now, next_at=now + timedelta(seconds=120)
+        )
+
+        # Assert
+        assert leased is None
+
+    @pytest.mark.asyncio
+    async def test_lease_due_dispatch_should_lease_oldest_due_job_first(self, mock_db):
+        """Test that the longest-waiting due job is leased first.
+
+        Given:
+            Two due pending jobs with different next_dispatch_at times,
+            inserted newest-first.
+        When:
+            lease_due_dispatch is awaited.
+        Then:
+            It should lease the one with the earliest next_dispatch_at, so
+            sustained overflow cannot starve the oldest job.
+        """
+        # Arrange
+        now = datetime.now(timezone.utc)
+        _insert_job(
+            mock_db,
+            workflow_key="newer",
+            status=JobStatus.PENDING,
+            next_dispatch_at=now - timedelta(seconds=10),
+        )
+        _insert_job(
+            mock_db,
+            workflow_key="older",
+            status=JobStatus.PENDING,
+            next_dispatch_at=now - timedelta(seconds=300),
+        )
+
+        # Act
+        leased = await lock.lease_due_dispatch(
+            mock_db, now=now, next_at=now + timedelta(seconds=120)
+        )
+
+        # Assert
+        assert leased is not None
+        assert leased.workflow_key == "older"
+
+    @pytest.mark.asyncio
+    async def test_lease_due_dispatch_should_reject_next_at_not_in_future(
+        self, mock_db
+    ):
+        """Test that a non-future next_at is rejected.
+
+        Given:
+            A now timestamp and a next_at equal to it.
+        When:
+            lease_due_dispatch is awaited.
+        Then:
+            It should raise ValueError — an equal/past next_at would let a
+            concurrent tick immediately re-lease the same job.
+        """
+        # Arrange
+        now = datetime.now(timezone.utc)
+
+        # Act & assert
+        with pytest.raises(ValueError, match="strictly after"):
+            await lock.lease_due_dispatch(mock_db, now=now, next_at=now)
+
+
+class TestRequeueOrphanedDispatch:
+    @pytest.mark.asyncio
+    async def test_requeue_orphaned_dispatch_should_requeue_a_stale_running_orphan(self, mock_db):
+        """Test that a RUNNING job with no recent heartbeat is re-queued.
+
+        Given:
+            A RUNNING job whose `updated_at` is older than the stale
+            threshold — an in-flight job orphaned when its API consumer
+            died mid-stream.
+        When:
+            requeue_orphaned_dispatch runs with that staleness cutoff.
+        Then:
+            The job is reset to PENDING with `next_dispatch_at` set so the
+            scheduler re-dispatches it, and the call reports one requeued
+            row.
+        """
+        # Arrange
+        now = datetime.now(timezone.utc)
+        stale = now - timedelta(seconds=2000)
+        _insert_job(
+            mock_db, workflow_key="a", status=JobStatus.RUNNING, updated_at=stale
+        )
+
+        # Act
+        count = await lock.requeue_orphaned_dispatch(
+            mock_db, now=now, stale_before=now - timedelta(seconds=900)
+        )
+
+        # Assert
+        assert count == 1
+        job = await lock.get_job(mock_db, "job-a")
+        assert job is not None
+        assert job.status == JobStatus.PENDING
+        assert job.next_dispatch_at == now
+
+    @pytest.mark.asyncio
+    async def test_requeue_orphaned_dispatch_should_requeue_a_stale_pending_job_with_no_next_dispatch(
+        self, mock_db
+    ):
+        """Test that a stale PENDING job with unset next_dispatch_at is re-queued.
+
+        Given:
+            A PENDING job with `next_dispatch_at` unset (a fresh claim whose
+            inline attempt never rescheduled) and a stale `updated_at`.
+        When:
+            requeue_orphaned_dispatch runs.
+        Then:
+            Its `next_dispatch_at` is set to now so the scheduler — which
+            never leases a null-`next_dispatch_at` row — can pick it up.
+        """
+        # Arrange
+        now = datetime.now(timezone.utc)
+        _insert_job(
+            mock_db,
+            workflow_key="a",
+            status=JobStatus.PENDING,
+            next_dispatch_at=None,
+            updated_at=now - timedelta(seconds=2000),
+        )
+
+        # Act
+        count = await lock.requeue_orphaned_dispatch(
+            mock_db, now=now, stale_before=now - timedelta(seconds=900)
+        )
+
+        # Assert
+        assert count == 1
+        job = await lock.get_job(mock_db, "job-a")
+        assert job is not None
+        assert job.status == JobStatus.PENDING
+        assert job.next_dispatch_at == now
+
+    @pytest.mark.asyncio
+    async def test_requeue_orphaned_dispatch_should_not_touch_a_fresh_running_job(self, mock_db):
+        """Test that a healthy, recently-heartbeating RUNNING job is left alone.
+
+        Given:
+            A RUNNING job whose `updated_at` is recent (within the stale
+            threshold), i.e. a healthy in-flight job.
+        When:
+            requeue_orphaned_dispatch runs.
+        Then:
+            It is not touched — staleness gating prevents the sweep from
+            yanking a live job away from its worker.
+        """
+        # Arrange
+        now = datetime.now(timezone.utc)
+        _insert_job(
+            mock_db,
+            workflow_key="a",
+            status=JobStatus.RUNNING,
+            updated_at=now - timedelta(seconds=10),
+        )
+
+        # Act
+        count = await lock.requeue_orphaned_dispatch(
+            mock_db, now=now, stale_before=now - timedelta(seconds=900)
+        )
+
+        # Assert
+        assert count == 0
+        job = await lock.get_job(mock_db, "job-a")
+        assert job is not None
+        assert job.status == JobStatus.RUNNING
+
+    @pytest.mark.asyncio
+    async def test_requeue_orphaned_dispatch_should_not_touch_an_already_queued_pending_job(self, mock_db):
+        """Test that a stale PENDING job already carrying next_dispatch_at is left.
+
+        Given:
+            A PENDING job that already has `next_dispatch_at` set (it is
+            queued and leasable by the scheduler), even with a stale
+            `updated_at`.
+        When:
+            requeue_orphaned_dispatch runs.
+        Then:
+            It is not touched — the scheduler already handles it; the sweep
+            targets only the non-leasable orphan states.
+        """
+        # Arrange
+        now = datetime.now(timezone.utc)
+        original = now - timedelta(seconds=50)
+        _insert_job(
+            mock_db,
+            workflow_key="a",
+            status=JobStatus.PENDING,
+            next_dispatch_at=original,
+            updated_at=now - timedelta(seconds=2000),
+        )
+
+        # Act
+        count = await lock.requeue_orphaned_dispatch(
+            mock_db, now=now, stale_before=now - timedelta(seconds=900)
+        )
+
+        # Assert
+        assert count == 0
+        job = await lock.get_job(mock_db, "job-a")
+        assert job is not None
+        assert job.next_dispatch_at == original
+
+    @pytest.mark.asyncio
+    async def test_requeue_orphaned_dispatch_should_not_touch_terminal_jobs(self, mock_db):
+        """Test that terminal (completed/failed) jobs are never re-queued.
+
+        Given:
+            Stale COMPLETED and FAILED jobs.
+        When:
+            requeue_orphaned_dispatch runs.
+        Then:
+            Neither is touched — only active rows (the mutex holders) are
+            candidates for recovery.
+        """
+        # Arrange
+        now = datetime.now(timezone.utc)
+        stale = now - timedelta(seconds=2000)
+        _insert_job(
+            mock_db, workflow_key="done", status=JobStatus.COMPLETED, updated_at=stale
+        )
+        _insert_job(
+            mock_db, workflow_key="fail", status=JobStatus.FAILED, updated_at=stale
+        )
+
+        # Act
+        count = await lock.requeue_orphaned_dispatch(
+            mock_db, now=now, stale_before=now - timedelta(seconds=900)
+        )
+
+        # Assert
+        assert count == 0
+        assert (await lock.get_job(mock_db, "job-done")).status == JobStatus.COMPLETED
+        assert (await lock.get_job(mock_db, "job-fail")).status == JobStatus.FAILED
diff --git a/tests/test_workflows/test_models.py b/tests/test_workflows/test_models.py
index f5eb23d..09b458b 100644
--- a/tests/test_workflows/test_models.py
+++ b/tests/test_workflows/test_models.py
@@ -309,6 +309,85 @@ def test_to_mongo_should_serialize_stages_progress_and_superseded_by(self):
         assert payload["superseded_by"] == "job-xyz"
         assert payload["progress"] == "indexing"
 
+    def test___init___should_default_dispatch_scheduling_fields(self):
+        """Test that the dispatch-scheduling fields default sanely.
+
+        Given:
+            A JobRecord built without dispatch-scheduling overrides.
+        When:
+            It is constructed.
+        Then:
+            next_dispatch_at should default to None and dispatch_attempts
+            to 0.
+        """
+        # Act
+        job = _make_job()
+
+        # Assert
+        assert job.next_dispatch_at is None
+        assert job.dispatch_attempts == 0
+
+    def test___init___should_reject_naive_next_dispatch_at(self):
+        """Test that a naive next_dispatch_at is rejected.
+
+        Given:
+            A naive next_dispatch_at (no tzinfo).
+        When:
+            A JobRecord is constructed with it.
+        Then:
+            It should raise ValidationError — the scheduler's
+            next_dispatch_at comparison requires aware UTC.
+        """
+        # Arrange
+        naive = datetime(2026, 4, 21, 13, 0, 0)
+
+        # Act & assert
+        with pytest.raises(ValidationError, match="timezone-aware"):
+            _make_job(next_dispatch_at=naive)
+
+    def test___init___should_accept_aware_next_dispatch_at(self):
+        """Test that an aware next_dispatch_at validates and round-trips.
+
+        Given:
+            An aware UTC next_dispatch_at.
+        When:
+            A JobRecord is constructed with it.
+        Then:
+            It should validate and preserve the value (None remains the
+            nullable default elsewhere).
+        """
+        # Arrange
+        when = datetime(2026, 4, 21, 13, 0, 0, tzinfo=timezone.utc)
+
+        # Act
+        job = _make_job(next_dispatch_at=when)
+
+        # Assert
+        assert job.next_dispatch_at == when
+
+    def test_to_mongo_should_serialize_dispatch_scheduling_fields(self):
+        """Test that to_mongo persists next_dispatch_at and dispatch_attempts.
+
+        Given:
+            A JobRecord with an aware next_dispatch_at and a non-zero
+            dispatch_attempts.
+        When:
+            to_mongo is invoked.
+        Then:
+            Both fields should appear in the dump with their values
+            preserved verbatim.
+        """
+        # Arrange
+        when = datetime(2026, 4, 21, 13, 0, 0, tzinfo=timezone.utc)
+        job = _make_job(next_dispatch_at=when, dispatch_attempts=3)
+
+        # Act
+        payload = job.to_mongo()
+
+        # Assert
+        assert payload["next_dispatch_at"] == when
+        assert payload["dispatch_attempts"] == 3
+
     @pytest.mark.parametrize(
         "status",
         list(JobStatus),
diff --git a/tests/test_workflows/test_s3_contract.py b/tests/test_workflows/test_s3_contract.py
index 16c599f..3f3a5f4 100644
--- a/tests/test_workflows/test_s3_contract.py
+++ b/tests/test_workflows/test_s3_contract.py
@@ -27,7 +27,7 @@
 from cfdb.workflows.events import Complete, StageComplete, WorkflowEvent
 from cfdb.workflows.executor import WoolExecutor
 from cfdb.workflows.lock import get_job
-from cfdb.workflows.models import ACTIVE_STATUSES, ArtifactKind, JobStatus
+from cfdb.workflows.models import ArtifactKind, JobStatus
 from cfdb.workflows.processors.base import Processor
 from cfdb.workflows.processors.registry import ProcessorRegistry
 from tests.test_workflows import FIXTURE_MD5
@@ -79,9 +79,7 @@ def _install_jobs_index(mock_db) -> None:
     mock_db.jobs.create_index(
         {"workflow_key": 1},
         unique=True,
-        partialFilterExpression={
-            "status": {"$in": [s.value for s in ACTIVE_STATUSES]}
-        },
+        partialFilterExpression={"active": True},
     )
 
 
diff --git a/tests/test_workflows/test_worker_lan.py b/tests/test_workflows/test_worker_lan.py
index f99f784..0c545b4 100644
--- a/tests/test_workflows/test_worker_lan.py
+++ b/tests/test_workflows/test_worker_lan.py
@@ -174,3 +174,65 @@ def test_main_tls_cli_flags_override_env_vars(self, monkeypatch):
         # Assert
         assert exit_code == 0
         assert captured["tls_ca"] == "/cli/ca.pem"
+
+
+def test_backpressure_bound_factory_should_still_declare_host():
+    """Test that the backpressure-bound spawn factory still declares a host.
+
+    Given:
+        The ``functools.partial(LocalWorker, backpressure=hook)`` factory
+        that ``worker_lan.serve`` hands to ``WorkerPool(spawn=…)`` to apply
+        per-worker backpressure.
+    When:
+        wool's ``declares_host`` classifier inspects it.
+    Then:
+        It should return True, so the pool still prescribes the bind host. A
+        False here would make spawned workers bind 127.0.0.1 and advertise
+        unreachable addresses — a runtime failure with no other unit signal,
+        so this pins the load-bearing invariant the worker_lan wiring relies
+        on against a wool signature change.
+    """
+    # Arrange
+    import functools
+
+    import wool
+    from wool.runtime.worker.base import declares_host
+
+    from cfdb.workflows.backpressure import TaskCountBackpressure
+
+    factory = functools.partial(wool.LocalWorker, backpressure=TaskCountBackpressure(1))
+
+    # Act & assert
+    assert declares_host(factory) is True
+
+
+def test_backpressure_bound_factory_should_survive_cloudpickle():
+    """Test that the backpressure-bound spawn factory round-trips cloudpickle.
+
+    Given:
+        The ``functools.partial(LocalWorker, backpressure=TaskCountBackpressure(1))``
+        factory ``worker_lan.serve`` hands to ``WorkerPool(spawn=…)`` — the
+        exact composite object wool serializes into each spawned worker
+        subprocess.
+    When:
+        It is cloudpickled and reloaded.
+    Then:
+        The reconstructed partial still carries a backpressure hook with the
+        same threshold, so per-worker backpressure survives the spawn
+        boundary rather than silently reverting to unbounded admission.
+    """
+    # Arrange
+    import functools
+
+    import cloudpickle
+    import wool
+
+    from cfdb.workflows.backpressure import TaskCountBackpressure
+
+    factory = functools.partial(wool.LocalWorker, backpressure=TaskCountBackpressure(1))
+
+    # Act
+    restored = cloudpickle.loads(cloudpickle.dumps(factory))
+
+    # Assert
+    assert restored.keywords["backpressure"].threshold == 1
diff --git a/tests/test_workflows/test_worker_main.py b/tests/test_workflows/test_worker_main.py
index 8f4ea9e..757d58d 100644
--- a/tests/test_workflows/test_worker_main.py
+++ b/tests/test_workflows/test_worker_main.py
@@ -4,6 +4,7 @@
 
 from unittest.mock import patch
 
+import pytest
 from click.testing import CliRunner
 
 from cfdb.workflows import worker_main
@@ -183,3 +184,84 @@ def test_main_tls_cli_flags_override_env_vars(self, monkeypatch):
         # Assert
         assert exit_code == 0
         assert captured["tls_ca"] == "/cli/ca.pem"
+
+
+class _StopServe(Exception):
+    """Sentinel raised from the mocked ``worker.start`` to halt ``serve``.
+
+    ``serve`` constructs the worker (with its backpressure kwarg) immediately
+    before ``await worker.start()``, so raising here exits the coroutine right
+    after the construction under test without entering the run loop, the
+    signal-wait, or the health/drain teardown.
+    """
+
+
+def _arrange_serve(mocker) -> object:
+    """Patch ``serve``'s collaborators and return the ``LocalWorker`` spy."""
+    mocker.patch.object(worker_main, "build_worker_credentials", return_value=None)
+    mocker.patch.object(
+        worker_main, "_start_health_server", mocker.AsyncMock(return_value=mocker.Mock())
+    )
+    worker_instance = mocker.Mock()
+    worker_instance.start = mocker.AsyncMock(side_effect=_StopServe)
+    worker_instance.stop = mocker.AsyncMock()
+    return mocker.patch.object(
+        worker_main.wool, "LocalWorker", return_value=worker_instance
+    )
+
+
+class TestServeBackpressureWiring:
+    @pytest.mark.asyncio
+    async def test_serve_should_wire_taskcount_backpressure_when_threshold_positive(
+        self, mocker, monkeypatch
+    ):
+        """Test that a positive task ceiling becomes a TaskCountBackpressure.
+
+        Given:
+            ``WORKER_MAX_CONCURRENT_TASKS`` set to 1 and ``wool.LocalWorker``
+            spied so ``serve`` halts right after constructing the worker.
+        When:
+            ``serve`` is run.
+        Then:
+            The worker is constructed with a ``backpressure`` hook whose
+            threshold is 1, so the ECS worker entrypoint actually serializes
+            its subprocess pipelines.
+        """
+        # Arrange
+        monkeypatch.setattr(worker_main, "WORKER_MAX_CONCURRENT_TASKS", 1)
+        local_worker = _arrange_serve(mocker)
+
+        # Act
+        with pytest.raises(_StopServe):
+            await worker_main.serve(worker_port=0, health_port=0)
+
+        # Assert
+        backpressure = local_worker.call_args.kwargs["backpressure"]
+        assert backpressure is not None
+        assert backpressure.threshold == 1
+
+    @pytest.mark.asyncio
+    async def test_serve_should_disable_backpressure_when_threshold_zero(
+        self, mocker, monkeypatch
+    ):
+        """Test that a zero task ceiling wires ``backpressure=None``.
+
+        Given:
+            ``WORKER_MAX_CONCURRENT_TASKS`` set to 0 (the disable sentinel)
+            and ``wool.LocalWorker`` spied.
+        When:
+            ``serve`` is run.
+        Then:
+            The worker is constructed with ``backpressure=None``, restoring
+            the unbounded admission behavior.
+        """
+        # Arrange
+        monkeypatch.setattr(worker_main, "WORKER_MAX_CONCURRENT_TASKS", 0)
+        local_worker = _arrange_serve(mocker)
+
+        # Act
+        with pytest.raises(_StopServe):
+            await worker_main.serve(worker_port=0, health_port=0)
+
+        # Assert
+        assert local_worker.call_args.kwargs["backpressure"] is None