diff --git a/README.md b/README.md index 01cd22a..f7e4ce8 100644 --- a/README.md +++ b/README.md @@ -548,12 +548,13 @@ Stream file contents from DCCs via HTTPS. | Code | Description | |------|-------------| | 200 | Full file content (GET) or file metadata (HEAD) | -| 202 | Preprocessed artifact not yet cached — workflow dispatched. `Location: /jobs/{id}` and `Retry-After` headers point to the polling endpoint. | +| 202 | Preprocessed artifact not yet cached — workflow accepted. `Location: /jobs/{id}` and `Retry-After` headers point to the polling endpoint. Under load the job may sit `pending` in the durable queue before a worker picks it up; poll `/jobs/{id}` for progress. | | 206 | Partial content (Range request) | | 400 | Invalid DCC, path-param shape, or Range header | | 403 | File requires authentication (consortium/protected access) or denied by upstream repository | | 404 | File not found, or HEAD probe of a not-yet-cached processed artifact (GET would dispatch) | | 416 | Range not satisfiable (out of bounds, or file size unknown so no range can be satisfied) | +| 429 | Too many active preprocessing jobs — the active-workflow ceiling (`CFDB_WORKFLOW_MAX_ACTIVE`) is reached; `Retry-After` header set. Retry shortly. | | 501 | No supported access method (e.g., Globus-only files) | | 502 | Upstream service error | | 503 | Workflow subsystem shutting down (`Retry-After`) | @@ -586,12 +587,13 @@ Stream index files (e.g., `.px2`, `.bai`) associated with DCC data files. | Code | Description | |------|-------------| | 200 | Full index file content (GET) or file metadata (HEAD) | -| 202 | Index not yet cached — workflow dispatched. `Location: /jobs/{id}` and `Retry-After` headers point to the polling endpoint. | +| 202 | Index not yet cached — workflow accepted. `Location: /jobs/{id}` and `Retry-After` headers point to the polling endpoint. Under load the job may sit `pending` in the durable queue before a worker picks it up; poll `/jobs/{id}` for progress. | | 206 | Partial content (Range request) | | 400 | Invalid DCC, path-param shape, or Range header | | 403 | File requires consortium/protected access (HuBMAP) | | 404 | File not found, format has no index (CSV/TSV/bigWig), or HEAD probe of a not-yet-cached index | | 416 | Range not satisfiable | +| 429 | Too many active preprocessing jobs — the active-workflow ceiling (`CFDB_WORKFLOW_MAX_ACTIVE`) is reached; `Retry-After` header set. Retry shortly. | | 502 | Upstream service error or malformed sidecar | | 503 | Workflow subsystem disabled (set `SYNC_DATA_DIR`) for a processable format, or shutting down (`Retry-After`) | @@ -651,6 +653,13 @@ The preprocessed artifact is the default response. Clients that want the raw ups Cache keys are content-addressed using each file's upstream `md5`, so a byte change upstream (with the sync pipeline refreshing `md5`) invalidates the cache automatically. +**Bounded concurrency, durable queuing, and admission control.** Dispatch is bounded on three cooperating layers so an unauthenticated burst on `/data` and `/index` can't oversubscribe the worker fleet or queue unbounded work: + +- **Per-worker backpressure** — each worker accepts at most `CFDB_WORKER_MAX_CONCURRENT_TASKS` tasks at once (default `1`), serializing the subprocess pipelines on a 1-vCPU worker. A worker at capacity rejects the dispatch and the API's priority load balancer rotates to the next worker. +- **Priority (leaky-bucket) load balancing** — the API offers each task to discovered workers in a stable order, so load concentrates on the lowest-ordered workers and over-provisioned workers drain to idle and self-reap (via `CFDB_WORKER_MAX_LIFETIME_SECONDS`) instead of every worker carrying a thin perpetual slice. +- **Durable queue + retry-to-deadline** — when no worker has capacity, the job is **not** failed and does **not** block the request: it stays `pending` and a durable, Mongo-backed scheduler re-attempts dispatch every `CFDB_WORKFLOW_RETRY_INTERVAL_S` (plus jitter) until a worker frees up or the `CFDB_WORKFLOW_DISPATCH_DEADLINE_S` deadline elapses (then it is failed with a `capacity:`-prefixed error). Because the queue lives in Mongo, an API restart resumes it. On every scheduler tick (including the first, on boot) an orphan-recovery sweep re-queues jobs a crash left mid-flight — a `running` job whose API consumer died, or a fresh `pending` claim that never rescheduled — once they pass the stale threshold (`CFDB_WORKFLOW_STALE_THRESHOLD_S`), so recovery is autonomous and does not wait for a client to re-request the file. Recovery shares the same deadline clock as a fresh job: the re-queue preserves the original submission time, so an orphan older than `CFDB_WORKFLOW_DISPATCH_DEADLINE_S` is failed `capacity:` on its first recovery attempt rather than resumed (its committed cache artifacts survive for a later fresh `GET` to reuse) — recovery is best-effort, not unbounded. On the ECS profile, an overflow also requests one bounded worker spawn (the leaky bucket overflowing), inverting the old unconditional per-request spawn. +- **Admission ceiling** — once `CFDB_WORKFLOW_MAX_ACTIVE` workflows are active (`pending` + `running`), further preprocessing requests are shed with `429 Retry-After` rather than queued, so the backlog itself is bounded. The check runs before the per-file mutex, so at the ceiling even a re-`GET` for a file whose workflow is already in flight is shed with `429` (rather than attaching to the in-flight job) and the client retries — the deliberate trade for shedding before an unbounded admission race. The readiness `/status` probes never dispatch and so never `429`. + `/index` continues to serve upstream sidecars first when present (the 218 BED→beddb and 4 BED→tbi 4DN cases that publish under `extra.extra_files` or `extra.fourdn.extra_files`); the workflow path is dispatched only when no sidecar exists. Set `?raw=true` to bypass the workflow path entirely and return only the upstream sidecar (404 when none exists). Required environment variables: @@ -659,10 +668,16 @@ Required environment variables: - `WORKFLOW_WORKER_COUNT` — local-dev only: how many workers the LAN pool (`python -m cfdb.workflows.worker_lan`) spawns and publishes (default `2`). The API itself no longer leases a fixed count — its `WorkerPool` admits every worker discovery surfaces. In the ECS profile one ephemeral worker is launched per workflow (via `EcsProvisioner` `RunTask`), so the **maximum concurrent worker count is bounded by the AWS Fargate vCPU service quota** for the account/region — raise it in Service Quotas; at 1 vCPU per worker it maps roughly 1:1 to concurrent workers. A burst of N distinct uncached files can launch up to ~N workers concurrently, subject to that quota. - `WORKFLOW_POOL_NAMESPACE` — wool discovery namespace shared by the API and the external worker pool (default `cfdb-workers`). Both processes must agree on this value or dispatch will hang waiting for workers. +Bounded-concurrency control (issue #45): + +- `CFDB_WORKER_MAX_CONCURRENT_TASKS` — per-worker backpressure threshold: a worker rejects a dispatch (gRPC `RESOURCE_EXHAUSTED`, which the priority load balancer treats as transient and rotates past) once it already has this many tasks in flight (default `1`, to serialize the subprocess pipelines on a 1-vCPU worker; `0` disables backpressure). Set on the **worker** process. +- `CFDB_WORKFLOW_MAX_ACTIVE` — admission ceiling on concurrently active workflows (`pending` + `running` jobs). Once this many are active, `/data` and `/index` shed new preprocessing requests with `429 Retry-After` before claiming the per-file mutex (default `1024`). Soft cap — a count-then-claim race may briefly overshoot. Set on the **API**. +- `CFDB_WORKFLOW_RETRY_INTERVAL_S` — base cadence (plus a small random jitter) at which the durable retry scheduler re-attempts dispatch for a job awaiting worker capacity (default `120`, i.e. 2 min). A dispatch attempt that finds no free worker leaves the job `pending` and rescheduled rather than blocking the request. Set on the **API**. +- `CFDB_WORKFLOW_DISPATCH_DEADLINE_S` — how long a job may wait for worker capacity (re-attempted on the retry cadence above) before the scheduler fails it with a `capacity:`-prefixed error (default `14400`, i.e. 4 h). Replaces the former single in-request `CFDB_WORKFLOW_DISPATCH_WAIT_S` wait — instead of blocking one request on a cold start, the job queues durably and is retried to this deadline. Set on the **API**. + Optional tunables (with defaults): - `CFDB_WORKFLOW_DURATION_CAP_S` — per-workflow wall-clock cap (default `14400`, i.e. 4 h — sized for multi-hour preprocessing runs; lower it for fixture-bound dev). -- `CFDB_WORKFLOW_DISPATCH_WAIT_S` — how long `ensure_workflow` waits for a free worker before giving up (default `240`, i.e. 4 min — sized for an ECS Fargate cold start: image pull + health check typically take 1-3 min, so a smaller budget can expire before a freshly-provisioned worker reports HEALTHY and hard-fail the first request. Lower it for fixture-bound dev where workers are already running). - `CFDB_WORKFLOW_HEARTBEAT_INTERVAL_S` — cadence at which the wool routine emits heartbeat events during quiet stages so the API can refresh `JobRecord.updated_at` (default `300`). The stale-reclaim threshold below is sized as `2 × heartbeat + safety_margin`; lowering this knob without also lowering the threshold widens the false-reclaim window. - `CFDB_WORKFLOW_STALE_THRESHOLD_S` — `updated_at` age beyond which an active row is reclaimable (default `900`; sized as `2 × heartbeat_interval + safety_margin` so a single missed heartbeat does not falsely reclaim a healthy worker). - `CFDB_SAMTOOLS_THREADS` (default `1`), `CFDB_SORT_PARALLEL` (default `2`) — CPU/thread caps for `samtools sort/index` and GNU `sort` respectively. diff --git a/scripts/create-indexes.js b/scripts/create-indexes.js index dad8f29..c7ebff3 100644 --- a/scripts/create-indexes.js +++ b/scripts/create-indexes.js @@ -221,6 +221,17 @@ ensureIndex( { name: "status_updated_at" } ); +// Serves the durable retry scheduler's due-dispatch lease +// (workflows.lock.lease_due_dispatch): { status: "pending", +// next_dispatch_at: { $lte: now } } sorted by next_dispatch_at asc. The +// status equality prefix plus the next_dispatch_at range/sort are both +// index-served, so the per-tick poll is not a scan of all PENDING rows. +ensureIndex( + db.jobs, + { status: 1, next_dispatch_at: 1 }, + { name: "status_next_dispatch_at" } +); + // TTL index on terminal job rows. Without this, every stale-reclaim // transition leaves a permanent ``failed`` document and the collection // grows unbounded for any frequently-touched workflow_key. The partial diff --git a/src/cfdb/api/main.py b/src/cfdb/api/main.py index eb1fdb8..53db36d 100644 --- a/src/cfdb/api/main.py +++ b/src/cfdb/api/main.py @@ -33,6 +33,7 @@ from cfdb.workflows.credentials import build_worker_credentials from cfdb.workflows.discovery import EcsDiscovery from cfdb.workflows.executor import WoolExecutor +from cfdb.workflows.loadbalancer import PriorityLoadBalancer from cfdb.workflows.processors.bam import BamIndexProcessor from cfdb.workflows.processors.registry import default_registry from cfdb.workflows.processors.tabix import TabixIntervalProcessor @@ -329,19 +330,27 @@ async def lifespan(_: FastAPI): # dispatch — fired while the Fargate worker is still booting # (image pull + health check, ~1-3 min) — blow the quorum # wait and raise an un-retried ``TimeoutError`` out of - # ``WorkerPool.start``. cfdb's ``_open_stream_with_retry`` - # only retries ``wool.NoWorkersAvailable``, so that + # ``WorkerPool.start``. cfdb's dispatch attempt only treats + # ``wool.NoWorkersAvailable`` as "no capacity", so that # ``TimeoutError`` fails the first request hard (and the # provisioner's cancelled ``RunTask`` leaks an idle worker). # With the gate off, a cold-start dispatch surfaces - # ``NoWorkersAvailable`` instead, which the executor already - # absorbs within its own ``_DISPATCH_WAIT_SECONDS`` budget — - # keeping the cold-start wait in one layer cfdb owns rather - # than duplicating it across wool's quorum wait too. + # ``NoWorkersAvailable`` instead, which the executor absorbs + # by leaving the job PENDING and re-attempting it on the + # durable retry scheduler's cadence — keeping the cold-start + # wait in one layer cfdb owns rather than duplicating it + # across wool's quorum wait too. async with wool.WorkerPool( discovery=discovery, credentials=worker_credentials, quorum=0, + # Priority/leaky-bucket balancing: always offer a task to + # discovered workers in the same stable order. With + # per-worker backpressure (one task each), load fills the + # lowest-ordered workers first and higher-ordered ones + # drain to idle and self-reap, instead of every worker + # carrying a thin perpetual slice (issue #45). + loadbalancer=PriorityLoadBalancer(), ): # Snapshot the lifespan task's contextvars after the # pool's ``__aenter__`` has populated wool's internals. @@ -354,6 +363,13 @@ async def lifespan(_: FastAPI): provisioner=provisioner, ) executor_handle = api.executor + # Start the durable retry scheduler — the single dispatch + # driver — here, inside the pool context, so the created + # task inherits wool's dispatch contextvars (same as + # request-spawned tasks via ``attach_wool_context``). + # ``_drain_executor`` cancels it on teardown before the + # pool closes. + api.executor.start_scheduler() log.info( "Workflow subsystem enabled: profile=%s cache=%s " "workdir=%s discovery=%s provisioner=%s " diff --git a/src/cfdb/api/routers/cache_stream.py b/src/cfdb/api/routers/cache_stream.py index 5b89651..830f5cc 100644 --- a/src/cfdb/api/routers/cache_stream.py +++ b/src/cfdb/api/routers/cache_stream.py @@ -19,7 +19,11 @@ from cfdb import api from cfdb.services import drs from cfdb.workflows.cache import CacheBackend -from cfdb.workflows.executor import ExecutorDraining, WorkflowNotApplicable +from cfdb.workflows.executor import ( + AdmissionRejected, + ExecutorDraining, + WorkflowNotApplicable, +) from cfdb.workflows.models import ArtifactKind logger = logging.getLogger(__name__) @@ -275,6 +279,18 @@ async def serve_workflow_artifact_or_dispatch( detail="Service is shutting down; please retry", headers={"Retry-After": "30"}, ) + except AdmissionRejected as exc: + # The active-workflow ceiling is hit — shed this request with a + # 429 so the client backs off rather than queuing unbounded work. + # NOT a WorkflowNotApplicable subclass, so it must be caught + # explicitly (before that handler) to avoid falling through to + # direct upstream streaming, which would bypass the bounded + # pipeline entirely. + raise HTTPException( + status_code=status.HTTP_429_TOO_MANY_REQUESTS, + detail="Too many active preprocessing jobs; please retry shortly", + headers={"Retry-After": str(exc.retry_after_seconds)}, + ) except WorkflowNotApplicable: # Race: processor accepted this file, but executor disagreed. # Fall through to the caller's own path rather than 500. diff --git a/src/cfdb/indexes.py b/src/cfdb/indexes.py index a3f9ccd..7bffd5a 100644 --- a/src/cfdb/indexes.py +++ b/src/cfdb/indexes.py @@ -134,6 +134,16 @@ def operational_index_specs() -> list[IndexSpec]: ), IndexSpec("jobs", [("job_id", 1)], name="job_id_unique", unique=True), IndexSpec("jobs", [("status", 1), ("updated_at", 1)], name="status_updated_at"), + # Serves the durable retry scheduler's due-dispatch lease + # (workflows.lock.lease_due_dispatch): {status: pending, + # next_dispatch_at: {$lte: now}} sorted by next_dispatch_at asc. + # The status equality prefix + next_dispatch_at range/sort are + # both index-served, so the per-tick poll is not a PENDING scan. + IndexSpec( + "jobs", + [("status", 1), ("next_dispatch_at", 1)], + name="status_next_dispatch_at", + ), # TTL on terminal rows so the collection doesn't grow unbounded. # The partial filter excludes active rows (active=True) so an # in-flight job is never reaped. 7 days gives operators a window diff --git a/src/cfdb/workflows/__init__.py b/src/cfdb/workflows/__init__.py index e6509ca..12ae181 100644 --- a/src/cfdb/workflows/__init__.py +++ b/src/cfdb/workflows/__init__.py @@ -31,12 +31,6 @@ preprocessing runs (e.g., a ``samtools sort`` on a multi-GB BAM followed by ``samtools index``). Operators running on bounded fixtures in dev should lower this via env. -- ``CFDB_WORKFLOW_DISPATCH_WAIT_S`` — how long ``ensure_workflow`` waits - for a wool worker to become available before giving up. Default ``240`` - (4 min) — sized for an ECS Fargate cold start (image pull + health - check, typically 1-3 min). A smaller value risks exhausting the retry - budget before a freshly-provisioned worker reports HEALTHY; lower it - for fixture-bound dev where workers are already running. - ``CFDB_WORKFLOW_HEARTBEAT_INTERVAL_S`` — how often the routine emits a heartbeat event during quiet periods so the API can refresh ``updated_at`` on the JobRecord. Default ``300`` (5 min). @@ -44,6 +38,28 @@ RUNNING/PENDING row is considered stale and reclaimable. Default ``900`` (15 min) — sized as ``2 × heartbeat_interval + safety_margin`` so a single missed heartbeat does not falsely reclaim a healthy worker. + +Concurrency / admission (bounded-concurrency control, issue #45): + +- ``CFDB_WORKER_MAX_CONCURRENT_TASKS`` — per-worker backpressure + threshold: a worker rejects a dispatch (gRPC ``RESOURCE_EXHAUSTED``, + which the load balancer treats as transient and rotates past) once it + already has this many tasks in flight. Default ``1`` (serialize the + subprocess pipelines on a 1-vCPU worker); ``0`` disables backpressure + (unbounded — the prior behavior). Enforced worker-side via a + ``wool.BackpressureLike`` hook. +- ``CFDB_WORKFLOW_MAX_ACTIVE`` — admission ceiling on concurrently active + workflows (``pending`` + ``running`` jobs). ``ensure_workflow`` returns + ``429 Retry-After`` once this many are active, shedding load before + claiming the per-file mutex. Default ``1024``. Soft cap + (count-then-insert may briefly overshoot). +- ``CFDB_WORKFLOW_DISPATCH_DEADLINE_S`` — how long a job may wait for + worker capacity (re-dispatched on the retry cadence below) before it is + marked ``failed``. Default ``14400`` (4 h). +- ``CFDB_WORKFLOW_RETRY_INTERVAL_S`` — base cadence at which the durable + retry scheduler re-attempts dispatch for a queued job awaiting + capacity; a small random jitter is added per attempt. Default ``120`` + (2 min). """ from __future__ import annotations @@ -112,24 +128,17 @@ def _positive_int(name: str, value: str, *, minimum: int = 0) -> int: # Runtime caps require ``minimum=1`` — zero values silently break the # corresponding loop or timeout (``asyncio.timeout(0)`` fires immediately, # ``HEARTBEAT_INTERVAL_S=0`` spins, ``STALE_THRESHOLD_S=0`` reclaims every -# active job on every check, ``DISPATCH_WAIT_S=0`` makes every cold start -# look like a hard failure). +# active job on every check). WORKFLOW_DURATION_CAP_S: Final = _positive_int( "CFDB_WORKFLOW_DURATION_CAP_S", os.getenv("CFDB_WORKFLOW_DURATION_CAP_S", "14400"), minimum=1, ) -# Default sized for a Fargate cold start (image pull + health check, -# ~1-3 min). With ``quorum=0`` a cold-start dispatch surfaces -# ``NoWorkersAvailable``, which the executor retries inside this budget; -# the old 60s default could expire before the just-launched worker -# reports HEALTHY, hard-failing the first request. 240s covers the -# cold-start window with headroom while staying env-overridable. -WORKFLOW_DISPATCH_WAIT_S: Final = _positive_int( - "CFDB_WORKFLOW_DISPATCH_WAIT_S", - os.getenv("CFDB_WORKFLOW_DISPATCH_WAIT_S", "240"), - minimum=1, -) +# NOTE: the former ``CFDB_WORKFLOW_DISPATCH_WAIT_S`` (a single in-request +# cold-start wait) was removed in the #45 restructure. A dispatch that +# finds no capacity no longer blocks the request — the job queues durably +# and the retry scheduler re-attempts it on ``CFDB_WORKFLOW_RETRY_INTERVAL_S`` +# until ``CFDB_WORKFLOW_DISPATCH_DEADLINE_S`` (both below). WORKFLOW_HEARTBEAT_INTERVAL_S: Final = _positive_int( "CFDB_WORKFLOW_HEARTBEAT_INTERVAL_S", os.getenv("CFDB_WORKFLOW_HEARTBEAT_INTERVAL_S", "300"), @@ -141,15 +150,53 @@ def _positive_int(name: str, value: str, *, minimum: int = 0) -> int: minimum=1, ) -# The stale-reclaim threshold MUST exceed two heartbeat intervals so a -# single missed heartbeat (e.g., a brief Mongo write delay) does not -# falsely reclaim a healthy worker. The default values (300 / 900) -# satisfy this with a 300s safety margin; an operator-tuned pair that -# violates the invariant is a configuration error. -if WORKFLOW_STALE_THRESHOLD_S < 2 * WORKFLOW_HEARTBEAT_INTERVAL_S: +# The stale-reclaim threshold MUST strictly exceed two heartbeat intervals. +# A single missed heartbeat (e.g., a brief Mongo write delay) must not +# falsely reclaim a healthy worker, AND the executor's +# ``_HEARTBEAT_LOSS_ABORT_S = max(HEARTBEAT, STALE - HEARTBEAT)`` needs a +# non-zero margin above one interval to keep its single-transient-failure +# tolerance: at the boundary ``STALE == 2 * HEARTBEAT`` that window +# collapses to exactly one interval (see issue #45 review A7). The default +# values (300 / 900) satisfy the strict form with a 300s margin; an +# operator-tuned pair that violates it is a configuration error. +if WORKFLOW_STALE_THRESHOLD_S <= 2 * WORKFLOW_HEARTBEAT_INTERVAL_S: raise ValueError( f"CFDB_WORKFLOW_STALE_THRESHOLD_S ({WORKFLOW_STALE_THRESHOLD_S}s) " - f"must be >= 2 * CFDB_WORKFLOW_HEARTBEAT_INTERVAL_S " + f"must be > 2 * CFDB_WORKFLOW_HEARTBEAT_INTERVAL_S " f"({WORKFLOW_HEARTBEAT_INTERVAL_S}s) to avoid false stale-reclaim " - "of healthy workers between heartbeats." + "of healthy workers between heartbeats and to keep the " + "heartbeat-loss abort window strictly positive." ) + +# --- Bounded-concurrency control (issue #45) -------------------------------- + +# Per-worker backpressure threshold. ``minimum=0`` because 0 is a valid +# "disable backpressure" sentinel — the worker wiring passes +# ``backpressure=None`` when 0, restoring the unbounded prior behavior. +WORKER_MAX_CONCURRENT_TASKS: Final = _positive_int( + "CFDB_WORKER_MAX_CONCURRENT_TASKS", + os.getenv("CFDB_WORKER_MAX_CONCURRENT_TASKS", "1"), + minimum=0, +) +# Admission ceiling on concurrently active (pending + running) workflows. +# ``ensure_workflow`` returns 429 once this many are active. ``minimum=1`` +# because 0 would reject every request. +WORKFLOW_MAX_ACTIVE: Final = _positive_int( + "CFDB_WORKFLOW_MAX_ACTIVE", + os.getenv("CFDB_WORKFLOW_MAX_ACTIVE", "1024"), + minimum=1, +) +# How long a job may wait for worker capacity before being failed. +# ``minimum=1`` — 0 would fail every queued job on its first attempt. +WORKFLOW_DISPATCH_DEADLINE_S: Final = _positive_int( + "CFDB_WORKFLOW_DISPATCH_DEADLINE_S", + os.getenv("CFDB_WORKFLOW_DISPATCH_DEADLINE_S", "14400"), + minimum=1, +) +# Base cadence for the durable retry scheduler's re-dispatch attempts; +# per-attempt jitter is layered on top. ``minimum=1`` — 0 would busy-spin. +WORKFLOW_RETRY_INTERVAL_S: Final = _positive_int( + "CFDB_WORKFLOW_RETRY_INTERVAL_S", + os.getenv("CFDB_WORKFLOW_RETRY_INTERVAL_S", "120"), + minimum=1, +) diff --git a/src/cfdb/workflows/backpressure.py b/src/cfdb/workflows/backpressure.py new file mode 100644 index 0000000..8f89966 --- /dev/null +++ b/src/cfdb/workflows/backpressure.py @@ -0,0 +1,63 @@ +"""Per-worker backpressure hook for the wool worker pool (issue #45). + +A worker rejects an incoming dispatch once it already has a configured +number of tasks in flight, serializing the subprocess pipelines +(samtools/bgzip/tabix/sort/…) the routine shells out to so a single +1-vCPU/2-GiB worker stops oversubscribing. wool surfaces the rejection to +the dispatcher as gRPC ``RESOURCE_EXHAUSTED`` — a transient error the load +balancer rotates past to the next worker. + +The hook is a plain, picklable object holding only an ``int`` because +``wool.LocalWorker`` serializes ``backpressure`` into its worker +subprocess. Keep it free of async/runtime state (no locks, queues, or +event-loop references) so it survives that serialization boundary cleanly. +""" + +from __future__ import annotations + +import wool + + +class TaskCountBackpressure: + """Reject a dispatch when the worker already has ``threshold`` tasks. + + Implements :class:`wool.BackpressureLike`: returns ``True`` (reject) + when the worker's in-flight task count has reached ``threshold`` and + ``False`` (accept) otherwise. With ``threshold == 1`` each worker runs + at most one routine at a time, serializing its CPU/IO-heavy subprocess + pipeline. The "disable backpressure" case (threshold 0) is handled by + the wiring, which passes ``backpressure=None`` rather than this hook. + """ + + def __init__(self, threshold: int) -> None: + if threshold < 1: + raise ValueError( + f"TaskCountBackpressure threshold must be >= 1; got {threshold} " + "(pass backpressure=None to disable backpressure entirely)" + ) + self.threshold = threshold + + def __call__(self, ctx: wool.BackpressureContext) -> bool: + """Return True to reject the incoming task, False to accept it.""" + return ctx.active_task_count >= self.threshold + + def __repr__(self) -> str: + return f"{type(self).__name__}(threshold={self.threshold})" + + +def backpressure_for(threshold: int) -> TaskCountBackpressure | None: + """Build the per-worker backpressure hook for ``threshold``, or None. + + ``threshold == 0`` disables backpressure: the worker wiring passes + ``backpressure=None`` to wool, restoring unbounded admission (the prior + behavior). Any positive value yields a :class:`TaskCountBackpressure` + that rejects once the worker has that many tasks in flight. A negative + threshold is rejected rather than silently disabling backpressure, so + the helper's contract is total. + """ + if threshold < 0: + raise ValueError( + f"backpressure threshold must be >= 0; got {threshold} " + "(0 disables backpressure, a positive value sets the limit)" + ) + return TaskCountBackpressure(threshold) if threshold > 0 else None diff --git a/src/cfdb/workflows/executor.py b/src/cfdb/workflows/executor.py index 41510a2..54c8619 100644 --- a/src/cfdb/workflows/executor.py +++ b/src/cfdb/workflows/executor.py @@ -38,18 +38,23 @@ import asyncio import contextlib import logging +import random import shutil +import uuid from abc import ABC, abstractmethod from collections.abc import AsyncIterator +from datetime import datetime, timedelta, timezone from pathlib import Path from typing import Any, Optional import wool from cfdb.workflows import ( - WORKFLOW_DISPATCH_WAIT_S, + WORKFLOW_DISPATCH_DEADLINE_S, WORKFLOW_DURATION_CAP_S, WORKFLOW_HEARTBEAT_INTERVAL_S, + WORKFLOW_MAX_ACTIVE, + WORKFLOW_RETRY_INTERVAL_S, keys as key_utils, ) from cfdb.workflows.cache import CacheBackend @@ -62,11 +67,16 @@ WorkflowEvent, ) from cfdb.workflows.lock import ( + STALE_WORKFLOW_THRESHOLD, claim_workflow, + count_active_workflows, heartbeat_workflow, + lease_due_dispatch, mark_running, record_stage_complete, release_workflow, + requeue_orphaned_dispatch, + reschedule_dispatch, update_progress, ) from cfdb.workflows.models import JobRecord, JobStatus @@ -87,37 +97,44 @@ #: Sourced from ``CFDB_WORKFLOW_DURATION_CAP_S`` (default 14400 s = 4 h) #: so deployments can override without a code change. Module-level #: rather than a ctor kwarg because tests cap it via ``monkeypatch.setattr`` -#: alongside the other dispatch knobs (``_DISPATCH_WAIT_SECONDS``, +#: alongside the other dispatch knobs (``_RETRY_INTERVAL_SECONDS``, #: ``_HEARTBEAT_INTERVAL_S``) and uniform module-state keeps fixture #: shape consistent. _WORKFLOW_DURATION_CAP_SECONDS = WORKFLOW_DURATION_CAP_S -#: Persisted ``error`` prefix surfaced to clients when the dispatch -#: budget is exhausted with no workers available. The prefix is part -#: of the on-wire contract — clients parse it to decide whether to -#: resubmit. Kept as a module constant so tests can import the same -#: symbol the executor writes. +#: Persisted ``error`` prefix surfaced to clients when a job is failed for +#: lack of worker capacity (its dispatch deadline elapsed while queued). +#: Part of the on-wire contract — clients parse it to decide whether to +#: resubmit. Kept as a module constant so tests can import the same symbol +#: the executor writes. _ERR_PREFIX_CAPACITY = "capacity:" -#: Persisted ``error`` prefix surfaced to clients when the provisioner -#: itself raises a non-retryable exception. Distinct from -#: :data:`_ERR_PREFIX_CAPACITY` so clients can tell "retry me later, -#: capacity issue" apart from "the provisioner crashed". -_ERR_PREFIX_PROVISIONER = "provisioner:" - -#: Total wall-clock budget waiting for a leased worker to surface during -#: dispatch. Sized for an ECS Fargate cold start (image pull + health -#: check, typically 1-3 min; see ``CFDB_WORKFLOW_DISPATCH_WAIT_S``, -#: default 240s) so a ``NoWorkersAvailable`` at the moment of dispatch -#: does not immediately fail the job — instead we poll the pool until a -#: worker shows up or the budget expires. -_DISPATCH_WAIT_SECONDS = float(WORKFLOW_DISPATCH_WAIT_S) - -#: Cadence at which we re-attempt the dispatch while waiting on capacity. -#: Sub-second polling buys us nothing because ECS scale-up dominates the -#: wait; a 1-second cadence keeps the noise floor low while still -#: surfacing a freshly-available worker quickly. -_DISPATCH_RETRY_INTERVAL_SECONDS = 1.0 +#: Admission ceiling on concurrently-active workflows (pending + running). +#: ``ensure_workflow`` sheds requests beyond this with ``AdmissionRejected`` +#: → HTTP 429. Module-level so tests ``monkeypatch.setattr`` it alongside +#: the other dispatch knobs. Soft cap: a count-then-claim race can briefly +#: overshoot, which is acceptable for a flood guard. +_MAX_ACTIVE_WORKFLOWS = WORKFLOW_MAX_ACTIVE + +#: Base cadence for the durable retry scheduler. A dispatch attempt that +#: finds no worker capacity leaves the job PENDING and reschedules its next +#: attempt this far out (plus jitter). Also the scheduler loop's idle wait +#: between ticks, so a freshly-rescheduled job is re-attempted promptly. +_RETRY_INTERVAL_SECONDS = float(WORKFLOW_RETRY_INTERVAL_S) + +#: Upper bound on the random jitter added to each reschedule, to spread a +#: thundering herd of queued jobs across retry ticks instead of stampeding +#: the pool in lockstep. Clamped to the retry interval, so tests that +#: monkeypatch ``_RETRY_INTERVAL_SECONDS`` low get correspondingly small, +#: bounded jitter (determinism in tests comes from that clamp, not from +#: zeroing this constant). +_RETRY_JITTER_MAX_SECONDS = min(30.0, _RETRY_INTERVAL_SECONDS) + +#: Max wall-clock a job may wait for capacity (measured from +#: ``submitted_at``) before the scheduler fails it ``capacity:``. Replaces +#: the old in-request 240s wait: instead of blocking one request, the job +#: queues durably and is retried until it runs or this deadline elapses. +_DISPATCH_DEADLINE_SECONDS = float(WORKFLOW_DISPATCH_DEADLINE_S) #: Cadence at which the wool routine emits ``heartbeat`` events into its #: stream during quiet periods. The API process consumes the stream and @@ -125,6 +142,30 @@ #: long-running stage doesn't get reclaimed as stale. _HEARTBEAT_INTERVAL_S = float(WORKFLOW_HEARTBEAT_INTERVAL_S) +#: How long a consumer may fail to refresh its heartbeat before it aborts +#: its own job. This makes a stale ``RUNNING`` row reliably mean "the +#: consumer is gone": a consumer that cannot reach Mongo stops computing +#: (and lets ``stream.aclose`` cancel the remote worker) *before* the +#: orphan sweep's ``STALE_WORKFLOW_THRESHOLD`` would revive the row, so +#: recovery is spared a redundant second run of a still-live job. Sized +#: strictly between one heartbeat interval (a single transient write +#: failure must not abort) and the sweep threshold (the consumer must give +#: up with margin to spare); the strict ``STALE > 2*HEARTBEAT`` startup +#: invariant keeps that window strictly positive (at ``STALE == 2*HEARTBEAT`` +#: it would collapse to exactly one interval, leaving no transient-failure +#: tolerance — see issue #45 review A7). +_HEARTBEAT_LOSS_ABORT_S = max( + _HEARTBEAT_INTERVAL_S, + STALE_WORKFLOW_THRESHOLD.total_seconds() - _HEARTBEAT_INTERVAL_S, +) + +#: Max jobs leased per scheduler tick. Bounds the per-tick dispatch +#: fan-out so a large queued backlog (up to ``WORKFLOW_MAX_ACTIVE``) can't +#: spawn thousands of concurrent attempts in one tick; the remainder is +#: leased on subsequent ticks. Generous enough to keep a reasonable fleet +#: saturated, small enough to avoid a thundering herd against the pool. +_MAX_DISPATCHES_PER_TICK = 64 + #: Grace given to ``_finalize`` tasks during ``drain``'s second phase #: after the primary ``_pending_tasks`` gather is done. A workflow that #: failed mid-write — Mongo connection blip, S3 cache cleanup blocked @@ -153,6 +194,85 @@ class ExecutorDraining(WorkflowNotApplicable): """ +class AdmissionRejected(RuntimeError): + """Raised by ``ensure_workflow`` when the active-workflow ceiling is hit. + + Deliberately NOT a :class:`WorkflowNotApplicable` subclass: a caller + that catches ``WorkflowNotApplicable`` falls through to direct upstream + streaming, but an admission rejection must surface as a distinct 429 + (with ``Retry-After``) so the client backs off rather than silently + bypassing the bounded pipeline. Carries the observed ``active`` count, + the ``ceiling`` it hit, and a ``retry_after_seconds`` hint for the + response header. + """ + + def __init__( + self, + *, + active: int, + ceiling: int, + retry_after_seconds: Optional[int] = None, + ) -> None: + self.active = active + self.ceiling = ceiling + self.retry_after_seconds = ( + retry_after_seconds + if retry_after_seconds is not None + else max(1, int(_RETRY_INTERVAL_SECONDS)) + ) + super().__init__( + f"Active workflow ceiling reached ({active} >= {ceiling}); " + "shedding request" + ) + + +class HeartbeatLost(RuntimeError): + """Raised by the stream consumer when it can no longer refresh the job. + + Signals that ``heartbeat_workflow`` has been failing for longer than + :data:`_HEARTBEAT_LOSS_ABORT_S` while a worker is still streaming + ``heartbeat`` events — i.e. this consumer has lost its connection to + Mongo and can no longer keep the row fresh. Raising this aborts the + attempt and ``aclose``s the stream, cancelling the remote worker so it + stops computing. + + This is a *best-effort optimization*, NOT the load-bearing + double-dispatch guard. A ``heartbeat`` event is only injected after a + quiet ``heartbeat_interval`` of stage silence, so a stage that streams + ``stage_complete`` / ``progress`` faster than that interval never trips + the check — meaning a Mongo-blind consumer is not guaranteed to abort + before the orphan sweep revives its row. What actually keeps a recovered + re-dispatch from corrupting a still-live attempt is the per-attempt + workdir nonce (``_attempt_dispatch``) plus content-addressed, atomically + committed cache artifacts. The abort just spares the wasted second run + when it can; do not remove the per-attempt workdir on the assumption + this covers the race. + """ + + +def _utcnow() -> datetime: + return datetime.now(timezone.utc) + + +def _next_dispatch_time(now: datetime) -> datetime: + """Compute the next dispatch attempt time for an overflowed job. + + ``now + retry_interval + jitter``. The jitter spreads a queued backlog + across ticks; it is bounded by ``_RETRY_JITTER_MAX_SECONDS`` (itself + clamped to the retry interval), so a test that monkeypatches + ``_RETRY_INTERVAL_SECONDS`` low gets correspondingly small, bounded + jitter — no test zeroes the constant. Always strictly after ``now`` + (retry interval is >= 1s), which ``lease_due_dispatch`` requires so a + leased row moves out of the due-window before its attempt resolves. + """ + jitter = ( + random.uniform(0, _RETRY_JITTER_MAX_SECONDS) + if _RETRY_JITTER_MAX_SECONDS > 0 + else 0.0 + ) + return now + timedelta(seconds=_RETRY_INTERVAL_SECONDS + jitter) + + class JobExecutor(ABC): """Abstract interface for workflow dispatch + job tracking.""" @@ -185,6 +305,7 @@ async def _run_processor_routine( file_meta: dict[str, Any], workdir_str: str, cache: CacheBackend, + heartbeat_interval: float, ) -> AsyncIterator[WorkflowEvent]: """Wool routine body: stream processor events back to the dispatcher. @@ -192,12 +313,30 @@ async def _run_processor_routine( :class:`~cfdb.workflows.events.StageComplete` and :class:`~cfdb.workflows.events.Complete` events. We wrap its stream with a heartbeat-aware iterator: while waiting on the next event from - the processor, every ``_HEARTBEAT_INTERVAL_S`` we inject a + the processor, every ``heartbeat_interval`` seconds we inject a :class:`~cfdb.workflows.events.Heartbeat` so the API process can refresh ``JobRecord.updated_at`` and signal liveness without requiring - the worker to touch Mongo. An exception from the processor is - converted to an :class:`~cfdb.workflows.events.Error` event so the API - consumer can record a clean terminal state. + the worker to touch Mongo. ``heartbeat_interval`` is passed explicitly + (resolved API-side at dispatch) rather than read from the worker's + module globals, because cloudpickle ships this routine by value and the + worker's own ``_HEARTBEAT_INTERVAL_S`` would otherwise be ignored. An + exception from the processor is converted to an + :class:`~cfdb.workflows.events.Error` event so the API consumer can + record a clean terminal state. + + The routine yields an immediate :class:`~cfdb.workflows.events.Heartbeat` + as its very first event — the instant a worker *accepts* the dispatch, + before the (potentially slow) upstream download. This is the dispatcher's + "worker accepted" signal: ``_attempt_dispatch`` marks the job RUNNING on + the first event, so a job leaves the leasable PENDING set the moment a + worker takes it rather than only after the first processor event lands. + Without this, a long download (which happens before the processor's + first event) keeps the job PENDING, and the durable scheduler would + re-lease it and double-dispatch onto the same workdir once the retry + interval elapsed. A rejected dispatch (worker backpressure) raises + ``NoWorkersAvailable`` before the routine body runs, so this leading + heartbeat fires only on acceptance — cleanly distinguishing accepted + from overflowed. Arguments are cloudpickle-serializable: ``processor`` is a stateless class instance, ``file_meta`` is a plain dict (B1 strips Mongo's @@ -213,20 +352,25 @@ class instance, ``file_meta`` is a plain dict (B1 strips Mongo's """ workdir = Path(workdir_str) workdir.mkdir(parents=True, exist_ok=True) + # "Worker accepted" signal — see the docstring. Emitted before the + # processor runs so the dispatcher marks the job RUNNING the instant a + # worker takes it (before the download), closing the re-lease / + # double-dispatch window. A rejected dispatch never reaches here. + yield Heartbeat() inner = processor.run(file_meta, workdir, cache).__aiter__() next_task: Optional[asyncio.Task] = None try: while True: next_task = asyncio.ensure_future(inner.__anext__()) # Loop on a per-event wait_for so a heartbeat injects every - # _HEARTBEAT_INTERVAL_S of stage silence; shield the task so + # heartbeat_interval of stage silence; shield the task so # a timeout doesn't propagate cancellation into the inner # generator mid-stage. while True: try: await asyncio.wait_for( asyncio.shield(next_task), - timeout=_HEARTBEAT_INTERVAL_S, + timeout=heartbeat_interval, ) break except asyncio.TimeoutError: @@ -271,18 +415,19 @@ class instance, ``file_meta`` is a plain dict (B1 strips Mongo's class WoolExecutor(JobExecutor): """Executor backed by a ``wool.WorkerPool`` and a Mongo jobs collection. - When configured with an :class:`EcsProvisioner`, the executor also - issues a ``RunTask`` on each fresh claim before opening the routine - stream, so a Fargate worker boots before the dispatch retry loop - begins polling for it. Without a provisioner (PoC dev profile) the - executor relies on workers already published into the discovery - namespace. + Dispatch tries existing workers first via the priority load balancer. + Only when an attempt overflows (no worker accepts) does the executor — + when configured with an :class:`EcsProvisioner` — issue a best-effort + ``RunTask`` to scale the fleet up, then leave the job queued for a later + retry; it never spawns a worker pre-dispatch. Without a provisioner + (PoC dev profile) the executor relies on workers already published into + the discovery namespace. - A :class:`RetryableProvisionerError` from the provisioner surfaces - as a terminal ``FAILED`` job with a ``capacity:``-prefixed error - string; any other provisioner failure surfaces with a - ``provisioner:`` prefix. Clients parse the prefix to decide - whether to resubmit. + A job that cannot find capacity before its dispatch deadline is failed + with a ``capacity:``-prefixed error string, which clients parse to + decide whether to resubmit. Provisioner failures on the overflow path + are best-effort and swallowed (the job simply retries on the next tick), + so they never surface as a terminal job state. Args: db: Motor database handle holding the ``jobs`` collection. @@ -291,7 +436,7 @@ class WoolExecutor(JobExecutor): processors persist artifacts to the deployment's real store (local FS or S3), not a worker-local filesystem. registry: Processor registry mapping artifact kinds to runners. - workdir_root: Parent directory under which per-job workdirs land. + workdir_root: Parent directory under which per-attempt workdirs land. pipeline_version: Embedded in workflow keys; bump to invalidate every in-flight workflow. provisioner: Optional :class:`EcsProvisioner`. When ``None`` @@ -322,11 +467,20 @@ def __init__( self._provisioner = provisioner self._pending_tasks: set[asyncio.Task] = set() #: Finalize tasks (release_workflow + workdir cleanup) created by - #: _run_workflow's `finally`. Tracked separately so drain() can - #: await them after the main _pending_tasks gather, ensuring they - #: complete before the lifespan teardown closes the Motor client. + #: _consume_and_finalize's `finally`. Tracked separately so drain() + #: can await them after the main _pending_tasks gather, ensuring + #: they complete before the lifespan teardown closes the Motor + #: client. self._finalize_tasks: set[asyncio.Task] = set() self._draining = False + #: The durable retry scheduler — re-attempts dispatch for jobs that + #: overflowed (no worker capacity) and are awaiting a free worker. + #: Fresh claims dispatch inline via ``ensure_workflow``; the + #: scheduler only handles the retry path. Started by + #: ``start_scheduler`` inside the lifespan's wool-pool context (so + #: it inherits wool's dispatch contextvars) and cancelled by + #: ``drain`` before the pool closes. + self._scheduler_task: Optional[asyncio.Task] = None async def ensure_workflow( self, file_meta: dict[str, Any] @@ -345,6 +499,22 @@ async def ensure_workflow( "No processor registered for this file, or no work required" ) + # Admission ceiling: shed new work once the active backlog + # (pending + running) hits the cap, so an unauthenticated flood on + # /data and /index can't queue unbounded jobs in Mongo. Checked + # AFTER applicability (don't 429 a file that needs no workflow) and + # BEFORE claiming the mutex. Soft cap — a count-then-claim race may + # transiently overshoot, acceptable for a flood guard; the + # per-source mutex still dedups same-file requests below the cap. + # NOTE: because this precedes the claim, at the ceiling a re-GET for + # a file whose workflow is already active is also shed with 429 + # (rather than attaching to the in-flight job); the client retries. + # This is the deliberate trade for shedding before an unbounded + # count-then-insert race window. + active = await count_active_workflows(self._db) + if active >= _MAX_ACTIVE_WORKFLOWS: + raise AdmissionRejected(active=active, ceiling=_MAX_ACTIVE_WORKFLOWS) + dcc, local_id, md5 = extract_identity(file_meta) wf_key = key_utils.workflow_key( dcc=dcc, @@ -363,15 +533,163 @@ async def ensure_workflow( ) if fresh: - task = asyncio.create_task( - self._run_workflow(record, processor, file_meta) - ) - self._pending_tasks.add(task) - task.add_done_callback(self._pending_tasks.discard) - task.add_done_callback(_log_unexpected_exception) + # Dispatch the first attempt inline (fire-and-forget), so a + # fresh claim runs immediately rather than waiting out a + # scheduler tick. The job's ``next_dispatch_at`` stays None + # until this attempt overflows, so the durable scheduler — which + # only leases jobs with a due ``next_dispatch_at`` — cannot + # race this inline attempt. Retries (after an overflow sets + # ``next_dispatch_at``) are driven by the scheduler. + self._spawn_attempt(record, processor, file_meta) return record, fresh + def _spawn_attempt( + self, + record: JobRecord, + processor: Processor, + file_meta: dict[str, Any], + ) -> None: + """Spawn a tracked background task running one dispatch attempt.""" + task = asyncio.create_task( + self._attempt_dispatch(record, processor, file_meta) + ) + self._pending_tasks.add(task) + task.add_done_callback(self._pending_tasks.discard) + task.add_done_callback(_log_unexpected_exception) + + def start_scheduler(self) -> None: + """Start the durable retry scheduler as a background task. + + MUST be called from inside the lifespan's ``wool.WorkerPool`` + context so the created task inherits wool's dispatch contextvars + (the same mechanism request-spawned tasks rely on via + ``attach_wool_context``); otherwise ``@wool.routine`` dispatch from + the scheduler would fail with no pool in context. Idempotent. + """ + if self._scheduler_task is not None: + return + self._scheduler_task = asyncio.create_task(self._scheduler_loop()) + self._scheduler_task.add_done_callback(_log_unexpected_exception) + + async def _scheduler_loop(self) -> None: + """Re-attempt dispatch for queued jobs until drain. + + Fresh claims dispatch inline (``ensure_workflow``); this loop drives + the *retry* path. Each tick leases every currently-due job — one + whose inline (or prior) attempt overflowed and set a + ``next_dispatch_at`` now in the past — and spawns a fresh dispatch + attempt for it, then sleeps one retry interval. A per-tick exception + is logged and swallowed so the scheduler never dies; cancellation + (from ``drain``) breaks the loop. DB-backed, so a freshly-started + replica resumes whatever queue the previous process left behind. + """ + while not self._draining: + try: + await self._recover_orphans() + await self._drain_due_jobs() + except asyncio.CancelledError: + raise + except Exception: + logger.exception("Scheduler dispatch tick failed; continuing") + await asyncio.sleep(_RETRY_INTERVAL_SECONDS) + + async def _recover_orphans(self) -> None: + """Re-queue jobs orphaned by an API/worker crash for re-dispatch. + + Runs at the top of every scheduler tick — including the first, on + boot — so a process that died mid-flight does not strand jobs the + normal lease query cannot pick up (``RUNNING`` rows whose consumer + died, and fresh ``PENDING`` claims that never rescheduled). Recovery + is then autonomous: it no longer depends on a client re-requesting + the same file to trigger ``claim_workflow``'s stale-reclaim. Gated + on the stale threshold so healthy heartbeating jobs are untouched. + + Note the recovery promise is bounded by the same dispatch deadline + as a fresh job: ``requeue_orphaned_dispatch`` preserves the original + ``submitted_at``, and ``_attempt_dispatch`` measures the deadline + from it, so an orphan older than ``CFDB_WORKFLOW_DISPATCH_DEADLINE_S`` + is failed ``capacity:`` on its first recovery attempt rather than + resumed (its committed cache artifacts survive for a later fresh + ``GET`` to reuse). Recovery is best-effort, not unbounded. + """ + now = _utcnow() + requeued = await requeue_orphaned_dispatch( + self._db, now=now, stale_before=now - STALE_WORKFLOW_THRESHOLD + ) + if requeued: + logger.info( + "Re-queued %d orphaned workflow(s) for re-dispatch after a " + "restart or worker crash", + requeued, + ) + + async def _drain_due_jobs(self) -> None: + """Lease currently-due queued jobs and spawn their retry attempts. + + ``lease_due_dispatch`` atomically claims one due job and pushes its + ``next_dispatch_at`` forward, so concurrent ticks (or replicas) + can't double-lease and a crashed attempt is still retried. We loop + until nothing is due (or the per-tick cap is hit), dispatching each + leased job as a tracked background task so a winning attempt's long + stream-consume doesn't block the scheduler from filling the pool. + + The per-tick cap (``_MAX_DISPATCHES_PER_TICK``) bounds the dispatch + fan-out: after a large backlog (up to ``WORKFLOW_MAX_ACTIVE``) a + single tick would otherwise spawn thousands of concurrent attempts, + each opening a wool stream that immediately overflows. The remainder + is leased on subsequent ticks; the deadline/retry machinery tolerates + the delay. + """ + dispatched = 0 + while not self._draining and dispatched < _MAX_DISPATCHES_PER_TICK: + now = _utcnow() + leased = await lease_due_dispatch( + self._db, now=now, next_at=_next_dispatch_time(now) + ) + if leased is None: + break + file_meta = leased.file_meta_snapshot + if file_meta is None: + logger.error( + "Leased job %s has no file_meta_snapshot — failing", + leased.job_id, + ) + await release_workflow( + self._db, + leased.job_id, + JobStatus.FAILED, + error="internal: leased job missing file_meta_snapshot", + ) + continue + processor = self._registry.lookup_for(file_meta) + if processor is None: + logger.error( + "No processor for leased job %s — failing", leased.job_id + ) + await release_workflow( + self._db, + leased.job_id, + JobStatus.FAILED, + error="internal: no processor for leased job", + ) + continue + self._spawn_attempt(leased, processor, file_meta) + dispatched += 1 + + # Leading operability signal: a saturated queue is otherwise silent + # until jobs start failing ``capacity:`` at the dispatch deadline. + # Emitted at most once per tick (cadence ``_RETRY_INTERVAL_SECONDS``), + # so it does not flood the log even under a sustained backlog. + if dispatched: + logger.info( + "Scheduler tick dispatched %d queued workflow(s)%s", + dispatched, + " (per-tick cap hit; remainder deferred to the next tick)" + if dispatched >= _MAX_DISPATCHES_PER_TICK + else "", + ) + async def drain(self, *, timeout: float) -> int: """Wait for in-flight workflow tasks to complete. @@ -391,6 +709,14 @@ async def drain(self, *, timeout: float) -> int: immediately. """ self._draining = True + # Stop the dispatch driver first so no new attempt tasks are spawned + # into the set we're about to snapshot, and so it stops dispatching + # into a pool that's about to close. + if self._scheduler_task is not None: + self._scheduler_task.cancel() + with contextlib.suppress(asyncio.CancelledError, Exception): + await self._scheduler_task + self._scheduler_task = None pending = list(self._pending_tasks) if not pending: return 0 @@ -405,7 +731,7 @@ async def drain(self, *, timeout: float) -> int: # against a closed Motor client surfaces as a noisy # exception in the lifespan exit, and the wool stream they # hold blocks pool aclose. Cancel them explicitly and - # await the cancellation so _run_workflow's shielded + # await the cancellation so _consume_and_finalize's shielded # _finalize gets its 2s best-effort grace before we let # the lifespan continue to teardown. for task in pending: @@ -429,14 +755,172 @@ async def drain(self, *, timeout: float) -> int: ) return len(pending) - async def _run_workflow( + async def _attempt_dispatch( self, record: JobRecord, processor: Processor, file_meta: dict[str, Any], ) -> None: - """Background coroutine: mark running, consume the routine's event - stream, and release a terminal status. + """Run one dispatch attempt for a leased PENDING job. + + Offers the task to the existing worker pool **once** via the + priority load balancer. On success the job is marked RUNNING (only + now that a worker has accepted — a queued job stays PENDING until + then) and its event stream is consumed to completion. On overflow + (every worker rejected, or none exist) the job stays PENDING, a + best-effort bounded worker spawn is requested, and it is rescheduled + for a later tick — unless it has blown its dispatch deadline, in + which case it is failed ``capacity:``. This inverts the old + unconditional pre-dispatch spawn: existing capacity is tried first + and scale-up happens only on overflow. + """ + # Per-attempt workdir (unique nonce), not per-job. If the orphan + # sweep ever re-dispatches a job whose previous attempt is somehow + # still alive (e.g. event-loop starvation that defeated the + # heartbeat-loss abort), the two attempts use disjoint scratch dirs + # and cannot corrupt each other's downloads / sort temp / cleanup. + # Partial-commit recovery is unaffected — it keys off the cache, not + # the workdir. + workdir = self._workdir_root / f"{record.job_id}-{uuid.uuid4().hex}" + + # Deadline: a job that has waited for capacity longer than the + # dispatch deadline (measured from ``submitted_at``) is failed rather + # than retried forever. ``capacity:`` is the stable on-wire prefix + # clients parse to decide whether to resubmit. Note the clock runs + # from original submission and is NOT reset on orphan recovery, so a + # job recovered after its deadline is failed here rather than resumed + # — recovery is best-effort, bounded by the same deadline. + waited = (_utcnow() - record.submitted_at).total_seconds() + if waited > _DISPATCH_DEADLINE_SECONDS: + logger.warning( + "Dispatch deadline exceeded for %s after %.0fs; failing", + record.job_id, + waited, + ) + await self._finalize( + workdir, + record.job_id, + JobStatus.FAILED, + f"{_ERR_PREFIX_CAPACITY} dispatch deadline exceeded", + ) + return + + try: + stream = await self._open_stream_once(processor, file_meta, workdir) + except asyncio.CancelledError: + # Cancelled (e.g. lifespan drain) before a worker accepted. + # Finalize FAILED so the row doesn't dangle PENDING until + # stale-reclaim; the client resubmits. Best-effort, then + # propagate the cancellation. + with contextlib.suppress(Exception): + await self._finalize( + workdir, + record.job_id, + JobStatus.FAILED, + "Workflow cancelled (worker shutdown)", + ) + raise + except wool.NoWorkersAvailable: + # No capacity this pass — keep the job PENDING, request a + # bounded scale-up (ECS only), and reschedule for a later tick. + await self._handle_overflow(record) + return + except Exception as exc: + logger.exception("Failed to open routine stream for %s", record.job_id) + await self._finalize( + workdir, record.job_id, JobStatus.FAILED, str(exc) + ) + return + + # A worker accepted the task. Claim RUNNING now — after the stream + # opened — so a job only leaves PENDING once it is genuinely + # running. ``mark_running`` is fenced on PENDING; a RuntimeError + # means a racing claimant (or stale-reclaim) already owns the row, + # so hand off: close the stream (cancelling the remote routine) and + # bail without duplicate work. + try: + await mark_running(self._db, record.job_id) + except asyncio.CancelledError: + with contextlib.suppress(BaseException): + await stream.aclose() + raise + except RuntimeError: + logger.info( + "mark_running rejected for %s (row no longer PENDING); " + "closing stream and handing off", + record.job_id, + ) + with contextlib.suppress(BaseException): + await stream.aclose() + # The successor owns the row, so do NOT release the mutex here — + # but this attempt's workdir was already created by the routine, + # so clean it up to avoid leaking scratch on the hand-off (A1). + with contextlib.suppress(Exception): + await asyncio.to_thread( + shutil.rmtree, str(workdir), ignore_errors=True + ) + return + except Exception as exc: + logger.exception("Failed to mark job %s as running", record.job_id) + with contextlib.suppress(BaseException): + await stream.aclose() + await self._finalize( + workdir, + record.job_id, + JobStatus.FAILED, + f"mark_running failed: {exc}", + ) + return + + await self._consume_and_finalize(record, stream, workdir) + + async def _handle_overflow(self, record: JobRecord) -> None: + """Handle a dispatch attempt that found no worker capacity. + + Requests one best-effort worker spawn through the provisioner (ECS + profile only; a no-op in the local LAN profile, where the pool is + fixed) and reschedules the job's next dispatch attempt. The spawn is + strictly best-effort — existing workers may free up before the next + tick — so any provisioner error is logged and swallowed rather than + failing the job; the dispatch deadline bounds the retry loop. + """ + if self._provisioner is not None: + try: + await self._provisioner.request(dedup_key=record.workflow_key) + except asyncio.CancelledError: + raise + except RetryableProvisionerError as exc: + logger.warning( + "Provisioner retryable capacity error for %s; will retry " + "on the next tick: %s", + record.job_id, + exc, + ) + except Exception: + logger.exception( + "Provisioner request failed for %s; will retry on the " + "next tick", + record.job_id, + ) + # Leading operability signal that the pool is over capacity: the job + # found no worker and is being deferred to a later tick. Logged at + # info so a steadily-overflowing fleet is visible before jobs start + # failing ``capacity:`` at the dispatch deadline. + logger.info( + "No worker capacity for %s; rescheduling (pool overflow)", + record.job_id, + ) + await reschedule_dispatch( + self._db, record.job_id, next_at=_next_dispatch_time(_utcnow()) + ) + + async def _consume_and_finalize( + self, + record: JobRecord, + stream: AsyncIterator[Any], + workdir: Path, + ) -> None: + """Consume a running workflow's event stream and release it terminal. The routine emits ``heartbeat``, ``stage_complete``, ``complete``, and ``error`` events as an async generator over wool's gRPC stream. @@ -446,102 +930,25 @@ async def _run_workflow( ``stage_complete`` arrives, so /jobs/{id} reflects partial progress in real time rather than all-at-once at the end. - A ``try/finally`` ensures the job always reaches a terminal - status and the per-job workdir is cleaned up, even when - ``mark_running`` raises or the task is cancelled during - shutdown. ``CancelledError`` propagates after the terminal write. + A ``try/finally`` ensures the job always reaches a terminal status + and the per-job workdir is cleaned up, even when the task is + cancelled during shutdown. ``CancelledError`` propagates after the + terminal write. The caller has already marked the job RUNNING and + opened ``stream``. """ - workdir = self._workdir_root / record.job_id - final_status: JobStatus = JobStatus.FAILED final_error: Optional[str] = None + # Time of the last successful heartbeat write. Seeded now because + # the caller just marked the job RUNNING (the row is fresh). If + # heartbeat writes then fail for longer than _HEARTBEAT_LOSS_ABORT_S + # we raise HeartbeatLost and abort, so a Mongo-blind consumer stops + # computing before the orphan sweep revives its row. Best-effort: it + # only re-evaluates on a Heartbeat event (i.e. during quiet stages), + # so the per-attempt workdir — not this abort — is the actual + # corruption guard (see HeartbeatLost). + last_heartbeat_ok = _utcnow() try: - # mark_running lives inside the try so a Mongo write failure - # here still routes through the finally — the row gets - # released to FAILED rather than dangling at PENDING for - # the full stale-reclaim window. - try: - await mark_running(self._db, record.job_id) - except Exception as exc: - logger.exception("Failed to mark job %s as running", record.job_id) - final_error = f"mark_running failed: {exc}" - return - - # Request a worker via the external provisioner (e.g. ECS - # ``RunTask``) before opening the routine stream. The - # provisioner dedup-keys on the workflow mutex so two - # concurrent fresh claims for the same source file - # share one ``RunTask`` and one worker. - # - # ``RetryableProvisionerError`` is treated as a best-effort - # scale-up failure, not a workflow-level failure: the Wool - # pool routes ``@wool.routine`` calls to *any* available - # worker via the discovery namespace, so an existing idle - # worker can still service this job. Log a warning and fall - # through to ``_open_stream_with_retry``; the - # ``capacity:``-prefixed terminal status is reserved for the - # case where dispatch *also* exhausts its budget with no - # workers available. - if self._provisioner is not None: - try: - await self._provisioner.request(dedup_key=record.workflow_key) - except asyncio.CancelledError: - final_error = "Workflow cancelled (worker shutdown)" - raise - except RetryableProvisionerError as exc: - logger.warning( - "Provisioner reported retryable capacity error for %s; " - "falling through to dispatch against existing workers: %s", - record.job_id, - exc, - ) - except Exception as exc: - logger.exception( - "Provisioner request failed for %s", record.job_id - ) - final_error = f"{_ERR_PREFIX_PROVISIONER} {exc}" - return - # Bound the stale-reclaim window across the provisioner - # round-trip: ``mark_running`` ran before - # ``provisioner.request`` and a Fargate cold start can - # plausibly exceed STALE_WORKFLOW_THRESHOLD. Heartbeat - # so the row stays fresh through the upcoming dispatch - # wait. A failed heartbeat is logged but does not - # abort the workflow — the next ``heartbeat`` event in - # the stream loop will retry. - try: - await heartbeat_workflow(self._db, record.job_id) - except Exception: - logger.exception( - "Post-provisioner heartbeat failed for %s; continuing", - record.job_id, - ) - - try: - stream = await self._open_stream_with_retry( - processor, file_meta, workdir - ) - except asyncio.CancelledError: - final_error = "Workflow cancelled (worker shutdown)" - raise - except wool.NoWorkersAvailable as exc: - # Dispatch budget exhausted with no workers reachable. - # ``capacity:`` is the stable on-wire prefix clients - # parse to decide whether to resubmit the job — its - # origin shifted from the provisioner-failure branch - # (which now falls through) to here, but the wire - # contract is unchanged. - logger.warning( - "Dispatch budget exhausted for %s with no workers available", - record.job_id, - ) - final_error = f"{_ERR_PREFIX_CAPACITY} {exc}" - return - except Exception as exc: - logger.exception("Failed to open routine stream for %s", record.job_id) - final_error = str(exc) - return try: async with asyncio.timeout(_WORKFLOW_DURATION_CAP_SECONDS): @@ -549,7 +956,21 @@ async def _run_workflow( if isinstance(event, Heartbeat): try: await heartbeat_workflow(self._db, record.job_id) + last_heartbeat_ok = _utcnow() except Exception: + stalled = ( + _utcnow() - last_heartbeat_ok + ).total_seconds() + if stalled > _HEARTBEAT_LOSS_ABORT_S: + # Lost Mongo long enough that the orphan + # sweep will treat this row as dead; abort + # so recovery is spared a redundant second + # run of this still-live attempt. + raise HeartbeatLost( + f"no successful heartbeat for " + f"{stalled:.0f}s " + f"(> {_HEARTBEAT_LOSS_ABORT_S:.0f}s)" + ) logger.exception( "Heartbeat failed for %s; continuing", record.job_id, @@ -622,6 +1043,18 @@ async def _run_workflow( if final_status == JobStatus.FAILED: final_error = "Workflow cancelled (worker shutdown)" raise + except HeartbeatLost as exc: + # Expected abort, not a crash: this consumer lost Mongo and + # is giving up the job so the orphan sweep can recover it + # without a concurrent live attempt. The terminal write + # below will likely no-op (Mongo is unreachable for us); the + # inner finally still aclose()s the stream, cancelling the + # remote worker so it stops computing. + logger.warning( + "Aborting workflow %s: %s", record.job_id, exc + ) + final_status = JobStatus.FAILED + final_error = f"heartbeat lost: {exc}" except Exception as exc: logger.exception( "Workflow %s failed during stream consumption", @@ -686,74 +1119,44 @@ async def _finalize( except Exception: logger.exception("Unable to clean workdir %s", workdir) - async def _open_stream_with_retry( + async def _open_stream_once( self, processor: Processor, file_meta: dict[str, Any], workdir: Path, ) -> AsyncIterator[dict[str, Any]]: - """Open the routine's event stream, retrying on ``NoWorkersAvailable``. - - The API leases workers from a pool that's provisioned externally - (ECS in production); cold starts can take ~30-90s. Until wool - exposes a quorum-style readiness gate we poll the pool with - ``NoWorkersAvailable`` retries on each attempt to open the - stream. ``_DISPATCH_WAIT_SECONDS`` caps the total wait so a - stuck scale-up surfaces as a workflow failure rather than - hanging until the much-larger ``_WORKFLOW_DURATION_CAP_SECONDS`` - cap fires. - - Returns an async iterator that yields the first event followed - by the rest of the routine's stream. The retry-on-capacity logic - only applies to the first ``__anext__`` call (which is where - wool surfaces ``NoWorkersAvailable``); once events start - flowing, subsequent ``__anext__`` calls just wait on the worker. + """Open the routine's event stream in a single dispatch attempt. + + Constructs the ``@wool.routine`` stream and pulls its first event, + which is where wool surfaces ``wool.NoWorkersAvailable`` when the + priority load balancer found no worker willing to accept the task. + That exception propagates to ``_attempt_dispatch``, which reschedules + the job — the durable retry scheduler is now the retry mechanism, so + there is no in-attempt polling loop (a single reschedule tick + covers an ECS cold start). On success returns an async iterator + yielding the first event followed by the rest of the stream. """ - loop = asyncio.get_running_loop() - deadline = loop.time() + _DISPATCH_WAIT_SECONDS - attempt = 0 - last_exc: wool.NoWorkersAvailable | None = None - while True: - stream = _run_processor_routine( - processor, - file_meta, - str(workdir), - self._cache, - ) - # ``returning`` flips True only after we've handed the - # stream off to ``_prepend_first_event``. Any other exit - # from this iteration (NoWorkersAvailable, CancelledError - # from __anext__ or the inter-iteration sleep, unexpected - # exception) leaves the freshly-constructed stream - # unawaited; the finally aclose()s it so wool's pool can - # release the lease cleanly on drain. - returning = False - try: - try: - first_event = await stream.__anext__() - except wool.NoWorkersAvailable as exc: - last_exc = exc - attempt += 1 - if loop.time() >= deadline: - break - if attempt >= 2: - logger.warning( - "No workers available — retrying dispatch " - "(attempt %d, %.1fs of %.1fs budget remaining)", - attempt, - deadline - loop.time(), - _DISPATCH_WAIT_SECONDS, - ) - await asyncio.sleep(_DISPATCH_RETRY_INTERVAL_SECONDS) - continue - returning = True - return _prepend_first_event(first_event, stream) - finally: - if not returning: - with contextlib.suppress(BaseException): - await stream.aclose() - assert last_exc is not None - raise last_exc + stream = _run_processor_routine( + processor, + file_meta, + str(workdir), + self._cache, + _HEARTBEAT_INTERVAL_S, + ) + # ``returning`` flips True only after we've handed the stream off to + # ``_prepend_first_event``. Any other exit (NoWorkersAvailable, a + # CancelledError from __anext__, an unexpected exception) leaves the + # freshly-constructed stream unawaited; the finally aclose()s it so + # wool's pool can release the lease cleanly. + returning = False + try: + first_event = await stream.__anext__() + returning = True + return _prepend_first_event(first_event, stream) + finally: + if not returning: + with contextlib.suppress(BaseException): + await stream.aclose() async def _prepend_first_event( @@ -761,10 +1164,10 @@ async def _prepend_first_event( ) -> AsyncIterator[dict[str, Any]]: """Yield ``first`` then the rest of ``stream``. - Used by ``_open_stream_with_retry`` to consume the initial - ``__anext__`` (where wool surfaces ``NoWorkersAvailable``) inside - the retry loop, then expose the remaining events to the workflow - consumer through a single uniform async-iterator interface. + Used by ``_open_stream_once`` to consume the initial ``__anext__`` + (where wool surfaces ``NoWorkersAvailable``), then expose the remaining + events to the workflow consumer through a single uniform async-iterator + interface. """ try: yield first @@ -778,12 +1181,12 @@ async def _prepend_first_event( def _log_unexpected_exception(task: asyncio.Task) -> None: """Done-callback that surfaces an unexpected exception via the logger. - ``_run_workflow`` is supposed to catch every failure and route it to - a terminal release; if anything escapes that contract (e.g. a future - refactor leaves a bare ``raise`` outside the try), asyncio's - default behavior is to log the unretrieved exception only at - interpreter shutdown. Surfacing it through the project logger - immediately makes the regression visible. + The scheduler loop and ``_attempt_dispatch`` are supposed to catch + every failure and route it to a terminal release or a reschedule; if + anything escapes that contract (e.g. a future refactor leaves a bare + ``raise`` outside the try), asyncio's default behavior is to log the + unretrieved exception only at interpreter shutdown. Surfacing it + through the project logger immediately makes the regression visible. """ if task.cancelled(): return diff --git a/src/cfdb/workflows/loadbalancer.py b/src/cfdb/workflows/loadbalancer.py new file mode 100644 index 0000000..bacbd8c --- /dev/null +++ b/src/cfdb/workflows/loadbalancer.py @@ -0,0 +1,113 @@ +"""Priority (leaky-bucket) load balancer for the wool worker pool (issue #45). + +Unlike wool's round-robin balancer, this one always offers a task to the +discovered workers in the same stable order (sorted by +``WorkerMetadata.uid``) on every dispatch. The order is arbitrary but +reproducible — ``uid`` is a per-worker UUID, not a seniority or cost +ranking; what matters is only that the same workers are offered work first +on each dispatch. Combined with per-worker backpressure (one task per +worker), load concentrates on the lowest-ordered workers and +higher-ordered workers drain to idle — so an over-provisioned fleet sheds +idle capacity (workers self-terminate on their max-lifetime) instead of +every worker carrying a thin, perpetual slice of traffic. This is the +"leaky bucket": fill the priority workers first and let the overflow spill +to the next. + +It honors wool's load-balancer worker-health contract (see +:class:`wool.LoadBalancerLike`): rotate to the next worker on +``TransientRpcError`` — which includes a backpressure ``RESOURCE_EXHAUSTED`` +rejection — evict the worker on a non-transient ``RpcError``, and let any +other exception propagate untouched. When no worker accepts the task (the +pool is empty, or every worker rejected or was evicted in one pass) it +raises ``NoWorkersAvailable``; the executor treats that as the signal to +add a worker and re-queue the job. + +Stateless by design: the stable order is recomputed per call, so there is +no per-context index, no lock, and nothing that resists pickling. +""" + +from __future__ import annotations + +import logging +from typing import TYPE_CHECKING +from typing import Any +from typing import AsyncIterator + +from wool import LoadBalancerContextLike +from wool import LoadBalancerLike +from wool import NoWorkersAvailable +from wool import RpcError +from wool import TransientRpcError + +if TYPE_CHECKING: + from wool.runtime.routine.task import Task + +logger = logging.getLogger(__name__) + + +class PriorityLoadBalancer(LoadBalancerLike): + """Offer each task to workers in a stable priority order. + + See the module docstring for the leaky-bucket rationale and the + worker-health contract this honors. + """ + + def __reduce__(self): + # Stateless — reconstruct via the bare constructor so the balancer + # is trivially picklable (mirrors RoundRobinLoadBalancer, which + # uses __reduce__ to shed its lock/index state). + return (self.__class__, ()) + + async def dispatch( + self, + task: Task, + *, + context: LoadBalancerContextLike, + timeout: float | None = None, + ) -> AsyncIterator[Any]: + """Dispatch *task* to the first worker, in priority order, that accepts. + + :param task: + The :class:`wool.Task` to dispatch. + :param context: + The :class:`wool.LoadBalancerContextLike` whose ``workers`` map + names the candidates and through which ``remove_worker`` evicts + unhealthy peers. + :param timeout: + Per-attempt timeout in seconds against a single worker. + :returns: + The worker's response stream (an async iterator). + :raises NoWorkersAvailable: + When the pool is empty or no worker accepts the task across one + stable-order pass (all rejected transiently or were evicted). + """ + # Snapshot a stable, deterministic order up front. Sorting by uid + # makes "priority" reproducible across dispatches, so the same + # low-order workers stay saturated. ``str(uid)`` guarantees an + # orderable key regardless of the uid's concrete type. Iterating a + # snapshot keeps the pass well-defined even when ``remove_worker`` + # mutates the live context mid-loop on eviction. + candidates = sorted( + context.workers.items(), key=lambda item: str(item[0].uid) + ) + + for metadata, connection in candidates: + try: + stream = await connection.dispatch(task, timeout=timeout) + except TransientRpcError as exc: + logger.debug( + "Skipping worker %s on transient error: %s", metadata.uid, exc + ) + continue + except RpcError as exc: + logger.warning( + "Evicting worker %s after non-transient RPC error: %s", + metadata.uid, + exc, + ) + context.remove_worker(metadata) + continue + else: + return stream + + raise NoWorkersAvailable("No healthy workers available for dispatch") diff --git a/src/cfdb/workflows/lock.py b/src/cfdb/workflows/lock.py index e5f1fef..05503f3 100644 --- a/src/cfdb/workflows/lock.py +++ b/src/cfdb/workflows/lock.py @@ -20,6 +20,7 @@ from datetime import datetime, timedelta, timezone from typing import Any, Optional +from pymongo import ReturnDocument from pymongo.errors import DuplicateKeyError from pymongo.write_concern import WriteConcern @@ -133,7 +134,7 @@ def _record_from_mongo(doc: dict[str, Any]) -> JobRecord: here so the round-trip survives. """ sanitized = {k: v for k, v in doc.items() if k != "_id"} - for field in ("submitted_at", "updated_at"): + for field in ("submitted_at", "updated_at", "next_dispatch_at"): value = sanitized.get(field) if isinstance(value, datetime) and value.tzinfo is None: sanitized[field] = value.replace(tzinfo=timezone.utc) @@ -190,6 +191,13 @@ async def claim_workflow( submitted_at=now, updated_at=now, file_meta_snapshot=file_meta_snapshot, + # ``next_dispatch_at`` is intentionally left unset (None) on a + # fresh claim. ``ensure_workflow`` dispatches the first attempt + # inline; only if that attempt overflows (no worker capacity) + # does it set ``next_dispatch_at``, handing the job to the + # durable retry scheduler. A None value is never "due", so the + # scheduler can't double-dispatch a job whose inline attempt is + # still in flight. ) # Attempt the insert first. If the partial-unique index admits it, @@ -471,3 +479,142 @@ async def heartbeat_workflow(db, job_id: str) -> None: }, {"$set": {"updated_at": _utcnow()}}, ) + + +# --- Bounded-concurrency admission + durable dispatch retry (issue #45) ------ + + +async def count_active_workflows(db) -> int: + """Count workflows currently holding the mutex (pending + running). + + Backs the admission ceiling in ``ensure_workflow``: once this reaches + ``CFDB_WORKFLOW_MAX_ACTIVE`` new requests are shed with 429. Soft by + nature — a count-then-insert race can briefly overshoot the cap, which + is acceptable for a flood guard. + + Counts on the canonical ``active`` boolean discriminator (the single + source of truth every other jobs read/write/index keys off), not the + derived ``status $in ACTIVE_STATUSES`` view, so the ceiling and the + mutex stay in lockstep even if the two ever drift. + """ + jobs = _jobs(db) + return await jobs.count_documents({"active": True}) + + +async def reschedule_dispatch(db, job_id: str, *, next_at: datetime) -> None: + """Defer a still-PENDING job's next dispatch attempt to ``next_at``. + + Called when a dispatch attempt finds no worker capacity: the job stays + PENDING and the durable scheduler re-attempts it at ``next_at``. + Fenced on PENDING so a job that has since gone RUNNING/terminal (a + racing attempt won the worker, or it was stale-reclaimed) is not + dragged back into the dispatch queue. Bumps ``dispatch_attempts`` for + observability. + """ + jobs = _jobs(db) + result = await jobs.update_one( + {"job_id": job_id, "status": JobStatus.PENDING.value}, + { + "$set": {"next_dispatch_at": next_at, "updated_at": _utcnow()}, + "$inc": {"dispatch_attempts": 1}, + }, + ) + if getattr(result, "matched_count", 0) == 0: + logger.debug( + "reschedule_dispatch no-op for job %s — row no longer PENDING " + "(won a worker, terminated, or was stale-reclaimed)", + job_id, + ) + + +async def lease_due_dispatch( + db, *, now: datetime, next_at: datetime +) -> Optional[JobRecord]: + """Atomically claim one PENDING job whose dispatch is due. + + Selects a PENDING job with ``next_dispatch_at <= now`` and, in the same + operation, pushes its ``next_dispatch_at`` forward to ``next_at`` — so a + concurrent scheduler tick (or a second API replica) cannot lease the + same job, and a crash mid-attempt still leaves the job scheduled for a + later retry. Returns the leased job, or ``None`` when nothing is due. + + The caller runs a dispatch attempt for the returned job; that attempt + either wins a worker (``mark_running``) or calls ``reschedule_dispatch`` + again on overflow. Jobs with ``next_dispatch_at`` unset (``None``) are + never due, so they are not leased here. + + Due jobs are leased oldest-first (``next_dispatch_at`` ascending) so + sustained overflow cannot starve the longest-waiting job toward its + deadline. ``next_at`` MUST be strictly in the future relative to + ``now``: the single-claim guarantee depends on the leased row's + ``next_dispatch_at`` moving out of the ``<= now`` window, so an + equal/past value would let a concurrent tick re-lease the same job + before the attempt resolves. + """ + if next_at <= now: + raise ValueError( + f"lease_due_dispatch requires next_at ({next_at}) strictly after " + f"now ({now}); an equal/past value would let a concurrent tick " + "immediately re-lease the same job" + ) + jobs = _jobs(db) + doc = await jobs.find_one_and_update( + { + "status": JobStatus.PENDING.value, + "next_dispatch_at": {"$lte": now}, + }, + {"$set": {"next_dispatch_at": next_at, "updated_at": now}}, + sort=[("next_dispatch_at", 1)], + return_document=ReturnDocument.AFTER, + ) + return _record_from_mongo(doc) if doc is not None else None + + +async def requeue_orphaned_dispatch( + db, *, now: datetime, stale_before: datetime +) -> int: + """Re-queue jobs orphaned by an API/worker crash so they re-dispatch. + + The durable scheduler only leases ``PENDING`` jobs whose + ``next_dispatch_at`` is set and due, so two states survive a restart + that it would never pick up on its own: + + - ``RUNNING`` rows whose API consumer died mid-stream (no heartbeat + since), and + - ``PENDING`` rows with ``next_dispatch_at`` unset — a fresh claim whose + inline first attempt never got to reschedule before the crash. + + Both are reset to ``PENDING`` with ``next_dispatch_at = now`` so the next + scheduler tick leases them. Gated on ``updated_at < stale_before`` (the + same staleness threshold :func:`claim_workflow` uses) so a healthy + in-flight job — which keeps ``updated_at`` fresh via heartbeats — is + never falsely reclaimed. Returns the number of rows requeued. + + Unlike ``claim_workflow``'s stale-reclaim, which fails the stale row so a + *new* claimant can supersede it, this revives the same row: there is no + new claimant, so the goal is to recover the in-flight work itself. + Partial-commit recovery means the re-dispatched job reuses any stages + already committed to cache. + """ + jobs = _jobs(db) + result = await jobs.update_many( + { + "active": True, + "updated_at": {"$lt": stale_before}, + "$or": [ + {"status": JobStatus.RUNNING.value}, + {"status": JobStatus.PENDING.value, "next_dispatch_at": None}, + ], + }, + { + "$set": { + "status": JobStatus.PENDING.value, + # PENDING is active; re-assert defensively so the + # discriminator can never drift from the status. + "active": True, + "next_dispatch_at": now, + "updated_at": now, + } + }, + ) + return getattr(result, "modified_count", 0) diff --git a/src/cfdb/workflows/models.py b/src/cfdb/workflows/models.py index b1fff7d..6156b25 100644 --- a/src/cfdb/workflows/models.py +++ b/src/cfdb/workflows/models.py @@ -73,7 +73,14 @@ class JobRecord(BaseModel): error: Populated on ``FAILED`` status with a human-readable reason. Capped at 1024 chars after path-scrubbing in ``lock.release_workflow`` so absolute filesystem paths from - tool stderr don't leak via ``/jobs/{id}``. + tool stderr don't leak via ``/jobs/{id}``. Some failures carry a + stable, client-parseable prefix: ``capacity:`` (the dispatch + deadline elapsed while the job waited for a worker — safe to + resubmit later), ``heartbeat lost:`` (the consumer lost its + Mongo connection and aborted so the orphan sweep can recover the + job), and ``internal:`` (a should-never-happen executor + invariant failure, e.g. a leased job missing its file_meta + snapshot or processor). superseded_by: When this row is stale-reclaimed by a fresh claim, the winner's ``job_id`` is recorded here so clients polling this (now-FAILED) job can follow the chain to the live one. @@ -81,6 +88,16 @@ class JobRecord(BaseModel): metadata as it was at claim time. Persisted on insert so a re-run / observer can reconstruct the dispatch context without re-reading the (possibly mutated) ``files`` document. + next_dispatch_at: When the durable retry scheduler should next + attempt to dispatch this (still-PENDING) job to a worker. + Unset (``None``) on claim and on a running/terminal job; set + and pushed forward by the retry scheduler each time a dispatch + attempt finds no capacity. Drives the scheduler's "due for + dispatch" query, which matches only rows whose + ``next_dispatch_at`` is set and in the past. + dispatch_attempts: How many times dispatch has been deferred for + lack of worker capacity. Observability only — the failure + deadline is measured from ``submitted_at``, not this count. """ model_config = ConfigDict( @@ -107,18 +124,24 @@ class JobRecord(BaseModel): error: Annotated[str, StringConstraints(max_length=1024)] | None = None superseded_by: Annotated[str, StringConstraints(max_length=64)] | None = None file_meta_snapshot: dict[str, Any] | None = None + next_dispatch_at: datetime | None = None + dispatch_attempts: int = Field(default=0, ge=0) - @field_validator("submitted_at", "updated_at") + @field_validator("submitted_at", "updated_at", "next_dispatch_at") @classmethod - def _require_aware_datetime(cls, value: datetime) -> datetime: + def _require_aware_datetime(cls, value: datetime | None) -> datetime | None: """Reject naive datetimes. All internal producers go through ``lock._utcnow()`` which returns an aware UTC datetime; a naive value would either be a refactor regression or a malformed document. Caught here so the stale - cutoff comparison in ``claim_workflow`` can never raise - ``TypeError`` on a mismatched-naivety operand. + cutoff comparison in ``claim_workflow`` (and the scheduler's + ``next_dispatch_at`` comparison) can never raise ``TypeError`` on a + mismatched-naivety operand. ``next_dispatch_at`` is nullable, so a + ``None`` passes through untouched. """ + if value is None: + return value if value.tzinfo is None: raise ValueError( "JobRecord datetimes must be timezone-aware " @@ -163,4 +186,6 @@ def to_mongo(self) -> dict[str, Any]: if self.file_meta_snapshot is not None else None ), + "next_dispatch_at": self.next_dispatch_at, + "dispatch_attempts": self.dispatch_attempts, } diff --git a/src/cfdb/workflows/worker_lan.py b/src/cfdb/workflows/worker_lan.py index 832ccc3..a9d2512 100644 --- a/src/cfdb/workflows/worker_lan.py +++ b/src/cfdb/workflows/worker_lan.py @@ -17,21 +17,26 @@ python -m cfdb.workflows.worker_lan --namespace cfdb-workers --workers 2 SIGINT (Ctrl-C) or SIGTERM drains the pool and exits. With no pool -running, ``/data`` and ``/index`` requests for processable formats hang -on the dispatch retry budget before failing with ``NoWorkersAvailable``. +running, ``/data`` and ``/index`` requests for processable formats do not +hang or surface ``NoWorkersAvailable`` to the client: the job is claimed +and queued PENDING (the request returns ``202``), and the API's durable +retry scheduler re-attempts dispatch every ``CFDB_WORKFLOW_RETRY_INTERVAL_S`` +until a worker appears or the ``CFDB_WORKFLOW_DISPATCH_DEADLINE_S`` deadline +elapses (then the job is failed ``capacity:``). Environment variables (CLI flags mirror them): * ``WORKFLOW_POOL_NAMESPACE`` — LAN discovery namespace the pool publishes under (default ``cfdb-workers``). MUST match the API's value. * ``WORKFLOW_WORKER_COUNT`` — number of workers to spawn and publish - (default ``2``). Size it at least as high as the API's lease count or - the API blocks waiting for workers. + (default ``2``). The API admits every worker discovery surfaces (there is + no fixed lease count), so size this to the local concurrency you want. """ from __future__ import annotations import asyncio +import functools import logging import signal from typing import Optional @@ -40,6 +45,8 @@ import wool from wool.runtime.discovery.lan import LanDiscovery +from cfdb.workflows import WORKER_MAX_CONCURRENT_TASKS +from cfdb.workflows.backpressure import backpressure_for from cfdb.workflows.credentials import build_worker_credentials logger = logging.getLogger(__name__) @@ -97,18 +104,31 @@ def _signal_handler_threaded(*_: object) -> None: except NotImplementedError: signal.signal(sig, _signal_handler_threaded) + # Bind per-worker backpressure onto the spawn factory so each spawned + # LocalWorker serializes its routines (mirrors the ECS worker_main + # wiring). ``functools.partial(..., backpressure=hook)`` keeps wool's + # ``declares_host`` True, so the pool still prescribes the bind host. + backpressure = backpressure_for(WORKER_MAX_CONCURRENT_TASKS) + worker_factory = ( + functools.partial(wool.LocalWorker, backpressure=backpressure) + if backpressure is not None + else wool.LocalWorker + ) + pool = wool.WorkerPool( spawn=workers, + worker=worker_factory, discovery=LanDiscovery(namespace), credentials=credentials, ) async with pool: logger.info( - "Published %d wool worker(s) under LAN namespace %r (mTLS %s) " - "— Ctrl-C to drain and exit", + "Published %d wool worker(s) under LAN namespace %r (mTLS %s, " + "max concurrent tasks %s) — Ctrl-C to drain and exit", workers, namespace, "enabled" if credentials is not None else "disabled", + WORKER_MAX_CONCURRENT_TASKS if backpressure is not None else "unbounded", ) await stop_event.wait() return 0 diff --git a/src/cfdb/workflows/worker_main.py b/src/cfdb/workflows/worker_main.py index c4cc603..5a92264 100644 --- a/src/cfdb/workflows/worker_main.py +++ b/src/cfdb/workflows/worker_main.py @@ -35,6 +35,8 @@ import click import wool +from cfdb.workflows import WORKER_MAX_CONCURRENT_TASKS +from cfdb.workflows.backpressure import backpressure_for from cfdb.workflows.constants import DEFAULT_WORKER_PORT from cfdb.workflows.credentials import build_worker_credentials @@ -136,16 +138,27 @@ def _signal_handler_threaded(*_: object) -> None: health_port, lambda: stop_event.is_set() ) + # Per-worker backpressure: reject a dispatch once this worker already + # has CFDB_WORKER_MAX_CONCURRENT_TASKS routines in flight (default 1), + # so its subprocess pipelines serialize instead of oversubscribing the + # 1-vCPU task. ``None`` (threshold 0) keeps the unbounded prior behavior. + backpressure = backpressure_for(WORKER_MAX_CONCURRENT_TASKS) + worker = wool.LocalWorker( - host="0.0.0.0", port=worker_port, credentials=credentials + host="0.0.0.0", + port=worker_port, + credentials=credentials, + backpressure=backpressure, ) await worker.start() try: logger.info( - "Wool worker listening on port %d (health on %d, mTLS %s)", + "Wool worker listening on port %d (health on %d, mTLS %s, " + "max concurrent tasks %s)", worker_port, health_port, "enabled" if credentials is not None else "disabled", + WORKER_MAX_CONCURRENT_TASKS if backpressure is not None else "unbounded", ) while True: # Check the self-termination path first. Max-lifetime is a diff --git a/tests/conftest.py b/tests/conftest.py index 21e5b6d..8d8ab76 100644 --- a/tests/conftest.py +++ b/tests/conftest.py @@ -23,10 +23,18 @@ def _resolve(doc: dict, key: str): def _match(doc: dict, query: dict) -> bool: """Minimal MongoDB query matcher supporting a small operator subset.""" for key, cond in query.items(): + # AND $and/$or with the rest of the top-level keys rather than + # short-circuiting the whole match, so a query that mixes them with + # sibling field conditions (e.g. {active, updated_at, $or}) is + # evaluated faithfully regardless of key order. if key == "$and": - return all(_match(doc, sub) for sub in cond) + if not all(_match(doc, sub) for sub in cond): + return False + continue if key == "$or": - return any(_match(doc, sub) for sub in cond) + if not any(_match(doc, sub) for sub in cond): + return False + continue value = _resolve(doc, key) @@ -173,6 +181,10 @@ def _apply_update(doc: dict, update: dict, *, is_insert: bool = False) -> bool: if v not in existing: existing.append(v) changed = True + elif op == "$inc": + for k, v in fields.items(): + _set_nested(doc, k, (_resolve(doc, k) or 0) + v) + changed = True else: raise NotImplementedError(f"FakeCollection update op: {op}") return changed @@ -267,12 +279,19 @@ async def find_one_and_update( *, upsert: bool = False, return_document=None, + sort: list | None = None, **_kwargs, ) -> dict | None: - for d in self.docs: - if _match(d, query): - _apply_update(d, update, is_insert=False) - return dict(d) + candidates = [d for d in self.docs if _match(d, query)] + if sort: + # Single-key sort is all the production queries use; honoring it + # lets tests exercise ordered claims (e.g. oldest-due-first). + field, direction = sort[0] + candidates.sort(key=lambda d: d.get(field), reverse=direction < 0) + if candidates: + d = candidates[0] + _apply_update(d, update, is_insert=False) + return dict(d) if upsert: seed: dict = {} for k, v in query.items(): @@ -311,10 +330,10 @@ async def update_many(self, query: dict, update: dict, **kwargs) -> _UpdateResul for d in self.docs: if _match(d, query): matched += 1 - for k, v in update.get("$set", {}).items(): - if d.get(k) != v: - d[k] = v - modified += 1 + # Row-level modified count, matching real Mongo semantics + # (number of documents changed, not number of field writes). + if _apply_update(d, update, is_insert=False): + modified += 1 return _UpdateResult(matched, modified) async def update_one(self, query: dict, update: dict, **kwargs) -> _UpdateResult: diff --git a/tests/integration/conftest.py b/tests/integration/conftest.py index 41269d5..2b1418f 100644 --- a/tests/integration/conftest.py +++ b/tests/integration/conftest.py @@ -530,9 +530,10 @@ async def wool_pool(): Function-scoped so every test gets a clean pool — prevents any cross-test state bleeding through worker process memory. The - executor's ``_dispatch_with_retry`` covers the brief window - between pool startup and LocalDiscovery surfacing the worker, - so no explicit sleep is needed here. + executor opens a single dispatch attempt and, on overflow, + reschedules the job onto its durable retry scheduler — so the brief + window between pool startup and the worker being surfaced needs no + explicit sleep here. """ async with wool.WorkerPool(spawn=1): yield diff --git a/tests/integration/routines.py b/tests/integration/routines.py index 11f7d0a..5c4a291 100644 --- a/tests/integration/routines.py +++ b/tests/integration/routines.py @@ -53,10 +53,10 @@ class StubProcessor(Processor): workflow to race against. - ``sleep_between_yields``: a delay inserted AFTER each ``stage_complete`` yield. Used by the runtime-cap test: a quick - first event lets ``_open_stream_with_retry`` return so the - executor's ``asyncio.timeout(cap)`` block is entered, then the - between-yields delay outlasts the cap so the wait for the next - event triggers the cap. + first event lets ``_open_stream_once`` return (and the executor mark + the job RUNNING) so the ``asyncio.timeout(cap)`` block is entered, + then the between-yields delay outlasts the cap so the wait for the + next event triggers the cap. - ``unpicklable_field``: any value assigned to ``self.unpicklable_field`` to deliberately poison cloudpickle. Used by the boundary tests that need to verify a pickling failure surfaces as a clean diff --git a/tests/integration/test_executor_boundary.py b/tests/integration/test_executor_boundary.py index 2a6d1a8..24acad5 100644 --- a/tests/integration/test_executor_boundary.py +++ b/tests/integration/test_executor_boundary.py @@ -10,11 +10,16 @@ from __future__ import annotations import asyncio +import functools import pytest +import wool +from cfdb.workflows import executor as executor_module +from cfdb.workflows.backpressure import TaskCountBackpressure from cfdb.workflows.cache import LocalFsCache from cfdb.workflows.executor import WoolExecutor +from cfdb.workflows.loadbalancer import PriorityLoadBalancer from cfdb.workflows.lock import get_job from cfdb.workflows.models import ACTIVE_STATUSES, JobStatus from cfdb.workflows.processors.registry import ProcessorRegistry @@ -30,9 +35,7 @@ def _install_jobs_index(mock_db) -> None: mock_db.jobs.create_index( {"workflow_key": 1}, unique=True, - partialFilterExpression={ - "status": {"$in": [s.value for s in ACTIVE_STATUSES]} - }, + partialFilterExpression={"active": True}, ) @@ -127,8 +130,8 @@ async def test_ensure_workflow_should_record_failed_when_routine_raises_across_p assert final.status == JobStatus.FAILED assert final.error is not None # When the routine raises before yielding, wool propagates the - # exception across the boundary; the executor records - # ``str(exc)`` from the ``_open_stream_with_retry`` failure path. + # exception across the boundary; the executor records ``str(exc)`` + # from ``_attempt_dispatch``'s stream-open failure path. # The exception class name is not part of the persisted error. assert "boom" in final.error # The scrub regex strips multi-segment absolute paths from the @@ -183,8 +186,11 @@ async def test_ensure_workflow_should_record_failed_when_workflow_duration_cap_f assert final.status == JobStatus.FAILED assert final.error is not None assert "runtime cap" in final.error.lower() - # Workdir is cleaned up regardless of failure mode. - assert not (executor._workdir_root / record.job_id).exists() + # Workdir is cleaned up regardless of failure mode. Assert against + # the workdir ROOT, not ``root / job_id``: the per-attempt workdir is + # ``root / f"{job_id}-{uuid}"`` (B1), so a bare-``job_id`` path is + # never created and asserting its absence would be vacuously true. + assert list(executor._workdir_root.iterdir()) == [] # The internal bookkeeping sets are drained after the await. assert len(executor._pending_tasks) == 0 assert len(executor._finalize_tasks) == 0 @@ -275,3 +281,91 @@ async def test_ensure_workflow_should_attach_second_caller_to_active_job_across_ d for d in mock_db.jobs.docs if d["workflow_key"] == rec_a.workflow_key ] assert len(records_for_key) == 1 + + +class TestProductionPoolConfiguration: + @pytest.mark.asyncio + async def test_ensure_workflow_should_distribute_and_reschedule_across_priority_pool( + self, mock_db, tmp_path, monkeypatch + ): + """Test the production pool config over the real pickle boundary. + + Given: + A two-worker ``wool.WorkerPool`` wired exactly as production — + ``PriorityLoadBalancer`` plus a ``TaskCountBackpressure(1)``-bound + spawn factory — the durable retry scheduler running, and three + distinct slow workflows that each hold their worker busy for the + whole test. + When: + All three are dispatched via ``ensure_workflow`` and the system + is allowed to settle. + Then: + Exactly two reach RUNNING concurrently (one per worker — proving + the load balancer distributes across both over the cloudpickle / + gRPC boundary), and the third is left PENDING with a + ``next_dispatch_at``: the priority balancer rotates past both + ``RESOURCE_EXHAUSTED`` rejections, surfaces ``NoWorkersAvailable``, + and the durable scheduler keeps it queued rather than dropping or + double-running it. + """ + # Arrange — fast retries so an early overflow (a worker not yet + # surfaced at pool startup) is healed by the scheduler promptly. + monkeypatch.setattr(executor_module, "_RETRY_INTERVAL_SECONDS", 0.2) + monkeypatch.setattr(executor_module, "_RETRY_JITTER_MAX_SECONDS", 0.1) + _install_jobs_index(mock_db) + cache = LocalFsCache(tmp_path / "cache") + registry = ProcessorRegistry() + registry.register(StubProcessor(sleep_seconds=30.0)) + executor = WoolExecutor( + mock_db, cache, registry, workdir_root=tmp_path / "jobs" + ) + metas = [{**stub_file_meta(), "local_id": f"ENCFF-{i}"} for i in range(3)] + pool = wool.WorkerPool( + spawn=2, + worker=functools.partial( + wool.LocalWorker, backpressure=TaskCountBackpressure(1) + ), + loadbalancer=PriorityLoadBalancer(), + ) + + async def _settled() -> tuple[int, int]: + running = pending_resched = 0 + for meta in metas: + # job_id isn't known until ensure_workflow returns; match on + # the workflow_key the records carry instead. + for doc in mock_db.jobs.docs: + if doc["local_id"] != meta["local_id"]: + continue + if doc["status"] == JobStatus.RUNNING.value: + running += 1 + elif ( + doc["status"] == JobStatus.PENDING.value + and doc.get("next_dispatch_at") is not None + ): + pending_resched += 1 + return running, pending_resched + + # Act / Assert + try: + async with pool: + executor.start_scheduler() + for meta in metas: + await executor.ensure_workflow(meta) + + # Backpressure(1) on two workers caps concurrent RUNNING at + # two, so the steady state is (2 running, 1 overflowed). The + # long-running stubs hold that state for the whole poll. + deadline = asyncio.get_event_loop().time() + 25.0 + running = pending_resched = 0 + while asyncio.get_event_loop().time() < deadline: + running, pending_resched = await _settled() + if running == 2 and pending_resched == 1: + break + await asyncio.sleep(0.2) + + assert running == 2, f"expected 2 RUNNING (one per worker), got {running}" + assert pending_resched == 1, ( + f"expected 1 overflowed+rescheduled job, got {pending_resched}" + ) + finally: + await executor.drain(timeout=15.0) diff --git a/tests/integration/test_processor_e2e.py b/tests/integration/test_processor_e2e.py index f592878..780eb23 100644 --- a/tests/integration/test_processor_e2e.py +++ b/tests/integration/test_processor_e2e.py @@ -643,5 +643,8 @@ async def test_ensure_workflow_should_surface_failed_status_when_processor_raise # The path-scrub regex strips multi-segment absolute paths so a # ``//`` shape MUST NOT survive into persisted text. assert re.search(r"/[A-Za-z][A-Za-z0-9_.-]*/[A-Za-z]", final.error) is None - # Workdir is cleaned up regardless of failure mode. - assert not (integration_executor._workdir_root / record.job_id).exists() + # Workdir is cleaned up regardless of failure mode. Assert against + # the workdir ROOT, not ``root / job_id``: the per-attempt workdir is + # ``root / f"{job_id}-{uuid}"`` (B1), so a bare-``job_id`` path is + # never created and asserting its absence would be vacuously true. + assert list(integration_executor._workdir_root.iterdir()) == [] diff --git a/tests/test_api/test_lifespan_workerpool.py b/tests/test_api/test_lifespan_workerpool.py index d0eb79c..b4681fd 100644 --- a/tests/test_api/test_lifespan_workerpool.py +++ b/tests/test_api/test_lifespan_workerpool.py @@ -97,6 +97,16 @@ async def _fake_build_discovery(_profile): assert kwargs["quorum"] == 0 assert kwargs["discovery"] is sentinel_discovery assert kwargs["credentials"] == "CREDS-SENTINEL" + # The priority/leaky-bucket balancer must be wired in (issue #45). This + # file exists to pin pool kwargs against silent regressions, so the + # load balancer belongs here alongside quorum/discovery/credentials. + from cfdb.workflows.loadbalancer import PriorityLoadBalancer + + assert isinstance(kwargs["loadbalancer"], PriorityLoadBalancer) + # The durable retry scheduler must be started inside the pool context + # (issue #45) so it inherits wool's dispatch contextvars; pin it here + # alongside the other load-bearing lifespan wiring. + fake_executor.start_scheduler.assert_called_once() def test_worker_pool_signature_should_accept_quorum(): diff --git a/tests/test_cache_stream.py b/tests/test_cache_stream.py index 5626d8b..e415829 100644 --- a/tests/test_cache_stream.py +++ b/tests/test_cache_stream.py @@ -17,7 +17,11 @@ stream_cache_entry, ) from cfdb.workflows.cache import LocalFsCache -from cfdb.workflows.executor import ExecutorDraining, WorkflowNotApplicable +from cfdb.workflows.executor import ( + AdmissionRejected, + ExecutorDraining, + WorkflowNotApplicable, +) from cfdb.workflows.models import ArtifactKind, JobRecord, JobStatus from cfdb.workflows.processors.base import Processor from cfdb.workflows.processors.registry import ProcessorRegistry @@ -492,6 +496,48 @@ async def test_should_raise_503_when_executor_is_draining( assert exc_info.value.status_code == 503 assert "Retry-After" in exc_info.value.headers + @pytest.mark.asyncio + async def test_should_raise_429_when_admission_rejected( + self, mocker, tmp_path + ): + """Test that ``AdmissionRejected`` is translated to 429 with Retry-After. + + Given: + Cache miss + GET + an executor that raises ``AdmissionRejected`` + (the active-workflow ceiling is hit). + When: + The helper is awaited. + Then: + It should raise ``HTTPException(429)`` carrying the exception's + ``retry_after_seconds`` as the ``Retry-After`` header — so a + flood backs off rather than falling through to direct streaming + (which would bypass the bounded pipeline). + """ + # Arrange + registry = ProcessorRegistry() + registry.register(_StubProcessor()) + mocker.patch.object(api, "processor_registry", registry) + mocker.patch.object(api, "cache", LocalFsCache(tmp_path / "cache")) + mocker.patch.object( + api, + "executor", + _RecordingExecutor( + exc=AdmissionRejected(active=12, ceiling=12, retry_after_seconds=7) + ), + ) + + # Act & assert + with pytest.raises(HTTPException) as exc_info: + await serve_workflow_artifact_or_dispatch( + _file_doc(), + ArtifactKind.DATA, + _Request(), + None, + head_404_detail="missing", + ) + assert exc_info.value.status_code == 429 + assert exc_info.value.headers["Retry-After"] == "7" + @pytest.mark.asyncio async def test_should_return_none_on_workflow_not_applicable_race( self, mocker, tmp_path diff --git a/tests/test_data.py b/tests/test_data.py index a81d769..f012969 100644 --- a/tests/test_data.py +++ b/tests/test_data.py @@ -11,7 +11,6 @@ from cfdb.api.routers.data import stream_file, stream_file_status from cfdb.services import drs, locks from cfdb.workflows.executor import WoolExecutor -from cfdb.workflows.models import ACTIVE_STATUSES from cfdb.workflows.processors.bam import BamIndexProcessor from tests.test_workflows import FIXTURE_MD5 from cfdb.workflows.processors.registry import ProcessorRegistry, default_registry @@ -115,9 +114,7 @@ async def test_stream_file_should_dispatch_workflow_when_processor_applies( mock_db.jobs.create_index( {"workflow_key": 1}, unique=True, - partialFilterExpression={ - "status": {"$in": [s.value for s in ACTIVE_STATUSES]} - }, + partialFilterExpression={"active": True}, ) mock_db.dcc.docs = [ diff --git a/tests/test_index.py b/tests/test_index.py index 8e286fc..3dc00b0 100644 --- a/tests/test_index.py +++ b/tests/test_index.py @@ -11,7 +11,7 @@ from cfdb.api.routers.index import stream_index_file, stream_index_file_status from cfdb.services import drs, locks from cfdb.workflows.executor import WoolExecutor -from cfdb.workflows.models import ACTIVE_STATUSES, ArtifactKind +from cfdb.workflows.models import ArtifactKind from cfdb.workflows.processors.bam import BamIndexProcessor from cfdb.workflows.processors.registry import ProcessorRegistry, default_registry from tests.test_workflows import FIXTURE_MD5 @@ -1048,9 +1048,7 @@ async def test_stream_index_file_should_return_202_when_processor_applies_and_ca mock_db.jobs.create_index( {"workflow_key": 1}, unique=True, - partialFilterExpression={ - "status": {"$in": [s.value for s in ACTIVE_STATUSES]} - }, + partialFilterExpression={"active": True}, ) # Register a real processor + cache + executor on the module. diff --git a/tests/test_indexes.py b/tests/test_indexes.py index 7b1780d..46e04b3 100644 --- a/tests/test_indexes.py +++ b/tests/test_indexes.py @@ -373,16 +373,22 @@ def test_create_indexes_js_should_define_all_operational_index_names(): The JS source is read. Then: It should reference each named operational index and the - locks.active index. + locks.active index. The names are derived from + ``operational_index_specs()`` (not hard-coded) so a newly-added + operational index is guarded automatically. """ + # Arrange — the named jobs-collection operational indexes (the locks + # index relies on Mongo's auto-generated name and is asserted below). + named = [ + spec.name + for spec in operational_index_specs() + if spec.name and spec.collection == "jobs" + ] + # Assert - for name in ( - "workflow_key_active_unique", - "job_id_unique", - "status_updated_at", - "terminal_ttl", - ): - assert name in _JS + assert named, "expected at least one named jobs operational index" + for name in named: + assert name in _JS, f"{name} missing from create-indexes.js" assert "db.locks.createIndex({ active: 1 })" in _JS diff --git a/tests/test_workflows/test_backpressure.py b/tests/test_workflows/test_backpressure.py new file mode 100644 index 0000000..df984be --- /dev/null +++ b/tests/test_workflows/test_backpressure.py @@ -0,0 +1,183 @@ +"""Tests for the per-worker backpressure hook.""" + +from __future__ import annotations + +import cloudpickle +import pytest +import wool + +from cfdb.workflows.backpressure import TaskCountBackpressure +from cfdb.workflows.backpressure import backpressure_for + + +def _ctx(active_task_count: int) -> wool.BackpressureContext: + """Build a BackpressureContext with a sentinel task for hook tests.""" + return wool.BackpressureContext(active_task_count=active_task_count, task=object()) + + +class TestTaskCountBackpressure: + def test___init___should_raise_when_threshold_below_one(self): + """Test that a non-positive threshold is rejected at construction. + + Given: + A threshold of 0, the "disable backpressure" sentinel that the + wiring handles by passing None instead of this hook. + When: + TaskCountBackpressure is constructed with it. + Then: + It should raise ValueError rather than build a hook that never + rejects. + """ + # Arrange, act, & assert + with pytest.raises(ValueError, match="threshold must be >= 1"): + TaskCountBackpressure(0) + + def test___call___should_accept_when_below_threshold(self): + """Test that the hook accepts a task below the threshold. + + Given: + A hook with threshold 1 and a worker with no tasks in flight. + When: + The hook is invoked with that context. + Then: + It should return False (accept). + """ + # Arrange + hook = TaskCountBackpressure(1) + + # Act + decision = hook(_ctx(active_task_count=0)) + + # Assert + assert decision is False + + def test___call___should_reject_when_at_threshold(self): + """Test that the hook rejects a task at the threshold. + + Given: + A hook with threshold 1 and a worker already running one task. + When: + The hook is invoked with that context. + Then: + It should return True (reject), which wool surfaces as + RESOURCE_EXHAUSTED. + """ + # Arrange + hook = TaskCountBackpressure(1) + + # Act + decision = hook(_ctx(active_task_count=1)) + + # Assert + assert decision is True + + def test___call___should_reject_when_above_threshold(self): + """Test that the hook rejects once in-flight exceeds the threshold. + + Given: + A hook with threshold 2 and a worker running three tasks. + When: + The hook is invoked with that context. + Then: + It should return True (reject). + """ + # Arrange + hook = TaskCountBackpressure(2) + + # Act + decision = hook(_ctx(active_task_count=3)) + + # Assert + assert decision is True + + def test_cloudpickle_roundtrip_should_preserve_threshold(self): + """Test that the hook survives the worker-serialization boundary. + + Given: + A hook that wool will serialize into a LocalWorker subprocess. + When: + It is cloudpickle dumped and loaded back. + Then: + The restored hook preserves its threshold and still rejects at + it, proving it is safe to ship across the worker boundary. + """ + # Arrange + hook = TaskCountBackpressure(2) + + # Act + restored = cloudpickle.loads(cloudpickle.dumps(hook)) + + # Assert + assert isinstance(restored, TaskCountBackpressure) + assert restored.threshold == 2 + assert restored(_ctx(active_task_count=2)) is True + assert restored(_ctx(active_task_count=1)) is False + + def test_conforms_to_backpressure_like_protocol(self): + """Test that the hook satisfies wool's BackpressureLike protocol. + + Given: + A constructed hook. + When: + It is checked against the runtime-checkable BackpressureLike + protocol. + Then: + It should be recognized as a BackpressureLike, so wool accepts + it where a backpressure hook is expected. + """ + # Arrange + hook = TaskCountBackpressure(1) + + # Act & assert — structural conformance, plus the bool-return + # contract wool actually depends on (BackpressureLike is satisfied + # by any callable, so the return-type check gives the test teeth). + assert isinstance(hook, wool.BackpressureLike) + assert isinstance(hook(_ctx(active_task_count=0)), bool) + + +class TestBackpressureFor: + def test_backpressure_for_should_return_none_when_threshold_is_zero(self): + """Test that a threshold of 0 disables backpressure. + + Given: + A threshold of 0 (the disable sentinel). + When: + backpressure_for is called. + Then: + It should return None so the wiring passes backpressure=None to + wool (unbounded admission). + """ + # Act & assert + assert backpressure_for(0) is None + + def test_backpressure_for_should_build_hook_when_threshold_positive(self): + """Test that a positive threshold yields a configured hook. + + Given: + A threshold of 2. + When: + backpressure_for is called. + Then: + It should return a TaskCountBackpressure carrying that threshold. + """ + # Act + hook = backpressure_for(2) + + # Assert + assert isinstance(hook, TaskCountBackpressure) + assert hook.threshold == 2 + + def test_backpressure_for_should_raise_on_negative_threshold(self): + """Test that a negative threshold is rejected, not silently disabled. + + Given: + A negative threshold. + When: + backpressure_for is called. + Then: + It should raise ValueError, so a negative value cannot + accidentally disable backpressure (only 0 disables it). + """ + # Act & assert + with pytest.raises(ValueError, match=">= 0"): + backpressure_for(-1) diff --git a/tests/test_workflows/test_executor.py b/tests/test_workflows/test_executor.py index 76d315f..a33a86b 100644 --- a/tests/test_workflows/test_executor.py +++ b/tests/test_workflows/test_executor.py @@ -3,7 +3,9 @@ from __future__ import annotations import asyncio +import uuid from collections.abc import AsyncIterator +from datetime import datetime, timedelta, timezone from pathlib import Path from typing import Any @@ -20,13 +22,19 @@ ) from cfdb.workflows.executor import ( PIPELINE_VERSION, + AdmissionRejected, ExecutorDraining, WoolExecutor, WorkflowNotApplicable, extract_identity, ) from cfdb.workflows.lock import get_job -from cfdb.workflows.models import ACTIVE_STATUSES, ArtifactKind, JobStatus +from cfdb.workflows.models import ( + ACTIVE_STATUSES, + ArtifactKind, + JobRecord, + JobStatus, +) from cfdb.workflows.processors.base import Processor from cfdb.workflows.processors.registry import ProcessorRegistry from cfdb.workflows.provisioner import EcsProvisioner, RetryableProvisionerError @@ -107,9 +115,7 @@ def _install_jobs_index(mock_db) -> None: mock_db.jobs.create_index( {"workflow_key": 1}, unique=True, - partialFilterExpression={ - "status": {"$in": [s.value for s in ACTIVE_STATUSES]} - }, + partialFilterExpression={"active": True}, ) @@ -124,6 +130,59 @@ async def _wait_for_terminal(mock_db, job_id: str, timeout: float = 2.0) -> None raise AssertionError(f"Job {job_id} did not reach terminal status") +async def _wait_for_status( + mock_db, job_id: str, status: JobStatus, timeout: float = 2.0 +) -> None: + """Poll the job record until it reaches ``status`` (or time out).""" + deadline = asyncio.get_event_loop().time() + timeout + while asyncio.get_event_loop().time() < deadline: + record = await get_job(mock_db, job_id) + if record is not None and record.status == status: + return + await asyncio.sleep(0.01) + raise AssertionError(f"Job {job_id} did not reach status {status}") + + +async def _seed_pending_job( + mock_db, + *, + submitted_at: datetime | None = None, + next_dispatch_at: datetime | None = None, + meta: dict[str, Any] | None = None, + status: JobStatus = JobStatus.PENDING, + updated_at: datetime | None = None, +) -> JobRecord: + """Insert a crafted job row directly and return its JobRecord. + + Lets a test drive ``_attempt_dispatch`` / ``_drain_due_jobs`` / + ``_recover_orphans`` against a job without going through + ``ensure_workflow`` (which would spawn its own inline attempt). + ``submitted_at`` controls the deadline clock; ``next_dispatch_at`` + controls scheduler due-ness; ``status`` / ``updated_at`` craft orphan + states (e.g. a stale RUNNING row). + """ + meta = meta or _file_meta() + dcc, local_id, md5 = extract_identity(meta) + now = datetime.now(timezone.utc) + record = JobRecord( + job_id=str(uuid.uuid4()), + workflow_key=key_utils.workflow_key( + dcc=dcc, local_id=local_id, md5=md5, pipeline_version=PIPELINE_VERSION + ), + status=status, + dcc=dcc, + local_id=local_id, + md5=md5, + pipeline_version=PIPELINE_VERSION, + submitted_at=submitted_at or now, + updated_at=updated_at or now, + file_meta_snapshot=meta, + next_dispatch_at=next_dispatch_at, + ) + await mock_db.jobs.insert_one(record.to_mongo()) + return record + + class TestExtractIdentity: def test_extract_identity_should_prefer_nested_dcc_abbreviation(self): """Test that extract_identity reads the canonical C2M2 shape. @@ -393,7 +452,7 @@ async def test_ensure_workflow_should_cleanup_workdir_after_success( tmp_workdir, no_wool_dispatch, ): - """Test that the per-job workdir is removed after a successful run. + """Test that the per-attempt workdir is removed after a successful run. Given: A registry with a stub processor that writes a file into its @@ -401,7 +460,8 @@ async def test_ensure_workflow_should_cleanup_workdir_after_success( When: ensure_workflow is awaited and the background task completes. Then: - The per-job workdir should no longer exist on disk. + No scratch directory should remain under the workdir root (the + per-attempt workdir, named with a nonce, is removed). """ class _WorkdirTouchingProcessor(_StubProcessor): @@ -425,8 +485,9 @@ async def run(self, file_meta, workdir, cache_root): record, _ = await executor.ensure_workflow(_file_meta()) await _wait_for_terminal(mock_db, record.job_id) - # Assert - assert not (tmp_workdir / record.job_id).exists() + # Assert — the per-attempt workdir (job_id + nonce) is removed, so + # no scratch directory leaks under the workdir root. + assert list(tmp_workdir.iterdir()) == [] @pytest.mark.asyncio async def test_ensure_workflow_should_release_even_when_record_stage_fails( @@ -495,7 +556,7 @@ async def test_ensure_workflow_should_record_timeout_when_cap_exceeded( # Arrange — yield one event first so the consumer enters the # asyncio.timeout block (the timeout wraps the event loop, not - # the initial _open_stream_with_retry call); then hang so the + # the initial _open_stream_once call); then hang so the # cap fires on the second iteration. class _HangingProcessor(_StubProcessor): async def run(self, file_meta, workdir, cache_root): @@ -862,22 +923,21 @@ def test_pipeline_version_should_be_positive(): assert isinstance(PIPELINE_VERSION, int) -class TestOpenStreamWithRetry: +class TestOpenStreamOnce: @pytest.mark.asyncio - async def test_open_stream_with_retry_should_recover_after_one_no_workers_error( + async def test_open_stream_once_should_return_stream_on_first_event( self, mock_db, tmp_cache, tmp_workdir, no_wool_dispatch, mocker ): - """Test that ``NoWorkersAvailable`` triggers a retry that eventually wins. + """Test that a routine yielding an event returns a usable stream. Given: - An executor whose ``_run_processor_routine`` raises - ``wool.NoWorkersAvailable`` on its first ``__anext__`` then - yields events on the second attempt. + An executor whose ``_run_processor_routine`` yields a regular + event followed by a ``complete``. When: - ``_open_stream_with_retry`` is awaited. + ``_open_stream_once`` is awaited. Then: - It should return a stream that yields the second invocation's - events without re-raising. + It should return a stream that re-yields the first event plus + the rest, in a single dispatch attempt (no retry loop). """ # Arrange _install_jobs_index(mock_db) @@ -888,57 +948,45 @@ async def test_open_stream_with_retry_should_recover_after_one_no_workers_error( mock_db, tmp_cache, registry, workdir_root=tmp_workdir ) - # Build two routines: one that immediately raises NoWorkersAvailable - # and one that yields a regular event followed by a complete. attempts = {"count": 0} - async def failing_stream(): - attempts["count"] += 1 - raise wool.NoWorkersAvailable("simulated cold start") - yield # pragma: no cover - async def working_stream(): attempts["count"] += 1 yield StageComplete(kind=ArtifactKind.INDEX, key=_INDEX_KEY) yield Complete(artifacts={}) - streams = [failing_stream(), working_stream()] - - def fake_routine(*_args, **_kwargs): - return streams.pop(0) - - mocker.patch.object(executor_module, "_run_processor_routine", fake_routine) - # Patch sleep so retries don't slow the test down. - async def fast_sleep(_s): - return None - - mocker.patch.object(executor_module.asyncio, "sleep", fast_sleep) + mocker.patch.object( + executor_module, + "_run_processor_routine", + lambda *_a, **_k: working_stream(), + ) # Act - stream = await executor._open_stream_with_retry( + stream = await executor._open_stream_once( processor, _file_meta(), tmp_workdir / "wd" ) events = [event async for event in stream] - # Assert - assert attempts["count"] == 2 + # Assert — a single attempt, both events delivered. + assert attempts["count"] == 1 + assert any(isinstance(e, StageComplete) for e in events) assert any(isinstance(e, Complete) for e in events) @pytest.mark.asyncio - async def test_open_stream_with_retry_should_raise_when_deadline_expires( + async def test_open_stream_once_should_propagate_no_workers_available( self, mock_db, tmp_cache, tmp_workdir, no_wool_dispatch, mocker ): - """Test that the retry loop eventually surfaces ``NoWorkersAvailable``. + """Test that ``NoWorkersAvailable`` propagates without retrying. Given: - An executor whose ``_run_processor_routine`` always raises - ``wool.NoWorkersAvailable``; the dispatch-wait budget is - patched to a tiny value so the deadline expires quickly. + An executor whose ``_run_processor_routine`` raises + ``wool.NoWorkersAvailable`` on its first ``__anext__``. When: - ``_open_stream_with_retry`` is awaited. + ``_open_stream_once`` is awaited. Then: - It should raise the last ``NoWorkersAvailable`` after the - dispatch budget is exhausted. + It should propagate the ``NoWorkersAvailable`` in a single + attempt — the durable retry scheduler, not an in-attempt loop, + owns retries now. The freshly-constructed stream is closed. """ # Arrange _install_jobs_index(mock_db) @@ -949,26 +997,25 @@ async def test_open_stream_with_retry_should_raise_when_deadline_expires( mock_db, tmp_cache, registry, workdir_root=tmp_workdir ) + attempts = {"count": 0} + async def failing_stream(): - raise wool.NoWorkersAvailable("still no workers") + attempts["count"] += 1 + raise wool.NoWorkersAvailable("simulated cold start") yield # pragma: no cover - def fake_routine(*_args, **_kwargs): - return failing_stream() - - mocker.patch.object(executor_module, "_run_processor_routine", fake_routine) - mocker.patch.object(executor_module, "_DISPATCH_WAIT_SECONDS", 0.01) - - async def fast_sleep(_s): - return None - - mocker.patch.object(executor_module.asyncio, "sleep", fast_sleep) + mocker.patch.object( + executor_module, + "_run_processor_routine", + lambda *_a, **_k: failing_stream(), + ) # Act & assert with pytest.raises(wool.NoWorkersAvailable): - await executor._open_stream_with_retry( + await executor._open_stream_once( processor, _file_meta(), tmp_workdir / "wd" ) + assert attempts["count"] == 1 class TestStreamConsumerProgressEvents: @@ -1121,7 +1168,7 @@ def test_executor_draining_should_be_caught_by_workflow_not_applicable(self): class TestOpenStreamMalformedFirstEvent: @pytest.mark.asyncio - async def test_open_stream_with_retry_should_propagate_non_capacity_error( + async def test_open_stream_once_should_propagate_non_capacity_error( self, mock_db, tmp_cache, tmp_workdir, no_wool_dispatch, mocker ): """Test that a non-``NoWorkersAvailable`` first error propagates. @@ -1130,11 +1177,11 @@ async def test_open_stream_with_retry_should_propagate_non_capacity_error( A routine whose first ``__anext__`` raises a generic ``RuntimeError`` (not a wool capacity error). When: - ``_open_stream_with_retry`` is awaited. + ``_open_stream_once`` is awaited. Then: - The exception should propagate to the caller — the retry - loop only handles ``NoWorkersAvailable``, every other class - is a real failure. + The exception should propagate to the caller — + ``_attempt_dispatch`` translates it into a FAILED job; only + ``NoWorkersAvailable`` is treated as overflow. """ # Arrange _install_jobs_index(mock_db) @@ -1156,7 +1203,7 @@ def fake_routine(*_args, **_kwargs): # Act & assert with pytest.raises(RuntimeError, match="not a capacity"): - await executor._open_stream_with_retry( + await executor._open_stream_once( processor, _file_meta(), tmp_workdir / "wd" ) @@ -1197,25 +1244,28 @@ async def test_ensure_workflow_should_persist_file_meta_snapshot_on_insert( class TestWoolExecutorWithProvisioner: @pytest.mark.asyncio - async def test_ensure_workflow_should_request_worker_when_provisioner_set( + async def test__attempt_dispatch_should_not_request_provisioner_when_worker_available( self, mock_db, tmp_cache, tmp_workdir, no_wool_dispatch, mocker ): - """Test that a fresh claim dispatches a provisioner request. + """Test that the provisioner is left idle when a worker accepts. Given: - A WoolExecutor wired with a stub EcsProvisioner. + A WoolExecutor wired with a stub EcsProvisioner and an + in-process worker (no_wool_dispatch) that accepts the task. When: - ``ensure_workflow`` is awaited on a previously-unseen file. + ``ensure_workflow`` dispatches the inline attempt and it wins + a worker. Then: - The provisioner's ``request`` should be awaited exactly once - with ``dedup_key`` set to the workflow mutex key for the file. + ``provisioner.request`` should NOT be awaited — spawn-on- + overflow means scale-up only happens when no worker accepts, + inverting the old unconditional pre-dispatch spawn (the cost + leak). """ # Arrange _install_jobs_index(mock_db) registry = ProcessorRegistry() registry.register(_StubProcessor()) provisioner = mocker.AsyncMock(spec=EcsProvisioner) - provisioner.request.return_value = ["arn:fake:task/abc"] executor = WoolExecutor( mock_db, tmp_cache, @@ -1223,51 +1273,41 @@ async def test_ensure_workflow_should_request_worker_when_provisioner_set( workdir_root=tmp_workdir, provisioner=provisioner, ) - meta = _file_meta() - dcc, local_id, md5 = extract_identity(meta) - expected_key = key_utils.workflow_key( - dcc=dcc, local_id=local_id, md5=md5, pipeline_version=PIPELINE_VERSION - ) # Act - record, _ = await executor.ensure_workflow(meta) + record, _ = await executor.ensure_workflow(_file_meta()) await _wait_for_terminal(mock_db, record.job_id) # Assert - provisioner.request.assert_awaited_once_with(dedup_key=expected_key) + final = await get_job(mock_db, record.job_id) + assert final is not None + assert final.status == JobStatus.COMPLETED + provisioner.request.assert_not_awaited() @pytest.mark.asyncio - async def test_ensure_workflow_should_skip_provisioner_when_attaching_to_existing_job( + async def test__attempt_dispatch_should_request_provisioner_on_overflow( self, mock_db, tmp_cache, tmp_workdir, no_wool_dispatch, mocker ): - """Test that the provisioner is only invoked on fresh claims. + """Test that overflow requests one bounded worker spawn and reschedules. Given: - An executor with a stub provisioner and a blocking processor - so the first claim stays active when the second call arrives. + A WoolExecutor wired with a stub provisioner and a queued job + whose dispatch attempt finds no worker capacity + (``_open_stream_once`` raises ``NoWorkersAvailable``). When: - ``ensure_workflow`` is awaited twice for the same file_meta. + ``_attempt_dispatch`` runs. Then: - ``provisioner.request`` should be awaited exactly once — the - attach path must not double-spend a Fargate worker against - an already-claimed workflow. + ``provisioner.request`` should be awaited once with the job's + workflow key, and the job should stay PENDING with + ``next_dispatch_at`` set and ``dispatch_attempts`` bumped — + queued durably for a later retry, not failed. """ # Arrange _install_jobs_index(mock_db) - release = asyncio.Event() - - class _BlockingProcessor(_StubProcessor): - async def run(self, file_meta, workdir, cache_root): - self.run_calls += 1 - await release.wait() - for kind, key in self.artifacts.items(): - yield StageComplete(kind=ArtifactKind(kind), key=key) - yield Complete(artifacts=dict(self.artifacts)) - registry = ProcessorRegistry() - registry.register(_BlockingProcessor()) + processor = _StubProcessor() + registry.register(processor) provisioner = mocker.AsyncMock(spec=EcsProvisioner) - provisioner.request.return_value = ["arn:fake:task/abc"] executor = WoolExecutor( mock_db, tmp_cache, @@ -1275,50 +1315,47 @@ async def run(self, file_meta, workdir, cache_root): workdir_root=tmp_workdir, provisioner=provisioner, ) + record = await _seed_pending_job(mock_db) + mocker.patch.object( + executor, + "_open_stream_once", + new=mocker.AsyncMock(side_effect=wool.NoWorkersAvailable("empty")), + ) # Act - record_a, fresh_a = await executor.ensure_workflow(_file_meta()) - record_b, fresh_b = await executor.ensure_workflow(_file_meta()) - release.set() - await _wait_for_terminal(mock_db, record_a.job_id) + await executor._attempt_dispatch(record, processor, _file_meta()) # Assert - assert fresh_a is True - assert fresh_b is False - assert record_a.job_id == record_b.job_id - provisioner.request.assert_awaited_once() + provisioner.request.assert_awaited_once_with(dedup_key=record.workflow_key) + final = await get_job(mock_db, record.job_id) + assert final is not None + assert final.status == JobStatus.PENDING + assert final.next_dispatch_at is not None + assert final.dispatch_attempts == 1 @pytest.mark.asyncio - async def test_ensure_workflow_should_fall_through_when_retryable_provisioner_raises( + async def test__handle_overflow_should_swallow_retryable_provisioner_error( self, mock_db, tmp_cache, tmp_workdir, no_wool_dispatch, mocker ): - """Test that a retryable provisioner failure falls through to dispatch. + """Test that a retryable provisioner error is swallowed, job reschedules. Given: A provisioner whose ``request`` raises - ``RetryableProvisionerError`` (capacity / ENI exhaustion / - throttling) AND a wool pool that can still dispatch to an - existing worker (the no_wool_dispatch fixture runs the - routine in-process). + ``RetryableProvisionerError`` on an overflowed dispatch. When: - ``ensure_workflow`` is awaited. + ``_attempt_dispatch`` runs. Then: - The retryable error should be logged but not fail the - workflow — the Wool pool routes to any available worker - via discovery, so an existing idle worker can service the - job even when ECS quota blocks a fresh launch. The - ``capacity:`` prefix is reserved for the case where - dispatch also exhausts its budget (see - ``test_ensure_workflow_should_record_capacity_failure_when_dispatch_also_fails``). + The error should be swallowed (best-effort scale-up) and the + job should stay PENDING, rescheduled for a later attempt — not + failed. """ # Arrange _install_jobs_index(mock_db) registry = ProcessorRegistry() - registry.register(_StubProcessor()) + processor = _StubProcessor() + registry.register(processor) provisioner = mocker.AsyncMock(spec=EcsProvisioner) - provisioner.request.side_effect = RetryableProvisionerError( - "ClientError: capacity-unavailable" - ) + provisioner.request.side_effect = RetryableProvisionerError("capacity") executor = WoolExecutor( mock_db, tmp_cache, @@ -1326,49 +1363,47 @@ async def test_ensure_workflow_should_fall_through_when_retryable_provisioner_ra workdir_root=tmp_workdir, provisioner=provisioner, ) + record = await _seed_pending_job(mock_db) + mocker.patch.object( + executor, + "_open_stream_once", + new=mocker.AsyncMock(side_effect=wool.NoWorkersAvailable("empty")), + ) # Act - record, _ = await executor.ensure_workflow(_file_meta()) - await _wait_for_terminal(mock_db, record.job_id) + await executor._attempt_dispatch(record, processor, _file_meta()) # Assert final = await get_job(mock_db, record.job_id) assert final is not None - provisioner.request.assert_awaited_once() - # Provisioner failure was swallowed; in-process dispatch - # succeeded, so the workflow completes despite the - # retryable provisioner error. - assert final.status == JobStatus.COMPLETED - assert final.error is None + assert final.status == JobStatus.PENDING + assert final.next_dispatch_at is not None @pytest.mark.asyncio - async def test_ensure_workflow_should_record_capacity_failure_when_dispatch_also_fails( + async def test__handle_overflow_should_swallow_generic_provisioner_error( self, mock_db, tmp_cache, tmp_workdir, no_wool_dispatch, mocker ): - """Test that the capacity: prefix fires when dispatch also exhausts. + """Test that a generic provisioner error doesn't fail the queued job. Given: - A provisioner whose ``request`` raises - ``RetryableProvisionerError`` AND a dispatch path that - exhausts its budget with ``wool.NoWorkersAvailable``. + A provisioner whose ``request`` raises a generic + ``RuntimeError`` (misconfiguration / unexpected boto failure) + on an overflowed dispatch. When: - ``ensure_workflow`` is awaited. + ``_attempt_dispatch`` runs. Then: - The workflow should reach ``FAILED`` with the - ``capacity:`` prefix — the wire contract clients parse to - decide whether to resubmit. Per C26 the origin of the - prefix shifted from the provisioner-failure branch to the - dispatch-budget-exhausted branch, but the on-wire shape - is unchanged. + The error should be logged and swallowed and the job should + stay PENDING (rescheduled) rather than failing — the + provisioner is a best-effort scale-up hint on overflow, not a + dispatch gate; the dispatch deadline bounds the retry loop. """ # Arrange _install_jobs_index(mock_db) registry = ProcessorRegistry() - registry.register(_StubProcessor()) + processor = _StubProcessor() + registry.register(processor) provisioner = mocker.AsyncMock(spec=EcsProvisioner) - provisioner.request.side_effect = RetryableProvisionerError( - "ClientError: capacity-unavailable" - ) + provisioner.request.side_effect = RuntimeError("misconfigured cluster") executor = WoolExecutor( mock_db, tmp_cache, @@ -1376,70 +1411,682 @@ async def test_ensure_workflow_should_record_capacity_failure_when_dispatch_also workdir_root=tmp_workdir, provisioner=provisioner, ) - # Force the dispatch path to also fail with - # ``wool.NoWorkersAvailable``; the C26 stream-open branch - # translates that to the ``capacity:`` prefix. + record = await _seed_pending_job(mock_db) mocker.patch.object( executor, - "_open_stream_with_retry", - side_effect=wool.NoWorkersAvailable("pool empty after budget"), + "_open_stream_once", + new=mocker.AsyncMock(side_effect=wool.NoWorkersAvailable("empty")), ) # Act - record, _ = await executor.ensure_workflow(_file_meta()) - await _wait_for_terminal(mock_db, record.job_id) + await executor._attempt_dispatch(record, processor, _file_meta()) + + # Assert + final = await get_job(mock_db, record.job_id) + assert final is not None + assert final.status == JobStatus.PENDING + assert final.next_dispatch_at is not None + + +class TestConsumeAndFinalizeHeartbeatLoss: + def test_heartbeat_loss_abort_threshold_sits_between_interval_and_stale(self): + """Test that the self-cancel threshold is between heartbeat and stale. + + Given: + The module's derived heartbeat-loss abort threshold. + When: + It is compared against the heartbeat interval and the orphan + sweep's stale threshold. + Then: + It should be strictly more than one heartbeat interval (a single + transient write failure must not abort — the strict + ``STALE > 2*HEARTBEAT`` startup invariant guarantees the margin, + issue #45 review A7) and strictly below the stale threshold (a + Mongo-blind consumer must give up before the sweep would revive + its row). + """ + # Assert + assert ( + executor_module._HEARTBEAT_LOSS_ABORT_S + > executor_module._HEARTBEAT_INTERVAL_S + ) + assert ( + executor_module._HEARTBEAT_LOSS_ABORT_S + < executor_module.STALE_WORKFLOW_THRESHOLD.total_seconds() + ) + + @pytest.mark.asyncio + async def test__consume_and_finalize_should_abort_when_heartbeat_lost( + self, mock_db, tmp_cache, tmp_workdir, mocker + ): + """Test that sustained heartbeat-write failure aborts the workflow. + + Given: + A RUNNING job whose ``heartbeat_workflow`` write always fails and + a zeroed abort threshold (so the first failure is already past + it). + When: + ``_consume_and_finalize`` consumes a stream that emits a + heartbeat. + Then: + It should abort with a ``heartbeat lost`` FAILED status and close + the stream (cancelling the remote worker), rather than consuming + on invisibly while the orphan sweep reclaims the row. + """ + # Arrange + _install_jobs_index(mock_db) + registry = ProcessorRegistry() + registry.register(_StubProcessor()) + executor = WoolExecutor( + mock_db, tmp_cache, registry, workdir_root=tmp_workdir + ) + record = await _seed_pending_job(mock_db, status=JobStatus.RUNNING) + mocker.patch.object(executor_module, "_HEARTBEAT_LOSS_ABORT_S", -1.0) + + async def boom(*_a, **_kw): + raise RuntimeError("mongo down") + + mocker.patch.object(executor_module, "heartbeat_workflow", boom) + closed = {"v": False} + + async def stream(): + try: + yield Heartbeat() + yield Complete(artifacts={}) # pragma: no cover (aborted first) + finally: + closed["v"] = True + + # Act + await executor._consume_and_finalize(record, stream(), tmp_workdir / "wd") # Assert final = await get_job(mock_db, record.job_id) assert final is not None assert final.status == JobStatus.FAILED - assert final.error is not None - assert final.error.startswith("capacity: "), ( - f"expected 'capacity: ' prefix, got {final.error!r}" + assert "heartbeat lost" in (final.error or "") + assert closed["v"] is True + + @pytest.mark.asyncio + async def test__consume_and_finalize_should_tolerate_a_single_heartbeat_failure( + self, mock_db, tmp_cache, tmp_workdir, mocker + ): + """Test that one transient heartbeat-write failure does not abort. + + Given: + A RUNNING job whose first ``heartbeat_workflow`` write fails but + the abort threshold is far in the future. + When: + ``_consume_and_finalize`` consumes a heartbeat then a complete. + Then: + The single failure is logged and swallowed and the job still + reaches COMPLETED. + """ + # Arrange + _install_jobs_index(mock_db) + registry = ProcessorRegistry() + registry.register(_StubProcessor()) + executor = WoolExecutor( + mock_db, tmp_cache, registry, workdir_root=tmp_workdir + ) + record = await _seed_pending_job(mock_db, status=JobStatus.RUNNING) + mocker.patch.object(executor_module, "_HEARTBEAT_LOSS_ABORT_S", 9999.0) + calls = {"n": 0} + + async def flaky(*_a, **_kw): + calls["n"] += 1 + if calls["n"] == 1: + raise RuntimeError("transient") + + mocker.patch.object(executor_module, "heartbeat_workflow", flaky) + + async def stream(): + yield Heartbeat() + yield Complete(artifacts={}) + + # Act + await executor._consume_and_finalize(record, stream(), tmp_workdir / "wd") + + # Assert + final = await get_job(mock_db, record.job_id) + assert final is not None + assert final.status == JobStatus.COMPLETED + + +class TestAttemptDispatchWorkdir: + @pytest.mark.asyncio + async def test__attempt_dispatch_should_use_a_distinct_workdir_per_attempt( + self, mock_db, tmp_cache, tmp_workdir, mocker + ): + """Test that each dispatch attempt for a job gets its own workdir. + + Given: + A queued job whose dispatch attempts always overflow. + When: + ``_attempt_dispatch`` runs twice for the same job. + Then: + The two attempts should compute distinct workdirs (both prefixed + with the job_id), so a re-dispatch can never share scratch with a + still-live prior attempt. + """ + # Arrange + _install_jobs_index(mock_db) + registry = ProcessorRegistry() + processor = _StubProcessor() + registry.register(processor) + executor = WoolExecutor( + mock_db, tmp_cache, registry, workdir_root=tmp_workdir ) - assert "pool empty" in final.error + record = await _seed_pending_job(mock_db) + seen: list[Path] = [] + + async def capture(_processor, _file_meta, workdir): + seen.append(workdir) + raise wool.NoWorkersAvailable("overflow") + + mocker.patch.object(executor, "_open_stream_once", new=capture) + + # Act + await executor._attempt_dispatch(record, processor, _file_meta()) + await executor._attempt_dispatch(record, processor, _file_meta()) + + # Assert + assert len(seen) == 2 + assert seen[0] != seen[1] + assert all(w.name.startswith(record.job_id) for w in seen) @pytest.mark.asyncio - async def test_ensure_workflow_should_record_provisioner_prefix_for_generic_exception( + async def test__attempt_dispatch_should_clean_workdir_when_mark_running_rejected( + self, mock_db, tmp_cache, tmp_workdir, mocker + ): + """Test that the hand-off branch cleans up its workdir. + + Given: + A worker accepts (the routine created the per-attempt workdir), + but ``mark_running`` rejects because a successor now owns the row. + When: + ``_attempt_dispatch`` runs. + Then: + It should remove this attempt's workdir (no scratch leak) and + leave the row untouched for the successor — without releasing the + mutex. + """ + # Arrange + _install_jobs_index(mock_db) + registry = ProcessorRegistry() + processor = _StubProcessor() + registry.register(processor) + executor = WoolExecutor( + mock_db, tmp_cache, registry, workdir_root=tmp_workdir + ) + record = await _seed_pending_job(mock_db) + captured: dict[str, Path] = {} + + async def fake_open(_processor, _file_meta, workdir): + workdir.mkdir(parents=True, exist_ok=True) + (workdir / "scratch").write_bytes(b"x") + captured["workdir"] = workdir + + async def s(): + yield Heartbeat() + + return s() + + mocker.patch.object(executor, "_open_stream_once", new=fake_open) + + async def reject(*_a, **_kw): + raise RuntimeError("row no longer PENDING") + + mocker.patch.object(executor_module, "mark_running", reject) + + # Act + await executor._attempt_dispatch(record, processor, _file_meta()) + + # Assert + assert "workdir" in captured + assert not captured["workdir"].exists() + final = await get_job(mock_db, record.job_id) + assert final is not None + assert final.status == JobStatus.PENDING + + +class TestWoolExecutorMarkRunningOnAcceptance: + @pytest.mark.asyncio + async def test__attempt_dispatch_should_mark_running_before_first_processor_event( + self, mock_db, tmp_cache, tmp_workdir, no_wool_dispatch + ): + """Test that a job is marked RUNNING the instant a worker accepts it. + + Given: + A processor that blocks before yielding any real event (standing + in for a slow upstream download, which is the processor's first + action before its first stage event). + When: + ``ensure_workflow`` dispatches the inline attempt and a worker + accepts it. + Then: + The job should reach RUNNING while the processor is still blocked + — driven by the routine's leading "worker accepted" heartbeat, + not the first processor event. This is the invariant that keeps a + job from lingering PENDING (and being re-leased / double- + dispatched onto the same workdir) for the whole download. + """ + # Arrange + _install_jobs_index(mock_db) + release = asyncio.Event() + + class _BlockBeforeFirstEvent(_StubProcessor): + async def run(self, file_meta, workdir, cache_root): + self.run_calls += 1 + await release.wait() # block before the first real event + yield Complete(artifacts={}) + + registry = ProcessorRegistry() + registry.register(_BlockBeforeFirstEvent()) + executor = WoolExecutor( + mock_db, tmp_cache, registry, workdir_root=tmp_workdir + ) + + # Act + record, _ = await executor.ensure_workflow(_file_meta()) + try: + # Would time out (job stuck PENDING) if mark_running waited for + # the first processor event rather than the acceptance heartbeat. + await _wait_for_status(mock_db, record.job_id, JobStatus.RUNNING) + running = await get_job(mock_db, record.job_id) + release.set() + await _wait_for_terminal(mock_db, record.job_id) + finally: + release.set() + await executor.drain(timeout=2.0) + + # Assert + assert running is not None + assert running.status == JobStatus.RUNNING + final = await get_job(mock_db, record.job_id) + assert final is not None + assert final.status == JobStatus.COMPLETED + + +class TestWoolExecutorAdmission: + @pytest.mark.asyncio + async def test_ensure_workflow_should_reject_at_ceiling( + self, mock_db, tmp_cache, tmp_workdir, mocker + ): + """Test that ensure_workflow sheds load once the active ceiling is hit. + + Given: + One active workflow already in the jobs collection and the + admission ceiling pinned to 1. + When: + ``ensure_workflow`` is awaited for another file. + Then: + It should raise ``AdmissionRejected`` carrying the observed + active count, the ceiling, and a positive ``retry_after`` hint + — before claiming the mutex, so a flood is shed at the door. + """ + # Arrange + _install_jobs_index(mock_db) + registry = ProcessorRegistry() + registry.register(_StubProcessor()) + executor = WoolExecutor( + mock_db, tmp_cache, registry, workdir_root=tmp_workdir + ) + mocker.patch.object(executor_module, "_MAX_ACTIVE_WORKFLOWS", 1) + await _seed_pending_job(mock_db) # one active workflow + + # Act & assert + with pytest.raises(AdmissionRejected) as excinfo: + await executor.ensure_workflow(_file_meta()) + assert excinfo.value.active == 1 + assert excinfo.value.ceiling == 1 + assert excinfo.value.retry_after_seconds >= 1 + + @pytest.mark.asyncio + async def test_ensure_workflow_should_admit_below_ceiling( self, mock_db, tmp_cache, tmp_workdir, no_wool_dispatch, mocker ): - """Test that non-retryable provisioner failures use the 'provisioner: ' prefix. + """Test that ensure_workflow claims normally below the ceiling. Given: - A provisioner whose ``request`` raises a generic - ``RuntimeError`` (a misconfiguration or unexpected boto - failure, not a retryable transient). + An admission ceiling of 5 and no active workflows. When: ``ensure_workflow`` is awaited. Then: - The job should reach ``FAILED`` with an error string that - starts with ``"provisioner: "`` so clients can distinguish - "retry me later" from "the provisioner crashed". + It should claim and dispatch the job to completion — the + ceiling only sheds load once it is reached. """ # Arrange _install_jobs_index(mock_db) registry = ProcessorRegistry() registry.register(_StubProcessor()) - provisioner = mocker.AsyncMock(spec=EcsProvisioner) - provisioner.request.side_effect = RuntimeError("misconfigured cluster") executor = WoolExecutor( - mock_db, - tmp_cache, - registry, - workdir_root=tmp_workdir, - provisioner=provisioner, + mock_db, tmp_cache, registry, workdir_root=tmp_workdir ) + mocker.patch.object(executor_module, "_MAX_ACTIVE_WORKFLOWS", 5) # Act - record, _ = await executor.ensure_workflow(_file_meta()) + record, fresh = await executor.ensure_workflow(_file_meta()) await _wait_for_terminal(mock_db, record.job_id) # Assert + assert fresh is True + final = await get_job(mock_db, record.job_id) + assert final is not None + assert final.status == JobStatus.COMPLETED + + +class TestWoolExecutorDispatchDeadline: + @pytest.mark.asyncio + async def test__attempt_dispatch_should_fail_capacity_when_deadline_exceeded( + self, mock_db, tmp_cache, tmp_workdir, no_wool_dispatch, mocker + ): + """Test that a job past its dispatch deadline is failed ``capacity:``. + + Given: + A queued job submitted long ago (older than a tiny patched + dispatch deadline). + When: + ``_attempt_dispatch`` runs. + Then: + The job should be failed with the ``capacity:`` prefix without + attempting dispatch — the deadline bounds the durable retry + loop so a job can't queue forever. + """ + # Arrange + _install_jobs_index(mock_db) + registry = ProcessorRegistry() + processor = _StubProcessor() + registry.register(processor) + executor = WoolExecutor( + mock_db, tmp_cache, registry, workdir_root=tmp_workdir + ) + mocker.patch.object(executor_module, "_DISPATCH_DEADLINE_SECONDS", 1.0) + old = datetime.now(timezone.utc) - timedelta(seconds=10000) + record = await _seed_pending_job(mock_db, submitted_at=old) + open_spy = mocker.patch.object( + executor, "_open_stream_once", new=mocker.AsyncMock() + ) + + # Act + await executor._attempt_dispatch(record, processor, _file_meta()) + + # Assert + open_spy.assert_not_awaited() # never tried to dispatch final = await get_job(mock_db, record.job_id) assert final is not None assert final.status == JobStatus.FAILED - assert final.error is not None - assert final.error.startswith("provisioner: "), ( - f"expected 'provisioner: ' prefix, got {final.error!r}" + assert (final.error or "").startswith("capacity:") + + +class TestWoolExecutorScheduler: + @pytest.mark.asyncio + async def test__drain_due_jobs_should_dispatch_a_due_queued_job( + self, mock_db, tmp_cache, tmp_workdir, no_wool_dispatch + ): + """Test that the scheduler tick dispatches a due rescheduled job. + + Given: + A queued job whose ``next_dispatch_at`` is in the past (an + earlier attempt overflowed) and a registered processor. + When: + One scheduler tick (``_drain_due_jobs``) runs. + Then: + The job should be leased, dispatched to the in-process worker, + and reach COMPLETED — the durable retry path picking up a + previously-overflowed job. + """ + # Arrange + _install_jobs_index(mock_db) + registry = ProcessorRegistry() + registry.register(_StubProcessor()) + executor = WoolExecutor( + mock_db, tmp_cache, registry, workdir_root=tmp_workdir ) - assert "misconfigured cluster" in final.error + due = datetime.now(timezone.utc) - timedelta(seconds=1) + record = await _seed_pending_job(mock_db, next_dispatch_at=due) + + # Act + await executor._drain_due_jobs() + await _wait_for_terminal(mock_db, record.job_id) + + # Assert + final = await get_job(mock_db, record.job_id) + assert final is not None + assert final.status == JobStatus.COMPLETED + + @pytest.mark.asyncio + async def test__drain_due_jobs_should_ignore_not_yet_due_jobs( + self, mock_db, tmp_cache, tmp_workdir, no_wool_dispatch + ): + """Test that a queued job whose retry time is in the future is skipped. + + Given: + A queued job whose ``next_dispatch_at`` is in the future. + When: + One scheduler tick runs. + Then: + The job should be left PENDING and undispatched — only due + jobs are leased, so a not-yet-due retry isn't run early. + """ + # Arrange + _install_jobs_index(mock_db) + registry = ProcessorRegistry() + processor = _StubProcessor() + registry.register(processor) + executor = WoolExecutor( + mock_db, tmp_cache, registry, workdir_root=tmp_workdir + ) + future = datetime.now(timezone.utc) + timedelta(seconds=3600) + record = await _seed_pending_job(mock_db, next_dispatch_at=future) + + # Act + await executor._drain_due_jobs() + + # Assert + final = await get_job(mock_db, record.job_id) + assert final is not None + assert final.status == JobStatus.PENDING + assert processor.run_calls == 0 + + @pytest.mark.asyncio + async def test__recover_orphans_should_requeue_and_dispatch_a_stale_running_job( + self, mock_db, tmp_cache, tmp_workdir, no_wool_dispatch + ): + """Test that a RUNNING job orphaned by a crash is recovered autonomously. + + Given: + A stale RUNNING job (no heartbeat past the stale threshold, + `next_dispatch_at` unset) — what an API crash mid-stream leaves + behind — and a registered processor. + When: + A scheduler tick runs (`_recover_orphans` then `_drain_due_jobs`). + Then: + The orphan is re-queued to PENDING, leased, dispatched, and + reaches COMPLETED — without any new request for the file, closing + the restart-recovery gap. + """ + # Arrange + _install_jobs_index(mock_db) + registry = ProcessorRegistry() + registry.register(_StubProcessor()) + executor = WoolExecutor( + mock_db, tmp_cache, registry, workdir_root=tmp_workdir + ) + stale = datetime.now(timezone.utc) - timedelta(seconds=2000) + record = await _seed_pending_job( + mock_db, + status=JobStatus.RUNNING, + updated_at=stale, + next_dispatch_at=None, + ) + + # Act — one scheduler tick: recover orphans, then dispatch due jobs. + await executor._recover_orphans() + await executor._drain_due_jobs() + await _wait_for_terminal(mock_db, record.job_id) + + # Assert + final = await get_job(mock_db, record.job_id) + assert final is not None + assert final.status == JobStatus.COMPLETED + + @pytest.mark.asyncio + async def test__recover_orphans_should_leave_a_fresh_running_job_alone( + self, mock_db, tmp_cache, tmp_workdir, no_wool_dispatch + ): + """Test that a healthy in-flight RUNNING job is not reclaimed. + + Given: + A RUNNING job with a recent `updated_at` (a live, heartbeating + job). + When: + `_recover_orphans` runs. + Then: + The job is left RUNNING — the staleness gate keeps the sweep from + yanking a live job away from its worker. + """ + # Arrange + _install_jobs_index(mock_db) + registry = ProcessorRegistry() + registry.register(_StubProcessor()) + executor = WoolExecutor( + mock_db, tmp_cache, registry, workdir_root=tmp_workdir + ) + record = await _seed_pending_job( + mock_db, status=JobStatus.RUNNING, next_dispatch_at=None + ) + + # Act + await executor._recover_orphans() + + # Assert + job = await get_job(mock_db, record.job_id) + assert job is not None + assert job.status == JobStatus.RUNNING + + @pytest.mark.asyncio + async def test__scheduler_loop_should_survive_a_tick_exception( + self, mock_db, tmp_cache, tmp_workdir, mocker + ): + """Test that an exception in one tick doesn't kill the scheduler. + + Given: + A scheduler whose lease query raises on its first call then + returns nothing, with a tiny retry interval. + When: + ``start_scheduler`` runs the loop across several ticks. + Then: + The scheduler task should still be alive and have ticked again + after the exception — a transient Mongo blip must not silently + stop the dispatch driver. + """ + # Arrange + _install_jobs_index(mock_db) + registry = ProcessorRegistry() + registry.register(_StubProcessor()) + executor = WoolExecutor( + mock_db, tmp_cache, registry, workdir_root=tmp_workdir + ) + calls = {"n": 0} + + async def flaky_lease(*_a, **_k): + calls["n"] += 1 + if calls["n"] == 1: + raise RuntimeError("transient mongo blip") + return None + + mocker.patch.object(executor_module, "lease_due_dispatch", flaky_lease) + mocker.patch.object(executor_module, "_RETRY_INTERVAL_SECONDS", 0.01) + + # Act + executor.start_scheduler() + try: + await asyncio.sleep(0.1) + # Assert — survived the first-tick exception and ticked again. + assert executor._scheduler_task is not None + assert not executor._scheduler_task.done() + assert calls["n"] >= 2 + finally: + await executor.drain(timeout=1.0) + + @pytest.mark.asyncio + async def test__drain_due_jobs_should_lease_at_most_the_per_tick_cap( + self, mock_db, tmp_cache, tmp_workdir, mocker + ): + """Test that one tick leases no more than the per-tick cap. + + Given: + More due queued jobs than ``_MAX_DISPATCHES_PER_TICK`` (the cap + monkeypatched low) and a stubbed ``_spawn_attempt`` so the tick + only counts dispatches. + When: + One scheduler tick (``_drain_due_jobs``) runs. + Then: + Exactly the cap's worth of attempts are spawned and the + remainder is left leasable for a subsequent tick — the + thundering-herd guard bounds the per-tick fan-out so a large + backlog can't spawn thousands of concurrent attempts at once. + """ + # Arrange + _install_jobs_index(mock_db) + registry = ProcessorRegistry() + registry.register(_StubProcessor()) + executor = WoolExecutor( + mock_db, tmp_cache, registry, workdir_root=tmp_workdir + ) + mocker.patch.object(executor_module, "_MAX_DISPATCHES_PER_TICK", 2) + due = datetime.now(timezone.utc) - timedelta(seconds=1) + for i in range(5): # cap (2) + 3 remainder, each a distinct workflow + await _seed_pending_job( + mock_db, + meta={**_file_meta(), "local_id": f"ENCFF{i:03d}"}, + next_dispatch_at=due, + ) + spawn = mocker.patch.object(executor, "_spawn_attempt") + + # Act — a single tick. + await executor._drain_due_jobs() + + # Assert — capped at the per-tick limit... + assert spawn.call_count == 2 + # ...with the remainder still due and leasable on the next tick. + leftover = await executor_module.lease_due_dispatch( + mock_db, + now=datetime.now(timezone.utc), + next_at=datetime.now(timezone.utc) + timedelta(seconds=60), + ) + assert leftover is not None + + +class TestWoolExecutorStartScheduler: + @pytest.mark.asyncio + async def test_drain_should_cancel_the_scheduler( + self, mock_db, tmp_cache, tmp_workdir + ): + """Test that drain stops the durable scheduler task. + + Given: + An executor whose scheduler has been started. + When: + ``drain`` runs. + Then: + The scheduler task should be cancelled and cleared, so it stops + dispatching before the wool pool closes on shutdown. + """ + # Arrange + _install_jobs_index(mock_db) + registry = ProcessorRegistry() + registry.register(_StubProcessor()) + executor = WoolExecutor( + mock_db, tmp_cache, registry, workdir_root=tmp_workdir + ) + executor.start_scheduler() + task = executor._scheduler_task + assert task is not None + + # Act + await executor.drain(timeout=1.0) + + # Assert + assert task.cancelled() or task.done() + assert executor._scheduler_task is None diff --git a/tests/test_workflows/test_loadbalancer.py b/tests/test_workflows/test_loadbalancer.py new file mode 100644 index 0000000..01b61be --- /dev/null +++ b/tests/test_workflows/test_loadbalancer.py @@ -0,0 +1,326 @@ +"""Tests for the priority (leaky-bucket) load balancer.""" + +from __future__ import annotations + +from dataclasses import dataclass +from types import MappingProxyType +from typing import Any + +import cloudpickle +import pytest +import wool + +from cfdb.workflows.loadbalancer import PriorityLoadBalancer + + +@dataclass(frozen=True) +class _FakeMeta: + """Stand-in for WorkerMetadata; frozen so it is hashable as a dict key.""" + + uid: str + + +class _FakeConn: + """Worker connection stub whose dispatch returns a stream or raises.""" + + def __init__(self, behavior: Any) -> None: + # behavior: a sentinel "stream" to return, or an Exception to raise. + self.behavior = behavior + self.calls = 0 + + async def dispatch(self, task: Any, *, timeout: float | None = None) -> Any: + self.calls += 1 + if isinstance(self.behavior, Exception): + raise self.behavior + return self.behavior + + +class _FakeContext: + """Minimal LoadBalancerContextLike backed by an ordered dict.""" + + def __init__(self, workers: dict[_FakeMeta, _FakeConn]) -> None: + self._workers = dict(workers) + self.removed: list[_FakeMeta] = [] + + @property + def workers(self) -> MappingProxyType: + return MappingProxyType(self._workers) + + def add_worker(self, metadata: _FakeMeta, connection: _FakeConn) -> None: + self._workers[metadata] = connection + + def update_worker(self, metadata, connection, *, upsert: bool = False) -> None: + if upsert or metadata in self._workers: + self._workers[metadata] = connection + + def remove_worker(self, metadata: _FakeMeta) -> None: + self.removed.append(metadata) + self._workers.pop(metadata, None) + + +class TestPriorityLoadBalancer: + @pytest.mark.asyncio + async def test_dispatch_should_return_first_worker_in_priority_order(self): + """Test that the lowest-uid worker is offered the task first. + + Given: + Three healthy workers inserted out of uid order. + When: + A task is dispatched. + Then: + It should return the lowest-uid worker's stream and not even + contact the higher-uid workers (priority short-circuit). + """ + # Arrange — insert c, a, b so insertion order != priority order. + conns = { + _FakeMeta("c"): _FakeConn("stream-c"), + _FakeMeta("a"): _FakeConn("stream-a"), + _FakeMeta("b"): _FakeConn("stream-b"), + } + context = _FakeContext(conns) + + # Act + stream = await PriorityLoadBalancer().dispatch(object(), context=context) + + # Assert + assert stream == "stream-a" + assert conns[_FakeMeta("a")].calls == 1 + assert conns[_FakeMeta("b")].calls == 0 + assert conns[_FakeMeta("c")].calls == 0 + + @pytest.mark.asyncio + async def test_dispatch_should_rotate_past_busy_worker_without_evicting(self): + """Test that a transiently-rejecting worker is skipped, not evicted. + + Given: + The priority worker rejects with TransientRpcError (a + backpressure RESOURCE_EXHAUSTED) and the next accepts. + When: + A task is dispatched. + Then: + It should return the next worker's stream and leave the busy + worker in the pool (no eviction). + """ + # Arrange + busy = _FakeConn(wool.TransientRpcError(details="busy")) + free = _FakeConn("stream-b") + context = _FakeContext({_FakeMeta("a"): busy, _FakeMeta("b"): free}) + + # Act + stream = await PriorityLoadBalancer().dispatch(object(), context=context) + + # Assert + assert stream == "stream-b" + assert busy.calls == 1 + assert context.removed == [] + + @pytest.mark.asyncio + async def test_dispatch_should_evict_worker_on_non_transient_error(self): + """Test that a non-transient RpcError evicts the failing worker. + + Given: + The priority worker fails with a non-transient RpcError and the + next accepts. + When: + A task is dispatched. + Then: + It should return the next worker's stream and evict the failing + worker via remove_worker. + """ + # Arrange + broken_meta = _FakeMeta("a") + broken = _FakeConn(wool.RpcError(details="broken")) + free = _FakeConn("stream-b") + context = _FakeContext({broken_meta: broken, _FakeMeta("b"): free}) + + # Act + stream = await PriorityLoadBalancer().dispatch(object(), context=context) + + # Assert + assert stream == "stream-b" + assert context.removed == [broken_meta] + + @pytest.mark.asyncio + async def test_dispatch_should_raise_when_pool_is_empty(self): + """Test that an empty pool surfaces NoWorkersAvailable. + + Given: + A context with no workers. + When: + A task is dispatched. + Then: + It should raise NoWorkersAvailable (the executor's signal to add + a worker and re-queue). + """ + # Arrange + context = _FakeContext({}) + + # Act & assert + with pytest.raises(wool.NoWorkersAvailable): + await PriorityLoadBalancer().dispatch(object(), context=context) + + @pytest.mark.asyncio + async def test_dispatch_should_raise_when_all_workers_reject(self): + """Test that an all-busy pool surfaces NoWorkersAvailable. + + Given: + Every worker rejects with TransientRpcError (all at capacity). + When: + A task is dispatched. + Then: + It should raise NoWorkersAvailable after one full pass, without + evicting the still-healthy busy workers. + """ + # Arrange + a = _FakeConn(wool.TransientRpcError(details="busy")) + b = _FakeConn(wool.TransientRpcError(details="busy")) + context = _FakeContext({_FakeMeta("a"): a, _FakeMeta("b"): b}) + + # Act & assert + with pytest.raises(wool.NoWorkersAvailable): + await PriorityLoadBalancer().dispatch(object(), context=context) + + # ...and busy workers stay in the pool for the next attempt. + assert context.removed == [] + assert a.calls == 1 and b.calls == 1 + + @pytest.mark.asyncio + async def test_dispatch_should_let_unexpected_errors_propagate(self): + """Test that a non-RpcError propagates without touching the pool. + + Given: + The priority worker raises a plain ValueError (a caller-fault, + not a worker-health signal). + When: + A task is dispatched. + Then: + It should propagate the ValueError unwrapped and not evict the + worker, per wool's worker-health contract. + """ + # Arrange + meta = _FakeMeta("a") + context = _FakeContext({meta: _FakeConn(ValueError("boom"))}) + + # Act & assert + with pytest.raises(ValueError, match="boom"): + await PriorityLoadBalancer().dispatch(object(), context=context) + assert context.removed == [] + + def test_cloudpickle_roundtrip_should_succeed(self): + """Test that the stateless balancer cloudpickle-serializes. + + Given: + A PriorityLoadBalancer instance. + When: + It is cloudpickle dumped and loaded back. + Then: + The round-trip yields another PriorityLoadBalancer (no lock or + index state to poison serialization). + """ + # Act + restored = cloudpickle.loads(cloudpickle.dumps(PriorityLoadBalancer())) + + # Assert + assert isinstance(restored, PriorityLoadBalancer) + + def test_conforms_to_loadbalancer_like_protocol(self): + """Test that the balancer satisfies wool's LoadBalancerLike protocol. + + Given: + A constructed balancer. + When: + It is checked against the runtime-checkable LoadBalancerLike + protocol. + Then: + It should be recognized as a LoadBalancerLike so WorkerPool + accepts it. + """ + # Act & assert + assert isinstance(PriorityLoadBalancer(), wool.LoadBalancerLike) + + @pytest.mark.asyncio + async def test_dispatch_should_pick_same_priority_worker_across_dispatches(self): + """Test that the priority order is stable across repeated dispatches. + + Given: + Three healthy workers and a single balancer + context reused + across two consecutive dispatches. + When: + Two tasks are dispatched in succession. + Then: + Both should go to the lowest-uid worker — the leaky-bucket + invariant, distinct from round-robin which advances each call. + """ + # Arrange + conns = { + _FakeMeta("a"): _FakeConn("s-a"), + _FakeMeta("b"): _FakeConn("s-b"), + _FakeMeta("c"): _FakeConn("s-c"), + } + context = _FakeContext(conns) + balancer = PriorityLoadBalancer() + + # Act + first = await balancer.dispatch(object(), context=context) + second = await balancer.dispatch(object(), context=context) + + # Assert + assert first == "s-a" + assert second == "s-a" + assert conns[_FakeMeta("a")].calls == 2 + assert conns[_FakeMeta("b")].calls == 0 + + @pytest.mark.asyncio + async def test_dispatch_should_continue_in_order_after_evicting_first_worker(self): + """Test that evicting the first worker still tries the rest in order. + + Given: + Three workers where the lowest-uid one fails with a non-transient + RpcError and the next accepts. + When: + A task is dispatched. + Then: + It should evict the first, dispatch to the second, and finish the + pass without error — pinning that iteration runs over a snapshot, + not the live worker map mutated by remove_worker. + """ + # Arrange + broken = _FakeMeta("a") + conns = { + broken: _FakeConn(wool.RpcError(details="broken")), + _FakeMeta("b"): _FakeConn("s-b"), + _FakeMeta("c"): _FakeConn("s-c"), + } + context = _FakeContext(conns) + + # Act + stream = await PriorityLoadBalancer().dispatch(object(), context=context) + + # Assert + assert stream == "s-b" + assert context.removed == [broken] + assert conns[_FakeMeta("c")].calls == 0 + + @pytest.mark.asyncio + async def test_dispatch_should_order_workers_by_string_form_of_uid(self): + """Test that workers are ordered by the string form of their uid. + + Given: + Two workers with non-string uids (10 and 2) whose lexicographic + string order ("10" < "2") differs from numeric order. + When: + A task is dispatched. + Then: + It should pick the worker whose str(uid) sorts first ("10"), + pinning the str(uid) coercion that gives a total, orderable key + regardless of the uid's concrete type. + """ + # Arrange + conns = {_FakeMeta(10): _FakeConn("s-10"), _FakeMeta(2): _FakeConn("s-2")} + context = _FakeContext(conns) + + # Act + stream = await PriorityLoadBalancer().dispatch(object(), context=context) + + # Assert + assert stream == "s-10" diff --git a/tests/test_workflows/test_lock.py b/tests/test_workflows/test_lock.py index 7061e7f..af7a9ae 100644 --- a/tests/test_workflows/test_lock.py +++ b/tests/test_workflows/test_lock.py @@ -5,22 +5,54 @@ from datetime import datetime, timedelta, timezone import pytest +from hypothesis import given +from hypothesis import strategies as st from cfdb.workflows import lock -from cfdb.workflows.models import ACTIVE_STATUSES, ArtifactKind, JobStatus +from cfdb.workflows.models import ArtifactKind, JobRecord, JobStatus from tests.test_workflows import FIXTURE_MD5 PIPELINE_VERSION = 1 +def _insert_job( + mock_db, + *, + workflow_key: str, + status: JobStatus, + next_dispatch_at: datetime | None = None, + dispatch_attempts: int = 0, + updated_at: datetime | None = None, +) -> JobRecord: + """Insert a crafted job doc into the fake collection and return it.""" + now = datetime.now(timezone.utc) + record = JobRecord( + job_id=f"job-{workflow_key}", + workflow_key=workflow_key, + status=status, + dcc="encode", + local_id="x", + md5=FIXTURE_MD5, + pipeline_version=PIPELINE_VERSION, + submitted_at=now, + updated_at=updated_at or now, + next_dispatch_at=next_dispatch_at, + dispatch_attempts=dispatch_attempts, + ) + mock_db.jobs.docs.append(record.to_mongo()) + return record + + def _install_jobs_index(mock_db) -> None: """Register the partial unique index on the FakeDB jobs collection.""" mock_db.jobs.create_index( {"workflow_key": 1}, unique=True, - partialFilterExpression={ - "status": {"$in": [s.value for s in ACTIVE_STATUSES]} - }, + # Match production: the mutex partial-unique index filters on the + # ``active`` boolean discriminator (DocumentDB rejects ``$in`` inside + # a partialFilterExpression). Keying the fake on ``active`` lets the + # unit layer catch an ``active``/``status`` lockstep drift. + partialFilterExpression={"active": True}, ) @@ -777,6 +809,65 @@ def test_record_from_mongo_should_attach_utc_to_naive_timestamps(self): assert record.submitted_at.tzinfo is timezone.utc assert record.updated_at.tzinfo is timezone.utc + @given( + submitted=st.datetimes(timezones=st.just(timezone.utc)), + updated=st.datetimes(timezones=st.just(timezone.utc)), + next_dispatch=st.one_of( + st.none(), st.datetimes(timezones=st.just(timezone.utc)) + ), + ) + def test_record_from_mongo_should_roundtrip_aware_and_naive_datetimes( + self, submitted, updated, next_dispatch + ): + """Test that the to_mongo/_record_from_mongo datetime round-trip is total. + + Given: + A JobRecord with arbitrary aware-UTC ``submitted_at`` / + ``updated_at`` and an aware-or-``None`` ``next_dispatch_at``, + serialized via ``to_mongo`` and then rendered naive — the BSON + Date round-trip some Motor / mongomock-motor variants produce. + When: + ``_record_from_mongo`` hydrates the doc back into a JobRecord. + Then: + Every datetime instant is preserved and re-attached to UTC, and + a ``None`` ``next_dispatch_at`` survives as ``None`` — the third + nullable datetime threads the round-trip without loss. + """ + # Arrange + record = JobRecord( + job_id="job-rt", + workflow_key=f"encode/x/{FIXTURE_MD5}/v1", + status=JobStatus.PENDING, + dcc="encode", + local_id="x", + md5=FIXTURE_MD5, + pipeline_version=1, + submitted_at=submitted, + updated_at=updated, + next_dispatch_at=next_dispatch, + ) + doc = record.to_mongo() + # Simulate the BSON Date round-trip: strip tzinfo on stored dates, + # which is the naive rendering ``_record_from_mongo`` must repair. + for field in ("submitted_at", "updated_at", "next_dispatch_at"): + value = doc.get(field) + if isinstance(value, datetime): + doc[field] = value.astimezone(timezone.utc).replace(tzinfo=None) + + # Act + hydrated = lock._record_from_mongo(doc) + + # Assert + assert hydrated.submitted_at == submitted + assert hydrated.updated_at == updated + assert hydrated.submitted_at.tzinfo is timezone.utc + assert hydrated.updated_at.tzinfo is timezone.utc + if next_dispatch is None: + assert hydrated.next_dispatch_at is None + else: + assert hydrated.next_dispatch_at == next_dispatch + assert hydrated.next_dispatch_at.tzinfo is timezone.utc + class TestClaimWorkflowSupersededByChain: @pytest.mark.asyncio @@ -875,3 +966,465 @@ async def test_get_job_should_hydrate_persisted_record(self, mock_db): assert hydrated is not None assert hydrated.job_id == record.job_id assert hydrated.workflow_key == record.workflow_key + + +class TestCountActiveWorkflows: + @pytest.mark.asyncio + async def test_count_active_workflows_should_count_pending_and_running( + self, mock_db + ): + """Test that the admission count includes only active jobs. + + Given: + Two pending, one running, and two terminal job rows. + When: + count_active_workflows is awaited. + Then: + It should return 3 — pending + running only, excluding the + terminal rows. + """ + # Arrange + _insert_job(mock_db, workflow_key="a", status=JobStatus.PENDING) + _insert_job(mock_db, workflow_key="b", status=JobStatus.PENDING) + _insert_job(mock_db, workflow_key="c", status=JobStatus.RUNNING) + _insert_job(mock_db, workflow_key="d", status=JobStatus.COMPLETED) + _insert_job(mock_db, workflow_key="e", status=JobStatus.FAILED) + + # Act + count = await lock.count_active_workflows(mock_db) + + # Assert + assert count == 3 + + @pytest.mark.asyncio + async def test_count_active_workflows_should_return_zero_when_empty(self, mock_db): + """Test that an empty collection counts as zero active workflows. + + Given: + A jobs collection with no rows. + When: + count_active_workflows is awaited. + Then: + It should return 0. + """ + # Act + count = await lock.count_active_workflows(mock_db) + + # Assert + assert count == 0 + + +class TestRescheduleDispatch: + @pytest.mark.asyncio + async def test_reschedule_dispatch_should_defer_pending_and_bump_attempts( + self, mock_db + ): + """Test that rescheduling a pending job advances its retry time. + + Given: + A pending job with no next_dispatch_at and zero attempts. + When: + reschedule_dispatch is awaited with a future next_at. + Then: + It should set next_dispatch_at to that time and increment + dispatch_attempts. + """ + # Arrange + _insert_job(mock_db, workflow_key="a", status=JobStatus.PENDING) + next_at = datetime.now(timezone.utc) + timedelta(seconds=120) + + # Act + await lock.reschedule_dispatch(mock_db, "job-a", next_at=next_at) + + # Assert + doc = mock_db.jobs.docs[0] + assert doc["next_dispatch_at"] == next_at + assert doc["dispatch_attempts"] == 1 + + @pytest.mark.asyncio + async def test_reschedule_dispatch_should_be_noop_when_not_pending(self, mock_db): + """Test that rescheduling a non-pending job changes nothing. + + Given: + A running job (it already won a worker). + When: + reschedule_dispatch is awaited for it. + Then: + It should leave next_dispatch_at and dispatch_attempts untouched, + because the update is fenced on PENDING. + """ + # Arrange + _insert_job(mock_db, workflow_key="a", status=JobStatus.RUNNING) + next_at = datetime.now(timezone.utc) + timedelta(seconds=120) + + # Act + await lock.reschedule_dispatch(mock_db, "job-a", next_at=next_at) + + # Assert + doc = mock_db.jobs.docs[0] + assert doc["next_dispatch_at"] is None + assert doc["dispatch_attempts"] == 0 + + +class TestLeaseDueDispatch: + @pytest.mark.asyncio + async def test_lease_due_dispatch_should_claim_due_job_and_push_forward( + self, mock_db + ): + """Test that a due pending job is leased and its next attempt deferred. + + Given: + A pending job whose next_dispatch_at is in the past. + When: + lease_due_dispatch is awaited with now past that time and a + future next_at. + Then: + It should return the job and push its next_dispatch_at forward to + next_at so a later tick won't re-lease it. + """ + # Arrange + now = datetime.now(timezone.utc) + _insert_job( + mock_db, + workflow_key="a", + status=JobStatus.PENDING, + next_dispatch_at=now - timedelta(seconds=1), + ) + next_at = now + timedelta(seconds=120) + + # Act + leased = await lock.lease_due_dispatch(mock_db, now=now, next_at=next_at) + + # Assert + assert leased is not None + assert leased.workflow_key == "a" + assert mock_db.jobs.docs[0]["next_dispatch_at"] == next_at + + @pytest.mark.asyncio + async def test_lease_due_dispatch_should_not_re_lease_within_lease_window( + self, mock_db + ): + """Test that a freshly-leased job is not immediately re-leasable. + + Given: + One due pending job that has just been leased (its + next_dispatch_at pushed into the future). + When: + lease_due_dispatch is awaited again at the same now. + Then: + It should return None — the single-claim guard, so two ticks or + replicas cannot double-dispatch one job. + """ + # Arrange + now = datetime.now(timezone.utc) + _insert_job( + mock_db, + workflow_key="a", + status=JobStatus.PENDING, + next_dispatch_at=now - timedelta(seconds=1), + ) + next_at = now + timedelta(seconds=120) + first = await lock.lease_due_dispatch(mock_db, now=now, next_at=next_at) + + # Act + second = await lock.lease_due_dispatch(mock_db, now=now, next_at=next_at) + + # Assert + assert first is not None + assert second is None + + @pytest.mark.asyncio + async def test_lease_due_dispatch_should_skip_unscheduled_job(self, mock_db): + """Test that a job with next_dispatch_at unset is never leased. + + Given: + A pending job whose next_dispatch_at is None (freshly claimed, + not yet deferred by the scheduler). + When: + lease_due_dispatch is awaited. + Then: + It should return None — an unscheduled job is not due. + """ + # Arrange + now = datetime.now(timezone.utc) + _insert_job( + mock_db, workflow_key="a", status=JobStatus.PENDING, next_dispatch_at=None + ) + + # Act + leased = await lock.lease_due_dispatch( + mock_db, now=now, next_at=now + timedelta(seconds=120) + ) + + # Assert + assert leased is None + + @pytest.mark.asyncio + async def test_lease_due_dispatch_should_skip_future_and_running_jobs( + self, mock_db + ): + """Test that not-yet-due and non-pending jobs are not leased. + + Given: + A pending job due in the future and a running job already past + its time. + When: + lease_due_dispatch is awaited. + Then: + It should return None — the future job is not due and the running + job is not pending. + """ + # Arrange + now = datetime.now(timezone.utc) + _insert_job( + mock_db, + workflow_key="future", + status=JobStatus.PENDING, + next_dispatch_at=now + timedelta(seconds=60), + ) + _insert_job( + mock_db, + workflow_key="running", + status=JobStatus.RUNNING, + next_dispatch_at=now - timedelta(seconds=60), + ) + + # Act + leased = await lock.lease_due_dispatch( + mock_db, now=now, next_at=now + timedelta(seconds=120) + ) + + # Assert + assert leased is None + + @pytest.mark.asyncio + async def test_lease_due_dispatch_should_lease_oldest_due_job_first(self, mock_db): + """Test that the longest-waiting due job is leased first. + + Given: + Two due pending jobs with different next_dispatch_at times, + inserted newest-first. + When: + lease_due_dispatch is awaited. + Then: + It should lease the one with the earliest next_dispatch_at, so + sustained overflow cannot starve the oldest job. + """ + # Arrange + now = datetime.now(timezone.utc) + _insert_job( + mock_db, + workflow_key="newer", + status=JobStatus.PENDING, + next_dispatch_at=now - timedelta(seconds=10), + ) + _insert_job( + mock_db, + workflow_key="older", + status=JobStatus.PENDING, + next_dispatch_at=now - timedelta(seconds=300), + ) + + # Act + leased = await lock.lease_due_dispatch( + mock_db, now=now, next_at=now + timedelta(seconds=120) + ) + + # Assert + assert leased is not None + assert leased.workflow_key == "older" + + @pytest.mark.asyncio + async def test_lease_due_dispatch_should_reject_next_at_not_in_future( + self, mock_db + ): + """Test that a non-future next_at is rejected. + + Given: + A now timestamp and a next_at equal to it. + When: + lease_due_dispatch is awaited. + Then: + It should raise ValueError — an equal/past next_at would let a + concurrent tick immediately re-lease the same job. + """ + # Arrange + now = datetime.now(timezone.utc) + + # Act & assert + with pytest.raises(ValueError, match="strictly after"): + await lock.lease_due_dispatch(mock_db, now=now, next_at=now) + + +class TestRequeueOrphanedDispatch: + @pytest.mark.asyncio + async def test_requeue_orphaned_dispatch_should_requeue_a_stale_running_orphan(self, mock_db): + """Test that a RUNNING job with no recent heartbeat is re-queued. + + Given: + A RUNNING job whose `updated_at` is older than the stale + threshold — an in-flight job orphaned when its API consumer + died mid-stream. + When: + requeue_orphaned_dispatch runs with that staleness cutoff. + Then: + The job is reset to PENDING with `next_dispatch_at` set so the + scheduler re-dispatches it, and the call reports one requeued + row. + """ + # Arrange + now = datetime.now(timezone.utc) + stale = now - timedelta(seconds=2000) + _insert_job( + mock_db, workflow_key="a", status=JobStatus.RUNNING, updated_at=stale + ) + + # Act + count = await lock.requeue_orphaned_dispatch( + mock_db, now=now, stale_before=now - timedelta(seconds=900) + ) + + # Assert + assert count == 1 + job = await lock.get_job(mock_db, "job-a") + assert job is not None + assert job.status == JobStatus.PENDING + assert job.next_dispatch_at == now + + @pytest.mark.asyncio + async def test_requeue_orphaned_dispatch_should_requeue_a_stale_pending_job_with_no_next_dispatch( + self, mock_db + ): + """Test that a stale PENDING job with unset next_dispatch_at is re-queued. + + Given: + A PENDING job with `next_dispatch_at` unset (a fresh claim whose + inline attempt never rescheduled) and a stale `updated_at`. + When: + requeue_orphaned_dispatch runs. + Then: + Its `next_dispatch_at` is set to now so the scheduler — which + never leases a null-`next_dispatch_at` row — can pick it up. + """ + # Arrange + now = datetime.now(timezone.utc) + _insert_job( + mock_db, + workflow_key="a", + status=JobStatus.PENDING, + next_dispatch_at=None, + updated_at=now - timedelta(seconds=2000), + ) + + # Act + count = await lock.requeue_orphaned_dispatch( + mock_db, now=now, stale_before=now - timedelta(seconds=900) + ) + + # Assert + assert count == 1 + job = await lock.get_job(mock_db, "job-a") + assert job is not None + assert job.status == JobStatus.PENDING + assert job.next_dispatch_at == now + + @pytest.mark.asyncio + async def test_requeue_orphaned_dispatch_should_not_touch_a_fresh_running_job(self, mock_db): + """Test that a healthy, recently-heartbeating RUNNING job is left alone. + + Given: + A RUNNING job whose `updated_at` is recent (within the stale + threshold), i.e. a healthy in-flight job. + When: + requeue_orphaned_dispatch runs. + Then: + It is not touched — staleness gating prevents the sweep from + yanking a live job away from its worker. + """ + # Arrange + now = datetime.now(timezone.utc) + _insert_job( + mock_db, + workflow_key="a", + status=JobStatus.RUNNING, + updated_at=now - timedelta(seconds=10), + ) + + # Act + count = await lock.requeue_orphaned_dispatch( + mock_db, now=now, stale_before=now - timedelta(seconds=900) + ) + + # Assert + assert count == 0 + job = await lock.get_job(mock_db, "job-a") + assert job is not None + assert job.status == JobStatus.RUNNING + + @pytest.mark.asyncio + async def test_requeue_orphaned_dispatch_should_not_touch_an_already_queued_pending_job(self, mock_db): + """Test that a stale PENDING job already carrying next_dispatch_at is left. + + Given: + A PENDING job that already has `next_dispatch_at` set (it is + queued and leasable by the scheduler), even with a stale + `updated_at`. + When: + requeue_orphaned_dispatch runs. + Then: + It is not touched — the scheduler already handles it; the sweep + targets only the non-leasable orphan states. + """ + # Arrange + now = datetime.now(timezone.utc) + original = now - timedelta(seconds=50) + _insert_job( + mock_db, + workflow_key="a", + status=JobStatus.PENDING, + next_dispatch_at=original, + updated_at=now - timedelta(seconds=2000), + ) + + # Act + count = await lock.requeue_orphaned_dispatch( + mock_db, now=now, stale_before=now - timedelta(seconds=900) + ) + + # Assert + assert count == 0 + job = await lock.get_job(mock_db, "job-a") + assert job is not None + assert job.next_dispatch_at == original + + @pytest.mark.asyncio + async def test_requeue_orphaned_dispatch_should_not_touch_terminal_jobs(self, mock_db): + """Test that terminal (completed/failed) jobs are never re-queued. + + Given: + Stale COMPLETED and FAILED jobs. + When: + requeue_orphaned_dispatch runs. + Then: + Neither is touched — only active rows (the mutex holders) are + candidates for recovery. + """ + # Arrange + now = datetime.now(timezone.utc) + stale = now - timedelta(seconds=2000) + _insert_job( + mock_db, workflow_key="done", status=JobStatus.COMPLETED, updated_at=stale + ) + _insert_job( + mock_db, workflow_key="fail", status=JobStatus.FAILED, updated_at=stale + ) + + # Act + count = await lock.requeue_orphaned_dispatch( + mock_db, now=now, stale_before=now - timedelta(seconds=900) + ) + + # Assert + assert count == 0 + assert (await lock.get_job(mock_db, "job-done")).status == JobStatus.COMPLETED + assert (await lock.get_job(mock_db, "job-fail")).status == JobStatus.FAILED diff --git a/tests/test_workflows/test_models.py b/tests/test_workflows/test_models.py index f5eb23d..09b458b 100644 --- a/tests/test_workflows/test_models.py +++ b/tests/test_workflows/test_models.py @@ -309,6 +309,85 @@ def test_to_mongo_should_serialize_stages_progress_and_superseded_by(self): assert payload["superseded_by"] == "job-xyz" assert payload["progress"] == "indexing" + def test___init___should_default_dispatch_scheduling_fields(self): + """Test that the dispatch-scheduling fields default sanely. + + Given: + A JobRecord built without dispatch-scheduling overrides. + When: + It is constructed. + Then: + next_dispatch_at should default to None and dispatch_attempts + to 0. + """ + # Act + job = _make_job() + + # Assert + assert job.next_dispatch_at is None + assert job.dispatch_attempts == 0 + + def test___init___should_reject_naive_next_dispatch_at(self): + """Test that a naive next_dispatch_at is rejected. + + Given: + A naive next_dispatch_at (no tzinfo). + When: + A JobRecord is constructed with it. + Then: + It should raise ValidationError — the scheduler's + next_dispatch_at comparison requires aware UTC. + """ + # Arrange + naive = datetime(2026, 4, 21, 13, 0, 0) + + # Act & assert + with pytest.raises(ValidationError, match="timezone-aware"): + _make_job(next_dispatch_at=naive) + + def test___init___should_accept_aware_next_dispatch_at(self): + """Test that an aware next_dispatch_at validates and round-trips. + + Given: + An aware UTC next_dispatch_at. + When: + A JobRecord is constructed with it. + Then: + It should validate and preserve the value (None remains the + nullable default elsewhere). + """ + # Arrange + when = datetime(2026, 4, 21, 13, 0, 0, tzinfo=timezone.utc) + + # Act + job = _make_job(next_dispatch_at=when) + + # Assert + assert job.next_dispatch_at == when + + def test_to_mongo_should_serialize_dispatch_scheduling_fields(self): + """Test that to_mongo persists next_dispatch_at and dispatch_attempts. + + Given: + A JobRecord with an aware next_dispatch_at and a non-zero + dispatch_attempts. + When: + to_mongo is invoked. + Then: + Both fields should appear in the dump with their values + preserved verbatim. + """ + # Arrange + when = datetime(2026, 4, 21, 13, 0, 0, tzinfo=timezone.utc) + job = _make_job(next_dispatch_at=when, dispatch_attempts=3) + + # Act + payload = job.to_mongo() + + # Assert + assert payload["next_dispatch_at"] == when + assert payload["dispatch_attempts"] == 3 + @pytest.mark.parametrize( "status", list(JobStatus), diff --git a/tests/test_workflows/test_s3_contract.py b/tests/test_workflows/test_s3_contract.py index 16c599f..3f3a5f4 100644 --- a/tests/test_workflows/test_s3_contract.py +++ b/tests/test_workflows/test_s3_contract.py @@ -27,7 +27,7 @@ from cfdb.workflows.events import Complete, StageComplete, WorkflowEvent from cfdb.workflows.executor import WoolExecutor from cfdb.workflows.lock import get_job -from cfdb.workflows.models import ACTIVE_STATUSES, ArtifactKind, JobStatus +from cfdb.workflows.models import ArtifactKind, JobStatus from cfdb.workflows.processors.base import Processor from cfdb.workflows.processors.registry import ProcessorRegistry from tests.test_workflows import FIXTURE_MD5 @@ -79,9 +79,7 @@ def _install_jobs_index(mock_db) -> None: mock_db.jobs.create_index( {"workflow_key": 1}, unique=True, - partialFilterExpression={ - "status": {"$in": [s.value for s in ACTIVE_STATUSES]} - }, + partialFilterExpression={"active": True}, ) diff --git a/tests/test_workflows/test_worker_lan.py b/tests/test_workflows/test_worker_lan.py index f99f784..0c545b4 100644 --- a/tests/test_workflows/test_worker_lan.py +++ b/tests/test_workflows/test_worker_lan.py @@ -174,3 +174,65 @@ def test_main_tls_cli_flags_override_env_vars(self, monkeypatch): # Assert assert exit_code == 0 assert captured["tls_ca"] == "/cli/ca.pem" + + +def test_backpressure_bound_factory_should_still_declare_host(): + """Test that the backpressure-bound spawn factory still declares a host. + + Given: + The ``functools.partial(LocalWorker, backpressure=hook)`` factory + that ``worker_lan.serve`` hands to ``WorkerPool(spawn=…)`` to apply + per-worker backpressure. + When: + wool's ``declares_host`` classifier inspects it. + Then: + It should return True, so the pool still prescribes the bind host. A + False here would make spawned workers bind 127.0.0.1 and advertise + unreachable addresses — a runtime failure with no other unit signal, + so this pins the load-bearing invariant the worker_lan wiring relies + on against a wool signature change. + """ + # Arrange + import functools + + import wool + from wool.runtime.worker.base import declares_host + + from cfdb.workflows.backpressure import TaskCountBackpressure + + factory = functools.partial(wool.LocalWorker, backpressure=TaskCountBackpressure(1)) + + # Act & assert + assert declares_host(factory) is True + + +def test_backpressure_bound_factory_should_survive_cloudpickle(): + """Test that the backpressure-bound spawn factory round-trips cloudpickle. + + Given: + The ``functools.partial(LocalWorker, backpressure=TaskCountBackpressure(1))`` + factory ``worker_lan.serve`` hands to ``WorkerPool(spawn=…)`` — the + exact composite object wool serializes into each spawned worker + subprocess. + When: + It is cloudpickled and reloaded. + Then: + The reconstructed partial still carries a backpressure hook with the + same threshold, so per-worker backpressure survives the spawn + boundary rather than silently reverting to unbounded admission. + """ + # Arrange + import functools + + import cloudpickle + import wool + + from cfdb.workflows.backpressure import TaskCountBackpressure + + factory = functools.partial(wool.LocalWorker, backpressure=TaskCountBackpressure(1)) + + # Act + restored = cloudpickle.loads(cloudpickle.dumps(factory)) + + # Assert + assert restored.keywords["backpressure"].threshold == 1 diff --git a/tests/test_workflows/test_worker_main.py b/tests/test_workflows/test_worker_main.py index 8f4ea9e..757d58d 100644 --- a/tests/test_workflows/test_worker_main.py +++ b/tests/test_workflows/test_worker_main.py @@ -4,6 +4,7 @@ from unittest.mock import patch +import pytest from click.testing import CliRunner from cfdb.workflows import worker_main @@ -183,3 +184,84 @@ def test_main_tls_cli_flags_override_env_vars(self, monkeypatch): # Assert assert exit_code == 0 assert captured["tls_ca"] == "/cli/ca.pem" + + +class _StopServe(Exception): + """Sentinel raised from the mocked ``worker.start`` to halt ``serve``. + + ``serve`` constructs the worker (with its backpressure kwarg) immediately + before ``await worker.start()``, so raising here exits the coroutine right + after the construction under test without entering the run loop, the + signal-wait, or the health/drain teardown. + """ + + +def _arrange_serve(mocker) -> object: + """Patch ``serve``'s collaborators and return the ``LocalWorker`` spy.""" + mocker.patch.object(worker_main, "build_worker_credentials", return_value=None) + mocker.patch.object( + worker_main, "_start_health_server", mocker.AsyncMock(return_value=mocker.Mock()) + ) + worker_instance = mocker.Mock() + worker_instance.start = mocker.AsyncMock(side_effect=_StopServe) + worker_instance.stop = mocker.AsyncMock() + return mocker.patch.object( + worker_main.wool, "LocalWorker", return_value=worker_instance + ) + + +class TestServeBackpressureWiring: + @pytest.mark.asyncio + async def test_serve_should_wire_taskcount_backpressure_when_threshold_positive( + self, mocker, monkeypatch + ): + """Test that a positive task ceiling becomes a TaskCountBackpressure. + + Given: + ``WORKER_MAX_CONCURRENT_TASKS`` set to 1 and ``wool.LocalWorker`` + spied so ``serve`` halts right after constructing the worker. + When: + ``serve`` is run. + Then: + The worker is constructed with a ``backpressure`` hook whose + threshold is 1, so the ECS worker entrypoint actually serializes + its subprocess pipelines. + """ + # Arrange + monkeypatch.setattr(worker_main, "WORKER_MAX_CONCURRENT_TASKS", 1) + local_worker = _arrange_serve(mocker) + + # Act + with pytest.raises(_StopServe): + await worker_main.serve(worker_port=0, health_port=0) + + # Assert + backpressure = local_worker.call_args.kwargs["backpressure"] + assert backpressure is not None + assert backpressure.threshold == 1 + + @pytest.mark.asyncio + async def test_serve_should_disable_backpressure_when_threshold_zero( + self, mocker, monkeypatch + ): + """Test that a zero task ceiling wires ``backpressure=None``. + + Given: + ``WORKER_MAX_CONCURRENT_TASKS`` set to 0 (the disable sentinel) + and ``wool.LocalWorker`` spied. + When: + ``serve`` is run. + Then: + The worker is constructed with ``backpressure=None``, restoring + the unbounded admission behavior. + """ + # Arrange + monkeypatch.setattr(worker_main, "WORKER_MAX_CONCURRENT_TASKS", 0) + local_worker = _arrange_serve(mocker) + + # Act + with pytest.raises(_StopServe): + await worker_main.serve(worker_port=0, health_port=0) + + # Assert + assert local_worker.call_args.kwargs["backpressure"] is None