Add bounded concurrency control for the preprocessing/indexing workflow pipeline — Closes #45#63
Merged
Merged
Conversation
Introduce the configuration surface for issue #45's bounded-concurrency control, parsed and validated at import alongside the existing workflow runtime caps: - CFDB_WORKER_MAX_CONCURRENT_TASKS (default 1) — per-worker backpressure threshold; 0 disables backpressure. - CFDB_WORKFLOW_MAX_ACTIVE (default 1024) — admission ceiling on active workflows. - CFDB_WORKFLOW_DISPATCH_DEADLINE_S (default 14400) — how long a job may wait for worker capacity before it is failed. - CFDB_WORKFLOW_RETRY_INTERVAL_S (default 120) — base cadence for the durable dispatch-retry scheduler. The threshold knob is consumed by the per-worker backpressure wiring; the remaining three are consumed by the admission and durable-retry layers that follow.
Workers ran with backpressure=None and accepted many concurrent routines, so a single 1-vCPU/2-GiB Fargate worker oversubscribed the samtools/bgzip/tabix/sort subprocess pipelines its routines shell out to. Add TaskCountBackpressure, a picklable wool BackpressureLike hook that rejects a dispatch once the worker already has CFDB_WORKER_MAX_CONCURRENT_TASKS routines in flight (default 1, so each worker serializes its pipeline). Rejection surfaces to the dispatcher as gRPC RESOURCE_EXHAUSTED, which the load balancer treats as transient and rotates past. Wire it into both worker entrypoints: worker_main builds the hook directly, and worker_lan binds it onto the spawn factory via functools.partial so the pool still prescribes the bind host. A threshold of 0 passes backpressure=None, restoring the unbounded prior behavior. The hook holds only an int and no async state so it survives wool's serialization into the worker subprocess.
wool's default round-robin balancer spreads load evenly, so every worker carries a thin perpetual slice of traffic and none drains to idle — defeating worker self-reaping on an over-provisioned fleet. Add PriorityLoadBalancer, a stateless wool LoadBalancerLike that offers each task to the discovered workers in a stable order (sorted by uid) on every dispatch. Combined with per-worker backpressure (one task each), load fills the lowest-ordered workers first and higher-ordered ones drain to idle and self-terminate on their max-lifetime — the leaky-bucket behavior. It honors wool's worker-health contract: rotate to the next worker on TransientRpcError (a backpressure RESOURCE_EXHAUSTED rejection), evict on a non-transient RpcError, propagate anything else, and raise NoWorkersAvailable when no worker accepts. Wire it into the API WorkerPool. Being stateless (no per-context index, no lock) keeps it trivially picklable.
Extend the workflow job record and mutex layer for issue #45's durable dispatch-retry and admission control, without changing dispatch behavior yet (the executor consumes these next): - JobRecord gains next_dispatch_at (when the retry scheduler should next attempt dispatch) and dispatch_attempts (deferral count). Both are serialized by to_mongo; the aware-datetime validator and _record_from_mongo treat next_dispatch_at as nullable UTC. - lock.py gains count_active_workflows (admission count), reschedule_dispatch (defer a still-PENDING job, fenced on PENDING), and lease_due_dispatch (atomically claim one due PENDING job, pushing next_dispatch_at forward so concurrent scheduler ticks or API replicas cannot double-pick it).
Address review findings on the foundational #45 primitives: - lease_due_dispatch leases due jobs oldest-first (next_dispatch_at ascending) so sustained overflow cannot starve the longest-waiting job toward its deadline, and rejects a next_at that is not strictly in the future (an equal/past value would let a concurrent tick re-lease the same job before the attempt resolves, defeating the single-claim guard). - reschedule_dispatch logs a debug line on a no-op fenced write, matching the observability convention of the module's other fenced updates. - Correct the JobRecord.next_dispatch_at docstring: it is unset on claim, not set on claim. - backpressure_for rejects a negative threshold rather than silently disabling backpressure, making its contract total.
lease_due_dispatch selects {status: pending, next_dispatch_at: {$lte:
now}} sorted by next_dispatch_at ascending, but the existing
status_updated_at index covers only the status prefix, so the
next_dispatch_at range/sort would scan all PENDING rows on every
scheduler tick. Add a (status, next_dispatch_at) index to
operational_index_specs and mirror it in create-indexes.js so the poll is
index-served. Adding it now, alongside the primitive that defines the
query, avoids an index migration on a hot, TTL-populated collection once
the scheduler goes live.
Pin the foundational #45 concurrency primitives flagged by review as shipping untested: - lease_due_dispatch: leases a due PENDING job and pushes next_dispatch_at forward; a second lease in the same window returns None (single-claim guard); skips unscheduled (None), not-yet-due, and non-PENDING jobs; leases oldest-due first; rejects a non-future next_at. - reschedule_dispatch: defers a PENDING job and bumps dispatch_attempts; no-ops (fenced) on a non-PENDING row. - count_active_workflows: counts pending+running, excludes terminal. - JobRecord: next_dispatch_at/dispatch_attempts defaults, naive-datetime rejection, aware round-trip, and to_mongo serialization. Extend the in-memory FakeCollection test double with $inc and a single-key sort on find_one_and_update so these queries are exercisable (both additive; existing tests use neither).
Address the test-coverage findings from review: - PriorityLoadBalancer: pin cross-dispatch order stability (the leaky-bucket invariant, distinct from round-robin), continued in-order dispatch after evicting the first worker (the snapshot-iteration guarantee a 2-worker eviction did not exercise), and the str(uid) ordering key. Clarify the docstring that the uid order is arbitrary-but-stable, not a seniority ranking. - backpressure_for: assert it rejects a negative threshold; give the protocol-conformance test teeth via the bool-return contract. - Lifespan: assert the WorkerPool is constructed with a PriorityLoadBalancer (the headline wiring, previously uncovered). - worker_lan: pin that the backpressure-bound spawn factory keeps wool's declares_host True, so the pool still prescribes the bind host.
…duler Wire the dormant bounded-concurrency primitives into the executor's request path so concurrency and cost become predictable and a dispatch that finds no capacity queues durably instead of blocking a request. ensure_workflow now enforces an admission ceiling: once the active backlog (pending + running) reaches CFDB_WORKFLOW_MAX_ACTIVE it raises AdmissionRejected before claiming the mutex, shedding a flood at the door rather than queuing unbounded work. A fresh claim dispatches its first attempt inline; each attempt offers the task to existing workers once via the priority load balancer and marks the job RUNNING only after a worker accepts, so a queued job stays PENDING until it genuinely runs. On overflow the job is rescheduled (staying PENDING, next_dispatch_at set) and, on the ECS profile, a single best-effort worker spawn is requested -- inverting the old unconditional pre-dispatch spawn. A durable, Mongo-backed scheduler re-attempts due jobs on CFDB_WORKFLOW_RETRY_INTERVAL_S until a worker frees up or the CFDB_WORKFLOW_DISPATCH_DEADLINE_S deadline elapses (then it fails the job with a capacity: prefix). Because the queue lives in Mongo it survives an API restart. A fresh claim leaves next_dispatch_at unset so the scheduler cannot race the inline attempt. This replaces the single in-request CFDB_WORKFLOW_DISPATCH_WAIT_S cold- start wait, which is removed; the retry cadence plus deadline supersede it.
Map the executor's AdmissionRejected to a 429 response with a Retry-After header on the /data and /index dispatch path, caught explicitly before WorkflowNotApplicable so a rejected request backs off rather than falling through to direct upstream streaming and bypassing the bounded pipeline. The side-effect-free /status readiness probes never dispatch, so they never 429. Start the durable retry scheduler in the lifespan, inside the wool pool context so the task inherits wool's dispatch contextvars (the same mechanism request-spawned tasks rely on). The existing drain teardown cancels it before the pool closes.
…ntrol Describe the four cooperating dispatch layers (per-worker backpressure, priority load balancing, durable retry-to-deadline queue, admission ceiling), add the new CFDB_WORKER_MAX_CONCURRENT_TASKS / CFDB_WORKFLOW_MAX_ACTIVE / CFDB_WORKFLOW_RETRY_INTERVAL_S / CFDB_WORKFLOW_DISPATCH_DEADLINE_S knobs, note that the removed CFDB_WORKFLOW_DISPATCH_WAIT_S is superseded, and list 429 in the /data and /index status-code tables.
…atch The dispatch routine now yields an immediate heartbeat as its first event, the instant a worker accepts the task and before the upstream download. _attempt_dispatch marks the job RUNNING on that first event, so the job leaves the leasable PENDING set the moment a worker takes it. Previously the processor's first event arrived only after the download, so a job stayed PENDING for the whole (potentially multi-second) download. When that exceeded the retry interval the durable scheduler re-leased the same job and dispatched a second concurrent attempt onto the same per-job workdir, corrupting the in-flight download (observed live as "invalid compressed data" and a source.raw.part rename race) and inflating apparent worker concurrency. A rejected dispatch raises NoWorkersAvailable before the routine body runs, so the leading heartbeat fires only on acceptance — overflowed jobs are never marked RUNNING, keeping concurrency bounded to the worker count.
Add a scheduler orphan-recovery sweep so restart recovery no longer depends on a client re-requesting the same file. The durable scheduler only leases PENDING jobs whose next_dispatch_at is set and due, so two states a crash leaves behind would otherwise sit until claim_workflow's stale-reclaim fired on a fresh same-file request: a RUNNING row whose API consumer died mid-stream, and a fresh PENDING claim whose inline attempt never rescheduled. requeue_orphaned_dispatch resets both to PENDING with next_dispatch_at=now once they pass the stale threshold, and the scheduler runs it at the top of every tick — including the first, on boot — so the next tick re-dispatches them. The sweep is gated on the same updated_at staleness threshold as stale-reclaim, so a healthy in-flight job (which heartbeats) is never yanked from its worker; partial-commit recovery means a re-dispatched job reuses any stages already cached. Unlike stale-reclaim, which fails the row so a new claimant supersedes it, the sweep revives the same row since there is no new claimant. Also fix the FakeCollection.update_many test double to report a row-level modified_count (matching real Mongo) instead of a per-field-write count.
Address the principal-engineer review of the bounded-concurrency change (PR #63, round 2): - B1: make a stale RUNNING row reliably mean a dead consumer. The stream consumer aborts (HeartbeatLost) once its heartbeat writes have been failing longer than _HEARTBEAT_LOSS_ABORT_S — sized strictly between one heartbeat interval and the orphan sweep's stale threshold — so a Mongo-blind consumer stops computing (aclose cancels the worker) before recovery would re-dispatch it, eliminating the duplicate run rather than merely making it non-corrupting. Each dispatch attempt also uses a per-attempt workdir (job_id + nonce) so the pathological residual can't corrupt a sibling attempt's scratch. - A1: clean up the per-attempt workdir on the mark_running hand-off branch (without releasing the successor's mutex), with a regression test. - A3: cap leases per scheduler tick (_MAX_DISPATCHES_PER_TICK) so a backlog can't spawn ~MAX_ACTIVE concurrent attempts in a single tick. - A6: pass heartbeat_interval explicitly to the routine so the value is resolved API-side at dispatch rather than snapshotted by cloudpickle. - A9/A12: correct the stale WoolExecutor docstring and comments. - A2/A4: document that the ceiling sheds attach re-GETs and that the dispatch deadline runs from submission and is not reset on recovery.
…eview
- A8: register the unit jobs mutex index with the production
{active: true} partialFilterExpression (not the DocumentDB-rejected
status-$in form), so the unit layer guards active/status lockstep drift
the way production does.
- A15: make the FakeCollection matcher AND $and/$or with sibling top-level
keys instead of short-circuiting on them, so a mixed query (e.g. the
orphan-recovery filter) is modeled regardless of key order.
- A7: derive the create-indexes.js parity assertion from
operational_index_specs() so a newly-added operational index is guarded
automatically (now covers status_next_dispatch_at).
- A11: rename executor/lock tests to match the method __name__ convention.
A2: note that at the ceiling a re-GET for an already-active file is also shed with 429 (it does not attach). A18: clarify the /data and /index 202 rows that the accepted job may sit pending in the durable queue under load.
Tighten the stale-reclaim invariant to a strict STALE > 2 * HEARTBEAT so the heartbeat-loss abort window (_HEARTBEAT_LOSS_ABORT_S = max(HEARTBEAT, STALE - HEARTBEAT)) keeps a non-zero margin above one interval; at the former permitted boundary STALE == 2 * HEARTBEAT it collapsed to exactly one interval, defeating the single-transient-failure tolerance the design advertises. Count the admission ceiling on the canonical active boolean discriminator rather than the derived status-in-ACTIVE_STATUSES view, so the ceiling and the mutex stay in lockstep if the two ever drift. Emit leading operability signals: a per-tick dispatch summary and an overflow-reschedule line, so a saturated pool is visible before jobs start failing capacity: at the deadline. Correct stale or overstated docstrings and comments: the HeartbeatLost abort is documented as a best-effort optimization (it only fires on quiet stages), with the per-attempt workdir named as the actual double-dispatch guard; the _next_dispatch_time jitter note no longer claims a test zeroes the constant; the load balancer dispatch return type is an async iterator, not an async generator; the worker_lan module docstring describes the durable-queue model instead of the removed hang-then-NoWorkersAvailable behavior; _recover_orphans and the JobRecord.error field document the from-submission deadline clock and the internal: error prefix.
Replace the vacuous workdir-cleanup assertions in the executor-boundary and processor e2e tests: since the per-attempt workdir is named job_id-uuid, a bare-job_id path is never created, so asserting its absence passed even if scratch leaked. Assert the workdir root is empty instead. Pin the per-tick lease cap: one _drain_due_jobs tick leases at most _MAX_DISPATCHES_PER_TICK jobs and leaves the remainder leasable, so a refactor dropping the thundering-herd guard fails CI. Pin the production pool configuration over the real cloudpickle/gRPC boundary: a two-worker pool wired with PriorityLoadBalancer and a TaskCountBackpressure(1)-bound spawn factory distributes two workflows across both workers and leaves the third PENDING with a next_dispatch_at, proving rotation-on-rejection and durable reschedule end to end. Pin the ECS worker entrypoint's backpressure wiring (worker_main.serve) and add Hypothesis property coverage for the JobRecord datetime aware/naive to_mongo/from_mongo round-trip (now including next_dispatch_at) and the cloudpickle round-trip of the backpressure-bound spawn factory. Strengthen the heartbeat-loss threshold assertion to the strict margin the tightened invariant now guarantees, rename the scheduler-method tests to the leading-underscore convention, and refresh the stale wool_pool fixture docstring.
Autonomous orphan recovery preserves the original submission time, so an orphan older than the dispatch deadline is failed capacity: on its first recovery attempt rather than resumed. State this in the durable-queue bullet so the autonomous-recovery promise is not read as unbounded; committed cache artifacts still survive for a later fresh GET to reuse.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Make preprocessing/indexing concurrency and cost predictable, and shed excess at admission, with four cooperating layers: per-worker backpressure, a priority (leaky-bucket) load balancer, a durable Mongo-backed retry-to-deadline scheduler with spawn-on-overflow, and a central admission ceiling that returns
429. Dispatch shifts from "block one request waiting for a worker (and spawn one unconditionally)" to "queue durably, retry to a deadline, and spawn only on overflow" — fixing the per-request worker-spawn cost leak and bounding both concurrency and the backlog.A job is marked
runningthe instant a worker accepts it (a leading routine heartbeat), which bounds observed concurrency to the worker count and prevents the scheduler from re-leasing a job during its upstream download. Recovery is autonomous: a scheduler sweep (on boot and every tick) re-queues jobs a crash left mid-flight — arunningjob whose API consumer died, or a freshpendingclaim that never rescheduled — so a restarted or surviving replica revives them without a client re-requesting the file.Verified end-to-end on a local 3-worker stack (1 task/worker, admission ceiling 15): a 16th distinct request returns
429 Retry-After; 15 queued jobs sitpendingand are retried while no worker exists, then drain tocompletedwith max concurrent RUNNING = 3 and zero failures; and seeded in-flight orphans are recovered tocompletedon API boot with no requests issued.Closes #45
Proposed changes
Per-worker backpressure
New
workflows/backpressure.py(TaskCountBackpressure, a picklablewool.BackpressureLike) wired intoworker_mainandworker_lan, gated byCFDB_WORKER_MAX_CONCURRENT_TASKS(default1;0disables). A worker at capacity rejects withRESOURCE_EXHAUSTED.Priority (leaky-bucket) load balancer
New
workflows/loadbalancer.py(PriorityLoadBalancer) wired into the APIWorkerPool. Offers each task to workers in a stable uid order so load concentrates and over-provisioned workers drain to idle and self-reap; honors wool's rotate-on-transient / evict-on-RpcError/NoWorkersAvailablecontract.Durable retry-to-deadline scheduler, spawn-on-overflow, mark-running-on-acceptance
executor.pyrestructured:ensure_workflowdispatches a fresh claim inline; a single durable scheduler re-attempts queued jobs. Each attempt opens the stream once via the LB and marks the jobrunningon the worker-acceptance heartbeat (before the download). On overflow the job stayspendingwithnext_dispatch_atset and (ECS) one bounded spawn is requested; the scheduler retries everyCFDB_WORKFLOW_RETRY_INTERVAL_S(+jitter) until success orCFDB_WORKFLOW_DISPATCH_DEADLINE_S(then failedcapacity:). Removes the in-request_open_stream_with_retry/DISPATCH_WAIT_Swait.Autonomous orphan recovery
lock.requeue_orphaned_dispatch+ the scheduler's_recover_orphans(run at the top of every tick, including boot) re-queue jobs a crash strands that the normal lease cannot pick up —runningrows whose consumer died andpendingrows with nonext_dispatch_at— once they passCFDB_WORKFLOW_STALE_THRESHOLD_S. Gated on the same staleness check asclaim_workflowso healthy heartbeating jobs are untouched; revives the row (partial-commit recovery reuses cached stages) rather than failing it, and makes recovery work for a surviving replica without an API restart or a client re-request.Admission ceiling to 429
ensure_workflowcounts active jobs and raisesAdmissionRejectedonceCFDB_WORKFLOW_MAX_ACTIVEis reached;cache_stream.pymaps it to429 Retry-After. Readiness/statusprobes never dispatch, so never429.Supporting primitives, indexes, knobs, docs
lock.py(count_active_workflows,reschedule_dispatch,lease_due_dispatch,requeue_orphaned_dispatch) andmodels.py(next_dispatch_at,dispatch_attempts); a(status, next_dispatch_at)index inindexes.py+create-indexes.js; new env knobs inworkflows/__init__.py(removing the supersededCFDB_WORKFLOW_DISPATCH_WAIT_S); README documents all four layers and adds429to the/dataand/indexstatus tables.main.pystarts/stops the scheduler in the wool-pool context.Test cases
TestTaskCountBackpressureactive_task_countis below / at / above NTestTaskCountBackpressureTestPriorityLoadBalancerTestPriorityLoadBalancerRpcErrorTestPriorityLoadBalancerNoWorkersAvailableTestWoolExecutorAdmissionensure_workflowis awaitedAdmissionRejectedbefore claimingTestWoolExecutorMarkRunningOnAcceptancerunningvia the leading heartbeatTestWoolExecutorWithProvisioner_attempt_dispatchrunsTestWoolExecutorWithProvisioner_attempt_dispatchrunspendingand rescheduledTestWoolExecutorDispatchDeadline_attempt_dispatchrunscapacity:prefix, no dispatchTestWoolExecutorSchedulerpendingTestWoolExecutorSchedulerTestRequeueOrphanedDispatchrunning/pending-no-next_dispatch_atvs fresh / queued / terminal rowsTestWoolExecutorSchedulerrunningorphan a crash left behind_recover_orphansthen drain) runscompletedwith no new requestTestServeWorkflowArtifactOrDispatchAdmissionRejectedHTTPException(429)withRetry-AfterTestLeaseDueDispatch/TestRescheduleDispatch/TestCountActiveWorkflowsjobscollection with due/active rowstest_lifespan_should_construct_worker_pool_with_quorum_zeroPriorityLoadBalancer; scheduler started; drain cancels it