You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Deferred, non-blocking follow-ups from the principal-engineer review of #45 (PR #63, review round 2). None of these gate #63; they are tracked here for a later pass.
Queue observability (review A13). The bounded-concurrency pipeline adds a durable queue but no signal for its depth or pressure. Emit metrics / rate-limited summary logs for queue depth (pending with a due next_dispatch_at), overflow-and-reschedule rate, and the dispatch_attempts distribution, so a sustained-overflow incident is visible before jobs start failing capacity:. executor._scheduler_loop / _handle_overflow are the natural emit points.
Per-job spawn-suppression window (review A14)._handle_overflow calls provisioner.request(dedup_key=…) on every overflow tick, so a job stuck pending for the full CFDB_WORKFLOW_DISPATCH_DEADLINE_S can issue ~DEADLINE/RETRY_INTERVALRunTasks; across CFDB_WORKFLOW_MAX_ACTIVE queued jobs that is a large sustained RunTask rate (bounded by the Fargate quota + worker self-reap, but wasteful). Suppress re-requesting a spawn for a job that already triggered one within the last tick.
Integration test of the production pool config (review A17). The cross-process boundary suite uses a bare wool.WorkerPool. Add a tests/integration test that builds the pool with loadbalancer=PriorityLoadBalancer() and ≥2 workers each TaskCountBackpressure(1), dispatches overlapping workflows over the real cloudpickle boundary, and asserts the LB rotates past a RESOURCE_EXHAUSTED rejection and the executor reschedules — the combined production path is currently fakes/unit-only.
Pre-existing test-isolation bug (found while verifying review-2 fixups). When tests/test_workflows/ runs before tests/test_index.py, TestStreamIndexFileFourdnSidecar::test_stream_index_file_should_502_when_sidecar_href_fails_allowlist fails with 500 instead of 502: cfdb.workflows.urlsafe.UnsafeOutboundURL is raised but the /index router's specific handler does not catch it (it falls to the generic 500). This reproduces on master (independent of Add bounded concurrency control for the preprocessing/indexing workflow pipeline — Closes #45 #63) and looks like an exception-class identity break from a module reimport in the workflows test path. The canonical default-order suite (pytest -m "not integration") passes because test_index.py collects before test_workflows/, which hides it. Root-cause the reimport and make the router's UnsafeOutboundURL handling order-independent (and/or add an autouse cleanup).
Motivation
The #45 bounded-concurrency change is correct and verified, but these items harden operability (observability, ECS cost), close a coverage gap in the production dispatch path, and fix a latent test-isolation flake that masks a real 502→500 regression risk.
Expected Outcome
Queue depth / overflow pressure is observable via metrics or logs.
A queued job does not issue a fresh RunTask on every retry tick.
The PriorityLoadBalancer + TaskCountBackpressure combination is exercised end-to-end over the real worker boundary.
The /index allowlist test passes regardless of test ordering, and the underlying UnsafeOutboundURL handling is order-independent.
Description
Deferred, non-blocking follow-ups from the principal-engineer review of #45 (PR #63, review round 2). None of these gate #63; they are tracked here for a later pass.
pendingwith a duenext_dispatch_at), overflow-and-reschedule rate, and thedispatch_attemptsdistribution, so a sustained-overflow incident is visible before jobs start failingcapacity:.executor._scheduler_loop/_handle_overfloware the natural emit points._handle_overflowcallsprovisioner.request(dedup_key=…)on every overflow tick, so a job stuckpendingfor the fullCFDB_WORKFLOW_DISPATCH_DEADLINE_Scan issue ~DEADLINE/RETRY_INTERVALRunTasks; acrossCFDB_WORKFLOW_MAX_ACTIVEqueued jobs that is a large sustainedRunTaskrate (bounded by the Fargate quota + worker self-reap, but wasteful). Suppress re-requesting a spawn for a job that already triggered one within the last tick.wool.WorkerPool. Add atests/integrationtest that builds the pool withloadbalancer=PriorityLoadBalancer()and ≥2 workers eachTaskCountBackpressure(1), dispatches overlapping workflows over the real cloudpickle boundary, and asserts the LB rotates past aRESOURCE_EXHAUSTEDrejection and the executor reschedules — the combined production path is currently fakes/unit-only.tests/test_workflows/runs beforetests/test_index.py,TestStreamIndexFileFourdnSidecar::test_stream_index_file_should_502_when_sidecar_href_fails_allowlistfails with500instead of502:cfdb.workflows.urlsafe.UnsafeOutboundURLis raised but the/indexrouter's specific handler does not catch it (it falls to the generic 500). This reproduces onmaster(independent of Add bounded concurrency control for the preprocessing/indexing workflow pipeline — Closes #45 #63) and looks like an exception-class identity break from a module reimport in the workflows test path. The canonical default-order suite (pytest -m "not integration") passes becausetest_index.pycollects beforetest_workflows/, which hides it. Root-cause the reimport and make the router'sUnsafeOutboundURLhandling order-independent (and/or add an autouse cleanup).Motivation
The #45 bounded-concurrency change is correct and verified, but these items harden operability (observability, ECS cost), close a coverage gap in the production dispatch path, and fix a latent test-isolation flake that masks a real 502→500 regression risk.
Expected Outcome
RunTaskon every retry tick./indexallowlist test passes regardless of test ordering, and the underlyingUnsafeOutboundURLhandling is order-independent.