Skip to content

Follow-up hardening for the bounded-concurrency pipeline (post-#45 review) #64

Description

@conradbzura

Description

Deferred, non-blocking follow-ups from the principal-engineer review of #45 (PR #63, review round 2). None of these gate #63; they are tracked here for a later pass.

  • Queue observability (review A13). The bounded-concurrency pipeline adds a durable queue but no signal for its depth or pressure. Emit metrics / rate-limited summary logs for queue depth (pending with a due next_dispatch_at), overflow-and-reschedule rate, and the dispatch_attempts distribution, so a sustained-overflow incident is visible before jobs start failing capacity:. executor._scheduler_loop / _handle_overflow are the natural emit points.
  • Per-job spawn-suppression window (review A14). _handle_overflow calls provisioner.request(dedup_key=…) on every overflow tick, so a job stuck pending for the full CFDB_WORKFLOW_DISPATCH_DEADLINE_S can issue ~DEADLINE/RETRY_INTERVAL RunTasks; across CFDB_WORKFLOW_MAX_ACTIVE queued jobs that is a large sustained RunTask rate (bounded by the Fargate quota + worker self-reap, but wasteful). Suppress re-requesting a spawn for a job that already triggered one within the last tick.
  • Integration test of the production pool config (review A17). The cross-process boundary suite uses a bare wool.WorkerPool. Add a tests/integration test that builds the pool with loadbalancer=PriorityLoadBalancer() and ≥2 workers each TaskCountBackpressure(1), dispatches overlapping workflows over the real cloudpickle boundary, and asserts the LB rotates past a RESOURCE_EXHAUSTED rejection and the executor reschedules — the combined production path is currently fakes/unit-only.
  • Pre-existing test-isolation bug (found while verifying review-2 fixups). When tests/test_workflows/ runs before tests/test_index.py, TestStreamIndexFileFourdnSidecar::test_stream_index_file_should_502_when_sidecar_href_fails_allowlist fails with 500 instead of 502: cfdb.workflows.urlsafe.UnsafeOutboundURL is raised but the /index router's specific handler does not catch it (it falls to the generic 500). This reproduces on master (independent of Add bounded concurrency control for the preprocessing/indexing workflow pipeline — Closes #45 #63) and looks like an exception-class identity break from a module reimport in the workflows test path. The canonical default-order suite (pytest -m "not integration") passes because test_index.py collects before test_workflows/, which hides it. Root-cause the reimport and make the router's UnsafeOutboundURL handling order-independent (and/or add an autouse cleanup).

Motivation

The #45 bounded-concurrency change is correct and verified, but these items harden operability (observability, ECS cost), close a coverage gap in the production dispatch path, and fix a latent test-isolation flake that masks a real 502→500 regression risk.

Expected Outcome

  • Queue depth / overflow pressure is observable via metrics or logs.
  • A queued job does not issue a fresh RunTask on every retry tick.
  • The PriorityLoadBalancer + TaskCountBackpressure combination is exercised end-to-end over the real worker boundary.
  • The /index allowlist test passes regardless of test ordering, and the underlying UnsafeOutboundURL handling is order-independent.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions