Follow-up hardening for the bounded-concurrency pipeline (post-#45 review)

## Description

Deferred, non-blocking follow-ups from the principal-engineer review of #45 (PR #63, review round 2). None of these gate #63; they are tracked here for a later pass.

- **Queue observability (review A13).** The bounded-concurrency pipeline adds a durable queue but no signal for its depth or pressure. Emit metrics / rate-limited summary logs for queue depth (`pending` with a due `next_dispatch_at`), overflow-and-reschedule rate, and the `dispatch_attempts` distribution, so a sustained-overflow incident is visible before jobs start failing `capacity:`. `executor._scheduler_loop` / `_handle_overflow` are the natural emit points.
- **Per-job spawn-suppression window (review A14).** `_handle_overflow` calls `provisioner.request(dedup_key=…)` on every overflow tick, so a job stuck `pending` for the full `CFDB_WORKFLOW_DISPATCH_DEADLINE_S` can issue ~`DEADLINE/RETRY_INTERVAL` `RunTask`s; across `CFDB_WORKFLOW_MAX_ACTIVE` queued jobs that is a large sustained `RunTask` rate (bounded by the Fargate quota + worker self-reap, but wasteful). Suppress re-requesting a spawn for a job that already triggered one within the last tick.
- **Integration test of the production pool config (review A17).** The cross-process boundary suite uses a bare `wool.WorkerPool`. Add a `tests/integration` test that builds the pool with `loadbalancer=PriorityLoadBalancer()` and ≥2 workers each `TaskCountBackpressure(1)`, dispatches overlapping workflows over the real cloudpickle boundary, and asserts the LB rotates past a `RESOURCE_EXHAUSTED` rejection and the executor reschedules — the combined production path is currently fakes/unit-only.
- **Pre-existing test-isolation bug (found while verifying review-2 fixups).** When `tests/test_workflows/` runs before `tests/test_index.py`, `TestStreamIndexFileFourdnSidecar::test_stream_index_file_should_502_when_sidecar_href_fails_allowlist` fails with `500` instead of `502`: `cfdb.workflows.urlsafe.UnsafeOutboundURL` is raised but the `/index` router's specific handler does not catch it (it falls to the generic 500). This reproduces on `master` (independent of #63) and looks like an exception-class identity break from a module reimport in the workflows test path. The canonical default-order suite (`pytest -m "not integration"`) passes because `test_index.py` collects before `test_workflows/`, which hides it. Root-cause the reimport and make the router's `UnsafeOutboundURL` handling order-independent (and/or add an autouse cleanup).

## Motivation

The #45 bounded-concurrency change is correct and verified, but these items harden operability (observability, ECS cost), close a coverage gap in the production dispatch path, and fix a latent test-isolation flake that masks a real 502→500 regression risk.

## Expected Outcome

- Queue depth / overflow pressure is observable via metrics or logs.
- A queued job does not issue a fresh `RunTask` on every retry tick.
- The PriorityLoadBalancer + TaskCountBackpressure combination is exercised end-to-end over the real worker boundary.
- The `/index` allowlist test passes regardless of test ordering, and the underlying `UnsafeOutboundURL` handling is order-independent.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Follow-up hardening for the bounded-concurrency pipeline (post-#45 review) #64

Description

Motivation

Expected Outcome

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Follow-up hardening for the bounded-concurrency pipeline (post-#45 review) #64

Description

Description

Motivation

Expected Outcome

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions