Make oversized BED files processable instead of failing mid-pipeline

# Make oversized BED files processable instead of failing mid-pipeline

## Description

Very large ENCODE BED files fail preprocessing. Observed on `ENCFF087RDN` (~11.0 GB `.bed.gz`, `bed9+`) and `ENCFF874VKO` (~9.1 GB `.bed.gz`, `bedMethyl`) — both WGBS "methylation state at CHH" files, which are inherently huge (one row per cytosine) — during the 2026-06-23 live dev cap-burst test. The worker logs show `download_source failed for CancelledError after ~6.6e9 / ~4.6e9 bytes` (cancelled roughly halfway through the download).

These files are not fundamentally impossible to process; long-running jobs are acceptable. The goal is to make them complete (even if that takes hours) rather than to reject them.

## Expected Behavior

A large but otherwise valid BED file should preprocess to completion and serve from cache, or — only past a genuine hard resource ceiling — fall back to raw upstream passthrough streaming. It should not be cancelled mid-download by an unrelated timer, and a long-running job should not be silently killed by worker recycling.

## Root Cause

There is no single bug; there are three independent limits, two soft and one hard:

1. **Time (soft).** `src/cfdb/workflows/executor.py` wraps each workflow in `asyncio.timeout(CFDB_WORKFLOW_DURATION_CAP_S)` (default 4 h). Separately, `src/cfdb/workflows/worker_main.py` self-terminates at `CFDB_WORKER_MAX_LIFETIME_SECONDS` (default 5 h) and the max-lifetime path does **not** drain in-flight work (the code comment is explicit: "defense-in-depth flip rather than a real drain window"). So a worker that already ran earlier jobs can reap mid-download on a long job, losing the work; orphan recovery then re-queues it and a fresh worker restarts from scratch. `src/cfdb/workflows/fetcher.py::download_source` streams the entire file with no internal size cap or timeout, so the `CancelledError` is always external (duration cap, worker reap, or consumer/gRPC-stream close). To support multi-hour jobs, raise the duration cap **and** stop the worker from reaping mid-job (raise/disable max-lifetime, or make reap drain in-flight work).

2. **Disk (the one hard ceiling).** The BED pipeline is `zcat -f source | sort -S256M -T tmp | bgzip > out.bgz`. GNU `sort` spills runs to `-T` at roughly the *uncompressed* input size, so peak scratch ≈ `source(compressed) + sort-temp(uncompressed) + out(compressed)`. For `ENCFF087RDN` (~11 GB `.gz`, ~5–8× ratio → ~55–88 GB uncompressed) that is ~77–110 GB. The Fargate worker's `WorkerEphemeralStorageGiB` defaults to **50 GiB** (`cloudformation/workers.yml`), so the sort would hit ENOSPC; the Fargate **maximum is 200 GiB**, which the 11 GB / 9 GB files fit under but which is a genuine hard wall for the largest files. Memory is not the binding resource (everything spills/streams).

3. **The high-leverage fix.** The disk and time blowup both come from the defensive re-sort, but bgzip+tabix only require coordinate-sorted input — and ENCODE's released BEDs are typically already position-sorted. Adding a streaming "is it already sorted?" check that skips `sort` when the input is already sorted drops peak scratch to ~`source + out` (~22 GB for the 11 GB file, well within the 50 GiB default) and removes the multi-hour sort for the common case.

## Expected Outcome

- A streaming pre-check skips `sort` when the input is already coordinate-sorted (the common case), eliminating the disk and time blowup.
- For genuinely-unsorted large inputs, the worker has enough headroom to finish: raise `WorkerEphemeralStorageGiB` toward the 200 GiB Fargate max and raise the duration cap; ensure worker recycling does not kill in-flight jobs.
- An optional size guard exists only as a fallback to raw passthrough past the ~200 GiB Fargate ephemeral ceiling — not as a blanket rejection of large files.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Make oversized BED files processable instead of failing mid-pipeline #70

Make oversized BED files processable instead of failing mid-pipeline

Description

Expected Behavior

Root Cause

Expected Outcome

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Make oversized BED files processable instead of failing mid-pipeline #70

Description

Make oversized BED files processable instead of failing mid-pipeline

Description

Expected Behavior

Root Cause

Expected Outcome

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions