Make oversized BED files processable instead of failing mid-pipeline
Description
Very large ENCODE BED files fail preprocessing. Observed on ENCFF087RDN (~11.0 GB .bed.gz, bed9+) and ENCFF874VKO (~9.1 GB .bed.gz, bedMethyl) — both WGBS "methylation state at CHH" files, which are inherently huge (one row per cytosine) — during the 2026-06-23 live dev cap-burst test. The worker logs show download_source failed for CancelledError after ~6.6e9 / ~4.6e9 bytes (cancelled roughly halfway through the download).
These files are not fundamentally impossible to process; long-running jobs are acceptable. The goal is to make them complete (even if that takes hours) rather than to reject them.
Expected Behavior
A large but otherwise valid BED file should preprocess to completion and serve from cache, or — only past a genuine hard resource ceiling — fall back to raw upstream passthrough streaming. It should not be cancelled mid-download by an unrelated timer, and a long-running job should not be silently killed by worker recycling.
Root Cause
There is no single bug; there are three independent limits, two soft and one hard:
-
Time (soft). src/cfdb/workflows/executor.py wraps each workflow in asyncio.timeout(CFDB_WORKFLOW_DURATION_CAP_S) (default 4 h). Separately, src/cfdb/workflows/worker_main.py self-terminates at CFDB_WORKER_MAX_LIFETIME_SECONDS (default 5 h) and the max-lifetime path does not drain in-flight work (the code comment is explicit: "defense-in-depth flip rather than a real drain window"). So a worker that already ran earlier jobs can reap mid-download on a long job, losing the work; orphan recovery then re-queues it and a fresh worker restarts from scratch. src/cfdb/workflows/fetcher.py::download_source streams the entire file with no internal size cap or timeout, so the CancelledError is always external (duration cap, worker reap, or consumer/gRPC-stream close). To support multi-hour jobs, raise the duration cap and stop the worker from reaping mid-job (raise/disable max-lifetime, or make reap drain in-flight work).
-
Disk (the one hard ceiling). The BED pipeline is zcat -f source | sort -S256M -T tmp | bgzip > out.bgz. GNU sort spills runs to -T at roughly the uncompressed input size, so peak scratch ≈ source(compressed) + sort-temp(uncompressed) + out(compressed). For ENCFF087RDN (~11 GB .gz, ~5–8× ratio → ~55–88 GB uncompressed) that is ~77–110 GB. The Fargate worker's WorkerEphemeralStorageGiB defaults to 50 GiB (cloudformation/workers.yml), so the sort would hit ENOSPC; the Fargate maximum is 200 GiB, which the 11 GB / 9 GB files fit under but which is a genuine hard wall for the largest files. Memory is not the binding resource (everything spills/streams).
-
The high-leverage fix. The disk and time blowup both come from the defensive re-sort, but bgzip+tabix only require coordinate-sorted input — and ENCODE's released BEDs are typically already position-sorted. Adding a streaming "is it already sorted?" check that skips sort when the input is already sorted drops peak scratch to ~source + out (~22 GB for the 11 GB file, well within the 50 GiB default) and removes the multi-hour sort for the common case.
Expected Outcome
- A streaming pre-check skips
sort when the input is already coordinate-sorted (the common case), eliminating the disk and time blowup.
- For genuinely-unsorted large inputs, the worker has enough headroom to finish: raise
WorkerEphemeralStorageGiB toward the 200 GiB Fargate max and raise the duration cap; ensure worker recycling does not kill in-flight jobs.
- An optional size guard exists only as a fallback to raw passthrough past the ~200 GiB Fargate ephemeral ceiling — not as a blanket rejection of large files.
Make oversized BED files processable instead of failing mid-pipeline
Description
Very large ENCODE BED files fail preprocessing. Observed on
ENCFF087RDN(~11.0 GB.bed.gz,bed9+) andENCFF874VKO(~9.1 GB.bed.gz,bedMethyl) — both WGBS "methylation state at CHH" files, which are inherently huge (one row per cytosine) — during the 2026-06-23 live dev cap-burst test. The worker logs showdownload_source failed for CancelledError after ~6.6e9 / ~4.6e9 bytes(cancelled roughly halfway through the download).These files are not fundamentally impossible to process; long-running jobs are acceptable. The goal is to make them complete (even if that takes hours) rather than to reject them.
Expected Behavior
A large but otherwise valid BED file should preprocess to completion and serve from cache, or — only past a genuine hard resource ceiling — fall back to raw upstream passthrough streaming. It should not be cancelled mid-download by an unrelated timer, and a long-running job should not be silently killed by worker recycling.
Root Cause
There is no single bug; there are three independent limits, two soft and one hard:
Time (soft).
src/cfdb/workflows/executor.pywraps each workflow inasyncio.timeout(CFDB_WORKFLOW_DURATION_CAP_S)(default 4 h). Separately,src/cfdb/workflows/worker_main.pyself-terminates atCFDB_WORKER_MAX_LIFETIME_SECONDS(default 5 h) and the max-lifetime path does not drain in-flight work (the code comment is explicit: "defense-in-depth flip rather than a real drain window"). So a worker that already ran earlier jobs can reap mid-download on a long job, losing the work; orphan recovery then re-queues it and a fresh worker restarts from scratch.src/cfdb/workflows/fetcher.py::download_sourcestreams the entire file with no internal size cap or timeout, so theCancelledErroris always external (duration cap, worker reap, or consumer/gRPC-stream close). To support multi-hour jobs, raise the duration cap and stop the worker from reaping mid-job (raise/disable max-lifetime, or make reap drain in-flight work).Disk (the one hard ceiling). The BED pipeline is
zcat -f source | sort -S256M -T tmp | bgzip > out.bgz. GNUsortspills runs to-Tat roughly the uncompressed input size, so peak scratch ≈source(compressed) + sort-temp(uncompressed) + out(compressed). ForENCFF087RDN(~11 GB.gz, ~5–8× ratio → ~55–88 GB uncompressed) that is ~77–110 GB. The Fargate worker'sWorkerEphemeralStorageGiBdefaults to 50 GiB (cloudformation/workers.yml), so the sort would hit ENOSPC; the Fargate maximum is 200 GiB, which the 11 GB / 9 GB files fit under but which is a genuine hard wall for the largest files. Memory is not the binding resource (everything spills/streams).The high-leverage fix. The disk and time blowup both come from the defensive re-sort, but bgzip+tabix only require coordinate-sorted input — and ENCODE's released BEDs are typically already position-sorted. Adding a streaming "is it already sorted?" check that skips
sortwhen the input is already sorted drops peak scratch to ~source + out(~22 GB for the 11 GB file, well within the 50 GiB default) and removes the multi-hour sort for the common case.Expected Outcome
sortwhen the input is already coordinate-sorted (the common case), eliminating the disk and time blowup.WorkerEphemeralStorageGiBtoward the 200 GiB Fargate max and raise the duration cap; ensure worker recycling does not kill in-flight jobs.