Skip to content

Make oversized BED files processable instead of failing mid-pipeline #70

Description

@conradbzura

Make oversized BED files processable instead of failing mid-pipeline

Description

Very large ENCODE BED files fail preprocessing. Observed on ENCFF087RDN (~11.0 GB .bed.gz, bed9+) and ENCFF874VKO (~9.1 GB .bed.gz, bedMethyl) — both WGBS "methylation state at CHH" files, which are inherently huge (one row per cytosine) — during the 2026-06-23 live dev cap-burst test. The worker logs show download_source failed for CancelledError after ~6.6e9 / ~4.6e9 bytes (cancelled roughly halfway through the download).

These files are not fundamentally impossible to process; long-running jobs are acceptable. The goal is to make them complete (even if that takes hours) rather than to reject them.

Expected Behavior

A large but otherwise valid BED file should preprocess to completion and serve from cache, or — only past a genuine hard resource ceiling — fall back to raw upstream passthrough streaming. It should not be cancelled mid-download by an unrelated timer, and a long-running job should not be silently killed by worker recycling.

Root Cause

There is no single bug; there are three independent limits, two soft and one hard:

  1. Time (soft). src/cfdb/workflows/executor.py wraps each workflow in asyncio.timeout(CFDB_WORKFLOW_DURATION_CAP_S) (default 4 h). Separately, src/cfdb/workflows/worker_main.py self-terminates at CFDB_WORKER_MAX_LIFETIME_SECONDS (default 5 h) and the max-lifetime path does not drain in-flight work (the code comment is explicit: "defense-in-depth flip rather than a real drain window"). So a worker that already ran earlier jobs can reap mid-download on a long job, losing the work; orphan recovery then re-queues it and a fresh worker restarts from scratch. src/cfdb/workflows/fetcher.py::download_source streams the entire file with no internal size cap or timeout, so the CancelledError is always external (duration cap, worker reap, or consumer/gRPC-stream close). To support multi-hour jobs, raise the duration cap and stop the worker from reaping mid-job (raise/disable max-lifetime, or make reap drain in-flight work).

  2. Disk (the one hard ceiling). The BED pipeline is zcat -f source | sort -S256M -T tmp | bgzip > out.bgz. GNU sort spills runs to -T at roughly the uncompressed input size, so peak scratch ≈ source(compressed) + sort-temp(uncompressed) + out(compressed). For ENCFF087RDN (~11 GB .gz, ~5–8× ratio → ~55–88 GB uncompressed) that is ~77–110 GB. The Fargate worker's WorkerEphemeralStorageGiB defaults to 50 GiB (cloudformation/workers.yml), so the sort would hit ENOSPC; the Fargate maximum is 200 GiB, which the 11 GB / 9 GB files fit under but which is a genuine hard wall for the largest files. Memory is not the binding resource (everything spills/streams).

  3. The high-leverage fix. The disk and time blowup both come from the defensive re-sort, but bgzip+tabix only require coordinate-sorted input — and ENCODE's released BEDs are typically already position-sorted. Adding a streaming "is it already sorted?" check that skips sort when the input is already sorted drops peak scratch to ~source + out (~22 GB for the 11 GB file, well within the 50 GiB default) and removes the multi-hour sort for the common case.

Expected Outcome

  • A streaming pre-check skips sort when the input is already coordinate-sorted (the common case), eliminating the disk and time blowup.
  • For genuinely-unsorted large inputs, the worker has enough headroom to finish: raise WorkerEphemeralStorageGiB toward the 200 GiB Fargate max and raise the duration cap; ensure worker recycling does not kill in-flight jobs.
  • An optional size guard exists only as a fallback to raw passthrough past the ~200 GiB Fargate ephemeral ceiling — not as a blanket rejection of large files.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions