Prevent ENCODE starch files from being misrouted into the BED→tabix pipeline and poisoning the cache
Description
ENCODE files with file_format=starch (BEDOPS' compressed-BED archive format) are routed into the BED preprocessing workflow, which both fails the job and — more seriously — commits a corrupt artifact to the content-addressed cache.
Observed on ENCODE files ENCFF570WRX and ENCFF774HHU (file_format=starch, output_type=peaks) during the 2026-06-23 live dev cap-burst test. Reproduced locally byte-for-byte by driving the real TabixIntervalProcessor().run() against https://www.encodeproject.org/files/ENCFF570WRX/@@download/ENCFF570WRX.starch (886 KB) with a LocalFsCache: it emits StageComplete(data) and then raises RuntimeError: tabix exited 1 … [E::hts_open_format] … Exec format error — the identical error seen in the live worker logs. The committed source.raw begins with ca5cade5… (starch magic) and the cached out.bgz begins with 1f8b0804… (a valid BGZF container wrapping scrambled bytes).
Expected Behavior
A starch file must never produce a corrupt cached artifact. Either it is handled correctly (decompressed via the BEDOPS toolchain and indexed) or it is rejected cleanly with a clear error and nothing is written to the cache. A subsequent /data GET must never serve corrupt bytes as a "preprocessed BED", and a failed job must not leave a poisoned stage-1 artifact that makes every retry a cache hit that re-fails forever.
Root Cause
src/cfdb/services/ontology_mappings.py:53 maps ENCODE starch → C2M2 BED (format:3003). This is reasonable for metadata, but it also drives workflow routing.
src/cfdb/workflows/processors/tabix.py (TabixIntervalProcessor) selects on file_format.name == "BED" and runs zcat -f source | sort | bgzip > out.bgz then tabix.
- starch is not gzip/plaintext — it is a BEDOPS archive (magic
ca5cade5 followed by bzip2 BZh… chunks). zcat -f passes the binary through untouched, sort scrambles it, bgzip wraps the garbage in a valid BGZF container, and tabix then fails to parse it (Exec format error). unstarch/bedops are not installed in the worker image.
- The stage-1 commit gate (
_count_data_lines) does not catch this: the bzip2 binary contains newline bytes, so it counts nonzero "data lines" and the garbage data artifact is written to the content-addressed cache before stage 2 (tabix) fails.
Fix options (decide during implementation):
- (a) Do not route starch into tabix — treat starch as unsupported/passthrough so it never enters the BED workflow.
- (b) Add real starch support — install BEDOPS and run
unstarch source | sort | bgzip | tabix.
- (c) Sniff the magic bytes in the tabix processor and fail fast (clear error, no cache commit) when the input is not gzip/plaintext.
Regardless of the routing choice, the commit gate must be hardened so binary/garbage can never land in the content-addressed cache — the cache poisoning is the highest-severity part of this bug.
Prevent ENCODE starch files from being misrouted into the BED→tabix pipeline and poisoning the cache
Description
ENCODE files with
file_format=starch(BEDOPS' compressed-BED archive format) are routed into the BED preprocessing workflow, which both fails the job and — more seriously — commits a corrupt artifact to the content-addressed cache.Observed on ENCODE files
ENCFF570WRXandENCFF774HHU(file_format=starch,output_type=peaks) during the 2026-06-23 live dev cap-burst test. Reproduced locally byte-for-byte by driving the realTabixIntervalProcessor().run()againsthttps://www.encodeproject.org/files/ENCFF570WRX/@@download/ENCFF570WRX.starch(886 KB) with aLocalFsCache: it emitsStageComplete(data)and then raisesRuntimeError: tabix exited 1 … [E::hts_open_format] … Exec format error— the identical error seen in the live worker logs. The committedsource.rawbegins withca5cade5…(starch magic) and the cachedout.bgzbegins with1f8b0804…(a valid BGZF container wrapping scrambled bytes).Expected Behavior
A starch file must never produce a corrupt cached artifact. Either it is handled correctly (decompressed via the BEDOPS toolchain and indexed) or it is rejected cleanly with a clear error and nothing is written to the cache. A subsequent
/dataGET must never serve corrupt bytes as a "preprocessed BED", and a failed job must not leave a poisoned stage-1 artifact that makes every retry a cache hit that re-fails forever.Root Cause
src/cfdb/services/ontology_mappings.py:53maps ENCODEstarch→ C2M2BED(format:3003). This is reasonable for metadata, but it also drives workflow routing.src/cfdb/workflows/processors/tabix.py(TabixIntervalProcessor) selects onfile_format.name == "BED"and runszcat -f source | sort | bgzip > out.bgzthentabix.ca5cade5followed by bzip2BZh…chunks).zcat -fpasses the binary through untouched,sortscrambles it,bgzipwraps the garbage in a valid BGZF container, andtabixthen fails to parse it (Exec format error).unstarch/bedopsare not installed in the worker image._count_data_lines) does not catch this: the bzip2 binary contains newline bytes, so it counts nonzero "data lines" and the garbagedataartifact is written to the content-addressed cache before stage 2 (tabix) fails.Fix options (decide during implementation):
unstarch source | sort | bgzip | tabix.Regardless of the routing choice, the commit gate must be hardened so binary/garbage can never land in the content-addressed cache — the cache poisoning is the highest-severity part of this bug.