Skip to content

Prevent ENCODE starch files from being misrouted into the BED→tabix pipeline and poisoning the cache #69

Description

@conradbzura

Prevent ENCODE starch files from being misrouted into the BED→tabix pipeline and poisoning the cache

Description

ENCODE files with file_format=starch (BEDOPS' compressed-BED archive format) are routed into the BED preprocessing workflow, which both fails the job and — more seriously — commits a corrupt artifact to the content-addressed cache.

Observed on ENCODE files ENCFF570WRX and ENCFF774HHU (file_format=starch, output_type=peaks) during the 2026-06-23 live dev cap-burst test. Reproduced locally byte-for-byte by driving the real TabixIntervalProcessor().run() against https://www.encodeproject.org/files/ENCFF570WRX/@@download/ENCFF570WRX.starch (886 KB) with a LocalFsCache: it emits StageComplete(data) and then raises RuntimeError: tabix exited 1 … [E::hts_open_format] … Exec format error — the identical error seen in the live worker logs. The committed source.raw begins with ca5cade5… (starch magic) and the cached out.bgz begins with 1f8b0804… (a valid BGZF container wrapping scrambled bytes).

Expected Behavior

A starch file must never produce a corrupt cached artifact. Either it is handled correctly (decompressed via the BEDOPS toolchain and indexed) or it is rejected cleanly with a clear error and nothing is written to the cache. A subsequent /data GET must never serve corrupt bytes as a "preprocessed BED", and a failed job must not leave a poisoned stage-1 artifact that makes every retry a cache hit that re-fails forever.

Root Cause

  1. src/cfdb/services/ontology_mappings.py:53 maps ENCODE starch → C2M2 BED (format:3003). This is reasonable for metadata, but it also drives workflow routing.
  2. src/cfdb/workflows/processors/tabix.py (TabixIntervalProcessor) selects on file_format.name == "BED" and runs zcat -f source | sort | bgzip > out.bgz then tabix.
  3. starch is not gzip/plaintext — it is a BEDOPS archive (magic ca5cade5 followed by bzip2 BZh… chunks). zcat -f passes the binary through untouched, sort scrambles it, bgzip wraps the garbage in a valid BGZF container, and tabix then fails to parse it (Exec format error). unstarch/bedops are not installed in the worker image.
  4. The stage-1 commit gate (_count_data_lines) does not catch this: the bzip2 binary contains newline bytes, so it counts nonzero "data lines" and the garbage data artifact is written to the content-addressed cache before stage 2 (tabix) fails.

Fix options (decide during implementation):

  • (a) Do not route starch into tabix — treat starch as unsupported/passthrough so it never enters the BED workflow.
  • (b) Add real starch support — install BEDOPS and run unstarch source | sort | bgzip | tabix.
  • (c) Sniff the magic bytes in the tabix processor and fail fast (clear error, no cache commit) when the input is not gzip/plaintext.

Regardless of the routing choice, the commit gate must be hardened so binary/garbage can never land in the content-addressed cache — the cache poisoning is the highest-severity part of this bug.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions