Support ENCODE starch peak files end-to-end via BEDOPS unstarch
Description
Handle ENCODE starch (BEDOPS compressed-BED archive) files as a first-class format so they preprocess to a sorted, bgzipped BED plus a tabix index — the same artifact the BED pipeline already produces — and become viewable in Gosling. Today starch is mapped to the BED EDAM term and routed into the BED tabix pipeline, which cannot decompress it.
Proposed changes:
- Stop conflating starch with BED. In
src/cfdb/services/ontology_mappings.py:53, starch maps to {"id": "format:3003", "name": "BED"} — the same id as plain BED. Remap it to its own file_format CV entry (name: "starch", a distinct id; there is no standard EDAM term for starch, so a token is minted) so it is no longer indistinguishable from BED. The materializer joins file_format by id, so the new starch CV entry must be populated in the file_format collection during sync.
- Convert starch like bigBed. starch is structurally identical to the existing bigBed handling — a binary source with a CLI converter that emits BED to stdout. In
src/cfdb/workflows/processors/tabix.py: add "starch": "bed" to _TABIX_PRESET, add "starch" to _SELF_VALIDATING_FORMATS (unstarch validates its own input), and add an elif fmt == "starch": branch in _stage_prepare running unstarch {source} | sort | bgzip > out.bgz. Tabix uses the existing bed preset.
- Install BEDOPS in
Dockerfile.wool to provide unstarch; it is trivy-scanned with the rest of the worker toolchain.
- Add tests: starch routes to TabixIntervalProcessor, the pipeline emits the
unstarch | sort | bgzip shape with the bed preset, and a direct-processor/integration test converts a real small starch file (e.g. ENCFF570WRX, ENCFF774HHU) to a valid bgzipped BED + tbi.
Motivation
ENCODE publishes ~2,409 starch files (2,387 released), and every one is output_type=peaks — peak calls that are a common Gosling visualization target. Because starch is cast to BED (same EDAM id) and routed into the zcat -f | sort | bgzip | tabix pipeline, which cannot decompress the BEDOPS archive, these files were the root cause of the cache-poisoning bug fixed in #69 / PR #71. After #71 they fail cleanly, but they remain unviewable. Supporting starch directly makes the full peak-file population usable and removes the BED conflation at its source.
This complements #71 rather than replacing it: the byte-sniff guard from #71 stays as defense-in-depth. Once starch is its own format handled by unstarch and listed in _SELF_VALIDATING_FORMATS, the guard cleanly steps aside (exactly as it does for bigBed) and continues to protect against any future format mislabel.
Expected Outcome
starch resolves to its own file_format (name: "starch"), no longer conflated with BED, with a corresponding file_format CV entry available after sync.
GET /data/encode/<starch-id> and GET /index/encode/<starch-id> return the preprocessed bgzipped BED and its tabix index (with Range support), the same as a BED file.
- The worker image carries
unstarch (BEDOPS); the conversion runs as a self-validating pipeline that commits no artifact when the source is not valid starch.
- The ~2,387 ENCODE starch peak files become viewable in Gosling.
Support ENCODE starch peak files end-to-end via BEDOPS unstarch
Description
Handle ENCODE
starch(BEDOPS compressed-BED archive) files as a first-class format so they preprocess to a sorted, bgzipped BED plus a tabix index — the same artifact the BED pipeline already produces — and become viewable in Gosling. Todaystarchis mapped to the BED EDAM term and routed into the BED tabix pipeline, which cannot decompress it.Proposed changes:
src/cfdb/services/ontology_mappings.py:53,starchmaps to{"id": "format:3003", "name": "BED"}— the same id as plain BED. Remap it to its ownfile_formatCV entry (name: "starch", a distinct id; there is no standard EDAM term for starch, so a token is minted) so it is no longer indistinguishable from BED. The materializer joinsfile_formatby id, so the new starch CV entry must be populated in thefile_formatcollection during sync.src/cfdb/workflows/processors/tabix.py: add"starch": "bed"to_TABIX_PRESET, add"starch"to_SELF_VALIDATING_FORMATS(unstarch validates its own input), and add anelif fmt == "starch":branch in_stage_preparerunningunstarch {source} | sort | bgzip > out.bgz. Tabix uses the existingbedpreset.Dockerfile.woolto provideunstarch; it is trivy-scanned with the rest of the worker toolchain.unstarch | sort | bgzipshape with the bed preset, and a direct-processor/integration test converts a real small starch file (e.g.ENCFF570WRX,ENCFF774HHU) to a valid bgzipped BED + tbi.Motivation
ENCODE publishes ~2,409 starch files (2,387 released), and every one is
output_type=peaks— peak calls that are a common Gosling visualization target. Becausestarchis cast to BED (same EDAM id) and routed into thezcat -f | sort | bgzip | tabixpipeline, which cannot decompress the BEDOPS archive, these files were the root cause of the cache-poisoning bug fixed in #69 / PR #71. After #71 they fail cleanly, but they remain unviewable. Supporting starch directly makes the full peak-file population usable and removes the BED conflation at its source.This complements #71 rather than replacing it: the byte-sniff guard from #71 stays as defense-in-depth. Once starch is its own format handled by
unstarchand listed in_SELF_VALIDATING_FORMATS, the guard cleanly steps aside (exactly as it does for bigBed) and continues to protect against any future format mislabel.Expected Outcome
starchresolves to its ownfile_format(name: "starch"), no longer conflated with BED, with a correspondingfile_formatCV entry available after sync.GET /data/encode/<starch-id>andGET /index/encode/<starch-id>return the preprocessed bgzipped BED and its tabix index (with Range support), the same as a BED file.unstarch(BEDOPS); the conversion runs as a self-validating pipeline that commits no artifact when the source is not valid starch.