Skip to content

Support ENCODE starch peak files end-to-end via BEDOPS unstarch #72

Description

@conradbzura

Support ENCODE starch peak files end-to-end via BEDOPS unstarch

Description

Handle ENCODE starch (BEDOPS compressed-BED archive) files as a first-class format so they preprocess to a sorted, bgzipped BED plus a tabix index — the same artifact the BED pipeline already produces — and become viewable in Gosling. Today starch is mapped to the BED EDAM term and routed into the BED tabix pipeline, which cannot decompress it.

Proposed changes:

  1. Stop conflating starch with BED. In src/cfdb/services/ontology_mappings.py:53, starch maps to {"id": "format:3003", "name": "BED"} — the same id as plain BED. Remap it to its own file_format CV entry (name: "starch", a distinct id; there is no standard EDAM term for starch, so a token is minted) so it is no longer indistinguishable from BED. The materializer joins file_format by id, so the new starch CV entry must be populated in the file_format collection during sync.
  2. Convert starch like bigBed. starch is structurally identical to the existing bigBed handling — a binary source with a CLI converter that emits BED to stdout. In src/cfdb/workflows/processors/tabix.py: add "starch": "bed" to _TABIX_PRESET, add "starch" to _SELF_VALIDATING_FORMATS (unstarch validates its own input), and add an elif fmt == "starch": branch in _stage_prepare running unstarch {source} | sort | bgzip > out.bgz. Tabix uses the existing bed preset.
  3. Install BEDOPS in Dockerfile.wool to provide unstarch; it is trivy-scanned with the rest of the worker toolchain.
  4. Add tests: starch routes to TabixIntervalProcessor, the pipeline emits the unstarch | sort | bgzip shape with the bed preset, and a direct-processor/integration test converts a real small starch file (e.g. ENCFF570WRX, ENCFF774HHU) to a valid bgzipped BED + tbi.

Motivation

ENCODE publishes ~2,409 starch files (2,387 released), and every one is output_type=peaks — peak calls that are a common Gosling visualization target. Because starch is cast to BED (same EDAM id) and routed into the zcat -f | sort | bgzip | tabix pipeline, which cannot decompress the BEDOPS archive, these files were the root cause of the cache-poisoning bug fixed in #69 / PR #71. After #71 they fail cleanly, but they remain unviewable. Supporting starch directly makes the full peak-file population usable and removes the BED conflation at its source.

This complements #71 rather than replacing it: the byte-sniff guard from #71 stays as defense-in-depth. Once starch is its own format handled by unstarch and listed in _SELF_VALIDATING_FORMATS, the guard cleanly steps aside (exactly as it does for bigBed) and continues to protect against any future format mislabel.

Expected Outcome

  • starch resolves to its own file_format (name: "starch"), no longer conflated with BED, with a corresponding file_format CV entry available after sync.
  • GET /data/encode/<starch-id> and GET /index/encode/<starch-id> return the preprocessed bgzipped BED and its tabix index (with Range support), the same as a BED file.
  • The worker image carries unstarch (BEDOPS); the conversion runs as a self-validating pipeline that commits no artifact when the source is not valid starch.
  • The ~2,387 ENCODE starch peak files become viewable in Gosling.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions