Support ENCODE starch peak files end-to-end via BEDOPS unstarch

# Support ENCODE starch peak files end-to-end via BEDOPS unstarch

## Description

Handle ENCODE `starch` (BEDOPS compressed-BED archive) files as a first-class format so they preprocess to a sorted, bgzipped BED plus a tabix index — the same artifact the BED pipeline already produces — and become viewable in Gosling. Today `starch` is mapped to the BED EDAM term and routed into the BED tabix pipeline, which cannot decompress it.

Proposed changes:

1. Stop conflating starch with BED. In `src/cfdb/services/ontology_mappings.py:53`, `starch` maps to `{"id": "format:3003", "name": "BED"}` — the same id as plain BED. Remap it to its own `file_format` CV entry (`name: "starch"`, a distinct id; there is no standard EDAM term for starch, so a token is minted) so it is no longer indistinguishable from BED. The materializer joins `file_format` by id, so the new starch CV entry must be populated in the `file_format` collection during sync.
2. Convert starch like bigBed. starch is structurally identical to the existing bigBed handling — a binary source with a CLI converter that emits BED to stdout. In `src/cfdb/workflows/processors/tabix.py`: add `"starch": "bed"` to `_TABIX_PRESET`, add `"starch"` to `_SELF_VALIDATING_FORMATS` (unstarch validates its own input), and add an `elif fmt == "starch":` branch in `_stage_prepare` running `unstarch {source} | sort | bgzip > out.bgz`. Tabix uses the existing `bed` preset.
3. Install BEDOPS in `Dockerfile.wool` to provide `unstarch`; it is trivy-scanned with the rest of the worker toolchain.
4. Add tests: starch routes to TabixIntervalProcessor, the pipeline emits the `unstarch | sort | bgzip` shape with the bed preset, and a direct-processor/integration test converts a real small starch file (e.g. `ENCFF570WRX`, `ENCFF774HHU`) to a valid bgzipped BED + tbi.

## Motivation

ENCODE publishes ~2,409 starch files (2,387 released), and every one is `output_type=peaks` — peak calls that are a common Gosling visualization target. Because `starch` is cast to BED (same EDAM id) and routed into the `zcat -f | sort | bgzip | tabix` pipeline, which cannot decompress the BEDOPS archive, these files were the root cause of the cache-poisoning bug fixed in #69 / PR #71. After #71 they fail cleanly, but they remain unviewable. Supporting starch directly makes the full peak-file population usable and removes the BED conflation at its source.

This complements #71 rather than replacing it: the byte-sniff guard from #71 stays as defense-in-depth. Once starch is its own format handled by `unstarch` and listed in `_SELF_VALIDATING_FORMATS`, the guard cleanly steps aside (exactly as it does for bigBed) and continues to protect against any future format mislabel.

## Expected Outcome

- `starch` resolves to its own `file_format` (`name: "starch"`), no longer conflated with BED, with a corresponding `file_format` CV entry available after sync.
- `GET /data/encode/<starch-id>` and `GET /index/encode/<starch-id>` return the preprocessed bgzipped BED and its tabix index (with Range support), the same as a BED file.
- The worker image carries `unstarch` (BEDOPS); the conversion runs as a self-validating pipeline that commits no artifact when the source is not valid starch.
- The ~2,387 ENCODE starch peak files become viewable in Gosling.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support ENCODE starch peak files end-to-end via BEDOPS unstarch #72

Support ENCODE starch peak files end-to-end via BEDOPS unstarch

Description

Motivation

Expected Outcome

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Support ENCODE starch peak files end-to-end via BEDOPS unstarch #72

Description

Support ENCODE starch peak files end-to-end via BEDOPS unstarch

Description

Motivation

Expected Outcome

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions