Skip to content

Populate compression_format on ENCODE file documents in the scraper #31

Description

@conradbzura

Description

ENCODE file documents in Mongo have a null/missing compression_format field, even when the filename clearly indicates gzip compression. 4DN and HuBMAP populate the field (as "" for uncompressed files); ENCODE does not.

Observed across the current materialized corpus (all filenames end in .gz, all have no compression_format):

Format ENCODE files
FASTA 98
VCF 40
GFF 679
GTF 1,303
BroadPeak 2,889
NarrowPeak 47,462
BED (.gz subset) 29,280

Expected Behavior

Every ENCODE file document should carry compression_format as an EDAM-term dict (same shape as file_format) when the file is compressed — typically { "id": "format:3989", "name": "gzip" } for plain gzip and the equivalent for bgzip — or "" to mirror the 4DN/HuBMAP "explicitly uncompressed" convention. Downstream consumers should not have to infer compression from filename suffixes.

This matters immediately for #30 (preprocessing/indexing workflow), which needs to decide whether a file requires gunzip | bgzip recompression before tabix indexing. With the field reliably populated, that decision stays metadata-driven instead of falling back to suffix inspection, and the field becomes useful for queries (e.g., "all bgzipped BED files ready for direct tabix indexing").

Root Cause

src/cfdb/services/encode.py::build_file_document (around line 269) builds the file doc from ENCODE TSV rows but never assigns compression_format:

doc = {
    "submission": "encode",
    "id_namespace": id_namespace,
    "local_id": accession,
    "filename": filename,
    "size_in_bytes": size_in_bytes,
    "md5": _nonempty(row.get("md5sum")),
    "sha256": None,
    "access_url": access_url,
    "status": _nonempty(row.get("File Status")),
    "data_access_level": "public",
    "creation_time": _nonempty(row.get("Experiment date released")),
    "persistent_id": f"https://www.encodeproject.org/files/{accession}/",
}
# file_format is set below from TSV column; compression_format never is

ENCODE's C2M2 TSV has no compression column, so no value propagates. 4DN and HuBMAP populate the field because their upstream C2M2 datapackages include it.

Proposed fix

Derive compression_format from the filename suffix. Add a helper _derive_compression_format(filename_or_url) in services/encode.py returning the EDAM id+name dict for .gz/.bgz (and optionally .bz2, .zip) or "", and call it alongside the existing get_file_format mapping:

doc["compression_format"] = _derive_compression_format(access_url or filename)

The ENCODE REST API (/files/{accession}/?format=json) exposes file_format_specifications but not a dedicated compression field, so the filename suffix remains the canonical indicator. No REST calls needed.

Acceptance Criteria

  • All ENCODE file documents have compression_format populated as an EDAM term dict for compressed files, or "" for uncompressed, after a fresh sync.
  • Derivation handles at minimum .gz and .bgz; unknown/unsuffixed files fall back to "".
  • Unit test for _derive_compression_format covering: .gz, .bgz, plain filename, empty/None input.
  • 4DN and HuBMAP compression_format values are unchanged (sourced from upstream C2M2 datapackage fields, untouched by this change).

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions