Description
ENCODE file documents in Mongo have a null/missing compression_format field, even when the filename clearly indicates gzip compression. 4DN and HuBMAP populate the field (as "" for uncompressed files); ENCODE does not.
Observed across the current materialized corpus (all filenames end in .gz, all have no compression_format):
| Format |
ENCODE files |
| FASTA |
98 |
| VCF |
40 |
| GFF |
679 |
| GTF |
1,303 |
| BroadPeak |
2,889 |
| NarrowPeak |
47,462 |
BED (.gz subset) |
29,280 |
Expected Behavior
Every ENCODE file document should carry compression_format as an EDAM-term dict (same shape as file_format) when the file is compressed — typically { "id": "format:3989", "name": "gzip" } for plain gzip and the equivalent for bgzip — or "" to mirror the 4DN/HuBMAP "explicitly uncompressed" convention. Downstream consumers should not have to infer compression from filename suffixes.
This matters immediately for #30 (preprocessing/indexing workflow), which needs to decide whether a file requires gunzip | bgzip recompression before tabix indexing. With the field reliably populated, that decision stays metadata-driven instead of falling back to suffix inspection, and the field becomes useful for queries (e.g., "all bgzipped BED files ready for direct tabix indexing").
Root Cause
src/cfdb/services/encode.py::build_file_document (around line 269) builds the file doc from ENCODE TSV rows but never assigns compression_format:
doc = {
"submission": "encode",
"id_namespace": id_namespace,
"local_id": accession,
"filename": filename,
"size_in_bytes": size_in_bytes,
"md5": _nonempty(row.get("md5sum")),
"sha256": None,
"access_url": access_url,
"status": _nonempty(row.get("File Status")),
"data_access_level": "public",
"creation_time": _nonempty(row.get("Experiment date released")),
"persistent_id": f"https://www.encodeproject.org/files/{accession}/",
}
# file_format is set below from TSV column; compression_format never is
ENCODE's C2M2 TSV has no compression column, so no value propagates. 4DN and HuBMAP populate the field because their upstream C2M2 datapackages include it.
Proposed fix
Derive compression_format from the filename suffix. Add a helper _derive_compression_format(filename_or_url) in services/encode.py returning the EDAM id+name dict for .gz/.bgz (and optionally .bz2, .zip) or "", and call it alongside the existing get_file_format mapping:
doc["compression_format"] = _derive_compression_format(access_url or filename)
The ENCODE REST API (/files/{accession}/?format=json) exposes file_format_specifications but not a dedicated compression field, so the filename suffix remains the canonical indicator. No REST calls needed.
Acceptance Criteria
Description
ENCODE file documents in Mongo have a null/missing
compression_formatfield, even when the filename clearly indicates gzip compression. 4DN and HuBMAP populate the field (as""for uncompressed files); ENCODE does not.Observed across the current materialized corpus (all filenames end in
.gz, all have nocompression_format):.gzsubset)Expected Behavior
Every ENCODE file document should carry
compression_formatas an EDAM-term dict (same shape asfile_format) when the file is compressed — typically{ "id": "format:3989", "name": "gzip" }for plain gzip and the equivalent for bgzip — or""to mirror the 4DN/HuBMAP "explicitly uncompressed" convention. Downstream consumers should not have to infer compression from filename suffixes.This matters immediately for #30 (preprocessing/indexing workflow), which needs to decide whether a file requires
gunzip | bgziprecompression before tabix indexing. With the field reliably populated, that decision stays metadata-driven instead of falling back to suffix inspection, and the field becomes useful for queries (e.g., "all bgzipped BED files ready for direct tabix indexing").Root Cause
src/cfdb/services/encode.py::build_file_document(around line 269) builds the file doc from ENCODE TSV rows but never assignscompression_format:ENCODE's C2M2 TSV has no compression column, so no value propagates. 4DN and HuBMAP populate the field because their upstream C2M2 datapackages include it.
Proposed fix
Derive
compression_formatfrom the filename suffix. Add a helper_derive_compression_format(filename_or_url)inservices/encode.pyreturning the EDAM id+name dict for.gz/.bgz(and optionally.bz2,.zip) or"", and call it alongside the existingget_file_formatmapping:The ENCODE REST API (
/files/{accession}/?format=json) exposesfile_format_specificationsbut not a dedicated compression field, so the filename suffix remains the canonical indicator. No REST calls needed.Acceptance Criteria
compression_formatpopulated as an EDAM term dict for compressed files, or""for uncompressed, after a fresh sync..gzand.bgz; unknown/unsuffixed files fall back to""._derive_compression_formatcovering:.gz,.bgz, plain filename, empty/None input.compression_formatvalues are unchanged (sourced from upstream C2M2 datapackage fields, untouched by this change).