Populate compression_format on ENCODE file documents in the scraper

## Description

ENCODE file documents in Mongo have a null/missing `compression_format` field, even when the filename clearly indicates gzip compression. 4DN and HuBMAP populate the field (as `""` for uncompressed files); ENCODE does not.

Observed across the current materialized corpus (all filenames end in `.gz`, all have no `compression_format`):

| Format | ENCODE files |
|---|---:|
| FASTA | 98 |
| VCF | 40 |
| GFF | 679 |
| GTF | 1,303 |
| BroadPeak | 2,889 |
| NarrowPeak | 47,462 |
| BED (`.gz` subset) | 29,280 |

## Expected Behavior

Every ENCODE file document should carry `compression_format` as an EDAM-term dict (same shape as `file_format`) when the file is compressed — typically `{ "id": "format:3989", "name": "gzip" }` for plain gzip and the equivalent for bgzip — or `""` to mirror the 4DN/HuBMAP "explicitly uncompressed" convention. Downstream consumers should not have to infer compression from filename suffixes.

This matters immediately for #30 (preprocessing/indexing workflow), which needs to decide whether a file requires `gunzip | bgzip` recompression before tabix indexing. With the field reliably populated, that decision stays metadata-driven instead of falling back to suffix inspection, and the field becomes useful for queries (e.g., "all bgzipped BED files ready for direct tabix indexing").

## Root Cause

`src/cfdb/services/encode.py::build_file_document` (around line 269) builds the file doc from ENCODE TSV rows but never assigns `compression_format`:

```python
doc = {
    "submission": "encode",
    "id_namespace": id_namespace,
    "local_id": accession,
    "filename": filename,
    "size_in_bytes": size_in_bytes,
    "md5": _nonempty(row.get("md5sum")),
    "sha256": None,
    "access_url": access_url,
    "status": _nonempty(row.get("File Status")),
    "data_access_level": "public",
    "creation_time": _nonempty(row.get("Experiment date released")),
    "persistent_id": f"https://www.encodeproject.org/files/{accession}/",
}
# file_format is set below from TSV column; compression_format never is
```

ENCODE's C2M2 TSV has no compression column, so no value propagates. 4DN and HuBMAP populate the field because their upstream C2M2 datapackages include it.

### Proposed fix

Derive `compression_format` from the filename suffix. Add a helper `_derive_compression_format(filename_or_url)` in `services/encode.py` returning the EDAM id+name dict for `.gz`/`.bgz` (and optionally `.bz2`, `.zip`) or `""`, and call it alongside the existing `get_file_format` mapping:

```python
doc["compression_format"] = _derive_compression_format(access_url or filename)
```

The ENCODE REST API (`/files/{accession}/?format=json`) exposes `file_format_specifications` but not a dedicated compression field, so the filename suffix remains the canonical indicator. No REST calls needed.

## Acceptance Criteria

- [ ] All ENCODE file documents have `compression_format` populated as an EDAM term dict for compressed files, or `""` for uncompressed, after a fresh sync.
- [ ] Derivation handles at minimum `.gz` and `.bgz`; unknown/unsuffixed files fall back to `""`.
- [ ] Unit test for `_derive_compression_format` covering: `.gz`, `.bgz`, plain filename, empty/None input.
- [ ] 4DN and HuBMAP `compression_format` values are unchanged (sourced from upstream C2M2 datapackage fields, untouched by this change).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Populate compression_format on ENCODE file documents in the scraper #31

Description

Expected Behavior

Root Cause

Proposed fix

Acceptance Criteria

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Format	ENCODE files
FASTA	98
VCF	40
GFF	679
GTF	1,303
BroadPeak	2,889
NarrowPeak	47,462
BED (`.gz` subset)	29,280

Uh oh!

Populate compression_format on ENCODE file documents in the scraper #31

Description

Description

Expected Behavior

Root Cause

Proposed fix

Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions