Skip to content

fix(metadata): align direct-upload keys to canonical dg-* namespace#8

Merged
sscarduzio merged 2 commits into
mainfrom
fix/direct-upload-metadata-namespace
May 17, 2026
Merged

fix(metadata): align direct-upload keys to canonical dg-* namespace#8
sscarduzio merged 2 commits into
mainfrom
fix/direct-upload-metadata-namespace

Conversation

@sscarduzio
Copy link
Copy Markdown
Collaborator

Summary

_upload_direct (the path taken by non-delta-eligible files like .sha1 / .sha512) was writing user-metadata with bare underscored keys (original_name, file_sha256, compression) while delta and reference uploads correctly used the canonical dashed namespace (dg-original-name, dg-file-sha256, dg-compression).

Downstream consumers — most visibly the DeltaGlider Proxy — only recognised the dashed form, so every .sha1 / .sha512 listing on a bucket holding deltaglider-uploaded files produced this in the proxy's logs, for every single object on every LIST call:

WARN PATHOLOGICAL | Missing/corrupt DG metadata for
bucket/key.sha1 — falling back to passthrough.
This file was likely copied without --metadata flag.
Error: Storage error: Missing dg-original-name

This patch aligns the writer to the canonical scheme. The read path stays backward-compatible with already-stored bare-keyed objects via resolve_metadata so no re-upload is required for the millions of .sha files already in production buckets.

Changes

Writer

  • _upload_direct emits metadata using f"{METADATA_PREFIX}{key}", matching the pattern delta/reference uploads already use.

Read-path alignment

  • METADATA_KEY_ALIASES now lists compression and source_name so resolve_metadata works for both fields uniformly.
  • Replaced bare metadata.get(...) lookups with resolve_metadata calls in:
    • DeltaService.get (dispatch on compression == "none")
    • DeltaService.delete + _delete_delta (extract original_name)
    • The recursive-delete listing path (identify direct uploads)
    • client.list_objects_v2 (FetchMetadata extras)
    • client_operations.stats.get_object_info

Tests

  • tests/unit/test_metadata_aliases.py (new, 11 tests) — pins the alias table contract: new dashed keys resolve, legacy bare underscored keys resolve, legacy hyphenated keys resolve, dashed wins when both present, empty strings count as missing, every field has both the new and the legacy form in its alias tuple, compression and source_name are present.
  • test_direct_upload_emits_dashed_namespace in tests/unit/test_core_service.py — pins the writer to emit only dg-* keys; will catch any future regression on this exact bug.
  • Existing tests at test_s3_compat.py and test_recursive_delete_reference_cleanup.py using the legacy bare compression: "none" form still pass unchanged — proving the dual-scheme read contract holds.

Full unit suite: 87/87 pass, mypy clean, ruff clean.

Backward compatibility

Scenario Behaviour
Read pre-v6.1.2 direct upload (bare keys) ✅ Still recognised via alias table
Read post-fix direct upload (dashed keys) ✅ Recognised via primary alias
Both schemes present on same object ✅ Dashed wins (test pinned)
Pre-v6.1.2 → upgrade → pre-v6.1.2 (downgrade) ✅ New uploads still work; old reader matches compression/original_name directly

No migration required.

Test plan

  • uv run pytest tests/unit/ — 87/87 pass locally
  • uv run ruff format --check src/ tests/ — clean
  • uv run ruff check src/ tests/ — clean
  • uv run mypy src/deltaglider — Success
  • CI green on the PR
  • Verify against DeltaGlider Proxy logs after a new upload — no more PATHOLOGICAL warning

🤖 Generated with Claude Code

sscarduzio and others added 2 commits May 17, 2026 09:21
`_upload_direct` (the path taken by non-delta-eligible files like
.sha1 / .sha512) wrote user-metadata with bare underscored keys
(`original_name`, `file_sha256`, `compression`) while delta and
reference uploads correctly used the canonical dashed namespace
(`dg-original-name`, `dg-file-sha256`, `dg-compression`).

Downstream consumers — most visibly the DeltaGlider Proxy — only
recognised the dashed form, so every .sha1 / .sha512 listing on
a bucket holding deltaglider-uploaded files produced:

    WARN PATHOLOGICAL | Missing/corrupt DG metadata for
    bucket/key.sha1 -- falling back to passthrough.
    Error: Storage error: Missing dg-original-name

This patch aligns the writer to the canonical scheme and keeps the
read path backward-compatible with already-stored bare-keyed objects
via `resolve_metadata`. No re-upload required.

Changes
-------
* `_upload_direct` emits metadata using `f"{METADATA_PREFIX}{key}"`
  (the same pattern delta/reference uploads already use).
* `METADATA_KEY_ALIASES` now lists `compression` and `source_name`
  so `resolve_metadata` works for both fields uniformly.
* Replaced bare `metadata.get("compression")` /
  `metadata.get("original_name")` / `metadata.get("file_size")` /
  `metadata.get("ref_key")` lookups in `DeltaService.get`,
  `DeltaService.delete`, `_delete_delta`, the recursive-delete
  listing path, `client.list_objects_v2`, and
  `client_operations.stats.get_object_info` with `resolve_metadata`
  calls so legacy bare-keyed objects keep working forever.

Tests
-----
* `tests/unit/test_metadata_aliases.py` (new, 11 tests) — pins the
  alias table contract: new dashed keys, legacy bare underscored
  keys, legacy hyphenated keys, priority rule, empty-string
  handling.
* `test_direct_upload_emits_dashed_namespace` in
  `tests/unit/test_core_service.py` — pins the writer to emit only
  dg-* keys.
* Existing tests using the legacy bare `compression: "none"` form
  in `test_s3_compat.py` and `test_recursive_delete_reference_*.py`
  still pass — proving the dual-scheme read contract holds.

Full unit suite: 87/87 pass, mypy clean, ruff clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adversarial review of the original patch caught a second
asymmetry: DeltaService.get's "is this a regular S3 object or
DeltaGlider-managed?" dispatch was a literal-string check
`"dg-file-sha256" not in obj_head.metadata`. After the writer
fix, NEW direct uploads have `dg-file-sha256` so they route
correctly. But ~4400 pre-fix `.sha1` / `.sha512` files in
production have the bare `file_sha256` key, and they were
silently being routed through the "regular S3 object" branch
instead of the "direct upload" branch.

Both branches call `_get_direct` so file content was still
served correctly — but the wrong log message fired
("Downloading regular S3 object (no DeltaGlider metadata)") and
the recorded file-size for telemetry came from obj_head.size
instead of the metadata's `file_size` (same value for direct
uploads, but still semantically wrong).

Swap the literal-string check for `resolve_metadata(meta,
"file_sha256") is None` so both schemes route to the
DeltaGlider-managed branch.

Added regression test `test_get_legacy_direct_upload_not_
misclassified_as_regular_s3` that builds a HEAD response with
the legacy bare-keyed metadata shape (exactly what's stored on
Hetzner today for the .sha files), captures the log messages,
and fails if the "regular S3 object" canary fires.

Demonstrated locally: revert the dispatch back to literal-string
check → new test fails with the canary log line. Restore →
88/88 pass.

CHANGELOG updated to document both fixes (writer + dispatch).
@sscarduzio sscarduzio merged commit d81240b into main May 17, 2026
5 checks passed
@sscarduzio sscarduzio deleted the fix/direct-upload-metadata-namespace branch May 18, 2026 08:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant