fix(metadata): align direct-upload keys to canonical dg-* namespace#8
Merged
Merged
Conversation
`_upload_direct` (the path taken by non-delta-eligible files like
.sha1 / .sha512) wrote user-metadata with bare underscored keys
(`original_name`, `file_sha256`, `compression`) while delta and
reference uploads correctly used the canonical dashed namespace
(`dg-original-name`, `dg-file-sha256`, `dg-compression`).
Downstream consumers — most visibly the DeltaGlider Proxy — only
recognised the dashed form, so every .sha1 / .sha512 listing on
a bucket holding deltaglider-uploaded files produced:
WARN PATHOLOGICAL | Missing/corrupt DG metadata for
bucket/key.sha1 -- falling back to passthrough.
Error: Storage error: Missing dg-original-name
This patch aligns the writer to the canonical scheme and keeps the
read path backward-compatible with already-stored bare-keyed objects
via `resolve_metadata`. No re-upload required.
Changes
-------
* `_upload_direct` emits metadata using `f"{METADATA_PREFIX}{key}"`
(the same pattern delta/reference uploads already use).
* `METADATA_KEY_ALIASES` now lists `compression` and `source_name`
so `resolve_metadata` works for both fields uniformly.
* Replaced bare `metadata.get("compression")` /
`metadata.get("original_name")` / `metadata.get("file_size")` /
`metadata.get("ref_key")` lookups in `DeltaService.get`,
`DeltaService.delete`, `_delete_delta`, the recursive-delete
listing path, `client.list_objects_v2`, and
`client_operations.stats.get_object_info` with `resolve_metadata`
calls so legacy bare-keyed objects keep working forever.
Tests
-----
* `tests/unit/test_metadata_aliases.py` (new, 11 tests) — pins the
alias table contract: new dashed keys, legacy bare underscored
keys, legacy hyphenated keys, priority rule, empty-string
handling.
* `test_direct_upload_emits_dashed_namespace` in
`tests/unit/test_core_service.py` — pins the writer to emit only
dg-* keys.
* Existing tests using the legacy bare `compression: "none"` form
in `test_s3_compat.py` and `test_recursive_delete_reference_*.py`
still pass — proving the dual-scheme read contract holds.
Full unit suite: 87/87 pass, mypy clean, ruff clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adversarial review of the original patch caught a second
asymmetry: DeltaService.get's "is this a regular S3 object or
DeltaGlider-managed?" dispatch was a literal-string check
`"dg-file-sha256" not in obj_head.metadata`. After the writer
fix, NEW direct uploads have `dg-file-sha256` so they route
correctly. But ~4400 pre-fix `.sha1` / `.sha512` files in
production have the bare `file_sha256` key, and they were
silently being routed through the "regular S3 object" branch
instead of the "direct upload" branch.
Both branches call `_get_direct` so file content was still
served correctly — but the wrong log message fired
("Downloading regular S3 object (no DeltaGlider metadata)") and
the recorded file-size for telemetry came from obj_head.size
instead of the metadata's `file_size` (same value for direct
uploads, but still semantically wrong).
Swap the literal-string check for `resolve_metadata(meta,
"file_sha256") is None` so both schemes route to the
DeltaGlider-managed branch.
Added regression test `test_get_legacy_direct_upload_not_
misclassified_as_regular_s3` that builds a HEAD response with
the legacy bare-keyed metadata shape (exactly what's stored on
Hetzner today for the .sha files), captures the log messages,
and fails if the "regular S3 object" canary fires.
Demonstrated locally: revert the dispatch back to literal-string
check → new test fails with the canary log line. Restore →
88/88 pass.
CHANGELOG updated to document both fixes (writer + dispatch).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
_upload_direct(the path taken by non-delta-eligible files like.sha1/.sha512) was writing user-metadata with bare underscored keys (original_name,file_sha256,compression) while delta and reference uploads correctly used the canonical dashed namespace (dg-original-name,dg-file-sha256,dg-compression).Downstream consumers — most visibly the DeltaGlider Proxy — only recognised the dashed form, so every
.sha1/.sha512listing on a bucket holding deltaglider-uploaded files produced this in the proxy's logs, for every single object on every LIST call:This patch aligns the writer to the canonical scheme. The read path stays backward-compatible with already-stored bare-keyed objects via
resolve_metadataso no re-upload is required for the millions of.shafiles already in production buckets.Changes
Writer
_upload_directemits metadata usingf"{METADATA_PREFIX}{key}", matching the pattern delta/reference uploads already use.Read-path alignment
METADATA_KEY_ALIASESnow listscompressionandsource_namesoresolve_metadataworks for both fields uniformly.metadata.get(...)lookups withresolve_metadatacalls in:DeltaService.get(dispatch oncompression == "none")DeltaService.delete+_delete_delta(extractoriginal_name)client.list_objects_v2(FetchMetadata extras)client_operations.stats.get_object_infoTests
tests/unit/test_metadata_aliases.py(new, 11 tests) — pins the alias table contract: new dashed keys resolve, legacy bare underscored keys resolve, legacy hyphenated keys resolve, dashed wins when both present, empty strings count as missing, every field has both the new and the legacy form in its alias tuple,compressionandsource_nameare present.test_direct_upload_emits_dashed_namespaceintests/unit/test_core_service.py— pins the writer to emit onlydg-*keys; will catch any future regression on this exact bug.test_s3_compat.pyandtest_recursive_delete_reference_cleanup.pyusing the legacy barecompression: "none"form still pass unchanged — proving the dual-scheme read contract holds.Full unit suite: 87/87 pass, mypy clean, ruff clean.
Backward compatibility
compression/original_namedirectlyNo migration required.
Test plan
uv run pytest tests/unit/— 87/87 pass locallyuv run ruff format --check src/ tests/— cleanuv run ruff check src/ tests/— cleanuv run mypy src/deltaglider— Success🤖 Generated with Claude Code