fix(storage): make GCS storage layer importName-subdirectory aware#114
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #114 +/- ##
==========================================
+ Coverage 91.06% 93.64% +2.58%
==========================================
Files 37 38 +1
Lines 1801 2157 +356
==========================================
+ Hits 1640 2020 +380
+ Misses 161 137 -24 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
Updates the GCS storage utilities to be aware of the ingestion/input/<importName>/ subdirectory layout, fixing prior incorrect uploads, dangerous --sync deletion behavior, and incorrect CSV registration checks under multi-import prefixes.
Changes:
- Remove subdir JSON skipping and standardize remote path construction for uploads/sync.
- Make sync deletion subset-safe by only pruning within locally-present import scopes (plus legacy root-level pruning).
- Add optional per-import scoping to CSV registration checks, strip stray slashes in
KGSettings, and enforce kwargs-onlygcs_folder_namein public storage APIs.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
src/bblocks/datacommons_tools/gcp_utilities/storage.py |
Adds import-subdir-aware sync scoping, shared remote-path helpers, kwargs-only APIs, and per-import registration checks. |
src/bblocks/datacommons_tools/gcp_utilities/settings.py |
Adds validators to normalize GCS folder path settings by stripping leading/trailing slashes. |
tests/test_storage.py |
Expands tests for nested JSON uploads, subset-safe sync behavior, per-import registration, fresh-import behavior, slash normalization, and kwargs-only enforcement. |
tests/test_explicit_path_characterization.py |
Updates callers for the new kwargs-only storage function signatures. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| This function first uploads all local files (via | ||
| :func:`upload_directory_to_gcs`) and then removes any blobs under | ||
| ``gcs_folder_name`` that no longer have a local counterpart. | ||
| :func:`upload_directory_to_gcs`) and then removes any blobs that are | ||
| stale (no longer have a local counterpart) within the scopes of the | ||
| import subdirectories present locally. |
The DCP platform stores each import under a shared input prefix (ingestion/input/<importName>/). The storage layer assumed a flat, single-import layout, causing three defects: nested config.json was skipped on upload, --sync deleted sibling imports (data loss), and the registration checks reported every CSV as both unregistered and missing. Sync deletion is now scoped to the import subdirs present locally, so a sibling import's blobs are never removed. Nested config.json uploads. The registration checks take an optional import_name to scope per import. Adds a _remote_path helper, a slash-stripping settings validator, kwargs-only signatures, and regression tests. Closes #108 Co-authored-by: Claude <noreply@anthropic.com>
Clarify the sync_directory_to_gcs docstring to match behaviour: deletion scopes to import subdirs that contain files locally, so empty-but-present import dirs are not pruned. Drop an unused mock blob and the unused tmp_path fixture from the per-import registration test. Co-authored-by: Claude <noreply@anthropic.com>
e12bb7a to
6c927e2
Compare
What
The DCP platform stores each import under a shared input prefix,
ingestion/input/<importName>/. The storage layer assumed a flat, single-import layout. Three defects followed.config.jsonwas skipped on upload._SKIP_IN_SUBDIRdropped any.jsonthat was not in the upload root, so per-import configs never reached GCS and the load job had nothing to read.--syncdeleted sibling imports.sync_directory_to_gcslisted every blob under the shared prefix and deleted anything without a local counterpart, wiping every other import's data. Permanent data loss.get_unregistered_csv_filesandget_missing_csv_filescompared<importName>/data.csvblob paths against flat config keys, so every CSV read as both unregistered and missing.How
Sync deletion now scopes to the import subdirectories present locally. Two rules decide what is stale. A subtree rule prunes any depth under a present import subdir. A direct-child rule prunes legacy root-level blobs. A blob under an import you did not bring locally is never a deletion candidate, which is the data-loss fix.
_SKIP_IN_SUBDIRis gone, so nestedconfig.jsonuploads. The two registration functions take an optionalimport_namethat scopes the check to one import's subdirectory; omit it and they behave as before. A fresh import with nothing uploaded yet now returns the right answer instead of raising.Supporting changes are a shared
_remote_pathhelper so upload and sync compute paths in one place, aKGSettingsvalidator that strips stray slashes from the folder paths, and kwargs-only signatures on the public storage functions.pipeline.pyis unchanged. Sync became subset-safe by discovering import names from the directory tree, so the caller needs no new argument.Testing
uv run pytestpasses 96 tests.ruff checkandruff format --checkare clean. Eight new or updated tests cover sibling-import preservation, legacy pruning, per-import registration, the fresh-import case, slash handling, and kwargs-only enforcement.Closes #108
Co-authored-by: Claude noreply@anthropic.com