Skip to content

fix(storage): make GCS storage layer importName-subdirectory aware#114

Merged
jm-rivera merged 2 commits into
mainfrom
fix/gcs-importname-subdirs
Jun 25, 2026
Merged

fix(storage): make GCS storage layer importName-subdirectory aware#114
jm-rivera merged 2 commits into
mainfrom
fix/gcs-importname-subdirs

Conversation

@jm-rivera

Copy link
Copy Markdown
Collaborator

What

The DCP platform stores each import under a shared input prefix, ingestion/input/<importName>/. The storage layer assumed a flat, single-import layout. Three defects followed.

  1. Nested config.json was skipped on upload. _SKIP_IN_SUBDIR dropped any .json that was not in the upload root, so per-import configs never reached GCS and the load job had nothing to read.
  2. --sync deleted sibling imports. sync_directory_to_gcs listed every blob under the shared prefix and deleted anything without a local counterpart, wiping every other import's data. Permanent data loss.
  3. Registration checks misreported. get_unregistered_csv_files and get_missing_csv_files compared <importName>/data.csv blob paths against flat config keys, so every CSV read as both unregistered and missing.

How

Sync deletion now scopes to the import subdirectories present locally. Two rules decide what is stale. A subtree rule prunes any depth under a present import subdir. A direct-child rule prunes legacy root-level blobs. A blob under an import you did not bring locally is never a deletion candidate, which is the data-loss fix.

_SKIP_IN_SUBDIR is gone, so nested config.json uploads. The two registration functions take an optional import_name that scopes the check to one import's subdirectory; omit it and they behave as before. A fresh import with nothing uploaded yet now returns the right answer instead of raising.

Supporting changes are a shared _remote_path helper so upload and sync compute paths in one place, a KGSettings validator that strips stray slashes from the folder paths, and kwargs-only signatures on the public storage functions.

pipeline.py is unchanged. Sync became subset-safe by discovering import names from the directory tree, so the caller needs no new argument.

Testing

uv run pytest passes 96 tests. ruff check and ruff format --check are clean. Eight new or updated tests cover sibling-import preservation, legacy pruning, per-import registration, the fresh-import case, slash handling, and kwargs-only enforcement.

Closes #108

Co-authored-by: Claude noreply@anthropic.com

@jm-rivera jm-rivera requested a review from Copilot June 24, 2026 16:08
@codecov-commenter

codecov-commenter commented Jun 24, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 99.38650% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 93.64%. Comparing base (48aecf8) to head (e12bb7a).
⚠️ Report is 5 commits behind head on main.

Files with missing lines Patch % Lines
...bblocks/datacommons_tools/gcp_utilities/storage.py 97.87% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #114      +/-   ##
==========================================
+ Coverage   91.06%   93.64%   +2.58%     
==========================================
  Files          37       38       +1     
  Lines        1801     2157     +356     
==========================================
+ Hits         1640     2020     +380     
+ Misses        161      137      -24     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@jm-rivera jm-rivera marked this pull request as ready for review June 24, 2026 16:09
@jm-rivera jm-rivera self-assigned this Jun 24, 2026
@jm-rivera jm-rivera requested a review from tillywoodfield June 24, 2026 16:09

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the GCS storage utilities to be aware of the ingestion/input/<importName>/ subdirectory layout, fixing prior incorrect uploads, dangerous --sync deletion behavior, and incorrect CSV registration checks under multi-import prefixes.

Changes:

  • Remove subdir JSON skipping and standardize remote path construction for uploads/sync.
  • Make sync deletion subset-safe by only pruning within locally-present import scopes (plus legacy root-level pruning).
  • Add optional per-import scoping to CSV registration checks, strip stray slashes in KGSettings, and enforce kwargs-only gcs_folder_name in public storage APIs.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
src/bblocks/datacommons_tools/gcp_utilities/storage.py Adds import-subdir-aware sync scoping, shared remote-path helpers, kwargs-only APIs, and per-import registration checks.
src/bblocks/datacommons_tools/gcp_utilities/settings.py Adds validators to normalize GCS folder path settings by stripping leading/trailing slashes.
tests/test_storage.py Expands tests for nested JSON uploads, subset-safe sync behavior, per-import registration, fresh-import behavior, slash normalization, and kwargs-only enforcement.
tests/test_explicit_path_characterization.py Updates callers for the new kwargs-only storage function signatures.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +151 to +154
This function first uploads all local files (via
:func:`upload_directory_to_gcs`) and then removes any blobs under
``gcs_folder_name`` that no longer have a local counterpart.
:func:`upload_directory_to_gcs`) and then removes any blobs that are
stale (no longer have a local counterpart) within the scopes of the
import subdirectories present locally.
Comment thread tests/test_storage.py
jm-rivera and others added 2 commits June 25, 2026 13:26
The DCP platform stores each import under a shared input prefix
(ingestion/input/<importName>/). The storage layer assumed a flat,
single-import layout, causing three defects: nested config.json was
skipped on upload, --sync deleted sibling imports (data loss), and the
registration checks reported every CSV as both unregistered and missing.

Sync deletion is now scoped to the import subdirs present locally, so a
sibling import's blobs are never removed. Nested config.json uploads.
The registration checks take an optional import_name to scope per import.
Adds a _remote_path helper, a slash-stripping settings validator,
kwargs-only signatures, and regression tests.

Closes #108

Co-authored-by: Claude <noreply@anthropic.com>
Clarify the sync_directory_to_gcs docstring to match behaviour: deletion
scopes to import subdirs that contain files locally, so empty-but-present
import dirs are not pruned. Drop an unused mock blob and the unused
tmp_path fixture from the per-import registration test.

Co-authored-by: Claude <noreply@anthropic.com>
@jm-rivera jm-rivera force-pushed the fix/gcs-importname-subdirs branch from e12bb7a to 6c927e2 Compare June 25, 2026 11:28
@jm-rivera jm-rivera merged commit 10a3756 into main Jun 25, 2026
3 checks passed
@jm-rivera jm-rivera deleted the fix/gcs-importname-subdirs branch June 25, 2026 11:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

GCS storage layer is not importName-subdirectory aware (skipped config.json, destructive --sync, wrong registration checks)

4 participants