-
Notifications
You must be signed in to change notification settings - Fork 10
Open
Description
Summary
Currently, parquet data catalogs (e.g. cmip6_catalog.parquet) are regenerated from scratch by scanning entire directory trees via generate_esgf_catalog.py or generate_catalog() in solve_helpers.py. There is no way to incrementally add new CMIP6 datasets to an existing parquet catalog without re-scanning everything.
As ESGF archives grow, full regeneration becomes increasingly expensive. We need a method to join/append new data into an existing parquet catalog.
Current Behavior
generate_catalog()scans all directories and produces a complete DataFramewrite_catalog_parquet()overwrites the parquet fileload_solve_catalog()reads the parquet file into a DataFrame- No incremental update path exists — adding one new dataset requires re-scanning the entire archive
Proposed Behavior
Add a method (e.g. on DatasetAdapter or as a standalone helper in solve_helpers.py) that:
- Loads the existing parquet catalog
- Scans only the new directory/files for metadata (using the existing
find_local_datasetsmachinery) - Merges the new entries into the existing catalog, deduplicating on a suitable key (e.g.
instance_idor the combination of DRS facets) - Applies version filtering (
filter_latest_versions) so superseded versions are dropped - Writes the updated catalog back to parquet
API sketch
def update_catalog_parquet(
existing_path: Path,
new_directories: list[Path],
source_type: SourceDatasetType,
strip_path_prefix: Path | None = None,
) -> pd.DataFrame:
"""Join new datasets into an existing parquet catalog."""
...This could also be exposed as a CLI option, e.g.:
python scripts/generate_esgf_catalog.py --update --cmip6-dir /path/to/new/dataConsiderations
- Deduplication key: need to decide what constitutes a "same dataset" for merge purposes (likely the full set of DRS facets minus version, so newer versions replace older ones)
- Should handle the case where the existing parquet doesn't exist yet (fall back to full generation)
- Version filtering should run after the merge so that new versions of existing datasets correctly supersede old ones
- The
{data_dir}path prefix stripping should be consistent between old and new entries
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels