Skip to content

Add method to join new CMIP6 data into existing parquet data catalogs #575

@lewisjared

Description

@lewisjared

Summary

Currently, parquet data catalogs (e.g. cmip6_catalog.parquet) are regenerated from scratch by scanning entire directory trees via generate_esgf_catalog.py or generate_catalog() in solve_helpers.py. There is no way to incrementally add new CMIP6 datasets to an existing parquet catalog without re-scanning everything.

As ESGF archives grow, full regeneration becomes increasingly expensive. We need a method to join/append new data into an existing parquet catalog.

Current Behavior

  • generate_catalog() scans all directories and produces a complete DataFrame
  • write_catalog_parquet() overwrites the parquet file
  • load_solve_catalog() reads the parquet file into a DataFrame
  • No incremental update path exists — adding one new dataset requires re-scanning the entire archive

Proposed Behavior

Add a method (e.g. on DatasetAdapter or as a standalone helper in solve_helpers.py) that:

  1. Loads the existing parquet catalog
  2. Scans only the new directory/files for metadata (using the existing find_local_datasets machinery)
  3. Merges the new entries into the existing catalog, deduplicating on a suitable key (e.g. instance_id or the combination of DRS facets)
  4. Applies version filtering (filter_latest_versions) so superseded versions are dropped
  5. Writes the updated catalog back to parquet

API sketch

def update_catalog_parquet(
    existing_path: Path,
    new_directories: list[Path],
    source_type: SourceDatasetType,
    strip_path_prefix: Path | None = None,
) -> pd.DataFrame:
    """Join new datasets into an existing parquet catalog."""
    ...

This could also be exposed as a CLI option, e.g.:

python scripts/generate_esgf_catalog.py --update --cmip6-dir /path/to/new/data

Considerations

  • Deduplication key: need to decide what constitutes a "same dataset" for merge purposes (likely the full set of DRS facets minus version, so newer versions replace older ones)
  • Should handle the case where the existing parquet doesn't exist yet (fall back to full generation)
  • Version filtering should run after the merge so that new versions of existing datasets correctly supersede old ones
  • The {data_dir} path prefix stripping should be consistent between old and new entries

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions