Add method to join new CMIP6 data into existing parquet data catalogs

## Summary

Currently, parquet data catalogs (e.g. `cmip6_catalog.parquet`) are regenerated from scratch by scanning entire directory trees via `generate_esgf_catalog.py` or `generate_catalog()` in `solve_helpers.py`. There is no way to incrementally add new CMIP6 datasets to an existing parquet catalog without re-scanning everything.

As ESGF archives grow, full regeneration becomes increasingly expensive. We need a method to join/append new data into an existing parquet catalog.

## Current Behavior

- `generate_catalog()` scans all directories and produces a complete DataFrame
- `write_catalog_parquet()` overwrites the parquet file
- `load_solve_catalog()` reads the parquet file into a DataFrame
- No incremental update path exists — adding one new dataset requires re-scanning the entire archive

## Proposed Behavior

Add a method (e.g. on `DatasetAdapter` or as a standalone helper in `solve_helpers.py`) that:

1. Loads the existing parquet catalog
2. Scans only the new directory/files for metadata (using the existing `find_local_datasets` machinery)
3. Merges the new entries into the existing catalog, deduplicating on a suitable key (e.g. `instance_id` or the combination of DRS facets)
4. Applies version filtering (`filter_latest_versions`) so superseded versions are dropped
5. Writes the updated catalog back to parquet

### API sketch

```python
def update_catalog_parquet(
    existing_path: Path,
    new_directories: list[Path],
    source_type: SourceDatasetType,
    strip_path_prefix: Path | None = None,
) -> pd.DataFrame:
    """Join new datasets into an existing parquet catalog."""
    ...
```

This could also be exposed as a CLI option, e.g.:
```bash
python scripts/generate_esgf_catalog.py --update --cmip6-dir /path/to/new/data
```

## Considerations

- Deduplication key: need to decide what constitutes a "same dataset" for merge purposes (likely the full set of DRS facets minus version, so newer versions replace older ones)
- Should handle the case where the existing parquet doesn't exist yet (fall back to full generation)
- Version filtering should run after the merge so that new versions of existing datasets correctly supersede old ones
- The `{data_dir}` path prefix stripping should be consistent between old and new entries

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add method to join new CMIP6 data into existing parquet data catalogs #575

Summary

Current Behavior

Proposed Behavior

API sketch

Considerations

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add method to join new CMIP6 data into existing parquet data catalogs #575

Description

Summary

Current Behavior

Proposed Behavior

API sketch

Considerations

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions