Skip to content

Introduce MetadataSet (and a generic download) for multi-variable Metadata workflows #235

@glwagner

Description

@glwagner

Summary

Add a MetadataSet — a typed bag of Metadata sharing a dataset, dates, region, and dir, differing only in variable name. Collapses the "many Metadata constructions with identical kwargs" pattern that recurs across examples and internal constructors, and enables a uniform set!(model, mset) that auto-routes verbose dataset names to short model field names via a global alias map.

Co-introduces a generic download verb to replace download_dataset, so the MetadataSet case can dispatch onto existing batched download backends.

Discussion origin: #233 (comment)

Motivation

The friction is concrete. From PR #233's examples/era5_breeze.jl:

meta_common = (region = era5_region, dir = era5_datadir)

set!(u,  Metadatum(:eastward_velocity;                   dataset=ds_pl, date=start_date, meta_common...))
set!(v,  Metadatum(:northward_velocity;                  dataset=ds_pl, date=start_date, meta_common...))
set!(T,  Metadatum(:temperature;                         dataset=ds_pl, date=start_date, meta_common...))
set!(qᵛ, Metadatum(:specific_humidity;                   dataset=ds_pl, date=start_date, meta_common...))
set!(qᶜ, Metadatum(:specific_cloud_liquid_water_content; dataset=ds_pl, date=start_date, meta_common...))
set!(qⁱ, Metadatum(:specific_cloud_ice_water_content;    dataset=ds_pl, date=start_date, meta_common...))

becomes

mset = MetadataSet(:eastward_velocity, :northward_velocity, :temperature,
                   :specific_humidity, :specific_cloud_liquid_water_content,
                   :specific_cloud_ice_water_content;
                   dataset=ds_pl, date=start_date, region=era5_region, dir=era5_datadir)
set!(atmos.model, mset)

The download path earlier in the same file already speaks this language — download_dataset(pl_vars, ds_pl, dates; meta_common...) batches a vector of variable names into one CDS request (ext/NumericalEarthCDSAPIExt.jl:280-336). MetadataSet makes the rest of the workflow symmetric.

Adoption survey

Same multi-variable-shared-kwargs pattern recurs in:

  • src/DataWrangling/ECCO/ECCO_atmosphere.jl:37-42 — 6 Metadata, varying only name (ECCOPrescribedAtmosphere)
  • src/DataWrangling/JRA55/JRA55_prescribed_atmosphere.jl:36-42 — 7 JRA55FieldTimeSeries calls (JRA55PrescribedAtmosphere)
  • examples/ERA5_hourly_data.jl:286-289, :344-347 — 2 + 4 Metadata constructions
  • examples/one_degree_simulation.jl:86-93, global_climate_simulation.jl:64-72, arctic_simulation.jl:55-86, meridional_heat_transport_ecco.jl:36-43 — coupled ocean+ice splits, 4 Metadatum each
  • examples/inspect_woa_temperature_salinity.jl:20-24, single_column_os_papa_simulation.jl:58-59, near_global_ocean_simulation.jl:86-87, mediterranean_simulation_with_ecco_restoring.jl:95-96, generate_surface_fluxes.jl:56-62 — ocean T/S pairs
  • test/runtests.jl:108-122, test_ocean_sea_ice_model.jl:41-50 — test setup loops

Totals: 15 example/test bundle sites, 2 flagship internal PrescribedAtmosphere constructors, ~30 download_dataset call sites touched by the rename, 3 docs files to update (docs/src/Metadata/metadata_overview.md, docs/src/Metadata/supported_variables.md, docs/src/index.md).

Type

struct MetadataSet{V, D, R, N, F}
    names     :: N        # NTuple{K, Symbol} — verbose dataset variable names
    dataset   :: V        # shared
    dates     :: D        # shared; scalar or AbstractVector
    region    :: R        # shared
    dir       :: String   # shared
    filenames :: F        # per-variable, auto-built; overridable
end

const MetadatumSet{V} = MetadataSet{V, <:Union{AnyDateTime, Nothing}} where V

Mirrors Metadata field-for-field, with name → names and filename → filenames.

Constructor

MetadataSet(:var1, :var2, ...; dataset, date [or dates], region, dir)

Positional varargs of verbose dataset variable names. Keyword arguments match Metadata/Metadatum: date for a scalar (yields a MetadatumSet), dates for a vector. region, dir, filenames optional with the same defaults as Metadata.

Access

The variable axis is exposed via property and indexed access:

mset.eastward_velocity   # → Metadata/Metadatum for this variable
mset[:eastward_velocity] # equivalent indexed form
mset[1]                  # indexed by position in `names`

mset.dataset             # struct field still accessible (getproperty fallthrough)
mset.region
keys(mset)               # → (:eastward_velocity, :northward_velocity, ...)
length(mset)             # number of variables
for m in mset ... end    # iterates variable axis, yielding Metadata per variable
NamedTuple(mset)         # (; eastward_velocity = mset.eastward_velocity, ...)
metadata_path(mset)      # NamedTuple of paths keyed by variable name

Property access via getproperty(mset, name): dispatches to struct field if name ∈ fieldnames(MetadataSet), otherwise looks up the variable. Variables named after struct fields (e.g. a hypothetical :dataset variable) remain accessible via mset[:dataset].

Field / FieldTimeSeries

Field(mset, arch=CPU(); kw...)              # → NamedTuple{names}(Field, ...)
FieldTimeSeries(mset, arch=CPU(); kw...)    # → NamedTuple{names}(FTS, ...)

NamedTuple is keyed by verbose dataset names — fts.eastward_velocity, fts.specific_humidity, etc. PrescribedAtmosphere constructors do the short rename in one explicit block at their existing call site (their public API stays unchanged).

set!(model, mset) — auto-generating via global alias map

A single method, leaning on the existing set!(model; kw...) kwarg interface:

function Oceananigans.set!(model, mset::MetadataSet)
    kwargs = (variable_aliases[n] => mset[n]
              for n in mset.names if haskey(variable_aliases, n))
    set!(model; kwargs...)
end

Variables not in variable_aliases silently fall through — mirrors current kwarg-set! behavior, and enables the ocean+ice split idiom on a single 4-variable set:

mset = MetadataSet(:temperature, :salinity,
                   :sea_ice_thickness, :sea_ice_concentration;
                   date, dataset)
set!(ocean.model,   mset)   # picks up :temperature, :salinity
set!(sea_ice.model, mset)   # picks up :sea_ice_thickness, :sea_ice_concentration

Also:

set!(fields::NamedTuple, mset::MetadataSet)  # element-wise; NT keys = verbose names

Global alias map (src/DataWrangling/DataWrangling.jl)

Top-level variable_aliases :: Dict{Symbol,Symbol}, every value traceable to a row in docs/src/appendix/notation.md (or restoring.jl:33-47 for biogeochemistry). Synonyms (e.g. :u_velocity / :eastward_velocity / :eastward_wind all → :u) are retained — they serve as domain disambiguators across dataset modules.

const variable_aliases = Dict{Symbol, Symbol}(
    # Ocean & atmosphere state (notation.md existing rows)
    :temperature              => :T,
    :air_temperature          => :T,
    :salinity                 => :S,
    :u_velocity               => :u,
    :v_velocity               => :v,
    :eastward_velocity        => :u,
    :northward_velocity       => :v,
    :eastward_wind            => :u,
    :northward_wind           => :v,
    :sea_level_pressure       => :p,
    # Atmosphere moisture / microphysics (Breeze notation.md rows)
    :specific_humidity                    => :qᵛ,
    :air_specific_humidity                => :qᵛ,
    :specific_cloud_liquid_water_content  => :qᶜˡ,
    :specific_cloud_ice_water_content     => :qᶜⁱ,
    :specific_rain_water_content          => :qʳ,
    # Sea ice (notation.md `ℵ` row; `:h` matches ClimaSeaIce field name)
    :sea_ice_thickness        => :h,
    :sea_ice_concentration    => :ℵ,
    # Freshwater fluxes (NEW notation.md rows)
    :rain_freshwater_flux     => :Jʳᵃ,
    :snow_freshwater_flux     => :Jˢⁿ,
    # Biogeochemistry (already in restoring.jl:33-47)
    :dissolved_inorganic_carbon     => :DIC,
    :alkalinity                     => :ALK,
    :nitrate                        => :NO₃,
    :phosphate                      => :PO₄,
    :dissolved_organic_phosphorus   => :DOP,
    :particulate_organic_phosphorus => :POP,
    :dissolved_iron                 => :Fe,
    :dissolved_silicate             => :SiO₂,
    :dissolved_oxygen               => :O₂,
)

25 entries. Variables with no entry (e.g. :vorticity, :geopotential, :significant_wave_height, :mesh_mask) are still fully fetchable via download(mset) and accessible via mset.<name>; they simply don't take part in the auto-set! path until a real adoption site needs them.

Notation.md additions

Two new rows in a new "Net surface freshwater fluxes" subsection between "Net ocean fluxes" and "Thermodynamic properties":

| ``J^{\mathrm{ra}}`` | `Jʳᵃ` | rain freshwater flux | Rain mass flux at the surface (kg m⁻² s⁻¹) |
| ``J^{\mathrm{sn}}`` | `Jˢⁿ` | snow freshwater flux | Snow mass flux at the surface (kg m⁻² s⁻¹) |

No other notation additions in this PR — speculative rows for vorticity, geopotential, waves, PV, trace gases, and net radiation aliases are deferred until an adoption site needs them.

Generic download (supersedes download_dataset)

Bundled into Stage 1. The argument is metadata, not a dataset, so the verb-on-object form reads better — and a single generic gives MetadataSet a hook for backend-specific aggregation:

download(::Metadatum)                      # current per-file behavior
download(::Metadata)                       # current Metadata behavior (date axis)
download(::MetadataSet)                    # default: per-element loop
download(::MetadataSet{<:ERA5PressureLevelsDataset})  # batched CDS path
download(::AbstractVector{<:Metadata})     # generic many-metadata

Migration

  • New download generic introduced in src/DataWrangling/DataWrangling.jl.
  • All per-backend methods renamed download_datasetdownload and import lines updated. Files touched:
    • src/DataWrangling/DataWrangling.jl:276 (fallback)
    • src/DataWrangling/ECCO/ECCO.jl:308
    • src/DataWrangling/JRA55/JRA55_metadata.jl:192
    • src/DataWrangling/EN4/EN4.jl:207
    • src/DataWrangling/IBCAO/IBCAO.jl:80, ETOPO/ETOPO.jl:48, IBCSO/IBCSO.jl:75, GEBCO/GEBCO.jl:69, ORCA/ORCA.jl:110
    • src/DataWrangling/OSPapa/OSPapa_flux_observations.jl:97, OSPapa_ocean_observations.jl:69
    • ext/NumericalEarthCDSAPIExt.jl:155, 174, 280, 294, 344, 362, 382
    • ext/NumericalEarthCopernicusMarineExt.jl:16, 24
    • ext/NumericalEarthWOAExt.jl:38
  • download_dataset kept as Base.@deprecate alias for one minor release.
  • download(::MetadataSet{<:ERA5PressureLevelsDataset}) routes onto the existing batched path at ext/NumericalEarthCDSAPIExt.jl:280-336, preserving the per-day multi-variable CDS bundling.

Stages

Stage Scope Confidence
1 MetadataSet struct + indexing/iteration/getproperty; MetadatumSet alias; variable_aliases dict; set!(model, ::MetadataSet); set!(::NamedTuple, ::MetadataSet); Field(::MetadataSet), FieldTimeSeries(::MetadataSet); download rename + @deprecate; ERA5 batched specialization; notation.md additions; unit tests high
2 Refactor ECCOPrescribedAtmosphere and JRA55PrescribedAtmosphere to construct one MetadataSet internally and consume fts.<verbose_name> for the short-name rename block. External API preserved. high
3 Adopt MetadataSet across the 15 surveyed example/test bundle sites. Update docs/src/Metadata/metadata_overview.md (new section), supported_variables.md (bundle reference), docs/src/index.md (Quick Start snippet). high

Deferred (explicitly out of scope)

  • Variable-name unification across modules. Synonyms like :u_velocity / :eastward_velocity / :eastward_wind stay; they serve as domain disambiguators (notably for ECCO4Monthly, which exposes both ocean and atmosphere fields under one dataset struct). Revisit only if/when those datasets are split.
  • Speculative variable_aliases entries for :vorticity, :geopotential, :geopotential_height, :potential_vorticity, :ozone_mass_mixing_ratio, :total_cloud_cover, :fraction_of_cloud_cover, :significant_wave_height, :mean_wave_period, :mean_wave_direction, :eastward_stokes_drift, :northward_stokes_drift, :free_surface, :depth, :bottom_height, :mesh_mask, :river_freshwater_flux, :iceberg_freshwater_flux, :evaporation_minus_precipitation, :net_* radiation aliases, :net_heat_flux, :sea_ice_u_velocity, :sea_ice_v_velocity. These remain fetchable via download(mset) and mset.<name> — they simply don't auto-rename in set!(model, mset) until an adoption site requires it.
  • :sea_ice_area_fraction as a key — no dataset module uses this name; CF standard name only.

Tasks

  • Stage 1: MetadataSet core + tests
  • Stage 1: variable_aliases dict at top of DataWrangling.jl
  • Stage 1: set!(model, ::MetadataSet) + set!(::NamedTuple, ::MetadataSet) + tests
  • Stage 1: Field(::MetadataSet), FieldTimeSeries(::MetadataSet) returning NamedTuples
  • Stage 1: Introduce generic download; rename backends; deprecate download_dataset
  • Stage 1: download(::MetadataSet) default + ERA5 specialization
  • Stage 1: Apply notation.md additions
  • Stage 2: Refactor ECCOPrescribedAtmosphere
  • Stage 2: Refactor JRA55PrescribedAtmosphere
  • Stage 3: Adopt in 15 surveyed example/test sites
  • Stage 3: Docs (metadata_overview.md new section, supported_variables.md bundle reference, docs/src/index.md Quick Start)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions