aerosol table by larsbuntemeyer · Pull Request #74 · WCRP-CORDEX/data-request-table

larsbuntemeyer · 2025-07-29T11:41:09Z

The initial aerosol table created using:

import pandas as pd

def sheet_url(url, sheet_name):
    """create google spreadsheet url based on sheet name"""
    sheet_name = sheet_name.replace(" ", "%20")
    return url.format(sheet_id=sheet_id, sheet_name=sheet_name)


def retrieve_google_sheet(url, sheet_name, skiprows=4):
    """retrieve single sheet of data request"""
    return pd.read_csv(sheet_url(url, sheet_name), skiprows=skiprows, dtype=str)

def handle_inconsistencies(df):
    """handle some random inconsistencies"""
    df.loc[df["priority"] == "TIER 2", "priority"] = "TIER2"
    df.loc[df["priority"] == "TIER 1", "priority"] = "TIER1"
    return df

def freq_list(row):
    """create list of frequencies from boolean entries ('x')"""
    if row["mon"] == "fx":
        return ["fx"]
    return [f for f in freqs if row[f] == "x"]

def update_cell_methods(df):
    # special fx cases
    df.loc[df.frequency == "fx", "cell_methods"] = "area: mean"

    # flux units, see https://github.com/WCRP-CORDEX/cordex-cmip6-data-request/issues/23
    df.loc[df.units == "W m-2", "cell_methods"] = "area: time: mean"

    return df

def handle_special_cell_methods(df):
    for var, v in df.cell_methods.items():
        for f, cm in v.items():
            df.loc[(df.out_name == var) & (df.frequency == f), "cell_methods"] = cm
    return df

def clean_df(df, drop=True):
    """tidy up dataframe"""
    # remove unnamed columns
    df = df.loc[:, ~df.columns.str.contains("Unnamed")]

    df["standard_name"] = df["standard_name"].fillna("")

    # lower case column names and renaming to cmip6 formats
    df.columns = df.columns.str.lower()
    df.rename(
        columns={"output variable name": "out_name", "comments": "comment"},
        inplace=True,
    )

    # frequency columns to tidy data
    df["frequency"] = df.apply(lambda row: freq_list(row), axis=1)
    df = df.explode("frequency", ignore_index=True)

    df = handle_inconsistencies(df)  # set correct frequency name for point values

    subdaily_pt = (df["frequency"].isin(["1hr", "3hr", "6hr"])) & (df["ag"] == "i")
    # set frequency, we don't do that anymore,
    # see https://github.com/WCRP-CORDEX/cordex-cmip6-data-request/issues/24
    # df.loc[subdaily_pt, "frequency"] = df[subdaily_pt].frequency + "Pt"

    # set cell methods depending on frequency
    df["cell_methods"] = "area: time: mean"
    df.loc[subdaily_pt, "cell_methods"] = "area: mean time: point"

    # update some more cell_methods
    df = update_cell_methods(df)
    # remove trailing formatters
    df.replace(r"\n", " ", regex=True, inplace=True)
    strip_cols = ["standard_name", "long_name"]
    for col in strip_cols:
        df[col] = df[col].str.strip()
    if drop is True:
        df.drop(columns=freqs, inplace=True)
        #df.drop(columns=["ag"], inplace=True)
        df = df.dropna(subset=["out_name", "frequency"], how="all")

    # handle min max cell_methods
    df.loc[df.out_name.str.contains("min"), "cell_methods"] = "area: mean time: minimum"
    df.loc[df.out_name.str.contains("max"), "cell_methods"] = "area: mean time: maximum"

    # handle special cases
    #df = handle_special_cell_methods(df)

    # set these to lowercase
    lowercase = ["CAPE", "LI", "CIN", "CAPEmax", "LImax", "CINmax"]
    lc = df.out_name.isin(lowercase)

    df.loc[lc, "out_name"] = df[lc].out_name.str.lower()

    # set positive values
    up = ["outgoing", "upward", "upwelling"]
    down = ["incoming", "downward", "downwelling", "sinking"]
    ups = df.loc[df.standard_name.str.contains("|".join(up), case=False)]
    downs = df.loc[df.standard_name.str.contains("|".join(down), case=False)]
    df.loc[ups.index, "positive"] = "up"
    df.loc[downs.index, "positive"] = "down"

    return df

freqs = ["mon", "day", "6hr", "3hr", "1hr"]

sheet_names = ["Aersol CORE", "Aerosol Tier 1", "Aerosol Tier 2"]
#url = "https://docs.google.com/spreadsheets/d/1_KLWJuVdxryyq3DsB5NIJwoneuVqSUVN/edit?pli=1&gid=1672965248#gid=1672965248"

sheet_id = "1_KLWJuVdxryyq3DsB5NIJwoneuVqSUVN"
url = (
    "https://docs.google.com/spreadsheets/d/{sheet_id}/gviz/"
    "tq?tqx=out:csv&sheet={sheet_name}"
)

def retrieve_data_request():
    data = []
    for sheet_name in sheet_names:
        df = retrieve_google_sheet(url, sheet_name, skiprows=0).rename(columns={"Output frequency mon": "mon"})
        df.columns.values[1] = "units"
        #df = clean_df(df)
        data.append(df)
    return data

df = pd.concat(retrieve_data_request(), ignore_index=True)
df = clean_df(df)
df.to_csv("aerosol.csv", index=False)

larsbuntemeyer · 2025-07-29T12:14:21Z

@pierrenabat i added a table in this PR basically containing all requested aerosol variables and some meta data derived from the information provided. I kept the "ag" column for now to check cell methods.

The default for cell methods is (all frequencies aver averaged values)

"area: time: mean"

in case of "i" in the aggregation column, for subdaily frequencies it's

"area: mean time: point"

However, i'm unsure how to handle the "c" (cumulative). Should the subdaily cell method be something like "area: mean time: sum"? I could't find anythin im CMIP6 to hang on, e.g., no cumulative subdaily frequncies.

larsbuntemeyer · 2025-07-29T12:21:17Z

@jesusff for aerosols, i now see in the comments a lot of pressure levels requested for aerosol variables, e.g.,

List of levels: 200, 250, 300, 350, 400, 450, 500, 550, 600, 650, 700, 750, 800, 850, 900, 925, 950, 975, 1000 hPa (all in 1 file if possible)

The default data request splits pressure levels in individual datasets with scalar coordinates. Should we stick with one approach or have both? I'm unsure...

pierrenabat · 2025-08-01T13:35:01Z

@larsbuntemeyer thanks for creating the table.
The variables with ag="cumulative" can have a cell_methods equal to "area: time: mean". These variables are equivalent to fluxes, such as existing variables like precipitation, evapotranspiration or radiation fluxes.
For the pressure levels, if it is not possible to have the different levels in the same file, you can split them in individual datasets if you prefer.
I will also complete the missing information for the variables concerned.
Thanks !

larsbuntemeyer · 2025-08-01T14:12:09Z

variables with ag="cumulative" can have a cell_methods equal to "area: time: mean"

~~Alright, i'll update that.~~
Edit: No update required, see also WCRP-CORDEX/cordex-cmip6-data-request#23

For the pressure levels, if it is not possible to have the different levels in the same file, you can split them in individual datasets if you prefer.

It should be possible, however, not consistent with the default data request. I think we need more opinions on this.

jesusff · 2025-08-08T10:48:30Z

I've added a separate discussion on the model levels issue in #76

For the moment, I'd leave this aerosol request as is now, with 3D variables including the vertical dimension. @pierrenabat, how standard is the set of levels you propose here?

larsbuntemeyer · 2025-08-13T11:09:51Z

Ok, we can keep 3D variables and i will add a coordinate. However, we still have to decide about the invalud standard names, see #34 (comment)

That is about half the variables that have invalid standard names, should we remove them for now?

Update aerosol standard_names for CF compliance

Update aerosol data request

Update datasets.csv with aerosol variables

larsbuntemeyer · 2026-03-23T16:03:20Z

The new aerosol variable entires have no realm yet, should they go under atmos?

pierrenabat · 2026-03-23T16:14:41Z

Thanks for the merge @larsbuntemeyer
I think all these aerosol variables could be in a realm "aerosol", as for CMIP6 aerosol variables. Sorry I had not seen this last column.

jesusff · 2026-03-23T18:54:17Z

Hi,

yes, I would not mix it with the atmos "realm" (or "component", as we called it in #68). We still need to decide on how to proceed with the organization of the data request. Not in the derived content (datasets.csv and CMOR tables), but in the "user files". In #68 there is a proposal to have a data request per domain (or activity in general, e.g. there could be a data request for CORDEX-CORE or for some FPS) and have the component/realm as an additional column to indicate the the data is requested only in case the model has this component (aerosol chemistry, in this case)

larsbuntemeyer · 2026-03-25T09:07:53Z

Agreed, it doesn't make much sense to have data requested by compents etc, but rather by domain or/and activity. I will go on with this PR and merge the aerosol component variables with realm aerosol in the cmor/datasets table. We can then let it the communities decide which subset to choose for their activties.

larsbuntemeyer added 2 commits July 29, 2025 13:40

initial aerosol table

b161cf3

added ag column

9c74bcf

added positive attribute

f110e21

larsbuntemeyer self-assigned this Aug 1, 2025

jesusff mentioned this pull request Aug 8, 2025

Open the data request to 3D (spatial) fields in a single file? #76

Open

pierrenabat and others added 10 commits December 4, 2025 10:49

Add files via upload

895fcad

Update aerosol standard_names for CF compliance

Add missing fields (units and Tier2 variables)

1b739ab

Merge pull request #84 from pierrenabat/aerosol

fc40e48

Update aerosol data request

Update datasets.csv with aerosol variables

dd636fb

restore lost semicolon

e2563e4

Merge pull request #97 from pierrenabat/aerosol

e2157c3

Update datasets.csv with aerosol variables

merge main

6316e8b

fix time1 dimension

d2329c3

update structure

2880b92

rename

288f18e

larsbuntemeyer mentioned this pull request Mar 25, 2026

When do i have to use/drop the area cell method? #98

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

aerosol table#74

aerosol table#74
larsbuntemeyer wants to merge 13 commits intomainfrom
aerosol

larsbuntemeyer commented Jul 29, 2025 •

edited

Loading

Uh oh!

larsbuntemeyer commented Jul 29, 2025 •

edited

Loading

Uh oh!

larsbuntemeyer commented Jul 29, 2025

Uh oh!

pierrenabat commented Aug 1, 2025

Uh oh!

larsbuntemeyer commented Aug 1, 2025 •

edited

Loading

Uh oh!

jesusff commented Aug 8, 2025

Uh oh!

larsbuntemeyer commented Aug 13, 2025

Uh oh!

larsbuntemeyer commented Mar 23, 2026

Uh oh!

pierrenabat commented Mar 23, 2026

Uh oh!

jesusff commented Mar 23, 2026

Uh oh!

larsbuntemeyer commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

larsbuntemeyer commented Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

larsbuntemeyer commented Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

larsbuntemeyer commented Jul 29, 2025

Uh oh!

pierrenabat commented Aug 1, 2025

Uh oh!

larsbuntemeyer commented Aug 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jesusff commented Aug 8, 2025

Uh oh!

larsbuntemeyer commented Aug 13, 2025

Uh oh!

larsbuntemeyer commented Mar 23, 2026

Uh oh!

pierrenabat commented Mar 23, 2026

Uh oh!

jesusff commented Mar 23, 2026

Uh oh!

larsbuntemeyer commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

larsbuntemeyer commented Jul 29, 2025 •

edited

Loading

larsbuntemeyer commented Jul 29, 2025 •

edited

Loading

larsbuntemeyer commented Aug 1, 2025 •

edited

Loading