Skip to content

aerosol table#74

Open
larsbuntemeyer wants to merge 13 commits intomainfrom
aerosol
Open

aerosol table#74
larsbuntemeyer wants to merge 13 commits intomainfrom
aerosol

Conversation

@larsbuntemeyer
Copy link
Copy Markdown
Contributor

@larsbuntemeyer larsbuntemeyer commented Jul 29, 2025

see #34

The initial aerosol table created using:

import pandas as pd

def sheet_url(url, sheet_name):
    """create google spreadsheet url based on sheet name"""
    sheet_name = sheet_name.replace(" ", "%20")
    return url.format(sheet_id=sheet_id, sheet_name=sheet_name)


def retrieve_google_sheet(url, sheet_name, skiprows=4):
    """retrieve single sheet of data request"""
    return pd.read_csv(sheet_url(url, sheet_name), skiprows=skiprows, dtype=str)

def handle_inconsistencies(df):
    """handle some random inconsistencies"""
    df.loc[df["priority"] == "TIER 2", "priority"] = "TIER2"
    df.loc[df["priority"] == "TIER 1", "priority"] = "TIER1"
    return df

def freq_list(row):
    """create list of frequencies from boolean entries ('x')"""
    if row["mon"] == "fx":
        return ["fx"]
    return [f for f in freqs if row[f] == "x"]

def update_cell_methods(df):
    # special fx cases
    df.loc[df.frequency == "fx", "cell_methods"] = "area: mean"

    # flux units, see https://github.com/WCRP-CORDEX/cordex-cmip6-data-request/issues/23
    df.loc[df.units == "W m-2", "cell_methods"] = "area: time: mean"

    return df

def handle_special_cell_methods(df):
    for var, v in df.cell_methods.items():
        for f, cm in v.items():
            df.loc[(df.out_name == var) & (df.frequency == f), "cell_methods"] = cm
    return df

def clean_df(df, drop=True):
    """tidy up dataframe"""
    # remove unnamed columns
    df = df.loc[:, ~df.columns.str.contains("Unnamed")]

    df["standard_name"] = df["standard_name"].fillna("")

    # lower case column names and renaming to cmip6 formats
    df.columns = df.columns.str.lower()
    df.rename(
        columns={"output variable name": "out_name", "comments": "comment"},
        inplace=True,
    )

    # frequency columns to tidy data
    df["frequency"] = df.apply(lambda row: freq_list(row), axis=1)
    df = df.explode("frequency", ignore_index=True)

    df = handle_inconsistencies(df)  # set correct frequency name for point values

    subdaily_pt = (df["frequency"].isin(["1hr", "3hr", "6hr"])) & (df["ag"] == "i")
    # set frequency, we don't do that anymore,
    # see https://github.com/WCRP-CORDEX/cordex-cmip6-data-request/issues/24
    # df.loc[subdaily_pt, "frequency"] = df[subdaily_pt].frequency + "Pt"

    # set cell methods depending on frequency
    df["cell_methods"] = "area: time: mean"
    df.loc[subdaily_pt, "cell_methods"] = "area: mean time: point"

    # update some more cell_methods
    df = update_cell_methods(df)
    # remove trailing formatters
    df.replace(r"\n", " ", regex=True, inplace=True)
    strip_cols = ["standard_name", "long_name"]
    for col in strip_cols:
        df[col] = df[col].str.strip()
    if drop is True:
        df.drop(columns=freqs, inplace=True)
        #df.drop(columns=["ag"], inplace=True)
        df = df.dropna(subset=["out_name", "frequency"], how="all")

    # handle min max cell_methods
    df.loc[df.out_name.str.contains("min"), "cell_methods"] = "area: mean time: minimum"
    df.loc[df.out_name.str.contains("max"), "cell_methods"] = "area: mean time: maximum"

    # handle special cases
    #df = handle_special_cell_methods(df)

    # set these to lowercase
    lowercase = ["CAPE", "LI", "CIN", "CAPEmax", "LImax", "CINmax"]
    lc = df.out_name.isin(lowercase)

    df.loc[lc, "out_name"] = df[lc].out_name.str.lower()

    # set positive values
    up = ["outgoing", "upward", "upwelling"]
    down = ["incoming", "downward", "downwelling", "sinking"]
    ups = df.loc[df.standard_name.str.contains("|".join(up), case=False)]
    downs = df.loc[df.standard_name.str.contains("|".join(down), case=False)]
    df.loc[ups.index, "positive"] = "up"
    df.loc[downs.index, "positive"] = "down"

    return df

freqs = ["mon", "day", "6hr", "3hr", "1hr"]

sheet_names = ["Aersol CORE", "Aerosol Tier 1", "Aerosol Tier 2"]
#url = "https://docs.google.com/spreadsheets/d/1_KLWJuVdxryyq3DsB5NIJwoneuVqSUVN/edit?pli=1&gid=1672965248#gid=1672965248"

sheet_id = "1_KLWJuVdxryyq3DsB5NIJwoneuVqSUVN"
url = (
    "https://docs.google.com/spreadsheets/d/{sheet_id}/gviz/"
    "tq?tqx=out:csv&sheet={sheet_name}"
)

def retrieve_data_request():
    data = []
    for sheet_name in sheet_names:
        df = retrieve_google_sheet(url, sheet_name, skiprows=0).rename(columns={"Output frequency mon": "mon"})
        df.columns.values[1] = "units"
        #df = clean_df(df)
        data.append(df)
    return data

df = pd.concat(retrieve_data_request(), ignore_index=True)
df = clean_df(df)
df.to_csv("aerosol.csv", index=False)

@larsbuntemeyer
Copy link
Copy Markdown
Contributor Author

larsbuntemeyer commented Jul 29, 2025

@pierrenabat i added a table in this PR basically containing all requested aerosol variables and some meta data derived from the information provided. I kept the "ag" column for now to check cell methods.

The default for cell methods is (all frequencies aver averaged values)

  • "area: time: mean"

in case of "i" in the aggregation column, for subdaily frequencies it's

  • "area: mean time: point"

However, i'm unsure how to handle the "c" (cumulative). Should the subdaily cell method be something like "area: mean time: sum"? I could't find anythin im CMIP6 to hang on, e.g., no cumulative subdaily frequncies.

@larsbuntemeyer
Copy link
Copy Markdown
Contributor Author

@jesusff for aerosols, i now see in the comments a lot of pressure levels requested for aerosol variables, e.g.,

  • List of levels: 200, 250, 300, 350, 400, 450, 500, 550, 600, 650, 700, 750, 800, 850, 900, 925, 950, 975, 1000 hPa (all in 1 file if possible)

The default data request splits pressure levels in individual datasets with scalar coordinates. Should we stick with one approach or have both? I'm unsure...

@pierrenabat
Copy link
Copy Markdown

@larsbuntemeyer thanks for creating the table.
The variables with ag="cumulative" can have a cell_methods equal to "area: time: mean". These variables are equivalent to fluxes, such as existing variables like precipitation, evapotranspiration or radiation fluxes.
For the pressure levels, if it is not possible to have the different levels in the same file, you can split them in individual datasets if you prefer.
I will also complete the missing information for the variables concerned.
Thanks !

@larsbuntemeyer
Copy link
Copy Markdown
Contributor Author

larsbuntemeyer commented Aug 1, 2025

variables with ag="cumulative" can have a cell_methods equal to "area: time: mean"

Alright, i'll update that.
Edit: No update required, see also WCRP-CORDEX/cordex-cmip6-data-request#23

For the pressure levels, if it is not possible to have the different levels in the same file, you can split them in individual datasets if you prefer.

It should be possible, however, not consistent with the default data request. I think we need more opinions on this.

@jesusff
Copy link
Copy Markdown
Member

jesusff commented Aug 8, 2025

I've added a separate discussion on the model levels issue in #76

For the moment, I'd leave this aerosol request as is now, with 3D variables including the vertical dimension. @pierrenabat, how standard is the set of levels you propose here?

@larsbuntemeyer
Copy link
Copy Markdown
Contributor Author

Ok, we can keep 3D variables and i will add a coordinate. However, we still have to decide about the invalud standard names, see #34 (comment)

That is about half the variables that have invalid standard names, should we remove them for now?

@larsbuntemeyer
Copy link
Copy Markdown
Contributor Author

The new aerosol variable entires have no realm yet, should they go under atmos?

@pierrenabat
Copy link
Copy Markdown

Thanks for the merge @larsbuntemeyer
I think all these aerosol variables could be in a realm "aerosol", as for CMIP6 aerosol variables. Sorry I had not seen this last column.

@jesusff
Copy link
Copy Markdown
Member

jesusff commented Mar 23, 2026

Hi,

yes, I would not mix it with the atmos "realm" (or "component", as we called it in #68). We still need to decide on how to proceed with the organization of the data request. Not in the derived content (datasets.csv and CMOR tables), but in the "user files". In #68 there is a proposal to have a data request per domain (or activity in general, e.g. there could be a data request for CORDEX-CORE or for some FPS) and have the component/realm as an additional column to indicate the the data is requested only in case the model has this component (aerosol chemistry, in this case)

@larsbuntemeyer
Copy link
Copy Markdown
Contributor Author

Agreed, it doesn't make much sense to have data requested by compents etc, but rather by domain or/and activity. I will go on with this PR and merge the aerosol component variables with realm aerosol in the cmor/datasets table. We can then let it the communities decide which subset to choose for their activties.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants