Document saved statistics variable names by pragnyanramtha · Pull Request #107 · mllam/mllam-data-prep

pragnyanramtha · 2026-05-13T20:24:46Z

Describe your changes

This updates the README section for output.splitting.compute_statistics to describe how computed statistics are stored in the output dataset.

The added documentation explains the {output_variable}__{split_name}__{operation} naming pattern, gives concrete examples such as state__train__mean, and notes which dimensions remain after reduction. It also clarifies that diff_ statistics are calculated only for variables that span the splitting dimension.

Issue Link

Closes #98

Type of change

Documentation (Addition or improvements to documentation)

Checklist before requesting a review

My branch is up-to-date with the target branch.
I have performed a self-review of my code.
I have updated the documentation to cover introduced code changes.
I have given the PR a name that clearly describes the change, written in imperative form.
For any new/modified functions/classes I have added docstrings that clearly describe its purpose, expected inputs and returned values. Not applicable, no functions or classes changed.
I have placed in-line comments to clarify the intent of any hard-to-understand passages of my code. Not applicable, no code changed.
I have added tests that prove my fix is effective or that my feature works. Not applicable, documentation-only change.
I have requested a reviewer and an assignee. I will leave reviewer and assignee selection to maintainers.

Validation:

git diff --check
python -m pre_commit run trailing-whitespace --files README.md
python -m pre_commit run end-of-file-fixer --files README.md

Copilot

Pull request overview

Documentation-only update clarifying how computed split statistics are stored in generated output datasets.

Changes:

Adds naming pattern documentation for saved statistics variables.
Provides concrete statistic variable examples.
Documents retained dimensions and diff_ statistic behavior.

Comments suppressed due to low confidence (2)

README.md:286

Typo: “normalisating” should be “normalising”.

4. Splitting and calculation of statistics of the output variables, using the `splitting` section. The `output.splitting.splits` attribute defines the individual splits to create (for example `train`, `val` and `test`) and `output.splitting.dim` defines the dimension to split along. The `compute_statistics` can be optionally set for a given split to calculate the statistical properties requested (for example `mean`, `std`) any method available on `xarray.Dataset.{op}` can be used. In addition methods prefixed by `diff_` (so the operational would be listed as `diff_{op}`) compute a statistic based on difference of consecutive time-steps, e.g. `diff_mean` to compute the `mean` of the difference between consecutive timesteps (these are used for normalisating increments). The `dims` attribute defines the dimensions to calculate the statistics over (for example `grid_index` and `time`).

README.md:286

This is a run-on sentence; the compute_statistics explanation and the xarray.Dataset.{op} note should be separated or joined with punctuation so the documentation reads clearly.

4. Splitting and calculation of statistics of the output variables, using the `splitting` section. The `output.splitting.splits` attribute defines the individual splits to create (for example `train`, `val` and `test`) and `output.splitting.dim` defines the dimension to split along. The `compute_statistics` can be optionally set for a given split to calculate the statistical properties requested (for example `mean`, `std`) any method available on `xarray.Dataset.{op}` can be used. In addition methods prefixed by `diff_` (so the operational would be listed as `diff_{op}`) compute a statistic based on difference of consecutive time-steps, e.g. `diff_mean` to compute the `mean` of the difference between consecutive timesteps (these are used for normalisating increments). The `dims` attribute defines the dimensions to calculate the statistics over (for example `grid_index` and `time`).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

docs: document saved statistics variable names

28db844

Copilot AI review requested due to automatic review settings May 13, 2026 20:24

Copilot started reviewing on behalf of pragnyanramtha May 13, 2026 20:25 View session

Copilot AI reviewed May 13, 2026

View reviewed changes

Comment thread README.md Outdated

Comment thread README.md Outdated

pragnyanramtha and others added 2 commits May 13, 2026 20:31

docs: polish statistics documentation

91a7c91

Fix independent diff statistics

dc0d024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Document saved statistics variable names#107

Document saved statistics variable names#107
pragnyanramtha wants to merge 3 commits into
mllam:mainfrom
pragnyanramtha:docs-98-statistics-variable-names

pragnyanramtha commented May 13, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

pragnyanramtha commented May 13, 2026

Describe your changes

Issue Link

Type of change

Checklist before requesting a review

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants