Merge pull request #21 from AgentOxygen/docs-update

AgentOxygen · web-flow · commit e2c68d5cc119 · 2025-10-21T16:01:24.000-05:00
Updated documentation
diff --git a/.readthedocs.yaml b/.readthedocs.yaml
@@ -19,4 +19,6 @@ sphinx:
 # See https://docs.readthedocs.io/en/stable/guides/reproducible-builds.html
 python:
    install:
-   - requirements: docs/requirements.txt
+   - requirements: requirements.txt
+   - method: pip
+     path: .
diff --git a/Dockerfile b/Dockerfile
@@ -5,6 +5,7 @@ WORKDIR /project
 COPY . .
 
 RUN pip install --upgrade pip
+RUN pip install -r requirements.txt
 RUN pip install pytest sphinx sphinx-autobuild
 
 RUN pip install -e .
diff --git a/README.md b/README.md
@@ -22,38 +22,43 @@ To learn more about the HDP and how to use it, check out the full ReadTheDocs do
 The code block below showcases an example HDP workflow for a 400 GB high performance computer:
 
 ```
-from dask.distributed import Client, LocalCluster
-import numpy as np
-import xarray
 import hdp
+import numpy as np
 
+output_dir = "."
 
-cluster = LocalCluster(n_workers=10, memory_limit="40GB", threads_per_worker=1, processes=True)
-client = Client(cluster)
-
-input_dir = "/local1/climate_model_output/"
-
-baseline_tasmax = xarray.open_zarr(f"{input_dir}CESM2_historical_day_tasmax.zarr")["tasmax"]
-test_tasmax = xarray.open_zarr(f"{input_dir}CESM2_ssp370_day_tasmax.zarr")["tasmax"]
+sample_control_temp = hdp.utils.generate_test_control_dataarray()
+sample_warming_temp = hdp.utils.generate_test_warming_dataarray()
 
-baseline_measures = hdp.measure.format_standard_measures(temp_datasets=[baseline_tasmax])
-test_measures = hdp.measure.format_standard_measures(temp_datasets=[test_tasmax])
+baseline_measures = hdp.measure.format_standard_measures(
+    temp_datasets=[sample_control_temp]
+)
+test_measures = hdp.measure.format_standard_measures(
+    temp_datasets=[sample_warming_temp]
+)
 
 percentiles = np.arange(0.9, 1.0, 0.01)
 
-
 thresholds_dataset = hdp.threshold.compute_thresholds(
     baseline_measures,
     percentiles
 )
 
 definitions = [[3,0,0], [3,1,1], [4,0,0], [4,1,1], [5,0,0], [5,1,1]]
 
-metrics_dataset = hdp.metric.compute_group_metrics(test_measures, thresholds_dataset, definitions)
-metrics_dataset = metrics_dataset.to_zarr("/local1/test_metrics.zarr", mode='w')
+metrics_dataset = hdp.metric.compute_group_metrics(test_measures, thresholds_dataset, definitions, include_threshold=True)
+metrics_dataset.to_netcdf(f"{output_dir}/sample_hw_metrics.nc", mode='w')
+
+figure_notebook = create_notebook(metrics_dataset)
+figure_notebook.save_notebook(f"{output_dir}/sample_hw_summary_figures.ipynb")
+
+sample_control_temp = sample_control_temp.to_dataset()
+sample_control_temp.attrs["description"] = "Mock control temperature dataset generated by HDP for unit testing."
+sample_control_temp.to_netcdf(f"{output_dir}/sample_control_temp.nc", mode='w')
 
-figure_notebook = hdp.hdp.create_notebook(metrics_dataset)
-figure_notebook.save_notebook("/local1/heatwave_summary_figures.ipynb")
+sample_warming_temp = sample_warming_temp.to_dataset()
+sample_warming_temp.attrs["description"] = "Mock temperature dataset with warming trend generated by HDP for unit testing."
+sample_warming_temp.to_netcdf(f"{output_dir}/sample_warming_temp.nc", mode='w')
 ```
 
 # Contributing
diff --git a/docs/joss/paper.md b/docs/joss/paper.md
@@ -22,6 +22,7 @@ bibliography: paper.bib
 ---
 
 # Summary
+
 The heatwave diagnostics package (`HDP`) is a Python package that provides the climate research community with tools to compute heatwave metrics for the large volumes of data produced by earth system model large ensembles, across multiple measures of heat, extreme heat thresholds, and heatwave definitions. The `HDP` leverages performance-oriented design using xarray, Dask, and Numba to maximize the use of available hardware resources while maintaining accessibility through an intuitive interface and well-documented user guide. This approach empowers the user to generate metrics for a wide and diverse range of heatwave types across the parameter space.
 
 # Statement of Need
@@ -30,9 +31,11 @@ Accurate quantification of the evolution of heatwave trends in climate model out
 Metrics such as heatwave frequency and duration are commonly used in hazard assessments, but there are few centralized tools and no universal heatwave criteria for computing them. This has resulted in parameter heterogeneity across the literature and has prompted some studies to adopt multiple definitions to build robustness (@perkins_review_2015). However, many studies rely on only a handful of metrics and definitions due to the excessive data management and computational burden of sampling a greater number of parameters (@perkins_measurement_2013). The introduction of large ensembles has further complicated the development of software tools, which have remained mostly specific to individual studies. Some generalized tools have been developed to address this problem, but do not contain explicit methods for evaluating the potential sensitivities of heatwave hazard to the choices of heat measure, extreme heat threshold, and heatwave definition.
 
 Development of the `HDP` was started in 2023 primarily to address the computational obstacles around handling terabyte-scale large ensembles, but quickly evolved to investigate new scientific questions around how the selection of characteristic heatwave parameters may impact hazard analysis. The `HDP` can provide insight into how the spatial-temporal response of heatwaves to climate perturbations depends on the choice of heatwave parameters. Although software does exist for calculating heatwave metrics (e.g. [heatwave3](https://robwschlegel.github.io/heatwave3/index.html), [xclim](https://xclim.readthedocs.io/en/stable/indices.html), [ehfheatwaves](https://tammasloughran.github.io/ehfheatwaves/)), these tools are not optimized to analyze more than a few definitions and thresholds at a time nor do they offer diagnostic plots.
+
 # Key Features
 
 ## Extension of XArray with Implementations of Dask and Numba
+
 `xarray` is a popular Python package used for geospatial analysis and for working with the netCDF files produced by climate models. The `HDP` workflow is based around `xarray` and seamlessly integrates with the `xarray.DataArray` data structure. Parallelization of `HDP` functions is achieved through the integration of `dask` with automated chunking and task graph construction features built into the `xarray` library. 
 
 ## Heatwave Metrics for Multiple Measures, Thresholds, and Definitions
@@ -59,6 +62,7 @@ The `HDP` allows the user to test a range of parameter values: for example, heat
 : Description of the heatwave metrics produced by the HDP. \label{table:metrics}
 
 ## Diagnostic Notebooks and Figures
+
 The automatic workflow compiles a "figure deck" containing diagnostic plots for multiple heatwave parameters and input variables. To simplify this process, figure decks are serialized and stored in a single Jupyter Notebook separated into descriptive sections. Basic descriptions are included in markdown cells at the top of each figure. The `HDPNotebook` class in `hdp.graphics.notebook` is utilized to facilitate the generation of these Notebooks internally, but can be called through the API as well to build custom notebooks. An example of a Notebook of the standard figure deck is shown in Figure \ref{fig:notebook}.
 
 ![Example of an HDP standard figure deck \label{fig:notebook}](HDP_Notebook_Example.png)
diff --git a/docs/joss/paper.pdf b/docs/joss/paper.pdf
diff --git a/docs/sample_data/sample.py b/docs/sample_data/sample.py
@@ -0,0 +1,63 @@
+#!/usr/bin/env python
+"""
+hdp.py
+
+Heatwave Diagnostics Package (HDP)
+
+Entry point for package.
+
+Developer: Cameron Cummins
+Contact: cameron.cummins@utexas.edu
+"""
+from hdp.graphics.notebook import create_notebook
+from os.path import isdir
+import hdp.measure
+import hdp.threshold
+import hdp.metric
+import numpy as np
+import sys
+
+
+def generate_sample_data(output_dir):
+    if not isdir(output_dir):
+        raise RuntimeError(f"Output directory '{output_dir}' does not exist!")
+
+    sample_control_temp = hdp.utils.generate_test_control_dataarray()
+    sample_warming_temp = hdp.utils.generate_test_warming_dataarray()
+
+    baseline_measures = hdp.measure.format_standard_measures(
+        temp_datasets=[sample_control_temp]
+    )
+    test_measures = hdp.measure.format_standard_measures(
+        temp_datasets=[sample_warming_temp]
+    )
+
+    percentiles = np.arange(0.9, 1.0, 0.01)
+
+    thresholds_dataset = hdp.threshold.compute_thresholds(
+        baseline_measures,
+        percentiles
+    )
+
+    definitions = [[3,0,0], [3,1,1], [4,0,0], [4,1,1], [5,0,0], [5,1,1]]
+
+    metrics_dataset = hdp.metric.compute_group_metrics(test_measures, thresholds_dataset, definitions, include_threshold=True)
+    metrics_dataset.to_netcdf(f"{output_dir}/sample_hw_metrics.nc", mode='w')
+
+    figure_notebook = create_notebook(metrics_dataset)
+    figure_notebook.save_notebook(f"{output_dir}/sample_hw_summary_figures.ipynb")
+
+    sample_control_temp = sample_control_temp.to_dataset()
+    sample_control_temp.attrs["description"] = "Mock control temperature dataset generated by HDP for unit testing."
+    sample_control_temp.to_netcdf(f"{output_dir}/sample_control_temp.nc", mode='w')
+
+    sample_warming_temp = sample_warming_temp.to_dataset()
+    sample_warming_temp.attrs["description"] = "Mock temperature dataset with warming trend generated by HDP for unit testing."
+    sample_warming_temp.to_netcdf(f"{output_dir}/sample_warming_temp.nc", mode='w')
+
+if __name__ == "__main__":
+    print("Generating testing data and simulating a full data-to-figure workflow: ")
+    if len(sys.argv) != 2:
+        assert Exception("Please specifiy the path to output sample data and results to.")
+    generate_sample_data(sys.argv[1])
+    print("Done!")
diff --git a/docs/user.rst b/docs/user.rst
@@ -30,45 +30,60 @@ The HDP can be installed using PyPI. You can view the webpage `here <https://pyp
 
 Quick Start
 -----------
-Below is example code that computes heatwave metrics for multiple measures, thresholds, and definitions. Heatwave metrics are obtained for the test dataset by comparing against the thresholds generated from the baseline dataset.
+Below is example code that computes heatwave metrics for multiple measures, thresholds, and definitions from sample data generated by the HDP. Heatwave metrics are obtained for the "warming" data by comparing against the thresholds generated from the "control" data.
 
 .. code-block:: python
 
-    from dask.distributed import Client, LocalCluster
-    import numpy as np
-    import xarray
     import hdp
-    
-    
-    cluster = LocalCluster(n_workers=10, memory_limit="40GB", threads_per_worker=1, processes=True)
-    client = Client(cluster)
-    
-    input_dir = "/local1/climate_model_output/"
-    
-    baseline_tasmax = xarray.open_zarr(f"{input_dir}CESM2_historical_day_tasmax.zarr")["tasmax"]
-    test_tasmax = xarray.open_zarr(f"{input_dir}CESM2_ssp370_day_tasmax.zarr")["tasmax"]
-    
-    baseline_measures = hdp.measure.format_standard_measures(temp_datasets=[baseline_tasmax])
-    test_measures = hdp.measure.format_standard_measures(temp_datasets=[test_tasmax])
-    
+    import numpy as np
+
+    output_dir = "."
+
+    sample_control_temp = hdp.utils.generate_test_control_dataarray()
+    sample_warming_temp = hdp.utils.generate_test_warming_dataarray()
+
+    baseline_measures = hdp.measure.format_standard_measures(
+        temp_datasets=[sample_control_temp]
+    )
+    test_measures = hdp.measure.format_standard_measures(
+        temp_datasets=[sample_warming_temp]
+    )
+
     percentiles = np.arange(0.9, 1.0, 0.01)
-    
-    
+
     thresholds_dataset = hdp.threshold.compute_thresholds(
         baseline_measures,
         percentiles
     )
-    
+
     definitions = [[3,0,0], [3,1,1], [4,0,0], [4,1,1], [5,0,0], [5,1,1]]
-    
-    metrics_dataset = hdp.metric.compute_group_metrics(test_measures, thresholds_dataset, definitions)
-    metrics_dataset = metrics_dataset.to_zarr("/local1/test_metrics.zarr", mode='w')
-    
-    figure_notebook = hdp.hdp.create_notebook(metrics_dataset)
-    figure_notebook.save_notebook("/local1/heatwave_summary_figures.ipynb")
-    
 
-Example 1: Generating Heatwave Diagnostics
+    metrics_dataset = hdp.metric.compute_group_metrics(test_measures, thresholds_dataset, definitions, include_threshold=True)
+    metrics_dataset.to_netcdf(f"{output_dir}/sample_hw_metrics.nc", mode='w')
+
+    figure_notebook = create_notebook(metrics_dataset)
+    figure_notebook.save_notebook(f"{output_dir}/sample_hw_summary_figures.ipynb")
+
+    sample_control_temp = sample_control_temp.to_dataset()
+    sample_control_temp.attrs["description"] = "Mock control temperature dataset generated by HDP for unit testing."
+    sample_control_temp.to_netcdf(f"{output_dir}/sample_control_temp.nc", mode='w')
+
+    sample_warming_temp = sample_warming_temp.to_dataset()
+    sample_warming_temp.attrs["description"] = "Mock temperature dataset with warming trend generated by HDP for unit testing."
+    sample_warming_temp.to_netcdf(f"{output_dir}/sample_warming_temp.nc", mode='w')
+
+This code snippet is included in the HDP source code and can be executed via:
+
+.. code-block:: console
+
+   $ git clone https://github.com/AgentOxygen/HDP.git
+   $ cd HDP
+   $ python hdp/docs/sample_data/sample.py hdp/docs/sample_data/
+
+
+The sample data, metric data, and summary figures are all saved to the specified `hdp/docs/sample_data/` but this path can be changed as needed. The sample input data is the same data used in unit testing, where temperature is generated using a sine wave over time with a period of one year and a gradient is applied to decrease the temperature uniformly over latitude. This processes is encapsulated in the function `hdp.utils.generate_test_control_dataarray`. For the warming dataset, a slight warming trend is applied uniformly over time to simulate global warming. By generating these input datasets instead of supplying them directly, we reduce disk space needed to install/use the package with sample data included.
+
+Example: Generating Heatwave Diagnostics
 ------------------------------------------
 In this first example, we will produce heatwave metrics for one IPCC AR6 emission scenario, SSP3-7.0, run by the CESM2 climate model to produce a large ensemble called the "CESM2 Large Ensemble Community Project" or `LENS2 <https://www.cesm.ucar.edu/community-projects/lens2>`_. We will explore the following set of heatwave parameters:
 
@@ -98,11 +113,11 @@ To fully utilize the performance enhancments offered by the HDP, we must first s
 .. code-block:: python
 
     from dask.distributed import Client, LocalCluster
-    cluster = LocalCluster(n_workers=20, memory_limit="10GB", threads_per_worker=1, processes=True, dashboard_address=":8004")
+    cluster = LocalCluster(n_workers=20, memory_limit="10GB", threads_per_worker=1, processes=True)
     client = Client(cluster)
 
 
-Once a Dask cluster is initialized, we then need to organize our data into `xarray.DataArray <https://docs.xarray.dev/en/stable/generated/xarray.DataArray.html>`_ objects. The entire HDP is built around xarray data structures to ensure ease of use and remain agnostic to input file types. Since we are working with a large ensemble, we need to make sure to concatenate the ensemble members along a "member" dimension. If we weren't using a large ensemble (a single long-running simulation for example), we would just omit this step. To read data from disk, we can use the `xarray.open_mfdataset <https://docs.xarray.dev/en/stable/generated/xarray.open_mfdataset.html>`_ function. Reading and post-processing data will look different from system to system, but the final format should be the same. Below is a list of xarray.DataArrays with the data structure for baseline_tasmax dataset visualized below:
+Once a Dask cluster is initialized, we then need to organize our data into `xarray.DataArray <https://docs.xarray.dev/en/stable/generated/xarray.DataArray.html>`_ objects. The entire HDP is built around xarray data structures to ensure ease of use and remain agnostic to input file types. Since we are working with a large ensemble, we need to make sure to concatenate the ensemble members along a "member" dimension. If we weren't using a large ensemble (a single long-running simulation for example), we would just omit this step. To read data from disk, we can use the `xarray.open_mfdataset <https://docs.xarray.dev/en/stable/generated/xarray.open_mfdataset.html>`_ function. Reading and post-processing data will look different from system to system, but the final format should be the same. Below is a list of `xarray.DataArrays` with the data structure for `baseline_tasmax` dataset visualized below:
 
 .. code-block:: python
 
@@ -116,7 +131,7 @@ Once a Dask cluster is initialized, we then need to organize our data into `xarr
 .. image:: assets/tasmax_dataarray_example.png
    :width: 600
 
-The spatial coordinates for latitude and longitude should be named "lat" and "lon" respectively. The "time" coordinates should be decoded into CFTime objects and a "member" dimension should be created if an ensemble is being used.
+The spatial coordinates for latitude and longitude should be named "lat" and "lon" respectively. The "time" coordinates should be decoded into `CFTime`` objects and a "member" dimension should be created if an ensemble is being used.
 
 To begin, we first need to format these measures so that they are in the correct units. This process will also compute heat index values using the relative humidity (rh) datasets.
 
@@ -156,18 +171,7 @@ Since we are connected to a Dask cluster, we can write the output to a zarr stor
 
 .. code-block:: python
 
-    metrics_dataset.to_zarr("/local1/lens2_ssp370_hw_metrics.zarr", mode='w', compute=True)
-
-
-:ref:`example_2`
-
-Example 2: RAMIP Analysis
--------------------------
-The Regional Aerosol Model Intercomparison Project (RAMIP) is a multi-model large ensemble of earth system model experiments conducted to quantify the role of regional aerosol emissions changes in near-term climate change projections (`Wilcox et al., 2023 <https://gmd.copernicus.org/articles/16/4451/2023/>`_). For the sake of simplicity, we will only investigate CESM2 (one of the 8 models available in this MIP) for this example. For CESM2, there are 10 ensemble members for each of the six model experiments. Each experiment is essentially a different emission scenario where regional aerosol emissions are held constant over different parts of the globe. We will use a historical simulation from 1960 to 1970 run produced by CESM2 from the same ensemble as the baseline for calculating the extreme heat threshold.
-
+    metrics_dataset.to_zarr("lens2_ssp370_hw_metrics.zarr", mode='w', compute=True)
 
-:ref:`threshold_calc`
-Threshold Calculation
----------------------
 
 
diff --git a/requirements.txt b/requirements.txt
@@ -10,4 +10,6 @@ netCDF4
 tqdm
 ipywidgets
 nbformat
+sphinx
+sphinx-autobuild
 sphinx-rtd-theme