This project implements a physically motivated, stochastic uncertainty model for gridded weather data (ERA5). It leverages a modern data engineering pipeline to transform raw geospatial data into interpretable predictability metrics using EOF decomposition and Ornstein–Uhlenbeck (OU) processes.
The pipeline quantifies how forecast uncertainty grows over time by modeling the temporal dynamics of reduced-order atmospheric states. By fitting continuous-time stochastic processes to historical anomalies, we derive analytical uncertainty growth curves and reconstruct them into spatial maps on the Earth's surface.
- Data Engineering Pipeline: Structured as composable assets compatible with Dagster, ensuring modularity and reproducibility.
- Geospatial Analysis: Uses
xarrayfor labeled multidimensional data handling andnumpy/statsmodelsfor numerical analysis. - Dimensionality Reduction: EOF (Empirical Orthogonal Function) decomposition to capture dominant modes of variability.
- Stochastic Modeling: Ornstein–Uhlenbeck process fitting to quantify memory time scales and noise strength.
- Uncertainty Mapping: Closed-form reconstruction of lead-time-dependent spatial uncertainty.
- Ingestion & Validation: Loads raw NetCDF data and performs structural (monotonicity, duplicates) and physical (range checks, units) validation.
- Preprocessing: Removes seasonal cycles (anomalies) and applies latitude-dependent cosine weighting to account for grid cell area differences.
- EOF Decomposition: Reduces spatial complexity by projecting anomalies onto a low-dimensional subspace of principal components.
-
OU Parameter Estimation: Fits OU processes to the leading EOF coefficients. The model estimates:
-
Decay Rate (
$\lambda$ ): The inverse of the memory time scale. -
Noise Strength (
$\sigma$ ): The magnitude of stochastic forcing.
-
Decay Rate (
-
Uncertainty Quantification: Computes the variance growth
$V(t) = \frac{\sigma^2}{2\lambda}(1 - e^{-2\lambda t})$ and projects it back to the original spatial grid.
The project bridges data engineering and physical climate science:
- Physical Motivation: Weather anomalies often exhibit "memory" that decays over time, a characteristic well-captured by the OU process (the continuous-time analog of AR(1)).
- Mathematical Modeling: Parameters are derived from the integral time scale of the autocorrelation function, ensuring consistent estimation of temporal persistence.
- Interpretability: The model provides metrics on where and how fast atmospheric predictability saturates.
- Improved OU Validation: Add diagnostics for each EOF mode, including empirical vs. theoretical autocorrelation plots and automatic rejection of modes that violate OU assumptions.
- Predictability Horizon Metric: Calculate time-to-saturation metrics per mode to quantify the intuitive "memory" and usefulness of forecasts.
- Spatial Uncertainty Snapshots: Generate static maps for fixed lead times (1, 6, 12 months) to visualize the spatial progression of uncertainty.
- Kalman Filter on Reduced State: Implement state estimation by fusing noisy EOF coefficients with OU dynamics, mimicking data assimilation.
- Code Refactoring: Organize the codebase into dedicated modules (ingestion, preprocessing, EOF reduction, stochastic modeling, visualization) with improved documentation.
The project is organized into a modular package structure:
src/stochastic_weather/core/: Contains the pure mathematical and physical logic (EOF decomposition, OU process fitting, uncertainty calculations, and visualization).src/stochastic_weather/assets/: Defines the Dagster assets that orchestrate the data pipeline, now including rich metadata visualization in the Dagster UI.data/: Placeholder for input NetCDF weather data.notebooks/: Exploratory analysis and prototyping.
- Python 3.9+
- Dependencies:
xarray,numpy,pandas,dagster,statsmodels,matplotlib,netcdf4
To reproduce the results, you need the ERA5 Monthly Evaporation dataset.
- Download the data: ERA5_LowRes_Monthly_evap.nc
- Place the file: Create a
data/directory in the project root and place the downloaded file there.
The expected path is:
data/ERA5_LowRes_Monthly_evap.nc
The project is structured with Dagster assets. You can launch the Dagster UI to explore and run the pipeline:
dagster dev -f definitions.pyTo generate the results, you need to materialize the assets in the Dagster UI:
- Open the Dagster UI at
http://localhost:3000. - Navigate to the Catalog page.
- Select all assets and click Materialize.
- Dagster will execute the pipeline and store the results.
- You can view the plots in spatial_uncertainty_map or eof_modes > plot > Show Markdow.
The eof_modes asset shows the spatial patterns of the leading EOF (Empirical Orthogonal Function) modes. These maps reveal the dominant spatial structures of weather variability.
How to view:
The spatial_uncertainty_map asset shows the reconstructed forecast uncertainty at a specific lead time (default: 6 months).
How to view:
This is a leisure student project developed to explore the intersection of data engineering and stochastic atmospheric dynamics. It is still in the early stages of development.

