Solar Forecasting Volunteer Onboarding

Welcome to the Solar Forecasting project! This document will introduce you to the key concepts and knowledge needed to contribute effectively.

Introduction to Solar Forecasting
Coding Style
What is NWP Data?
Understanding Zarr Format
Target Data: What is UK PVlive?
Basics of Machine Learning for Solar Forecasting
Datasets for Solar Forecasting
APIs and Data Retrieval
Data Pipelines for Solar Forecasting
Benchmarks and Comparisons
Geographical Adaptability
Key Tools and Technologies
Common Terminology
Helpful Knowledge and Skills
How This Project Fits into Renewable Energy
Development and Testing Guide
Command Line Interface (CLI)
Running PVNet Model

Introduction to Solar Forecasting

Solar forecasting is the process of predicting the amount of solar energy that will be generated over a specific period. Understanding this helps optimize renewable energy systems and integrate them with the grid.

Coding Style

To ensure code quality and maintain consistency, we follow the Open Climate Fix (OCF) Coding Style. All contributors are expected to adhere to these standards.

Key Points:

Formatting: Use black with a line length of 100 characters.
Linting: Run ruff to catch and fix style violations.
Docstrings: Write Google-style docstrings for all functions, classes, and modules.
Pre-commit Hooks: Install pre-commit hooks to automatically format and lint your code before commits.

Refer to the full OCF Coding Style Guide for detailed instructions.

What is NWP Data?

Numerical Weather Prediction (NWP) data uses mathematical models of the atmosphere and oceans to forecast weather. It predicts various atmospheric conditions such as temperature, pressure, wind speed, humidity, precipitation type and amount, cloud cover, and sometimes even surface conditions and air quality—all of which are crucial for solar forecasting.

https://en.wikipedia.org/wiki/Numerical_weather_prediction

Working with NWP Data

Understanding Zarr Format

zarr is a relatively new, cloud-based data format designed to improve access to N-dimensional arrays. It provides an effective way to store large N-dimensional data in the cloud, with access facilitated through predefined chunks. Zarr can be viewed as the cloud-based counterpart to HDF5/NetCDF files, as it follows a similar data model. However, unlike NetCDF or HDF5, which store data in a single file, Zarr organizes data as a directory containing compressed binary files for chunks of data, alongside metadata stored in external JSON files.

The semantic mapping from the NetCDF Data Model to the Zarr Data Model is as follows:

NetCDF Data Model	Zarr V2 Data Model
File	Store
Group	Group
Variable	Array
Attribute	User Attribute
Dimension	Not supported as a native feature

A Zarr array can be stored in any storage system that supports a key/value interface. In this system:

A key is an ASCII string. A value is an arbitrary sequence of bytes. Supported operations include: Read: Retrieve the sequence of bytes associated with a key. Write: Set the sequence of bytes associated with a key. Delete: Remove a key/value pair. Currently, Zarr V2 is the stable version, while Zarr V3 is considered experimental. https://wiki.earthdata.nasa.gov/display/ESO/Zarr+Format

Target Data: What is UK PVlive?

UK PVlive provides national solar generation data, accessible via API, and is maintained by the University of Sheffield. This dataset serves as a reliable "ground truth" for training and evaluating solar forecasting models.

Key Features:

National Solar Generation Data: Offers estimates of real-time solar photovoltaic (PV) generation across the UK.
Granularity: Provides data at both national and regional levels, allowing for detailed analysis.
Frequency: Updated every 30 minutes, ensuring near real-time data availability for forecasting and validation.
Capacity Data: Includes information on installed PV capacity, which is crucial for normalizing generation data.
Historical Data: Offers access to historical records for long-term analysis and model training.
API Access: Enables automated data retrieval, simplifying integration with machine learning pipelines.

Applications:

Ground Truth for Model Training: Serves as a reliable dataset to validate solar forecasting models.
Capacity Trend Analysis: Tracks changes in installed PV capacity over time to analyze the growth of solar energy adoption.
Benchmarking: Provides a baseline for comparing forecasting models against observed solar generation.

Why It Matters:

UK PVlive is a critical resource for solar forecasting because it combines high-quality, timely data with extensive historical records. Its regional and national-level granularity enables robust analysis and supports model generalization.

For more information, visit the PVlive website or explore their API documentation for data access.

Basics of Machine Learning for Solar Forecasting

This project applies core machine learning (ML) principles to the domain of solar forecasting. Below is an overview of the essential concepts, tailored to this project:

Data Preparation: Preparing datasets by splitting them into training, validation, and testing sets is fundamental. Refer to the Data Pipelines for Solar Forecasting section for details on batch creation and pipeline design.
Feature Engineering: Identifying and transforming input variables (e.g., solar irradiance, temperature, and cloud cover) into meaningful features for the ML model. See Common Terminology for definitions of relevant features.
Model Selection: Choosing models like Convolutional Neural Networks (CNNs) for spatial data or transformers for advanced pattern recognition. Refer to Machine Learning Terms for details on architectures.
Evaluation: Using metrics such as MAE, RMSE, and R² to assess model performance. See Machine Learning Terms for metric descriptions.
Optimization: Applying techniques like gradient descent and regularization to improve model performance. Refer to Optimization and Cost Functions for more information.

For contributors unfamiliar with these concepts, the Machine Learning Terms section provides a glossary of key terms to get started.

Tutorial: Getting Started with ML Training: OCF Open Data PVNet Notebook

Datasets for Solar Forecasting

Datasets form the backbone of solar forecasting by providing the historical and real-time data required for model training and evaluation. This project leverages a variety of datasets, including weather, solar generation, and climate data.

Met Office UK Deterministic (UKV)

A NWP dataset used for UK solar forecasting. See Met Office Dataset Documentation for detailed information about:

Variables and their impact on solar forecasting
Dataset structure and format
Data quality considerations
Access instructions via Hugging Face

NWP Attributes for Solar Forecasting

When working with any NWP dataset for solar forecasting, certain atmospheric variables are particularly important for accurate predictions. The NWP Solar Attributes Reference provides a comprehensive list of the key CF Standard Names and attributes that should be prioritized when available in NWP models. This includes core variables like solar radiation fluxes, cloud cover at different levels, temperature, and wind, as well as additional variables like aerosol optical depth and surface albedo that can enhance forecast accuracy in specific conditions. While not all NWP models will contain every variable listed, this reference helps ensure consistency and guides data collection efforts across different weather data sources.

For the GFS dataset specifically, the GFS NWP Solar Attributes Table shows how these recommended attributes map to the actual GFS channels available in our dataset, indicating which variables are present and which are missing from the GFS model.

Other Weather Datasets

For a complete list of available weather datasets and their descriptions, see the Datasets Guide.

APIs and Data Retrieval

APIs play a crucial role in fetching real-time and historical data required for solar forecasting. This section outlines key concepts and resources for retrieving data such as solar generation, capacity, and weather information.

Key APIs for Solar Forecasting

PVlive API Target Data: What is UK PVlive?
Weather APIs Accessing weather data is critical for creating feature-rich datasets. Commonly used APIs include:
- Met Office DataPoint: Provides UK-specific weather data, including forecasts, observations, and warnings. Data is available in JSON or XML formats for easy integration. Learn more.
- Copernicus Atmosphere Monitoring Service (CAMS): Offers aerosol, cloud, and solar radiation data, with high-resolution forecasts and reanalysis datasets. Learn more.
- European Centre for Medium-Range Weather Forecasts (ECMWF): Provides global atmospheric data and reanalysis datasets, including ERA5, for solar radiation and historical weather data. Learn more.
- OpenWeatherMap: Offers current and historical weather data, including temperature, cloud cover, and wind speed, with global coverage. Learn more.
- Meteomatics API: Provides high-resolution gridded weather data, including solar radiation, temperature, and wind speed, suitable for UK and European applications. Learn more.
Satellite Data APIs Satellite imagery and radiance data are invaluable for analyzing cloud cover and solar irradiance:
- Copernicus Atmosphere Monitoring Service (CAMS): Provides satellite-based aerosol, cloud, and solar radiation data. Learn more.
- NASA's POWER API: Offers meteorological and solar datasets tailored for renewable energy applications, including European regions. Learn more.
- EUMETSAT: Europe's satellite-based service providing weather and climate data, including cloud cover and solar radiation products. Learn more.
AWS S3 Access You will need access to the AWS S3 bucket containing the NWP data. Ensure you have the required permissions to list and download objects from the bucket.
1. Install the AWS CLI
```
# Using pip
pip install awscli
```
  - Alternatively, install the AWS CLI via your system's package manager or AWS CLI Installer.
2. Configure AWS Credentials Run the following command and follow the prompts:
```
aws configure
# - AWS Access Key ID
# - AWS Secret Access Key
# - Default region (e.g., eu-west-1)
# - Default output format (json recommended)
```
1. Access the NWP Data
- Use the aws s3 ls command to list the available objects in the bucket.
- Example: aws s3 ls --no-sign-request s3://ocf-open-data-pvnet/data/
- Use the aws s3 cp command to download the data to your local machine.
Accessing GFS Data from S3 The GFS data is stored in a S3 bucket s3://ocf-open-data-pvnet/data/gfs.zarr/. To access and work with this data, you can use Python libraries such as xarray and s3fs. These libraries allow you to directly read Zarr-formatted data from S3.

Prerequisites

Before accessing the data, ensure you have the following Python packages installed:

pip install xarray zarr s3fs

Below is an example of how to open the GFS dataset from the S3 bucket using Python:

import xarray as xr
import s3fs
# Create an S3 filesystem object
s3 = s3fs.S3FileSystem(anon=True)
# Open the GFS dataset from the S3 bucket
dataset_path = 's3://ocf-open-data-pvnet/data/gfs.zarr/'
ds = xr.open_zarr(s3.get_mapper(dataset_path), consolidated=True)
# Display the dataset
print(ds)

Best Practices for Using APIs

API Keys: Most APIs require authentication via an API key. Store keys securely using environment variables or secret management tools.
Rate Limits: Adhere to API usage policies to avoid exceeding request limits. Use caching to minimize repeated calls for the same data.
Batching Requests: For large datasets, make batched requests to ensure efficient retrieval within rate limits.
Data Normalization: Standardize data from multiple sources to ensure consistency in units (e.g., W/m² for solar irradiance).

Example Workflow for Data Retrieval

Set Up API Clients: Use Python libraries like requests, httpx, or specific SDKs provided by the API services.
Query Data: Fetch solar and weather data using appropriate endpoints.
Save Locally: Store retrieved data in formats like CSV, JSON, or cloud-based storage like Zarr.
Integrate with Pipelines: Use the retrieved data to create features or targets for machine learning models.

Tools for API Interaction

Python Libraries:
- requests or httpx for making HTTP requests.
- pandas for data manipulation and storage.
API Testing Tools:
- Postman or cURL for testing API endpoints before integration.

By effectively leveraging APIs like PVlive and weather services, contributors can build robust datasets for solar forecasting and improve model accuracy.

Data Pipelines for Solar Forecasting

Data pipelines are crucial for preparing and processing data for machine learning models. In the context of solar forecasting, these pipelines handle multiple data sources and transform them into a format suitable for training and inference.

OCF Data Pipeline Components

OCF Data Sampler
- A Python library that standardizes data loading and preprocessing
- Handles both NWP (Numerical Weather Prediction) and PV generation data
- Provides PyTorch Dataset implementations for efficient batch processing
- Supports various data formats including Zarr, NetCDF, and CSV
PVNet Integration
- Works with OCF's PVNet model for solar forecasting
- Processes multiple input streams:
  - Weather forecast data (NWP)
  - Historical PV generation data
  - Satellite imagery (when available)
- Handles data normalization and feature engineering

Key Pipeline Features

Data Loading
- Efficient loading of large NWP datasets using Zarr format
- Real-time PV generation data retrieval via PVLive API
- Support for multiple weather data sources (Met Office, GFS, etc.)
Preprocessing Steps
- Temporal alignment of different data sources
- Spatial interpolation for matching grid resolutions
- Feature normalization and standardization
- Missing data handling and quality checks
Batch Creation
- Creation of training/validation/test splits
- Time-series specific considerations
- Efficient memory management for large datasets
- Support for both regional and national-level forecasting

Example Pipeline Usage Refer to:

Benchmarks and Comparisons

Understand the importance of benchmarking and how our models compare to existing solutions. These are still in development...

Geographical Adaptability

This project won't be limited to the UK eventually. We hope to support other regions and data sources in the future. Your help is welcome!

Key Tools and Technologies

Familiarize yourself with tools like Python, pandas, and open-source libraries like - OCF Data Sampler

PVNet Model

Common Terminology

Below is a glossary of key terms that might be useful when working on this project:

General Solar Energy Terms

Solar Irradiance: The power per unit area received from the Sun in the form of electromagnetic radiation, measured in watts per square meter (W/m²).
Photovoltaic (PV): A method of generating electricity by converting sunlight directly into electricity using solar panels.
Solar Panel Efficiency: The ratio of the electrical output of a solar panel to the incident sunlight energy, expressed as a percentage.
Capacity Factor: The ratio of actual energy produced by a solar PV system to the maximum possible energy it could produce over a given period.
Global Horizontal Irradiance (GHI): Total solar radiation received on a horizontal surface.
Global Tilted Irradiance (GTI): The total solar irradiance received on a tilted surface, accounting for direct, diffuse, and reflected components. It is essential for designing and evaluating the performance of tilted photovoltaic systems.
Direct Normal Irradiance (DNI): Solar radiation received directly from the Sun at a perpendicular angle.
Diffuse radiation: Refers to solar radiation that reaches the Earth's surface after being scattered by molecules, aerosols, or clouds in the atmosphere.
Diffuse Horizontal Irradiance (DHI): Solar radiation received indirectly due to scattering in the atmosphere.
Solar Zenith Angle: The angle between the Sun and a line perpendicular to the Earth's surface; influences solar irradiance.

Weather Forecasting and Numerical Weather Prediction (NWP) Terms

Numerical Weather Prediction (NWP): The use of mathematical models to simulate atmospheric processes and predict future weather conditions.
Gridded Data: Data arranged in a regular, grid-like structure, where each cell or grid point represents a specific geographical area and contains corresponding data values (e.g., temperature, solar irradiance, or wind speed).
Global Forecast System (GFS): A global NWP model produced by the National Centers for Environmental Prediction (NCEP), a part of the National Oceanic and Atmospheric Administration (NOAA). It provides weather forecasts up to 16 days in advance.
- GFS Documentation
- Tutorial: Understanding GFS Data
UK Met Office UKV Model: A high-resolution (2km grid spacing) weather forecast model specifically designed for the UK region. The UKV (UK Variable) model uses a variable-resolution grid that allows it to represent UK weather patterns in greater detail than global models. It provides forecasts up to 54 hours ahead with hourly outputs, making it particularly valuable for solar forecasting in the UK.
- Key features include detailed cloud physics, advanced data assimilation, and specific tuning for UK weather patterns
- Variables include cloud cover (at multiple levels), solar radiation, temperature, and wind
- Uses Lambert Azimuthal Equal Area projection centered on the UK
- Met Office Site
- Tutorial: Understanding Met Office data
European Centre for Medium-Range Weather Forecasts (ECMWF): An independent intergovernmental organization that produces highly accurate medium-range weather forecasts.
Model Resolution: The spatial and temporal granularity of an NWP model, usually measured in kilometers or degrees.
Initialization: The process of incorporating current observational data into a model to start a forecast.
Boundary Conditions: Data input to a weather model defining conditions at the edges of the modeled area.
ERA5: A global reanalysis dataset produced by ECMWF, providing hourly data on weather, atmospheric conditions, and other variables. It is widely used in climate research and weather model training due to its high spatial and temporal resolution.
UFS Replay: Historical weather data reanalyzed with the Unified Forecast System (UFS), often used to train or test weather forecasting models.
ECMWF IFS (Integrated Forecast System): A live numerical weather prediction model from ECMWF, delivering global forecasts for various atmospheric variables.
ARCO-ERA5: A variant of ERA5 dataset tailored for specific applications, often preprocessed to streamline analysis.
OCF DWD Archive: Historical weather data from the German Weather Service (DWD), curated by Open Climate Fix for renewable energy and forecasting applications.

Geospatial Terms

Geostationary: A satellite orbit where the satellite remains fixed relative to a specific point on Earth's surface, providing continuous observation of the same region. Commonly used in weather monitoring and solar radiation measurement.
Geospatial Data: Information about objects, events, or phenomena on Earth's surface, represented by geographic coordinates and often used in mapping and analysis.
Lambert Azimuthal Equal Area projection: A map projection that preserves area while minimizing distortion. It's particularly useful for mapping polar regions and is commonly used in meteorological applications, especially for regional weather forecasting. The projection maintains accurate representation of areas while sacrificing some angular relationships.
Latitude: The angular distance of a location north or south of the equator, measured in degrees. Important for determining solar angles and irradiance.
Longitude: The angular distance of a location east or west of the prime meridian, measured in degrees. Used in conjunction with latitude to pinpoint geographic locations.
Spatial Resolution: The level of detail in a geospatial dataset, often defined by the size of the grid cells or pixels representing the data. Higher resolution provides more detail but requires more storage and processing power.
Temporal Resolution: The frequency at which data is collected or updated over time, important for capturing changes in weather or solar irradiance.
Digital Elevation Model (DEM): A 3D representation of Earth's surface, showing elevation data. DEMs are used in solar modeling to account for shading and terrain effects.
Remote Sensing: The acquisition of information about Earth's surface using satellites or aircraft. Remote sensing is critical for gathering data on cloud cover, aerosols, and solar radiation.
Coordinate Reference System (CRS): A system used to define how geographic data is projected onto a flat surface, ensuring spatial data is accurately mapped and analyzed.
Raster Data: A type of geospatial data stored in a grid format, where each cell contains a value representing a specific property (e.g., temperature, irradiance, or elevation).
Vector Data: A type of geospatial data that represents geographic features using points, lines, and polygons, often used for mapping boundaries, roads, and other discrete features.
Topographic Shading: The effect of terrain features (e.g., mountains, hills) on sunlight exposure, influencing solar irradiance calculations.
GeoJSON: A format for encoding geographic data structures in JSON, often used for sharing and visualizing geospatial data on the web.
GIS (Geographic Information System): A system designed to capture, store, manipulate, analyze, and visualize spatial or geographic data, widely used in solar and weather forecasting.
Great Circle Distance: The shortest distance between two points on a sphere, useful for calculating distances between locations on Earth.
Solar Declination Angle: The angle between the Sun's rays and the equatorial plane, varying throughout the year and influencing solar irradiance calculations.
Solar Azimuth Angle: The angle between the Sun's rays and the north-south line on the horizontal plane, used to determine the direction of the Sun's rays.
Solar Elevation Angle: The angle between the Sun's rays and the horizontal plane, used to determine the height of the Sun above the horizon.
Solar Zenith Angle: The angle between the Sun's rays and a line perpendicular to the Earth's surface, used to determine the angle of incidence of the Sun's rays on a surface.
Solar Noon: The time when the Sun reaches its highest point in the sky, occurring when the Sun's zenith angle is minimized.
Solar Midnight: The time when the Sun reaches its lowest point in the sky, occurring when the Sun's zenith angle is maximized.
OSGB (Ordnance Survey National Grid Reference System): A system whereby any location in the UK is divided into a geographic grid of squares, with each square given a unique reference of letters and numbers.
- https://en.wikipedia.org/wiki/Ordnance_Survey_National_Grid

Climate and Atmosphere Terms

Albedo: The reflectivity of a surface, important for understanding how much sunlight is absorbed or reflected by the Earth. Surfaces with high albedo, such as snow and ice, reflect more sunlight, while darker surfaces absorb more.
Aerosols: Tiny particles or liquid droplets suspended in the atmosphere that affect solar radiation by scattering or absorbing sunlight. They play a significant role in cloud formation and can influence local and global temperatures.
Greenhouse Gases: Atmospheric gases, such as carbon dioxide (CO₂), methane (CH₄), and water vapor (H₂O), that trap heat in the Earth's atmosphere and contribute to global warming.
Atmospheric Pressure: The force exerted by the weight of the atmosphere above a given point, measured in hectopascals (hPa) or millibars (mb). It affects weather patterns and the movement of air masses.
Relative Humidity: The amount of water vapor in the air compared to the maximum amount the air can hold at a given temperature, expressed as a percentage. It influences cloud formation and precipitation.
Dew Point: The temperature at which air becomes saturated with moisture and water vapor condenses into dew, clouds, or fog.
Radiative Forcing: The change in the energy balance of the Earth's atmosphere due to factors like greenhouse gases, aerosols, and changes in solar irradiance. It is a key concept in climate change studies.
Turbidity: A measure of the atmosphere's clarity, influenced by aerosols, dust, and pollution. High turbidity reduces the amount of solar radiation reaching the Earth's surface.
Ozone Layer: A layer of ozone (O₃) in the stratosphere that absorbs the majority of the Sun's harmful ultraviolet radiation. Changes in the ozone layer can impact solar irradiance measurements.
Wind Shear: A change in wind speed or direction over a short distance in the atmosphere. It can influence cloud formation, storm development, and the dispersal of aerosols.
Thermal Inversion: A phenomenon where a layer of warm air traps cooler air near the Earth's surface, preventing vertical mixing. It can lead to increased pollution and reduced solar irradiance at the surface.

Cloud and Sky Observation Terms

Cloud Cover: The fraction of the sky obscured by clouds, typically expressed as a percentage.
Spatial Homogeneity in cloud classification refers to the uniformity or consistency in the structure and appearance of clouds over a given spatial area. When clouds exhibit spatial homogeneity, they appear relatively uniform in terms of their texture, brightness, and structure across the entire observed area.
The World Meteorological Organization (WMO) classifies clouds into the following types based on their appearance and altitude. These are grouped into three main altitude categories: high, middle, and low clouds.

High Clouds (Above 20,000 feet / 6,000 meters)

Cirrus (Ci): Wispy, hair-like clouds composed of ice crystals, often indicating fair weather.
Cirrostratus (Cs): Thin, veil-like clouds covering the sky, often producing a halo around the Sun or Moon.
Cirrocumulus (Cc): Small, white, patchy clouds without shading, often arranged in rows or ripples.

Middle Clouds (6,500–20,000 feet / 2,000–6,000 meters)

Altostratus (As): Gray or blue-gray clouds covering the sky, usually associated with continuous rain or snow.
Altocumulus (Ac): White or gray clouds in patches or layers, often with shading, signaling changing weather.

Low Clouds (Below 6,500 feet / 2,000 meters)

Stratus (St): Uniform gray clouds covering the entire sky, often producing drizzle or mist.
Stratocumulus (Sc): Low, lumpy clouds, typically covering most of the sky with breaks of blue.
Nimbostratus (Ns): Thick, dark clouds producing steady precipitation and obscuring the Sun.

Clouds with Vertical Development

Cumulus (Cu): Fluffy, white clouds with a flat base, often indicating fair weather when small.
Cumulonimbus (Cb): Towering thunderstorm clouds with an anvil-shaped top, capable of producing heavy rain, lightning, and severe weather.

Special Types (Optional to include)

Contrails (Ct): Man-made clouds formed by aircraft exhaust.
Pyrocumulus: Clouds formed by intense heat, such as from wildfires or volcanic eruptions.

For more details, refer to the WMO Cloud Identification Guide.

Data and Measurement Terms

Ground Truth: Data collected on-site (e.g., solar generation data from PV systems) used to validate predictions.
Weather Station: A facility with instruments and equipment to measure atmospheric conditions such as temperature, humidity, and wind speed.
Time Series Data: Data points collected or recorded at time-ordered intervals, often used in forecasting.
Reanalysis Data: A blend of observational data and model output to create a consistent historical record of atmospheric variables. Examples include ERA5 and UFS Replay datasets.
Gridded Data: Data represented in a regular grid structure, where each cell corresponds to a specific geographic area. Common in weather and climate datasets for efficient analysis.
NetCDF (Network Common Data Form): A self-describing, machine-independent data format designed for storing and sharing array-oriented scientific data. Commonly used in meteorology, oceanography, and other geosciences, NetCDF supports large datasets and includes metadata for describing the data's structure and meaning.
HDF5 (Hierarchical Data Format version 5): A versatile data model that supports the storage of large, complex datasets in a hierarchical structure. HDF5 is widely used for scientific computing, offering high performance, scalability, and the ability to handle large amounts of data efficiently.
GRIB (GRIdded Binary): A concise data format commonly used in meteorology to store historical and forecast weather data. GRIB files are optimized for large datasets, representing grid-based information such as temperature, wind speed, and precipitation. They are often used by Numerical Weather Prediction (NWP) models and are known for their efficient storage and compression.

CF Conventions

CF (Climate and Forecast) Conventions: The CF (Climate and Forecast) Convention standards are a set of metadata conventions designed to promote the processing and sharing of climate and forecast data, particularly for gridded data (e.g., netCDF, Zarr, or other array-based formats). Developed by the climate and weather modeling communities, these standards ensure that data files are self-describing, interoperable, and usable across different software tools and research groups.
Standard Names: CF Conventions define a controlled vocabulary of standard names that precisely describe physical quantities. For example:
- air_temperature: The bulk temperature of the air, suitable for air temperature measurements
- surface_temperature: Temperature of the surface of the Earth
- surface_downwelling_shortwave_flux_in_air: Total solar radiation reaching the Earth's surface
Units: CF Conventions require the use of standard units and provide methods for describing how to derive units.
Dimensions: The conventions specify how to identify coordinate variables and auxiliary coordinate variables.
Metadata: Includes attributes for describing the data's content, quality, and provenance.
For a complete list of CF standard names, visit the CF Standard Names Table

Power Units:

Watt (W): The basic unit of power in the International System of Units (SI), representing one joule per second.
Kilowatt (kW): Equal to 1,000 watts, commonly used to measure the capacity of small solar systems.
Megawatt (MW): Equal to 1,000 kilowatts or one million watts, used for larger solar farms and power plants.
Gigawatt (GW): Equal to 1,000 megawatts or one billion watts, used to represent national or regional energy capacities.
Terawatt (TW): Equal to 1,000 gigawatts or one trillion watts, often used for global energy capacity.
Petawatt (PW): Equal to 1,000 terawatts or one quadrillion watts, applicable for global-scale discussions.
Energy vs. Power:
- Power: The rate at which energy is produced or consumed, typically measured in watts (W).
- Energy: The total amount of work performed over time, measured in watt-hours (Wh), kilowatt-hours (kWh), etc.
Capacity Factor: The ratio of actual energy produced by a system to the maximum possible energy it could produce over a given period.
Irradiance vs. Insolation:
- Irradiance: The power of solar radiation received per unit area (W/m²).
- Insolation: The total energy received over a given time, typically measured in kilowatt-hours per square meter (kWh/m²).
PV Capacity: The maximum amount of power that a photovoltaic system can generate under ideal conditions, often measured in megawatts (MW).

Grid and Power Systems Terms

Grid Supply Point (GSP): A location where electricity is transferred from the transmission network to the distribution network, serving as a critical node in the power grid.
GSP ID's in PVlive:
- Retrieve numerical GSP IDs from PV_Live: PV_Live API
- Get a list of GSP IDs:
```
from pvlive_api import PVLive

pvl = PVLive()
pvl.gsp_ids
```
- Find the official NESO GSP region codes: NESO Data Portal
- Cross-reference these to create your own mapping.
Peak Load: The maximum power demand in a given period, often used to assess system capacity and ensure reliability during high-demand times.
Base Load: The minimum level of demand on an electrical grid over a 24-hour period. It is typically supplied by reliable, continuous sources like nuclear or coal-fired power plants.
Load Factor: The ratio of the average load over a given period to the peak load during that same period, indicating the efficiency of grid usage.
Frequency Regulation: The process of maintaining the grid's operating frequency (e.g., 50 Hz in Europe or 60 Hz in the US) within acceptable limits to ensure stable power delivery.
Distributed Energy Resources (DERs): Small-scale power generation or storage units, such as rooftop solar panels and batteries, connected to the distribution network.
Interconnection: The linkage of two or more electricity systems to enable power exchange and enhance reliability.
Curtailment: The reduction of power output from renewable energy sources (e.g., solar or wind) due to oversupply or grid limitations.
Net Metering: A billing mechanism that allows consumers who generate their own electricity (e.g., via rooftop solar panels) to send excess power back to the grid in exchange for credits on their electricity bill.
Smart Grid: An advanced electricity network that uses digital technology to monitor and manage the flow of electricity, improving efficiency and reliability.
Reactive Power: Power that oscillates between the source and the load, necessary for maintaining voltage levels in the grid.
Transformer: A device used in the grid to step up or step down voltage levels for efficient power transmission and distribution.
Grid Resilience: The ability of the power grid to recover quickly from disruptions, such as natural disasters or cyberattacks.
Voltage Drop: A reduction in voltage as electricity travels through transmission and distribution lines, influenced by the resistance of the lines and the distance from the source.
Power Factor: A measure of how effectively electrical power is converted into useful work output. A power factor of 1 indicates maximum efficiency.
Black Start: The process of restoring the grid after a complete shutdown, using backup power sources to start key components.

Machine Learning Terms

Feature Engineering: The process of selecting, modifying, and transforming raw data into features suitable for machine learning models.
Train-Test Split: Dividing data into training and testing subsets to evaluate model performance.
Validation: Using a subset of data to tune model hyperparameters and prevent overfitting.
Cross-Validation: A technique for assessing model performance by splitting data into multiple training and testing subsets.

Model Evaluation Metrics

Mean Absolute Error (MAE): A common metric for evaluating forecasting accuracy by measuring the average magnitude of errors. Lower values indicate better model performance.
Root Mean Square Error (RMSE): Measures the standard deviation of prediction errors, giving higher weight to large errors. Lower RMSE values indicate better performance.
R² Score (Coefficient of Determination): Indicates how well the model predictions approximate the real data. An R² of 1 means perfect prediction; values closer to 0 indicate weaker predictive performance.
Mean Bias Error (MBE): Measures the average bias in model predictions, indicating whether predictions are systematically overestimating or underestimating.
Precision: In classification tasks, the ratio of true positive predictions to the total predicted positives. Indicates the accuracy of positive predictions.
Recall (Sensitivity): The ratio of true positives to the actual positives in the dataset. Measures the ability to capture all relevant instances.
F1 Score: The harmonic mean of precision and recall, providing a balanced metric for evaluating model performance in classification.

Optimization and Cost Functions

Cost Function: A function that measures the difference between the predicted and actual values, guiding the optimization process. Examples include:
- Mean Squared Error (MSE): Commonly used in regression tasks.
- Cross-Entropy Loss: Frequently used in classification tasks.
Gradient Descent: An optimization algorithm used to minimize the cost function by iteratively adjusting the model parameters in the direction of the steepest descent.
- Learning Rate: A hyperparameter that determines the step size during gradient descent. A smaller rate ensures convergence but may slow down training.
Stochastic Gradient Descent (SGD): A variation of gradient descent where a single data point or a small batch is used to compute gradients, making the optimization process faster.
Adam Optimizer: An advanced optimization algorithm combining the benefits of momentum and adaptive learning rates for efficient gradient descent.
Regularization: Techniques to prevent overfitting by adding a penalty to the cost function. Examples:
- L1 Regularization (Lasso): Encourages sparsity in the model by shrinking less important coefficients to zero.
- L2 Regularization (Ridge): Penalizes large coefficients to make the model more generalizable.

Training and Model Behavior

Masking: A technique used to ignore certain parts of input data during model training or inference. For example, in transformers, masking ensures that certain tokens or parts of the input sequence are not considered.
Overfitting: A situation where the model performs well on the training data but fails to generalize to unseen data due to excessive complexity.
Underfitting: Occurs when a model is too simple to capture the underlying patterns in the data, resulting in poor performance on both training and testing datasets.
Early Stopping: A technique to prevent overfitting by halting training when the model's performance on the validation set stops improving.
Batch Size: The number of samples processed before the model updates its parameters during training.

Model Architectures and Learning Paradigms

Convolutional Neural Network (CNN): A type of neural network designed to process data with a grid-like structure, such as images. CNNs are particularly useful for image classification, segmentation, and pattern detection.
Transformer: A deep learning architecture that relies on self-attention mechanisms, widely used in natural language processing and increasingly in computer vision tasks.
Supervised Learning: A machine learning paradigm where the model learns from labeled data, meaning each input has a corresponding output.
Self-Supervised Learning: A learning paradigm where the model generates labels or tasks from the data itself, enabling training without explicit human-labeled data.
Reinforcement Learning (RL): A learning paradigm where an agent learns to make decisions by interacting with an environment and receiving rewards or penalties based on its actions.

Visualization and Interpretability

Confusion Matrix: A table used in classification tasks to visualize the performance of a model by showing true positives, true negatives, false positives, and false negatives.
Loss Curve: A plot showing the value of the cost function during training to monitor convergence and detect overfitting.
Learning Curve: A plot showing model performance (e.g., accuracy, loss) against training progress, often used to assess whether a model is underfitting or overfitting.

Hyperparameter Tuning

Grid Search: An exhaustive search over a manually specified set of hyperparameter values to find the optimal configuration.
Random Search: A more efficient method that randomly samples hyperparameter values to find good configurations.
Bayesian Optimization: A probabilistic approach to hyperparameter tuning that models the relationship between hyperparameters and model performance to find optimal settings.

Tools and Software

Black: A Python code formatter used for maintaining consistent code style.
Ruff: A fast Python linter that helps enforce coding standards.
Pre-commit Hooks: Tools that automatically check or modify code before it is committed to a repository.

Helpful Knowledge and Skills

Contributing to this project doesn't require expertise in all areas 😅. We need volunteers with skills or interest in some of the following domains to help us build different parts of the project. Plus, there's plenty of opportunity to learn as you go!

Programming and Development

Python Programming: Familiarity with Python for data analysis, API's, and machine learning workflows.
Version Control: Experience with Git and GitHub for collaboration and maintaining code quality.
Data Formats: Understanding of data formats like Zarr, JSON, and CSV, and how to interact with them programmatically.

Data Engineering

Data Acquisition: Experience in retrieving large-scale datasets from APIs, cloud storage, or public repositories.
Data Pipelines: Building and maintaining pipelines for data transformation, cleaning, and preparation.
Cloud Storage: Expertise in managing data in cloud-optimized formats like Zarr, NetCDF, or HDF5.
Database Management: Familiarity with databases for large-scale data, such as PostgreSQL, and working with vector databases for embeddings.
Performance Optimization: Skills in improving data retrieval and processing speed, especially with gridded or geospatial data.

Data Analysis and Processing

Data Manipulation: Proficiency with libraries like pandas, numpy, or xarray for transforming and analyzing datasets.
Data Visualization: Ability to use tools like matplotlib or seaborn to interpret and present data insights.
Numerical Weather Prediction (NWP) Data: Familiarity with gridded datasets and their use in weather and solar forecasting.

Machine Learning and Forecasting

Model Training: Understanding the basics of training machine learning models, including feature engineering, train-test splitting, and evaluation.
Evaluation Metrics: Knowledge of metrics like MAE, RMSE, and R² for assessing model performance.
Neural Network Architectures: Familiarity with models like CNNs and transformers, or a willingness to learn.

Domain Expertise

Weather Forecasting: Understanding of Numerical Weather Prediction (NWP) models, weather patterns, and atmospheric science.
Solar Energy: Knowledge of solar irradiance, photovoltaic systems, and energy metrics like capacity factor or GHI/DNI.
Climate Science: Familiarity with climate datasets, terms like albedo and aerosols, and their implications for solar energy forecasting.
Geospatial Analysis: Proficiency in working with geospatial data, coordinate systems, and tools like GIS.

Tools and Technologies

API's: Experience working with APIs to retrieve data (e.g., PVlive, OpenWeatherMap, CAMS).
Cloud Storage: Basic understanding of cloud-optimized formats like Zarr for managing large datasets.

Geospatial and Climate Knowledge

Geospatial Data: Understanding of concepts like gridded data, coordinate reference systems (CRS), and GIS tools.
Climate and Atmospheric Science: Knowledge of terms like albedo, aerosols, and solar irradiance, or a willingness to learn their significance.

Collaboration and Communication

Open Source Development: Willingness to collaborate in an open-source environment, including code reviews and documentation updates.
Documentation: Ability to write clear and concise documentation for code and processes to support other contributors.

This project values both existing expertise and a learner's mindset. Contributors who are eager to learn and apply new skills are highly encouraged to join!

How This Project Fits into Renewable Energy

The transition to renewable energy is one of the most significant challenges—and opportunities—of our time. Solar energy is at the forefront of this movement, offering a clean, abundant, and sustainable alternative to fossil fuels. This project aims to accelerate that transition by improving the accuracy and accessibility of solar forecasting, empowering communities, businesses, and governments to make smarter energy decisions.

Why It Matters

Enhancing Grid Reliability: Accurate solar forecasting helps balance energy supply and demand, reducing reliance on fossil fuel backups and preventing grid instability.
Maximizing Solar Potential: By predicting solar generation more effectively, we can make better use of installed solar capacity and encourage further adoption of solar technology.
Empowering Decision-Making: Solar forecasts provide critical insights for energy planners, operators, and consumers, enabling smarter choices in energy storage, distribution, and usage.
Global Impact: As we expand this project to other regions, we contribute to a worldwide effort to reduce carbon emissions and solve climate change.

Inspiring Collaboration

This project isn't just about technology—it's about building a community of innovators, engineers, and visionaries who share a passion for sustainability. Together, we can:

Support the global transition to clean energy!
Create open-source tools that benefit everyone, from local communities to international organizations!
Inspire others to join the renewable energy movement!

By contributing to this project, you're not just writing code or analyzing data—you're making a meaningful impact on the planet and helping pave the way to a sustainable future for many generations to come. 🌍✨

Development and Testing Guide

This guide outlines how to maintain code quality and test functionality using ruff, black, and pytest.

Code Quality Checks

To ensure consistent coding style and catch common issues, use the following tools:

Ruff: A fast Python linter.
- Run Ruff to check for linting errors:
```
ruff check .
```
- Automatically fix errors where possible:
```
ruff check . --fix
```
Black: A Python code formatter.
- Format your code with Black:
```
black .
```

Running Tests

Use pytest to ensure the project works as expected:

Run all tests:
```
pytest
```
Run specific tests:
```
pytest tests/test_main.py
```

Command Line Interface (CLI)

The open-data-pvnet CLI provides various commands for downloading, processing, and loading weather and solar data.

Basic Structure

open-data-pvnet <provider> <operation> [options]

Available Providers

metoffice: UK Met Office weather data
gfs: Global Forecast System data
dwd: German Weather Service data (coming soon)

Operations

archive: Download and archive data

# Archive a single hour
open-data-pvnet metoffice archive --year 2023 --month 12 --day 1 --hour 12 --region uk

# Archive an entire day with parallel processing
open-data-pvnet metoffice archive --year 2023 --month 12 --day 1 --region uk --workers 4

load: Load archived data for analysis

# Load a single hour
open-data-pvnet metoffice load --year 2023 --month 1 --day 16 --hour 0 --region uk

# Load an entire day
open-data-pvnet metoffice load --year 2023 --month 1 --day 16 --region uk

# Load with custom chunking
open-data-pvnet metoffice load --year 2023 --month 1 --day 16 --region uk \
  --chunks "time:24,latitude:100,longitude:100"

Common Options

--region: Specify data region (uk or global) for Met Office data
--overwrite: Force overwrite of existing files
--remote: Load data remotely without downloading
--chunks: Specify chunking for data loading
--workers: Number of parallel workers for archiving (default: 1)
--archive-type: Type of archive to create (zarr.zip or tar)

Examples for Different Use Cases

Working with UK Data

# Download Met Office UK weather data
open-data-pvnet metoffice archive --year 2023 --month 12 --day 1 --region uk --workers 2

# Load and analyze the data
open-data-pvnet metoffice load --year 2023 --month 12 --day 1 --region uk

Remote Data Access

# Load data directly from HuggingFace without downloading
open-data-pvnet metoffice load --year 2023 --month 1 --day 16 --region uk --remote

Error Handling

Common error messages and their solutions:

"No datasets found": Check if the specified date has available data
"Error loading dataset": Verify your internet connection and credentials
"Invalid chunks specification": Ensure chunk string follows the format "dim1:size1,dim2:size2"

Running PVNet Model

Update configuration file Go to src/open_data_pvnet/configs/PVNet_configs/datamodule/streamed_batches.yaml

Change values if desired (increase at your discretion): num_train_samples: 5 num_val_samples: 5

Update src/open_data_pvnet/configs/PVNet_configs/datamodule/premade_batches.yaml Change this line to configuration: <your_directory...open-data-pvnet/src/open_data_pvnet/configs/PVNet_configs/datamodule/configuration/example_configuration.yaml>
Update src/open_data_pvnet/configs/PVNet_configs/config.yaml Change the line to - datamodule: premade_batches.yaml
Open a Weights & Biases Account https://wandb.ai/ Go to src/open_data_pvnet/configs/PVNet_configs/logger/wandb.yaml Change to project: "GFS_TEST_RUN" Change to save_dir: "GFS_TEST_RUN"
Run the samples We recommend you save the samples locally for faster processing In your main open-data-pvnet directory, run the following command (assumes aws cli is installed locally) aws s3 sync s3://ocf-open-data-pvnet/data/gfs/v4/2023.zarr/ ./gfs_2023.zarr --no-sign-request aws s3 sync s3://ocf-open-data-pvnet/data/uk/pvlive/v2/combined_2023_gsp.zarr ./gsp_2023.zarr --no-sign-request Change the example_configuration.yaml zarr_path attributes to local paths you made above Comment out both of these lines public: True # If you are going to use the actual s3 buckets then leave alone however this may be really slow In streamed_batches.yaml change this line configuration: null to your actual path of the example_configuration.yaml file

If running in a virtual environment, be sure to activate it. `source ./venv/bin/activate`

rm -rf GFS_samples PLACEHOLDER # to remove previous sample runs python src/open_data_pvnet/scripts/save_samples.py

Run the training Go to config.yaml and change this line - datamodule: streamed_batches.yaml to - datamodule: premade_batches.yaml python run.py

Thank you for joining us on this journey to advance solar forecasting and renewable energy solutions!

FilesExpand file tree

getting_started.md

Latest commit

History