Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
98 changes: 98 additions & 0 deletions docs/france_readme.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
## France Solar Data Pipeline for PVNet
This edit/ contribution adds support for France RTE solar generation data to the project.

## Changes
- Added France data processing script
- Created admin region metadata CSV
- Updated data pipeline to use integer location_ids
- Added inspection script for validation

## Data API
The Definitive datasets follow the format:
https://eco2mix.rte-france.com/download/eco2mix/eCO2mix_RTE_{Region}_Annuel-Definitif_{Year}.zip

The consolidate datasets follow the format:
https://eco2mix.rte-france.com/download/eco2mix/eCO2mix_RTE_{Region}_En-cours-Consolide.zip

Note that TCH (le Taux de CHarge), which refers to the actual production compared to installed solar capacity is only available from 2020. Hence, initially we use 2020 to 2024 (5 years) of data.

## Summer Time Behavior
When transitioning to summer time (e.g. 26 Mar 2023 2:00 to 03:00), entries between 2:00 and 3:00 are duplicated.
When transitioning back to winter time (e.g. 29 Oct 2023 3:00 to 2:00), data entries are ambiguous and 2 timesteps will be missing.

### ZARR File
The converted zarr file is available on huggingface, link:
https://huggingface.co/datasets/hhhn2/France_PV_data

## Testing
- Ran process_france_data.py successfully
- Validated output with inspect_france_training_pipeline.py

## Data Processing Results
Data Quality

Generation (MW):
Shape: (12, 87696)
Range: [0.00, 4002.00] MW
Mean: 174.98 MW
NaN count: 120 (0.01%)

Capacity (MWp):
Shape: (12, 87696)
Range: [122.70, 6000.00] MWp
Mean: 1170.15 MWp
NaN count: 0 (0.00%)

Per-Region Statistics

0:
Generation: [0.0, 2194.0] MW, Mean: 238.2 MW, NaN: 0.0%
Capacity: 1655.3 MWp, NaN: 0.0%

1:
Generation: [0.0, 883.0] MW, Mean: 78.7 MW, NaN: 0.0%
Capacity: 537.8 MWp, NaN: 0.0%

2:
Generation: [0.0, 568.0] MW, Mean: 49.3 MW, NaN: 0.0%
Capacity: 364.3 MWp, NaN: 0.0%

3:
Generation: [0.0, 975.0] MW, Mean: 97.5 MW, NaN: 0.0%
Capacity: 665.7 MWp, NaN: 0.0%

4:
Generation: [0.0, 1337.0] MW, Mean: 134.2 MW, NaN: 0.0%
Capacity: 998.7 MWp, NaN: 0.0%

5:
Generation: [0.0, 629.0] MW, Mean: 49.0 MW, NaN: 0.0%
Capacity: 361.4 MWp, NaN: 0.0%

6:
Generation: [0.0, 306.0] MW, Mean: 26.8 MW, NaN: 0.0%
Capacity: 218.7 MWp, NaN: 0.0%

7:
Generation: [0.0, 464.0] MW, Mean: 32.1 MW, NaN: 0.0%
Capacity: 247.4 MWp, NaN: 0.0%

8:
Generation: [0.0, 4002.0] MW, Mean: 534.9 MW, NaN: 0.0%
Capacity: 3524.9 MWp, NaN: 0.0%

9:
Generation: [0.0, 3287.0] MW, Mean: 438.6 MW, NaN: 0.0%
Capacity: 2799.3 MWp, NaN: 0.0%

10:
Generation: [0.0, 1213.0] MW, Mean: 118.8 MW, NaN: 0.0%
Capacity: 860.4 MWp, NaN: 0.0%

11:
Generation: [0.0, 1942.0] MW, Mean: 301.8 MW, NaN: 0.0%
Capacity: 1807.9 MWp, NaN: 0.0%

## Next Steps
- Make a NOAA GFS pipeline for France
- Compare with baseline model
1 change: 0 additions & 1 deletion scripts/generation/fetch_pvlive_data.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
from pvlive_api import PVLive
import logging


logger = logging.getLogger(__name__)


Expand Down
15 changes: 9 additions & 6 deletions scripts/generation/generate_combined_gsp.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,12 +33,13 @@

from scripts.generation.fetch_pvlive_data import PVLiveData

logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")


def main(
start_year: int = typer.Option(2020, help="Start year for data collection"),
end_year: int = typer.Option(2025, help="End year for data collection"),
output_folder: str = typer.Option("data", help="Output folder for the zarr dataset")
output_folder: str = typer.Option("data", help="Output folder for the zarr dataset"),
):
"""
Generate combined GSP data for all GSPs and save as a zarr dataset.
Expand All @@ -51,15 +52,15 @@ def main(
all_dataframes = []

# Changed range to start from 0 to include gsp_id=0
for gsp_id in range(0, 319):
for gsp_id in range(0, 319):
logging.info(f"Processing GSP ID {gsp_id}")
df = data_source.get_data_between(
start=range_start,
end=range_end,
entity_id=gsp_id,
extra_fields="capacity_mwp,installedcapacity_mwp"
extra_fields="capacity_mwp,installedcapacity_mwp",
)

if df is not None and not df.empty:
# Add gsp_id column to the dataframe
df["gsp_id"] = gsp_id
Expand Down Expand Up @@ -87,7 +88,9 @@ def main(
xr_pv.to_zarr(output_path, mode="w", consolidated=True)

logging.info(f"Successfully saved combined GSP dataset to {output_path}")
logging.info(f"Dataset contains GSPs 0-318 for period {range_start.date()} to {range_end.date()}")
logging.info(
f"Dataset contains GSPs 0-318 for period {range_start.date()} to {range_end.date()}"
)


if __name__ == "__main__":
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,117 @@
general:
description: Configuration for training PVNet on France solar PV data with GFS data
name: france_pvnet_config

input_data:
# # Either use Site OR GSP configuration
# site:
# # Path to Site data in NetCDF format
# file_path: PLACEHOLDER.nc
# # Path to metadata in CSV format
# metadata_file_path: PLACEHOLDER.csv
# time_resolution_minutes: 15
# interval_start_minutes: -60
# # Specified for intraday currently
# interval_end_minutes: 480
# dropout_timedeltas_minutes: null
# dropout_fraction: 0 # Fraction of samples with dropout

# France solar PV generation data
# Uses 'generation:' (generic PV data) not 'gsp:' (UK-specific Grid Supply Point)
generation:
# Path to generation data in zarr format -- update to S3 path when uploaded
# hugging face: https://huggingface.co/datasets/hhhn2/France_PV_data/tree/main
zarr_path: "./data/fra/france_pv_generation.zarr"
interval_start_minutes: -60
interval_end_minutes: 480
time_resolution_minutes: 30 # half hourly resolution
# Random value from the list below will be chosen as the delay when dropout is used
# If set to null no dropout is applied. Only values before t0 are dropped out for GSP.
# Values after t0 are assumed as targets and cannot be dropped.
dropout_timedeltas_minutes: []
dropout_fraction: 0.0 # Fraction of samples with dropout
public: False # set to true when uploaded

nwp:
gfs:
time_resolution_minutes: 180 # Match the dataset's resolution (3 hours)
interval_start_minutes: -180
interval_end_minutes: 540
dropout_fraction: 0.0
dropout_timedeltas_minutes: []
zarr_path: "./data/fra/gfs/france_gfs_2024_02.zarr"
provider: "gfs"
image_size_pixels_height: 2
image_size_pixels_width: 2
public: False # set to true when uploaded
channels:
- dlwrf # downwards long-wave radiation flux
- dswrf # downwards short-wave radiation flux
- hcc # high cloud cover
- mcc # medium cloud cover
- lcc # low cloud cover
- prate # precipitation rate
- r # relative humidity
- t # 2-metre temperature
- tcc # total cloud cover
- u10 # 10-metre wind U component
- u100 # 100-metre wind U component
- v10 # 10-metre wind V component
- v100 # 100-metre wind V component
- vis # visibility
normalisation_constants:
dlwrf:
mean: 298.342
std: 96.305916
dswrf:
mean: 168.12321
std: 246.18533
hcc:
mean: 35.272
std: 42.525383
lcc:
mean: 43.578342
std: 44.3732
mcc:
mean: 33.738823
std: 43.150745
prate:
mean: 2.8190969e-05
std: 0.00010159573
r:
mean: 18.359747
std: 25.440672
sde:
mean: 0.36937004
std: 0.43345627
t:
mean: 278.5223
std: 22.825893
tcc:
mean: 66.841606
std: 41.030598
u10:
mean: -0.0022310058
std: 5.470838
u100:
mean: 0.0823025
std: 6.8899174
v10:
mean: 0.06219831
std: 4.7401133
v100:
mean: 0.0797807
std: 6.076132
vis:
mean: 19628.32
std: 8294.022
u:
mean: 11.645444
std: 10.614556
v:
mean: 0.12330122
std: 7.176398
solar_position:
interval_start_minutes: -60
interval_end_minutes: 480
time_resolution_minutes: 30
13 changes: 13 additions & 0 deletions src/open_data_pvnet/configs/admin_region_lat_lon.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
location_id,region,principal_municipality,latitude,longitude
0,"Auvergne-Rhône-Alpes","Lyon",45.7640,4.8357
1,"Bourgogne-Franche-Comté","Dijon",47.3220,5.0415
2,"Bretagne","Rennes",48.1173,-1.6778
3,"Centre-Val-de-Loire","Orléans",47.9030,1.9093
4,"Grand-Est","Strasbourg",48.5734,7.7521
5,"Hauts-de-France","Lille",50.6292,3.0573
6,"Ile-de-France","Paris",48.8566,2.3522
7,"Normandie","Rouen",49.4432,1.0993
8,"Nouvelle-Aquitaine","Bordeaux",44.8378,-0.5792
9,"Occitanie","Toulouse",43.6047,1.4442
10,"Pays-de-la-Loire","Nantes",47.2184,-1.5536
11,"PACA","Marseille",43.2965,5.3698
92 changes: 92 additions & 0 deletions src/open_data_pvnet/configs/france_gfs_config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
general:
name: "france_gfs_config"
description: "Configuration for France GFS data"

input_data:
nwp:
gfs:
# GFS provides 3-hourly forecasts globally
time_resolution_minutes: 180
interval_start_minutes: -180
interval_end_minutes: 540
dropout_timedeltas_minutes: null
dropout_fraction: 0.0
accum_channels: []
max_staleness_minutes: 540

# Use existing OCF GFS data - filter to France bounds at runtime
zarr_path: "s3://ocf-open-data-pvnet/data/gfs/v4/2023.zarr"
provider: "gfs"
public: true

geographic_bounds:
latitude_min: 42.0
latitude_max: 51.5
longitude_min: -5.5
longitude_max: 9.0

# Spatial sampling
image_size_pixels_height: 2
image_size_pixels_width: 2

# Weather channels for solar prediction
channels:
- dlwrf # downwards long-wave radiation flux
- dswrf # downwards short-wave radiation flux
- hcc # high cloud cover
- lcc # low cloud cover
- mcc # medium cloud cover
- prate # precipitation rate
- r # relative humidity
- t # 2-metre temperature
- tcc # total cloud cover
- u10 # 10-metre wind U component
- u100 # 100-metre wind U component
- v10 # 10-metre wind V component
- v100 # 100-metre wind V component
- vis # visibility

# Normalisation constants (using global GFS stats from UK config)
normalisation_constants:
dlwrf:
mean: 298.342
std: 96.305916
dswrf:
mean: 168.12321
std: 246.18533
hcc:
mean: 35.272
std: 42.525383
lcc:
mean: 43.578342
std: 44.3732
mcc:
mean: 33.738823
std: 43.150745
prate:
mean: 2.8190969e-05
std: 0.00010159573
r:
mean: 18.359747
std: 25.440672
t:
mean: 278.5223
std: 22.825893
tcc:
mean: 66.841606
std: 41.030598
u10:
mean: -0.0022310058
std: 5.470838
u100:
mean: 0.0823025
std: 6.8899174
v10:
mean: 0.06219831
std: 4.7401133
v100:
mean: 0.0797807
std: 6.076132
vis:
mean: 19628.32
std: 8294.022
Loading