Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 18 additions & 0 deletions docs/getting_started.md
Original file line number Diff line number Diff line change
Expand Up @@ -211,6 +211,24 @@ ds = xr.open_zarr(s3.get_mapper(dataset_path), consolidated=True)
print(ds)
```

6. **Accessing US EIA Data from S3**
The US EIA solar generation data is stored in the S3 bucket `s3://ocf-open-data-pvnet/data/us/eia/`. Similar to GFS data, it can be accessed directly using `xarray` and `s3fs`.

```python
import xarray as xr
import s3fs

# Create an S3 filesystem object (Public Access)
s3 = s3fs.S3FileSystem(anon=True)

# Open the US EIA dataset (Latest Version)
dataset_path = 's3://ocf-open-data-pvnet/data/us/eia/latest/target_eia_data_processed.zarr'
ds = xr.open_zarr(s3.get_mapper(dataset_path), consolidated=True)

# Display the dataset
print(ds)
```

### Best Practices for Using APIs

- **API Keys**: Most APIs require authentication via an API key. Store keys securely using environment variables or secret management tools.
Expand Down
199 changes: 199 additions & 0 deletions docs/us_data_preprocessing.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,199 @@
# US Data Preprocessing for ocf-data-sampler

This document describes how to preprocess EIA solar generation data for use with ocf-data-sampler and PVNet training.

## Overview

The EIA data collected by `collect_eia_data.py` needs to be preprocessed to match the format expected by ocf-data-sampler, which follows the UK GSP data structure.

## Data Format Requirements

### Input Format (from `collect_eia_data.py`)
- **Dimensions**: `(timestamp, ba_code)`
- **Variables**: `generation_mw`, `ba_name`, `latitude`, `longitude`, `value-units`
- **Index**: MultiIndex on `(timestamp, ba_code)`

### Output Format (for ocf-data-sampler)
- **Dimensions**: `(ba_id, datetime_gmt)` where `ba_id` is int64
- **Variables**: `generation_mw`, `capacity_mwp`
- **Coordinates**: `ba_code`, `ba_name`, `latitude`, `longitude` (optional)
- **Chunking**: `{"ba_id": 1, "datetime_gmt": 1000}`
- **Format**: Zarr with consolidated metadata

## Preprocessing Steps

The preprocessing script (`preprocess_eia_for_sampler.py`) performs the following transformations:

1. **Rename timestamp to datetime_gmt**
- Converts to UTC timezone
- Removes timezone info (matches UK format)

2. **Map BA codes to numeric IDs**
- Creates numeric `ba_id` (0, 1, 2, ...) for each unique `ba_code`
- Generates metadata CSV with mapping

3. **Add capacity data**
- Estimates `capacity_mwp` from maximum historical generation
- Applies safety factor (1.15x) to account for capacity > max generation
- Ensures capacity >= generation

4. **Restructure dataset**
- Sets index to `(ba_id, datetime_gmt)`
- Applies proper chunking for efficient access
- Saves with consolidated metadata

## Usage

### Step 1: Collect Raw EIA Data

```bash
python src/open_data_pvnet/scripts/collect_eia_data.py \
--start 2020-01-01 \
--end 2023-12-31 \
--output src/open_data_pvnet/data/target_eia_data.zarr
```

### Step 2: Preprocess for ocf-data-sampler

```bash
python src/open_data_pvnet/scripts/preprocess_eia_for_sampler.py \
--input src/open_data_pvnet/data/target_eia_data.zarr \
--output src/open_data_pvnet/data/target_eia_data_processed.zarr \
--metadata-output src/open_data_pvnet/data/us_ba_metadata.csv
```

### Step 3: Verify Compatibility

```bash
python src/open_data_pvnet/scripts/test_eia_sampler_compatibility.py \
--data-path src/open_data_pvnet/data/target_eia_data_processed.zarr \
--config-path src/open_data_pvnet/configs/PVNet_configs/datamodule/configuration/us_configuration.yaml
```

## Capacity Data Options

The preprocessing script supports three methods for obtaining capacity data:

### 1. Estimate from Generation (Default)
```bash
--capacity-method estimate
```
Estimates capacity as `max(generation) * 1.15`. This is a simple heuristic that works reasonably well for initial testing.

### 2. Load from File
```bash
--capacity-method file --capacity-file path/to/capacity.csv
```
Loads capacity data from a CSV file with columns `ba_code` and `capacity_mwp`.

### 3. Static Value (Not Recommended)
```bash
--capacity-method static
```
Uses a static capacity value for all BAs. Only for testing.

## Output Files

### Processed Zarr Dataset
- **Location**: `src/open_data_pvnet/data/target_eia_data_processed.zarr`
- **Format**: Zarr v3 with consolidated metadata
- **Structure**: Matches UK GSP format for ocf-data-sampler compatibility

### BA Metadata CSV
- **Location**: `src/open_data_pvnet/data/us_ba_metadata.csv`
- **Columns**: `ba_id`, `ba_code`, `ba_name`, `latitude`, `longitude`
- **Purpose**: Mapping between numeric IDs and BA codes, plus spatial coordinates

## Configuration

Update the US configuration file to point to the processed data:

```yaml
# src/open_data_pvnet/configs/PVNet_configs/datamodule/configuration/us_configuration.yaml
input_data:
gsp:
zarr_path: "src/open_data_pvnet/data/target_eia_data_processed.zarr"
time_resolution_minutes: 60 # EIA data is hourly
# ... other settings
```

## Troubleshooting

### Issue: "ba_code not found in dataset"
**Solution**: Ensure the input file was created by `collect_eia_data.py` and has the correct structure.

### Issue: "Missing capacity data"
**Solution**: The script will estimate capacity automatically. If you have external capacity data, use `--capacity-method file`.

### Issue: "ocf-data-sampler compatibility test fails"
**Solution**:
1. Check that dimensions are `(ba_id, datetime_gmt)`
2. Verify `datetime_gmt` is datetime64 without timezone
3. Ensure `capacity_mwp` variable exists
4. Check that Zarr has consolidated metadata

### Issue: "Generation exceeds capacity"
**Solution**: The script automatically ensures `capacity_mwp >= generation_mw * 1.01`. If this fails, check for data quality issues.

## Next Steps

After preprocessing:
1. Verify data with `test_eia_sampler_compatibility.py`
2. Update configuration files
3. Test data loading with ocf-data-sampler
4. Proceed with PVNet training setup

## References

- UK GSP Data Format: `src/open_data_pvnet/scripts/generate_combined_gsp.py`
- ocf-data-sampler Documentation: https://github.com/openclimatefix/ocf-data-sampler
- EIA Data Collection: `src/open_data_pvnet/scripts/collect_eia_data.py`



## S3 Data Storage & Retrieval

We support storing processed US EIA data in S3 for easier access across environments.

### 1. Uploading to S3

You can upload processed data directly to S3 using the `collect_and_preprocess_eia.py` script. This requires AWS credentials with write access to the target bucket.

```bash
python src/open_data_pvnet/scripts/collect_and_preprocess_eia.py \
--start 2023-01-01 --end 2023-01-07 \
--upload-to-s3 \
--s3-bucket ocf-open-data-pvnet \
--s3-version v1 \
--public # make objects public-read
```

Or uplaod existing data using the utility script:

```bash
python src/open_data_pvnet/scripts/upload_eia_to_s3.py \
--input src/open_data_pvnet/data/target_eia_data_processed.zarr \
--version v1 \
--public
```

### 2. Accessing from S3

Update your `us_configuration.yaml` to point to the S3 path:

```yaml
input_data:
gsp:
zarr_path: "s3://ocf-open-data-pvnet/data/us/eia/v1/target_eia_data_processed.zarr"
public: True
```

Or access directly in Python:

```python
import s3fs
import xarray as xr

s3 = s3fs.S3FileSystem(anon=True)
ds = xr.open_zarr(s3.get_mapper("s3://ocf-open-data-pvnet/data/us/eia/v1/target_eia_data_processed.zarr"), consolidated=True)
```
102 changes: 102 additions & 0 deletions docs/us_eia_dataset_format.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@
# US EIA Dataset Technical Documentation

## Overview

We collect hourly solar generation data for the United States from the **US Energy Information Administration (EIA) Open Data API**. This dataset serves as the primary ground truth for training solar forecasting models for the US region.

Key characteristics:
- **Source**: [EIA Hourly Electricity Grid Monitor](https://www.eia.gov/electricity/gridmonitor/dashboard/electric_overview/US48/US48)
- **Granularity**: Hourly resolution
- **Coverage**: Major US Balancing Authorities (ISOs/RTOs)
- **License**: Public Domain (US Government Data)

---

## Data Formats

### 1. Raw Data (Intermediate)

The data collected by `collect_eia_data.py` is stored in Zarr format with the following structure:

- **Dimensions**: `(timestamp, ba_code)`
- **Variables**:
- `generation_mw`: Electricity generation in Megawatts (MW)
- `ba_name`: Full name of the Balancing Authority
- `latitude`: Approximate centroid latitude
- `longitude`: Approximate centroid longitude
- `value-units`: Unit string (e.g., "megawatthours")

### 2. Processed Data (Ready for Training)

The raw data is preprocessed by `preprocess_eia_for_sampler.py` to match the format required by `ocf-data-sampler`. This format aligns with the UK GSP dataset structure.

- **Dimensions**: `(ba_id, datetime_gmt)`
- **Chunking**: `{"ba_id": 1, "datetime_gmt": 1000}`
- **Variables**:

| Variable | Type | Description |
|----------|------|-------------|
| `generation_mw` | `float32` | Solar generation in MW |
| `capacity_mwp` | `float32` | Estimated installed capacity in MWp |

- **Coordinates**:

| Coordinate | Type | Description |
|------------|------|-------------|
| `ba_id` | `int64` | Numeric ID mapped to each BA code |
| `datetime_gmt` | `datetime64[ns]` | Timestamp in UTC |
| `ba_code` | `string` | ISO/RTO code (e.g., "CISO") |
| `ba_name` | `string` | Full name of the BA |
| `latitude` | `float32` | Centroid latitude |
| `longitude` | `float32` | Centroid longitude |

---

## Metadata & Mapping

A metadata CSV file (`us_ba_metadata.csv`) is generated alongside the processed data. It maps numeric `ba_id`s to their corresponding codes and locations.

| ba_id | ba_code | ba_name | latitude | longitude |
|-------|---------|---------|----------|-----------|
| 0 | CISO | California ISO | 37.0 | -120.0 |
| 1 | ERCO | Electric Reliability Council of Texas | 31.0 | -99.0 |
| ... | ... | ... | ... | ... |

---

## Capacity Estimation

Unlike UK PVLive, the EIA dataset does not provide historical installed capacity. We estimate capacity using a heuristic based on maximum historical generation:

```python
capacity = max(generation_mw) * 1.15
min_capacity = 100.0 MW
```

- **Method**: `estimate` (Default)
- **Safety Factor**: 1.15 (Assumes max generation is ~85% of theoretical capacity due to efficiencies/weather)
- **Minimum**: 100 MW floor to prevent zeros for missing data intervals

---

## Data Quality & Validation

- **Missing Data**: Intervals with missing data are typically represented as NaNs. The `ocf-data-sampler` handles this by finding valid contiguous time periods.
- **Timezone**: All timestamps are converted to **UTC**.
- **Negative Generation**: Clipped to 0.

## Usage

### Loading Data with Xarray

```python
import xarray as xr
import s3fs

# Local Load
ds = xr.open_zarr("src/open_data_pvnet/data/target_eia_data_processed.zarr", consolidated=True)

# S3 Load (Public)
s3 = s3fs.S3FileSystem(anon=True)
ds = xr.open_zarr(s3.get_mapper("s3://ocf-open-data-pvnet/data/us/eia/latest/target_eia_data_processed.zarr"), consolidated=True)
```
Loading