Skip to content

ducspe/DeepElbe

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DeepElbe: A Multi-Horizon Dissolved Oxygen Forecasting and Hypoxia Early Warning System with Statistical Machine Learning and Deep Learning Methods

Authors: Danu Caus1,2, Ovidio García-Oliva1, Carsten Lemmen1, and Tobias Weigel1,2

1 Helmholtz-Zentrum Hereon, Geesthacht, Germany
2 Helmholtz AI, Helmholtz Association

DeepElbe project cover

Study area

Elbe measurement stations

The code can be adapted to other measurement stations and hydrological systems. For our current experiments, however, we considered the three Elbe stations: Blankenese, Seemannshöft, and Bunthaus.

Release status

This repository is a public research-code release for reproducing the DeepElbe experiments. It is script-oriented and optimized for experiment workflows rather than packaged as a production library.

Motivation

Dissolved oxygen dynamics in rivers are shaped by interacting meteorological and hydrological conditions and can vary across daily and hourly time scales. Low-oxygen and hypoxic episodes can threaten aquatic ecosystems and motivate forecasting tools that support monitoring and early warning. This repository studies multi-horizon forecasting of oxygen concentration and hypoxia using both statistical and deep learning methods. While the experiments presented here focus on three Elbe monitoring stations, the workflow is designed to be adaptable to other stations and hydrological systems.

Problem formulation

This repository uses direct multi-horizon forecasting. For each reference timestamp t, the model predicts separate targets at exact future leads t+k (not "any event within the next window").

  • oxygen_level task: predict oxygen_lead1 ... oxygen_leadH, where oxygen_leadk is oxygen exactly at t+k.
  • hypoxia task: predict hypoxia_lead1 ... hypoxia_leadH, where hypoxia_leadk is hypoxia status exactly at t+k.

Direct Multi-Horizon Task Formulations for Dissolved Oxygen Forecasting and Hypoxia Early Warning

DeepElbe task formulations

So, for example, at lead 30 in the daily setting, evaluation is against the observation exactly 30 days ahead, not against any day in days 1-30.

The current daily artificial neural network (ANN) is a multi-layer perceptron (MLP) that uses a 7-day history to predict up to 30 days into the future.

The current hourly ANN is also an MLP, using a 7-hour history to predict up to 24 hours into the future.

We deliberately use MLPs as the ANN models because they keep the neural-network comparison close to the original statistical baseline: both approaches operate on the same lagged input features, predict the same direct multi-horizon targets, and map predictors to lead-specific outputs in a feed-forward supervised learning setup. This makes the ANN-vs-baseline comparison closer to an apples-to-apples comparison than using a more complex sequence-model architecture.

DeepElbe Model Setup and ANN-vs-Baseline Comparison Workflow

DeepElbe model-related diagrams

The ANN models use Z-normalized inputs by default, optional dropout, task-specific losses, and early stopping. Architecture and regularization settings can be fixed manually or explored through the provided W&B sweep configurations.

The baseline implementation in Python mirrors the original R models by Ovidio García-Oliva: ordinary least squares (OLS) linear regression for oxygen concentration, and an optional logistic regression for hypoxia classification. Both are evaluated on train/val/test splits for comparability.

The original R baseline algorithm by Ovidio García-Oliva is available at https://github.com/ovgarol/elbe-oxygen-prediction. The Python reimplementation used here is in baselines/python/.

The training and comparison scripts expect raw input files under data/. You can recreate that folder with bash download_data.sh, which downloads the original public DWD air-temperature files and WSV water-temperature / dissolved-oxygen files. Preprocessing (aggregation, merge, lag/lead feature creation, and split standardization) is done on the fly during training/evaluation rather than stored as preprocessed files. Both daily and hourly variants are generated at runtime from the same raw files by aggregating timestamps to daily means or hourly means, respectively.

For data splits, Blankenese and Seemannshoeft currently use val_end=2019-12-31, while Bunthaus uses val_end=2022-12-31 because the 2016-2019 Bunthaus period has too few hypoxia positives (daily has none), making hypoxia validation less informative.

Repository structure

  • deepelbe_ann/: ANN training code, daily/hourly dataset builders, and ANN-vs-baseline comparison scripts.
  • baselines/python/: Python reimplementation of the original statistical baselines.
  • sweeps/: W&B sweep configurations and the sweep launcher.
  • vega_frontend_graphs/: W&B custom chart templates for training and ANN-vs-baseline comparison tables.
  • Top-level train_*.sh, compare_*.sh, run_compare_*.sh, select_best_checkpoints.sh, download_data.sh, and reset_experiment_workspace.sh: shell entry points for the main workflows.
  • Generated or local folders such as data/, runs/, wandb/, and sweeps/launch_outputs/ are recreated locally and ignored by git.

Setup

Tested with Python 3.10; Python 3.10 or newer is recommended.

Create and activate the virtual environment with the repo-standard name from .gitignore:

python -m venv .deepelbe_venv
source .deepelbe_venv/bin/activate

Install dependencies:

pip install -r requirements.txt

Recreate the raw data/ folder if it is not present locally:

bash download_data.sh

Optional local configuration:

cp config.example.env config.local.env

Edit config.local.env for local paths, W&B defaults, and Slurm account/partition/log settings. If you are not using the Hereon W&B team, set WANDB_ENTITY in config.local.env to your own W&B entity. For online sweeps, also replace entity: hereon-deepelbe in the sweep YAMLs before running wandb sweep. load_env_config.sh is sourced by the top-level Bash entry points; it loads config.local.env automatically when present. Do not copy config.local.env into a public repository. Set DEEPELBE_CONFIG=/path/to/config.env to use a different config file.

Quick Start: Fresh Sweep Run

Run this from the repository root for a clean W&B sweep workflow with 30 runs per sweep.

# 1. Reset generated experiment artifacts
bash reset_experiment_workspace.sh --yes

# 2. Recreate raw data
bash download_data.sh

# 3. Make sure W&B is authenticated
source .deepelbe_venv/bin/activate
wandb login

# 4. Create all W&B sweeps with 30 runs per sweep
AGENT_COUNT=30 bash sweeps/launch_all_sweeps.sh

# 5. Run the newest generated agent script
latest_sweep_dir="$(ls -td sweeps/launch_outputs/* | head -n 1)"
bash "${latest_sweep_dir}/agent_commands.sh"

# 6. After all sweep runs finish, select best checkpoints
bash select_best_checkpoints.sh

# 7. Run best-checkpoint ANN-vs-baseline comparisons
bash run_compare_all_best.sh

# 8. Sync offline comparison runs from wandb/ (comparison wrappers default to offline mode)
wandb sync --include-offline --mark-synced wandb/offline-run-*

Training commands

Run the batch scripts from the repo root. Use bash for local runs, or submit_with_config.sh on Slurm/HPC. Logging is Weights & Biases only. By default, scripts run in offline mode and store run files in wandb/ (not runs/wandb), while checkpoints stay in runs/checkpoints/. The scripts set WANDB_DIR to the repo root so W&B writes directly to ./wandb. Canonical task IDs are oxygen_level (regression) and hypoxia (classification). Input feature standardization is enabled by default (FEATURE_STANDARDIZATION=1). Set FEATURE_STANDARDIZATION=0 to train or compare on raw input features. deepelbe_ann/train.py is training-only (fit/test/checkpoint). ANN-vs-baseline comparison tables are generated separately by:

  • deepelbe_ann/compare_hypoxia_models.py
  • deepelbe_ann/compare_oxygen_level_models.py

Daily oxygen_level (all stations):

bash train_daily_oxygen_level_all_stations.sh

Daily hypoxia (all stations):

bash train_daily_hypoxia_all_stations.sh

Hourly oxygen_level (all stations):

bash train_hourly_oxygen_level_all_stations.sh

Hourly hypoxia (all stations):

bash train_hourly_hypoxia_all_stations.sh

HPC examples:

bash submit_with_config.sh train_daily_hypoxia_all_stations.sh
bash submit_with_config.sh train_hourly_hypoxia_all_stations.sh

For GPU training on Slurm, request a GPU-capable partition/node according to your cluster setup, for example by setting SLURM_PARTITION in config.local.env or by passing explicit sbatch options. deepelbe_ann/train.py automatically uses CUDA when available; pass --no-cuda only when you want to force CPU execution.

submit_with_config.sh passes Slurm account, partition, working directory, and log-output settings from config.local.env to sbatch. You can still call sbatch directly with explicit Slurm options if preferred.

To force online Weights & Biases sync for the daily run:

WANDB_MODE=online bash train_daily_hypoxia_all_stations.sh

To sync offline runs from wandb/ (minimal command, best for quick local use from repo root):

wandb sync --include-offline --mark-synced wandb/offline-run-*

Use the robust variant below if you want safer behavior when there are no matches, many runs, or unusual path names:

find wandb -maxdepth 1 -type d -name 'offline-run-*' -print0 | xargs -0 -r wandb sync --include-offline --mark-synced

Compatibility wrappers (run both tasks for a resolution):

bash train_daily_all.sh
bash train_hourly_all.sh

Default W&B groups used by the task-specific scripts:

  • daily_oxygen_level_all_stations_${SLURM_JOB_ID:-local}
  • daily_hypoxia_all_stations_${SLURM_JOB_ID:-local}
  • hourly_oxygen_level_all_stations_${SLURM_JOB_ID:-local}
  • hourly_hypoxia_all_stations_${SLURM_JOB_ID:-local}

ANN vs baseline comparison (separate step)

After training an MLP, run the task-specific comparison wrapper with the selected checkpoint.

Hypoxia comparison:

  • Logs analysis/thresholds_table and analysis/roc_points_table.
  • Default group: compare_baseline_vs_ann_hypoxia_${RESOLUTION}_${STATION}_${SLURM_JOB_ID:-local}.

Recommended wrapper script:

CHECKPOINT_PATH=runs/checkpoints/<run_dir>/<best_checkpoint>.ckpt bash compare_hypoxia.sh

Example overrides (daily Blankenese):

CHECKPOINT_PATH=runs/checkpoints/<run_dir>/<best_checkpoint>.ckpt \
RESOLUTION=daily \
STATION=Blankenese \
HORIZON=30 \
WANDB_MODE=offline \
bash compare_hypoxia.sh

Oxygen-level comparison:

  • Logs analysis/oxygen_error_by_lead_table and analysis/oxygen_overall_table.
  • Also logs oxygen time-series tables for selected leads under analysis/oxygen_timeseries_<resolution>_lead<lead>_<model>_table (for example, analysis/oxygen_timeseries_daily_lead30_ann_table and analysis/oxygen_timeseries_hourly_lead24_baseline_table).
  • Default group: compare_baseline_vs_ann_oxygen_level_${RESOLUTION}_${STATION}_${SLURM_JOB_ID:-local}.

Recommended wrapper script:

CHECKPOINT_PATH=runs/checkpoints/<run_dir>/<best_checkpoint>.ckpt bash compare_oxygen_level.sh

Example overrides (hourly Blankenese):

CHECKPOINT_PATH=runs/checkpoints/<run_dir>/<best_checkpoint>.ckpt \
RESOLUTION=hourly \
STATION=Blankenese \
HORIZON=24 \
WANDB_MODE=offline \
bash compare_oxygen_level.sh

Batch helpers for best-checkpoint comparisons

  • select_best_checkpoints.sh scans runs/checkpoints/*, parses checkpoint metric values from filenames, selects the best checkpoint per run folder (max for F1-like metrics, min for MSE/loss-like metrics), and writes runs/checkpoints/<run_dir>/best_checkpoint_path.txt.
  • run_compare_hypoxia_best.sh scans runs/checkpoints/*-hypoxia-*, reads each best_checkpoint_path.txt, infers run settings from the folder name, reads feature_standardization from checkpoint metadata, and runs compare_hypoxia.sh sequentially with resume support.
  • run_compare_oxygen_level_best.sh scans runs/checkpoints/*-oxygen_level-*, reads each best_checkpoint_path.txt, infers run settings from the folder name, reads feature_standardization from checkpoint metadata, and runs compare_oxygen_level.sh sequentially with resume support.

Checkpoint filename abbreviation examples

Training runs use a compact model_id in W&B run names and checkpoint filenames so the main model settings can be identified without opening metadata files. Checkpoint directory names intentionally keep the stable <station>-<task>-<resolution>-lags... format used by the comparison helpers.

Examples of abbreviations used in model_id strings:

  • mlp-h256x128: MLP with hidden layers 256 and 128 (mlp-hlinear means no hidden layer).
  • do0p10: dropout 0.10.
  • lr1e-03: learning rate 1e-3.
  • wd1e-04: weight decay 1e-4.
  • std: feature standardization enabled.
  • nostd: feature standardization disabled.
  • loss-mse: MSE loss for oxygen-level regression.
  • loss-huber-delta1p0: Huber loss with delta 1.0.
  • loss-bcew: weighted binary cross-entropy for hypoxia.
  • loss-bceu: unweighted binary cross-entropy for hypoxia.
  • loss-focal-g2p0-a0p25: focal loss with gamma 2.0 and alpha 0.25.
  • thr0p50: hypoxia probability threshold 0.50.

Examples:

# Create/update best checkpoint path files for all run folders
bash select_best_checkpoints.sh

# Inspect selections without writing files
bash select_best_checkpoints.sh --dry-run

# Run hypoxia comparisons sequentially for all best checkpoints
bash run_compare_hypoxia_best.sh

# Run oxygen_level comparisons sequentially for all best checkpoints
bash run_compare_oxygen_level_best.sh

# Dry-run comparison commands (no execution, hypoxia)
bash run_compare_hypoxia_best.sh --dry-run

# Dry-run comparison commands (no execution, oxygen_level)
bash run_compare_oxygen_level_best.sh --dry-run

Workspace reset helper

Use reset_experiment_workspace.sh to prepare a fresh experiment workspace by cleaning generated artifacts:

  • runs/
  • sweeps/launch_outputs/
  • wandb/

Safety behavior:

  • Default is dry-run (shows targets and sizes, deletes nothing).
  • Actual deletion requires explicit confirmation with --yes.

Examples:

# Preview what would be deleted
bash reset_experiment_workspace.sh --dry-run

# Delete the configured targets
bash reset_experiment_workspace.sh --yes

W&B hyperparameter sweeps (on a server with internet access)

Use this mode when your machine has internet access and you want W&B-managed online sweeps (wandb sweep + wandb agent). Twelve station-specific sweep configs are provided:

  • Blankenese: sweeps/oxygen_level_daily_blankenese_sweep.yaml, sweeps/hypoxia_daily_blankenese_sweep.yaml, sweeps/oxygen_level_hourly_blankenese_sweep.yaml, sweeps/hypoxia_hourly_blankenese_sweep.yaml
  • Bunthaus: sweeps/oxygen_level_daily_bunthaus_sweep.yaml, sweeps/hypoxia_daily_bunthaus_sweep.yaml, sweeps/oxygen_level_hourly_bunthaus_sweep.yaml, sweeps/hypoxia_hourly_bunthaus_sweep.yaml
  • Seemannshoeft: sweeps/oxygen_level_daily_seemannshoeft_sweep.yaml, sweeps/hypoxia_daily_seemannshoeft_sweep.yaml, sweeps/oxygen_level_hourly_seemannshoeft_sweep.yaml, sweeps/hypoxia_hourly_seemannshoeft_sweep.yaml

Loss options now exposed for sweeps:

  • Oxygen-level sweeps include oxygen-loss (mse, huber) and huber-delta.
  • Hypoxia sweeps include hypoxia-loss (bce_weighted, bce_unweighted, focal) plus focal-gamma and focal-alpha.

Launch sweeps (requires internet; these configs use --wandb-mode=online). Example:

wandb sweep sweeps/oxygen_level_daily_blankenese_sweep.yaml
wandb sweep sweeps/hypoxia_daily_bunthaus_sweep.yaml
wandb sweep sweeps/hypoxia_hourly_seemannshoeft_sweep.yaml

Then run agents (replace <sweep_id> with the ID returned by wandb sweep):

wandb agent hereon-deepelbe/DeepElbe/<sweep_id>

If you are not using the Hereon W&B team, replace hereon-deepelbe with your own W&B entity in the agent path.

Automate launching all sweep YAMLs and generating copy/paste-ready agent commands:

bash sweeps/launch_all_sweeps.sh

By default this generates agent commands with 30 runs per sweep:

  • oxygen_level sweeps: wandb agent --count 30 ...
  • hypoxia sweeps: wandb agent --count 30 ...

Override all sweeps uniformly with AGENT_COUNT=<N> if needed. Example for 30 runs per sweep across every config:

AGENT_COUNT=30 bash sweeps/launch_all_sweeps.sh

Or override only one task family:

AGENT_COUNT_OXYGEN_LEVEL=30 AGENT_COUNT_HYPOXIA=30 bash sweeps/launch_all_sweeps.sh

The effective run limit is the lower of the generated agent --count value and the sweep YAML run_cap.

This creates timestamped outputs under sweeps/launch_outputs/<UTC timestamp>/:

  • sweep_results.tsv with config path, exit code, sweep ID, parsed agent command, and log path.
  • agent_commands.txt with config | sweep_id | wandb agent ....
  • agent_commands.sh executable script containing all parsed wandb agent ... lines. It logs START/DONE/FAIL timestamps to agent_run.log and tracks completed entries in agent_progress.done. Re-running agent_commands.sh in the same output directory skips completed entries and resumes from the first unfinished one.

Run the generated script (resume-safe by default):

bash sweeps/launch_outputs/<UTC timestamp>/agent_commands.sh

Concrete fresh-start example:

# 1. Remove generated experiment artifacts from previous runs
bash reset_experiment_workspace.sh --yes

# 2. Create fresh sweep IDs and generate a new launch_outputs/<UTC timestamp>/ directory
AGENT_COUNT_OXYGEN_LEVEL=30 AGENT_COUNT_HYPOXIA=30 bash sweeps/launch_all_sweeps.sh

# 3. Run the generated agent script from the new timestamped output directory
bash sweeps/launch_outputs/<UTC timestamp>/agent_commands.sh

Notes:

  • Bunthaus sweep configs use val_end=2022-12-31.
  • Blankenese and Seemannshoeft sweep configs use val_end=2019-12-31.

Saved W&B custom-chart templates

Saved W&B custom-chart templates are provided under vega_frontend_graphs/. These files are intended for W&B tables logged by the ANN-vs-baseline comparison scripts, not for plain training runs. They are reusable frontend templates, not universally correct drop-in charts for every run. When reusing them, adapt them carefully to the specific case:

  • switch train filters to val or test where needed
  • make sure the queried table keys match the intended resolution and lead
  • verify chart titles and labels after instantiation so they still describe the selected split, lead, station, and task correctly
  • remember that some files are generic Vega specs while the paired GraphQL files may be concrete lead-specific instances

Result examples

These examples illustrate oxygen-level forecasting outputs from the experiment workflow. They are included as qualitative/diagnostic examples; full quantitative comparisons are produced through the W&B tables and custom chart templates.

Oxygen-Level Forecasting Metrics Across Forecast Leads

Oxygen-level forecasting metrics

Observed Dissolved Oxygen Fluctuations Versus Model Predictions

Oxygen ground truth fluctuations versus predictions

Short-Horizon Versus Long-Horizon Oxygen-Level Forecasting

Long-horizon versus short-horizon oxygen-level forecasting

Theory and interpretation

This section collects explanatory notes about how key outputs should be interpreted. It is intended as a reference appendix at the end of the README, and can be extended later if other recurring theoretical questions come up.

Oxygen-level bias metric

The oxygen-level comparison reports a per-lead bias and an aggregated split-level bias_mean.

For one split (train, val, or test) and one forecast lead h:

$$ \mathrm{bias}_h = \frac{1}{N}\sum_{i=1}^{N}\left(\hat{y}_{i,h} - y_{i,h}\right) $$

Plain fallback:

bias_h = (1 / N) * sum(i = 1..N) [ y_hat(i,h) - y(i,h) ]

Where:

  • i: sample index
  • h: lead index (forecast step)
  • N: number of samples in the split
  • y_hat[i,h]: model prediction for sample i, lead h
  • y[i,h]: target for sample i, lead h

The reported split-level bias_mean is the average of the per-lead biases:

$$ \mathrm{bias_mean} = \frac{1}{H}\sum_{h=1}^{H}\mathrm{bias}_h $$

Plain fallback:

bias_mean = (1 / H) * sum(h = 1..H) [ bias_h ]

Where:

  • H: horizon length (number of forecast leads)

Interpretation:

  • Positive bias means average over-prediction.
  • Negative bias means average under-prediction.
  • Bias near zero means signed errors cancel on average; it does not necessarily mean low RMSE or MAE.

Why baseline train bias is usually near zero

In this project, the oxygen baseline is an OLS model, that is, an Ordinary Least Squares linear regression model fit separately for each forecast lead. The implementation includes an intercept term.

For in-sample OLS with an intercept, the residuals on the training data are constrained to have mean approximately zero. Since the bias is just the mean signed residual, the baseline train bias is therefore expected to be approximately zero for each lead.

This is why baseline train bias often appears as 0 in the reported outputs. In practice, it is usually a very small floating-point value that gets rounded for display. Validation and test bias are not constrained this way, so they are typically non-zero.

Zero bias does not mean perfect predictions

Zero bias is easy to misread. It does not mean that the model is predicting the targets accurately at every sample. It only means that the signed errors average out.

For example:

sample prediction target error
1 8 5 +3
2 4 7 -3

In this example, the bias is zero, but the model is wrong on both samples. RMSE would still be 3.

What zero bias means:

  • no systematic direction of error
  • over-predictions and under-predictions cancel on average

What zero bias does not mean:

  • predictions are close to the targets
  • RMSE or MAE are low
  • the model is not overfitting

Overfitting would instead show up as training RMSE/MAE being much better than validation/test RMSE/MAE. A non-zero validation or test bias usually points more toward a systematic shift between training and evaluation periods, for example a temporal trend or broader distribution shift.

License

This repository is licensed under the Apache License, Version 2.0. See LICENSE.

Copyright 2026 Danu Caus and DeepElbe contributors.

The original R baseline implementation was authored by Ovidio García-Oliva / Helmholtz-Zentrum Hereon GmbH and is available at https://github.com/ovgarol/elbe-oxygen-prediction. The Python baseline in baselines/python/ is a reimplementation/port of that baseline.

This repository extends that baseline work with the ANN/MLP forecasting code, daily and hourly training workflows, ANN-vs-baseline comparison scripts, W&B sweep configurations, and W&B custom-chart templates. These additions are also distributed under the Apache License, Version 2.0.

Downloaded DWD and WSV data are not redistributed in this repository and remain subject to the terms of their original providers.

Affiliation

This work was carried out at Helmholtz-Zentrum Hereon, Geesthacht, Germany.

Acknowledgements

This work was supported by Helmholtz Association's Initiative and Networking Fund through Helmholtz AI [grant number: ZT-I-PF-5-01].

This work used resources of the Deutsches Klimarechenzentrum (DKRZ) granted by its Scientific Steering Committee (WLA) under project ID AIM.

About

Comparing statistical baselines with deep learning for dissolved oxygen forecasting and hypoxia classification in river systems.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors