DeepElbe: A Multi-Horizon Dissolved Oxygen Forecasting and Hypoxia Early Warning System with Statistical Machine Learning and Deep Learning Methods
Authors: Danu Caus1,2, Ovidio García-Oliva1, Carsten Lemmen1, and Tobias Weigel1,2
1 Helmholtz-Zentrum Hereon, Geesthacht, Germany
2 Helmholtz AI, Helmholtz Association
The code can be adapted to other measurement stations and hydrological systems. For our current experiments, however, we considered the three Elbe stations: Blankenese, Seemannshöft, and Bunthaus.
This repository is a public research-code release for reproducing the DeepElbe experiments. It is script-oriented and optimized for experiment workflows rather than packaged as a production library.
Dissolved oxygen dynamics in rivers are shaped by interacting meteorological and hydrological conditions and can vary across daily and hourly time scales. Low-oxygen and hypoxic episodes can threaten aquatic ecosystems and motivate forecasting tools that support monitoring and early warning. This repository studies multi-horizon forecasting of oxygen concentration and hypoxia using both statistical and deep learning methods. While the experiments presented here focus on three Elbe monitoring stations, the workflow is designed to be adaptable to other stations and hydrological systems.
This repository uses direct multi-horizon forecasting.
For each reference timestamp t, the model predicts separate targets at exact future leads t+k (not "any event within the next window").
oxygen_leveltask: predictoxygen_lead1 ... oxygen_leadH, whereoxygen_leadkis oxygen exactly att+k.hypoxiatask: predicthypoxia_lead1 ... hypoxia_leadH, wherehypoxia_leadkis hypoxia status exactly att+k.
Direct Multi-Horizon Task Formulations for Dissolved Oxygen Forecasting and Hypoxia Early Warning
So, for example, at lead 30 in the daily setting, evaluation is against the observation exactly 30 days ahead, not against any day in days 1-30.
The current daily artificial neural network (ANN) is a multi-layer perceptron (MLP) that uses a 7-day history to predict up to 30 days into the future.
The current hourly ANN is also an MLP, using a 7-hour history to predict up to 24 hours into the future.
We deliberately use MLPs as the ANN models because they keep the neural-network comparison close to the original statistical baseline: both approaches operate on the same lagged input features, predict the same direct multi-horizon targets, and map predictors to lead-specific outputs in a feed-forward supervised learning setup. This makes the ANN-vs-baseline comparison closer to an apples-to-apples comparison than using a more complex sequence-model architecture.
DeepElbe Model Setup and ANN-vs-Baseline Comparison Workflow
The ANN models use Z-normalized inputs by default, optional dropout, task-specific losses, and early stopping. Architecture and regularization settings can be fixed manually or explored through the provided W&B sweep configurations.
The baseline implementation in Python mirrors the original R models by Ovidio García-Oliva: ordinary least squares (OLS) linear regression for oxygen concentration, and an optional logistic regression for hypoxia classification. Both are evaluated on train/val/test splits for comparability.
The original R baseline algorithm by Ovidio García-Oliva is available at
https://github.com/ovgarol/elbe-oxygen-prediction. The Python reimplementation used here is in
baselines/python/.
The training and comparison scripts expect raw input files under data/. You can recreate that folder with bash download_data.sh, which downloads the original public DWD air-temperature files and WSV water-temperature / dissolved-oxygen files. Preprocessing (aggregation, merge, lag/lead feature creation, and split standardization) is done on the fly during training/evaluation rather than stored as preprocessed files. Both daily and hourly variants are generated at runtime from the same raw files by aggregating timestamps to daily means or hourly means, respectively.
For data splits, Blankenese and Seemannshoeft currently use val_end=2019-12-31, while Bunthaus uses val_end=2022-12-31 because the 2016-2019 Bunthaus period has too few hypoxia positives (daily has none), making hypoxia validation less informative.
deepelbe_ann/: ANN training code, daily/hourly dataset builders, and ANN-vs-baseline comparison scripts.baselines/python/: Python reimplementation of the original statistical baselines.sweeps/: W&B sweep configurations and the sweep launcher.vega_frontend_graphs/: W&B custom chart templates for training and ANN-vs-baseline comparison tables.- Top-level
train_*.sh,compare_*.sh,run_compare_*.sh,select_best_checkpoints.sh,download_data.sh, andreset_experiment_workspace.sh: shell entry points for the main workflows. - Generated or local folders such as
data/,runs/,wandb/, andsweeps/launch_outputs/are recreated locally and ignored by git.
Tested with Python 3.10; Python 3.10 or newer is recommended.
Create and activate the virtual environment with the repo-standard name from .gitignore:
python -m venv .deepelbe_venv
source .deepelbe_venv/bin/activateInstall dependencies:
pip install -r requirements.txtRecreate the raw data/ folder if it is not present locally:
bash download_data.shOptional local configuration:
cp config.example.env config.local.envEdit config.local.env for local paths, W&B defaults, and Slurm account/partition/log settings.
If you are not using the Hereon W&B team, set WANDB_ENTITY in config.local.env to your own W&B entity. For online sweeps, also replace entity: hereon-deepelbe in the sweep YAMLs before running wandb sweep.
load_env_config.sh is sourced by the top-level Bash entry points; it loads config.local.env automatically when present.
Do not copy config.local.env into a public repository.
Set DEEPELBE_CONFIG=/path/to/config.env to use a different config file.
Run this from the repository root for a clean W&B sweep workflow with 30 runs per sweep.
# 1. Reset generated experiment artifacts
bash reset_experiment_workspace.sh --yes
# 2. Recreate raw data
bash download_data.sh
# 3. Make sure W&B is authenticated
source .deepelbe_venv/bin/activate
wandb login
# 4. Create all W&B sweeps with 30 runs per sweep
AGENT_COUNT=30 bash sweeps/launch_all_sweeps.sh
# 5. Run the newest generated agent script
latest_sweep_dir="$(ls -td sweeps/launch_outputs/* | head -n 1)"
bash "${latest_sweep_dir}/agent_commands.sh"
# 6. After all sweep runs finish, select best checkpoints
bash select_best_checkpoints.sh
# 7. Run best-checkpoint ANN-vs-baseline comparisons
bash run_compare_all_best.sh
# 8. Sync offline comparison runs from wandb/ (comparison wrappers default to offline mode)
wandb sync --include-offline --mark-synced wandb/offline-run-*Run the batch scripts from the repo root. Use bash for local runs, or submit_with_config.sh on Slurm/HPC.
Logging is Weights & Biases only. By default, scripts run in offline mode and store run files in wandb/ (not runs/wandb), while checkpoints stay in runs/checkpoints/. The scripts set WANDB_DIR to the repo root so W&B writes directly to ./wandb.
Canonical task IDs are oxygen_level (regression) and hypoxia (classification).
Input feature standardization is enabled by default (FEATURE_STANDARDIZATION=1). Set FEATURE_STANDARDIZATION=0 to train or compare on raw input features.
deepelbe_ann/train.py is training-only (fit/test/checkpoint). ANN-vs-baseline comparison tables are generated separately by:
deepelbe_ann/compare_hypoxia_models.pydeepelbe_ann/compare_oxygen_level_models.py
Daily oxygen_level (all stations):
bash train_daily_oxygen_level_all_stations.shDaily hypoxia (all stations):
bash train_daily_hypoxia_all_stations.shHourly oxygen_level (all stations):
bash train_hourly_oxygen_level_all_stations.shHourly hypoxia (all stations):
bash train_hourly_hypoxia_all_stations.shHPC examples:
bash submit_with_config.sh train_daily_hypoxia_all_stations.sh
bash submit_with_config.sh train_hourly_hypoxia_all_stations.shFor GPU training on Slurm, request a GPU-capable partition/node according to your cluster setup, for example by setting SLURM_PARTITION in config.local.env or by passing explicit sbatch options. deepelbe_ann/train.py automatically uses CUDA when available; pass --no-cuda only when you want to force CPU execution.
submit_with_config.sh passes Slurm account, partition, working directory, and log-output settings from config.local.env to sbatch. You can still call sbatch directly with explicit Slurm options if preferred.
To force online Weights & Biases sync for the daily run:
WANDB_MODE=online bash train_daily_hypoxia_all_stations.shTo sync offline runs from wandb/ (minimal command, best for quick local use from repo root):
wandb sync --include-offline --mark-synced wandb/offline-run-*Use the robust variant below if you want safer behavior when there are no matches, many runs, or unusual path names:
find wandb -maxdepth 1 -type d -name 'offline-run-*' -print0 | xargs -0 -r wandb sync --include-offline --mark-syncedCompatibility wrappers (run both tasks for a resolution):
bash train_daily_all.sh
bash train_hourly_all.shDefault W&B groups used by the task-specific scripts:
daily_oxygen_level_all_stations_${SLURM_JOB_ID:-local}daily_hypoxia_all_stations_${SLURM_JOB_ID:-local}hourly_oxygen_level_all_stations_${SLURM_JOB_ID:-local}hourly_hypoxia_all_stations_${SLURM_JOB_ID:-local}
After training an MLP, run the task-specific comparison wrapper with the selected checkpoint.
Hypoxia comparison:
- Logs
analysis/thresholds_tableandanalysis/roc_points_table. - Default group:
compare_baseline_vs_ann_hypoxia_${RESOLUTION}_${STATION}_${SLURM_JOB_ID:-local}.
Recommended wrapper script:
CHECKPOINT_PATH=runs/checkpoints/<run_dir>/<best_checkpoint>.ckpt bash compare_hypoxia.shExample overrides (daily Blankenese):
CHECKPOINT_PATH=runs/checkpoints/<run_dir>/<best_checkpoint>.ckpt \
RESOLUTION=daily \
STATION=Blankenese \
HORIZON=30 \
WANDB_MODE=offline \
bash compare_hypoxia.shOxygen-level comparison:
- Logs
analysis/oxygen_error_by_lead_tableandanalysis/oxygen_overall_table. - Also logs oxygen time-series tables for selected leads under
analysis/oxygen_timeseries_<resolution>_lead<lead>_<model>_table(for example,analysis/oxygen_timeseries_daily_lead30_ann_tableandanalysis/oxygen_timeseries_hourly_lead24_baseline_table). - Default group:
compare_baseline_vs_ann_oxygen_level_${RESOLUTION}_${STATION}_${SLURM_JOB_ID:-local}.
Recommended wrapper script:
CHECKPOINT_PATH=runs/checkpoints/<run_dir>/<best_checkpoint>.ckpt bash compare_oxygen_level.shExample overrides (hourly Blankenese):
CHECKPOINT_PATH=runs/checkpoints/<run_dir>/<best_checkpoint>.ckpt \
RESOLUTION=hourly \
STATION=Blankenese \
HORIZON=24 \
WANDB_MODE=offline \
bash compare_oxygen_level.shselect_best_checkpoints.shscansruns/checkpoints/*, parses checkpoint metric values from filenames, selects the best checkpoint per run folder (maxfor F1-like metrics,minfor MSE/loss-like metrics), and writesruns/checkpoints/<run_dir>/best_checkpoint_path.txt.run_compare_hypoxia_best.shscansruns/checkpoints/*-hypoxia-*, reads eachbest_checkpoint_path.txt, infers run settings from the folder name, readsfeature_standardizationfrom checkpoint metadata, and runscompare_hypoxia.shsequentially with resume support.run_compare_oxygen_level_best.shscansruns/checkpoints/*-oxygen_level-*, reads eachbest_checkpoint_path.txt, infers run settings from the folder name, readsfeature_standardizationfrom checkpoint metadata, and runscompare_oxygen_level.shsequentially with resume support.
Training runs use a compact model_id in W&B run names and checkpoint filenames so the main model settings can be identified without opening metadata files. Checkpoint directory names intentionally keep the stable <station>-<task>-<resolution>-lags... format used by the comparison helpers.
Examples of abbreviations used in model_id strings:
mlp-h256x128: MLP with hidden layers 256 and 128 (mlp-hlinearmeans no hidden layer).do0p10: dropout 0.10.lr1e-03: learning rate 1e-3.wd1e-04: weight decay 1e-4.std: feature standardization enabled.nostd: feature standardization disabled.loss-mse: MSE loss for oxygen-level regression.loss-huber-delta1p0: Huber loss with delta 1.0.loss-bcew: weighted binary cross-entropy for hypoxia.loss-bceu: unweighted binary cross-entropy for hypoxia.loss-focal-g2p0-a0p25: focal loss with gamma 2.0 and alpha 0.25.thr0p50: hypoxia probability threshold 0.50.
Examples:
# Create/update best checkpoint path files for all run folders
bash select_best_checkpoints.sh
# Inspect selections without writing files
bash select_best_checkpoints.sh --dry-run
# Run hypoxia comparisons sequentially for all best checkpoints
bash run_compare_hypoxia_best.sh
# Run oxygen_level comparisons sequentially for all best checkpoints
bash run_compare_oxygen_level_best.sh
# Dry-run comparison commands (no execution, hypoxia)
bash run_compare_hypoxia_best.sh --dry-run
# Dry-run comparison commands (no execution, oxygen_level)
bash run_compare_oxygen_level_best.sh --dry-runUse reset_experiment_workspace.sh to prepare a fresh experiment workspace by cleaning generated artifacts:
runs/sweeps/launch_outputs/wandb/
Safety behavior:
- Default is dry-run (shows targets and sizes, deletes nothing).
- Actual deletion requires explicit confirmation with
--yes.
Examples:
# Preview what would be deleted
bash reset_experiment_workspace.sh --dry-run
# Delete the configured targets
bash reset_experiment_workspace.sh --yesUse this mode when your machine has internet access and you want W&B-managed online sweeps (wandb sweep + wandb agent).
Twelve station-specific sweep configs are provided:
- Blankenese:
sweeps/oxygen_level_daily_blankenese_sweep.yaml,sweeps/hypoxia_daily_blankenese_sweep.yaml,sweeps/oxygen_level_hourly_blankenese_sweep.yaml,sweeps/hypoxia_hourly_blankenese_sweep.yaml - Bunthaus:
sweeps/oxygen_level_daily_bunthaus_sweep.yaml,sweeps/hypoxia_daily_bunthaus_sweep.yaml,sweeps/oxygen_level_hourly_bunthaus_sweep.yaml,sweeps/hypoxia_hourly_bunthaus_sweep.yaml - Seemannshoeft:
sweeps/oxygen_level_daily_seemannshoeft_sweep.yaml,sweeps/hypoxia_daily_seemannshoeft_sweep.yaml,sweeps/oxygen_level_hourly_seemannshoeft_sweep.yaml,sweeps/hypoxia_hourly_seemannshoeft_sweep.yaml
Loss options now exposed for sweeps:
- Oxygen-level sweeps include
oxygen-loss(mse,huber) andhuber-delta. - Hypoxia sweeps include
hypoxia-loss(bce_weighted,bce_unweighted,focal) plusfocal-gammaandfocal-alpha.
Launch sweeps (requires internet; these configs use --wandb-mode=online). Example:
wandb sweep sweeps/oxygen_level_daily_blankenese_sweep.yaml
wandb sweep sweeps/hypoxia_daily_bunthaus_sweep.yaml
wandb sweep sweeps/hypoxia_hourly_seemannshoeft_sweep.yamlThen run agents (replace <sweep_id> with the ID returned by wandb sweep):
wandb agent hereon-deepelbe/DeepElbe/<sweep_id>If you are not using the Hereon W&B team, replace hereon-deepelbe with your own W&B entity in the agent path.
Automate launching all sweep YAMLs and generating copy/paste-ready agent commands:
bash sweeps/launch_all_sweeps.shBy default this generates agent commands with 30 runs per sweep:
oxygen_levelsweeps:wandb agent --count 30 ...hypoxiasweeps:wandb agent --count 30 ...
Override all sweeps uniformly with AGENT_COUNT=<N> if needed.
Example for 30 runs per sweep across every config:
AGENT_COUNT=30 bash sweeps/launch_all_sweeps.shOr override only one task family:
AGENT_COUNT_OXYGEN_LEVEL=30 AGENT_COUNT_HYPOXIA=30 bash sweeps/launch_all_sweeps.shThe effective run limit is the lower of the generated agent --count value and the sweep YAML run_cap.
This creates timestamped outputs under sweeps/launch_outputs/<UTC timestamp>/:
sweep_results.tsvwith config path, exit code, sweep ID, parsed agent command, and log path.agent_commands.txtwithconfig | sweep_id | wandb agent ....agent_commands.shexecutable script containing all parsedwandb agent ...lines. It logsSTART/DONE/FAILtimestamps toagent_run.logand tracks completed entries inagent_progress.done. Re-runningagent_commands.shin the same output directory skips completed entries and resumes from the first unfinished one.
Run the generated script (resume-safe by default):
bash sweeps/launch_outputs/<UTC timestamp>/agent_commands.shConcrete fresh-start example:
# 1. Remove generated experiment artifacts from previous runs
bash reset_experiment_workspace.sh --yes
# 2. Create fresh sweep IDs and generate a new launch_outputs/<UTC timestamp>/ directory
AGENT_COUNT_OXYGEN_LEVEL=30 AGENT_COUNT_HYPOXIA=30 bash sweeps/launch_all_sweeps.sh
# 3. Run the generated agent script from the new timestamped output directory
bash sweeps/launch_outputs/<UTC timestamp>/agent_commands.shNotes:
- Bunthaus sweep configs use
val_end=2022-12-31. - Blankenese and Seemannshoeft sweep configs use
val_end=2019-12-31.
Saved W&B custom-chart templates are provided under vega_frontend_graphs/.
These files are intended for W&B tables logged by the ANN-vs-baseline comparison scripts, not for plain training runs.
They are reusable frontend templates, not universally correct drop-in charts for every run.
When reusing them, adapt them carefully to the specific case:
- switch
trainfilters tovalortestwhere needed - make sure the queried table keys match the intended resolution and lead
- verify chart titles and labels after instantiation so they still describe the selected split, lead, station, and task correctly
- remember that some files are generic Vega specs while the paired GraphQL files may be concrete lead-specific instances
These examples illustrate oxygen-level forecasting outputs from the experiment workflow. They are included as qualitative/diagnostic examples; full quantitative comparisons are produced through the W&B tables and custom chart templates.
This section collects explanatory notes about how key outputs should be interpreted. It is intended as a reference appendix at the end of the README, and can be extended later if other recurring theoretical questions come up.
The oxygen-level comparison reports a per-lead bias and an aggregated split-level bias_mean.
For one split (train, val, or test) and one forecast lead h:
Plain fallback:
bias_h = (1 / N) * sum(i = 1..N) [ y_hat(i,h) - y(i,h) ]
Where:
i: sample indexh: lead index (forecast step)N: number of samples in the splity_hat[i,h]: model prediction for samplei, leadhy[i,h]: target for samplei, leadh
The reported split-level bias_mean is the average of the per-lead biases:
Plain fallback:
bias_mean = (1 / H) * sum(h = 1..H) [ bias_h ]
Where:
H: horizon length (number of forecast leads)
Interpretation:
- Positive bias means average over-prediction.
- Negative bias means average under-prediction.
- Bias near zero means signed errors cancel on average; it does not necessarily mean low RMSE or MAE.
In this project, the oxygen baseline is an OLS model, that is, an Ordinary Least Squares linear regression model fit separately for each forecast lead. The implementation includes an intercept term.
For in-sample OLS with an intercept, the residuals on the training data are constrained to have mean approximately zero. Since the bias is just the mean signed residual, the baseline train bias is therefore expected to be approximately zero for each lead.
This is why baseline train bias often appears as 0 in the reported outputs. In practice, it is usually a very small floating-point value that gets rounded for display. Validation and test bias are not constrained this way, so they are typically non-zero.
Zero bias is easy to misread. It does not mean that the model is predicting the targets accurately at every sample. It only means that the signed errors average out.
For example:
| sample | prediction | target | error |
|---|---|---|---|
| 1 | 8 | 5 | +3 |
| 2 | 4 | 7 | -3 |
In this example, the bias is zero, but the model is wrong on both samples. RMSE would still be 3.
What zero bias means:
- no systematic direction of error
- over-predictions and under-predictions cancel on average
What zero bias does not mean:
- predictions are close to the targets
- RMSE or MAE are low
- the model is not overfitting
Overfitting would instead show up as training RMSE/MAE being much better than validation/test RMSE/MAE. A non-zero validation or test bias usually points more toward a systematic shift between training and evaluation periods, for example a temporal trend or broader distribution shift.
This repository is licensed under the Apache License, Version 2.0. See LICENSE.
Copyright 2026 Danu Caus and DeepElbe contributors.
The original R baseline implementation was authored by Ovidio García-Oliva /
Helmholtz-Zentrum Hereon GmbH and is available at
https://github.com/ovgarol/elbe-oxygen-prediction. The Python baseline in
baselines/python/ is a reimplementation/port of that baseline.
This repository extends that baseline work with the ANN/MLP forecasting code, daily and hourly training workflows, ANN-vs-baseline comparison scripts, W&B sweep configurations, and W&B custom-chart templates. These additions are also distributed under the Apache License, Version 2.0.
Downloaded DWD and WSV data are not redistributed in this repository and remain subject to the terms of their original providers.
This work was carried out at Helmholtz-Zentrum Hereon, Geesthacht, Germany.
This work was supported by Helmholtz Association's Initiative and Networking Fund through Helmholtz AI [grant number: ZT-I-PF-5-01].
This work used resources of the Deutsches Klimarechenzentrum (DKRZ) granted by its Scientific Steering Committee (WLA) under project ID AIM.






