Fix continuous ROC for binary observations with out-of-range forecasts (#442)#450
Fix continuous ROC for binary observations with out-of-range forecasts (#442)#450aaronspring wants to merge 1 commit into
Conversation
In `roc(..., bin_edges="continuous")` the binary observations were categorised with the forecast-derived threshold `i`. When forecast values fall outside the [0, 1] range of the observations, no observation ever lands in the event category, so the hit rate is always 0 and the area collapses (e.g. 0.0 instead of 1.0). Categorise observations with a fixed split at 0.5 in continuous mode, varying only the forecast threshold. Results now match `sklearn.metrics.roc_auc_score` regardless of forecast magnitude or input dimension ordering. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #450 +/- ##
==========================================
+ Coverage 94.49% 94.64% +0.14%
==========================================
Files 27 27
Lines 2836 2855 +19
==========================================
+ Hits 2680 2702 +22
+ Misses 156 153 -3 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
|
@paololucchino do you mind to review this PR fixing your issue #442 ? |
Verification: is this a
|
| Case | sklearn | xs explicit bin_edges |
xs "continuous" |
|---|---|---|---|
| A) Issue #442 data, forecast out of [0,1] (point 9, fc∈[1.23,1.81]) | 1.0000 | 0.0000 ❌ | 0.0000 ❌ |
| B) Probability forecast within [0,1] | 0.9843 | 0.9843 ✅ | 0.9843 ✅ |
So on main, explicit bin_edges matches sklearn only when the forecast (and the edges) lie within the [0,1] range of the binary observations (case B). For the out-of-range forecasts in #442 the explicit path is broken in the exact same way as continuous (point 9 → 0.0 instead of 1.0).
Why: both paths feed the same bin_edges to the observation categorisation. The binary observations (0/1) only fall into the "event" category if an edge sits between 0 and 1. Once forecasts — and thus the derived/passed edges — exceed 1, no observation is ever an "event", the hit rate is 0 at every threshold, and the area collapses. The bug is therefore not unique to continuous; it stems from categorising binary observations against forecast-scale edges.
Implication for this PR (confirms the scope is right): with the fix, "continuous" matches sklearn for all 10 points (case A), and the explicit bin_edges path is intentionally left unchanged. That's correct: the explicit path is a documented generalised multi-category ROC where observations and forecasts share categories — forcing a fixed 0.5 observation split there would break its documented behaviour (e.g. the o = f.copy(), category_edges = np.linspace(-2, 2, 5) doctest). The sklearn-equivalent binary-observation ROC is precisely what "continuous" mode is for, which is where this PR applies the fix.
Bottom line: explicit bin_edges on main would have matched sklearn only for in-range forecasts; for #442's out-of-range case it would not, so the fix belongs in the continuous path.
Summary
Fixes #442.
xss.roc(..., bin_edges="continuous")returned ROC AUC values that diverged sharply fromsklearn.metrics.roc_auc_score— down to0.0where sklearn returns1.0.Root cause
In continuous mode the observations are binary (
0/1), but the code categorised both observations and forecasts with the same forecast-derived thresholdi:The thresholds
iare drawn from forecast magnitudes. When forecasts fall outside the[0, 1]range of the binary observations, the binary observations never land in the "event" category, so the hit rate is0at every threshold and the area collapses.This was previously masked because typical probability forecasts (and the existing sklearn test, which
clip(0, 1)s the forecast) stay within[0, 1]. The reproducer in the issue shifts forecasts above1, exposing the bug. It is not a dimension-ordering problem — it reproduces in pure 1-D.Fix
In continuous mode, categorise observations with a fixed split at
0.5and vary only the forecast threshold. Output now matches sklearn for every point in the reproducer (error0.0across all 10 points), independent of forecast magnitude or input dim order.Tests
test_roc_continuous_forecast_out_of_unit_range_against_sklearn(parametrized overdrop_intermediate), using the multi-dimensional, transposed, out-of-range data from the issue and asserting per-point agreement withroc_auc_score. It fails onmainand passes with this change.test_probabilistic.py,test_contingency.py, and accessor suites pass (500 passed, 1 skipped); doctests pass.🤖 Generated with Claude Code