Skip to content

[MNT] Update scikit-learn upper bound#3250

Draft
MatthewMiddlehurst wants to merge 17 commits intomainfrom
mm/sklearn
Draft

[MNT] Update scikit-learn upper bound#3250
MatthewMiddlehurst wants to merge 17 commits intomainfrom
mm/sklearn

Conversation

@MatthewMiddlehurst
Copy link
Copy Markdown
Member

@MatthewMiddlehurst MatthewMiddlehurst commented Jan 19, 2026

problem estimators:

  1. rotation_forest_regressor: I've regenerated results as the changes are genuine (see below)
  2. weasel. seems to be a typing issue with the words used

@aeon-actions-bot aeon-actions-bot bot added the maintenance Continuous integration, unit testing & package distribution label Jan 19, 2026
@aeon-actions-bot
Copy link
Copy Markdown
Contributor

aeon-actions-bot bot commented Jan 19, 2026

Thank you for contributing to aeon

I have added the following labels to this PR based on the title: [ maintenance ].

The Checks tab will show the status of our automated tests. You can click on individual test runs in the tab or "Details" in the panel below to see more information if there is a failure.

If our pre-commit code quality check fails, any trivial fixes will automatically be pushed to your PR unless it is a draft.

Don't hesitate to ask questions on the aeon Discord channel if you have any.

PR CI actions

These checkboxes will add labels to enable/disable CI functionality for this PR. This may not take effect immediately, and a new commit may be required to run the new configuration.

  • Run pre-commit checks for all files
  • Run mypy typecheck tests
  • Run all pytest tests and configurations
  • Run all notebook example tests
  • Run numba-disabled codecov tests
  • Stop automatic pre-commit fixes (always disabled for drafts)
  • Disable numba cache loading
  • Regenerate expected results for testing
  • Push an empty commit to re-run CI checks

@MatthewMiddlehurst MatthewMiddlehurst added the full pytest actions Run the full pytest suite on a PR label Mar 24, 2026
@TonyBagnall
Copy link
Copy Markdown
Contributor

just looking at this

Catch22Regressor,

this seems a matter of precision
Expected:
array([0.638218, 1.090666, 0.583235, 1.575507, 0.484134,
0.709761, 1.332061, 1.099275, 1.516734, 0.316833])
Got:
array([0.63821896, 1.0906666 , 0.58323551, 1.57550709, 0.48413489,
0.70976176, 1.33206165, 1.09927538, 1.51673405, 0.31683308])

@TonyBagnall
Copy link
Copy Markdown
Contributor

TonyBagnall commented Apr 9, 2026

FAILED aeon/classification/dictionary_based/tests/test_weasel.py::test_weasel_v2_score - AssertionError:
Arrays are not almost equal to 4 decimals
ACTUAL: 0.5454545454545454
DESIRED: 0.90909
FAILED aeon/classification/dictionary_based/tests/test_weasel.py::test_weasel_score - AssertionError:
Arrays are not almost equal to 4 decimals
ACTUAL: 0.5454545454545454
DESIRED: 0.727272

@TonyBagnall
Copy link
Copy Markdown
Contributor

FAILED aeon/regression/sklearn/tests/test_rotation_forest_regressor.py::test_rotf_output - AssertionError:
Arrays are not almost equal to 4 decimals

Mismatched elements: 15 / 15 (100%)
Max absolute difference among violations: 0.0045145
Max relative difference among violations: 0.1914099
ACTUAL: array([0.026 , 0.0245, 0.0224, 0.0453, 0.0892, 0.0314, 0.026 , 0.0451,
0.0287, 0.04 , 0.026 , 0.0378, 0.0265, 0.0356, 0.0281])
DESIRED: array([0.0269, 0.0269, 0.02 , 0.0428, 0.0903, 0.0271, 0.0255, 0.0408,
0.029 , 0.0425, 0.0269, 0.0367, 0.0236, 0.0344, 0.0236])

@TonyBagnall
Copy link
Copy Markdown
Contributor

I looked at the rotation forest regressor, and it looks like a fix in scikit that our testing case hits

claude this time:

_"From the 1.8 changelog:

"Fixed a regression in decision trees where almost constant features were not handled properly" scikit-learn (#32259)
"Fixed splitting logic during training in tree.DecisionTree* (and consequently in ensemble.RandomForest*) for nodes containing near-constant feature values and missing values" scikit-learn
"Fix decision tree splitting with missing values present in some features. In some cases the last non-missing sample would not be partitioned correctly" scikit-learn (#32351)

Rotation Forest is a near-perfect trigger for #1 and #2. Each tree is trained on PCA-rotated features built from a bootstrap sample of only 3 columns at a time. It's extremely common for one of the three PCA components on a bootstrap subsample to capture almost no variance — i.e. a "near-constant feature." sklearn 1.7 split those columns one way; 1.8 splits them another (correctly). That cascades through every tree in the ensemble, so every prediction shifts a little — exactly the pattern you're seeing (all 15 elements differ, magnitude ~1–20% relative, no element wildly off).From the 1.8 changelog:

"Fixed a regression in decision trees where almost constant features were not handled properly" scikit-learn (#32259)
"Fixed splitting logic during training in tree.DecisionTree* (and consequently in ensemble.RandomForest*) for nodes containing near-constant feature values and missing values" scikit-learn
"Fix decision tree splitting with missing values present in some features. In some cases the last non-missing sample would not be partitioned correctly" scikit-learn (#32351)

Rotation Forest is a near-perfect trigger for #1 and #2. Each tree is trained on PCA-rotated features built from a bootstrap sample of only 3 columns at a time. It's extremely common for one of the three PCA components on a bootstrap subsample to capture almost no variance — i.e. a "near-constant feature." sklearn 1.7 split those columns one way; 1.8 splits them another (correctly). That cascades through every tree in the ensemble, so every prediction shifts a little — exactly the pattern you're seeing (all 15 elements differ, magnitude ~1–20% relative, no element wildly off)."_

and chatgpt

_The most likely upstream cause of the 1.7 -> 1.8 drift is scikit-learn PR #32259, which landed in 1.8. The 1.8 release notes describe it as fixing a regression where almost constant features were not handled properly in decision trees. The PR discussion is more explicit: after the tree refactor in PR #29458, FEATURE_THRESHOLD was accidentally initialised to 0.0 instead of 1e-7, and 1.8 fixes that. The constant is commented as being there to mitigate precision differences between 32-bit and 64-bit.

That fits RotationForestRegressor unusually well, because aeon casts the rotated PCA features to float32 before fitting the tree. So a tree-side fix specifically about how near-constant features are treated under 32-bit vs 64-bit precision is exactly the sort of change that can move RotF predictions while leaving the high-level algorithm unchanged_

@TonyBagnall
Copy link
Copy Markdown
Contributor

chatgpt thinks this

es. These two failures are not the logistic-regression issue.

WEASEL uses RidgeClassifierCV on its default path because support_probabilities=False by default, and WEASEL_V2 also fits a RidgeClassifierCV. So test_weasel_score and test_weasel_v2_score are both going through the same sklearn ridge-CV path, not the LogisticRegression(liblinear) path.

The common upstream input is also important: both classifiers feed bag-of-words count features into that ridge classifier. In aeon’s SFA code, sparse bags are explicitly built as csr_matrix(..., dtype=np.uint32), and the dense bag constructors also allocate np.uint32 count arrays. WEASEL stacks those bags straight into RidgeClassifierCV, and WEASEL_V2 does the same after building dense SFA outputs with return_sparse=False.

The likely source of the regression is the sklearn 1.8 ridge refactor. The release notes say RidgeCV, RidgeClassifier, and RidgeClassifierCV gained array-API support in 1.8. In the 1.8 ridge.py source, RidgeGCV.fit now records original_dtype = X.dtype, fits in a high-precision float dtype, and then casts intercept, dual_coef, and coef_ back to original_dtype at the end. That original_dtype capture and end-of-fit cast-back are not present in the 1.7.2 source.

For WEASEL and WEASEL_V2, that means the ridge model is very likely being fit from uint32 count matrices and then having its learned parameters cast back to uint32. For a linear classifier, that is disastrous because the coefficients and intercept are supposed to be signed floating-point values. That neatly explains why both tests drop together under 1.8, despite using different transforms: the shared failure point is RidgeClassifierCV receiving integer-count bags and then storing integer-cast parameters.

The fix in aeon should be simple: cast the bag matrix to float before calling RidgeClassifierCV.fit in both places. In practice, just before fitting:

all_words = all_words.astype(np.float64, copy=False)

for WEASEL, and

words = words.astype(np.float64, copy=False)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

full pytest actions Run the full pytest suite on a PR maintenance Continuous integration, unit testing & package distribution

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants