Description
We are explicitly controlling seed for splitting, and in some instances model state (where the underlying model has a random_state field), but finer-tuned, consistent control would be beneficial, particularly in seeding initialization of deep learning models and in handling bootstrap-resampled data when use_bagging=True. What's there now works, and isn't necessarily causing issues, just could be better implemented.
Key Issues
1. Inconsistent Seed Handling
- Overloaded Seed Source: Both workflow types derive the model/ensemble seed from self.split.random_state. This conflates data splitting reproducibility with model initialization reproducibility. Changing the data split seed shouldn't necessarily force a change in model initialization seeds, and vice-versa.
- Single-Model vs. Ensemble:
- Ensemble: Logic exists to seed each member (global_seed + i).
- Single-Model: There is no explicit seeding in the _train methods for single models. Reproducibility relies entirely on external configuration or defaults.
2. Model Type Differences and Gaps
- Classic ML (AnvilWorkflow):
- Relies on checking hasattr(model, "random_state"). If a model doesn't have this attribute, the seed is ignored with a warning, leading to uncontrolled behavior.
- There is no fallback mechanism for models that might accept a seed via a different argument or global state.
- Deep Learning (AnvilDeepLearningWorkflow):
- Scratch Training: Correctly calls pl.seed_everything(global_seed + i).
- Finetuning/Deserialization: Completely omits pl.seed_everything. This means finetuning runs are not guaranteed to be reproducible even if the split seed is fixed.
3. Bootstrapping Issues
- Global State Mutation: The bootstrapping logic calls np.random.seed(...) inside the loop. This mutates the global NumPy random state, which can have side effects on other parts of the program or libraries running in the same process.
- Recommendation: Use a local numpy.random.Generator instance (e.g., rng = np.random.default_rng(seed)) for sampling indices.
4. Unaccounted Paths
- DL Finetuning: As noted above, the deserialize path in DL workflows lacks seeding.
- Featurization: In the DL ensemble loop, featurizers are re-created, but if they have any stochastic components, they are not explicitly seeded before creation in all paths.
Description
We are explicitly controlling seed for splitting, and in some instances model state (where the underlying model has a
random_statefield), but finer-tuned, consistent control would be beneficial, particularly in seeding initialization of deep learning models and in handling bootstrap-resampled data whenuse_bagging=True. What's there now works, and isn't necessarily causing issues, just could be better implemented.Key Issues
1. Inconsistent Seed Handling
2. Model Type Differences and Gaps
3. Bootstrapping Issues
4. Unaccounted Paths