[ENH] Inconsistent random seed handling

## Description
We are explicitly controlling seed for splitting, and in some instances model state (where the underlying model has a `random_state` field), but finer-tuned, consistent control would be beneficial, particularly in seeding initialization of deep learning models and in handling bootstrap-resampled data when `use_bagging=True`. What's there now *works*, and isn't necessarily causing issues, just could be better implemented.

## Key Issues
### 1. Inconsistent Seed Handling
- Overloaded Seed Source: Both workflow types derive the model/ensemble seed from self.split.random_state. This conflates data splitting reproducibility with model initialization reproducibility. Changing the data split seed shouldn't necessarily force a change in model initialization seeds, and vice-versa.
- Single-Model vs. Ensemble:
    - Ensemble: Logic exists to seed each member (global_seed + i).
    - Single-Model: There is no explicit seeding in the _train methods for single models. Reproducibility relies entirely on external configuration or defaults.

### 2. Model Type Differences and Gaps
- Classic ML (AnvilWorkflow):
    - Relies on checking hasattr(model, "random_state"). If a model doesn't have this attribute, the seed is ignored with a warning, leading to uncontrolled behavior.
    - There is no fallback mechanism for models that might accept a seed via a different argument or global state.
- Deep Learning (AnvilDeepLearningWorkflow):
    - Scratch Training: Correctly calls pl.seed_everything(global_seed + i).
    - Finetuning/Deserialization: Completely omits pl.seed_everything. This means finetuning runs are not guaranteed to be reproducible even if the split seed is fixed.

### 3. Bootstrapping Issues
- Global State Mutation: The bootstrapping logic calls np.random.seed(...) inside the loop. This mutates the global NumPy random state, which can have side effects on other parts of the program or libraries running in the same process.
- Recommendation: Use a local numpy.random.Generator instance (e.g., rng = np.random.default_rng(seed)) for sampling indices.

### 4. Unaccounted Paths
- DL Finetuning: As noted above, the deserialize path in DL workflows lacks seeding.
- Featurization: In the DL ensemble loop, featurizers are re-created, but if they have any stochastic components, they are not explicitly seeded before creation in all paths.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENH] Inconsistent random seed handling #499

Description

Key Issues

1. Inconsistent Seed Handling

2. Model Type Differences and Gaps

3. Bootstrapping Issues

4. Unaccounted Paths

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[ENH] Inconsistent random seed handling #499

Description

Description

Key Issues

1. Inconsistent Seed Handling

2. Model Type Differences and Gaps

3. Bootstrapping Issues

4. Unaccounted Paths

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions