Skip to content

Implement deferred imports#552

Draft
smcolby wants to merge 4 commits into
mainfrom
enh/550/deferred-imports
Draft

Implement deferred imports#552
smcolby wants to merge 4 commits into
mainfrom
enh/550/deferred-imports

Conversation

@smcolby
Copy link
Copy Markdown
Contributor

@smcolby smcolby commented May 18, 2026

Description

Importing openadmet.models.registries was paying the full cost of every 3rd-party library in the package (~6.7s cold) regardless of which components were actually needed. This PR replaces the "import everything at once" paradigm with deferred/lazy imports across all four component groups (models, featurizers, splitters, trainers/evaluators). Resolves #550 .

Results

Benchmark Before After Speedup
import openadmet.models.registries 6.702s 0.111s 60×
import openadmet.models 0.044s 0.017s 2.6×
registries + load_all() 6.702s 3.727s 1.8× (+ fully deferrable)
architecture/model_base.py 1.652s 0.070s 24×
architecture/chemprop.py 3.083s 0.101s 30×
split/cluster.py 3.524s 0.331s 11×
trainer/lightning.py 1.653s 0.069s 24×
eval/regression.py 1.582s 0.326s 4.9×

Changes

Split model_base.py

  • Extracted LightningModuleBase and LightningModelBase into a new architecture/lightning_model_base.py. All torch / lightning imports are isolated there.
  • model_base.py uses PEP 562 module __getattr__ to lazily re-export the Lightning classes, preserving backward-compatible from model_base import LightningModelBase without paying the torch cost at import time.
  • Deferred joblib inside save() / load().

Deferred estimator class imports

  • Replaced mod_class: ClassVar[type] = SomeThirdPartyClass with a _get_estimator_class() classmethod in every concrete pickleable model (xgboost, catboost, lgbm, rf, svm, tabpfn, dummy). Each classmethod contains a local import that fires only at first build() call.
  • Moved _METRIC_TO_LOSS dict from module level into chemprop.build().

Deferred imports in features / split / trainer / eval

  • features/feature_base.py: heavy molfeat / torch imports moved to TYPE_CHECKING block; from __future__ import annotations added.
  • features/chemprop.py: removed a self-import bug (lines importing from itself); all chemprop / torch / sklearn imports deferred inside featurize() and _vendor_build_dataloader().
  • split/scaffold.py: splito and sklearn.model_selection.train_test_split deferred inside each split() method.
  • split/cluster.py: useful_rdkit_utils, datamol, molfeat, KMeans deferred inside split(); removed unused GroupShuffleSplit import.
  • trainer/lightning.py: torch, lightning, and all callbacks/loggers deferred inside build() / train().
  • eval/regression.py: wandb deferred inside if self.use_wandb: blocks; scipy.stats, sklearn.metrics, and seaborn deferred by converting the class-level _metrics dict into a _base_metrics() classmethod and moving plot imports inside plot methods.
  • eval/eval_base.py: scipy.stats.bootstrap deferred inside stat_and_bootstrap().
  • eval/cross_validation.py: stopped importing removed module-level names from regression.py; wrap_ktau / wrap_spearmanr now do local imports.

Lazy registry loading

  • New openadmet/models/_registry_loader.py exposes load_group(name) and load_all() (both idempotent), using importlib.import_module. Zero heavy imports at module level.
  • registries.py rewritten to only import the six registry objects (models, featurizers, splitters, trainers, evaluators, ensemblers) plus re-export load_all.
  • Every get_*_class() function now calls load_group() before the registry lookup, so any single-component usage auto-loads only what it needs.
  • anvil/specification.py and anvil/workflow_base.py updated to import load_all instead of from registries import *.

Quality Assurance & AI Policy

To maintain project quality and respect maintainer bandwidth, please confirm the following:

  • Manual Verification: I have manually reviewed and tested the code in this PR.
  • AI-Assisted Content: If AI tools were used (e.g., Copilot, ChatGPT), I have personally verified the logic, edge cases, and compliance with the existing codebase. I confirm the code is not a "blind" AI generation.
  • Minimal Review: I believe this PR is in a state that requires minimal intervention or correction from maintainers.
  • Scoped Change: This PR addresses a single, well-scoped issue rather than multiple unrelated changes.

Status

  • Ready to go (Checking this signals to maintainers that the PR is ready for final review)

Developers Certificate of Origin


Note to Contributors: We reserve the right to close PRs without review if they appear to lack human validation or do not meet the quality standards described in our CONTRIBUTING.md.

smcolby and others added 4 commits May 17, 2026 16:30
… load

Baseline: import openadmet.models.registries = 6.702s
After: import openadmet.models.registries = 0.111s (60x faster)

Phase 1 - Split model_base.py:
- Create architecture/lightning_model_base.py isolating all torch/lightning imports
- Strip model_base.py of torch/lightning/joblib top-level imports
- Add PEP 562 module __getattr__ for lazy LightningModelBase re-export
- Defer joblib inside save()/load() method bodies
- Result: architecture/model_base.py 1.652s -> 0.070s

Phase 2 - Deferred estimator class imports:
- Replace mod_class: ClassVar[type] = SomeClass with _get_estimator_class()
  classmethod in all concrete architecture modules (xgboost, catboost, lgbm,
  rf, svm, tabpfn, dummy)
- Remove all top-level 3rd-party imports from these modules
- Move _METRIC_TO_LOSS dict initialization inside chemprop build()
- Result: each arch module 2-3s -> ~0.1s

Phase 3 - Deferred imports in features/split/trainer/eval:
- feature_base.py: move molfeat/torch imports to TYPE_CHECKING block
- features/chemprop.py: remove self-import bug; defer all chemprop/torch/sklearn
  imports inside featurize() and _vendor_build_dataloader()
- split/scaffold.py: defer splito and sklearn.model_selection inside split()
- split/cluster.py: defer useful_rdkit_utils, datamol, molfeat, KMeans inside
  split(); remove unused GroupShuffleSplit import
- trainer/lightning.py: defer torch and lightning imports inside build()/train()
- eval/regression.py: defer wandb, scipy.stats, sklearn.metrics, seaborn inside
  their respective usage methods; convert _metrics class var to _base_metrics()
  classmethod; fix cross_validation.py to not import removed module-level names
- eval/eval_base.py: defer scipy.stats.bootstrap inside stat_and_bootstrap()
- Result: registries 6.702s -> 3.548s

Phase 4 - Lazy registry loading:
- Create _registry_loader.py with idempotent load_group()/load_all() functions
  using importlib.import_module; zero heavy imports at module level
- Rewrite registries.py to only import base registry objects and expose load_all()
- Add load_group() call to each get_*_class() function for on-demand loading
- Update anvil/specification.py and anvil/workflow_base.py to import load_all
  instead of wildcard-importing registries
- Result: import openadmet.models.registries 6.702s -> 0.111s (60x faster)

Before/after summary:
  import openadmet.models.registries: 6.702s -> 0.111s
  architecture/model_base.py:         1.652s -> 0.070s
  architecture/xgboost.py:            2.123s -> 0.099s
  architecture/chemprop.py:           3.083s -> 0.101s
  split/cluster.py:                   3.524s -> 0.331s
  split/scaffold.py:                  1.476s -> 0.330s
  trainer/lightning.py:               1.653s -> 0.069s
  eval/regression.py:                 1.582s -> 0.326s
  registries + load_all():            N/A    -> 3.727s (same real cost, deferred)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- load_group(): add Parameters section with valid group keys
- load_all(): expand summary line
- get_mod_class(): add full Parameters/Returns/Raises sections
- get_featurizer_class(): add full Parameters/Returns/Raises sections
- get_ensemble_class(): add full Parameters/Returns/Raises sections
- RegressionEvaluator._base_metrics(): add Returns section

All 5 D413 blank-line-after-section violations auto-fixed.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
get_transform_class() was the only get_*_class() function not calling
load_group() before the registry lookup. This caused 'ImputeTransform not
found in transform catalogue' in integration tests because the transforms
group was never eagerly loaded under the new lazy registry.

Also adds the missing Raises section to the docstring.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@smcolby smcolby self-assigned this May 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[ENH] Speed up import openadmet.models through deferred imports

1 participant