feat!: introduce scoped dependencies#1233
Conversation
Captures the design tree from the planning conversation: hail and l2g optional extras, an aggregate all extra, phased rollout (Phase 0 config cleanup → Phase A hail → Phase B l2g), per-module guard strategy, and test marker partitioning. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The SessionConfig.hail_home field defaulted to os.path.dirname(hail.__file__), which forced every CLI invocation to import hail at module load time — even for steps that never touch hail. Session._setup_hail_config already lazy-resolves hail_home when None is passed, so the eager default was redundant pre-computation that leaked the hail import into the CLI boot path. This is a pure deletion: hail_home becomes str | None = None. No public behavior changes for users who have hail installed; non-hail installs are unblocked downstream. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The hail package is now an opt-in extra. `pip install gentropy` no longer installs hail; users running hail-backed steps (gnomAD LD, FinnGen finemapping, PanUKBB LD, susie finemapper) must install `gentropy[hail]`. Per-module guard strategy: - Module-guard at top of file for intrinsically hail-bound datasources (gnomAD LD/variants, FinnGen finemapping, PanUKBB LD). - Lazy in-function guards for hail imports already deferred to method bodies (common.session, common.genomic_region). - `utils.spark` gains a `with_hail` flag; the testing Spark conf is now hail-free by default. Test infrastructure partitions into two pytest invocations (hail-marked vs not), with the session-scoped Spark fixture introspecting the marker on collected items to build the appropriate Spark session. Docker, Dataproc, and CI continue to install hail by default (fat image). The `[l2g]` and `[all]` extras land in a follow-up. BREAKING CHANGE: hail is no longer installed by default. Install `gentropy[hail]` for hail-backed steps (gnomAD LD, FinnGen finemapping, PanUKBB LD, susie finemapper). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…xtra The L2G machine learning stack is now an opt-in extra. `pip install gentropy` no longer installs xgboost, scikit-learn, skops, SHAP, matplotlib, wandb, or huggingface-hub. Users running L2G training, prediction, model load/save, or HF Hub publishing must install `gentropy[l2g]`. A new `[all]` aggregate extra (= `[hail,l2g]`) is introduced for the historical fat install. Per-module guard strategy: - Module-guard at top of file for intrinsically L2G-bound modules (method.l2g.model, method.l2g.trainer, dataset.l2g_prediction). - Lazy refactor of the top-level `gentropy.l2g` step module: only `LocusToGeneStep.run_train` and `run_predict` pull L2G heavy deps. The four non-training step classes (FeatureMatrix, TrainTestSplit, Evidence, Associations) remain runnable on a core-only install. Test infrastructure adds an `l2g` pytest marker; L2G-using tests use the same in-body `pytest.importorskip` pattern Phase A established. The `[dependency-groups].test` self-reference widens from `gentropy[hail]` to `gentropy[all]`. The Dataproc init script switches from `gentropy[hail]` to `gentropy[all]`. Docker and CI need no further change. BREAKING CHANGE: xgboost, scikit-learn, skops, shap, matplotlib, wandb, and huggingface-hub are no longer installed by default. Install `gentropy[l2g]` (or `gentropy[all]`) for L2G model training, prediction, and HuggingFace Hub publishing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…all aggregate extra
The PRD and per-issue specs were checked in to make the planning artefacts visible during the rollout. They're personal-workflow artefacts, not project documentation, so they don't belong in tree. Add them to .gitignore so contributors can keep similar local notes without polluting the index. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ireneisdoomed
left a comment
There was a problem hiding this comment.
Thank you for the initiative!!! My summary:
- The PR creates different submodule for Gentropy:
l2g,hail,all.pip install gentropywill only install the bare spark modules - Every module that imports hail or l2g dependencies adds some guardrails to check whether the dependency is in the environment. It raises a readable message otherwise
- All tests that use the optional dependencies are skipped if the dependency is not found
- Dataproc/Docker/CI install
gentropy[all]so the behaviour in our pipelines shouldn't change
I have one comment that is not a deal-breaker: IMO, L2G prediction should be available in the bare Gentropy. This means that xgboost is installed as part of the main package. This is not currently straightforward to accomplish because L2GPrediction is bound to L2GModel, which imports wandb and huggingface. What do you think?
|
|
||
|
|
||
| @contextmanager | ||
| def optional_imports(extra: str) -> Iterator[None]: |
There was a problem hiding this comment.
Interesting pattern 😵💫
| # Initialize Hail if requested | ||
| if start_hail: | ||
| import hail as hl | ||
| from gentropy.common.imports import optional_imports |
There was a problem hiding this comment.
Could we import thin function in gentropy/common/__init__.py?
Looks a bit too repetitive to me. We'd go from from gentropy.common.imports import optional_imports to from gentropy.common import optional_imports
There was a problem hiding this comment.
totally agree!
|
|
||
| if with_hail: | ||
| try: | ||
| import hail as hl |
There was a problem hiding this comment.
Could we use the context manager here?
| from gentropy.common.imports import optional_imports | ||
|
|
||
| with optional_imports("l2g"): | ||
| import shap |
There was a problem hiding this comment.
This could be moved inside explain, it's the only function that uses it.
| with optional_imports("l2g"): | ||
| import shap | ||
|
|
||
| from gentropy.method.l2g.model import LocusToGeneModel |
There was a problem hiding this comment.
LocusToGeneModel is only used for type checking. I think we should be able to move it under a TYPE_CHECKING block
If that's the case, we could import L2GPrediction without the optional dependencies
Co-authored-by: Irene López Santiago <45119610+ireneisdoomed@users.noreply.github.com>
|
@ireneisdoomed, I can see usecases that people actually do not need to run L2G training, but they actually still want to run the feature matrix, predictions and evidence with our model. So maybe the l2g scope is too wide. The only thing that I do not like with this approach is that we still need to install the xgboost with default gentropy, as the package is quiet large, but we might not be able to avoid it. |
Yes. It makes sense to me, but I agree it is a product decision |
Rationale
Full gentropy stack of dependencies is not needed to run most of the steps except L2G. The same idea is behind fine-mapping & LD access that requires hail. Current gentropy build on the cluster takes ~5minutes due to excessive amount of dependencies.
This PR attempts to isolate steps that require L2G dependencies and Hail dependencies into 2 separate installation scopes.
The benefit of this addition should be in simplified dependency resolution (hence startup time) for variant, coloc, credible_set cluster, while retaining L2G dependencies for L2G steps on a separate cluster.
Summary
Carves gentropy's heavy ML and hail stacks behind opt-in extras. After this PR
pip install gentropyships a Spark-only core; users opt intogentropy[hail],gentropy[l2g],or
gentropy[all]based on what steps they run. Docker, Dataproc, and CI continue to install the fat image ([all]), so production behaviour is unchanged.BREAKING CHANGE for end users: plain
pip install gentropyno longer pullshail,xgboost,scikit-learn,skops,shap,matplotlib,wandb, orhuggingface-hub.A step that needs one of those raises
ImportErrorwith the exact pip command to fix it.Install matrix (now in README + docs/installation.md)
pip install gentropypip install gentropy[hail]pip install gentropy[l2g]pip install gentropy[all]What changed
Packaging (
pyproject.toml):hail,xgboost{,-cpu},scikit-learn,skops,shap,matplotlib,wandb,huggingface-hubfrom[dependencies]to[project.optional-dependencies]under[dependency-groups].testself-referencesgentropy[all]so the test env stays fat.hailandl2gpytest markers; partition by them.Source-side guards:
src/gentropy/common/imports.pywithinstall_hint(extra)andoptional_imports(extra)context manager — every guard funnels missing-import errors throughone voice (
"pip install gentropy[<extra>]").datasource/{gnomad/ld,gnomad/variants,finngen/finemapping,pan_ukbb_ld/ld}.py(hail) andmethod/l2g/{model,trainer}.py,dataset/l2g_prediction.py(l2g).common/session.py,common/genomic_region.py, plusgentropy.l2g(top-level step module). The fournon-training L2G step classes —
LocusToGeneFeatureMatrixStep,LocusToGeneTrainTestSplitStep,LocusToGeneEvidenceStep,LocusToGeneAssociationsStep— remain importable andrunnable on a core-only install; only
LocusToGeneStep.run_train/run_predictpull the heavy stack.SessionConfig.hail_homeflips from eagerly-resolvedos.path.dirname(hail.__file__)tostr | None = None.Session._setup_hail_configalready lazy-resolved it, so theeager default was just leaking
import hailinto the CLI boot path.utils/spark.py::get_spark_testing_confgains awith_hail: bool = Falseflag; the testing Spark conf is hail-free by default and only mixes in the Kryo registrator / hailjar paths when explicitly opted in.
Test infrastructure:
conftest.pysparkfixture introspects collected items: builds a hail-enabled session iff at least one collected test carries thehailmarker. Keepspytest -m "not hail"honestly hail-free.pytestmark = pytest.mark.hail/.l2gandpytest.importorskipso they're safe to collect on a core-only install.tests/gentropy/no_spark/test_hail_optional.pyandtest_l2g_optional.pypin the public contract — install hint wording, the specific pip command substring, that thelazy step module imports cleanly with the L2G stack stubbed out via a
sys.meta_pathfinder. Newtest_config.pycases pinhail_homedefault + import-without-hail.Build & CI:
Dockerfile: bothuv syncinvocations gain--all-extrasso the production image still installs everything.utils/install_dependencies_on_cluster.sh: Dataproc init switches fromgentropy[hail]togentropy[all]..github/workflows/pr.yaml: install step nowuv sync --all-groups --all-extras.Makefile:make testpartitions runs intohailvs non-hailandno_shared_sparkvs shared — required because thesparkfixture needs to know the partition'shail-marker state at fixture-build time. Also (small fix) the
uv syncstep uses--all-groupsso the docs test can collect (mkdocslives in[dependency-groups].docs).tool.deptry:xgboost-cpukeeps theDEP002ignore (no module of its own — platform-conditional drop-in); thexgboostignore is dropped since it's now correctly declaredvia the extra.
Test plan
make testpasses locally with--all-extras --all-groups.tests/gentropy/no_spark/test_hail_optional.pysimulates a hail-free env via a customsys.meta_pathfinder and assertspip install gentropy[hail]appears in theraised
ImportError.tests/gentropy/no_spark/test_l2g_optional.pydoes the same for the L2G stack, and additionally assertsgentropy.l2gimports cleanly without the heavy deps(lazy-imports work).
tests/gentropy/test_config.py::TestSessionConfigassertshail_homedefaults toNone, is typedstr | None, and is importable with hail blocked.