Skip to content

feat!: introduce scoped dependencies#1233

Open
project-defiant wants to merge 12 commits into
devfrom
feat/extra-dependencies
Open

feat!: introduce scoped dependencies#1233
project-defiant wants to merge 12 commits into
devfrom
feat/extra-dependencies

Conversation

@project-defiant

@project-defiant project-defiant commented Jun 14, 2026

Copy link
Copy Markdown
Contributor

Rationale

Full gentropy stack of dependencies is not needed to run most of the steps except L2G. The same idea is behind fine-mapping & LD access that requires hail. Current gentropy build on the cluster takes ~5minutes due to excessive amount of dependencies.

This PR attempts to isolate steps that require L2G dependencies and Hail dependencies into 2 separate installation scopes.

The benefit of this addition should be in simplified dependency resolution (hence startup time) for variant, coloc, credible_set cluster, while retaining L2G dependencies for L2G steps on a separate cluster.

Summary

Carves gentropy's heavy ML and hail stacks behind opt-in extras. After this PR pip install gentropy ships a Spark-only core; users opt into gentropy[hail], gentropy[l2g],
or gentropy[all] based on what steps they run. Docker, Dataproc, and CI continue to install the fat image ([all]), so production behaviour is unchanged.

BREAKING CHANGE for end users: plain pip install gentropy no longer pulls hail, xgboost, scikit-learn, skops, shap, matplotlib, wandb, or huggingface-hub.
A step that needs one of those raises ImportError with the exact pip command to fix it.

Install matrix (now in README + docs/installation.md)

Use case Command
Spark-only steps pip install gentropy
+ hail-backed datasources (gnomAD LD, FinnGen finemapping, PanUKBB LD, susie) pip install gentropy[hail]
+ L2G training, prediction, HF Hub publishing pip install gentropy[l2g]
Full pipeline pip install gentropy[all]

What changed

Packaging (pyproject.toml):

  • Move hail, xgboost{,-cpu}, scikit-learn, skops, shap, matplotlib, wandb, huggingface-hub from [dependencies] to [project.optional-dependencies] under
  • [dependency-groups].test self-references gentropy[all] so the test env stays fat.
  • Register hail and l2g pytest markers; partition by them.

Source-side guards:

  • New shared helper src/gentropy/common/imports.py with install_hint(extra) and optional_imports(extra) context manager — every guard funnels missing-import errors through
    one voice ("pip install gentropy[<extra>]").
  • Module-level guards at the top of intrinsically extra-bound files: datasource/{gnomad/ld,gnomad/variants,finngen/finemapping,pan_ukbb_ld/ld}.py (hail) and
    method/l2g/{model,trainer}.py, dataset/l2g_prediction.py (l2g).
  • Lazy in-function guards where heavy imports were already deferred: common/session.py, common/genomic_region.py, plus gentropy.l2g (top-level step module). The four
    non-training L2G step classes — LocusToGeneFeatureMatrixStep, LocusToGeneTrainTestSplitStep, LocusToGeneEvidenceStep, LocusToGeneAssociationsStep — remain importable and
    runnable on a core-only install; only LocusToGeneStep.run_train/run_predict pull the heavy stack.
  • SessionConfig.hail_home flips from eagerly-resolved os.path.dirname(hail.__file__) to str | None = None. Session._setup_hail_config already lazy-resolved it, so the
    eager default was just leaking import hail into the CLI boot path.
  • utils/spark.py::get_spark_testing_conf gains a with_hail: bool = False flag; the testing Spark conf is hail-free by default and only mixes in the Kryo registrator / hail
    jar paths when explicitly opted in.

Test infrastructure:

  • conftest.py spark fixture introspects collected items: builds a hail-enabled session iff at least one collected test carries the hail marker. Keeps pytest -m "not hail" honestly hail-free.
  • Existing hail- and L2G-using tests gain file-scope pytestmark = pytest.mark.hail / .l2g and pytest.importorskip so they're safe to collect on a core-only install.
  • New tests/gentropy/no_spark/test_hail_optional.py and test_l2g_optional.py pin the public contract — install hint wording, the specific pip command substring, that the
    lazy step module imports cleanly with the L2G stack stubbed out via a sys.meta_path finder. New test_config.py cases pin hail_home default + import-without-hail.

Build & CI:

  • Dockerfile: both uv sync invocations gain --all-extras so the production image still installs everything.
  • utils/install_dependencies_on_cluster.sh: Dataproc init switches from gentropy[hail] to gentropy[all].
  • .github/workflows/pr.yaml: install step now uv sync --all-groups --all-extras.
  • Makefile: make test partitions runs into hail vs non-hail and no_shared_spark vs shared — required because the spark fixture needs to know the partition's
    hail-marker state at fixture-build time. Also (small fix) the uv sync step uses --all-groups so the docs test can collect (mkdocs lives in [dependency-groups].docs).
  • tool.deptry: xgboost-cpu keeps the DEP002 ignore (no module of its own — platform-conditional drop-in); the xgboost ignore is dropped since it's now correctly declared
    via the extra.

Test plan

  • make test passes locally with --all-extras --all-groups.
  • tests/gentropy/no_spark/test_hail_optional.py simulates a hail-free env via a custom sys.meta_path finder and asserts pip install gentropy[hail] appears in the
    raised ImportError.
  • tests/gentropy/no_spark/test_l2g_optional.py does the same for the L2G stack, and additionally asserts gentropy.l2g imports cleanly without the heavy deps
    (lazy-imports work).
  • tests/gentropy/test_config.py::TestSessionConfig asserts hail_home defaults to None, is typed str | None, and is importable with hail blocked.

project-defiant and others added 7 commits June 14, 2026 09:42
Captures the design tree from the planning conversation: hail and l2g
optional extras, an aggregate all extra, phased rollout (Phase 0
config cleanup → Phase A hail → Phase B l2g), per-module guard
strategy, and test marker partitioning.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The SessionConfig.hail_home field defaulted to os.path.dirname(hail.__file__),
which forced every CLI invocation to import hail at module load time —
even for steps that never touch hail.

Session._setup_hail_config already lazy-resolves hail_home when None is
passed, so the eager default was redundant pre-computation that leaked
the hail import into the CLI boot path.

This is a pure deletion: hail_home becomes str | None = None. No public
behavior changes for users who have hail installed; non-hail installs
are unblocked downstream.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The hail package is now an opt-in extra. `pip install gentropy` no longer
installs hail; users running hail-backed steps (gnomAD LD, FinnGen
finemapping, PanUKBB LD, susie finemapper) must install
`gentropy[hail]`.

Per-module guard strategy:
- Module-guard at top of file for intrinsically hail-bound datasources
  (gnomAD LD/variants, FinnGen finemapping, PanUKBB LD).
- Lazy in-function guards for hail imports already deferred to method
  bodies (common.session, common.genomic_region).
- `utils.spark` gains a `with_hail` flag; the testing Spark conf is now
  hail-free by default.

Test infrastructure partitions into two pytest invocations (hail-marked
vs not), with the session-scoped Spark fixture introspecting the marker
on collected items to build the appropriate Spark session.

Docker, Dataproc, and CI continue to install hail by default (fat image).
The `[l2g]` and `[all]` extras land in a follow-up.

BREAKING CHANGE: hail is no longer installed by default. Install
`gentropy[hail]` for hail-backed steps (gnomAD LD, FinnGen finemapping,
PanUKBB LD, susie finemapper).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…xtra

The L2G machine learning stack is now an opt-in extra. `pip install
gentropy` no longer installs xgboost, scikit-learn, skops, SHAP,
matplotlib, wandb, or huggingface-hub. Users running L2G training,
prediction, model load/save, or HF Hub publishing must install
`gentropy[l2g]`. A new `[all]` aggregate extra (= `[hail,l2g]`) is
introduced for the historical fat install.

Per-module guard strategy:
- Module-guard at top of file for intrinsically L2G-bound modules
  (method.l2g.model, method.l2g.trainer, dataset.l2g_prediction).
- Lazy refactor of the top-level `gentropy.l2g` step module: only
  `LocusToGeneStep.run_train` and `run_predict` pull L2G heavy deps.
  The four non-training step classes (FeatureMatrix, TrainTestSplit,
  Evidence, Associations) remain runnable on a core-only install.

Test infrastructure adds an `l2g` pytest marker; L2G-using tests use
the same in-body `pytest.importorskip` pattern Phase A established.
The `[dependency-groups].test` self-reference widens from
`gentropy[hail]` to `gentropy[all]`. The Dataproc init script switches
from `gentropy[hail]` to `gentropy[all]`. Docker and CI need no
further change.

BREAKING CHANGE: xgboost, scikit-learn, skops, shap, matplotlib,
wandb, and huggingface-hub are no longer installed by default.
Install `gentropy[l2g]` (or `gentropy[all]`) for L2G model training,
prediction, and HuggingFace Hub publishing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added documentation Improvements or additions to documentation Method Dataset Step Datasource size-XL labels Jun 14, 2026
@project-defiant project-defiant changed the title Feat/extra dependencies feat!(scoped dependencies) Jun 14, 2026
The PRD and per-issue specs were checked in to make the planning
artefacts visible during the rollout. They're personal-workflow
artefacts, not project documentation, so they don't belong in tree.

Add them to .gitignore so contributors can keep similar local notes
without polluting the index.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@project-defiant project-defiant linked an issue Jun 14, 2026 that may be closed by this pull request
3 tasks
@project-defiant project-defiant changed the title feat!(scoped dependencies) dep!: introduce scoped dependencies Jun 14, 2026
@project-defiant project-defiant changed the title dep!: introduce scoped dependencies dependencies: introduce scoped dependencies Jun 14, 2026
@project-defiant project-defiant changed the title dependencies: introduce scoped dependencies deps!: introduce scoped dependencies Jun 14, 2026
@project-defiant project-defiant changed the title deps!: introduce scoped dependencies deps: introduce scoped dependencies Jun 14, 2026
@project-defiant project-defiant changed the title deps: introduce scoped dependencies feat: introduce scoped dependencies Jun 14, 2026
@project-defiant project-defiant changed the title feat: introduce scoped dependencies feat!: introduce scoped dependencies Jun 14, 2026
@project-defiant project-defiant marked this pull request as ready for review June 14, 2026 21:31
@project-defiant project-defiant requested review from ireneisdoomed and removed request for ireneisdoomed June 15, 2026 12:47

@ireneisdoomed ireneisdoomed left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the initiative!!! My summary:

  1. The PR creates different submodule for Gentropy: l2g, hail, all. pip install gentropy will only install the bare spark modules
  2. Every module that imports hail or l2g dependencies adds some guardrails to check whether the dependency is in the environment. It raises a readable message otherwise
  3. All tests that use the optional dependencies are skipped if the dependency is not found
  4. Dataproc/Docker/CI install gentropy[all] so the behaviour in our pipelines shouldn't change

I have one comment that is not a deal-breaker: IMO, L2G prediction should be available in the bare Gentropy. This means that xgboost is installed as part of the main package. This is not currently straightforward to accomplish because L2GPrediction is bound to L2GModel, which imports wandb and huggingface. What do you think?



@contextmanager
def optional_imports(extra: str) -> Iterator[None]:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting pattern 😵‍💫

# Initialize Hail if requested
if start_hail:
import hail as hl
from gentropy.common.imports import optional_imports

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we import thin function in gentropy/common/__init__.py?
Looks a bit too repetitive to me. We'd go from from gentropy.common.imports import optional_imports to from gentropy.common import optional_imports

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

totally agree!

Comment thread src/utils/spark.py

if with_hail:
try:
import hail as hl

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use the context manager here?

Comment thread tests/gentropy/conftest.py Outdated
from gentropy.common.imports import optional_imports

with optional_imports("l2g"):
import shap

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be moved inside explain, it's the only function that uses it.

with optional_imports("l2g"):
import shap

from gentropy.method.l2g.model import LocusToGeneModel

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LocusToGeneModel is only used for type checking. I think we should be able to move it under a TYPE_CHECKING block

If that's the case, we could import L2GPrediction without the optional dependencies

Co-authored-by: Irene López Santiago <45119610+ireneisdoomed@users.noreply.github.com>
@project-defiant

Copy link
Copy Markdown
Contributor Author

@ireneisdoomed, I can see usecases that people actually do not need to run L2G training, but they actually still want to run the feature matrix, predictions and evidence with our model. So maybe the l2g scope is too wide. The only thing that I do not like with this approach is that we still need to install the xgboost with default gentropy, as the package is quiet large, but we might not be able to avoid it.

@ireneisdoomed

Copy link
Copy Markdown
Contributor

@ireneisdoomed, I can see usecases that people actually do not need to run L2G training, but they actually still want to run the feature matrix, predictions and evidence with our model. So maybe the l2g scope is too wide. The only thing that I do not like with this approach is that we still need to install the xgboost with default gentropy, as the package is quiet large, but we might not be able to avoid it.

Yes. It makes sense to me, but I agree it is a product decision

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Gentropy scoped dependencies

2 participants