I will describe here how I implemented the flexible combinations in my effort, so we can build a consensus on how this can be implemented in drevalpy. The terms 'featurizer' (what they are called in drevalpy) and 'encoder' (what I call them) are used interchangeably. Similar for drug/perturbation.
Registries
There are three registries: One for omics featurizers, one for perturbation/drug featurizers and one for predictors. Each featurizer/predictor registers itself to the corresponding registry using a decorator:
@register_omics_encoder("pca", description="...")
class PCA(OmicsEncoder): ...
@register_predictor("randomForest", description="...")
class RandomForestPredictor(SklearnTabularPredictor): ...
For components that are based on literature models, it is possible to store additional metadata, such as a publication DOI or a github link in the registry. This allows getting a comprehensive overview of the available components via a CLI command or python API.
Extending the registry
During development, having to modify the dreval source code for adding new models is annoying (and users might modify parts of the package that are not relevant for adding models). This is even more problematic for evaluating new components in the nextflow pipeline.
Registries can help here. By importing the decorator function from the package, users can extend the registry without touching the drevalpy source code. This can look like following:
import anndata as ad
import numpy as np
from sklearn.linear_model import Ridge
from drevalpy.models.encoders.omics.base import OmicsEncoder
from drevalpy.models.predictors.sklearn_tabular import SklearnTabularPredictor
from drevalpy.models.registry import register_omics_encoder, register_predictor
@register_omics_encoder(
"logGeneExpression",
description="Log1p of one obsm matrix.",
category="native",
)
class LogGeneExpression(OmicsEncoder):
def __init__(self, *, obsm_key: str = "gene_expression") -> None:
self._obsm_key = obsm_key
self._n_features = 0
def _fit_impl(self, adata: ad.AnnData, *, row_indices=None) -> None:
self._n_features = adata.obsm[self._obsm_key].shape[1]
def transform(self, adata: ad.AnnData) -> np.ndarray:
x = np.asarray(adata.obsm[self._obsm_key], dtype=np.float32)
return np.log1p(np.clip(x, 0.0, None))
@property
def output_dim(self) -> int:
return self._n_features
@register_predictor(
"myRidge",
description="Ridge on concat(omics, perturbation).",
category="general_purpose",
)
class MyRidgePredictor(SklearnTabularPredictor):
def _make_estimator(self) -> Ridge:
return Ridge(alpha=float(self._h.get("alpha", 1.0)))
And then it can be used as follows:
import my_extensions # registers logGeneExpression + myRidge
from drevalpy.models import EncoderConfig, ModelConfig, PredictorConfig
config = ModelConfig(
omics_encoder=EncoderConfig("logGeneExpression", registry="omics"),
perturbation_encoder=EncoderConfig("oneHot", registry="perturbation"),
predictor=PredictorConfig("myRidge", hyperparameters={"alpha": 0.5}),
)
model = config.create_model()
model.fit(adata, strategy, fold_index=0)
When used via CLI (or via the nextflow pipeline) users can provide a directory containing such files, which are then automatically registered and usable.
Flexible combinations
Now as every component can be accessed via a string in its registry, it is possible to construct a model using a structured string. I used the pattern <omics encoder>:<drug_encoder>:<predictor>. We can then dynamically construct the equivalent of a DRPmodel dynamically. Each component comes with a defined hyperparameter space, and the hyperparameter space of the triplet is the union of the per-component hyperparameters, which can be tuned in combination.
Model zoo
Certain predictors (e.g. PharmaFormer) might require a certain pair of featurizers so we can really call it PharmaFormer. To allow giving certain combinations a name, I created a so-called 'model zoo' which stores certain combinations under a friendly name. It is also configure hyperparameter spaces and default hyperparameters like this.
It is also possible to provide additional model zoo configurations via a zoo directory to the CLI or the nextflow pipeline. It is also possible to reference self-registered components in the self-made zoo configurations.
Featurizer type checking
Most predictors get dense representations from their featurizers, but some also require graphs or sequence representations. The registry decorator requires a featurizer type for each featurizer, and an omics/perturbation featurizer type for each predictor. During construction of the combined model, we can thus ensure that a valid combination was selected before running into an actual type mismatch.
Let me know what you think about this
I will describe here how I implemented the flexible combinations in my effort, so we can build a consensus on how this can be implemented in drevalpy. The terms 'featurizer' (what they are called in drevalpy) and 'encoder' (what I call them) are used interchangeably. Similar for drug/perturbation.
Registries
There are three registries: One for omics featurizers, one for perturbation/drug featurizers and one for predictors. Each featurizer/predictor registers itself to the corresponding registry using a decorator:
For components that are based on literature models, it is possible to store additional metadata, such as a publication DOI or a github link in the registry. This allows getting a comprehensive overview of the available components via a CLI command or python API.
Extending the registry
During development, having to modify the dreval source code for adding new models is annoying (and users might modify parts of the package that are not relevant for adding models). This is even more problematic for evaluating new components in the nextflow pipeline.
Registries can help here. By importing the decorator function from the package, users can extend the registry without touching the drevalpy source code. This can look like following:
And then it can be used as follows:
When used via CLI (or via the nextflow pipeline) users can provide a directory containing such files, which are then automatically registered and usable.
Flexible combinations
Now as every component can be accessed via a string in its registry, it is possible to construct a model using a structured string. I used the pattern
<omics encoder>:<drug_encoder>:<predictor>. We can then dynamically construct the equivalent of aDRPmodeldynamically. Each component comes with a defined hyperparameter space, and the hyperparameter space of the triplet is the union of the per-component hyperparameters, which can be tuned in combination.Model zoo
Certain predictors (e.g. PharmaFormer) might require a certain pair of featurizers so we can really call it
PharmaFormer. To allow giving certain combinations a name, I created a so-called 'model zoo' which stores certain combinations under a friendly name. It is also configure hyperparameter spaces and default hyperparameters like this.It is also possible to provide additional model zoo configurations via a zoo directory to the CLI or the nextflow pipeline. It is also possible to reference self-registered components in the self-made zoo configurations.
Featurizer type checking
Most predictors get dense representations from their featurizers, but some also require graphs or sequence representations. The registry decorator requires a featurizer type for each featurizer, and an omics/perturbation featurizer type for each predictor. During construction of the combined model, we can thus ensure that a valid combination was selected before running into an actual type mismatch.
Let me know what you think about this