Flexible featurizer-predictor combinations and registries

I will describe here how I implemented the flexible combinations in my effort, so we can build a consensus on how this can be implemented in drevalpy. The terms 'featurizer' (what they are called in drevalpy) and 'encoder' (what I call them) are used interchangeably. Similar for drug/perturbation.

# Registries

There are three registries: One for omics featurizers, one for perturbation/drug featurizers and one for predictors. Each featurizer/predictor registers itself to the corresponding registry using a decorator:

```python
@register_omics_encoder("pca", description="...")
class PCA(OmicsEncoder): ...

@register_predictor("randomForest", description="...")
class RandomForestPredictor(SklearnTabularPredictor): ...
```

For components that are based on literature models, it is possible to store additional metadata, such as a publication DOI or a github link in the registry. This allows getting a comprehensive overview of the available components via a CLI command or python API.

## Extending the registry

During development, having to modify the dreval source code for adding new models is annoying (and users might modify parts of the package that are not relevant for adding models). This is even more problematic for evaluating new components in the nextflow pipeline.

Registries can help here. By importing the decorator function from the package, users can extend the registry without touching the drevalpy source code. This can look like following:

```python
import anndata as ad
import numpy as np
from sklearn.linear_model import Ridge

from drevalpy.models.encoders.omics.base import OmicsEncoder
from drevalpy.models.predictors.sklearn_tabular import SklearnTabularPredictor
from drevalpy.models.registry import register_omics_encoder, register_predictor


@register_omics_encoder(
    "logGeneExpression",
    description="Log1p of one obsm matrix.",
    category="native",
)
class LogGeneExpression(OmicsEncoder):
    def __init__(self, *, obsm_key: str = "gene_expression") -> None:
        self._obsm_key = obsm_key
        self._n_features = 0

    def _fit_impl(self, adata: ad.AnnData, *, row_indices=None) -> None:
        self._n_features = adata.obsm[self._obsm_key].shape[1]

    def transform(self, adata: ad.AnnData) -> np.ndarray:
        x = np.asarray(adata.obsm[self._obsm_key], dtype=np.float32)
        return np.log1p(np.clip(x, 0.0, None))

    @property
    def output_dim(self) -> int:
        return self._n_features


@register_predictor(
    "myRidge",
    description="Ridge on concat(omics, perturbation).",
    category="general_purpose",
)
class MyRidgePredictor(SklearnTabularPredictor):
    def _make_estimator(self) -> Ridge:
        return Ridge(alpha=float(self._h.get("alpha", 1.0)))
```

And then it can be used as follows:

```python
import my_extensions  # registers logGeneExpression + myRidge

from drevalpy.models import EncoderConfig, ModelConfig, PredictorConfig

config = ModelConfig(
    omics_encoder=EncoderConfig("logGeneExpression", registry="omics"),
    perturbation_encoder=EncoderConfig("oneHot", registry="perturbation"),
    predictor=PredictorConfig("myRidge", hyperparameters={"alpha": 0.5}),
)

model = config.create_model()
model.fit(adata, strategy, fold_index=0)
```

When used via CLI (or via the nextflow pipeline) users can provide a directory containing such files, which are then automatically registered and usable.

# Flexible combinations

Now as every component can be accessed via a string in its registry, it is possible to construct a model using a structured string. I used the pattern `<omics encoder>:<drug_encoder>:<predictor>`. We can then dynamically construct the equivalent of a `DRPmodel` dynamically. Each component comes with a defined hyperparameter space, and the hyperparameter space of the triplet is the union of the per-component hyperparameters, which can be tuned in combination.

## Model zoo

Certain predictors (e.g. PharmaFormer) might require a certain pair of featurizers so we can really call it `PharmaFormer`. To allow giving certain combinations a name, I created a so-called 'model zoo' which stores certain combinations under a friendly name. It is also configure hyperparameter spaces and default hyperparameters like this.

It is also possible to provide additional model zoo configurations via a zoo directory to the CLI or the nextflow pipeline. It is also possible to reference self-registered components in the self-made zoo configurations.

## Featurizer type checking

Most predictors get dense representations from their featurizers, but some also require graphs or sequence representations. The registry decorator requires a featurizer type for each featurizer, and an omics/perturbation featurizer type for each predictor. During construction of the combined model, we can thus ensure that a valid combination was selected before running into an actual type mismatch.

---

Let me know what you think about this

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Flexible featurizer-predictor combinations and registries #403

Registries

Extending the registry

Flexible combinations

Model zoo

Featurizer type checking

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Flexible featurizer-predictor combinations and registries #403

Description

Registries

Extending the registry

Flexible combinations

Model zoo

Featurizer type checking

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions