Skip to content

Allow custom split creation procedures #407

Description

@nictru

During my efforts, I was very interested in scaling laws. This can be random scaling (selecting a random subset of cell lines for LCO), but also more complicated setups could be thinkable: E.g. to investigate how the presence of more tissues in the training data affects the performance in a single tissue. For this purpose, one could for example split the lung data into train/test, and then include different amounts of other tissues only in the training set.

Creating a setup like this might be possible when using the package via python CLI, but quite difficult via CLI or the nextflow pipeline. And the nextflow pipeline is the most useful thing here, as one ends up not only with ´n_models x n_folds´ runs, but with ´n_models x n_folds x n_fractions´.

I approached this by adding a pipeline parameter that needs to point to a file which contains a python function that follows a defined interface for creating splits. Model training tasks are then launched for each train/test split that this function creates.

This is not a replacement for the LCO/LTO/LDO/LPO splits, but an extension for advanced users.

Possible problems

  1. Users might create leaking splits
  2. Report construction becomes more complicated

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions