During my efforts, I was very interested in scaling laws. This can be random scaling (selecting a random subset of cell lines for LCO), but also more complicated setups could be thinkable: E.g. to investigate how the presence of more tissues in the training data affects the performance in a single tissue. For this purpose, one could for example split the lung data into train/test, and then include different amounts of other tissues only in the training set.
Creating a setup like this might be possible when using the package via python CLI, but quite difficult via CLI or the nextflow pipeline. And the nextflow pipeline is the most useful thing here, as one ends up not only with ´n_models x n_folds´ runs, but with ´n_models x n_folds x n_fractions´.
I approached this by adding a pipeline parameter that needs to point to a file which contains a python function that follows a defined interface for creating splits. Model training tasks are then launched for each train/test split that this function creates.
This is not a replacement for the LCO/LTO/LDO/LPO splits, but an extension for advanced users.
Possible problems
- Users might create leaking splits
- Report construction becomes more complicated
During my efforts, I was very interested in scaling laws. This can be random scaling (selecting a random subset of cell lines for LCO), but also more complicated setups could be thinkable: E.g. to investigate how the presence of more tissues in the training data affects the performance in a single tissue. For this purpose, one could for example split the lung data into train/test, and then include different amounts of other tissues only in the training set.
Creating a setup like this might be possible when using the package via python CLI, but quite difficult via CLI or the nextflow pipeline. And the nextflow pipeline is the most useful thing here, as one ends up not only with ´n_models x n_folds´ runs, but with ´n_models x n_folds x n_fractions´.
I approached this by adding a pipeline parameter that needs to point to a file which contains a python function that follows a defined interface for creating splits. Model training tasks are then launched for each train/test split that this function creates.
This is not a replacement for the LCO/LTO/LDO/LPO splits, but an extension for advanced users.
Possible problems