Skip to content

remove_outlier_replicates counts NaNs as an actual measured value #419

@paulsonak

Description

@paulsonak

Premise

I am dealing with a multitask dataset that is uncurated but already merged to wide form, so each response has its own column. Some response cols have more data, so other response columns have NaN's for a given compound instead of a number. I proceed with curation like this:

dfs=[]
for resp in <response_col_list>:

    curated_df1 = remove_outlier_replicates(full_df, ...)
    # curated_df1 operates only on <resp>, but has all columns that full_df has
    # it now has less rows due to discarded outlier replicates ONLY from <resp>
    # it's no longer a reliable data source for any other columns included in the dataset

    curated_df = aggregate_assay_data(curated_df1, ...)
    # curated_df has only columns [<compound_id_col>, <smiles_col>, <relation_col>, <output_value_col>]

    dfs.append(curated_df)

I then merge all the curated_dfs together to reconstruct the multitask dataset.

Problem

I noticed a bunch of NaN's after re-merging the datasets to construct the curated multitask dataset where there should have been values. I've traced this to remove_outlier_replicates: If a given compound is listed twice, and one has a value and the other is a NaN for the given response column, this NaN is treated as a real value instead of ignored or dismissed. Then, the outlier analysis flags it as too variable and it is removed from the data.

Proposed solution

I believe it might be fairly common to have NaN's in the response column when entering the curation pipeline. We should have a check for this instead of relying on the user to properly prepare the data.

Since remove_outlier_replicates is discarding rows from the dataset anyway, I think it should first remove any rows with NaN's in the response column before proceeding with removing outliers. We can include an info or warning note that it's removing XXX NaN compounds.

Also, since remove_outlier_replicates destroys the validity of any other data in the dataframe, it is worth considering that we return only the valid columns, like we do in aggregate_assay_data. I think remove_outlier_replicates is used on its own sometimes so people want to maintain the other columns, but this doesn't feel like the right move for new users.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions