remove_outlier_replicates counts NaNs as an actual measured value

## Premise
I am dealing with a multitask dataset that is *uncurated* but already merged to wide form, so each response has its own column. Some response cols have more data, so other response columns have NaN's for a given compound instead of a number. I proceed with curation like this:
```
dfs=[]
for resp in <response_col_list>:

    curated_df1 = remove_outlier_replicates(full_df, ...)
    # curated_df1 operates only on <resp>, but has all columns that full_df has
    # it now has less rows due to discarded outlier replicates ONLY from <resp>
    # it's no longer a reliable data source for any other columns included in the dataset

    curated_df = aggregate_assay_data(curated_df1, ...)
    # curated_df has only columns [<compound_id_col>, <smiles_col>, <relation_col>, <output_value_col>]

    dfs.append(curated_df)
```
I then merge all the curated_dfs together to reconstruct the multitask dataset.

## Problem
I noticed a bunch of NaN's after re-merging the datasets to construct the curated multitask dataset where there should have been values. I've traced this to `remove_outlier_replicates`: If a given compound is listed twice, and one has a value and the other is a NaN for the given response column, this NaN is treated as a real value instead of ignored or dismissed. Then, the outlier analysis flags it as too variable and it is removed from the data. 

## Proposed solution
I believe it might be fairly common to have NaN's in the response column when entering the curation pipeline. We should have a check for this instead of relying on the user to properly prepare the data. 

Since `remove_outlier_replicates` is discarding rows from the dataset anyway, I think it should first remove any rows with NaN's in the response column before proceeding with removing outliers. We can include an `info` or `warning` note that it's removing XXX NaN compounds. 

Also, since `remove_outlier_replicates` destroys the validity of any other data in the dataframe, it is worth considering that we return only the valid columns, like we do in `aggregate_assay_data`. I think `remove_outlier_replicates` is used on its own sometimes so people want to maintain the other columns, but this doesn't feel like the right move for new users. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

remove_outlier_replicates counts NaNs as an actual measured value #419

Premise

Problem

Proposed solution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

remove_outlier_replicates counts NaNs as an actual measured value #419

Description

Premise

Problem

Proposed solution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions