Options for cleaner datasets by PascalIversen · Pull Request #428 · daisybio/drevalpy

PascalIversen · 2026-06-11T10:29:34Z

dataset versions where we Drop drugs when they just don't really have a signal in the curves (p value of curve curator). We should warn, actually, when someone does LDO on them, because I use the test labels to determine what is a "bad" drug. The labels of these bad drugs 1) are just noise in the training and 2.) make the per drug pearson mean lower than it is on drugs that actually work.

I don't want to just drop non-significant curves per experiment because that would be leakage in LCO, LPO. Currently it would only leakage in LDO.

Time for a new version - v1.5.0

Add CTRPv2_clean, CTRPv2_cleaner and CTRPv2_cleanest as selectable datasets, derived from the original CTRPv2 download so nothing new needs hosting. On first load each one builds a local folder that keeps only drugs with at least N reproducible CurveCurator-significant dose-response curves (N = 15, 30, 50) and symlinks CTRPv2's feature files. Filtering is done at the drug level only, never per experiment, so the sample is not conditioned on the response. Many CTRPv2 compounds (prodrugs, non-cytotoxics) have flat curves and meaningless IC50 labels; selective drugs that have a real cluster of responders are kept because the criterion counts responders rather than their fraction. Register the three loaders in AVAILABLE_DATASETS and update the factory test.

The clean/cleaner/cleanest variants drop inactive drugs, so leave-drug-out (LDO) metrics on them are optimistic: real screens contain inactive compounds a model would still face. Emit a warning when one of these datasets is used with LDO.

JudithBernett · 2026-06-11T11:24:00Z

How is this specific to CTRPv2?

JudithBernett · 2026-06-11T11:29:06Z

+    meta_path = os.path.join(path_data, "meta", "tissue_mapping.csv")
+    if not os.path.exists(meta_path):
+        download_dataset("meta", path_data, redownload=True)
+    path = os.path.join(path_data, dataset_name, f"{dataset_name}.csv")
+    response_data = pd.read_csv(path, dtype={"pubchem_id": str, "cell_line_name": str})
+    response_data[DRUG_IDENTIFIER] = response_data[DRUG_IDENTIFIER].str.replace(",", "")
+    check_measure(measure, list(response_data.columns), dataset_name)
+    return DrugResponseDataset(
+        response=response_data[measure].values,
+        cell_line_ids=response_data[CELL_LINE_IDENTIFIER].values,
+        drug_ids=response_data[DRUG_IDENTIFIER].values,
+        tissues=response_data[TISSUE_IDENTIFIER].values,
+        dataset_name=dataset_name,
+    )


this is all duplicated to _load_zenodo_dataset

PascalIversen · 2026-06-11T11:36:46Z

not specific, but CTRPv2 is the only one I use these days :D

Should I
a) make a general solution with the next cli flag?
b) just make it for every dataset where its possible
c)just GDSC1 and 2 and CTRPv1,2 and CCLE?

JudithBernett · 2026-06-11T11:43:07Z

I'd say this can be a general solution; maybe instead of the hardcoded 5, 30, 50 with percentages per drug? I'd check if it was curve curated (i.e., if it has the Regulation column) and then, this could be done for all datasets

PascalIversen · 2026-06-11T12:38:41Z

Hm, I am not sure. Maybe it is good to establish some benchmarking datasets, so we can have a leaderboard on CTRPv2 cleanest, etc Both are also options. Let's see after the weekend.

But I want to keep it absolute, not percentage-based. I see no reason for percentage-based making sense. Maybe the drug only has a signal that's highly selective for 5 breast cancer cell lines or something. This should be included regardless of how many others have been measured

JudithBernett · 2026-06-11T12:40:36Z

Ok, let's discuss next week :D But I think 5, 30, 50 are somewhat arbitrarily chosen and would mean different things in overall screening size

…; tests Addresses review on #428: - Split _load_zenodo_dataset into _read_response_frame + _frame_to_dataset so the clean loader reuses them instead of duplicating the load logic - Replace CTRPv2-specific cleaners with a generic DrugCurveFilter + DERIVED_DATASETS registry, gated on the curve-curated Regulation column (works for any base via register_clean_tiers) - DrugCurveFilter supports absolute (default, shipped 15/30/50) and percentage thresholds - Generalize the LDO warning to any derived dataset - Add data-free tests/test_dataset_cleaning.py

PascalIversen · 2026-06-18T12:17:44Z

Refactored to adress:

No more duplication
Made it general: gated on the Regulation column so it works for any curve-curated dataset, not just CTRPv2
Absolute vs %: now a DrugCurveFilter option that supports both. Kept absolute as default currently (because imo selective drugs shouldn't be penalized for screen size)

still need to review this

- Register Sphinx cross-reference roles (class, func, meth, attr, data, mod, obj) in the flake8 config so loader.py docstrings stop tripping RST304 ("Unknown interpreted text role"). - Document the return of the _toy_frame test helper (DAR201). These were the only real CI failure; mypy/tests/xdoctest/typeguard were fail-fast cancellations triggered by the flake8 error.

JudithBernett and others added 4 commits June 8, 2026 16:23

Merge pull request #414 from daisybio/development

88bac77

Time for a new version - v1.5.0

Merge branch 'development' into ctrpv2_clean

15958e4

JudithBernett reviewed Jun 11, 2026

View reviewed changes

PascalIversen changed the title ~~Clean versions of the CTRPv2 datasets~~ Options for cleaner datasets Jun 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Options for cleaner datasets#428

Options for cleaner datasets#428
PascalIversen wants to merge 6 commits into
developmentfrom
ctrpv2_clean

PascalIversen commented Jun 11, 2026 •

edited

Loading

Uh oh!

JudithBernett commented Jun 11, 2026

Uh oh!

JudithBernett Jun 11, 2026

Uh oh!

PascalIversen commented Jun 11, 2026

Uh oh!

JudithBernett commented Jun 11, 2026

Uh oh!

PascalIversen commented Jun 11, 2026

Uh oh!

JudithBernett commented Jun 11, 2026

Uh oh!

PascalIversen commented Jun 18, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

PascalIversen commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JudithBernett commented Jun 11, 2026

Uh oh!

JudithBernett Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

PascalIversen commented Jun 11, 2026

Uh oh!

JudithBernett commented Jun 11, 2026

Uh oh!

PascalIversen commented Jun 11, 2026

Uh oh!

JudithBernett commented Jun 11, 2026

Uh oh!

PascalIversen commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

PascalIversen commented Jun 11, 2026 •

edited

Loading

PascalIversen commented Jun 18, 2026 •

edited

Loading