Some bugs found by the bot by PascalIversen · Pull Request #438 · daisybio/drevalpy

PascalIversen · 2026-06-19T09:01:43Z

Draft, to review.

load_and_select_gene_features kept feature values in CSV order while relabeling meta_info to gene-list order. values misaligned with gene names;
Single-drug final-prediction prep re-fetched unmasked validation/early-stopping splits, injecting all drugs into single-drug train sets. Removed the override.
CurveCurator raw-data prep referenced "replicate" unconditionally.. KeyError on replicate-free input. Dup-key now built from the pivot columns.
Single-drug consolidation rebound out_path inside the model loop: a second model nested under the first. Use a fresh local path.
Cross-study prediction mutated the caller's train_dataset via add_rows. Copy it first.
NN checkpoint guard used is not None, but the path defaults to "" in torch.load(""). Now checks
XGBoost reg_lambda was tuned but never passed to the model. Now forwarded.
CLI list-option expansion treated any token matching a subcommand name as the subcommand (e.g. --run_id report), breaking expansion. Only the first token is inspected now.
Metric evaluation masked NaN predictions but not NaN responses. Now masks rows where either is NaN.
add_meta_info had a dead is None branch that would have raised on the read-only property. Removed.
NN forward used x.squeeze(), collapsing a batch-of-1 prediction to a scalar → squeeze(-1); drop_last=True could empty the loader for a tiny training set → guarded by training-set size.
Removed the KNN variance hyperparameter grid that was never read (caused 8× redundant tuning).

…ures load_and_select_gene_features built indices_to_keep by iterating the source CSV columns, so feature values kept the source (CSV) column order while meta_info was relabeled to the gene-list order. When the two orders differ, every value is misaligned with its gene name. This is invisible within a single dataset (train and predict share the CSV column order) but silently corrupts cross-study prediction across datasets whose CSVs list genes in different orders (e.g. CTRPv2 -> GDSC for methylation and for DIPK's expression input). Build indices_to_keep by mapping each gene in ordered_genes to its source column index, so values are selected AND reordered to match ordered_genes / meta_info. Add a regression test asserting values stay aligned to their gene names when the gene-list order differs from the CSV order; the existing tests sorted colnames or only compared shape, so they missed this.

#2 cli_model_testing: _prep_data_for_final_prediction re-fetched the raw, unmasked split["validation_es"]/split["early_stopping"] after get_datasets_from_cv_split had already returned drug-masked copies. For single-drug models this re-introduced all drugs' rows into the single-drug train/early-stopping sets. Drop the redundant override and rely on the helper's return values (which already mask per drug and set es_dataset=None for non-early-stopping models), matching run_train_and_predict_cv. #3 curvecurator: _prepare_raw_data unconditionally used "replicate" in the duplicate check and groupby even though the no-replicate input shape is explicitly supported (pivot_columns=["dose"]), raising KeyError. Build the duplicate-key subset from pivot_columns so both shapes work.

…B reg_lambda, CLI subcommand detection #4 consolidate_single_drug_model_predictions rebound the out_path parameter inside the model loop (out_path = os.path.join(out_path, ...)), so a second single-drug model nested under the first (out_path/model1/model2). Use a local model_out_path for the write paths. #5 cross_study_prediction added early-stopping rows to the caller's train_dataset in place; that object is reused across cross-study datasets and by randomization/robustness tests. Copy train_dataset first (mirroring dataset). #6 SimpleNeuralNetwork checkpoint guard used `is not None`, but Lightning's best_model_path defaults to "" (never None), so the fallback was dead and a missing checkpoint hit torch.load(""). Use a truthiness check. #7 MultiViewXGBoost tuned reg_lambda in hyperparameters.yaml but never passed it to XGBRegressor (stayed at default 1). Forward reg_lambda. #8 _active_list_options treated any token matching a subcommand name as the subcommand, so an option value like `--run_id report` suppressed root list expansion. The pipeline is an invoke_without_command group, so only argv[0] can be the subcommand; inspect just that. Add a regression test.

…s, KNN dead hparam evaluation.evaluate: only prediction NaNs were masked; NaN *responses* with valid predictions corrupted the metric. Mask rows where either is NaN, NaN if <2 valid pairs. FeatureDataset.add_meta_info: drop the dead `if self.meta_info is None` branch (meta_info is a read-only property always initialized to {}; the assignment would have raised). Update in place. SimpleNeuralNetwork: forward used x.squeeze(), collapsing a batch-of-1 prediction to a 0-d scalar; use squeeze(-1). DataLoader drop_last=True could empty the loader for a training set smaller than one batch; guard with len(train_dataset) > batch_size. KNNRegressor: remove the `variance` grid from hyperparameters.yaml -- build_model never read it, so it only multiplied the tuning grid 8x with identical models.

PascalIversen added 4 commits June 19, 2026 10:42

PascalIversen marked this pull request as draft June 19, 2026 09:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Some bugs found by the bot#438

Some bugs found by the bot#438
PascalIversen wants to merge 4 commits into
developmentfrom
fix_bugs

PascalIversen commented Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

PascalIversen commented Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant