Skip to content

Some bugs found by the bot#438

Draft
PascalIversen wants to merge 4 commits into
developmentfrom
fix_bugs
Draft

Some bugs found by the bot#438
PascalIversen wants to merge 4 commits into
developmentfrom
fix_bugs

Conversation

@PascalIversen

Copy link
Copy Markdown
Collaborator

Draft, to review.

  • load_and_select_gene_features kept feature values in CSV order while relabeling meta_info to gene-list order. values misaligned with gene names;
  • Single-drug final-prediction prep re-fetched unmasked validation/early-stopping splits, injecting all drugs into single-drug train sets. Removed the override.
  • CurveCurator raw-data prep referenced "replicate" unconditionally.. KeyError on replicate-free input. Dup-key now built from the pivot columns.
  • Single-drug consolidation rebound out_path inside the model loop: a second model nested under the first. Use a fresh local path.
  • Cross-study prediction mutated the caller's train_dataset via add_rows. Copy it first.
  • NN checkpoint guard used is not None, but the path defaults to "" in torch.load(""). Now checks
  • XGBoost reg_lambda was tuned but never passed to the model. Now forwarded.
  • CLI list-option expansion treated any token matching a subcommand name as the subcommand (e.g. --run_id report), breaking expansion. Only the first token is inspected now.
  • Metric evaluation masked NaN predictions but not NaN responses. Now masks rows where either is NaN.
  • add_meta_info had a dead is None branch that would have raised on the read-only property. Removed.
  • NN forward used x.squeeze(), collapsing a batch-of-1 prediction to a scalar → squeeze(-1); drop_last=True could empty the loader for a tiny training set → guarded by training-set size.
  • Removed the KNN variance hyperparameter grid that was never read (caused 8× redundant tuning).

…ures

load_and_select_gene_features built indices_to_keep by iterating the source
CSV columns, so feature values kept the source (CSV) column order while
meta_info was relabeled to the gene-list order. When the two orders differ,
every value is misaligned with its gene name. This is invisible within a single
dataset (train and predict share the CSV column order) but silently corrupts
cross-study prediction across datasets whose CSVs list genes in different orders
(e.g. CTRPv2 -> GDSC for methylation and for DIPK's expression input).

Build indices_to_keep by mapping each gene in ordered_genes to its source
column index, so values are selected AND reordered to match ordered_genes /
meta_info.

Add a regression test asserting values stay aligned to their gene names when the
gene-list order differs from the CSV order; the existing tests sorted colnames
or only compared shape, so they missed this.
#2 cli_model_testing: _prep_data_for_final_prediction re-fetched the raw,
unmasked split["validation_es"]/split["early_stopping"] after
get_datasets_from_cv_split had already returned drug-masked copies. For
single-drug models this re-introduced all drugs' rows into the single-drug
train/early-stopping sets. Drop the redundant override and rely on the helper's
return values (which already mask per drug and set es_dataset=None for
non-early-stopping models), matching run_train_and_predict_cv.

#3 curvecurator: _prepare_raw_data unconditionally used "replicate" in the
duplicate check and groupby even though the no-replicate input shape is
explicitly supported (pivot_columns=["dose"]), raising KeyError. Build the
duplicate-key subset from pivot_columns so both shapes work.
…B reg_lambda, CLI subcommand detection

#4 consolidate_single_drug_model_predictions rebound the out_path parameter
inside the model loop (out_path = os.path.join(out_path, ...)), so a second
single-drug model nested under the first (out_path/model1/model2). Use a local
model_out_path for the write paths.

#5 cross_study_prediction added early-stopping rows to the caller's
train_dataset in place; that object is reused across cross-study datasets and by
randomization/robustness tests. Copy train_dataset first (mirroring dataset).

#6 SimpleNeuralNetwork checkpoint guard used `is not None`, but Lightning's
best_model_path defaults to "" (never None), so the fallback was dead and a
missing checkpoint hit torch.load(""). Use a truthiness check.

#7 MultiViewXGBoost tuned reg_lambda in hyperparameters.yaml but never passed it
to XGBRegressor (stayed at default 1). Forward reg_lambda.

#8 _active_list_options treated any token matching a subcommand name as the
subcommand, so an option value like `--run_id report` suppressed root list
expansion. The pipeline is an invoke_without_command group, so only argv[0] can
be the subcommand; inspect just that. Add a regression test.
…s, KNN dead hparam

evaluation.evaluate: only prediction NaNs were masked; NaN *responses* with valid
predictions corrupted the metric. Mask rows where either is NaN, NaN if <2 valid pairs.

FeatureDataset.add_meta_info: drop the dead `if self.meta_info is None` branch
(meta_info is a read-only property always initialized to {}; the assignment would
have raised). Update in place.

SimpleNeuralNetwork: forward used x.squeeze(), collapsing a batch-of-1 prediction
to a 0-d scalar; use squeeze(-1). DataLoader drop_last=True could empty the loader
for a training set smaller than one batch; guard with len(train_dataset) > batch_size.

KNNRegressor: remove the `variance` grid from hyperparameters.yaml -- build_model
never read it, so it only multiplied the tuning grid 8x with identical models.
@PascalIversen PascalIversen marked this pull request as draft June 19, 2026 09:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant