feat(l2g): add LocusToGeneCrossValidationStep for pre-split CV with hyperparameter tuning by addramir · Pull Request #1230 · opentargets/gentropy

addramir · 2026-05-28T15:14:54Z

✨ Context

Cross-validation and hyperparameter sweeps are exploratory — they should never run as part of a production training job. `LocusToGeneStep` previously had a `cross_validate` flag that was never used in prod and bloated its interface. This PR introduces a dedicated `LocusToGeneCrossValidationStep` that takes pre-built, annotated train/test feature matrix parquets and runs k-fold CV without any gold-standard parsing or feature matrix construction.

🛠 What does this PR implement

New: `LocusToGeneCrossValidationStep` (`l2g.py`)

A standalone step for CV and hyperparameter exploration. Takes two parquet inputs — a pre-split training set and a held-out test set, both annotated with features and `goldStandardSet` labels — and runs k-fold CV on the training partition followed by a final evaluation on the test set.

Parameters:

`train_feature_matrix_path` / `test_feature_matrix_path` — required parquet paths
`hyperparameters` — base XGBoost config
`features_list` — defaults to the standard L2G feature list
`n_splits` — CV folds (default 5)
`hyperparameter_grid` — optional W&B-style grid `{"param": {"values": [...]}, ...}`; when combined with `cv_results_dir`, all configs are evaluated
`cv_results_dir` — local or `gs://` path; when set, writes `cv_results.json`, `cv_folds.csv`, and per-config ROC / PR / confusion-matrix PNGs; without it, metrics go to the terminal

No W&B logging — CV output is file-based or terminal only.

Registered as Hydra step `locus_to_gene_cross_validation`.

New: `LocusToGeneTrainTestSplitStep` (`l2g.py`)

A standalone step for creating the train/test split upstream of CV or training. Registered as Hydra step `locus_to_gene_train_test_split`.

`LocusToGeneTrainer` upgrades

`evaluate()` — extended with `TP`, `FP`, `TN`, `FN` confusion-matrix components
`cross_validate()` — new `cv_results_dir` parameter enables file-mode output with explicit grid iteration
`_run_cv_fold()` — extracts single-fold logic; adds locus-level resolution stats (`n_loci`, `n_loci_one_gene_above`, `n_loci_no_gene_above`)
`_expand_grid()` — static method expanding a W&B-style grid into a flat list of hyperparameter configs
`_summarise_and_plot_config()`, `_write_cv_files()`, `_upload_dir_to_gcs()` — file-mode output pipeline
`_plot_roc_curves()`, `_plot_pr_curves()`, `_plot_confusion_matrix()` — per-config diagnostic PNGs

`LocusToGeneStep` refactor

Wired to consume pre-split train/test parquets from `LocusToGeneTrainTestSplitStep` instead of building the feature matrix inline. Gold-standard parsing and the `cross_validate` flag are removed from this step.

Config deduplication

Extracted `_L2G_FEATURES_LIST` and `_L2G_HYPERPARAMETERS` as module-level constants; `LocusToGeneConfig`, `LocusToGeneTrainTestSplitConfig`, and `LocusToGeneCrossValidationConfig` all reference them.

🧪 Test plan

`pytest tests/gentropy/method/test_l2g_trainer.py` — all existing tests pass; `test_evaluate_perfect_predictions` updated for new TP/FP/TN/FN keys; `_expand_grid` and `_write_cv_files` unit tests added
`pytest tests/gentropy/test_l2g_cv_step.py` — CV step integration tests including `cv_results_dir` file output and hyperparameter grid
Smoke-test `LocusToGeneCrossValidationStep` with local parquets, terminal mode
Smoke-test with `cv_results_dir` set to a local path and verify JSON/CSV/PNG outputs
Smoke-test with `hyperparameter_grid` + `cv_results_dir` to verify grid expansion

🤖 Generated with Claude Code

Introduce a dedicated step that builds the annotated gold-standard feature matrix from the full feature matrix + credible set, performs a hierarchical train/test split, and writes both partitions to parquet. A predefined test set from a previous run can be supplied to keep the held-out set stable across feature matrix updates. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

LocusToGeneStep no longer performs gold standard annotation or train/test splitting internally. Instead it requires presplit parquets produced by LocusToGeneTrainTestSplitStep, making the two steps composable in a pipeline without redundant Spark computation. - Remove gold_standard_curation_path, variant_index_path, gene_interactions_path from LocusToGeneStep and LocusToGeneConfig - Remove prepare_gold_standard() and _annotate_gold_standards_w_feature_matrix() - run_train() validates presplit paths are provided and loads them directly - LocusToGeneTrainer.train() accepts presplit_train_df/presplit_test_df to bypass internal generate_train_test_split - Add import pandas as pd at module level in l2g.py - Add tests for LocusToGeneTrainTestSplitStep static helpers and LocusToGeneTrainer presplit bypass path Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

…tStep - Write split stats JSON in the fresh-split branch (was only written for the predefined-test branch) - Create parent directories before writing the JSON sidecar locally so the step doesn't fail on a fresh output path - Remove coalesce(1) to avoid forcing a single output partition - Unpersist annotated_fm after both branches write their parquets - Replace toPandas() calls in the predefined-test branch with Spark left_anti / inner joins and create_map label encoding, keeping all data on the cluster and avoiding driver-memory limits Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

- Tighten _parse_gold_standard match: add guard so the unexpected_columns branch cannot fire when missing_mandatory_columns is also present - Persist train_sdf/test_sdf before counting in the predefined-test branch to avoid recomputing the joins on the subsequent .write actions - Raise ValueError in LocusToGeneTrainer.train() when exactly one of presplit_train_df / presplit_test_df is provided - Make credible_set_path / feature_matrix_path optional in LocusToGeneConfig and LocusToGeneStep (predict mode validates they are present; train mode no longer loads them unnecessarily) Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

…tStep Allows callers to specify an explicit output path for the split statistics JSON instead of deriving it from train_parquet_path. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

…Step log Adds n_positive, n_negative, n_unique_loci, n_unique_positive_genes, and (when traitFromSourceMappedId is present) n_unique_positive_gene_disease_pairs for both train and test splits, nested under "train"/"test" keys in the stats JSON. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

New dedicated step that runs k-fold cross-validation on pre-built, annotated train/test feature matrix parquets without any gold-standard parsing or training set construction. Trainer upgrades (ported from PR #1228): - evaluate() extended with TP/FP/TN/FN confusion-matrix components - cross_validate() now supports cv_results_dir: writes cv_results.json, cv_folds.csv and per-config ROC/PR/confusion-matrix PNGs to a local or GCS directory; hyperparameter grid is swept explicitly in file mode - _run_cv_fold() extracts single-fold logic and adds locus-level stats (n_loci, n_loci_one_gene_above, n_loci_no_gene_above) - _expand_grid() expands W&B-style grid to flat config list - _summarise_and_plot_config(), _write_cv_files(), _upload_dir_to_gcs() handle file-mode output and GCS upload - _plot_roc_curves(), _plot_pr_curves(), _plot_confusion_matrix() PNGs LocusToGeneStep and its training path are unchanged. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Adds a dedicated LocusToGeneCrossValidationStep that runs k-fold CV (with optional hyperparameter grid sweeps) on pre-built, annotated train/test feature-matrix parquets, leaving the production LocusToGeneStep untouched. The trainer is extended to support a file-output CV mode (cv_results_dir) producing JSON/CSV summaries and per-config ROC/PR/confusion plots, plus confusion-matrix counts in evaluate().

Changes:

New LocusToGeneCrossValidationStep in l2g.py and matching LocusToGeneCrossValidationConfig registered under step=locus_to_gene_cross_validation.
LocusToGeneTrainer.cross_validate refactored: _run_cv_fold extracted, hyperparameters now applied before fit (previously applied post-fit in W&B mode), cv_results_dir file-output mode with grid expansion and plotting helpers added.
evaluate() extended with TP/FP/TN/FN; test updated accordingly.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.

File	Description
src/gentropy/l2g.py	New `LocusToGeneCrossValidationStep` that loads pre-split parquets, manually populates the trainer, runs CV, fits, and reports hold-out metrics.
src/gentropy/method/l2g/trainer.py	Refactored `cross_validate` with `cv_results_dir` file mode, `_run_cv_fold`, `_expand_grid`, plotting helpers, GCS upload helper, and confusion-matrix counts in `evaluate`.
src/gentropy/config.py	Added `LocusToGeneCrossValidationConfig` (duplicates feature list and hyperparameters from `LocusToGeneConfig`) and registered the Hydra step.
tests/gentropy/method/test_l2g_trainer.py	Updated `test_evaluate_perfect_predictions` for the new TP/FP/TN/FN keys.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

addramir · 2026-05-28T16:01:55Z

+    features_list: list[str] = field(
+        default_factory=lambda: [
+            "eQtlColocClppMaximum",
+            "pQtlColocClppMaximum",
+            "sQtlColocClppMaximum",
+            "eQtlColocH4Maximum",
+            "pQtlColocH4Maximum",
+            "sQtlColocH4Maximum",
+            "eQtlColocClppMaximumNeighbourhood",
+            "pQtlColocClppMaximumNeighbourhood",
+            "sQtlColocClppMaximumNeighbourhood",
+            "eQtlColocH4MaximumNeighbourhood",
+            "pQtlColocH4MaximumNeighbourhood",
+            "sQtlColocH4MaximumNeighbourhood",
+            "distanceSentinelFootprint",
+            "distanceSentinelFootprintNeighbourhood",
+            "distanceFootprintMean",
+            "distanceFootprintMeanNeighbourhood",
+            "distanceTssMean",
+            "distanceTssMeanNeighbourhood",
+            "distanceSentinelTss",
+            "distanceSentinelTssNeighbourhood",
+            "vepMaximum",
+            "vepMaximumNeighbourhood",
+            "vepMean",
+            "vepMeanNeighbourhood",
+            "e2gMean",
+            "e2gMeanNeighbourhood",
+            "geneCount500kb",
+            "proteinGeneCount500kb",
+            "credibleSetConfidence",
+            "transPQtlColocH4Maximum",
+            "transPQtlColocH4MaximumNeighbourhood",
+        ]
+    )
+    hyperparameters: dict[str, Any] = field(
+        default_factory=lambda: {
+            "max_depth": 5,
+            "reg_alpha": 1,
+            "reg_lambda": 1.0,
+            "subsample": 0.8,
+            "colsample_bytree": 0.8,
+            "eta": 0.05,
+            "min_child_weight": 10,
+            "scale_pos_weight": 0.8,
+        }


Fixed: extracted _L2G_FEATURES_LIST and _L2G_HYPERPARAMETERS as module-level constants in config.py. All three configs (LocusToGeneConfig, LocusToGeneTrainTestSplitConfig, LocusToGeneCrossValidationConfig) now reference them via list(_L2G_FEATURES_LIST) / dict(_L2G_HYPERPARAMETERS).

addramir · 2026-05-28T16:02:13Z

+        trainer.cross_validate(
+            parameter_grid=hyperparameter_grid,
+            n_splits=n_splits,
+            cv_results_dir=cv_results_dir,
+        )
+
+        trainer.fit()
+
+        trainer.log_to_terminal(
+            eval_id="Hold-out",
+            metrics=trainer.evaluate(
+                y_true=y_test,
+                y_pred=trainer.model.model.predict(x_test),
+                y_pred_proba=trainer.model.model.predict_proba(x_test),
+            ),
+        )


Won't fix: the CV step is intentionally exploratory — its purpose is to evaluate configurations and write metrics, not to select and apply a 'best' config automatically. The caller inspects the output and chooses hyperparameters manually before running a production training job.

addramir · 2026-05-28T16:01:57Z

+        train_feature_matrix = L2GFeatureMatrix(
+            _df=session.load_data(train_feature_matrix_path, "parquet"),
+            features_list=features_list,
+            with_gold_standard=True,
+        ).persist()
+
+        test_feature_matrix = L2GFeatureMatrix(
+            _df=session.load_data(test_feature_matrix_path, "parquet"),
+            features_list=features_list,
+            with_gold_standard=True,
+        ).persist()


Fixed: added unpersist() calls on both feature matrix DataFrames immediately after .toPandas() materialises them.

addramir · 2026-05-28T16:02:14Z

+        metric_keys = list(folds[0]["metrics"].keys())
+        mean_metrics = {
+            k: float(np.mean([f["metrics"][k] for f in folds])) for k in metric_keys
+        }
+        std_metrics = {
+            k: float(np.std([f["metrics"][k] for f in folds])) for k in metric_keys
+        }


Won't fix: the per-config summaries are intended for exploratory comparison across grid configurations, not fold-to-fold inference. Reporting mean ± std over all metrics (including counts) is acceptable in this diagnostic context.

addramir · 2026-05-28T16:01:58Z

+    @staticmethod
+    def _upload_dir_to_gcs(local_dir: Path, gcs_prefix: str) -> None:
+        """Upload every file under ``local_dir`` to GCS, preserving relative paths.
+
+        Args:
+            local_dir (Path): Root of the local directory tree to upload.
+            gcs_prefix (str): Destination GCS prefix, e.g. ``gs://bucket/path``.
+        """
+        from google.cloud import storage as gcs_storage
+
+        without_scheme = gcs_prefix[len("gs://"):]
+        bucket_name, _, blob_prefix = without_scheme.partition("/")
+        client = gcs_storage.Client()
+        bucket = client.bucket(bucket_name)
+
+        for local_file in local_dir.rglob("*"):
+            if not local_file.is_file():
+                continue
+            relative = local_file.relative_to(local_dir)
+            blob_name = (
+                f"{blob_prefix.rstrip('/')}/{relative}" if blob_prefix else str(relative)
+            )
+            bucket.blob(blob_name).upload_from_filename(str(local_file))
+            logging.info("Uploaded %s → gs://%s/%s", relative, bucket_name, blob_name)


Fixed: _upload_dir_to_gcs now delegates to copy_to_gcs from gentropy.external.gcs, removing the inline google.cloud.storage import and manual bucket/blob parsing.

addramir · 2026-05-28T16:02:11Z

+    @staticmethod
+    def _expand_grid(parameter_grid: dict[str, Any]) -> list[dict[str, Any]]:
+        """Expand a W&B-style parameter grid into a flat list of hyperparameter configs.
+
+        Args:
+            parameter_grid (dict[str, Any]): Grid in the form ``{"param": {"values": [v1, v2, ...]}, ...}``.
+
+        Returns:
+            list[dict[str, Any]]: One dict per hyperparameter combination.
+
+        Raises:
+            ValueError: If any grid entry is missing a ``values`` key.
+        """
+        keys = list(parameter_grid.keys())
+        missing = [k for k in keys if "values" not in parameter_grid[k]]
+        if missing:
+            raise ValueError(
+                f"Hyperparameter grid entries must have a 'values' key. Missing in: {missing}"
+            )
+        values = [parameter_grid[k]["values"] for k in keys]
+        return [dict(zip(keys, combo)) for combo in product(*values)]


Already covered: _expand_grid has unit tests in tests/gentropy/method/test_l2g_trainer.py (including valid grid, single-param grid, and invalid-key error cases). _write_cv_files and cv_results_dir file-mode output are tested in both test_l2g_trainer.py and tests/gentropy/test_l2g_cv_step.py.

Keeps both LocusToGeneTrainTestSplitStep (from PR #1227) and LocusToGeneCrossValidationStep (this branch) as independent steps. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

…lities Tests cover _expand_grid, _write_cv_files, cross_validate with cv_results_dir, and LocusToGeneCrossValidationStep end-to-end including hyperparameter grid sweep. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Direct instance method assignment is rejected by mypy; patch.object is the correct way to stub a method in tests. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

…onStep - Extract _L2G_FEATURES_LIST and _L2G_HYPERPARAMETERS as module-level constants in config.py; LocusToGeneConfig, LocusToGeneTrainTestSplitConfig, and LocusToGeneCrossValidationConfig all reference them to eliminate copy-paste duplication - Unpersist train/test feature matrix Spark DataFrames in LocusToGeneCrossValidationStep after toPandas() materialises them - Replace _upload_dir_to_gcs inline GCS client with the shared copy_to_gcs helper from gentropy.external.gcs Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

For each config evaluated during cross-validation, train on the full training set and evaluate on the held-out test set, appending a "holdout" row to cv_folds.csv alongside the per-fold rows. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

…tionConfig OmegaConf cannot handle field(default_factory=lambda: ...) on list[str] and dict[str, Any] typed fields in structured configs — it tries to resolve default_factory as a dict key and raises ConfigAttributeError. Apply the same Any workaround already used for hyperparameter_grid. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

When LocusToGeneCrossValidationStep is instantiated via Hydra, the hyperparameters field arrives as an OmegaConf DictConfig rather than a plain dict. The property fell through to default_factory() which does not exist on DictConfig. Fall back to dict() conversion instead. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

The train-test split step saves goldStandardSet as 0/1 integers. LocusToGeneCrossValidationStep was unconditionally applying a string-to-int map, silently producing NaN for all rows and leaving zero positives for hierarchical_split. Now skips the map when the column is already numeric. Adds a regression test with integer-encoded labels. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

_eval_on_test_set is skipped when x_test/y_test/test_df are None so that unit tests calling cross_validate directly (without a full step setup) still pass. Extracts _maybe_append_holdout_row helper to keep cross_validate complexity within the C901 limit. Updates test_cv_results_json_structure to expect n_splits + 1 fold_metrics entries when test data is present. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Set n_estimators=300 as the explicit default. With eta=0.05 the previous implicit default of 100 trees gave a learning capacity (eta × n_estimators) of 5, versus XGBoost's internal default of ~30. 300 trees brings capacity closer to what the learning rate implies. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

gamma=0 (min split gain, explicit default) and max_delta_step=1 (stabilises gradient updates for imbalanced binary classification, per XGBoost docs recommendation). Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

When holdout_only=True, CV folds are skipped entirely. Each config is trained once on the full training set and evaluated directly on the holdout set. Produces a cv_folds.csv with one row per config (fold=holdout) and n_splits=0 in cv_results.json. Use when a large pre-split holdout is available and k-fold CV would add noise and 5x compute overhead without additional signal. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Resolved conflicts in config.py, l2g.py, trainer.py, and test_l2g_trainer.py: - Keep _L2G_FEATURES_LIST / _L2G_HYPERPARAMETERS constants (our deduplication) - Keep LocusToGeneCrossValidationConfig and its Hydra registration (new in this branch) - Take dev's to_unpersist list pattern and chained .persist() calls (resource management) - Take dev's logger / _SetStats / pd.io.common.get_handle improvements - Take dev's Spark-native label decoding in run_train (more efficient) - Keep TestExpandGrid / TestWriteCvFiles / TestCrossValidateWithCvResultsDir (new tests) Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

addramir · 2026-06-19T15:34:05Z

        # If no grid is provided, use default ones set in the model
        parameter_grid = parameter_grid or {
            param: {"values": [value]}
            for param, value in self.model.hyperparameters.items()
        }


Partially addresed. I'd skip the n_splits=0 guard — it's not a realistic mistake and adding validation for every degenerate integer just adds noise. The holdout_only one is worth
it because it's a named flag with documented semantics that silently breaks.

Given refactor, we should move this upstream to the step. Feels awkward that cv function need to define by itself the parameter grid.

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

- Rename `l` → `label` in _plot_confusion_matrix list comprehension (ruff E741) - Raise ValueError when holdout_only=True without cv_results_dir Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

project-defiant · 2026-06-23T13:07:50Z

    features_list: list[str] = field(default_factory=list)
    hyperparameters: dict[str, Any] = field(
        default_factory=lambda: {
+            "n_estimators": 300,


New parameter, did you run hyperparameter tuning?

project-defiant · 2026-06-23T13:18:42Z

+            "gamma": 0,  # min loss reduction to make a split
+            "max_delta_step": 1,  # stabilises updates for imbalanced data


As above, is there a hyperparameter tuning for gamma and max_delta_step ?

From xgboost:

gamma ∈ [0, ∞) gamma = 0

Since gamma is the minimum loss reduction required for a split, the default (0) allows any split with positive regularized gain. This can produce more complex individual trees, including branches with small objective improvements

Again for the max_delta_step this is actually non-default value.

project-defiant · 2026-06-23T13:44:03Z

        """
        if not self.hyperparameters:
            raise ValueError("Hyperparameters have not been set.")
        elif isinstance(self.hyperparameters, dict):
            return self.hyperparameters
-        return self.hyperparameters.default_factory()
+        try:
+            # dataclasses.field with default_factory (legacy path)
+            return self.hyperparameters.default_factory()
+        except (AttributeError, TypeError):
+            # OmegaConf DictConfig passed directly from Hydra
+            return dict(self.hyperparameters)


I think we should state what is the order of precedence for the hyperparameters.

if self.hyperparameters are a dict -> return them

else return dict(self.hyperparameters.default_factory())

Right now the execution is:
if self.hyperparameters == None:
raise
if self.hyperparameters is dict:
return them
else (case when hyperparameters are not null, nor a dict -> could be any other type):
return defaults.

project-defiant · 2026-06-23T14:17:13Z

+        train_feature_matrix_path: str,
+        test_feature_matrix_path: str,
+        hyperparameters: dict[str, Any],
+        features_list: list[str] | None = None,
+        n_splits: int = 5,
+        hyperparameter_grid: dict[str, Any] | None = None,


It is a bit confusing that we have hyperparameters and hyperparameter_grid defined at the same time. I would probably keep only hyperparameter_grid as this is a superset of all parameter ranges, and given CV run, you will probably also want to sweep over the hyperparameter grid anyway.

On the second note, since we are not explicit on what kind of hyperparameters are expected (we pass them through directly to the xgboost model, it would be useful to have some minimal hint about that in the docstring + meaningful type hints). All parameters are expected to be of a metaType

class ParameterRange(NamedTuple): values: list[int] | list[float] name: str class ParameterRangeSpec(NamedTuple): upper_bound: int| float lower_bound: int | float step: int | float def to_grid(name: str) -> ParameterRange: return ... class Hyperparameters: grid: list[ParameterRange]

Given such types, we could easily understand the concept behind the input values.

project-defiant · 2026-06-23T14:27:49Z

+        effective_features = train_feature_matrix.features_list
+
+        l2g_model = LocusToGeneModel(
+            model=XGBClassifier(random_state=777, eval_metric="aucpr"),


FUN FACT -> https://arxiv.org/abs/2109.08203

The random seed probably should be handled as all other hyperparameters 😵, otherwise we could end up in non-optimal spot.

project-defiant · 2026-06-23T15:05:37Z

-            "transPQtlColocH4MaximumNeighbourhood",
-        ]
-    )
+    features_list: list[str] = field(default_factory=lambda: list(_L2G_FEATURES_LIST))


project-defiant · 2026-06-23T15:14:03Z

+        trainer.y_test = y_test
+
+        trainer.cross_validate(
+            parameter_grid=hyperparameter_grid,


Since we are refactoring this bit, it would be useful if we specify one hyperparameter_grid for the entire step and reuse it everywhere (another words -> trainer.cross_validate should not need to look up for the trainer.hyperparameters and specifying the artificial grid if hyperparameter_grid is passed as None to the cross_validate method)

The handling of the hyperparameter_grid should happen before cross_validate call.

project-defiant · 2026-06-23T15:15:02Z

+    features_list: Any = field(default_factory=lambda: list(_L2G_FEATURES_LIST))  # list[str] — Any avoids OmegaConf default_factory bug
+    hyperparameters: Any = field(default_factory=lambda: dict(_L2G_HYPERPARAMETERS))  # dict[str, Any] — Any avoids OmegaConf default_factory bug
+    n_splits: int = 5
+    hyperparameter_grid: Any = None  # dict[str, Any] | None — Any avoids OmegaConf Optional[Dict] merge bug


specifying hyperparameter_grid and hyperparameters feels redundant and implies implicit behavior which is not documented.

project-defiant · 2026-06-23T15:18:01Z

        wandb_run_name: str | None = None,
        parameter_grid: dict[str, Any] | None = None,


if we are not using the w&b for CV runs, we should remove the parameter and downstream logic.

project-defiant · 2026-06-23T15:19:38Z

        # If no grid is provided, use default ones set in the model
        parameter_grid = parameter_grid or {
            param: {"values": [value]}
            for param, value in self.model.hyperparameters.items()
        }


Given refactor, we should move this upstream to the step. Feels awkward that cv function need to define by itself the parameter grid.

addramir and others added 9 commits May 26, 2026 16:30

Merge branch 'dev' into feat/l2g-train-test-split-step

354f95e

feat(l2g): add split_stats_path parameter to LocusToGeneTrainTestSpli…

5aea491

…tStep Allows callers to specify an explicit output path for the split statistics JSON instead of deriving it from train_parquet_path. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

fix: leakage

4b14e9a

Copilot AI review requested due to automatic review settings May 28, 2026 15:14

Copilot started reviewing on behalf of addramir May 28, 2026 15:15 View session

github-actions Bot added size-L Method Step Feature labels May 28, 2026

addramir marked this pull request as draft May 28, 2026 15:15

Copilot AI reviewed May 28, 2026

View reviewed changes

Merge feat/l2g-train-test-split-step into yt_new_cv_step

67492f2

Keeps both LocusToGeneTrainTestSplitStep (from PR #1227) and LocusToGeneCrossValidationStep (this branch) as independent steps. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

github-actions Bot added size-XL and removed size-L labels May 28, 2026

addramir and others added 10 commits May 28, 2026 16:38

fix: tests

57d3067

fix(test): replace method-assign with patch.object for log_to_wandb mock

d58cd0b

Direct instance method assignment is rejected by mypy; patch.object is the correct way to stub a method in tests. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

addramir and others added 3 commits May 29, 2026 17:50

addramir requested a review from Copilot June 19, 2026 15:17

Copilot started reviewing on behalf of addramir June 19, 2026 15:18 View session

Copilot AI reviewed Jun 19, 2026

View reviewed changes

addramir and others added 5 commits June 19, 2026 16:31

Potential fix for pull request finding

5e87ead

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Potential fix for pull request finding

f40260a

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Potential fix for pull request finding

578965c

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

fix(l2g): raise ValueError when holdout_only=True without cv_results_dir

6cb0bec

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

fix(l2g): fix ambiguous variable name and holdout_only guard

05fc91b

- Rename `l` → `label` in _plot_confusion_matrix list comprehension (ruff E741) - Raise ValueError when holdout_only=True without cv_results_dir Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

addramir marked this pull request as ready for review June 19, 2026 15:49

addramir assigned project-defiant and unassigned project-defiant Jun 19, 2026

addramir requested a review from project-defiant June 19, 2026 15:50

addramir self-assigned this Jun 19, 2026

project-defiant requested changes Jun 23, 2026

View reviewed changes

		"gamma": 0, # min loss reduction to make a split
		"max_delta_step": 1, # stabilises updates for imbalanced data

		wandb_run_name: str \| None = None,
		parameter_grid: dict[str, Any] \| None = None,

Uh oh!

Conversation

addramir commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✨ Context

🛠 What does this PR implement

New: `LocusToGeneCrossValidationStep` (`l2g.py`)

New: `LocusToGeneTrainTestSplitStep` (`l2g.py`)

`LocusToGeneTrainer` upgrades

`LocusToGeneStep` refactor

Config deduplication

🧪 Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

addramir commented May 28, 2026 •

edited

Loading