Skip to content

feat(l2g): add LocusToGeneCrossValidationStep for pre-split CV with hyperparameter tuning#1230

Open
addramir wants to merge 28 commits into
devfrom
yt_new_cv_step
Open

feat(l2g): add LocusToGeneCrossValidationStep for pre-split CV with hyperparameter tuning#1230
addramir wants to merge 28 commits into
devfrom
yt_new_cv_step

Conversation

@addramir

@addramir addramir commented May 28, 2026

Copy link
Copy Markdown
Contributor

✨ Context

Cross-validation and hyperparameter sweeps are exploratory — they should never run as part of a production training job. `LocusToGeneStep` previously had a `cross_validate` flag that was never used in prod and bloated its interface. This PR introduces a dedicated `LocusToGeneCrossValidationStep` that takes pre-built, annotated train/test feature matrix parquets and runs k-fold CV without any gold-standard parsing or feature matrix construction.

🛠 What does this PR implement

New: `LocusToGeneCrossValidationStep` (`l2g.py`)

A standalone step for CV and hyperparameter exploration. Takes two parquet inputs — a pre-split training set and a held-out test set, both annotated with features and `goldStandardSet` labels — and runs k-fold CV on the training partition followed by a final evaluation on the test set.

Parameters:

  • `train_feature_matrix_path` / `test_feature_matrix_path` — required parquet paths
  • `hyperparameters` — base XGBoost config
  • `features_list` — defaults to the standard L2G feature list
  • `n_splits` — CV folds (default 5)
  • `hyperparameter_grid` — optional W&B-style grid `{"param": {"values": [...]}, ...}`; when combined with `cv_results_dir`, all configs are evaluated
  • `cv_results_dir` — local or `gs://` path; when set, writes `cv_results.json`, `cv_folds.csv`, and per-config ROC / PR / confusion-matrix PNGs; without it, metrics go to the terminal

No W&B logging — CV output is file-based or terminal only.

Registered as Hydra step `locus_to_gene_cross_validation`.

New: `LocusToGeneTrainTestSplitStep` (`l2g.py`)

A standalone step for creating the train/test split upstream of CV or training. Registered as Hydra step `locus_to_gene_train_test_split`.

`LocusToGeneTrainer` upgrades

  • `evaluate()` — extended with `TP`, `FP`, `TN`, `FN` confusion-matrix components
  • `cross_validate()` — new `cv_results_dir` parameter enables file-mode output with explicit grid iteration
  • `_run_cv_fold()` — extracts single-fold logic; adds locus-level resolution stats (`n_loci`, `n_loci_one_gene_above`, `n_loci_no_gene_above`)
  • `_expand_grid()` — static method expanding a W&B-style grid into a flat list of hyperparameter configs
  • `_summarise_and_plot_config()`, `_write_cv_files()`, `_upload_dir_to_gcs()` — file-mode output pipeline
  • `_plot_roc_curves()`, `_plot_pr_curves()`, `_plot_confusion_matrix()` — per-config diagnostic PNGs

`LocusToGeneStep` refactor

Wired to consume pre-split train/test parquets from `LocusToGeneTrainTestSplitStep` instead of building the feature matrix inline. Gold-standard parsing and the `cross_validate` flag are removed from this step.

Config deduplication

Extracted `_L2G_FEATURES_LIST` and `_L2G_HYPERPARAMETERS` as module-level constants; `LocusToGeneConfig`, `LocusToGeneTrainTestSplitConfig`, and `LocusToGeneCrossValidationConfig` all reference them.

🧪 Test plan

  • `pytest tests/gentropy/method/test_l2g_trainer.py` — all existing tests pass; `test_evaluate_perfect_predictions` updated for new TP/FP/TN/FN keys; `_expand_grid` and `_write_cv_files` unit tests added
  • `pytest tests/gentropy/test_l2g_cv_step.py` — CV step integration tests including `cv_results_dir` file output and hyperparameter grid
  • Smoke-test `LocusToGeneCrossValidationStep` with local parquets, terminal mode
  • Smoke-test with `cv_results_dir` set to a local path and verify JSON/CSV/PNG outputs
  • Smoke-test with `hyperparameter_grid` + `cv_results_dir` to verify grid expansion

🤖 Generated with Claude Code

addramir and others added 9 commits May 26, 2026 16:30
Introduce a dedicated step that builds the annotated gold-standard feature
matrix from the full feature matrix + credible set, performs a hierarchical
train/test split, and writes both partitions to parquet. A predefined test
set from a previous run can be supplied to keep the held-out set stable
across feature matrix updates.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
LocusToGeneStep no longer performs gold standard annotation or
train/test splitting internally. Instead it requires presplit parquets
produced by LocusToGeneTrainTestSplitStep, making the two steps
composable in a pipeline without redundant Spark computation.

- Remove gold_standard_curation_path, variant_index_path,
  gene_interactions_path from LocusToGeneStep and LocusToGeneConfig
- Remove prepare_gold_standard() and _annotate_gold_standards_w_feature_matrix()
- run_train() validates presplit paths are provided and loads them directly
- LocusToGeneTrainer.train() accepts presplit_train_df/presplit_test_df
  to bypass internal generate_train_test_split
- Add import pandas as pd at module level in l2g.py
- Add tests for LocusToGeneTrainTestSplitStep static helpers and
  LocusToGeneTrainer presplit bypass path

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…tStep

- Write split stats JSON in the fresh-split branch (was only written
  for the predefined-test branch)
- Create parent directories before writing the JSON sidecar locally so
  the step doesn't fail on a fresh output path
- Remove coalesce(1) to avoid forcing a single output partition
- Unpersist annotated_fm after both branches write their parquets
- Replace toPandas() calls in the predefined-test branch with Spark
  left_anti / inner joins and create_map label encoding, keeping all
  data on the cluster and avoiding driver-memory limits

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
- Tighten _parse_gold_standard match: add guard so the unexpected_columns
  branch cannot fire when missing_mandatory_columns is also present
- Persist train_sdf/test_sdf before counting in the predefined-test branch
  to avoid recomputing the joins on the subsequent .write actions
- Raise ValueError in LocusToGeneTrainer.train() when exactly one of
  presplit_train_df / presplit_test_df is provided
- Make credible_set_path / feature_matrix_path optional in LocusToGeneConfig
  and LocusToGeneStep (predict mode validates they are present; train mode
  no longer loads them unnecessarily)

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…tStep

Allows callers to specify an explicit output path for the split statistics
JSON instead of deriving it from train_parquet_path.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…Step log

Adds n_positive, n_negative, n_unique_loci, n_unique_positive_genes, and
(when traitFromSourceMappedId is present) n_unique_positive_gene_disease_pairs
for both train and test splits, nested under "train"/"test" keys in the stats JSON.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
New dedicated step that runs k-fold cross-validation on pre-built,
annotated train/test feature matrix parquets without any gold-standard
parsing or training set construction.

Trainer upgrades (ported from PR #1228):
- evaluate() extended with TP/FP/TN/FN confusion-matrix components
- cross_validate() now supports cv_results_dir: writes cv_results.json,
  cv_folds.csv and per-config ROC/PR/confusion-matrix PNGs to a local
  or GCS directory; hyperparameter grid is swept explicitly in file mode
- _run_cv_fold() extracts single-fold logic and adds locus-level stats
  (n_loci, n_loci_one_gene_above, n_loci_no_gene_above)
- _expand_grid() expands W&B-style grid to flat config list
- _summarise_and_plot_config(), _write_cv_files(), _upload_dir_to_gcs()
  handle file-mode output and GCS upload
- _plot_roc_curves(), _plot_pr_curves(), _plot_confusion_matrix() PNGs

LocusToGeneStep and its training path are unchanged.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 28, 2026 15:14
@addramir addramir marked this pull request as draft May 28, 2026 15:15

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a dedicated LocusToGeneCrossValidationStep that runs k-fold CV (with optional hyperparameter grid sweeps) on pre-built, annotated train/test feature-matrix parquets, leaving the production LocusToGeneStep untouched. The trainer is extended to support a file-output CV mode (cv_results_dir) producing JSON/CSV summaries and per-config ROC/PR/confusion plots, plus confusion-matrix counts in evaluate().

Changes:

  • New LocusToGeneCrossValidationStep in l2g.py and matching LocusToGeneCrossValidationConfig registered under step=locus_to_gene_cross_validation.
  • LocusToGeneTrainer.cross_validate refactored: _run_cv_fold extracted, hyperparameters now applied before fit (previously applied post-fit in W&B mode), cv_results_dir file-output mode with grid expansion and plotting helpers added.
  • evaluate() extended with TP/FP/TN/FN; test updated accordingly.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.

File Description
src/gentropy/l2g.py New LocusToGeneCrossValidationStep that loads pre-split parquets, manually populates the trainer, runs CV, fits, and reports hold-out metrics.
src/gentropy/method/l2g/trainer.py Refactored cross_validate with cv_results_dir file mode, _run_cv_fold, _expand_grid, plotting helpers, GCS upload helper, and confusion-matrix counts in evaluate.
src/gentropy/config.py Added LocusToGeneCrossValidationConfig (duplicates feature list and hyperparameters from LocusToGeneConfig) and registered the Hydra step.
tests/gentropy/method/test_l2g_trainer.py Updated test_evaluate_perfect_predictions for the new TP/FP/TN/FN keys.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/gentropy/config.py Outdated
Comment on lines +373 to +418
features_list: list[str] = field(
default_factory=lambda: [
"eQtlColocClppMaximum",
"pQtlColocClppMaximum",
"sQtlColocClppMaximum",
"eQtlColocH4Maximum",
"pQtlColocH4Maximum",
"sQtlColocH4Maximum",
"eQtlColocClppMaximumNeighbourhood",
"pQtlColocClppMaximumNeighbourhood",
"sQtlColocClppMaximumNeighbourhood",
"eQtlColocH4MaximumNeighbourhood",
"pQtlColocH4MaximumNeighbourhood",
"sQtlColocH4MaximumNeighbourhood",
"distanceSentinelFootprint",
"distanceSentinelFootprintNeighbourhood",
"distanceFootprintMean",
"distanceFootprintMeanNeighbourhood",
"distanceTssMean",
"distanceTssMeanNeighbourhood",
"distanceSentinelTss",
"distanceSentinelTssNeighbourhood",
"vepMaximum",
"vepMaximumNeighbourhood",
"vepMean",
"vepMeanNeighbourhood",
"e2gMean",
"e2gMeanNeighbourhood",
"geneCount500kb",
"proteinGeneCount500kb",
"credibleSetConfidence",
"transPQtlColocH4Maximum",
"transPQtlColocH4MaximumNeighbourhood",
]
)
hyperparameters: dict[str, Any] = field(
default_factory=lambda: {
"max_depth": 5,
"reg_alpha": 1,
"reg_lambda": 1.0,
"subsample": 0.8,
"colsample_bytree": 0.8,
"eta": 0.05,
"min_child_weight": 10,
"scale_pos_weight": 0.8,
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed: extracted _L2G_FEATURES_LIST and _L2G_HYPERPARAMETERS as module-level constants in config.py. All three configs (LocusToGeneConfig, LocusToGeneTrainTestSplitConfig, LocusToGeneCrossValidationConfig) now reference them via list(_L2G_FEATURES_LIST) / dict(_L2G_HYPERPARAMETERS).

Comment thread src/gentropy/l2g.py
Comment on lines +700 to +715
trainer.cross_validate(
parameter_grid=hyperparameter_grid,
n_splits=n_splits,
cv_results_dir=cv_results_dir,
)

trainer.fit()

trainer.log_to_terminal(
eval_id="Hold-out",
metrics=trainer.evaluate(
y_true=y_test,
y_pred=trainer.model.model.predict(x_test),
y_pred_proba=trainer.model.model.predict_proba(x_test),
),
)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't fix: the CV step is intentionally exploratory — its purpose is to evaluate configurations and write metrics, not to select and apply a 'best' config automatically. The caller inspects the output and chooses hyperparameters manually before running a production training job.

Comment thread src/gentropy/l2g.py
Comment on lines +652 to +662
train_feature_matrix = L2GFeatureMatrix(
_df=session.load_data(train_feature_matrix_path, "parquet"),
features_list=features_list,
with_gold_standard=True,
).persist()

test_feature_matrix = L2GFeatureMatrix(
_df=session.load_data(test_feature_matrix_path, "parquet"),
features_list=features_list,
with_gold_standard=True,
).persist()

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed: added unpersist() calls on both feature matrix DataFrames immediately after .toPandas() materialises them.

Comment on lines +650 to +656
metric_keys = list(folds[0]["metrics"].keys())
mean_metrics = {
k: float(np.mean([f["metrics"][k] for f in folds])) for k in metric_keys
}
std_metrics = {
k: float(np.std([f["metrics"][k] for f in folds])) for k in metric_keys
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't fix: the per-config summaries are intended for exploratory comparison across grid configurations, not fold-to-fold inference. Reporting mean ± std over all metrics (including counts) is acceptable in this diagnostic context.

Comment thread src/gentropy/method/l2g/trainer.py Outdated
Comment on lines +711 to +734
@staticmethod
def _upload_dir_to_gcs(local_dir: Path, gcs_prefix: str) -> None:
"""Upload every file under ``local_dir`` to GCS, preserving relative paths.

Args:
local_dir (Path): Root of the local directory tree to upload.
gcs_prefix (str): Destination GCS prefix, e.g. ``gs://bucket/path``.
"""
from google.cloud import storage as gcs_storage

without_scheme = gcs_prefix[len("gs://"):]
bucket_name, _, blob_prefix = without_scheme.partition("/")
client = gcs_storage.Client()
bucket = client.bucket(bucket_name)

for local_file in local_dir.rglob("*"):
if not local_file.is_file():
continue
relative = local_file.relative_to(local_dir)
blob_name = (
f"{blob_prefix.rstrip('/')}/{relative}" if blob_prefix else str(relative)
)
bucket.blob(blob_name).upload_from_filename(str(local_file))
logging.info("Uploaded %s → gs://%s/%s", relative, bucket_name, blob_name)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed: _upload_dir_to_gcs now delegates to copy_to_gcs from gentropy.external.gcs, removing the inline google.cloud.storage import and manual bucket/blob parsing.

Comment on lines +612 to +632
@staticmethod
def _expand_grid(parameter_grid: dict[str, Any]) -> list[dict[str, Any]]:
"""Expand a W&B-style parameter grid into a flat list of hyperparameter configs.

Args:
parameter_grid (dict[str, Any]): Grid in the form ``{"param": {"values": [v1, v2, ...]}, ...}``.

Returns:
list[dict[str, Any]]: One dict per hyperparameter combination.

Raises:
ValueError: If any grid entry is missing a ``values`` key.
"""
keys = list(parameter_grid.keys())
missing = [k for k in keys if "values" not in parameter_grid[k]]
if missing:
raise ValueError(
f"Hyperparameter grid entries must have a 'values' key. Missing in: {missing}"
)
values = [parameter_grid[k]["values"] for k in keys]
return [dict(zip(keys, combo)) for combo in product(*values)]

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already covered: _expand_grid has unit tests in tests/gentropy/method/test_l2g_trainer.py (including valid grid, single-param grid, and invalid-key error cases). _write_cv_files and cv_results_dir file-mode output are tested in both test_l2g_trainer.py and tests/gentropy/test_l2g_cv_step.py.

Keeps both LocusToGeneTrainTestSplitStep (from PR #1227) and
LocusToGeneCrossValidationStep (this branch) as independent steps.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added size-XL and removed size-L labels May 28, 2026
addramir and others added 10 commits May 28, 2026 16:38
…lities

Tests cover _expand_grid, _write_cv_files, cross_validate with cv_results_dir,
and LocusToGeneCrossValidationStep end-to-end including hyperparameter grid sweep.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Direct instance method assignment is rejected by mypy; patch.object is
the correct way to stub a method in tests.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…onStep

- Extract _L2G_FEATURES_LIST and _L2G_HYPERPARAMETERS as module-level
  constants in config.py; LocusToGeneConfig, LocusToGeneTrainTestSplitConfig,
  and LocusToGeneCrossValidationConfig all reference them to eliminate
  copy-paste duplication
- Unpersist train/test feature matrix Spark DataFrames in
  LocusToGeneCrossValidationStep after toPandas() materialises them
- Replace _upload_dir_to_gcs inline GCS client with the shared
  copy_to_gcs helper from gentropy.external.gcs

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
For each config evaluated during cross-validation, train on the full
training set and evaluate on the held-out test set, appending a
"holdout" row to cv_folds.csv alongside the per-fold rows.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…tionConfig

OmegaConf cannot handle field(default_factory=lambda: ...) on list[str]
and dict[str, Any] typed fields in structured configs — it tries to
resolve default_factory as a dict key and raises ConfigAttributeError.
Apply the same Any workaround already used for hyperparameter_grid.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
When LocusToGeneCrossValidationStep is instantiated via Hydra, the
hyperparameters field arrives as an OmegaConf DictConfig rather than a
plain dict. The property fell through to default_factory() which does
not exist on DictConfig. Fall back to dict() conversion instead.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
The train-test split step saves goldStandardSet as 0/1 integers.
LocusToGeneCrossValidationStep was unconditionally applying a
string-to-int map, silently producing NaN for all rows and leaving zero
positives for hierarchical_split. Now skips the map when the column is
already numeric. Adds a regression test with integer-encoded labels.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
_eval_on_test_set is skipped when x_test/y_test/test_df are None so
that unit tests calling cross_validate directly (without a full step
setup) still pass. Extracts _maybe_append_holdout_row helper to keep
cross_validate complexity within the C901 limit. Updates
test_cv_results_json_structure to expect n_splits + 1 fold_metrics
entries when test data is present.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Set n_estimators=300 as the explicit default. With eta=0.05 the previous
implicit default of 100 trees gave a learning capacity (eta × n_estimators)
of 5, versus XGBoost's internal default of ~30. 300 trees brings capacity
closer to what the learning rate implies.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
addramir and others added 3 commits May 29, 2026 17:50
gamma=0 (min split gain, explicit default) and max_delta_step=1
(stabilises gradient updates for imbalanced binary classification,
per XGBoost docs recommendation).

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
When holdout_only=True, CV folds are skipped entirely. Each config is
trained once on the full training set and evaluated directly on the
holdout set. Produces a cv_folds.csv with one row per config (fold=holdout)
and n_splits=0 in cv_results.json.

Use when a large pre-split holdout is available and k-fold CV would add
noise and 5x compute overhead without additional signal.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Resolved conflicts in config.py, l2g.py, trainer.py, and test_l2g_trainer.py:
- Keep _L2G_FEATURES_LIST / _L2G_HYPERPARAMETERS constants (our deduplication)
- Keep LocusToGeneCrossValidationConfig and its Hydra registration (new in this branch)
- Take dev's to_unpersist list pattern and chained .persist() calls (resource management)
- Take dev's logger / _SetStats / pd.io.common.get_handle improvements
- Take dev's Spark-native label decoding in run_train (more efficient)
- Keep TestExpandGrid / TestWriteCvFiles / TestCrossValidateWithCvResultsDir (new tests)

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

Comment on lines 412 to 416
# If no grid is provided, use default ones set in the model
parameter_grid = parameter_grid or {
param: {"values": [value]}
for param, value in self.model.hyperparameters.items()
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partially addresed. I'd skip the n_splits=0 guard — it's not a realistic mistake and adding validation for every degenerate integer just adds noise. The holdout_only one is worth
it because it's a named flag with documented semantics that silently breaks.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given refactor, we should move this upstream to the step. Feels awkward that cv function need to define by itself the parameter grid.

Comment thread src/gentropy/method/l2g/trainer.py Outdated
Comment thread src/gentropy/method/l2g/trainer.py Outdated
Comment thread src/gentropy/method/l2g/model.py
addramir and others added 5 commits June 19, 2026 16:31
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
- Rename `l` → `label` in _plot_confusion_matrix list comprehension (ruff E741)
- Raise ValueError when holdout_only=True without cv_results_dir

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@addramir addramir marked this pull request as ready for review June 19, 2026 15:49
@addramir addramir requested a review from project-defiant June 19, 2026 15:50
@addramir addramir self-assigned this Jun 19, 2026
features_list: list[str] = field(default_factory=list)
hyperparameters: dict[str, Any] = field(
default_factory=lambda: {
"n_estimators": 300,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New parameter, did you run hyperparameter tuning?

Comment on lines +45 to +46
"gamma": 0, # min loss reduction to make a split
"max_delta_step": 1, # stabilises updates for imbalanced data

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above, is there a hyperparameter tuning for gamma and max_delta_step ?

From xgboost:

gamma ∈ [0, ∞)
gamma = 0

Since gamma is the minimum loss reduction required for a split, the default (0) allows any split with positive regularized gain. This can produce more complex individual trees, including branches with small objective improvements

Again for the max_delta_step this is actually non-default value.

Comment on lines 194 to +204
"""
if not self.hyperparameters:
raise ValueError("Hyperparameters have not been set.")
elif isinstance(self.hyperparameters, dict):
return self.hyperparameters
return self.hyperparameters.default_factory()
try:
# dataclasses.field with default_factory (legacy path)
return self.hyperparameters.default_factory()
except (AttributeError, TypeError):
# OmegaConf DictConfig passed directly from Hydra
return dict(self.hyperparameters)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should state what is the order of precedence for the hyperparameters.

  1. if self.hyperparameters are a dict -> return them
  2. else return dict(self.hyperparameters.default_factory())

Right now the execution is:
if self.hyperparameters == None:
raise
if self.hyperparameters is dict:
return them
else (case when hyperparameters are not null, nor a dict -> could be any other type):
return defaults.

Comment thread src/gentropy/l2g.py
Comment on lines +837 to +842
train_feature_matrix_path: str,
test_feature_matrix_path: str,
hyperparameters: dict[str, Any],
features_list: list[str] | None = None,
n_splits: int = 5,
hyperparameter_grid: dict[str, Any] | None = None,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a bit confusing that we have hyperparameters and hyperparameter_grid defined at the same time. I would probably keep only hyperparameter_grid as this is a superset of all parameter ranges, and given CV run, you will probably also want to sweep over the hyperparameter grid anyway.

On the second note, since we are not explicit on what kind of hyperparameters are expected (we pass them through directly to the xgboost model, it would be useful to have some minimal hint about that in the docstring + meaningful type hints). All parameters are expected to be of a metaType

class ParameterRange(NamedTuple):
    values: list[int] | list[float]
    name: str

class ParameterRangeSpec(NamedTuple):
    upper_bound: int| float
    lower_bound: int | float
    step: int | float

    def to_grid(name: str) -> ParameterRange:
        return ...

class Hyperparameters:
    grid: list[ParameterRange]

Given such types, we could easily understand the concept behind the input values.

Comment thread src/gentropy/l2g.py
effective_features = train_feature_matrix.features_list

l2g_model = LocusToGeneModel(
model=XGBClassifier(random_state=777, eval_metric="aucpr"),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FUN FACT -> https://arxiv.org/abs/2109.08203

The random seed probably should be handled as all other hyperparameters 😵, otherwise we could end up in non-optimal spot.

Comment thread src/gentropy/config.py
"transPQtlColocH4MaximumNeighbourhood",
]
)
features_list: list[str] = field(default_factory=lambda: list(_L2G_FEATURES_LIST))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as above.

Comment thread src/gentropy/l2g.py
trainer.y_test = y_test

trainer.cross_validate(
parameter_grid=hyperparameter_grid,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are refactoring this bit, it would be useful if we specify one hyperparameter_grid for the entire step and reuse it everywhere (another words -> trainer.cross_validate should not need to look up for the trainer.hyperparameters and specifying the artificial grid if hyperparameter_grid is passed as None to the cross_validate method)

The handling of the hyperparameter_grid should happen before cross_validate call.

Comment thread src/gentropy/config.py
Comment on lines +399 to +402
features_list: Any = field(default_factory=lambda: list(_L2G_FEATURES_LIST)) # list[str] — Any avoids OmegaConf default_factory bug
hyperparameters: Any = field(default_factory=lambda: dict(_L2G_HYPERPARAMETERS)) # dict[str, Any] — Any avoids OmegaConf default_factory bug
n_splits: int = 5
hyperparameter_grid: Any = None # dict[str, Any] | None — Any avoids OmegaConf Optional[Dict] merge bug

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

specifying hyperparameter_grid and hyperparameters feels redundant and implies implicit behavior which is not documented.

Comment on lines 385 to 386
wandb_run_name: str | None = None,
parameter_grid: dict[str, Any] | None = None,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we are not using the w&b for CV runs, we should remove the parameter and downstream logic.

Comment on lines 412 to 416
# If no grid is provided, use default ones set in the model
parameter_grid = parameter_grid or {
param: {"values": [value]}
for param, value in self.model.hyperparameters.items()
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given refactor, we should move this upstream to the step. Feels awkward that cv function need to define by itself the parameter grid.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants