feat(l2g): add LocusToGeneCrossValidationStep for pre-split CV with hyperparameter tuning#1230
feat(l2g): add LocusToGeneCrossValidationStep for pre-split CV with hyperparameter tuning#1230addramir wants to merge 28 commits into
Conversation
Introduce a dedicated step that builds the annotated gold-standard feature matrix from the full feature matrix + credible set, performs a hierarchical train/test split, and writes both partitions to parquet. A predefined test set from a previous run can be supplied to keep the held-out set stable across feature matrix updates. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
LocusToGeneStep no longer performs gold standard annotation or train/test splitting internally. Instead it requires presplit parquets produced by LocusToGeneTrainTestSplitStep, making the two steps composable in a pipeline without redundant Spark computation. - Remove gold_standard_curation_path, variant_index_path, gene_interactions_path from LocusToGeneStep and LocusToGeneConfig - Remove prepare_gold_standard() and _annotate_gold_standards_w_feature_matrix() - run_train() validates presplit paths are provided and loads them directly - LocusToGeneTrainer.train() accepts presplit_train_df/presplit_test_df to bypass internal generate_train_test_split - Add import pandas as pd at module level in l2g.py - Add tests for LocusToGeneTrainTestSplitStep static helpers and LocusToGeneTrainer presplit bypass path Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…tStep - Write split stats JSON in the fresh-split branch (was only written for the predefined-test branch) - Create parent directories before writing the JSON sidecar locally so the step doesn't fail on a fresh output path - Remove coalesce(1) to avoid forcing a single output partition - Unpersist annotated_fm after both branches write their parquets - Replace toPandas() calls in the predefined-test branch with Spark left_anti / inner joins and create_map label encoding, keeping all data on the cluster and avoiding driver-memory limits Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
- Tighten _parse_gold_standard match: add guard so the unexpected_columns branch cannot fire when missing_mandatory_columns is also present - Persist train_sdf/test_sdf before counting in the predefined-test branch to avoid recomputing the joins on the subsequent .write actions - Raise ValueError in LocusToGeneTrainer.train() when exactly one of presplit_train_df / presplit_test_df is provided - Make credible_set_path / feature_matrix_path optional in LocusToGeneConfig and LocusToGeneStep (predict mode validates they are present; train mode no longer loads them unnecessarily) Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…tStep Allows callers to specify an explicit output path for the split statistics JSON instead of deriving it from train_parquet_path. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…Step log Adds n_positive, n_negative, n_unique_loci, n_unique_positive_genes, and (when traitFromSourceMappedId is present) n_unique_positive_gene_disease_pairs for both train and test splits, nested under "train"/"test" keys in the stats JSON. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
New dedicated step that runs k-fold cross-validation on pre-built, annotated train/test feature matrix parquets without any gold-standard parsing or training set construction. Trainer upgrades (ported from PR #1228): - evaluate() extended with TP/FP/TN/FN confusion-matrix components - cross_validate() now supports cv_results_dir: writes cv_results.json, cv_folds.csv and per-config ROC/PR/confusion-matrix PNGs to a local or GCS directory; hyperparameter grid is swept explicitly in file mode - _run_cv_fold() extracts single-fold logic and adds locus-level stats (n_loci, n_loci_one_gene_above, n_loci_no_gene_above) - _expand_grid() expands W&B-style grid to flat config list - _summarise_and_plot_config(), _write_cv_files(), _upload_dir_to_gcs() handle file-mode output and GCS upload - _plot_roc_curves(), _plot_pr_curves(), _plot_confusion_matrix() PNGs LocusToGeneStep and its training path are unchanged. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Adds a dedicated LocusToGeneCrossValidationStep that runs k-fold CV (with optional hyperparameter grid sweeps) on pre-built, annotated train/test feature-matrix parquets, leaving the production LocusToGeneStep untouched. The trainer is extended to support a file-output CV mode (cv_results_dir) producing JSON/CSV summaries and per-config ROC/PR/confusion plots, plus confusion-matrix counts in evaluate().
Changes:
- New
LocusToGeneCrossValidationStepinl2g.pyand matchingLocusToGeneCrossValidationConfigregistered understep=locus_to_gene_cross_validation. LocusToGeneTrainer.cross_validaterefactored:_run_cv_foldextracted, hyperparameters now applied beforefit(previously applied post-fit in W&B mode),cv_results_dirfile-output mode with grid expansion and plotting helpers added.evaluate()extended withTP/FP/TN/FN; test updated accordingly.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
| src/gentropy/l2g.py | New LocusToGeneCrossValidationStep that loads pre-split parquets, manually populates the trainer, runs CV, fits, and reports hold-out metrics. |
| src/gentropy/method/l2g/trainer.py | Refactored cross_validate with cv_results_dir file mode, _run_cv_fold, _expand_grid, plotting helpers, GCS upload helper, and confusion-matrix counts in evaluate. |
| src/gentropy/config.py | Added LocusToGeneCrossValidationConfig (duplicates feature list and hyperparameters from LocusToGeneConfig) and registered the Hydra step. |
| tests/gentropy/method/test_l2g_trainer.py | Updated test_evaluate_perfect_predictions for the new TP/FP/TN/FN keys. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| features_list: list[str] = field( | ||
| default_factory=lambda: [ | ||
| "eQtlColocClppMaximum", | ||
| "pQtlColocClppMaximum", | ||
| "sQtlColocClppMaximum", | ||
| "eQtlColocH4Maximum", | ||
| "pQtlColocH4Maximum", | ||
| "sQtlColocH4Maximum", | ||
| "eQtlColocClppMaximumNeighbourhood", | ||
| "pQtlColocClppMaximumNeighbourhood", | ||
| "sQtlColocClppMaximumNeighbourhood", | ||
| "eQtlColocH4MaximumNeighbourhood", | ||
| "pQtlColocH4MaximumNeighbourhood", | ||
| "sQtlColocH4MaximumNeighbourhood", | ||
| "distanceSentinelFootprint", | ||
| "distanceSentinelFootprintNeighbourhood", | ||
| "distanceFootprintMean", | ||
| "distanceFootprintMeanNeighbourhood", | ||
| "distanceTssMean", | ||
| "distanceTssMeanNeighbourhood", | ||
| "distanceSentinelTss", | ||
| "distanceSentinelTssNeighbourhood", | ||
| "vepMaximum", | ||
| "vepMaximumNeighbourhood", | ||
| "vepMean", | ||
| "vepMeanNeighbourhood", | ||
| "e2gMean", | ||
| "e2gMeanNeighbourhood", | ||
| "geneCount500kb", | ||
| "proteinGeneCount500kb", | ||
| "credibleSetConfidence", | ||
| "transPQtlColocH4Maximum", | ||
| "transPQtlColocH4MaximumNeighbourhood", | ||
| ] | ||
| ) | ||
| hyperparameters: dict[str, Any] = field( | ||
| default_factory=lambda: { | ||
| "max_depth": 5, | ||
| "reg_alpha": 1, | ||
| "reg_lambda": 1.0, | ||
| "subsample": 0.8, | ||
| "colsample_bytree": 0.8, | ||
| "eta": 0.05, | ||
| "min_child_weight": 10, | ||
| "scale_pos_weight": 0.8, | ||
| } |
There was a problem hiding this comment.
Fixed: extracted _L2G_FEATURES_LIST and _L2G_HYPERPARAMETERS as module-level constants in config.py. All three configs (LocusToGeneConfig, LocusToGeneTrainTestSplitConfig, LocusToGeneCrossValidationConfig) now reference them via list(_L2G_FEATURES_LIST) / dict(_L2G_HYPERPARAMETERS).
| trainer.cross_validate( | ||
| parameter_grid=hyperparameter_grid, | ||
| n_splits=n_splits, | ||
| cv_results_dir=cv_results_dir, | ||
| ) | ||
|
|
||
| trainer.fit() | ||
|
|
||
| trainer.log_to_terminal( | ||
| eval_id="Hold-out", | ||
| metrics=trainer.evaluate( | ||
| y_true=y_test, | ||
| y_pred=trainer.model.model.predict(x_test), | ||
| y_pred_proba=trainer.model.model.predict_proba(x_test), | ||
| ), | ||
| ) |
There was a problem hiding this comment.
Won't fix: the CV step is intentionally exploratory — its purpose is to evaluate configurations and write metrics, not to select and apply a 'best' config automatically. The caller inspects the output and chooses hyperparameters manually before running a production training job.
| train_feature_matrix = L2GFeatureMatrix( | ||
| _df=session.load_data(train_feature_matrix_path, "parquet"), | ||
| features_list=features_list, | ||
| with_gold_standard=True, | ||
| ).persist() | ||
|
|
||
| test_feature_matrix = L2GFeatureMatrix( | ||
| _df=session.load_data(test_feature_matrix_path, "parquet"), | ||
| features_list=features_list, | ||
| with_gold_standard=True, | ||
| ).persist() |
There was a problem hiding this comment.
Fixed: added unpersist() calls on both feature matrix DataFrames immediately after .toPandas() materialises them.
| metric_keys = list(folds[0]["metrics"].keys()) | ||
| mean_metrics = { | ||
| k: float(np.mean([f["metrics"][k] for f in folds])) for k in metric_keys | ||
| } | ||
| std_metrics = { | ||
| k: float(np.std([f["metrics"][k] for f in folds])) for k in metric_keys | ||
| } |
There was a problem hiding this comment.
Won't fix: the per-config summaries are intended for exploratory comparison across grid configurations, not fold-to-fold inference. Reporting mean ± std over all metrics (including counts) is acceptable in this diagnostic context.
| @staticmethod | ||
| def _upload_dir_to_gcs(local_dir: Path, gcs_prefix: str) -> None: | ||
| """Upload every file under ``local_dir`` to GCS, preserving relative paths. | ||
|
|
||
| Args: | ||
| local_dir (Path): Root of the local directory tree to upload. | ||
| gcs_prefix (str): Destination GCS prefix, e.g. ``gs://bucket/path``. | ||
| """ | ||
| from google.cloud import storage as gcs_storage | ||
|
|
||
| without_scheme = gcs_prefix[len("gs://"):] | ||
| bucket_name, _, blob_prefix = without_scheme.partition("/") | ||
| client = gcs_storage.Client() | ||
| bucket = client.bucket(bucket_name) | ||
|
|
||
| for local_file in local_dir.rglob("*"): | ||
| if not local_file.is_file(): | ||
| continue | ||
| relative = local_file.relative_to(local_dir) | ||
| blob_name = ( | ||
| f"{blob_prefix.rstrip('/')}/{relative}" if blob_prefix else str(relative) | ||
| ) | ||
| bucket.blob(blob_name).upload_from_filename(str(local_file)) | ||
| logging.info("Uploaded %s → gs://%s/%s", relative, bucket_name, blob_name) |
There was a problem hiding this comment.
Fixed: _upload_dir_to_gcs now delegates to copy_to_gcs from gentropy.external.gcs, removing the inline google.cloud.storage import and manual bucket/blob parsing.
| @staticmethod | ||
| def _expand_grid(parameter_grid: dict[str, Any]) -> list[dict[str, Any]]: | ||
| """Expand a W&B-style parameter grid into a flat list of hyperparameter configs. | ||
|
|
||
| Args: | ||
| parameter_grid (dict[str, Any]): Grid in the form ``{"param": {"values": [v1, v2, ...]}, ...}``. | ||
|
|
||
| Returns: | ||
| list[dict[str, Any]]: One dict per hyperparameter combination. | ||
|
|
||
| Raises: | ||
| ValueError: If any grid entry is missing a ``values`` key. | ||
| """ | ||
| keys = list(parameter_grid.keys()) | ||
| missing = [k for k in keys if "values" not in parameter_grid[k]] | ||
| if missing: | ||
| raise ValueError( | ||
| f"Hyperparameter grid entries must have a 'values' key. Missing in: {missing}" | ||
| ) | ||
| values = [parameter_grid[k]["values"] for k in keys] | ||
| return [dict(zip(keys, combo)) for combo in product(*values)] |
There was a problem hiding this comment.
Already covered: _expand_grid has unit tests in tests/gentropy/method/test_l2g_trainer.py (including valid grid, single-param grid, and invalid-key error cases). _write_cv_files and cv_results_dir file-mode output are tested in both test_l2g_trainer.py and tests/gentropy/test_l2g_cv_step.py.
Keeps both LocusToGeneTrainTestSplitStep (from PR #1227) and LocusToGeneCrossValidationStep (this branch) as independent steps. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…lities Tests cover _expand_grid, _write_cv_files, cross_validate with cv_results_dir, and LocusToGeneCrossValidationStep end-to-end including hyperparameter grid sweep. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Direct instance method assignment is rejected by mypy; patch.object is the correct way to stub a method in tests. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…onStep - Extract _L2G_FEATURES_LIST and _L2G_HYPERPARAMETERS as module-level constants in config.py; LocusToGeneConfig, LocusToGeneTrainTestSplitConfig, and LocusToGeneCrossValidationConfig all reference them to eliminate copy-paste duplication - Unpersist train/test feature matrix Spark DataFrames in LocusToGeneCrossValidationStep after toPandas() materialises them - Replace _upload_dir_to_gcs inline GCS client with the shared copy_to_gcs helper from gentropy.external.gcs Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
For each config evaluated during cross-validation, train on the full training set and evaluate on the held-out test set, appending a "holdout" row to cv_folds.csv alongside the per-fold rows. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…tionConfig OmegaConf cannot handle field(default_factory=lambda: ...) on list[str] and dict[str, Any] typed fields in structured configs — it tries to resolve default_factory as a dict key and raises ConfigAttributeError. Apply the same Any workaround already used for hyperparameter_grid. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
When LocusToGeneCrossValidationStep is instantiated via Hydra, the hyperparameters field arrives as an OmegaConf DictConfig rather than a plain dict. The property fell through to default_factory() which does not exist on DictConfig. Fall back to dict() conversion instead. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
The train-test split step saves goldStandardSet as 0/1 integers. LocusToGeneCrossValidationStep was unconditionally applying a string-to-int map, silently producing NaN for all rows and leaving zero positives for hierarchical_split. Now skips the map when the column is already numeric. Adds a regression test with integer-encoded labels. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
_eval_on_test_set is skipped when x_test/y_test/test_df are None so that unit tests calling cross_validate directly (without a full step setup) still pass. Extracts _maybe_append_holdout_row helper to keep cross_validate complexity within the C901 limit. Updates test_cv_results_json_structure to expect n_splits + 1 fold_metrics entries when test data is present. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Set n_estimators=300 as the explicit default. With eta=0.05 the previous implicit default of 100 trees gave a learning capacity (eta × n_estimators) of 5, versus XGBoost's internal default of ~30. 300 trees brings capacity closer to what the learning rate implies. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
gamma=0 (min split gain, explicit default) and max_delta_step=1 (stabilises gradient updates for imbalanced binary classification, per XGBoost docs recommendation). Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
When holdout_only=True, CV folds are skipped entirely. Each config is trained once on the full training set and evaluated directly on the holdout set. Produces a cv_folds.csv with one row per config (fold=holdout) and n_splits=0 in cv_results.json. Use when a large pre-split holdout is available and k-fold CV would add noise and 5x compute overhead without additional signal. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Resolved conflicts in config.py, l2g.py, trainer.py, and test_l2g_trainer.py: - Keep _L2G_FEATURES_LIST / _L2G_HYPERPARAMETERS constants (our deduplication) - Keep LocusToGeneCrossValidationConfig and its Hydra registration (new in this branch) - Take dev's to_unpersist list pattern and chained .persist() calls (resource management) - Take dev's logger / _SetStats / pd.io.common.get_handle improvements - Take dev's Spark-native label decoding in run_train (more efficient) - Keep TestExpandGrid / TestWriteCvFiles / TestCrossValidateWithCvResultsDir (new tests) Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
| # If no grid is provided, use default ones set in the model | ||
| parameter_grid = parameter_grid or { | ||
| param: {"values": [value]} | ||
| for param, value in self.model.hyperparameters.items() | ||
| } |
There was a problem hiding this comment.
Partially addresed. I'd skip the n_splits=0 guard — it's not a realistic mistake and adding validation for every degenerate integer just adds noise. The holdout_only one is worth
it because it's a named flag with documented semantics that silently breaks.
There was a problem hiding this comment.
Given refactor, we should move this upstream to the step. Feels awkward that cv function need to define by itself the parameter grid.
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
- Rename `l` → `label` in _plot_confusion_matrix list comprehension (ruff E741) - Raise ValueError when holdout_only=True without cv_results_dir Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
| features_list: list[str] = field(default_factory=list) | ||
| hyperparameters: dict[str, Any] = field( | ||
| default_factory=lambda: { | ||
| "n_estimators": 300, |
There was a problem hiding this comment.
New parameter, did you run hyperparameter tuning?
| "gamma": 0, # min loss reduction to make a split | ||
| "max_delta_step": 1, # stabilises updates for imbalanced data |
There was a problem hiding this comment.
As above, is there a hyperparameter tuning for gamma and max_delta_step ?
From xgboost:
gamma ∈ [0, ∞)
gamma = 0
Since gamma is the minimum loss reduction required for a split, the default (0) allows any split with positive regularized gain. This can produce more complex individual trees, including branches with small objective improvements
Again for the max_delta_step this is actually non-default value.
| """ | ||
| if not self.hyperparameters: | ||
| raise ValueError("Hyperparameters have not been set.") | ||
| elif isinstance(self.hyperparameters, dict): | ||
| return self.hyperparameters | ||
| return self.hyperparameters.default_factory() | ||
| try: | ||
| # dataclasses.field with default_factory (legacy path) | ||
| return self.hyperparameters.default_factory() | ||
| except (AttributeError, TypeError): | ||
| # OmegaConf DictConfig passed directly from Hydra | ||
| return dict(self.hyperparameters) |
There was a problem hiding this comment.
I think we should state what is the order of precedence for the hyperparameters.
- if self.hyperparameters are a dict -> return them
- else return dict(self.hyperparameters.default_factory())
Right now the execution is:
if self.hyperparameters == None:
raise
if self.hyperparameters is dict:
return them
else (case when hyperparameters are not null, nor a dict -> could be any other type):
return defaults.
| train_feature_matrix_path: str, | ||
| test_feature_matrix_path: str, | ||
| hyperparameters: dict[str, Any], | ||
| features_list: list[str] | None = None, | ||
| n_splits: int = 5, | ||
| hyperparameter_grid: dict[str, Any] | None = None, |
There was a problem hiding this comment.
It is a bit confusing that we have hyperparameters and hyperparameter_grid defined at the same time. I would probably keep only hyperparameter_grid as this is a superset of all parameter ranges, and given CV run, you will probably also want to sweep over the hyperparameter grid anyway.
On the second note, since we are not explicit on what kind of hyperparameters are expected (we pass them through directly to the xgboost model, it would be useful to have some minimal hint about that in the docstring + meaningful type hints). All parameters are expected to be of a metaType
class ParameterRange(NamedTuple):
values: list[int] | list[float]
name: str
class ParameterRangeSpec(NamedTuple):
upper_bound: int| float
lower_bound: int | float
step: int | float
def to_grid(name: str) -> ParameterRange:
return ...
class Hyperparameters:
grid: list[ParameterRange]
Given such types, we could easily understand the concept behind the input values.
| effective_features = train_feature_matrix.features_list | ||
|
|
||
| l2g_model = LocusToGeneModel( | ||
| model=XGBClassifier(random_state=777, eval_metric="aucpr"), |
There was a problem hiding this comment.
FUN FACT -> https://arxiv.org/abs/2109.08203
The random seed probably should be handled as all other hyperparameters 😵, otherwise we could end up in non-optimal spot.
| "transPQtlColocH4MaximumNeighbourhood", | ||
| ] | ||
| ) | ||
| features_list: list[str] = field(default_factory=lambda: list(_L2G_FEATURES_LIST)) |
| trainer.y_test = y_test | ||
|
|
||
| trainer.cross_validate( | ||
| parameter_grid=hyperparameter_grid, |
There was a problem hiding this comment.
Since we are refactoring this bit, it would be useful if we specify one hyperparameter_grid for the entire step and reuse it everywhere (another words -> trainer.cross_validate should not need to look up for the trainer.hyperparameters and specifying the artificial grid if hyperparameter_grid is passed as None to the cross_validate method)
The handling of the hyperparameter_grid should happen before cross_validate call.
| features_list: Any = field(default_factory=lambda: list(_L2G_FEATURES_LIST)) # list[str] — Any avoids OmegaConf default_factory bug | ||
| hyperparameters: Any = field(default_factory=lambda: dict(_L2G_HYPERPARAMETERS)) # dict[str, Any] — Any avoids OmegaConf default_factory bug | ||
| n_splits: int = 5 | ||
| hyperparameter_grid: Any = None # dict[str, Any] | None — Any avoids OmegaConf Optional[Dict] merge bug |
There was a problem hiding this comment.
specifying hyperparameter_grid and hyperparameters feels redundant and implies implicit behavior which is not documented.
| wandb_run_name: str | None = None, | ||
| parameter_grid: dict[str, Any] | None = None, |
There was a problem hiding this comment.
if we are not using the w&b for CV runs, we should remove the parameter and downstream logic.
| # If no grid is provided, use default ones set in the model | ||
| parameter_grid = parameter_grid or { | ||
| param: {"values": [value]} | ||
| for param, value in self.model.hyperparameters.items() | ||
| } |
There was a problem hiding this comment.
Given refactor, we should move this upstream to the step. Feels awkward that cv function need to define by itself the parameter grid.
✨ Context
Cross-validation and hyperparameter sweeps are exploratory — they should never run as part of a production training job. `LocusToGeneStep` previously had a `cross_validate` flag that was never used in prod and bloated its interface. This PR introduces a dedicated `LocusToGeneCrossValidationStep` that takes pre-built, annotated train/test feature matrix parquets and runs k-fold CV without any gold-standard parsing or feature matrix construction.
🛠 What does this PR implement
New: `LocusToGeneCrossValidationStep` (`l2g.py`)
A standalone step for CV and hyperparameter exploration. Takes two parquet inputs — a pre-split training set and a held-out test set, both annotated with features and `goldStandardSet` labels — and runs k-fold CV on the training partition followed by a final evaluation on the test set.
Parameters:
No W&B logging — CV output is file-based or terminal only.
Registered as Hydra step `locus_to_gene_cross_validation`.
New: `LocusToGeneTrainTestSplitStep` (`l2g.py`)
A standalone step for creating the train/test split upstream of CV or training. Registered as Hydra step `locus_to_gene_train_test_split`.
`LocusToGeneTrainer` upgrades
`LocusToGeneStep` refactor
Wired to consume pre-split train/test parquets from `LocusToGeneTrainTestSplitStep` instead of building the feature matrix inline. Gold-standard parsing and the `cross_validate` flag are removed from this step.
Config deduplication
Extracted `_L2G_FEATURES_LIST` and `_L2G_HYPERPARAMETERS` as module-level constants; `LocusToGeneConfig`, `LocusToGeneTrainTestSplitConfig`, and `LocusToGeneCrossValidationConfig` all reference them.
🧪 Test plan
🤖 Generated with Claude Code