Rich evaluations on new data with atomic diffs by SamusRam · Pull Request #35 · pluskal-lab/EnzymeExplorer

SamusRam · 2026-04-02T21:17:20Z

No description provided.

The duplicated lambda at lines 189-193 and 223-227 used an intersection check against {"Unknown", "precursor substr"} to decide whether to add "isTPS". This caused sequences with labels like {"FPP_SMILES", "precursor substr"} to incorrectly miss the isTPS flag. Replace with a shared helper `assign_is_tps_label` that uses a subset check: isTPS is added whenever the label set contains at least one real TPS substrate (anything outside {"Unknown", "precursor substr", "other"}). Note: previously cached fold_*_results.pkl files embed test_df with the old labeling and must be invalidated before evaluation. Made-with: Cursor

Regenerate data/TPS-Nov19_2023_verified_all_reactions_with_neg_with_folds.csv from the authoritative data/tps_folds_nov2023.h5 via store_folds_into_csv.py with split_col_name=stratified_phylogeny_based_split_with_minor_products. Create parallel *_phylo_folds config versions for all benchmark models (Blastp, CLEAN, Foldseek, HMM, PfamSUPFAM, DomainsRandomForest, PlmDomainsRandomForest) pointing to the phylo-fold CSV and split column. Existing mmseqs-based configs are unchanged. Made-with: Cursor

scripts/compute_fold_similarities.py computes test-vs-train MMseqs2 similarity for each fold and produces a pickle artifact consumed by evaluation. Pinned settings: --alignment-mode 3, top-N retrieval with post-hoc best-hit selection (highest pident with qcov >= threshold, tiebreak by lowest evalue). Stores pident, qcov, evalue, has_hit per test sequence. Metadata saved to companion .meta.json. Includes unit tests for best-hit selection and fold-name discovery. Made-with: Cursor

…support - Add load_similarity_artifact() to normalise both legacy BLAST and rich MMseqs pickles into a common schema (pident, qcov, evalue, has_hit). Legacy entries tagged is_synthetic=True. - Replace hardcoded 10-step BLAST identity bucketing with configurable similarity_bins parameter; default falls back to legacy behaviour. - Add "all" (no filter) and "no_hit" pseudo-bins automatically. - Add min_negatives_for_eval check alongside existing min positives. - Log support counts (n_pos, n_neg) per bin for transparency. - Add --similarities-path CLI alias for --blast-identities-path. - Unit tests for artifact loading, bin label generation, and record keys. Made-with: Cursor

…branch Check out utils.py, constants.py, hmmer_wrapper.py, and mmseqs2_wrapper.py verbatim from origin/revision_data-preparation (94013a2) to minimise future merge conflicts. Add negative_filters.py as a thin re-export layer exposing filter_out_putative_tpss, filter_by_ec, filter_by_go, and filter_by_pfam_supfam plus the constants they depend on. Unit tests cover filter_out_putative_tpss and filter_by_ec with synthetic data. Made-with: Cursor

…leaning-filters

…cleaning-filters

- get_folds_from_csv: handle bare-integer folds (new dataset) - experiment_runner: normalize fold column on load, configurable type_col_name - BaseConfig: add type_col_name with backward-compatible default - compute_fold_similarities: support bare-integer fold format - build_synced_fold_dataset.py: generate synced-fold CSV (old data, new tps folds) - compare_cross_dataset.py: cross-dataset comparison on shared TPS - New configs for Blastp, PlmRandomForest, PlmDomainsRandomForest (new_dataset + synced_folds variants) - Add EnzymeExplorer_Dataset.csv from revision branch - 9 unit tests for fold compat, normalization, synced dataset builder

Add cross-dataset evaluation support to the experiment runner: models trained on old synced data (Track C) can now be evaluated on new-dataset folds (Track D), isolating the dataset-shift effect. - BaseConfig: add eval_csv_path, eval_split_col_name, eval_id_col_name, eval_seq_col_name, eval_type_col_name, eval_group_col_name fields - EmbSklearnBaseConfig: add eval_representations_path (kw_only) - experiment_runner: add _load_eval_dataset helper; swap features_df at prediction time for cross-dataset embeddings - CLEAN: propagate eval column support - New Track D configs: Blastp, CLEAN, HMM, PlmRandomForest - New track configs: CLEAN, Foldseek, HMM (new_dataset + synced_folds) - Scripts: run_cross_negative_stress_test.py, run_track_c_cross_eval.py - Tests: add TestLoadEvalDataset (column rename, fold normalization, fold selection, no-cross-eval guard) - Data CSV fixes for fold column consistency

Comprehensive document motivating the A/C/D/B evaluation protocol for knowledgeable ML reviewers. Covers atomic variable isolation, per- similarity-bin analysis, cross-negative stress test, and atomic negative cleaning analysis. Results presented in two layers: aggregate overview (grouped bars, heatmap) then per-bin degradation curves (faceted lines).

Generates three figure types from evaluation pickles: 1. Grouped bar charts (aggregate mAP/AP per model per track) 2. Faceted line plots (AP and MCC-F1 by sequence-identity bin) 3. Heatmaps (models x tracks x metrics in a single glanceable view) Supports any combination of tracks via CLI; auto-discovers models and similarity bins from the data.

Prenyltransferases (pt, ggpps, fpps, gpps, gfpps, hsqs) catalyse prenyl-chain elongation, not terpene cyclisation — they are not TPS. They are now excluded from the per-type macro-average AP and treated as negatives alongside "Unknown" proteins. Changes: - Add "pt" to _PRECURSOR_TYPES in experiment_runner.py - Add _NON_TPS_TYPES constant in evaluation/plotting scripts - Filter precursor types from per-type AP computation in: plot_camera_ready.py, evaluate_per_type_tps_detection.py, analyze_per_type_tps_detection.py, analyze_macro_type_ap_drop.py - Add TestPrenyltransferaseExclusion tests (4 tests) - Add dump_pertype_fold_ap.py diagnostic script

Major rewrite of local_validation_protocol.md: - Add Track E (new TPS + old negatives) throughout - Fix TPS counts: 339 additional enzymes, not ~2,100 - Remove Section 3.1 (Negative Distribution Correction) - Rewrite waterfall analysis with D→E and E→B decomposition - Add full metric tables (substrate + TPS detection) - Add model variant sections (CLEANBetterDetection, CLEANEcDetection, Hierarchical PlmRF/PlmDomainsRF) - Document PT exclusion from per-type macro-average - Update Figure 6 caption for 2×2 heatmap layout Also includes PlmDomainsRF config and code updates, and cross-track visualization improvements.

New scripts: - build_track_e_dataset.py: construct Track E dataset - compute_cross_dataset_similarities.py: MMseqs2 similarities - evaluate_new_models.py: evaluate PlmDomainsRF and model variants - extend_ec_mapping.py: extend EC-to-substrate mapping for new dataset - postprocess_clean_ec_detection.py: EC-based CLEAN TPS detection - postprocess_clean_tps_detection.py: proportion-based CLEAN detection - build_hierarchical_models.py: two-stage hierarchical PlmRF/PlmDomainsRF - investigate_high_scoring_negatives.py: substrate-bearing negatives analysis - plot_substrate_neg_analysis.py: substrate-bearing negatives visualization - build_combined_domain_features.py: combined domain feature extraction - build_combined_domain_matrix.py: domain comparison matrix - build_domain_features_foldseek.py: Foldseek-based domain features - test_hierarchical_models.py: tests for hierarchical model logic

New experiment configurations for: - Track E (cross_new_tps_old_neg) for Blastp, CLEAN, Foldseek, HMM, PlmRF, PlmDomainsRF - Cross-dataset (cross_synced_to_new) for Foldseek, PlmDomainsRF - CLEANBetterDetection and CLEANEcDetection variants - PlmRandomForestHierarchical and PlmDomainsRandomForestHierarchical

- track_e_new_tps_old_neg.csv: Track E dataset (new TPS + old negatives) - ec_to_substrate_mapping_extended.json: 356 ECs covering old + new TPS - ec_to_substrate_mapping_2026_03_14.json: original 292-EC mapping - martsDB_reactions_2026_02_22.csv: MartsDB reaction data for EC mapping - EnzymeExplorer_Dataset.csv: updated main dataset

The proportion-based approach (CLEANBetterDetection) consistently underperformed the original CLEAN — drop it from the protocol doc. Keep only CLEANEcDetection (EC-based detection, +1.6 to +4.7 pp AP). Also fix: missing blank line before Section 13 heading, trailing newline.

8 figures: combined heatmap, grouped bars (mAP + AP), detailed heatmap, per-similarity-bin plots, and waterfall decomposition charts.

…mbers. The final prediction scores are calculated by a weighted majority voting.

segef · 2026-04-07T09:21:46Z

enzymeexplorer/src/experiments_orchestration/experiment_runner.py

 logger = logging.getLogger(__file__)
 logger.setLevel(logging.INFO)

+_NON_TPS_LABELS = frozenset({"Unknown", "precursor substr", "other"})


I don't think other should be included in _NON_TPS_LABELS. Otherwise, these won't be used as true positives for isTPS label.

Good catch! Thank you 🙏

Merge PR #34 (bugfix/fix_clean_ec_to_substrate_mapping): - Replace Rhea-based EC lookup with pre-computed JSON mapping - Add fold-specific pretrained model support (fold_idx parameter) - Add weighted majority voting for isTPS/substrate prediction - Add ec_utils.py, clean_dataset_prep.py, get_ec_to_substrate_mapping.py - Add fold_idx parameter to all predict_proba signatures - Update all CLEAN/CLEANEcDetection/CLEANBetterDetection configs Fix _NON_TPS_LABELS: - Remove 'other' from the set so proteins with 'other' substrate are correctly labeled as TPS-positive (isTPS=True) - Update tests accordingly Made-with: Cursor

Proteins with Type=Unknown that happen to carry real terpenoid substrate annotations (e.g. prenyltransferases misannotated as Unknown) were incorrectly labelled isTPS=True by assign_is_tps_label, artificially deflating TPS detection AP by ~40% on tracks B/D/E. The fix adds _remap_substrates_by_type() which overrides the substrate column to "Unknown" for any protein with Type=Unknown *before* the groupby+assign_is_tps_label step, mirroring the existing precursor-type remap. Both call sites (data_df and eval_data_df) now use the shared helper. Also includes: - patch_istps_labels.py: one-off script that corrected existing fold result pickles (95 files across tracks B/D/E) - Regression tests for the remap and end-to-end isTPS assignment - Regenerated all figures referenced by local_validation_protocol.md Made-with: Cursor

Remove per-TPS-type macro-averaged AP computation from plot_camera_ready.py (functions, fig4c, fig6 bottom panels). Figure 6 is now a clean 1×2 side-by-side heatmap (mAP + AP). Filter CLEAN (retrained) to Track B only. Regenerate all figures. Update local_validation_protocol.md caption accordingly. Made-with: Cursor

Diagnostic script to measure the impact of including the 'precursor substr' class in mAP computation. Shows per-model, per-track delta and per-class AP breakdown for Blastp Track A. Made-with: Cursor

The TPS detection AP values need re-evaluation after the isTPS label fix (stale evaluation CSVs were overriding corrected fold results). Remove the AP panel from fig6 for now; keep only substrate prediction mAP. Also fix merge logic so fold pickle values take priority over stale CSVs. Made-with: Cursor

SamusRam and others added 23 commits March 14, 2026 15:16

Merge branch 'feature/mmseqs-similarity-artifact' into feature/port-c…

b73699d

…leaning-filters

Merge branch 'feature/extend-eval-similarity-bins' into feature/port-…

af8ce1f

…cleaning-filters

Add script for producing a json with the EC to TPS substrate mapping

07ddb91

Use the propagated EC to substrates mapping in CLEAN model

16c3afa

Add a new script to prepare datasets for CLEAN retraining and evaluation

d0af5a0

Change file extension of prepped clean data file to csv

34fc0f2

Support running CLEAN from retrained models for each fold

0bb8210

Adjust the CLEAN config file for retrained models

f59e233

SamusRam requested a review from segef April 2, 2026 21:17

SamusRam and others added 4 commits April 2, 2026 23:20

docs: add figure PNGs referenced by local_validation_protocol.md

0aa8ce9

8 figures: combined heatmap, grouped bars (mAP + AP), detailed heatmap, per-similarity-bin plots, and waterfall decomposition charts.

Fix CLEAN prediction logic by supporting prediction of multiple EC nu…

213080a

…mbers. The final prediction scores are calculated by a weighted majority voting.

remove local files from repo

6993f63

Change clean_installation_root config default

2bcee83

segef reviewed Apr 7, 2026

View reviewed changes

SamusRam added 4 commits April 7, 2026 17:07

test: add mAP comparison script (with/without precursor substr)

0db5155

Diagnostic script to measure the impact of including the 'precursor substr' class in mAP computation. Shows per-model, per-track delta and per-class AP breakdown for Blastp Track A. Made-with: Cursor

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rich evaluations on new data with atomic diffs#35

Rich evaluations on new data with atomic diffs#35
SamusRam wants to merge 32 commits intomainfrom
feature/port-cleaning-filters

SamusRam commented Apr 2, 2026

Uh oh!

segef Apr 7, 2026

Uh oh!

SamusRam Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

SamusRam commented Apr 2, 2026

Uh oh!

segef Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

SamusRam Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants