Rich evaluations on new data with atomic diffs#35
Open
Conversation
The duplicated lambda at lines 189-193 and 223-227 used an
intersection check against {"Unknown", "precursor substr"} to decide
whether to add "isTPS". This caused sequences with labels like
{"FPP_SMILES", "precursor substr"} to incorrectly miss the isTPS flag.
Replace with a shared helper `assign_is_tps_label` that uses a subset
check: isTPS is added whenever the label set contains at least one
real TPS substrate (anything outside {"Unknown", "precursor substr",
"other"}).
Note: previously cached fold_*_results.pkl files embed test_df with
the old labeling and must be invalidated before evaluation.
Made-with: Cursor
Regenerate data/TPS-Nov19_2023_verified_all_reactions_with_neg_with_folds.csv from the authoritative data/tps_folds_nov2023.h5 via store_folds_into_csv.py with split_col_name=stratified_phylogeny_based_split_with_minor_products. Create parallel *_phylo_folds config versions for all benchmark models (Blastp, CLEAN, Foldseek, HMM, PfamSUPFAM, DomainsRandomForest, PlmDomainsRandomForest) pointing to the phylo-fold CSV and split column. Existing mmseqs-based configs are unchanged. Made-with: Cursor
scripts/compute_fold_similarities.py computes test-vs-train MMseqs2 similarity for each fold and produces a pickle artifact consumed by evaluation. Pinned settings: --alignment-mode 3, top-N retrieval with post-hoc best-hit selection (highest pident with qcov >= threshold, tiebreak by lowest evalue). Stores pident, qcov, evalue, has_hit per test sequence. Metadata saved to companion .meta.json. Includes unit tests for best-hit selection and fold-name discovery. Made-with: Cursor
…support - Add load_similarity_artifact() to normalise both legacy BLAST and rich MMseqs pickles into a common schema (pident, qcov, evalue, has_hit). Legacy entries tagged is_synthetic=True. - Replace hardcoded 10-step BLAST identity bucketing with configurable similarity_bins parameter; default falls back to legacy behaviour. - Add "all" (no filter) and "no_hit" pseudo-bins automatically. - Add min_negatives_for_eval check alongside existing min positives. - Log support counts (n_pos, n_neg) per bin for transparency. - Add --similarities-path CLI alias for --blast-identities-path. - Unit tests for artifact loading, bin label generation, and record keys. Made-with: Cursor
…branch Check out utils.py, constants.py, hmmer_wrapper.py, and mmseqs2_wrapper.py verbatim from origin/revision_data-preparation (94013a2) to minimise future merge conflicts. Add negative_filters.py as a thin re-export layer exposing filter_out_putative_tpss, filter_by_ec, filter_by_go, and filter_by_pfam_supfam plus the constants they depend on. Unit tests cover filter_out_putative_tpss and filter_by_ec with synthetic data. Made-with: Cursor
- get_folds_from_csv: handle bare-integer folds (new dataset) - experiment_runner: normalize fold column on load, configurable type_col_name - BaseConfig: add type_col_name with backward-compatible default - compute_fold_similarities: support bare-integer fold format - build_synced_fold_dataset.py: generate synced-fold CSV (old data, new tps folds) - compare_cross_dataset.py: cross-dataset comparison on shared TPS - New configs for Blastp, PlmRandomForest, PlmDomainsRandomForest (new_dataset + synced_folds variants) - Add EnzymeExplorer_Dataset.csv from revision branch - 9 unit tests for fold compat, normalization, synced dataset builder
Add cross-dataset evaluation support to the experiment runner: models trained on old synced data (Track C) can now be evaluated on new-dataset folds (Track D), isolating the dataset-shift effect. - BaseConfig: add eval_csv_path, eval_split_col_name, eval_id_col_name, eval_seq_col_name, eval_type_col_name, eval_group_col_name fields - EmbSklearnBaseConfig: add eval_representations_path (kw_only) - experiment_runner: add _load_eval_dataset helper; swap features_df at prediction time for cross-dataset embeddings - CLEAN: propagate eval column support - New Track D configs: Blastp, CLEAN, HMM, PlmRandomForest - New track configs: CLEAN, Foldseek, HMM (new_dataset + synced_folds) - Scripts: run_cross_negative_stress_test.py, run_track_c_cross_eval.py - Tests: add TestLoadEvalDataset (column rename, fold normalization, fold selection, no-cross-eval guard) - Data CSV fixes for fold column consistency
Comprehensive document motivating the A/C/D/B evaluation protocol for knowledgeable ML reviewers. Covers atomic variable isolation, per- similarity-bin analysis, cross-negative stress test, and atomic negative cleaning analysis. Results presented in two layers: aggregate overview (grouped bars, heatmap) then per-bin degradation curves (faceted lines).
Generates three figure types from evaluation pickles: 1. Grouped bar charts (aggregate mAP/AP per model per track) 2. Faceted line plots (AP and MCC-F1 by sequence-identity bin) 3. Heatmaps (models x tracks x metrics in a single glanceable view) Supports any combination of tracks via CLI; auto-discovers models and similarity bins from the data.
Prenyltransferases (pt, ggpps, fpps, gpps, gfpps, hsqs) catalyse prenyl-chain elongation, not terpene cyclisation — they are not TPS. They are now excluded from the per-type macro-average AP and treated as negatives alongside "Unknown" proteins. Changes: - Add "pt" to _PRECURSOR_TYPES in experiment_runner.py - Add _NON_TPS_TYPES constant in evaluation/plotting scripts - Filter precursor types from per-type AP computation in: plot_camera_ready.py, evaluate_per_type_tps_detection.py, analyze_per_type_tps_detection.py, analyze_macro_type_ap_drop.py - Add TestPrenyltransferaseExclusion tests (4 tests) - Add dump_pertype_fold_ap.py diagnostic script
Major rewrite of local_validation_protocol.md: - Add Track E (new TPS + old negatives) throughout - Fix TPS counts: 339 additional enzymes, not ~2,100 - Remove Section 3.1 (Negative Distribution Correction) - Rewrite waterfall analysis with D→E and E→B decomposition - Add full metric tables (substrate + TPS detection) - Add model variant sections (CLEANBetterDetection, CLEANEcDetection, Hierarchical PlmRF/PlmDomainsRF) - Document PT exclusion from per-type macro-average - Update Figure 6 caption for 2×2 heatmap layout Also includes PlmDomainsRF config and code updates, and cross-track visualization improvements.
New scripts: - build_track_e_dataset.py: construct Track E dataset - compute_cross_dataset_similarities.py: MMseqs2 similarities - evaluate_new_models.py: evaluate PlmDomainsRF and model variants - extend_ec_mapping.py: extend EC-to-substrate mapping for new dataset - postprocess_clean_ec_detection.py: EC-based CLEAN TPS detection - postprocess_clean_tps_detection.py: proportion-based CLEAN detection - build_hierarchical_models.py: two-stage hierarchical PlmRF/PlmDomainsRF - investigate_high_scoring_negatives.py: substrate-bearing negatives analysis - plot_substrate_neg_analysis.py: substrate-bearing negatives visualization - build_combined_domain_features.py: combined domain feature extraction - build_combined_domain_matrix.py: domain comparison matrix - build_domain_features_foldseek.py: Foldseek-based domain features - test_hierarchical_models.py: tests for hierarchical model logic
New experiment configurations for: - Track E (cross_new_tps_old_neg) for Blastp, CLEAN, Foldseek, HMM, PlmRF, PlmDomainsRF - Cross-dataset (cross_synced_to_new) for Foldseek, PlmDomainsRF - CLEANBetterDetection and CLEANEcDetection variants - PlmRandomForestHierarchical and PlmDomainsRandomForestHierarchical
- track_e_new_tps_old_neg.csv: Track E dataset (new TPS + old negatives) - ec_to_substrate_mapping_extended.json: 356 ECs covering old + new TPS - ec_to_substrate_mapping_2026_03_14.json: original 292-EC mapping - martsDB_reactions_2026_02_22.csv: MartsDB reaction data for EC mapping - EnzymeExplorer_Dataset.csv: updated main dataset
The proportion-based approach (CLEANBetterDetection) consistently underperformed the original CLEAN — drop it from the protocol doc. Keep only CLEANEcDetection (EC-based detection, +1.6 to +4.7 pp AP). Also fix: missing blank line before Section 13 heading, trailing newline.
8 figures: combined heatmap, grouped bars (mAP + AP), detailed heatmap, per-similarity-bin plots, and waterfall decomposition charts.
…mbers. The final prediction scores are calculated by a weighted majority voting.
segef
reviewed
Apr 7, 2026
| logger = logging.getLogger(__file__) | ||
| logger.setLevel(logging.INFO) | ||
|
|
||
| _NON_TPS_LABELS = frozenset({"Unknown", "precursor substr", "other"}) |
Contributor
There was a problem hiding this comment.
I don't think other should be included in _NON_TPS_LABELS. Otherwise, these won't be used as true positives for isTPS label.
Collaborator
Author
There was a problem hiding this comment.
Good catch! Thank you 🙏
Merge PR #34 (bugfix/fix_clean_ec_to_substrate_mapping): - Replace Rhea-based EC lookup with pre-computed JSON mapping - Add fold-specific pretrained model support (fold_idx parameter) - Add weighted majority voting for isTPS/substrate prediction - Add ec_utils.py, clean_dataset_prep.py, get_ec_to_substrate_mapping.py - Add fold_idx parameter to all predict_proba signatures - Update all CLEAN/CLEANEcDetection/CLEANBetterDetection configs Fix _NON_TPS_LABELS: - Remove 'other' from the set so proteins with 'other' substrate are correctly labeled as TPS-positive (isTPS=True) - Update tests accordingly Made-with: Cursor
Proteins with Type=Unknown that happen to carry real terpenoid substrate annotations (e.g. prenyltransferases misannotated as Unknown) were incorrectly labelled isTPS=True by assign_is_tps_label, artificially deflating TPS detection AP by ~40% on tracks B/D/E. The fix adds _remap_substrates_by_type() which overrides the substrate column to "Unknown" for any protein with Type=Unknown *before* the groupby+assign_is_tps_label step, mirroring the existing precursor-type remap. Both call sites (data_df and eval_data_df) now use the shared helper. Also includes: - patch_istps_labels.py: one-off script that corrected existing fold result pickles (95 files across tracks B/D/E) - Regression tests for the remap and end-to-end isTPS assignment - Regenerated all figures referenced by local_validation_protocol.md Made-with: Cursor
Remove per-TPS-type macro-averaged AP computation from plot_camera_ready.py (functions, fig4c, fig6 bottom panels). Figure 6 is now a clean 1×2 side-by-side heatmap (mAP + AP). Filter CLEAN (retrained) to Track B only. Regenerate all figures. Update local_validation_protocol.md caption accordingly. Made-with: Cursor
Diagnostic script to measure the impact of including the 'precursor substr' class in mAP computation. Shows per-model, per-track delta and per-class AP breakdown for Blastp Track A. Made-with: Cursor
The TPS detection AP values need re-evaluation after the isTPS label fix (stale evaluation CSVs were overriding corrected fold results). Remove the AP panel from fig6 for now; keep only substrate prediction mAP. Also fix merge logic so fold pickle values take priority over stale CSVs. Made-with: Cursor
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.