Skip to content

Rich evaluations on new data with atomic diffs#35

Open
SamusRam wants to merge 32 commits intomainfrom
feature/port-cleaning-filters
Open

Rich evaluations on new data with atomic diffs#35
SamusRam wants to merge 32 commits intomainfrom
feature/port-cleaning-filters

Conversation

@SamusRam
Copy link
Copy Markdown
Collaborator

@SamusRam SamusRam commented Apr 2, 2026

No description provided.

SamusRam and others added 23 commits March 14, 2026 15:16
The duplicated lambda at lines 189-193 and 223-227 used an
intersection check against {"Unknown", "precursor substr"} to decide
whether to add "isTPS".  This caused sequences with labels like
{"FPP_SMILES", "precursor substr"} to incorrectly miss the isTPS flag.

Replace with a shared helper `assign_is_tps_label` that uses a subset
check: isTPS is added whenever the label set contains at least one
real TPS substrate (anything outside {"Unknown", "precursor substr",
"other"}).

Note: previously cached fold_*_results.pkl files embed test_df with
the old labeling and must be invalidated before evaluation.

Made-with: Cursor
Regenerate data/TPS-Nov19_2023_verified_all_reactions_with_neg_with_folds.csv
from the authoritative data/tps_folds_nov2023.h5 via store_folds_into_csv.py
with split_col_name=stratified_phylogeny_based_split_with_minor_products.

Create parallel *_phylo_folds config versions for all benchmark models
(Blastp, CLEAN, Foldseek, HMM, PfamSUPFAM, DomainsRandomForest,
PlmDomainsRandomForest) pointing to the phylo-fold CSV and split column.
Existing mmseqs-based configs are unchanged.

Made-with: Cursor
scripts/compute_fold_similarities.py computes test-vs-train MMseqs2
similarity for each fold and produces a pickle artifact consumed by
evaluation. Pinned settings: --alignment-mode 3, top-N retrieval with
post-hoc best-hit selection (highest pident with qcov >= threshold,
tiebreak by lowest evalue). Stores pident, qcov, evalue, has_hit per
test sequence. Metadata saved to companion .meta.json.

Includes unit tests for best-hit selection and fold-name discovery.

Made-with: Cursor
…support

- Add load_similarity_artifact() to normalise both legacy BLAST and rich
  MMseqs pickles into a common schema (pident, qcov, evalue, has_hit).
  Legacy entries tagged is_synthetic=True.
- Replace hardcoded 10-step BLAST identity bucketing with configurable
  similarity_bins parameter; default falls back to legacy behaviour.
- Add "all" (no filter) and "no_hit" pseudo-bins automatically.
- Add min_negatives_for_eval check alongside existing min positives.
- Log support counts (n_pos, n_neg) per bin for transparency.
- Add --similarities-path CLI alias for --blast-identities-path.
- Unit tests for artifact loading, bin label generation, and record keys.

Made-with: Cursor
…branch

Check out utils.py, constants.py, hmmer_wrapper.py, and
mmseqs2_wrapper.py verbatim from origin/revision_data-preparation
(94013a2) to minimise future merge conflicts.

Add negative_filters.py as a thin re-export layer exposing
filter_out_putative_tpss, filter_by_ec, filter_by_go, and
filter_by_pfam_supfam plus the constants they depend on.

Unit tests cover filter_out_putative_tpss and filter_by_ec with
synthetic data.

Made-with: Cursor
- get_folds_from_csv: handle bare-integer folds (new dataset)
- experiment_runner: normalize fold column on load, configurable type_col_name
- BaseConfig: add type_col_name with backward-compatible default
- compute_fold_similarities: support bare-integer fold format
- build_synced_fold_dataset.py: generate synced-fold CSV (old data, new tps folds)
- compare_cross_dataset.py: cross-dataset comparison on shared TPS
- New configs for Blastp, PlmRandomForest, PlmDomainsRandomForest
  (new_dataset + synced_folds variants)
- Add EnzymeExplorer_Dataset.csv from revision branch
- 9 unit tests for fold compat, normalization, synced dataset builder
Add cross-dataset evaluation support to the experiment runner: models
trained on old synced data (Track C) can now be evaluated on new-dataset
folds (Track D), isolating the dataset-shift effect.

- BaseConfig: add eval_csv_path, eval_split_col_name, eval_id_col_name,
  eval_seq_col_name, eval_type_col_name, eval_group_col_name fields
- EmbSklearnBaseConfig: add eval_representations_path (kw_only)
- experiment_runner: add _load_eval_dataset helper; swap features_df
  at prediction time for cross-dataset embeddings
- CLEAN: propagate eval column support
- New Track D configs: Blastp, CLEAN, HMM, PlmRandomForest
- New track configs: CLEAN, Foldseek, HMM (new_dataset + synced_folds)
- Scripts: run_cross_negative_stress_test.py, run_track_c_cross_eval.py
- Tests: add TestLoadEvalDataset (column rename, fold normalization,
  fold selection, no-cross-eval guard)
- Data CSV fixes for fold column consistency
Comprehensive document motivating the A/C/D/B evaluation protocol for
knowledgeable ML reviewers. Covers atomic variable isolation, per-
similarity-bin analysis, cross-negative stress test, and atomic negative
cleaning analysis. Results presented in two layers: aggregate overview
(grouped bars, heatmap) then per-bin degradation curves (faceted lines).
Generates three figure types from evaluation pickles:
1. Grouped bar charts (aggregate mAP/AP per model per track)
2. Faceted line plots (AP and MCC-F1 by sequence-identity bin)
3. Heatmaps (models x tracks x metrics in a single glanceable view)

Supports any combination of tracks via CLI; auto-discovers models and
similarity bins from the data.
Prenyltransferases (pt, ggpps, fpps, gpps, gfpps, hsqs) catalyse
prenyl-chain elongation, not terpene cyclisation — they are not TPS.
They are now excluded from the per-type macro-average AP and treated
as negatives alongside "Unknown" proteins.

Changes:
- Add "pt" to _PRECURSOR_TYPES in experiment_runner.py
- Add _NON_TPS_TYPES constant in evaluation/plotting scripts
- Filter precursor types from per-type AP computation in:
  plot_camera_ready.py, evaluate_per_type_tps_detection.py,
  analyze_per_type_tps_detection.py, analyze_macro_type_ap_drop.py
- Add TestPrenyltransferaseExclusion tests (4 tests)
- Add dump_pertype_fold_ap.py diagnostic script
Major rewrite of local_validation_protocol.md:
- Add Track E (new TPS + old negatives) throughout
- Fix TPS counts: 339 additional enzymes, not ~2,100
- Remove Section 3.1 (Negative Distribution Correction)
- Rewrite waterfall analysis with D→E and E→B decomposition
- Add full metric tables (substrate + TPS detection)
- Add model variant sections (CLEANBetterDetection, CLEANEcDetection,
  Hierarchical PlmRF/PlmDomainsRF)
- Document PT exclusion from per-type macro-average
- Update Figure 6 caption for 2×2 heatmap layout

Also includes PlmDomainsRF config and code updates, and
cross-track visualization improvements.
New scripts:
- build_track_e_dataset.py: construct Track E dataset
- compute_cross_dataset_similarities.py: MMseqs2 similarities
- evaluate_new_models.py: evaluate PlmDomainsRF and model variants
- extend_ec_mapping.py: extend EC-to-substrate mapping for new dataset
- postprocess_clean_ec_detection.py: EC-based CLEAN TPS detection
- postprocess_clean_tps_detection.py: proportion-based CLEAN detection
- build_hierarchical_models.py: two-stage hierarchical PlmRF/PlmDomainsRF
- investigate_high_scoring_negatives.py: substrate-bearing negatives analysis
- plot_substrate_neg_analysis.py: substrate-bearing negatives visualization
- build_combined_domain_features.py: combined domain feature extraction
- build_combined_domain_matrix.py: domain comparison matrix
- build_domain_features_foldseek.py: Foldseek-based domain features
- test_hierarchical_models.py: tests for hierarchical model logic
New experiment configurations for:
- Track E (cross_new_tps_old_neg) for Blastp, CLEAN, Foldseek, HMM,
  PlmRF, PlmDomainsRF
- Cross-dataset (cross_synced_to_new) for Foldseek, PlmDomainsRF
- CLEANBetterDetection and CLEANEcDetection variants
- PlmRandomForestHierarchical and PlmDomainsRandomForestHierarchical
- track_e_new_tps_old_neg.csv: Track E dataset (new TPS + old negatives)
- ec_to_substrate_mapping_extended.json: 356 ECs covering old + new TPS
- ec_to_substrate_mapping_2026_03_14.json: original 292-EC mapping
- martsDB_reactions_2026_02_22.csv: MartsDB reaction data for EC mapping
- EnzymeExplorer_Dataset.csv: updated main dataset
The proportion-based approach (CLEANBetterDetection) consistently
underperformed the original CLEAN — drop it from the protocol doc.
Keep only CLEANEcDetection (EC-based detection, +1.6 to +4.7 pp AP).

Also fix: missing blank line before Section 13 heading, trailing newline.
@SamusRam SamusRam requested a review from segef April 2, 2026 21:17
SamusRam and others added 4 commits April 2, 2026 23:20
8 figures: combined heatmap, grouped bars (mAP + AP), detailed heatmap,
per-similarity-bin plots, and waterfall decomposition charts.
…mbers. The final prediction scores are calculated by a weighted majority voting.
logger = logging.getLogger(__file__)
logger.setLevel(logging.INFO)

_NON_TPS_LABELS = frozenset({"Unknown", "precursor substr", "other"})
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think other should be included in _NON_TPS_LABELS. Otherwise, these won't be used as true positives for isTPS label.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Thank you 🙏

Merge PR #34 (bugfix/fix_clean_ec_to_substrate_mapping):
- Replace Rhea-based EC lookup with pre-computed JSON mapping
- Add fold-specific pretrained model support (fold_idx parameter)
- Add weighted majority voting for isTPS/substrate prediction
- Add ec_utils.py, clean_dataset_prep.py, get_ec_to_substrate_mapping.py
- Add fold_idx parameter to all predict_proba signatures
- Update all CLEAN/CLEANEcDetection/CLEANBetterDetection configs

Fix _NON_TPS_LABELS:
- Remove 'other' from the set so proteins with 'other' substrate are
  correctly labeled as TPS-positive (isTPS=True)
- Update tests accordingly

Made-with: Cursor
SamusRam added 4 commits April 7, 2026 17:07
Proteins with Type=Unknown that happen to carry real terpenoid substrate
annotations (e.g. prenyltransferases misannotated as Unknown) were
incorrectly labelled isTPS=True by assign_is_tps_label, artificially
deflating TPS detection AP by ~40% on tracks B/D/E.

The fix adds _remap_substrates_by_type() which overrides the substrate
column to "Unknown" for any protein with Type=Unknown *before* the
groupby+assign_is_tps_label step, mirroring the existing precursor-type
remap. Both call sites (data_df and eval_data_df) now use the shared
helper.

Also includes:
- patch_istps_labels.py: one-off script that corrected existing fold
  result pickles (95 files across tracks B/D/E)
- Regression tests for the remap and end-to-end isTPS assignment
- Regenerated all figures referenced by local_validation_protocol.md

Made-with: Cursor
Remove per-TPS-type macro-averaged AP computation from
plot_camera_ready.py (functions, fig4c, fig6 bottom panels).
Figure 6 is now a clean 1×2 side-by-side heatmap (mAP + AP).
Filter CLEAN (retrained) to Track B only. Regenerate all figures.
Update local_validation_protocol.md caption accordingly.

Made-with: Cursor
Diagnostic script to measure the impact of including the
'precursor substr' class in mAP computation. Shows per-model,
per-track delta and per-class AP breakdown for Blastp Track A.

Made-with: Cursor
The TPS detection AP values need re-evaluation after the isTPS
label fix (stale evaluation CSVs were overriding corrected fold
results). Remove the AP panel from fig6 for now; keep only
substrate prediction mAP. Also fix merge logic so fold pickle
values take priority over stale CSVs.

Made-with: Cursor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants