Skip to content

Feature/mmseqs clustering for folds#29

Merged
segef merged 2 commits intomainfrom
feature/mmseqs-clustering-for-folds
Mar 11, 2026
Merged

Feature/mmseqs clustering for folds#29
segef merged 2 commits intomainfrom
feature/mmseqs-clustering-for-folds

Conversation

@segef
Copy link
Copy Markdown
Contributor

@segef segef commented Mar 11, 2026

No description provided.

This commit introduces mmseqs-based sequence clustering as an alternative
to MSA-based phylogenetic clustering, which better prevents train-test
data leakage by grouping highly similar sequences together.

Key changes:
- Add get_mmseqs_clusters.py: generates sequence clusters using mmseqs
  easy-cluster with configurable identity (default 30%) and coverage
  (default 50%) thresholds
- Add generate_mmseqs_folds_to_csv.py: generates stratified group k-folds
  using mmseqs clusters and saves directly to CSV
- Add analyze_fold_similarity.py: analyzes train-test sequence similarity
  to detect potential data leakage
- Add analyze_negative_similarity.py: checks negative samples for
  similarity to TPS sequences (found 29 problematic negatives)
- Add filter_problematic_negatives.py: removes high-similarity negatives
- Update get_balanced_stratified_group_kfolds.py: add --cluster-source
  argument to choose between phylogenetic and mmseqs clustering
- Filter 29 problematic negatives (>40% identity to TPS) from
  sampled_id_2_seq.pkl

Results comparison (train-test similarity):
- Corrupted phylogenetic folds: mean 65.32%, 35 pairs >95% identity
- Correct phylogenetic folds: mean 46.85%, 2 pairs >95% identity
- mmseqs 30%/50% folds: mean 17.85%, 0 pairs >80% identity

New data files:
- mmseqs_clusters.pkl (30% identity, 50% coverage, 82 clusters)
- TPS-Nov19_2023_verified_all_reactions_with_*_folds_mmseqs_30_50.csv
@segef segef merged commit 3b7ba12 into main Mar 11, 2026
1 check failed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants