Merged
Conversation
This commit introduces mmseqs-based sequence clustering as an alternative to MSA-based phylogenetic clustering, which better prevents train-test data leakage by grouping highly similar sequences together. Key changes: - Add get_mmseqs_clusters.py: generates sequence clusters using mmseqs easy-cluster with configurable identity (default 30%) and coverage (default 50%) thresholds - Add generate_mmseqs_folds_to_csv.py: generates stratified group k-folds using mmseqs clusters and saves directly to CSV - Add analyze_fold_similarity.py: analyzes train-test sequence similarity to detect potential data leakage - Add analyze_negative_similarity.py: checks negative samples for similarity to TPS sequences (found 29 problematic negatives) - Add filter_problematic_negatives.py: removes high-similarity negatives - Update get_balanced_stratified_group_kfolds.py: add --cluster-source argument to choose between phylogenetic and mmseqs clustering - Filter 29 problematic negatives (>40% identity to TPS) from sampled_id_2_seq.pkl Results comparison (train-test similarity): - Corrupted phylogenetic folds: mean 65.32%, 35 pairs >95% identity - Correct phylogenetic folds: mean 46.85%, 2 pairs >95% identity - mmseqs 30%/50% folds: mean 17.85%, 0 pairs >80% identity New data files: - mmseqs_clusters.pkl (30% identity, 50% coverage, 82 clusters) - TPS-Nov19_2023_verified_all_reactions_with_*_folds_mmseqs_30_50.csv
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.