Dear IQ-Tree:
I write to report a potential bug related to PMSF and an easy, likely working solution.
The data are available in the folders iqtree/ and iqtree_no_keepident/. Briefly, for my dataset where there are identical sequences, PMSF seems to works incorrectly if -keep-ident is NOT specified but by -keep-ident the problem can be solved.
Preparation
cd bs_inBV_bestfit_issue/split/0/iqtree/
then
cd ../iqtree_no_keepident/
Purpose
This directory compares several IQ-TREE runs on the same alignment (../combined.phy) and the same fixed tree (ref.tre), under different model / site-frequency settings.
The main point is to show that -keep-ident might affect PMSF-related runs.
Shared setup
All runs use:
../combined.phy
ref.tre
- 30 protein sequences
- 100 sites
- fixed-tree evaluation (
-te ref.tre), not tree search
So differences in BEST SCORE FOUND come from model/options, not from different searched topologies.
Run types
| Prefix |
Main command idea |
Meaning |
LG+G |
-m LG+G |
plain LG + Gamma |
LG+C10+G |
-m LG+C10+G |
LG + C10 mixture + Gamma |
LG+C10+G_ft |
-m LG+C10+G -ft guide.treefile |
infer PMSF/sitefreq from guide.treefile, then evaluate |
LG+C10+G_fs |
-m LG+C10+G -fs LG+C10+G_ft.sitefreq |
reuse inferred sitefreq file |
LG+G_fs |
-m LG+G -fs LG+C10+G_ft.sitefreq |
LG+G with the same imported sitefreq file |
Notes:
*_ft generates LG+C10+G_ft.sitefreq
*_fs consumes that .sitefreq file
guide.treefile is only used as input for -ft
- If i understand correctly, In theory, LG+C10+G_ft,
LG+C10+G_fs, and LG+G_fs should result in the same lnL (this is what was observed when -keep-ident is specified; see below).
Results
Without -keep-ident
In iqtree_no_keepident/:
gives:
LG+C10+G_fs_noki.log:BEST SCORE FOUND : -1634.495
LG+C10+G_ft_noki.log:BEST SCORE FOUND : -1634.495
LG+C10+G_noki.log:BEST SCORE FOUND : -1634.495
LG+G_fs_noki.log:BEST SCORE FOUND : -1666.411
LG+G_noki.log:BEST SCORE FOUND : -1666.411
This means:
LG+C10+G, LG+C10+G_ft, and LG+C10+G_fs all collapse to the same lnL.
-
LG+G and LG+G_fs also collapse to the same lnL
- i.e. PMSF / imported site frequencies appear to have no effect without
-keep-ident
With -keep-ident
In iqtree/:
gives:
LG+C10+G_fs.log:BEST SCORE FOUND : -1475.677
LG+C10+G_ft.log:BEST SCORE FOUND : -1475.677
LG+C10+G.log:BEST SCORE FOUND : -1634.495
LG+G_fs.log:BEST SCORE FOUND : -1475.677
LG+G.log:BEST SCORE FOUND : -1666.412
This means:
LG+C10+G_ft, LG+C10+G_fs, and LG+G_fs all give the improved PMSF-like lnL: -1475.677
- while the ordinary non-PMSF runs remain:
LG+C10+G: -1634.495
LG+G: -1666.412
So with -keep-ident, the PMSF-related runs behave as expected.
Interpretation
The issue appears to be tied to identical-sequence handling:
- without
-keep-ident, the PMSF/sitefreq-based runs do not produce their distinct improved likelihoods
- with
-keep-ident, they do
In other words, the problematic behavior seems to disappear once -keep-ident is enabled.
bs_inBV_bestfit_issue.tar.gz
cheers,
sishuo
Dear IQ-Tree:
I write to report a potential bug related to PMSF and an easy, likely working solution.
The data are available in the folders
iqtree/andiqtree_no_keepident/. Briefly, for my dataset where there are identical sequences, PMSF seems to works incorrectly if-keep-identis NOT specified but by-keep-identthe problem can be solved.Preparation
cd bs_inBV_bestfit_issue/split/0/iqtree/then
cd ../iqtree_no_keepident/Purpose
This directory compares several IQ-TREE runs on the same alignment (
../combined.phy) and the same fixed tree (ref.tre), under different model / site-frequency settings.The main point is to show that
-keep-identmight affect PMSF-related runs.Shared setup
All runs use:
../combined.phyref.tre-te ref.tre), not tree searchSo differences in
BEST SCORE FOUNDcome from model/options, not from different searched topologies.Run types
LG+G-m LG+GLG+C10+G-m LG+C10+GLG+C10+G_ft-m LG+C10+G -ft guide.treefileguide.treefile, then evaluateLG+C10+G_fs-m LG+C10+G -fs LG+C10+G_ft.sitefreqLG+G_fs-m LG+G -fs LG+C10+G_ft.sitefreqNotes:
*_ftgeneratesLG+C10+G_ft.sitefreq*_fsconsumes that.sitefreqfileguide.treefileis only used as input for-ftLG+C10+G_fs, andLG+G_fsshould result in the same lnL (this is what was observed when-keep-identis specified; see below).Results
Without
-keep-identIn
iqtree_no_keepident/:gives:
This means:
LG+C10+G,LG+C10+G_ft, andLG+C10+G_fsall collapse to the same lnL.LG+GandLG+G_fsalso collapse to the same lnL-keep-identWith
-keep-identIn
iqtree/:gives:
This means:
LG+C10+G_ft,LG+C10+G_fs, andLG+G_fsall give the improved PMSF-like lnL:-1475.677LG+C10+G:-1634.495LG+G:-1666.412So with
-keep-ident, the PMSF-related runs behave as expected.Interpretation
The issue appears to be tied to identical-sequence handling:
-keep-ident, the PMSF/sitefreq-based runs do not produce their distinct improved likelihoods-keep-ident, they doIn other words, the problematic behavior seems to disappear once
-keep-identis enabled.bs_inBV_bestfit_issue.tar.gz
cheers,
sishuo