Skip to content

PMSF might not work correctly in the presence of identical seqs and a potential solution to it #139

@evolbeginner

Description

@evolbeginner

Dear IQ-Tree:

I write to report a potential bug related to PMSF and an easy, likely working solution.

The data are available in the folders iqtree/ and iqtree_no_keepident/. Briefly, for my dataset where there are identical sequences, PMSF seems to works incorrectly if -keep-ident is NOT specified but by -keep-ident the problem can be solved.

Preparation

cd bs_inBV_bestfit_issue/split/0/iqtree/
then
cd ../iqtree_no_keepident/

Purpose

This directory compares several IQ-TREE runs on the same alignment (../combined.phy) and the same fixed tree (ref.tre), under different model / site-frequency settings.

The main point is to show that -keep-ident might affect PMSF-related runs.


Shared setup

All runs use:

  • ../combined.phy
  • ref.tre
  • 30 protein sequences
  • 100 sites
  • fixed-tree evaluation (-te ref.tre), not tree search

So differences in BEST SCORE FOUND come from model/options, not from different searched topologies.


Run types

Prefix Main command idea Meaning
LG+G -m LG+G plain LG + Gamma
LG+C10+G -m LG+C10+G LG + C10 mixture + Gamma
LG+C10+G_ft -m LG+C10+G -ft guide.treefile infer PMSF/sitefreq from guide.treefile, then evaluate
LG+C10+G_fs -m LG+C10+G -fs LG+C10+G_ft.sitefreq reuse inferred sitefreq file
LG+G_fs -m LG+G -fs LG+C10+G_ft.sitefreq LG+G with the same imported sitefreq file

Notes:

  • *_ft generates LG+C10+G_ft.sitefreq
  • *_fs consumes that .sitefreq file
  • guide.treefile is only used as input for -ft
  • If i understand correctly, In theory, LG+C10+G_ft, LG+C10+G_fs, and LG+G_fs should result in the same lnL (this is what was observed when -keep-ident is specified; see below).

Results

Without -keep-ident

In iqtree_no_keepident/:

grep 'BEST SCORE' *log

gives:

LG+C10+G_fs_noki.log:BEST SCORE FOUND : -1634.495
LG+C10+G_ft_noki.log:BEST SCORE FOUND : -1634.495
LG+C10+G_noki.log:BEST SCORE FOUND : -1634.495
LG+G_fs_noki.log:BEST SCORE FOUND : -1666.411
LG+G_noki.log:BEST SCORE FOUND : -1666.411

This means:

  • LG+C10+G, LG+C10+G_ft, and LG+C10+G_fs all collapse to the same lnL.
    • LG+G and LG+G_fs also collapse to the same lnL
  • i.e. PMSF / imported site frequencies appear to have no effect without -keep-ident

With -keep-ident

In iqtree/:

grep 'BEST SCORE' *log

gives:

LG+C10+G_fs.log:BEST SCORE FOUND : -1475.677
LG+C10+G_ft.log:BEST SCORE FOUND : -1475.677
LG+C10+G.log:BEST SCORE FOUND : -1634.495
LG+G_fs.log:BEST SCORE FOUND : -1475.677
LG+G.log:BEST SCORE FOUND : -1666.412

This means:

  • LG+C10+G_ft, LG+C10+G_fs, and LG+G_fs all give the improved PMSF-like lnL: -1475.677
  • while the ordinary non-PMSF runs remain:
    • LG+C10+G: -1634.495
    • LG+G: -1666.412

So with -keep-ident, the PMSF-related runs behave as expected.


Interpretation

The issue appears to be tied to identical-sequence handling:

  • without -keep-ident, the PMSF/sitefreq-based runs do not produce their distinct improved likelihoods
  • with -keep-ident, they do

In other words, the problematic behavior seems to disappear once -keep-ident is enabled.


bs_inBV_bestfit_issue.tar.gz

cheers,
sishuo

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions