Possvm (Phylogenetic Ortholog Sorting with Species oVerlap and MCL) is a python tool to analyse pre-computed gene trees and identify pairs and clusters of orthologous genes. It takes advantage of the species overlap algorithm implemented in the ETE toolkit to parse the phylogeny and identify orthologous gene pairs, and MCL clustering for orthogroup identification.
Warning
If you downloaded Possvm between the 8th of April 2024 and July 2024, please redownload it as an important bug has been found in the MCL clustering module.
It only requires a gene tree in newick format, where the sequence name contains a prefix that indicates their species of origin, e.g. human_gene1. It does not require a species tree to infer orthologs.
An overview of Possvm can be found here:
Orthology clustering from gene trees with Possvm (Grau-Bové and Sebé-Pedrós, MBE 2021)
Possvm works in four basic steps:
- Identification of orthology pairs using Species Overlap, which are used to build an orthology graph.
- Obtain orthology clusters from pairwise orthology relationships, using MCL clustering.
- Produce parseable tables with orthogroups and ortholog pairs.
- Orthogroups can be annotated using names from reference member genes, in a phylogeny-aware manner.
All functions are available in the possvm.py script.
Input:
- Gene trees in NEWICK format, which may contain node supports (node supports are only necessary for some extra functionalities).
- Optionally, a list of reference genes with names can be supplied (
-rflag). Possvm will use it to label the orthgroups with user-supplied names. This is a simple tab-separated table where the first column is the gene id (same as used in the gene tree file) and the second column is the gene name:
human_gene1 gene_name_A
human_gene2 gene_name_A
human_gene3 gene_name_B
# Note that this file can be redundant, i.e. you can use
# the same name for more than one gene id (useful to reduce
# paralog/isoform redundancy).
# Example in `test/drosophila_gene_names.csv`Output:
- Per-gene orthogroup classification (filename:
ortholog_groups.csv). A CSV table with all genes and their annotations, in the following format:
gene orthogroup orthogroup_support reference_ortholog reference_support
spA_gene1 OG0 100.0 spA_gene1 100.0
spB_gene2 OG0 100.0 spA_gene1 90.0
spC_gene3 OG0 100.0 spA_gene1 100.0
spB_gene4 OG1 75.0 NA NA
spC_gene5 OG1 75.0 NA NA
geneis the gene idorthogroupis the orthogroup id produced by Possvmorthogroup_supportis the statistical support at the deepest node of each orthogroup (if available,NAotherwise)reference_ortholog: direct orthologs of each gene in the list of reference gene names (NAif no list supplied)reference_support: statistical support at the deepest node separating each gene and their orthologs in the list of reference gene names (NAif no list supplied/no supports available).
-
Annotated phylogeny in Newick and PDF format (filenames:
ortholog_groups.newickandortholog_groups.newick.pdf; the-skipprintflag omits PDF production) -
Pairs of orthologous genes and statistical support for their relationship, if available (filename:
ortholog_pairs.csv). It is also possible to report all pairwise relationships between genes (orthologs and paralogs) using the-printallpairsflag (filename:pairs_all.csv)
Available options:
usage: possvm.py [-h] -i IN [-o OUT] [-p PHY] [-r REF] [-refsps REFSPS]
[-s SOS] [-method METHOD] [-inflation INFLATION]
[-outgroup OUTGROUP] [-split SPLIT]
[-itermidroot ITERMIDROOT] [-skiproot] [-skipprint]
[-printallpairs] [-min_support_node MIN_SUPPORT_NODE]
[-min_support_transfer MIN_SUPPORT_TRANSFER]
[-clean_gene_names] [-cut_gene_names CUT_GENE_NAMES]
[-ogprefix OGPREFIX] [-spstree SPSTREE] [-v]
optional arguments:
-h, --help show this help message and exit
-i IN, --in IN Path to a phylogenetic tree in newick format. Each
sequence in the tree must have a prefix indicating the
species, separated from gene name with a split
character. Default split character is "_", see --split
for options.
-o OUT, --out OUT OPTIONAL: String. Path to output folder. Defaults to
same directory as input file.
-p PHY, --phy PHY OPTIONAL: String. Prefix for output files. Defaults to
`basename` of input phylogeny. Default behaviour will
never overwrite original files, because it adds
suffixes.
-r REF, --ref REF OPTIONAL: String. Path to a table indicating reference
gene names that can be used for orthogroup labeling.
Format: geneid <tab> name.
-refsps REFSPS, --refsps REFSPS
OPTIONAL: String. Comma-separated list of reference
species that will be used for orthogroup labeling. If
absent, all gene names present in the -r table will be
considered.
-s SOS, --sos SOS OPTIONAL: Float. Species overlap threshold used for
orthology inference in ETE. Default is 0. Higher
values (up to 1) result in more inclusive groupings.
-method METHOD, --method METHOD
OPTIONAL: String. Clustering method. Options are `mcl`
(MCL, default), `mclw` (MCL weighted by node
supports), `louvain` (Louvain), or `lpa` (label
propagation algorithm).
-inflation INFLATION, --inflation INFLATION
OPTIONAL: Float. Inflation hyperparameter for MCL
clustering. Only applicable if `method` is `mcl` or
`mclw`. In practice, the most inflation-responsive
method is `mclw`.
-outgroup OUTGROUP, --outgroup OUTGROUP
OPTIONAL: String. Define a set of species that are
treated as outgroups in the phylogeny, and excluded
from orthology clustering. Can be a comma-separated
list of species, or a file with one species per line.
This option DOES NOT affect tree rooting, just
orthology clustering. Disabled by default.
-split SPLIT, --split SPLIT
OPTIONAL: String to use as species prefix delimiter in
gene ids, e.g. "_" for gene names formatted as
speciesA_geneX. Defaults to "_".
-itermidroot ITERMIDROOT, --itermidroot ITERMIDROOT
OPTIONAL: Integer. Turns on iterative midpoint rooting
with INT iterations, which is used instead of the
default midpoint rooting. Low numbers are recommended
(e.g. 10 is often more than enough).
-skiproot, --skiproot
OPTIONAL: Bool. Turns off tree rooting, in case your
trees are already rooted.
-skipprint, --skipprint
OPTIONAL: Bool. Turns off printing of annotated
phylogeny in PDF format (annotated newick is still
produced).
-printallpairs, --printallpairs
OPTIONAL: Bool. Turns on the production of a table
with pairwise orthology/paralogy relationships between
all pairs of genes in the phylogeny (default behaviour
is to only report pairs of orthologs).
-min_support_node MIN_SUPPORT_NODE, --min_support_node MIN_SUPPORT_NODE
OPTIONAL: Float. Min node support to consider
orthology relationships. If not set, all relationships
are considered.
-min_support_transfer MIN_SUPPORT_TRANSFER, --min_support_transfer MIN_SUPPORT_TRANSFER
OPTIONAL: Float. Min node support to allow transfer of
labels from labelled to non-labelled groups in the
same clade. If not set, this step is skipped.
-clean_gene_names, --clean_gene_names
OPTIONAL: Bool. Will attempt to "clean" gene names
from the reference table (see -r) used to create
cluster names, to avoid very long strings in groups
with many paralogs. Currently, it collapses number
suffixes in gene names, and converts strings such as
Hox2/Hox4 to Hox2-4. More complex substitutions are
not supported.
-cut_gene_names CUT_GENE_NAMES, --cut_gene_names CUT_GENE_NAMES
OPTIONAL: Integer. If set, will shorten cluster name
strings to the given length in the PDF file, to avoid
long strings in groups with many paralogs. Default is
no shortening.
-ogprefix OGPREFIX, --ogprefix OGPREFIX
OPTIONAL: String. Prefix for ortholog clusters.
Defaults to "OG".
-spstree SPSTREE, --spstree SPSTREE
OPTIONAL: Path to a species tree. If this is provided,
Possvm will use a species tree reconciliation
algorithm instead of species overlap.
-v, --version show program's version number and exit
If you happen to have a species tree for your dataset, you can also use the script provided in scripts/possvm_reconstruction.py to reconstruct ancestral gains and losses using Dollo parsimony. This species tree is not required for possvm's main functionality, it's just a convenience tool. See below for examples.
Possvm has been tested in Python 3.10, and it depends on the ETE3 toolkit library. I recommend that you use conda to install ETE and all other dependencies (I used version 23.1.0 in Ubuntu 22.04).
Once you have a working installation of conda (see here for instructions), you can use the environment.yaml file bundled in this repository to reproduce the environment:
# create env and install packages
conda env create -n possvm --file environment.yaml
conda activate possvmAlternatively, you can install them one by one, like this:
# create environment for possvm
conda create -n possvm
conda activate possvm
# install dependencies
conda install -c etetoolkit ete3==3.1.2
conda install -c bioconda pandas markov_clustering matplotlib numpy
conda install -c conda-forge networkx==3.0Both options should download and install all basic dependencies, including the following essential packages:
ete3 3.1.2
markov_clustering 0.0.6
matplotlib 3.7.1
networkx 3.0
scipy 1.10.0
numpy 1.23.5
pandas 1.5.3
python 3.10.9Once these dependencies are up and running, you can run Possvm like any Python script:
python possvm.py -h
# maybe add it as an alias, because you're going to use it everyday?
echo "alias possvm=\"python $(pwd)/possvm.py\"" >> ~/.bashrc
source ~/.bashrc
possvm -hYou can test Possvm using a small phylogeny of fatty acid synthases from three insects (Drosophila melanogaster and two mosquitoes, Anopheles gambiae and Aedes aegypti), which can be found in the test/ folder.
- Identify ortholog clusters in the gene tree (
OG0toOG4):
possvm -i test/fa_synthases.newick -p fa_synthases_unnamed
# -i path to gene tree
# -p output filename- Same, but annotating the orthogroups with Drosophila gene names if possible, and reporting relationships between all gene pairs:
possvm -i test/fa_synthases.newick -r test/drosophila_gene_names.csv -p fa_synthases_named -printallpairs- If you have a fully labelled species tree, you can reconstruct ancestral characters with Dollo parsimony:
python scripts/possvm_reconstruction.py -ort test/fa_synthases_unnamed.ortholog_groups.csv -tree test/species_tree.newick -out test/species_tree.ancestral_reconstructionIn this case, your species tree should include all species present in your gene trees (it can include species absent in your gene tree too, e.g. where losses occur), and all the ancestral nodes should be labelled, as follows (newick format):
(Dromel,(Aedaeg,Anogam)mosquitoes)insects;Please keep in mind that Dollo parsimony is not always the most appropriate evolutionary model for ancestral reconstruction as it does not account for the possibility of parallel gains, and it tends to inflate the number of presences in the ancestral node.
The PDF printing step in Possvm relies on the ete3 library and is therefore affected by this known issue regarding graphic library dependencies. If you are working in a system without a graphical user interface (GUI), you may lack some of these libraries and thus encounter the following error when trying to create the PDF:
ImportError: cannot import name 'TreeStyle'If so, try installing this library with pip:
pip3 install PyQt5 # tested with 5.11.3If this fails:
- You can always disable PDF printing with the
--skipprintflag, and the program should finish without problems. - Drop me a line in the issues section.
Please let me know here if you find anything!
As part of our manuscript (Grau-Bové and Sebé-Pedrós, MBE 2021), we assessed Possvm's accuracy against curated gene family classifications from the Orthobench and HomeoDB databases. You can find data and code to reproduce these analyses in our companion repository.
If you use Possvm, please cite the following papers:
- Possvm paper: Grau-Bové and Sebé-Pedrós, MBE 2021.
- ETE toolkit: Huerta-Cepas et al. MBE 2016.
- Species overlap algorithm: Huerta-Cepas et al. Genome Biology 2007.
- MCL clustering: Enright et al. NAR 2002.



