A lite Matlab toolbox for evaluating protein function predictors (according to CAFA protocol)
There are two types of basic data structures used widely in this toolbox:
The ontology structure consists of the following fields:
term, which is astructarray havingidandnamecorresponding to ontology term id and description of each term.rel_code, is a cell array storing the types of relation used in this structure, for example{'is_a', 'part_of'}.DAG, a sparse integer matrix encoding the relations between terms, i.e.,DAG(i, j) = 1indicates termiandjhas the type1relation, while those relation types are encoded inrel_code.alt_list, is a mapping table between alternative term id and approved term id.date, the date when this ontology structure is created.
pfp_ontbuild('/path/to/obofile');The structure that represents a set of annotation (usually experimental) using a specific ontology:
object, is a set of objects having annotations (usually proteins or genes)ontology, an ontology structure associated with this annotation set.annotation, a sparse logical matrix indicating if objectiis annotated with termj. Note thatiis the i-th entry inobjectwhilejis the j-th term of{ontology.term.id}.date, is the date when this structure is created.
Note that there is a similar structure representing a predictor's output. The only difference is that the annotation field is replaced by score, which is a sparse real number matrix (having scores between [0, 1]) having its prediction scores.
pfp_oabuild(ont, '/path/to/plain-text-annotation', '/more/files', ...);where ont is an ontology structure built using pfp_ontbuild and the following plain-text annotation files should have only two columns: 1) sequence (protein/gene) ID and 2) annotated term ID.
Considered as a multi-label learning (MLL) problem, protein functionprediction requires a method to predict a score for an instance (protein orgene) for every possible label. Therefore, evaluating such a prediction resultsin comparing a prediction matrix and a "ground-truth" annotation matrix, both ofwhich having size n-by-m, where n is the number of instances and m isthe number of labels (terms in an ontology).
Generally, there are two types of evaluation schemes: 1) sequence-centric and 2) term-centric. The former calculates a performance measure for each row first and then combines those results to get an overall performance measure; while the latter calculates in a column-major manner followed by the combination step.
One needs to specify which metric to use for each row (or column) and which method to use when combining those measures. In CAFA2, we used (weighted) F-measure, and (normalized) semantic distance for sequence-centric evaluation, and AUC is used for term-centric evaluation. And in both schemes, combination was done simply by averaging.
The lite toolbox also contains functions to build the two baseline methods used in CAFA evaluation, i.e., Naive and BLAST.
-
Naive predictions can be created by using
pfp_naive.mwhich loads an annotation structure and predicts a query protein according to the annotation frequency. -
BLAST prediction can be created by using
pfp_blast.m. The function depends on an extra structure created frompfp_importblastp.mwhich, as the name indicates, imports output results from theblastpprogram (tested on v2.2.28+). (Note that we usually BLAST the test set proteins against the annotated training set proteins to obtain those BLAST hits.)
The source code of this project is licensed under the MIT license.