European Conference on Computational Biology edition:7 location:Cagliari, Italy date:22-26 September 2008
The completion of several genome projects in the past decade has generated the full genome sequence of different organisms. Identifying genes in the sequences and assigning biological functions to them has now become a key challenge in modern biology. This last step, which is the focus of this work, is often guided by automatic discovery processes which interact with the laboratory experiments.
There are two characteristics of the function prediction task which distinguish it from common machine learning problems: (1) a single gene may have multiple functions; and (2) the functions are organized in a hierarchy: a gene that is related to some function is automatically related to all its parent functions (this is called the hierarchy constraint). This particular problem setting is known in machine learning as hierarchical multi-label classification (HMC) and recently, a number of approaches have been proposed to deal with it. These approaches differ w.r.t. a number of characteristics: which learning algorithm they are based on, whether the hierarchy constraint is always met, and whether they can deal with hierarchies structured as a directed acyclic graph (e.g. the Gene Ontology) or are restricted to hierarchies structured as a rooted tree (e.g. MIPS's FunCat).
We present an HMC decision tree learner that takes into account the hierarchy constraint and that is able to process DAGs. We show that our method outperforms previously published results for S. cerevisiae and A. thaliana. As evaluation measure we use precision-recall curves, which is a well-suited evaluation measure for HMC learners. Moreover, we are able to further increase the predictive performance by upgrading our method to an ensemble technique, if the user is willing to (partly) give up on interpretability. Ensemble techniques are learning methods that construct a set of classifiers and classify new data instances by taking a vote over their predictions.