Title: Predicting gene function in S. cerevisiae and A. thaliana using hierarchical multi-label decision tree ensembles
Authors: Schietgat, Leander
Vens, Celine
Struyf, Jan
Blockeel, Hendrik
Kocev, Dragi
Dzeroski, Saso
Issue Date: Oct-2008
Publisher: Department of Computer Science, K.U.Leuven
Series Title: CW Reports vol:CW528
Abstract: Motivation: S. cerevisiae and A. thaliana are two well-studied organisms in biology. Despite the fact that their genomes have already been completed in 1996 and 2000 respectively, the functions of 30% to 40% of their open reading frames (ORFs) remain unclassified. Different machine learning methods have been proposed that annotate the ORFs automatically. However, it is unclear which method is to be preferred in terms of predictive performance, efficiency, interpretability, and usability. Moreover, different evaluation measures for predictive performance have been used in the literature, each showing a limited aspect of the method's performance. Results: We study the usefulness of decision tree based models for predicting the multiple functions of ORFs. First, we describe an algorithm for learning decision trees that can make predictions for the ORFs automatically. We present new results obtained with this algorithm, showing that the trees found by it exhibit clearly better predictive performance than the trees found by previously described methods, while yielding equally interpretable results. The predictive accuracy of our trees, however, is still below that of some recently proposed statistical learning methods. Ensembles of such trees, on the other hand, give even better predictive results, comparable with those of state-of-the-art methods (sometimes better, sometimes worse), while the ensemble method scales much better and is easier to use. We conclude that decision tree based methods are currently the most efficient, easy-to-use, and flexible approach to ORF function prediction, flexible in the sense that they cover the spectrum from maximally interpretable to maximally accurate models. Our evaluation makes use of precision-recall-curves. We argue that this is a better evaluation criterion than previously used criteria. Our evaluation method can be seen as an additional contribution to the field. Availability: The software is freely available on
Publication status: published
KU Leuven publication type: IR
Appears in Collections:Informatics Section

Files in This Item:
File Description Status SizeFormat
CW528.pdfDocument Published 797KbAdobe PDFView/Open


All items in Lirias are protected by copyright, with all rights reserved.