Benelux Bioinformatics Conference edition:5 location:Liège, Belgium date:14-15 December 2009
We consider the problem of discovering biclusters in gene expression data by means of machine learning. The data contains the measured expression levels of the genes of a particular organism under a number of varying conditions. The learning task given such a dataset is to find subsets of genes that are co-expressed under subsets of conditions (such a subset of genes together with the corresponding subset of conditions is called a bicluster).
The problem of biclustering gene expression data has already been tackled using probabilistic model-based biclustering. So far, this approach was implemented in a special-purpose system , although there are a number of general-purpose probabilistic modelling systems that also appear suitable for solving this problem. A solution in a general-purpose system would have the advantage of being easily adaptable and extensible, for instance with respect to additional data sources about the considered genes . The goal of this work is to investigate how well the problem of biclustering gene expression data can be solved with a number of general-purpose probabilistic modelling systems. Concretely, we consider so-called probabilistic logic learning (PLL) systems, which use elements of first-order logic for the sake of expressivity. PLL is currently a very popular approach in the artificial intelligence and machine learning community.
In our work, we first made an analysis of the modelling- and learning-features required to solve the biclustering problem (such as the ability to deal with numerical data, with overlapping clusters, etc.). Next, we made an overview of which of these features are supported by which PLL systems. In this analysis, the PLL system Alchemy (that deals with so-called Markov Logic) appeared to be the most promising. Hence, we continued by implementing probabilistic model-based biclustering in this system. This work showed that there are several practical problems that make it impossible to represent the desired model in the Alchemy system. We report the problems encountered (limitations of Alchemy) as well as the aspects of the biclustering task that can easily be modelled in Alchemy (strong points of Alchemy). In the light of these limitations and strong points, we also compare Alchemy to the other PLL systems considered in our initial analysis.
From the perspective of biological applications, our discussion is relevant in the sense that we give some insight into what kind of problems can and cannot easily be tackled using popular general-purpose systems. From the perspective of informatics, in particular machine learning, our discussion is relevant in the sense that we identify a number of shortcomings of existing systems and corresponding directions for future work.
 Tim Van den Bulcke et al. Efficient Query-Driven Biclustering of Gene Expression Data Using Probabilistic Relational Models'', ESAT-SISTA Internal Report 08-134, K.U.Leuven, 2008.