ASMS, Date: 2014/06/15 - 2014/06/19, Location: Baltimore, MD, USA

Publication date: 2014-06-19

62nd Annual ASMS Conference on Mass Spectrometry and Allied Topics

Author:

De Grave, Kurt
Ramon, Jan

Keywords:

Bayesian network, probabilistic logical model, proteomics, mass spectrometry, logical Bayesian network, quantitative proteomics

Abstract:

*Introduction* Bayesian networks [Heckerman, 1995] are graphical models representing joined probability distributions of groups of random variables by describing the dependencies between them. Recently, the machine learning community has developed a substantial interest in statistical relational learning (SRL), which combines the advantages of probabilistic models with those of relational data mining [Getoor and Taskar, 2007]. An important advantage of these approaches is that while standard Bayesian networks specify the dependencies between columns of one table, SRL models can handle complex dependencies in relational databases with many tables. In mass spectrometry, this allows probabilistic inference across several interacting domains such as the proteome, PTMs, experimental protocols, instruments, and fragmentation models. *Methods* We adopt the framework of logical Bayesian networks (LBN) [Fierens et al., 2005], which combine the properties of logical programs and Bayesian networks. In particular, a logic program is used to specify the random variables of interest and the Bayesian network defining their joined probability distribution. We construct an LBN model describing the several domains relevant for proteomics, taking a modular approach as mandated by the need to accommodate different labs and experimental protocols. Labs can reconfigure the network to fit their experimental protocols. Using the information in the LBN, it is possible to perform three important inference tasks more precisely and flexibly: interpretation of observed spectra, prediction of spectra, and optimization of experimental parameters. The framework also enables adhoc probabilistic querying. *Preliminary Data* Inference is performed by integrating probabilistic inference algorithms and a priori induced specialized submodels. Important submodels include the prediction of fragmentation (based on a C++ port of MS2PIP [Degroeve and Martens. 2013]) and cleavage [Fannes et al., 2013]. Background models describe the prior distribution of variables, such as the abundance of proteins and PTMs in the tissues of a species. Estimating accurate priors and submodels is challenging, as even very large amounts of training data may not be sufficient in a high-dimensional space without sufficient assumptions of independence or other background knowledge. The experimentalist can express background knowledge which shifts or marginalizes the background models, or wholly replace arbitrary submodels, most conveniently by supplying retraining data. Some queries can be easily computed while some others pose very high computational demands. We work on increasing the scope of queries that can be answered quickly through both algorithmic innovation and implementation quality. Our inference acceleration strategies include C++ code generation for the trained submodels and an algorithm for upstream probability reconstruction for random forests. *Novel Aspect* This is the first probabilistic logical model in mass spectrometry.