International Mass Spectrometry Conference, Date: 2014/08/24 - 2014/08/29, Location: Geneva, Switzerland

Publication date: 2014-08-01
Volume: 20 Pages: 507 - 507

IMSC 2014: 20th International Mass Spectrometry Conference

Author:

De Grave, Kurt
Van den Bulck, Alexander ; Touzé, Sébastien ; Fannes, Thomas ; Ramon, Jan

Abstract:

1. Introduction The recent algorithm MS2PIP is the most accurate predictor of observed intensities of peptide fragment ions in collision-induced dissociation (CID). The basic assumption and design choice of the algorithm is that models based on coherent training data are more accurate. MS2PIP therefore segments the training (and test) data into partitions for each combination of peptide length, fragment ion length, and ion type and charge. For each partition a separate random forest (RF) model is trained. The machine learning literature, however, contains a large volume of experiments demonstrating that with the right algorithm, additional data that is correlated with the main target can often be helpful to a learner. This is found in transfer learning, in multi-label classification, and in relational learning. 2. Methods We revisit the basic assumption behind MS2PIP and devise an encoding for all its features, and some additional features, in a way that permits the training of a monolithic RF on all available data irrespective of lengths. Features are computed on demand, avoiding the need to store the training data matrix in memory and allowing a larger number of examples and features. Our system is implemented in C++ and we name it I2P (InSPECtor Intensity Predictor). Heavy use of templates allows a maximum amount of work to be performed at compile-time: the binary code is fully specialized in the type of the data it has to handle and the tasks it has to perform. The predictive model itself can be exported to C++ and compiled, generating a very fast predictor that does not need to load model data files. 3. Results Experiments indicate that the predictions of I2P are much better for rare ions and those of MS2PIP are better for the most common ions (short single-charged b and y ions). An ensemble that predicts the mean (in log-space) of the output of both systems is superior to either system on its own. 4. Conclusions The intense partitioning of MS2PIP, while advantageous for abundant ions, is harmful for less abundant ions, which collectively make up a large proportion of all cases. I2P mitigates this problem because the predictions of rare classes can benefit from tree structures based on many more cases. The same property makes I2P also more robust against scarcity of training data. A simple ensemble provides the best overall predictions. 5. Novel aspect I2P is a novel system for predicting fragmentation ion intensities with state-of-the-art performance that is more suitable for modeling new fragmentation mechanisms for which fewer historical experiments are available. [1] Sven Degroeve and Lennart Martens. MS2PIP: a tool for MS/MS peak intensity prediction. Bioinformatics 29 (24), 2013.