Proceedings of the fifth Benelux Bioinformatics Conference pages:1-1
Benelux Bioinformatics Conference edition:5 location:Liege, Belgium date:14-15 December 2009
In several experimental procedures in bio-informatics, mass spectrography is an essential step. It is a challenging problem to extract as much information as possible from the resulting mass spectrogram. Ideally, one starts from a pure substance and desires to determine from the fragment peaks in the spectrogram the structure of the original molecule.
One possible strategy is to search in the space of all molecules for molecules that could have generated the observed spectrum. This is a non-trivial computational problem, as the space of 1all molecules is huge. In this work, we represent the molecules with graphs and we investigate the possibility to use recent and highly efficient graph pattern listing techniques to address this problem.
In order to decide whether a molecule could have generated the observed spectrum, one has to take into account two factors. First, there is the a priori likelihood of the molecule. E.g. when analyzing proteins, sequences that are equal or very similar to parts of the genome are more likely than unrelated sequences. Second, it should be taken into account how likely the observed spectrum is, given the proposed molecule. Usually, one hopes that the molecule produces a sufficiently diverse set of fragments so that the massess of many subgraphs of the graph to find are observed.
We will focus here on the listing part, i.e. the task to list all molecular graphs in a particular constrained space efficiently and without isomorphic duplicates. In the fields of data mining and graph theory, this problem has been studied in some depth. A rather generic result is given by Ramon and Nijssen, who show that one can list all elements of a monotone graph class with polynomial delay; i.e. such that the time needed between any two solutions is bounded by a polynomial in the input size. A graph class is monotone if it holds for all elements of the class that all subgraphs are also elements. This is typically a good property in the mass spectrography application where all subgraphs of a molecular graph are potentially observed fragments. It allows for focusing on the part of the search space with the highest likelihood according to a prior model of the fragmentation probabilities.
We investigate further possibilities to constrain or direct the search. One can show that it is possible to efficiently list molecules satisfying constraints such as a given total mass, a given total number of atoms of the different types, and a given set of possible degrees of nodes (valencies of atom types). However, listing problems that combine such constraints with the requirement that solutions should contain (or not contain) certain subgraphs (or subgraphs of certain weights) may not be solvable in output-polynomial time.
In ongoing work, we also consider the easier case of proteins, which have a linear structure (even though post-translational modifications can generate significant side chains). We'd also like to get a deeper understanding of the fragmentation probabilities as this would allow a better prioritizing in our search methods.