Algorithms for Analyzing Biological Sequences (Algoritmen voor de analyse van biologische sequenties)

De Paula Costa, Eduardo; Blockeel, Hendrik; Ramon, Jan

Author:

De Paula Costa, Eduardo

Blockeel, Hendrik ; Ramon, Jan

Keywords:

machine learning, bioinformatics, decision tree, biological sequences, clustering, protein subfamilies, peptides, mass spectrometry, prediction uncertainty, phylogeny, top-down clustering

Abstract:

In 1995, scientists achieved the first complete genetic map of a free living organism, the bacterium Haemophilus influenzae. Since then, a huge amount of genomic and proteomic data has been generated due to the developments in biological and computer sciences. This availability of data has led to a new challenge, namely, the analysis and interpretation of this data. These involve many tasks such as: the identification of genes, regulatory regions, and repetitive elements; phylogenetic analysis; identification of proteins and peptides expressedin an organism, cell or tissue; and protein function prediction.In this thesis, we investigate computational methods for analyzing nucleotide and amino acid sequences. We do it in the context of three biological problems: phylogenetic tree reconstruction, protein subfamily identification, and peptide identification using mass spectrometry data.In phylogenetic tree reconstruction, one is interested in inferring the most likely tree topology that explains the evolutionary history of genes and organisms. We propose a method that, given a set of nucleotide or amino acid sequences, builds a phylogenetic tree in a top-down way. Our method is based on a conceptual clustering method that extends the well-known decision tree learning approach. We start from a single cluster containing all sequences and repeatedly divide it into subclusters until all sequences form a different cluster. We assume that a split can be described by referring to particular polymorphic positions of the multiple sequence alignment and propose a heuristic to choose the best split at each iteration. This allows us to identify important mutations that might have given rise to different evolutionary lineages. The trees generated by our method are therefore more informative than the ones generated by standard methods. Moreover, we show that our method is comparable to standard ones in terms of accuracy of the produced phylogenetic tree.In protein subfamily identification, given a set of sequences belonging to a protein family, the goal is to identify subgroups of functionally closely related proteins. This can be done by using evolutionary information. We propose a method that first uses the phylogenetic tree reconstruction method described in the previous paragraph, and then cuts the tree to extract clusters (subfamilies) of sequences. We show that the proposed method yields good results with advantages over those produced by a state-of-the-art method. More specifically, it identifies subfamily-specific positions that might be important for the function(s) associated to the subfamily, and it allows easy classification of new sequences into one of the identified subfamilies.Finally, we consider the problem of inferring the peptide identification of mass spectrometry data. In mass spectrometry, an unknown peptide undergoes fragmentation, and its fragment masses are registered in a fragmentation spectrum. Then, computational methods infer the peptide sequence from its spectrum. We proposea method that does this by searching for the peptide identification in the six-frame translation of the genome. Differently from other methods that allow a similar search, it does not use a filtering procedure to limit the computation of the scoring function to a small subset of sequences that are likely to obtain a high score. Instead, it performs an exhaustive scan of the genome translation. We show that our strategy can identify more peptides than a non-exhaustive search.As an additional contribution of this thesis, we consider a problem purely computational, namely, the problem of estimating the certainty of a prediction given by a decision tree. We propose a method that uses a transductive procedure to tackle this task. We show that our method improves the certainty estimates given by standard decision trees, and is especially suitable for ranking estimation. The ideas we use in our method can be easily generalized to other machine learning techniques, as well as combined with existing methods for prediction certainty estimation.