Deepening the Methodology behind Data Integration and Dimensionality Reduction: Applications in Life Sciences

Thomas, Minta; De Moor, Bart

Author:

Thomas, Minta

De Moor, Bart

Keywords:

SISTA

Abstract:

The problems of high dimensionality and heterogeneity of data always raise lots of challenges in computational biology and chemistry. As the size of data sets increase, as well their complexity, dimensionality reduction and advanced analytics will gain its importance. The past 10 years or so, data integration has become an active area of research in the field of machine learning, bioinformatics and chemoinformatics. Several dimensionality reduction and data integration methods are currently available for analyzing and classifying biological data. In the first part of this thesis, we concentrate on dimensionality reduction techniques such as the Generalized Eigenvalue Decomposition (GEVD) and Robust Principal Component Analysis (RPCA). We will investigate the generalized eigenvalue decomposition (GEVD) in a maximum likelihood setting, in which we employ a technique relying on the generalization of the singular value decomposition (SVD). We will elaborate the similarity between maximum likelihood estimation via a generalized eigenvalue decomposition (MLGEVD) and generalized ridge regression. This relationship reveals an important mathematical property of GEVD in which one of the matrices acts as prior information in the model development. Later we present GEVD for the integration of microarray and literature information. Then robust PCA (RPCA) is applied on a weighted matrix for the identification of differentially expressed genes of colon cancer. In the second part of the thesis, we propose a data-driven bandwidth selection criterion for kernel PCA (KPCA), which is a non-linear dimensionality reduction technique. We center our discussion on feature selection/transformation techniques in medical diagnostics. We show how to build stable, robust and interpretable classifiers on non-linear data. In the third part of the thesis we investigate a machine learning approach, a weighted LS-SVM classifier to integrate two data sources. This algorithm offers a single mathematical framework for data integration and classification problems, hence providing solutions for many real bioinformatics applications. Finally, based on PCA, we define new chemical descriptors from the connectiontable of chemical compounds. In addition, we develop a new machine learning approach for the identification of biofilm inhibitors of Salmonella Typhimurium and Pseudomonas aeruginosa. Here, PCA converts the connection-table of each compound into a structural descriptor of two vectors: one corresponding to atoms and the other to bonds. As a supervised classification algorithm, a weighted least squares support vector machine is used in which a table enumerating the atoms is weighted against a table enumerating the bonds. We apply this framework to a given experimental data set on activity of collection of compounds against Salmonella and Pseudomonas biofilms. This trained model predicts the activity of new compounds on these biofilms.