Learning from Multi-View Data: Clustering Algorithm and Text Mining Application (Leren van multi-view gegevens: clustering algoritme en text mining toepassing)

Publication date: 2011-09-15


Liu, Xinhai
De Moor, Bart




The dissertation is organized into three parts.In the first part, we analyze multi-view clustering from multilinearperspective and create several novel multi-view clusteringalgorithms. At first, modeling multi-view data as a tensor, wepresent a novel tensor based multi-view partitioning framework forintegrating multi-view data in the context of spectral clustering.Within this framework, a joint optimal subspace shared by multi-viewdata as well as the multilinear relationships among multi-view dataare revealed by the relevant tensor methods. Second, takingmulti-view data as multiple graphs, we put forward a multi-viewclustering strategy based on simultaneous trace maximization (STM),which analyzes multi-view data through a multilinear perspective aswell. Third, a joint dimension reduction scheme based on tensordecomposition is presented, particularly for multi-view data. Thedimension reduction scheme is embedded into the STM based multi-viewclustering strategy, which enables us to handle large-scalemulti-view data. In the second part, we investigate text mining to extract multi-viewheterogeneous data from a large-scale publication database of Web ofScience (WoS). In order to facilitate the scientific mapping that isuseful for monitoring and detecting new trends in differentscientific fields, hybrid clustering, either in vector spaces or ingraph spaces, is carried out to integrate these multi-view data.Regarding hybrid clustering in vector spaces, various methodologiesare included in a unified framework, which consists of two generalapproaches: clustering ensemble and kernel fusion. A mutualinformation based weighting scheme is proposed to leverage the effectof multiple data sources in hybrid clustering. Concerning hybridclustering in graph spaces, various graphs are generated frommulti-view data. Utilizing the complementary properties of both textgraph and citation graph, we present a hybrid strategy named graphcoupling. Meanwhile, based on the modularity optimization, our graphcoupling strategy detects the number of clusters automatically andprovides a top-down hierarchical analysis, which fits in with thepractical applications. In addition, the computation of thismodularity based hybrid clustering method is so efficient that itdoes well in partitioning large-scale data. In the third part, we propose a novel strategy to derive knowledge fromtextual information from a multi-view perspective. The multiple viewscan be different controlled vocabularies, term weighting schemes,publishing time periods and biomedical subjects. Our strategy hasbeen applied to the MEDLINE corpus and analyzed using a disease baseddata set. In particular, we investigate the effect of combiningmultiple views for clustering and assessed whether vertical searchescan be more accurate for specific biological questions. Moreover, aWeb application of our multi-view text mining strategy is developedfor gene retrieval.