Title:  Algorithms for additive overlapping clustering. 
Other Titles:  Algoritmes voor additieve overlappende clustering. 
Authors:  Depril, Dirk; M0123629 
Issue Date:  8May2009 
Abstract:  In statistical practice, one is often interested in finding structures w ithin a given group of I objects. In several cases, such structures may be based on a number of subgroups of the objects under study. To reveal subgroups in a group of objects, one may appeal to some form of cluster analysis. Several clustering models exist depending on the type of data and the type of clustering. In this dissertation, all clustering models imply an overlapping clustering of the objects. Furthermore, all models imply an approximation of the given data and, in the associated data ana lysis, are fitted to a given data set by means of minimizing a least squ ares loss function. In Chapter 1, threeway twomode I x I x J object by object by source da ta blocks are considered consisting of J object by object similarity mat rices of size I x I stemming from J different sources. A well known clus tering model for this type of data is the INDCLUS model which implies an overlapping clustering of the objects, with each cluster having a slice specific positive weight. In the data analysis, the minimization of the least squares loss function has appeared to be a difficult problem for which several algorithmic strategies have been proposed. At present, the best available option seems to be the SYMPRES algorithm. Yet, SYMPRES i s conjectured to suffer from a severe local optima problem. As a way out , we will propose four new ALS algorithms. In a simulation study, one of the new algorithms is shown to outperform SYMPRES in terms of optimizat ion performance, cluster recovery and computational effort. In Chapter 2, twoway twomode I x J object by variable data will be con sidered. The onemode additive clustering model implies an overlapping c lustering of the objects and a set of profiles (characteristic variable values for each cluster). The model values of the variables of an object are the sum of the profiles of its corresponding clusters. In the data analysis, a sequential fitting strategy, also called the method of princ ipal clusters (PCL), has been proposed for the minimization of the least squares loss function. Theoretical and empirical evidence that the PCL algorithm may have problems in revealing the true structure underlying a data set will be presented. As a way out, three new algorithms to fit t he model to empirical data will be presented: two of an alternating leas t squares (ALS) type, orthogonally combined with two different starting strategies, and one based on simulated annealing (SA). In a simulation s tudy it is demonstrated that all three new algorithms outperform the exi sting PCL algorithm. The amount of objects that belong to more than one cluster (the overlap) is further found to have a considerable influence on the algorithmic performance of the ALS algorithms, with low amounts o f overlap requiring a different starting strategy than high ones. As a c onsequence, for the analysis of real data sets in practice, a hybrid app roach will be presented consisting of one of the ALS algorithms initiali zed by means of the two starting strategies under study. The profiles of the onemode additive clustering model are the key to in terpret the clusters. The interpretation may, however, be seriously hamp ered in case the data include a very large number of variables. To deal with this problem, we propose in Chapter 3 a new model that simultaneous ly clusters the objects in overlapping clusters and reduces the variable space; as such, the model implies that the cluster profiles and, hence, the reconstructed data are constrained to lie in a low dimensional spac e. An ALS algorithm to fit the new model to a given data set will be pre sented, along with a simulation study and an illustrative example that m akes use of empirical data. In Chapter 4, we will consider the additive biclustering for twoway two mode object by variable data. This model implies an overlapping cluster ing of both the objects and the variables, together with a core value fo r each bicluster (i.e., a pair of an object and a variable cluster). The value of a model entry then equals the sum of the core values of the bi clusters the entry belongs to. To minimize the least squares loss functi on, two algorithms have been proposed in the literature: PENCLUS and Bai ers ALS approach. However, up to now, these algorithms have not yet bee n systematically evaluated in a simulation study. Moreover, both algorit hms have some builtin limitations, the impact of which on their perform ance is unclear. As a way out, first, a new ALS algorithm will be presen ted. Second, we will report on a simulation study in which the two exist ing and the new algorithm are evaluated. From this study, the new ALS al gorithm appears to outperform the two existing strategies. Little is yet known about the identifiability of the additive clustering models in question, which may hamper their interpretability. Therefore, in Chapter 5, we will first present a necessary condition on the cluste ring structure for the onemode additive clustering model to be identifi ed. This condition is further shown to imply that hierarchies are never identified. Second, we will show that the inclusion of an offset term in onemode additive clustering models implies an unidentifiability that c an only be resolved by adding constraints to the model. In the discussio n, we show that the necessary condition for identifiability can be exten ded to the additive biclustering model, whereas it is not necessary for the identifiability of the ADCLUS/INDCLUS model for twoway onemode sim ilarities and for Mirkins additive boxclustering model. 
Table of Contents:  Introduction 1 New ALS approaches for fitting the INDCLUS model 1.1 Introduction 1.2 The SYMPRES algorithm 1.3 Four new ALS approaches 1.4 Simulation study 1.4.1 Introduction 1.4.2 Design 1.4.3 Results 1.4.4 Discussion 1.5 Concluding remarks References 2 Algorithms for additive clustering of rectangular data tables 2.1 Introduction 2.2 The Onemode Additive Clustering Model 2.2.1 The model 2.2.2 Illustrative Application 2.2.3 Data Analysis: Loss functions 2.3 Principal Cluster Analysis 2.3.1 PCL Algorithm 2.3.2 Some problems with PCL 2.4 Three new algorithms 2.4.1 ALSlf2 2.4.2 ALSlf1 2.4.3 SAlf1 2.5 Simulation Study 2.5.1 Introduction 2.5.2 Data generation 2.5.3 Algorithms 2.5.4 Evaluation Criteria 2.5.5 Results 2.5.6 Discussion 2.6 Concluding remarks References 2.A Pseudocodes 2.A.1 PCL pseudocode 2.A.2 ALSlf2 Pseudocode 2.A.3 ALSlf1 Pseudocode 2.A.4 Simulated Annealing Pseudocode 3 Lowdimensional overlapping clustering 3.1 Introduction 3.2 The Low Dimensional Overlapping Clustering Model 3.3 Estimation 3.3.1 Loss Function 3.3.2 Algorithm 3.4 Simulation Study 3.4.1 Introduction 3.4.2 Design 3.4.3 Evaluation Criteria 3.4.4 Results and Discussion 3.5 Application 3.6 Discussion References 3.A Descriptions of the situations 4 Additive biclustering: A comparison of one new and two existing algorithms 4.1 Introduction 4.2 Additive biclustering model 4.3 Data Analysis 4.3.1 PENCLUS 4.3.2 Clusterwise ALS 4.3.3 A novel algorithm: Full clustering ALS 4.4 Simulation Study 4.4.1 Introduction 4.4.2 Design 4.4.3 Evaluation Criteria 4.4.4 Results 4.4.5 Discussion 4.5 Conclusion References 5 Uniqueness of Additive Clustering Models 5.1 Introduction 5.2 Additive Clustering Models 5.3 Identifiability 5.3.1 General Issues 5.3.2 Offset terms 5.4 Discussion 5.4.1 Mirkin’s Binary Hierarchy model 5.4.2 The ADCLUS and INDCLUS models for similarity data 5.4.3 The box clustering model 5.4.4 The additive biclustering model References 5.A Counterexample ADCLUS 5.B Counterexample additive boxclustering model Discussion Nederlandstalige samenvatting 
Publication status:  published 
KU Leuven publication type:  TH 
Appears in Collections:  Statistics Section Leuven Statistics Research Centre (LStat) Quantitative Psychology and Individual Differences

