Title:  Clusterwise regression with reduction of predictors 
Authors:  Vande Gaer, Eva 
Issue Date:  19Sep2012 
Abstract:  In the behavioral sciences, many research questions pertain to the relationship between one or more predictors and a criterion variable. To answer such questions, often linear least squares regression (LR) is applied. However, LR does not always suffice for answering the complex questions some studies raise. More specically, three kind of complications might arise. The first complication pertains to data that are hierarchically structured in that the observations are nested in higher level units. The traditional LR method is not capable of handling such nested data, because it assumes independence among the observations. A second complication pertains to the presence of a large number of moderately to strongly correlated predictor variables. As a consequence, multicollinearity problems might arise resulting in instable regression weights. Moreover, because of the large number of predictors, often one will not only want to model the regression relationships but also grasp the structure underlying the predictor data block. A third complication pertains to heterogeneity in the relationship between the predictors and the criterion, implying that the underlying regression model is not the same for all higher level units but rather subgroups are present in the population that differ with regard to the underlying regression weights. Some methods have been developed that address one or several of the complications mentioned above. Dimensional reduction methods reduce the predictors to a few summarizers and regress the criterion on these summarizers by combining a PCA related model with a regression model. As such, the underlying structure of the predictors is explicitly modeled and multicollinearity problems can be avoided. Furthermore, a method called Clusterwise Regression (CR) was proposed that searches in an exploratory way for groups of observations that differ with regard to the underlying regression weights and thus allows the user to model heterogeneity in the regression relationships. Note that an multiobservation extension of CR was proposed for the modeling of twolevel data.Although the methods discussed above are useful, at least three challenges remain. First, reducing the predictors by means of a method that is dimensional in nature might be adequate to grasp the underlying structure of many but not all data sets. More specifically, in case the data contains predictors that can be considered repeated measurements of each other, dimension reduction methods might not be the most suitable. For such data, the underlying predictor structure is expected to be a partition. Therefore, a method that imposes such a partition structure directly rather than approximating it might be more appropriate. Secondly, with regard to CR, some indications were obtained that this method focuses on differences in means between clusters rather than differences in regression slopes. However, this claim was not thoroughly investigated, nor was it investigated how this tendency affects the estimates of the model parameters and what happens in case of twolevel data. Finally, a third remaining challenge lies in the observation that the three complicationshierarchically structured data, many predictors, and heterogeneity in the regression relationshipsoften occur simultaneously, implying the need for a method that can address all three at the same time.With this dissertation, we aimed at addressing these three challenges. More specifically, three methods were developed (Clustered Covariates Regression, CCovR; Principal Covariates Clusterwise Regression, PCCR; and CLASSIN) and the performance of a fourth one (Clusterwise Regression) was investigated. The CCovR method (Chapter 1) combines a partitioning of the predictors with regression modeling. With regard to the CR method (Chapter 2), its performance and that of its multilevel extension was investigated in an extensive simulation study. Finally, two methods were developed that simultaneously address the need for handling hierarchical data, reducing the predictors, and clustering the higher level units. PCCR (Chapter 3) is a method for realvalued data that combines a dimensional reduction with multiobservation CR. CLASSIN (Chapter 4) is a Boolean method that is build on the principles behind CR and CCovR. 
Description:  For the simulations we used the infrastructure of the VSC – Flemish Supercomputer Center, funded by the Hercules Foundation and the Flemish Government – department EWI 
Publication status:  published 
KU Leuven publication type:  TH 
Appears in Collections:  Research Group Welfare State and Housing Methodology of Educational Sciences Quantitative Psychology and Individual Differences

