Profit driven data mining in massive customer networks: new insights and algorithms.

Verbeke, Wouter; Baesens, Bart; Martens, David

Author:

Verbeke, Wouter

Baesens, Bart ; Martens, David

Abstract:

Profit driven data mining in massive customer networks: new insights and algorithmsScientific summaryWouter VerbekePromotor: Prof. dr. Bart Baesens Department of Decision Sciences and Information ManagementKatholieke Universiteit Leuven, BelgiumData mining and knowledge discoveryData mining, which is an integral part of Knowledge Discovery in Databases, is the process of automatically discovering useful information in large structured data repositories. Data mining techniques comprise a range of heuristics and modeling techniques coming from fields such as statistics, artificial intelligence, machine learning, and algorithmic computing.Classification is one of the main data mining tasks, and concerns the estimation of an unknown value of a discrete target variable or a class attribute pertaining to an instance, based on the known data attributes of this instance. Three main requirements have been identified (Martens, 2008) that are applicable to classification models, and together these requirements constitute a holistic view on data mining:(1) predictive power, (2) comprehensibility, (3) justifiability. Classification models that meet these requirements are called acceptable for implementation.Predictive power, also referred to as discrimination power or classification performance, concerns the ability of a model to make accurate or truthful predictions towards the future. Since the predictions made by a classification model are effectively implemented in business settings for decision making, the correctness of the predicted labels usually has a direct impact on the efficiency and the efficacy of the decisions made by a company. As such, the predictive power of a classification model is of major importance and therefore has received much attention in the scientific literature.However, a successful implementation of a data mining model as a decision making tool in a business context usually heavily depends on the comprehensibility and the justifiability of the model. Comprehensibility or interpretability is required when the model needs to be human-interpretable, in order to allow the user to understand the grounds upon which a model classifies an instance. A comprehensible model is also called a white-box model, in contrast to black-box models which are not or hardly interpretable.A justifiable model is a model that is intuitively correct and in line with domain knowledge. The justifiability of a comprehensible model can be checked by a human user. However, this may not be a straightforward task in case of very large models, or in case of very complex models that include a large number of predictive attributes. Therefore, it may be preferable to automate this process, in order to avoid human error as well as to improve efficiency. A black-box model on the other hand cannot be interpreted by a human user and requires additional processing in order to check whether it is in line with domain knowledge. Therefore, whenever justifiability is demanded, typically comprehensibility will be required as well, although not necessarily.From a theoretical perspective, this dissertation aims to develop data mining techniques that allow to induce acceptable classification models. The holistic view on data mining serves as a general framework that interlinks the chapters of this thesis. Each chapter involves a stand-alone research project, and as such can be read separately. The contributions of each project can be related straightforward to the three requirements that are formulated in this section.From an application perspective, this dissertation examines the use of data mining techniques in a business context, and more specifically in the field of customer churn prediction in the telecommunication (telco) sector. As such, new insights are generated, as well as new challenges brought to light.Customer churn prediction in the telecommunication sectorCustomer retention receives a growing amount of attention from telco operators as a means to reduce customer churn and to control the related costs, and thus in order to safeguard profitability. Most wireless telco providers already operate a customer churn prediction (CCP) model that indicates the customers with the highest propensity to attrite. This allows an efficient customer management, and a better allocation of the limited marketing resources for customer retention campaigns. Customer churn prediction models are typically, but not exclusively, applied in contractual settings, such as the postpaid segment in the wireless telco industry. In a contractual setting usually more information is at hand than in a non-contractual setting, such as the prepaid segment which consists mostly of anonymous customers. Various types of information can be used to predict customer attrition, such as for instance socio-demographic data (e.g., sex, age, or zip code) and call behavior statistics (e.g., the number of international calls, billing information, or the number of calls to the customer helpdesk). Alternatively, social network information extracted from call detail record (CDR) data can be explored to predict churn (Dasgupta, 2008), which is especially interesting if no other information is available.A number of challenges are presented when applying data mining techniques to predict customer churn:(1) A first challenge concerns the massive size of the customer base of telco operators, which typically contain millions of customers. In order to process the arising data cascades, computationally efficient algorithms are required both in terms of implementation and design. Although computing power is available in large quantities and at low cost, the size of the data may be prohibitive for application of computationally inefficient algorithms, as illustrated in Chapter 5 of this dissertation.(2) A second challenge is posed by the uneven class distribution of churners and non-churners. Typically, retention campaigns are executed at least on a monthly basis, and the fraction of the customers that churns during this period is much smaller than the fraction of customers that does not churn. This results in a heavily skewed class distribution, which may cause data mining techniques to experience difficulties in learning powerful classification models.(3) Finally, a third challenge relates to the time dimension which is essential to the process of customer churn. Data mining techniques therefore have to explicitly incorporate the temporal aspect when learning patterns from the data.These three challenges require specific adjustments to data mining techniques in order to be applicable in a customer churn prediction setting, and are handled in this dissertation.Building comprehensible customer churn prediction models with advanced rule induction techniquesIn the second chapter of this dissertation an overview of the state-of-the-art in customer churn prediction is provided. An extensive literature review of the most important contributions in the scientific literature indicates that a wide range of data mining techniques have been tested on their ability to predict customer churn. However, the comprehensibility and justifiability aspects of CCP models have mainly been ignored. CCP models should be both accurate and comprehensible to improve the efficiency of customer retention campaigns, respectively in order to select the customers that are the most likely to churn (accuracy requirement), and to understand the reasons why a CCP model indicates these customers to have a high probability to attrite, and as such to uncover the main drivers for churn (comprehensibility requirement). This, on its turn, allows to develop effective retention offers. Moreover, as a general requirement, CCP models need to be justifiable in order to be acceptable for implementation.Therefore, two advanced rule induction techniques, AntMiner+ and ALBA, have been applied to predict churn. Both techniques explicitly seek to induce accurate as well as comprehensible rule sets. The results are benchmarked to C4.5, Ripper, SVM, and logistic regression. It is shown that ALBA, combined with Ripper or C4.5, results in the highest accuracy, while sensitivity is the highest for C4.5 and Ripper applied on an oversampled data set. AntMiner+ yields a lower sensitivity, but allows to include domain knowledge and results in comprehensible rule sets which are much smaller than the rule sets induced by C4.5. Ripper also results in small and comprehensible rule sets, but may lead to an unintuitive model that violates domain knowledge. The comprehensibility of a churn prediction model is important since it facilitates the interpretation and the practical use of a model for marketing purposes. Comprehensibility also allows to check the concordance of a model with domain knowledge, which is of great importance since the intuitiveness of a model determines whether or not a model will be accepted by the end-users, and as such whether the model will effectively serve its purpose. AntMiner+ allows to include domain knowledge by imposing monotonicity constraints, leading to intuitive correct models that are still comprehensible and accurate, as indicated by the results of the experiments.RULEM: rule learning with monotonicity constraints for ordinal classificationThe results of the second chapter illustrate that in order to be acceptable for implementation many real world applications require classification models to be in line with domain knowledge and to satisfy monotone relations between predictor variables and the target class. However, existing techniques to induce monotone classification models either yield poor classification performance, or are only able to handle a binary class variable (e.g., AntMiner+). Therefore, in the third chapter the novel RULEM algorithm to induce monotone ordinal rule based classification models is developed. A main asset of the proposed approach is its complementarity with existing rule-based classification techniques. Since monotonicity is guaranteed during a postprocessing step, the RULEM approach can be combined with any rule- or tree-based classification technique. In a first step, the RULEM algorithm checks whether a rule set or decision tree violates the imposed monotonicity constraints. Existing violations are resolved in a second step by inducing a set of additional rules which enforce monotone classification. The algorithm is able to handle non-monotonic noise, and can be applied to both partially and totally monotone problems with an ordinal target variable.Based on the RULEM algorithm, two novel justifiability measures are introduced. The RULEMS and RULEMF measures allow to calculate the extent to which a classification model is in line with domain knowledge expressed in the form of monotonicity constraints. Both measures provide an intuitive indication of the justifiability of a rule set, and can be calculated in a fully automated manner.An extensive benchmarking experiment has been set up to test the impact of the RULEM approach on the predictive power and the comprehensibility of the resulting rule set. The results of the experiments indicate that the proposed approach preserves the predictive power of the original rule induction techniques while guaranteeing monotone classification, at the cost of a small increase in the size of the rule set. Hence, the RULEM algorithm is shown to yield accurate, comprehensible, and justifiable rule based classification models. The predictive power of the final rule set therefore depends on the selected rule induction technique that RULEM is combined with.New insights into churn prediction in the telecommunication sector: a profit driven data mining approachCCP models are typically evaluated using statistically based performance measures, such as for instance top decile lift or AUC. However, as shown in Chapter 4 of this thesis, this may lead to suboptimal model selection, and consequently to a loss in profits. Therefore, in the first part of Chapter 4, a novel, profit centric performance measure is developed. Optimizing the fraction of included customers with the highest predicted probabilities to attrite yields the maximum profit that can be generated by a retention campaign. Since reducing the costs or losses associated with customer churn is the main objective of a CCP model, this chapter advocates the use of the maximum profit (i.e., the maximum reduction in costs or losses) that can be generated by using the output of the model as a measure to evaluate CCP models.In the second part of the fourth chapter a large benchmarking experiment is conducted, including twenty-one state-of-the-art predictive algorithms that are applied on eleven data sets from telco operators worldwide. The benchmarking experiment allows to rigorously analyze and assess the impact of classification technique, oversampling, and input selection on the performance of a CCP model. The results of the experiments are tested using the appropriate test statistics, and evaluated using both the novel profit centric based measure and statistical performance measures, leading to the following conclusions:Applying the maximum profit criterion and including the optimal fraction of customers in a retention campaign leads to substantially different outcomes. Furthermore, the results of the experiments provide strong indications that the use of the maximum profit criterion can have a profound impact on the generated profits by a retention campaign.Secondly, the effect of oversampling on the performance of a CCP model strongly depends on the data set and the classification technique that is applied, and can be positive or negative. Therefore, we recommend to adopt an empirical approach, and as such to consistently test whether oversampling is beneficial.Third, the choice of classification technique significantly impacts the predictive power of the resulting model. Alternating Decision Trees yielded the best overall performance in the experiments, although a large number of other techniques were not significantly outperformed. Hence, other properties of modeling techniques besides the predictive power have to be taken into account when selecting a classification technique, such as comprehensibility, justifiability, and operational efficiency. Rule induction techniques, decision tree approaches, and classical statistical techniques such as logistic regression and Naive Bayes or Bayesian Networks score well on all three aspects, and result in a powerful, yet comprehensible model that is easy to implement and operate. Therefore these techniques are recommended to be applied for CCP modeling. Comprehensibility or interpretability is an important aspect of a classifier which allows the marketing department to extract valuable information from a model, in order to design effective retention campaigns and strategies. The comprehensibility of a model however also depends on the number of variables included in a model. Clearly a model including ten variables is easier to interpret than a model containing fifty variables or more.This leads to a fourth conclusion, i.e., input selection is crucial to achieve good predictive power, and six to eight variables generally suffice to predict churn with high accuracy. Consequently, from an economical point of view it may be more efficient to invest in data quality, than in gathering an extensive range of attributes capturing all the available information on a customer. Furthermore, the input selection procedure has shown that usage attributes are the most predictive kind of data. However, also socio-demographic data, financial information, and marketing related attributes are indispensable sources of information to predict customer churn. Moreover, marketing related attributes such as the hand set that is provided to a customer by the operator, are important sources of actionable information to design effective retention campaigns.Finally, this chapter also provides benchmarks to the industry to compare the performance of their CCP models.Social network analysis for customer churn predictionThe fifth chapter develops a range of new and adapted relational learning algorithms for customer churn prediction using social network effects, designed to handle the massive size of the call graph, the time dimension, and the skewed class distribution typically present in a customer churn prediction setting. Furthermore, an innovative approach to incorporate non-Markovian network effects within relational classifiers is presented, i.e., the weight product, which allows to upgrade the network neighborhood of the weight matrix. The weight product can be applied complementary to existing relational classifiers, and is shown to be the equivalent operation for weighted networks to the distance product for distance networks. Finally, a novel parallel model setup to combine a relational and non-relational classification model is introduced, which selects a top-fraction of the customers with the highest predicted probabilities to churn of both models.The performance of the newly proposed techniques is experimentally tested, and the profit driven evaluation methodology presented in Chapter 4 is applied to assess the results of two real life case studies on large scale telco data sets, containing both networked (call detail record data) and non-networked (customer related) information about millions of subscribers. The experiments indicate the existence of a significant impact of social network effects on the churn behavior of telco subscribers. Interestingly, also non-Markovian social network effects are observed: the churn behavior of not only the friends, but also the friends of friends have an impact on the churn behavior of telco subscribers.Relational and non-relational classifiers, which are built using respectively networked and non-networked information, are found to detect different groups or types of churners, hinting towards a great potential of social network analysis for customer churn prediction.A propositionalization approach to incorporate social network information within a non-relational model by including network variables yields the best performing stand-alone CCP model. However, the novel parallel modeling setup, resulting in a non-integrated model, outperforms all other approaches to combine a relational and non-relational model and generates significant gains in profit compared to integrated models, including the propositionalization approach.Hence, the main finding of this chapter with strong consequences for the practice of customer relationship management is that applying relational classifiers in combination with the current generation of CCP models can generate significant profit gains. This results from the fact that relational and non-relational CCP models detect different types of churners.Finally, including second order network effects by upgrading the network neighborhood using the weight product, is shown to improve classification power and to further increase the generated profits compared to a model that restricts the impact of the network to first order effects.Related publicationsVerbeke, W., Martens, D., Mues, C., Baesens, B., 2011. Building comprehensible customer churn prediction models with advanced rule induction techniques. Expert Systems with Applications 38 (3), 23542364Verbeke, W., Martens, D., Baesens, B., 2011. Rulem: Rule learning with monotonicity constraints for ordinal classification. IEEE Transactions on data and knowledge engineering, under reviewVerbeke, W., Dejaeger, K., Martens, D., Hur, J., Baesens, B., 2011. New insights into churn prediction in the telecommunication sector: a profit driven data mining approach. European Journal of Operational Research, doi 10.1016/j.ejor.2011.09.031Verbeke, W., Martens, D., Baesens, B., 2011. Social network analysis for customer churn prediction. Management Science, under reviewReferencesMartens, D., 2008. Building acceptable classification models for financial engineering applications. Ph.D. thesis, K.U.Leuven, Leuven, Belgium.