Research on Telecom Customer Churn Prediction Based on GA-XGBoost and SHAP ()
1. Introduction
With the booming development of information technology and mobile networks, the competition in the telecommunication industry is becoming increasingly severe. It is known that the cost of developing a new subscriber is 5 - 6 times higher than the cost of retaining an existing customer [1], and the choice of the subscriber determines the development of the company, therefore, the ability to successfully predict churn and effectively reduce it has become an important concern for the telecommunication industry. It is necessary for telecommunication companies to build a model that can accurately predict the tendency of customers to churn in a certain period of time in the future and take appropriate measures to solve it.
In recent years, for the telecommunication customer churn prediction problem, scholars at home and abroad have been using data mining techniques to analyze and establish customer churn prediction models and apply classification algorithms to the field of telecommunication customer churn, which is of great practical significance for telecommunication companies to tap effective customers. In China, Qian et al. [2] improved the support vector machine by introducing a cost-sensitive function and used different penalty coefficients on the objective function to incorporate different misclassification costs into the modeling process for the study of telecommunication customer churn prediction. The literature [3] introduced kernel principal component analysis (KPCA) to customer churn prediction and proposes a corresponding feature extraction algorithm. The literature [4] gives a pruned random forest approach and proposes an effective random forest similarity matrix-based variance estimation based on the degree of variance of important factors affecting the stability of the combined classifier. Wang [5] used a real dataset of a provincial telecom operator as a research object, preprocessed the data and used PSO-BP neural network for customer churn prediction by data equalization based on AdaCost algorithm and feature selection based on Relief filtering method, and proved the feasibility of the algorithm for operator customer churn management work. In foreign countries, Veronikha Effendy et al. [6] used combined sampling and weighted random forest to deal with imbalanced data in customer churn prediction. Ammar A.Q. Ahmed [7] used a hybrid firefly algorithm based approach for churn prediction of large telecommunication data, using firefly algorithm for identifying optimal solutions and combining simulated annealing algorithm with firefly algorithm for optimization. Awnag et al. [8] proposed a regression based churn prediction model to identify customer churn by using multiple regression analysis, this technique uses customer feature data for analysis and provides good performance. Sara Tavassoli et al. [9] proposed three hybrid integrated classifiers based on bagging and boosting, the proposed method can be applied not only for customer churn prediction but also for any other binary classification algorithm application.
Although the above studies have contributed to telecommunication customer churn prediction, due to the complexity of machine learning models, there is no rational explanation of the model, and only the importance of a feature can be judged in terms of the importance of the features affecting customer churn, and it cannot explain how the feature affects the prediction results. In order to solve the above problems, this paper tries to establish a telecom customer churn prediction model that integrates SHAP and the improved XGBoost algorithm, and hyper-parameter tuning of the parameters of XGBoost model by genetic algorithm. SHAP can explain and analyze various factors affecting telecom customer churn and provide corresponding information for telecom companies to adopt corresponding policies.
2. Construction of Telecom Customer Churn Prediction Model
2.1. Overall Process of Customer Churn Prediction Model Construction
The overall flowchart of telecom customer churn prediction model construction is shown in Figure 1. In this paper, we use the common dataset of customer churn of a telecom company on Kaggle website, analyze and process the abnormal values and missing values in the dataset, extract features from the dataset, and use ADASYN algorithm to process the data imbalance problem, and then construct the telecom customer churn prediction model by XGBoost, decision tree, K-nearest neighbor, GBDT and LightBGM, extreme forest algorithm to construct the telecom customer churn prediction model, and obtain the optimal model by comparing the corresponding evaluation indexes, then use genetic algorithm to intelligently optimize the optimal model to obtain the final model, and finally use SHAP framework to interpret and analyze this customer churn prediction model.
2.2. XGBoost Algorithm
XGBoost (Extreme Gradient Boosting) is an algorithm based on GBDT, which was proposed by Chen in 2016 [10]. It is optimized for GBDT and can effectively handle the relationship between data. For example, it optimizes the loss function by using the second-order Taylor formula expansion to improve the computational accuracy, simplifies the model by using regular terms to avoid overfitting, and uses Blocks storage structure for parallel computation [11]. The XGBoost algorithm is shown in Equation (1).
(1)
where
denotes the predicted value of XGBoost output,
denotes the i-th input sample, taking values in the range [1, K],
denotes the k-th classification
Figure 1. General flow chart of telecom customer churn prediction model.
regression tree, and F denotes the classification regression tree space. The XGBoost objective function can be composed of a loss function and a regularization term. The specific objective function is shown in Equation (2), where
is the loss function of the predicted value and the true value;
is the penalty to the model complexity, i.e., the regularization term of the objective function, and the specific regularization term is shown in Equation (3); Y and
denote the term coefficients of the regularization, and T denotes the number of leaf nodes of the k-th tree.
(2)
(3)
Minimize the loss function of the optimized XGBoost with the regularization term, as shown in Equation (4).
(4)
For the objective function use the second-order expansion of Taylor’s formula and simplify it as shown in Equation (5), where
is the first-order derivative of
and
is the second-order derivative of
.
(5)
Derivative of
and making the derivative zero, the
minimizes the objective function can be found as shown in Equation (6).
(6)
is the minimum value of the objective function, the smaller the value, the better the tree model, corresponding to the minimum value of the objective function as shown in Equation (7).
(7)
Calculate the score of the split node of the tree model as shown in Equation (8).
(8)
2.3. Genetic Algorithm
Genetic Algorithm (GA) is a stochastic global optimization and search method that obtains the output optimal solution by inputting an objective function and constraints, drawing on Darwin’s biological evolution and Mendel’s genetic mechanism [12]. The genetic algorithm simulates the replication, crossover, and mutation occurring in natural selection and heredity according to the law of evolution of organisms in nature, and starts from any initial population, and through random selection, crossover, and mutation operations, produces a group of individuals better suited to the environment, evolves the population to a better region in the search space, and so on from one generation to another, and finally converges to a group of individuals best adapted to the environment, and to obtain the optimal solution of the problem [13]. As a relatively mature algorithm, genetic algorithm can be applied in the fields of function optimization, combinatorial optimization, and shop floor scheduling.
2.4. GA-XGBoost Algorithm
The XGBoost model suffers from many parameters, slow convergence, and large influence of parameters on model prediction results, while the traditional grid search for hyperparameter tuning suffers from low accuracy and long running time. Therefore, this paper proposes a GA-XGBoost model combining genetic algorithm and XGBoost model, and uses the global search capability of genetic algorithm to optimize the tuning parameter selection for XGBoost, and uses AUC as the fitness function to adjust the index.
The GA-XGBoost optimal hyperparameter combination is the optimal number of chromosomes output by the genetic algorithm when the number of iterations meets the termination requirement. For the telecom customer churn prediction model, three parameters, n_estimators, learning_rate, and max_depth, are optimized and initialized by the genetic algorithm. The optimal parameters are obtained by using the genetic algorithm to obtain a new generation of population, for which the parent population is replicated, crossed, and mutated, and then the best individuals are replaced by the worst individuals by calculating the fitness values of the offspring population. The GA-XGBoost algorithm is shown below:
2.5. Interpretability of the Integrated SHAP Model
SHAP (Shapley additive explanations), an explanatory framework proposed by Lundberg, is an important approach from game theory [14]. For each prediction sample, the model produces a prediction value, and SHAP can calculate the Shapley value of each feature to reflect the contribution of each feature to the overall model prediction. SHAP interprets the prediction value of the model as the sum of the Shapley values of each input feature on the magnitude and direction of the role of individual features under the full sample, thus explaining the results of the model [15].
Suppose
is the i-th sample, where the j-th feature of
is
,
is the predicted value of the model for this sample, the mean value of the target variable of all samples, i.e., the baseline value of SHAP is
, and
is the SHAP value corresponding to
, then the SHAP value is:
(9)
The interpretability of machine learning algorithms is currently a hot topic in artificial intelligence research, and the GA-XGBoost model is poorly interpretable as a black-box model due to the high complexity of the integrated learning model. To solve the problem of poor interpretability of the GA-XGBoost model, the SHAP framework is introduced to interpret the results reliably, and the SHAP framework has powerful visualization functions and possesses the ability to display the interpreted results of model predictions, which is widely applied to interpret more complex classification and regression models [16]. Meanwhile, the traditional feature importance ranking can only determine the importance of a feature and does not explain how the feature affects the prediction results, while the greatest advantage of SHAP value is the ability to explain and analyze the degree of influence of each feature, and also to reflect the positive and negative influence of each feature.
3. Experimental Analysis
3.1. Experimental Data and Its Preprocessing
The experimental data in this paper originates from the data of a telecom company on the Kaggle platform. There are a total of 4025 customer samples in the data training set, of which each sample includes 20 feature attributes, consisting of several dimensions of label information affecting customer churn characteristics and whether the user is finally lost, and the basic information of the dataset is shown in Table 1. The analysis reveals that the dataset is seriously unbalanced, in which there are 589 pieces of data of customer samples that have been lost and 3652 pieces of data of customer samples that have not been lost, as shown in Figure 2.
For continuous features in the dataset, the features are normalized by removing the mean and scaling variance; for discrete features in the dataset, one-hot is used for solo-hot coding when there is no size relationship, and numerical mapping is used when there is a size correlation between attributes.
Table 1. Basic information of the dataset.
Figure 2. Distribution diagram of customer churn labels.
Due to data imbalance, this experiment uses the ADASYN method for oversampling, which is a modified version of SMOTE, by assigning different weights to samples from different minority classes and thus obtaining different numbers of samples, while taking into account the majority of the sampled samples and setting an adjusted hyperparameter d, i.e., a threshold. The distribution of the data is improved by generating synthetic samples to reduce the bias caused by class imbalance and by adaptively adjusting the classification boundaries to difficult instances.
3.2. Evaluation Index
In the field of telecom customer churn, the learning metrics for assessing telecom customer churn were not limited to Accuracy for assessment, but were selected to focus on positive examples, where Accuracy, Recall, Precision, F1-score metrics were derived from the confusion matrix, as shown in Table 2.
The evaluation metrics used in the telecom customer churn prediction problem are accuracy, precision, recall, F1-score, and AUC. The specific formulas for the above evaluation metrics are shown in the following order.
(10)
(11)
(12)
(13)
3.3. Experimental Results and Analysis
The paper selected six classification models, XGBoost, LightGBM, DecisionTree, KNN, GDBT, and ExtraTrees for comparison, and selected Accuracy, AUC, Recall, Precision, F1-score as the evaluation indexes of the models, and the ROC comparison results of each classification model are shown in Figure 3.
After the experimental comparison, it was found that XGBoost reached 0.9477 in terms of AUC, which is significantly higher than other models, and therefore is the optimal model. The iterative results of the genetic algorithm for hyperparameter tuning are shown in Figure 4. The final approximate optimal solution is n_estimators = 55, learning_rate = 0.18, max_depth = 4.
The XGBoost model with hyperparameter tuning by genetic algorithm is defined as GA-XGBoost, where the confusion matrix of GA-XGBoost model is shown in Figure 5, and the ROC comparison between GA-XGBoost and XGBoost is shown in Figure 6, from which it can be seen that GA-XGBoost improves 1% in AUC compared to XGBoost model and the AUC value is basically above 0.90.
Figure 3. Comparison results of each classification model.
Figure 4. Optimization iteration results of genetic algorithm.
As can be seen from Table 3, the AUC values of KNN and ExtraTrees are relatively low and still need to be improved. The improved GA-XGBoost has an F1 value of 78%, an accuracy of 93.1%, and an AUC value of 95.61%, all of which
Figure 6. ROC comparison of GA-XGBoost and XGBoost.
Table 3. Comparison results of different model.
are slightly higher than the other six models, and the AUC values are basically above 0.90. The performance of the GA-XGBoost model proposed in this paper is obviously superior.
4. Explanatory Analysis of the Model Based on SHAP
SHAP feature importance, i.e., the degree of contribution of each feature to enhance the overall model predictive power, provides a more direct representation of the degree of influence of the feature on the model. The focus of this chapter focuses on the explanatory analysis of model prediction results for telecommunication customer churn based on the SHAP framework [17].
Figure 7 shows the SHAP summary diagram of telecom customer churn features using the SHAP model, where the horizontal coordinate is the SHAP value, each line of the vertical coordinate represents a feature variable, and each point represents a sample, red represents a large feature value and blue represents a small feature value. The SHAP value is zero as the middle dividing line, for the left side of the dividing line is blue, the middle is purple and the right side is red, it means that the feature has positive contribution to customer churn; for the left side of the dividing line is red, the middle is purple and the right side is blue, it means that the feature has negative contribution to customer churn.
Figure 7. SHAP feature analysis summary diagram.
Through the SHAP values of some features in Figure 7, the analysis of their contribution shows that voice_mail_plan, total_intl_calls are more obviously negatively correlated, that is, the higher the value taken the lower the likelihood of model output churn. Customer voicemail plans are the most important feature of customer churn given by SHAP, and the higher number of customers subscribed to voicemail plans in the figure can determine that more effective customers can be retained if the telecom company increases the discount on voicemail plans. The higher number of international calls also reflects the customer’s dependence and trust in the telecom company. The contribution of number_customer_service_calls, total_eve_minutes and total_intl_minutes has a significant positive correlation, i.e., the higher the value, the higher the likelihood of model output churn. The number of calls to customer service reflects the current level of customer dissatisfaction with the telecom network, and the increase in total_intl_minutes increases the related tariffs and leads to the likelihood of customer churn, which is very helpful for the telecom industry to adjust and improve their sales policies and services.
In the SHAP feature analysis effect plot, base_value is the mean value of the telecom customer churn prediction model. The red range indicates that the feature contributes positively to telecom customer churn, and the blue range indicates that the feature contributes negatively to customer churn. Figure 8 shows
Figure 8. Predicted SHAP characteristic analysis effect diagram of non-churn customers.
Figure 9. Total_day_minutes SHAP scatter plot.
the effect plot of SHAP feature analysis for a customer predicted to be a non-churning customer. The SHAP values for the number of voicemails and the total minutes of calls during the night are 38.0 and 207.7, respectively; the customer has a voicemail plan, the number of calls per day and the total minutes of calls per day are −1.0, −61.0 and −213.7, respectively. Finally, the model prediction for the first data in this test set is −3.43, which means that it is more likely to be non-churning and it is predicted to be non-churning customers due to reasons such as longer total minutes of calls in a day, more calls in a day, still using the voicemail plan, etc. Eventually, through the calculation of the self-attention layer, the short-term behavioral history sequence of the user is converted into a deeper representation that reflects the short-term interest mu t of the current user.
In the SHAP scatter plot in Figure 9, total_day_minites reflects the process of customer churn, showing a V-shaped trend of “up-down-up”, i.e., the total number of minutes of calls per day is in the range of 150 - 250 minutes, and the model output has a low churn rate. In this paper, we believe that these customers are more willing to continue to use the package of the telecom company because their usage time is relatively stable, while those who have more than 250 minutes a day may choose a more cost-effective telecom package due to their higher daily usage, i.e., they are more likely to churn.
5. Conclusions
To address the problem of telecom customer churn prediction, a model combining genetic algorithm and XGBoost algorithm is proposed in this paper, and the SHAP framework is used to supplement the model interpretability. The factors affecting telecom customer churn are analyzed using the proposed method, and the main factors affecting telecom customer churn are identified as the length of calls per day, the number of calls per day, and whether to subscribe to a voicemail plan, etc. Through the analysis, it is found that customers with longer and more frequent daytime calls are easier to retain, and this group may be dependent on the telecom industry due to their work demands, and telecom companies can offer daytime call packages with appropriate discounts so as to retain more and more valuable users.
Although the GA-XGBoost algorithm proposed in this paper achieves better results in terms of evaluation indexes compared with traditional machine learning algorithms, there are still shortcomings, as the model uses genetic algorithm for parameter optimization, which leads to long running time in the process of finding optimal parameters. The next step will be to combine the feature selection method to filter the features of the model and remove the redundant features in order to reduce the prediction time of the telecom customer churn model, which will be the research direction after this topic.
Acknowledgements
The research was supported by the Key Laboratory of Enterprise Informationization and Internet of Things Measurement and Control Technology in Sichuan Province Universities (Grant No. 2021WYJ04).