Quantitative Analysis in Securities and Futures Company Customer Value Assessment

Abstract

This paper focuses on the evaluation process of users in an Investment company in China. We find different types of users through the process of clustering, and make an evaluation on users by using regression to give them a score. These two aspects provide a standard for this company to form different strategies for different customers in order to benefit both company and users. The meaning of the project is to provide better service to the old customers that have less trading frequency, and to lower the risk of the loss of valuable new customers. We perform data cleansing to remove inactive accounts and outliers and do logarithmic transformation to reduce the influence of extreme monetary values. Because of the strong correlation between variables, it is hard to perform algorithms on original data. Thus in order to reduce the large dimension of data, we perform factor analysis to create three dimensions that represent users’ information, one relating to monetary, one to their transaction number, one to their profit. For clustering, we perform widely used K-means clustering methods. Using the elbow method, customers are clustered into four groups. The resulting four groups show one group with high trading frequency; one with large money and profit, one with large money and loss, and also one majority group with less money and trading deals. We use a regression tree to perform regression based on the reduced dimensions and their contribution. The model reaches 97% accuracy showing that monetary aspects of a user make up the most important to a company. Further discussion uses classification methods to check our clustering result and performs regression on some of the variables composing contributions to reveal more details of each dimension.

Share and Cite:

Huang, J. , Ma, N. , Wang, Y. , Li, J. and Liu, S. (2022) Quantitative Analysis in Securities and Futures Company Customer Value Assessment. Modern Economy, 13, 1471-1487. doi: 10.4236/me.2022.1311079.

1. Introduction

Customers of securities firms vary in their assets, trading preferences, and profit. Firms intend to provide suitable services for different customers to enhance customer satisfaction, as customer satisfaction is positively related to customer loyalty. Additionally, a higher level of customer satisfaction leads to a willingness to pay more and to stay with the business (Xu et al., 2007). Customer value assessment is applied by securities firms to categorize customers and further improve their service.

Customer value is a fundamental concept in the marketplace (Dlouhy et al., 2018). Based on assessments of the costs and benefits, customer value is calculated depending on circumstances. Firms build value models for customers with typical standards, but the situation varies due to industrial characteristics and company strategies (Yamamoto, 2007). After gathering firsthand customer data, the data processing directly influence the valuation result, as weighting of different factors heavily impact the analysis. Traditional way of customer value assessment chooses the percentage of factors from practice and experience, leaving space for improvement.

This paper is structured as follows: First we describe all the methods being used in detail. Then we apply these methods to our data to generate the results, followed by discussions that evaluate the accuracy of the results. Lastly, we sum up with a conclusion.

2. Literature Review

The customers’ value of a company is various typically depends on a multitude of factors, with the significance of each varying on a case-by-case basis. Industry professionals typically gauge customer value using experience, as factors are not in close proximity to one another. However, this begs the question of how companies with little comparable customers process the assessment, or how companies formulate a standardized system for customer valuation. Investigating this has significant implications for industry professionals such as investment bankings, who not only need to know types of customers, but also how to categorize customers.

Although securities firms commonly apply industrial experience when assessing customer value, machine learning algorithms can be an applicable method for customer categorization. K-means clustering technology and SPSS Tool software are used to forecast customer purchasing performance for a supermarket (Kashwan & Velu, 2013), which segment customer by their behavior. This research applies machine learning algorithms to construct customer value assessment model for securities firms.

The machine learning is classified into two main types, supervised machine learning algorithms and unsupervised machine learning algorithms, based on their functions. Supervised machine learning algorithms require earning from training sets to perform in the testing sets, and its primary function is to make prediction about the output values based on the inputs values. The characteristic of the input value is that the data has been classified and labeled. In contrast, unsupervised machine learning algorithms process unclassified and unlabeled data, aiming to discover and define the hidden structure or pattern from unclassified data. This research will apply both supervised and unsupervised machine learning algorithms, K-means Clustering and Regression Tree, to construct the customer value assessment model.

3. Methods

Our process of constructing a company customer value assessment model is shown in Figure 1.

3.1. Data Cleansing

Before any step of data processing, data cleansing is often needed. Due to the specification of securities and futures company customer data, some steps are recommended:

First, the data of securities and futures company customer is likely to contain lots of samples whose values are all 0 (i.e., some customers who had no security, and did no transaction), these samples should be obviated at the very beginning,

Figure 1. Our model process.

otherwise they will affect our result to a great extent. Excluding the outliers is also needed because of the same reason.

Furthermore, a logarithmic transformation is needed (for those variables that contain both positive and negative values, we can take a log and then multiple by its sign function sgn(x)) on the variables concerning the money because of two reasons: Firstly, the data of securities and futures company customer is usually left-tailed, those samples with extremely large number will affect the result a lot; Secondly, the actually difference between the customers whose equity is 10 and 110 significantly outweighs that between the customers whose equity is 10,000 and 10,100.

3.2. Factor Analysis

Because the customer data in most securities and futures company is usually faced with the problem of high-dimension and high-correlation. A method to reduce the dimension is indispensable before we do the work of clustering and regression.

1) The Orthogonal Factor Model: Factor analysis is a popular method to do data reduction in modern days. The beginning of factor analysis lies in the early 20th-century attempts of Karl Pearson, Charles Spearman, and others to do research about intelligence (Johnson & Wichern, 2002). The main purpose of factor analysis is to explain the covariance relationships among many observable variables in terms of a few underlying, but unobservable, random quantities, which are called factors.

As a model, we consider the observable random vector X , with p components, has mean μ and covariance matrix Σ . The factor model assumes that X is linearly dependent upon F 1 , F 2 , , F m , called common factors ( F in matrix notation), and ε 1 , ε 2 , , ε p , called errors or, sometimes, specific factors ( ε in matrix notation).

Demonstrating as matrix equations, the factor model is:

X μ = L F + ε (1)

where L is the matrix of factor loadings.

with additional assumptions:

E ( F ) = 0 C o v ( F ) = E [ F F ] = I E ( ε ) = 0 C o v ( ε ε ) = E [ ε ε ] = Ψ = d i a g ( Ψ 1 , Ψ 2 , , Ψ 3 ) C o v ( ε , F ) = E ( ε F ) = 0 (2)

We define the communities: h i 2 = l i 1 2 + l i 1 2 + + l i m 2 , which indicate the percentage of the explainable part of X i by the common factors.

Two equations are needed to mention. The first one is important in calibrate the model; the second one is essential in model interpretion:

C o v ( X ) = L L + Ψ (or V a r ( X i ) = l i 1 2 + + l i m 2 + ψ i C o v ( X i , X k ) = l i 1 l k 1 + + l i m l k m ) (3)

C o v ( X , F ) = L (or C o v ( X i , F j ) = l i j ) (4)

2) Model Calibration: We estimate the model by two steps: Firstly, we estimate L and Ψ ; then, we estimate F , or the factor scores.

We estimate L and Ψ through the covariance structure, σ = L L + Ψ , which is indicated in Equation (3). Using the principal component method, let Σ have eigenvalue-eigenvector pairs ( λ i , e i ) with λ 1 λ 2 λ p 0 , then Σ = λ 1 e 1 e 1 + + λ p e p e p . We estimate the factor loadings, specific variances, and communities by the following three equations:

L ˜ = [ λ ^ 1 e ^ 1 , , λ ^ m e ^ m ] ψ ˜ i = s i i j = 1 m l i j 2 h ˜ i 2 = l ˜ i 1 2 + + l ˜ i m 2 (5)

After estimating L and Ψ , we treat them as if they are the real values to estimate F . We use the weighted least square methods, which is a popular method to deal with the linear regression model with different varience (Maxwell, 1892), to estimate F. The solution is:

f ^ = ( L Ψ 1 L ) 1 L Ψ 1 ( x μ ) (6)

3) Factor Rotation: We use the varimax criterion for factor rotation, which was introduced by Kaiser, to improve the model’s interpretability.

Factor loadins L are determined only up to an orthogonal matrix T . Thus the loadings L * = L T and L both give the same representation. The communalities, given by the diagonal elements of L L = L * L * are also unaffected by the choice of T .

Kaiser proposed varimax criterion: define l ˜ i j * = l ^ i j * / h ^ i to be the rotated coefficients scaled by the square root of the communalities. Select the orthogonal

transformation T that makes V = 1 p j = 1 m [ i = 1 p l ˜ i j * 4 ( i = 1 p l ˜ i j * 2 ) 2 / p ] as large

as possible (Kaiser, 1958).

This criterion will concentrate the loadings, i.e., to maximize the loadings of F i with some X j , and minimize the others, which will enable us to explain a certain common factor by a few original variables.

After conducting the factor rotation, the factor score, which is estimated above, should also be adjusted by:

f j = T f j , j = 1 , 2 , , n (7)

4) Model Explanation: Through Equation (4), we get the essential of the loading matrix. That is, the (i, j) element of L indicate the variance of the ith variable and the jth factor. Since the factors are already standardized by the assumptions, if we standardize the variables at first, l i j will be equal to the correlation of X i and F j . As a result, standardization is usually preferred before factor analysis. As discussed above, Kaiser’s varimax criterion will concentrate the loadings, so that a certain common factor will have a large correlation with some of the variables and small with others, which indicates that this certain factor can be explained by those variables with large correlation.

Usually, the common factors of the customer data of securities and futures company customer will indicate the monetary, trade frequency, and the profit and loss of customers.

3.3. K-Means Clustering Method

The K-means clustering method is a nonhierarchical clustering techniques, which is designed to group items, rather than variables, into a collection of K clusters. It does not have to store the matrix of distances (similarities), so it can be applied to many data sets than hierarchical techniques. The idea is: nonhierarchical methods start from either an initial partition of items into groups or an initial set of seed points, which will form the nuclei of clusters. The process of k-means is:

1) Partition the items into K initial clusters.

2) Proceed through the list of items, assigning an item to the cluster whose centroid (mean) is nearest. Recalculate the centroid for the cluster receiving the new item and for the cluster losing the item.

3) Repeat steps 2 until no more reassignments take place.

Elbow methods is used to determine the cluster number K, which is, to draw a plot whose x-label indicates the cluster number K and y-label indicate the sum of squares within the group. The elbow point that does not significantly add explained variance by which we see a great change in slope is appropriate estimation of K (Hartigan & Wong, 1979).

3.4. Regression Tree Method

A regression tree is basically using a decision tree algorithm to do the task of regression. Random forest is a supervised machine learning technique that can be used for classification, regression, etc. It is a type of ensemble machine learning that combines the prediction from multiple decision tree results and takes their average for its own result. The performance of this algorithm generally beats a simple decision tree.

The key idea used is bagging (bootstrap aggregation). The algorithm splits the dataset into samples, then a subset of features is chosen to create a model. This process repeats many times and then aggregates to form the final result (Breiman, 1996).

Here’s the whole process:

1) Pick at random k data points from the training set.

2) Build a decision tree associated to these k data points.

3) Choose the number N of trees you want to build and repeat steps 1 and 2.

4) For a new data point, make each one of your N-tree trees predict the value of y for the data point in question and assign the new data point to the average across all of the predicted y values.

4. Model Constrcution

4.1. Data Source

Our data comes from a famous Securities and Futures company in China. The data contains 91,592 customers with 19 variables. The data is prepoccessed by multiplying a certain constant to protect user privacy.

4.2. Data Overview and Cleansing

The dataset consists of 19 variables, whose meaning is demonstrated in Appendix A.

We treat the latest 8 as dependent variable, for they show the profit contributed by each customer, and we treat the former 11 as independent ones. The box-plot of the original data is shown in Figure 2. As shown, almost all variables are concentratedly distributed near 0, and the sample highlighted is the possible outlier.

We do 4 steps of data cleansing:

1) Obviate the sample with all zeros.

2) Do transformation sgn ( x ) × ln ( | x | + 1 ) to all those variables whose unit is $.

3) To those whose number of commission is 0, we change their cancellation rate to the mean of the rest, or 26.37% (since their cancellation rate is N/A).

4) Obviate the two outliers highlighted in Figure 2.

After data cleansing, the box-plot is shown in Figure 3. We can see that the distribution of variables concerning money is ameliorated; however, the variables concerning trade numbers are still concentratedly distributed near 0. We cannot obviate them since the customers who have equity but no trading records can still contribute to the company’s profit.

4.3. Factor Analysis Result

We use principle component method and the Kaiser’s varimax criteria to do factor analysis. The loading matrix is shown in Table 1.

As shown, the first factor has a high correlation with Equity, Guarantee Deposit, number of transactions, turnover, so it can be explained as the factor concerning monetary; the second factor have a high correlation with number of commision and number of cancellation, so it can be explained as the factor concerning frequency; the third factor has a high correlation with profit/loss and its ratio, so it can be explained as the factor concerning profit and loss.

Figure 2. Box-plot before data cleansing.

Figure 3. Box-plot after data cleansing.

Table 1. Loading matrix.

4.4. K-Means Result

Figurex shows the sum of square within groups regarding different numbers of clusters K. As shown in Figure4, K = 4 is an apt estimation for the number of clusters.

We cluster the samples into 4 categories. Figure 5 shows the clustering result, and Table 2 shows the features of the clusters.

In terms of the features, cluster 1 consists of the customer who do really high frequency trade; Cluster 2 and cluster 4 are both made up of customers with a large monetary, while cluster-2 customers have a positive profit, contrary to cluster-4 customers, who have a negative one; Cluster-3 customers may have a shorter lifetime due to their less monetary and trading frequency, but they are the main composition of the company’s customer, thus needed to be heedful in management.

4.5. Regression Tree Result

For regression to evaluate the customer’s value, we split the dataset into 75 percent train data and test size of 25 percent. We perform random forest regression on the three dimensions we reduced, to regress on the total contribution. The package we use is Random Forest Regressor from sklearn ensemble. After fitting the model to the data, we use the feature importances attribute of the model to make a plot showing the percent of 3 variables’ importance. The result is shown in Figure 6.

We get that dimension 1, which relates to the monetary aspect of a customer, makes up the most important for their contribution by around 90 percent, which also makes sense from the perspective of the company. The other two dimensions’ influence seems to be incomparable. By testing the model on the test data, we get an accuracy of 97 percent, which seems to be a legit model. By passing one customer’s data, we can use the reduced dimension to predict their contribution.

Figure 4. Elbow method.

Figure 5. K-means result.

Figure 6. Contribution regression result.

Table 2. Cluster features.

5. Discussion

5.1. Evaluation on Clustering Results

To evaluate the accuracy of the clustering from the k-means algorithm, a model could be used to lead an accurate rate of the classification which classified customers into 4 different classes. To make the output of the model a straightforward classified result and have a view on how to classify customers, decision tree is used here since decision tree would show up the rules and how decision tree drudges different data.

C5.0 decision tree algorithm is a model that works by splitting the sample based on the field that provides the maximum information gain (Sharma & Kumar, 2016). It would be split into branches until each of the leaves, the end of branches, are no longer breakable or can be led to a conclusion while deleting unrelated features. To identify different features, C5.0 decision tree use the concept of entropy which can be specified as Entropy ( S ) = i = 1 c ( p i log 2 ( p i ) ) (Yobero, 2018). The results of each entropy would tell the purity of the features which determines how intertwined different subspaces of data regarding its classes are (Li & Claramunt, 2006). The purity would be further used in the calculation of information gain by using InfoGain ( F ) = Entropy ( S 1 ) -Entropy ( S 2 ) . After the calculation C5.0 decision tree would know how to create the branches since the higher the information gain, the better a feature is at creating independent groups. Other than a clear classification on features, C5.0 decision tree uses a different strategy than other decision tree algorithm. The C5.0 decision tree post-prune the tree which means that it creates a large tree that is overfitting, afterwards, it would cut of leaves and branches that have little effects. This strategy would bring a high accuracy while preventing a potential risk on overfitting.

In this research, 40,000 samples are randomly selected from the dataset. 20,000 samples are treated as training data and the other 20,000 as testing data in order to test the accuracy of the decision tree. The confusing matrix is shown in Table 3 where we can see an 0.1% error rates out of 20,000 testing cases. According to the confusion matrix, it also shows that the model is accurate on evaluating most of the classes, which indicates that the result of customers’ being clustered into 4 groups is reasonable.

5.2. Regression on Five Variables Composing Contirbution

We know that Exchange Return, Refund of Exchange Return, Zero Interest Return, Interest Income, and Net Retention Money are linearly related to the total contribution, so we also perform random forest regression on the three dimensions we reduced to these five variables and then use feature importances attribute of the model to show the percent of importance of each dimension. We first look at the results of Interest Income, Exchange Return, and Net Retention Money. Figure 7 is the plot results shown.

We get that the Feature Importance results on the three dimensions for these three variables are similar to what we did before for the total contribution. And Figure 8 is the plot results for the remaining two variables, which are different from previous ones.

From these two plot results, the second and third factors make greater impacts to the feature importance of refund of exchange return and Zero Interest Return,

Table 3. Decision tree confusion matrix.

(a)(b)(c)

Figure 7. Results for first three variables. (a) Interest income; (b) Exchange return; (c) Net retention.

(a)(b)

Figure 8. Results for the remaining two variables. (a) Refund of exchange return; (b) Zero interest return.

which contradicts our original results to some extent purely from the picture results. In order to find out if this result makes a big difference to our conclusion, we perform regression to find out the linear relation between them. Table 4 is the results of the regression.

From the regression table, we get that the linear function is:

Total Contribution = 0.0551 × Refund o fExchange Return + 0.1445 × Exchange Rate 0.0622 × Zero Interest Return + 0.8153 × Interest Income + 0.2630 × Net Retention Money + 0.2882 (8)

The coefficient of Refund of Exchange Return and Zero Interest Return, −0.0551 and −0.0622, are comparably the smallest among others, and the absolute

Table 4. Regression result.

value of their t value, 22.406 and 17.906, are also the smallest among others. The results reflect that the statistically significant relationship between the predictor variable Refund of Exchange Return and Zero Interest Return and the response variable Total Contribution are the least, and their contribution to the Total Contribution is minimal. Therefore, the difference in feature importance of these two variables does not affect our results to a great extent, and our conclusions are reliable in the current examination.

6. Conclusion

Customer value assessment is a significant concept in securities companies, while the industrial experience is the source of assessment in past practice. Supervised and unsupervised machine learning algorithms were applied to construct the customer value assessment model in this research with 91,592 customer data from a Chinese Top securities firm. K-means models processed customer categorization. Customers were clustered into four groups: a group with high trading frequency; a group with lots of money and profit; a group with lots of money loss; and a group with limited amount of money and trading frequency, which is the majority. The trading frequency, asset, and profit compose the crucial factors of customer valuation for securities firms. High trading frequency implied a group of professional customers and the possibility of quantitative trading. The group with limited assets and frequent trading is the majority, and their thoughts preference was the focus of securities firms. The future work will involve categorization within this group to assist securities firms in developing better services. Regression Tree performed customer value evaluation. The main factor of value company treasures was the number of assets the customer holds. The customer’s trading details share little percentage in calculating the total contribution. Although the trading data indicates the difference in customer categorization, the asset value is the primary factor while determining the value of the customer for securities firms.

Conflicts of Interest

The authors declare no conflicts of interest regarding the publication of this paper.

References

[1] Breiman, L. (1996). Bagging Predictors. Machine Learning, 26, 123-140.
https://doi.org/10.1007/BF00058655
[2] Dlouhy, J., Wans, S., & Haghsheno, S. (2018). Evaluation of Customer Value by Building Owners in the Construction Process. 26th Annual Conference of the International Group for Lean Construction (pp. 199-208). International Group for Lean Construction.
https://doi.org/10.24928/2018/0393
[3] Hartigan, J. A., & Wong, M. A. (1979). Algorithm AS 136: A K-Means Clustering Algorithm. Journal of the Royal Statistical Society: Series C (Applied Statistics), 28, 100-108.
https://doi.org/10.2307/2346830
[4] Johnson, R. A., & Wichern, D. W. (2002). Applied Multivariate Statistical Analysis. Prentice Hall.
[5] Kaiser, H. F. (1958). The Varimax Criterion for Analytic Rotation in Factor Analysis. Psychometrika, 23, 187-200.
https://doi.org/10.1007/BF02289233
[6] Kashwan, K. R., & Velu, C. M. (2013). Customer Segmentation Using Clustering and Data Mining Techniques. International Journal of Computer Theory and Engineering, 5, 856-861.
https://doi.org/10.7763/IJCTE.2013.V5.811
[7] Li, X., & Claramunt, C. (2006). A Spatial Entropy-Based Decision Tree for Classification of Geographical Information. Transactions in GIS, 10, 451-467.
https://doi.org/10.1111/j.1467-9671.2006.01006.x
[8] Maxwell, J. C. (1892). A Treatise on Electricity and Magnetism (Vol. 2, 3rd Ed., pp. 68-73). Clarendon.
[9] Sharma, H., & Kumar, S. (2016). A Survey on Decision Tree Algorithms of Classification in Data Mining. International Journal of Science and Research (IJSR), 5, 2094-2097.
https://doi.org/10.21275/v5i4.NOV162954
[10] Xu, Y., Goedegebuure, R., & Heijden, B. (2007). Customer Perception, Customer Satisfaction, and Customer Loyalty within Chinese Securities Business: Towards a Mediation Model for Predicting Customer Behavior. Journal of Relationship Marketing 5, 79-104.
https://doi.org/10.1300/J366v05n04_06
[11] Yamamoto, G. T. (2007). Understanding Customer Value Concept: Key to Success.
http://www.opf.slu.cz/vvr/akce/turecko/pdf/Yamamoto.pdf
[12] Yobero, C. (2018). Determining Creditworthiness for Loan Applications Using C5.0 Decision Trees. RPubs by RStudio.
https://rstudio-pubs-static.s3.amazonaws.com/404024_4e62fe44761a4bc690918f93ac2a2aed.html

Copyright © 2024 by authors and Scientific Research Publishing Inc.

Creative Commons License

This work and the related PDF file are licensed under a Creative Commons Attribution 4.0 International License.