1. Introduction
The clustering analysis is one of the statistical methods that deal with the division and classification of variables data elements into several homogeneous groups that are homogeneous within one group (cluster) and are different from other groups (other clusters). Cluster analysis is defined as, a set of methods for constructing a (hopefully) sensible and informative classification of an initially unclassified set of data, using the variable values observed on each individual. All such methods essentially try to imitate what the eye-brain system does so well in two dimensions (Everitt and Skrondal [1]). Because of this characteristic of cluster analysis, it has been used in many applied fields. It is used to divide and classify data into aggregates, which help to properly select appropriate statistical analysis of these data as a decision-making tool. The objective of this statistical method is to divide the data matrix containing the number of (n) of the samples and (p) of the variables into a homogeneous number of partial groups (k) by assembling homogeneous and convergent sample items in clusters. Thereafter, criteria and measures must be used to distinguish between the different cluster results to reach two main points: the similarity of the data elements within the different clusters and the optimal number of clusters. This is done through the use of the legal functions, known as the standards of validity and legal performance of the cluster. In this paper, one of the most important hybrid models based on clusters, the Gaussian mixed model-based clustering is used. The hybrid models based on clusters are able to predict accurately if the appropriate variance model is chosen. It is applied through the use of four heterogeneity models. The covariance matrix of the Gaussian mixed model is unknown. So to estimate these parameters we need to maximize the log-likelihood function of. Direct maximization of the log-likelihood function is complicated, so the maximum likelihood estimator (MLE) of a finite mixture model is usually obtained via the EM algorithm (Dempster et al. [2]).
Banfield and Raftery [3] proposed a model-based clustering method based on constraining these geometric features of components using the eigenvalue decomposition of the covariance matrix.
Different constraints on the covariance matrix provides different models that are applicable to different data structures, which is another advantage of model-based clustering. In 1995, Celeux and Govaert [4] classified these models in three main families of models: spherical, diagonal and general families. They have given the definitions and derivations of all 14 available models, along with the covariance matrix update equations based on these models to be used in the EM algorithm. However, only nine of those have a closed form solution to the covariance update equation, which is evaluated in the M-step of the EM algorithm.
Later in 2016, Chi et al., [5] showed that the population likelihood function has bad local maxima even in the special case of equally-weighted mixtures of well-separated and spherical Gaussians. They proved that the log-likelihood value of these bad local maxima can be arbitrarily worse than that of any global optimum. Also, they showed that the EM algorithm with random initialization will converge to bad critical points with probability at least. They further establish that the first-order variant of EM will not converge to strict saddle points almost surely, indicating that the poor performance of the first-order method can be attributed to the existence of bad local maxima rather than bad saddle points.
Cluster analysis is used in various fields of science. Tóth et al., [6] described gamma-ray bursts (GRBs) using clustering. They analyzed the Final BATSE Catalog using Gaussian-mixture-models-based clustering methods for six variables (durations, peak flux, total fluency and spectral hardness ratios) that contain information on clustering.
In 2000, Bozdogan [7] studied the basic idea of Akaike’s [8] information criterion (AIC). Then, he presented some recent developments on a new entropic or information complexity (ICOMP) criterion of Bozdogan [9] for model selection.
The main contribution of the present paper is to propose the mixture-model cluster analysis technique under different covariance structures of the component densities. To determine the optimal number of clusters by selecting the number of clusters corresponding to the lowest values for the different criteria. Four models for covariance structures that have not been applied in previous studies are studied using three criteria of the complexity of information.
This paper is organized as follows: Section one is the introduction and section two the Gaussian Mixture Model-based Clustering (GMMC) is discussed. In section three, the Expectation-Maximization (EM) algorithm is introduced. The Model Selection Criteria are introduced in section four. Finally, sections five and six contain the Numerical Results, and the Conclusion, respectively (Table 1).
2. The Gaussian Mixture Model-Based Clustering (GMMC)
The Gaussian mixture model is a powerful clustering algorithm used in cluster analysis. It is the most widely used clustering method of this kind, is the one based on learning a mixture of Gaussians. It assumes that there are a certain number of Gaussian distributions, and each of these distributions represents a cluster. Hence, a Gaussian Mixture Model tends to group the data points belonging to a single distribution together. Gaussian Mixture Models are probabilistic models and use the soft clustering approach for distributing the points in different clusters. It’s difficult to determine the right model parameters, Expectation-Maximization method is used to determine the model parameters.
In a case where
are given (p dimensional data of size n), would be interested in estimating the number of clusters K. Assuming the observations
(
,
) are assumed to be drawn from the following mixture K distribution, each corresponding to a different cluster:
Table 1. Nomenclatures of used parameters.
Here
are the mixing proportions that satisfy
and
.
is the vector of unknown parameters of the kth component, and
represents the probability that an observation belongs to the kth component. The Gaussian mixture model assumes that the components of the mixture are the multivariate normal distribution, thus the density function becomes:
The mixture components (i.e. clusters) are ellipsoids centered at
with other geometric features, such as volume, shape, and orientation, determined by the covariance matrix
. (Titterington et al. [10]).
In this case, the component densities
are given by:
Parsimonious parameterizations of the covariance matrices can be obtained by using the eigenvalue decomposition of the covariance matrix. The eigenvalue decomposition of the kth covariance matrix is given as:
where:
is a scalar controlling the volume of the ellipsoid.
is a diagonal matrix specifying the shape of the density contours with
.
is an orthogonal matrix which determines the orientation of the corresponding ellipsoid (Banfield and Raftery [3] and Celeux and Govaert [4]).
In one dimension, there are just two models: E for equal variance and V for varying variance. In the multivariate setting, the volume, shape, and orientation of the covariance can be constrained to be equal or variable across groups. Thus, 14 possible models with different geometric characteristics can be specified. Table 2 reports all such models with the corresponding distribution structure type, volume, shape, orientation, and associated model names. See (Erar [11], Gupta and Bhatia [12], Chi et al., [5], Scrucca et al., [13], Malsiner-Walli et al., [14] and Tóth, et al., [6]).
Approaching the clustering problem from this probabilistic standpoint reduces the whole problem to the parameter estimation of a mixture density. The unknown parameters of the Gaussian mixture density, are the mixing proportions,
, the mean vectors,
, and the covariance matrices,
. Therefore, to estimate these parameters, we need to maximize the log-likelihood given by:
The estimates of the mixing proportion,
, the mean vector
, and the covariance matrix
for the kth population are given as:
Table 2. Parameterizations of the covariance matrix and the corresponding geometric features.
where:
.
This estimation requires the non-linear optimization of the mixture likelihood for high-dimensional data sets. However, there are no closed-form solutions to
for any mixture density; so the likelihood has to be numerically maximized. For this numerical optimization, the Expectation-Maximization (EM) algorithm of Dempster et al. [2] is used, which treats the data as incomplete and the group labels yi as missing.
3. The Expectation-Maximization (EM) Algorithm
The expectation-maximization (EM) algorithm is an iterative procedure used to find maximum likelihood estimates when data are incomplete or are treated as being incomplete. The consummate citation for the EM algorithm is the famous paper by Dempster et al. [2]. In EM algorithm, E and M steps are iterated until convergence is reached. The EM algorithm is based on the “complete-data”; i.e., the observed data plus the missing data. In E-step, the expected value of the complete-data log-likelihood, say Q, is computed; in the M-step, Q is maximized with respect to the model parameters. The EM algorithm is easy to implement and a numerically stable algorithm that has reliable global convergence under fairly general conditions. However, the likelihood surface in mixture models tends to have multiple modes. So initialization of EM is crucial because it usually produces sensible results when started from reasonable starting values (Wu [15]). In this approach, hierarchical clusters are obtained by recursively merging the two clusters that provide the smallest decrease in the classification likelihood for the Gaussian mixture model (Banfield and Raftery [3], Xu et al. [16]).
The EM algorithm is an iterative procedure consisting of two alternating steps, given some starting values for all parameters (
,
and
). The algorithm can be summarized as follows at iteration (t + 1):
1) In the E-step, the posterior probability,
of the ith observation belonging to the kth component is estimated, given the current parameter estimates.
2) In the M-step, the parameter estimates of
,
and
are updated given the estimated posterior probabilities, using the update equations
3) Iterate the first two steps until convergence.
The EM algorithm requires two issues to be addressed; determining the number of components, K, and initialization of the parameters.
4. The Model Selection Criteria
After estimating the parameters for the covariance matrix, the next step of determining the optimal cluster structure is selecting the best model. Despite the vast number of different model selection criteria in the literature, Schwarz’s Bayesian Criteria (SBC) (Schwarz [17]) is no doubt the most widely used in the model-based clustering. Besides these criteria, two other selection criteria are used. Namely AIC (Akaike [8]) and the information complexity (ICOMP) criterion (Bozdogan [18]). When using any information criterion to perform model selection, the model corresponding to the lowest score as providing the best balance between good fit and parsimony is chosen. Using the likelihood function, the AIC and SBC functions of the Gaussian mixture model can be defined as follows:
where:
is the likelihood function.
m is the number of independent parameters to be estimated.
is the maximum likelihood estimate for parameter θ.
ICOMP, originally introduced by Bozdogan [7] [9] [18] [19], is a logical extension of AIC and SBC, based on the structural complexity of an element or set of random vectors via the generalization of the information-based covariance complexity index of Van Emden [20]. ICOMP penalizes the lack-of-fit of a model with twice the negative of the maximized log-likelihood, following the same procedure of AIC and SBC. However, in ICOMP, a combination of lack-of-parsimony and profusion-of-complexity are also simultaneously penalized by a scalar complexity measure, C, of the model covariance matrix; while in AIC and SBC, only the lack of parsimony is penalized in terms of the number of parameters. In general, ICOMP is defined by using the likelihood function, the AIC and SBC functions of the Gaussian mixture model can be defined as follows:
where:
is the likelihood function.
C is a real-valued complexity measure.
is the estimated model covariance matrix.
The covariance matrix is estimated by the estimated inverse Fisher information matrix (IFIM),
is given by:
That is to say, IFIM is the negative expectation of the matrix of the second partial derivatives of the maximized log-likelihood of the fitted model, evaluated at the maximum likelihood estimators
.
For a multivariate normal model, the general form of ICOMP is defined as:
where:
For all the above criteria, the decision rule is to select the model that gives the minimum score for the loss function.
5. The Numerical Results
All results were obtained by using MATLAB.
The Gaussian mixture-model based clustering is applied, which implements the EM algorithm for inference, to four simulated data sets. The maximum number of clusters is taken K max = 6 for all examples. The convergence criteria of the EM algorithm are set to see = 10−6 and a maximum of 1000 iterations is allowed. After confirming the validity of mathematical equations and the program, four models of covariance matrix were applied. These models are:
Model: EVV with the covariance matrix (
).
Model: VII with the covariance matrix (
).
Model: VEE with the covariance matrix (
).
Model: VVE with the covariance matrix (
).
These models have been selected due to their distinguishing features: They represent different cases of the covariance matrix. Where the models [EVV] [VEE] and [VVE] belong to the General Family (Celeux and Govaert [4]). While the model [VII] belongs to the spherical family. In all models, the AIC, SBC and ICOMPPEU parameters were calculated. The optimal number of clusters has been determined by reaching the lowest values. The values of the complexity criteria were as follows:
1) Model: EVV with the covariance matrix (
) (Figure 1 & Figure 2)
From Table 3, the optimal number of clusters was determined. It was determined by achieving the lowest values of the criteria at the same time. It was found to fit the number of clusters of two clusters.
Given below in Table 4 the parameter values estimated for the best simulation.
For the selected model, GMMC identifies the cluster labels with a miss classification rate of 1%. The miss classification rate is calculated as follows:
Figure 1. Scatterplot of the actual dataset labeled by groups (Model EVV).
Figure 2. Scatterplot of the estimated dataset labeled by groups (Model EVV).
Table 3. Values of the criteria for selecting the model to reach the best simulation for the model (EVV) for the number of clusters k = 1, ..., 6.
Table 4. The resulting confusion matrix for model (EVV).
2) Model: VII with the covariance matrix (
) (Figure 3 & Figure 4)
Using Table 5, the optimal number of clusters was two clusters. GMMC achieves a miss classification rate of 2% for the model (VII). The resulting confusion matrix is shown in Table 6.
3) Model: VEE with the covariance matrix (
) (Figure 5 & Figure 6)
From the results in Table 7, it was found that the optimal number of clusters is three, so the number of clusters was increased. To achieve greater clarity, the sample size was 500 instead of 250 and was divided into three groups as follows (Table 8).
For this model, the miss classification rate was 15%.
4) Model: VVE with the covariance matrix (
) (Figure 7 & Figure 8)
Figure 3. Scatterplot of the actual dataset labeled by groups (Model VII).
Figure 4. Scatterplot of the estimated dataset labeled by groups (Model VII).
Figure 5. Scatterplot of the actual dataset labeled by groups (Model VEE).
Table 5. Values of the criteria for selecting the model to reach the best simulation for the model (VII) for the number of clusters k = 1, ..., 6.
Table 6. The resulting confusion matrix for model (VII).
Figure 6. Scatterplot of the estimated dataset labeled by groups (Model VEE).
Figure 7. Scatterplot of the actual dataset labeled by groups (Model VVE).
Figure 8. Scatterplot of the estimated dataset labeled by groups (Model VVE).
Table 7. Values of the criteria for selecting the model to reach the best simulation for the model (VEE) for the number of clusters k = 1, ..., 6.
Table 8. The resulting confusion matrix for model (VEE).
The fit number of clusters for this model was two clusters (Table 9).
It was shown that the miss classification rate was 0% from the data in Table 10.
6. Conclusion
In this paper, the Gaussian mixture model-based clustering is used. The mixture models based on clusters are able to predict accurately if the appropriate covariance matrix, model is selected. It is applied by using four models:
Table 9. Values of the criteria for selecting the model to reach the best simulation for the model (VVE) for the number of clusters k = 1, ..., 6.
Table 10. The resulting confusion matrix for model (VVE).
1) Model [EVV] (
) represents the case of equal volume, variable shape, and orientation. It is showed that the optimal number of clusters equals two. From the values of the complexity criteria in Table 3, it is noted that the ICOMPPEU criterion corresponds to the lowest value compared to the other two criteria and the miss classification rate was 1%.
2) Model [VII] (
) represents the case of variable volume, shape, and orientation. Also, in this model, the optimal number of clusters equals two and the ICOMPPEU criterion corresponds to the lowest value compared to the other two parameters (the values in Table 5). The miss classification rate was 2%.
3) Model [VEE] (
) represents the case of variable volume, equal shape, and direction. From Table 7, it is found that the optimal number of clusters is calculated by the number of clusters corresponding to the lowest values of the complexity of the information and found to be equal to three. The miss classification rate was 15%.
4) Model [VVE] (
) represents the case of variable volume, shape, and equal orientation. As the first and second model the optimal number of clusters equals two the ICOMPPEU criterion corresponds to the lowest value compared to the other two criteria (values are found in Table 9, while the miss classification rate was 0%.
The results showed that the ICOMPPEU criteria were superior to the rest of the criteria. In addition to the success of the Gauss model based on the clusters in the prediction using the covariance matrix. The study also determined the possibility of determining the optimal number of clusters by selecting the number of clusters corresponding to the lowest values of the different criteria.
For the number of clusters k = 1, ..., 6, the three different selection criteria have chosen the VVE model for the number of clusters two to be the optimal model. For the selected model, the Gaussian Mixture Model-based Clustering (GMMC) diagnoses the cluster classification with a 0% miss classification rate.