Variable Selection for Robust Mixture Regression Model with Skew Scale Mixtures of Normal Distributions ()
1. Introduction
In applied statistics, the arc-sine laws for the Wiener process and the skew Brownian motion [1] are widely used in finance if the market is homogeneous. However, the problem of modeling heterogeneous data has been extensively studied in recent years and the Finite Mixture of Regression (FMR) model is an important tool for heterogeneous cases. A large number of applications associate a random response variable
with covariates
through FMR models and the assumption is that for each observation data point
, the regression coefficients are not the same. More details about the FMR model can be found in [2].
The Gaussian FMR model is the most common FMR model, which assumes that the random error of each subgroup follows the normal distribution. It is well known that using the normal distribution to model data with asymmetric and heavy-tailed behaviors is unsuitable, and the parameter estimates are sensitive to outliers. To overcome the potential shortcomings of Gaussian mixture models, McLachlan et al. [3] proposed to replace the mixtures of normal with mixtures of t-distribution which results in a more robust mixture model. Basso et al. [4] studied a finite mixture model based on scale mixtures of skew-normal distributions, and Franczak et al. [5] proposed a mixture model using shifted asymmetric Laplace distributions which parameterize the skewness as well as the location and the scale.
The problem of variable selection in FMR models has been widely discussed recently. There are generally two types of variable selection methods. One is the optimal subset selection method and the discontinuous penalty method based on the information criterion, including stepwise regression, best subset regression, BIC criterion, AIC criterion and so on. The other is the continuous penalty method. By imposing penalties on the parameters of the objective function, one can select significant variables and obtain the parameter estimates simultaneously. The Least Absolute Shrinkage and Selection Operator (LASSO), elastic net regularization [6], MCP penalty [7] and SCAD penalty [8] are penalty functions for variable selection. We utilize the SCAD and a new penalty function proposed in this paper which balance the SCAD and
penalty to perform variable selection on a robust mixture regression model based on Skew Scale Mixtures of Normal (SSMN) distributions [9] and this robust model can accommodate asymmetric and heavy-tailed data better.
The paper is organized as follows. In Section 2, a robust mixture regression model using the skew scale mixtures of normal distributions (RMR-SSMN) is introduced. Then, variable selection methods with SCAD penalty function and a newly proposed penalty function are presented in Section 3. Section 4 outlines the adjusted EM algorithm for estimating and a BIC method for selecting turning parameters and components. In Section 5, we carry out simulation studies to compare the performances between FMR models and RMR-SSMN models, and show the effect of variable selection with penalty functions. An application to a real data set of the method is discussed in Section 6 and some conclusions are obtained in Section 7.
2. Robust Mixture Regression Model with SSMN Distributions
It is known that the FMR model can model heterogeneous data and the Skew Scale Mixtures of Normal (SSMN) distributions [9] cover both asymmetric and heavy-tailed distributions. Therefore, we propose a robust mixture regression model whose regression errors of components follow SSMN distributions. Unsurprisingly, this model is more robust than general FMR models for heterogeneous cases.
2.1. Skew Scale Mixtures of Normal Distributions
If a random variable
follows a skew-normal (SN) distribution with location parameter
, scale parameter
, and skewness parameter
, denoted
, then its density function is given as follows:
. (1)
where
and
are the probability density function and the cumulative distribution function of
calculated at
, respectively.
Note that when
, the
reduces to the
, and as given in [9], the SN distribution’s marginal stochastic representation is presented by:
(2)
where
,
and
are independent.
Furthermore, if a random variable
follows a SSMN distribution [9] with location parameter
, scale parameter
, and skewness parameter
, then its probability density function is given by:
(3)
where
is the cumulative distribution function of
who derived from the parameter vector
, and
is a positive random variable, and
is a strictly positive function. If the probability density function of the random variable
is shown as the Equation (3), it can be denoted as:
.
For
, its hierarchical representation has the form as follows:
(4)
This paper will consider the following distributions in the SSMN distributions family:
· The skew Student-t-normal distribution (STN) [10] with
,
and
, which follows probability density:
(5)
where
and
is the gamma function. We can obtain that
.
· The skew contaminated normal distribution (SCN).
is a discrete random variable taking one of two values and
. Given the parameter vector
,
,
, the density function of
is
. Naturally get as follows:
(6)
Therefore, the conditional distribution
can be obtained as:
(7)
where
.
· The skew power-exponential distribution (SPE) has following probability density:
(8)
Ferreira et al. [9] have proved that
.
2.2. Robust Mixture Regression Model with SSMN Distributions
Suppose we have
independent random variables
, which are taken from a mixture of SSMN distributions. The conditional density function of the robust mixture regression model with SSMN distributions (RMR-SSMN) which has
components is given by:
(9)
with covariate vector
and q-dimensional unknown regression coefficients vector
.
denote the mixing proportions satisfying
,
.
is the parameter vector of the model. For convenience, let
,
,
,
,
, and
. In this paper, RMR-SSMN models contain the robust mixture regression model with STN distribution (RMR-STN), SCN distribution (RMR-SCN), SPE distribution (RMR-SPE) and SN distribution (RMR-SN).
3. Variable Selection Method
If a component in the q-dimensional explanatory variable
has no significant effect on the response variable
, the regression coefficient of this component estimated by the maximum likelihood method will close to 0 rather than 0. Thus, this covariate is not excluded from the model and makes the model unstable. To avoid this problem, we use a penalized likelihood approach [11] for selecting variables and estimating parameters simultaneously. Let
be sample observations from RMR-SSMN models. The log-likelihood function of
is given by:
(10)
Following the idea in [11], we can get the estimates of
by maximizing the penalized log-likelihood function which is defined as:
(11)
with the penalty function is given by:
(12)
where
is a nonnegative and non-decreasing function in
with the turning parameter
. The turning parameter controls the intensity of the penalty for the regression coefficients.
The SCAD penalty has a type of oracle property as discussed in [8]. In this work, we complete the variable selection procedure using the following SCAD penalty function:
(13)
Meanwhile, inspired by [12], we propose a combined penalty function which balance the SCAD penalty and
penalty. This penalty function by introducing a connection parameter
is more effective in variable selection than directly mixing SCAD and
, and the specific form is given by:
(14)
We call this new penalty function as MIXL2-SCAD. Some asymptotic properties of the penalty function are showed in [12], and the constant
and
. Following the idea of [8], let
. In particular, the constant
,
and
in MIXL2-SCAD jointly control the speed of contraction of
, and when
, MIXL2-SCAD penalty reduces to the SCAD penalty.
4. Numeric Solutions
The expectation-maximization (EM) algorithm can be applied to mixture regression models based on SSMN distributions for maximizing the penalized log-likelihood function. When the M-step of EM is analytically intractable for SSMN distributions, it can be replaced with a sequence of conditional maximization (CM) steps which is derived from ECM algorithm [13]. Furthermore, we also maximize the constrained actual marginal log-likelihood function which called CML steps [14] for simplicity.
4.1. Maximization of the Penalized Log-Likelihood Function
Let us introduce the latent vector
with the component indicator variable
which has the following form:
(15)
Using the Equations (2) and (4), we can get the following hierarchical representation for the mixture of SSMN distributions.
(16)
denotes the truncated normal distribution. Let
,
,
and
. Among them,
and
are also regarded as latent vectors. Then the complete log-likelihood function with complete-data
is given by:
(17)
with:
(18)
(19)
is a constant that does not depend on any unknown parameter, and
is the density function of the latent variable
.
Replacing
with
in the penalized log-likelihood function, the complete penalized log-likelihood function is given by:
(20)
Refer to the method of Fan and Li [8], given the initial parameter value
,
can be replaced by the following local quadratic function:
(21)
This approximation will be applied in the CM-step of the algorithm at each iteration. The adjusted EM algorithm proceeds with the following three steps. The E-step calculates the conditional expectation of the complete penalized log-likelihood function, the CM-step and CML-step obtain the closed-form of parameter estimates.
· The E-step. Given the current estimates
, calculate the Q-function,
, obtained as:
(22)
with:
(23)
(24)
(25)
The required expressions are
,
,
and
.
First, the conditional expectation
is given by:
(26)
Then, refer to [15],
and
can be evaluated by:
(27)
(28)
with
and
.
Further,
has different expressions for RMR-SSMN models with different distributions in the SSMN family, obtained as:
(29)
with
.
· The CM-step. Maximize
with respect to
on the
iteration. As in [11], the mixing proportions are updated by:
(30)
which are the approximate iterated values. Maximizing
with respect to the
instead of maximizing
will simplify the computation of
and this updating scheme works well in our simulations.
We now consider that
is constant, and maximize
with respect to the rest parameters in
. The updates of
are given by:
(31)
(32)
(33)
(34)
with:
and
is an identity matrix of order
, and
is a matrix of order
.
· The CML-step. Fix
and
, update
to get
by optimizing the constrained log-likelihood function:
(35)
The above iterations are repeated alternately until the maximum number of iterations is reached or a suitable stopping rule is met. In this work, the iterations will be completed when
is sufficiently small, such as 10−5.
4.2. Selection of Turning Parameters and Components
When using the methods proposed in this paper, we also need to consider how to determine the components
and the size of the turning parameters in the penalty function. Cross-Validation (CV), Generalized Cross-Validation (GCV), AIC and BIC are commonly used criteria for the selection of turning parameters.
As showed in [12], the final selected model will be overfitting if the turning parameter selected by GCV and they use the BIC to choose. In this paper, we also propose a proper BIC criterion for RMR-SSMN models to select turning parameters
, the constant
and the components
.
Let
, we should take a set of
at a time over a suitable range and use the proposed adjusted EM algorithm to obtain the corresponding parameter estimates
. The optimal set of
is selected by minimizing the following BIC criterion:
(36)
where
represents the number of non-zero regression coefficients of
and
is either equal to 4 (RMR-SN model), 5 (RMR-STN and RMR-SPE models) or 6 (RMR-SCN model).
5. Simulation Studies
We perform Monte Carlo simulations to evaluate the performance of the proposed robust mixture model and adjusted EM algorithm. To evaluate the effect of variable selection and the accuracy of parameter estimates, we use the correctly estimated zero coefficients (S1), correctly estimated non-zero coefficients (S2), the mean estimate over all falsely identified non-zero predictors (
) [16] of
and the mean squared error (MSE) of regression coefficients (
),
.
5.1. Simulation 1
The first simulation uses the SCAD penalty function to select significant variables for RMR-STN, RMR-SPE and RMR-SCN models, and compare the simulation results with the Gaussian FMR model and RMR-SN model.
We set
for the simulation so that the sample data set
for the mixture regression model is derived from the following model:
(37)
where
is used to identify the subgroup that the sample belongs to.
,
,
,
,
and
.
The covariate
is generated from a multivariate normal with mean 0, variance 1, and two correlation structures:
,
. The simulation considers the following three error distributions cases: 1) the random errors
and
follow the t-distribution with 3 degrees of freedom (
); 2) the random errors
and
follow the chi-square distribution with 3 degrees of freedom (
); 3) The random errors
and
follow the mixture distribution of normal
. So that there are 15 sets of combination, and for each combination, we respectively performed 100 repetitions for the simulation with
.
From Table 1, the value of S1 in Com1 and Com2 from RMR-STN, RMR-SPE and RMR-SCN are all bigger than the value in FMR model for all three cases, respectively. In case (1), the S1 in Com2 from RMR-SPE is biggest (S1 = 0.9533), however, the S1 in Com2 from FMR is smallest (S1 = 0.8533). In case (2), the RMR-SCN model has the biggest S1 (S1 = 0.9033) in Com2 while the S1 in Com2 from FMR is 0.8167. In case (3), when the S2 in Com1 and Com2 from FMR are 0.9933 and 0.9950, respectively, the values of S2 in both components from RMR-STN are 1.00.
Furthermore, the value of
in Com1 and Com2 from RMR-STN, RMR-SPE and RMR-SCN are much smaller than the value in FMR model. When errors follow
distribution, RMR-SN performs well with the smallest
and S2 = 1.00 in both Com1 and Com2 that indicates the non-zero coefficients are all identified correctly. Overall, RMR-SSMN models are more robust than FMR for variable selection when the data set is asymmetric (
), heavy-tailed (
) and contaminated (
).
5.2. Simulation 2
Simulation 2 uses the MIXL2-SCAD penalty function to select significant variables for RMR-STN, RMR-SPE and RMR-SCN models. By comparing the results of Simulation 1 and Simulation 2, the effects of SCAD and MIXL2-SCAD penalty function on variable selection are analyzed. In addition, the generation of the sample data set and the distributions of random errors in this simulation are the same as in Simulation 1, and both
and
cases are considered.
Table 1. Results of FMR, RMR-SN, RMR-STN, RMR-SPE and RMR-SCN using SCAD penalty function on 100 replicates.
From Table 2, we can know that as the sample size
increases, the values of S1 and S2 in Com1 and Com2 are getting closer and closer to 1, and the value of
is getting smaller and smaller, indicating the asymptotic property of parameter estimates. When
and errors follow
distribution, the values of S1 and S2 in Com1 from RMR-SPE model are equal to 1.00, which indicates that the MIXL2-SCAD penalty ensures the non-zero and zero coefficients
Table 2. Results of RMR-STN, RMR-SPE and RMR-SCN using MIXL2-SCAD penalty function on 100 replicates.
can be identified completely. When
and errors follow
, the same result appears in Com2 from RMR-SPE and RMR-SCN model. The absolute values of mean estimate over all falsely identified non-zero predictors (
) are smaller than 0.01 from MIXL2-SCAD when
.
By comparing Table 1 and Table 2, we can see that the values of S1 and S2 in Com1 and Com2 from MIXL2-SCAD are all greater than or equal to the values from SCAD penalty for all cases when
. It is worth noting that in case (3), when n = 300, the values of S2 in Com1 and Com2 from RMR-STN, RMR-SPE and RMR-SCN using MIXL2-SCAD penalty are all 1.00, however, the values of S2 in Com1 from RMR-SPE and RMR-SCN using SCAD penalty are 0.9933. From these comparisons of experimental data, we can know that MIXL2-SCAD performs better than SCAD penalty in variable selection.
6. Real Data Analysis
In this section, we obtain the Seoul bike sharing demand data set from the website http://archive.ics.uci.edu/ml/datasets.php. From this dataset, we screen out the total number of bikes rented from 10:00 am to 11:00 am every functional day of bike rental system in Seoul from December 1, 2017 to November 30, 2018 with 12 features that may affect the demand of rental bikes. There are 353 observations in total. The 12 features are: temperature (
), humidity (
), wind-speed (
), visibility (
), dew point temperature (
), solar radiation (
), rainfall (
), snowfall (
), holiday (holiday = 1, else = 0;
), spring (spring = 1, else = 0;
), summer (summer = 1, else = 0;
) and autumn (autumn = 1, else = 0;
).
are dummy variables and
indicate different seasons. Considering that there may be further differential effects between seasons and holiday, we continue to introduce 3 interaction terms between dummy variables, namely
,
,
. This leads to a set of 15 potential covariates affecting rented bike count (RBC) from 10:00 am to 11:00 am.
Let
be the response variable, where
is the standard deviation of RBC. Figure 1 shows the histogram and density estimate of
, we can see that the data set has obvious heterogeneity, so that the RMR-STN model is applicable. We also apply RMR-SPE and RMR-SCN models to this real data set, the outcomes are worse than RMR-STN’s result, thus we do not report the results here.
The parameter estimates under FMR, RMR-STN (
) and RMR-STN (
) with BIC method and MIXL2-SCAD penalty function are given in Table 3. The
RMR-STN model has the lowest BIC (542.5) and the
RMR-STN model ranks second (BIC = 544.7) when FMR model has the biggest BIC (562.8). Furthermore, the predicted rented bike count from the
RMR-STN model has the smallest MSE of 0.09 and the biggest regression
of 0.90.
From Table 3, the bike rented demand can be divided into three categories: “low”, “medium” and “high” during the time period from 10:00 am to 11:00 am
Figure 1. Histogram and density estimate for
.
Table 3. Summary of FMR, RMR-STN (
) and RMR-STN (
) model with BIC method and MIXL2-SCAD penalty for Seoul bike sharing demand data set.
with
RMR-STN model. Humidity is a negative factor for all three types of demand. When the bike rented demand is “medium”, warmer temperature and increased solar radiation help increase bike demand, while rainfall, snowfall, and holidays reduce the demand. In contrast, when bike rented demand is “high”, the positive effect of dew point temperature on bike demand is greatest, while the negative effects of holidays and snowfall disappear. In addition, we can also find that the rented bike count has a strong seasonality and the rented count will be more in other seasons than in winter.
7. Conclusion
In this paper, we mainly propose a robust mixture regression model based on the skew scale mixtures of normal distributions (RMR-SSMN) which can avoid the potential limitation of normal mixtures. A new penalty function (MIXL2-SCAD) which combines SCAD and
penalties is presented for variable selection. Through simulations, we find that the RMR-SSMN models are more robust than general FMR models for heterogeneous data with asymmetry and heavy-tailed properties, and outliers. Furthermore, the capability of MIXL2-SCAD to select the most parsimonious FMR model is obviously better than SCAD. The proposed methodology is applied to a real data set and achieves reasonable results. However, this paper only focuses on the mixture of the simple linear model, and further research can focus on the mixture of the semiparametric model or nonparametric model.