1. Introduction
Empirical likelihood is a nonparametric method first proposed by Owen [1] [2] [3], which is an estimation method inspired by maximum likelihood, but does not require assumptions about the distribution of the data. Thus, we can avoid potential problems of model misspecification. Because of the robustness of empirical likelihood, and the fact that it inherits many desirable properties of parametric likelihood, empirical likelihood has been extended to linear models, correlation models, variance models [3], general estimating equations [4], generalized linear models [5] and longitudinal data analysis [6] [7], etc.
On the one hand, the Bayesian method based on empirical likelihood has the advantages of Bayesian inference, and on the other hand, it avoids the risk of incorrect model assumptions, and has received extensive attention from scholars and has developed rapidly. Bayesian empirical likelihood was first proposed by Lazar [8]. It is a semiparametric method that combines parametric priors and nonparametric likelihoods. It not only pays attention to the use of overall information and sample information, but also pays attention to the collection of prior information. After processing, it forms a prior distribution and participates in statistical inference. Lazar [8] replaced the likelihood function in Bayes theorem with an empirical likelihood function, and used Monte Carlo simulation to prove the validity of the obtained posterior distribution. Zhong and Ghosh [9] studied some higher-order properties of Bayesian empirical likelihood. Li, Zhao and Dong [10] applied Bayesian empirical likelihood to linear regression models with censored data. Bedoui and Lazar [11] proposed the Bayesian empirical likelihood for lasso regression and ridge regression. Moon and Bedoui [12] proposed an empirical-likelihood-based Bayesian elastic network model that combines the interpretability and robustness of Bayesian empirical likelihood methods, which can be used for variable selection. In addition, Bayesian empirical likelihood is also extended to quantile structural equation modeling [13], quantile regression [14], etc.
Variable selection under the Bayesian framework, that is, introducing penalty terms into the model in the form of parameter priors. For example, Park and Casella [15] used conditional Laplace prior for complete Bayesian analysis and proposed Bayesian lasso. In addition, Li and Lin [16] proposed Bayesian elastic net using an informative prior. Mallick and Yi [17] proposed a new Bayesian lasso method based on uniform scale mixing of Laplace density. The variable selection based on Bayesian empirical likelihood is to replace the parametric likelihood function in Bayes theorem with a nonparametric likelihood function, which can be studied without making assumptions about the distribution of the data, avoiding problems caused by misspecified models.
This paper is divided into six sections. The first section introduces the research status of empirical likelihood and Bayesian empirical likelihood, and how to select variables based on Bayesian empirical likelihood. Section 2 derives the empirical likelihood function for linear models. Section 3 introduces the basics of Bayesian empirical likelihood. The fourth section is the focus of this paper, where
regularization based on Bayesian empirical likelihood is proposed, and the penalty term is added to the model in the form of a generalized Gaussian prior. Section 5 verifies the effectiveness of the proposed method when the error violates the normality assumption of zero mean of the standard parameter model by simulation. The sixth section is the conclusion of this paper.
2. Empirical Likelihood Inference for Linear Models
Suppose we observe a set of data
, if the relationship between
and
is linear, it can be represented by the following mathematical model:
(1)
where
is the predictor variable,
is the response variable,
is the unknown intercept,
is the unknown slope of the explanatory variable
, and
is the error. In the standard parametric model, we generally assume that the errors are independent and obey a normal distribution with a mean of zero and a constant variance. But in the empirical likelihood, we relax the distributional assumption of the error, and the error distribution does not necessarily satisfy the normality assumption of zero mean. Next, without loss of generality, assuming that both the predictor and response variables are standardized, then the intercept term
is equal to zero.
Let
where
is the design matrix of
,
is the vector of
,
is the vector of
,
is the vector of
. Then the above multiple linear regression model can be expressed as:
(2)
Also, in linear models, regression coefficients are generally estimated by minimizing the residual sum of squares
. Using the matrix notation defined above and assuming
is invertible, the canonical equation is obtained
That is, the regression coefficient satisfies the following estimation equation:
Defining auxiliary variables
, the profile empirical likelihood ratio of the regression parameters
can be obtained as follows:
(3)
Then apply the Lagrange multiplier method to solve the
that satisfies the formula (3). If you want to find the
that maximizes
, it is equivalent to finding the
that maximizes
. Let
(4)
where
and
are Lagrange multipliers. Let the partial derivative of G with respect to
,
and
be zero, and the following equations can be obtained:
(5)
By multiplying both sides of Equation ① in formula (5) by
at the same time and summing it up, we can get
That is,
. Then substitute
into Equation ① in formula (5) to get
Then the profile empirical likelihood function of the regression coefficient
can be written, which is given by
, where
(6)
Substitute the expression of
into the ② in formula (5), and the Lagrange multiplier
can be solved by the following equation:
(7)
Next, it is proved that under some regular conditions, if
makes the profile logarithmic likelihood function
maximum, then
converges to the true value
according to the probability.
Theorem (Consistency) Under some regular conditions, if
then
.
Proof: Let
, and denote
for
where
. Owen [2] proved that when
, there is
Then perform Taylor expansion on
, we get:
where
, c is the smallest eigenvalue of
. Similarly, it can also be shown that
Since
is continuous with respect to
, and
is in the sphere
, so
has a minimum value in the sphere, that is
.
3. Bayesian Empirical Likelihood
Penalized linear regression and Bayesian linear regression are closely related, and their estimates can be interpreted as Bayesian posterior estimates of parameters under certain priors. For linear models:
(8)
Under the assumption that the noise obeys the Gaussian distribution of the regularization framework, from the perspective of probability, the regularized least squares method corresponds to the maximum a posteriori estimate, namely
Then the maximum a posteriori estimate of the parameter
is
where
is the prior distribution of the parameter
. When the parameter
obeys the Laplace distribution, the L1 regularization is derived; when the parameter
obeys the Gaussian distribution, the L2 regularization is derived. From the above, it can be seen that lasso regression and ridge regression are closely related to Bayesian linear models when different priors are placed on the parameters.
The Bayesian empirical likelihood is as follows: Let
be an independent multivariate random variable subject to an unknown distribution
, whose unknown distribution
depends on the parameter
. Assuming that both the predictor and response variables are standardized, then the intercept term is zero. Let the prior of
be
, and when the data distribution is unknown, replace the parameter likelihood function in Bayes theorem with the empirical likelihood function, then the posterior empirical likelihood density is
(9)
Combining the empirical likelihood inference of multiple linear regression in section 2, we can obtain the posterior inference of the Bayesian empirical likelihood of linear regression as
(10)
4. L1/2 Regularization Inference Based on Bayesian Empirical Likelihood
4.1. Hierarchical Model
Linear regression L1/2 regularization penalizes the magnitude of the regression coefficients by imposing an L1/2 penalty, that is, it minimizes the penalized residual sum of squares as follows:
(11)
Without loss of generality, we assume that the data is normalized and the intercept term is 0. In formula (11),
is the
vector,
is the
vector,
is the
design matrix,
, and the tuning parameter
controls the degree of penalty. The larger the value of
, the larger the shrinkage of the regression parameters.
By observing the form of the penalty term in (11), we find that the regression parameter
in the L1/2 regularization has the form of an independent and identical zero-mean generalized Gaussian prior. The density function expression of the zero-mean generalized Gaussian distribution is:
where
is the gamma function,
is the scale parameter, and
is the shape parameter that controls the decay rate of the tail of the distribution. There are two special cases in GGD: when
, corresponding to the Laplace distribution, and when
, corresponding to the normal distribution.
Combining the above connections, on the basis of Park and Casella [15], we consider adding a generalized Gaussian prior to the regression parameters
with mean of 0, shape parameter of
, and scale parameter of
. The expression is as follows:
(12)
Although most of the existing literatures express the generalized Gaussian distribution as a scale mixture of normal distributions, this representation is not suitable for the Bayesian bridge model of
penalty. Therefore, other representations need to be explored. In this paper, the generalized Gaussian distribution is expressed as a mixture of uniform distribution and gamma distribution, which is:
Then, without assuming the distribution form of the data, the empirical likelihood function is used to replace the parameter likelihood function, and the Bayesian hierarchical model can be expressed as:
(13)
In the above hierarchical model, we choose
. Assuming that the priors of different parameters are independent, then the joint posterior density can be expressed as:
(14)
Given
,
,
,
and
, the full conditional distribution of
is:
(15)
From the expression of the full conditional distribution of
, we know that its full conditional distribution has no closed form.
Similarly, given the conditions of
,
,
,
and
, the full conditional distribution of
is:
(16)
Analogously, given the conditions of
,
,
,
and
, the full conditional distribution of
is:
(17)
In the expression of the prior distribution, we find that tuning parameter
is introduced into the model in the form of hyperparameters that play a role in controlling the accuracy of the prior distribution. The larger the value of
, the more concentrated the prior distribution is at mean 0; the smaller the value of
, the more scattered the prior distribution is at mean 0. In this paper, we specify a gamma prior Gamma(c, d) for the penalty parameter
.
In model (13), when the latent variable
is marginalized and the generalized Gaussian prior is directly used, the full conditional distribution of
given
,
,
,
and
is:
(18)
4.2. The Framework of the Algorithm
Regarding
,
and
in the Bayesian hierarchical model, this paper uses the Gibbs algorithm to sample.
1) The full conditional distribution of
is the left-truncated exponential distribution
, and two-step sampling is considered. First generate
from the exponential distribution
, and then let
.
2) The full conditional distribution of
is the left-truncated inverse gamma distribution, and two-step sampling is considered. First generate
from the right-truncated gamma distribution
, then let
.
3) The full conditional distribution of
is the gamma distribution, and
is generated directly from the gamma distribution
.
Regarding the regression parameter
, since its full conditional distribution has no closed form, this paper considers sampling using the tailored M-H algorithm adopted by Chib [18] and Bedoui [11]. Among them, the candidate generation density in the M-H algorithm is a multivariate t distribution, its location parameter is the mode of the logarithmic empirical likelihood function for the linear model, and the dispersion matrix is the inverse of the negative Hessian matrix of the logarithmic empirical likelihood function evaluated at this mode.
5. Simulation
In this section, simulation experiments are performed to verify the effectiveness of L1/2 regularization based on Bayesian empirical likelihood (BEL). We generate data from the following multiple linear regression models:
where
is the
response variable,
is a
design matrix,
is the
error vector and
is the sample size. The data for the design matrix
comes from a multivariate Gaussian distribution with a mean of zero and a covariance matrix of
. The regression coefficients
are a
regression vector.
In standard parametric models, it is generally assumed that the errors follow a normal distribution with zero mean. However, in empirical likelihood, there is no need to make assumptions about the error distribution, which can avoid making false assumptions about the error distribution and make the model more robust.
We assume the error violates the zero-mean normality assumption of the standard parametric model,
is independent and identically distributed from a normal distribution with mean −3 and variance 32. Under this model, we generate training datasets with three different sample sizes (
= 50, 100, 200). And produce a test set of the same size. Furthermore, the Bayesian empirical likelihood-based L1/2 regularization method (BEL) proposed in this paper is compared with the Bayesian bridge regression model for scale mixture of normal based on generalized Gaussian density (BBR.N) proposed by Polson [19], the Bayesian bridge regression model for scale mixtures of triangular based on generalized Gaussian density (BBR.T) proposed by Polson [19], and Bayesian lasso model (BLASSO) proposed by Park and Casella [15]. Among them, the exponent of the regularization term of BBR.N and BBR.T is selected as
, corresponds to the L1/2 penalty using the parametric likelihood function.
For the hyperparameters in the hierarchical model, we choose a = 10, b = 0.1, c = 2 and d = 2 to conduct numerical simulations. And generate 50 sets of training data sets, that is, repeat the experiment 50 times, fit the model on the training data set, iterate 15,000 times for each experiment, discard the first 5000 times, and calculate the mean of the regression coefficients of the last 10,000 times as the estimated value. Then calculate its performance on the test dataset.
The evaluation indicators are the mean square error (MSE) and the mean absolute deviation (MAE) on the test set, and the calculation expressions are as follows:
(19)
(20)
In order to exclude the influence of possible extreme values, this paper uses the median of these 50 data to evaluate the performance of the four methods, namely the median of mean square error (MMSE) and the median of mean absolute deviation (MMAE).
Table 1 shows the values of the median of mean squared error and the median of mean absolute error of the four methods at three different sample sizes. As can be seen from Table 1, when the error distribution violates the normality assumption of zero mean of standard parametric model, especially when the sample size is small (
= 50, 100), the BEL method outperforms the other three methods. And with the increase of sample size, the values of MMSE and MMAE of the four methods all showed a downward trend.
Figure 1 shows the boxplots of the values of MSE computed on the test set for the four evaluation methods at three different sample sizes. It can also be seen from the figure that when the sample size is small (
= 50, 100), the BEL method is significantly better than the other three methods. And when the sample size is 200, the BEL method is slightly better than the other three methods. In general, it can be seen that the BEL method performs better in small samples when the error violates the zero-mean normality assumption.
Figure 1. Boxplots of the values of MSE for the four methods.
Table 1. Values of MMSE and MMAE for the four methods.
Figure 2 shows the boxplots of 50 MAEs calculated on the test set by the four evaluation methods by repeating 50 experiments under each sample size of simulation experiment. When the number of observations is 50, 100, the MMAE of the BEL method significantly smaller than the other three methods. When the number of observations is 200, the MMAE of the BEL method is slightly smaller than the other three methods.
Table 2 shows the number of times each component of the regression coefficients is excluded using the scaled neighborhood criterion proposed by Li and Lin [16] on 50 training datasets with three different sample sizes. It can be seen from Table 2 that the four methods can better play the role of identifying important variables and unimportant variables, that is, they can play the role of variable selection. When the number of observations is 50 and 100, the BEL method can more accurately identify non-zero variables.
Figure 2. Boxplots of the values of MAE for the four methods.
Table 2. The number of times the regression component was removed based on 50 repetitions of the simulation.
6. Conclusions
This paper proposes a new method for variable selection, which is L1/2 regularization based on Bayesian empirical likelihood. This method introduces the L1/2 penalty into the model in the form of generalized Gaussian prior. Replace the parametric likelihood function in Bayes theorem with a nonparametric likelihood function, and derive the posterior distribution through the Bayesian hierarchical model, then use MCMC method to sample from the posterior distribution. Simulations demonstrate that the proposed method BEL outperforms BBR.N, BBR.T and BLASSO when the errors violate the zero-mean normality assumption for standard parametric models. Especially when the sample size is small, the prediction accuracy of the BEL method is better. In addition, the proposed method can perform variable selection well.
Subsequent research may consider Bayesian empirical likelihood inference combining L1/2 penalty and L2 penalty, which is a flexible penalty method. Consider adding a spike-and-slab prior to the parameters, the expression is as follows:
(21)
where
and
. When
, it indicates that the jth predictor is more important and should be kept. When
, it indicates that the jth predictor is not important and should be removed from the model. Compared with applying a single prior distribution to the parameters, this mixed prior can well combine the advantages of variable selection and sparse recovery.