L1/2 Regularization Based on Bayesian Empirical Likelihood

Yuan Wang; Wanzhou Ye

doi:10.4236/apm.2022.125029

Advances in Pure Mathematics > Vol.12 No.5, May 2022

L_1/2 Regularization Based on Bayesian Empirical Likelihood

Yuan Wang, Wanzhou Ye
Department of Mathematics, College of Science, Shanghai University, Shanghai, China.
DOI: 10.4236/apm.2022.125029 PDF HTML XML 247 Downloads 812 Views

Abstract

Bayesian empirical likelihood is a semiparametric method that combines parametric priors and nonparametric likelihoods, that is, replacing the parametric likelihood function in Bayes theorem with a nonparametric empirical likelihood function, which can be used without assuming the distribution of the data. It can effectively avoid the problems caused by the wrong setting of the model. In the variable selection based on Bayesian empirical likelihood, the penalty term is introduced into the model in the form of parameter prior. In this paper, we propose a novel variable selection method, L_1/2 regularization based on Bayesian empirical likelihood. The L_1/2 penalty is introduced into the model through a scale mixture of uniform representation of generalized Gaussian prior, and the posterior distribution is then sampled using MCMC method. Simulations demonstrate that the proposed method can have better predictive ability when the error violates the zero-mean normality assumption of the standard parameter model, and can perform variable selection.

Keywords

Bayesian Empirical Likelihood, Generalized Gaussian Prior, L_1/2 Regularization, MCMC Method

Share and Cite:

Wang, Y. and Ye, W. (2022) L_1/2 Regularization Based on Bayesian Empirical Likelihood. Advances in Pure Mathematics, 12, 392-404. doi: 10.4236/apm.2022.125029.

1. Introduction

Empirical likelihood is a nonparametric method first proposed by Owen [1] [2] [3], which is an estimation method inspired by maximum likelihood, but does not require assumptions about the distribution of the data. Thus, we can avoid potential problems of model misspecification. Because of the robustness of empirical likelihood, and the fact that it inherits many desirable properties of parametric likelihood, empirical likelihood has been extended to linear models, correlation models, variance models [3], general estimating equations [4], generalized linear models [5] and longitudinal data analysis [6] [7], etc.

On the one hand, the Bayesian method based on empirical likelihood has the advantages of Bayesian inference, and on the other hand, it avoids the risk of incorrect model assumptions, and has received extensive attention from scholars and has developed rapidly. Bayesian empirical likelihood was first proposed by Lazar [8]. It is a semiparametric method that combines parametric priors and nonparametric likelihoods. It not only pays attention to the use of overall information and sample information, but also pays attention to the collection of prior information. After processing, it forms a prior distribution and participates in statistical inference. Lazar [8] replaced the likelihood function in Bayes theorem with an empirical likelihood function, and used Monte Carlo simulation to prove the validity of the obtained posterior distribution. Zhong and Ghosh [9] studied some higher-order properties of Bayesian empirical likelihood. Li, Zhao and Dong [10] applied Bayesian empirical likelihood to linear regression models with censored data. Bedoui and Lazar [11] proposed the Bayesian empirical likelihood for lasso regression and ridge regression. Moon and Bedoui [12] proposed an empirical-likelihood-based Bayesian elastic network model that combines the interpretability and robustness of Bayesian empirical likelihood methods, which can be used for variable selection. In addition, Bayesian empirical likelihood is also extended to quantile structural equation modeling [13], quantile regression [14], etc.

Variable selection under the Bayesian framework, that is, introducing penalty terms into the model in the form of parameter priors. For example, Park and Casella [15] used conditional Laplace prior for complete Bayesian analysis and proposed Bayesian lasso. In addition, Li and Lin [16] proposed Bayesian elastic net using an informative prior. Mallick and Yi [17] proposed a new Bayesian lasso method based on uniform scale mixing of Laplace density. The variable selection based on Bayesian empirical likelihood is to replace the parametric likelihood function in Bayes theorem with a nonparametric likelihood function, which can be studied without making assumptions about the distribution of the data, avoiding problems caused by misspecified models.

This paper is divided into six sections. The first section introduces the research status of empirical likelihood and Bayesian empirical likelihood, and how to select variables based on Bayesian empirical likelihood. Section 2 derives the empirical likelihood function for linear models. Section 3 introduces the basics of Bayesian empirical likelihood. The fourth section is the focus of this paper, where $L_{1 / 2}$ regularization based on Bayesian empirical likelihood is proposed, and the penalty term is added to the model in the form of a generalized Gaussian prior. Section 5 verifies the effectiveness of the proposed method when the error violates the normality assumption of zero mean of the standard parameter model by simulation. The sixth section is the conclusion of this paper.

2. Empirical Likelihood Inference for Linear Models

Suppose we observe a set of data $(x_{1}, y_{1}) \dots (x_{n}, y_{n})$ , if the relationship between $x_{i}$ and $y_{i}$ is linear, it can be represented by the following mathematical model:

$y_{i} = β_{0} + β_{1} x_{i 1} + β_{2} x_{i 2} + \dots + β_{q} x_{i q} + ε_{i}, i = 1, 2, \dots, n,$ (1)

where $x_{i} = {(x_{i 1}, x_{i 2}, \dots, x_{i q})}^{T}$ is the predictor variable, $y_{i}$ is the response variable, $β_{0}$ is the unknown intercept, $β_{j}$ is the unknown slope of the explanatory variable $x_{i j}$ , and $ε_{i}$ is the error. In the standard parametric model, we generally assume that the errors are independent and obey a normal distribution with a mean of zero and a constant variance. But in the empirical likelihood, we relax the distributional assumption of the error, and the error distribution does not necessarily satisfy the normality assumption of zero mean. Next, without loss of generality, assuming that both the predictor and response variables are standardized, then the intercept term $β_{0}$ is equal to zero.

Let

$X = (\begin{matrix} x_{11} & \dots & x_{1 q} \\ ⋮ & ⋱ & ⋮ \\ x_{n 1} & \dots & x_{n q} \end{matrix}) = (\begin{matrix} x_{1}^{T} \\ ⋮ \\ x_{n}^{T} \end{matrix}), β = (\begin{matrix} β_{1} \\ ⋮ \\ β_{q} \end{matrix}), y = (\begin{matrix} y_{1} \\ ⋮ \\ y_{n} \end{matrix}), ε = (\begin{matrix} ε_{1} \\ ⋮ \\ ε_{n} \end{matrix}),$

where $X$ is the design matrix of $n \times q$ , $β$ is the vector of $q \times 1$ , $y$ is the vector of $n \times 1$ , $ε$ is the vector of $n \times 1$ . Then the above multiple linear regression model can be expressed as:

$y_{i} = x_{i}^{T} β + ε_{i}, i = 1, 2, \dots, n .$ (2)

Also, in linear models, regression coefficients are generally estimated by minimizing the residual sum of squares ${‖ y - X β ‖}_{2}^{2}$ . Using the matrix notation defined above and assuming $X^{T} X$ is invertible, the canonical equation is obtained

$X^{T} X β = X^{T} y .$

That is, the regression coefficient satisfies the following estimation equation:

$E (X^{T} (y - X β)) = 0.$

Defining auxiliary variables $Z_{i} (β) = x_{i} (y_{i} - x_{i}^{T} β)$ , the profile empirical likelihood ratio of the regression parameters $β$ can be obtained as follows:

$R (β) = \max_{ω_{i}} {\prod_{i = 1}^{n} n ω_{i} | ω_{i} \geq 0, \sum_{i = 1}^{n} ω_{i} = 1, \sum_{i = 1}^{n} ω_{i} Z_{i} (β) = 0} .$ (3)

Then apply the Lagrange multiplier method to solve the $ω_{i}$ that satisfies the formula (3). If you want to find the $ω_{i}$ that maximizes $\prod_{i = 1}^{n} n ω_{i}$ , it is equivalent to finding the $ω_{i}$ that maximizes $\sum_{i = 1}^{n} \log n ω_{i}$ . Let

$G = \sum_{i = 1}^{n} \log n ω_{i} - n η^{T} \sum_{i = 1}^{n} ω_{i} x_{i} (y_{i} - x_{i}^{T} β) - γ (1 - \sum_{i = 1}^{n} ω_{i}),$ (4)

where $η = {(η_{1}, η_{2}, \dots, η_{q})}^{T}$ and $γ$ are Lagrange multipliers. Let the partial derivative of G with respect to $ω_{i}$ , $η$ and $γ$ be zero, and the following equations can be obtained:

${\begin{array}{l} \frac{\partial G}{\partial ω_{i}} = 0 \Leftrightarrow \frac{n}{n ω_{i}} - n η^{T} x_{i} (y_{i} - x_{i}^{T} β) + γ = 0 ① \\ \frac{\partial G}{\partial η} = 0 \Leftrightarrow - n \sum_{i = 1}^{n} ω_{i} x_{i} (y_{i} - x_{i}^{T} β) = 0 ② \\ \frac{\partial G}{\partial γ} = 0 \Leftrightarrow 1 - \sum_{i = 1}^{n} ω_{i} = 0 ③ \end{array}$ (5)

By multiplying both sides of Equation ① in formula (5) by $ω_{i}$ at the same time and summing it up, we can get

$0 = \sum_{i = 1}^{n} ω_{i} \frac{\partial G}{\partial ω_{i}} = n + γ .$

That is, $γ = - n$ . Then substitute $γ = - n$ into Equation ① in formula (5) to get

$ω_{i} = \frac{1}{n} \cdot \frac{1}{1 + η^{T} x_{i} (y_{i} - x_{i}^{T} β)} .$

Then the profile empirical likelihood function of the regression coefficient $β$ can be written, which is given by $L_{E L} (β) = \exp {l_{E L} (β)}$ , where

$l_{E L} (β) = \sum_{i = 1}^{n} \log ω_{i} = - n \log n - \sum_{i = 1}^{n} \log {1 + η^{T} x_{i} (y_{i} - x_{i}^{T} β)} .$ (6)

Substitute the expression of $ω_{i}$ into the ② in formula (5), and the Lagrange multiplier $η = η (β)$ can be solved by the following equation:

$\sum_{i = 1}^{n} \frac{x_{i} (y_{i} - x_{i}^{T} β)}{1 + η^{T} x_{i} (y_{i} - x_{i}^{T} β)} = 0 .$ (7)

Next, it is proved that under some regular conditions, if $\hat{β}$ makes the profile logarithmic likelihood function $l_{E L} (β)$ maximum, then $\hat{β}$ converges to the true value $β_{0}$ according to the probability.

Theorem (Consistency) Under some regular conditions, if

$\hat{β} = \arg \max l_{E L} (β) = \arg \min \sum_{i = 1}^{n} \log {1 + η^{T} x_{i} (y_{i} - x_{i}^{T} β)},$

then $\hat{β} \overset{P}{\to} β_{0}$ .

Proof: Let $R (β) = \sum_{i = 1}^{n} \log {1 + η^{T} x_{i} (y_{i} - x_{i}^{T} β)}$ , and denote $β = β_{0} + u n^{- 1 / 3}$ for $β \in {β | ‖ β - β_{0} ‖ \leq n^{- 1 / 3}}$ where $‖ u ‖ = 1$ . Owen [2] proved that when $‖ β - β_{0} ‖ \leq n^{- 1 / 3}$ , there is

$\begin{array}{l} η (β) = {[\frac{1}{n} \sum_{i = 1}^{n} Z_{i} (β) Z_{i}^{T} (β)]}^{- 1} [\frac{1}{n} \sum_{i = 1}^{n} Z_{i} (β)] + o (n^{- 1 / 3}) \\ = O (n^{- 1 / 3}) (a . s .) . \end{array}$

Then perform Taylor expansion on $R (β)$ , we get:

$\begin{matrix} R (β) = \sum_{i = 1}^{n} η^{T} (β) Z_{i} (β) - \frac{1}{2} \sum_{i = 1}^{n} {[η^{T} (β) Z_{i} (β)]}^{2} + o (n^{- 1 / 3}) (a . s .) \\ = \frac{n}{2} {[\frac{1}{n} \sum_{i = 1}^{n} Z_{i} (β)]}^{T} {[\frac{1}{n} \sum_{i = 1}^{n} Z_{i} (β) Z_{i}^{T} (β)]}^{- 1} [\frac{1}{n} \sum_{i = 1}^{n} Z_{i} (β)] + o (n^{- 1 / 3}) (a . s .) \\ = \frac{n}{2} {[\frac{1}{n} \sum_{i = 1}^{n} Z_{i} (β_{0}) + \frac{1}{n} \sum_{i = 1}^{n} \frac{\partial Z_{i} (β_{0})}{\partial β} u n^{- 1 / 3}]}^{T} {[\frac{1}{n} \sum_{i = 1}^{n} Z_{i} (β) Z_{i}^{T} (β)]}^{- 1} \\ \times [\frac{1}{n} \sum_{i = 1}^{n} Z_{i} (β_{0}) + \frac{1}{n} \sum_{i = 1}^{n} \frac{\partial Z_{i} (β_{0})}{\partial β} u n^{- 1 / 3}] + o (n^{- 1 / 3}) (a . s .) \end{matrix}$

$\begin{matrix} = \frac{n}{2} {[O (n^{- 1 / 2} {(\log \log n)}^{1 / 2}) + E (\frac{\partial Z_{i} (β_{0})}{\partial β}) u n^{- 1 / 3}]}^{T} {[E (Z_{i} (β_{0}) Z_{i}^{T} (β_{0}))]}^{- 1} \\ \times [O (n^{- 1 / 2} {(\log \log n)}^{1 / 2}) + E (\frac{\partial Z_{i} (β_{0})}{\partial β}) u n^{- 1 / 3}] + o (n^{- 1 / 3}) (a . s .) \\ \geq (c - ε) n^{1 / 3} (a . s .), \end{matrix}$

where $c - ε > 0$ , c is the smallest eigenvalue of $E {(\partial Z_{i} (β_{0}) / \partial β)}^{T} {[E (Z_{i} (β_{0}) Z_{i}^{T} (β_{0}))]}^{- 1} E (\partial Z_{i} (β_{0}) / \partial β)$ . Similarly, it can also be shown that

$\begin{matrix} R (β_{0}) = \frac{n}{2} {[\frac{1}{n} \sum_{i = 1}^{n} Z_{i} (β_{0})]}^{T} {[\frac{1}{n} \sum_{i = 1}^{n} Z_{i} (β_{0}) Z_{i}^{T} (β_{0})]}^{- 1} [\frac{1}{n} \sum_{i = 1}^{n} Z_{i} (β_{0})] + o (1) \\ = O (\log \log n) (a . s .) . \end{matrix}$

Since $R (β)$ is continuous with respect to $β$ , and $β$ is in the sphere $‖ β - β_{0} ‖$ , so $R (β)$ has a minimum value in the sphere, that is $\hat{β} \overset{P}{\to} β_{0}$ .

3. Bayesian Empirical Likelihood

Penalized linear regression and Bayesian linear regression are closely related, and their estimates can be interpreted as Bayesian posterior estimates of parameters under certain priors. For linear models:

$y = X β + ε .$ (8)

Under the assumption that the noise obeys the Gaussian distribution of the regularization framework, from the perspective of probability, the regularized least squares method corresponds to the maximum a posteriori estimate, namely

$P (β | Data) \propto P (Data | β) P (β) .$

Then the maximum a posteriori estimate of the parameter $β$ is

$\hat{β} = \underset{β}{\arg \max} P (β | Data) = \underset{β}{\arg \max} P (Data | β) P (β),$

where $P (β)$ is the prior distribution of the parameter $β$ . When the parameter $β$ obeys the Laplace distribution, the L₁ regularization is derived; when the parameter $β$ obeys the Gaussian distribution, the L₂ regularization is derived. From the above, it can be seen that lasso regression and ridge regression are closely related to Bayesian linear models when different priors are placed on the parameters.

The Bayesian empirical likelihood is as follows: Let $X = (x_{1}, \dots, x_{q})$ be an independent multivariate random variable subject to an unknown distribution $F_{β}$ , whose unknown distribution $F_{β} \in F_{β}$ depends on the parameter $β = {(β_{1}, \dots, β_{q})}^{T} \in Ω \in ℝ^{Q}$ . Assuming that both the predictor and response variables are standardized, then the intercept term is zero. Let the prior of $β$ be $π (β)$ , and when the data distribution is unknown, replace the parameter likelihood function in Bayes theorem with the empirical likelihood function, then the posterior empirical likelihood density is

$π (β | X, y) = \frac{L_{E L} (β) π (β)}{\int_{Ω} L_{E L} (β) π (β) d β} \propto L_{E L} (β) π (β) .$ (9)

Combining the empirical likelihood inference of multiple linear regression in section 2, we can obtain the posterior inference of the Bayesian empirical likelihood of linear regression as

$π (β | X, y) \propto \exp [\log {π (β)} - \sum_{i = 1}^{n} {1 + η^{T} x_{i} (y_{i} - x_{i}^{T} β)}] .$ (10)

4. L_1/2 Regularization Inference Based on Bayesian Empirical Likelihood

4.1. Hierarchical Model

Linear regression L_1/2 regularization penalizes the magnitude of the regression coefficients by imposing an L_1/2 penalty, that is, it minimizes the penalized residual sum of squares as follows:

$\min_{β} (\frac{1}{2} {‖ y - X β ‖}_{2}^{2} + λ {‖ β ‖}_{1 / 2}^{1 / 2}) .$ (11)

Without loss of generality, we assume that the data is normalized and the intercept term is 0. In formula (11), $y$ is the $n \times 1$ vector, $β$ is the $q \times 1$ vector, $X$ is the $n \times q$ design matrix, ${‖ β ‖}_{1 / 2}^{1 / 2} = \sum_{j = 1}^{q} {| β_{j} |}^{1 / 2}$ , and the tuning parameter $λ$ controls the degree of penalty. The larger the value of $λ$ , the larger the shrinkage of the regression parameters.

By observing the form of the penalty term in (11), we find that the regression parameter $β_{j}$ in the L_1/2 regularization has the form of an independent and identical zero-mean generalized Gaussian prior. The density function expression of the zero-mean generalized Gaussian distribution is:

$f (x) = \frac{p}{2 σ Γ (1 / p)} \exp (- \frac{{| x |}^{p}}{σ^{p}}),$

where $Γ (\cdot)$ is the gamma function, $σ$ is the scale parameter, and $p > 0$ is the shape parameter that controls the decay rate of the tail of the distribution. There are two special cases in GGD: when $p = 1$ , corresponding to the Laplace distribution, and when $p = 2$ , corresponding to the normal distribution.

Combining the above connections, on the basis of Park and Casella [15], we consider adding a generalized Gaussian prior to the regression parameters $β_{j}$ with mean of 0, shape parameter of $p = 1 / 2$ , and scale parameter of $\sqrt{σ^{2}} λ^{- 2}$ . The expression is as follows:

$π (β | σ^{2}) = \prod_{j = 1}^{q} \frac{λ^{2}}{2 \sqrt{σ^{2}} Γ (2 + 1)} \exp {- λ {(| β_{j} | / \sqrt{σ^{2}})}^{1 / 2}} .$ (12)

Although most of the existing literatures express the generalized Gaussian distribution as a scale mixture of normal distributions, this representation is not suitable for the Bayesian bridge model of $L_{q} (0 < q < 1)$ penalty. Therefore, other representations need to be explored. In this paper, the generalized Gaussian distribution is expressed as a mixture of uniform distribution and gamma distribution, which is:

$\begin{array}{l} \frac{λ^{2}}{2 \sqrt{σ^{2}} Γ (2 + 1)} \exp {- λ {(| x | / \sqrt{σ^{2}})}^{1 / 2}} \\ = \int_{- \sqrt{σ^{2}} u^{2} < x < \sqrt{σ^{2}} u^{2}} \frac{1}{2 \sqrt{σ^{2}} u^{2}} \cdot \frac{λ^{2 + 1}}{Γ (2 + 1)} u^{(2 + 1) - 1} e^{- λ u} d u . \end{array}$

Then, without assuming the distribution form of the data, the empirical likelihood function is used to replace the parameter likelihood function, and the Bayesian hierarchical model can be expressed as:

$\begin{array}{l} L_{E L} (β) ~ \exp {- \sum_{i = 1}^{n} \log [1 + η^{T} x_{i} (y_{i} - x_{i}^{T} β)]}, \\ β | u, σ^{2} ~ \prod_{j = 1}^{q} Uniform (- \sqrt{σ^{2}} u_{j}^{2}, \sqrt{σ^{2}} u_{j}^{2}), \\ u | λ ~ \prod_{j = 1}^{q} Gamma (2 + 1, λ), \\ σ^{2} ~ π (σ^{2}) d σ^{2} . \end{array}$ (13)

In the above hierarchical model, we choose $π (σ^{2}) = I G (a, b)$ . Assuming that the priors of different parameters are independent, then the joint posterior density can be expressed as:

$π (β, u, σ^{2}, λ | y, X) \propto L_{E L} (β) π (β | u, σ^{2}) π (u | λ) π (λ) π (σ^{2}) d σ^{2} .$ (14)

Given $y$ , $X$ , $u$ , $λ$ and $σ^{2}$ , the full conditional distribution of $β$ is:

$\begin{matrix} π (β | y, X, u, λ, σ^{2}) \propto L_{E L} (β) π (β | u, σ^{2}) \\ \propto \exp {- \sum_{i = 1}^{n} \log [1 + η^{T} x_{i} (y_{i} - x_{i}^{T} β)]} \prod_{j = 1}^{q} I {| β_{j} | < \sqrt{σ^{2}} u_{j}^{2}} . \end{matrix}$ (15)

From the expression of the full conditional distribution of $β$ , we know that its full conditional distribution has no closed form.

Similarly, given the conditions of $y$ , $X$ , $β$ , $λ$ and $σ^{2}$ , the full conditional distribution of $u$ is:

$π (u | y, X, β, λ, σ^{2}) \propto π (β | u, σ^{2}) π (u | λ) \propto \prod_{j = 1}^{q} e^{- λ u_{j}} I {u_{j} > {(\frac{| β_{j} |}{\sqrt{σ^{2}}})}^{1 / 2}} .$ (16)

Analogously, given the conditions of $y$ , $X$ , $β$ , $u$ and $λ$ , the full conditional distribution of $σ^{2}$ is:

$\begin{matrix} π (σ^{2} | y, X, β, u, λ) \propto π (β | u, σ^{2}) π (σ^{2}) d σ^{2} \\ \propto {(\frac{1}{σ^{2}})}^{(q - 1) / 2 + a + 1} \exp (- \frac{b}{σ^{2}}) I {σ^{2} > \max_{j} \frac{β_{j}^{2}}{u_{j}^{4}}} . \end{matrix}$ (17)

In the expression of the prior distribution, we find that tuning parameter $λ$ is introduced into the model in the form of hyperparameters that play a role in controlling the accuracy of the prior distribution. The larger the value of $λ$ , the more concentrated the prior distribution is at mean 0; the smaller the value of $λ$ , the more scattered the prior distribution is at mean 0. In this paper, we specify a gamma prior Gamma(c, d) for the penalty parameter $λ$ .

In model (13), when the latent variable $u_{j}$ is marginalized and the generalized Gaussian prior is directly used, the full conditional distribution of $λ$ given $y$ , $X$ , $β$ , $u$ and $σ^{2}$ is:

$π (λ | y, X, β, u, σ^{2}) \propto λ^{(c + 2 q) - 1} \exp {- λ (d + \sum_{j = 1}^{q} {| β_{j} |}^{1 / 2})} .$ (18)

4.2. The Framework of the Algorithm

Regarding $u$ , $σ^{2}$ and $λ$ in the Bayesian hierarchical model, this paper uses the Gibbs algorithm to sample.

1) The full conditional distribution of $u_{j}$ is the left-truncated exponential distribution $\exp (λ) I {u_{j} > {(| β_{j} | / \sqrt{σ^{2}})}^{1 / 2}}$ , and two-step sampling is considered. First generate $u_{j}^{*}$ from the exponential distribution $\exp (λ)$ , and then let $u_{j} = u_{j}^{*} + {(| β_{j} | / \sqrt{σ^{2}})}^{1 / 2}$ .

2) The full conditional distribution of $σ^{2}$ is the left-truncated inverse gamma distribution, and two-step sampling is considered. First generate $σ^{2 *}$ from the right-truncated gamma distribution $Gamma ((q - 1) / 2 + a, b) I {σ^{2 *} < \max_{j} (1 / (β_{j}^{2} / u_{j}^{4}))}$ , then let $σ^{2} = 1 / σ^{2 *}$ .

3) The full conditional distribution of $λ$ is the gamma distribution, and $λ$ is generated directly from the gamma distribution $Gamma (c + 2 q, d + \sum_{j = 1}^{q} {| β_{j} |}^{1 / 2})$ .

Regarding the regression parameter $β$ , since its full conditional distribution has no closed form, this paper considers sampling using the tailored M-H algorithm adopted by Chib [18] and Bedoui [11]. Among them, the candidate generation density in the M-H algorithm is a multivariate t distribution, its location parameter is the mode of the logarithmic empirical likelihood function for the linear model, and the dispersion matrix is the inverse of the negative Hessian matrix of the logarithmic empirical likelihood function evaluated at this mode.

5. Simulation

In this section, simulation experiments are performed to verify the effectiveness of L_1/2 regularization based on Bayesian empirical likelihood (BEL). We generate data from the following multiple linear regression models:

$y = X β + ε,$

where $y$ is the $n \times 1$ response variable, $X$ is a $n \times 8$ design matrix, $ε$ is the $n \times 1$ error vector and $n$ is the sample size. The data for the design matrix $X$ comes from a multivariate Gaussian distribution with a mean of zero and a covariance matrix of $Σ = {0.2}^{| i - j |}, i, j \in {1, 2, \dots, 8}$ . The regression coefficients $β = {(3, 1.5, 0, 0, 2, 0, 0, 0)}^{T}$ are a $8 \times 1$ regression vector.

In standard parametric models, it is generally assumed that the errors follow a normal distribution with zero mean. However, in empirical likelihood, there is no need to make assumptions about the error distribution, which can avoid making false assumptions about the error distribution and make the model more robust.

We assume the error violates the zero-mean normality assumption of the standard parametric model, $ε_{i}$ is independent and identically distributed from a normal distribution with mean −3 and variance 3². Under this model, we generate training datasets with three different sample sizes ( $n$ = 50, 100, 200). And produce a test set of the same size. Furthermore, the Bayesian empirical likelihood-based L_1/2 regularization method (BEL) proposed in this paper is compared with the Bayesian bridge regression model for scale mixture of normal based on generalized Gaussian density (BBR.N) proposed by Polson [19], the Bayesian bridge regression model for scale mixtures of triangular based on generalized Gaussian density (BBR.T) proposed by Polson [19], and Bayesian lasso model (BLASSO) proposed by Park and Casella [15]. Among them, the exponent of the regularization term of BBR.N and BBR.T is selected as $α = 0.5$ , corresponds to the L_1/2 penalty using the parametric likelihood function.

For the hyperparameters in the hierarchical model, we choose a = 10, b = 0.1, c = 2 and d = 2 to conduct numerical simulations. And generate 50 sets of training data sets, that is, repeat the experiment 50 times, fit the model on the training data set, iterate 15,000 times for each experiment, discard the first 5000 times, and calculate the mean of the regression coefficients of the last 10,000 times as the estimated value. Then calculate its performance on the test dataset.

The evaluation indicators are the mean square error (MSE) and the mean absolute deviation (MAE) on the test set, and the calculation expressions are as follows:

$MSE = \frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}$ (19)

$MAE = \frac{1}{n} \sum_{i = 1}^{n} | y_{i} - {\hat{y}}_{i} |$ (20)

In order to exclude the influence of possible extreme values, this paper uses the median of these 50 data to evaluate the performance of the four methods, namely the median of mean square error (MMSE) and the median of mean absolute deviation (MMAE).

Table 1 shows the values of the median of mean squared error and the median of mean absolute error of the four methods at three different sample sizes. As can be seen from Table 1, when the error distribution violates the normality assumption of zero mean of standard parametric model, especially when the sample size is small ( $n$ = 50, 100), the BEL method outperforms the other three methods. And with the increase of sample size, the values of MMSE and MMAE of the four methods all showed a downward trend.

Figure 1 shows the boxplots of the values of MSE computed on the test set for the four evaluation methods at three different sample sizes. It can also be seen from the figure that when the sample size is small ( $n$ = 50, 100), the BEL method is significantly better than the other three methods. And when the sample size is 200, the BEL method is slightly better than the other three methods. In general, it can be seen that the BEL method performs better in small samples when the error violates the zero-mean normality assumption.

Figure 1. Boxplots of the values of MSE for the four methods.

Table 1. Values of MMSE and MMAE for the four methods.

Figure 2 shows the boxplots of 50 MAEs calculated on the test set by the four evaluation methods by repeating 50 experiments under each sample size of simulation experiment. When the number of observations is 50, 100, the MMAE of the BEL method significantly smaller than the other three methods. When the number of observations is 200, the MMAE of the BEL method is slightly smaller than the other three methods.

Table 2 shows the number of times each component of the regression coefficients is excluded using the scaled neighborhood criterion proposed by Li and Lin [16] on 50 training datasets with three different sample sizes. It can be seen from Table 2 that the four methods can better play the role of identifying important variables and unimportant variables, that is, they can play the role of variable selection. When the number of observations is 50 and 100, the BEL method can more accurately identify non-zero variables.

Figure 2. Boxplots of the values of MAE for the four methods.

Table 2. The number of times the regression component was removed based on 50 repetitions of the simulation.

6. Conclusions

This paper proposes a new method for variable selection, which is L_1/2 regularization based on Bayesian empirical likelihood. This method introduces the L_1/2 penalty into the model in the form of generalized Gaussian prior. Replace the parametric likelihood function in Bayes theorem with a nonparametric likelihood function, and derive the posterior distribution through the Bayesian hierarchical model, then use MCMC method to sample from the posterior distribution. Simulations demonstrate that the proposed method BEL outperforms BBR.N, BBR.T and BLASSO when the errors violate the zero-mean normality assumption for standard parametric models. Especially when the sample size is small, the prediction accuracy of the BEL method is better. In addition, the proposed method can perform variable selection well.

Subsequent research may consider Bayesian empirical likelihood inference combining L_1/2 penalty and L₂ penalty, which is a flexible penalty method. Consider adding a spike-and-slab prior to the parameters, the expression is as follows:

$π (β | δ) = \prod_{j = 1}^{q} {(1 - δ_{j}) ψ (β_{j}; λ_{1}, σ_{1}^{2}) + δ_{j} φ (β_{j}; λ_{2}, σ_{2}^{2})} .$ (21)

where

$ψ (β_{j}; λ_{1}, σ_{1}^{2}) = λ_{1}^{2} / [2 \sqrt{σ_{1}^{2}} Γ (2 + 1)] \times \exp {- λ_{1} {(| β_{j} | / \sqrt{σ_{1}^{2}})}^{1 / 2}},$

$φ (β_{j}; λ_{2}, σ_{2}^{2}) = \sqrt{λ_{2} / 2 π σ_{2}^{2}} \times \exp {- λ_{2} β_{j}^{2} / 2 σ_{2}^{2}},$

and $δ_{j} \in {0, 1}$ . When $δ_{j} = 1$ , it indicates that the jth predictor is more important and should be kept. When $δ_{j} = 0$ , it indicates that the jth predictor is not important and should be removed from the model. Compared with applying a single prior distribution to the parameters, this mixed prior can well combine the advantages of variable selection and sparse recovery.

Conflicts of Interest

The authors declare no conflicts of interest regarding the publication of this paper.

References

[1]	Owen, A.B. (1988) Empirical Likelihood Ratio Confidence Intervals for a Single Functional. Biometrika, 75, 237-249. https://doi.org/10.1093/biomet/75.2.237
[2]	Owen, A.B. (1990) Empirical Likelihood Ratio Confidence Regions. The Annals of Statistics, 18, 90-120. https://doi.org/10.1214/aos/1176347494
[3]	Owen, A.B. (1991) Empirical Likelihood for Linear Models. The Annals of Statistics, 19, 1725-1747. https://doi.org/10.1214/aos/1176348368
[4]	Qin, J. and Lawless, J. (1994) Empirical Likelihood and General Estimating Equations. The Annals of Statistics, 22, 300-325. https://doi.org/10.1214/aos/1176325370
[5]	Kolaczyk, E.D. (1994) Empirical Likelihood for Generalized Linear Models. Statistica Sinica, 4, 199-218.
[6]	Nadarajah, T., Variyath, A. and Loredo-Osti, J. (2020) Empirical Likelihood Based Longitudinal Data Analysis. Open Journal of Statistics, 10, 611-639. https://doi.org/10.4236/ojs.2020.104037
[7]	Huang, T., Fan, Y. and Sun, Z. (2019) Robust Element-Wise Empirical Likelihood Estimation Method for Longitudinal Data. Journal of Applied Mathematics and Physics, 7, 1408-1420. https://doi.org/10.4236/jamp.2019.76094
[8]	Lazar, N.A. (2003) Bayesian Empirical Likelihood. Biometrika, 90, 319-326. https://doi.org/10.1093/biomet/90.2.319
[9]	Zhong, X. and Ghosh, M. (2016) Higher-Order Properties of Bayesian Empirical Likelihood. Electronic Journal of Statistics, 10, 3011-3044. https://doi.org/10.1214/16-EJS1201
[10]	Li, C.J., Zhao, H.M. and Dong, X.G. (2019) Bayesian Empirical Likelihood and Variable Selection for Censored Linear Model with Applications to Acute Myelogenous Leukemia Data. International Journal of Biomathematics, 12, 799-813. https://doi.org/10.1142/S1793524519500505
[11]	Bedoui, A. and Lazar, N.A. (2020) Bayesian Empirical Likelihood for Ridge and Lasso Regressions. Computational Statistics & Data Analysis, 145, 106917. https://doi.org/10.1016/j.csda.2020.106917
[12]	Moon, C. and Bedoui, A. (2020) Bayesian Elastic Net Based on Empirical Likelihood. arXiv: 2006.10258. https://doi.org/10.48550/arXiv.2006.10258
[13]	Zhang, Y. and Tang, N. (2017) Bayesian Empirical Likelihood Estimation of Quantile Structural Equation Models. Journal of Systems Science & Complexity, 30, 122-138. https://doi.org/10.1007/s11424-017-6254-x
[14]	Yang, Y. and He, X. (2012) Bayesian Empirical Likelihood for Quantile Regression. The Annals of Statistics, 40, 1102-1131. https://doi.org/10.1214/12-AOS1005
[15]	Park, T. and Casella, G. (2008) The Bayesian Lasso. Journal of the American Statistical Association, 103, 681-686. https://doi.org/10.1198/016214508000000337
[16]	Li, Q. and Lin, N. (2010) The Bayesian Elastic Net. Bayesian Analysis, 5, 151-170. https://doi.org/10.1214/10-BA506
[17]	Mallick, H. and Yi, N. (2014) A New Bayesian Lasso. Statistics and Its Interface, 7, 571-582. https://doi.org/10.4310/SII.2014.v7.n4.a12
[18]	Chib, S., Shin, M. and Simoni, A. (2018) Bayesian Estimation and Comparison of Moment Condition Models. Journal of the American Statistical Association, 113, 1656-1668. https://doi.org/10.1080/01621459.2017.1358172
[19]	Polson, N.G., Scott, J.G. and Windle, J. (2014) The Bayesian Bridge. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76, 713-733. https://doi.org/10.1111/rssb.12042

Journals Menu

Follow SCIRP

	customer@scirp.org
	+86 18163351462(WhatsApp)
	1655362766

	Paper Publishing WeChat

Journals Menu

Home

About SCIRP

Service

Policies