A New Extended BIC and Sequential Lasso Regression Analysis and Their Application in Classification ()
1. Introduction
Sparse high-dimensional regression (SHR) models arise in many important contemporary scientific fields. A SHR model is:
(1)
where the number of features p is much larger than the sample size n, and only a relatively small number of the βj’s are nonzero. Feature selection is crucial in the analysis of SHR models. Desboulets [1] pointed out that there are three main categories of variable selection procedures, they are test-based procedures, penalty-based procedures and screening-based procedures. Regularized regression approaches to the analysis of SHR models have attracted considerable attention of the researchers. A regularized regression approach selects the features and estimates the coefficients simultaneously by minimizing a penalized sum of squares of the form:
(2)
where λ is a regulating parameter and
is a penalty function such that the number of fitted nonzero coefficients can be regulated by λ; that is, only a certain number of βj’s are estimated nonzero when λ is set at a certain value. Various penalty functions have been proposed and studied, including Lasso:
, SCAD [2] , which smoothly clips a
penalty (for small
) and a constant penalty (for large
), adaptive Lasso [3] :
, where
are given weights, and MCP [4] , which smoothly approaches the
penalty from a constant penalty (for large
’s) by an asymptote.
Sequential methods have also received attention in recent decades for feature selection in SHR models. The traditional sequential procedures such as forward stepwise regression (FSR) were criticized for its greedy nature. However, it was discovered recently that the greedy nature is indeed a good one if the goal is to identify relevant features, see [5] , especially, in the presence of high spurious correlations due to extremely high dimensionality of the feature space. A sequential procedure of a different nature called least angle regression (LAR) was proposed in [6] . The LAR continuously updates the estimate of the expected responses along a direction having equal angle with the features already selected and selects new features having the largest absolute correlation with the updated current residuals. Recently, Luo and Chen [7] identified the nonzero entries of the precision matrix by a sequential method called JR-SLasso proposed in [8] . Chen and Jiang [9] proposed a two-stage sequential conditional selection (TSCS) approach to the identification and estimation of the nonzeros of the coefficient matrix and the precision matrix. Sequential approach has also been considered for models other than SHR models. Besides their desirable theoretical properties, sequential approaches are computationally more appealing. They are more stable and less affected by the dimensionality of the feature space.
Apart from variable seletion, model selection is also an important part for regression analysis. In statistical modeling an investigator often faces the problem of choosing a suitable model from among a collection of viable candidates. Such a determination may be facilitated using a selection criterion, which assigns a score to every model in a candidate set based on some underlying statistical principle. The Bayesian information criterion (BIC) is one of the most widely known and pervasively used tools in statistical model selection. However, as it was shown by [10] the ordinary Bayesian information criterion is too liberal that is the criteria select far more features than the relevant ones when the model space is large. Chen and Chen [10] proposed a family of extended Bayes information criteria (EBIC) to better meet the needs of variable selection for large model spaces.
Bsides regression, classification problem also receives widespread attention in statistical modeling. Classification problems with high-dimensional data rise in many important contemporary scientific fields such as genetics and medicine. The most popular method for classification is Fisher’s linear discrimination analysis (LDA). In a K -class classification problem, the LDA assumes that the predictor
given class
follows the multivariate normal distribution
,
. Let
and:
(3)
where
is the so-called precision matrix. The Bayes rule which is theoretically optimal classifies x into class k if and only if:
Since the Bayes rule cannot be realized in practice due to the unknown uk’s and Ω, the Bayes rule is estimated in the LDA by replacing uk’s and Ω with their estimates. If the dimension of predictor p (<n) is fixed, or diverges under certain conditions, it has been shown that the LDA is asymptotically optimal but that, if p > n, it is asymptotically as bad as a random guessing, where n is the sample size, see [11] . The failure of LDA in the case p > n is due to accumulated errors in the estimation of the unknowns, as argued in [12] . Thus, it is necessary to bypass the difficulties in the estimation of unknowns.
In this paper, firstly, we propose a novel method to choose the regularization parameter λ for lasso regression, traditional method such as multifold cross-validation chooses regularization parameter gradually, in details: for lasso regression, the larger regularization parameter λ is, the greater number of regression coefficients will be estimated zero, once λ exceeds the maximum value then there will be no features in the candidate model. Thus, it is important to know the exact maximum value of regularization parameter. However, multifold cross-validation method doesn’t tell us the largest value of λ. Differing from multifold cross-validation, our method gives the exact maximum value of regularization parameter such that at least one of the βj ’s will be estimated nonzero. Thus, if you have the data, you can get the largest λ for lasso regression immediately by our method. Then we apply this method to choose parameter for a sequence of partially penalized least squares problems, which means that the features selected in an earlier step are not penalized in the subsequent steps.
Secondly, the re-examination of BIC and EBIC prompts us naturally to consider other forms of prior over the model space in the Bayes approach. We propose a new extended Bayes information criterion family, and under some mild conditions our new EBIC (NEBIC) is shown to be consistent. Then our NEBIC is used as the stopping rule for sequential lasso algorithm, we dub the proposed procedure as sequential lasso with NEBIC. After that, we apply our algorithm to classification problem, the numerical study demonstrates that our algorithm performs well than its competing methods under consideration.
The remainder of the paper is arranged as follows. In Section 2, we introduce the basic properties of sequential lasso and our new method for choosing the regularization parameter λ . In Section 3, the selection consistency of our NEBIC is established. In Section 4, the structure of our algorithm is given. In Section 5, we apply our algorithm to classification problems, then the main method and simulation results are introduced sequentially.
2. Sequential Lasso Regression for Feature Selection
Let
be the
design matrix, for
,
be the observation vector of predictor j on n individuals,
be the response vector, and
all have been standardized, such that
,
, and
,
. Thus, in model (1) the intercept
can be omitted. Let
,
, in matrix notation, model (1) is expressed as:
(4)
Let S denote the set of indices
, let s be any subset of S. Denote by
the matrix consisting of the columns of X with indices in s. Similarly, let
denote the vector consisting of the corresponding components of
. Let
be the linear space spanned by the columns of
and
its corresponding projection matrix:
At the initial step, sequential lasso minimizes the following penalized sum of squares:
(5)
Let
, from Theorem 1 we can prove that
is the largest value of the penalty parameter such that at least one of the βj’s,
, will be estimated nonzero. The features with nonzero estimated coefficients are selected and the set of their indices is denoted by
.
For
, let
be the index set of the features selected until step k. At step
, sequential lasso minimizes the following penalized sum of squares:
(6)
From the proposition 1 we can prove that minimization of
is equivalent to the minimization of
,
where
,
,
,
,
, from Theorem 1 we can prove that
is the largest value of the penalty parameter such that at least one of the βj’s,
, will be estimated nonzero.
Next, we give the statements and proofs of proposition 1 and theorem 1.
Proposition 1. For
, the minimization of
is equivalent to the minimization of
.
Proof: Differentiating
with respect to
, we have
Setting the above derivative to zero, we obtain:
Substituting
into
we have:

Theorem 1.
is the largest value of the regularization parameter such that at least one of the βj’s,
, will be estimated nonzero.
Proof: this statement is equivalent to that
is the largest value of the regularization parameter such that the minimization of
will obtain nonzero solutions.
By the KKT condition, let
we have:
(7)
In component form:
(8)
where
is a sub gradient of
at
, according to the value of
, there are three situations:
Then (8) is equivalent to:
(9.1)
or
(9.2)
The statement “
is the largest value of the penalty parameter such that the minimization of
will obtain nonzero solutions.” is established by showing:
(i) Let
,
will obtain nonzero solutions.
(ii) Let
,
will only obtain zero solutions.
According to the proof by contradiction, we suppose that there are only zero solutions when
.
Substituting
into (9.2) we have:
, which means:
This contradicts with
, thus the supposition is wrong, (i) is proved.
Now let us turn to (ii):
Let
, and let
we have:
(10)
Then, the proof of (ii) is equivalent to prove this:
Substituting
into the left of (10) we have:
By
we have:
, since
, then for all
, we have:
, thus
solves equation (10), (ii) is proved.

3. EBIC and New EBIC
Suppose the dimension of model space S is P, denote Sj is the collection of all models with j covariates, so that the model space S can be partitioned into
, such that models within each Sj have equal dimension. Let
be the size of Sj, we know that
. For example, suppose the number of covariates under consideration is
, the class of models containing a single covariate is denoted by
, then
has size
, while the class of models containing two covariate is denoted by
,
has size
. We can see that the size of
is much bigger than the size of
. Now let us consider the prior distribution over S as follows.
For
, we have
,
, since models within each
have equal dimension, it is reasonable to assign an equal probability
for any
. The ordinary Bayesian information criterion
assigns probabilities
proportional to
, that is
. However, this would cause unreasonable situation by large model space. For example, as we have discussed in the above paragraph, we can see that
is 999/2 times bigger than
, according to the constant prior by BIC, the probability assigned to
is also 999/2 times that assigned to
. According to the knowledge of combinatorial number, the size of
increases as j increases to
, so that the probability assigned to
by the prior increases almost exponentially. In other word, models with larger number of covariates, 50 or 100 say, receive much higher probabilities than models with fewer covariates. This is obviously unreasonable, being strongly against the principle of parsimony.
Instead of assigning probabilities
proportional to
, as in the ordinary BIC, Chen and Chen [10] assign
proportional to
for
. This results in the prior probability
,
. This prior distribution gives rise to an extended BIC family (EBIC).
Notice that extended Bayesian information criterion is established by introducing the function
, which aims to select models with fewer covariates. From the perspective of function, we know that
is a monotone increasing convex function, and the parameter ε is confined within [0, 1] to ensure upper convex property satisfied. Inspired by this, we
consider other upper convex function like
, and the parameter a can be
any positive numbers. In other word, we assign
proportional to
, so that
, this type of prior distribution on the model space gives rise to a new EBIC family as follows:
(11)
where
is the maximum likelihood estimator of
given model s. Now let us investigate the properties of our new EBIC for feature selection in linear regression models. Under some mild conditions our new EBIC is shown to be consistent.
Let
be the vector of n observations on the response variable, let
be the corresponding design matrix with all the covariates of concerns, and let
be the vector of regression coefficients. Assume that:
(12)
where
and
is the identity matrix of size n. Let
be the smallest subset of
such that
, where
and
are respectively the design matrix and the coefficients corresponding to
. Let
be the number of components in
. We call the true submodel and denote
by
, and K is an upper bound for
. Let the projection matrix of
be
. Define
.
The family of our new EBIC under model (12) is defined as:
(13)
Under the asymptotic identifiability condition proposed by [10] , the consistency of our new EBIC is established. The asymptotic identifiability condition is as follows:
Condition 1: asymptotic identifiability. Model (12) with true submodel
is asymptotically identifiable if:
And other two lemmas proposed by [13] are also useful to our proof.
Lemma 1: if
as
, we have:
Lemma 2: let
denote a
random variable with degrees of freedom k. If
and
, then:
uniformly for all
.
We now state the consistency result as follows.
Theorem 2. Assume that
for some constant k. If
, then under the asymptotic identifiability condition we have:
for
, as
.
Proof:
, say,
where:
Without loss of generality, we assume that
.
Case 1:
,
First, we will show
. We can write:
where Zj’s are i.i.d. standard normal variables, we have:
(14)
By asymptotic identifiability condition, uniformly over s such that
, we have:
(I)
Write
where:
We hence arrive at
(II)
where the last inequality follows the Bonferroni inequality, in detail we have:
where
. The last inequality comes from Lemma 2 that
. And according to Lemma 2, we can prove that
.
Thus, we have
(III)
We know the term
is a
-distributed statistic with a fixed degrees of freedom
. Then take (I), (II), (III) into (14) we have:
Under the asymptotic identifiability condition, we know that
, which goes to infinity faster than
, is the dominating term in
. Thus,
for any large constant C in probability, thus we have:
Next, we will show
, as
.
We know that:
(15)
On the one hand,
Let
, we know that
as
, so that
. Thus, the above equation can be expressed as
Which means:
(IV)
On the other hand,
According to Lemma 1 we have
as
. Thus, we obtain:
(V)
Bring (IV), (V) into (15) we can see that choosing
, we obtain:
Case 2:
,
First, we will show:
In this case we have
, thus
.
And
where
,
are some independent standard normal random variables depending on s. Let
, we obtain that:
As
,
. And in case 1 we have shown that:
Thus,
So, we can see that:
Consequently, uniformly in s such that
, we have:
Thus, we have:
as
. The conclusion hence follows.

4. The Structure of Our Algorithm
Initial step:
Standardize
, such that
,
,
,
, let
, consider the following penalized sum of squares:
the features with nonzero estimated coefficients are selected and the set of their indices is denoted by
, let
. Compute
and
.
General step:
For
, compute
for
, where
,
. Let
, consider the following penalized sum of squares:
the features with nonzero estimated coefficients are selected and the set of their indices is denoted by
, let
. Compute
, if
, stop; otherwise, compute
and continue.
Suppose that the above procedure stops at
step, then our algorithm outputs the index set of selected features which is denoted by
. Then the parameters in the selected model are estimated by their least-square estimates:
The
for
,
in the above algorithm is given by:
5. Application to High-Dimensional Classification
5.1. Method
As we have discussed before, it is necessary to give better methods to estimate the unknown uk’s and Ω for the small-n-large-p problem. In this paper, we apply our algorithm to identify the non-zero entries of precision matrix, and the estimate of Ω is then obtained by the constrained maximum likelihood estimation. And we adopt the method proposed by [7] to estimate class means. The estimated class means and precision matrix are finally used to construct the discrimination rule.
5.2. Procedure
1) Constrained estimation for the class means
The predictor components are first ordered according to the F-statistic for testing the significance of class effects, instead of being pairwise fused, the class means for each component are then clustered by a divisive hierarchical clustering procedure. The class means are estimated under the structure revealed by the clustering. For details, we refer the reader to [7] .
2) Constrained estimation for the precision matrix
The identification of non-zero entries in a concentration matrix has attracted considerable attention of the researchers, concentration matrix is the inverse of the covariance matrix of a random vector. A concentration matrix is closely related to an undirected graphical model. An undirected graphical model is specified by a vertex set V and an edge set E, and is denoted by
. The vertex set V represents a collection of random variables
. The edge set E describes the inter-relationship among the random variables: there is an edge connecting vertices
and
if they are dependent conditioning on all the remaining variables. Suppose that
follows a multivariate normal distribution with concentration matrix
. Then, there is an edge between Yi and Yj if and only if
. Thus, the detection of edges of G is equivalent to the identification of non-zero entries of Ω.
In this paper, we adopt the method, which was proposed by [14] , is based on the relationship between entries of Ω and the coefficients of p regression models where each component of Y is regressed on the remaining p − 1 components. A non-zero entry of Ω corresponds to a non-zero regression coefficient in the regression models. In other word, the detection and estimation of non-zero entries of Ω are then boiled down to the selection and estimation of non-zero coefficients in p regression models. According to this, we first apply our algorithm to identify non-zero entries of Ω, and the estimate of Ω is then obtained by the constrained maximum likelihood estimation. The details as follow.
For an undirected graph
, let V be modeled as the set of the components of a random vector
. We assume that Y follows a multivariate normal distribution
. Without loss of generality, assume that
. Let
be the vector obtained from Y by eliminating component
. Denote by
the variance-covariance matrix of
, by
the variance of
, and by
the covariance vector between
and
. We know that the conditional distribution of
given
is still normal, with the following conditional mean and conditional variance:
Let
be the jth component of
. We can then express the conditional distributions in the form of the following regression models:
(16)
where
. Without loss of generality, suppose that
is the first component of Y. Let the covariance matrix Σ be partitioned as:
, let
, we can obtain:
By comparing the left upper block of Ω with
, we see that:
noting that
. These connections establish the equivalence:
As we can see that the identification of non-zero entries of Ω reduces to the identification of the non-zero
in the above regression models. Thus, firstly, we apply our algorithm to model (16) to identify all the non-zero entries of Ω. Let
, according to our algorithm we have
.
Next, we summary our proposed approaches. First, we introduce some notation. Denote by
the standardized n-vector of observed
, let
be the
matrix consisting of all
with
. Here is our algorithm based on above data sets:
Algorithm for identifying
Initial step:
Let
, let
be
-vector, for
. Consider minimizing following p penalized sum of squares separately:
Let
denote the indices of features with nonzero estimated coefficients for the above p regression models respectively. Let
. Then compute
and
for
.
General step k + 1 (k ≥ 1):
Let
for
. Consider minimizing following p penalized sum of squares separately:
where
,
,
. Let
be the indices of features with nonzero estimated coefficients for the above ith regression model, and
for
. Then let
. Compute
. If
, stop; otherwise, compute
for
and continue.
Suppose the above algorithm stops at step
, then our algorithm output the index set
.
The NEBIC for set
is given by the following formula:
.
Estimation of precision matrix:
From the above algorithm we identify all the non-zero entries of
, let
be the set of all the indices. Then
is obtained by the constrained maximum likelihood estimation as follows:
where
is the empirical covariance matrix.
3) Discrimination rule
The discrimination rule is constructed by replacing the unknowns in
by their estimates obtained above. Here
is:
Then we have
:
Classify x into class k if and only if:
5.3. Simulation Studies
We compare our method with other methods available in the literature through numerical studies in this section. The methods considered for the comparison are: 1) linear discrimination with detected sparsity (LDwDS) in [7] . 2) The adaptive hierarchically penalized nearest shrunken centroid (ahp-NSC) and the adaptive
-norm penalized NSC (alp-NSC) proposed in [15] . 3) Pairwise SIS (PSIS) in [16] .
The performance of the methods is evaluated by the misclassification rate (MCR) and computation time. We have three simulation settings, repeat each setting 100 times. At each replicate under a simulation setting, we simulate a training data set and a testing data set. The training data set is used to construct discrimination rules with the methods, and the testing set is used to evaluate the MCR and record computation time.
We design the simulation setting inspired by [7] , consider
,
. The following single scheme for the class means is taken throughout:
The covariance matrix is generated through the precision matrix. First, the nonzero positions of the precision matrix are decided in the following schemes:
ES1: the non-zero entries of Ω is randomly determined with:
ES2: Ω is a diagonal block matrix with 10 blocks of size 10.
ES3: Ω is a diagonal block matrix with 10 blocks of size 10, each block is a diagonal band matrix with
if
.
For the nonzero values of the precision matrix, we first generate
as follows:
,
are generated as i.i.d. observations from the uniform distribution on
. Then we take
. Eventually, take the common covariance matrix as the correlation matrix corresponding to
. The training sample size
and the testing sample size
. The covariance matrix is generated as a diagonal block matrix with four 100 × 100 identical blocks. The 100 × 100 block is generated by the same generating schemes above. The simulation results under these three settings are reported in Tables 1-3.
From all three tables we can see that as for the MCR, the performance of our method is much better than the two NSC methods and PSIS. Although our method has the same MCR with LDwDS from Table 1 and Table 2, it takes shorter computation time than LDwDS. From Table 3 we find that our method performs better than LDwDS both in misclassification and computation time. From [7] we notice that the misclassification rates of LDwDS are universally lower than all the other methods under consideration. Nevertheless, it is not without drawbacks. LDwDS takes longer computation time than other methods.
![]()
Table 1. Averaged misclassification rate (MCR) and computation time (in seconds) under simulation ES1 (numbers in the parentheses are standard deviations).
![]()
Table 2. Averaged misclassification rate (MCR) and computation time (in seconds) under simulation ES2 (numbers in the parentheses are standard deviations).
![]()
Table 3. Averaged misclassification rate (MCR) and computation time (in seconds) under simulation ES3 (numbers in the parentheses are standard deviations).
Therefore, from the perspectives of computation time and misclassification, our algorithm performances better than its competing algorithms.
6. Summary and Discussion
In this paper, on the one hand, based on the characteristic of lasso regression we propose a novel method to choose the regularization parameter λ , which gives the exact maximum value of λ for lasso regression. On the other hand, since the ordinary Bayes information criterion is too liberal for model selection when the model space is large, inspired by [10] we propose a new EBIC (NEBIC). Compared with EBIC the parameter in our new criterion doesn’t need to be restricted in a small range, what’s more, under some mild conditions our new EBIC is also shown to be consistent. Then, based on these two findings, we have a new algorithm for feature selection. In detail, we apply our new method to choose regularization parameter for sequential lasso, and our NEBIC is used as the stopping rule. After that, we apply our algorithm to classification problem, the numerical study demonstrates that our algorithm performs better than its competing methods under consideration.
Further research may consider these two questions:
Firstly, our method chooses the largest regularization parameter λ such that at least one of the βj's,
, will be estimated nonzero. A nature question is that instead of solving the optimization problems can we obtain the indices of those features with nonzero estimated coefficients by easier methods? Luo and Chen [17] pointed out that these indices are related to the choice of regularization parameter λ under some conditions. However, these conditions are too strict. Further research may consider if there exist milder conditions to connect our new method for choosing regularization parameter with the indices of non-zero estimated coefficients.
Secondly, we choose the upper convex function
as the prior
function over the model space to extend EBIC, and our new EBIC is shown to be consistent. Subsequent research may consider if there exists a special function class for model space priors such that within this class the Bayes information criteria is consistent.