A New Extended BIC and Sequential Lasso Regression Analysis and Their Application in Classification

Abstract

In this paper, firstly, we propose a new method for choosing regularization parameter λ for lasso regression, which differs from traditional method such as multifold cross-validation, our new method gives the maximum value of parameter λ directly. Secondly, by considering another prior form over model space in the Bayes approach, we propose a new extended Bayes information criterion family, and under some mild condition, our new EBIC (NEBIC) is shown to be consistent. Then we apply our new method to choose parameter for sequential lasso regression which selects features by sequentially solving partially penalized least squares problems where the features selected in earlier steps are not penalized in the subsequent steps. Then sequential lasso uses NEBIC as the stopping rule. Finally, we apply our algorithm to identify the nonzero entries of precision matrix for high-dimensional linear discrimination analysis. Simulation results demonstrate that our algorithm has a lower misclassification rate and less computation time than its competing methods under considerations.

Share and Cite:

Chen, J. and Ye, W. (2023) A New Extended BIC and Sequential Lasso Regression Analysis and Their Application in Classification. Advances in Pure Mathematics, 13, 284-302. doi: 10.4236/apm.2023.135020.

1. Introduction

Sparse high-dimensional regression (SHR) models arise in many important contemporary scientific fields. A SHR model is:

y i = β 0 + j = 1 p β j x i j + ϵ i , i = 1 , 2 , , n (1)

where the number of features p is much larger than the sample size n, and only a relatively small number of the βj’s are nonzero. Feature selection is crucial in the analysis of SHR models. Desboulets [1] pointed out that there are three main categories of variable selection procedures, they are test-based procedures, penalty-based procedures and screening-based procedures. Regularized regression approaches to the analysis of SHR models have attracted considerable attention of the researchers. A regularized regression approach selects the features and estimates the coefficients simultaneously by minimizing a penalized sum of squares of the form:

i = 1 n ( y i β 0 j = 1 p β j x i j ) 2 + j = 1 p p λ ( | β j | ) (2)

where λ is a regulating parameter and p λ is a penalty function such that the number of fitted nonzero coefficients can be regulated by λ; that is, only a certain number of βj’s are estimated nonzero when λ is set at a certain value. Various penalty functions have been proposed and studied, including Lasso: p λ ( | β j | ) = λ | β j | , SCAD [2] , which smoothly clips a L 1 penalty (for small | β j | ) and a constant penalty (for large | β j | ), adaptive Lasso [3] : p λ ( | β j | ) = λ ω j | β j | , where ω j are given weights, and MCP [4] , which smoothly approaches the L 1 penalty from a constant penalty (for large | β j | ’s) by an asymptote.

Sequential methods have also received attention in recent decades for feature selection in SHR models. The traditional sequential procedures such as forward stepwise regression (FSR) were criticized for its greedy nature. However, it was discovered recently that the greedy nature is indeed a good one if the goal is to identify relevant features, see [5] , especially, in the presence of high spurious correlations due to extremely high dimensionality of the feature space. A sequential procedure of a different nature called least angle regression (LAR) was proposed in [6] . The LAR continuously updates the estimate of the expected responses along a direction having equal angle with the features already selected and selects new features having the largest absolute correlation with the updated current residuals. Recently, Luo and Chen [7] identified the nonzero entries of the precision matrix by a sequential method called JR-SLasso proposed in [8] . Chen and Jiang [9] proposed a two-stage sequential conditional selection (TSCS) approach to the identification and estimation of the nonzeros of the coefficient matrix and the precision matrix. Sequential approach has also been considered for models other than SHR models. Besides their desirable theoretical properties, sequential approaches are computationally more appealing. They are more stable and less affected by the dimensionality of the feature space.

Apart from variable seletion, model selection is also an important part for regression analysis. In statistical modeling an investigator often faces the problem of choosing a suitable model from among a collection of viable candidates. Such a determination may be facilitated using a selection criterion, which assigns a score to every model in a candidate set based on some underlying statistical principle. The Bayesian information criterion (BIC) is one of the most widely known and pervasively used tools in statistical model selection. However, as it was shown by [10] the ordinary Bayesian information criterion is too liberal that is the criteria select far more features than the relevant ones when the model space is large. Chen and Chen [10] proposed a family of extended Bayes information criteria (EBIC) to better meet the needs of variable selection for large model spaces.

Bsides regression, classification problem also receives widespread attention in statistical modeling. Classification problems with high-dimensional data rise in many important contemporary scientific fields such as genetics and medicine. The most popular method for classification is Fisher’s linear discrimination analysis (LDA). In a K -class classification problem, the LDA assumes that the predictor x = ( x 1 , , x p ) T given class G = k follows the multivariate normal distribution N ( u k , Σ ) , k { 1 , , K } . Let π k = Pr ( G = k ) and:

d k j ( x ) = [ x ( u k + u j ) / 2 ] T Ω ( u k u j ) + ln ( π k / π j ) , (3)

where Ω = Σ 1 is the so-called precision matrix. The Bayes rule which is theoretically optimal classifies x into class k if and only if:

d k j ( x ) > 0 , forall j k .

Since the Bayes rule cannot be realized in practice due to the unknown uks and Ω, the Bayes rule is estimated in the LDA by replacing uks and Ω with their estimates. If the dimension of predictor p (<n) is fixed, or diverges under certain conditions, it has been shown that the LDA is asymptotically optimal but that, if p > n, it is asymptotically as bad as a random guessing, where n is the sample size, see [11] . The failure of LDA in the case p > n is due to accumulated errors in the estimation of the unknowns, as argued in [12] . Thus, it is necessary to bypass the difficulties in the estimation of unknowns.

In this paper, firstly, we propose a novel method to choose the regularization parameter λ for lasso regression, traditional method such as multifold cross-validation chooses regularization parameter gradually, in details: for lasso regression, the larger regularization parameter λ is, the greater number of regression coefficients will be estimated zero, once λ exceeds the maximum value then there will be no features in the candidate model. Thus, it is important to know the exact maximum value of regularization parameter. However, multifold cross-validation method doesn’t tell us the largest value of λ. Differing from multifold cross-validation, our method gives the exact maximum value of regularization parameter such that at least one of the βj ’s will be estimated nonzero. Thus, if you have the data, you can get the largest λ for lasso regression immediately by our method. Then we apply this method to choose parameter for a sequence of partially penalized least squares problems, which means that the features selected in an earlier step are not penalized in the subsequent steps.

Secondly, the re-examination of BIC and EBIC prompts us naturally to consider other forms of prior over the model space in the Bayes approach. We propose a new extended Bayes information criterion family, and under some mild conditions our new EBIC (NEBIC) is shown to be consistent. Then our NEBIC is used as the stopping rule for sequential lasso algorithm, we dub the proposed procedure as sequential lasso with NEBIC. After that, we apply our algorithm to classification problem, the numerical study demonstrates that our algorithm performs well than its competing methods under consideration.

The remainder of the paper is arranged as follows. In Section 2, we introduce the basic properties of sequential lasso and our new method for choosing the regularization parameter λ . In Section 3, the selection consistency of our NEBIC is established. In Section 4, the structure of our algorithm is given. In Section 5, we apply our algorithm to classification problems, then the main method and simulation results are introduced sequentially.

2. Sequential Lasso Regression for Feature Selection

Let X = ( x 1 , x 2 , , x p ) be the n × p design matrix, for j = 1 , , p ,

x j = ( x 1 j , x 2 j , , x n j ) T be the observation vector of predictor j on n individuals, y = ( y 1 , y 2 , , y n ) T be the response vector, and x j , y all have been standardized, such that x j T 1 = 0 , y T 1 = 0 , and y T y = n , x j T x j = n . Thus, in model (1) the intercept β 0 can be omitted. Let β = ( β 1 , β 2 , , β p ) T , ϵ = ( ϵ 1 , ϵ 2 , , ϵ n ) T , in matrix notation, model (1) is expressed as:

y = X β + ϵ (4)

Let S denote the set of indices { 1 , 2 , , p } , let s be any subset of S. Denote by X ( s ) the matrix consisting of the columns of X with indices in s. Similarly, let β ( s ) denote the vector consisting of the corresponding components of β . Let R ( s ) be the linear space spanned by the columns of X ( s ) and H ( s ) its corresponding projection matrix:

H ( s ) = X ( s ) [ X T ( s ) X ( s ) ] 1 X T ( s ) .

At the initial step, sequential lasso minimizes the following penalized sum of squares:

l 1 = y X β 2 2 + λ 1 j = 1 p | β j | (5)

Let λ 1 = 2 max j { 1 , 2 , , p } | x j T y | , from Theorem 1 we can prove that λ 1 is the largest value of the penalty parameter such that at least one of the βjs, j { 1 , 2 , , p } , will be estimated nonzero. The features with nonzero estimated coefficients are selected and the set of their indices is denoted by s * 1 .

For k 1 , let s * k be the index set of the features selected until step k. At step k + 1 , sequential lasso minimizes the following penalized sum of squares:

l k + 1 = y X β 2 + λ k + 1 j s * k | β j | . (6)

From the proposition 1 we can prove that minimization of l k + 1 is equivalent to the minimization of

l ˜ k + 1 = y ˜ X ˜ β ˜ 2 2 + λ k + 1 j s * k | β j | ,

where y ˜ = [ I H ( s * k ) ] y , X ˜ = [ I H ( s * k ) ] X ( s * k ) , β ˜ = β ( s * k ) , λ k + 1 = 2 max j s * k | x ˜ j T y ˜ | , x ˜ j = [ I H ( s * k ) ] x j , from Theorem 1 we can prove that λ k + 1 is the largest value of the penalty parameter such that at least one of the βjs, j s * k , will be estimated nonzero.

Next, we give the statements and proofs of proposition 1 and theorem 1.

Proposition 1. For k 1 , the minimization of l k + 1 is equivalent to the minimization of l ˜ k + 1 .

Proof: Differentiating l k + 1 with respect to β ( s * k ) , we have

l k + 1 β ( s * k ) = 2 X T ( s * k ) y + 2 X T ( s * k ) X ( s * k ) β ( s * k ) + 2 X T ( s * k ) X ( s * k ) β ( s * k ) .

Setting the above derivative to zero, we obtain:

β ^ ( s * k ) = [ X T ( s * k ) X ( s * k ) ] 1 X T ( s * k ) [ y X ( s * k ) β ( s * k ) ] .

Substituting β ^ ( s * k ) into y X β 2 we have:

l k + 1 = y X β 2 2 + λ k + 1 j s * k | β j | = y ( X ( s * k ) β ^ ( s * k ) + X ( s * k ) β ( s * k ) ) 2 2 + λ k + 1 j s * k | β j | = [ I H ( s * k ) ] [ y X ( s * k ) β ( s * k ) ] 2 2 + λ k + 1 j s * k | β j | = y ˜ X ˜ β ˜ 2 2 + λ k + 1 j s * k | β j | = l ˜ k + 1 .

Theorem 1. λ 1 = 2 max j { 1 , 2 , , p } | x j T y | is the largest value of the regularization parameter such that at least one of the βjs, j { 1 , 2 , , p } , will be estimated nonzero.

Proof: this statement is equivalent to that λ 1 is the largest value of the regularization parameter such that the minimization of l 1 will obtain nonzero solutions.

By the KKT condition, let l 1 β = 0 we have:

2 X T ( y X β ^ ) = λ 1 β ^ 1 (7)

In component form:

2 x j T ( y X β ^ ) = λ 1 | β ^ j | , j { 1 , 2 , , p } . (8)

where | β ^ j | is a sub gradient of | β j | at β ^ j , according to the value of β ^ j , there are three situations:

| β ^ j | ( 1 , 1 ) , when β ^ j = 0 | β ^ j | = 1 , when β ^ j > 0 ; | β ^ j | = 1 , when β ^ j < 0.

Then (8) is equivalent to:

2 x j T ( y X β ^ ) = ± λ 1 , when β ^ j 0 (9.1)

or

| 2 x j T ( y X β ^ ) | < λ 1 , when β ^ j = 0. (9.2)

The statement “ λ 1 is the largest value of the penalty parameter such that the minimization of l 1 will obtain nonzero solutions.” is established by showing:

(i) Let λ 1 = 2 max j { 1 , 2 , , p } | x j T y | , min β y j = 1 p β j x j 2 2 + λ 1 j = 1 p | β j | will obtain nonzero solutions.

(ii) Let λ > λ 1 , min β y j = 1 p β j x j 2 2 + λ j = 1 p | β j | will only obtain zero solutions.

According to the proof by contradiction, we suppose that there are only zero solutions when λ 1 = 2 max j { 1 , 2 , , p } | x j T y | .

Substituting β ^ = 0 into (9.2) we have: | 2 X T ( y X 0 ) | < λ 1 , which means:

| 2 x j T y | < λ 1 , j { 1 , 2 , , p } .

This contradicts with λ 1 = 2 max j { 1 , 2 , , p } | x j T y | , thus the supposition is wrong, (i) is proved.

Now let us turn to (ii):

Let l = min β y j = 1 p β j x j 2 2 + λ j = 1 p | β j | , and let l β = 0 we have:

2 X T ( y X β ^ ) = λ β ^ 1 (10)

Then, the proof of (ii) is equivalent to prove this:

| 2 x j T ( y X 0 ) | < λ , j { 1 , 2 , , p }

Substituting β ^ = 0 into the left of (10) we have:

| 2 x j T ( y X 0 ) | = | 2 x j T y | , j { 1 , 2 , , p }

By λ 1 = 2 max j { 1 , 2 , , p } | x j T y | we have: | 2 x j T y | λ 1 , since λ > λ 1 , then for all j { 1 , 2 , , p } , we have: | 2 x j T ( y X 0 ) | λ 1 < λ , thus β ^ = 0 solves equation (10), (ii) is proved.

3. EBIC and New EBIC

Suppose the dimension of model space S is P, denote Sj is the collection of all models with j covariates, so that the model space S can be partitioned into S = j = 1 P S j , such that models within each Sj have equal dimension. Let τ ( S j ) be the size of Sj, we know that τ ( S j ) = C P j . For example, suppose the number of covariates under consideration is P = 1000 , the class of models containing a single covariate is denoted by S 1 , then S 1 = { { 1 } , { 2 } , , { P } } has size τ ( S 1 ) = C P 1 = 1000 , while the class of models containing two covariate is denoted by S 2 , S 2 = { { 1 , 2 } , { 1 , 3 } , , { 1 , P } , } has size τ ( S 2 ) = C P 2 = ( 1000 × 999 ) / 2 . We can see that the size of S 2 is much bigger than the size of S 1 . Now let us consider the prior distribution over S as follows.

For s S , we have p r ( s ) = p r ( S j ) p r ( s | S j ) , j = 1 , 2 , , P , since models within each S j have equal dimension, it is reasonable to assign an equal probability

p r ( s | S j ) = 1 τ ( S j ) for any s S j . The ordinary Bayesian information criterion

assigns probabilities p r ( S j ) proportional to τ ( S j ) , that is p r ( S j ) c τ ( S j ) . However, this would cause unreasonable situation by large model space. For example, as we have discussed in the above paragraph, we can see that τ ( S 2 ) is 999/2 times bigger than τ ( S 1 ) , according to the constant prior by BIC, the probability assigned to S 2 is also 999/2 times that assigned to S 1 . According to the knowledge of combinatorial number, the size of S j increases as j increases to j = P / 2 = 500 , so that the probability assigned to S j by the prior increases almost exponentially. In other word, models with larger number of covariates, 50 or 100 say, receive much higher probabilities than models with fewer covariates. This is obviously unreasonable, being strongly against the principle of parsimony.

Instead of assigning probabilities p r ( S j ) proportional to τ ( S j ) , as in the ordinary BIC, Chen and Chen [10] assign p r ( S j ) proportional to τ ε ( S j ) for ε [ 0 , 1 ] . This results in the prior probability p r ( s ) = p r ( S j ) p r ( s | S j ) = τ ε 1 ( S j ) = τ γ ( S j ) , γ = 1 ε . This prior distribution gives rise to an extended BIC family (EBIC).

Notice that extended Bayesian information criterion is established by introducing the function x ε ( x > 1 , 0 < ε < 1 ) , which aims to select models with fewer covariates. From the perspective of function, we know that x ε ( x > 1 , 0 ε 1 ) is a monotone increasing convex function, and the parameter ε is confined within [0, 1] to ensure upper convex property satisfied. Inspired by this, we

consider other upper convex function like x x + a , and the parameter a can be

any positive numbers. In other word, we assign p r ( S j ) proportional to

τ ( S j ) τ ( S j ) + a , so that p r ( s ) = p r ( S j ) p r ( s | S j ) = ( τ ( S j ) τ ( S j ) + a ) 1 τ ( S j ) = 1 τ ( S j ) + a , this type of prior distribution on the model space gives rise to a new EBIC family as follows:

B I C a ( s ) = 2 ln L n { θ ^ ( s ) } + v ( s ) ln n + 2 ln ( τ ( S j ) + a ) , a > 0 (11)

where θ ^ ( s ) is the maximum likelihood estimator of θ ( s ) given model s. Now let us investigate the properties of our new EBIC for feature selection in linear regression models. Under some mild conditions our new EBIC is shown to be consistent.

Let y n be the vector of n observations on the response variable, let X n be the corresponding design matrix with all the covariates of concerns, and let β be the vector of regression coefficients. Assume that:

y n = X n β + e n , (12)

where e n ~ N ( 0 , σ 2 I n ) and I n is the identity matrix of size n. Let s 0 be the smallest subset of { 1 , , p n } such that u n = E y n = X n ( s 0 ) β ( s 0 ) , where X n ( s 0 ) and β ( s 0 ) are respectively the design matrix and the coefficients corresponding to s 0 . Let v ( s 0 ) be the number of components in s 0 . We call the true submodel and denote v ( s 0 ) by K 0 , and K is an upper bound for K 0 . Let the projection matrix of X n ( s ) be H n ( s ) = X n ( s ) [ X n T ( s ) X n ( s ) ] 1 X n T ( s ) . Define Δ n ( s ) = u n H n ( s ) u n 2 .

The family of our new EBIC under model (12) is defined as:

B I C a ( s ) = n ln ( y n T ( I n H n ( s ) ) y n n ) + v ( s ) ln n + 2 ln ( τ ( S j ) + a ) , s S j , a > 0. (13)

Under the asymptotic identifiability condition proposed by [10] , the consistency of our new EBIC is established. The asymptotic identifiability condition is as follows:

Condition 1: asymptotic identifiability. Model (12) with true submodel s 0 is asymptotically identifiable if:

lim n min { ( ln n ) 1 Δ n ( s ) : s s 0 , v ( s ) K 0 } = .

And other two lemmas proposed by [13] are also useful to our proof.

Lemma 1: if ln j ln p δ as p + , we have:

ln ( p ! j ! ( p j ) ! ) = j ln p ( 1 δ ) ( 1 + o ( 1 ) ) .

Lemma 2: let χ k 2 denote a χ 2 random variable with degrees of freedom k. If m + and K m 0 , then:

P ( χ k 2 m ) = 1 Γ ( k / 2 ) ( m / 2 ) ( k / 2 ) 1 e m / 2 ( 1 + o ( 1 ) ) ,

uniformly for all k K .

We now state the consistency result as follows.

Theorem 2. Assume that p n = O ( n k ) for some constant k. If a > 0 , then under the asymptotic identifiability condition we have:

p r [ min { B I C a ( s ) : v ( s ) = j , s s 0 } > B I C a ( s 0 ) ] 1 ,

for j = 1 , , K , as n + .

Proof:

B I C a ( s ) B I C a ( s 0 ) = n ln ( y n T ( I n H n ( s ) ) y n y n T ( I n H n ( s 0 ) ) y n ) + ( v ( s ) v ( s 0 ) ) ln n + 2 [ ln ( τ ( S j ) + a ) ln ( τ ( S K 0 ) + a ) ] T 1 + T 2 , say,

where:

T 1 = n ln ( y n T ( I n H n ( s ) ) y n y n T ( I n H n ( s 0 ) ) y n ) = n ln ( 1 + y n T ( I n H n ( s ) ) y n e n T ( I n H n ( s 0 ) ) e n e n T ( I n H n ( s 0 ) ) e n ) ,

T 2 = ( v ( s ) v ( s 0 ) ) ln n + 2 [ ln ( τ ( S j ) + a ) ln ( τ ( S K 0 ) + a ) ] .

Without loss of generality, we assume that σ 2 = 1 .

Case 1: s 0 s ,

First, we will show T 1 n ln ( 1 + C ln n n ) . We can write:

y n T ( I n H n ( s 0 ) ) y n = e n T ( I n H n ( s 0 ) ) e n = i = 1 n v ( s 0 ) Z j 2 = n { 1 + o p ( 1 ) } ,

where Zjs are i.i.d. standard normal variables, we have:

y n T ( I n H n ( s ) ) y n e n T ( I n H n ( s 0 ) ) e n = u n T { I n H n ( s ) } u n + 2 u n T { I n H n ( s ) } e n e n T H n ( s ) e n + e n T H n ( s 0 ) e n (14)

By asymptotic identifiability condition, uniformly over s such that v ( s ) K , we have:

( ln n ) 1 u n T { I n H n ( s ) } u n . (I)

Write

u n T { I n H n ( s ) } e n = [ u n T { I n H n ( s ) } u n ] Z ( s ) ,

where:

Z ( s ) = u n T { I n H n ( s ) } e n u n T { I n H n ( s ) } u n ~ N ( 0 , 1 ) .

We hence arrive at

max [ u n T { I n H n ( s ) } e n : s S j ] [ u n T { I n H n ( s ) } u n ] max { Z ( s ) : s S j } [ u n T { I n H n ( s ) } u n ] O p { 2 ln p n } = o p [ u n T { I n H n ( s ) } u n ] (II)

where the last inequality follows the Bonferroni inequality, in detail we have:

P ( max { Z ( s ) : s S j , j K } m ) j = 1 K τ ( S j ) P ( Z ( s ) m ) = j = 1 K τ ( S j ) P ( χ 1 2 m ) j = 1 K τ ( S j ) P ( χ j 2 m )

where m = 2 K ln p n . The last inequality comes from Lemma 2 that P ( χ 1 2 m ) < P ( χ j 2 m ) . And according to Lemma 2, we can prove that j = 1 K τ ( S j ) P ( χ j 2 m ) 0 .

Thus, we have

max { e n T H n ( s ) e n : v ( s ) K } = m ( 1 + o p ( 1 ) ) = 2 K ln p n ( 1 + o p ( 1 ) ) = O p ( ln p n ) = O p ( ln n ) (III)

We know the term e n T H n ( s 0 ) e n is a χ 2 -distributed statistic with a fixed degrees of freedom K 0 . Then take (I), (II), (III) into (14) we have:

y n T ( I n H n ( s ) ) y n e n T ( I n H n ( s 0 ) ) e n = u n T { I n H n ( s ) } u n [ 1 + 2 u n T { I n H n ( s ) } e n u n T { I n H n ( s ) } u n e n T H n ( s ) e n u n T { I n H n ( s ) } u n + e n T H n ( s 0 ) e n u n T { I n H n ( s ) } u n ] = u n T { I n H n ( s ) } u n ( 1 + o p ( 1 ) ) .

Under the asymptotic identifiability condition, we know that u n T { I n H n ( s ) } u n , which goes to infinity faster than ln n , is the dominating term in y n T ( I n H n ( s ) ) y n e n T ( I n H n ( s 0 ) ) e n . Thus,

y n T ( I n H n ( s ) ) y n e n T ( I n H n ( s 0 ) ) e n e n T ( I n H n ( s 0 ) ) e n C ln n n ,

for any large constant C in probability, thus we have:

T 1 = n ln ( y n T ( I n H n ( s ) ) y n y n T ( I n H n ( s 0 ) ) y n ) n ln ( 1 + C ln n n ) .

Next, we will show B I C a ( s ) B I C a ( s 0 ) + , as n + .

We know that:

B I C a ( s ) B I C a ( s 0 ) = T 1 + T 2 n ln ( 1 + C ln n n ) + ( v ( s ) v ( s 0 ) ) ln n + 2 [ ln ( τ ( S j ) + a ) ln ( τ ( S K 0 ) + a ) ] (15)

On the one hand,

n ln ( 1 + C ln n n ) + ( v ( s ) v ( s 0 ) ) ln n n ln ( 1 + C ln n n ) v ( s 0 ) ln n = ln n n ln n [ ln ( 1 + C ln n n ) v ( s 0 ) v ( s 0 ) ln n n ]

Let n ln n = a n , we know that 1 a n 0 as n + , so that ln ( 1 + 1 a n ) ~ 1 a n . Thus, the above equation can be expressed as

= a n ( ln n ) [ ln ( 1 + C 1 a n ) K 0 1 a n ] a n ( ln n ) [ C 1 a n K 0 1 a n ] = ln n ( C K 0 ) > ln n ( C K )

Which means:

n ln ( 1 + C ln n n ) + ( v ( s ) v ( s 0 ) ) ln n ln n ( C K ) . (IV)

On the other hand,

According to Lemma 1 we have ln ( τ ( S j ) + a ) ln τ ( S j ) j ln p n as n + . Thus, we obtain:

2 [ ln ( τ ( S j ) + a ) ln ( τ ( S K 0 ) + a ) ] 2 ( j K 0 ) ln p n > 2 K ln p n . (V)

Bring (IV), (V) into (15) we can see that choosing C > K ( 2 k + 1 ) , we obtain:

B I C a ( s ) B I C a ( s 0 ) > ln n ( C K ( 2 k + 1 ) ) + .

Case 2: s 0 s ,

First, we will show:

T 1 = n ln ( y n T ( I n H n ( s ) ) y n y n T ( I n H n ( s 0 ) ) y n ) 2 j ln p n { 1 + o p ( 1 ) } .

In this case we have { I n H n ( s ) } X n ( s 0 ) = 0 , thus y n T ( I n H n ( s ) ) y n = e n T ( I n H n ( s ) ) e n .

And

e n T ( I n H n ( s 0 ) ) e n y n T ( I n H n ( s ) ) y n = e n T ( H n ( s ) H n ( s 0 ) ) e n = i = 1 j Z i 2 ( s ) .

where j = v ( s ) v ( s 0 ) , Z i ( s ) are some independent standard normal random variables depending on s. Let e ^ n = { I n H n ( s 0 ) } e n , we obtain that:

n ln ( y n T ( I n H n ( s 0 ) ) y n y n T ( I n H n ( s ) ) y n ) = n ln ( 1 + e n T ( I n H n ( s 0 ) ) e n e n T ( I n H n ( s ) ) e n e n T ( I n H n ( s ) ) e n ) = n ln { 1 + i = 1 j Z i 2 ( s ) e ^ n T e ^ n i = 1 j Z i 2 ( s ) } n i = 1 j Z i 2 ( s ) e ^ n T e ^ n i = 1 j Z i 2 ( s ) .

As n + , n 1 e ^ n T e ^ n σ 2 = 1 . And in case 1 we have shown that:

max { i = 1 j Z i 2 ( s ) , s S v ( s ) } = 2 j ln p n { 1 + o p ( 1 ) } .

Thus,

max { n i = 1 j Z i 2 ( s ) e ^ n T e ^ n i = 1 j Z i 2 ( s ) : s S v ( s ) } 2 n j ln p n { 1 + o p ( 1 ) } ( n 2 j ln p n ) { 1 + o p ( 1 ) } = 2 j ln p n { 1 + o p ( 1 ) } .

So, we can see that:

T 1 = n ln ( y n T ( I n H n ( s 0 ) ) y n y n T ( I n H n ( s ) ) y n ) 2 j ln p n { 1 + o p ( 1 ) } .

Consequently, uniformly in s such that v ( s ) = j + v ( s 0 ) , we have:

B I C a ( s ) B I C a ( s 0 ) = T 1 + T 2 2 j ln p n { 1 + o p ( 1 ) } + ( v ( s ) v ( s 0 ) ) ln n + 2 [ ln ( τ ( S j ) + a ) ln ( τ ( S K 0 ) + a ) ] 2 j ln p n { 1 + o p ( 1 ) } + ( v ( s ) v ( s 0 ) ) ln n + 2 ( v ( s ) v ( s 0 ) ) ln p n = 2 j ln p n { 1 + o p ( 1 ) } + j ln n + 2 j ln p n = j ln n .

Thus, we have:

B I C a ( s ) B I C a ( s 0 ) j ln n + as n + . The conclusion hence follows.

4. The Structure of Our Algorithm

Initial step:

Standardize y , x j , j = 1 , , p , such that y T 1 = 0 , x j T 1 = 0 , y T y = n , x j T x j = n , let λ 1 = 2 max j { 1 , 2 , , p } | x j T y | , consider the following penalized sum of squares:

min β { y X β 2 2 + λ 1 β 1 } ,

the features with nonzero estimated coefficients are selected and the set of their indices is denoted by s T E M P , let s * 1 = s T E M P . Compute I H ( s * 1 ) and NEBIC ( s * 1 ) .

General step:

For k 1 , compute x ˜ j y ˜ for j s * k , where y ˜ = [ I H ( s * k ) ] y , x ˜ j = [ I H ( s * k ) ] x j . Let λ k + 1 = 2 max j s * k | x ˜ j y ˜ | , consider the following penalized sum of squares:

min β { y ˜ X ˜ β ˜ 2 2 + λ k + 1 β ˜ 1 } ,

the features with nonzero estimated coefficients are selected and the set of their indices is denoted by s T E M P , let s * k + 1 = s * k s T E M P . Compute NEBIC ( s * k + 1 ) , if NEBIC ( s * k + 1 ) > NEBIC ( s * k ) , stop; otherwise, compute I H ( s * k + 1 ) and continue.

Suppose that the above procedure stops at k * + 1 step, then our algorithm outputs the index set of selected features which is denoted by s * k * . Then the parameters in the selected model are estimated by their least-square estimates:

β ^ ( s * k * ) = [ X T ( s * k * ) X ( s * k * ) ] 1 X T ( s * k * ) y , β ^ ( s * k * ) = 0.

The NEBIC for s * k , k = 1 , 2 , , k * in the above algorithm is given by:

NEBIC ( s * k ) = n ln ( y T ( I n H n ( s * k ) ) y n ) + v ( s * k ) ln n + 2 ln ( τ ( S v ( s * k ) ) + a ) where a > 0.

5. Application to High-Dimensional Classification

5.1. Method

As we have discussed before, it is necessary to give better methods to estimate the unknown uks and Ω for the small-n-large-p problem. In this paper, we apply our algorithm to identify the non-zero entries of precision matrix, and the estimate of Ω is then obtained by the constrained maximum likelihood estimation. And we adopt the method proposed by [7] to estimate class means. The estimated class means and precision matrix are finally used to construct the discrimination rule.

5.2. Procedure

1) Constrained estimation for the class means

The predictor components are first ordered according to the F-statistic for testing the significance of class effects, instead of being pairwise fused, the class means for each component are then clustered by a divisive hierarchical clustering procedure. The class means are estimated under the structure revealed by the clustering. For details, we refer the reader to [7] .

2) Constrained estimation for the precision matrix

The identification of non-zero entries in a concentration matrix has attracted considerable attention of the researchers, concentration matrix is the inverse of the covariance matrix of a random vector. A concentration matrix is closely related to an undirected graphical model. An undirected graphical model is specified by a vertex set V and an edge set E, and is denoted by G = ( V , E ) . The vertex set V represents a collection of random variables { Y 1 , , Y p } . The edge set E describes the inter-relationship among the random variables: there is an edge connecting vertices Y i and Y j if they are dependent conditioning on all the remaining variables. Suppose that Y = ( Y 1 , , Y p ) follows a multivariate normal distribution with concentration matrix Ω = ( Ω i j ) . Then, there is an edge between Yi and Yj if and only if Ω i j = Ω j i 0 . Thus, the detection of edges of G is equivalent to the identification of non-zero entries of Ω.

In this paper, we adopt the method, which was proposed by [14] , is based on the relationship between entries of Ω and the coefficients of p regression models where each component of Y is regressed on the remaining p − 1 components. A non-zero entry of Ω corresponds to a non-zero regression coefficient in the regression models. In other word, the detection and estimation of non-zero entries of Ω are then boiled down to the selection and estimation of non-zero coefficients in p regression models. According to this, we first apply our algorithm to identify non-zero entries of Ω, and the estimate of Ω is then obtained by the constrained maximum likelihood estimation. The details as follow.

For an undirected graph G = ( V , E ) , let V be modeled as the set of the components of a random vector Y = ( Y 1 , , Y p ) . We assume that Y follows a multivariate normal distribution N ( u , Σ ) . Without loss of generality, assume that u = 0 . Let Y i be the vector obtained from Y by eliminating component Y i . Denote by Σ i i the variance-covariance matrix of Y i , by σ i 2 the variance of Y i , and by Σ i i the covariance vector between Y i and Y i . We know that the conditional distribution of Y i given Y i is still normal, with the following conditional mean and conditional variance:

E ( Y i | Y i ) = Σ i i Σ i i 1 Y i , V a r ( Y i | Y i ) = σ i 2 Σ i i Σ i i 1 Σ i i .

Let β i j be the jth component of Σ i i Σ i i 1 . We can then express the conditional distributions in the form of the following regression models:

Y i = j i β i j Y j + ϵ i , ϵ i ~ N ( 0 , D i ) , i = 1 , , p , (16)

where D i = σ i 2 Σ i i ( Σ i i ) 1 Σ i i . Without loss of generality, suppose that Y i is the first component of Y. Let the covariance matrix Σ be partitioned as:

Σ = ( σ i 2 Σ i i Σ i i Σ i i ) , let Ω = ( ω j l ) j , l { 1 , , p } , we can obtain:

Ω = Σ 1 = ( D i 1 D i 1 Σ i i Σ i i 1 Σ i i 1 Σ i i D i 1 Σ i i 1 + Σ i i 1 Σ i i D i 1 Σ i i Σ i i 1 ) .

By comparing the left upper block of Ω with Σ i i Σ i i 1 , we see that:

β i j = ω i j D i 1 = ω i j ω i i ,

noting that D i 1 = Ω i i . These connections establish the equivalence:

ω i j = 0 β i j = 0.

As we can see that the identification of non-zero entries of Ω reduces to the identification of the non-zero β i j in the above regression models. Thus, firstly, we apply our algorithm to model (16) to identify all the non-zero entries of Ω. Let E = { ( j , l ) : ω j l 0 } , according to our algorithm we have E ^ .

Next, we summary our proposed approaches. First, we introduce some notation. Denote by y i the standardized n-vector of observed Y i , let X i be the n × ( p 1 ) matrix consisting of all y j with j i . Here is our algorithm based on above data sets:

Algorithm for identifying E

Initial step:

Let λ i = 2 max | X i T y i | , let β i be ( p 1 ) -vector, for i = 1 , 2 , , p . Consider minimizing following p penalized sum of squares separately:

min β i 1 2 y i X i β i 2 2 + λ i β i 1 , i = 1 , 2 , , p ,

Let s 1 1 , , s p 1 denote the indices of features with nonzero estimated coefficients for the above p regression models respectively. Let s 1 = { s 1 1 , , s p 1 } . Then compute NEBIC ( s 1 ) and I H ( s 1 1 ) for i = 1 , 2 , , p .

General step k + 1 (k ≥ 1):

Let λ i = 2 max | X i T ( I H ( s i k ) ) y i | for i = 1 , 2 , , p . Consider minimizing following p penalized sum of squares separately:

min β ˜ i 1 2 y ˜ i X ˜ i β ˜ i 2 2 + λ i β ˜ i 1 , i = 1 , 2 , , p .

where y ˜ i = ( I H ( s i k ) ) y i , X ˜ i = ( I H ( s i k ) ) X i , β ˜ i = β i ( s i k ) . Let s i T E M P be the indices of features with nonzero estimated coefficients for the above ith regression model, and s i k + 1 = s i T E M P s i k for i = 1 , 2 , , p . Then let s k + 1 = { s 1 k + 1 , , s p k + 1 } . Compute NEBIC ( s k + 1 ) . If NEBIC ( s k + 1 ) > NEBIC ( s k ) , stop; otherwise, compute I H ( s i k + 1 ) for i = 1 , 2 , , p and continue.

Suppose the above algorithm stops at step k + 1 , then our algorithm output the index set s k = { s 1 k , , s p k } .

The NEBIC for set s k is given by the following formula:

NEBIC ( s k ) = i = 1 p { n ln ( [ I H ( s i k ) ] y i 2 2 ) + | s i * k | ln n + 2 ln ( τ ( S | s i k | ) + a ) } .

Estimation of precision matrix:

From the above algorithm we identify all the non-zero entries of Ω ^ , let E ^ be the set of all the indices. Then Ω ^ is obtained by the constrained maximum likelihood estimation as follows:

Ω ^ = min ω j l = 0 : ( j , l ) E { trace ( S ^ Ω ) ln det ( Ω ) + λ Ω 1 } ,

where S ^ is the empirical covariance matrix.

3) Discrimination rule

The discrimination rule is constructed by replacing the unknowns in d k j ( x ) by their estimates obtained above. Here d k j ( x ) is:

d k j ( x ) = [ x ( u k + u j ) / 2 ] T Ω ( u k u j ) + ln ( π k / π j ) .

Then we have d ^ k j ( x ) :

d ^ k j ( x ) = [ x ( u ^ k + u ^ j ) / 2 ] T Ω ^ ( u ^ k u ^ j ) + ln ( n k / n j ) .

Classify x into class k if and only if:

d ^ k j ( x ) > 0 , forall j k .

5.3. Simulation Studies

We compare our method with other methods available in the literature through numerical studies in this section. The methods considered for the comparison are: 1) linear discrimination with detected sparsity (LDwDS) in [7] . 2) The adaptive hierarchically penalized nearest shrunken centroid (ahp-NSC) and the adaptive L -norm penalized NSC (alp-NSC) proposed in [15] . 3) Pairwise SIS (PSIS) in [16] .

The performance of the methods is evaluated by the misclassification rate (MCR) and computation time. We have three simulation settings, repeat each setting 100 times. At each replicate under a simulation setting, we simulate a training data set and a testing data set. The training data set is used to construct discrimination rules with the methods, and the testing set is used to evaluate the MCR and record computation time.

We design the simulation setting inspired by [7] , consider K = 2 , p = 400 . The following single scheme for the class means is taken throughout:

u 1 = 0 , u 2 = ( 0.5 , , 0.5 100 , 0 , , 0 p 100 ) .

The covariance matrix is generated through the precision matrix. First, the nonzero positions of the precision matrix are decided in the following schemes:

ES1: the non-zero entries of Ω is randomly determined with:

Pr ( ω i j = 0 ) = 1 Pr ( ω i j 0 ) = 0.99.

ES2: Ω is a diagonal block matrix with 10 blocks of size 10.

ES3: Ω is a diagonal block matrix with 10 blocks of size 10, each block is a diagonal band matrix with ω i j 0 if | i j | 2 .

For the nonzero values of the precision matrix, we first generate Ω ˜ = ω ˜ i j as follows: ω ˜ i i = 1 , ω ˜ i j = ω ˜ j i are generated as i.i.d. observations from the uniform distribution on [ 0.3 , 0.7 ] . Then we take Ω = Ω ˜ + ( 0.1 λ min ( Ω ˜ ) ) I . Eventually, take the common covariance matrix as the correlation matrix corresponding to Ω 1 . The training sample size n 1 = 200 and the testing sample size n 2 = 1000 . The covariance matrix is generated as a diagonal block matrix with four 100 × 100 identical blocks. The 100 × 100 block is generated by the same generating schemes above. The simulation results under these three settings are reported in Tables 1-3.

From all three tables we can see that as for the MCR, the performance of our method is much better than the two NSC methods and PSIS. Although our method has the same MCR with LDwDS from Table 1 and Table 2, it takes shorter computation time than LDwDS. From Table 3 we find that our method performs better than LDwDS both in misclassification and computation time. From [7] we notice that the misclassification rates of LDwDS are universally lower than all the other methods under consideration. Nevertheless, it is not without drawbacks. LDwDS takes longer computation time than other methods.

Table 1. Averaged misclassification rate (MCR) and computation time (in seconds) under simulation ES1 (numbers in the parentheses are standard deviations).

Table 2. Averaged misclassification rate (MCR) and computation time (in seconds) under simulation ES2 (numbers in the parentheses are standard deviations).

Table 3. Averaged misclassification rate (MCR) and computation time (in seconds) under simulation ES3 (numbers in the parentheses are standard deviations).

Therefore, from the perspectives of computation time and misclassification, our algorithm performances better than its competing algorithms.

6. Summary and Discussion

In this paper, on the one hand, based on the characteristic of lasso regression we propose a novel method to choose the regularization parameter λ , which gives the exact maximum value of λ for lasso regression. On the other hand, since the ordinary Bayes information criterion is too liberal for model selection when the model space is large, inspired by [10] we propose a new EBIC (NEBIC). Compared with EBIC the parameter in our new criterion doesn’t need to be restricted in a small range, what’s more, under some mild conditions our new EBIC is also shown to be consistent. Then, based on these two findings, we have a new algorithm for feature selection. In detail, we apply our new method to choose regularization parameter for sequential lasso, and our NEBIC is used as the stopping rule. After that, we apply our algorithm to classification problem, the numerical study demonstrates that our algorithm performs better than its competing methods under consideration.

Further research may consider these two questions:

Firstly, our method chooses the largest regularization parameter λ such that at least one of the βj's, j { 1 , 2 , , p } , will be estimated nonzero. A nature question is that instead of solving the optimization problems can we obtain the indices of those features with nonzero estimated coefficients by easier methods? Luo and Chen [17] pointed out that these indices are related to the choice of regularization parameter λ under some conditions. However, these conditions are too strict. Further research may consider if there exist milder conditions to connect our new method for choosing regularization parameter with the indices of non-zero estimated coefficients.

Secondly, we choose the upper convex function x x + a , a > 0 as the prior

function over the model space to extend EBIC, and our new EBIC is shown to be consistent. Subsequent research may consider if there exists a special function class for model space priors such that within this class the Bayes information criteria is consistent.

Conflicts of Interest

The authors declare no conflicts if interest regarding the publication of this paper.

References

[1] Desboulets, L.D.D. (2018) A Review on Variable Selection in Regression Analysis. Econometrics, 6, Article 45.
https://doi.org/10.3390/econometrics6040045
[2] Fan, J. and Li, R. (2001) Variable Selection via Nonconcave Penalized Likelihood and Its Oracle Properties. Journal of the American Statistical Association, 96, 1348-1360.
https://doi.org/10.1198/016214501753382273
[3] Zou, H. (2012) The Adaptive Lasso and Its Oracle Properties. Journal of the American Statistical Association, 101, 1418-1429.
https://doi.org/10.1198/016214506000000735
[4] Zhang, C. (2010) Nearly Unbiased Variable Selection under Minimax Concave Penalty. The Annals of Statistics, 38, 894-942.
https://doi.org/10.1214/09-AOS729
[5] Tropp, J.A. (2004) Greed Is Good: Algorithmic Results for Sparse Approximation. IEEE Transactions on Information Theory, 50, 2231-2242.
https://doi.org/10.1109/TIT.2004.834793
[6] Efron, B., Hastie, T., Johnstone, I. and Tibshirani, R. (2004) Least Angle Regression. The Annals of Statistics, 32, 407-499.
https://doi.org/10.1214/009053604000000067
[7] Luo, S. and Chen, Z. (2020) A Procedure of Linear Discrimination Analysis with Detected Sparsity Structure for High-Dimensional Multi-Class Classification. Journal of Multivariate Analysis, 179, Article ID: 104641.
https://doi.org/10.1016/j.jmva.2020.104641
[8] Luo, S. and Chen, Z. (2014) Edge Detection in Sparse Gaussian Graphical Models. Computational Statistics & Data Analysis, 70, 138-152.
https://doi.org/10.1016/j.csda.2013.09.002
[9] Chen, Z. and Jiang, Y. (2020) A Two-Stage Sequential Conditional Selection Approach to Sparse High-Dimensional Multivariate Regression Models. Annals of the Institute of Statistical Mathematics, 72, 65-90.
https://doi.org/10.1007/s10463-018-0686-5
[10] Chen, J. and Chen, Z. (2008) Extended Bayesian Information Criteria for Model Selection with Large Model Spaces. Biometrika, 95, 759-771.
https://doi.org/10.1093/biomet/asn034
[11] Bickel, P.J. and Levina, E. (2004) Some Theory for Fisher’s Linear Discriminant Function, ‘Naive Bayes’, and Some Alternatives When There Are Many More Variables than Observations. Bernoulli, 10, 989-1010.
https://doi.org/10.3150/bj/1106314847
[12] Fan, J. and Fan, Y. (2008) High Dimensional Classification Using Features Annealed Independence Rules. Annals of Statistics, 36, 2605-2637.
https://doi.org/10.1214/07-AOS504
[13] Luo, S. and Chen, Z. (2013) Extended BIC for Linear Regression Models with Diverging Number of Relevant Features and High or Ultra-High Feature Spaces. Journal of Statistical Planning and Inference, 143, 494-504.
https://doi.org/10.1016/j.jspi.2012.08.015
[14] Meinshausen, N. and Bühlmann, P. (2006) High-Dimensional Graphs and Variable Selection with the Lasso. The Annals of Statistics, 34, 1436-1462.
https://doi.org/10.1214/009053606000000281
[15] Wang, S. and Zhu, J. (2007) Improved Centroids Estimation for the Nearest Shrunken Centroid Classifier. Bioinformatics, 23, 972-979.
https://doi.org/10.1093/bioinformatics/btm046
[16] Pan, R., Wang, H. and Li, R. (2016) Ultrahigh-Dimensional Multiclass Linear Discriminant Analysis by Pairwise Sure Independence Screening. Journal of the American Statistical Association, 111, 169-179.
https://doi.org/10.1080/01621459.2014.998760
[17] Luo, S. and Chen, Z. (2014) Sequential Lasso Cum EBIC for Feature Selection with Ultra-High Dimensional Feature Space. Journal of the American Statistical Association, 109, 1229-1240.
https://doi.org/10.1080/01621459.2013.877275

Copyright © 2024 by authors and Scientific Research Publishing Inc.

Creative Commons License

This work and the related PDF file are licensed under a Creative Commons Attribution 4.0 International License.