Model-Free Feature Screening Based on Gini Impurity for Ultrahigh-Dimensional Multiclass Classification

Abstract

It is quite common that both categorical and continuous covariates appear in the data. But, most feature screening methods for ultrahigh-dimensional classification assume the covariates are continuous. And applicable feature screening method is very limited; to handle this non-trivial situation, we propose a model-free feature screening for ultrahigh-dimensional multi-classification with both categorical and continuous covariates. The proposed feature screening method will be based on Gini impurity to evaluate the prediction power of covariates. Under certain regularity conditions, it is proved that the proposed screening procedure possesses the sure screening property and ranking consistency properties. We demonstrate the finite sample performance of the proposed procedure by simulation studies and illustrate using real data analysis.

Share and Cite:

Wang, Z. and Deng, G. (2022) Model-Free Feature Screening Based on Gini Impurity for Ultrahigh-Dimensional Multiclass Classification. Open Journal of Statistics, 12, 711-732. doi: 10.4236/ojs.2022.125042.

1. Introduction

Ultrahigh-dimensional data are commonly available in a wide range of scientific research and applications. Feature screening plays an essential role in the ultrahigh-dimensional data, where Fan and Lv [1] first proposed the sure independence screening (SIS) in the seminal paper. For linear regressions, they showed that the approach based on Pearson correlation learning possesses a sure screening property. That is, even if the number of predictors p can grow much faster than the number of observations n with log p = O ( n α ) for some

α ( 0, 1 2 ) , all relevant predictors can be selected with probability tending to one [2].

Lots of feature screening is the Model-based and Model-free approaches have been developed in recent years, see, for example, Wang [3] proposed forward regression for ultrahigh-dimensional data. Fan and Song [4] applied the maximum marginal likelihood estimates or the maximum marginal likelihood to ultrahigh-dimensional screening in generalized linear model. Fan et al. [5] further extend the correlation learning to marginal nonparametric learning. Zhu et al. [6] proposed a model-free feature screening approach for ultrahigh-dimensional data. Li et al. [7] proposed a robust rank correlation screening method to deal with ultrahigh-dimensional data based on the Kendall τ correlation coefficient. Li et al. [8] applied the distance correlation to sure independence screening procedure. He et al. [9] proposed a quantile-adaptive framework for nonlinear variable screening with high-dimensional heterogeneous data. Fan et al. [10] proposed nonparametric independence screening selects variables by ranking a measure of the nonparametric marginal contributions of each covariate given the exposure variable. Liu et al. [11] proposed a feature screening procedure for varying coefficient model based on conditional correlation coefficient. Nandy et al. [12] proposed a covariate information number sure independence screening, which used a marginal utility connected to the notion of the traditional Fisher information. Pouyap et al. [13] proposed a merge of the features selection methods in order to define the most relevant features in the texture of the vibration signal images.

To address the ultrahigh-dimensional feature screening in classification problem, Fan and Fan [14] proposed the t-test statistic for two-sample mean problem as a marginal utility for feature screening and establish its theoretical properties. Mai and Zou [15] applied the Kolmogorov filter to ultrahigh-dimensional binary classification. Cui et al. [16] proposed a screening procedure via used empirical conditional distribution functions. Lai et al. [17] proposed a feature screening procedure based on the expected conditional Kolmogorov filter for binary classification problem. However, the above-proposed screening methods assume that the types of data are continuous. For categorical covariates, Huang et al. [18] constructed a model-free discrete feature screening method based on the Pearson Chi-square statistics and showed its sure screening property fulfilling (Fan et al. [2]). When all the covariates are binary, Ni and Fang [19] proposed a model-free feature screening procedure based on information entropy theory for multi-class classification. Ni et al. [20] further proposed a feature screening procedure based on weighting Adjusted Pearson Chi-square for multi-class classification. Sheng and Wang [21] proposed a new model-free feature screening method based on classification accuracy of marginal classifiers for ultrahigh-dimensional classification. Anzarmou et al. [22] proposed a new model-free interaction screening method, termed Kendall Interaction Filter (KIF), for the classification in high-dimensional settings.

Based on the above study of classification models, in this paper, we propose a model-free feature screening for ultrahigh-dimensional multi-classification with both categorical and continuous covariates. The proposed feature screening method will be based on Gini impurity to evaluate the prediction power of covariates. Gini impurity is a non-purity attribute splitting index, which was proposed by Breiman et al. [23] and has been widely used in decision tree algorithms such as CART and SPRINT. With regard to categorical covariate screening, we can apply the index of purity gain, which is the same as information gain [19]. Similar to Ni and Fang [19], continuous covariates can be sliced via standard normal quantile. The proposed feature screening procedure is based on purity gain, which is referred to Purity Gain sure independence screening (PG-SIS). Theoretically, the PG-SIS is rigorously proven to enjoy. Fan and Lv [1] proposed sure screening property that ensures all important features can be obtained. Practically, as shown by the simulation results, compared with the existing feature screening method, PG-SIS satisfies the sure screening property.

This paper is organized as follows. Section 2 describes the proposed PG-SIS method in detail. Section 3 establishes its sure screening property. In Section 4, numerical simulations and an example for real data analysis are given to assess sure screening property of our method. Some concluding remarks are given in Section 5 and all the proofs are given in the Appendix.

2. Feature Screening Procedure

We first introduce Gini impurity and purity gain, and then propose the screening procedure based on purity gain.

2.1. Gini Index and Purity Gain

Suppose that Y is a categorical response with R classes { 1, , R } , and covariate X = ( X 1 , X 2 , , X p ) is a vector of p dimension, where each of these components X k with J k categories. Where J k = { 1, , J } . To introduce the Gini impurity and purity gain, assuming that Y { 1, , R } and X k { 1,2, , J k } . Define p r = P ( Y = r ) represents the probability function of a response variable, w k , j = P ( X k = j ) represents the probability function of covariates, p k , j r = P ( Y = r | X k = j ) represents the probability function of response variables under the condition of covariates, where r = 1 , 2 , , R ; j = 1 , 2 , , J k and k = 1 , 2 , , p . Let 0 × log 0 = 0 . Marginal Gini impurity of Y and X respectively is defined as

G i n i ( Y ) = 1 r = 1 R p r 2 (1)

G i n i ( X k ) = 1 j = 1 J k w k , j 2 (2)

Conditional Gini impurity is defined as

G i n i ( Y | X k ) = j = 1 J k w k , j ( 1 r = 1 R p k , j r 2 ) (3)

Similar to the information gain, the purity gain is defined as

P G ( Y | X k ) = G i n i ( Y ) G i n i ( Y | X k ) = 1 r = 1 R p r 2 j = 1 J k w k , j ( 1 r = 1 R p k , j r 2 ) (4)

In the Equation (1), G i n i ( Y ) is non-negative and acquires its maximum 1 1 R if and only if p 1 = = p R = 1 R by Jensen’s inequality [24]. And the G i n i ( Y | X k ) in Equation (2) is the conditional Gini impurity of Y given X k = j . Further support can be given by the following proposition.

Proposition 2.1. When X k is a categorical covariable, we can get P G ( Y | X k ) 0 , and X k and Y are independent if and only if P G ( Y | X k ) = 0 .

For continuous X k , the conditional Gini impurity can’t directly calculate, and purity gain by slicing X into several categories. For a fixed integer J 2 , let q ( j ) be the j/Jth percentile of X, j = 1 , , J 1 , q ( 0 ) = and q ( J ) = + . Replacing w k , j and p k , j r in Equation (3) respectively by

w k , j = P ( X k ( q ( j 1 ) , q ( j ) ] ) and p k , j r = P ( Y = r | X k ( q ( j 1 ) , q ( j ) ] ) , we define conditional Gini impurity based on continuous covariates:

G i n i J ( Y | X k ) = j = 1 J k w k , j ( 1 r = 1 R p k , j r 2 ) = j = 1 J k P ( X k ( q ( j 1 ) , q ( j ) ] ) ( 1 r = 1 R P ( Y = r | X k ( q ( j 1 ) , q ( j ) ] ) 2 ) (5)

P G ( Y | X k ) = G i n i ( Y ) G i n i ( Y | X k ) = 1 r = 1 R p r 2 j = 1 J k w k , j ( 1 r = 1 R p k , j r 2 ) = 1 r = 1 R p r 2 j = 1 J k P ( X k ( q ( j 1 ) , q ( j ) ] ) × ( 1 r = 1 R P ( Y = r | X k ( q ( j 1 ) , q ( j ) ] ) 2 ) (6)

Proposition 2.2. When X k is a continuous covariable, we can get P G J ( Y | X k ) 0 , and X k and Y are independent if and only if P G J ( Y | X k ) = 0 .

2.2. Feature Screening Procedure Based on Purity Gain

First, we select a medium scale of simplified model which can almost fully contain D, where D = { k : F ( Y | x ) functionally depends on X k for some Y = r } , we use an adjusted purity gain index for each pair ( Y , X k ) is as follows:

e k = 1 r = 1 R p r 2 j = 1 J k w k , j ( 1 r = 1 R p k , j r 2 ) log J k (7)

where p r = P ( Y = r ) , w k , j = P ( X k = j ) and p k , j r = P ( Y = r | X k = j ) when X k is categorical, J k represents the number of categories of X k . When X k is defined as a continuous covariates, J k represents the number of slices applied to X k , w k , j = 1 / J k , p k , j r = P ( Y = r | X k ( q k , ( j 1 ) , q k , ( j ) ] ) and q k , ( j ) represents j/Jth percentile of X k .

There may be more categories of covariates associated with larger purity gain in the original definition of Equation (4), regardless of whether the covariates are important, especially when the number of categories involved in each covariate is different. So Ni and Fang [19] used log J k to construct the information gain ratio to solve this problem, where each category of X k is the same. Similarly, when each category of X k is the same, for Equation (7), we apply the log J k to build an adjusted purity gain index to address the problem, which is also applied to continuous X k . However, each category of X k is different, 1 j = 1 J k w k , j 2 is defined as an adjustment factor, which is motivated by the split X k into several categories via the Decision Tree algorithm.

For sample data { x i 1 , , x i p , y i } , i = 1 , , n , e k can be easily estimated by

e ^ k = ( 1 r = 1 R p ^ r 2 ) j = 1 J k w ^ k , j ( 1 r = 1 R p ^ k , j r 2 ) log J k (8)

When X k is categorical, w ^ k , j = 1 n i = 1 n I { x i k = j } and p ^ k , j r = i = 1 n I { y i = r , x i k = j } i = 1 n I { x i k = j } .

When X k is continuous,

w ^ k , j = 1 n i = 1 n I { x i k ( q ^ k , ( j 1 ) , q ^ k , ( j ) ] }

and

p ^ k , j r = i = 1 n I { y i = r , x i k ( q ^ k , ( j 1 ) , q ^ k , ( j ) ] } i = 1 n I { x i k ( q ^ k , ( j 1 ) , q ^ k , ( j ) ] }

where q ^ k , ( j ) is the j/Jth sample normal percentile of { x i k , , x n k } . In either case, p ^ r = 1 n i = 1 n I { y i = r } .

We suggest selecting a sub-model: D ^ = { k : e ^ k c n τ , 1 k p } . Where both c and τ are predetermined thresholds established via Condition (C2) in Section 3. In practice, we can choose a model:

D ^ = { k : e ^ k isamongthetopof d largestofall }

where d = [ n / log n ] .

3. Feature Screening Property

In this section, we establish the sure screening property of PG-SIS. Based on Ni and Fang [19] proposed sure independence screening theories, the following conditions are assumed.

Condition 1 (C1). There exist two positive constants c 1 and c 2 such that, c 1 / R p r c 2 / R , c 1 + c 2 R , c 1 / R p k , j r c 2 / R and c 1 / J k w k , j c 2 / J k for every 1 j J k , 1 r R and 1 k p .

Condition 2 (C2). There exist a positive constant c > 0 and 0 τ < 1 / 2 such that min k D e k 2 c n τ .

Condition 3 (C3). R = O ( n ε ) , J = max 1 k P J k = O ( n k ) , where ε 0 , κ 0 and 2 τ + 2 ε + 2 κ < 1 .

Condition 4 (C4). There exist a positive constant c 3 , such that 0 < f k ( x | Y = r ) < c 3 for any 1 r R , and x in the domain of X k , where f k ( x | Y = r ) is the Lebesgue density function of X k conditional on Y = r .

Condition 5 (C5). There exist a positive constant c 4 and 0 ρ < 1 / 2 such that f k ( x ) c 4 n ρ for any 1 k p and x in the domain of X k , where f k ( x ) is the Lebesgue density function of X k . Furthermore, f k ( x ) is continuous in the domain of X k .

Condition 6 (C6). R = O ( n ε ) , J = max 1 k P J k = O ( n κ ) , where 2 τ + 2 ε + 2 κ + 2 ρ < 1 and ε 0, κ 0 .

Condition 7 (C7). lim inf p { min k D e k max k I e k } δ , where δ > 0 is a constant.

Condition (C1) guarantees that the proportion of each class of variables cannot be either extremely small or extremely large. Similar assumption is also made in condition (C1) in Huang et al. [18] and Cui et al. [16]. According to Fan and Lv [1] and Cui et al. [16], Condition (C2) allows the minimum true signal to disappear to zero in the order of n τ as the sample size goes to infinity. According to [19] Condition (C3) provides the covariates to diverge with a certain order and the number of classes for the response, and Condition (C6) modifies Condition (C3) a little bit. To ensures the sample percentiles are close to the true percentiles, Condition (C4) rules out the extreme case that some X k put heavy mass in a small range. Condition (C5) asks for the n ρ as lower bound to the density. According to [16] and Zhu et al. [6] proposed ranking consistency property, we need to assume the inactive covariate subset I = { 1, , p } \ D , then Condition (C7) is established.

Theorem 3.1. (Sure screening property) Under conditions (C1) to (C3), if all the covariates are categorical, we get:

P ( D D ^ ) 1 O ( p exp { b n 1 ( 2 τ + 2 ε + 2 κ ) + ( ε + κ ) log n } )

Theorem 3.2. (Sure screening property) Under conditions (C4) to (C6), when the covariates are composed of continuous and categorical variables, we get:

P ( D D ^ ) 1 O ( p exp { b n 1 ( 2 τ + 2 ε + 2 κ + 2 ρ ) + ( ε + κ ) log n } )

where b is a positive constant. If log p = O ( n α ) and α < 1 ( 2 τ + 2 ε + 2 κ + 2 ρ ) , PG-SIS has a sure screening property.

Theorem 3.3. (Ranking consistency property) Under conditions (C1), (C4), (C5) and (C7), if log R J log n = O ( 1 ) and max { log P , log n } R 4 J 4 n 1 2 ρ = o ( 1 ) , then lim inf n { min k D e ^ k max k I e ^ k } > 0 , a.s.

Theorem 3.3 testifies that the proposed screening index can separate active and inactive covariates well in the sample level.

4. Numerical Studies

4.1. Simulation Results

In this subsection, we carry out three simulation studies to demonstrate the finite sample performance of our group screen methods described in Section 2. We compare PG-SIS with IG-SIS [19] and APC-SIS in performance via the below evaluation criteria: MMS, minimal model size, consists of all active covariates, the results generally existing 5%, 25%, 50%, 75%, 95% of MMS; CP1, CP2 and CP3 respectively represent the probability that the given model size [ n / log n ] , 2 [ n / log n ] and 3 [ n / log n ] cover all active covariates, while CPa indicates whether the indicators of the selected model cover all active covariates.

Model 1: categorical covariates and binary response

We first consider the response variables of different categories. According to [19], we assume a model which response y i is binary in which R = 2 , and all the covariates are categorical. We think about two distributions for y i :

1) Balanced, P ( y i = r ) = 1 / 2 ;

2) Unbalanced, p r = 2 [ 1 + R r R 1 ] / 3 R with max 1 r R p r = 2 min 1 r R p r .

The true model is defined at D = { 1,2, ,20 } with d 0 = | D | = 20 . Condition on y i , latent variable is generated as z i = ( z i ,1 , , z i , p ) , where z i , k N ( μ r k ,1 ) , 1 k p . Then, we construct active covariates:

1) If k > d 0 , then μ r k = 0 ;

2) If k d 0 and r = 1 , then μ r k = 0.5 ;

3) If k d 0 and r = 2 , then μ r k = 0.5 .

Next, we apply the quantile of the standard normal distribution to generate covariates. The specific approach is as follows:

1) When k as odd number, that is x i , k = I ( z i , k > z ( j 2 ) ) + 1 ;

2) When k as even number, that is x i , k = I ( z i , k > z ( j 5 ) ) + 1 ;

Where αth percentile of the standard normal distribution is z ( α ) .

Thus, amongst all p covariates, the covariates of two categories and five categories account for half, respectively. Similar to [20], we consider p = 1000 , 5000 and n = 200 , 400 in this model.

Table 1 reports the evaluation criteria over 100 simulations for Model 1. We

Table 1. Simulation results for example 1.

can see the following: The results argue that the proposed PG-SIS works quite well. When the sample size n increases, PG-SIS is close to the true model size d 0 = 20 in MMS, and both increase to 1 in coverage probability. MMS in unbalanced response is better than in balanced response in performance via comparing the response of different structures. The performances of PG-SIS and IG-SIS are quite close, and PG-SIS is slightly better than APC-SIS in higher coverage probabilities.

Model 2: categorical covariates and multi-class response

We consider more covariate classification, and response y i is multi-class which R = 10 . We think about y i of two distributions:

1) Balanced, P ( y i = r ) = 1 / R ;

2) Unbalanced, p r = 2 [ 1 + R r R 1 ] / 3 R with max 1 r R p r = 2 min 1 r R p r .

Among the p = 2000 covariates, the minimum set of active covariate set is X D = { X 200 , X 400 , X 600 , X 800 , X 1000 , , X 2000 } with the number of active covariates d 0 = 10 . Condition on y i , latent variable is generated as z i = ( z i ,1 , , z i , p ) , for covariates X k , x i , k = f k ( ε i , k + μ i , k ) , where ε i , k ~ N ( 0,1 ) and f k ( ) represents a quantile function of standard normal distribution. Then, we construct active covariates via defining μ i , k :

1) If X X D and y i = r , then μ i , k = 1.5 × ( 0.9 ) r ;

2) If X X D , then μ i , k = 0 ;

Next, we apply the f k ( ) to generate covariates, and take p = 2000 , n = 300 , 400 , 500 in this model. The specific approach is as follows:

1) For 1 k 400 , then f k ( ε i , k + μ i . k ) = I ( z i , k > z ( j 2 ) ) + 1 ;

2) For 400 < k 800 , then f k ( ε i , k + μ i . k ) = I ( z i , k > z ( j 4 ) ) + 1 ;

3) For 800 < k 1200 , then f k ( ε i , k + μ i . k ) = I ( z i , k > z ( j 6 ) ) + 1 ;

4) For 1200 < k 1600 , then f k ( ε i , k + μ i . k ) = I ( z i , k > z ( j 8 ) ) + 1 ;

5) For 1600 < k 2000 , then f k ( ε i , k + μ i , k ) = I ( z i , k > z ( j 10 ) ) + 1 ;

Thus, amongst all the p covariates, the covariates of two categories, four categories, six categories, eight categories and ten categories account for one-fifth each.

Table 2 reports the evaluation criteria over 100 simulations for Model 2. We can see the following: Two methods in performance under Model 1 is worse than Model 2. When the model is more intricate, PG-SIS in performance is close to IG-SIS. Particularly, PG-SIS and IG-SIS have a slightly small MMS under a small sample size n. When the sample size n increases, PG-SIS is close to

Table 2. Simulation results for example 2.

d 0 = 10 in MMS, and both increase to 1 in coverage probability. Four indexes of coverage probability of APC-SIS are worse than that of PG-SIS when the sample n = 200 . MMS in unbalanced response is better than in balanced response in performance via comparing the response of different structures. Furthermore, PG-SIS and IG-SIS are more robust in performance because the fluctuation range in MMS is small.

Model 3: continuous and categorical covariates

At last, among the covariates that are both continuous and categorical, we assume a more complex example, where response y i is multi-class which R = 4 . We think about y i of two distributions:

1) Balanced, p r = P ( y i = r ) = 1 / R ;

2) Unbalanced, p r = 2 [ 1 + R r R 1 ] / 3 R with max 1 r R p r = 2 min 1 r R p r .

In this model, we take p = 5000 , n = 400 , 600 , 800 . The true model is defined at X D = { X k : k = [ k p 20 ] , k = 1 , , 20 } with d 0 = 20 . Condition on y i ,

latent variable is generated as z i = ( z i ,1 , , z i , p ) . For covariates X k , z i , k N ( μ i , k ,1 ) ,1 k p , where u i k = ( 1 ) r θ r k when y i = r and k D . According to Ni and Fang [19], θ r k is given in Table 3. μ i , k = 0 when k D . To generate X k :

For k [ 5 p 20 ] , then x i , k = j , if z i , k ( z ( j 1 ) / 4 , z j / 4 ] , j = 1 , 2 , 3 , 4 ;

For [ 5 p 20 ] < k [ 10 p 20 ] , then x i , k = j , if z i k ( z ( j 1 ) / 10 , z j / 10 ] , j = 1, ,10 ;

For [ 10 p 20 ] < k p , then x i , k = z i , k .

Thus, amongst all the p covariates, the covariates of four categories and ten categories account for one-fifth, respectively, the other covariates are continuous. Similarly, there respectively are 5 in four categories and ten categories in the active covariates, and the rest of active covariates are continuous accounting for half. For continuous covariates, we applied different slices J k = 4 , 8 , 10 . The corresponding approaches are defined as PG-SIS-4, IG-SIS-4, APC-SIS = 4, PG-SIS-8, IG-SIS-8, APC-SIS-8, PG-SIS-10, IG-SIS-10 and APC-SIS-10. Table 4 and Table 5 show the simulation results with over 100 simulations for balanced and unbalanced case, respectively. We can see the following: When the sample size n increases, PG-SIS is close to d 0 = 20 in MMS, and both increase to 1 in coverage probability. And coverage probability of PG-SIS is close to IG-SIS in five indexes. Therefore, it is proved that the PG-SIS has the characteristics of feature screening. MMS in unbalanced response is better than in balanced response in performance via comparing the response of different structures. Furthermore, PG-SIS and IG-SIS are robust in performance because the fluctuation range in MMS is small for two types of responses. When different slices are

Table 3. Parameter specification of Model 3.

Table 4. Simulation results for example 3: balanced Y.

Table 5. Simulation results for example 3: unbalanced Y.

Table 6. Analysis results for real data example.

applied in continuous covariates, PG-SIS and IG-SIS are better in five indexes of coverage probability and MMS in performance via comparing the response of different structures. Therefore, three methods are independent of the number of slices in performance.

4.2. Real Data

In this subsection, we analyse a real data set from the feature selection database of Arizona State University (http://featureselection.asu.edu/). The GLIOMA biological data includes 50 samples and 4434 features, which is unbalanced due to the response variable. Every class is 14, 7, 14, 15, and the covariates are not only continuous, but also multiclass. We randomly divided the data into two parts where 90 percent of the data represent training data and 10 percent of the data represents test data. The sample size of training data and test data respectively are n = 45 and n = 5 . The dimension of both training data and test data are p = 4434 .

We apply a ten-fold cross-validation to eliminate different training data that cause the model accuracy problems. To PG-SIS, IG-SIS and APC-SIS, we use three classification approaches, which are Support Vector Machine [25], Random Forest (RF) and Decision Tree (DT) [26] to them via the chose active covariates.

In training data, we use the G-mean and F-measure [27] evaluation, the same is true for test data. PG-SIS in performance for unbalanced data is reported in Table 6. In all classification methods, PG-SIS is the best in performance, where G-mean of PG-SIS is more closed to 1 than the other two methods. In a word, the proposed PG-SIS performs better.

5. Conclusions

In the data, there are continuous and categorical covariates, and the response is categorical, which is very common in practice, but the applicable screening methods are very limited. We propose a PG-SIS procedure based on Gini impurity to effectively screen covariates. PG-SIS has a sure screening property and ranking consistency property theoretically and is model free. When the numbers of categories of all the covariates are the same and different, PG-SIS is quite similar to IG-SIS in performance, which can be shown in simulation.

The features screening reports some difficulties based on missing data. In the future, based on the classification model, we intend to propose a new feature screening to either the missing variable or the response variable can be selected.

Acknowledgements

The work was supported by National Natural Science Foundation of China [grant number 71963008].

Appendix

Proof of Proposition 2.1. To prove Proposition 2.1, we need to define f ( x ) = x 2 , proved to be close Ni and Fang [19]. By Jensen’s inequality [24],

j = 1 J k w k , j r = 1 R p k , j r 2 = r = 1 R [ j = 1 J k w k , j f ( p k , j r ) ] r = 1 R f ( j = 1 J k w k , j p k , j r ) = r = 1 R f ( j = 1 J k P ( X k = j ) P ( Y = r | X k = j ) ) = r = 1 R p r 2

then

P G = 1 r = 1 R p r 2 j = 1 J k w k , j ( 1 r = 1 R p k , j r 2 ) = 1 r = 1 R p r 2 j = 1 J k w k , j + j = 1 J k w k , j r = 1 R p k , j r 2 = j = 1 J k w k , j r = 1 R p k , j r 2 r = 1 R p r 2 0

The above equation holds if and only if p k , j r = p k , j r , for any 1 r R and 1 j j J . That is, X k and Y are independent.

Proof of Proposition 2.2. From the same proof as Proposition 2.1, we can get P G J ( Y | X k ) 0 holds if and only if p k , j r = p k , j r for any 1 r R and 1 j j J , that is P ( Y = r | X ( q ( j 1 ) , q ( j ) ] ) = P ( Y = r | X ( q ( j 1 ) , q ( j ) ] ) . So when X k and Y are independent, P G J ( Y | X k ) = 0 . On the other hand, when P G J ( Y | X k ) = 0 for any J, we need to show that P ( X x | Y = r ) does not depend on r for any x in the domain of X. Proposition 2.2 has been proven by [19], so the proof is omitted here.

Lemma 1 (Bernstein inequality). If Z 1 , , Z n is an independent random variable with a mean value of 0 and bounded supporter is [ M , M ] , then the inequality:

P ( | i = 1 n Z i | > t ) 2 exp { t 2 2 ( v + M t 3 ) }

where v V a r ( i = 1 n Z i )

Lemma 2. For discrete covariates X k and discrete response Y, we have the following three inequalities:

1) P ( | p ^ r p r | > t ) 2 exp { 6 n t 2 3 + 4 t }

2) P ( | w ^ k , j w k , j | > t ) 2 exp { 6 n t 2 3 + 4 t }

3) P ( | p ^ k , j r p k , j r | > t ) 2 exp { 6 n t 2 3 + 4 t }

Proof of Lemma 2. Three inequalities are similar in the proofs, where inequality (1) and inequality (2) have been given at [27]. The following is the proof of inequality (3).

p ^ k , j r = i = 1 n I { y i = r , X i , k = j } i = 1 n I { X i , k = j }

The expectation of p ^ k , j r is

E ( p ^ k , j r ) = E ( i = 1 n I { y i = r , X i , k = j } i = 1 n I { X i , k = j } ) = E ( i = 1 n y i = r , X i , k = j i = 1 n X i , k = j ) = p k , j r

Let Z i = I { y i = r | X i , k = j } p k , j r , V a r ( i = 1 n Z i ) = n V a r ( Z i ) = n p k , j r ( 1 p k , j r ) n 4 is known, then

P ( | p ^ k , j r p k , j r | > t ) = P ( | n 1 i = 1 n Z i | > t ) = P ( | i = 1 n Z i | > n t ) 2 exp { n 2 t 2 2 ( n 4 + n t 3 ) } 2 exp { 6 n t 2 3 + 4 t }

According to Bernstein inequality, the formula is held.

Lemma 3. With regard to discrete covariates X k and discrete response Y, for any 0 < ε < 1 , under condition (C1), we have

P ( | e ^ k e k | > 2 ε ) O ( R J 3 ) exp { c 5 n ε 2 R 2 J 6 } ,

where c 5 represents a positive constant.

Proof of Lemma 3. By e k and e ^ k in Section 2.2, we have

log J K ( e ^ k e k ) = [ ( 1 r = 1 R p ^ r 2 ) j = 1 J k w ^ k , j ( 1 r = 1 R p ^ k , j r 2 ) ] [ 1 r = 1 R p r 2 j = 1 J k w k , j ( 1 r = 1 R p k , j r 2 ) ] = ( r = 1 R p r 2 r = 1 R p ^ r 2 ) + ( j = 1 J k w k , j j = 1 J k w ^ k , j ) + ( j = 1 J k w ^ k , j r = 1 R p ^ k , j r 2 j = 1 J k w k , j r = 1 R p k , j r 2 ) = r = 1 R ( p r 2 p ^ r 2 ) + j = 1 J k ( w k , j w ^ k , j ) + j = 1 J k r = 1 R ( w ^ k , j p ^ k , j r 2 w k , j p k , j r 2 )

= r = 1 R ( p r p ^ r ) ( p r + p ^ r ) + j = 1 J k ( w k , j w ^ k , j ) + j = 1 J k r = 1 R [ ( w ^ k , j p ^ k , j r + w k , j p k , j r ) ( p ^ k , j r p k , j r ) + p ^ k , j r p k , j r ( w k , j w ^ k , j ) ] = I 1 + I 2 + I 3

Since log J log 2 0.5 , we have

P ( | e ^ k e k | > ε ) P ( | I 1 | > ε 3 ) + P ( | I 2 | > ε 3 ) + P ( | I 3 | > ε 3 )

For I 1 , we have

P ( | I 1 | > ε 3 ) r = 1 R P ( | ( p r p ^ r ) ( p r + p ^ r ) | > ε 3 ) r = 1 R P ( | p r p ^ r | > c 1 ε 3 R J 3 ) R J 3 2 exp { 6 n ( c 1 ε 3 R J 3 ) 2 3 + 4 ( c 1 ε 3 R J 3 ) }

For I 2 , we have

P ( | I 2 | > ε 3 ) j J k P ( | w ^ k , j w k , j | > c 1 ε 3 J 3 ) J 3 2 exp { 6 n ( c 1 ε 3 J 3 ) 2 3 + 4 ( c 1 ε 3 J 3 ) }

For I 3 , we have

I 3 = j = 1 J k r = 1 R [ ( w ^ k , j p ^ k , j r + w k , j p k , j r ) ( p ^ k , j r p k , j r ) + p ^ k , j r p k , j r ( w k , j w ^ k , j ) ] = j J k r = 1 R [ ( w ^ k , j p ^ k , j r + w k , j p k , j r ) ( p ^ k , j r p k , j r ) ] + j J k r = 1 R p ^ k , j r p k , j r ( w ^ k , j w k , j ) : = I 31 + I 32

For I 31 and I 32 , we have

P ( | I 3 | > ε 3 ) P ( | I 31 | > ε 6 ) + P ( | I 32 | > ε 6 )

P ( | I 31 | > ε 6 ) j J k r = 1 R P ( | ( w ^ k , j p ^ k , j r + w k , j p k , j r ) ( p ^ k , j r p k , j r ) | > ε 6 ) j J k r = 1 R p ( | p ^ k , j r p k , j r | > c 1 ε 6 R J 3 ) R J 3 2 exp { 6 n ( c 1 ε 6 R J 3 ) 2 3 + 4 ( c 1 ε 6 R J 3 ) }

P ( | I 32 | > ε 6 ) j J k r = 1 R P ( | p ^ k , j r p k , j r ( w ^ k , j w k , j ) | > ε 6 ) j J k r = 1 R P ( | w ^ k , j w k , j | c 1 ε 6 R J 3 ) R J 3 2 exp { 6 n ( c 1 ε 6 R J 3 ) 2 3 + 4 ( c 1 ε 6 R J 3 ) }

In a word, we have the inequality,

P ( | e ^ k e k | > 2 ε ) O ( R J 3 ) exp { c 5 n ε 2 R 2 J 6 }

where c 5 represents a positive constant.

Proof of Theorem 3.1. By Conditions (C1) to (C3) and Lemma 3, we can get

P ( D D ^ ) P ( | e ^ k e k | c n τ , k D ) P ( max 1 k P | e ^ k e k | c n τ ) 1 k = 1 P P ( max 1 k P | e ^ k e k | > c n τ ) 1 O ( R J 3 ) p exp { c 5 c 2 n 1 2 τ R 2 J 6 } 1 O ( p exp { b n 1 2 τ 2 ε 2 κ + ( ε + κ ) log n } )

Where b is a positive constant.

Lemma 4 (Lemma A.2 [19]). For any continuous covariate X k satisfying conditions (C4) and (C5), let F k ( y , x ) be the cumulative distribution function of ( Y , X k ) and F ^ k ( y , x ) be the empirical cumulative distribution function, we have P ( | F ^ k ( r , q ^ k , ( j ) ) F k ( r , q k , ( j ) ) | > ε ) c 6 exp { c 7 n 1 2 ρ ε 2 } for any ε > 0 , 1 r R and 1 j J k , where c 6 and c 7 are two positive constants.

Lemma 5 (Lemma A.3 [19]). Under (C1), (C4) and (C5), for any 0 < ε < 1 , so for continuous X k , we have P ( | e ^ k e k | > 2 ε ) O ( R J ) exp { c 9 n 1 2 ρ ε 2 R 4 J 4 } , there exists a positive constant c 9 .

Proof of Theorem 3.2. According to Lemma 4 and Lemma 5, the proof of Theorem 3.2 is the same as Theorem 3.1 and hence is omitted.

Proof of Theorem 3.3. According to Lemma 3 and 5 and under Conditions (C1), (C4), (C5) and (C7). The proof of Theorem 3.2 is proved by [19] and hence is omitted.

Conflicts of Interest

The authors declare no conflicts of interest.

References

[1] Fan, J.Q. and Lv, J.C. (2008) Sure Independence Screening for Ultrahigh Dimensional Feature Space. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70, 849-911.
https://doi.org/10.1111/j.1467-9868.2008.00674.x
[2] Fan, J.Q., Samworth, R. and Wu, Y.C. (2009) Ultrahigh Dimensional Feature Selection: Beyond the Linear Model. Journal of Machine Learning Research, 10, 2013-2038.
http://arxiv.org/abs/0812.3201
[3] Wang, H.S. (2009) Forward Regression for Ultra-High Dimensional Variable Screening. Journal of the American Statistical Association, 104, 1512-1524.
https://doi.org/10.1198/jasa.2008.tm08516
[4] Fan, J.Q. and Song, R. (2010) Sure Independence Screening in Generalized Linear Models with NP-Dimensionality. Annals of Statistics, 38, 3567-3604.
https://doi.org/10.1214/10-AOS798
[5] Fan, J.Q., Feng, Y. and Song, R. (2011) Nonparametric Independence Screening in Sparse Ultra-High-Dimensional Additive Models. Journal of the American Statistical Association, 106, 544-557.
https://doi.org/10.1198/jasa.2011.tm09779
[6] Zhu, L.P., Li, L.X., Li, R.Z. and Zhu, L.X. (2011) Model-Free Feature Screening for Ultrahigh-Dimensional Data. Journal of the American Statistical Association, 106, 1464-1475.
https://doi.org/10.1198/jasa.2011.tm10563
[7] Li, G.R., Peng, H., Zhang, J. and Zhu, L.X. (2012) Robust Rank Correlation Based Screening. Annals of Statistics, 40, 1846-1877.
https://doi.org/10.1214/12-AOS1024
[8] Li, R.Z., Zhong, W. and Zhu, L.P. (2012) Feature Screening via Distance Correlation Learning. Journal of the American Statistical Association, 107, 1129-1139.
https://doi.org/10.1080/01621459.2012.695654
[9] He, X.M., Wang, L. and Hong, H.G. (2013) Quantile-Adaptive Model-Free Variable Screening for High-Dimensional Heterogeneous Data. Annals of Statistics, 41, 342-369.
https://doi.org/10.1214/13-AOS1087
[10] Fan, J.Q., Ma, Y.B. and Dai, W. (2014) Nonparametric Independence Screening in Sparse Ultra-High-Dimensional Varying Coefficient Models. Journal of the American Statistical Association, 109, 1270-1284.
https://doi.org/10.1080/01621459.2013.879828
[11] Liu, J.Y., Li, R.Z. and Wu, R.L. (2014) Feature Selection for Varying Coefficient Models with Ultrahigh-Dimensional Covariates. Statistics & Probability Letters, 109, 266-274.
https://doi.org/10.1080/01621459.2013.850086
[12] Nandy, D., Chiaromonte, F. and Li, R.Z. (2021) Covariate Information Number for Feature Screening in Ultrahigh-Dimensional Supervised Problems. Journal of the American Statistical Association, 117, 1516-1529.
https://doi.org/10.1080/01621459.2020.1864380
[13] Pouyap, M., Bit joka, L., Mfoumou, E. and Toko, D. (2021) Improved Bearing Fault Diagnosis by Feature Extraction Based on GLCM, Fusion of Selection Methods, and Multiclass-Naive Bayes Classification. Journal of Signal and Information Processing, 12, 71-85.
https://doi.org/10.4236/jsip.2021.124004
[14] Fan, J.Q. and Fan, Y.Y. (2008) High-Dimensional Classification Using Features Annealed Independence Rules. Annals of Statistics, 36, 2605-2637.
https://doi.org/10.1214/07-AOS504
[15] Mai, Q. and Zou, H. (2013) The Kolmogorov Filter for Variable Screening in High-Dimensional Binary Classification. Biometrika, 100, 229-234.
https://doi.org/10.1093/biomet/ass062
[16] Cui, H.J., Li, R.Z. and Zhong, W. (2015) Model-Free Feature Screening for Ultrahigh Dimensional Discriminant Analysis. Journal of the American Statistical Association, 110, 630-641.
https://doi.org/10.1080/01621459.2014.920256
[17] Lai, P., Song, F.L., Chen, K.W. and Liu, Z. (2017) Model Free Feature Screening with Dependent Variable in Ultrahigh Dimensional Binary Classification. Statistics & Probability Letters, 125, 141-148.
https://doi.org/10.1016/j.spl.2017.02.011
[18] Huang, D.Y., Li, R.Z. and Wang, H.S. (2014) Feature Screening for Ultrahigh Dimensional Categorical Data with Applications. Journal of Business & Economic Statistics, 32, 237-244.
https://doi.org/10.1080/07350015.2013.863158
[19] Ni, L. and Fang, F. (2016) Entropy-Based Model-Free Feature Screening for Ultrahigh-Dimensional Multiclass Classification. Journal of Nonparametric Statistics, 28, 515-530.
https://doi.org/10.1080/10485252.2016.1167206
[20] Ni, L., Fang, F. and Wan, F.J. (2017) Adjusted Pearson Chi-Square Feature Screening for Multi-Classification with Ultrahigh Dimensional Data. Metrika, 80, 805-828.
https://doi.org/10.1007/s00184-017-0629-9
[21] Sheng, Y. and Wang, Q.H. (2020) Model-Free Feature Screening for Ultrahigh Dimensional Classification. Journal of Multivariate Analysis, 178, 1-12.
https://doi.org/10.1016/j.jmva.2020.104618
[22] Anzarmou, Y., Mkhadri, A. and Oualkacha, K. (2022) The Kendall Interaction Filter for Variable Interaction Screening in High Dimensional Classification Problems. Journal of Applied Statistics, 1-19.
https://doi.org/10.1080/02664763.2022.2031125
[23] Breiman, L., Friedman, J.H., Stone, C.J. and Olshen, R.A. (1984) Classification and Regression Trees. Wadsworth International Group, Belmont.
[24] Marco, T. (2012) Lectures on Probability Theory and Mathematical Statistics. CreateSpace Independent Publishing Platform, Scotts Valley.
[25] Suykens, J.A.K. and Vandewalle, J. (1999) Least Squares Support Vector Machine Classifiers. Neural Processing Letters, 9, 293-300.
https://doi.org/10.1023/A:1018628609742
[26] Lantz, B. (2015) Machine Learning with R. Mathematical & Statistical Software. Packt Publishing, Birmingham.
[27] He, H.J. and Deng, G.M. (2022) Grouped Feature Screening for Ultra-High Dimensional Data for the Classification Model. Journal of Statistical Computation and Simulation, 1-24.

Copyright © 2024 by authors and Scientific Research Publishing Inc.

Creative Commons License

This work and the related PDF file are licensed under a Creative Commons Attribution 4.0 International License.