Model-Free Feature Screening via Maximal Information Coefficient (MIC) for Ultrahigh-Dimensional Multiclass Classification

Abstract

It is common for datasets to contain both categorical and continuous variables. However, many feature screening methods designed for high-dimensional classification assume that the variables are continuous. This limits the applicability of existing methods in handling this complex scenario. To address this issue, we propose a model-free feature screening approach for ultra-high-dimensional multi-classification that can handle both categorical and continuous variables. Our proposed feature screening method utilizes the Maximal Information Coefficient to assess the predictive power of the variables. By satisfying certain regularity conditions, we have proven that our screening procedure possesses the sure screening property and ranking consistency properties. To validate the effectiveness of our approach, we conduct simulation studies and provide real data analysis examples to demonstrate its performance in finite samples. In summary, our proposed method offers a solution for effectively screening features in ultra-high-dimensional datasets with a mixture of categorical and continuous covariates.

Share and Cite:

Chen, T. and Deng, G. (2023) Model-Free Feature Screening via Maximal Information Coefficient (MIC) for Ultrahigh-Dimensional Multiclass Classification. Open Journal of Statistics, 13, 917-940. doi: 10.4236/ojs.2023.136046.

1. Introduction

With the advancement of data acquisition tools and the improvement of computer storage capacity, ultra-high dimensional data has been widely applied in various scientific research fields, especially in genomics, tumor classification, machine learning and other fields. The traditional data screening methods can no longer be applied, and the existing feature screening methods for ultra-high dimensional data have their own limitations. Among them, when the covariates are both categorical and continuous variables, there are fewer research methods and the screening effect needs to be improved, so there is an urgent need to develop new theoretical and statistical methods to deal with ultra-high dimensional data. The pioneering work by Fan and Lv (2008) [1] introduced the concept of sure independence screening (SIS) in their seminal paper. Specifically, for linear regressions, they demonstrated that the approach based on Pearson correlation learning exhibits a sure screening property. This means that even when the

number of predictors (p) grows at a much faster rate than the number of observations (n) with logarithm of p equal to O ( n α ) for some α ( 0 , 1 2 ) , all relevant predictors can be selected with a probability approaching one (2009) [2] .

Numerous approaches have been developed in recent years for feature screening in ultrahigh-dimensional data. Wang (2009) [3] introduced forward regression as a method for handling such data. Fan and Song (2010) [4] applied maximum marginal likelihood estimates or maximum marginal likelihood to ultrahigh-dimensional screening in generalized linear models. Fan et al. (2011) [5] extended correlation learning to marginal nonparametric learning. Li et al. (2012) [6] presented a robust rank correlation screening method based on the Kendall correlation coefficient. He et al. (2013) [7] developed a quantile-adaptive framework for nonlinear variable screening in high-dimensional heterogeneous data. Fan et al. (2014) [8] introduced nonparametric independence screening, which selects variables based on the nonparametric marginal contributions of each covariate given the exposure variable. Nandy et al. (2022) [9] introduced covariate information number sure independence screening, which incorporates a marginal utility connected to the traditional Fisher information. Tong et al. (2022) [10] propose a model-free conditional feature screening method for ultra-high-dimensional data based on false discovery rate (FDR) control, which does not require a specific functional form of the regression function and is robust to heavy-tail responses and predictors.

To tackle the challenge of ultrahigh-dimensional feature screening in classification problems, Fan and Fan (2008) [11] introduced the t-test statistic for the two-sample mean problem as a marginal utility for feature screening and established its theoretical properties. Mai and Zou (2013) [12] applied the Kolmogorov filter to ultrahigh-dimensional binary classification. Cui et al. (2015) [13] proposed a screening procedure that utilizes empirical conditional distribution functions. Lai et al. (2017) [14] developed a feature screening procedure based on the expected conditional Kolmogorov filter for binary classification problems.

However, the aforementioned screening methods assume that the data types are continuous. For categorical covariates, Huang et al. (2014) [15] devised a model-free discrete feature screening method based on Pearson Chi-square statistics and demonstrated its sure screening property, as mentioned in Fan et al. (2009) [2] . When all the covariates are binary, Ni and Fang (2016) [16] proposed a model-free feature screening procedure based on information entropy theory for multi-class classification. Ni et al. (2017) [17] further extended this by introducing a feature screening procedure based on weighted Adjusted Pearson Chi-square for multi-class classification. Sheng and Wang (2020) [18] introduced a novel model-free feature screening method based on the classification accuracy of marginal classifiers for ultrahigh-dimensional classification. Anzarmou et al. (2022) [19] presented a new model-free interaction screening method called Kendall Interaction Filter (KIF) for classification in high-dimensional settings.

Based on the aforementioned research on classification models, this paper introduces a model-free feature screening approach for ultrahigh-dimensional multi-classification that accommodates both categorical and continuous covariates. The proposed method utilizes the maximal information coefficient (MIC) to evaluate the predictive power of the covariates. For screening categorical covariates, we employ the maximal information coefficient (MIC) index, which is equivalent to information gain [16] . The feature screening procedure proposed in this paper is based on maximal information coefficient, specifically referred to as Maximal Information Coefficient Sure Independence Screening (MIC-SIS). The maximum mutual information coefficient can be directly used to categorize the data without any cut-off processing, and it overcomes the disadvantages of information gain, such as the difficulty of calculating the joint probability, not belonging to the measurement method, no way to normalize it, and not being able to compare the results of different data and it, and the screening results are more robust and have lower computational complexity. It first finds an optimal discretization method, and then converts the mutual information value into a measurement method, and the value range is between [0, 1]. MIC has the advantages of wide application range, low computational complexity and strong robustness. The MIC-SIS method is rigorously proven to possess the sure screening property, as originally proposed by Fan and Lv [1] , ensuring that all significant features can be identified. Through simulation results, the MIC-SIS approach demonstrates the satisfaction of the sure screening property when compared to existing feature screening methods.

The paper is organized as follows: Section 2 provides a detailed description of the proposed MIC-SIS method. Section 3 establishes the sure screening property of the method. In Section 4, numerical simulations and a real data analysis example are presented to assess the sure screening property of our approach. Concluding remarks are provided in Section 5, and all proofs are included in the Appendix.

2. Feature Screening Procedure

Firstly, we introduce the concept of the maximal information coefficient (MIC), and subsequently, we propose a screening procedure that is based on the maximal information coefficient.

2.1. Maximal Information Coefficient (MIC)

The fundamental principle underlying the maximal information coefficient (MIC) is based on the concept of mutual information. Mutual Information (MI) [20] is a valuable measure in information theory, quantifying the amount of information contained in one random variable regarding another random variable. It represents the reduction in uncertainty of a random variable due to the knowledge of another random variable. In decision trees, mutual information and information gain (IG) are essentially equivalent.

The mutual information between two random variables X and Y is defined based on their joint probability distribution p ( X , Y ) as follows

M I ( X , Y ) = p ( x , y ) log 2 p ( x , y ) p ( x ) p ( y ) d x d y .

M I ( X , Y ) is always nonnegative and M I ( X , Y ) = 0 if and only if X and Y are independent.

When covariate X is continuous and Y is a categorical response with R classes { 1 , 2 , , R } ,

M I ( X , Y ) = i = 1 n r = 1 R p ( X i , r ) log ( p ( X i , r ) p ( X i ) p r ) .

where p ( X i , r ) = p r P ( X i | Y = r ) and p r = 1 n i = 1 n I ( Y i = r ) . In practice probability functions are usually calculated using Gaussian kernel functions.

When covariate X = { X 1 , X 2 , , X p } is a vector of p dimension with J categories, where j = { 1 , 2 , , J } , and Y is also categorical with R classes { 1 , 2 , , R } ,

M I ( X , Y ) = j = 1 J r = 1 R p ( X = j , Y = r ) log ( p ( X = j , Y = r ) p ( X = j ) p ( Y = r ) ) .

where p ( X = j , Y = r ) = 1 n i = 1 n I ( X i = j , Y i = r ) ; r = 1 , 2 , , R ; j = 1 , 2 , , J .

The concept behind MIC is to discretize the relationship between two variables and represent it in two-dimensional space using a scatterplot. A data set consisting of data points with two attributes is distributed in a two-dimensional space. A grid of a multiplied by b is used to divide the data space, and the frequency of data points falling in each ( x , y ) cell is estimated as p ( x , y ) , which greatly reduces the computational complexity of the joint probability and successfully solves the difficult problem of estimating the joint probability in mutual information.

p ( x , y ) = number of data points in the ( x , y ) grid total number of data points .

Since there are several ways to partition the data points using an a × b grid, our goal is to find the partitioning method that maximizes the mutual information. The mutual information values are normalized using a normalization factor that maps them to the [ 0 , 1 ] interval. Finally, the mesh resolution that maximises the normalised mutual information is determined as the MIC measure. The formula for MIC is expressed in the following equation:

M I C ( X , Y ) = max a b < B M I ( X , Y ) log 2 min ( a , b ) .

In the above equation, a, b is the number of grids divided in the x, y direction, which is essentially the grid distribution, and B is the variable. According to Reshef D N et al [21] , the grid resolution is typically limited to a × b < B , where the size setting of B is often chosen to be approximately 0.6 times the power of the data volume.

M I C ( X , Y ) is always nonnegative and M I C ( X , Y ) = 0 if and only if X and Y are independent.

2.2. An Independence Ranking and Screening Procedure

We propose a novel model-free sure independence screening method utilizing the maximal information coefficient M I C ( X , Y ) for analyzing ultrahigh-dimensional data. In this context, Y represents the response variable with support Ψ y , and X = ( X 1 , , X p ) denotes the predictor vector, where p is significantly larger than the sample size n. Without specifying a particular regression model, we define the subset of active predictor indices as follows:

D = { k : p ( Y | X ) functionally depends on X k for some y Ψ y } ,

and define the subset of inactive predictor indices by

I = { k : p ( Y | X ) does not functionally depend o n X k for any y Ψ y } .

Using the notation mentioned above, we can define the active predictors as X D = { X k : k D } and the inactive predictors as X I = { X k : k I } . Our primary objective is to accurately identify the subset of active predictor indices, denoted as D.

The MI marginal measure can be estimated by letting M I ^ ( X , Y ) . When covariate X is continuous and Y is a categorical response with R classes { 1 , 2 , , R } ,

ω ^ k = M I ^ ( X i k , Y ) = i = 1 n r = 1 R p ^ ( X i k , r ) log ( p ^ ( X i k , r ) p ^ ( X i k ) p ^ r ) .

where p ^ ( X i k , r ) = p ^ r P ^ ( X i k | Y = r ) and p ^ r = 1 n i = 1 n I ( Y i = r ) . Consider a covariate vector X = { X 1 , X 2 , , X p } of dimension p, where each component X k takes on J k categories, represented by J k = { 1 , 2 , , J } . Furthermore, the response variable Y is also categorical, with R classes denoted by { 1 , 2 , , R } ,

ω ^ k = M I ^ ( X k , Y ) = j = 1 J k r = 1 R p ^ ( X k = j , Y = r ) log ( p ^ ( X k = j , Y = r ) p ^ ( X k = j ) p ^ ( Y = r ) ) .

where p ^ ( X k = j , Y = r ) = 1 n i = 1 n I ( X i k = j , Y i = r ) ; p ^ ( X k = j ) = 1 n i = 1 n I ( X i k = j ) ; p ^ ( Y = r ) = 1 n i = 1 n I ( Y i = r ) ; i = 1 , 2 , , n ; r = 1 , 2 , , R ; j = 1 , 2 , , J k .

The MIC marginal measure can be estimated by letting M I C ^ ( X , Y ) . Then

ω ^ k = M I C ^ ( X k , Y ) = max a b < B M I ^ ( X k , Y ) log 2 min ( a , b ) = max a b < B ω ^ k log 2 min ( a , b ) .

Our goal is to calculate the maximal information coefficient MIC between each predictor and the response variable, denoted as ω ^ k = M I C ^ ( X k , Y ) for k = 1 , 2 , , p . Note that ω ^ k = 0 if and only if X k X I , this also indicates that predictor X k is statistically independent of Y. Therefore, the MIC index can be utilized as a measure of dependence to screen the predictors. The MIC-based approach is considered model-free because it solely relies on the marginal and joint densities of the random variables. This index can effectively capture both linear and nonlinear relationships between the response and predictors.

In ultra-high-dimensional data analysis, the primary objective of feature screening is to identify a reduced model with a small number of predictors that can still encompass the true model D with a high probability. According to the deterministic screening property proposed by Fan and Lv [2] , as the amount of data n tends to infinity, the probability that the model converges to the true model must converge to 1, so that the screened covariates are guaranteed to be valid. In this paper, we propose utilizing the ω ^ k index to select a moderate-sized model

D ^ = { k : ω ^ k c n τ , for 1 k p }

where c and τ are predetermined positive values. In practice, we often select the reduced model using another formula:

D ^ = { k : ω ^ k is among the top d largest of all }

It is evident that the set of predictors { X k : k D ^ } represents the most likely relevant predictors associated with the response variable. Consequently, we can employ the predictors in { X k : k D ^ } to estimate the true model. To simplify the description, we refer to the aforementioned procedure as the MIC-SIS (Maximal Information Coefficient Sure Independence Screening) procedure.

3. Feature Screening Property

In the subsequent sections, we will establish the theoretical properties of the proposed independence screening procedure. Previous studies by Fan and Lv [1] , Ji and Jin [22] , Zhou and Wang [20] and Ni and Fang [16] have demonstrated that the sure screening property ensures the effectiveness of the independence screening procedure. Hence, it is crucial to establish the sure screening property for MIC-SIS. The following conditions are assumed to guarantee the sure screening property of the MIC-SIS procedure. Although they may not be the weakest conditions, they are primarily imposed to facilitate the technical proofs.

(C1) Let X = ( x 1 , x 2 , , x p ) , where x i is drawn from an unknown distribution F i . Each distribution F i has an unknown Lebesgue probability density function pdf f i , with i = 1 , 2 , , p . The conditions specified in Lemma 3 in the Appendix apply to these distributions.

(C2) There exists a positive constant 0 < κ < 2 , such that

sup 1 k p i = 1 n r = 1 R log p ( X i k , Y r ) p ( X i k ) p ( Y r ) = O ( n κ ) , a . e .

(C3) There exists a positive constant c > 0 and τ ; the minimum MIC of the active predictors satisfies min k D ω k 2 c n τ ;

(C4) Both X and Y exhibit sub exponential tail probabilities that hold uniformly in p. Specifically, there exists a positive constant μ 0 such that, for all 0 < μ μ 0 , the following condition holds:

sup p max 1 k p E { exp ( μ X k 1 2 ) } < , E { exp ( μ Y q 2 ) } <

(C5) There exist two positive constants c 1 and c 2 such that, c 1 / R p ( Y = r ) c 2 / R , c 1 + c 2 R , c 1 / R p ( X k = j , Y = r ) c 2 / R and c 1 / J k p ( X k = j ) c 2 / J k for every 1 J k J , 1 r R , and 1 k p .

(C6) There exist a positive constant c 3 , such that 0 < f k ( x | Y = r ) < c 3 for any 1 r R , and x in the domain of X k , where f k ( x | Y = r ) is the Lebesgue density function of X k conditional on Y = r .

(C7) There exist a positive constant c 4 and 0 ρ < 1 / 2 such that f k ( x ) c 4 n ρ for any 1 k p and x in the domain of X k , where f k ( x ) is the Lebesgue density function of X k . Furthermore, f k ( x ) is continuous in the domain of X k .

(C8) lim inf n { min k D ω ^ k max k I ω ^ k } > δ , where δ > 0 is a constant.

According to Ji and Jin [22] and conditions in the Appendix, conditions (C1) and (C2) ensure that the estimated probabilities converge strongly and uniformly to the true probabilities. According to Fan and Lv [1] and Cui [13] , condition (C3) allows the minimum true signal to disappear to zero in the order of n τ as the sample size goes to infinity. According to the sure screening property proposed by Zhou and Wang [20] , then condition (C4) is established. Condition (C5) guarantees that the proportion of each class of variables cannot be either extremely small or extremely large. A similar assumption is also made in condition (C5) in Huang [15] and Cui [13] . To ensure that the sample percentiles are close to the true percentiles, condition (C6) excludes the extreme case that some X k put heavy mass in a small range. Condition (C7) requires the n ρ as a lower bound on the density. According to Cui [13] , it is easy to show that ω k > 0 for k D and ω k = 0 for k I naturally holds. Thus, condition (C8) is established, and MIC index is able to separate active and inactive predictors well at the population level.

Theorem 1 (Sure Screening Property).

Under conditions(C1) - (C4), there exists the positive constant C 1 such that

P ( ω ^ k ω k c n τ ) O ( p ) exp { C 1 n 1 2 τ }

Further, we have that

P ( D D ^ ) 1 O ( s n exp ( C 1 n 1 2 τ ) ) .

In the equation above, s n represents the cardinality of D. According to Theorem 1, it implies that we can handle the ultra-high-dimensional scenario where the logarithm of p is on the order of O ( n 1 2 τ ) , with τ > 0 .

Theorem 2 (Ranking consistency property).

Under conditions(C5) - (C8), if log R J log n = O ( 1 ) and max { log P , log n } R 4 J 4 n 1 2 ρ = o ( 1 ) , then

lim inf n { min k D ω k max k I ω k } > 0 , a . s .

Theorem 2 demonstrates that the proposed screening index effectively distinguishes between active and inactive covariates at the sample level.

4. Numerical Studies

4.1. Simulation Results

In this subsection, we conduct three simulation studies to demonstrate the finite sample performance of our group screening methods as described in Section 2. We compare the performance of MIC-SIS with that of IG-SIS [16] and APC-SIS [17] using the following evaluation criteria:

1) MMS (Minimal Model Size): This criterion represents the smallest model size that includes all active covariates. The results are presented for various proportions of MMS, such as 5%, 25%, 50%, 75%, and 95%.

2) CP1, CP2, and CP3: These criteria indicate the probabilities that a given model size, specifically [ n / log n ] , 2 [ n / log n ] , and 3 [ n / log n ] , respectively, cover all active covariates.

3) CPa: This criterion evaluates whether the indicators of the selected model cover all active covariates.

By comparing these evaluation criteria, we can assess and compare the performance of MIC-SIS, IG-SIS, and APC-SIS in terms of their ability to identify and include active covariates in the model.

Model 1: categorical covariates and binary response

We begin by examining the response variables with different categories. Following Ni and Fang [16] , we consider a binary response model where R = 2 , and all covariates are categorical. We consider two distributions for the response variable y i :

1) Balanced, P ( y i = r ) = 1 / 2 ;

2) Unbalanced, p r = 2 [ 1 + R r R 1 ] / 3 R with max 1 r R p r = 2 min 1 r R p r .

The true model is defined at D = { 1 , 2 , , 20 } with d 0 = | D | = 20 . Condition on y i , a latent variable z i is generated as z i = ( z i , 1 , z i , 2 , , z i , p ) , where z i . k ~ N ( μ r k , 1 ) for 1 k p . We then construct the active covariates as follows:

1) If k > d 0 , then μ r k = 0 ;

2) If k d 0 and r = 1 , then μ r k = 0.5 ;

3) If k d 0 and r = 2 , then μ r k = 0.5 .

Next, we generate the covariates by applying the quantile of the standard normal distribution. The specific approach is as follows:

1) When k as odd number, that is x i , k = I ( z i , k > z j 2 ) + 1 ;

2) When k as even number, that is x i , k = I ( z i , k > z j 5 ) + 1 .

Where αth percentile of the standard normal distribution is z α .

Therefore, out of all p covariates, half of them belong to two categories, while the other half belong to five categories. Following the approach in Ni and Fang [17] , we consider p = 1000 and p = 5000 , with sample sizes n = 200 and n = 400 in this model.

Table 1 shows the evaluation criteria for Model 1 based on 100 simulations. The results show the effectiveness of the proposed MIC-SIS method. As the sample size n increases, MIC-SIS approaches the true model size d 0 = 20 in terms of MMS, and the coverage probability increases toward 1. MMS performs better in the unbalanced response compared to the balanced response when considering different response structures. MIC-SIS and IG-SIS show similar performance, with MIC-SIS slightly outperforming APC-SIS at higher coverage probabilities.

Model 2: categorical covariates and multi-class response

We further investigate the classification of more covariates, where the response variable y i has multiple classes with R = 10 . We consider two distributions for y i :

1) Balanced, P ( y i = r ) = 1 R ;

2) Unbalanced, p r = 2 [ 1 + R r R 1 ] / 3 R with max 1 r R p r = 2 min 1 r R p r .

Out of the p = 2000 covariates, the minimum set of active covariates is represented by X D = { X 200 , X 400 , X 600 , X 800 , X 1000 , , X 2000 } , with a total of d 0 = 20 active covariates. Conditional on y i , the latent variable

z i = ( z i , 1 , z i , 2 , , z i , p ) is generated, where z i . k follows a standard normal distribution N ( μ i , k , 1 ) for covariate X k . Each covariate x i , k is defined as z i = ( z i , 1 , z i , 2 , , z i , p ) , where ε i , k follows a standard normal distribution N ( 0 , 1 ) , and f k ( ) represents the quantile function of the standard normal

Table 1. Simulation results for model 1.

distribution. Based on this, we construct the active covariates by defining μ i , k :

1) If X X D and y i = r , then μ i , k = 1.5 × ( 0.9 ) r ;

2) If X X D and y i = r , then μ i , k = 0 .

Next, we generate the covariates by applying the quantile function f k ( ) to the defined parameters. We consider p = 2000 and sample sizes n = 300 , 400 and 500 for this model. The specific approach is as follows:

1) For 1 k 400 , then f k ( ε i , k + μ i , k ) = I ( z i , k > z j 2 ) + 1 ;

2) For 400 < k 800 , then ( f k ( ε i , k + μ i , k ) = I ( z i , k > z i 4 ) + 1 ;

3) For 800 < k 1200 , then f k ( ε i , k + μ i , k ) = I ( z i , k > z 1 6 ) + 1 ;

4) For 1200 < k 1600 , then f k ( ε i , k + μ i , k ) = I ( z i , k > z j 8 ) + 1 ;

5) For 1600 < k 2000 , then f k ( ε i , k + μ i , k ) = I ( z i , k > z j i 0 ) + 1 .

Among the p = 2000 covariates, each category 2, 4, 6, 8, and 10 constitutes one-fifth of the total.

Table 2 shows the results of the evaluation criteria for 100 simulations of Model 2. The following conclusions can be drawn:

Both methods perform poorly in the more complex Model 2 compared to Model 1. MIC-SIS and IG-SIS perform similarly. As the sample size n increases, MIC-SIS approaches the true model size d 0 = 10 in MMS and the coverage probability reaches 1. When the sample size is 300, the coverage probability of APC-SIS is lower compared to MIC-SIS. By comparing the responses of different structures, the unbalanced response has a better MMS performance than the balanced response, the performance of MIC-SIS and IG-SIS is more stable with less fluctuation in MMS. In conclusion, these results highlight the effectiveness and robustness of MIC-SIS and IG-SIS in dealing with Model 2.

Model 3: continuous and categorical covariates

Finally, we consider a more complex example where the response variable y i is multi-class with R = 4 . We examine two distributions for y i :

1) Balanced, P ( y i = r ) = 1 R ;

2) Unbalanced, p r = 2 [ 1 + R r R 1 ] / 3 R with max 1 r R p r = 2 min 1 r R p r .

In this model, we consider p = 5000 and sample sizes n = 400 , 600 , 800 . The true model is defined as X D = { X k : k = [ k p 20 ] , k = 1 , 2 , , 20 } with d 0 = 20 . Conditioned on y i , the latent variable is generated as z i = ( z i , 1 , z i , 2 , , z i , p ) , where z i . k ~ N ( μ i , k , 1 ) for 1 k p . For the covariates X k , we have x i , k ~ N ( μ i , k , 1 ) for 1 k p , where

Table 2. Simulation results for model 2.

μ i , k ~ ( 1 ) r θ r k μ i , k ~ ( 1 ) r θ r k if y i = r and k D . The values of θ r k , as given in Table 3 by Ni and Fang [16] , determine the active covariates. Specifically, μ i , k = 0 when k D .

An active covariate is established by defining μ i , k :

1) For k [ 5 p 20 ] , then x i , k = j , if z i , k ( z j 1 4 , z j 4 ] , j = 1 , 2 , 3 , 4 ;

2) For [ 5 p 20 ] < k [ 10 p 20 ] , then x i , k = j , if z i k ( z j 1 0 , z j 10 ] , j = 1 , 2 , , 10 ;

3) For [ 10 p 20 ] < k p , then x i , k = z i , k .

Table 3. Parameter specification of Model 3.

Among all the p covariates, four categories and ten categories account for one-fifth each, respectively, while the remaining covariates are continuous. Similarly, there are 5 active covariates in each of the four categories and ten categories, with the remaining active covariates being continuous, accounting for half of the total. For the continuous covariates, we applied different subdivisions with J k = 4 , 8 and 10. Accordingly, we define the corresponding approaches as MIC-SIS-4, IG-SIS-4, APC-SIS-4, MIC-SIS-8, IG-SIS-8, APC-SIS-8, MIC-SIS-10, IG-SIS-10, and APC-SIS-10.

Table 4 and Table 5 present the simulation results based on over 100 simulations for the balanced and unbalanced cases, respectively. The following observations can be made: As the sample size n increases, MIC-SIS approaches the true model size d 0 = 20 in terms of MMS, and both approach 1 in terms of coverage probability. The coverage probability of MIC-SIS is similar to that of IG-SIS for all five indices, demonstrating the characteristic screening properties of MIC-SIS. For MMS, the unbalanced response outperforms the balanced response when comparing response structures. In addition, both MIC-SIS and IG-SIS exhibit robust performance, as evidenced by the small range of variation in MMS for both response types. When applying different breakdowns to the continuous covariates, MIC-SIS and IG-SIS outperform the other methods in terms of coverage probability and MMS when comparing response structures.

4.2. Real Data

In this subsection, we analyze a real dataset obtained from the feature selection database of Arizona State University (http://featureselection.asu.edu/). The dataset, called GLIOMA biological data, consists of 50 samples and 4434 features. The data is unbalanced due to the response variable, with class sizes of 14, 7, 14, and 15. The covariates in this dataset are both continuous and multiclass. We randomly divided the data into two parts, with 90% used as training data and 10% used as test data. The training data consists of 45 samples, while the test data consists of 5 samples. The dimensionality of both the training and test data is p = 4434 .

To assess the performance of MIC-SIS, PG-SIS, IG-SIS, and APC-SIS, we employ three classification approaches: Support Vector Machine (SVM) [23] , Random Forest (RF), and Decision Tree (DT). We utilize a ten-fold cross-validation to address potential issues related to varying training data that could affect the accuracy of the models. These classification approaches are applied to the selected active covariates obtained from the aforementioned screening methods. The evaluation metrics commonly used in such analyses include accuracy, recall,

Table 4. Simulation results for model 3 (Balanced Y).

Table 5. Simulation results for model 3 (Unbalanced Y).

F-measure, and G-mean. In this paper, we specifically utilize G-mean and F-measure [24] to assess the performance of the models on both the training and test data. The performance of MIC-SIS for unbalanced data is presented in Table 6.

Table 6. Analysis results for real data example.

According to the results of Table 6, we can get that among all the classification methods, MIC-SIS consistently outperforms the others, exhibiting higher G-mean and F-measure values that are closer to 1. In summary, the proposed MIC-SIS method demonstrates superior performance.

5. Conclusions

In practical scenarios, it is common to encounter datasets with a combination of continuous and categorical covariates, along with categorical responses. However, the available screening methods for such cases are limited. To address this problem, we introduce a new method called MIC-SIS (Maximum Information Coefficient-based Screening), which does not require continuous variables to be sliced and can be applied directly to a wide range of variables, overcoming the shortcomings of existing methods. In this paper, we demonstrate that MIC-SIS has theoretical properties such as deterministic screening and ranking consistency, and that no modeling is required. Through numerical simulations, we find that MIC-SIS can effectively screen covariates with better screening and lower computational complexity than existing methods. It also performs well in the empirical analysis of GLIOMA data.

One of the current challenges in covariate screening is missing data. It is common to have missing or incorrect data during the data collection process, which can affect the variable screening results. In future work, we aim to develop a new approach to feature screening that can either deal with missing variables prior to screening, or use classification models to screen features based on response variables.

Acknowledgements

The work was supported by National Natural Science Foundation of China [grant number 71963008].

Availability of Data

The GLIOMA biological data that support the findings of this study are available from the feature selection database of Arizona State University (http://featureselection.asu.edu/).

Appendix

Proof of Theoretical Result.

To establish the validity of Theorem 1, we rely on three accompanying lemmas. The first two lemmas furnish us with exponential inequalities, and their detailed proofs can be found in [25] .

Lemma 1. Let μ = E ( Y ) . If P ( a Y b ) = 1 , then

E [ exp { s ( Y μ ) } ] exp { s 2 ( b a ) 2 8 } .

Lemma 2. Let h ( Y 1 , , Y m ) be a kernel function for the U statistics U n , and θ = E { h ( Y 1 , , Y m ) } . If a h ( Y 1 , , Y m ) b holds, then for any t > 0 and n m , we have the following inequality

P ( U n θ t ) exp ( 2 [ n m ] t 2 ( b a ) 2 ) .

where [ n / m ] denotes the integer part of n / m .

Lemma 2 represents the one-sided tail inequality of U n . As a result of its symmetry, we can readily derive the two-sided tail inequality of U n

P ( | U n θ | t ) 2 exp ( 2 [ n m ] t 2 ( b a ) 2 ) .

Lemma 3. (The asymptotic property of nonparametric density estimators). Suppose that f ( x ) exists and h = c n 1 5 , then

n 2 5 { p ^ ( x ) p ( x ) } L N ( c 2 2 f ( x ) μ 2 ( K ) , 1 c f ( x ) K 2 2 ) .

From the above equation, μ 2 ( K ) = s 2 K ( s ) d s and K 2 2 = K 2 ( s ) d s .

Lemma 3 directly implies that p ^ ( x ) p p ( x ) . Under certain additional stringent conditions, we can achieve strong uniform convergence of p ^ ( x ) .

lim n sup x | p ^ ( x ) p ( x ) | = 0 , a . e .

For more detailed information regarding the strong uniform convergence, one can refer to references such as [26] or [27] . These sources provide further insights and explanations on the topic.

Proof of Theorem 1. We begin by demonstrating that the following inequality holds for each k:

P { | ω ^ k ω k | c n τ } O ( exp ( C 1 n 1 2 τ ) ) .

Since ω ^ k is obtained by normalizing ω ^ k without altering its properties, we only need to establish that the aforementioned inequality holds for each k:

P { max a b < B | ω ^ k ω k | log 2 min ( a , b ) c n τ } O ( exp ( C 1 n 1 2 τ ) ) .

Because

max a b < B log 2 min ( a , b ) | ω ^ k ω k | = | ω ^ k ω k | = | i = 1 n r = 1 R p ^ ( X i k , Y r ) log p ^ ( X i k , Y r ) p ^ ( X i k ) p ^ ( Y r ) p ( x k , y ) log p ( x k , y ) p ( x k ) p ( y ) d x k d y | = | i = 1 n r = 1 R p ^ ( X i k , Y r ) log p ^ ( X i k , Y r ) p ^ ( X i k ) p ^ ( Y r ) 1 n 2 i = 1 n r = 1 R log p ^ ( X i k , Y r ) p ^ ( X i k ) p ^ ( Y r ) + 1 n 2 i = 1 n r = 1 R log p ^ ( X i k , Y r ) p ^ ( X i k ) p ^ ( Y r ) p ( x k , y ) log p ( x k , y ) p ( x k ) p ( y ) d x k d y | = | M k , 1 + M k , 2 | .

Using Lemma 3 and the strong law of large numbers, we observe the convergence of

M k , 1 = i = 1 n r = 1 R p ^ ( X i k , Y r ) log p ^ ( X i k , Y r ) p ^ ( X i k ) p ^ ( Y r ) 1 n 2 i = 1 n r = 1 R log p ^ ( X i k , Y r ) p ^ ( X i k ) p ^ ( Y r ) 0 , a . e .

Next, our goal is to establish an upper bound for the second term.

Define h ( X i k , Y r ; X k , Y ) = log p ^ ( X i k , Y r ) p ^ ( X i k ) p ^ ( Y r ) as the kernel of the U statistics of I k , 2 , where we define M k , 2 = I k , 2 ω k , and

I k , 2 = I k , 2 1 n 2 i = 1 n r = 1 R log p ^ ( X i k , Y r ) p ^ ( X i k ) p ^ ( Y r ) .

By applying Markov's inequality, we can ensure that:

P ( I k , 2 E h > ε ) exp ( t ε ) exp ( t E h ) E [ exp ( t I k , 2 ) ] , for any t > 0 .

where E h = p ( x k , y ) log p ( x k , y ) p ( x k ) p ( y ) d x k d y .

Following the approach utilized by Li et al. (2012) [28] to handle the U statistics, and considering condition (C2), we can immediately deduce that:

P ( I k , 2 E h > ε ) exp ( t ε + t 2 8 n ) .

By selecting t = 4 n ε , we obtain P ( I k , 2 E h > ε ) exp ( 2 n ε 2 ) . Consequently, taking into account the symmetry of U statistics, we can derive the bilateral tail inequality:

P ( | I k , 2 E h | > ε ) 2 exp ( 2 n ε 2 ) .

By utilizing the relationship between I k , 2 and I k , 2 , we can demonstrate that

P ( | M k , 2 | > 2 ε ) = P ( | I k , 2 ω k | > 2 ε ) = P ( | I k , 2 + 1 n 2 i = r log p ^ ( X i k , Y r ) p ^ ( X i k ) p ^ ( Y r ) ω k | > 2 ε ) .

Under condition (C2), for ε > 0 , we can choose a sufficiently large N 1 such that when n > N 1 , 1 n 2 i = r log p ^ ( X i k , Y r ) p ^ ( X i k ) p ^ ( Y r ) < ε 3 . Furthermore, we can easily establish that

P ( | I k , 2 ω k | > 2 ε ) P ( | I k , 2 ω k | > 5 3 ε ) .

Note that

| I k , 2 ω k | = | I k , 2 E h + E h ω k | .

Similarly, employing the same technique and selecting a larger N 2 , we can ensure that when n > N 2 , | ω k E h | < ε 3 . This directly implies that

P ( | I k , 2 ω k | > 5 3 ε ) P ( | I k , 2 E h | > 4 3 ε ) .

Let ε = c n τ , 0 < τ < 1 2 . By employing P ( | I k , 2 E h | > ε ) 2 exp ( 2 n ε 2 ) , together with Lemma 1 and Bonferroni’s inequality, we can deduce that

P { | ω ^ k ω k | c n τ } 2 exp ( 2 c 2 n 1 2 τ ) .

We thus have

P { max 1 k p | ω ^ k ω k | c n τ } 2 p exp ( 2 c 2 n 1 2 τ ) = O ( p [ exp ( C 1 n 1 2 τ ) ] ) .

Next, we prove the second part of Theorem 1.

If D D ^ , it implies that there must exist some k D such that ω ^ k = c n τ . By utilizing condition (C3), we can deduce that if ω ^ k = c n τ holds for some k D , then ω ^ k = c n τ also holds for some k D . Thus, the event { D D ^ } is a subset of the event { | ω ^ k ω k | c n τ , for som e k D } . Taking the complement on both sides, we obtain { max k D | ω ^ k ω k | c n τ } is a subset of { D D ^ } . Therefore, we have:

P ( D D ^ ) P { max k D | ω ^ k ω k | c n τ } = 1 P { min k D | ω ^ k ω k | c n τ } = 1 s n P { | ω ^ k ω k | c n τ } 1 O ( s n [ exp ( C 1 n 1 2 τ ) ] ) .

In the above equation, s n is the cardinality of D.

Now we prove Theorem 2. The proofs for Lemma 4 and Lemma 5 can be found in Ni and Fang (2016) [16] .

Lemma 4. For categorical covariates X k and response Y,under condition (C5), for any 0 < ε < 1 , we have P ( | ω ^ k ω k | > 2 ε ) O ( R J ~ ) exp { c 5 n ε 2 R 4 J 4 } , where c 5 represents a positive constant.

Lemma 5. For continuous covariates X k and response Y, under condition (C5), (C6) and (C7), for any 0 < ε < 1 , we have P ( | ω ^ k ω k | > 2 ε ) O ( R J ~ ) exp { c 6 n 1 2 ρ ε 2 R 4 J 4 } , where c 6 represents a positive constant.

Under Conditions (C5)-(C8) and by Lemmas 4 and 5, if log R J log n = O ( 1 ) and max { log P , log n } R 4 J 4 n 1 2 ρ = o ( 1 ) , we get

P ( min k D ω ^ k max k I ω ^ k < δ 2 ) P ( ( min k D ω ^ k max k I ω ^ k ) ( min k D ω k max k I ω k ) < δ 2 ) P ( | ( min k D ω ^ k max k I ω ^ k ) ( min k D ω k max k I ω k ) | > δ 2 ) P ( max 1 k p | ω ^ k ω k | > δ 4 ) O ( R J k ) p exp { c 5 n 1 2 ρ R 4 J k 4 } = O ( exp { log R J + log p c 7 n 1 2 ρ R 4 J 4 } )

where c 7 = min { c 5 , c 6 } ( δ / 4 ) 2 . Since log R J log n = O ( 1 ) , there exists a positive constant c 8 such that log ( R J ) c 8 log n . Also, max { log P , log n } R 4 J 4 n 1 2 ρ = o ( 1 ) implies that log p 1 2 c 7 n 1 2 ρ R 4 J 4 and 1 2 c 7 n 1 2 ρ R 4 J 4 ( c 8 + 2 ) log n for large n. Then there exists a n 0 such that n = n 0 exp { log R J + log p c 7 n 1 2 ρ R 4 J 4 } n = n 0 exp { c 8 log n 1 2 c 7 n 1 2 ρ R 4 J 4 } n = n 0 exp { c 8 log n ( c 8 + 2 ) log n } = n = n 0 n 2 < . According to Ni and Fang (2016) [16] and by the Borel Cantelli Lemma, we can get lim inf n { min k D ω k max k I ω k } δ 2 > 0 , a.s.

This is the end of the proof.

Conflicts of Interest

The authors declare no conflicts of interest.

References

[1] Fan, J.Q. and Lv, J.C. (2008) Sure Independence Screening for Ultrahigh Dimensional Feature Space. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70, 849-911.
https://doi.org/10.1111/j.1467-9868.2008.00674.x
[2] Fan, J.Q., Samworth, R. and Wu, Y.C. (2009) Ultrahigh Dimensional Feature Selection: Beyond the Linear Model. Journal of Machine Learning Research, 10, 2013-2038.
[3] Wang, H.S. (2009) Forward Regression for Ultra-High Dimensional Variable Screening. Journal of the American Statistical Association, 104, 1512-1524.
https://doi.org/10.1198/jasa.2008.tm08516
[4] Fan, J.Q. and Song, R. (2010) Sure Independence Screening in Generalized Linear Models with NP-Dimensionality. The Annals of Statistics, 38, 3567-3604.
https://doi.org/10.1214/10-AOS798
[5] Fan, J.Q., Feng, Y. and Song, R. (2011) Nonparametric Independence Screening in Sparse Ultra-High-Dimensional Additive Models. Journal of the American Statistical Association, 106, 544-557.
https://doi.org/10.1198/jasa.2011.tm09779
[6] Li, G.R., Peng, H., Zhang, J. and Zhu, L.X, (2012) Robust Rank Correlation Based Screening. The Annals of Statistics, 40, 1846-1877.
https://doi.org/10.1214/12-AOS1024
[7] He, X.M., Wang, L. and Hong, H.G. (2013) Quantile-Adaptive Model-Free Variable Screening for High-Dimensional Heterogeneous Data. The Annals of Statistics, 41,342-369.
https://doi.org/10.1214/13-AOS1087
[8] Fan, J.Q., Ma, Y.B. and Dai, W. (2014) Nonparametric Independence Screening in Sparse Ultra-High-Dimensional Varying Coefficient Models. Journal of the American Statistical Association, 109, 1270-1284.
https://doi.org/10.1080/01621459.2013.879828
[9] Nandy, D., Chiaromonte, F. and Li, R.Z. (2022) Covariate Information Number for Feature Screening in Ultrahigh-Dimensional Supervised Problems. Journal of the American Statistical Association, 117, 1516-1529.
https://doi.org/10.1080/01621459.2020.1864380
[10] Tong, Z.X., Cai, Z.R., Yang, S.S. and Li, R.Z. (2022) Model-Free Conditional Feature Screening with FDR Control. Journal of the American Statistical Association.
https://doi.org/10.1080/01621459.2022.2063130
[11] Fan, J.Q. and Fan, Y.Y. (2008) High-Dimensional Classification Using Features Annealed Independence Rules. The Annals of Statistics, 36, 2605-2637.
https://doi.org/10.1214/07-AOS504
[12] Mai, Q. and Zou, H. (2013) The Kolmogorov Filter for Variable Screening in High-Dimensional Binary Classification. Biometrika, 100, 229-234.
https://doi.org/10.1093/biomet/ass062
[13] Cui, H.J., Li, R.Z. and Zhong, W. (2015) Model-Free Feature Screening for Ultrahigh Dimensional Discriminant Analysis. Journal of the American Statistical Association, 110, 630-641.
https://doi.org/10.1080/01621459.2014.920256
[14] Lai, P., Song, F.L. and Chen, K.W. (2017) Model Free Feature Screening with Dependent Variable in Ultrahigh Dimensional Binary Classification. Statistics & Probability Letters, 125, 141-148.
https://doi.org/10.1016/j.spl.2017.02.011
[15] Huang, D.Y., Li, R.Z. and Wang, H.S. (2014) Feature Screening for Ultrahigh Dimensional Categorical Data with Applications. Journal of Business & Economic Statistics, 32, 237-244.
https://doi.org/10.1080/07350015.2013.863158
[16] Ni, L. and Fang, F. (2016) Entropy-Based Model-Free Feature Screening for Ultrahigh-Dimensional Multiclass Classification. Journal of Nonparametric Statistics, 28, 515-530.
https://doi.org/10.1080/10485252.2016.1167206
[17] Ni, L., Fang, F. and Wan, F.J. (2017) Adjusted Pearson Chi-Square Feature Screening for Multi-Classification with Ultrahigh Dimensional Data. Metrika, 80, 805-828.
https://doi.org/10.1007/s00184-017-0629-9
[18] Sheng, Y. and Wang, Q.H. (2020) Model-Free Feature Screening for Ultrahigh Dimensional Classification. Journal of Multivariate Analysis, 178, Article ID: 104618.
https://doi.org/10.1016/j.jmva.2020.104618
[19] Anzarmou, Y., Mkhadri, A. and Oualkacha, K. (2022) The Kendall Interaction Filter for Variable Interaction Screening in High Dimensional Classification Problems. Journal of Applied Statistics, 50, 1496-1514.
[20] Zhou, S.B., Wang, T. and Huang, Y.J. (2022) Feature Screening via Mutual Information Learning Based on Nonparametric Density Estimation. Journal of Mathematics, 2022, Article ID: 7584374.
https://doi.org/10.1155/2022/7584374
[21] Reshef, D.N., Reshef, Y.A. and Finucane, H.K. (2011) Detecting Novel Associations in Large Data Sets. Science, 334, 1518-1524.
https://doi.org/10.1126/science.1205438
[22] Ji, P.S. and Jin, J.S. (2012) UPS Delivers Optimal Phase Diagram in High-Dimensional Variable Selection. The Annals of Statistics, 40, 73-103.
https://doi.org/10.1214/11-AOS947
[23] Suykens, J.A.K. and Vandewalle, J. (1999) Least Squares Support Vector Machine Classifiers. Neural Processing Letters, 9, 293-300.
https://doi.org/10.1023/A:1018628609742
[24] He, H.J. and Deng, G.M. (2022) Grouped Feature Screening for Ultra-High Dimensional Data for the Classification Model. Journal of Statistical Computation and Simulation, 92, 974-997.
https://doi.org/10.1080/00949655.2021.1981901
[25] Serfling, R.J. (1980) Approximation Theorems of Mathematical Statistics. John Wiley & Sons Inc., New York.
https://doi.org/10.1002/9780470316481
[26] Dias, R., Garcia, N.L. and Zambom, A.Z. (2012) Monte Carlo Algorithm for Trajectory Optimization Based on Markovian Readings. Computational Optimization and Applications, 51, 305-321.
https://doi.org/10.1007/s10589-010-9337-3
[27] Davis, R.A., Lii, K.S. and Politis, D.N. (2011) Selected Works of Murray Rosenblatt. Springer, New York.
https://doi.org/10.1007/978-1-4419-8339-8
[28] Li, R.Z., Zhong, W. and Zhu, L.P. (2012) Feature Screening via Distance Correlation Learning. Journal of the American Statistical Association, 107, 1129-1139.
https://doi.org/10.1080/01621459.2012.695654

Copyright © 2024 by authors and Scientific Research Publishing Inc.

Creative Commons License

This work and the related PDF file are licensed under a Creative Commons Attribution 4.0 International License.