Model-Free Ultra-High-Dimensional Feature Screening for Multi-Classified Response Data Based on Weighted Jensen-Shannon Divergence ()
1. Introduction
In fields like tumor classification, genomics, and machine learning, the issue of processing ultra-high-dimensional data is frequently faced. According to [1] definition of ultra-high-dimensional data, it is assumed that the sample size and the dimensionality of the covariates are n and p, respectively. There exists a constant
such that
, and at this point, p exhibits an exponential order of increase with the sample size n. Furthermore, this data tends to be sparse; the number of variables tends to be very high, while the number of variables that have a significant impact is very small. Therefore, in the problem of ultra-high-dimensional data analysis, the development of fast and effective variable screening methods to rapidly reduce ultra-high-dimensional data to reasonable dimensions is very important research.
To address this problem, Fan and Lv [2] first proposed the SIS method for variable screening of ultra-high-dimensional data. Afterwards, many scholars of statistics studied the problem and established a series of feature screening methods. The ultra-high-dimensional feature screening methods that have been developed are: Fan and Song [3] proposed a method of variable screening (MMLE), whereby variable screening is performed by ranking the very large marginal likelihood estimates in a generalized linear model. Fan et al. [4] proposed a nonparametric independent screening (NIS) method to investigate variable screening methods in additivity models, using B-spline basis functions to fit the edge nonparametric components. Subsequently, besides the additive model, the variable coefficient model is another widely used nonparametric model. Liu et al. [5] further proposed a new variable screening method based on conditional correlation coefficients for variable coefficient models. For the semiparametric model, Li et al. [6] utilized a robust rank correlation screening (RRCS) method based on the Kendall
correlation coefficient. Most of these methods imply the assumption that the response variable is continuous, but ultra-high-dimensional data with categorical response variables are increasingly appearing in various fields of scientific research, and if traditional categorization methods such as logistic regression, decision trees, and support vector machines are used to solve this kind of problem, they will encounter problems such as long time-consuming, high computational costs, and reduced prediction accuracy. Based on this, many researchers have proposed various screening methods for ultra-high-dimensional data where the response variable is categorical. Fan and Fan [7] suggested a feature screening technique based on marginal t-tests for normal distributions for response variables that are binary. However, the robustness of this method is low, so Mai and Zou [8] proposed a screening method for ultra-high-dimensional binary categorical variables based on the Kolmogorov-Smirnov statistic. In practice, the response variable is multi-categorical, which is also very common. For response variables that are multi-classified, Cui et al. [9] establish a robust screening method by constructing the distance between the global distribution function and the conditional distribution function. Huang et al. [10] proposed an ultra-high-dimensional multi-classified variable screening method based on Pearson’s chi-square statistic (PC-SIS).
The above methods of variable screening are based on the correlation between explanatory and response variables. With the development of information theory and the disciplinary integration with statistics, the characteristics of information entropy and the entropy family are recognized and applied by researchers. Ni and Fang [11] proposed a method for ultra-high-dimensional variable screening based on information gain (IG-SIS) from the perspective of information quantity. Jensen-Shannon divergence is an information theory-based concept that plays an important role in calculating similarity and comparing differences in probability distributions and is characterized by non-negativity and symmetry. For ease of reading, Jensen-Shannon divergence is abbreviated to JS divergence in this paper. When the response variable is a binary categorical variable, there are two conditional probability distributions for
given Y. The degree of difference between these two conditional probability distributions can be measured using JS divergence, and the magnitude of JS divergence represents the degree of strength of the correlation between
and Y.
Therefore, on the basis of the above research on ultra-high-dimensional feature screening for response variables that are categorical, in this paper, from a new perspective, we propose a model-free ultra-high-dimensional feature screening method for multi-classified response data based on weighted JS divergence, defined as WJS-SIS. The idea of the method is to first calculate the JS divergence between the conditional probability distribution of
and the unconditional probability distribution of
given
between the conditional probability distribution of
and the unconditional probability distribution of
conditionally, and then use
as the weight to calculate the weighted JS divergence. And, when the number of categories in each covariate is different, using the logarithmic factor of the number of categories in each covariate to adjust the weighted JS divergence is also proposed to measure the relationship between the covariates and the response variable, defined as AWJS-SIS. Theoretically, both WJS-SIS and AWJS-SIS have sure screening properties and ranking consistency, and from the results of Monte Carlo simulations and real data experiments, they have significant effects on screening ultra-high-dimensional multi-classified response variable data. At the same time, they are model-free screening methods that do not depend on any model assumptions.
The rest of the paper is organized as follows: Section 2 describes the proposed WJS-SIS and AWJS-SIS methods in detail. Section 3 describes the screening and ranking consistency of the methods. Section 4 and Section 5 give the simulation study and an experiment with real data, respectively. Section 6 draws conclusions. All theorem proofs are given in the appendix.
2. Method
2.1. Basic Assumption
Suppose
is an
-dimensional covariate matrix, where X obeys the assumption of independent identical distribution, let
. And
is an
-dimensional categorical response variable.
Define D as the set of important covariates,
as the set of unimportant covariates, and
as the number of variables in the set of important covariates, which is expressed as:
2.2. Information Entropy
Information entropy is a measure of the index of information proposed by [12] . When the covariate
and the response variable
, the information entropy of the
and Y are:
where the logarithmic base is 2, and
. And, where the expressions for
and
are as follows:
The conditional information entropy of
given Y is defined as:
where
But when the covariate
is a continuous variable, using standard normal distribution quantiles to cut
into categorical data:
where
be the j/J quantile, and
,
,
.
2.3. IG-SIS
Ni and Fang [11] proposed the feature screening method of IG-SIS, which is based on the principle of using the difference between the information entropy of y and the conditional information entropy of Y given
to measure the importance of
.
The strength of the correlation between Y and
can be represented by the information gain, and the expression is as follows:
and
(1)
The higher the difference, the more important
is.
2.4. WJS-SIS
Finding an index to measure the relationship between response variables and covariates is the core of ultra-high-dimensional feature screening and the key to big data processing. From the perspective of the distribution of the data, for univariate feature screening, the relationship between variables can be measured by comparing the distribution of the data. Moreover, the current screening methods for categorical variables, in addition to the use of traditional statistical indexes to measure the relationship between the variables, also combine methods from other disciplines. For example, some studies have quantified some measure of the amount of information as an index for feature screening. The Jensen-Shannon divergence mentioned in this paper is based on an information-theoretic concept that is important in calculating similarity and comparing differences in probability distributions and has the properties of non-negativity and symmetry: assuming that there are two distributions A and B, then
. Thus, for ultra-high-dimensional feature screening for multi-classified response variable data, we can utilize the Jensen-Shannon divergence to measure the relationship between response variables and covariates.
When the response variable is a binary categorical variable, there are two conditional probability distributions for
given Y. The degree of difference between these two conditional probability distributions can be measured using JS divergence, and the magnitude of JS divergence represents the degree of strength of the correlation between
and Y. In practice, the response variable is more than just a binary categorization case and more often involves multicategorization.
Therefore, in this paper, a model-free ultra-high-dimensional feature screening method for multicategorical response data with weighted JS divergence is investigated from the perspective of JS divergence for the case where the response variable is multicategorical.
First, separately calculate the JS divergence between conditional probability distributions for
given
and the probability distribution of
, and then
is used as the weight to obtain the weighted JS divergence.
Assume that
and
, and
is the average probability distribution of U and V. If
is a continuous variable, U and V are defined as follows:
,
.
Then, the weighted JS divergence of U and V is defined as:
and
(2)
2.5. AWJS-SIS
When the number of categories of covariates is large, the directly computed weighted JS divergence values may be large, which makes it possible that unimportant variables due to a large number of categories may be incorrectly selected. To address this problem, this paper refers to Ni and Fang [11] using
to construct the adjusted weighted JS divergence for variable selection. Where
represents the number of categorical categories L of
or the number of categories in which
is sliced by a standard normally distributed quantile.
The adjusted weighted JS divergence of U and V is defined as:
(3)
and
(4)
3. Theoretical Properties
In [2] , it is shown that a good feature screening method should satisfy the properties of sure screening and ranking consistency. Sure screening is the basis of feature screening, which means being able to screen all important variables with a probability of 1 when the sample size is sufficient, which ensures that the truly important variables will theoretically be screened in their entirety. Ranking consistency means that the indexes of all significant variables are ranked before the indexes of all insignificant variables, which ensures that when selecting the top
variables, important variables can be screened out reasonably and robustly. This subsection will illustrate the theoretical properties of the methods proposed in this paper under certain conditions, which are as follows:
Condition 1 (C1).
, this indicates that the variable dimension P is an exponential multiple of the sample capacity n.
Condition 2 (C2). There have
,
, such that
,
,
.
Condition 3 (C3). There has a constant
and
, such that
.
Condition 4 (C4). There has a constant
for
such that
, and x is in the domain of definition of Xk, where under the condition
,
is the Lebesgue density function of Xk.
Condition 5 (C5). There have a constant
and
such that
, and x is in the domain of definition of Xk for
, where
is continuous in the domain of definition of Xk, and
is the Lebesgue density function of Xk.
Condition 6 (C6).
,
,
and
with
.
The literature on ultra-high-dimensional feature screening approaches, such as [2] [13] , and [14] , typically includes the aforementioned six requirements. Condition (C1) demonstrates that it is a feature screening method applied to ultra-high-dimensional problems. Condition (C2) demonstrates that the marginal probabilities of the response variable and the covariate are bounded by an upper and a lower limit, preventing the worst-case scenario of the screening method failing. This worst-case situation is due to a flaw in the Jensen-Shannon divergence. When the two distributions do not overlap at all, even if the centers of the two distributions are as close as possible, their Jensen-Shannon divergence is constant, and at this point, the Jensen-Shannon divergence fails to measure the extent of the difference between the two distributions and thus the importance of the covariates. And Condition (C3) demonstrates that the values of the indexes corresponding to the really important variables are bounded by a lower value. Condition (C4) eliminates an extreme scenario in which some Xk places a large mass in a small range, ensuring that the sample percentile is close to the genuine percentile. Condition (C5) shows the lower bound of the density must be of order
in order. The presence of Condition (C6) guarantees a certain rate of divergence in the number of covariate categories. When the response is multi-classified and all covariates are discrete, we provide the theoretical properties of the feature screening technique WJS-SIS under these six conditions.
Because
,
, and
, it follows that
. So, this study provides a thorough theoretical proof for the index
of weighted JS divergence-based sure screening and ranking consistency for feature screening.
The Properties of Sure Screening and Ranking Consistency
Categorical covariates are subscripted with the letter j, while continuous covariates are subscripted with the letter k. If the covariate is categorical, L represents the number of categories, while
represents the number of categories if the covariate is continuous.
When the covariates are categorical variables, there are theorems 3.1 and 3.2.
Theorem 3.13 In the (C1) - (C2) conditions,
, there have
and
with
when
,
. And under the (C1) - (C3) conditions, when
, such that
Theorem 3.2. In the (C1) - (C3) conditions, assume that
, then there have
When the covariates are continuous, there are theorems 3.3 and 3.4.
Theorem 3.3. Under the conditions (C1), (C2), (C4), (C5), and (C6), there have constants
,
and there are
(5)
When
, there is
Theorem 3.4. Assume that
, under the conditions (C1), (C3), (C4), (C5), and (C6), there have
When the covariates are continuous and categorical covariates coexist, there are theorems 3.5 and 3.6.
Theorem 3.5. Under the conditions (C1), (C2), (C4), (C5), and (C6), there have constants
,
and
and there are
(6)
where
. When
, there is
where
.
Theorem 3.6. Assume that
and
, under the conditions (C1), (C2), (C4), (C5), and (C6), there have
The appendix contains a thorough proof of the theoretical portion.
4. Numerical Simulation
4.1. Evaluation Indexes
The first evaluation indexes are CP1 and CP2, which represent the proportion of true important covariates that are selected into the set of significant covariates when the top
and top
variables are selected as the set of significant covariates, respectively. The second evaluation indexes are CPa1 and CPa2, which show whether the selected set of important covariates contains all the true important covariates when the number of the important covariates set is
and
, respectively. The third evaluation index is the MMS, which represents the minimum model size that will be selected for all important variables. Each simulation was conducted 100 times. All calculations in this paper were performed in R software.
4.2. Simulation Experiments and Results
4.2.1. Simulation 1
The response variable and all covariates are four-categorical variables. Where, for the response variable Y, both balanced and unbalanced distributions are considered: balanced,
, with
, and
; unbalanced,
with
. Define
as the true imporant variables set, where
. X is generated by the conditional probability of X given by Y.
for
and
, where
is given in Table 1. And,
when
,
. The dimensionality of the covariates was set to
, and the sample sizes were set to
,
, and
.
The simulation results are shown in Table 2, where the performance indexes of all the methods for all conditions are the same, with coverage CP and CPa both being 1 and the values of the MMS values being close to
.
4.2.2. Simulation 2
The response variable and all covariates are categorical variables, where the response variable is set up as in Simulation 1, and the categories of the covariates are set up as categories 2, 4, 6, 8, and 10, respectively. Similarly, define
as the set of important variables. The covariate data were generated through the quantile of the standard normal distribution
. Define
by
, where
,
. And when
,
, otherwise
. The following are the precise steps for creating covariates:
Table 1. Parameter specification for the simulations.
The numbers in parentheses are the corresponding standard deviations.
The values of L are 2, 4, 6, 8, and 10, which correspond to
,
,
,
,
. We set
and
.
The simulation results are displayed in Table 3, and all techniques’ performance indexes for every situation are same, with coverage CP and CPa both being 1, and MMS values that are nearly equal to
.
The numbers in parentheses are the corresponding standard deviations.
4.2.3. Simulation 3
The covariates are continuous variables, and the response variables are set up as in Simulation 1. We use the standard normal distribution of quantile function to slice the covariates into categorical data, where
, and define the methods as WJS-SIS-4, AWJS-SIS-4, IG-SIS-4; WJS-SIS-8, AWJS-SIS-8, IG-SIS-8; and WJS-SIS-10, AWJS-SIS-10, IG-SIS-10, respectively. The essential variables are set up in the same way as in Simulation 1. Generate X using the standard normal distribution
with
and assume
, and
. Where
,
, otherwise,
. We set
and
.
Because this slice is used to divide all covariates after a particular number of slices are chosen each time, the performance indexes of WJS-SIS and AWJS-SIS are the same.
Table 4 displays the simulation results for a balanced distribution of Y. Since
Table 4. Results for simulation 3: balanced Y.
The numbers in parentheses are the corresponding standard deviations.
all covariates are divided using the slices each time a certain number of slices is chosen, the performance indexes of WJS-SIS and AWJS-SIS are identical. The coverage index values CP and CPa for the three methods are 1 in all cases. Regarding the MMS values, the 95% quantile values of MMS are slightly different only at
, where IG-SIS is smaller than those of WJS-SIS and AWJS-SIS at
and 10, and WJS-SIS and AWJS-SIS are smaller than those of IG-SIS at
; and the 95% quantile values of MMS of all the three methods are increasing with Jk increases. Table 5 shows the simulation results when Y is an unbalanced distribution: with respect to the CP and CPa values, IG-SIS is slightly higher than WJS-SIS and AWJS-SIS at
, and 1 for all methods in all other cases. With regard to the MMS values, IG-SIS has a smaller MMS than WJS-SIS and AWJS-SIS at
for the 95% percentile of the MMS values are smaller, and at
, the MMS values are the same for all methods. All methods perform better when the number of slices is small.
4.2.4. Simulation 4
The covariates are categorical and continuous, with continuous covariates treated the same as in Simulation 3 regarding handling. The response variables are set up as in Simulation 1. Set the essential variables set is
. Generating the latent variables
in the same way of Simulation 3 generating covariates and then generating categorical and continuous covariates: 1) For
, then
, if
,
; 2) For
, then
, if
,
; 3) For
, then
. We set
and
.
Table 6 displays the simulation results for a balanced distribution of Y. The CP and CPa values for AWJS-SIS and IG-SIS are the same and larger than those for WJS-SIS. Regarding the MMS values, the 75% and 95% quantile values of AWJS-SIS and IG-SIS are smaller than those of IG-SIS at N = 400 and 600; whereas AWJS-SIS is smaller than IG-SIS at
and
, and larger than IG-SIS at
and
and 10, and at
, the MMS values of AWJS-SIS and IG-SIS have the same MMS value; at
, all methods have the same MMS value. Table 7 shows the simulation results when Y is an unbalanced distribution: At
, all methods have the same performance index value. At
, AWJS-SIS and IG-SIS are both larger than WJS-SIS, with AWJS-SIS being somewhat smaller than IG-SIS. At
, all methods are equal in terms of the CP and CPa values. All methods perform better when the number of slices is large.
4.3. Computational Time Cost
We obtained the median running time of each algorithm through a simulation experiment, where the covariates X and the set of significant variables were set
Table 5. Results for simulation 3: unbalanced Y.
The numbers in parentheses are the corresponding standard deviations.
Table 6. Results for simulation 4: balanced Y.
The numbers in parentheses are the corresponding standard deviations.
Table 7. Results for simulation 4: unbalanced Y.
The numbers in parentheses are the corresponding standard deviations.
Table 8. The results of computational time cost.
The numbers in parentheses are the corresponding standard deviations.
up as in simulation experiment 2, and Y was set up as a balanced distribution. The experiment was set up with
,
, and repeat the experiment 100 times. An Intel Core i7-8700 machine running Windows 10 at 3.20 GHz was used for all calculations. Table 8 shows the median runtime for the three methods, which increases as P increases and is consistently faster to compute for WJS-SIS and AWJS-SIS than for IG-SIS.
4.4. Comprehensive Analysis of Simulation Results
The main argument is that, in terms of performance, the approaches of WJS-SIS and AWJS-SIS in this study are extremely comparable to IG-SIS. There is a difference in performance between the approaches when the sample size is small, and the performance of WJS-SIS is more affected by the slices than that of AWJ-SIS and IG-SIS, which are more robust and whose performance is more adaptive to the number of slices. But all methods perform as well as they do as the number of variables screened increases or as the sample size increases, and all are able to screen out all the important variables, and the true model size is close to the number of important variables. In terms of computing time, IG-SIS is longer than WJS-SIS and AWJS-SIS.
5. Experimental Study with Real Data
In real life, ultra-high-dimensional data with multi-class response variables is common, and feature screening of such data can achieve the effects of data dimensionality reduction, feature mining, and variable selection. The methods proposed in this paper can be applied in different fields and can improve the efficiency and accuracy of data analysis, reveal the information behind the data, and help in the construction of decision-making and prediction models. For example, in the medical field, it can be applied to analyze gene expression data and help identify genes associated with diseases. In the financial field, it can help identify key factors that affect stock or commodity prices. In image processing, it can be used for tasks such as feature extraction and target recognition. In practical implementation, based on Fan and Lv [2] , the number of important variables is generally selected as the first
. In this paper, we analyze the case when
.
We analyzed the TOX-171 micro-integrated columns biological dataset from the Arizona State University feature selection database (http://featureselection.asu.edu/) with 171 samples and 5748 features, with four classes of response variables and roughly unbalanced distributions, with covariates of continuous type. We randomly divide the dataset in a 7:3 ratio, where 70% of the data is used as the training dataset and the remaining 30% as the test dataset. As randomly dividing datasets may bring the potential problem of model prediction accuracy degradation, to address this problem, we used ten-fold cross-validation to train the model and repeated the experiment 100 times to take the average of the evaluation indexes and calculate the standard deviation of the evaluation indexes. The smaller the standard deviation, the more stable it is, which means that the average of the evaluation indexes is desirable. On the training and test sets, respectively, variables screened using the three methods were tested for categorization using a support vector machine, and the values of the geometric mean (G-mean) for categorization accuracy (CA), specificity (SPE), and sensitivity (SEN) were calculated.
Table 9 and Table 10 show the corresponding index values when the number of selected variables is
and
, respectively. Combining Table 9 and Table 10, it can be seen that in both the training and test sets, all methods perform better when Jk is relatively large, and the CA and G-mean values of AWJS-SIS are always higher than those of WJS-SIS, and the CA and G-mean values of AWJS-SIS are higher than those of IG-SIS at Jk = 8. As well, the classification of the method is better as the screening variables increase.
6. Conclusion
In this paper, from the perspective of introducing JS (Jensen-Shannon) divergence to measure the importance of covariates, for the case where Y is multi-classified, this paper constructs model-free ultra-high-dimensional feature
Table 9. The result when screening the first
variables.
The numbers in parentheses are the corresponding standard deviations.
Table 10. The result when screening the first
variables.
The numbers in parentheses are the corresponding standard deviations.
screening methods for multi-classified response data based on weighted JS divergence under different scenarios, using the WJS-SIS method when the number of categories in each covariate is the same and the AWJS-SIS method with adjusted weighted JS divergence when the number of categories in each covariate is different. Theoretically, both WJS-SIS and AWJS-SIS have sure screening properties and ranking consistency. Then, from the Monte Carlo simulation results and experiments with real data, WJS-SIS and AWJS-SIS have a significant effect on feature screening, and the performance is very similar to that of IG-SIS, but WJS-SIS is a little bit weaker in terms of robustness, whereas AWJ-SIS and IG-SIS are robust a little stronger, and both WJS-SIS and AWJS-SIS are faster than IG-SIS in terms of computation time. Finally, the approaches proposed in this paper utilize the Jensen-Shannon divergence to measure the importance of covariates from the perspective of information quantity, which is different from the traditional statistical indicators, which may provide a reference for methodological research in the field of multi-class variable screening for ultra-high-dimensional data. And the methods proposed in this work only take into consideration the correlation between the response variable and the covariates; they do not account for the presence of a high covariate correlation. Therefore, in future studies for ultra-high-dimensional variable screening, the element of covariate correlation will be incorporated.
Acknowledgements
The study was supported by the National Natural Science Foundation of China [grant number 71963008].
Appendix
The following four lemmas are initially introduced in order to demonstrate Theorem 3.1.
Lemma 1. Suppose there are mutually independent random variables
with sample size N and
, where
are constants. If we assume that
, then there has a constant t for which the inequality holds:
In [15] , Lemma 1’s proof is presented.
Lemma 2. Suppose there are two bounded random variables a and b, and there have two positive constants
such that
. The estimates corresponding to
can be computed as
, given a sample size of n. Suppose that for
, there have constants
and
such that:
then, there exist
Where,
,
,
.
Besides, assuming that b is bounded and non-zero, and that there has
such that
, then there exist
where,
,
and
.
In [10] , Lemma 2’s proof is presented.
Lemma 3. If the covariates are categorical, we can get that
. And
only when
, Y and
are independent.
In [16] , Lemma 3’s proof is presented.
Lemma 4. If the covariates are continuous, there is
, when Y and
are independent,
.
Lemma 4’s proof is omitted here because it is similar to Proposition 2.2’s proof in [11] .
Theorem 3.1 proof:
Let
,
,
then
The definitions of
and
state that there are
(7)
and
To prove that
Part at first:
By estimating the probability with the sample frequency, we have
Thus, there is
furthermore, it follows from Lemma 1 and Lemma 2 that since
,
are estimates of
,
:
So,
:
Additionally, may be proved that
. Assume
,
:
Then,
.
We can obtain that
in a proof similar to this one.
So, we can get
. Similarly, it can be shown that
,
, and
are all
.
For the
part:
Prove the
Part:
According to Lemma 1 and Lemma 2, it can also be shown that
converges to
with probability, then
For
, thus, there is
(8)
In the (C1) - (C3) condition, there exists
and
with
(9)
then,
(10)
when
and
, there has
Then,
(11)
so,
, with
.
Therefore, in the conditions (C1) - (C3), Theorem 0.1 sure screening property holds.
Theorem 0.2 proof:
Because of
, there has
such that
, and after that, there have
From Fatou’s Lemma we can get
Thus,
(12)
Therefore, Theorem 3.2 holds.
Theorem 3.3 proof:
Assume that
is
’s empirical cumulative distribution function and that
is the cumulative distribution function of
. And let
be the cumulative distribution function of
, and
. Then, using LEMMA.A.2 in [11] as evidence, we can similarly demonstrate that, for
, given the conditions (C4) and (C5), there are
where
and
,
are positive constants.
So,
and
, respectively.
Then, for
, there has
(13)
Equation (13) will not be proven here because it is similar to the proof of Equation (8).
There are constants
and
under condition (C6) such that
(14)
then,
(15)
with
, there are
(16)
So,
,
, Theorem 0.3 holds.
Therefore, a proof similar to that of Equation (12), we have:
(17)
So, Theorem 3.4 holds.
Thus, based on Equations ((10), (11), (15), (16)), and the proof of Theorem 3.5 and Theorem 3.6 are similar to that of Theorem 3.1 and Theorem 3.2, so they will not be proved in detail.