Model-Free Ultra-High-Dimensional Feature Screening for Multi-Classified Response Data Based on Weighted Jensen-Shannon Divergence

Qingqing Jiang; Guangming Deng

doi:10.4236/ojs.2023.136042

Open Journal of Statistics > Vol.13 No.6, December 2023

Model-Free Ultra-High-Dimensional Feature Screening for Multi-Classified Response Data Based on Weighted Jensen-Shannon Divergence

Qingqing Jiang¹, Guangming Deng^1,2*
¹School of Mathematics and Statistics, Guilin University of Technology, Guilin, China.
²Applied Statistics Institute, Guilin University of Technology, Guilin, China.
DOI: 10.4236/ojs.2023.136042 PDF HTML XML 57 Downloads 250 Views

Abstract

In ultra-high-dimensional data, it is common for the response variable to be multi-classified. Therefore, this paper proposes a model-free screening method for variables whose response variable is multi-classified from the point of view of introducing Jensen-Shannon divergence to measure the importance of covariates. The idea of the method is to calculate the Jensen-Shannon divergence between the conditional probability distribution of the covariates on a given response variable and the unconditional probability distribution of the covariates, and then use the probabilities of the response variables as weights to calculate the weighted Jensen-Shannon divergence, where a larger weighted Jensen-Shannon divergence means that the covariates are more important. Additionally, we also investigated an adapted version of the method, which is to measure the relationship between the covariates and the response variable using the weighted Jensen-Shannon divergence adjusted by the logarithmic factor of the number of categories when the number of categories in each covariate varies. Then, through both theoretical and simulation experiments, it was demonstrated that the proposed methods have sure screening and ranking consistency properties. Finally, the results from simulation and real-dataset experiments show that in feature screening, the proposed methods investigated are robust in performance and faster in computational speed compared with an existing method.

Keywords

Ultra-High-Dimensional, Multi-Classified, Weighted Jensen-Shannon Divergence, Model-Free, Feature Screening

Share and Cite:

Jiang, Q. and Deng, G. (2023) Model-Free Ultra-High-Dimensional Feature Screening for Multi-Classified Response Data Based on Weighted Jensen-Shannon Divergence. Open Journal of Statistics, 13, 822-849. doi: 10.4236/ojs.2023.136042.

1. Introduction

In fields like tumor classification, genomics, and machine learning, the issue of processing ultra-high-dimensional data is frequently faced. According to [1] definition of ultra-high-dimensional data, it is assumed that the sample size and the dimensionality of the covariates are n and p, respectively. There exists a constant $α \in (0, 1 / 2)$ such that $\ln p = O (n^{α})$ , and at this point, p exhibits an exponential order of increase with the sample size n. Furthermore, this data tends to be sparse; the number of variables tends to be very high, while the number of variables that have a significant impact is very small. Therefore, in the problem of ultra-high-dimensional data analysis, the development of fast and effective variable screening methods to rapidly reduce ultra-high-dimensional data to reasonable dimensions is very important research.

To address this problem, Fan and Lv [2] first proposed the SIS method for variable screening of ultra-high-dimensional data. Afterwards, many scholars of statistics studied the problem and established a series of feature screening methods. The ultra-high-dimensional feature screening methods that have been developed are: Fan and Song [3] proposed a method of variable screening (MMLE), whereby variable screening is performed by ranking the very large marginal likelihood estimates in a generalized linear model. Fan et al. [4] proposed a nonparametric independent screening (NIS) method to investigate variable screening methods in additivity models, using B-spline basis functions to fit the edge nonparametric components. Subsequently, besides the additive model, the variable coefficient model is another widely used nonparametric model. Liu et al. [5] further proposed a new variable screening method based on conditional correlation coefficients for variable coefficient models. For the semiparametric model, Li et al. [6] utilized a robust rank correlation screening (RRCS) method based on the Kendall $τ$ correlation coefficient. Most of these methods imply the assumption that the response variable is continuous, but ultra-high-dimensional data with categorical response variables are increasingly appearing in various fields of scientific research, and if traditional categorization methods such as logistic regression, decision trees, and support vector machines are used to solve this kind of problem, they will encounter problems such as long time-consuming, high computational costs, and reduced prediction accuracy. Based on this, many researchers have proposed various screening methods for ultra-high-dimensional data where the response variable is categorical. Fan and Fan [7] suggested a feature screening technique based on marginal t-tests for normal distributions for response variables that are binary. However, the robustness of this method is low, so Mai and Zou [8] proposed a screening method for ultra-high-dimensional binary categorical variables based on the Kolmogorov-Smirnov statistic. In practice, the response variable is multi-categorical, which is also very common. For response variables that are multi-classified, Cui et al. [9] establish a robust screening method by constructing the distance between the global distribution function and the conditional distribution function. Huang et al. [10] proposed an ultra-high-dimensional multi-classified variable screening method based on Pearson’s chi-square statistic (PC-SIS).

The above methods of variable screening are based on the correlation between explanatory and response variables. With the development of information theory and the disciplinary integration with statistics, the characteristics of information entropy and the entropy family are recognized and applied by researchers. Ni and Fang [11] proposed a method for ultra-high-dimensional variable screening based on information gain (IG-SIS) from the perspective of information quantity. Jensen-Shannon divergence is an information theory-based concept that plays an important role in calculating similarity and comparing differences in probability distributions and is characterized by non-negativity and symmetry. For ease of reading, Jensen-Shannon divergence is abbreviated to JS divergence in this paper. When the response variable is a binary categorical variable, there are two conditional probability distributions for $x_{j}$ given Y. The degree of difference between these two conditional probability distributions can be measured using JS divergence, and the magnitude of JS divergence represents the degree of strength of the correlation between $x_{j}$ and Y.

Therefore, on the basis of the above research on ultra-high-dimensional feature screening for response variables that are categorical, in this paper, from a new perspective, we propose a model-free ultra-high-dimensional feature screening method for multi-classified response data based on weighted JS divergence, defined as WJS-SIS. The idea of the method is to first calculate the JS divergence between the conditional probability distribution of $x_{j}$ and the unconditional probability distribution of $x_{j}$ given $Y = r (r = 1, 2, \dots, R)$ between the conditional probability distribution of $x_{j}$ and the unconditional probability distribution of $x_{j}$ conditionally, and then use $\Pr (Y = r)$ as the weight to calculate the weighted JS divergence. And, when the number of categories in each covariate is different, using the logarithmic factor of the number of categories in each covariate to adjust the weighted JS divergence is also proposed to measure the relationship between the covariates and the response variable, defined as AWJS-SIS. Theoretically, both WJS-SIS and AWJS-SIS have sure screening properties and ranking consistency, and from the results of Monte Carlo simulations and real data experiments, they have significant effects on screening ultra-high-dimensional multi-classified response variable data. At the same time, they are model-free screening methods that do not depend on any model assumptions.

The rest of the paper is organized as follows: Section 2 describes the proposed WJS-SIS and AWJS-SIS methods in detail. Section 3 describes the screening and ranking consistency of the methods. Section 4 and Section 5 give the simulation study and an experiment with real data, respectively. Section 6 draws conclusions. All theorem proofs are given in the appendix.

2. Method

2.1. Basic Assumption

Suppose $X = (x_{i 1}, x_{i 2}, \dots, x_{i j})$ is an $N \times P$ -dimensional covariate matrix, where X obeys the assumption of independent identical distribution, let $x_{j} = {x_{1 j}, x_{2 j}, \dots, x_{i j}}, i = 1,2, \dots, N; j = 1,2, \dots, P$ . And $Y = (y_{1}, y_{2}, \dots, y_{N})$ is an $N \times 1$ -dimensional categorical response variable.

Define D as the set of important covariates, $D^{c}$ as the set of unimportant covariates, and $| D | = d_{0}$ as the number of variables in the set of important covariates, which is expressed as:

$\begin{array}{l} D = {j : for some Y = y, F (x_{j} | y) is related to Y}, \end{array}$

$D^{c} = {1,2, \dots, p} \ D .$

2.2. Information Entropy

Information entropy is a measure of the index of information proposed by [12] . When the covariate $x_{j} \in {1,2, \dots, L}$ and the response variable $Y \in {1,2, \dots, R}$ , the information entropy of the $x_{j}$ and Y are:

$H (x_{j}) = - \sum_{l = 1}^{L} p_{j, l} \log p_{j, l},$

$H (Y) = - \sum_{r = 1}^{R} p_{r} \log p_{r},$

where the logarithmic base is 2, and $0 \times \log 0 = 0$ . And, where the expressions for $p_{r}$ and $p_{j, l}$ are as follows:

$p_{r} = \Pr (Y = r), r = 1, 2,$

${\hat{p}}_{r} = \frac{\sum_{i = 1}^{N} I {y_{i} = r}}{N},$

$p_{j, l} = \Pr (x_{j} = l),$

${\hat{p}}_{j, l} = \Pr (x_{j} = l) = \frac{\sum_{i = 1}^{N} I {x_{i j} = l}}{N} .$

The conditional information entropy of $x_{j}$ given Y is defined as:

$H (x_{j} | Y) = - \sum_{l = 1}^{L} p_{j, l r} \log p_{j, l r},$

$H (Y | x_{j}) = - \sum_{r = 1}^{R} p_{l r, j} \log p_{l r, j},$

where

$p_{j, l r} = \Pr (x_{j} = l | Y = r),$

${\hat{p}}_{j, l r} = \frac{\sum_{i = 1}^{N} I {x_{i j} = l, y_{i} = r}}{\sum_{i = 1}^{N} I {y_{i} = r}},$

$p_{l r, j} = \Pr (Y = r | x_{j} = l),$

${\hat{p}}_{l r, j} = \frac{\sum_{i = 1}^{N} I {x_{i j} = l, y_{i} = r}}{\sum_{i = 1}^{N} I {x_{i j} = l}} .$

But when the covariate $x_{j}$ is a continuous variable, using standard normal distribution quantiles to cut $x_{j}$ into categorical data:

$p_{j, l} = \Pr (x_{j} \in (q_{(j - 1)}, q_{(j)}]),$

${\hat{p}}_{j, l} = \frac{\sum_{i = 1}^{N} I {x_{i j} \in (q_{(j - 1)}, q_{(j)}]}}{N},$

$p_{j, l r} = \Pr (x_{j} \in (q_{(j - 1)}, q_{(j)}] | Y = r),$

${\hat{p}}_{j, l r} = \frac{\sum_{i = 1}^{N} I {x_{i j} \in (q_{(j - 1)}, q_{(j)}]}}{\sum_{i = 1}^{N} I {y_{i} = r}},$

$p_{l r, j} = \Pr (Y = r | x_{j} \in (q_{(j - 1)}, q_{(j)}]),$

${\hat{p}}_{l r, j} = \frac{\sum_{i = 1}^{N} I {y_{i} = r}}{\sum_{i = 1}^{N} I {x_{i j} \in (q_{(j - 1)}, q_{(j)}]}},$

where $q_{(j)}$ be the j/J quantile, and $j = 1,2, \dots, J$ , $q_{(0)} = - \infty$ , $q_{(J)} = + \infty$ .

2.3. IG-SIS

Ni and Fang [11] proposed the feature screening method of IG-SIS, which is based on the principle of using the difference between the information entropy of y and the conditional information entropy of Y given $x_{j}$ to measure the importance of $x_{j}$ .

The strength of the correlation between Y and $x_{j}$ can be represented by the information gain, and the expression is as follows:

$\begin{matrix} IG (Y, x_{j}) = \frac{1}{\log J_{k}} (H (Y) - H (Y | x_{j})) \\ = \frac{1}{\log J_{k}} (\sum_{r = 1}^{R} \sum_{J = 1}^{J_{k}} p_{l r, j} \log p_{l r, j} - \sum_{r = 1}^{R} p_{r} \log p_{r} - \sum_{j = 1}^{J_{k}} p_{j, l} \log p_{j, l}), \end{matrix}$

and

$\hat{I}G (Y, x_{j}) = \frac{1}{\log J_{k}} (\sum_{r = 1}^{R} \sum_{J = 1}^{J_{k}} {\hat{p}}_{l r, j} \log {\hat{p}}_{l r, j} - \sum_{r = 1}^{R} {\hat{p}}_{r} \log {\hat{p}}_{r} - \sum_{j = 1}^{J_{k}} {\hat{p}}_{j, l} \log {\hat{p}}_{j, l}) .$ (1)

The higher the difference, the more important $x_{j}$ is.

2.4. WJS-SIS

Finding an index to measure the relationship between response variables and covariates is the core of ultra-high-dimensional feature screening and the key to big data processing. From the perspective of the distribution of the data, for univariate feature screening, the relationship between variables can be measured by comparing the distribution of the data. Moreover, the current screening methods for categorical variables, in addition to the use of traditional statistical indexes to measure the relationship between the variables, also combine methods from other disciplines. For example, some studies have quantified some measure of the amount of information as an index for feature screening. The Jensen-Shannon divergence mentioned in this paper is based on an information-theoretic concept that is important in calculating similarity and comparing differences in probability distributions and has the properties of non-negativity and symmetry: assuming that there are two distributions A and B, then $J S (A ∥ B) = J S (B ∥ A)$ . Thus, for ultra-high-dimensional feature screening for multi-classified response variable data, we can utilize the Jensen-Shannon divergence to measure the relationship between response variables and covariates.

When the response variable is a binary categorical variable, there are two conditional probability distributions for $x_{j}$ given Y. The degree of difference between these two conditional probability distributions can be measured using JS divergence, and the magnitude of JS divergence represents the degree of strength of the correlation between $x_{j}$ and Y. In practice, the response variable is more than just a binary categorization case and more often involves multicategorization.

Therefore, in this paper, a model-free ultra-high-dimensional feature screening method for multicategorical response data with weighted JS divergence is investigated from the perspective of JS divergence for the case where the response variable is multicategorical.

First, separately calculate the JS divergence between conditional probability distributions for $x_{j}$ given $Y = r (r = 1, 2, \dots, R)$ and the probability distribution of $x_{j}$ , and then $\Pr (Y = r)$ is used as the weight to obtain the weighted JS divergence.

Assume that $U = \Pr (x_{j} = l | Y = r)$ and $V = \Pr (x_{j} = l)$ , and $M = \frac{1}{2} (U + V)$ is the average probability distribution of U and V. If $x_{j}$ is a continuous variable, U and V are defined as follows: $U = \Pr (x_{j} \in (q_{(j - 1)}, q_{(j)}] | Y = r)$ , $V = \Pr (x_{j} \in (q_{(j - 1)}, q_{(j)}])$ .

Then, the weighted JS divergence of U and V is defined as:

$\begin{matrix} e_{j} = \sum_{r = 1}^{R} \Pr (Y = r) J S (U ∥ V) \\ = \sum_{r = 1}^{R} \Pr (Y = r) (\frac{1}{2} \sum_{j = 1}^{P} U \log (\frac{U}{M}) + \frac{1}{2} \sum_{j = 1}^{P} V \log (\frac{V}{M})) \\ = \frac{1}{2} \sum_{r = 1}^{R} \Pr (Y = r) (\sum_{j = 1}^{P} U \log (U) - \sum_{j = 1}^{P} U \log (M) \\ + \sum_{j = 1}^{P} V \log (V) - \sum_{j = 1}^{P} V \log (M)) \\ = \frac{1}{2} \sum_{r = 1}^{R} \Pr (Y = r) ((H (U, M) - H (U)) + (H (V, M) - H (V))), \end{matrix}$

and

$\begin{matrix} {\hat{e}}_{j} = J S (\hat{U} ∥ \hat{V}) \\ = \frac{1}{2} \sum_{r = 1}^{R} \Pr (Y = r) ((H (\hat{U}, \hat{M}) - H (\hat{U})) + (H (\hat{V}, \hat{M}) - H (\hat{V}))) . \end{matrix}$ (2)

2.5. AWJS-SIS

When the number of categories of covariates is large, the directly computed weighted JS divergence values may be large, which makes it possible that unimportant variables due to a large number of categories may be incorrectly selected. To address this problem, this paper refers to Ni and Fang [11] using ${(\log J_{k})}^{- 1}$ to construct the adjusted weighted JS divergence for variable selection. Where $J_{k}$ represents the number of categorical categories L of $x_{j}$ or the number of categories in which $x_{j}$ is sliced by a standard normally distributed quantile.

The adjusted weighted JS divergence of U and V is defined as:

$\begin{matrix} w_{j} = e_{j} / \log J_{k} \\ = \frac{\frac{1}{2} \sum_{r = 1}^{R} \Pr (Y = r) ((H (U, M) - H (U)) + (H (V, M) - H (V)))}{\log J_{k}}, \end{matrix}$ (3)

and

$\begin{matrix} {\hat{w}}_{j} = {\hat{e}}_{j} / \log J_{k} \\ = \frac{\frac{1}{2} \sum_{r = 1}^{R} \Pr (Y = r) ((H (\hat{U}, \hat{M}) - H (\hat{U})) + (H (\hat{V}, \hat{M}) - H (\hat{V})))}{\log J_{k}} . \end{matrix}$ (4)

3. Theoretical Properties

In [2] , it is shown that a good feature screening method should satisfy the properties of sure screening and ranking consistency. Sure screening is the basis of feature screening, which means being able to screen all important variables with a probability of 1 when the sample size is sufficient, which ensures that the truly important variables will theoretically be screened in their entirety. Ranking consistency means that the indexes of all significant variables are ranked before the indexes of all insignificant variables, which ensures that when selecting the top $d_{n}$ variables, important variables can be screened out reasonably and robustly. This subsection will illustrate the theoretical properties of the methods proposed in this paper under certain conditions, which are as follows:

Condition 1 (C1). $P = o (\exp (N^{δ})), δ \in (0,1)$ , this indicates that the variable dimension P is an exponential multiple of the sample capacity n.

Condition 2 (C2). There have $c_{1} > 0$ , $c_{2} > 0$ , such that $0 < c_{1} \leq p_{j, l r} \leq c_{2} < 1$ , $\forall l, r \in {0, 1}$ , $\forall k = 1, 2, \dots, P$ .

Condition 3 (C3). There has a constant $c > 0$ and $0 \leq τ < 1 / 2$ , such that $min_{j \in D} e_{j} \geq 2 c N^{- τ}$ .

Condition 4 (C4). There has a constant $c_{3}$ for $\forall 1 \leq r \leq R$ such that $0 < f_{k} (x | Y = r) < c_{3}$ , and x is in the domain of definition of X_k, where under the condition $Y = r$ , $f_{k} (x | Y = r)$ is the Lebesgue density function of X_k.

Condition 5 (C5). There have a constant $c_{4}$ and $\forall 1 \leq ρ \leq 1 / 2$ such that $c_{4} n^{- ρ} \leq f_{k} (x) < c_{5}$ , and x is in the domain of definition of X_k for $\forall 1 \leq k \leq ρ$ , where $f_{k} (x)$ is continuous in the domain of definition of X_k, and $f_{k} (x)$ is the Lebesgue density function of X_k.

Condition 6 (C6). $J = \max_{1 \leq j \leq P} J_{k} = O (N^{κ})$ , $κ > 0$ , $\forall 1 \leq τ \leq 1 / 2$ and $\forall 1 \leq ρ \leq 1 / 2$ with $2 τ + 2 ρ < 1$ .

The literature on ultra-high-dimensional feature screening approaches, such as [2] [13] , and [14] , typically includes the aforementioned six requirements. Condition (C1) demonstrates that it is a feature screening method applied to ultra-high-dimensional problems. Condition (C2) demonstrates that the marginal probabilities of the response variable and the covariate are bounded by an upper and a lower limit, preventing the worst-case scenario of the screening method failing. This worst-case situation is due to a flaw in the Jensen-Shannon divergence. When the two distributions do not overlap at all, even if the centers of the two distributions are as close as possible, their Jensen-Shannon divergence is constant, and at this point, the Jensen-Shannon divergence fails to measure the extent of the difference between the two distributions and thus the importance of the covariates. And Condition (C3) demonstrates that the values of the indexes corresponding to the really important variables are bounded by a lower value. Condition (C4) eliminates an extreme scenario in which some X_k places a large mass in a small range, ensuring that the sample percentile is close to the genuine percentile. Condition (C5) shows the lower bound of the density must be of order $n^{- ρ}$ in order. The presence of Condition (C6) guarantees a certain rate of divergence in the number of covariate categories. When the response is multi-classified and all covariates are discrete, we provide the theoretical properties of the feature screening technique WJS-SIS under these six conditions.

Because $w_{j} = e_{j} / \log J_{k}$ , ${\hat{w}}_{j} = {\hat{e}}_{j} / \log J_{k}$ , and $\log J_{k} \geq \log 2 \geq 1 / 2$ , it follows that $\Pr (| w_{j} - {\hat{w}}_{j} | > ε) = \Pr (| e_{j} - {\hat{e}}_{j} | > ε / 2)$ . So, this study provides a thorough theoretical proof for the index $e_{j}$ of weighted JS divergence-based sure screening and ranking consistency for feature screening.

The Properties of Sure Screening and Ranking Consistency

Categorical covariates are subscripted with the letter j, while continuous covariates are subscripted with the letter k. If the covariate is categorical, L represents the number of categories, while $J_{k}$ represents the number of categories if the covariate is continuous.

When the covariates are categorical variables, there are theorems 3.1 and 3.2.

Theorem 3.13 In the (C1) - (C2) conditions, $0 \leq τ < 1 / 2$ , there have $c > 0$ and $C > 0$ with

$\Pr (\max_{1 \leq j \leq P} | e_{j} - {\hat{e}}_{j} | > c N^{- τ}) \leq 12 P R L \exp {- C N^{1 - 2 τ}},$

when $0 < a < 1 - 2 τ$ , $\Pr (\max_{1 \leq j \leq P} | e_{j} - {\hat{e}}_{j} | \geq c N^{- τ}) \to 0, N \to \infty$ . And under the (C1) - (C3) conditions, when $N \to \infty$ , such that

$\Pr (D \subseteq \hat{D}) \geq 1 - 12 d_{0} R L \exp {- C N^{1 - 2 τ}} \to 1.$

Theorem 3.2. In the (C1) - (C3) conditions, assume that $\min_{j \subseteq D} {\hat{e}}_{j} - \max_{j \subseteq D^{c}} {\hat{e}}_{j} > 0$ , then there have

$\Pr {\lim_{N \to \infty} \inf (\min_{j \subseteq D} {\hat{e}}_{j} - \max_{j \subseteq D^{c}} {\hat{e}}_{j}) > 0} = 1.$

When the covariates are continuous, there are theorems 3.3 and 3.4.

Theorem 3.3. Under the conditions (C1), (C2), (C4), (C5), and (C6), there have constants $c_{11} > 0$ , $C_{1} > 0$ and there are

$\Pr (\max_{1 \leq j \leq P} | e_{k} - {\hat{e}}_{k} | > c_{10} N^{- τ}) \leq 6 c_{6} P R J_{k} \exp {- C_{1} N^{1 - 2 ρ - 2 τ}} .$ (5)

When $N \to \infty$ , there is

$\Pr (D \subseteq \hat{D}) \geq 1 - 6 c_{6} d_{0} R J_{k} \exp {- C_{1} N^{1 - 2 ρ - 2 τ}} \to 1.$

Theorem 3.4. Assume that $\min_{k \subseteq D} {\hat{e}}_{k} - \max_{k \subseteq D^{c}} {\hat{e}}_{k} > 0$ , under the conditions (C1), (C3), (C4), (C5), and (C6), there have

$\Pr {\lim_{N \to \infty} \inf (\min_{k \subseteq D} {\hat{e}}_{k} - \max_{k \subseteq D^{c}} {\hat{e}}_{k}) > 0} = 1.$

When the covariates are continuous and categorical covariates coexist, there are theorems 3.5 and 3.6.

Theorem 3.5. Under the conditions (C1), (C2), (C4), (C5), and (C6), there have constants $c_{11} > 0$ , $C_{2} > 0$ and $C_{3} > 0$ and there are

$\begin{array}{l} \Pr (\max_{1 \leq j \leq P} (| e_{j} - {\hat{e}}_{j} | + | e_{k} - {\hat{e}}_{k} |) > c_{11} N^{- τ}) \\ \leq 12 P_{1} R L \exp {- C_{2} N^{1 - 2 τ}} + 6 c_{6} P_{2} R J_{k} \exp {- C_{3} N^{1 - 2 ρ - 2 τ}}, \end{array}$ (6)

where $P_{1} + P_{2} = P$ . When $N \to \infty$ , there is

$\Pr (D \subseteq \hat{D}) \geq 1 - 12 d_{1} R L \exp {- C_{2} N^{1 - 2 τ}} - 6 c_{6} d_{2} R J_{k} \exp {- C_{3} N^{1 - 2 ρ - 2 τ}} \to 1,$

where $d_{1} + d_{2} = d_{0}$ .

Theorem 3.6. Assume that $\min_{j \subseteq D} {\hat{e}}_{j} - \max_{j \subseteq D^{c}} {\hat{e}}_{j} > 0$ and $\min_{k \subseteq D} {\hat{e}}_{k} - \max_{k \subseteq D^{c}} {\hat{e}}_{k} > 0$ , under the conditions (C1), (C2), (C4), (C5), and (C6), there have

$\Pr {\lim_{N \to \infty} \inf ((\min_{j \subseteq D} {\hat{e}}_{j} - \max_{j \subseteq D^{c}} {\hat{e}}_{j}) + (\min_{k \subseteq D} {\hat{e}}_{k} - \max_{k \subseteq D^{c}} {\hat{e}}_{k})) > 0} = 1.$

The appendix contains a thorough proof of the theoretical portion.

4. Numerical Simulation

4.1. Evaluation Indexes

The first evaluation indexes are CP1 and CP2, which represent the proportion of true important covariates that are selected into the set of significant covariates when the top $[N / \log N]$ and top $2 [N / \log N]$ variables are selected as the set of significant covariates, respectively. The second evaluation indexes are CPa1 and CPa2, which show whether the selected set of important covariates contains all the true important covariates when the number of the important covariates set is $[N / \log N]$ and $2 [N / \log N]$ , respectively. The third evaluation index is the MMS, which represents the minimum model size that will be selected for all important variables. Each simulation was conducted 100 times. All calculations in this paper were performed in R software.

4.2. Simulation Experiments and Results

4.2.1. Simulation 1

The response variable and all covariates are four-categorical variables. Where, for the response variable Y, both balanced and unbalanced distributions are considered: balanced, $p_{r} = \Pr (Y = r) = 1 / R$ , with $r = 1, \dots, R$ , and $R = 2$ ; unbalanced, $p_{r} = 2 [1 + (R - r) / (R - 1)] / 3 R$ with $\max_{1 \leq r \leq R} p_{r} = 2 \min_{1 \leq r \leq R} p_{r}$ . Define $D = {1, 2, \dots, d_{0}}$ as the true imporant variables set, where $d_{0} = | D | = 10$ . X is generated by the conditional probability of X given by Y. $\Pr (x_{i j} = (1, 2, 3, 4) | y_{i} = r) = (θ_{r j} / 2, (1 - θ_{r j}) / 2, θ_{r j} / 2, (1 - θ_{r j}) / 2)$ for $1 \leq r \leq R$ and $1 \leq j \leq d_{0}$ , where $θ_{r j}$ is given in Table 1. And, $θ_{r j} = 0.5$ when $1 \leq r \leq R$ , $d_{0} < j \leq P$ . The dimensionality of the covariates was set to $P = 2000$ , and the sample sizes were set to $N = 300$ , $N = 400$ , and $N = 500$ .

The simulation results are shown in Table 2, where the performance indexes of all the methods for all conditions are the same, with coverage CP and CPa both being 1 and the values of the MMS values being close to $d_{0} = 10$ .

4.2.2. Simulation 2

The response variable and all covariates are categorical variables, where the response variable is set up as in Simulation 1, and the categories of the covariates are set up as categories 2, 4, 6, 8, and 10, respectively. Similarly, define $D = {j = [j^{'} P / 10], j^{'} = 1, 2, \dots, 10}$ as the set of important variables. The covariate data were generated through the quantile of the standard normal distribution $f_{j} (\cdot)$ . Define $x_{i, j}$ by $f_{j} (ε_{i, j} + μ_{i, j})$ , where $ε_{i, j} ~ N (0,1)$ , $1 \leq j \leq P$ . And when $j \in D$ , $μ_{i, j} = 1.5 \times {(- 0.9)}^{r}$ , otherwise $μ_{i, j} = 0$ . The following are the precise steps for creating covariates:

Table 1. Parameter specification for the simulations.

Table 2. Results for simulation 1.

The numbers in parentheses are the corresponding standard deviations.

$f_{j} (ε_{i, j} + μ_{i, j}) = I (z_{i, j} > z_{(\frac{j^{″}}{L})}) + 1, (j^{″} = 1, 2, \dots, l - 1) .$

The values of L are 2, 4, 6, 8, and 10, which correspond to $1 \leq j \leq 400$ , $400 < j \leq 800$ , $800 < j \leq 1200$ , $1200 < j \leq 1600$ , $1600 < j \leq 2000$ . We set $P = 2000$ and $N = 160, 240, 320$ .

The simulation results are displayed in Table 3, and all techniques’ performance indexes for every situation are same, with coverage CP and CPa both being 1, and MMS values that are nearly equal to $d_{0} = 10$ .

Table 3. Results for simulation 2.

The numbers in parentheses are the corresponding standard deviations.

4.2.3. Simulation 3

The covariates are continuous variables, and the response variables are set up as in Simulation 1. We use the standard normal distribution of quantile function to slice the covariates into categorical data, where $J_{K} = 4, 8, 10$ , and define the methods as WJS-SIS-4, AWJS-SIS-4, IG-SIS-4; WJS-SIS-8, AWJS-SIS-8, IG-SIS-8; and WJS-SIS-10, AWJS-SIS-10, IG-SIS-10, respectively. The essential variables are set up in the same way as in Simulation 1. Generate X using the standard normal distribution $N (μ_{i j},1)$ with $μ_{i} = {μ_{i 1}, μ_{i 2}, \dots, μ_{i P}}$ and assume $x_{i} = {x_{i 1}, x_{i 2}, \dots, x_{i P}} \in ℝ^{P}$ , and $x_{i j}, (j = 1,2, \dots, P)$ . Where $j \in D$ , $μ_{i j} = {(- 1)}^{r} θ_{r j}$ , otherwise, $μ_{i j} = 0$ . We set $P = 5000$ and $N = 400,600,800$ .

Because this slice is used to divide all covariates after a particular number of slices are chosen each time, the performance indexes of WJS-SIS and AWJS-SIS are the same.

Table 4 displays the simulation results for a balanced distribution of Y. Since

Table 4. Results for simulation 3: balanced Y.

The numbers in parentheses are the corresponding standard deviations.

all covariates are divided using the slices each time a certain number of slices is chosen, the performance indexes of WJS-SIS and AWJS-SIS are identical. The coverage index values CP and CPa for the three methods are 1 in all cases. Regarding the MMS values, the 95% quantile values of MMS are slightly different only at $N = 400$ , where IG-SIS is smaller than those of WJS-SIS and AWJS-SIS at $J_{k} = 4$ and 10, and WJS-SIS and AWJS-SIS are smaller than those of IG-SIS at $J_{k} = 8$ ; and the 95% quantile values of MMS of all the three methods are increasing with J_k increases. Table 5 shows the simulation results when Y is an unbalanced distribution: with respect to the CP and CPa values, IG-SIS is slightly higher than WJS-SIS and AWJS-SIS at $N = 400$ , and 1 for all methods in all other cases. With regard to the MMS values, IG-SIS has a smaller MMS than WJS-SIS and AWJS-SIS at $N = 400,600$ for the 95% percentile of the MMS values are smaller, and at $N = 800$ , the MMS values are the same for all methods. All methods perform better when the number of slices is small.

4.2.4. Simulation 4

The covariates are categorical and continuous, with continuous covariates treated the same as in Simulation 3 regarding handling. The response variables are set up as in Simulation 1. Set the essential variables set is $D = {j = [j^{'} P / 20], j^{'} = 1, 2, \dots, 20}$ . Generating the latent variables

$z_{i} = (z_{i,1}, \dots, z_{i, P})$ in the same way of Simulation 3 generating covariates and then generating categorical and continuous covariates: 1) For $P \leq 1250$ , then $x_{i j} = j^{″}$ , if $z_{i j} \in (z_{(j^{″} - 1) / 4}, z_{j^{″} / 4}]$ , $j^{″} = 1,2,3,4$ ; 2) For $1250 < P \leq 2500$ , then $x_{i j} = j^{‴}$ , if $z_{i j} \in (z_{(j^{‴} - 1) / 10}, z_{j^{‴} / 10}]$ , $j^{‴} = 1, \dots,10$ ; 3) For $2500 < P \leq 5000$ , then $x_{i j} = z_{i j}$ . We set $P = 5000$ and $N = 400,600,800$ .

Table 6 displays the simulation results for a balanced distribution of Y. The CP and CPa values for AWJS-SIS and IG-SIS are the same and larger than those for WJS-SIS. Regarding the MMS values, the 75% and 95% quantile values of AWJS-SIS and IG-SIS are smaller than those of IG-SIS at N = 400 and 600; whereas AWJS-SIS is smaller than IG-SIS at $N = 400$ and $J_{k} = 8$ , and larger than IG-SIS at $N = 400$ and $J_{k} = 4$ and 10, and at $N = 600$ , the MMS values of AWJS-SIS and IG-SIS have the same MMS value; at $N = 800$ , all methods have the same MMS value. Table 7 shows the simulation results when Y is an unbalanced distribution: At $N = 800$ , all methods have the same performance index value. At $N = 400$ , AWJS-SIS and IG-SIS are both larger than WJS-SIS, with AWJS-SIS being somewhat smaller than IG-SIS. At $N = 600$ , all methods are equal in terms of the CP and CPa values. All methods perform better when the number of slices is large.

4.3. Computational Time Cost

We obtained the median running time of each algorithm through a simulation experiment, where the covariates X and the set of significant variables were set

Table 5. Results for simulation 3: unbalanced Y.

The numbers in parentheses are the corresponding standard deviations.

Table 6. Results for simulation 4: balanced Y.

The numbers in parentheses are the corresponding standard deviations.

Table 7. Results for simulation 4: unbalanced Y.

The numbers in parentheses are the corresponding standard deviations.

Table 8. The results of computational time cost.

The numbers in parentheses are the corresponding standard deviations.

up as in simulation experiment 2, and Y was set up as a balanced distribution. The experiment was set up with $N = 400$ , $P = 1000,2000, \dots,10000$ , and repeat the experiment 100 times. An Intel Core i7-8700 machine running Windows 10 at 3.20 GHz was used for all calculations. Table 8 shows the median runtime for the three methods, which increases as P increases and is consistently faster to compute for WJS-SIS and AWJS-SIS than for IG-SIS.

4.4. Comprehensive Analysis of Simulation Results

The main argument is that, in terms of performance, the approaches of WJS-SIS and AWJS-SIS in this study are extremely comparable to IG-SIS. There is a difference in performance between the approaches when the sample size is small, and the performance of WJS-SIS is more affected by the slices than that of AWJ-SIS and IG-SIS, which are more robust and whose performance is more adaptive to the number of slices. But all methods perform as well as they do as the number of variables screened increases or as the sample size increases, and all are able to screen out all the important variables, and the true model size is close to the number of important variables. In terms of computing time, IG-SIS is longer than WJS-SIS and AWJS-SIS.

5. Experimental Study with Real Data

In real life, ultra-high-dimensional data with multi-class response variables is common, and feature screening of such data can achieve the effects of data dimensionality reduction, feature mining, and variable selection. The methods proposed in this paper can be applied in different fields and can improve the efficiency and accuracy of data analysis, reveal the information behind the data, and help in the construction of decision-making and prediction models. For example, in the medical field, it can be applied to analyze gene expression data and help identify genes associated with diseases. In the financial field, it can help identify key factors that affect stock or commodity prices. In image processing, it can be used for tasks such as feature extraction and target recognition. In practical implementation, based on Fan and Lv [2] , the number of important variables is generally selected as the first $d_{n} = γ [N / \log N]$ . In this paper, we analyze the case when $γ = 1,2$ .

We analyzed the TOX-171 micro-integrated columns biological dataset from the Arizona State University feature selection database (http://featureselection.asu.edu/) with 171 samples and 5748 features, with four classes of response variables and roughly unbalanced distributions, with covariates of continuous type. We randomly divide the dataset in a 7:3 ratio, where 70% of the data is used as the training dataset and the remaining 30% as the test dataset. As randomly dividing datasets may bring the potential problem of model prediction accuracy degradation, to address this problem, we used ten-fold cross-validation to train the model and repeated the experiment 100 times to take the average of the evaluation indexes and calculate the standard deviation of the evaluation indexes. The smaller the standard deviation, the more stable it is, which means that the average of the evaluation indexes is desirable. On the training and test sets, respectively, variables screened using the three methods were tested for categorization using a support vector machine, and the values of the geometric mean (G-mean) for categorization accuracy (CA), specificity (SPE), and sensitivity (SEN) were calculated.

Table 9 and Table 10 show the corresponding index values when the number of selected variables is $[N / \log N]$ and $2 [N / \log N]$ , respectively. Combining Table 9 and Table 10, it can be seen that in both the training and test sets, all methods perform better when J_k is relatively large, and the CA and G-mean values of AWJS-SIS are always higher than those of WJS-SIS, and the CA and G-mean values of AWJS-SIS are higher than those of IG-SIS at J_k = 8. As well, the classification of the method is better as the screening variables increase.

6. Conclusion

In this paper, from the perspective of introducing JS (Jensen-Shannon) divergence to measure the importance of covariates, for the case where Y is multi-classified, this paper constructs model-free ultra-high-dimensional feature

Table 9. The result when screening the first $[N / \log N]$ variables.

The numbers in parentheses are the corresponding standard deviations.

Table 10. The result when screening the first $2 [N / \log N]$ variables.

The numbers in parentheses are the corresponding standard deviations.

screening methods for multi-classified response data based on weighted JS divergence under different scenarios, using the WJS-SIS method when the number of categories in each covariate is the same and the AWJS-SIS method with adjusted weighted JS divergence when the number of categories in each covariate is different. Theoretically, both WJS-SIS and AWJS-SIS have sure screening properties and ranking consistency. Then, from the Monte Carlo simulation results and experiments with real data, WJS-SIS and AWJS-SIS have a significant effect on feature screening, and the performance is very similar to that of IG-SIS, but WJS-SIS is a little bit weaker in terms of robustness, whereas AWJ-SIS and IG-SIS are robust a little stronger, and both WJS-SIS and AWJS-SIS are faster than IG-SIS in terms of computation time. Finally, the approaches proposed in this paper utilize the Jensen-Shannon divergence to measure the importance of covariates from the perspective of information quantity, which is different from the traditional statistical indicators, which may provide a reference for methodological research in the field of multi-class variable screening for ultra-high-dimensional data. And the methods proposed in this work only take into consideration the correlation between the response variable and the covariates; they do not account for the presence of a high covariate correlation. Therefore, in future studies for ultra-high-dimensional variable screening, the element of covariate correlation will be incorporated.

Acknowledgements

The study was supported by the National Natural Science Foundation of China [grant number 71963008].

Appendix

The following four lemmas are initially introduced in order to demonstrate Theorem 3.1.

Lemma 1. Suppose there are mutually independent random variables $x_{1}, x_{2}, \dots, x_{N}$ with sample size N and $\Pr (x_{i} \in [a_{i}, b_{i}]) = 1, 1 \leq i \leq N$ , where $a_{i}, b_{i}$ are constants. If we assume that $\bar{x} = 1 / N \sum_{i = 1}^{N} x_{i}$ , then there has a constant t for which the inequality holds:

$\Pr (| \bar{x} - E (\bar{x}) \geq t |) \leq 2 \exp (- 2 N t^{2} / \sum_{i = 1}^{N} {(b_{i} - a_{i})}^{2}) .$

In [15] , Lemma 1’s proof is presented.

Lemma 2. Suppose there are two bounded random variables a and b, and there have two positive constants $M_{1}, M_{2}$ such that $| a | \leq M_{1}, | b | \leq M_{2}$ . The estimates corresponding to $a, b$ can be computed as $\hat{A}, \hat{B}$ , given a sample size of n. Suppose that for $\forall ε \in (0,1)$ , there have constants $c_{1} > 0, c_{2} > 0$ and $s > 0$ such that:

$\Pr (| \hat{A} - a | \geq ε) \leq c_{1} {(1 - \frac{ε s}{c_{1}})}^{N}$

$\Pr (| \hat{B} - b | \geq ε) \leq c_{2} {(1 - \frac{ε s}{c_{2}})}^{N}$

then, there exist

$\Pr (| \hat{A} \hat{B} - a b | \geq ε) \leq C_{1} {(1 - \frac{ε s}{C_{1}})}^{N}$

$\Pr (| {\hat{A}}^{2} - a^{2} | \geq ε) \leq C_{2} {(1 - \frac{ε s}{C_{2}})}^{N}$

$\Pr (| (\hat{A} - a) - (\hat{B} - b) | \geq ε) \leq C_{3} {(1 - \frac{ε s}{C_{3}})}^{N}$

Where, $C_{1} = \max {2 c_{1} + c_{2}, c_{1} + 2 c_{2} + 2 c_{2} M_{1},2 c_{2} M_{2}}$ , $C_{2} = \max {3 c_{1} + 2 c_{1} M_{1},2 c_{2} M_{2}}$ , $C_{3} = \max {2 c_{1},2 c_{2}, c_{1} + c_{2}}$ .

Besides, assuming that b is bounded and non-zero, and that there has $M_{3} > 0$ such that $| b | \geq M_{3}$ , then there exist

$\Pr (| \frac{\hat{A}}{\hat{B}} - \frac{\hat{a}}{\hat{b}} | \geq ε) \leq C_{4} {(1 - \frac{ε s}{C_{4}})}^{N},$

where, $C_{4} = \max {c_{1} + c_{2} + c_{5}, c_{2} / M_{4}, 2 c_{2} M_{1} / (M_{2} M_{4})}$ , $c_{5} > 0$ and $M_{4} > 0$ .

In [10] , Lemma 2’s proof is presented.

Lemma 3. If the covariates are categorical, we can get that $e_{j} \geq 0$ . And $e_{j} = 0$ only when $\Pr (x_{j} = l | Y = r) = \Pr (x_{j} = l)$ , Y and $x_{j}$ are independent.

In [16] , Lemma 3’s proof is presented.

Lemma 4. If the covariates are continuous, there is $e_{j} \geq 0$ , when Y and $x_{j}$ are independent, $e_{j} = 0$ .

Lemma 4’s proof is omitted here because it is similar to Proposition 2.2’s proof in [11] .

Theorem 3.1 proof:

Let $U = \Pr (x_{j} = l | Y = r)$ , $V = \Pr (x_{j} = l)$ , $M = \frac{1}{2} (U + V)$ then

The definitions of $e_{j}$ and ${\hat{e}}_{j}$ state that there are

$\begin{matrix} | e_{j} - {\hat{e}}_{j} | = \frac{1}{2} | \sum_{r = 1}^{R} \Pr (Y = r) [(H (U, M) - H (U)) + (H (V, M) - H (V))] \\ - \sum_{r = 1}^{R} \hat{P} r (Y = r) [(\hat{H} (U, M) - \hat{H} (U)) + (\hat{H} (V, M) - \hat{H} (V))] | \end{matrix}$

$\begin{matrix} = \frac{1}{2} | \sum_{r = 1}^{R} [\hat{P} r (Y = r) | \hat{H} (U, M) - \hat{H} (U) | - \hat{P} r (Y = r) | H (U, M) - H (U) | \\ + \hat{P} r (Y = r) | H (U, M) - H (U) | - \Pr (Y = r) | H (U, M) - H (U) | \\ + \hat{P} r (Y = r) | \hat{H} (V, M) - \hat{H} (V) | - \hat{P} r (Y = r) | H (V, M) - H (V) | \\ + \hat{P} r (Y = r) | H (V, M) - H (V) | - \Pr (Y = r) | H (V, M) - H (V) |] | \end{matrix}$

$\begin{matrix} = \frac{1}{2} | \sum_{r = 1}^{R} \Pr (Y = r) [(H (U, M) - H (U)) + (H (V, M) - H (V))] \\ - \sum_{r = 1}^{R} \hat{P} r (Y = r) [(\hat{H} (U, M) - \hat{H} (U)) + (\hat{H} (V, M) - \hat{H} (V))] | \end{matrix}$

$\begin{matrix} = \frac{1}{2} | \sum_{r = 1}^{R} [(\hat{P} r (Y = r) | \hat{H} (U, M) - \hat{H} (U) | - | H (U, M) - H (U) |) \\ + (\hat{P} r (Y = r) - \Pr (Y = r)) | H (U, M) - H (U) | \\ + \hat{P} r (Y = r) (| \hat{H} (V, M) - \hat{H} (V) | - | H (V, M) - H (V) |) \\ + (\hat{P} r (Y = r) - \Pr (Y = r)) | H (V, M) - H (V) |] | \end{matrix}$

$\begin{matrix} \leq \frac{1}{2} | \sum_{r = 1}^{R} \hat{P} r (Y = r) (| \hat{H} (U, M) - H (U, M) | + | \hat{H} (U) - H (U) |) \\ + \sum_{r = 1}^{R} (\hat{P} r (Y = r) - \Pr (Y = r)) | H (U, M) - H (U) | \\ + \sum_{r = 1}^{R} \hat{P} r (Y = r) (| \hat{H} (V, M) - H (V, M) | + | \hat{H} (V) - H (V) |) \end{matrix}$

$+ \sum_{r = 1}^{R} (\hat{P} r (Y = r) - \Pr (Y = r)) | H (V, M) - H (V) | |$

$\begin{matrix} \leq \frac{1}{2} | \sum_{r = 1}^{R} | \hat{H} (U, M) - H (U, M) | + \sum_{r = 1}^{R} | \hat{H} (U) - H (U) | \\ + \sum_{r = 1}^{R} | \hat{H} (V, M) - H (V, M) | + \sum_{r = 1}^{R} | \hat{H} (V) - H (V) | \\ + 2 \sum_{r = 1}^{R} (\hat{P} r (Y = r) - \Pr (Y = r)) |, \end{matrix}$ (7)

and

$\begin{array}{l} \Pr (| e_{j} - {\hat{e}}_{j} | > ε) \\ \leq \Pr (\frac{1}{2} | \sum_{r = 1}^{R} | \hat{H} (U, M) - H (U, M) | + \sum_{r = 1}^{R} | \hat{H} (U) - H (U) | \\ + \sum_{r = 1}^{R} | \hat{H} (V, M) - H (V, M) | + \sum_{r = 1}^{R} | \hat{H} (V) - H (V) | \\ + 2 \sum_{r = 1}^{R} (\Pr (Y = r) - \Pr (Y = r)) | > ε) \end{array}$

$\begin{array}{l} \leq \Pr (\sum_{r = 1}^{R} | \hat{H} (U, M) - H (U, M) | > \frac{ε}{3}) + \Pr (\sum_{r = 1}^{R} | \hat{H} (U) - H (U) | > \frac{ε}{3}) \\ + \Pr (\sum_{r = 1}^{R} | \hat{H} (V, M) - H (V, M) | > \frac{ε}{3}) + \Pr (\sum_{r = 1}^{R} | \hat{H} (V) - H (V) | > \frac{ε}{3}) \\ + 2 \Pr (\sum_{r = 1}^{R} (\Pr (Y = r) - \Pr (Y = r)) > \frac{ε}{3}) \\ = : E_{j 1} + E_{j 2} + E_{j 3} + E_{j 4} + E_{j 5} . \end{array}$

To prove that $E_{j 1}$ Part at first:

$\begin{array}{l} \Pr (| H (U, M) - \hat{H} (U, M) | > \frac{ε}{3}) \\ = \Pr (| \sum_{l = 1}^{L} \hat{p} (x_{j} = l | Y = r) \log (\frac{\hat{p} (x_{j} = l | Y = r) + \hat{p} (x_{j} = l)}{2}) \\ - \sum_{l = 1}^{L} p (x_{j} = l | Y = r) \log (\frac{p (x_{j} = l | Y = 1) + p (x_{j} = l)}{2}) | > \frac{ε}{3}) \end{array}$

$\begin{array}{l} \leq R \max_{r} \Pr (| \sum_{l = 1}^{L} \hat{p} (x_{j} = l | Y = r) \log (\frac{\hat{p} (x_{j} = l | Y = r) + \hat{p} (x_{j} = l)}{2}) \\ - \sum_{l = 1}^{L} p (x_{j} = l | Y = r) \log (\frac{p (x_{j} = l | Y = r) + p (x_{j} = l)}{2}) | > \frac{ε}{3 R}) \\ \leq R L \max_{r, l} \Pr (| \hat{p} (x_{j} = l | Y = r) \log (\frac{\hat{p} (x_{j} = l | Y = r) + \hat{p} (x_{j} = l)}{2}) \\ - p (x_{j} = l | Y = r) \log (\frac{p (x_{j} = l | Y = r) + p (x_{j} = l)}{2}) | > \frac{ε}{3 R L}) . \end{array}$

By estimating the probability with the sample frequency, we have

$\hat{p} (x_{j} = l | Y = r) = \sum_{i = 1}^{N} I (x_{i j} = l) I (y_{i} = r) / \sum_{i = 1}^{N} I (y_{i} = r)$

$\hat{p} (x_{j} = l | Y = r) = E (I (x_{i j} = l) I (y_{i} = r)) / p (I (y_{i} = r))$

$\hat{p} (x_{j} = l) = \frac{\sum_{i = 1}^{N} I {x_{i j} = l}}{N}$

$p (x_{j} = l) = p (I {x_{i j} = l})$

$\hat{p} (Y = r) = \frac{\sum_{i = 1}^{N} I {y_{i} = r}}{N}$

$p (Y = r) = p (I {Y = r})$

Thus, there is

$\begin{array}{l} \Pr (| \hat{p} (x_{j} = l | Y = r) - p (x_{j} = l | Y = r) | > ε_{1}) \\ = \Pr (| \frac{\sum_{i - 1}^{N} I (x_{i j} = l) I (y_{i} = r)}{\sum_{i = 1}^{N} I (y_{i} = r)} - \frac{E (I (x_{i j} = l) I (y_{i} = r))}{p (I (y_{i} = r))} | > ε_{1}) \\ = : \Pr (| \frac{S_{n}}{T_{n}} - \frac{s_{n}}{t_{n}} | \geq ε_{1}); \end{array}$

furthermore, it follows from Lemma 1 and Lemma 2 that since $S_{n}$ , $T_{n}$ are estimates of $s_{n}$ , $t_{n}$ :

$\Pr (| S_{n} - s_{n} | > ε_{2}) \geq 2 \exp {- 2 N ε_{2}^{2}},$

$\Pr (| T_{n} - t_{n} | > ε_{2}) \geq 2 \exp {- 2 N ε_{2}^{2}} .$

So, $\hat{p} (x_{j} = l | Y = r) \overset{P}{\to} p (x_{j} = l | Y = r)$ :

$\Pr (| \hat{p} (x_{j} = l | Y = r) - p (x_{j} = l | Y = r) | > ε_{1}) \leq 2 \exp {- 2 N ε_{1}^{2}} .$

Additionally, may be proved that $\log (\hat{p} (x_{j} = l | Y = 1)) \overset{P}{\to} \log (p (x_{j} = l | Y = 1))$ . Assume ${\hat{p}}^{*} = \hat{p} (x_{j} = l | Y = 1)$ , $p^{*} = p (x_{j} = l | Y = 1)$ :

$\begin{array}{l} \Pr (| \log ({\hat{p}}^{*}) - \log (p^{*}) | > ε_{3}) \\ = \Pr (| \log (({\hat{p}}^{*} - p^{*}) + p^{*}) - \log (p^{*}) | > ε_{3}) \\ \leq \Pr (| \log (p^{*}) + \frac{1}{p^{*}} ({\hat{p}}^{*} - p^{*}) + o ({\hat{p}}^{*} - p^{*}) - \log (p^{*}) | > ε_{3}) \\ \leq \Pr (| {\hat{p}}^{*} - p^{*} | > ε_{3} p^{*} - o ({\hat{p}}^{*} - p^{*})) \end{array}$

Then, $\log (\hat{p} (x_{j} = l | Y = 1)) \overset{P}{\to} \log (p (x_{j} = l | Y = 1))$ .

We can obtain that $\log (\frac{\hat{p} (x_{j} = l | Y = r) + \hat{p} (x_{j} = l)}{2}) \overset{P}{\to} \log (\frac{p (x_{j} = l | Y = r) + p (x_{j} = l)}{2})$ in a proof similar to this one.

So, we can get $E_{j 1} \leq 2 R L \exp {- 2 N ε^{2} / 9 R^{2} L^{2}}$ . Similarly, it can be shown that $E_{j 2}$ , $E_{j 3}$ , and $E_{j 4}$ are all $\leq 2 R L \exp {- 2 N ε^{2} / 9 R^{2} L^{2}}$ .

For the $E_{j 5}$ part:

$\Pr (\sum_{r = 1}^{R} (\hat{P} r (Y = r) - \Pr (Y = r)) > \frac{ε}{3}) \leq 2 R \exp {- 2 N ε^{2} / 9 R^{2}}$

Prove the $E_{j 5}$ Part:

According to Lemma 1 and Lemma 2, it can also be shown that $\hat{p} (Y = r)$ converges to $p (Y = r)$ with probability, then

$\begin{array}{l} \Pr (\sum_{r = 1}^{R} (\hat{P} r (Y = r) - \Pr (Y = r)) > \frac{ε}{3}) \\ \leq R \max_{r} \Pr (| \frac{1}{N} \sum_{i = 1}^{N} f (y_{i} = r) - E (y_{i} = r) | > \frac{ε}{3}) \\ \leq 2 R \exp {- 2 N ε^{2} / 9 R^{2}} . \end{array}$

For $0 < ε_{4} < 1$ , thus, there is

$\Pr (| e_{j} - {\hat{e}}_{j} | > ε_{4}) \leq 8 R L \exp {- 2 N ε_{4}^{2} / 9 R^{2} L^{2}} + 4 R \exp {- 2 N ε_{4}^{2} / 9 R^{2}} .$ (8)

In the (C1) - (C3) condition, there exists $c > 0$ and $C > 0$ with

$\begin{matrix} \Pr (| e_{j} - {\hat{e}}_{j} | > c N^{- τ}) \leq 8 R L \exp {- 2 c^{2} N^{1 - 2 τ} / 9 R^{2} L^{2}} \\ + 4 R \exp {- 2 c^{2} N^{1 - 2 τ} / 9 R^{2}} \\ \leq 12 R L \exp {- C N^{1 - 2 τ}}, \end{matrix}$ (9)

then,

$\begin{matrix} \Pr (\max_{1 \leq j \leq P} | e_{j} - {\hat{e}}_{j} | > c N^{- τ}) \leq \Pr (\cup_{j = 1}^{P} | e_{j} - {\hat{e}}_{j} | > c N^{- τ}) \\ \leq P \Pr (| e_{j} - {\hat{e}}_{j} | > c N^{- τ}) \\ \leq 12 P R L \exp {- C N^{1 - 2 τ}} . \end{matrix}$ (10)

when $N \to \infty$ and $0 < a < 1 - 2 τ$ , there has

$\Pr (\max_{1 \leq j \leq P} | e_{j} - {\hat{e}}_{j} | > c N^{- τ}) \to 0.$

Then,

$\begin{matrix} \Pr (D \subseteq \hat{D}) \geq \Pr (| e_{j} - {\hat{e}}_{j} | > c N^{- τ}, \forall j \in D) \\ \geq \Pr (\max_{j \in D} | e_{j} - {\hat{e}}_{j} | > c N^{- τ}) \\ \geq 1 - d_{0} \Pr (| e_{j} - {\hat{e}}_{j} | > c N^{- τ}) \\ \geq 1 - 12 d_{0} R L \exp {- C N^{1 - 2 τ}}, \end{matrix}$ (11)

so, $\Pr (D \subseteq \hat{D}) \to 1$ , with $N \to \infty$ .

Therefore, in the conditions (C1) - (C3), Theorem 0.1 sure screening property holds.

Theorem 0.2 proof:

Because of $\min_{j \in D} e_{j} - \max_{j \in D^{c}} e_{j} > 0$ , there has $δ > 0$ such that $\min_{j \in D} e_{j} - \max_{j \in D^{c}} e_{j} = δ$ , and after that, there have

$\begin{matrix} \Pr (\min_{j \in D} {\hat{e}}_{j} \leq \max_{j \in D^{c}} {\hat{e}}_{j}) = \Pr (\min_{j \in D} {\hat{e}}_{j} - \max_{j \in D^{c}} e_{j} \leq \max_{j \in D^{c}} {\hat{e}}_{j} - \max_{j \in D^{c}} e_{j}) \\ = \Pr (\min_{j \in D} {\hat{e}}_{j} - \min_{j \in D} e_{j} + δ \leq \max_{j \in D^{c}} {\hat{e}}_{j} - \max_{j \in D^{c}} e_{j}) \\ = \Pr (\min_{j \in D} {\hat{e}}_{j} - \min_{j \in D} e_{j} - \max_{j \in D^{c}} {\hat{e}}_{j} + \max_{j \in D^{c}} e_{j} \leq - δ) \\ = \Pr (| (\min_{j \in D} {\hat{e}}_{j} - \max_{j \in D^{c}} {\hat{e}}_{j}) - (\min_{j \in D} e_{j} - \max_{j \in D^{c}} e_{j}) | \geq δ) \\ \leq \Pr (\max_{1 \leq j \leq P} | e_{j} - {\hat{e}}_{j} | \geq δ / 2) \\ \leq 12 P R L \exp {- C N^{1 - 2 τ}} . \end{matrix}$

From Fatou’s Lemma we can get

$\begin{array}{l} \Pr {\lim_{n \to \infty} \inf (\min_{j \in D} {\hat{e}}_{j} - \max_{j \in D^{c}} {\hat{e}}_{j}) \leq 0} \\ \leq \lim_{n \to \infty} \Pr {(\min_{j \in D} {\hat{e}}_{j} - \max_{j \in D^{c}} {\hat{e}}_{j}) \leq 0} \\ = 0. \end{array}$

Thus,

$\Pr {\lim_{N \to \infty} \inf (\min_{j \in D} {\hat{e}}_{j} - \max_{j \in D^{c}} {\hat{e}}_{j}) \leq 0} = 1.$ (12)

Therefore, Theorem 3.2 holds.

Theorem 3.3 proof:

Assume that ${\hat{F}}_{k} (x | y)$ is $(X_{k}, Y)$ ’s empirical cumulative distribution function and that $F_{k} (x | y)$ is the cumulative distribution function of $(X_{k}, Y)$ . And let $F_{k} (x)$ be the cumulative distribution function of $x_{k}$ , and $\partial F_{k} (x) / \partial x = f_{k} (x)$ . Then, using LEMMA.A.2 in [11] as evidence, we can similarly demonstrate that, for $\forall ε_{5}, ε_{6} > 0, 1 \leq r \leq R, 1 \leq j \leq J_{k}$ , given the conditions (C4) and (C5), there are

$\Pr (| {\hat{F}}_{k} ({\hat{q}}_{k, (j)} | r) - F_{k} (q_{k, (j)} | r) | > ε_{5}) \leq c_{6} \exp {- c_{7} N^{1 - 2 ρ} ε_{5}^{2}},$

$\Pr (| {\hat{F}}_{k} ({\hat{q}}_{k, (j)}) - F_{k} (q_{k, (j)}) | > ε_{6}) \leq c_{6} \exp {- c_{9} N^{1 - 2 ρ} ε_{6}^{2}}$

where $c_{6} = 3 c_{8}$ and $c_{7} = \min {1 / 2, c_{4}^{2} / 2 c_{3}^{2}}$ , $c_{9} = \min {1 / 2, c_{4}^{2} / 2 c_{5}^{2}}$ are positive constants.

So, ${\hat{F}}_{k} ({\hat{q}}_{k, (j)} | r) \overset{P}{\to} F_{k} (q_{k, (j)} | r)$ and ${\hat{F}}_{k} ({\hat{q}}_{k, (j)}) \overset{P}{\to} F_{k} (q_{k, (j)})$ , respectively.

Then, for $0 < ε_{7} < 1$ , there has

$\Pr (| e_{k} - {\hat{e}}_{k} | > ε_{7}) \leq 4 c_{6} R J_{k} \exp {\frac{- c_{7} N^{1 - 2 ρ} ε_{7}^{2}}{9 R^{2} J_{k}^{2}}} + 2 c_{6} R J_{k} \exp {\frac{- c_{9} N^{1 - 2 ρ} ε_{7}^{2}}{9 R^{2} J_{k}^{2}}} .$ (13)

Equation (13) will not be proven here because it is similar to the proof of Equation (8).

There are constants $c_{10}$ and $C_{1}$ under condition (C6) such that

$\begin{matrix} \Pr (| e_{j} - {\hat{e}}_{j} | > c_{10} N^{- τ}) \leq 4 c_{6} P R J_{k} \exp {\frac{- c_{7} c_{10}^{2} N^{1 - 2 ρ - 2 τ}}{9 R^{2} J_{k}^{2}}} \\ + 2 c_{6} P R J_{k} \exp {\frac{- c_{9} c_{10}^{2} N^{1 - 2 ρ - 2 τ}}{9 R^{2} J_{k}^{2}}} \\ \leq 6 c_{6} P R J_{k} \exp {- C_{1} N^{1 - 2 ρ - 2 τ}}, \end{matrix}$ (14)

then,

$\begin{matrix} \Pr (\max_{1 \leq j \leq P} | e_{k} - {\hat{e}}_{k} | > c_{10} N^{- τ}) \leq \Pr (\cup_{j = 1}^{P} | e_{k} - {\hat{e}}_{k} | > c_{10} N^{- τ}) \\ \leq P \Pr (| e_{k} - {\hat{e}}_{k} | > c_{10} N^{- τ}) \\ \leq 6 c_{6} P R J_{k} \exp {- C_{1} N^{1 - 2 ρ - 2 τ}}, \end{matrix}$ (15)

with $N \to \infty$ , there are

$\Pr (\max_{1 \leq k \leq P} | e_{k} - {\hat{e}}_{k} | > c_{10} N^{- τ}) \to 0$

$\begin{matrix} \Pr (D \subseteq \hat{D}) \geq \Pr (| e_{k} - {\hat{e}}_{k} | > c_{10} N^{- τ}, \forall k \in D) \\ \geq \Pr (\max_{k \in D} | e_{k} - {\hat{e}}_{k} | > c_{10} N^{- τ}) \\ \geq 1 - d_{0} \Pr (| e_{k} - {\hat{e}}_{k} | > c_{10} N^{- τ}) \\ \geq 1 - 6 c_{6} d_{0} R J_{k} \exp {- C_{1} N^{1 - 2 ρ - 2 τ}} . \end{matrix}$ (16)

So, $\Pr (D \subseteq \hat{D}) \to 1$ , $N \to \infty$ , Theorem 0.3 holds.

Therefore, a proof similar to that of Equation (12), we have:

$\Pr {\lim_{N \to \infty} \inf (\min_{k \in D} {\hat{e}}_{k} - \max_{k \in D^{c}} {\hat{e}}_{k}) \leq 0} = 1.$ (17)

So, Theorem 3.4 holds.

Thus, based on Equations ((10), (11), (15), (16)), and the proof of Theorem 3.5 and Theorem 3.6 are similar to that of Theorem 3.1 and Theorem 3.2, so they will not be proved in detail.

Conflicts of Interest

The authors declare no conflicts of interest.

References

[1]	Fan, J.Q., Samworth, R. and Wu, Y.C. (2009) Ultrahigh Dimensional Feature Selection: Beyond the Linear Model. Journal of Machine Learning Research, 10, 2013-2038. http://arxiv.org/abs/0812.3201
[2]	Fan, J.Q. and Lv, J.C. (2008) Sure Independence Screening for Ultrahigh Dimensional Feature Space. Journal of the Royal Statistical Society Series B: Statistical Methodology, 70, 849-911. https://doi.org/10.1111/j.1467-9868.2008.00674.x
[3]	Fan, J.Q. and Song, R. (2010) Sure Independence Screening in Generalized Linear Models with NP-Dimensionality. The Annals of Statistics, 38, 3567-3604. https://doi.org/10.1214/10-AOS798
[4]	Fan, J.Q., Feng, Y. and Song, R. (2011) Nonparametric Independence Screening in Sparse Ultra-High-Dimensional Additive Models. Journal of the American Statistical Association, 106, 544-557. https://doi.org/10.1198/jasa.2011.tm09779
[5]	Liu, J.Y., Li, R.Z. and Wu, R.L. (2014) Feature Selection for Varying Coefficient Models with Ultrahigh-Dimensional Covariates. Journal of the American Statistical Association, 109, 266-274. https://doi.org/10.1080/01621459.2013.850086
[6]	Li, G.R., Peng, H., Zhang, J. and Zhu, L.X. (2012) Robust Rank Correlation Based Screening. The Annals of Statistics, 40, 1846-1877. https://doi.org/10.1214/12-AOS1024
[7]	Fan, J.Q. and Fan, Y.Y. (2008) High Dimensional Classification Using Features Annealed Independence Rules. Annals of Statistics, 36, 2605-2637. https://doi.org/10.1214/07-AOS504
[8]	Mai, Q. and Zou, H. (2013) The Kolmogorov Filter for Variable Screening in High-Dimensional Binary Classification. Biometrika, 100, 229-234. https://doi.org/10.1093/biomet/ass062
[9]	Cui, H.J., Li, R.Z. and Zhong, W. (2015) Model-Free Feature Screening for Ultrahigh Dimensional Discriminant Analysis. Journal of the American Statistical Association, 110, 630-641. https://doi.org/10.1080/01621459.2014.920256
[10]	Huang, D.Y., Li, R.Z. and Wang, H.S. (2014) Feature Screening for Ultrahigh Dimensional Categorical Data with Applications. Journal of Business & Economic Statistics, 32, 237-244. https://doi.org/10.1080/07350015.2013.863158
[11]	Ni, L. and Fang, F. (2016) Entropy-Based Model-Free Feature Screening for Ultrahigh-Dimensional Multiclass Classification. Journal of Nonparametric Statistics, 28, 515-530. https://doi.org/10.1080/10485252.2016.1167206
[12]	Shannon, C.E. (1948) A Mathematical Theory of Communication. The Bell System Technical Journal, 27, 379-423. https://doi.org/10.1002/j.1538-7305.1948.tb01338.x
[13]	Li, R.Z., Zhong, W. and Zhu, L.P. (2012) Feature Screening via Distance Correlation Learning. Journal of the American Statistical Association, 107, 1129-1139. https://doi.org/10.1080/01621459.2012.695654
[14]	Cui, H.J. and Zhong, W. (2019) A Distribution-Free Test of Independence Based on Mean Variance Index. Computational Statistics & Data Analysis, 139, 117-133. https://doi.org/10.1016/j.csda.2019.05.004
[15]	Hoeffding, W. (1963) Probability Inequalities for Sums of Bounded Random Variables. Journal of the American Statistical Association, 58, 13-30. https://doi.org/10.1080/01621459.1963.10500830
[16]	Lin, J.H. (1991) Divergence Measures Based on the Shannon Entropy. IEEE Transactions on Information Theory, 37, 145-151. https://doi.org/10.1109/18.61115

Journals Menu

Follow SCIRP

	+1 323-425-8868
	customer@scirp.org
	+86 18163351462(WhatsApp)
	1655362766

	Paper Publishing WeChat

Journals Menu

Home

About SCIRP

Service

Policies