1. Introduction
Before constructing a regression model, it is important to determine whether the covariate
has an effect on the response
. As pointed out by [1], in most cases, we are more concerned with the conditional mean of the response. Thus, the conditional mean dependence has received attention, which measures the departure of
from
. When
has no effect on the conditional mean of
, i.e.,
,
should not be included in a conditional mean regression model. In practice, based on historical analysis or domain knowledge, some covariates related to response will be known. Our aim is to determine whether
has contribution on the conditional mean of
after controlling the affect from the known variable
.
Regarding partial dependence, work has been increasing recently. An intuitive approach to measure partial conditional mean dependence is
(1)
Based on a plug-in estimator of equation (1), [2] developed a partial conditional mean independence test. However, as pointed out by [3], under the null hypothesis, i.e., when quantity (1) equals zero, the test statistic has a degenerate distribution. To deal with the degenerate limit distribution, [3] developed a significance test based on the black-box learner, [4] proposed a general framework to evaluate feature importance, and [5] considered measuring the partial dependence based on the decomposition formula of the conditional variance. These methods combine machine learning with sample splitting. Therefore, to a certain extent, they suffer from the loss of power caused by sample splitting. Another issue is that they only consider scalar responses and cannot handle vector or functional responses. In the field of vector or functional data analysis, conditional mean regression is an important analytical tool (see [6] for a regression model with vector response, see [7]-[9]) for regression models with function response among others), and it is necessary to consider the partial conditional mean dependence for vector or functional response.
To our knowledge, among these tools for partial conditional mean dependence, the Partial Martingale Difference Divergence (pMDD), as introduced in [10], is currently the only one applicable to response variables in Hilbert space. pMDD is a scalar-valued measure of conditional mean dependence of
given
, adjusting for the nonlinear dependence on
, where
,
and
are random vectors of arbitrary dimensions. It extends the martingale difference divergence (MDD) introduced in [11]. However, as shown in [12] [13], the performance of MDD suffers from the curse of dimensionality. Let
be an independent copy of
, and let
be the martingale difference divergence of
given
. When
and
, [12] shows that
where
, and
is the covariance of
and
. Since the covariance only captures the linear dependence, the martingale difference divergence may have less power when it is employed to detect nonlinear relationships, especially in the cases of high dimensions. pMDD, as an extension of MDD, will suffer from the curse of dimensionality for the same reason. This phenomenon can be found in the numerical results in Section 4.
In this paper, we introduce a new tool to measure the partial conditional mean dependence for vector or functional responses. The numerical experiments in [13] demonstrated the advantages of kernel-based conditional mean dependence over MDD in identifying nonlinear relationships and handling high-dimensional variables. This prompts us to develop a tool based on kernel methods for measuring partial conditional mean dependence. Our development follows that in [10], so we name our tool as Kernel-based Partial Conditional Mean Dependence. Simulation results show that Kernel-based Partial Conditional Mean Dependence has an advantage over Partial Martingale Difference Divergence in identifying the nonlinear dependence of
’s conditional mean on
after controlling for
.
The rest of the paper is organized as follows. In Section 2, we review the kernel-based conditional mean dependence measure. In Section 3, we explore the procedure of constructing Kernel-based Partial Conditional Mean Dependence, and give its sample analogy. A group of finite sample simulation studies is carried out in Section 4. In Section 5, some discussions are included. All technical proofs are presented in the Appendix.
2. Kernel-Based Conditional Mean Dependence
Before formally introducing the Kernel-based Partial Conditional Mean Dependence, it is necessary to review a tool for measuring conditional mean Dependence—Kernel-based Conditional Mean Dependence.
As proposed by [13], the kernel-based conditional mean dependence (KCMD) is defined as
where
,
are separable Hilbert spaces,
,
are random elements valued in
,
, respectively.
is an independent copy of
,
is a characteristic kernel (For details on characteristic kernels, please refer to [14] [15]) defined on
. We specifically point out that in this article, the kernel function
used in KCMD is not fixed. Its subscript
indicates that the form and domain of the kernel function depend on
and the space in which
is located. Through the paper,
and
represent inner products and norms respectively. Kernel-based conditional mean dependence (KCMD) is intended to measure departure from the relationship
for
and
. Lemma 1 below summarizes the fundamental properties of KCMD.
Lemma 1. Suppose that
is a positive definite and bounded characteristic kernel. Then
is well defined, and
a)
;
b)
if and only if
, almost surely.
Denote
(2)
as shown in [13], one can give another expression of the KCMD as follows:
(3)
Using the expression, we can provide an unbiased estimator of
. It is closely related to the so-called
-centred matrix.
(4)
and
as the norm of
. Theorem 1 in [16] shows that the linear span of all matrices in
is a Hilbert space with inner product defined in (4). Using the
-centered matrices, we can construct an estimator of
. Given independent and identically distributed (i.i.d) observations
from the joint distribution of
, an unbiased estimator of
provided in [13] is defined as
(5)
Here,
and
are the
-centred versions of matrixes
and
respectively, and the
-th elements of
is
, the
-th elements of
is
. [13] has shown that the estimator (5) is unbiased and admits a U-statistic expression
(6)
with the kernel
where
and the sum is over all 4! permutations of
.
3. Kernel-Based Partial Conditional Mean Dependence
In this section, we introduce the kernel-based partial conditional mean Dependence (partial KCMD), which can measure the conditional mean dependence of a response
given a predictor variable
after controlling some variable
, where
is an random element valued in the separable Hilbert space
.
3.1. Population Partial KCMD
For any symmetric function
defined on
, define
as a
-centered operator, and
where
is a random element valued in Hilbert space
,
is an independent copy of
. The functional class
is a linear space. Define the inner product on
as
It can be verified that the map satisfies the conditions of an inner product. In addition, define the norm on
as
. Take
. Let
and
for any
. Then
. Similarly,
. Thus
in equation (3) can also be written as
(7)
This implies that the KCMD measures the conditional mean dependence of
on
through the inner product of
and
in a linear space. The cosine value
measures the strength of conditional mean dependence.
Lemma 2. Suppose
and
, then
can be decomposed into two orthogonal parts,
(8)
The first term in (8) represents the part of
’s conditional mean that is affected by
, the second term represents the part that can not be interpreted by
, and in a sense it corresponds to
since
. Next we define
and
. One way to measure the additional contribution of
to the conditional mean of
controlling for
, is to measure
.
Inspired by [10], we provide definitions for the Kernel-based Partial Conditional Mean Dependence and the Kernel-based Partial Conditional Mean Correlation. Define
similar to
, replacing
with
in
is
.
Definition 1. The population partial KCMD of
given
, after controlling for the effect of
, i.e.,
is defined as
If
, then we define
.
The population Kernel-based Partial Conditional Mean Correlation(pKCMC) is defined as
If
, then we define
.
After performing some straightforward calculations, we obtain an equivalent expression for pKCMD, which is given by
(9)
The
is the Hilbert-Schmidt Independence Criterion(HSIC) between
and
, it measures the dependence between these two random elements. The kernel functions used in
are
and
, which, like in KCMD, depend on the variables in their subscripts. The content about HSIC can be found in [15] [17]-[19] and so on. We reviewed the specific form of HSIC in the Appendix and derive the last equation of (9). When the conditional mean of
does not depend on
or
is a constant, we have
or
. As a result, we have
.
3.2. Sample pKCMD
Given the sample
, we want to define sample partial KCMD, denoted as
as the sample analog of population partial KCMD. Let
, and define
be
matrix with entries
,
where
, and
,
and
are defined similarly to
,
and
.
Definition 2. Given a random sample from the joint distribution
, the sample partial kernel-based conditional mean dependence of
given
, after controlling for the effect of
, is given by
assuming
and
otherwise. The sample partial kernel-based conditional mean correlation is given by
If
, then we define
.
We next outline theoretical properties of the sample pKCMD. Analogous results hold for the sample pKCMC, which we omit discussing further here.
Theorem 1. If one of the following two conditions holds,
a)
and
are bounded kernels,
;
b)
,
, and
.
Then, as
, we have
a.s..
We also show that
is concentrated. To obtain the bounds of the deviation
, we impose the following condition.
(C1) There exists a constant
such that for all
,
.
Condition (C1) follows immediately when
is bounded uniformly, or when it has a Gaussian distribution. Condition (C1) is widely used in statistical research, for example, in [11] [20] [21], to analyze the theoretical properties of feature screening.
Theorem 2. If
and
are bounded kernels,
, and Condition (C1) holds, then for any
, there exist constants
,
and
such that
Take
with
and a constant
. According to Theorem 2, there exists
, such that
. This implies that
is concentrated, and the deviation between
and
is less than
with probability at least
.
4. Simulation
In the section, we examine tests of the null hypothesis of zero pKCMD. When calculating pKCMD, we need to choose kernel functions
and
. We use Gaussian kernels for
and
, defined as
and
respectively. For the choice of bandwidths
and
for these kernels, we can use median heuristic ([13]). We compare our proposed method with pMDD introduced in [10]. We use permutation to obtain the critical values and take the permutation number as 300. The permutation method is described in Section 5 of [10].
The simulations take into account varying sample sizes and levels of dependence between the two random variables to evaluate the performance of tests. For each setting, the empirical sizes or powers of the tests (represented by the proportions of rejections) are recorded through 1000 repetitions at different significance levels.
Example 1 Generate the i.i.d. sample of
from the following model:
,
,
, where
. Consider two scenarios:
1)
, and
.
2)
are independent and identically distributed (i.i.d.) random variables from the Cauchy distribution with location parameter 0 and scale parameter 1.
are i.i.d. random variables from the standard normal distribution.
In this example,
and
are independent of each other, and
depends only on
. Thus, after controlling for the third random vector
, the conditional mean of
is independent of
. From Table 1, both methods can reasonably control the type-I error rate.
Table 1. Empirical size of the two tests for Example 1 with
and
.
Scenario |
Method |
|
|
|
(1) |
pKCMD |
0.014 |
0.050 |
0.098 |
pMDD |
0.011 |
0.048 |
0.105 |
(2) |
pKCMD |
0.009 |
0.051 |
0.091 |
pMDD |
0.010 |
0.059 |
0.111 |
Example 2 Generate the i.i.d. sample of
from the following model:
,
,
, where
is a multivariate normal distribution with zero mean and identity covariance matrix
, and
for any
.
We consider the following four relationships for
: a)
, b)
, c)
, and d)
.
Table 2. Empirical powers of the two tests for Example 2 with
.
Relationship |
|
Method |
|
|
|
|
|
|
0.01 |
pKCMD |
0.271 |
0.492 |
0.664 |
0.780 |
0.843 |
pMDD |
0.424 |
0.667 |
0.810 |
0.900 |
0.939 |
0.05 |
pKCMD |
0.470 |
0.669 |
0.801 |
0.878 |
0.908 |
pMDD |
0.606 |
0.784 |
0.893 |
0.943 |
0.972 |
0.10 |
pKCMD |
0.573 |
0.749 |
0.856 |
0.911 |
0.933 |
pMDD |
0.698 |
0.836 |
0.924 |
0.955 |
0.979 |
|
0.01 |
pKCMD |
0.050 |
0.102 |
0.158 |
0.288 |
0.421 |
pMDD |
0.028 |
0.051 |
0.055 |
0.085 |
0.111 |
|
0.05 |
pKCMD |
0.165 |
0.287 |
0.398 |
0.538 |
0.688 |
pMDD |
0.101 |
0.141 |
0.166 |
0.223 |
0.273 |
0.10 |
pKCMD |
0.270 |
0.420 |
0.551 |
0.696 |
0.810 |
pMDD |
0.180 |
0.238 |
0.284 |
0.346 |
0.420 |
|
0.01 |
pKCMD |
0.235 |
0.480 |
0.694 |
0.835 |
0.909 |
pMDD |
0.241 |
0.474 |
0.676 |
0.793 |
0.897 |
0.05 |
pKCMD |
0.456 |
0.692 |
0.843 |
0.923 |
0.953 |
pMDD |
0.465 |
0.673 |
0.828 |
0.904 |
0.957 |
0.10 |
pKCMD |
0.583 |
0.778 |
0.905 |
0.949 |
0.966 |
pMDD |
0.602 |
0.772 |
0.886 |
0.940 |
0.972 |
|
0.01 |
pKCMD |
0.044 |
0.087 |
0.158 |
0.251 |
0.401 |
pMDD |
0.027 |
0.052 |
0.051 |
0.084 |
0.097 |
0.05 |
pKCMD |
0.157 |
0.276 |
0.383 |
0.543 |
0.679 |
pMDD |
0.089 |
0.143 |
0.168 |
0.217 |
0.255 |
0.10 |
pKCMD |
0.254 |
0.416 |
0.542 |
0.707 |
0.816 |
pMDD |
0.166 |
0.230 |
0.279 |
0.356 |
0.428 |
This example compares the empirical powers of pKCMD and pMDD across different functional relationships, significance levels, and sample sizes. According to Table 2, for the linear function
, pMDD consistently outperforms pKCMD, showing higher sensitivity. In the quadratic
and cosine
relationships, pKCMD generally demonstrates superior power, especially at larger samples, indicating better non-linear effect detection. For
relationship, both tests perform comparably well, with slight advantages for pMDD at lower significance levels. Overall, pMDD excels with linear relationships, while pKCMD is preferable for non-linear ones, particularly with larger sample sizes.
Example 3 Consider the model in Example 2, set
,
.
Table 3 compares the empirical powers of pKCMD and pMDD at varying significance levels
and varying dimension
, with
. pKCMD consistently outperforms pMDD across all settings, with its power decreasing more gradually as
increases. At lower significance levels (
), pKCMD maintains high power even with
, while pMDD’s power drops sharply. At higher
, both tests improve, but pKCMD remains superior, especially for larger
. Overall, according to the power shown in Table 3, pKCMD is more robust and powerful, particularly in high-dimensional contexts.
Table 3. Empirical powers of the two tests for Example 3 with
.
|
Method |
|
|
|
|
|
0.01 |
pKCMD |
1.000 |
0.960 |
0.670 |
0.399 |
0.281 |
pMDD |
0.933 |
0.385 |
0.183 |
0.101 |
0.091 |
0.05 |
pKCMD |
1.000 |
0.995 |
0.863 |
0.638 |
0.506 |
pMDD |
0.986 |
0.647 |
0.413 |
0.262 |
0.222 |
0.10 |
pKCMD |
1.000 |
0.999 |
0.937 |
0.761 |
0.662 |
pMDD |
0.998 |
0.771 |
0.552 |
0.394 |
0.336 |
Example 4 Consider two models with the formula
, where
is generated by Wiener process. Two processes for
, including Ornstein-Uhlenbeck process (OU) and Gaussian process with exponential variogram (VP), are employed, and they are generated by rproc2f data function in R package fda.usc with default parameters. The models for generating
are as follows:
1)
,
.
2)
,
.
Table 4 and Table 5 reveal the performance of pKCMD and pMDD in detecting partial conditional mean dependencies under two distinct models of
. For the quadratic model in (1), pKCMD consistently outperforms pMDD across varying sample sizes and significance levels, with its empirical power improving significantly as the sample size increases. For instance, at
, pKCMD achieves a power of 0.943 and 0.952 for
with the OU and VP processes, respectively, while pMDD attains only 0.741 and 0.591. This indicates that pKCMD is more effective in capturing the quadratic relationship, and the choice between the OU and VP processes does not substantially affect the relative performance of the tests. For the cosine model in (2), similar trends are observed, with pKCMD maintaining higher power than pMDD across all conditions. The empirical power of both tests increases with the sample size. Although pKCMD shows slightly higher power for the OU process compared to the VP process at larger sample sizes, the difference is not substantial, suggesting that the tests’ power is generally robust to the underlying stochastic process of
.
Overall, our analysis shows that the tests based on Kernel-based Partial Conditional Mean Dependence perform effectively in most of the situations we examined. As the sample size grows, the power of these tests increases. When compared to pMDD, our pKCMD test proves to be more efficient at capturing nonlinear relationships. Importantly, even when the dimension of variables increases, the decline in our test’s power is significantly slower compared to other tests.
Table 4. Empirical powers of the two tests for Example 4 (1).
|
Method |
|
|
|
|
OU |
VP |
OU |
VP |
0.01 |
pKCMD |
0.606 |
0.561 |
0.884 |
0.863 |
pMDD |
0.216 |
0.237 |
0.449 |
0.389 |
0.05 |
pKCMD |
0.777 |
0.760 |
0.943 |
0.952 |
pMDD |
0.409 |
0.402 |
0.741 |
0.591 |
0.10 |
pKCMD |
0.835 |
0.848 |
0.957 |
0.973 |
pMDD |
0.543 |
0.498 |
0.845 |
0.689 |
Table 5. Empirical powers of the two tests for Example 4 (2).
|
Method |
|
|
|
|
|
OU |
VP |
OU |
VP |
OU |
VP |
30 |
pKCMD |
0.388 |
0.274 |
0.526 |
0.416 |
0.593 |
0.519 |
pMDD |
0.139 |
0.116 |
0.320 |
0.230 |
0.438 |
0.328 |
50 |
pKCMD |
0.472 |
0.378 |
0.583 |
0.542 |
0.643 |
0.628 |
pMDD |
0.206 |
0.147 |
0.428 |
0.309 |
0.529 |
0.429 |
70 |
pKCMD |
0.575 |
0.459 |
0.671 |
0.588 |
0.730 |
0.671 |
pMDD |
0.350 |
0.219 |
0.541 |
0.415 |
0.638 |
0.548 |
90 |
pKCMD |
0.639 |
0.516 |
0.743 |
0.658 |
0.785 |
0.742 |
pMDD |
0.435 |
0.338 |
0.644 |
0.550 |
0.711 |
0.633 |
5. Real Data
In this section, we consider exploring the Tecator dataset contained in R package fda.usc. This dataset includes values of a 100-channel spectrum (of wavelength 850 - 1050 mm) of absorbance (
), water content (
), fat content (
), and protein content (
) for 215 meat samples. This dataset has been widely studied in functional data analysis, and this literature mainly focuses on characterizing the influence of functional covariate
on scalar response
. Our goal is to determine whether the other two variables,
and
, have an impact on the conditional mean of the response after controlling for the influence of
. For
, the p-values computed by pKCMD and pMDD all are 0.000, for
, we obtain the same results. These values means that when constructing a conditional mean regression model,
and
should also be considered. [22] has already considered the influence of
,
, and
on
through a semi-functional partial linear regression.
6. Conclusion
This paper introduces pKCMD, a novel test for detecting partial conditional mean dependencies in Hilbert spaces, extending existing measures. We derive equivalent expressions for pKCMD at both population and sample levels. Numerical experiments show that pKCMD consistently outperforms pMDD across sample sizes and significance levels, particularly in nonlinear relationships. From the simulation results, compared with pMDD, pKCMD is more robust and performs better in high-dimensional situations without more computational loss. Overall, pKCMD is a competitive, reliable method for analyzing partial conditional mean dependencies, especially in nonlinear settings. How to choose the optimal kernel function in dependency testing is an important issue. This is our future research direction. In addition, we are also considering extending pKCMD to other tasks, such as conditional independence test, goodness-of-fit test, and feature screening, to broaden its applicability.
Appendix
Appendix A1. Hilbert-Schmidt Independent Criterion and
Equation (9)
Definition A1. (HSIC) The Hilbert-Schmidt Independent Criterion of random elements
and
is defined as
where
is an independent copy of
,
and
are two kernel functions.
Note that in Definition A1, the kernel functions
and
are mutable and depend on
and
, respectively. HSIC was developed to test the independence of
and
, and has many good properties. a) It is non negative; b) If
and
are characteristic, then
if and only if
and
are independent. Using the symbol
defined in (2), symbol
is similar to
, but the kernel function used in
is
, we have
thus (9) holds.
Appendix A2. The Proofs of Main Results
Proof of Lemma 1. The Lemma 1 is the Proposition 1 in [13]. □
Proof of Lemma 2. Because
and
their inner product is
thus these two parts in (8) are orthogonal. □
Proof of Theorem 1. a) When
and
are bounded kernels. We suppose that
and
. First, we consider
in
. Because
is an unbiased estimate of
and can be written as an U-statistic by equation (6), and
By the strong law of large number for U-statistics,
Similarly, we have
According to the continuous mapping theorem, we have
a.s..
b) For general kernel.
Similar to (a), we can also prove that
a.s.. This completes the proof. □
Proof of Theorem 2. Denote
and . By employing the Markov inequality, we obtain that for any
and
,
Note that
admits a U-statistic expression
with the kernel
, where
and the sum is over all 4! permutations of
. Following [23] (Section 5.1.6), we write
where where
denotes the summation over all possible permutations of
,
is the
-th element under the permutation,
is the integer part of
. Denote
, write
, where
and
. Correspondingly, its population counterpart can also be decomposed as
.
Jensen inequality yields
so
, and
, apply the Lemma 5.6.1.A of [23],
So,
Furthermore,
by the symmetry of the U-statistic.
Now we turn to
.
for any
. By the assumption that
is bounded, there exists a positive constant
such that
, and
,
this yields
by condition (C1).
Thus, if we choose
for
, then
for sufficiently large
. Hence,
.
for
.
choosing
,
, because
, we have
where the constants
satisfy
. Immediately, we have
Thus, we obtain
Because these kernels are bounded, according Theorem 3 in [24], we have
This completes the proof of theorem. □