Cross Validation Based Model Averaging for Varying-Coefficient Models with Response Missing at Random ()
1. Introduction
In the past two decades, there have been more and more studies on model averaging and model selection. Model selection is to select a final model before estimation, but due to the uncertainty of the model, there will be a deviation caused by the selection of wrong model. Model averaging is to estimate a large number of candidate models, and then give greater weight to the sub-models with better prediction results, that is, to select the model after estimation, and the candidate model does not require the inclusion of the real model, which can effectively overcome the shortcomings of the model selection method. Therefore, model averaging is increasingly used in various fields.
Most of the model averaging methods have been devoted to the frequency model averaging, and mainly in parametric models. For example, Hansen (2007) [1] , Zhang et al. (2008) [2] , Wan et al. (2010) [3] proposed the Mallows-type model averaging (MMA) method. To further improve the method, Zhang et al. (2016) [4] proposed a weight selection criterion based on the Kullback-Leibler loss with penalty term for generalized linear models and generalized linear mixed-effects models. Using nonparametric kernel smoothing method to estimate nonparametric functions, Li et al. (2018) [5] , Zhang and Wang (2019) [6] extended the Mallows-type model averaging method to semiparametric models. Recently, Xia (2021) [7] extended the Mallows-type model averaging method to varying-coefficient models using B-spline to estimate the nonparametric functions. However, the above mentioned literatures focus on the case where the data is completely observed. In statistical studies such as economics, market research and medical research, missing data is a common phenomenon. According to the different missing mechanisms, it can be divided into three types, namely Missing completely at random (MCAR), Missing at random (MAR) and Missing not ant random (MNAR). Little attention has been paid to model averaging with missing data. One of the reasons is that the complexity of missing data makes it very challenging to extend existing methods with complete data to missing data. The study of model averaging with missing data is still a relatively new topic and needs further development.
In this paper, we focus on model averaging for varying-coefficient models with response missing at random (MAR), which has been widely studied in the field of missing data. As far as we know, there are few literatures on the model averaging with missing response. For example, Sun et al. (2014) [8] developed a model average estimation approach for linear regression models with response missing at random under a local misspecification framework, called smoothed focused vector information criteria (SFVIC). Zeng et al. (2018) [9] extended the SFVIC to varying-coefficient partially linear models with response missing at random by employing the profile least-squares estimation process and inverse probability weighted method to estimate regression coefficients of the models under the framework of local misspecification. However, these researches are limited to the local misspecification framework, and the model averaging without local misspecification framework is relatively small. For example, based on the weighted delete-one cross-validation (WDCV) criterion, Xie et al. (2021) [10] proposed a two-step model averaging procedure for high-dimensional linear regression models with missing response without local misspecification framework, and under certain conditions, they proved the WDCV criterion asymptotically achieved the lowest possible prediction loss. But their criterion required the covariates used in different candidate models cannot overlap. Based on the Mallows-type model averaging method, Wei et al. (2021) [11] established the HRCp criterion of linear models with missing response, which improved the shortcomings of MMA in estimating the error covariance matrix. Wei and Wang (2021) [12] developed a linear models averaging procedure with response missing at random by establishing a cross-validation-based weight choice criterion, and proved its asymptotic optimality. The three literatures mentioned above are limited to parametric linear models.
Inspired by Xia (2021) [7] , Wei and Wang (2021) [12] , and Wei et al. (2021) [11] , in this paper we shall extend the Jackknife model averaging (JMA) method (e.g. Hansen and Racine (2012) [13] and Zhang et al. (2013) [14] ) to varying-coefficient models with response missing at random and heteroscedasticity error. The varying-coefficient model is a common type of non-parametric model. It has no fixed regression function and is more flexible in form. It can further fit complex data and is more widely used in various fields. Following Xia (2021) [7] , we use B-spline to estimate the nonparametric varying-coefficient functions, and use the inverse probability weighted (IPW) method to deal with the missing data. In order to avoid the “curse of dimensionality”, we assume a parametric model for propensity score function. Compared with previous articles, our method has the following advantages. Firstly, we use B-spline to estimate the varying-coefficient functions, which results in less computational burden compared with the local kernel function method. Secondly, we choose the weight vector based on JMA method. Compared with the MMA method, our method does not need to estimate the covariance matrix of the error and is more suitable for the heteroscedasticity models, which makes it easier to deal with real data. In the theoretical proof, we do not need to consider the relationship between the estimated error covariance matrix and the real error covariance matrix, which makes our theoretical proof easier. Finally, our method can deal with three cases for candidate models (Nested case, Marginal case and Full case), not just the case where the candidate models are nested. Under certain conditions, our proposal is asymptotically optimal in the sense of achieving the lowest possible squared error, which shows that our method is feasible. The finite sample performance of our proposal is studied by numerical simulations and compared with some related methods. The simulation results display that our proposal is feasible.
The rest of this paper is organized as follows. Section 2 and Section 3 describe the method of model averaging estimator and weight selection respectively. Some conditions and theoretical result for asymptotic optimality are given in Section 4. The proofs of the main results are given in Section 5.
2. Model Framework and Model Averaging Estimation Method
We consider the following varying-coefficient model:
(1)
where
is response variable, the index variable
is a scalar and
is a random error with
and
. We assume
, where
is a p-dimensional covariate,
corresponds to the vector of the unknown coefficient functions.
Suppose that
and
are completely observable, and
may be missing, where
if
is observed,
otherwise. In this paper, we assume that
is missing at random (MAR), that is
(2)
Let
, using model (1) and MAR assumption, we can obtain
(3)
where
,
,
.
We consider a series of approximate models. In particular, we use
candidate model
to approximate the real data generating process of Y. The mth candidate model
contains
covariates, that is,
is the cardinality of
. The total number of all possible candidate models is
, which contains an empty model that excludes all covariates. From model (3), under the mth model, we have
(4)
The purpose of this paper is to construct an asymptotically optimal model average estimator of the conditional mean
with missing response under MAR assumption.
Note that
is fully observed, if the selection probability function
is known, the estimator of
can be obtained by using B-spline to approximate the coefficient function for
. Specifically, we first approximate each
with a function in a polynomial spline space. Without losing generality, we assume that U has a compact set
. Let
is a B-spline function basis of order r, where
,
is the number of interior knots and increases with the increase of sample size n. According to the B-spline theory in De Boor (2001) [15] , there exists a parameter vector
satisfying
where
is the jth component of
. So the mth model can be re-expressed as
where
,
is a
dimensional column vector,
.
Therefore,
can be obtained by solving the following least squares optimization problem
(5)
Let
,
is a vector of length
, then the solution of (5) is given by
where
is a
-dimensional matrix. Thus, the estimator of
is
with
.
For the mth candidate model, the estimator of
is then obtained as
(6)
where
. From (6), we know that
is linearly related to
, we can define the model average estimator of
as the weighted average of
, that is
(7)
where
,
is the weight vector, which satisfies
3. Weight Selection
According to Hansen and Racine (2012) [13] , we give the details for the jackknife selection (also known as leave-one-out cross validation) of
as follows.
The jackknife estimator of the mth model is denoted by
,
where
, and
and
are defined as matrices
and column vectors
with the ith row deleted. After some calculations, we can derive the following relationship:
where
is an
diagonal matrix and the ith diagonal element is
with
the i-th diagonal element of
, then
,
. Therefore, we obtain the following JMA criterion for selecting the weight vector
:
(8)
In fact,
in (8) is unknown and need to be modeled. Following Wei and Wang (2021) [12] , we assume that
is a parametric model of
, where
is a known function and can be correctly specified while
is an unknown parameter vector. Denote
as the maximum likelihood estimate (MLE) of
,
. Replacing
in
with
, we can get the following criterion:
(9)
Then the optimal weight vector
is defined as
(10)
Thus, the model average estimator of
is
, and named as the varying-coefficient jackknife model average estimator (VC-JMA).
4. Asymptotically Optimality
In this section, we investigate the asymptotic optimality of the proposed estimator
. We first give some necessary symbols. Define
(11)
as the loss function and risk function of
, respectively.
. Let
,
be a
-dimensional unit vector whose the mth element is one and the other elements are zeros and
.
Next, we give some conditions which are required for proving the asymptotic optimality of
. All the limiting processes discussed here and throughout the paper are with respect to
.
(C1) The MLE
of
is
consistent and satisfies the regularity conditions of asymptotic normality.
is bounded away from 0 and twice continuously diferentiable with respect to
. For all
’s in a neighborhood of
,
, where
is the true value of
.
(C2)
,
, a.s., where
and
are constant.
(C3)
, where
is the rank of
.
(C4)
,
,
, a.s., where G is an integer satisfying
.
(C5) Functions
belong to a class of functions
, whose rth detivatives
s exist and are Lipschitz of order
.
(C6) The density
of U is bounded away from 0 and infinity on its support.
Remark 1 As Zhang et al. (2013) [14] pointed out, condition (C2) is commonly used in the study of asymptotic optimality of cross-validation methods. The condition (C3) is based on Zhang et al. (2013) [14] ’s condition (22). For a detailed explanation of the condition, see Zhang et al. (2013) [14] . Conditions (C4) is commonly used in model averaging literature, see for example Wan et al. (2010) [3] , Zhang et al. (2013) [14] and Zhu et al. (2019) [16] .
Remark 2 Conditions (C5) and (C6) are two common assumptions in approximating nonparametric coefficient functions with B-spline basis functions, which can be seen in Fan et al. (2014) [17] .
With these conditions, the following theorem states the asymptotic optimality of
.
Theorem 1 Suppose that conditions (C1) - (C6) hold, then
(12)
where
.
Theorem 1 implies that our proposed estimator
is asymptotically optimal because the selected weight vector
yields a squared error that is asymptotically identical to that of the infeasible optimal weight vector.
5. Theorem Proof
Before proving Theorem 1, we first introduce some symbols used in the proof process. We use c to represent a universal positive constant, which can take different values in different situations. For any matrix A, we define
as the maximum singular value of A. Let
. According to the definitions of
and
in (11), we define the loss function and risk function of
in (8) by
respectively, and then obtain
where
. Let
be an
diagonal matrix and the ith diagonal element is
. Then according to the definition of
given in Section 2.2, we obtain
where
.
Next, give the four Lemmas needed to prove Theorem 1.
Lemma 1 Let
if, as
,
and there exists a constant c such that
almost surely, then
The proof of Lemma 1 refers to Zhang et al. (2013) [14] .
Lemma 2 If (C4) is satisfied, we have
(13)
Proof of Lemma 2: Note that
Therefore, to prove (13), it suffices to prove that
(14)
(15)
By Zhang et al. (2013) [14] , under the condition (C4), we have
We use conditions (C4), Chebyshev inequality and Theorem 2 of Whittle (1960) [18] to prove (14) and (15), respectively, as follows:
Therefore, we have
(14) is proven. The proof process of (15) is similar to that of (14), and hence the proof of the Lemma 2.
Lemma 3 If (C1) - (C4) are satisfied, then
(16)
Proof of Lemma 3: By Lemma 1 of Wei and Wang (2021) [12] and Cauchy-Schwarz inequality, we have
Therefore, to prove (16), it suffices to show that
(17)
(18)
Note that
Therefore, to prove (17), it suffices to show that
(19)
(20)
The proof steps of (19) and (20) are similar to those of (14) and (15), respectively. The detailed process is omitted here, so it is established under (C4) condition, i.e. (17) is proven, and next prove (18).
According to (C1) and Cauchy-Schwarz inequality, we have
According to (C3) and (C4), to prove (18), it suffices to prove that
(21)
(22)
By (C2) - (C4) and Markov inequality, for any
, we have
This together with the dominated convergence theorem indicates
.
We perform Taylor expansion of
around the true value
and then obtain
where the last inequality is due to (C1) and Cauchy-Schwarz inequality, and
is a linear combination between
and
. Since
is MLE, by (C1) we have
. Because of the consistency of
and (C1), we have
. These results imply (22), this completes the proof of (18) and hence the proof of the Lemma 3.
Lemma 4 If (C2) - (C4) are satisfied, then
(23)
Proof of Lemma 4: After some careful calculations, we have
where
Because
is independent of
, then
By Lemma 1, we know that, to prove (23) it suffices to prove (13) and
(24)
Since (13) has been proved, the following only need to prove (24). By Cauchy-Schwarz inequality, we have
So to prove (24), it suffices to verify
(25)
(26)
(27)
(28)
(29)
The proofs of (25), (26) and (27) are similar to (14), and hold under the condition (C4).
(30)
By Zhang et al. (2013) [14] , under (C2) - (C4), we have
which together with (18) implies (28), and next prove (29).
By Lemma 1 of Wei and Wang (2021) [12] , we have
(31)
By Zhang et al. (2013) [14] , we know that, under (C2),
. This together with (31) as well as (28) implies (29), and (25) - (29) together prove (24). This completes the proof of the Lemma 4.
Proof of Theorem 1: Firstly, we note that
Therefore, to prove Theorem 1, it suffices to prove that
(32)
(33)
(34)
(35)
According to Zhang et al. (2013) [14] , we know that (32) holds under (C2) - (C4). Then according to Lemma 2, Lemma 3 and Lemma 4, we know that (33) - (35) hold, so the Theorem 1 is proved.
6. Conclusions
In this paper, we extend the JMA method to the nonparametric varying-coefficient models with response missing at random. Firstly, we use the inverse probability weighted method to deal with the missing data, then use B-spline to estimate the nonparametric functions, and finally use jackknife method to select weight
. Under certain conditions, the asymptotic optimality of our method is proved.
In this paper, we only consider the varying-coefficient models. Undoubtedly, it is meaningful to extend our ideas to more complex models, such as partially linear models and semiparametric varying-coefficient models. However, this is a very challenging research. Second, for this artical, we assume that the parametric model of the selection probability function is correctly specified, and further research is needed to develop a model averaging method that is robust against the misspecification of the selection probability function. These will be interesting future research directions.
Acknowledgements
The research is supported by NSF projects (ZR2021MA077) of Shandong Province of China.