Estimation of Distribution Function Based on Presmoothed Relative-Risk Function ()
1. Introduction
Censored data occur in survival analysis, bio-medical trials, industrial experiments. There are several schemas of censoring (from the right, left, both sides, mixed with competing risks and others). However, in statistical literature right random censoring is wide spread, in so far as it was easily described from the methodological point of view. Here we consider also this kind of censorship in order to compare our results with others.
Let
and
be two independent sequences of independent and identically distributed (i.i.d.) random variables (r.v.-s) with common unknown continuous distribution functions (d.f.-s) F and G, respectively. Let the
be censored on the right by
, so that the observations available for us at the n-th stage consist of the sample
, where
and
with
meaning the indicator of the event A. The main problem consists of a non-parametrical estimating of d.f. F with nuisance d.f. G based on the censored sample
, where the number of observed
,
is random.
Kaplan and Meier [1] were the first to suggest the product-limit (PL) estimator
of F defined as
(1)
where
are the order statistics of Z-sample
and
the sequence of indicators adjunct to the ordered Z-sample. There are different versions of PL-estimators. However, those do not coincide, if the largest
is a censoring time. There is an enormous set of works on investigating several properties of PL-estimators and their application on statistical problems, specially in the case of right random censorship. However,
is not a unique estimator of d.f. F. Abdushukurov [2] [3] proposed another estimator of F, of relative-risk power type:
(2)
where
is an empirical estimator of d.f.
and
is an estimator of relative-risk function
. Here cumulative hazard functions (c.h.f.-s)
and
corresponding to d.f.-s H, G and F defined as
(3)
with subdistribution functions
,
The corresponding estimators of c.h.f.-s (3) are
where
are empirical counterparts of
with
.
In [2] [3] [4] it was shown that both estimators (1) and (2) have similar asymptotic properties tending to the same limiting Gaussian process. However, the relative-risk power estimator (2) has some small-sample advantages with respect to PL-estimator (1). For example, it is not sensitive to censoring in last observed point
, since
and it is identifiable with the model:
,
,
, where
is a corresponding estimator of d.f. G(t). In [4] it was proposed several extended versions of estimator (2) in generalized models of incomplete observations mixed with competing risks. These estimators were also extensively studied in some statistical problems. It is not difficult to observe that estimator (2) is a natural extension of well-know ACL-(Abdushukurov-Cheng-Lin) estimator of F in simple Proportional Hazards Model (PHM):
where
is an estimator of probability
, which is value of the constant relative-risk function
(so far as in PHM,
). Note that
was independently proposed and studied by Abdushukurov [5] and Chen, Lin [6] (for more information, see also Csörgő [7] ). This estimator was studied, extended and used by many other authors up a present. The main property of PHM is its characterization by independence of subsamples
and
. This property is equivalent
to relation
for some positive
. In PHM,
and therefore
is a censoring parameter. Estimator
in
PHM is asymptotically efficient with respect to
. This advantage of the estimator is well preserved for plug-in estimators of many functionals (see [2] [5] [7] ). That is why, in this framework, the conditional probability that datum is not censored given its observed value
(4)
is a very important function, which in PHM is constant
. Moreover, the key role of probability (4) takes part in expressing c.h.f.
via
as
and, therefore, a relative-risk function given as
Probability (4) is a regression of
on
. Hence, it can be estimated by some regression statistics. We have used the following nonparametric regression estimator of Nadaraya [8] and Watson [9]:
(5)
where the kernel
is a given probability density function and
is a bandwith sequence such that:
. In case of dependence of probability (4) on unknown parameters, it may be estimated parametrically (see, Dikta [10] in this context). Cao et al. [11] proposed following presmoothed PL-estimator of d.f. F by replacing the censoring indicators
in the expression of PL-estimator (1) by the estimator (5) at the observed data points:
(6)
Some asymptotic properties of estimator (6) were investigated in [11] [12]. Taking into account some advantages of estimator (2) with respect to (1), we propose a new presmoothed relative-risk power (PRRP) estimator:
(7)
were
is a partially presmoothed analogue of estimator
. For probability mass function (4) smooth estimator (5) is used in formula for c.h.f.
. But the estimator (7) is not smooth. We can see that estimator (7) is also well defined in whole line without any conditions on censorship.
2. Asymptotic Properties of PRRP Estimator
Let’s denote
. In order to investigate the properties of estimator (7) we need the following conditions:
(C1)
, where
and
;
(C2) Numbers
and
are such that
,
and
,
;
(C3) For all
there takes place
;
(C4) k is a symmetric, twice continuously differentiable and bounded variation density function with compact support;
(C5) Density
exists, is four times continuously differentiable at
and
;
(C6)
is four times continuously differentiable at
;
(C7)
for some
,
for some
and
.
Consider random functions
In the next theorem, we will show that PRRP estimator can be approximated by summ of i.i.d. random functions on t with the rate for the remainder term tending to zero at
almost surely.
Theorem 1. If the conditions (C1)-(C7) are fulfilled, then there holds
(8)
with
, where
.
The following Lemmas allow us to prove Theorem 1.
Lemma 1. (Corollary from lemma 3.2 in [12] ) Assume that the conditions (C1)-(C7) are fulfilled. Then following estimate holds
(9)
Lemma 2. (Theorem 3.4 in [12] ). Assume that the conditions (C1)-(C7) are fulfilled. Then for
it is true that
(10)
Lemma 3. (Dworetzky-Kiefer-Wolfowitz inequality with tight constant
from [13] ). For all
, some
and
the following estimate holds
(11)
Lemma 4. (Lemma on page 53 [14] ). For
there is true following estimate
(12)
where
and
are some positive constants.
Lemma 5. 1) If conditions (C1)-(C3) are fulfilled, then there hold estimates
a)
;
b)
;
2) If the conditions (C4)-(C7) are additionally required fulfilled, then the following estimate is also valid
c)
.
Proof of Lemma 5. Observe that
(13)
where the last equality follows from (11) and Borel-Cantelli’s lemma. Further, it is clear that b) is consequence of (13) and
For c) we have
where we have used Lemma 1 and the estimate (13). Lemma 5 is proved.
Proof of Theorem 1. By two-term Taylor expansion for difference
we obtain
(14)
where
,
,
lies between
and
. For
we have representation
(15)
Now we will show that in (15) the first summand is main term and the sum of other two terms tends (at
) to zero. Consider first term, which can be decomposed as
(16)
where
Hence by (9) and Lemma 5 (condition (a)), we have
(17)
Then from (10), (16) and (17) we obtain for all
For other two terms of (15) we have
(18)
Hence by (9) and (13), we obtain
(19)
Now by simple algebra and integrating by parts for
and taking into account Taylor exponsion for
we get chain of equalities
(20)
where
,
,
Hence, using (11) we have
(21)
Consider equality
(22)
and its integral form
(23)
Using (22) and (23) in the third and first integrals in (20) and taking into account also (21) we obtain
(24)
Application of estimator (11) to the first and second integrals and (12) to the third integral in (24) gives that
(25)
(26)
(27)
Thus, adding (18), (19) and (24)-(27), we derive
Then, by virtue of (9) and (16), from (15) we have
and, consequently,
(28)
Finally, the desired result (8) follows, from (15)-(17) and (28). The proof is completed.
Now as a consequence the strong uniform consistency of PRRP estimator can be obtained.
Theorem 2. Let the assumtions of Theorem 1 are fulfilled. Then at
there holds
(29)
Proof of Theorem 2. Using inequality
,
, we have a chain of following relations:
where the last equality is obtained by using of Lemmas 3 and 5 (candition (c)) and this completes the proof of Theorem 2.
The approximating sequence of normalized sum of random functions
in Theorem 1is the same that for presmoothed PL-estimator (6). Therefore, from theorem 3.7 in [12] follows the asymptotic normality of PRRP estimator, under taking into account the representation (8).
Theorem 3. Let the assumptions of Theorem 1 be fulfilled and
(C8)
,
and
as
for any
.
Then there hold
1) If
, then
,
2) If
, then
,
where
3. Numerical Study of Estimators
In this section, we investigate the above estimates using numerical methods. By python programming language we are preparing a high-quality sample. We select
and get a sample of volume
. This sample is censored from the right with r.v.-s having a d.f.
. The resulting sample has a degree of censorship 47%. We will study the above estimates on the resulting sample.
The red line in the figure shows the theoretical d.f.
and the green line shows the Kaplan-Meier estimate (Figure 1). One disadvantage of this estimate is that it may not matter at this endpoint.
Now we draw the evaluation graph (Figure 2) of estimator proposed by Abdushukurov (2). In the figure, the red line shows the theoretical d.f., the blue line shows Abdushukurov’s estimate. It can be seen from the graphs drawn that both estimates are very good. But in practice, it is difficult for us to see on the graph which score is better. Therefore, we study the sum
. Let’s make the appropriate tables for it.
From the table (Table 1) above, it can be concluded that the estimate (2) proposed by Abdushukurov is closer to the d.f.
.
Now we draw the estimates (6) (Figure 3) and (7) (Figure 4).
As can be seen from the graph, despite the high level of censorship, both estimates are very close to the theoretical d.f. The table below shows that the price actually depends on the selected bandwith sequence.
From the table (Table 2) above, we can conclude that the
-estimator is better than
-estimator.
Figure 1.
-Estimator (Kaplan-Meier).
Table 1. Comparison of
-estimate with
-estimate.
Figure 2.
-Estimator (Abdushukurov).
Table 2. Comparison of
-estimate with
-estimate.