1. Introduction
Let
be independent and identically distributed (i.i.d.) random variables drawn from F, a distributed function which belongs to the max-domain of attraction of an extreme value distribution
with extreme value index
. It is well-known that if
, mathematically, there exist constants
and
such that
(1)
for all
. For a tail index
, the limit relation (1) is equivalent to
(2)
where
is the left-continuous inverse function of
and a regular varying function with
, see [1] .
The estimation of tail index for heavy-tailed distributions may be the one of the most studied problems in the extreme value theory. Since the numerous works of this aspect such as the Hill estimator, the Pickands estimator and the maximum likelihood estimator have already been explored referring to [1] [2] [3] [4] and [5] for detailed discussions and reviews.
This extreme value analysis often rely on high order statistics. However, in many applications, one may face the challenges when it quickly run out of data since the observations can be corrupted and this contamination can lead to severe bias in the estimation of the tail index. Considering the problem, plenty of researchers did a lot of work. Based on the classic Hill eatimator of
:
where
are the associated order statistics of
i.i.d. random variables with unknown distribution
with
, [6] trimmed a certain number of the largest order statistics in order to obtain a robust estimator of
and (among other robust estimators) defined a trimmed version of the Hill estimator:
[7] then chose the weights
so that the estimator is asymptotically optimal where,
and they also found the method for the trimming parameter which yields the trimmed Hill estimator that can adapt to the unknown level of contamination in the extremes. While removing the lower order statistics from the classical Hill estimator, [8] derived an alternative estimator of the tail index and it was shown to have lower variance than the classic Hill estimator. A number of reseachers also considered trimming but of the models rather than the data, see [9] and [10] . Moreover, the random censoring for heavy-tailed distribution was discussed in [11] [12] [13] and [14] . Contrary to the above, here we assume to have non-truncated heavy-tailed model and only the top order statistics are contaminated in the associated data.
The rapid emergence of massive datasets in various fields becomes more and more challenging to traditional statistical methods. Account of that, distributed inference theory which refers to analyzing data stored in distributed machines has been proposed. It is developed to deal with large-scale statistical optimization problems and requires a divide-and-conquer algorithm which estimates a desired quantity or parameter on each machine and transmits the results to a central machine often by simple averaging. With the conditions of [15] [16] and [17] [18] reported on a first attempt in distributed inference for extreme value index and proposed a distributed Hill estimator and establish its asymptotic theories.
In this paper, considering the massive datasets contaminated of the top order statistics we apply the method of distributed inference and then derive a new estimator of the extreme value index for heavy-tailed distributions. The new estimator can be used for the situation when large datasets are distributedly stored and cannot be combined into one oracle sample and the top order statistics are corrupted.
We assume that the i.i.d. observations
are stored in k machines with m observations each, i.e.,
and the operation mechanism of each machine is independent. Let
denote the order statistics within the machine j. Suppose we have identified that the top d0 order statistics have been corrupted in each machine, we use the top
exceedance radios
and cut the same level of top d0 exceedance radios for
and
to build the estimator in each machine. Then we take the average of the estimators from all machines and the distributed trimmed Hill estimator is defined as:
(3)
To derive the asymptotic normality of distributed trimmed Hill estimator
, we need impose the following condition on the sequences k and m,
(4)
And we need the second order regular varying condition as follows: there exists an positive or negative function A with
and a real number
, satisfying
, such that
(5)
for all
(see e.g. [1] , Corollary 2.3.4).
By (5), we have that there exists a function
such that
as
, and for all
,
, there is a
such that for
,
,
(6)
By the adoption of (5) on the function L, we also have,
(7)
for more details on
, see page 48 in [1] .
2. Main Results
In the homogeneous case where
is a fixed integer, the following theorem shows the asymptotic normality of the distributed trimmed Hill estimator.
Theorem 2.1. Suppose
with
and (4) and (5) hold. Let
, where
is a fixed integer. If
, as
, then
where
with
and
with
.
In the heterogeneous case where
are uniformly bounded positive integer series, i.e.,
, the following theorem shows the asymptotic normality of the distributed trimmed Hill estimator.
Theorem 2.2. Suppose
with
and (4) and (5) hold. Let
be uniformly bounded positive integer series, i.e.,
and
. If
, as
, then
where
.
In the homogeneous case where
and
is an intermediate sequence, i.e.,
,
as
, the following theorem shows the asymptotic normality of the distributed trimmed Hill estimator.
Theorem 2.3. Suppose
with
, and (4) and (5) hold. Let
, where
and
as
. If
, as
, then
where
3. Simulation Studies
In this section, we study the finite sample performance of the distributed trimmed Hill estimator
and compare it with [7] ’s estimator, i.e., the trimmed Hill estimator on the following three distributions which all belong to the max-domain of attraction of an extreme value distribution for varying parameters with three sub-cases for each distribution.
We obtain the mean value and mean squared error (MSE) for r = 2000 Monte Carlo simulations of all considered estimators of heavy-tailed models with sample size n = 10,000. We assume the contamination occurs in the top d0 order statictics in each machine and vary the level of d in the distributed trimmed Hill estimator to verify the theoretical results on the property we give in Section2 and to compare the finite sample performance of the distributed trimmed Hill estimator with that of the trimmed Hill estimator for different values of d. The sample
contains n = 10,000 observations stored in k machines with m observations each. We fix k = 20 and m = 500 and vary d from 30 to 100 with d0 = 8.The results are presented in Figures 1-3.
• The Fréchet distribution with distribution function
which implies
and
. We consider the three parameters α = 1, 0.5 and 2.
Figure 1. Fréchet distribution, parameters α = 1, 0.5 and 2. Diagnostics of trimmed Hill estimator (coral) and distributed trimmed Hill estimator(skyblue) as a function of d.
Figure 2. Pareto (σ, ξ) distribution, sets of parameters σ = 1, ξ = 1; σ = 2, ξ = 0.5 and σ = 1, ξ = 2.5. Diagnostics of trimmed Hill estimator(coral) and distributed trimmed Hill estimator (skyblue) as a function of d.
• The Pareto(σ, ξ) distribution with distribution function
which implies
and
. We consider the three sets of parameters σ = 1, ξ = 1; σ = 2, ξ = 0.5 and σ = 1, ξ = 2.5.
• The Burr(τ, λ) distribution with distribution function
which implies
and
.We consider the three sets of parameters τ = 2, λ = 0.5; τ = 3, λ = 0.5 and τ = 3, λ = 1.
For the Fréchet distribution, Figure 1 shows that as d increases, the MSE increases for the estimators with different α. For the Pareto distribution in Figure 2, the bias between the estimators and the true value is virtually zero for all levels of d. For the Burr distribution in Figure 3, we observe a trade off for the estimators with different sets of parameters: as d increases, the MSE increases when λ is low while the MSE decreases when λ takes a larger value.
Figure 3. Burr (τ, λ) distribution, sets of parameters τ = 2, λ = 0.5; τ = 3, λ = 0.5 and τ = 3, λ = 1. Diagnostics of trimmed Hill estimator (coral) and distributed trimmed Hill estimator (skyblue) as a function of d.
Figures 1-3 show that the difference in MSE between the distributed trimmed Hill estimator and the trimmed Hill estimator is not sizeable. Consequently, we can infer that when dealing with the estimation problem of extreme value index with massive and corrupted datasets the new estimator we derive performs well.
4. Proof
Recall that
,
, where
is a random sample of Z with the distribution function
,
. For each machine j, let
denote the order statistics of the m Pareto (1) distributed variables corresponding to the m observations in this machine. Notting that
, we have
and then
(2) implies that
, then
, where
is slowly varying function. Hence
and
Lemma 4.1. Suppose
with
, define
for
, and
. Under the assumption of (4),
.
Proof. Note that
forms a random sample from the standard exponential distribution. In the machine j, for any
by Rényi’s representation we have
with
i.i.d. standard exponential, where
are the order statistics of Exp (1) corresponding to the
obeservation.
The joint distribution of
,
, can be expressed as follows:
(8)
By (8) it implies that
for
, and by WLLN and (4) it follows that
as
.
Lemma 4.2. Under the condition of Theorem 3 and define
if
, as
, we have
Proof. Note that
(9)
where
In the first term of (9), we have that
(10)
By the assumption of
as
and
as
, we can get that as
,
. By (7) choose
and we can get that
(11)
Since
is the
order statistic from the standard Pareto distribution, [19] implies that
. Recall that
and we have
as
. Combining with
as
, we can get that as
,
and then
(12)
Similarly, as
,
(13)
Combining with the (10) (11) (12) and (13), we can get that as
,
(14)
In the second term of (9), we have that
(15)
It is known that
(16)
by WLLN for triangular array and
is independent with
for
and
, we have that
,
(17)
and
(18)
Combining with (17) and (18), it illustrates that
,
and finally we can get that
,
which yield the Lemma.
Proof of Theorem 2.1. When
and by Lemma S.2 in [18] , we have that
, for any
.Then by applying (6) twice with
and
and
we get that as
,
(19)
Here, the
term is uniform for all
,
and all
. We obtain that
where
By Lemma 4.1. and the central limit theorem, we have that
, as
.
By WLLN for triangular array and
is independent with
for
and
, we have that
as
, where the second equality follows from a direct calculation. By the Stirling’s formula, it follows that,
as
. Hence, combing with
as
, we can replace
by A and obtain that as
,
(20)
Similarly, as for
, we obtain that
as
. Combining with
and (20) as
the statement in Theorem 2.1 follows.
When
, (19) is equivalent to
as
, where
term is uniform for all
, all
and
. Similarly, we obtain that
where
We can show that
,
and
as
, similar to the proof above, the statement in Theorem 2.1 follows.
Proof of Theorem 2. 2. We only show the proof for
and the proof for
is similar.
By Lemma S.2 in [18] , we have
, for any
. Then by applying (6) twice with
and
and
and using the same method as shown in the proof of Theorem 2.1, we obtain that
where
as
.
By Lemma 4.1. and the central limit theorem, we have that
as
.
As for
, by WLLN for triangular array and for each j,
is independent with
, where
, similar to the proof of Theorem 2.1, we have that
(21)
as
.
Similarly, as for
, we obtain that
as
. Combining with
and (21) as
the statement in Theorem 2.2 follows.
Proof of Theorem 2.3. We only show the proof for
and the proof for
is similar.
By Lemma S.2 in [18] , we have
, for any
. Then by applying (6) twice with
and
and
and using the same method in the prood of Theorem 2.1, we obtain that
where
By Lemma 4.1. and the central limit theorem, we have that
as
.
By WLLN for triangular array and
is independent with
for
and
, we have that
(22)
By the Stirling’s formula, it follows that
as
.
Similarly, as for
, we obtain that
as
. Combining with
and (22) as
the statement in Theorem 2.3 follows.