1. Introduction
To investigate the properties and features of data or in anomaly detection, density estimation performs a vital role. For this purpose, the nonparametric kernel density estimation or curve estimation is a famous technique. Nonparametric estimation has certain advantages over parametric estimation, e.g. the problem of priori distribution choice, possibility of using non-homogenous data, no functional form and the most important is allocation of weights, etc. [1]. In nonparametric density estimation, boundary bias is a very serious issue. It affects the performance of the estimator at boundary points due to boundary effects, then from the interior points. Such problem is happened; when smoothing is carried out near the boundary and fixed symmetric kernel allocate weights outside the density support. That’s why in some cases, parametric method for curve estimation performs better than nonparametric estimation [2]. Such problem is happened when variables represent some sort of physical measure such as time or length. These variables thus have a natural lower boundary, e.g. time of birth, etc. So, when smoothing is carried out near the boundary and then fixed, symmetric kernels allocate weights outside the density support and due to this, boundary bias arise [3].
There is a vast literature on removing boundary effects in nonparametric method. As yet there appears to be no single dominating solution that corrects the boundary problem for all shapes of densities. Some common techniques are reflection of data, introduced by Schuster [4]. Similarly, Silverman [5] proposed negative-reflection. Eubank and Speckman [6] suggested semi-parametric model. Chen [7] suggested the solution of this problem by replacing the symmetric kernels by the asymmetric Bata kernel which never assigns weight outside the support. Many others used Chen’s idea and proposed other kernels, i.e. Gamma [8], lognormal [9], Inverse Gaussian [10], Weibull [11], etc.
By following Chen [7], we are going to propose a new class of density estimator named as a Gumbel kernel estimator along with its bias, variance and optimal bandwidth, which will be the keen addition in category of asymmetrical kernel(s) that solve the problem of boundary bias. The Gumbel distribution is a particular case of the Generalized Extreme Value (GEV) distribution, and also known as the log Weibull or Fisher-Tippett distribution. GEV is a family of continuous probability distributions which combines the Gumbel, Frechet and Weibull families and also known as type I, II and III extreme value distributions. The common functional form for all 3 distributions was discovered by McFadden [12].
This paper is organized as follows. In the first Section, we present some information about kernel smoothing and in Section 2 we presented the proposed kernel. In third Section, we investigated the bias, variance and optimal bandwidth of the Gumbel kernel estimator. The performance of the proposed estimator will be tested via real and simulated data sets in Section 4, while Section 5 concludes.
2. Gumbel Kernel Estimator
Let
be a random sample from a distribution with an unknown probability density function f which has bounded support on [0, ∞). Representation of pdf of Gumbel (µ, β) is
,
, (1)
where
and
. The mean and variance of J are equal to
and
, where
is the Euler-Mascheroni constant.
As
and
, the class of Gumbel kernel considered is:
(2)
where h is bandwidth satisfying the condition that
and
as
. If a random variable X has a pdf
, then
and the variance is
.
The corresponding estimator of pdf is
. (3)
This estimator is easy to use and similar to following kernels for comparison:
Gamma 1 and Gamma 2 kernels by Chen [7] are;
, (4)
and
. (5)
where
(6)
Beta kernel by Chen [7] is;
, (7)
where, B is Beta function.
Birnbaum-Saunders and Log-Normal kernels by Jin and Kawczak [9] are;
, (8)
and
(9)
Inverse Gaussian and Reciprocal Inverse Gaussian kernels by Scaillet [10] are;
, (10)
and
(11)
Erlang kernel by Salha, et al. [13] is;
(12)
Weibull kernel by Salha, et al. [11] is;
(13)
3. Bias, Variance and Optimal Bandwidth
Theorem 1 (Bias)
The bias of proposed estimator is given by;
(14)
Proof:
where
follows a Gumbel distribution with scale parameter
and shape parameter x.
The Taylor expansion about
for
is:
.
So,
.
.
Hence,
Theorem 2 (Variance)
The variance of the proposed estimator is given by:
(15)
Proof:
Let
be a Gumbel
distributed random variable. Hence
and
. We have
where,
. By Taylor expansion of
we get:
So,
Therefore,
Optimal Bandwidth
To proceed for optimal bandwidth, initially Mean Squared Error (MSE) and Mean Integrated Squared Error (MISE) are derived as;
As we know Mean square errors for Gumbel kernel estimator is
We can approximate MISE to be:
(16)
where,
and
.
To find optimal bandwidth, now we minimize Equation (16) with respect to h, so we have
(17)
Setting (17) equal zero yields an optimal bandwidth
for the given pdf and kernel:
4. Applications
In this section, the performance of the proposed estimators in estimating the pdf is observed through real life data as well as by simulation study.
4.1. Suicide Data Example
We take suicide data given in Silverman [5] to inspect the performance of new developed kernel. The data gives the lengths of the treatment spells (in days) of control patients in suicide study.
We used the logarithm of the data to draw Figure 1 using data driven bandwidth, named as normal scale rule (NSR) by Silverman [5]. The NSR is given by;
(18)
where R is inter-quartile range, which results in 0.4894. It can be observed that Gumbel kernel performed very well, especially near end points and free of boundary bias.
4.2. Flood Data Example
Further, we used the flood data given in Gumbel [14], to exhibit the practical performance of the Gumbel estimator. The data give the discharge per second of the Rhone River (Europe).
Here fixed bandwidth which is 1,000,000, is used. Figure 2 shows that the performance of new proposed kernel estimator, which is acceptable.
4.3. Simulation Study
In this section we wish to investigate the finite sample properties of the two asymmetric kernel estimators; Gumbel and Weibull, which belong to family of extreme value distributions. The experiments are based on 1000 random samples of length
,
and
. For each simulated sample and each estimator considered, mean squared errors (MSE) are reported in Table 1, for extreme value distributions, namely Frechet, Weibull and Gumbel distributions and various parameter values by using bandwidth given as [8].
Figure 1. The Gumbel kernel estimator for the suicide data.
Figure 2. The Gumbel kernel estimator for the flood data.
Here in Table 1, variety of randomly selected location parameter (small/medium/large) is examined with constant scale parameter. We may observe that the Gumbel kernel estimator performs better than Weibull kernel estimator unanimously almost for all density estimates with all different parameters and different sample sizes. For both Gumbel and Weibull kernel, MSEs decreased as sample size increased. In graphical representation, we present Gumbel kernel with real density. It can be examined in Figure 3 that the performance of
Figure 3. The Gumbel kernel estimator of the density functions with different distributions. (where solid line shows the real density and other line represents the density estimated by Gumbel kernel). (a) Gumbel (3, 1) (b) Frechet (3, 1, 1) (c) Weibull (25.713, 1).
the Gumbel estimator is acceptable at the boundary near the zero with different densities. In the interior, the behavior of the pdf estimator becomes more similar as we get away from zero in any extreme value distribution case.
5. Conclusion
In this paper, we have proposed a new kernel estimator for probability density functions for (iid) data [0, ∞), namely Gumbel kernel. Such densities are encountered in a wide variety of applications to describe extreme wind speeds, sea wave heights, floods, rainfall, age at death, minimum temperature, rainfall during droughts, electrical strength of materials, air pollution problems, geological problems, naval engineering etc. [4]. Gumbel kernel is free of boundary bias, non-negative, with natural varying shape. We showed that the bias depends on the smoothing parameter h and the estimated point x, and it goes to zero as h → 0, also it gets smaller for the values of x closed to zero. The variance of the new proposed kernel estimator was investigated, and we noticed that it depends also on h and x. On the other hand, it goes to zero as h → 0, and gets large at the values of x close to zero.
In addition, the performance of the proposed estimators is tested in three applications. In a simulation study, we used different densities of GEV distribution and compared it with Weibull (Extreme value distribution III) kernel estimator on basis of MSE. We observed that the performance of the proposed estimator is excellent, and gives a smaller MSE. Additionally, by using real data examples, we exhibited the practical performance of the new estimator.
From the above discussion, it can be concluded that one of the reason for adaptation of nonparametric method was to control the allocation of weights at boundary points. But boundary bias is still present if symmetrical kernels are used for curve estimation. In this situation, best alternative is to use asymmetrical kernel and Gumbel kernel is finest selection than Weibull kernel, comparatively.