Bias Correction Technique for Estimating Quantiles of Finite Populations under Simple Random Sampling without Replacement ()
1. Introduction
In recent years, the estimation of population distribution functions in the context of survey sampling has received considerable attention. A particular focus of this attention was the median, which is often considered to be a more acceptable position measure than the mean, especially when the interest variable follows a distorted distribution. Modern population mean or total estimators may typically be significantly enhanced when appropriate supplementary information is made available. Accordingly, the use of the auxiliary information in sample quantile estimators seems highly desirable. Use of known auxiliary knowledge both at the estimation stage and at the selection stage contributes to better estimation strategies in the sampling of surveys. If such information is not fully known or missing and information on the auxiliary variable(s) is relatively cheaper to obtain, one may consider taking a broad preliminary sample to estimate the auxiliary variable population mean(s).
Traditional kernel estimation methods have generally held that the performance of kernel methods depends largely on the smoothing bandwidth of the kernel, and very little depends on the type of the kernel. Most kernels used are symmetric kernels and are set once chosen. This may be useful for estimating unbounded support curves, but not for curves that have compact support and are discontinuous at boundary points. For curves of this kind, a fixed kernel shape leads to a boundary bias. This boundary bias is due to the weight allocation of the fixed symmetric kernel outside the distribution support when smoothing close to the boundary takes place. In addition, standard kernel methods yield wiggly estimates in the tail of the distribution as the reduction of the boundary bias leads to a limited bandwidth that prevents the pooling of appropriate data. Even otherwise, as noted in [1] when estimating the probability density function, the standard kernel estimator “works well for densities not far from Gaussian in shape”, however, it can perform very poorly when the shape seems far from Gaussian, particularly near the boundary.
Boundary bias is a well-known problem, and several scholars have proposed ways to eliminate it. In the context of nonparametric regression, [2] [3] [4] proposed the use of boundary kernels, while [5] used Richardson’s extrapolation to combine two kernel estimates with different bandwidths. In density estimation, [6] proposed data reflection, [7] considered empirical transformations, and [8] proposed a framework of jaccknife methods for correcting boundary bias. In recent years, it has been shown by [9] [10], that in nonparametric regression, local linear smoother is free of boundary bias and achieves the optimal convergence for mean integrated squared error. It is interesting to note a local linear smoother uses a fixed kernel in its initial form, and the local least-regression implicitly employs different kernels at different places. The transformation method is among the numerous methods suggested to deal with data on
. In order to minimize the boundary bias in the density estimation framework, [1] [11] [12], among others, studied general transformation methods. The transformation may operate under unique conditions and it is important to select the appropriate transformation by analyzing the subject matter and related studies.
The estimation of population quantiles is of great interest when a parametric form for the underlying distribution is not available. In a broad range of statistical applications, quantile estimation plays an important role: the Q-Q plot; the goodness-of-fit, the computation of extreme quantiles and value at risk in insurance business and financial risk management. Also, a large class of actuarial risk measures can be defined as functional of quantiles see ( [13] ). Most contributions have been made based on simple random sampling (SRS) to estimate the pth quantile using a kernel function. The reader can be referred to [14] [15] [16].
Quantile estimation has been intensively used in many fields. Most of the existing quantile estimators suffer from either a bias or an inefficiency for high probability levels. In order to correct the bias problems, [17] suggested several nonparametric quantile estimators based on the beta-kernel and applied them to transformed data. A Monte Carlo based study showed that those estimators improve the efficiency of the traditional ones, not only for light tailed distributions, but also for heavy tailed, when the probability level is close to 1, [18] used transformed kernel estimate. In their study, they overcame this inconsistency by using a new approach based on the modified Champernowne distribution which behaves as the Pareto distribution.
As a result, the aim of this paper is to develop a nonparametric estimator for the quantile function of finite populations using a bias corrected approach to address the shortcomings of previously studied estimation methods. There are two unique features about this approach. One is that it ensures an accurate estimate and the other is that it reduces the estimation bias with negligible increase in variance.
The concept of Multiplicative Bias Correction (MBC) approach was first considered in [19], and the results obtained showed that the estimator of the regression function had desirable properties compared to existing estimators, including solving the boundary problems. This form of correction is especially well suited for changing non-negative regression function because it does not change the sign of the regression function and ensures an accurate estimate and reduces the estimation bias with negligible increase in variance. As there is always a bias-variance trade off for non-parametric smoothers in finite samples, smoothers can be generated whose asymptotic bias converges to zero while maintaining the same asymptotic variance. For a deeper discussion of Multiplicative Bias Correction technique we refer the reader to [20] [21] [22] [23].
Outline of the paper
In Section 2, we propose an estimator for finite population quantile function using a bias correction technique. Asymptotic properties of the proposed estimator are derived in Section 3. Empirical study of the results is given in Section 4 and the conclusion of the findings is given in Section 5.
2. Proposed Estimator
In the sampling survey, we are time and again interested in studying the distribution of a specific variable of interest, Y. The efficient technique to illustrate the distribution function is by assessing the quantiles of the distribution. By the pth quantile of the distribution, we imply the value Q, which would be
. One way of designing quantile estimators is to invert the estimator of the distribution function. Let
denote an estimator of
. Since the estimator
is often a step function, the form of the quantile estimator may not be smooth.
In this section we discuss a quantile estimator derived from a model-based multiplicative bias correction distribution function estimator that integrates auxiliary information. This distribution function estimator was introduced by [24]. The quantile estimator is based on inverting the [24] distribution function estimator. We derive a Bahadur representation for the quantile estimator.
Let
be a probability distribution function. The population quantile of order
is defined as
(1)
for
. If F is continuous and strictly increasing, then
is the unique solution to Equation (1). In general,
satisfies
(2)
or equivalently
(3)
Suppose that
for
are independent and identically distributed (i.i.d) random variables with conforming survey values
. By definition,
are independent, identically distributed random variables, each with common distribution function F. For all real t, the empirical population distribution function for
is defined to be
(4)
where
The sample quantile of order
is defined as
(5)
The sample quantile of order
is a strongly consistent estimator of
, unless
and
for some
(i.e., unless F is flat in a right neighborhood of
). See ( [25] ).
Theorem 1 ( [26] ). Let
be a random sample of size n with common distribution function F, and let
. If
is the unique solution of
, then
as
.
Suppose a sample s of n units is drawn through simple random sampling without replacement from a finite population and
be the non-sampled units of the finite population. Let Y be the survey variable associated with auxiliary variable X which are assumed to follow superpopulation model under model-based approach. A commonly used working model for the finite population is
(6)
where
is a known function of
that accounts for heteroscedasticity and
are independent and identically distributed (i.i.d) random variables with mean 0 and variance
,
and
Under model-based approach Equation (4) can be expressed as
(7)
where
represent the sampled part and is known while
is the non-sampled part which is unknown.
The problem is estimating the second term of Equation (7). To estimate Equation (7), [24] proposed a multiplicative bias corrected estimator for finite population distribution given by
(8)
where
is the model-based nonparametric estimator for
and
is the estimated distribution function of the residuals defined by
.
In this study, we propose a multiplicative bias corrected quantile estimator for finite population based on finite population distribution in Equation (8) given by
(9)
The problem is to estimate
for any
given. Thus, from the sample of n units of a population of size N, we observe
. The general method is formulated as follows: first obtain an estimator of the distribution function,
, and then estimate the quantile by taking the inverse.
3. Properties of Proposed Estimator
3.1. Asymptotic Unbiasedness
In simple random sampling, as
is hypergeometrically distributed variable, then
(10)
and
(11)
If the sample size n is sufficiently large then
is approximately normal.
Theorem 2 ( [27] ). Let x be in the interval
containing
as an interior point. Then the sample quantile,
(12)
with
uniformly in
for
in
, where
Proof: For proof see [27].
We now study the properties of the
estimator. For this, a linear approximation is needed because
is not a continuous function. The estimator
can be expressed asymptotically as a linear function of the estimated distribution function evaluated at the quantile
by the Bahadur representation (see [28] ) together with the results from Theorem 2 above.
Let
be the multiplicative bias corrected distribution function of the density
.
Theorem 3 (Taylor’s Theorem). Let
be an integer and let the function
be k times differentiable at the point
. Then there exists a function
such that
(13)
and
. This is called the Peano form of the remainder.
Then using Taylor series expansion of the function
around
we can write
(14)
where
, according to [29] since
contains two derivatives in a
neighborhood, this neighborhood is bound by the second derivative and
is positive.
Then solving for
in Equation (14) we have the Bahadur’s representation as
(15)
Moreover, it can be shown that
(16)
Substituting the above results of Equation (16) in Equation (15) yields
(17)
where
denotes the derivative of the limiting value of
as
and
.
The linear approximation previously used by [30] [31] helps to study the asymptotic properties of the estimator. On the other hand, the estimator
is asymptotically unbiased because
is unbiased estimator of
(see [24] ). In this way
(18)
but
and by using Equation (18) it can be seen that
(19)
The bias of
is of order
. Thus, it converges to zero at a faster rate. Therefore,
is asymptotically unbiased.
3.2. Asymptotic Variance
Asymptotic Variance of
will be obtained as follows, Consider the Bahadur’s representation:
(20)
Then applying variance on both side of Equation (20) we have
(21)
3.3. Asymptotic Mean Squared Error
The asymptotic mean squared error of the estimator
is given by
(22)
Substituting Equations (19) and (21) we get
(23)
Equation (23) tends to zero as
and thus
. This shows that
is asymptotically consistent.
4. Empirical Study
The main purpose of this section is to compare the performance of the proposed estimator MBCQE with the existing quantile estimators: RKMQE, CDQE, FAQE and NWQE. In this study, two populations are considered, which are generated from the regression model given by
where
with the following mean functions described in Table 1.
A population of 1000 auxiliary values
are generated as independent and identically distributed uniform random variables,
. The mean functions
Table 1. Mean functions used in the simulation study.
represent a class of correct and incorrect model specifications for the estimators being considered. The errors are assumed to be independent and identically distributed (i.i.d) normal random variables having mean 0 and standard deviation,
. They contain 1000 units and the population is simulated as i.i.d uniform random variables. The population values
are generated from the mean functions by adding the errors
in each of the cases. 1000 samples are simulated using simple random sampling without replacement for each case.
Nadaraya-Watson kernel weights are used in the smoothing of
to obtain the rough estimator,
, of the mean function
. A
ratio
is evaluated and is smoothed further to obtain the correction
factor
which is then used together with the rough estimator to obtain the multiplicative bias corrected estimator,
, of the mean function.
The existing estimators for quantile functions for finite populations that were used for comparison with our developed estimator Multiplicative Bias Corrected Quantile Estimator (MBCQE);
are:
1) Chamber and Dunstan Quantile Estimator (CDQE):
2) Nadaraya Watson Quantile Estimator (NWQE):
3) Rao Kovar Mantel Quantile Estimator (RKMQE):
4) Dorfman and Hall Quantile Estimator (FAQE):
The results of this simulation study are summarized in Table 2. Table 2 shows the unconditional Biases, Relative Mean Error (RME) and Relative Root Mean Squared Error (RRMSE) for the estimators at various values of the quantile
(i.e. 0.25, 0.5 and 0.75). Linear and cosine mean functions were used to obtain the tabulated results. Similar results and conclusions can be obtained using other mean functions such as quadratic, sine, bump etc. To analyze the performance of the proposed estimator against some specified estimators, unconditional Relative
Table 2. Unconditional biases, relative mean errors and relative root mean squared errors.
Mean Error and Relative Root Mean Squared Errors for the estimator
are computed as
(24)
and
(25)
where
is the quantile corresponding to the sth simulated sample
and N is the number of replications. The RME indicates the measure of how close the estimator being considered is from the actual value, while RRMSE indicates measure of accuracy of the estimator. For instance, an estimator, MBCQE, will be said to be “better” or more preferable than the other estimators if its RRMSE is comparably smaller.
Bias of a quantile estimator refers to the deviation of the expected value of the estimator from the true quantile value. All of the quantile estimators considered here are biased but comparetively MBCQE exhibits a smaller bias. MBCQE can be seen to be a very efficient estimator of the empirical quantile function at all levels of the α-quantile followed closely by RKMQE and FAQE. CDQE proved to be a very inefficient estimator at all levels of α.
Further, comparison of estimators was done with respect to empirical quantile function which further affirmed the results tabulated above. Table 3 and Table 4 give a tabulation of all the estimators listed below.
Table 3. Quantile estimates for linear mean function.
Table 4. Quantile estimates for cosine mean function.
CDQE overestimates the empirical quantile function at all points while MBCQE give an almost perfect estimation of the empirical quantile function. On the other hand, NWQE underestimates the true quantile function at some points towards the lower tail while it overestimates the same function at other points along the upper tail.
The conditional performance of the estimator was done and compared with the performance of other existing quantile estimators. To do this, 200 random samples, all of size 400, were selected and the mean of the auxiliary values
was computed for each sample to obtain 200 values of
. These sample means were then sorted in ascending order and further grouped into clusters of size 20 such that a total of 10 groups were realized. Further, group means of the means of auxiliary variables was calculated to get
. Empirical means and biases were then computed for all the estimators RKMQE, CDQE, FAQE, NWQE and MBCQE. The conditional biases were plotted against
to provide a good understanding of the pattern generated. Figures 1-6 show the behavior of the conditional biases, relative absolute biases and mean squared error realized by all the estimators of quantile functions under linear and cosine mean functions at various values of the quantile
(i.e. 0.25, 0.5 and 0.75).
In most cases, there are significant differences among the bias characteristics of the various estimators. A detailed examination of the plots reveals that MBCQE and RKMQE have lower levels of bias overall, as indicated by the
Figure 1. Conditional biases, RAB and MSE for the estimators using a linear mean function at
.
Figure 2. Conditional biases, RAB and MSE for the estimators using a linear mean function at
.
Figure 3. Conditional biases, RAB and MSE for the estimators using a linear mean function at
.
Figure 4. Conditional biases, RAB and MSE for the estimators using a cosine mean function at
.
Figure 5. Conditional biases, RAB and MSE for the estimators using a cosine mean function at
.
Figure 6. Conditional biases, RAB and MSE for the estimators using a cosine mean function at
.
proximity of plotted curves to the horizontal (no bias) line at 0.0 on the vertical axis. Interestingly, despite the rather entangled nature of some of the plots, estimator MBCQE emerges clearly as the least biased for nearly every group means of the means of auxiliary variables and quantile level. For the median, several estimators exhibit identical bias, and for most of the estimators, bias is not symmetrical with respect to quantile level.
Plots of Conditional MSE versus group means of the means of auxiliary variables similarly reveal coincident behavior for the quantiles. MBCQE and RKMQE produce generally the lowest MSE values. In particular, MBCQE yields the lowest MSE in most cases among all other estimators. MBCQE is consistently better than all other estimators for both bias and MSE. All of these estimators are asymptotically unbiased and they all exhibit MSE consistency in that the MSE values tend toward zero as sample size increases.
From the plots it can be seen that MBCQE and RKMQE performed equally better than all other estimators of the true quantile function and it can be seen that sample balancing does not affect the performance of the estimators.
5. Conclusions and Suggestions
In conclusion, using the results from Table 2-4 and Figures 1-6, MBCQE was found to be an efficient estimator of the quantile function for finite population. NWQE was found to be very inefficient of all the estimators with large conditional bias, relative absolute bias and mean squared error compared to the other estimators. MBCQE can therefore be used in estimating quantile functions for various units in the population in various sectors of the economy. Finally, further work can be done on the construction of confidence intervals for the proposed estimator, and a researcher can investigate various bias correction strategies such as Adaptive Boosting and the Bootstrap bias reduction techniques in quantile function estimation.
Acknowledgements
Sincere thanks to the Pan-African University Institute of Basic Sciences, Technology and Innovation (PAUSTI) for funding this research.