Local Polynomial Regression Estimator of the Finite Population Total under Stratified Random Sampling: A Model-Based Approach ()
1. Introduction
Sample surveys’ main objective is to obtain information about the population, and then use such information to make inference about some population quantities. The information that is mostly sought about the population is usually aggregate values of various population characteristics, total number of units, proportion of units having certain attributes. The information can be collected by either sampling methods or census. One of the approaches to using auxiliary information in construction of estimators is by assuming a working model that describes the relationship between the survey variable and the auxiliary variable. Estimators are then derived based on this model. At this stage, estimators are sought to have good efficiency given that the model is true. In most cases, a linear model is assumed. Generalized regression estimators by [1] and [2] including linear regression estimators and ratio estimators by [3] , and best linear unbiased estimators by [4] and [5] and post-stratification estimators by [6] as well are all derived from the assumption of linear models. Sometimes the linear model fails, and therefore, the resulting estimators do not beat the purely design-based estimators. As a result, [7] proposed a class of estimators in which the working model assumes a nonlinear parametric model. The improvement of the efficiency of such estimators, however, requires prior information about the exact parametric population structure. As a result of these concerns, several researchers have so far considered nonparametric models for
. Nonparametric regression may be used in the estimation of unknown finite population quantities such as population totals, means, proportions or averages. The idea of nonparametric regression traces its origin in works by [8] and [9] . Nonparametric-based estimation is often more robust and flexible than inference based on parametric regression models or design probabilities (as in designed-based inference) [10] . In sample surveys, auxiliary information is used at the estimation stage of finite population quantities-population total or mean, say-to increase the precision of estimators of such population quantities [11] [12] [13] .
A variety of approaches exist for construction of more efficient estimators for population total or mean, and they include model-based and design-based methods. Model-based approach in sample surveys is based on superpopulation models, which assumes that the population under study is a realization of a random variable having a superpopulation model
. This model
is used to predict the nonsampled values of the population, and hence the finite population quantities, total
or mean
[13] . [14] first considered nonparametric models for
within a model-assisted approach and obtained a local polynomial regression estimator as a generalization of the ordinary generalized regression estimator. Their simulation study shows that the proposed estimator performs relatively better than other parametric estimators. [13] improved on [14] estimator and developed a model-based local polynomial regression estimator applicable to direct sampling designs such as simple random sampling and systematic sampling. Their estimator demonstrates better performance than [14] model-assisted estimator. Their estimator also beats other parametric estimators.
In this paper, auxiliary information is used to determine an estimator of finite population total using nonparametric regression under stratified random sampling. To achieve this, a model-based approach is adopted by making use of the local polynomial regression estimation to predict the nonsampled values of the survey variable y. Stratified estimators for finite population total
or mean
have proved to yield better estimators than those resulting from simple random sampling [15] [16] . Additionally, it has been shown in the literature that local polynomial approximation method has several nice features including satisfactory boundary behaviour, easy interpretability, applicability for a variety of design-circumstances and nice minimax properties (see [17] [18] and [19] ).
2. Proposed Estimator
Consider a population consisting of N units. Suppose this population is divided into H disjoint strata, each of size
.
Let
be the survey measurement for the
unit in the
stra- tum. Further, let
be the auxiliary measurement positively correlated with
.
From each stratum, a simple random sample of size
is selected without replace- ment, where
is sufficiently large with respect to
and
.
Let
be the sample in the
stratum and
be the nonsampled set in the
stratum.
The population total is defined as
(1)
which can rewritten as
(2)
where
and
.
Once the sample has been observed, the problem of estimating Y becomes the problem of predicting the sum of the nonsampled
. Usually, inference is made using the known sample and the model
.
The first component in Equation (1) is known while the second requires prediction which is the focus in this paper. In this paper, local polynomial regression method will be used to predict the unknown
,
.
Suppose the distribution generating
is given by the superpopulation model,
in which
(3)
where
are independently distributed random variables with mean 0 and variance
.
Then it follows that
(4)
(5)
where
and
are assumed to be continuous and twice differentiable fun- ctions of x, and
.
In practice, the values of
are unknown and so requires prediction. Adopting [13] [14] and [20] ideas, we make use of local polynomial regression of degree p, which is a generalization of the kernel smoothing, to predict the unobserved
in Equation (1). Let
, where K denotes a continuous kernel function and b is the bandwidth.
Then a model-based local polynomial regression estimator of the nonsampled
in the
stratum is given by:
(6)
where
is a column vector of length
;
;
and
. Equation (6)
holds as long as
is a nonsingular matrix.
Now denoting the estimator for the finite population total by
and the estimator within the
stratum by
. Therefore, in stratum h, the estimator of the popu- lation total based on local polynomial regression is
(7)
and the estimator for the finite population total is
(8)
with
.
3. Properties of Proposed Estimator
In this section, a study is carried out on various properties of estimator (8), which may be important in practice. In doing so, the following assumptions are made:
1) The regression function
has a bounded second derivative.
2) The marginal density,
is continuous and
.
3) The conditional variance
is bounded and continuous.
4) The kernel density function
is bounded and continuous satisfying the
following:
,
,
and ![]()
for
.
These conditions on
were imposed and used in [18] work and are purposely for the convenience of technical arguments and therefore can be relaxed.
3.1.
Is Asymptotically Model-Unbiased
Now consider the difference:
(9)
(10)
(11)
and taking expectation yields
(12)
(13)
since ![]()
i.e.
(14)
which is the bias associated with
.
Approximating
by Taylor series expansion about a point
and assuming further that
and
, then observe that
(15)
Letting
, then
(16)
(17)
and applying expectations then
(18)
Theorem 3 of [21] allows that under conditions (1)-(4) if
and
,
(19)
(20)
So that
(21)
It implies that
provided that
and
, and thus
is asymptotically model-unbiased.
3.2. Mean Square Error (MSE) of ![]()
The estimator (8) has the MSE
(22)
which can be decomposed as
(23)
Theorem 1 of [18] allows that under Condition (1), if
then
(24)
Observe that Equation (24) tends to zero if
and
and thus
.
This shows that
is statistically consistent and thus useful.
4. Simulation Study
In this section, a study is carried out on the practical performance of several estimators (see Table 1 and Table 2 for the estimators).
The first estimator is design-based, the second one is parametric and model-based while the last two are nonparametric and model-based.
4.1. Description of the Population
The working model is taken to be
,
. In this study, four populations are considered, which are generated from the regression model given by
(25)
with the following mean functions
(26)
(27)
(28)
(29)
with
. They represent a class of correct and incorrect model specifications for the estimators being considered. For
,
is expected to be the best estimator, since the model assumed is correctly specified. The rest of the mean functions:
,
and
represent various deviations from the linear model,
. These populations are plotted in Figure 1. For more on these populations, see [13] and [14] .
The errors are assumed to be independent and identically distributed (i.i.d) normal random variables having mean 0 and standard deviation
. They contain 2000 units and the population
is simulated as i.i.d uniform random variables. The
![]()
Table 1. Estimators being compared in the Simulation study.
![]()
![]()
Figure 1. Plot of linear, sine, bump and jump populations.
population values
are generated from the mean functions by adding the errors
in each of the cases. Each of the populations is divided into 10 equal, disjoint and mutually exclusive strata which are made as homogeneous as possible to ensure that units in each stratum vary little from each other. A sample of size,
is then taken with each stratum contributing a sample size of
,
. 1000 samples are simulated using simple random sampling without replacement for each case.
Epanechnikov kernel,
(30)
is used for kernel smoothing on each of the populations. In each case, bandwidth values
(see [20] ) (with
),
,
and
(see [15] ) are con- sidered.
Data simulations, the estimators and computations were obtained using R Software on a desktop.
To analyze the performance of the proposed estimator against some specified estimators, relative absolute bias (RAB) is computed as
(31)
and the relative efficiency (RE) with respect to the Horvitz-Thompson (HT) estimator is computed as
(32)
is the estimator of the finite population total being considered; Y is the true population total and R is the number of replications.
The relative efficiency (RE) is meant to examine the robustness of the various estimators against the proposed estimator.
The confidence intervals (CI) and the average lengths (AL) of the confidence intervals of various estimators are also computed as follows:
(33)
(34)
where
and
are the upper and lower confidence limits respectively;
and R are as defined earlier.
4.2. Results
The results of this simulation study are summarized in Table 3 and Table 4. For each populations,
(
), the performance of each estimator is analyzed using the RAB and RE. The RAB indicates the measure of how close the estimator being considered is from the actual value, while the RE is used to check the robustness of the estimator. For instance, an estimator,
, will be said to be “better” or more preferable than another one,
, if its RE is comparably smaller. That is, if
, where
and
are estimators, then
is said to be “better” than
.
![]()
Table 2. Summary of the formulae used in computing the respective population totals of the various estimators.
The confidence intervals and average length of the intervals are also measured for each case. A smaller length is better because it implies that the true population total is captured within a smaller range and therefore results are more precise.
The estimators
and
are tested under the same bandwidth choice i.e.
(with
),
,
and
. Results of this simulation are shown in Table 3 and Table 4 below.
Table 3 shows the RAB’s and RE’s of the various estimators with respect to the Horvitz-Thompson estimator (
). Table 4 shows the confidence intervals and their average lengths.
In most scenarios,
is better than the parametric estimators, but the parametric estimator,
, performs best when the model is correctly specified, as Table 3 shows. This occurs both in the linear and the bump populations, where in the former, a strong linear relationship holds between the variables while in the latter, the function is linear over most of its range despite a “bump” for a small part of the range of
.
When the model is completely misspecified as in the Sine and Jump populations, a greater efficiency can be achieved by the nonparametric regression estimators. This can be seen in Table 3 for the Sine and Jump populations: the nonparametric estimators (
and
) are more efficient than their parametric opponent,
.
When the underlying superpopulation model is completely unknown, a reasonable choice for finite population total estimation would be the nonparametric estimators such as
and
with small bandwidth choices. This can be seen in Table 3 and Table 4.
In this study,
is sometimes seen to perform much bettter but not as worse as
, and hence the proposed estimator,
emerges as the best performing among the nonparametric estimators being considered here (see Table 3). A good overall performance is observed with the proposed estimator, with smaller values of RAB and RE than the model-based competitor
for every population and fixed bandwidth under consideration.
Despite
being relatively the best estimator, its performance is significantly affected by the bandwidth choices. As the bandwidth size increases, some amount of efficiency is lost (see Table 3).
![]()
Table 3. Relative absolute bias (RAB) and relative efficiency (RE) based on 1000 replications of simple random sampling within strata from four fixed populations of size
. Sample size is
.
![]()
Table 4. Estimated lower and upper confidence limits and corresponding average lengths based on 1000 replications of simple random sampling within strata from four fixed populations of size
. Sample size is
. (LCL is the Lower Confidence Limit, UCL is the Upper Confidence Limit and AL is the Average Length).
Additionally, a keen look at the estimated totals in Table 3 shows that: as the bandwidth increases, the local linear regression estimator,
becomes equivalent to the linear regression estimator,
. This shows that the bandwidth has an effect on the mean square error of
. Particularly, for whichever bandwidth that is considered in this study,
essentially dominates
for all the populations except Linear and Bump populations, where
is competitive. Further,
essentially dominates
for all populations except in the Jump population, where
dominates all estimators being considered. The overall performance of
is consistently good as long as the bandwidth remains small in this particular study.
5. Conclusion
In this study, performance of the proposed estimator has been investigated against some design-based and model-based regression estimators. The RE values of the proposed estimator are in general close to one. It has been shown that for whichever bandwidth considered,
essentially dominates
for all the populations except Linear and Bump populations, where
is competitive. Further,
essentially dominates
for all populations except in the Jump population, where it dominates all estimators being considered. Generally, good confidence intervals are seen for the nonparametric regression estimators, and use of the proposed estimator leads to relatively smaller values of RE compared to other estimators. We conclude that non- parametric regression approach under stratified random sampling using the proposed estimator yields good results.
Acknowledgements
Special thanks to the African Union (AU) for the funding that saw the success of this research.