Estimating a Finite Population Mean under Random Non-Response in Two Stage Cluster Sampling with Replacement ()
1. Introduction
In survey sampling, non-response is one source of errors in data analysis. Nonresponse introduces bias into the estimation of population characteristics. It also causes samples to fail to follow the distributions determined by the original sampling design. This paper seeks to reduce the non-response bias in the estimation of a finite population mean in two stage cluster sampling.
Use of regression models is recognized as one of the procedures for reducing bias due to non-response using auxiliary information. In practice, information on the variables of interest is not available for non-respondents but information on auxiliary variables may be available for non-respondents. It is therefore desirable to model the response behavior and incorporate the auxiliary data into the estimation so that the bias arising from non-response can be reduced. If the auxiliary variables are correlated with the response behavior, then the regression estimators would be more precise in estimation of population parameters, given the auxiliary information is known.
Many authors have developed estimators of population mean where non-response exists in the study and auxiliary variables. But there exist cases that do not exhibit non-response in the auxiliary variables, such as: number of people in a family, duration one takes to go through education. Imputation techniques have been used to account for non-response in the study variable. For instance, [1] applied compromised method of imputation to estimate a finite population mean under two stage cluster sampling, this method however produced a large bias. In this study, the Nadaraya-Watson regression technique is applied in deriving the estimator for the finite population mean. Kernel weights are used to compensate for non-response.
Reweighting Method
Non-response causes loss of observations and therefore reweighting means that the weights are increased for all or almost all of the elements that fail to respond in a survey. The population mean,
, is estimated by selecting a sample of size n at random with replacement. If responding units to item y are independent so that the probability of unit j responding in cluster i is
then an imputed estimator,
, for
, is given by
(1.0)
where
gives sample survey weight tied to unit j in cluster i and
is its second order probability of inclusion,
, is the set of r units responding to item y and
is the set of m units that failed to respond to item y so that
and
is the imputed value generated so that the missing value
is compensated for, [2] .
2. The Proposed Estimator of Finite Population Mean
Consider a finite population of size M consisting of N clusters with
elements in the ith cluster. A sample of n clusters is selected so that
units respond and
units fail to respond. Let
denote the value of the survey variable Y for unit j in cluster i, for
,
and let population mean be given by
(2.1)
Let an estimator of the finite population mean be defined by
as follows:
(2.2)
where
is an indicator variable defined by
and
and
are the number of units that respond and those that fail to respond respectively.
is the probability of selecting the jth unit in the ith cluster into the sample.
Let
to be the inverse of the second order inclusion probabilities and that
is the ith auxiliary random variable from the jth cluster. It follows that Equation (2.2) becomes
(2.3)
Suppose
is known to be Bernoulli random variables with probability of success
, then,
and
, [3] . Thus, the expected value of the estimator of population mean is given by
(2.4)
Assuming non-response in the second stage of sampling, the problem is therefore to estimate the values of
. To do this, a linear regression model applied by [4] and [5] given below is used;
(2.5)
where
is a smooth function of the auxiliary variables and
is the residual term with mean zero and variance which is strictly positive, Substituting Equation (2.5) in Equation (2.4) the following result is obtained:
(2.6)
Assuming that
, and simplifying Equation (2.6) we obtain the following
(2.7)
A detailed work done by [5] proved that
. Therefore Equation (2.7) reduces to
(2.8)
The second term in Equation (2.8) is simplified as follows:
(2.9)
But
, [6] . Thus we get the following:
(2.10)
(2.11)
But
, for details see [5] .
On simplification, Equation (2.11) reduces to
(2.12)
Recall
so that Equation (2.12) may be re-written as follows:
(2.13)
Assume the sample sizes are large i.e. as
and
, Equation (2.13) simplifies to
(2.14)
Combining Equation (2.14) with the first term in Equation (2.08) becomes;
(2.15)
Since the first term represents the response units, their values are all known. The problem is to estimate the non-response units in the second term. Let the indicator variable
, the problem now reduces to that of estimating the function
, which is a function of the auxiliary variables,
. Hence the expected value of the estimator of the finite population mean under non-response is given as;
(2.16)
In order to derive the asymptotic properties of the expected value of the proposed estimator in 2.16, first a review of Nadaraya-Watson estimator is given below.
3. Review of Nadaraya-Watson Estimator
Given a random sample of bivariate data
having a joint pdf
with the regression model given by
as in Equation (2.5), where
is unknown. Let the error term satisfy the following conditions:
(3.0)
Furthermore, let
denote a symmetric kernel density function which is twice continuously differentiable with:
(3.1)
In addition, let the smoothing weights be defined by
(3.2)
where b is a smoothing parameter, normally referred to as the bandwidth such that,
.
Using Equation (3.2), the Nadaraya-Watson estimator of
is given by:
(3.3)
Given the model
and the conditions of the error term as explained in 3.0 above, the expression for the survey variable
relative to the auxiliary variable
can be given as a joint pdf of
as follows:
(3.4)
where
is the marginal density of
. The numerator and the denominator of Equation (3.4) can be estimated separately using kernel functions as follows:
is estimated by;
(3.5)
and
(3.6)
Using change of variables technique; let
(3.7)
So that
(3.8)
(3.9)
From the conditions specified in Equation (3.1), the following (3.9) simplifies to
(3.10)
which reduces to:
(3.11)
Following the same procedure, the denominator can be obtained as follows:
(3.12)
Using change of variable technique as in Equation (3.7), Equation (3.12) can be re-written as follows:
(3.13)
which yields
(3.14)
Since
is a pdf and therefore integrates to 1.
It follows from Equations ((3.11) and (3.14)) that the estimator
is as given in Equation (3.3). Thus the estimator of
is a linear smoother since it is a linear function of the observations,
. Given a sample and a specified kernel function, then for a given auxiliary value
, the corresponding y-estimate is obtained by the estimator outlined in Equation (3.3), which can be written as:
(3.15)
where
is the Nadaraya-Watson estimator for estimating the unknown function
, for details see [7] [8] .
This provides a way of estimating for instance the non-response values of the survey variable
, given the auxiliary values
, for a specified kernel function.
4. Asymptotic Bias of the Mean Estimator
Equation (2.16) may be written as
(4.1)
Replacing
by
and re-writing Equation (3.15) using the property of symmetry associated with Nadaraya-Watson estimator, then
(4.2)
(4.3)
where
is the estimated marginal density of auxiliary variables
.
But for a finite population mean, the expected value of the estimator is given in Equation (4.1). The bias is given by
(4.4)
(4.5)
Which reduces to
(4.6)
(4.7)
Re-writing the regression model given by
as
(4.8)
So that from Equation (4.3) the first term in Equation (4.7) before taking the expectation is given as:
(4.9)
Simplifying Equation (4.9) the following is thus obtained:
(4.10)
where
Taking conditional expectation of Equation (4.10) we get
(4.11)
To obtain the relationship between the conditional mean and the selected bandwidth, the following theorem due to [6] is applied;
Theorem: (Dorfman, 1992)
Let
be a symmetric density function with
and
. Assume n and N increase together such that
with
. Besides, assume the sampled and non-sampled values of x are in the interval
and are generated by densities
and
respectively both bounded away from zero on
and assumed to have continuous second derivatives. If for any variable
,
and
, then
.
Applying this theorem, we have
(4.12)
This theorem is stated without proof. To prove it, we partition it into the bias and variance terms and separately prove them as follows:
From Equation (3.0) it follows that
. Therefore,
. Thus
can be obtained as follows:
(4.13)
Using substitution and change of variable technique below
(4.14)
Equation (4.13) can simplify to:
(4.15)
(4.16)
Using the Taylor’s series expansion about the point
, the kth order kernel can be derived as follows:
(4.17)
Similarly,
(4.18)
Expanding up to the 3rd order kernels, Equation (4.18) becomes
(4.19)
In a similar manner, the expansion of Equation (4.16) up to order
is given by:
(4.20)
Simplifying Equation (4.20) gives;
(4.21)
Using the conditions stated in Equation (3.1), the derivation in (4.21) can further be simplified to obtain:
(4.22)
Hence the expected value of the second term in Equation (4.11) then becomes:
(4.23)
(4.24)
(4.25)
where
(4.26)
and
is as stated in Equation (3.1)
Using equation of the bias given in (4.4) and the conditional expectation in Equation (4.11), we obtain the following equation for the bias of the estimator:
(4.27)
5. Asymptotic Variance of the Estimator,
From Equations ((4.9) and (4.11)),
(5.0)
Hence
(5.1)
where
Expressing Equation (5.1) in terms of expectation we obtain:
(5.2)
Using the fact that the conditional expectation
, the second term in Equation (4.13) reduces to zero. Therefore,
(5.3)
where
Let
, and
, and making the following substitutions
(5.4)
(5.5)
(5.6)
which can be simplified to get:
(5.7)
(5.8)
(5.9)
Hence
(5.10)
where
so that
.
Changing variables and applying Taylor’s series expansion then
(5.11)
(5.12)
which simplifies to
(5.13)
For large samples, as
,
and for
, then
. Hence the variance in Equation (5.12) asymptotically tends to zero, that is,
(5.14)
On simplification,
(5.15)
Substituting Equations ((5.7) into (5.15)) yields the following:
(5.16)
(5.17)
where,
It is notable that the variance term still depends on the marginal density function,
of the auxiliary variables
. It can also be observed that the variance is inversely related to the smoothing parameter b. This implies that an increase in b results in a smaller variance. However, increasing the bandwidth would give a larger bias. Therefore there is a trade-off between the bias and variance of the estimated population mean. A bandwidth that provides a compromise between the two measures would therefore be desirable.
6. Mean Squared Error (MSE) of the Finite Population Mean Estimator
The MSE of
combines the bias and the variance terms of this estimator that is,
(6.0)
(6.1)
Expanding Equation (6.1) gives:
(6.2)
(6.3)
Combining the bias in Equation (4.27) and the variance in Equation (5.17) and conditioning on the auxiliary values
of the auxiliary variables
then
(6.4)
(6.5)
where
,
,
as used earlier in the rest of the derivations.
7. Conclusion
If the sample size is large enough, that is as
and
the
of
in Equation (6.5) due to the kernel tends to zero for sufficiently a small bandwidth b. The estimator
is therefore asymptotically consistent since its MSE converges to zero.