A Multiplicative Bias Correction for Nonparametric Approach and the Two Sample Problem in Sample Survey

Abstract

Let two separate surveys collect related information on a single population U. Consider situation where we want to best combine data from the two surveys to yield a single set of estimates of a population quantity (population parameter) of interest. This Article presents a multiplicative bias reduction estimator for nonparametric regression to two sample problem in sample survey. The approach consists to apply a multiplicative bias correction to an estimator. The multiplicative bias correction method which was proposed, by Linton & Nielsen, 1994, assures a positive estimate and reduces the bias of the estimate with negligible increase in variance. Even as we apply this method to the two sample problem in sample survey, we found out through the study of it asymptotic properties that it was asymptotically unbiased, and statistically consistent. Furthermore an empirical study was carried out to compare the performance of the developed estimator with the existing ones.

Share and Cite:

Stephane, K. , Otieno, R. and Mageto, T. (2017) A Multiplicative Bias Correction for Nonparametric Approach and the Two Sample Problem in Sample Survey. Open Journal of Statistics, 7, 1053-1066. doi: 10.4236/ojs.2017.76073.

1. Introduction

Sometimes, it happens that two separate surveys gather related information on a variable of interest of a population, U, having perhaps distinct designs and mode of sampling. It becomes very important on how to combine the data from the two surveys.

Take as example, the students of the sub-regional institute of statistics and apply economics (ISSEA), and those of the polytechnic institute, both in different ways with different importances to collect data on unemployment in Cameroon. Researchers at the national institute of statistics (Cameroon) are faced with the following problem: how can the data from these two distinct surveys joined together to produce a single data and have a better representation of the population?

Some great scientists have been looking into these problems for several years. The approach to this problem have been in different ways; one of which involve getting estimates of the two surveys separately and using the inverse of the estimated variances as weights to weigh them together as seen in [1] . [2] went further by using empirical likelihood method to combine information from multiple survey. Another option to this consist of putting the two data sets in a single data set, taking into account the weight on individual sample units. Developed in [3] are some of these methods which include; the pseudo- likelihood, missing information principle and iterated post-stratified estimator. After simulations on two different populations, it was concluded that, in neither population the design based ways of combining data yield best results. The iterated post-stratified estimator looks to be a very promising non-parametric way to combined data from two sources.

Just recently [4] used the Nonparametric regression, which is the model-based sampler’s method of choice when there is a serious doubt about the suitability of a linear or other simple parametric models for the survey data at hand. The nonparametric regression supersedes the need for use of design weights and standard design-based weights. Recognition of this is especially helpful in confronting problems in sampling situations where design weights are missing or questionable.

This study made use of kernel smoothers, especially the Nadaraya Watson smoother. However, estimators based on Nadaraya Watson smoothing weights are normally biased in small samples and at boundary points.

There exist alternative techniques of reducing the bias. For a detailed review see [5] - [11] . These methods improve the performance of nonparametric regression at points of large curvature. But in this framework, we consider a multiplicative bias correction approach to nonparametric regression to have an estimate with a smaller bias than existing ones.

Outline of the Paper

The remaining part of this paper is organized as follows: In Section 2, a multiplicative bias corrected estimator T ^ M B C for the finite population totals is proposed. In Section 3, the asymptotic properties of the proposed estimator are derived. In Section 4, an empirical study of the derived properties is presented. In Section 5 we give a conclusion to the paper.

2. Proposed Estimator

Consider a finite population, U = 1 , 2 , , N and let y 1 , y 2 , , y n represent the combined random sample drawn from the population using different sampling techniques. Suppose that to each of these y i s , there is an auxiliary information x 1 , x 2 , , x n .

Let consider the following model;

E ( Y i / X i = x i ) = h ( x i ) (1)

cov ( Y i , Y j / X i = x i , X j = x j ) = { σ 2 ( x i ) , i = j 0 , i j (2)

where h ( x i ) and σ 2 ( x i ) are twice continuously differentiable functions (that is lipschitz continuous). With these assumptions on h ( x i ) and σ 2 ( x i ) , one can estimate h ( x i ) and σ 2 ( x i ) non-parametrically.

Let ϵ i = Y i h ( X i ) be i.i.d. with zero mean, and variance σ 2 . We can refer to this set-up as the weak model. In this scheme, we can ignore which of the original samples, the Y i s are available from.

Usually in the computation of finite population total,we have the formula given by

T = i U y i = i s y i + j r y j (3)

where, s refers to the sample and r refers to the nonsampled part of the population. Since the values of the sample part is known, the process of estimating the finite population total is equivalent to predicting the nonsample part of the population.

To do this, the multiplicative bias corrected technique is employed in which case the proposed estimator of the population total is now defined as

T ^ M B C = i s y i h ^ ( x i ) π i + j r h ^ ( x j ) (4)

where

π i is the inclusion probability

h ^ ( x i ) is the multiplicative bias corrected estimator.

The principal objective of the multiplicative bias corrected technique is to correct the insufficiences of the kernel smoother that is the bias problem at the boundaries. Given a pilot smoother of the regression function

h ˜ ( x ) = j = 1 n w x j Y j (5)

The inverse relative estimation error of the smoother at each of the observations is given by h ( x ) h ˜ ( x ) .

A noisy estimate of the ratio, h ( x ) h ˜ ( x ) , is given by

β ( x ) = Y j h ˜ ( X j ) (6)

Smoothing the noisy estimate β ( x ) leads to

β ˜ ( x ) = j = 1 n w x j β ( x ) (7)

Above gives a better estimate for the inverse of the relative estimation error at each particular observation and can therefore be used as a multiplicative correction of the pilot smoother.

h ^ ( x ) = β ˜ ( x ) h ˜ ( x ) (8)

For both h ˜ ( x ) and β ˜ ( x ) , we use the same weighting scheme;

w x j = 1 n h K ( x X j h ) (9)

where

h is the bandwidth

K is a probability density function, symmetric about zero.

n is the sample size

Bandwidth Selection Techniques

● Implement biased cross-validation (bcv).

● Implement unbiased cross-validation (ucv).

● Implements a rule-of-thumb for choosing the bandwidth of a Gaussian kernel density estimator (ndr0)

● Can use a more common variation given by Scott (1992) (ndr)

3. Properties of Proposed Estimator

3.1. Assumptions

The following assumptions are made in the estimation of h ^ ( x i ) .

● The regression function is bounded and strictly positive, that is, b h ( x ) a > 0 for all x

● The regression function is twice continuously differentiable everywhere.

ϵ has finite fourth moments and has a symmetric distribution around zero.

● The bandwidth h is such that, h 0 , n h and ( n h ) 2 as n

3.2. Asymptotic Unbiasedness of the Proposed Estimator

We want to show that E ( T ^ M B C T ) 0 as n . Under the model based, the bias of the estimator T ^ M B C is defined as follows;

E [ T ^ M B C T ] = E [ T ^ M B C ] E [ T ] (10)

Now, we have the expected value of the proposed estimator for the finite population total given by;

E [ T ^ M B C ] = E [ i s y i h ^ ( x i ) π i + j r h ^ ( x j ) ] (11)

= E [ i s y i h ^ ( x i ) π i ] + E [ j r h ^ ( x j ) ] (12)

= i s 1 π i E ( y i h ^ ( x i ) ) + U | s E ( h ^ ( x j ) ) (13)

E ( h ^ ( x j ) ) is obtained by analysing the individual terms of the stochastic approximation of h ^ ( x ) . Let us then establish the stochastic approximatiom of h ^ ( x ) as shown by (Hengartner 2009).

From (8),

h ^ ( x ) = β ˜ ( x ) h ˜ ( x ) (14)

= j = 1 n w x j Y j h ˜ ( X j ) h ˜ ( x ) = j = 1 n w x j h ˜ ( x ) h ˜ ( X j ) Y j (15)

= j = 1 n w x j R j ( x ) Y j where R j ( x ) = h ˜ ( x ) h ˜ ( X j ) (16)

Let define, h ¯ = E ( h ˜ ( x ) | X 1 , X 2 , , X n ) then we can express R j ( x ) as.

R j ( x ) = h ˜ ( x ) h ˜ ( X j ) = ( h ¯ ( x ) h ¯ ( X j ) ) ( h ˜ ( x ) h ¯ ( x ) ) ( h ˜ ( X j ) h ¯ ( X j ) ) 1 = ( h ¯ ( x ) h ¯ ( X j ) ) ( h ˜ ( x ) h ¯ ( x ) + h ¯ ( x ) h ¯ ( x ) ) ( h ˜ ( X j ) h ¯ ( X j ) + h ¯ ( X j ) h ¯ ( X j ) ) 1 = ( h ¯ ( x ) h ¯ ( X j ) ) ( h ˜ ( x ) h ¯ ( x ) h ¯ ( x ) + 1 ) ( h ˜ ( X j ) h ¯ ( X j ) h ¯ ( X j ) + 1 ) 1 = ( h ¯ ( x ) h ¯ ( X j ) ) ( R ( x ) + 1 ) ( R ( X j ) + 1 ) 1

Through the series expansion,

( R ( X j ) + 1 ) 1 = 1 R ( X j ) + 1 = 1 1 ( R ( X j ) ) = n = 0 [ R ( X j ) ] n = 1 R ( X j ) + R ( X j ) 2 +

R j ( x ) = h ¯ ( x ) h ¯ ( X j ) [ 1 + R ( x ) R ( X j ) + r j ( x , X j ) ]

is an approximation of the quantity R.

Replacing both Y j and R j in (16), we obtain

h ^ ( x ) = j = 1 n w x j h ¯ ( x ) h ¯ ( X j ) [ 1 + R ( x ) R ( X j ) + r j ( x , X j ) ] ( h ( X j ) + ϵ j ) = j = 1 n w x j h ¯ ( x ) h ¯ ( X j ) h ( X j ) + j = 1 n w x j h ¯ ( x ) h ¯ ( X j ) ( ϵ j + h ( X j ) ) ( R ( x ) R ( X j ) ) + j = 1 n w x j h ¯ ( x ) h ¯ ( X j ) ( R ( x ) R ( X j ) ) ϵ j + j = 1 n w x j h ¯ ( x ) h ¯ ( X j ) r j ( x , X j ) ( h ( X j ) + ϵ j )

Using the assumption n h the remainder term turns to zero in probability and the expression reduces to;

h ^ ( x ) = j = 1 n w x j h ¯ ( x ) h ¯ ( X j ) h ( X j ) + j = 1 n w x j h ¯ ( x ) h ¯ ( X j ) ( ϵ j + h ( X j ) ) ( R ( x ) R ( X j ) ) + j = 1 n w x j h ¯ ( x ) h ¯ ( X j ) ( R ( x ) R ( X j ) ) ϵ j + 0 p ( 1 n h )

To solve Equation (16), we need to find E ( h ^ ( x j ) ) hence,

E ( h ^ ( x j ) ) = E [ j = 1 n w x j h ¯ ( x ) h ¯ ( X j ) h ( X j ) + j = 1 n w x j h ¯ ( x ) h ¯ ( X j ) ( ϵ j + h ( X j ) ) ( R ( x ) R ( X j ) ) + j = 1 n w x j h ¯ ( x ) h ¯ ( X j ) ( R ( x ) R ( X j ) ) ϵ j + 0 p ( 1 n h ) ] = j = 1 n w x j h ¯ ( x ) h ¯ ( X j ) E ( h ( X j ) ) + j = 1 n w x j h ¯ ( x ) h ¯ ( X j ) E ( ϵ j ) + j = 1 n w x j h ¯ ( x ) h ¯ ( X j ) h ( X j ) × E ( R ( x ) R ( X j ) ) + j = 1 n w x j h ¯ ( x ) h ¯ ( X j ) ( R ( x ) R ( X j ) ) E ( ϵ j ) + 0 p ( 1 n h ) = j = 1 n w x j h ¯ ( x ) h ¯ ( X j ) h ( X j ) + j = 1 n w x j h ¯ ( x ) h ¯ ( X j ) h ( X j ) E ( h ˜ ( x ) h ¯ ( x ) h ˜ ( X j ) h ¯ ( X j ) ) + 0 p ( 1 n h )

since E ( ϵ j = 0 )

E ( h ^ ( x j ) ) = j = 1 n w x j h ¯ ( x ) h ¯ ( X j ) h ( X j ) + o p ( 1 n h ) since h ¯ ( x ) = E ( h ˜ ( x ) ) (17)

Hence,

E [ T ^ M B C ] = i s 1 π i E ( y i ) ( j = 1 n w x j h ¯ ( x ) h ¯ ( X i ) h ( X i ) + o p ( 1 n h ) ) + U | s ( j = 1 n w x j h ¯ ( x ) h ¯ ( X i ) h ( X j ) ) + o p ( 1 n h ) (18)

The above expression can be reduced by considering a limited Taylor series of h ( X j ) h ¯ ( X j ) about a point x. Hence

h ( X j ) h ¯ ( X j ) = h ( x ) h ¯ ( x ) + ( X j x ) ( h ( x ) h ¯ ( x ) ) + ( X j x ) 2 ( h ( x ) h ¯ ( x ) ) + o p ( 1 ) (19)

Now, substituting the first two terms in (18) gives

E [ T ^ M B C ] = i s 1 π i E ( y i ) E ( h ^ ( x i ) ) + U | s ( j = 1 n w x j h ¯ ( x ) ( h ( x ) h ¯ ( x ) + ( X j x ) ( h ( x ) h ¯ ( x ) ) ) + o p ( 1 n h ) (20)

But j = 1 n w x j = 1 and j = 1 n ( X j x ) w x j = 0 , therefore

E [ T ^ M B C ] = U | s h ( x ) + o p ( 1 n h ) (21)

Furthermore,

E ( T ) = i s E ( y i ) + j r E ( y j ) = i s y ¯ + j r h ( x )

Hence the asymptotic bis of the estimator is given by

B I A S ( T ^ M B C ) = E ( T ^ M B C T N ) = 1 N i s y ¯ + o p ( 1 n h )

The bias of T ^ M B C will be of order o p ( 1 n h ) . Thus it converges to zero at a faster rate compared to the existing non-parametric estimators which generally converge at the rate o p ( h 2 ) .

3.3. Asymptotic Variance of the Proposed Estimator

The variance of the finite population total is given by;

V a r [ T ^ M B C ] = V a r [ i s y i h ^ ( x i ) π i + j r h ^ ( x j ) ] = V a r [ i s y i h ^ ( x i ) π i ] + V a r [ j r h ^ ( x j ) ] = i s ( 1 π i ) 2 V a r ( y i h ^ ( x i ) ) + U | s V a r ( h ^ ( x j ) )

Firstly,

V a r ( h ^ ( x j ) ) = V a r ( j = 1 n w x j h ¯ ( x ) h ¯ ( X j ) [ 1 + R ( x ) R ( X j ) + r j ( x , X j ) ] ( h ( X j ) + ϵ j ) ) (22)

Using the assumption n h , the remainder terms converge to zero in probability. Therefore r j ( x , X j ) ( h ( X j ) + ϵ j ) = 0 p ( 1 n h ) and Equation (22) reduces to

V a r ( h ^ ( x j ) ) = V a r ( j = 1 n w x j h ¯ ( x ) h ¯ ( X j ) [ 1 + R ( x ) R ( X j ) ] ( h ( X j ) + ϵ j ) + 0 p ( 1 n h ) ) (23)

Truncating the binomial expansion at the first term yields

V a r ( h ^ ( x j ) ) = V a r ( j = 1 n w x j h ¯ ( x ) h ¯ ( X j ) y j ) + 0 p ( 1 ( n h ) 2 ) = j = 1 n ( w x j h ¯ ( x ) h ¯ ( X j ) ) 2 σ 2 ( x j ) + 0 p ( 1 ( n h ) 2 )

Simplify the above expression by considering the first and second part of the Taylor series of σ 2 ( x j ) h ¯ 2 ( X j ) . So we obtain

V a r ( h ^ ( x j ) ) = j = 1 n ( w x j ) 2 σ 2 ( x j ) + 0 p ( 1 ( n h ) 2 ) (24)

Therefore,

V a r [ T ^ M B C ] = i s ( 1 π i ) 2 σ 2 ( x i ) + U j = 1 n ( w x j ) 2 σ 2 ( x j ) + 0 p ( 1 ( n h ) 2 ) (25)

Thus the asymptotic variance is given by

V a r ( T ^ M B C N ) = 1 N 2 i s ( 1 π i ) 2 σ 2 ( x i ) + 1 N 2 U j = 1 n ( w x j ) 2 σ 2 ( x j ) + 0 p ( 1 ( n h ) 2 ) (26)

This implies that T ^ M B C is more efficient than the usual non-parametric regression estimator proposed by Dorfman (1992).

3.4. Asymptotic Mean Square Error

The asymtotic mean square error of the estimator T ^ M B C is given by

M S E [ T ^ M B C ] = V a r [ T ^ M B C ] + [ B i a s ( T ^ M B C ) ] 2 (27)

M S E [ T ^ M B C ] = 1 N 2 i s ( 1 π i ) 2 σ 2 ( x i ) + 1 N 2 U j = 1 n ( w x j ) 2 σ 2 ( x j ) + 0 p ( 1 ( n h ) 2 ) + [ 1 N i s y ¯ + o p ( 1 n h ) ] 2 (28)

As n and h , the M S E [ T ^ M B C ] turns to 0 indicating that, the proposed estimator is statistically consistent.

4. Empirical Study

4.1. Population

In this section, the theory developed in the previous section was tested using a set of simulation studies, with a mix of survey designs, and employing various approaches to selecting the best bandwidths. We employ a population U of countries in the world of size, N = 188, with auxiliary variable x = gross national product (GNI) and variable of interest y = human development index(HDI), of interest is the population total of the HDI, y = l U y l .

Figure 1 below shows the scatter diagram of the population. Where HDI is on the vertical axis and GNI on the horizontal axis, where there exist a quadratic relationship between the two variables.

We suppose, for each run of the experiment that two samples are taken:

Sample 1 ( s 1 ): srswor ( n 1 = 32 )

Sample 2 ( s 2 ): stratsrs-four strata equal in each, and 8 units taken at random

Figure 1. Scatter diagram.

in each, so that n 2 = 32 . The total experiment consists of 500 runs of pairs of samples. Table 1 gives the estimators considered.

For an estimator T ^ we considered three measures of relative success across the 500 runs:

i) Unconditional relative bias measured as ratio of mean value (across runs) to target

Bias = r u n s ( T ^ T ) / T

ii) Unconditional relative root mean square error divided by target

rmse = ( r u n s ( T ^ T ) ) 2 / T

4.2. Results

Results obtained are tabulated in Table 2.

From the results obtained, we observe that the unbiased cross validation approach is a viable means of selecting bandwidth as it gives the lowest bias and root mean square error across all the estimators. The proposed estimator to the two sample problem gives better estimates of the population total compared to those realized using the estimator proposed by [12] , and [4] respectively.

Furthermore, we study the conditional performances of the selected estimators. 500 samples obtained were sorted by the values of the mean of the auxiliary variable and put in 25 groups each containing 20 values. We then compute the bias and root mean square error of each group. The plots of conditional performances against the average of the sorted mean auxiliary variable. We then report the behaviour of the conditional bias for the different bandwidth.

Table 1. Estimators.

Table 2. Empirical results.

Figure 2 and Figure 3 indicate the conditional bias and conditional root mean square respectively, with each of the plot drawn at different bandwidth. The population mean of auxiliary variable x was found to be 1.701. Under the conditional bias plots, it is observed that, the proposed estimator outperforms the two currently used estimatorsin terms of conditional biases especially with the unbiased cross-validation and the biased cross-validation method of selecting bandwidth. This trend persist in the case of conditional root mean square error.

(a)(b)(c)(d)

Figure 2. Plots indicating the conditional biases of three estimators. (a) Biased cross-validation (bcv); (b) Rule-of-thumb for choosing the bandwidth of a Gaussian kernel density estimator (ndr0); (c) Common common variation given by Scott (1992); (d) Unbiased cross-validation (ucv).

(a)(b)(c)(d)

Figure 3. Plots indicating the conditional root mean square error of three estimators. (a) Biased cross-validation (bcv); (b) rule-of-thumb for choosing the bandwidth of a Gaussian kernel density estimator (ndr0); (c) Common variation given by Scott (1992); (d) unbiased cross-validation (ucv).

5. Conclusion

The aim of this study was to develop an estimator with the lowest bias for the finite population total using the multiplicative bias corrected approach to non parametric regression. This study reveals that the proposed estimator is more efficient than the modified nonparametric estimator (NPT). With a suitable bandwidth selection (ucv), the proposed estimator has the smallest bias and root mean square error values. It has therefore proven to be efficient in resolving the boundary value problem that is associated with the existing nonparametric smoothers.

Acknowledgements

My first appreciation goes to my supervisors Professor Odhiambo and Doctor Mageto for accompanying me through this work. Also, alot of thanks to the African Union for providing for this scientific reseach and placing such confident in its youth. Lastly but not the least, thanks to my family for their support.

Conflicts of Interest

The authors declare no conflicts of interest.

References

[1] Merkouris, T. (2004) Combining Independent Regression Estimators from Multiple Surveys. Journal of the American Statistical Association, 99, 1131-1139.
https://doi.org/10.1198/016214504000000601
[2] Wu, C.B. (2004) Combining Information from Multiple Surveys through the Emperical Likelihood Method. The Canadian Journal of Statistics, 32, 15-26.
https://doi.org/10.2307/3315996
[3] Dorfman, A.H. (2008) The Two Sample Problem. Proceedings of the Joint Statistical Meetings, Section on Survey Research Methods, Denver, 3-7 August 2008.
[4] Dorfman, A.H. (2009) Nonparametric Regression and the Two Sample Problem. Proceedings of the Joint Statistical Meetings, Section on Survey Research Methods, Washington DC, August 1-6 2009, 277-270.
[5] Marron, J.S. and Hardle, W. (1986) Random Approximations to Some Measures of Accuracy in Nonparametric Curve Estimation. Journal of Multivariate Analysis, 20, 91-113.
https://doi.org/10.1016/0047-259X(86)90021-7
[6] Bierens, H.J. (1987) Kernel Estimators of Regression Functions. Advances in Econometrics: Fifth World Congress, Cambridge University Press, Cambridge, 99-144.
https://doi.org/10.1017/CCOL0521344301.003
[7] Muller, H.-G. and Stadtmuller, U. (1987) Variable Bandwidth Kernel Estimators of Regression Curves. The Annals of Statistics, 15, 182-201.
https://doi.org/10.1214/aos/1176350260
[8] Linton, O. and Nielsen, J.P. (1994) A Multiplicative Bias Reduction Method for Nonparametric Regression. Statistics & Probability Letters, 19, 181-187.
https://doi.org/10.1016/0167-7152(94)90102-3
[9] Fan, J.Q. (1992) Design-Adaptive Nonparametric Regression. Journal of the American Statistical Association, 87, 998-1004.
https://doi.org/10.1080/01621459.1992.10476255
[10] Hirukawaa, M. and Sakudo, M. (2014) Nonnegative Bias Reduction Methods for Density Estimation Using Asymmetric Kernels. Computational Statisticsand Data Analysis, 92, 112-123.
https://doi.org/10.1016/j.csda.2014.01.012
[11] Hengartner, N. and Matzner-Lober, E., Rouviere, L. and Burr, T, (2009) Multiplicative Bias Corrected Nonparametric Smoothers. arXiv Preprint arXiv:0908.0128.
[12] Dorfman, A.H. (1992) Nonparametric Regression for Estimating Totals in Finite Populations. Proceedings of the Section on Survey Research Methods, American Statistical Association Alexandria, Washington DC, 622-625.

Copyright © 2024 by authors and Scientific Research Publishing Inc.

Creative Commons License

This work and the related PDF file are licensed under a Creative Commons Attribution 4.0 International License.