Estimation of Finite Population Totals in High Dimensional Spaces ()
1. Introduction
In Surveys, extrapolation reduces the accuracy of information since the sample is a subset of an entire population and therefore, does not contain information on units that are not represented in the selected sample. In such cases of unobserved units therefore, use of Auxiliary Information on the characteristic under study is usually effective in predicting unobserved units if the model is correctly specified. In general, when using Auxiliary Information, it is assumed that there is a finite population of N distinct and identifiable units;
. Let each population unit have the variable of interest as Y. It is assumed that there is an auxiliary variable
, closely correlated with Y, which is known for the entire population (i.e.
) that is known as
. Researchers are frequently faced with the task of estimating a population function, (i.e. a function of Y’s), such as the Population Total;
(1)
or the population distribution functions
(2)
In estimating the Population Totals T for instance, a sample S is usually chosen such that the pair
and
is obtained from the variable X and corresponding variable Y. It can then be employed in the design, estimation, or both stages. In the presence of such Auxiliary Variables, Super-Population Models at the estimation stage of inference may be used, [1] and [2]. However, regarding the underlying relationship between the Survey and Auxiliary Variables, all of these techniques refer to Simple Statistical Models (Linear Regression Models). In an Empirical Study, [3] show that misspecification of the model can lead to substantial mistakes in the Parametric Superpopulation. To solve this problem, Nonparametric Regression involving robust estimators in Finite Population Sampling has been proposed [4] [5] [6].
As a result, the reason for using a nonparametric approach in this research is that a regression curve created this way serves four key functions, as explained by [7]: It provides a versatile method of exploring the general relationship between two variables, enables one to make prediction of observations without any reference to fixed parametric model, is a tool for finding spurious observations by studying influence of isolated points and is a flexible method for interpolating between adjacent values of auxiliary variable.
Usually, a major problem that is encountered when using Nonparametric Kernel based Regression Estimators over a finite interval such as the estimation of finite population quantities is the bias at the boundary points, ( [8]). It is also known that Kernel and Polynomial Regression Estimators provide good estimates for the population totals when
and
, [5] [9].
Despite the fact that High Dimensional Auxiliary Information can be accounted for in the above estimators, the problem of Regressor Sparseness in the design space renders Kernel Methods and Local Polynomials unworkable because performance decreases quickly as the dimension increases, [9] [10] [11]. This problem is known as “curse of dimensionality” which is a result of the sparsity of data in high-dimensional environments, which leads to a drop in the highest feasible rates of convergence of regression function estimators towards their target curve as the dimension of the Regressor Vector grows. A review on the concept of curse of dimensionality is provided in [12].
Given the problem called “curse of dimensionality”, one has to use different Nonparametric Estimators to retain a large degree of flexibility. An attempt to navigate through this curse while handling Multiple Auxiliary Information is to consider and use recursive covering in model based perspectives [13] and Generalized Additive Modeling in Model-Assisted Framework [14]. These estimation methods come at a cost of reduced flexibility with the associated risk of increased bias [10] [11] [12] [15].
Consequently in this paper, robustness of the proposed Nonparametric Estimator for the Finite Population Total is based on Feedforward Backpropagation Neural Network Approach to address the shortcomings of previously studied estimation methods is developed. Although Kernel and Local Approximators may also have the same property as Artificial Neural Networks (ANNs), they often require a high number of components to attain equivalent approximation accuracy [16]. The high number of components presents a challenge to feasibility in usage of the methods. ANNs are thus considered to be a parsimonious approach to this Parametric Functional Analysis.
2. Estimation of Finite Population Totals Using Artificial Neural Networks
Let Y be the Survey Variable associated with an Auxiliary Variable X assumed to follow a Superpopulation Model under a Model-Based Approach. A commonly used working model for the Finite Population is
(3)
with
,
i.i.d with mean zero and
are the auxiliary information.
Also, let
(4)
be the finite population total where s is the sampled units and r are the non-sampled units. Assume that
is given according to Equation (3) with
,
i.i.d. Consider estimating
based on a Feedforward Backpropagation Neural Network. As a basic building block, consider the Neurons as a Nonlinear Transformation of a Linear Combination of the input
.
Feedforward networks with multiple layers of hidden units are more complex networks that enable information feedback to be specified. Its study will only deal with the presented structure 5, which is widely used for a range of applications and has the appealing characteristic of being implemented in statistical software, but the results herein are straightforward to extrapolate.
In this simplest case of one hidden layer with
Neurons, the Network can be written to represent the Network Function as follows
(5)
with
and
(6)
where
represents the vector of all parameters of weights of the network.
is a given Activation Function. For regression problems, functions of the sigmoid shape. Therefore, depending on the required output, one could choose between widely used sigmoid functions, the logistic sigmoid and the bipolar sigmoid. The Logistic Function is preferable when the objective is to approximate functions that map into probability space. In particular, the Activation Function is a smooth counterpart of the Indicator Function if the input signals are “constrained” between zero and one. For instance, logistic function described as
(7)
is a leading example of which it approaches one (zero) when its arguments go to infinity (negative infinity). Thus, the Logistic Activation Function produces partially on/off signals following the received input signals. This function
specifies a mapping from the input space
to the output space which for this study is one-dimensional. Such a class of all network output function
has several uniform approximation properties [17] [18] [19]. Important for the current study is that for any continuous function m, any
and any compact set
there exist a function
with
These imply that any Regression Function
may be approximated well enough using a large enough number of neurons and appropriate parameters
.
Therefore, a nonparametric estimate for
is gotten if H is first chosen in a manner which serves as a tuning parameter and determines the smoothness of the estimate, then estimation of the parameter
from the data by nonlinear least squares is done to yield
(8)
with
Under appropriate conditions,
converges in probability for
and a constant H to the parameter vector
which corresponds to the best approximation of
by a function of type
with
Also, under some stronger assumptions, the Asymptotic Normality of
and thus the estimator of
also follows for the regression function
. Therefore, the immediate consequence of these is that
as
.
The estimation error
can be divided into two asymptotically independent subcomponents:
, where the value
minimises the sample version of
, [20]. Thus, by Universal Approximation Property of Neural Networks,
converges to the Regression Function
as
. Therefore
is a consistent Estimate of
if H increases with n as is herein imposed, and with an appropriate rate. From these results, the corresponding estimate of the finite population total is therefore, given as
(9)
which is the proposed estimator for the Finite Population Total, with
Regularity Notes on the Proposed Estimator
1)
is a Model-Based Estimator, so that all the inference is with respect to the model for the
, not the Survey Design.
2) This estimator is identical to that proposed in [4], except that the NN is replaced by a Kernel-Based Regression.
3) This estimator can be used to estimate the population totals of a finite population so long as the assumption is that each of the unsampled elements has the same distribution as the sampled elements.
4) For fixed H, this work just fits a Nonlinear Regression Model to the data. However, it is known that this model can be misspecified and therefore one has to select a decent H, determining the form of the nonlinear regression function and the dimension of its parameter, to get a reasonable balance between bias and variance of
as an estimate of
.
5) The parameter vector
of [5] is not uniquely determined (identified) by the function
. i.e. for different values of
, the same function
is realised. If, for example the activation function is antisymmetric,
, then changing the enumeration of hidden units and multiplying all weights
, going into hidden units and simultaneously the weight
going out of the neuron by −1 do not change the function. To avoid this ambiguity and the related problems of estimation, this study considered only parameter vectors in a subset
chosen such that for each function in [5] with H neurons, there exists exactly one corresponding parameter
. For antisymmetric
one can choose for example
, that is, the last h coordinates of
are in decreasing order. For more details on the identification of parameters see [21].
Theoretically, Feedforward Neural Network which has one hidden layer suffices by the Universal Approximation Property. For practical purposes, networks with more than one hidden layer may provide a better approximation to
with fewer parameters, see [9] [17] [18] [22] [23].
3. Theoretical Properties of the Proposed Estimator
3.1. Assumptions
To be able to prove the theoretical results, the following assumptions are made;
1) The errors
are Identically Independently Distributed (IID) with mean 0, finite variance
satisfying
and for some
and
.
2) The Auxiliary Measurements
are i.i.d. with an absolutely continuous distribution F having a finite second moment.
(10)
where
is strictly positive density whose support is a compact subset of
. Moreover,
(11)
and for some
and
.
3)
is a bounded function.
4) For each sequence of finite population indexed by v, conditioned on the value
, the super population model (3.1), where
satisfies A1, then, the
is considered fixed with respect to the super population model
.
5) The survey variable has a bounded moment with ξ-probability 1. Moreover, it is noted that (A1), …, (A3) immediately imply for some
(12)
6) The sampling rate is bounded, that is
7) The parameter space
is a compact set,
an interior point of
and it is irreducible; that is for
none of the following three cases holds [21].
a)
, for some
.
b)
, for some
.
c)
, for
.
8) The activation function
in 7 is asymmetric sigmoid function that is differentiable to any order. Additionally, it is assumed that the class of functions
is linearly independent. Such function can easily be represented using an indicator (threshold) function,
(13)
The logistic activation function in 7 fulfills these requirements.
To prove for consistency of the proposed estimator, the rate which determines how the complexity of the networks and therefore the possible roughness of the function estimate
increases with the sample size n has to satisfy some conditions. We follow [19] and restrict the number H of neurons and the overall size of the network weights
simultaneously. For some sequences
, let
(14)
For given sample size n, we consider only network functions in
(15)
as an estimate for
. Therefore, we redefine the parameter estimate as
(16)
and the network estimate for
is therefore given by
(17)
which is a kind of sieve estimate in the sense of [24] or [25].
To prove consistency of
, it needs to be shown that the Neural Network Based Regression Function
is also consistent.
Theorem 3.1. Let
be i.i.d variable with
, and
. Let the distributions of
and
satisfy A2 and Equation (12). Let
be the set of neural network output functions given by Equation (15) with an activation function
which is Lipschitz continuous on
, strictly increasing and satisfying Equation (13). Let
be in the closure of
in
that is, in the space of functions square integrable with respect to the distribution of the
. Then
is a consistent estimate of
in the
-sense, that is
in probability (18)
provided that
such that
and
where
determine the rate of decrease of the tail of the distribution of the
by Equation (12).
Proof. Theorem 1 can be proven exactly as Theorem 2.1 of [26] for stationary processes satisfying an
-mixing condition and also as Theorem 3.1 of [27] for fixed data. As here the data are independent, the Bernstein inequality for stationary processes may be replaced by a Bernstein inequality for independent data like that one in Section (2.5.4), Lemma A of [28] [29]. Therefore, the right hand side of Equation (5.1) of [26] changes to
Then the proof proceeds exactly as in [19] and results in slightly different condition for the rates of
in the independence case.
We remark that for bounded random variables
, the last condition on
involving
can be dropped. In that case, Theorem 1 essentially is equivalent to Theorem 3.3 of [19]. We also remark that by Theorem 3.4 of [19], we may determine the parameters
which determine the network complexity and therefore the smoothness of the function estimate, adaptively from the data by Cross Validation without changing the consistency of
. For the detail on the proof of these theorems, see the work of [26] [27].
Note that, to prove the consistency of
we need Equation (13) with a simple mean over the unobserved
instead of the integral. The following results show that the difference between the integral and the sample mean is negligible.
Theorem 3.2. Let
be i.i.d with 3 for some bounded
. Let F denote the distribution of
. Let
. Let
be the index set of the observed data and
the index of unobserved data. Let
be defined as in Equation (13) with
defined as in Equation (17) with
denote the estimate of
based on the sample
. Let
such that
and let
satisfy conditions in Theorem 3. Then for
(19)
for all
and all N large enough where
are some constants independent of
and
.
Proof. From assumption A3, let C be the upper bound of
. By definition of
and
, we immediately have
setting
(20)
these therefore result to
(21)
note that
is independent of
, and completely determined by
. Now apply Bernstein’s inequality (Lemma A, Section 2.5.4) of [28] and get
(22)
Now the results follow as
and therefore
dominates the denominator of the exponent for N large enough and as
coincides asymptotically with
. Moreover, as
, that is, the right hand side of the inequality converges to zero (taking limits as
).
3.2. Asymptotic Consistency
Theorem 3.3. If (A1)-(A8) are satisfied and if the activation function
is Lipschits continuous and strictly increasing and also Theorem 1 holds, then the neural network estimate
of the population total T given by 6 with
and
given by [8] is consistent in the following sense.
(23)
provided that the number
and the bound
of the network weights satisfy
such that
(24)
where
determines(by A1) how fast the tail probability of the
and
decreases. [19] showed that, the appropriate choice for
is such that
as
and
, i.e.
as
Proof.
(25)
by Jensen’s inequality.
Now the last term converges to
where
since
by law of large numbers. The first term of 25 decomposes into
(26)
The right hand terms of 26 converge to 0 by Theorem 1 and as
.
The proof is completed by using Theorem 2 to cope with left hand terms where we drop the factor
converges to
anyhow.
(27)
hence the proof.
3.3. Mean Squared Error
Mean Squared Error is used to measure the accuracy of the estimator among other measures of performance. The MSE is defined by
where T denotes the true population total. To estimate
, first, we consider
(28)
where the
is a set of unsampled auxiliary units.
denotes the total of the unsampled elements and
.
The last approximation of Equation (28) follows from Equation (15) of [30], that is
for some positive constant
.
The term
is the predictor bias due to randomness or sampling bias of D. Now from Equation (28), we have
(29)
As noted in [30], the quantity
can be estimated by batch method. Therefore,
(30)
for details see [30]. Equation (30) can be substituted in 29 in lieu of
.
Now, under the assumption that the
, then the estimate of
is given as
(31)
Under the assumption that the population is made up of exact copies of the sampled (training) data, we have
where
the fitted sample totals and
(32)
Under the true model, we have
. Hence the
can be estimated by
(33)
Thus,
can be estimated by
(34)
As
Equation (34) reduces to
(35)
4. Empirical Results
To illustrate our estimation approach, the following data will be utilized. A population of size 188 will be obtained from the United Nations Development Programme 2020 report. The UN studied the development in 1889 countries. It grouped development in the countries as either very high human development, high human development, medium human development or low human development. Kenya was classified in countries that fall under medium development and ranked number 143 among the 188 countries studied. The UN study used attributes such as Human Development Index (HDI), Life expectancy at Birth, Expected years of schooling, Mean years of schooling, Gross national income (GNI) per capita and GNI per capita rank minus HDI to rank human development index in the 189 countries. In this study, a relationship between Human Development Index (HDI) which is considered as the survey variable and the auxiliary variables; Life expectancy at Birth, Expected years of schooling, Mean years of schooling and Gross National Income (GNI) per capita is considered.
In order to understand how the proposed estimator compares against other existing non-parametric regression estimators, we compared the performance of our estimator to that of identified estimators based on Multivariate Additive Regression Splines (MARS), Generalized Additive Models (GAM) and Local polynomial (LP) which can handle high dimensional data. We compare the performance of the proposed estimator of the population totals, with
,
,
and
, using the bias, mean squared error (MSE), mean absolute error (MAE), and mean absolute percentage error (MAPE).
The unconditional results for the estimators were computed that are used in the analysis that acts as performance indicators of the estimators. The results include; Bias, Mean Square Error (MSE), Mean Absolute Error (MAE) and mean absolute percentage error (MAPE) respectively. These criteria are defined as follows; Bias of a Population total estimator refers to the deviation of the expected value of the estimator from the true Total value. Table 1 provides the results for performance of the estimators when applied to the data obtained from the United Nations Development Programme 2020 report. All of the population total estimators considered here are biased but comparatively
exhibits a smaller bias.
can be seen to be a very efficient estimator of the finite population total since it has smaller RMSE, followed closely by
and
.
proved to be a very inefficient estimator of all other estimators.
Table 1. Unconditional bias, mean square error, relative root mean square error, mean absolute error and mean absolute percentage error for real data set.
The conditional performance of the estimator was done and compared with the performance of other existing population total estimators. To do this, 500 random samples, all of sizes 100 and 50, were selected and the mean of the auxiliary values xi was computed for each sample to obtain 200 values of
. These sample means were then sorted in ascending order and further grouped into clusters of size 20 such that a total of 25 groups was realized. Further, group means of the means of auxiliary variables were calculated to get
. Empirical means and biases were then computed for all the estimators
,
,
and
. The conditional biases were plotted against
to provide a good understanding of the pattern generated. Figure 1 and Figure 2 show the behavior of the conditional biases, relative absolute biases and mean squared error realized by all the estimators based on the real data set.
In most cases, there are significant differences among the bias characteristics of the various estimators. A detailed examination of the plots reveals that
has lower levels of bias followed by
as indicated by the proximity of plotted curves to the horizontal (no bias) line at 0:0 on the vertical axis. Interestingly, despite the rather entangled nature of some of the plots, estimator
emerges clearly as the least biased for nearly every group means of the means of auxiliary variables.
Plots of Conditional MSE versus group means of the means of auxiliary variables similarly reveal coincident behavior for the estimators.
and
produce generally the lowest MSE values. In particular,
yields the lowest MSE in most cases among all other estimators.
is consistently better than all other estimators for both bias and MSE. All of these estimators are asymptotically unbiased and they all exhibit MSE consistency in that the MSE values tend toward zero as sample size increases. From the plots it can be seen that
and
performed equally better than all other estimators of the true population total functions.
Figure 1. Conditional bias, mean square error, relative root mean square error and mean absolute error based on real data with a sample size of 100.
Figure 2. Conditional bias, mean square error, relative root mean square error and mean absolute error based on real data with a sample size of 50.
5. Conclusion and Recommendations
In this paper, an estimator for Finite Population Total has been developed by employing a Feed Forward Back Propagation Neural Network technique in Non-parametric Regression. Asymptotic properties such as the Consistency and Mean Squared Error for the developed estimator have also been derived. When applied to dataset obtained from the United Nations Development Programme 2020 report, the findings indicate that the proposed estimator has the lowest bias and root mean square error values compared to other existing estimators. The developed estimator is considered to be effective in addressing the curse of dimensionality that makes Local Polynomials and Kernel Estimators ineffective when dealing with High Dimensional Data. It should be noted that the proposed estimator has been considered in the case of Simple Random Sampling Without Replacement (SRSWoR). An extension to other sampling techniques such Stratification may be done since they rely on SRSWoR, and it is hypothesised that efficiency will be improved compared to other existing estimators in literature.