Estimation of Finite Population Totals in High Dimensional Spaces

Festus A. Were; George O. Orwa; Romanus O. Otieno

doi:10.4236/ojs.2022.125035

Open Journal of Statistics > Vol.12 No.5, October 2022

Estimation of Finite Population Totals in High Dimensional Spaces

Festus A. Were, George O. Orwa, Romanus O. Otieno
Department of Statistics and Actuarial Sciences, JKUAT, Nairobi, Kenya.
DOI: 10.4236/ojs.2022.125035 PDF HTML XML 100 Downloads 402 Views

Abstract

In this paper, the problem of Nonparametric Estimation of Finite Population Totals in high dimensional datasets is considered. A robust estimator of the Finite Population Total based on Feedforward Backpropagation Neural Network is derived with the aid of a Super-Population Model. This current study is motivated by the fact that Local Polynomials and Kernel methods have in preceding related studies, been shown to provide good estimators for Finite Population Totals but in low dimensions. Even in these situations however, bias at boundary points presents a big challenge when using these estimators in estimating Finite Population parameters. The challenge worsens as the dimension of regressors increase. This is because as the dimension of the Regressor Vectors grows, the Sparseness of the Regressors’ values in the design space becomes unfeasible, resulting in a decrease in the fastest achievable rates of convergence of the Regression Function Estimators towards the target curve, rendering Kernel Methods and Local Polynomials ineffective to address these challenges. This study considers the technique of Artificial Neural Networks which yields robust estimators in high dimensions and reduces the estimation bias with marginal increase in variance. This is due to its Multi-Layer Structure, which can approximate a wide range of functions to any required level of precision. The estimator’s properties are developed, and a comparison with existing estimators was conducted to evaluate the estimator’s performance using real data sets acquired from the United Nations Development Programme 2020. The estimation approach performs well in an example using data from a United Nations Development Programme 2020 on the study of Human Development Index against other factors. The theoretical and practical results imply that the Neural Network estimator is highly recommended for survey sampling estimation of the finite population total.

Keywords

Neural Networks, Kernel Smoother, Local Polynomial, Nonparametric

Share and Cite:

Were, F. , Orwa, G. and Otieno, R. (2022) Estimation of Finite Population Totals in High Dimensional Spaces. Open Journal of Statistics, 12, 581-596. doi: 10.4236/ojs.2022.125035.

1. Introduction

In Surveys, extrapolation reduces the accuracy of information since the sample is a subset of an entire population and therefore, does not contain information on units that are not represented in the selected sample. In such cases of unobserved units therefore, use of Auxiliary Information on the characteristic under study is usually effective in predicting unobserved units if the model is correctly specified. In general, when using Auxiliary Information, it is assumed that there is a finite population of N distinct and identifiable units; $U = {1, 2, \dots, N}$ . Let each population unit have the variable of interest as Y. It is assumed that there is an auxiliary variable $X \in ℝ^{d}$ , closely correlated with Y, which is known for the entire population (i.e. $X_{1}, X_{2}, \dots, X_{N}$ ) that is known as $\forall Y_{i}$ . Researchers are frequently faced with the task of estimating a population function, (i.e. a function of Y’s), such as the Population Total;

$T = \sum_{i = 1}^{N} Y_{i}$ (1)

or the population distribution functions

$F (y) = \frac{1}{N} \sum_{i = 1}^{N} I_{i} (Y_{i} \leq y)$ (2)

In estimating the Population Totals T for instance, a sample S is usually chosen such that the pair $(x_{i, j}, y_{i}), i = 1, 2, \dots, n$ and $j = 1, 2, 3, \dots, d$ is obtained from the variable X and corresponding variable Y. It can then be employed in the design, estimation, or both stages. In the presence of such Auxiliary Variables, Super-Population Models at the estimation stage of inference may be used, [1] and [2]. However, regarding the underlying relationship between the Survey and Auxiliary Variables, all of these techniques refer to Simple Statistical Models (Linear Regression Models). In an Empirical Study, [3] show that misspecification of the model can lead to substantial mistakes in the Parametric Superpopulation. To solve this problem, Nonparametric Regression involving robust estimators in Finite Population Sampling has been proposed [4] [5] [6].

As a result, the reason for using a nonparametric approach in this research is that a regression curve created this way serves four key functions, as explained by [7]: It provides a versatile method of exploring the general relationship between two variables, enables one to make prediction of observations without any reference to fixed parametric model, is a tool for finding spurious observations by studying influence of isolated points and is a flexible method for interpolating between adjacent values of auxiliary variable.

Usually, a major problem that is encountered when using Nonparametric Kernel based Regression Estimators over a finite interval such as the estimation of finite population quantities is the bias at the boundary points, ( [8]). It is also known that Kernel and Polynomial Regression Estimators provide good estimates for the population totals when $x \in ℝ^{d}$ and $d = 1$ , [5] [9].

Despite the fact that High Dimensional Auxiliary Information can be accounted for in the above estimators, the problem of Regressor Sparseness in the design space renders Kernel Methods and Local Polynomials unworkable because performance decreases quickly as the dimension increases, [9] [10] [11]. This problem is known as “curse of dimensionality” which is a result of the sparsity of data in high-dimensional environments, which leads to a drop in the highest feasible rates of convergence of regression function estimators towards their target curve as the dimension of the Regressor Vector grows. A review on the concept of curse of dimensionality is provided in [12].

Given the problem called “curse of dimensionality”, one has to use different Nonparametric Estimators to retain a large degree of flexibility. An attempt to navigate through this curse while handling Multiple Auxiliary Information is to consider and use recursive covering in model based perspectives [13] and Generalized Additive Modeling in Model-Assisted Framework [14]. These estimation methods come at a cost of reduced flexibility with the associated risk of increased bias [10] [11] [12] [15].

Consequently in this paper, robustness of the proposed Nonparametric Estimator for the Finite Population Total is based on Feedforward Backpropagation Neural Network Approach to address the shortcomings of previously studied estimation methods is developed. Although Kernel and Local Approximators may also have the same property as Artificial Neural Networks (ANNs), they often require a high number of components to attain equivalent approximation accuracy [16]. The high number of components presents a challenge to feasibility in usage of the methods. ANNs are thus considered to be a parsimonious approach to this Parametric Functional Analysis.

2. Estimation of Finite Population Totals Using Artificial Neural Networks

Let Y be the Survey Variable associated with an Auxiliary Variable X assumed to follow a Superpopulation Model under a Model-Based Approach. A commonly used working model for the Finite Population is

$y_{i} = m (x_{i}) + ε_{i}$ (3)

with $x_{i j} \in ℝ^{d}$ , $ε_{1}, ε_{2}, \dots, ε_{N}$ i.i.d with mean zero and $x_{i j}, i = 1, 2, \dots, N; j = 1, 2, \dots, d$ are the auxiliary information.

Also, let

$T = \sum_{i \in s} y_{i} + \sum_{i \in r} y_{i}$ (4)

be the finite population total where s is the sampled units and r are the non-sampled units. Assume that $y_{i}$ is given according to Equation (3) with $x_{i} \in ℝ^{d}$ , $ε_{1}, ε_{2}, \dots, ε_{N}$ i.i.d. Consider estimating $m (x)$ based on a Feedforward Backpropagation Neural Network. As a basic building block, consider the Neurons as a Nonlinear Transformation of a Linear Combination of the input $x = {(x_{1}, \dots, x_{d})}^{'}$ .

Feedforward networks with multiple layers of hidden units are more complex networks that enable information feedback to be specified. Its study will only deal with the presented structure 5, which is widely used for a range of applications and has the appealing characteristic of being implemented in statistical software, but the results herein are straightforward to extrapolate.

In this simplest case of one hidden layer with $H \geq 1$ Neurons, the Network can be written to represent the Network Function as follows

$f_{H} (x, θ) = v_{0} + \sum_{h = 1}^{H} v_{h} ψ (w_{0 h} + x^{T} w_{h}), x \in ℝ^{d}$ (5)

with $w_{h} = (w_{1 h}, \dots, w_{d h}) \in ℝ^{d}$ and

$θ = {(w_{01}, \dots, w_{0 H}, w_{1}^{T}, \dots, w_{H}^{T}, v_{0}, \dots, v_{h})}^{T} \in ℝ^{M (H)}$ (6)

where $M (H) = (d + 1) H + H + 1$ represents the vector of all parameters of weights of the network. $ψ : ℝ \mapsto ℝ$ is a given Activation Function. For regression problems, functions of the sigmoid shape. Therefore, depending on the required output, one could choose between widely used sigmoid functions, the logistic sigmoid and the bipolar sigmoid. The Logistic Function is preferable when the objective is to approximate functions that map into probability space. In particular, the Activation Function is a smooth counterpart of the Indicator Function if the input signals are “constrained” between zero and one. For instance, logistic function described as

$ψ (u) = \frac{1}{1 + \exp (- u)}, - \infty < u < \infty$ (7)

is a leading example of which it approaches one (zero) when its arguments go to infinity (negative infinity). Thus, the Logistic Activation Function produces partially on/off signals following the received input signals. This function $f_{H} (x; θ)$ specifies a mapping from the input space $ℝ^{d}$ to the output space which for this study is one-dimensional. Such a class of all network output function $O = {f_{H} (x; θ), θ \in ℝ^{M (H)}, H \geq 1}$ has several uniform approximation properties [17] [18] [19]. Important for the current study is that for any continuous function m, any $ε > 0$ and any compact set $C \subseteq ℝ^{d}$ there exist a function $f_{H} \in O$ with

$\sup_{x \in C} | m (x) - f_{H} (x; θ) | < ε$

These imply that any Regression Function $m (x)$ may be approximated well enough using a large enough number of neurons and appropriate parameters $θ$ .

Therefore, a nonparametric estimate for $m (x)$ is gotten if H is first chosen in a manner which serves as a tuning parameter and determines the smoothness of the estimate, then estimation of the parameter $θ$ from the data by nonlinear least squares is done to yield

${\hat{θ}}_{n} = \arg \min_{θ \in ℜ^{M (H)}} D_{n} (θ)$ (8)

with

$D_{n} (θ) = \sum_{s} {(y_{i} - f_{H} (x; θ))}^{2}$

Under appropriate conditions, ${\hat{θ}}_{n}$ converges in probability for $n \to \infty$ and a constant H to the parameter vector $θ \in Θ_{H}$ which corresponds to the best approximation of $m (x)$ by a function of type $f_{H} (x; θ), θ \in Θ_{H}$ with

$θ = \arg \min_{θ \in ℜ^{M (H)}} D (θ) with D (θ) = E {m (x) - f_{H} (x; θ)}$

Also, under some stronger assumptions, the Asymptotic Normality of ${\hat{θ}}_{n}$ and thus the estimator of $\hat{m} (x) = f_{H} (x; {\hat{θ}}_{n})$ also follows for the regression function $m (x)$ . Therefore, the immediate consequence of these is that $f_{H} (x; {\hat{θ}}_{n}) \to f_{H} (x; θ)$ as $n \to \infty$ .

The estimation error ${\hat{θ}}_{n} - θ$ can be divided into two asymptotically independent subcomponents: ${\hat{θ}}_{n} - θ = ({\hat{θ}}_{n} - {\hat{θ}}_{n}) + ({\hat{θ}}_{n} - θ)$ , where the value

$θ_{n} = \arg \min_{θ \in ℜ^{M (H)}} \sum_{i = 1}^{n} {m (x) - f_{H} (x, θ)}^{2}$

minimises the sample version of $D (θ)$ , [20]. Thus, by Universal Approximation Property of Neural Networks, $f_{H} (x; θ)$ converges to the Regression Function $m (x)$ as $H \to \infty$ . Therefore $f_{H} (x; {\hat{θ}}_{n})$ is a consistent Estimate of $m (x)$ if H increases with n as is herein imposed, and with an appropriate rate. From these results, the corresponding estimate of the finite population total is therefore, given as

${\hat{T}}_{N N} = \sum_{j \in s} y_{j} + \sum_{j \in r} {\hat{m}}_{n} (x_{j})$ (9)

which is the proposed estimator for the Finite Population Total, with

${\hat{m}}_{n} (x_{j}) = f_{H} (x; {\hat{θ}}_{n})$

Regularity Notes on the Proposed Estimator

1) $T_{N N}$ is a Model-Based Estimator, so that all the inference is with respect to the model for the ${y^{'}}_{i} s$ , not the Survey Design.

2) This estimator is identical to that proposed in [4], except that the NN is replaced by a Kernel-Based Regression.

3) This estimator can be used to estimate the population totals of a finite population so long as the assumption is that each of the unsampled elements has the same distribution as the sampled elements.

4) For fixed H, this work just fits a Nonlinear Regression Model to the data. However, it is known that this model can be misspecified and therefore one has to select a decent H, determining the form of the nonlinear regression function and the dimension of its parameter, to get a reasonable balance between bias and variance of ${\hat{m}}_{n} (x)$ as an estimate of $m (x)$ .

5) The parameter vector $θ$ of [5] is not uniquely determined (identified) by the function $f_{H} (x, θ)$ . i.e. for different values of $θ$ , the same function $f_{H} (x, θ)$ is realised. If, for example the activation function is antisymmetric, $ψ (- x) = - ψ (x)$ , then changing the enumeration of hidden units and multiplying all weights $w_{i h}, i = 1, 2, \dots, d$ , going into hidden units and simultaneously the weight $v_{h}$ going out of the neuron by −1 do not change the function. To avoid this ambiguity and the related problems of estimation, this study considered only parameter vectors in a subset $Θ_{H} \subset ℝ^{M (H)}$ chosen such that for each function in [5] with H neurons, there exists exactly one corresponding parameter $Θ_{H}$ . For antisymmetric $ψ$ one can choose for example $Θ_{H} = {θ \in ℝ^{M (H)}; v_{1} \geq v_{2} \geq \dots \geq v_{H}}$ , that is, the last h coordinates of $θ$ are in decreasing order. For more details on the identification of parameters see [21].

Theoretically, Feedforward Neural Network which has one hidden layer suffices by the Universal Approximation Property. For practical purposes, networks with more than one hidden layer may provide a better approximation to $m (x)$ with fewer parameters, see [9] [17] [18] [22] [23].

3. Theoretical Properties of the Proposed Estimator

3.1. Assumptions

To be able to prove the theoretical results, the following assumptions are made;

1) The errors $ε_{i}$ are Identically Independently Distributed (IID) with mean 0, finite variance $σ^{2}$ satisfying

$p r (| ε_{i} | > t) \leq a_{0} \exp {- a_{1} t^{α}} for all t \geq 0$

and for some $a_{0}, a_{1}$ and $α > 0$ .

2) The Auxiliary Measurements $x_{i} \in ℝ^{d}$ are i.i.d. with an absolutely continuous distribution F having a finite second moment.

$\int_{- \infty}^{x_{1}} \int_{- \infty}^{x_{2}} \dots \int_{- \infty}^{x_{d}} f (t_{1}, \dots, t_{d}) d t$ (10)

where $f (.)$ is strictly positive density whose support is a compact subset of $ℝ^{d}$ . Moreover,

$p r (‖ x_{i} ‖ > t) \leq b_{0} \exp {- b_{1} t^{β}} for all t \geq 0$ (11)

and for some $b_{0}, b_{1}$ and $β > 0$ .

3) $m (x)$ is a bounded function.

4) For each sequence of finite population indexed by v, conditioned on the value $x_{i}$ , the super population model (3.1), where $ε_{i}$ satisfies A1, then, the $x_{i}$ is considered fixed with respect to the super population model $ξ$ .

5) The survey variable has a bounded moment with ξ-probability 1. Moreover, it is noted that (A1), …, (A3) immediately imply for some $c_{0}, c_{1} > 0$

$P r (| y_{i} | > t) \leq c_{0} \exp {- c_{1} t^{α}}, for all t \geq 0$ (12)

6) The sampling rate is bounded, that is

$\lim_{v \to \infty} \sup \frac{n}{N} = π, where π \in (0, 1)$

7) The parameter space $Θ$ is a compact set, $θ$ an interior point of $Θ$ and it is irreducible; that is for $h, h^{'} \neq 0$ none of the following three cases holds [21].

a) $v_{h} = 0$ , for some $h = 1, \dots, H$ .

b) $w_{h} = 0$ , for some $h = 1, \dots, H$ .

c) $({w^{'}}_{h}, w_{0 h}) = \pm ({w^{'}}_{h^{'}}, w_{0 h^{'}})$ , for $w \neq w^{'}$ .

8) The activation function $ψ$ in 7 is asymmetric sigmoid function that is differentiable to any order. Additionally, it is assumed that the class of functions ${ψ (b_{t}, b_{0}), b > 0} \cup {ψ \equiv 1}$ is linearly independent. Such function can easily be represented using an indicator (threshold) function,

$ψ (u) = (\begin{array}{l} ψ (u) \to 0, as u \to - \infty \\ ψ (u) \to 1, as u \to + \infty \\ ψ (u) + ψ (- u) = 1 \end{array}$ (13)

The logistic activation function in 7 fulfills these requirements.

To prove for consistency of the proposed estimator, the rate which determines how the complexity of the networks and therefore the possible roughness of the function estimate ${\hat{m}}_{n} (x)$ increases with the sample size n has to satisfy some conditions. We follow [19] and restrict the number H of neurons and the overall size of the network weights $v_{h}, w_{k h}$ simultaneously. For some sequences $H_{n}, Δ_{n} \to \infty$ , let

$Θ_{n} = Θ (H_{n}, Δ_{n}) = {θ \in Θ; \sum_{h = 0}^{H_{n}} | υ_{h} | \leq Δ_{n}, \sum_{h = 1}^{H_{n}} \sum_{k = 0}^{d} | ω_{k d} | \leq H_{n} Δ_{n}}$ (14)

For given sample size n, we consider only network functions in

$O_{n} = O (H_{n}, Δ_{n}) = {f_{H_{n}} (x, θ); θ \in Θ (H_{n}, Δ_{n})}$ (15)

as an estimate for $m (x)$ . Therefore, we redefine the parameter estimate as

${\hat{θ}}_{n} = \arg \min_{θ \in Θ_{n}} \sum_{s} {(y_{i} - f_{H} (x; θ))}^{2}$ (16)

and the network estimate for $m (x)$ is therefore given by

${\hat{m}}_{n} (x) = f_{H_{n}} (x, {\hat{θ}}_{n})$ (17)

which is a kind of sieve estimate in the sense of [24] or [25].

To prove consistency of ${\hat{T}}_{N N}$ , it needs to be shown that the Neural Network Based Regression Function ${\hat{m}}_{N N}$ is also consistent.

Theorem 3.1. Let $(y_{1}, x_{1}), \dots, (y_{n}, x_{n})$ be i.i.d variable with $y_{i} \in ℝ$ , and $x_{i} \in ℝ^{d}$ . Let the distributions of $y_{i}$ and $x_{x_{i}}$ satisfy A2 and Equation (12). Let $O_{n} = O (H_{n}, Δ_{n}), n \geq 1$ be the set of neural network output functions given by Equation (15) with an activation function $ψ$ which is Lipschitz continuous on $ℝ$ , strictly increasing and satisfying Equation (13). Let ${\hat{m}}_{n} (x) = E (y_{i} | x_{i}) = x$ be in the closure of $\cup_{n = 1}^{\infty} O_{n}$ in $L^{2} (F)$ that is, in the space of functions square integrable with respect to the distribution of the $x_{i}$ . Then ${\hat{m}}_{n} (x)$ is a consistent estimate of $m (x)$ in the $L^{2} (F)$ -sense, that is

$\int {(m (x) - {\hat{m}}_{n} (x))}^{2} d F (x) \to 0$ in probability (18)

provided that $H_{n}, Δ_{n} \to \infty$ such that

$Δ_{n} = o (n^{\frac{1}{4}})$

$H_{n}, Δ_{n}^{4} \log n = o (n)$ and $H_{n} \log n = o (Δ_{n}^{α})$

where $α$ determine the rate of decrease of the tail of the distribution of the $y_{i}$ by Equation (12).

Proof. Theorem 1 can be proven exactly as Theorem 2.1 of [26] for stationary processes satisfying an $α$ -mixing condition and also as Theorem 3.1 of [27] for fixed data. As here the data are independent, the Bernstein inequality for stationary processes may be replaced by a Bernstein inequality for independent data like that one in Section (2.5.4), Lemma A of [28] [29]. Therefore, the right hand side of Equation (5.1) of [26] changes to

$c_{1} \exp (- c_{2} \frac{Δ}{N M_{N}^{2}}) instead of c_{1} \exp (- c_{2} \frac{Δ^{2}}{\sqrt{N} M_{N}^{2}})$

Then the proof proceeds exactly as in [19] and results in slightly different condition for the rates of $H_{n}, Δ_{n}$ in the independence case.

We remark that for bounded random variables $(y_{i}, x_{i})$ , the last condition on $H_{n}, Δ_{n}$ involving $α$ can be dropped. In that case, Theorem 1 essentially is equivalent to Theorem 3.3 of [19]. We also remark that by Theorem 3.4 of [19], we may determine the parameters $H_{n}, Δ_{n}$ which determine the network complexity and therefore the smoothness of the function estimate, adaptively from the data by Cross Validation without changing the consistency of ${\hat{m}}_{n} (x)$ . For the detail on the proof of these theorems, see the work of [26] [27].

Note that, to prove the consistency of ${\hat{T}}_{N N}$ we need Equation (13) with a simple mean over the unobserved $x_{i}, i \in r$ instead of the integral. The following results show that the difference between the integral and the sample mean is negligible.

Theorem 3.2. Let $((y_{1}, x_{1}), \dots, (y_{N}, x_{N}))$ be i.i.d with 3 for some bounded $m (x)$ . Let F denote the distribution of $x_{i}$ . Let $| ψ (u) | \leq 1$ . Let $s = 1, \dots, n$ be the index set of the observed data and $r = n + 1, \dots, N$ the index of unobserved data. Let ${\hat{θ}}_{n}$ be defined as in Equation (13) with ${\hat{m}}_{n} (x)$ defined as in Equation (17) with ${\hat{m}}_{n} (x) = f_{H_{n}} (x, {\hat{θ}}_{n})$ denote the estimate of $m (x)$ based on the sample $(y_{i}, x_{i}), i \in s$ . Let $n, N \to \infty$ such that $\frac{n}{N} \to π (0,1)$ and let $H_{n}, Δ_{n}$ satisfy conditions in Theorem 3. Then for $δ > 0$

$\begin{array}{l} P r (| \frac{1}{N - n} \sum_{j \in r} {(m (x_{j}) - {\hat{m}}_{n} (x_{j}))}^{2} \\ \begin{matrix} \end{matrix} - \int {(m (x_{j}) - {\hat{m}}_{n} (x_{j}))}^{2} d F (x) | > δ | (y_{i}, x_{i}), i \in s) \leq d_{1} \exp {- d_{2} \frac{N δ^{2}}{Δ_{n}^{4}}} \end{array}$ (19)

for all $δ > 0$ and all N large enough where $d_{1}, d_{2}$ are some constants independent of $N, n$ and $(y_{i}, x_{i}), i \in s$ .

Proof. From assumption A3, let C be the upper bound of $m (x)$ . By definition of ${\hat{m}}_{n} (x) = f_{H_{n}} (x, {\hat{θ}}_{n})$ and $O (H_{n}, Δ_{n})$ , we immediately have

$| {\hat{m}}_{n} (x) | \leq Δ_{n} a .s | ψ (u) | \leq 1$

setting

$V_{N_{i}} = {(m (x_{j}) - {\hat{m}}_{n} (x_{j}))}^{2} - \int {(m (x_{j}) - {\hat{m}}_{n} (x_{j}))}^{2} d F (x), i \to r$ (20)

these therefore result to

$\begin{array}{l} | V_{N_{i}} | \leq 4 (C^{2} + Δ_{n}^{2}) \\ E {V_{N_{i}} | (y_{i}, x_{i}), i \in s} = 0 \\ E {V_{N_{i}}^{2} | (y_{i}, x_{i}), i \in s} \leq 32 (C^{4} + Δ_{n}^{4}) \end{array}$ (21)

note that ${\hat{m}}_{n} (x)$ is independent of $(y_{i}, x_{i}), i \in r$ , and completely determined by $(y_{i}, x_{i}), i \in s$ . Now apply Bernstein’s inequality (Lemma A, Section 2.5.4) of [28] and get

$\begin{array}{l} P r (\frac{1}{N - n} | \sum_{j \in r} V_{N_{j}} | > δ | (y_{i}, x_{i}), i \in s) \\ \leq 2 \exp {- \frac{N_{n} δ^{2}}{64 (C^{4} + Δ_{n}^{4}) + \frac{2}{3} 4 (C^{2} + Δ_{n}^{2}) δ}} \end{array}$ (22)

Now the results follow as $Δ_{n} \to \infty$ and therefore $Δ_{n}^{4}$ dominates the denominator of the exponent for N large enough and as $N - n$ coincides asymptotically with $(1 - π) N$ . Moreover, as $Δ_{n} = o (n^{\frac{1}{4}}), \frac{N}{Δ_{n}^{4}} \to \infty$ , that is, the right hand side of the inequality converges to zero (taking limits as $Δ_{n} \to \infty$ ).

3.2. Asymptotic Consistency

Theorem 3.3. If (A1)-(A8) are satisfied and if the activation function $ψ (u)$ is Lipschits continuous and strictly increasing and also Theorem 1 holds, then the neural network estimate ${\hat{T}}_{N N}$ of the population total T given by 6 with ${\hat{m}}_{n} (x) = f (x, {\hat{θ}}_{n})$ and ${\hat{θ}}_{n}$ given by [8] is consistent in the following sense.

$\begin{array}{l} \frac{1}{N} | T - {\hat{T}}_{N N} | \to 0 i n p r o b a b i l i t y \\ w h e r e N, n \to \infty w i t h \frac{n}{N} \to π \in (0,1) \end{array}$ (23)

provided that the number $H_{n}$ and the bound $Δ_{n}$ of the network weights satisfy $H_{n}, Δ_{n} \to \infty$ such that

$\begin{array}{l} Δ_{n} = o (n^{\frac{1}{4}}) \\ H_{n} Δ_{n}^{4} \log n = o (n) \\ H_{n} \log n = o (Δ_{n}^{α}) \end{array}$ (24)

where $α$ determines(by A1) how fast the tail probability of the $ε_{i}$ and $y_{i}$ decreases. [19] showed that, the appropriate choice for $Δ_{n}$ is such that $Δ_{n} \to \infty$

as $n \to \infty$ and $Δ_{n} = o (n^{\frac{1}{4}})$ , i.e. $n^{\frac{1}{4}} Δ_{n} \to 0$ as $n \to \infty$

Proof.

$\begin{array}{l} \frac{1}{N} | T - {\hat{T}}_{N N} | = \frac{1}{N} | \sum_{j \in r} (y_{j} - {\hat{m}}_{n} (x_{j})) | \\ = \frac{1}{N} | (m (x_{j}) - {\hat{m}}_{n} (x_{j})) + \sum_{j \in r} ε_{j} | \\ \leq \frac{1}{N} | m (x_{j}) - {\hat{m}}_{n} (x_{j}) | + \frac{N - n}{N} | \frac{1}{N - n} \sum_{j \in r} ε_{j} | \\ \frac{1}{N} {(m (x_{j}) - {\hat{m}}_{n} (x_{j}))}^{2} + \frac{N - n}{N} | \frac{1}{N - n} \sum_{j \in r} ε_{j} | \end{array}$ (25)

by Jensen’s inequality.

Now the last term converges to

$\frac{N - n}{N} | \frac{1}{N - n} \sum_{j \in r} ε_{j} | = (1 - π) | E (ε_{j}) |$

where $(1 - π) | E (ε_{j}) | = 0$ since $E (ε_{j}) = 0$ by law of large numbers. The first term of 25 decomposes into

$\begin{array}{l} \frac{N - n}{N} (\frac{1}{N - n} \sum_{j \in r} {(m (x_{j}) - {\hat{m}}_{n} (x_{j}))}^{2} - \int {(m (x_{j}) - {\hat{m}}_{n} (x_{j}))}^{2} d F (x)) \\ + \frac{N - n}{N} \int {(m (x_{j}) - {\hat{m}}_{n} (x_{j}))}^{2} d F (x) \end{array}$ (26)

The right hand terms of 26 converge to 0 by Theorem 1 and as $\frac{N - n}{N} \to 1 - π$ .

The proof is completed by using Theorem 2 to cope with left hand terms where we drop the factor $\frac{N - n}{N}$ converges to $1 - π$ anyhow.

$\begin{array}{l} P r (| \frac{1}{N - n} \sum_{j \in r} {(m (x_{j}) - {\hat{m}}_{n} (x_{j}))}^{2} - \int {(m (x_{j}) - {\hat{m}}_{n} (x_{j}))}^{2} d F (x) | > δ) \\ = E {P r [| \frac{1}{N - n} \sum_{j \in r} {(m (x_{j}) - {\hat{m}}_{n} (x_{j}))}^{2} - \int {(m (x_{j}) - {\hat{m}}_{n} (x_{j}))}^{2} d F (x) | > δ | (y_{i}, x_{i}, i \in s)]} \\ \leq d_{1} \exp {- d_{2} \frac{N δ}{Δ_{n}^{4}}} \to 0 \forall δ > 0 as n \to \infty, Δ_{n} \to \infty \end{array}$ (27)

hence the proof.

3.3. Mean Squared Error

Mean Squared Error is used to measure the accuracy of the estimator among other measures of performance. The MSE is defined by $E {(T_{N N} - T)}^{2}$ where T denotes the true population total. To estimate $E {(T_{N N} - T)}^{2}$ , first, we consider

$\begin{array}{l} E [{(T_{N N} - T)}^{2} | D, X_{n + 1}^{N}] = E [{(\sum_{i = 1}^{H} \sum_{j = n + 1}^{N} \hat{m} (x, θ) - \sum_{j = n + 1}^{N} (m (x) + ε))}^{2} | D, X_{n + 1}^{N}] \\ = \frac{{(N - n)}^{2}}{N^{2}} E [{(\frac{1}{N - n} \sum_{i = 1}^{H} \sum_{j = n + 1}^{N} \hat{m} (x, θ) - \sum_{j = n + 1}^{N} (m (x) + ε))}^{2} | D, X_{n + 1}^{N}] + \frac{N - n}{N} v a r (ε_{i}) \\ = \frac{{(N - n)}^{2}}{N^{2}} E [{(\frac{1}{H (N - n)} \sum_{i = 1}^{H} \sum_{j = n + 1}^{N} \hat{m} (x, θ) - E (T_{k} | D, X_{j}) + E (T_{k} | D, X_{j}) - E (T_{k}))}^{2}] + \frac{N - n}{N} v a r (ε_{i}) \\ = \frac{τ_{D}^{2}}{H} (1 - f) {E (T_{k} | D, X_{j}) - E (T_{k})}^{2} + \frac{1 - f}{N} v a r (ε_{i}) \end{array}$ (28)

where the $X_{j = (x_{n + 1}, \dots, x_{N})}$ is a set of unsampled auxiliary units. $T_{k}$ denotes the total of the unsampled elements and $E (T_{k}) = \sum_{j = n + 1}^{N} m (x)$ .

The last approximation of Equation (28) follows from Equation (15) of [30], that is

$E {(\frac{1}{H N} \sum_{i = 1}^{H} \sum_{j = n + 1}^{N} \hat{m} (x, θ) - (1 - f) E (T_{k}) | D, X_{j})}^{2} \approx \frac{τ_{D}^{2}}{H}$

for some positive constant $τ_{D}^{2}$ .

The term $E (T_{k} | D, X_{j}) - E (T_{k})$ is the predictor bias due to randomness or sampling bias of D. Now from Equation (28), we have

$E {(T_{N N} - T)}^{2} = E (\frac{{\hat{τ}}_{D}^{2}}{H}) + {(1 - f)}^{2} E {E (T_{k} | D, X_{j} - E (T_{k}))}^{2} + \frac{1 - f}{N} v a r (ε_{i})$ (29)

As noted in [30], the quantity $τ_{D}^{2}$ can be estimated by batch method. Therefore,

${\hat{τ}}_{D}^{2} = \frac{s}{r - 1} \sum_{t = 1}^{r} {({\hat{T}}_{N N, t} - T_{N N})}^{2}$ (30)

for details see [30]. Equation (30) can be substituted in 29 in lieu of $E (τ_{D}^{2})$ .

Now, under the assumption that the $\frac{ε_{i}}{σ} ~ t (v)$ , then the estimate of $v a r (ε_{i})$ is given as

$v \hat{a} r (ε_{i}) = \frac{v}{v - 2} \frac{1}{H} \sum_{i = 1}^{H} {\hat{σ}}_{i}^{2}$ (31)

Under the assumption that the population is made up of exact copies of the sampled (training) data, we have $E (T_{k} | D, X_{j}) - E (T_{k}) ≊ \hat{T} - T$ where $\hat{T}$ the fitted sample totals and

$E {(\hat{T} - T)}^{2} = {(\sum_{i = 1}^{n} {\hat{ε}}_{i})}^{2} = V a r ({\hat{ε}}_{i})$ (32)

Under the true model, we have $V a r ({\hat{ε}}_{i}) = v a r (ε_{i})$ . Hence the $E {E (T_{k} | D, X_{j} - E (T_{k}))}^{2}$ can be estimated by

$\hat{B} i a s^{2} = \frac{1}{n} v \hat{a} r (ε_{i})$ (33)

Thus, $E {(T_{N N} - T)}^{2}$ can be estimated by

$\begin{matrix} \hat{E} {(T_{N N} - T)}^{2} = \frac{{\hat{τ}}_{D}^{2}}{H} + (1 - f) B \hat{i} a s^{2} + \frac{1 - f}{N} v \hat{a} r (ε_{i}) \\ = \frac{{\hat{τ}}_{D}^{2}}{H} + \frac{1 - f}{n} v \hat{a} r (ε_{i}) \end{matrix}$ (34)

As $H \to \infty$ Equation (34) reduces to

$\hat{E} {(T_{N N} - T)}^{2} = \frac{1 - f}{n} v \hat{a} r (ε_{i})$ (35)

4. Empirical Results

To illustrate our estimation approach, the following data will be utilized. A population of size 188 will be obtained from the United Nations Development Programme 2020 report. The UN studied the development in 1889 countries. It grouped development in the countries as either very high human development, high human development, medium human development or low human development. Kenya was classified in countries that fall under medium development and ranked number 143 among the 188 countries studied. The UN study used attributes such as Human Development Index (HDI), Life expectancy at Birth, Expected years of schooling, Mean years of schooling, Gross national income (GNI) per capita and GNI per capita rank minus HDI to rank human development index in the 189 countries. In this study, a relationship between Human Development Index (HDI) which is considered as the survey variable and the auxiliary variables; Life expectancy at Birth, Expected years of schooling, Mean years of schooling and Gross National Income (GNI) per capita is considered.

In order to understand how the proposed estimator compares against other existing non-parametric regression estimators, we compared the performance of our estimator to that of identified estimators based on Multivariate Additive Regression Splines (MARS), Generalized Additive Models (GAM) and Local polynomial (LP) which can handle high dimensional data. We compare the performance of the proposed estimator of the population totals, with ${\hat{T}}_{L P}$ , ${\hat{T}}_{M A R S}$ , ${\hat{T}}_{G A M}$ and ${\hat{T}}_{S A M}$ , using the bias, mean squared error (MSE), mean absolute error (MAE), and mean absolute percentage error (MAPE).

The unconditional results for the estimators were computed that are used in the analysis that acts as performance indicators of the estimators. The results include; Bias, Mean Square Error (MSE), Mean Absolute Error (MAE) and mean absolute percentage error (MAPE) respectively. These criteria are defined as follows; Bias of a Population total estimator refers to the deviation of the expected value of the estimator from the true Total value. Table 1 provides the results for performance of the estimators when applied to the data obtained from the United Nations Development Programme 2020 report. All of the population total estimators considered here are biased but comparatively $T_{N N}$ exhibits a smaller bias. $T_{N N}$ can be seen to be a very efficient estimator of the finite population total since it has smaller RMSE, followed closely by $T_{L P}$ and $T_{M A R S}$ . $T_{G A M}$ proved to be a very inefficient estimator of all other estimators.

Table 1. Unconditional bias, mean square error, relative root mean square error, mean absolute error and mean absolute percentage error for real data set.

The conditional performance of the estimator was done and compared with the performance of other existing population total estimators. To do this, 500 random samples, all of sizes 100 and 50, were selected and the mean of the auxiliary values xi was computed for each sample to obtain 200 values of $\bar{X}$ . These sample means were then sorted in ascending order and further grouped into clusters of size 20 such that a total of 25 groups was realized. Further, group means of the means of auxiliary variables were calculated to get $\bar{\bar{X}}$ . Empirical means and biases were then computed for all the estimators $T_{N N}$ , $T_{L P}$ , $T_{M A R S}$ and $T_{G A M}$ . The conditional biases were plotted against $\bar{\bar{X}}$ to provide a good understanding of the pattern generated. Figure 1 and Figure 2 show the behavior of the conditional biases, relative absolute biases and mean squared error realized by all the estimators based on the real data set.

In most cases, there are significant differences among the bias characteristics of the various estimators. A detailed examination of the plots reveals that $T_{N N}$ has lower levels of bias followed by $T_{L P}$ as indicated by the proximity of plotted curves to the horizontal (no bias) line at 0:0 on the vertical axis. Interestingly, despite the rather entangled nature of some of the plots, estimator $T_{N N}$ emerges clearly as the least biased for nearly every group means of the means of auxiliary variables.

Plots of Conditional MSE versus group means of the means of auxiliary variables similarly reveal coincident behavior for the estimators. $T_{N N}$ and $T_{L P}$ produce generally the lowest MSE values. In particular, $T_{N N}$ yields the lowest MSE in most cases among all other estimators. $T_{N N}$ is consistently better than all other estimators for both bias and MSE. All of these estimators are asymptotically unbiased and they all exhibit MSE consistency in that the MSE values tend toward zero as sample size increases. From the plots it can be seen that $T_{N N}$ and $T_{L P}$ performed equally better than all other estimators of the true population total functions.

Figure 1. Conditional bias, mean square error, relative root mean square error and mean absolute error based on real data with a sample size of 100.

Figure 2. Conditional bias, mean square error, relative root mean square error and mean absolute error based on real data with a sample size of 50.

5. Conclusion and Recommendations

In this paper, an estimator for Finite Population Total has been developed by employing a Feed Forward Back Propagation Neural Network technique in Non-parametric Regression. Asymptotic properties such as the Consistency and Mean Squared Error for the developed estimator have also been derived. When applied to dataset obtained from the United Nations Development Programme 2020 report, the findings indicate that the proposed estimator has the lowest bias and root mean square error values compared to other existing estimators. The developed estimator is considered to be effective in addressing the curse of dimensionality that makes Local Polynomials and Kernel Estimators ineffective when dealing with High Dimensional Data. It should be noted that the proposed estimator has been considered in the case of Simple Random Sampling Without Replacement (SRSWoR). An extension to other sampling techniques such Stratification may be done since they rely on SRSWoR, and it is hypothesised that efficiency will be improved compared to other existing estimators in literature.

Conflicts of Interest

The authors declare no conflicts of interest.

References

[1]	Chambers, R.L. and Dunstan, R. (1986) Estimating Distribution Functions from Survey Data. Biometrika, 73, 597-604. https://doi.org/10.1093/biomet/73.3.597
[2]	Wang, S.J. and Dorfman, A.H. (1996) A New Estimator for the Finite Population Distribution Function. Biometrika, 83, 639-652. https://doi.org/10.1093/biomet/83.3.639
[3]	Hansen, M.H., et al. (1987) Some History and Reminiscences on Survey Sampling. Statistical Science, 2, 180-190. https://doi.org/10.1214/ss/1177013352
[4]	Dorfman, A.H. (1992) Nonparametric Regression for Estimating Totals in Finite Populations. In: Proceedings of the Section on Survey Research Methods, American Statistical Association Alexandria, 622-625.
[5]	Otieno, R.O. and Mwalili, T.M. (2000) Nonparametric Regression Method for Estimating the Error Variance in Unistage Sampling.
[6]	Jay Breidt, F. and Opsomer, J.D. (2000) Local Polynomial Regression Estimators in Survey Sampling. Annals of Statistics, 28, 1026-1053. https://doi.org/10.1214/aos/1015956706
[7]	Hardle, W. and Linton, O. (1994) Applied Nonparametric Methods. In: Engle, R.F. and McFadden, D., Eds., Handbook of Econometrics, Vol. 4, Elsevier, Amsterdam, 2295-2339. https://doi.org/10.1016/S1573-4412(05)80007-8
[8]	Chambers, R.L., Dorfman, A.H. and Hall, P. (1992) Properties of Estimators of the Finite Population Distribution Function. Biometrika, 79, 577-582. https://doi.org/10.1093/biomet/79.3.577
[9]	Montanari, G.E. and Ranalli, M.G. (2003) On Calibration Methods for Design Based Finite Population Inferences. Bulletin of the International Statistical Institute, 60, 2 p.
[10]	Stone, C.J. (1982) Optimal Global Rates of Convergence for Nonparametric Regression. The Annals of Statistics, 10, 1040-1053. https://doi.org/10.1214/aos/1176345969
[11]	Bickel, P.J. and Li, B. (2007) Local Polynomial Regression on Unknown Manifolds. Institute of Mathematical Statistics, Beachwood, Lecture Notes—Monograph Series, 177-186. https://doi.org/10.1214/074921707000000148
[12]	Friedman, J.H. (1991) Multivariate Adaptive Regression Splines. The Annals of Statistics, 19, 1-67. https://doi.org/10.1214/aos/1176347963
[13]	Di Ciaccio, A. and Montanari, G.E. (2001) A Nonparametric Regression Estimator of a Finite Population Mean. In: Book Short Papers CLADAG, Istituto di Statistica, Università degli Studi di Palermo, Palermo, 173-176.
[14]	Opsomer, J.D., Jay Breidt, F., Moisen, G.G. and Kauermann, G. (2007) Model Assisted Estimation of Forest Resources with Generalized Additive Models. Journal of the American Statistical Association, 102, 400-409. https://doi.org/10.1198/016214506000001491
[15]	El-Housseiny, A.R. and Ziedan, D. (2014) Estimation of Population Total Using Nonparametric Regression Models. Advances and Applications in Statistics, 39, 37-59.
[16]	Barron, A.R. (1993) Universal Approximation Bounds for Superpositions of a Sigmoidal Function. IEEE Transactions on Information Theory, 39, 930-945. https://doi.org/10.1109/18.256500
[17]	Ken-Ichi, F. (1989) On the Approximate Realization of Continuous Mappings by Neural Networks. Neural Networks, 2, 183-192. https://doi.org/10.1016/0893-6080(89)90003-8
[18]	Cybenko, G. (1989) Approximation by Superpositions of a Sigmoidal Function. Mathematics of Control, Signals and Systems, 2, 303-314. https://doi.org/10.1007/BF02551274
[19]	White, H. (1990) Connectionist Nonparametric Regression: Multilayer Feed forward Networks Can Learn Arbitrary Mappings. Neural Networks, 3, 535-549. https://doi.org/10.1016/0893-6080(90)90004-5
[20]	Franke, J. and Neumann, M.H. (2000) Bootstrapping Neural Networks. Neural Computation, 12, 1929-1949. https://doi.org/10.1162/089976600300015204
[21]	Gene Hwang, J.T. and Ding, A.A. (1997) Prediction Intervals for Artificial Neural Networks. Journal of the American Statistical Association, 92, 748-757. https://doi.org/10.1080/01621459.1997.10474027
[22]	Barron, A.R. (1994) Approximation and Estimation Bounds for Artificial Neural Networks. Machine Learning, 14, 115-133. https://doi.org/10.1007/BF00993164
[23]	Asnaashari, A., McBean, E.A., Gharabaghi, B. and Tutt, D. (2013) Forecasting Watermain Failure Using Artificial Neural Network Modelling. Canadian Water Resources Journal, 38, 24-33. https://doi.org/10.1080/07011784.2013.774153
[24]	Grenander, U. and Ulf, G. (1981) Abstract Inference. Technical Report.
[25]	Geman, S. and Hwang, C.-R. (1982) Nonparametric Maximum Likelihood Estimation by the Method of Sieves. The Annals of Statistics, 10, 401-414. https://doi.org/10.1214/aos/1176345782
[26]	Franke, J. and Diagne, M. (2006) Estimating Market Risk with Neural Networks. Statistics & Decisions, 24, 233-253. https://doi.org/10.1524/stnd.2006.24.2.233
[27]	Shen, X.X., Jiang, C., Sakhanenko, L. and Lu, Q. (2019) Asymptotic Properties of Neural Network Sieve Estimators.
[28]	Serfling, R.J. (1980, 2000) Approximation Theorems of Mathematical Statistics. John Wiley & Sons, Inc., Hoboken. https://doi.org/10.1002/9780470316481
[29]	Serfling, R.J. (2009) Approximation Theorems of Mathematical Statistics, Volume 162. John Wiley & Sons, Hoboken.
[30]	Liang, F.M. and Kuk, Y.C.A. (2004) A Finite Population Estimation Study with Bayesian Neural Networks. Survey Methodology, 30, 219-234.

Journals Menu

Follow SCIRP

	+1 323-425-8868
	customer@scirp.org
	+86 18163351462(WhatsApp)
	1655362766

	Paper Publishing WeChat

Journals Menu

Home

About SCIRP

Service

Policies