A Simulation Study of Hierarchical Bayesian Fusion Spatial Small Area Model for Binary Outcome under Spatial Misalignment ()
1. Introduction
Simulation (stochastic) methods are based on obtaining random samples of the parameters of interest from the desired distribution and estimating the expectation of any function of the parameters. It can be used for high-dimensional distributions, and there are general algorithms which work for a wide variety of models; where necessary, more efficient computation can be obtained by combining these general ideas with tailored simulation methods, deterministic methods, and distributional approximations. Markov chain Monte Carlo (MCMC) methods have been important in making Bayesian inference practical for generic hierarchical models in small area estimation.
Small Area Estimation (SAE) is a strategy for improving estimation accuracy and delivering reliable parameter estimates for small areas, where a small area refers small sub-population in terms of sample size [1] [2] [3] [4]. To improve accuracy and reliability, a variety of estimators have been developed that combine survey data for the target small areas with data from outside the survey, frequently related to recent censuses and current administrative data.
In small area estimation, statistical models can be utilized in a model assisted approach or model based approach. Even when a model is misspecified, [5] that use working models, in which a model is specified but desirable design based attributes are preserved, are called model assisted estimators.
There are now a variety of model based approaches available, depending on the nature of the measurement (e.g., binary, count or continuous variable) [6] [7] and on the auxiliary data available for the individual units in the population versus those available only at the aggregate level (areal level or unit level) [8] [9].
Unit level estimations are expected to be more precise than areal level estimates on average [10]. Access to auxiliary data at the unit level, on the other hand, may be difficult. As a result, a fusion model that contains both unit and areal level data should be considered to accommodate this. The fusion small area model makes use of census auxiliary data at the area level, which is available in both observation and prediction regions, as well as individual survey variables.
The fusion predictive model links individual survey variables with a hierarchical areal level model, allowing prediction to be carried out to other geographically defined small areas. A fusion small area model, which was successfully used by other authors [11], comprising both unit level survey and areal level auxiliary census data was adopted.
The Bayesian approach to modeling has the ability to combine information from several sources using melding or data fusion [12]. The evaluation of prediction accuracy is an important aspect of SAE [13] although its computation may be complicated. The Bayesian approach solves the problem automatically, resulting to realizations of the target quantities’ from the posterior distribution [13].
A generalized linear model with or without spatial correlation structure was proposed by Ghosh et al. [14] using a hierarchical Bayesian framework. In addition, a hierarchical Bayesian model was proposed by Bakar et al. [11] for several categorical data under purely spatial setting. Besides, misalignment issues were addressed following this. Furthermore, difficulties with misalignment were resolved as a result of this.
Under the hierarchical Bayesian framework, misaligned data models have recently been presented to solve the issue of making inference for domains other than those for which survey data is available [11] [15] [16]. Following this, Trevisani and Gelfand [17] proposed a model to handle spatial misalignment for count data. For a categorical data under spatial misalignment, Bakar et al. [11] had proposed a hierarchical Bayesian model to handle this issue.
Furthermore, Bakar and Jin [18] proposed for estimating a binary outcome variable under purely spatial setting for secondary geography where survey data were not collected. Primary and secondary geographies are spatially misaligned. However, the study by Bakar and Jin [18] considered a real level data.
A fusion model developed by Muchie et al. [19] for the binary data has some appealing features. However, it has to be evaluated using simulation studies. As the model has closed forms for all required full conditionals, Gibbs sampler will be used for generating samples.
As a result, the aim of this study was to assess the performance of the hierarchical Bayesian spatial fusion small area model for binary data under spatial misalignment.
2. Materials and Methods
2.1. Model
Here we introduce a spatial hierarchical statistical model for the binary data. Let
be a binary response variable for the
individual in area r,
denoting the
area and also
be a set of p-covariates in the
area. Where
,
and
with
being a realization of Bernoulli distribution, i.e.
.
Considering y be the realization of a random variable Y that can take values one and zero with probabilities p and
respectively. Hence,
is a Bernoulli response at area
and a logit-normal model with probability
is defined as:
(1)
where
which is the unit level error.
Further,
(2)
where
is the
areal level predictor variable for
,
is the corresponding model coefficient, and
is error process; and L is the number of predictors or covariates in the model.
,is spatially correlated process, modeled explicitly to take the statistical dependency due to spatial proximity. As understanding and defining the spatial process
is often important for constructing the full conditional distribution of
.
The Moran’s I basis function [20] is one of the popular ways to model areal spatial process
, and consequently, is useful for addressing problems with large spatial data [21]. Hence, for dimension reduction, m eigenvectors of a Moran’s I operator matrix are used, where
. Similarly, dimension reduction approach is used with a set of multiresolution spatial basis function [22] [23]. In multiresolution models [23] m knots are defined over the spatial domain by using reduced rank approximation [24] [25].
Hence, the spatial basis function for the spatial process
comprises the Moran’s I and multiresolution which is written as:
(3)
where
(4)
is a combination of eigenvectors
of Moran’s I operator matrix
and multiresolution spatial basis
. The term
is an independent and identically distributed error process to capture the remaining random component so as to take the uncertainities arising from
.
The process models generated from the predictors (
) were used to construct the basis so as to alleviate the well-known problem of collinearity [26] between the true predictor variables and the spatial random process. Accordingly, the Moran’s I operator matrix is defined as
(5)
where
,
is
nearest neighbor weight matrix containing 0 s and 1 s, and
is an
identity matrix.
Spectral decomposition of Moran’s I operator matrix
[21] were used to define n eigen vectors,
and multiresolution spatial basis,
. A bi-square function was used to define the spatial basis function
.
For
basis functions of the ith resolution, we define the bi-square basis function in d-dimensional Euclidean space
as,
(6)
where
is the number of basis functions at the
resolution,
is the center of the
basis function
at the
resolution,
is the Euclidean norm,
is the radius of its spatial support (sometimes called the aperture), and
is an indicator function.
The choice of
determines the multiple resolutions, which are used to capture different dependence scales [22] [23] [25]. Some basis functions with centers outside the study domain are included to accommodate boundary effects [27].
When knot locations are equidistant over the domain, the value of
is typically some multiple of the minimum distance between knot locations. [22] use a constant range
for all knots of a given resolution. To accommodate knot locations that may not be equidistant, we let the ranges of the knots vary. Specifically, we define
as a multiple of the minimum distance from knot points to all the other knots. In practice,
is specified to be 1.5 times the minimum distance between basis function centers of the same resolution [22].
Another issue that is related to
is the selection of the number and position of knots. A regular grid, or its modification, is often a popular choice for the position of knots [28] [29]. An orthogonal spatial basis
of order
is constructed on the basis of the distances between the centroid of the areas and knot points.
We model
, which is a vector of spatial random effect, by using a Gaussian distribution with zero mean and a lower dimensional covariance matrix
,
. Here
comprises a smoothing parameter
and
i.e.
(7)
where Q is
matrix with non-diagonal entries
if regions i and j are neighbors and 0 otherwise, and diagonal entries
are equal to the number of region i’s neighbors [26]. It can be computed using the neighborhood weight matrix,
, as
, and
is a vector of 1 s.
Since Q is singular [30], we use a spectral decomposition of
to obtain the full conditional distribution of the spatial process
. Thus, we can write
(8)
where
is an orthogonal matrix of order
and
is a diagonal matrix, see details in [21]. Following [20], in this research we assume that eigen vectors of
are known and its eigen values are known up to a multiplicative constant.
2.2. Predictive Distributions
The geographical areas
are those in which prediction is required. Let the area
have overlapping geographical boundaries with
. We assume that the areal relationships that are learned from
are largely maintained for
as they cover the same region and the same whole population. This leads that we can reuse all the parameters with few adjustment of spatial correlation according to the new geographical classification. The fixed effect mean component and the spatial random component were computed separately and joined finally. Specifically, change of support is used to obtain the probability distribution of the process
at a second geographical boundaries.
The change of support for the mean process
in prediction location
is written as:
(9)
where,
is the area of the areal unit
. The integral in Equation (9) is however difficult to solve, and there are approximation methods available to address this difficulty [31]. However, the approximations are best suited for situations when
. This approximation techniques are derived using the concept of the posterior predictive distribution in the Bayesian inference approach.
The posterior predictive distribution is the distribution of possible unobserved values conditional on the observed values. The posterior predictive distribution for unobserved values (
) can be found conditional on the observed values (
). To find the predictive posterior distribution of
, we considered auxiliary data,
is the
design matrix, based on second geographical boundaries. Thus, the posterior predictive distribution of
is written as
(10)
where
contains all the parameters in the model.
On the other hand, for the spatial random effect, new nearest neighbor weight matrix
based on
neighborhood in Moran’s I-matrix, and a new orthogonal spatial basis
based on the distances between
centroids and knot points in the space
are used. Thus,
is obtained by incorporating
in Equation (3). Where,
is the basis function that contains
neighbourhood weight matrix calculated from at
.
and
respectively are the eigen vectors of Moran’s I operator matrix and multiresolution basis based on area
, i.e. SA2s.
The predictive distribution of
thus is obtained using MCMC samples and by composition of draws from the posterior distributions and
.
2.3. Gibbs Sampler Algorithm
The Gibbs sampler, introduced by [32], has been developed by [33]. The Gibbs sampler simulates from multidimensional posterior distributions by iterative sampling from the lower-dimensional full conditional posterior distributions.
The conditional distributions are required for implementation of the model through Markov Chain Monte Carlo (MCMC) simulation. Full conditional densities are derived by abstracting out from the full joint posterior distribution only those elements including the parameter of interest and treating other components as constants [34]. Accordingly, we obtain the full conditional distribution for the variance parameters and fixed effect coefficients of the model as given below where the detailed derivations are given in a manuscript done by the same authors [19]:
(11)
(12)
and (13)
(14)
(15)
The Gibbs sampler updates the chain one component at a time, instead of updating the entire vector. In our case, starting from an initial values
,
,
,
and
, at iteration t, the Gibbs sampler draws:
The densities on the right hand sides of the above are called the complete conditional distributions or full conditional distributions. We now provide step by step instructions for running this MCMC algorithm in the list below.
1. Choose initial values for
,
,
,
and
and denote them with
,
,
,
and
respectively. Let
.
2. Let
.
3. Update the components of
,
,
,
and
.
- Let
,
,
,
and
equal to a value generated from the full conditionals of
,
,
,
and
respectively.
4. Repeat the steps 2 and 3 until convergence.
Furthermore, the predictive distribution of
is obtained using MCMC samples and by composition of draws from the posterior distributions and
. We use the following algorithm to predict
:
1. Draw a sample
,
, from the posterior distribution.
2. By creating linear predictors using those parameters drawn in step 1, draw
from the predictive posterior distributions for
above. That is to replace the above integral by an empirical average of
computed using each of those samples in step 1,
3. Then draw
from the bernouli logit model given above for the
individual, where
for each area
. Hence the predictive equation can be written in terms of area
as
where
is obtained by incorporating
in Equation (3).
is a combination of
and
, which are the eigen vectors of Moran’s I operator matrix and multiresolution basis based on area
, i.e. SA2s. The term
is obtained from the predictive conditional Gaussian distribution with mean 0 and variance
.
4. Finally for each draw (t) we obtain the predictive probabilities in each
by averaging over the individual predicted probabilities as
Steps 1 - 3 in this algorithm are for predicting individual level probabilities. However, interest is in obtaining predictions at the aggregated areal level
. Hence, in step 4, we aggregate the results by averaging the individual predicted probabilities in each
for each draw (t). Note that, in the modelling hierarchy, probabilities are aggregated only at the prediction stage.
2.4. Measures of Model Evaluation
The R iterations for the direct and SAE model estimates were evaluated for bias, relative bias, mean squared error (MSE), root mean squared error (RMSE), percent coverage for a 95% nominal rate, and 95% confidence interval width for direct estimates and 95% credible interval width for the SAE models. Each of these metrics were calculated in two ways: 1) for an individual small area, and 2) average across all individual small areas.
Accordingly the first, Bias was calculated as the average difference between the posterior mean of the population total,
of small areai and iteration j, and the truth
:
The second, Relative bias,
, is defined as the bias relative to the truth for each small area (
) and summed across all small areas
(
).
The third, Mean squared error (MSE) was calculated as the average squared deviations of the posterior mean of the population total from the true population, and RMSE was the square root of these deviations:
where;
refers true parameter value for the
small area,
is the posterior mean of the population total, and R refers the number of iterations. A trade-off between bias and precision (mean squared error, ...) is present when applying a SAE model.
The last one, Percent coverage was defined as the average number of times the 95% SAE credible interval for
, or the direct estimate 95% confidence interval, included truth for each small area. The proportion of times the 95% CI of the estimated summary mean contains the true value.
It is desirable to have coverage near 95%. Coverage higher than 95% indicates an inefficient estimator whereas coverage less than 95% indicates an inaccurate estimator.
3. Results and Discussion
3.1. Simulation Design
Proper and weakly informative prior distributions for the model parameters were used. For the fixed effect coefficients (
) we used conjugate normal distribution as the prior distribution with mean zero and a variance of 10 to represent weakly informative and proper prior specification. For the variance and smoothing parameters we consider a conjugate inverse-gamma prior distributions, weakly informative hyper-parameters for the inverse-gamma prior.
As noted in [35], inverse-gamma prior with very small hyper-parameters does not have any proper limiting posterior distribution. Hence, we consider a proper and weakly informative invers gamma prior with reasonable values of the shape hyper-parameter
; and the rate hyper-parameters
[36] [37], which is popularly suggested in literature [38] [39] [40]. Therefore, for the variance parameters we let the shape and rate hyper-parameters of the inverse gamma prior distribution be 2 and 1, as the inverse gamma (2, 1) distribution is often treated as proper and weakly informative [36] [37].
To get small area estimation at
, a total of 400 square grid cells were considered where
population points were generated. Two predictor variables,
were generated and correspondingly the binary response,
(“yes” = 1, “no” = 0) was considered to generate the population values. The probabilities were hence calculated accordingly in each grid cell by identifying the “yes” simulated respondents. The aggregated values of
at each grid cell
were treated as census information in the model.
To clarify the insight of the data layout over the domain of the study area we have clearly stated it as follows. Firstly, we took a random sample of 10% [20] [41] of the total population and identify those points inside the square grid cells. Hence, the sample data did not contain all the square grid cells. Figure 1 & Figure 2 showed the squared grid cells and a representation of the samples in a grid cell together with the population points that were generated in that cell. Figure 1 showed that the sampled grid cells (shaded in grey colour) (
) were superimposed,
Figure 1. Representation of the population and sample grid cells as well as two resolutions of knot points.
Figure 2. The samples and population points in one squared grid cell.
which is based on randomly selected sample points; two different resolutions of knot points (
,
) used for the model proposed were also superimposed.
Secondly, the sample values of the response variable, the aggregated values of auxiliary variables and other spatial components were used to find estimates of hyper-parameters under the SA1.
Thirdly, based on data in
, we generate small area estimates for the second geographic areas
(i.e. SA2) having spatial misalignment with the SA1,
. The data are generated in a manner similar to that of
small area estimation. However, we now aim predictions for
, individual level probability at the 100 grid cells, using different knot point resolutions.
3.2. MCMC Output and Diagnosis
Regarding the software implementation, all simulations were performed using the freely available software “R version 4.0.3 (2020-10-10)”. It took 9 hours to generate an MCMC sample of size 50,000 using a machine with Intel(R) Core(TM) i7-4600U CPU @ 2.10 GHz 2.70 GHz and RAM of 8.00 GB (7.90 GB usable).
A gibbs sample of size 50,000 was generated and the first 10,000 samples were discarded as the burn-in period. Further, thinning was applied, every 10th sample selected discarding all others, so as to reduce auto-correlation, to reduce burden on computer memory/storage and to facilitate great deal of post-processing. Hence, 4000 posterior samples were collected after burn-in and thinning. Accordingly, an MCMC sample of size 4000 were used for the entire analysis here under. Accordingly, the trace plot, density plot and auto-correlation plots based on the final 4000 MCMC samples were given below.
Figure 3 showed the trace plots of the MCMC samples for selected model parameters of the model. Each plot illustrated that the chain was sampling from their corresponding stationary distributions. The MCMC chain of variance parameters and the entries of
were mixed well and converged quickly, while the MCMC samples of regression coefficients,
, appear to have relatively slow to converge.
The quick convergence of the predictive sample guarantees that the subsequent analyses (such as the averaged predictive means of
and
) are based on appropriate predictive distributions.
Similarly, the kernel density plot (Figure 4) also showed there is no serious problem of convergence as the plots are uni-modal. Fast decay of auto-correlation (Figure 5) as the lags increase also supplements the above statement regarding convergence.
Henceforth, probabilities for the second geographical areas were computed from posterior predictive distribution. All the following model comparisons were based on these probabilities.
3.3. Model Comparison
The estimates from the model, hierarchical Bayesian fusion spatial small area model, were compared with direct estimate. The RMSE, MAE and 95% probability coverage were considered for comparison purpose. Accordingly, the RMSE
Figure 3. Trace plot of parameters based on MCMC samples.
Figure 4. Kernel density plot of parameters based on MCMC samples.
Figure 5. ACF plot of parameters based on MCMC samples.
and MAE of the estimates from proposed model is less than that of the direct estimate showing the proposed model performs better (Table 1). Additionally, the 95% probability coverage for the estimate from the proposed model was greater than that of the direct estimate. This also strengthens the improvement of the proposed model.
4. Concluding Remarks
In this study, we have assessed the performance of the recently developed spatial hierarchical Bayesian fusion spatial small area model for binary data under spatial misalignment. The process models generated from the predictors were used to construct the basis so as to alleviate the well-known problem of collinearity between the true predictor variables and the spatial random process. The performance of the proposed model was assessed using simulation studies. The model proposed has the ability to address the large dimensional spatial problem by reducing dimension through the use of basis function knots. The proposed model performs better than the direct estimate. The developed model can be applied in many broad areas including but not limited to health sciences, public health, agriculture, and economics.
Acknowledgements
We would like to express our appreciation to Pan African University Institute of Basic Sciences, Technology and Innovation for the support.