Bayesian Interval Estimation of the Prevalence Rate Using Pool Testing Strategy with Retesting ()
1. Introduction
Estimating disease prevalence is among the significant activities in the conduct of epidemiological studies. One has to note that traditional individual testing can be very costly, and in most cases requires a lot of time and resources especially when it is carried out on large samples. The pool testing technique, which began with Dorfman [1], provides an alternative economical way of testing clinical samples and involves putting several clinical samples into a pool or group to be tested in a diagnostic test. In addition, pool testing is also ideal when the population has a low disease prevalence rate. Pool testing with retesting consists of testing a pool, and if the pool tests positive, the pool is tested again. When a pool tests positive on the retest, individual members of that pool are tested and the pool is categorized as either defective or non-defective. These aids in reducing the loss of sensitivity which may be evident in pool testing. It has been demonstrated that the use of pool testing techniques with retesting decreases the number of misclassifications, secondly it increases the efficiency of the testing kit and thirdly it increases the efficiency of the estimator of the prevalence rate [2] [3]. Nyongesa [4] considered a single-stage pool design to estimate the prevalence of a disease in the presence of inspection errors. He investigated the effects of specificity and sensitivity, and sample size on the efficiency of the estimator of the prevalence rate and found that the estimator is more efficient for large sample sizes and low specificity and sensitivity.
Bayesian techniques have been applied in many epidemiological settings, such as disease monitoring, outbreak simulation, and prevalence quantification. For instance, McDonald and Hodson [5] applied Bayesian techniques to predict disease prevalence based on tests that are not perfect, and their study showed that this approach has superior efficiency to the other methods in the same context.
The Bayesian approach to prevalence rate estimation incorporates the prior information about the prevalence rate and combines it with the data collected to estimate the prevalence rate. To do this, one has to assign a prior distribution to the prevalence rate, form the likelihood of the prevalence rate based on the observed data and then apply the Bayes’ Theorem to obtain the posterior distribution of the prevalence rate. The estimation of the prevalence rate is based on the posterior distribution. Bayesian method has been used in the estimation of the prevalence rate in group testing as explained by Liu and Liu [6]. The study focused on describing Bayesian interval estimation using Beta priors and binomial likelihoods; the study also showed the superiority of using Bayesian credible intervals in the group testing paradigm. Indeed, the authors explained how the addition of prior information could enhance estimation, especially where the prevalence was low.
Spiegelhalter and Best [7] wrote about the object of applying Bayesian methods to complex models in epidemiology, with the self-note that Bayesian credible intervals are suitable for the modeling of the costs and effects in the cost effectiveness analysis. They demonstrated how even with all the different approaches to evidence and uncertainty EA is a powerful tool that can be utilized for decision-making in the field of public health. Gelman et al. [8] describe the general Bayesian approach to statistical data analysis and for single parameter models such as the binomial distribution, they give illustrative practical examples of the construction and interpretation of Bayesian credible intervals for the unknown parameter of interest.
In the study done by Brett, Rohani, and Drake [9], bayesian methods of prediction of epidemic transitions with observational data of less accuracy were used. To measure the uncertainty of the estimates of some important epidemiological parameters, they used Bayesian credible intervals and pointed out that intervals of this kind are rather robust to data uncertainty and model misspecification.
Orawo [10] compared four interval estimation methods for the construction of the interval estimate of the binomial proportion: These include, the Wilson, Clopper-Pearson, likelihood and Wald confidence intervals. The analysis of the study, via simulation, established that the Clopper-Pearson interval was a conservative interval both for small and big samples and, on the other hand, the Wald confidence interval had coverage probabilities lower than the nominal level and had the problem of overshoot. The Wilson and likelihood intervals are relatively close in terms of coverage probabilities near the nominal level, with the likelihood interval also being slightly shorter.
Hepworth [11] compared four confidence interval methods of estimating the proportion of positively classified units when the number of patterns in each group is different and the sample is very large. He stated that these methods should be compared to the exact likelihood ratio, which is preferable to the precise method. However, Biggerstaff [12] pointed out the confidence interval method for the difference of two proportions based on the pooled sample.
The applicability of Bayesian statistical methods in interval estimation of prevalence rate through pool testing with retesting strategy has remained relatively uncharted. The traditional Wald method has been applied in constructing the confidence interval for the prevalence rate based on data generated by different pool testing designs; however, it is not accurate when the normal approximation to the binomial is poor. This is expected to occur since the prevalence rate is low and is close to zero. Other intervals such as Clopper-Pearson and Wilson intervals for a proportion, are restricted to binomial sampling and thus are not possible to compute in a complex trinomial probability model, resulting from a pool testing with retesting, whose probabilities involve the prevalence rate. This paper considers the numerical construction of Bayesian credible intervals and compares them with the Wald confidence intervals on the basis of coverage probabilities and mean interval lengths based on simulated data.
The paper is organized as follows: Section 2 describes the pool testing with retesting design. In section 3, the trinomial probability model is derived from the retesting design. The Bayes credible intervals and Wald confidence intervals are outlined in section 4. In section 5, the simulation results on the coverage probability and mean interval length of the interval estimates, generated by the two methods, are presented. Section 6 is devoted to concluding remarks.
2. The Retesting Design
Suppose that a population of the size
is divided into
groups each of size
. Each of the
pools is subjected to an initial test. Pools that test positive are retested. On the other hand, further testing is discontinued for pools that test negative. If a pool tests positive on re-testing, its constituent members are tested individually for the presence of characteristics of interest. Retesting is considered to improve the sensitivity and specificity of the testing scheme. It has also been proved that retesting reduces misclassification significantly compared to the Dorfman procedure and enhances the efficiency of the testing kit.
Figure 1 illustrates the process of pool testing with retesting.
The figure shows the
constructed groups and the test result on the
group,
.
Figure 1. Pool testing with retesting design.
The following indicator functions was used to classify the
binary observations generated by the pool testing with the retesting procedure
and
The observations of the constituent members of the
Group was denoted by
. By definition,
.
Suppose that the constituent members of a group are assumed to act independently of each other; then.
, where
is the prevalence rate.
3. The Model Formation
Let the random variables
and
denote the number of groups that test positive and negative on the initial test, respectively. Also, denote by
and
the number of groups that test positive and negative on the retest, respectively. Utilizing the indicator functions
and
and Figure 1, the probability of declaring the
pool negative on the initial test
is derived as follows. By the law of total probability
where
is the sensitivity (the probability of correctly classifying a defective pool), and
is the specificity (the probability of correctly classifying a non-defective pool).
Similarly, the probability of declaring a pool as negative on retesting pool that is initially classified as positive is given by;
and is obtained as
Finally, the probability of declaring a pool positive on retesting the initially announced positive pool can be derived as
The above three probabilities
and
are used to formulate the model of group testing with retesting as the joint probability distribution of the random variables
and
with joint probability mass function
, (1)
which is a trinomial probability model with parameters
.
The likelihood function of the parameter
is proportional to the trinomial probability mass function in (1). Taking
as the constant of proportionality, the likelihood function is given as
. (2)
The log-likelihood function is
(3)
4. Interval Estimation Methods
Interval estimation is a statistical technique used to estimate the range within which a population parameter is expected to lie with a certain level of confidence. It provides information regarding the closeness of a point estimate to the true parameter value by giving a range of plausible values for the parameter.
4.1. Wald Confidence Interval
The Wald confidence interval is the most widely used approximate confidence interval and is based on the asymptotic normality property of the maximum likelihood estimator of the parameter of interest. Let
be the MLE of the prevalence rate
obtained by maximizing the log-likelihood function in equation (3). Then by the asymptotic property of the MLE the sampling distribution of
is approximately standard normal. The standard error is defined as
, where
is the observed Fisher information. Given
, the
Wald confidence interval of the prevalence rate
is given by
, where
is the upper
percentile of the standard normal distribution. The most commonly used values of α are 0.1, 0.05, and 0.01, which correspond to confidence levels of 90%, 95%, and 99%, respectively.
4.2. Bayesian Credible Intervals
In the Bayesian credible interval approach, the prevalence rate parameter
is a random variable with on probability density function on the parameter space,
. The joint probability mass function in equation (1) will be considered as a conditional mass function of
given
and written as
. The unknown parameter is assigned a probability distribution, denoted by
, called the prior distribution of
, This reflects an experimenter’s subjective belief regarding which p-values are more or less likely when one considers the whole parameter space. The evidences about
from the prior distribution
and the likelihood function
are combined by means of the Bayes Theorem to come up with what is known as a posterior distribution. In this study, we use the Beta distribution as a prior for the parameter of interest because the distribution is flexible and appropriate for modeling proportions. The Beta distribution has many shapes and is popularly used in Bayesian statistics for modeling an unknown proportion The values of parameters of the Beta distribution can be chosen so that it reflects a prior information about a small proportion. Since the prevalence rate is close to zero using a non-informative uniform distribution over the interval (0,1) may lead to inaccurate interval estimates.
Let
be the joint probability distribution of the
and
. The posterior distribution of
given
, denoted by
, is defined as
. (4)
The credible interval of the prevalence rate
is constructed using the above posterior distribution and is defined as follows: for fixed
, a subset
of the parameter space
is called
credible interval for
if and only if the posterior probability of the subset
is at least
. (5)
For a given value
there is a long list of
competing credible intervals, and the optimal credible interval should be short and includes only those values of
which are very likely according to the posterior distribution. Such a credible interval is called highest posterior density (HPD) and is defined as follows: for a fixed
, a subset
of the parameter space
is called
HPD credible interval for
if and only if the subset
has the following form:
, (6)
where
depends on
and data, and is the largest constant such that the probability of
is at least
. If the posterior density is symmetric about its finite mean, then the
HPD credible interval
will be the shortest, equal-tailed and symmetric about the posterior mean.
We note that it is not possible to derive the analytical form of the posterior density in (4) and therefore we apply the gridding approximation algorithm to simulate it. Suppose that G values of the prevalence rate
are simulated from the posterior density. Then an equal-tailed credible interval is computed as
, where
is the z-quantile of the posterior distribution.
5. Simulation Study
In this section, simulation studies are conducted to evaluate the performance of the Bayesian interval estimation method relative to the traditional Wald method. Various simulation scenarios with different prevalence rates
, pool sizes
, and pool numbers
are considered. For each scenario, the performances of Bayesian credible interval and Wald confidence interval for the prevalence rate
were compared on the basis of coverage probability and expected length. Let
and
denote the observed values of the random variables
and
and
denote the value of the trinomial probability mass function in (1). It follows that for any confidence interval method for estimating the prevalence rate
the actual coverage probability at a fixed value of
is given by
, (7)
where the indicator function
takes the value 1 if the interval covers the value of
and the value 0 if the interval does not cover the value of
. Let
and
denote the lower and upper confidence limits, respectively. The expected length of the random interval
is given by
, (8)
For the simulation study we arbitrarily set the number of pools at
. The 95% Bayesian credible intervals and 95% Wald confidence intervals were constructed for 1000 simulations of
for a fixed values of the prevalence rate
and pool size
. The coverage probability and expected length are then computed for each of the two interval estimation methods. The values of the prevalence rates and pool sizes used in the simulation study are
and
. Table 1 and Table 2 respectively show the computed coverage probabilities and mean interval length (in parenthesis) for the Bayesian credible intervals and Wald confidence intervals for various values of
and
. The Bayesian credible interval has coverage probabilities which are higher than the nominal level and the mean interval lengths are extremely short. The coverage probabilities draw closer to the nominal level as pool size increases. On the other hand, the coverage probabilities for the Wald confidence interval are lower but on average closer to the nominal level and the mean interval lengths are very large as compared to those of credible intervals.
To investigate the effects of group size
and the prevalence rate
on the performance of the two methods of interval estimation, coverage probabilities and mean lengths of the credible and Wald confidence intervals were computed repeatedly for different pairs of values of
and
. The following three pairs of values of
and
were arbitrarily chosen and used in the simulation study:
,
and
. Figure 2 and Figure 3 show plots of coverage probabilities and mean interval lengths of the 95% Bayesian credible intervals and 95% Wald confidence intervals, respectively, when
and
. The coverage probabilities (in Figure 2) for both types of interval estimates for the prevalence rate show short upward and downward spikes and are close to the nominal level, however, most coverage probabilities for the Wald confidence intervals are below the normimal level as compared to those of the Bayesian credible intervals which are evenly distributed about the nominal level. The graphs in Figure 3 demonstrate that on average the mean interval length for the Bayes credible intervals are smaller than that of the wald comfidence interval.
Table 1. Coverage probabilities and mean interval lengths (in parenthesis) of the nominal 95% Bayesian interval credible intervals for the prevalence rate
.
|
5 |
15 |
20 |
0.05 |
0.973 [0.0000156] |
0.957 [0.0000088] |
0.956 [0.000005806] |
0.01 |
0.986 [0.00001702] |
0.963 [0.0000102] |
0.952 [0.00000941] |
0.05 |
0.983 [0.00003764] |
0.965 [0.0000258] |
0.959 [0.0000240] |
0.1 |
0.988 [0.00004985] |
0.972 [0.00003864] |
0.980 [0.00004966] |
Table 2. Coverage probabilities and mean interval lengths (in parenthesis) of the nominal 95% Wald confidence intervals for the prevalence rate
.
|
5 |
15 |
20 |
0.05 |
0.998 [0.030517] |
0.909 [0.008167] |
0.950 [0.00673] |
0.01 |
0.931 [0.020388] |
0.938 [0.010986] |
0.941 [0.009811] |
0.05 |
0.942 [0.04274] |
0.953 [0.02934] |
0.938 [0.02775] |
0.1 |
0.945 [0.06341] |
0.949 [0.05522] |
0.961 [0.06386] |
Figure 2. Plots of coverage probabilities of the 93% credible intervals and 95% Wald confidence intervals when
and
.
Figure 3. Plots of mean interval lengths of the 95% credible intervals and 95% Wald confidence intervals when
and
.
Figure 4 and Figure 5 show the graphs of coverage probabilities and mean interval lengths for the 95% credible intervals and 95% Wald confidence intervals when the group size is increased to 20 and the prevalence rate is kept constant at 0.1. Figure 4 shows that the coverage probability of the Wald confidence interval is increased above the nominal level when the group size is increased to 20. On the other hand, the coverage probabilities of Bayesian credible interval are close to the nominal level and are not affected by the upward change in group size. In Figure 4, the mean interval lengths of the 95% Wald confidence intervals have long upward and downward spikes as compared to those of the credible intervals, which are short and uniform. This implies that Wald method produces wider confidence intervals when group size is increased and hence is less accurate. Increasing group size reduces the number of groups (less data) and hence the Wald method performs poorly since it is based on large sample theory.
![]()
Figure 4. Plots of coverage probabilities of the 95% credible intervals and 95% Wald confidence intervals when
and
.
Figure 5. Plots of mean interval lengths of the 95% credible intervals and 95% Wald confidence intervals when
and
.
Lastly, Figure 6 and Figure 7 present the plots of coverage probabilities and mean interval lengths for the 95% credible intervals and 95% Wald confidence intervals when the group size is kept constant at 20 and prevalence rate is decreased to 0.01. In Figure 6 it can be observed that the coverage probabilities of the interval estimates produced by both interval estimation methods are on average less than the nominal level; the upward and downward spikes show the same pattern. However, the plots of mean interval lengths in Figure 7 indicate that the performance of the Wald method decreases further with decrease in the prevalence rate while that of the Bayesian method is good.
5.1. Results Interpretation
The simulation results demonstrate that Bayesian credible intervals generally outperform Wald confidence intervals, particularly in scenarios with low prevalence rates and smaller pool sizes. This superior performance is attributed to the Bayesian
Figure 6. Plots of coverage probabilities of the 95% credible intervals and 95% Wald confidence intervals when
and
.
Figure 7. Plots of mean interval lengths of the 95% credible intervals and 95% Wald confidence intervals when
and
.
method’s ability to incorporate prior information and its robustness to variability, making it less sensitive to small sample sizes and low prevalence rates. In contrast, the Wald method relies on normal approximation, which becomes unreliable when the prevalence rate is close to zero or the sample size is limited. Consequently, the Wald intervals tend to be wider and less accurate under such conditions, whereas Bayesian intervals maintain consistent coverage probabilities close to the nominal level and have narrower interval widths.
However, in cases where the prevalence rate and sample size are relatively large, both methods tend to perform similarly, as the Wald method’s reliance on asymptotic normality becomes more reliable. Despite this, the Bayesian approach still provides a slight advantage by yielding more stable and efficient interval estimates.
5.2. Practical Implications
The findings have significant implications for public health decision-making. Accurate prevalence estimation is crucial in managing infectious disease outbreaks and allocating healthcare resources efficiently. The use of Bayesian credible intervals allows for more precise estimation, especially in low-prevalence settings, enabling policymakers to make data-driven decisions with greater confidence. Furthermore, the ability to incorporate prior information can enhance predictive accuracy when prior data or expert knowledge is available. In contrast, reliance on the Wald method in such scenarios might lead to overestimating uncertainty, resulting in either excessive caution or insufficient intervention measures. Adopting Bayesian methods can therefore optimize decision-making processes, particularly in surveillance and screening programs.
6. Conclusions
The results of the above simulation study indicate that the Bayesian interval estimation method produced more accurate interval estimates with coverage probabilities close to the nominal level of 0.95 and short interval widths than the traditional Wald method, for all the cases considered. The Wald confidence intervals had coverage probabilities close to the nominal level for all the cases considered. However, their mean interval widths were large and hence were inaccurate. In Figure 7, the graph of mean interval width for the Wald confidence interval has long spikes implying that as the pool size increases the Wald confidence interval becomes wider and hence inaccurate. On the other hand, the graph of Bayesian interval widths on the same figure has short spikes, indicating the precision of the credible intervals is not significantly affected by increased pool size and hence it is more efficient in estimating prevalence rate. Finally, it can be observed that the accuracy of the interval estimates constructed by both methods is good at lower prevalence rates and pool sizes and deteriorates (although slightly for the Bayesian method) with an increase in the prevalence rate and pool size. This may be due to the fact that the efficiency of pool testing is low at a higher prevalence rate. Also, the accuracy of the Wald method depends on normal approximation to the binomial, which may be poor for all the prevalence rates used in the simulation study. In this paper we assume that both the specificity and sensitivity of the testing kit are equal and fixed at 0.95. However, they can be varied in a future study to investigate their effects on the efficiency of the two interval estimation methods for the prevalence rate of a rare trait.
This study demonstrates that the Bayesian interval estimation method provides more accurate and efficient interval estimates for prevalence rates compared to the traditional Wald method. The Bayesian approach consistently achieves coverage probabilities closer to the nominal level, even when prevalence rates are low or sample sizes are small. In contrast, the Wald method often produces wider intervals and lower coverage probabilities under similar conditions due to its reliance on normal approximation. The practical implications of these findings are significant for public health decision-making. Accurate prevalence estimation enables better resource allocation, outbreak management, and screening program optimization. By incorporating prior information and maintaining robustness to variability, Bayesian methods enhance the reliability of prevalence estimates, particularly in low-prevalence scenarios. Future research should investigate the impact of varying specificity and sensitivity, and the number of pools on interval estimation efficiency. Adjusting these parameters could provide further insights into optimizing the Bayesian approach for different diagnostic contexts. Our next research project is to construct Likelihood confidence intervals for the prevalence rate in pool testing with retesting design and compare them with the Bayes credible intervals using simulation.