Probability Distributions Arising in Connection with the Inspection Paradox for Bernoulli Trials ()
1. Introduction
The Inspection Paradox, which occurs in many sampling situations, is a pervasive problem to researchers [1] [2] . It is called a paradox because selecting samples in an apparently correct way can lead to biased estimates. The sampling method that is used creates a bias called length bias ( [2] [3] , pp. 294-296). Two examples are the waiting times for buses [4] and estimates of class sizes [5] . They employ length-based sampling that leads to the paradoxical result of estimates that are too large. The scheduler of the buses knows how long the waiting periods are between arrivals of buses at a bus stop. Customers tend to obtain estimates of the waiting times that are larger because the customers are liable to arrive at the bus stop during a longer interval between buses. For the class size example, a school administrator might have a list of the number of children in each class. Instead of sampling from that list of classes to obtain an average class size, students can be asked how many classmates they have. Larger classes have more students, so students from those classes are more likely to be asked about their class’s sizes. Estimates from this procedure will probably be too large.
This paradox has been well-known for a long time. Bytheway [6] and Jenkins and Tuten [7] describe published articles that had unwittingly fallen into the paradox and may have come to incorrect conclusions as a consequence. In [8] , the author discusses the complication that the fastest growing tumors are less likely to be detected because the window of detection is smaller for them.
Conversely, sometimes those longer intervals or larger objects are the target of a study. For example, [9] suggests purposely using sampling of individuals, i.e., length-based sampling, in order to find people’s experience with COVID. The goal of the authors of [4] is to find longer-lived examples using length-based sampling.
In the past, continuous-time processes and models were the most commonly studied, but discrete processes, such as those investigated here, have become more prevalent in the digital age. A natural setting, which is consistent with our model, is DNA sequencing, where
( [10] , pp. 17-41; [11] ). The process is composed of consecutive live births, where a success is a birth with a particular genetic characteristic and the number of births between successes is the waiting time. Sometimes, the discrete time analysis is imposed by an artificial procedure such as hourly, daily, or weekly measurements [12] , or an activity such as throwing a die. There is much continuing study and research into discrete time renewal processes ( [3] , pp. 174-299; [12] [13] [14] , pp. 53-65).
In what follows, probability distributions and moments are derived for the lengths of all of the interarrival periods in the inspected process, and they are compared to those for the corresponding lengths in the uninspected process. Distributions are also derived for the waiting times for any fixed number of successes that occur both before and after the inspected trial. The main tools used are conditioning arguments that make heavy use of the memoryless property of the geometric distribution. This distribution is the only distribution of the discrete type which has this property. It is the discrete-time analog of the exponential distribution, which uniquely possesses the memoryless property among distributions of the continuous type. Accordingly, the results derived here are compared to the corresponding results in the continuous-time setting found in [2] .
2. Bernoulli Processes
Consider a Bernoulli process
with success probability p, where p is a fixed number in the interval
. The interarrival period lengths
are independent random variables, each of which has the Geometric(p) distribution with cumulative distribution function (cdf)
(1)
for
and expectation
.
Letting
and
for
, we see that
is the discrete waiting time or arrival time for the n-th event (i.e., the n-th success) and
is the additional number of trials required to obtain the k-th success after the occurrence of the
-st success.
Note that for
, the number
of successes that have occurred up to and including time t is
, which has the Binomial(t,p) distribution
for
with expectation
.
For
,
has the Negative Binomial (n, p) distribution with cdf
and expectation
. These facts are well known and are discussed in ( [15] , pp. 409-414) and ( [16] , pp. 457-471).
Starting at a fixed inspection time t, the additional number
of trials required to obtain the next success has the geometric distribution in (1). It is independent of the number of trials
since the last success, where
if a success occurs at time t and we define
if no success occurs in the first t trials. The random variable
has the truncated geometric distribution with cdf
and expectation
.
3. The Inspected Interarrival Period
The length of the interarrival period containing the inspection trial t is
and our first theorem gives its probability distribution.
Theorem 1. The cdf of
is given by
where
.
Proof. Our proof proceeds by computing the convolution of the distributions of
and
given above. By conditioning on the value of
and using the independence of
and
, it follows that for
.
For
it similarly follows that
Note that for
Hence we see that the length of the inspected interarrival period is stochastically larger than the length of a typical interarrival period in the uninspected process. This fact illustrates the Inspection Paradox. Note that as
(where
denotes convergence in distribution) and that
The situation just illustrated for the Bernoulli process should be compared with the Inspection Paradox for the Poisson process (see [2] ), in which case the limiting expected length of the inspected interarrival period is twice the expected length of a typical interarrival period in the uninspected process. The minus one
arises from the fact that
can be zero and its limiting expectation is
.
In [2] the distribution and moments for the pre- and post-inspection time interarrival period lengths are derived for the Poisson process. Our goal here is to accomplish this for the Bernoulli process.
4. Pre- and Post-Inspection Interarrival Periods
Since for
and
the post-inspection interarrival period length
has the geometric distribution in (1), our focus in this section is on the pre-inspection interarrival period lengths
for
.
We begin with a combinatorial lemma.
Lemma 1. If m, n and j are nonnegative integers with
, then
Proof. Notice that
is the number of ways to select n integers from the
set
. Another way to obtain this count is to select the
greatest integer, where
, call this integer k, and from the remaining integers in the set, select j integers which are greater than k and
integers less than k. Since we must have at least j integers greater than k, and at least
integers less than k, k may only take values from the set
. So the number of ways to select n integers from the set
by first selecting the
greatest is given by
, completing the proof.
Theorem 2. For t a positive integer and
Proof. Fix
and
, and define
for all
. In a discrete time renewal process,
for all
, so we need not consider cases where
, because this implies
, and hence
.
Recall that the
success occurs at time
. Since j successes must occur after the
success, but at or before time t, we must have
. Therefore
, so
for
.
Now we assume that
. Conditioning on
, noting that
implies
, and that
implies
, we have
Conditioning on
, and noting that there must be at least j trials after
, up to and including trial t (to accommodate the j successes after
), and at least
trials up to and including trial
(to accommodate the interarrival period length greater than x, and the
successes occurring by trial
), we have
By applying Lemma 1,
Hence for
,
Although there are simple formulas for the moments of the exponentially distributed interarrival times in a Poisson process (see [17] , pp. 498-499), the factorial moments of the geometrically distributed interarrival times in a Bernoulli process have more concise formulas. For this reason, our next theorem gives the factorial moments for the lengths of the pre-inspection interarrival periods. The set of positive integral moments of any random variable can be determined from the factorial moments and vice-versa.
Theorem 3. For
and
, the mth factorial moment of
is
Proof. For notational simplicity, we let
in this proof. From Lemma 1 and Theorem 2,
From Theorem 2
for
and
, and so
is stochastically smaller than the length of a typical interarrival period in the uninspected process. Moreover, the sequence
is stochastically decreasing and
with probability one for
. Theorem 2 also shows that for fixed
,
as
, so that the inspection effect dissipates away for large t. A similar argument to the one given in the proof of Theorem 3 shows that the mth factorial moment of
is
Since the double sum on the right hand side in the statement of Theorem 3 is a polynomial in t of degree
, its product with
tends to zero as
and hence the mth factorial moment for
approaches that for
for any fixed
. These results are analogous to those for the Poisson process (see [2] ), which isn’t surprising since the Poisson process is the continuous time analog for Bernoulli trials.
5. Waiting Times
The next two theorems give the distributions for waiting times in the inspected process.
Theorem 4. If
and
,
where Y has the NegativeBinomial (j + 1, p) distribution.
Proof. Observe that
. Letting Y have the NegativeBinomial (j + 1, p) distribution, it follows that if
, then
.
If
, then
Theorem 5. If
and
, then
has the Negative Binomial (j, p) distribution.
Proof. For
,
. So for
An alternative proof of Theorem 5 can be given by observing that
and that, as noted above, the summands on the right hand side of the last equation are independent random variables, each of which has the Geometric(p) distribution in (1).
Once again, the results for waiting times are analogous to those for the Poisson process (see [2] ).
6. Conclusions
The Inspection Paradox impacts experiments in both continuous- and discrete-time processes. The paradox represents a difficulty because inspected intervals are likely to be unusually long, which can result in inflated estimators. Our goal was to quantify this phenomenon for Bernoulli processes.
Determination of the distributions and the expected lengths of all intervals when there has been an inspected interval is a major step for a systematic modeling and length-based sampling. These distributions and their properties add to our intuition about length-based sampling.
A perhaps surprising outcome is that the sequence of times between successes, i.e., the lengths of intervals such as the waiting times for buses, is stochastically decreasing into the past from the inspected interval.
In order to gain a thorough understanding of the Inspection Paradox for Bernoulli trials, it is important to know the probability distributions for random variables which naturally arise in this context. We have computed the distributions and moments for the interarrival and waiting times for the inspected Bernoulli process and compared them with those for the uninspected version of this process. The results presented constitute a nearly complete mathematical analysis of the Inspection Paradox for Bernoulli trials.