Probability Distributions Arising in Connection with the Inspection Paradox for Bernoulli Trials

Abstract

In renewal theory, the Inspection Paradox refers to the fact that an interarrival period in a renewal process which contains a fixed inspection time tends to be longer than one for the corresponding uninspected process. We focus on the paradox for Bernoulli trials. Probability distributions and moments for the lengths of the interarrival periods are derived for the inspected process, and we compare them to those for the uninspected case.

Share and Cite:

Marengo, J. , Himes, A. , Reinberger, W. and Farnsworth, D. (2023) Probability Distributions Arising in Connection with the Inspection Paradox for Bernoulli Trials. Open Journal of Statistics, 13, 769-777. doi: 10.4236/ojs.2023.136038.

1. Introduction

The Inspection Paradox, which occurs in many sampling situations, is a pervasive problem to researchers [1] [2] . It is called a paradox because selecting samples in an apparently correct way can lead to biased estimates. The sampling method that is used creates a bias called length bias ( [2] [3] , pp. 294-296). Two examples are the waiting times for buses [4] and estimates of class sizes [5] . They employ length-based sampling that leads to the paradoxical result of estimates that are too large. The scheduler of the buses knows how long the waiting periods are between arrivals of buses at a bus stop. Customers tend to obtain estimates of the waiting times that are larger because the customers are liable to arrive at the bus stop during a longer interval between buses. For the class size example, a school administrator might have a list of the number of children in each class. Instead of sampling from that list of classes to obtain an average class size, students can be asked how many classmates they have. Larger classes have more students, so students from those classes are more likely to be asked about their class’s sizes. Estimates from this procedure will probably be too large.

This paradox has been well-known for a long time. Bytheway [6] and Jenkins and Tuten [7] describe published articles that had unwittingly fallen into the paradox and may have come to incorrect conclusions as a consequence. In [8] , the author discusses the complication that the fastest growing tumors are less likely to be detected because the window of detection is smaller for them.

Conversely, sometimes those longer intervals or larger objects are the target of a study. For example, [9] suggests purposely using sampling of individuals, i.e., length-based sampling, in order to find people’s experience with COVID. The goal of the authors of [4] is to find longer-lived examples using length-based sampling.

In the past, continuous-time processes and models were the most commonly studied, but discrete processes, such as those investigated here, have become more prevalent in the digital age. A natural setting, which is consistent with our model, is DNA sequencing, where p = 0.25 ( [10] , pp. 17-41; [11] ). The process is composed of consecutive live births, where a success is a birth with a particular genetic characteristic and the number of births between successes is the waiting time. Sometimes, the discrete time analysis is imposed by an artificial procedure such as hourly, daily, or weekly measurements [12] , or an activity such as throwing a die. There is much continuing study and research into discrete time renewal processes ( [3] , pp. 174-299; [12] [13] [14] , pp. 53-65).

In what follows, probability distributions and moments are derived for the lengths of all of the interarrival periods in the inspected process, and they are compared to those for the corresponding lengths in the uninspected process. Distributions are also derived for the waiting times for any fixed number of successes that occur both before and after the inspected trial. The main tools used are conditioning arguments that make heavy use of the memoryless property of the geometric distribution. This distribution is the only distribution of the discrete type which has this property. It is the discrete-time analog of the exponential distribution, which uniquely possesses the memoryless property among distributions of the continuous type. Accordingly, the results derived here are compared to the corresponding results in the continuous-time setting found in [2] .

2. Bernoulli Processes

Consider a Bernoulli process { N ( t ) : t = 0 , 1 , } with success probability p, where p is a fixed number in the interval ( 0,1 ) . The interarrival period lengths X 1 , X 2 , are independent random variables, each of which has the Geometric(p) distribution with cumulative distribution function (cdf)

Pr ( X x ) = 1 ( 1 p ) x (1)

for x = 1 , 2 , and expectation E [ X ] = 1 p .

Letting S 0 = 0 and S n = k = 1 n X k for n = 1 , 2 , , we see that S n is the discrete waiting time or arrival time for the n-th event (i.e., the n-th success) and X k is the additional number of trials required to obtain the k-th success after the occurrence of the ( k 1 ) -st success.

Note that for t = 0,1, , the number N ( t ) of successes that have occurred up to and including time t is N ( t ) = max { n { 0,1, } : S n t } , which has the Binomial(t,p) distribution

Pr ( N ( t ) = k ) = ( t k ) p k ( 1 p ) t k

for k = 0,1, , t with expectation E [ N ( t ) ] = t p .

For n = 1,2, , S n has the Negative Binomial (n, p) distribution with cdf

F n ( s ) = Pr ( S n s ) = 1 Pr ( N ( s ) n 1 ) = ( 0 s = 0 , 1 , , n 1 1 k = 0 n 1 ( s k ) p k ( 1 p ) s k s = n , n + 1 ,

and expectation E [ S n ] = n p . These facts are well known and are discussed in ( [15] , pp. 409-414) and ( [16] , pp. 457-471).

Starting at a fixed inspection time t, the additional number Y ( t ) of trials required to obtain the next success has the geometric distribution in (1). It is independent of the number of trials A ( t ) since the last success, where A ( t ) = 0 if a success occurs at time t and we define A ( t ) = t if no success occurs in the first t trials. The random variable A ( t ) has the truncated geometric distribution with cdf

Pr ( A ( t ) x ) = ( 1 ( 1 p ) x + 1 x = 0 , 1 , , t 1 1 x = t , t + 1 ,

and expectation E [ A ( t ) ] = 1 p p ( 1 ( 1 p ) t ) .

3. The Inspected Interarrival Period

The length of the interarrival period containing the inspection trial t is

X N ( t ) + 1 = A ( t ) + Y ( t ) ,

and our first theorem gives its probability distribution.

Theorem 1. The cdf of X N ( t ) + 1 is given by

Pr ( X N ( t ) + 1 x ) = ( 1 q x ( 1 + p x ) x = 0 , 1 , , t 1 1 q x ( 1 + p t ) x = t , t + 1 ,

where q = 1 p .

Proof. Our proof proceeds by computing the convolution of the distributions of A ( t ) and Y ( t ) given above. By conditioning on the value of Y ( t ) and using the independence of A ( t ) and Y ( t ) , it follows that for x = 1,2, , t 1 .

Pr ( X N ( t ) + 1 x ) = k = 0 x 1 Pr ( A ( t ) k ) Pr ( Y ( t ) = x k ) = p k = 0 x 1 ( 1 q k + 1 ) q x k 1 = p k = 0 x 1 q x k 1 p k = 0 x 1 q x = 1 q x ( 1 + p x ) .

For x = t , t + 1 , it similarly follows that

Pr ( X N ( t ) + 1 x ) = k = 0 t 1 Pr ( A ( t ) k ) Pr ( Y ( t ) = x k ) + k = t x 1 Pr ( A ( t ) k ) Pr ( Y ( t ) = x k ) = p k = 0 t 1 ( 1 q k + 1 ) q x k 1 + p k = t x 1 q x k 1 = p k = 0 x 1 q x k 1 p k = 0 t 1 q x = 1 q x ( 1 + p t ) .

Note that for x = 1,2,

Pr ( X N ( t ) + 1 > x ) > q x = Pr ( X 1 > x ) .

Hence we see that the length of the inspected interarrival period is stochastically larger than the length of a typical interarrival period in the uninspected process. This fact illustrates the Inspection Paradox. Note that as t

X N ( t ) + 1 D X 1 1 + X 2

(where D denotes convergence in distribution) and that

E [ X N ( t ) + 1 ] 2 p 1.

The situation just illustrated for the Bernoulli process should be compared with the Inspection Paradox for the Poisson process (see [2] ), in which case the limiting expected length of the inspected interarrival period is twice the expected length of a typical interarrival period in the uninspected process. The minus one

arises from the fact that A ( t ) can be zero and its limiting expectation is 1 p 1 .

In [2] the distribution and moments for the pre- and post-inspection time interarrival period lengths are derived for the Poisson process. Our goal here is to accomplish this for the Bernoulli process.

4. Pre- and Post-Inspection Interarrival Periods

Since for t > 0 and j { 2,3, } the post-inspection interarrival period length X N ( t ) + j has the geometric distribution in (1), our focus in this section is on the pre-inspection interarrival period lengths X N ( t ) j for j = 0 , 1 , .

We begin with a combinatorial lemma.

Lemma 1. If m, n and j are nonnegative integers with j < n m , then

( m n ) = k = n j m j ( k 1 n j 1 ) ( m k j ) .

Proof. Notice that ( m n ) is the number of ways to select n integers from the

set { 1,2, , m } . Another way to obtain this count is to select the ( j + 1 ) st greatest integer, where j { 0,1, , n 1 } , call this integer k, and from the remaining integers in the set, select j integers which are greater than k and n j 1 integers less than k. Since we must have at least j integers greater than k, and at least n j 1 integers less than k, k may only take values from the set { n j , n j + 1, , m j } . So the number of ways to select n integers from the set { 1,2, , m } by first selecting the ( j + 1 ) st greatest is given by

k = n j m j ( k 1 n j 1 ) ( m k j ) , completing the proof.

Theorem 2. For t a positive integer and j { 0,1, , t 1 }

Pr ( X N ( t ) j x ) = ( 1 q x n = j + 1 t x ( t x n ) p n q t x n x = 0 , 1 , , t j 1 1 x = t j , t j + 1 ,

Proof. Fix t { 1,2, } and j { 0,1, , t 1 } , and define X k = 0 for all k 0 . In a discrete time renewal process, N ( t ) t for all t { 1,2, } , so we need not consider cases where j t , because this implies N ( t ) j t j 0 , and hence X N ( t ) j = 0 .

Recall that the ( N ( t ) j ) th success occurs at time S N ( t ) j . Since j successes must occur after the ( N ( t ) j ) th success, but at or before time t, we must have S N ( t ) j t j . Therefore X N ( t ) j i = 1 N ( t ) j X i = S N ( t ) j t j , so Pr ( X N ( t ) j x ) = 1 for x { t j , t j + 1, } .

Now we assume that x { 0,1, , t j 1 } . Conditioning on N ( t ) , noting that N ( t ) j implies X N ( t ) j = 0 , and that X N ( t ) j > x implies N ( t ) t x , we have

Pr ( X N ( t ) j > x ) = n = 0 t Pr ( X N ( t ) j > x , N ( t ) = n ) = n = j + 1 t x Pr ( X N ( t ) j > x , N ( t ) = n ) .

Conditioning on S N ( t ) j , and noting that there must be at least j trials after S N ( t ) j , up to and including trial t (to accommodate the j successes after S N ( t ) j ), and at least x + n j trials up to and including trial S N ( t ) j (to accommodate the interarrival period length greater than x, and the n j successes occurring by trial S N ( t ) j ), we have

n = j + 1 t x Pr ( X N ( t ) j > x , N ( t ) = n ) = n = j + 1 t x k = x + n j t j Pr ( X N ( t ) j > x , N ( t ) = n , S N ( t ) j = k ) = n = j + 1 t x k = x + n j t j ( k x 1 n j 1 ) p n j 1 q k x n + j q x p ( t k j ) p j q t k j .

By applying Lemma 1,

n = j + 1 t x Pr ( X N ( t ) j > x , N ( t ) = n ) = q x n = j + 1 t x p n q t x n k = x + n j t j ( k x 1 n j 1 ) ( t k j ) = q x n = j + 1 t x ( t x n ) p n q t x n .

Hence for x { 0,1, , t j 1 } ,

Pr ( X N ( t ) j x ) = 1 q x n = j + 1 t x ( t x n ) p n q t x n .

Although there are simple formulas for the moments of the exponentially distributed interarrival times in a Poisson process (see [17] , pp. 498-499), the factorial moments of the geometrically distributed interarrival times in a Bernoulli process have more concise formulas. For this reason, our next theorem gives the factorial moments for the lengths of the pre-inspection interarrival periods. The set of positive integral moments of any random variable can be determined from the factorial moments and vice-versa.

Theorem 3. For j = 0,1, and m = 1,2, , the mth factorial moment of X N ( t ) j is

E [ a = 0 m 1 ( X N ( t ) j a ) ] = m ! { k = m 1 t j 1 ( k m 1 ) q k q t l = 0 j ( p q ) l k = m 1 t j 1 ( k m 1 ) ( t k l ) } .

Proof. For notational simplicity, we let X = X N ( t ) j in this proof. From Lemma 1 and Theorem 2,

E [ a = 0 m 1 ( X a ) ] = n = m t j n ( n 1 ) ( n ( m 1 ) ) Pr ( X = n ) = m ! n = m t j ( n m ) Pr ( X = n ) = m ! n = m t j k = m 1 n 1 ( k m 1 ) Pr ( X = n ) = m ! k = m 1 t j 1 ( k m 1 ) n = k + 1 t j Pr ( X = n ) = m ! k = m 1 t j 1 ( k m 1 ) Pr ( X > k ) = m ! k = m 1 t j 1 ( k m 1 ) q k l = j + 1 t k ( t k l ) p l q t k l

= m ! k = m 1 t j 1 ( k m 1 ) q k { 1 l = 0 j ( t k l ) p l q t k l } = m ! { k = m 1 t j 1 ( k m 1 ) q k q t l = 0 j ( p q ) l k = m 1 t j 1 ( k m 1 ) ( t k l ) } .

From Theorem 2

Pr ( X N ( t ) x ) > Pr ( X 1 x )

for x = 0,1, and t > 0 , and so X N ( t ) is stochastically smaller than the length of a typical interarrival period in the uninspected process. Moreover, the sequence { X N ( t ) j } j = 0 is stochastically decreasing and X N ( t ) j = 0 with probability one for

j t . Theorem 2 also shows that for fixed j { 0,1, } , X N ( t ) j D X 1 as t , so that the inspection effect dissipates away for large t. A similar argument to the one given in the proof of Theorem 3 shows that the mth factorial moment of X 1 is

E [ a = 0 m 1 ( X 1 a ) ] = m ! k = m 1 ( k m 1 ) q k = m ! q m 1 p m .

Since the double sum on the right hand side in the statement of Theorem 3 is a polynomial in t of degree m + j , its product with q t tends to zero as t and hence the mth factorial moment for X N ( t ) j approaches that for X 1 for any fixed j { 0,1, } . These results are analogous to those for the Poisson process (see [2] ), which isn’t surprising since the Poisson process is the continuous time analog for Bernoulli trials.

5. Waiting Times

The next two theorems give the distributions for waiting times in the inspected process.

Theorem 4. If j { 0,1, } and t > 0 ,

Pr ( t S N ( t ) j s ) = ( Pr ( Y 1 s ) s = 0 , 1 , , t 1 1 s = t , t + 1 ,

where Y has the NegativeBinomial (j + 1, p) distribution.

Proof. Observe that j t S N ( t ) j t . Letting Y have the NegativeBinomial (j + 1, p) distribution, it follows that if s = 0,1, , j 1 , then Pr ( Y 1 s ) = Pr ( Y s + 1 ) = 0 = Pr ( t S N ( t ) j s ) .

If s = j , j + 1 , , t 1 , then

Pr ( t S N ( t ) j s ) = Pr ( N ( t ) N ( t s 1 ) j + 1 ) = 1 Pr ( N ( t ) N ( t s 1 ) < j + 1 ) = 1 k = 0 j ( s + 1 k ) p k q s 1 k = Pr ( Y s + 1 ) = Pr ( Y 1 s ) .

Theorem 5. If j { 1,2, } and t > 0 , then S N ( t ) + j t has the Negative Binomial (j, p) distribution.

Proof. For j { 1,2, } , S N ( t ) + j t > 0 . So for x > 0

Pr ( S N ( t ) + j t x ) = Pr ( N ( t + x ) N ( t ) j ) = 1 k = 0 j 1 ( x k ) p k q x k .

An alternative proof of Theorem 5 can be given by observing that

S N ( t ) + j t = Y ( t ) + k = 2 j X N ( t ) + k

and that, as noted above, the summands on the right hand side of the last equation are independent random variables, each of which has the Geometric(p) distribution in (1).

Once again, the results for waiting times are analogous to those for the Poisson process (see [2] ).

6. Conclusions

The Inspection Paradox impacts experiments in both continuous- and discrete-time processes. The paradox represents a difficulty because inspected intervals are likely to be unusually long, which can result in inflated estimators. Our goal was to quantify this phenomenon for Bernoulli processes.

Determination of the distributions and the expected lengths of all intervals when there has been an inspected interval is a major step for a systematic modeling and length-based sampling. These distributions and their properties add to our intuition about length-based sampling.

A perhaps surprising outcome is that the sequence of times between successes, i.e., the lengths of intervals such as the waiting times for buses, is stochastically decreasing into the past from the inspected interval.

In order to gain a thorough understanding of the Inspection Paradox for Bernoulli trials, it is important to know the probability distributions for random variables which naturally arise in this context. We have computed the distributions and moments for the interarrival and waiting times for the inspected Bernoulli process and compared them with those for the uninspected version of this process. The results presented constitute a nearly complete mathematical analysis of the Inspection Paradox for Bernoulli trials.

Conflicts of Interest

The authors declare no conflicts of interest.

References

[1] Ross, S.M. (1993) An Introduction to Probability Models. 5th Edition, Academic Press, San Diego.
https://doi.org/10.1016/B978-0-12-598455-3.50004-0
[2] Marengo, J.E., Himes, A.M., Reinberger, W.C. and Farnsworth, D.L. (2023) Probability Distributions Arising in Connection with the Inspection Paradox for the Poisson Process. Open Journal of Statistics, 13, 16-24.
https://doi.org/10.4236/ojs.2023.131002
[3] Resnick, S.I. (2002) Adventures in Stochastic Processes. Springer, New York.
https://doi.org/10.1007/978-1-4612-0387-2
[4] Pal, A., Kostinski, S. and Reuveni, S. (2022) The Inspection Paradox in Stochastic Resetting. Journal of Physics A: Mathematical and Theoretical, 55, Article ID: 021001.
https://iopscience.iop.org/article/10.1088/1751-8121/ac3cdf/pdf
https://doi.org/10.1088/1751-8121/ac3cdf
[5] Wagner, C.H. (2009) Average Perceived Class Size and Average Perceived Population Density. College Mathematics Journal, 40, 284-292.
[6] Bytheway, B. (1974) A Statistical Trap Associated with Family Size. Journal of Biosocial Science, 6, 67-72.
https://doi.org/10.1017/S0021932000009512
[7] Jenkins, J.J. and Tuten, J.T. (1992) Why Isn’t the Average Child from the Average Family? And Similar Puzzles. American Journal of Psychology, 105, 517-526.
https://doi.org/10.2307/1422907
[8] Gates, T.J. (2001) Screening for Cancer: Evaluating the Evidence. American Family Physician, 63, 513-522.
https://www.aafp.org/pubs/afp/issues/2001/0201/p513.html
[9] Downey, A. (2021) COVID-19 and the Inspection Paradox.
http://www.allendowney.com/blog/2021/08/19/covid-19-and-the-inspection-paradox/
[10] Barbu, V.S. and Limnios, N. (2008) Semi-Markov Chains and Hidden Semi-Markov Models toward Applications: Their Use in Reliability and DNA Analysis. Springer, New York.
https://doi.org/10.1007/978-0-387-73173-5_3
[11] Chen, R. (1978) A Surveillance System for Congenital Malformations. Journal of the American Statistical Association, 73, 323-327.
https://doi.org/10.2307/2286660
[12] Livsey, J. (2013) Count Time Series and Discrete Renewal Processes.
https://tigerprints.clemson.edu/all_dissertations/1163
[13] Gallager, R. (2011) Discrete Stochastic Processes. MIT Open Courseware.
https://ocw.mit.edu/courses/6-262-discrete-stochastic-processes-spring-2011/931ffa0940899c27f34b71ad64fd2bb0_MIT6_262S11_chap04.pdf
[14] Mitov, K.V. and Omey, E. (2014) Renewal Processes. Springer, Cham.
https://doi.org/10.1007/978-3-319-05855-9
[15] Ross, S.M. (2003) The Inspection Paradox. Probability in the Engineering and Informational Sciences, 17, 47-51.
https://doi.org/10.1017/S0269964803171033
[16] Karlin, S. and Taylor, H.M. (1975) A First Course in Stochastic Processes. 2nd Edition, Academic Press, San Diego.
https://doi.org/10.1016/B978-0-08-057041-9.50005-2
[17] Johnson, N.L., Kotz, S. and Balakrishnan, N. (1995) Continuous Univariate Distributions, Volume 1. 2nd Edition, John Wiley, New York.

Copyright © 2024 by authors and Scientific Research Publishing Inc.

Creative Commons License

This work and the related PDF file are licensed under a Creative Commons Attribution 4.0 International License.