Exact Distribution of Difference of Two Sample Proportions and Its Inferences

Abstract

Comparing two population proportions using confidence interval could be misleading in many cases, such as the sample size being small and the test being based on normal approximation. In this case, the only one option that we have is to collect a large sample. Unfortunately, the large sample might not be possible. One example is a person suffering from a rare disease. The main purpose of this journal is to derive a closed formula for the exact distribution of the difference between two independent sample proportions, and use it to perform related inferences such as a confidence interval, regardless of the sample sizes and compare with the existing Wald, Agresti-Caffo and Score. In this journal, we have derived a closed formula for the exact distribution of the difference between two independent sample proportions. This distribution doesn’t need any requirements, and can be used to perform inferences such as: a hypothesis test for two population proportions, regardless of the nature of the distribution and the sample sizes. We claim that exact distribution has the least confidence width among Wald, Agresti-Caffo and Score, so it is suitable for inferences of the difference between the population proportion regardless of sample size.

Share and Cite:

Dahal, K. and Amezziane, M. (2020) Exact Distribution of Difference of Two Sample Proportions and Its Inferences. Open Journal of Statistics, 10, 363-374. doi: 10.4236/ojs.2020.103024.

1. Introduction

Comparing two population proportions, especially when the sample size is small is very challenging in statistics, and has applications in many fields. Several procedures have been suggested [One of the most popular and common methods that has been used for a long time is the Wald interval]. Due to simplicity and convenience, the first method that comes in the mind of most statisticians is the Wald method. However, there are some disadvantages of the Wald interval. Firstly, it is based on normal approximation and for this approximation to work well, we need a large sample. Unfortunately, large samples may be costly in practice. Secondly, the coverage probability is liberal. The coverage probability with nominal 95% confidence interval is almost less than 0.5 when the sample size is small. Even for a large sample size, the coverage probability is always less than the nominal confidence level ( 1 α ).

Agresti and Brian Caffo (2000) [1] introduced Adjusted Wald Confidence Interval by slightly modifying Wald interval by adding one success and one failure for each group. They have also shown that the coverage probability of the Adjusted Wald Interval is reasonably greater than the regular Wald interval. However, Agresti-Caffo interval is also based on normal approximation.

Robert G. Newcombe (1998) [2] has explained eleven different methods to compare the difference between two population proportions. Some of them are conservative, like Score, while others are liberal, like Wald.

The main purpose of this journal is to derive a closed formula for the exact distribution of the difference between two independent sample proportions, and use it to perform related inferences such as a hypothesis test. The rest of the journal is organized as follows. In Section 2, we derive the closed formula for exact distribution of the difference between two independent sample proportions and break it into different cases. We obtain the support of the distribution in Section 3. In Section 4, we perform the hypothesis test. In Section 5, we compute the power of the hypothesis test. In Section 6, we compute the confidence interval and compare it to others. In Section 7, we summarize the main findings and conclude the journal.

2. Exact Distribution of Difference of Two Sample Proportions

Let X 1 , X 2 , , X m and Y 1 , Y 2 , , Y n are iid Bernoulli random samples from two different populations with parameters p 1 and p 2 respectively and let

p ^ 1 = 1 m i = 1 m X i and p ^ 2 = 1 n i = 1 n Y i be the point estimates of the parameters p 1

and p 2 respectively. We denote the difference between p ^ 1 and p ^ 2 by D.

To obtain the exact distribution of D, we first derive the probability generating function (pgf) of W = m n ( D + 1 ) in the following lemma.

Lemma

Let W = m n ( D + 1 ) , then the pgf of W is given by

p w ( z ) = s = 0 m u = 0 s t = 0 n v = 0 t ( 1 ) s + t + u + v ( m s ) ( s u ) ( n t ) ( t v ) p 1 s ( 1 p 2 ) t z u n + v m (1)

Now, let f ( k m l n ) denote the probability mass function (pmf) of D at the point k m l n , for k = 0, , m and l = 0, , n .

Theorem

Let the greatest common divisor: g c d ( m , n ) = r , and m and n be such that m = r m and n = r n . The pmf of D is given by

f ( k m l n ) = ( 1 ) k + n l s = 0 m ( 1 ) s ( m s ) p 1 s t = 0 n ( 1 ) t ( n t ) ( 1 p 2 ) t i S m , n ( s , t ) ( 1 ) i ( m n ) ( s k + i m ) ( t ( n l ) i n ) ,

for k = 0, , m and l = 0, , n , where

S m , n ( s , t ) = [ max ( k m , ( n l ) t n ) , min ( s k m , ( n l ) n ) ] .

From the Theorem above, we derive the next results by corresponding them to different relations between m and n.

Corollary 1

If g c d ( m , n ) = 1 , then the exact distribution of D is given by:

P r ( D = k m l n ) = ( m k ) ( n l ) p 1 k p 2 l ( 1 p 1 ) m k ( 1 p 2 ) n l

for k m l n 0 , while P r ( D = 0 ) = ( 1 p 1 ) m ( 1 p 2 ) n + p 1 m p 2 n .

Corollary 2

If m = n and k = l then the exact distribution of D is given by

P r ( D = 0 ) = ( 1 ) n s = 0 n ( 1 ) s ( n s ) p 1 s t = 0 n ( 1 ) t ( n t ) ( 1 p 2 ) t u = n t s ( s u ) ( t n u )

Corollary 3

The exact distribution of D is given by

P r ( D = k m l n ) = s = 0 m t = 0 n ( u , v ) S s , t ( 1 ) s + t + u + ( k u ) n m + n l ( m s ) ( s u ) ( n t ) ( t ( k u ) n m + n l ) p 1 s ( 1 p 2 ) t

for k = 0, , m and l = 0, , n where,

S s , t = { ( u , ( k u ) n m + n l ) 2 : max ( 0 , k + ( n l t ) m n ) u min ( s , k + ( n l ) m n ) }

Corollary 4

The exact distribution of D is symmetrical about zero if m = n and p 1 = p 2 .

3. Support of the Distribution

Support of the exact distribution is denoted by D ( m , n ) . For small values of m and n, it can be derived manually. However, for larger values of m and n, it is tedious and time consuming, so the software such as R is used.

For m = n = 2 , D = k 2 l 2 . Where k = 0,1,2 and l = 0,1,2 .

Thus the support for m = n = 2 is 1, 0.5,0,0.5,1 .

The graphs of the Probability mass function for exact distribution for the difference of two population proportion for m = n and p1 = p2 are plotted in Figure 1. These graphs (Figure 1) are the evidence to support corollary 4.

Figure 1. Probability mass function for exact distribution for the difference of two population proportion for m = n and p 1 = p 2 .

4. Hypothesis Testing

To test H 0 : p 1 = p 2 = p against H 1 : p 1 p 2 = δ 0 , we use D as a test statistic. Let p ( D = k m l n | H 0 ) = f 0 ( k m l n ) . Then the null distribution of D is given by

f 0 ( k m l n ) = ( 1 ) k + n l s = 0 m ( 1 ) s ( m s ) p s t = 0 n ( 1 ) t ( n t ) ( 1 p ) t i S m , n ( s , t ) ( 1 ) i ( m n ) ( s k + i m ) ( t ( n l ) i n ) ,

for k = 0, , m and l = 0, , n , where

S m , n ( s , t ) = [ max ( k m , ( n l ) t n ) , min ( s k m , ( n l ) n ) ] .

The critical region can be obtained by finding c α / 2 and c 1 α / 2 such that:

max { D : p r ( D c α 2 | H 0 ) α 2 } and min { D : p r ( D c 1 α 2 | H 0 ) α 2 } .

This means that:

( k , l ) E α 2 f 0 ( k m l n ) α 2 and ( k , l ) E 1 α 2 f 0 ( k m l n ) α 2 .

where

E α 2 = { ( k , l ) 2 : 0 k m ,0 l n , k m l n c α 2 }

and

E 1 α 2 = { ( k , l ) 2 : 0 k m ,0 l n , k m l n c 1 α 2 }

Example: Gender Discrimination

The table below shows the gender distribution of the promoted files.

Data Source:

https://www2.stat.duke.edu/courses/Spring12/sta101.1/lec/lec14S.pdf.

In this question, we will investigate whether or not gender discrimination is associated with the promotion of the employees. In other words, we would like to conduct the following hypothesis test.

H 0 : There is no gender discrimination in promotion vs H 1 : There is gender discrimination in promotion.

We run the R program for exact distribution for m = 24 , n = 24 , p ^ 1 = 21 24 , and p ^ 2 = 14 24 , obtain the test statistic, and p-value to 0.291667 and 0.03286628

respectively. Since p-value is less than α , we reject the null hypothesis and conclude that there is gender discrimination in promotion. However the p-value is slightly less than α , so there is moderate gender discrimination for the promotion of the employees.

5. Power Calculation

If c α 2 and c 1 α 2 are the left and right critical values and if the Null hypothesis is rejected for the test statistic, d = p ^ 1 p ^ 2 then the power of the corresponding hypothesis test is given by:

1 β = 2 min { p r ( D d | H α ) , p r ( D d | H α ) } = 2 ( k , l ) E α f ( k m l n )

where

E α = { ( k , l ) 2 : 0 k m ,0 l n , k m l n d or k m l n d }

Continuation of the example: Gender Discrimination

In this example, we have rejected null hypothesis with the significance level α = 0.05 . Now we want to find power of the hypothesis test for

p 1 = p ^ 1 = 21 24 , p 2 = p ^ 2 = 14 24 , and α = 0.05 . We run the R program for the power

calculation of exact distribution and obtain that the power of the hypothesis test equals to 0.5657226.

6. Confidence Interval

Point estimator of p 1 p 2 is D = p ^ 1 p ^ 2 , which can be obtained by the given samples. Let L α / 2 and U α / 2 are lower and upper bound for 1 α confidence coefficient for p 1 p 2 . We obtain L α / 2 and U α / 2 as follows:

L α / 2 = max { D : p r ( D L α 2 ) α 2 }

U α / 2 = min { D : p r ( D U α 2 ) α 2 } .

Thus, ( 1 α ) 100 % confidence interval for p 1 p 2 is ( L α / 2 , U α / 2 ) .

A relatively easy approach to compare the difference between population proportions ( p 1 p 2 ) is confidence interval. We calculate the sample proportions p ^ 1 and p ^ 2 from respective samples. Once p ^ 1 and p ^ 2 are calculated, we use them to construct confidence interval with nominal confidence coefficient 1 α . If the confidence interval does not include 0, we reject the null hypothesis. Otherwise fail to reject null hypothesis.

(a) (b)

Table 1. 95% confidence interval for Exact, Wald, Agresti-Caffo, and Score.

For the purpose of this comparison, we have constructed some confidence intervals including respective confidence width for Exact, Wald, Agresti-Caffo and Score for m = n = 20 and 95% confidence coefficient (Table 1).

The last four columns of the above table are the confidence widths for Exact, Wald, Agrest-Caffo, and Score. It can be seen that the confidence width of Exact has the least amount.

7. Conclusion

Inferences of the difference of the population proportion are a very basic problem in statistics. Standard Wald interval has been used universally. Standard Wald interval is persistently chaotic, and has unacceptably poor coverage probabilities when either the sample sizes are small or one proportion is very large and the other is very small. Several intervals have been suggested but their level of performance is not satisfactory when the sample size is small. We have been shown that our distribution does not depend on sample size. We have also shown that exact distribution has the least confidence width among Wald, Agresti-Caffo and Score, so it is suitable for inferences of the difference between the population proportion regardless of sample size.

Appendix

Proof of lemma

If we define Z j = ( 1 Y j ) , then W can be written as W = n i = 1 m X i + m j = 1 n Z j . The pgf of W can be written as p w ( z ) = i = 1 m E ( z n X i ) i = 1 n E ( z m Z j ) since the two

samples are independent of each other and the observations in each sample are independent and identically distributed.

Since X i ~ i i d B e r ( p 1 ) for i = 1, , m , then E ( z n X i ) = 1 p 1 ( 1 z n ) and

E ( i = 1 m z n X i ) = ( 1 p 1 ( 1 z n ) ) m = s = 0 m ( 1 ) s ( m s ) p 1 s ( 1 z n ) s = s = 0 m u = 0 s ( 1 ) s + u ( m s ) ( s u ) p 1 s z u n . (2)

Similarily, since Y i ~ B e r ( p 2 ) for j = 1, , n , then

E ( j = 1 n z m Z j ) = t = 0 n v = 0 t ( 1 ) t + v ( n t ) ( t v ) ( 1 p 2 ) t z v m . (3)

We multiply the RHS’ of 2 and 3 to obtain 1.

Proof of Theorem

Notice that, even though the support of D and W are different, their pmf’s have

the same probabilities: P r ( W = k n + ( n l ) m ) = P r ( D = k m l n ) for k = 0, , m

and l = 0, , n . The pmf of W can be obtained from the pgf as follows:

P r ( W = k n + ( n l ) m ) = 1 ( k n + ( n l ) m ) ! d k n + ( n l ) m d z k n + ( n l ) m p w ( z ) | z = 0 .

Therefore,

P r ( W = k n + ( n l ) m ) = s = 0 m u = 0 s t = 0 n v = 0 t ( 1 ) s + t + u + v ( m s ) ( s u ) ( n t ) ( t v ) p 1 s ( 1 p 2 ) t δ k n + ( n l ) m ( u n + v m ) , (4)

where δ a ( x ) = 1 if x = a and 0 otherwise.

To simplify the formula 4, we use the fact that δ k n + ( n l ) m ( u n + v m ) = 1 is equivalent to k n + ( n l ) m = u n + v m which, in its turn, is equivalent to ( u k ) n = ( n l v ) m . From this last equality, we conclude that u k = i m and n l v = i n for some i because m and n are relative prime to each other. The values of i are hence obtained by solving the following system of equations:

( u k = i m ( n l ) v = i n 0 u s 0 v t i

This leads to the following simplified system: ( k m i s k m ( n l ) t n i ( n l ) n i . which corresponds to the values of i that forms the set S m , n ( s , t ) = [ max ( k m , ( n l ) t n ) , min ( s k m , ( n l ) n ) ] .

Proof of Corollary 1

Since m and n are relatively prime to each other, the support of D becomes:

S m , n ( s , t ) = [ max ( k m , ( n l ) t n ) , min ( s k m , ( n l ) n ) ] .

when k m l n 0 , we have ( k , l ) { ( 0,0 ) , ( m , n ) } , hence 1 < max ( k m , ( n l ) t n ) < 1 and 1 < min ( s k m , ( n l ) n ) < 1 . Therefore S m , n ( s , t ) = { 0 } . Now from Theorem above we get,

P r ( D = k m l n ) = s = k m t = n l n ( 1 ) s + t + k + n l ( m s ) ( s k ) ( n t ) ( t n l ) p 1 s ( 1 p 2 ) t = ( 1 ) k + n l [ s = 0 m k ( 1 ) s + k ( m s + k ) ( s + k k ) p 1 s + k ] [ t = 0 l ( 1 ) t + n l ( n t + n l ) ( t + n l n l ) ( 1 p 2 ) t + n l ] = p 1 k ( 1 p 2 ) n l [ s = 0 m k ( 1 ) s ( m k ) ( m k s ) p 1 s ] [ t = 0 l ( 1 ) t ( n l ) ( l t ) ( 1 p 2 ) t ] = ( m k ) ( n l ) p 1 k p 2 l ( 1 p 1 ) m k ( 1 p 2 ) n l

when k m l n = 0 , we have ( k , l ) { ( 0,0 ) , ( m , n ) } and hence:

S m , n ( s , t ) = ( [ max ( 0 , n t n ) , min ( s m , 1 ) ] ) ( [ max ( 1 , t n ) , min ( s m m , 0 ) ] ) = ( [ n t n , s m ] ) ( [ t n , s m m ] )

For this case, k m is either 0 or −1 and n l n is either 0 or 1 so, now from the theorem we get,

P r ( D = 0 ) = s = k m t = n l n ( 1 ) s + t + k + n l ( m s ) ( s k ) ( n t ) ( t n l ) p 1 s ( 1 p 2 ) t

= s = 0 m t = n n ( 1 ) s + t + n ( m s ) ( s 0 ) ( n t ) ( t n ) p 1 s ( 1 p 2 ) t + s = m m t = 0 n ( 1 ) s + t + m + n n ( m s ) ( s m ) ( n t ) ( t n n ) p 1 s ( 1 p 2 ) t

= s = 0 m ( 1 ) s + n + n ( m s ) ( n n ) ( n n ) p 1 s ( 1 p 2 ) n + t = 0 n ( 1 ) m + t + m ( m m ) ( m m ) ( n t ) p 1 m ( 1 p 2 ) t = s = 0 m ( 1 ) s ( m s ) p 1 s ( 1 p 2 ) n + t = 0 n ( 1 ) t ( n t ) p 1 m ( 1 p 2 ) t = ( 1 p 1 ) m ( 1 p 2 ) n + p 1 m ( 1 ( 1 p 2 ) ) n = ( 1 p 1 ) m ( 1 p 2 ) n + p 1 m p 2 n

Proof of corollary 2

For m = n and k = l , the theorem reduces to,

P r ( D = 0 ) = ( 1 ) n s = 0 n ( 1 ) s ( n s ) p 1 s t = 0 n ( 1 ) t ( n t ) ( 1 p 2 ) t i S n , n ( s , t ) ( s k + i n ) ( t n k i n ) ,

where,

i S n , n ( s , t ) = [ max ( k n , n k t n ) , min ( s k n , n k n ) ] .

i n S n , n ( s , t ) = [ max ( k , n k t ) ) , min ( s k , n k ) ) ] .

k + i n S n , n ( s , t ) = [ max ( 0 , n t ) , min ( s , n ) ] .

Now we replace k + i n by u and obtain the following result:

P r ( D = 0 ) = ( 1 ) n s = 0 n ( 1 ) s ( n s ) p 1 s t = 0 n ( 1 ) t ( n t ) ( 1 p 2 ) t u = max ( 0 , n t ) min ( s , n ) ( s u ) ( t n u ) = ( 1 ) n s = 0 n ( 1 ) s ( n s ) p 1 s t = 0 n ( 1 ) t ( n t ) ( 1 p 2 ) t u = n t s ( s u ) ( t n u ) .

Proof of corollary 3

The exact distribution of D, using lemma, is given by;

P r ( D = k m l n ) = s = 0 m u = 0 s t = 0 n v = 0 t ( 1 ) s + t + u + v ( m s ) ( s u ) ( n t ) ( t v ) p 1 s ( 1 p 2 ) t δ k n + ( n l ) m ( u n + v m )

where δ a ( x ) = 1 if x = a and 0 otherwise. Let us define a set H s , t as follows:

H s , t = { ( u , v ) 2 : 0 u s , 0 v t , u n + v m = k n + ( n l ) m } = { ( u , v ) 2 : 0 u s , 0 v t , v = ( k u ) n m + n l } = { ( u , ( k u ) n m + n l ) 2 : 0 u s , 0 ( k u ) n m + n l t } = { ( u , ( k u ) n m + n l ) 2 : 0 u s , t ( u k ) n m + l n 0 }

= { ( u , ( k u ) n m + n l ) 2 : 0 u s , n l t ( u k ) n m n l } = { ( u , ( k u ) n m + n l ) 2 : 0 u s , k + ( n l t ) m n u k + ( n l ) m n } = { ( u , ( k u ) n m + n l ) 2 : max ( 0 , k + ( n l t ) m n ) u min ( s , k + ( n l ) m n ) } = S s , t

Thus, P r ( D = k m l n ) = s = 0 m t = 0 n u S s , t ( 1 ) s + t + u + ( k u ) n m + n l ( m s ) ( s u ) ( n t ) ( t ( k u ) n m + n l ) p 1 s ( 1 p 2 ) t

Proof of corollary 4

Using Corollary (3), the exact distribution of D for m = n and p 1 = p 2 is given by

P r ( D = k n l n ) = s = 0 n t = 0 n u S s , t ( 1 ) s + t + u + ( k u ) n n + n l ( n s ) ( s u ) ( n t ) ( t ( k u ) n n + n l ) p 1 s ( 1 p 1 ) t = s = 0 n t = 0 n u S s , t ( 1 ) s + t + u + k u + n l ( n s ) ( s u ) ( n t ) ( t k u + n l ) p 1 s ( 1 p 1 ) t = s = 0 n t = 0 n u S s , t ( 1 ) s + t + n + k l ( n s ) ( s u ) ( n t ) ( t k l + u + n ) p 1 s ( 1 p 1 ) t

where,

S s , t = { ( u , k l + n u ) 2 : max ( 0 , k l + n t ) u min ( s , k l + n ) }

Since both k and l run from 0 to n so P r ( D = k n l n ) = P r ( D = l n k n ) .

Conflicts of Interest

The authors declare no conflicts of interest.

References

[1] Agresti, A. and Caffo, B. (2000) Simple and Effective Confidence Interval for Proportions and Differences of Proportions Result from Adding Two Successes and Two Failures. American Statistical Association, 54, 280-288.
https://doi.org/10.2307/2685779
[2] Newcombe, R.G. (1998). Interval Estimation for the Difference between Independent Proportions: Comparison of Eleven Methods. John Wiley & Sons, Ltd., Hoboken.
https://doi.org/10.1002/(SICI)1097-0258(19980430)17:8<873::AID-SIM779>3.0.CO;2-I

Copyright © 2024 by authors and Scientific Research Publishing Inc.

Creative Commons License

This work and the related PDF file are licensed under a Creative Commons Attribution 4.0 International License.