A Geometric Approach to Conditioning and the Search for Minimum Variance Unbiased Estimators

Abstract

Our purpose is twofold: to present a prototypical example of the conditioning technique to obtain the best estimator of a parameter and to show that this technique resides in the structure of an inner product space. The technique uses conditioning of an unbiased estimator on a sufficient statistic. This procedure is founded upon the conditional variance formula, which leads to an inner product space and a geometric interpretation. The example clearly illustrates the dependence on the sampling methodology. These advantages show the power and centrality of this process.

Share and Cite:

Marengo, J. and Farnsworth, D. (2021) A Geometric Approach to Conditioning and the Search for Minimum Variance Unbiased Estimators. Open Journal of Statistics, 11, 437-442. doi: 10.4236/ojs.2021.113027.

1. Introduction

We are given a coin that has probability θ of coming up heads when tossed once, where θ is an unknown real number in the interval (0, 1). We wish to estimate θ. Typically, we toss this coin a number of times, which is fixed in advance, and use the proportion of heads that appear as an estimator of θ. This approach not only makes sense intuitively, but also is optimal in that this estimator is unbiased and has the smallest possible variance based on the fixed number of tosses. It is the minimum variance unbiased estimator (MVUE) of θ [1]. The sample proportion is also the maximum likelihood estimator(MLE) of θ, because it maximizes the likelihood of obtaining the observed result of these coin tosses [1] [2].

Suppose that we proceed in a different way. We perform a “do until” experiment. Instead of fixing the number of tosses in advance, we toss this coin until we obtain heads for the nth time, where n is fixed. The recorded data are x 1 , x 2 , , x n , where x1 is the number of tosses it takes to obtain heads for the first time, and for 2 ≤ kn, xk is the additional number of tosses it takes to obtain the kth occurrence of heads after we have obtained headsk − 1 times. These data are realizations of independent random variables X 1 , X 2 , , X n , each of which has the geometric distribution with probability mass function (pmf)

p ( x ; θ ) = ( 1 θ ) x 1 θ for x = 1 , 2 , 3 ,

[1] [2]. Our goal is to find the MVUE of θ based on this alternative data.

We use this example to illustrate how conditioning an unbiased estimator on a sufficient statistic can be used to find a MVUE. This shows how we can discover the best estimator by using the very important concept of conditioning. The example demonstrates that this is not simply of theoretical interest. Finding appropriate estimators is a consequential problem in inference, decision-making, and data reduction. In practical terms, an advantage of this method of using conditioning in this manner is that it is guaranteed to yield minimum variance and unbiased estimators. While they might be considered by some to be intuitive, other techniques, such as the method of moments, may not produce estimators with those desirable properties [1] [2].

2. The Minimum Variance and Unbiased Estimator (MVUE)

One way to find an unbiased estimator of θ for the geometric distribution and the data x 1 , x 2 , , x n is to start with X1. Because Pr ( X 1 = 1 ) = θ , the estimator

u ( X 1 ) = { 1 i f X 1 = 1 0 i f X 1 > 1

is an unbiased estimator of θ. However, this estimator ignores the values of X 2 , X 3 , , X n , which makes it suspect. Indeed, it is not the MVUE of θ unless n = 1.

The random variable Y = k = 1 n X k is the total number of tosses of the coin that it takes to obtain n heads. It has the negative binomial distribution with pmf

P r ( Y = y ) = ( y 1 n 1 ) ( 1 θ ) y n θ n for y = n , n + 1 ,

[1] [2]. The variable Y is a sufficient statistic for θ in the sense that the conditional, joint distribution of ( X 1 , X 2 , , X n ), given the value of Y, does not depend upon θ, as we now demonstrate. For x 1 , x 2 , , x n { 0 , 1 } and y = k = 1 n x k ,

P r ( X k = x k f o r k = 1 , 2 , , n | Y = y ) = k = 1 n P r ( X k = x k ) P r ( Y = y ) = ( 1 θ ) k = 1 n x k n θ n ( y 1 n 1 ) ( 1 θ ) y n θ n = 1 ( y 1 n 1 ) . (1)

when y k = 1 n x k , the probability is zero.

The denominator of (1) is the number of ordered n-tuples of positive integers that sum to y. We conclude that, given Y = y, the conditional distribution of ( X 1 , X 2 , , X n ) is uniformly distributed on the set of all such n-tuples. The distribution does not depend upon θ. This means that, as long as the experimenters retain the value Y, they may discard the data ( x 1 , x 2 , , x n ) without losing any information about θ. The statistic Y is therefore sufficient or good enough, because it contains all the knowledge about the value of θ that is available in the data. In other words, Y drains the data of all the useful information that it has to convey about the value of θ.

The estimator

v ( Y ) = E ( u ( X 1 ) | Y )

is unbiased for θ, because

E ( v ( Y ) ) = E Y ( E X 1 ( u ( X 1 ) | Y ) ) = E ( u ( X 1 ) ) = θ .

Thus, u(X1) and v(Y) are both unbiased estimators of θ which can be computed from the data ( x 1 , x 2 , , x n ).

According to the conditional variance formula

V a r ( u ( X 1 ) ) = E ( V a r ( u ( X 1 ) | Y ) ) + V a r ( E ( u ( X 1 ) | Y ) ) (2)

[1] [3] [4], so

V a r ( v ( Y ) ) = V a r ( E ( u ( X 1 ) | Y ) ) V a r ( u ( X 1 ) ) . (3)

This reasoning says that the MVUE of θ must be a function of Y, because we can use the conditional variance formula to condition any unbiased estimator of θ, which is not a function of Y, on Y itself to obtain another unbiased estimator of θ, which is a function of Y having a smaller variance. This is the content of the Rao-Blackwell theorem [3] [4].

There is only one function of Y that is an unbiased estimator of θ, which is demonstrated as follows. Suppose that there are two such estimators v 1 ( Y ) and v 2 ( Y ) , then

E ( v 1 ( Y ) ) = θ = E ( v 2 ( Y ) )

and

0 = E ( v 1 ( Y ) v 2 ( Y ) ) = θ n y = n ( v 1 ( y ) v 2 ( y ) ) ( y 1 n 1 ) ( 1 θ ) y n .

Because the series of powers of 1 – θ is identically zero for 0 < θ < 1, all of its coefficients must be zero. Therefore, v 1 ( y ) = v 2 ( y ) for y = n , n + 1 , , and the uniqueness is established. Thus, v ( Y ) = E ( u ( X 1 ) | Y ) must be the unique MVUE of θ.

To compute the estimator, proceed as follows:

E ( u ( X 1 ) | Y = y ) = x 1 = 1 u ( x 1 ) P r ( X 1 = x 1 | Y = y )

= P r ( X 1 = 1 | Y = y ) = P r ( X 1 = 1 a n d Y = y ) P r ( Y = y ) = P r ( X 1 = 1 ) P r ( k = 2 n X k = y 1 ) P r ( Y = y )

where the independence of X 1 , X 2 , , X n yields the last equality. Further,

E ( u ( X 1 ) | Y = y ) = θ ( y 2 n 2 ) ( 1 θ ) ( y 1 ) ( n 1 ) θ n 1 ( y 1 n 1 ) ( 1 θ ) y n θ n = n 1 y 1 .

Hence, the MVUE of θ is

θ ^ = n 1 Y 1 .

3. Geometric Interpretation

There is a geometric interpretation of the conditional variance formula (2) that can provide a deeper understanding of these results. There is a long tradition of showing that variables and statistics reside in geometric spaces. Herr [5] gives a history of these representations. See [6] [7] [8] for other examples. The concept of projection, which is the essence of the present geometric approach, is widely used elsewhere, for example in machine learning. This approach places the process of computing MVUEs in the framework of the universally important concept of an inner product space and thus provides a deeper mathematical insight into this process.

The set of random variables on a fixed probability space which have finite second moment is a Hilbert space L 2 , where its inner product , and metric d are defined for points U , V L 2 as

U , V = E ( U V )

and

d ( U , V ) = E ( ( U V ) 2 )

[9]. The conditional variance formula (2) is the “Pythagorean Theorem” in L 2 as

d 2 ( u ( X 1 ) , E ( u ( X 1 ) ) ) = d 2 ( u ( X 1 ) , E ( u ( X 1 ) | Y ) ) + d 2 ( E ( u ( X 1 ) | Y ) , E ( u ( X 1 ) ) ) , (4)

because

d 2 ( u ( X 1 ) , E ( u ( X 1 ) | Y ) ) = E ( ( u ( X 1 ) E ( u ( X 1 ) | Y ) ) 2 ) = E Y ( E X 1 ( ( u ( X 1 ) E ( u ( X 1 ) | Y ) ) 2 | Y ) ) = E ( V a r ( u ( X 1 ) | Y ) )

and

d 2 ( E ( u ( X 1 ) | Y ) , E ( u ( X 1 ) ) ) = E ( ( E ( u ( X 1 ) | Y ) E ( u ( X 1 ) ) ) 2 ) = V a r ( E ( u ( X ) | Y ) ) ,

by employing E Y ( E X 1 ( u ( X 1 ) | Y ) ) = E ( u ( X 1 ) ) . Equation (4) makes clear that E ( u ( X 1 ) | Y ) is the orthogonal projection of u ( X 1 ) onto the subspace of random variables that are functions of Y, as the representational illustration in Figure 1 shows.

If the unbiased estimator u ( X 1 ) is not a function of Y, then d ( u ( X 1 ) , E ( u ( X 1 ) | Y ) ) > 0 , and from (4)

d ( E ( u ( X 1 ) | Y , ) E ( u ( X 1 ) ) ) < d ( u ( X 1 ) , E ( u ( X 1 ) ) ) .

That is,

V a r ( E ( u ( X 1 ) | Y ) ) < V a r ( u ( X 1 ) ) ,

as in (3).

Because the hypotenuse of a right triangle is its longest side, Figure 1 makes it clear that the variance of v(Y) is strictly less than the variance of u(X1).

4. Conclusions

Using a prototypical example, we have shown how the powerful technique of conditioning can be used to find the minimum variance unbiased estimator (MVUE) of a parameter. The data collection technique and the observations’ probability distribution determine the MVUE. Oftentimes, these MVUEs are not simply standard statistics, like means and variances, as the example illustrates. The geometry of the conditional variance formula shows how the minimum variance estimator is obtained as a projection.

Future endeavors might involve further development of the geometrical representation of this technique. Also, the example reveals how a feature of a

Figure 1. E ( u ( X 1 ) | Y ) as the result of an orthogonal projection in the Hilbert space L 2 .

physical population can be viewed as a parameter in different probability distributions, depending upon the sampling methods. In the example, θ is in a Bernoulli distribution and a geometric distribution. Topics for potential study are such choices, the ease of the sampling methods, the required sample sizes or efficiency, and the formulations of the MVUEs.

Acknowledgements

The authors want to thank an anonymous referee for many insightful comments.

Conflicts of Interest

The authors declare no conflicts of interest.

References

[1] Hogg, R.V., McKean, J.W. and Craig, A.T. (2018) Introduction to Mathematical Statistics. 8th Edition, Pearson, Boston.
[2] Devore, J.L. (2016) Probability and Statistics for Engineering and the Sciences. 9th Edition, Cengage, Boston.
[3] Peña, E.A. and Rohatgi, V. (1994) Some Comments about Sufficiency and Unbiased Estimation. American Statistician, 48, 242-243.
https://doi.org/10.1080/00031305.1994.10476067
[4] Ross, S.M. (2019) Introduction to Probability Models. 12th Edition, Academic Press, London.
https://doi.org/10.1016/B978-0-12-814346-9.00006-8
[5] Herr, D.G. (1980) On the History of the Use of Geometry in the General Linear Model. American Statistician, 34, 43-47.
https://doi.org/10.1080/00031305.1980.10482710
[6] Farnsworth, D.L. (2000) The Geometry of Statistics. College Mathematics Journal, 31, 200-204.
https://doi.org/10.1080/07468342.2000.11974143
[7] Wood, G.R. and Saville, D.J. (2002) A New Angle on the t-Test. Journal of the Royal Statistical Society, Series D (The Statistician), 51, 99-104.
https://doi.org/10.1111/1467-9884.00301
[8] Saville, D.J. and Wood, G.R. (2011) Statistical Methods: A Geometric Primer. CreateSpace, Scotts Valley.
[9] Rudin, W. (1986) Real and Complex Analysis. 3rd Edition, McGraw-Hill, New York.

Copyright © 2024 by authors and Scientific Research Publishing Inc.

Creative Commons License

This work and the related PDF file are licensed under a Creative Commons Attribution 4.0 International License.