High-Accuracy Confidence Regions for Distribution Parameters

Abstract

With the help of today’s computers, it is always relatively easy to find maximum-likelihood estimators of one or more parameters of any specific statistical distribution, and use these to construct the corresponding approximate confidence interval/region, facilitated by the well-known asymptotic properties of the likelihood function. The purpose of this article is to make this approximation substantially more accurate by extending the Taylor expansion of the corresponding probability density function to include quadratic and cubic terms in several centralized sample means, and thus finding the corresponding -proportional correction to the original algorithm. We then demonstrate the new procedure’s usage, both for constructing confidence regions and for testing hypotheses, emphasizing that incorporating this correction carries minimal computational and programming cost. In our final chapter, we present two examples to indicate how significantly the new approximation improves the procedure’s accuracy.

Share and Cite:

Vrbik, J. (2022) High-Accuracy Confidence Regions for Distribution Parameters. Applied Mathematics, 13, 488-501. doi: 10.4236/am.2022.136031.

1. Introduction

Using the likelihood function to find an approximate confidence region for several parameters (often only one, in which case we speak of a confidence interval) of a specific distribution goes back to the classic publication of Kendall and Stuart [1] . To understand the rest of our article, it is necessary to briefly review their main result, summarized by

Theorem 1. Assuming a case of regular (meaning the distributions support is not a function of any of the distributions parameters) estimation, the following random variable

2 ln L ( X ; θ ^ ) 2 ln L ( X ; θ 0 ) (1)

has approximately the chi-square distribution with K degrees of freedom, where K is the number of parameters to be estimated, X is the set of n observations (allowing for the possibility of a multivariate distribution), L ( X ; θ ) denotes the corresponding likelihood function, θ ^ is the vector of the resulting ML estimates, and θ 0 represents the true (but unknown) values of the parameters.

Proof. Realizing that

ln L ( X ; θ ) : = i = 1 n ln f ( x i ; θ ) (2)

where x i are the individual observations, then differentiating the RHS with respect to θ and setting each component of the answer to 0 yields

i = 1 n ln f ( x i ; θ ) θ = 0 (3)

where θ implies differentiating ln f ( x i ; θ ) with respect to each parameter; the result is thus a vector with K components.

To solve the corresponding set of equations, we expand the LHS of (3) at θ = θ 0 , thus getting

i = 1 n ln f ( x i ; θ ) θ | θ = θ 0 + i = 1 n 2 ln f ( x i ; θ ) θ 2 | θ = θ 0 ( θ θ 0 ) + = 0 (4)

where the second derivative of ln f ( x i ; θ ) is a symmetric K by K matrix. Dividing each side of the last equation by n, we solve it for θ θ 0 getting, to the first-order accuracy

θ ^ θ 0 μ 2 1 Y (5)

where

Y : = 1 n i = 1 n ln f ( x i ; θ ) θ | θ = θ 0 (6)

μ 2 : = E ( 2 ln f ( X ; θ ) θ 2 | θ = θ 0 ) (7)

Note that the expected value of Y is 0 , since

All x ln f ( x ; θ ) θ f ( x ; θ ) d x = All x f ( x ; θ ) θ d x = θ ( 1 ) = 0 (8)

and its variance-covariance matrix equals to μ 2 / n , due to

d d θ All x ln f ( x ; θ ) θ f ( x ; θ ) d x = μ 2 + μ 11 = O (9)

where

μ 11 : = E ( ln f ( X ; θ ) θ ln f ( X ; θ ) θ | θ = θ 0 ) (10)

is the variance-covariance matrix of ln f ( X ; θ ) θ | θ = θ 0 (the small circle implies direct product of the two vectors), and O is the zero matrix.

Similarly expanding (1) and utilizing (5) we get, to the same level of accuracy

2 n Y ( θ ^ θ 0 ) + n ( θ ^ θ 0 ) T μ 2 ( θ ^ θ 0 ) = n Y T μ 2 1 Y (11)

resulting in the following simple approximation to (1)

n Y T μ 2 1 Y n (12)

where Y n has (due to Central Limit Theorem) a K-variate Normal distribution with the mean of 0 and the variance-covariance matrix of μ 2 . Introducing Z = Y n , the moment generating function of (12) is then computed by

exp ( z T ( μ 2 ) 1 z 2 + t z T ( μ 2 ) 1 z ) d z ( 2 π ) K / 2 det ( μ 2 ) = det ( μ 2 1 2 t ( μ 2 ) 1 ) = ( 1 2 t ) K / 2 (13)

easily identified as the MGF of χ K 2 distribution. ■

The result then leads to a simple (we call it basic) algorithm of constructing the corresponding confidence region based on a random independent sample of n observations.

Examples

From the last theorem it follows that an approximate confidence region is found by following these steps:

● Using sample values drawn from the distribution, maximize its likelihood function, thus getting the ML estimates of all parameters;

● Choose a confidence level (denoted τ ) and make (1) (where θ ^ represents the ML estimates, but θ 0 are now variables to be solved for) equal to the corresponding critical value of the χ K 2 distribution;

● Solve for θ 0 , resulting in two limits of the corresponding confidence interval when K = 1 , a closed curve when K = 2 , a closed surface when K = 3 , etc.; this means that, beyond a one-parameter case, the resulting region can be displayed only graphically (a trivial task for today’s computers).

The following Mathematica program shows how it is done, first using a Geometric distribution with a (computer generated) sample of 40 and confidence level of 90%, then a Gamma distribution with a sample of 60 and confidence level of 95%.

The last line of the program has produced the following 95% confidence region (see Figure 1) for α (horizontal scale) and β (vertical scale).

A three-parameter estimation can be done in a similar manner, only the call to “ContourPlot” needs to be replaced by “ContourPlot3D”. Visualizing a confidence region for more than three parameters would require slicing it into a sequence of cross-sections.

Figure 1. Confidence region for α and β.

The error of this procedure depends on the sample size, decreasing with n in the O ( 1 n ) manner; it can reach several percent for small samples. The purpose of the next section is to show how to substantially reduce this error, making it rather insignificant in all practical situations.

2. High-Accuracy Extension

There have been many attempts at improving the accuracy of this approximation, starting with Bartlett [2] , followed by [3] and many others, all concentrating on higher moments of the approximately Normal distribution of ML estimators. Our approach is different: we aim at finding an appropriate correction to the χ 2 distribution of the Likelihood function. The technique we use is not unlike that of [4] , in spite of diverging objectives.

It has been shown in previous publications [5] and [6] that the distribution of (1) is more accurately described by the following probability density function

χ K 2 ( u ) ( 1 + A n ( u K 1 ) ) (14)

where χ K 2 ( u ) is the chi-square PDF of the original theorem, A is a quantity (often a simple constant) specific to the sampled distribution, and K is the number of parameters to be estimated. In terms of the corresponding CDF, this becomes

1 Γ ( K 2 , u 2 ) + 2 A n K ( u 2 ) K / 2 exp ( u 2 ) Γ ( K 2 ) (15)

(the two-argument Γ denotes the incomplete Gamma function) or, more explicitly

e r f ( u 2 ) A n 2 u π exp ( u 2 )

when K = 1 , and

1 ( 1 + A u 2 n ) exp ( u 2 ) (16)

when K = 2 (the two most important cases).

Note that the expected value of distribution (14) is K + 2 A n ; similarly expanding (1) then yields the formula for computing A, given the PDF of the sampled distribution.

Leaving out less important details, we now indicate the individual steps of such derivation.

● Extend the LHS of (4) by two more terms of the corresponding expansion, namely by

+ 1 2 i = 1 n 3 ln f ( x i ; θ ) θ 3 | θ = θ 0 ( θ θ 0 ) 2 + 1 6 i = 1 n 4 ln f ( x i ; θ ) θ 4 | θ = θ 0 ( θ θ 0 ) 3 + (17)

Note that the third and fourth derivatives of ln f ( x i ; θ ) constitute fully symmetric tensors of ranks 3 and 4 respectively; each is further multiplied by the corresponding power (one less than the derivative’s order) of θ θ 0 . The multiplication is carried out by taking the dot product along one of the tensor’s indices with vector θ θ 0 , and repeating as many times as the power of θ θ 0 indicates; the result is thus always a vector with one remaining index.

● Solve the resulting equation (iteratively) for θ ^ θ 0 , to the same order of accuracy. This extends our existing solution (5), which we denote Θ 1 , by the following quadratic term

Θ 2 = μ 2 1 ( Y 2 μ 2 + 1 2 μ 3 Θ 1 ) Θ 1 (18)

and a cubic term, given by

Θ 3 = μ 2 1 ( 1 2 ( Y 3 μ 3 ) Θ 1 2 + μ 3 Θ 2 Θ 1 + 1 6 μ 4 Θ 1 3 + ( Y 2 μ 2 ) Θ 2 ) (19)

where

Y l : = 1 n i = 1 n l ln f ( x i ; θ ) θ l | θ = θ 0 (20)

μ l : = E ( l ln f ( X ; θ ) θ l | θ = θ 0 ) (21)

(we have already explained how to multiply a symmetric tensor by a power or a product of vectors).

● Similarly expand (1); further divided by the sample size n, this results in the following scalar expression

Y Θ ( 1 ) + Y Θ 2 1 6 μ 3 Θ 1 3 + Y Θ 3 1 2 μ 3 Θ 2 Θ 1 2 1 6 ( Y 3 μ 3 ) Θ 1 3 1 12 μ 4 Θ 1 4 + = Y Θ ( 1 ) + ( Y 2 μ 2 ) Θ 1 2 + 1 3 μ 3 Θ 1 3 + 1 3 ( Y 3 μ 3 ) Θ 1 3 + 1 4 ( μ 3 Θ 1 2 ) × ( μ 3 Θ 1 2 ) + 1 12 μ 4 Θ 1 4 + ( μ 3 Θ 1 2 + ( Y 2 μ 2 ) Θ 1 ) × ( ( Y 2 μ 2 ) Θ 1 ) + (22)

where the × symbol between two vectors (not used beyond this formula) implies an operation we call contraction of the two vectors, carried out by

V 1 × V 2 : = V 1 μ 2 1 V 2 (23)

This is also how we perform a contraction of two specific indices of a tensor (required shortly); e.g. contracting the first two indices of μ 3 would be done (and denoted) by

μ 3 ˙ : = i , j = 1 K ( μ 3 ) i , j , n ( μ 2 1 ) i , j (24)

resulting in a vector.

● Finally, take the expected value of (22) to get K n + 2 A n 2 + .

This yields (term by term)

K n + μ 211 + K n 2 + μ 3 μ 111 3 n 2 + μ 31 n 2 + 2 μ 3 μ 3 + μ 3 ˙ μ 3 ˙ 4 n 2 + μ 4 4 n 2 + 2 μ 3 μ 21 + μ 3 ˙ μ 21 + μ 21 μ 21 + μ 21 μ 21 + μ 22 K n 2 (25)

where

μ 21 : = E ( 2 ln f ( X ; θ ) θ 2 ln f ( X ; θ ) θ | θ = θ 0 ) (26)

with μ 211 , μ 111 , μ 22 and μ 31 defined analogously. Bold-face notation indicates that each term of (25) has been contracted in all of its indices, resulting in a scalar. For some of these terms, there are several possibilities to carry out such a contraction (e.g. the six indices of μ 21 μ 21 can be contracted in altogether five non-equivalent ways, only two of which we actually need); it is thus very important to indicate (in the ambiguous cases only; note that all pair-wise contractions of μ 4 , and also of μ 31 , yield the same answer) the proper way of doing it.

A dot implies contraction of two of the indicated indices ( μ 3 ˙ thus making the result into a vector, while μ 211 implies, by the absence of a dot over 2, that the first two indices have been contracted with the third and fourth index, rather than with each other); similarly, two circles indicate that the corresponding two indices are to be contracted: μ 21 μ 21 thus makes the first index of μ 21 contract with its last index, returning a vector yet to be contracted with the first index of μ 21 (due to the lack of a dot over 2); similarly μ 21 μ 21 indicates contracting the last index of μ 21 with the first index of μ 21 , while the remaining two indices of μ 21 are contracted with the remaining two indices of μ 21 .

Since μ 3 μ 111 = μ 3 μ 3 2 μ 3 μ 21 , which can be verified in a manner of (9), the final formula for 2A then reads

μ 211 + μ 31 + μ 22 + 1 4 μ 4 + 1 4 μ 3 ˙ μ 3 ˙ + 1 6 μ 3 μ 3 + μ 3 ˙ μ 21 + μ 3 μ 21 + μ 21 μ 21 + μ 21 μ 21 (27)

when K = 1 , the formula simplifies to

A = μ 211 + μ 31 + μ 22 + 1 4 μ 4 2 ( μ 2 ) 2 + 5 12 μ 3 2 + 2 μ 3 μ 21 + 2 μ 21 2 2 ( μ 2 ) 3 (28)

We should mention that, in principle, the 1 n -proportional correction in (14) has two more terms of increasing complexity (see [5] ), namely

χ K 2 ( u ) ( 1 + A n ( u K 1 ) + B n ( u 2 K ( K + 2 ) 2 u K + 1 ) + C n ( u 3 K ( K + 2 ) ( K + 4 ) 3 u 2 K ( K + 2 ) + 3 u K 1 ) ) (29)

where the B and C coefficients can be found in a manner very similar to how we have just established the value of A; most surprisingly, both B and C then turn out to be equal to zero.

2.1. Location and Scaling Parameters

An important feature of the last two formulas is the fact that A may be a function of shape parameters (if any) of the sampled distribution, but not of a location or a scaling parameter, as these always cancel out from the resulting expression. To understand why, consider

f ( x ; μ , σ ) = f 0 ( x μ σ ) σ : = f 0 ( y ) σ (30)

This enables us to simplify the two derivatives of ln f ( x ; μ , σ ) thus

( ln f 0 ( y ) ln σ ) μ = f 0 ( y ) σ f 0 ( y ) (31)

( ln f 0 ( y ) ln σ ) σ = y f 0 ( y ) σ f 0 ( y ) 1 σ (32)

from which it is clear that each further differentiation (whether with respect to μ or σ ) yields a function of y, divided by an extra power of σ . Taking the expected value of any product of such derivatives amounts to further multiplying by f 0 ( y ) d y and integrating over all possible values of y, leaving us with a number divided by power (equal to the derivative’s total order) of σ . These powers of σ then always cancel out from (27).

This implies that, for distributions having only location and/or scaling parameters, A will turn out to be a constant.

2.2. Examples

Values of A are summarized in Table 1 for some of the most common distributions.

For Gamma and Beta distributions, A is a complicated function of its first (Gamma) and both (Beta) parameter(s); yet, under the indicated constrains (see the above table), the value of A remains fairly constant.

For Gamma distribution (to illustrate the complexity of A as a function of a shape parameter), the full formula reads

A = [ 12 ψ 1 + α ( 16 ψ 2 9 ψ 1 2 ) + α 2 ( 2 ψ 1 3 6 ψ 1 ψ 2 + 3 ψ 3 ) + α 3 ( 5 ψ 2 2 3 ψ 1 ψ 3 ) ] 2 288 ( α ψ 1 1 ) 6 (33)

where ψ i is the i + 1 st derivative of the natural logarithm of the Γ ( α ) function, evaluated at α . Yet, by plotting the last expression as a function of α , we can confirm that, with the exception of only the smallest (say < 0.5) values of α ,

A is nearly a constant, quickly approaching its α limit of 121 72 . We can

Table 1. Value of A for various distributions.

make similar conclusion in the case of Beta distribution, as mentioned already. Note that, in the case of bivariate Normal distribution, A is a constant, free not only of the two location and the two scaling parameters, but also of the shape parameter ρ .

Incorporating the A-related correction requires only a trivial modification of the previous program; in the Geometric example, we need to replace the “CL = ...” line by the following two lines:

Note that, to compute A, instead of using the true but unknown value of p, we had to use its point estimate instead.

In the Gamma example (since the α estimate was bigger than 0.5), we can use the large- α limit of A; the “CI = ...” line now has to be replaced by

3. Hypotheses Testing

The same formulas can be utilized for high-accuracy testing of a null hypothesis, claiming that parameters of distribution have specific values. Thus, for example, when the distribution has two parameters, we first need to find their ML estimates (based on a random independent sample of size n); using these we then evaluate (1), where θ 0 now stands for the parameter values as specified by the null hypothesis, and substitute the result for u in (16); this yields the confidence level of rejecting the null hypothesis—subtracting this from 1 then returns the corresponding P-value.

The following Mathematica program illustrates how it is done, using the Cauchy distribution and testing whether its two parameters (m and s, say) have true values equal to 0 and 1 respectively (our random sample of 40 independent observations has been generated using m = 0.3 and s = 0.8 ).

Based on the resulting P value, the null hypothesis would be rejected at 1% level of significance.

In our second example, we test whether parameters of a bivariate Normal distribution have the following values: μ 1 = μ 2 = 0 , σ 1 = σ 2 = 1 and ρ = 0.5 (these are also the values we used to generate the sample of 100 bivariate observations).

The resulting P value would lead to accepting the null hypothesis at any practical level of significance.

4. Accuracy Improvement

To investigate the accuracy of the new approximation, we need to be able to compute the level of confidence of the resulting region exactly. This can be done only in a few situations, such as the Exponential and Normal distributions. Exploring these two cases will give us a good indication of the improvement achieved by incorporating the A correction in general. The details of computing the exact confidence level may be of only passing interest to most readers; it is the resulting error tables which are important.

4.1. Exponential Case

We know that the ML estimator of the β parameter is given by the sample mean X ¯ whose exact distribution is Gamma(n, β/n). Since

L ( X ; β ) = i = 1 n X i β n ln β (34)

then

2 L ( X ; X ¯ ) 2 L ( X ; β ) = 2 n ( 1 + ln X ¯ β X ¯ β ) (35)

(note that this expression is always non-negative, reaching a minimum value of zero when X ¯ = β ). The limits of the confidence interval of the basic technique are found by making (35) equal to a critical value of the χ 1 2 distribution (let us call it C τ , where τ is the corresponding level of confidence) and solving for X ¯ / β (and then, trivially, for β ).

One can show that the two real solutions are given by

W 0 ( exp ( 1 C τ 2 n ) ) and W 1 ( exp ( 1 C τ 2 n ) ) (36)

where W is the Lambert (also called Omega, or “ProductLog” by Mathematica) function and the subscripts refer to its two real branches. To find the true confidence level, we need to integrate the PDF of X ¯ / β between these two limits (our code utilizes the corresponding CDF instead); subtracting the result from τ establishes the error of the basic algorithm. Further subtracting

C τ exp ( C τ 2 ) 6 n 2 π (37)

then converts it into the error of our new formula (15). The following Mathematica program does that in four simple lines.

Both errors (using various sample sizes) are displayed in Table 2, using τ = 95 % .

The table demonstrates that, as n increases, the error decreases with the first power of 1 n for the basic approximation, and with the second power of 1 n for our new algorithm.

4.2. Normal Case

Similarly, it is well known that the ML estimators of μ and σ 2 of the Normal distribution are X ¯ and s 2 : = X 2 ¯ X ¯ 2 respectively, that they are independent of each other, and that their respective distributions are N ( μ , σ n ) and Gamma( n 1 2 , 2 σ 2 ). Since

L ( X ; μ , σ 2 ) = i = 1 n ( X i μ ) 2 2 σ 2 n 2 ln σ 2 n ln 2 π (38)

(1) then becomes

2 L ( X ; X ¯ , s 2 ) 2 L ( X ; μ , σ 2 ) = n ( 1 + ln s 2 σ 2 s 2 σ 2 ( X ¯ μ ) 2 σ 2 ) (39)

To get the exact confidence level of the basic algorithm, we need to made this expression less than the critical value (denoted C τ , as in the previous example) of the χ 2 2 distribution, and integrate the joint PDF of s 2 / σ 2 and n ( X ¯ μ ) / σ over the resulting region. Solving the equation of its boundary for s 2 / σ 2 in the manner of (35) now yields

W 0 ( exp ( 1 C τ Z 2 n ) ) and W 1 ( exp ( 1 C τ Z 2 n ) ) (40)

where Z stands for n ( X ¯ μ ) / σ , having the standard Normal distribution.

Table 2. Aproximation errors (Exponential case).

We then integrate the PDF of s 2 / σ 2 (having the χ n 1 2 distribution) using these two limits (again, the program utilizes the corresponding CDF instead); finally, we multiply the answer by the PDF of Z and integrate over all marginal values of Z (i.e. from 0 to C τ )—the last step is done numerically. This yields the true confidence level of the resulting confidence region; subtracting it from τ then provides the corresponding error of the basic algorithm; further subtracting

C τ exp ( C τ 2 ) 24 n (41)

converts that into the error of the new algorithm. This is how it is done using Mathematica:

Table 3 summarizes the results.

Table 3. Approximation errors (Normal case).

Similarly to the previous example, one can clearly discern the 1 n -proportional decrease of the basic error, and the more rapid 1 n 2 -proportional decrease achieved by our correction.

5. Conclusion

In this article, we have modified the traditional way (based on the corresponding likelihood function and a random independent sample of size n) of constructing a confidence region for distribution parameters, aiming to reduce its error. This has been achieved by adding, to the usual χ 2 distribution of the original technique, a simple, 1 n -proportional correction whose coefficient is computed, for each specific distribution, using a rather complex formula; we have done the latter for many common distributions, summarizing the results in Table 1. We have also provided several examples of a Mathematica code, demonstrating the new algorithm and its simplicity (each computer program takes only a few lines); a practically identical approach can then be applied to the task of hypotheses testing. Finally, we have numerically investigated the improvement in accuracy thereby achieved, finding it rather substantial. It is worth emphasizing that, once the value of A has been established, incorporating the corresponding correction is extremely easy and done with a negligible computational cost.

Future research will hopefully explain why are the B and C coefficients of (29) both equal to zero; so far we have been able to prove that only by a brute-force computation, but we believe that there is some more fundamental reason behind this result, which we have not been able to uncover.

Conflicts of Interest

The author declares no conflicts of interest regarding the publication of this paper.

References

[1] Kendall, M.G. and Stuart, A. (1969) The Advanced Theory of Statistics, Vol. 1, Chapter 3. Hafner Publishing Company, New York.
[2] Bartlett, M.S. (1953) Approximate Confidence Intervals. Biometrica, 40, 12-19.
https://doi.org/10.1093/biomet/40.1-2.12
[3] Cysneiros, F.J.deA., dos Santos, S.J.P. and Cordeiro, G.M. (2001) Skewness and Kurtosis for Maximum Likelihood Estimator in One-Parameter Exponential Family Models. Brazilian Journal of Statistics and Probability, 15, 85-108.
[4] Bowman, K.O. and Shenton, L.R. (1998) Aymptotic Skewness and the Distribution of Maximum Likelihood Estimators. Communications in Statistics—Theory and Methods, 27, 2743-2760.
https://doi.org/10.1080/03610929808832252
[5] Vrbik, J. (2013) Accurate Confidence Regions Based on MLEs. Advances and Applications in Statistics, 32, 33-56.
[6] Vrbik, J. (2013) Addendum and Errata to “Accurate Confidence Regions Based on MLEs”. Advances and Applications in Statistics, 35, 161-163.

Copyright © 2024 by authors and Scientific Research Publishing Inc.

Creative Commons License

This work and the related PDF file are licensed under a Creative Commons Attribution 4.0 International License.