Bayesian Set Estimation with Alternative Loss Functions: Optimality and Regret Analysis

Abstract

Decision-theoretic interval estimation requires the use of loss functions that, typically, take into account the size and the coverage of the sets. We here consider the class of monotone loss functions that, under quite general conditions, guarantee Bayesian optimality of highest posterior probability sets. We focus on three specific families of monotone losses, namely the linear, the exponential and the rational losses whose difference consists in the way the sizes of the sets are penalized. Within the standard yet important set-up of a normal model we propose: 1) an optimality analysis, to compare the solutions yielded by the alternative classes of losses; 2) a regret analysis, to evaluate the additional loss of standard non-optimal intervals of fixed credibility. The article uses an application to a clinical trial as an illustrative example.

Share and Cite:

Santis, F. and Gubbiotti, S. (2023) Bayesian Set Estimation with Alternative Loss Functions: Optimality and Regret Analysis. Open Journal of Statistics, 13, 195-211. doi: 10.4236/ojs.2023.132010.

1. Introduction

Together with point estimation and testing, set estimation is the most popular method for inference on the unknown parameter of a statistical model. In the decision-theoretic framework, set estimation has traditionally received less attention than the other two problems, both from a frequentist and a Bayesian point of view. Set estimators are in fact typically derived using non-decisional approaches. For instance, frequentist intervals are found using pivotal quantities or inverting acceptance regions of tests; Bayesian intervals, such as equal tails or highest posterior density sets, are obtained by exploiting some features of the posterior distribution of the parameter, with no explicit connections with a decision-theoretic context. Probably this is the consequence of the difficulty in choosing a satisfactory loss function for this decision problem.

In the few studies on this topic, loss functions typically take into account two aspects of the candidate set estimators, i.e. their size (aptly penalized by a specific coefficient) and their ability to contain the true value of the parameter. The most widespread loss function is the so-called linear loss, which is a linear function of the size of the set (see Expression (2) of Section 1). Despite its simplicity, this function has been found to produce optimal sets (both frequentist and Bayesian) with paradoxical flaws even in a very standard problem such as set estimation of the normal mean with unknown variance: see [1] and [2] . In order to address these inconveniences, the Authors have then proposed to consider other losses that are increasing but non-linear functions of the sizes of the sets under evaluation. Among these, the exponential and the rational losses (see Expressions (3) of Section 2).

The analysis contained in the present article is closely related to that in [1] and [2] . Our contribution is intended to fill the following three gaps. 1) The approach of [1] and [2] is mainly theoretical. In the present article we explore the features of optimal actions and posterior expected losses induced by three different loss functions (linear, exponential and rational) in several numerical examples. Specifically, we assess the sensitivity of optimal actions with respect to both the choice of the loss function and of the prior distribution (informative vs non-informative priors). 2) As a second contribution, we propose a regret analysis: we evaluate the additional losses of standard highest posterior density sets of fixed posterior probability with respect to optimal sets under the three losses. 3) As far as we know, a formal decision-theoretic approach set estimation under different loss functions has never been applied to any real experimental context. We here provide an application to clinical trial data.

As we said, the literature on the subject is scant. In addition to the already mentioned motivating articles [1] and [2] , general introductions to Bayesian decision-theoretic set estimation can be found, for instance, in [3] [4] [5] [6] [7] . An objective Bayesian decisional approach is proposed in [8] . The use of posterior regret for interval estimation has been lately considered in [9] . Evaluation of additional posterior loss of non-optimal actions for point estimation and testing has been examined by [10] and [11] . Considerations on the conflict between optimal and non-optimal Bayesian actions can be found in [12] .

This article moves from a preliminary study in [13] for the Poisson model and is organized as follows. In Section 2 we introduce the basic elements of the Bayesian decision-theoretic approach to set estimation. Section 3 is about optimality analysis for interval estimation of a normal mean with known and unknown variance. Using both simulations and clinical trial data we discuss the effect of the values of the penalizing constant for the length of intervals and the impact of prior choices on features of optimal sets (length, posterior probability and posterior expected loss) using the three losses. In Section 4 we change perspective: from the search of optimal sets (whose length and credibility is not fixed in advance) we switch to a regret analysis that aims at evaluating the additional cost of using intervals with fixed credibility levels rather than optimal sets. Using the clinical trial example again, we examine the effects of the following quantities on the posterior regret associated with the exponential and rational losses: the value of the penalizing constant, the sample size, the credibility level. Finally, Section 5 summarizes the main points come to light in the article. All elaborations are performed using the software R [14] : code is available upon request.

2. Methodology

Given a parametric model, ${f}_{n}\left(\cdot |\theta \right),\theta \in \Theta$ , let $\pi \left(\theta \right)$ denote the prior distribution of $\theta$ , ${x}_{n}$ an observed sample of size n and $\pi \left(\theta |{x}_{n}\right)$ the corresponding posterior distribution. For simplicity, suppose that $\Theta \subseteq {ℝ}^{1}$ and that $\pi \left(\theta \right)$ is a probability density function. Assuming to be interested in set estimation of $\theta$ from a decision theoretic perspective (see [3] and [7] ), let $\mathcal{C}$ be a class of subsets of $\Theta$ and $\mathbb{L}\left(\theta ,C\right)$ the loss function for a generic set $C\in \mathcal{C}$ . This approach prescribes one to select a set ${C}^{\star }$ that minimizes the posterior expected loss $\rho \left(C,{x}_{n}\right)$ as C varies in $\mathcal{C}$ , i.e.:

${C}^{\star }=\mathrm{arg}\underset{C\in \mathcal{C}}{\mathrm{min}}\rho \left(C,{x}_{n}\right),\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{ }\text{where}\text{\hspace{0.17em}}\text{ }\text{ }\rho \left(C,{x}_{n}\right)={\int }_{\Theta }\text{ }\text{ }\mathbb{L}\left(\theta ,C\right)\pi \left(\theta |{x}_{n}\right)\text{d}\theta .$

The most widely used family of losses for set estimation is defined by setting

$\mathbb{L}\left(\theta ,C\right)=\mathbb{S}\left[L\left(C\right)\right]+{\mathbb{I}}_{\stackrel{¯}{C}}\left(\theta \right),$ (1)

where the size $\mathbb{S}\left(\cdot \right)$ is an increasing function of $L\left(C\right)$ —the Lebesgue measure of C—and ${\mathbb{I}}_{\stackrel{¯}{C}}\left(\cdot \right)$ is the indicator function of the set $\stackrel{¯}{C}=\Theta \C$ . The resulting posterior expected loss of $C\in \mathcal{C}$ is

$\rho \left(C,{x}_{n}\right)=\mathbb{S}\left[L\left(C\right)\right]+1-ℙ\left(C|{x}_{n}\right),$

which embodies a compromise between the size of C and its posterior probability of containing $\theta$ , denoted as $ℙ\left(C|{x}_{n}\right)$ . One important property of the class of monotone functions (see, for instance, [2] ) is that, if $\theta$ is an absolutely continuous random variable (as we assume here), optimal actions are highest posterior density (HPD) sets defined as ${C}^{\star }=\left\{\theta \in \Theta :\pi \left(\theta |{x}_{n}\right)\ge h\right\},\text{\hspace{0.17em}}h\ge 0$ . More specifically, we here assume that HPD sets are intervals ${C}^{\star }=\left[{L}^{\star },{U}^{\star }\right]$ .

Let $\mathcal{L}\left(C\right)=U-L$ be the length of a generic interval C. The simplest form of loss (1) for C is obtained by selecting

${\mathbb{S}}_{\mathcal{l}}\left[\mathcal{L}\left(C\right)\right]=a\mathcal{L}\left(C\right),\text{\hspace{0.17em}}\text{\hspace{0.17em}}a>0,$ (2)

as size function, which yields the class of linear loss functions, $a\mathcal{L}\left(C\right)+{\mathbb{I}}_{\stackrel{¯}{C}}\left(\theta \right)$ , where a is a penalizing constant for interval lengths. Casella, Hwang and Robert in [1] and [2] show that, in the case of unbounded parameter space, optimal sets under the linear loss function may be dominated by unreasonable sets. For instance, in the case of the normal model $\text{N}\left(\theta ,{\sigma }^{2}\right)$ with unknown variance, the standard Student’s t-interval for $\theta$ is dominated by a set that is empty as the sample variance is sufficiently large. They also show that (under mild conditions) these kinds of problems are avoided if both the components of (1) assume values in $\left[0,1\right]$ or, more specifically, if $\mathbb{S}\left(\cdot \right)$ is a nonlinear and increasing function that ranges monotonically in the unit interval such that ${\mathrm{lim}}_{C\to \varnothing }\mathbb{S}\left(\mathcal{L}\left(C\right)\right)=0$ and ${\mathrm{lim}}_{C\to \Theta }\mathbb{S}\left(\mathcal{L}\left(C\right)\right)=1$ . To resolve this paradox the Authors propose the following two nonlinear functions

${\mathbb{S}}_{e}\left[\mathcal{L}\left(C\right)\right]=1-{\text{e}}^{-\frac{a\mathcal{L}{\left(C\right)}^{2}}{2}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{and}\text{\hspace{0.17em}}\text{\hspace{0.17em}}{\mathbb{S}}_{r}\left[\mathcal{L}\left(C\right)\right]=\frac{a\mathcal{L}\left(C\right)}{a\mathcal{L}\left(C\right)+1},\text{\hspace{0.17em}}\text{\hspace{0.17em}}a<0$ (3)

that result in the classes of exponential and rational loss functions. The posterior expected losses corresponding to the three size functions under examination in this article are then given by:

${\rho }_{j}\left(C,{x}_{n}\right)={\mathbb{S}}_{j}\left[\mathcal{L}\left(C\right)\right]+1-ℙ\left(C|{x}_{n}\right),\text{\hspace{0.17em}}\text{\hspace{0.17em}}j=\mathcal{l},e,r.$ (4)

Note that whereas ${\rho }_{\mathcal{l}}\left(C,{x}_{n}\right)\ge 0$ , for all C, i.e. it is unbounded above, the values of ${\rho }_{e}\left(C,{x}_{n}\right)$ and ${\rho }_{r}\left(C,{x}_{n}\right)$ are upper bounded. As an example, Figure 1 shows ${\rho }_{j}\left(C,{x}_{n}\right)$ , $j=\mathcal{l},e,r$ as a function of $\mathcal{L}\left(C\right)$ : shaded areas represent the contributions of ${\mathbb{S}}_{j}\left[\mathcal{L}\left(C\right)\right]$ (dotted lines) and of $1-ℙ\left(C|{x}_{n}\right)$ (solid lines) respectively.

Posterior expected losses ${\rho }_{j}$ allow us to follow two different decisional approaches:

(i) Optimality analysis, i.e. determination and comparison of optimal intervals

${C}_{j}^{\star }=\mathrm{arg}\underset{C\in \mathcal{C}}{\mathrm{min}}{\rho }_{j}\left(C,{x}_{n}\right)\text{ }j=\mathcal{l},e,r;$

(ii) Regret analysis, i.e. evaluation of standard intervals of given credibility level $\gamma$ , denoted by ${C}^{\gamma }$ , in terms of their additional expected loss (or expected regret) with respect to ${C}_{j}^{\star }$ , quantified by

${\Delta }_{j}\left({C}^{\gamma }\right)={\rho }_{j}\left({C}^{\gamma },{x}_{n}\right)-{\rho }_{j}\left({C}_{j}^{\star },{x}_{n}\right)\text{ }j=\mathcal{l},e,r.$ (5)

Note for any arbitrary choice of $\gamma \in \left[0,1\right]$ , ${\Delta }_{j}\left({C}^{\gamma }\right)\ge 0$ .

Approaches (i) and (ii) will be now specialized to the normal model in Sections 3 and 4 respectively.

3. Optimality Analysis

Let us assume that ${X}_{i}|\theta ~\text{N}\left(\theta ,{\sigma }^{2}\right)$ , with ${\sigma }^{2}$ known, and that $\theta ~\text{N}\left({\mu }_{0},{\sigma }^{2}/{n}_{0}\right)$ . These assumptions imply that $\theta |{x}_{n}~\text{N}\left(\stackrel{¯}{\mu },{\sigma }^{2}/\stackrel{¯}{n}\right)$ where $\stackrel{¯}{\mu }=\left({n}_{0}{\mu }_{0}+n{\stackrel{¯}{x}}_{n}\right)/\stackrel{¯}{n}$ and $\stackrel{¯}{n}={n}_{0}+n$ . In this case the class of HPD sets—which are also equal tails (ET) intervals—is ${C}_{k}=\stackrel{¯}{\mu }±k\sigma /\sqrt{\stackrel{¯}{n}},\text{\hspace{0.17em}}k>0$ , and

${\rho }_{j}\left({C}_{k},{x}_{n}\right)={\mathbb{S}}_{j}\left[\mathcal{L}\left({C}_{k}\right)\right]+1-\left[\Phi \left(k\right)-\Phi \left(-k\right)\right],$ (6)

where $\Phi \left(\cdot \right)$ is the standard normal cdf and ${\mathbb{S}}_{j}\left[\mathcal{L}\left({C}_{k}\right)\right]$ , for $j=\mathcal{l},e,r$ , are obtained by setting $\mathcal{L}\left({C}_{k}\right)=2k\sigma /\sqrt{\stackrel{¯}{n}}$ in (2) and (3). The non-informative case is simply obtained by setting ${n}_{0}=0$ , which yields standard confidence intervals ${\stackrel{¯}{x}}_{n}±k\sigma /\sqrt{n},\text{\hspace{0.17em}}k>0$ . Note that the minimizers ${k}_{j}^{\star }$ can be determined as follows:

(i) ${k}_{\mathcal{l}}^{\star }=\left\{\begin{array}{ll}-2\mathrm{ln}\left(a\sqrt{2\pi }\sigma /\sqrt{\stackrel{¯}{n}}\right)\hfill & \text{ }\text{if}\text{\hspace{0.17em}}a<{\left(\sqrt{2\pi }\sigma /\sqrt{\stackrel{¯}{n}}\right)}^{-1}\hfill \\ 0\hfill & \text{ }\text{otherwise}\hfill \end{array}$ ;

Figure 1. Posterior expected losses ${\rho }_{j}\left(C,{x}_{n}\right)$ , ${\mathbb{S}}_{j}\left[\mathcal{L}\left(C\right)\right]$ (dotted lines) and $1-ℙ\left(C|{x}_{n}\right)$ (solid lines), $j=\mathcal{l},e,r$ , as functions of $\mathcal{L}\left(C\right)$ .

(ii) ${k}_{r}^{\star }$ satisfies the equation $\varphi \left[{k}_{r}^{\star }\right]=\frac{\stackrel{¯}{a}}{2{\left[{k}_{r}^{\star }+\stackrel{¯}{a}\right]}^{2}}$ , where $\stackrel{¯}{a}=\frac{a\sqrt{\stackrel{¯}{n}}}{2\sigma }$ and $\varphi \left(\cdot \right)$ denotes the density function of a standard normal random variable;

(iii) ${k}_{e}^{\star }$ is determined numerically.

For the unknown variance case let us assume that $\theta |{\sigma }^{2}~\text{N}\left({\mu }_{0},{\sigma }^{2}/{n}_{0}\right)$ and that ${\sigma }^{2}~\text{Inv-Ga}\left({\nu }_{0}/2,\text{rate}={\nu }_{0}{\sigma }_{0}^{2}/2\right)$ or, equivalently, an inverse chi-square r.v. of parameters $\left({\nu }_{0},\text{rate}={\sigma }_{0}^{2}\right)$ . In this case standard conjugate analysis yields that $\theta |{x}_{n}~{t}_{\stackrel{¯}{\nu }}\left(\stackrel{¯}{\mu },{\stackrel{¯}{\sigma }}^{2}/\stackrel{¯}{n}\right)$ , where $\stackrel{¯}{\nu }={\nu }_{0}+n$ , $\stackrel{¯}{\mu }$ is unchanged w.r.t. the known

variance case, $\stackrel{¯}{\nu }{\stackrel{¯}{\sigma }}^{2}={\nu }_{0}{\sigma }_{0}^{2}+\left(n-1\right){S}_{n}^{2}+\frac{{n}_{0}n}{{n}_{0}+n}{\left({\stackrel{¯}{x}}_{n}-{\mu }_{0}\right)}^{2}$ , ${S}_{n}^{2}$ is the sample vari

ance and $\stackrel{¯}{n}={n}_{0}+n$ (see Lesaffre and Lawson (2012), p. 86-87). HPD sets are then ${C}_{k}=\stackrel{¯}{\mu }±k\stackrel{¯}{\sigma }/\sqrt{\stackrel{¯}{n}},\text{\hspace{0.17em}}k>0$ , and

${\rho }_{j}\left({C}_{k},{x}_{n}\right)={\mathbb{S}}_{j}\left[\mathcal{L}\left({C}_{k}\right)\right]+1-\left[{\mathbb{F}}_{\stackrel{¯}{\nu }}\left(\stackrel{¯}{\mu }+k\stackrel{¯}{\sigma }/\sqrt{\stackrel{¯}{n}}\right)-{\mathbb{F}}_{\stackrel{¯}{\nu }}\left(\stackrel{¯}{\mu }-k\stackrel{¯}{\sigma }/\sqrt{\stackrel{¯}{n}}\right)\right],$ (7)

where ${\mathbb{F}}_{\stackrel{¯}{\nu }}\left(\cdot \right)$ is the cdf of the t random variable with $\stackrel{¯}{\nu }$ degrees of freedom, location $\stackrel{¯}{\mu }$ and scale $\stackrel{¯}{\sigma }/\sqrt{\stackrel{¯}{n}}$ and ${\mathbb{S}}_{j}\left[\mathcal{L}\left({C}_{k}\right)\right]$ , for $j=\mathcal{l},e,r$ , are obtained by setting $L\left({C}_{k}\right)=2k\stackrel{¯}{\sigma }/\sqrt{\stackrel{¯}{n}}$ in (2) and (3). Note that the non-informative case is retrieved for ${n}_{0}={\nu }_{0}=0$ , i.e. using a flat prior for $\theta |{\sigma }^{2}$ and $\pi \left({\sigma }^{2}\right)=1/{\sigma }^{2}$ . In this case HPD sets are ${C}_{k}={\stackrel{¯}{x}}_{n}±k{S}_{n}/\sqrt{n}$ (see Lesaffre and Lawson (2012) p. 88). The minimizers ${k}_{j}^{\star }$ of (7) depend on the data and can be determined numerically, for $j=\mathcal{l},e,r$ , or according to results similar to (i)-(iii). For instance,

it can be checked that ${k}_{r}^{\star }$ satisfies the equation ${\phi }_{\stackrel{¯}{\nu }}\left[{k}_{r}^{\star }\right]=\frac{\stackrel{¯}{a}}{2{\left[{k}_{r}^{\star }+\stackrel{¯}{a}\right]}^{2}}$ , where $\stackrel{¯}{a}=\frac{a\sqrt{n}}{2\stackrel{¯}{\sigma }}$ and ${\phi }_{\stackrel{¯}{\nu }}\left(\cdot \right)$ denotes the density function of a ${t}_{\stackrel{¯}{\nu }}$ random variable

with $\stackrel{¯}{\nu }$ degrees of freedom, location $\stackrel{¯}{\mu }$ and scale $\stackrel{¯}{\sigma }/\sqrt{\stackrel{¯}{n}}$ . For futher details see [2] .

3.1. Numerical Examples

For each loss function and for selected values of a we determine the optimal sets using the following procedure. We first consider a grid of values for k and determine the corresponding ET/HPD sets ${C}_{k}$ and their expected loss ${\rho }_{j}\left({C}_{k},{x}_{n}\right)$ . Then we select ${k}_{j}^{\star }$ as the minimizer of ${\rho }_{j}\left({C}_{k},{x}_{n}\right)$ , for $j=\mathcal{l},e,r$ . Figure 2 shows the plots of ${\rho }_{j}\left({C}_{k},{x}_{n}\right)$ as functions of k for Normal posteriors in the known variance case, under an informative prior ( ${n}_{0}=5$ , left panels) and a non-informative prior ( ${n}_{0}=0$ , right panels). For each value of a the optimal ${k}_{j}^{\star }$ is circled. According to the definition of the different size functions in (2) and (3), the impact of the penalizing coefficient a on interval lengths is not directly comparable. However, in general, as expected, the larger a the smaller ${k}_{j}^{\star }$ , $j=\mathcal{l},e,r$ . Among the three losses, ${\rho }_{\mathcal{l}}\left({C}_{k},{x}_{n}\right)$ is the most sensitive with respect to the values of a, which results in the largest range of values for ${k}_{\mathcal{l}}^{\star }$ . Note also that the optimal values ${k}_{r}^{\star }$ are the largest, yielding optimal intervals with posterior probability ${\gamma }_{r}^{\star }$ larger than 0.94 with smaller variability than those of the linear and of the exponential loss. For all the explored a values, the exponential loss function provides optimal sets with intermediate levels of length and posterior probability, and overall smaller values of the posterior expected loss. As expected, for each value of a, the curves ${\rho }_{j}$ as functions of k are slightly but

Figure 2. Posterior expected losses ${\rho }_{j}\left(C,{x}_{n}\right)$ , $j=\mathcal{l},e,r$ , as functions of k for different values of a for Normal posteriors under an informative prior with ${n}_{0}=5$ (left column) and a non-informative prior with ${n}_{0}=0$ (right column). For each ${\rho }_{j}$ , $j=\mathcal{l},e,r$ , circles denote ${k}_{j}^{\star }$ .

uniformly higher for the non-informative case (right panels). Interestingly, even though a selected ${k}_{j}^{\star }$ can be surprisingly smaller in the non-informative case, the corresponding value of ${\rho }_{j}^{\star }$ is always greater than in the informative case, due to the larger posterior variance of $\theta$ . For instance, under the exponential loss function with $a=1$ , ${k}_{e}^{\star }$ and ${\rho }_{e}^{\star }$ are equal to 1.54 and 0.335 in the non-informative case and 1.70 and 0.264 in the informative case (see Table 1). The above comments are consistent with the numerical values of ${k}_{j}^{\star }$ , ${\mathcal{L}}_{j}^{\star }$ , ${\gamma }_{j}^{\star }$ and ${\rho }_{j}^{\star }$ reported in Table 1 for the Normal model with known variance, both for the informative and the non-informative case.

Table 2 reports ${k}_{j}^{\star }$ , ${\mathcal{L}}_{j}^{\star }$ , ${\gamma }_{j}^{\star }$ and ${\rho }_{j}^{\star }$ for the unknown variance case assuming ${\sigma }_{0}^{2}=1$ , (i) ${n}_{0}=0,{\nu }_{0}=100$ and (ii) ${n}_{0}=0,{\nu }_{0}=0$ . Notice that the values of Table 2 side (i) are very close to those of Table 1 side (ii), since ${\nu }_{0}=100$ yields a prior for ${\sigma }^{2}$ highly concentrated on the variance parameter that is assumed to be known in the former case. As regards the comparison between the informative and the non-informative case, considerations similar to those made for the known variance case also apply to Table 2 [compare (i) and (ii)].

3.2. Application to Clinical Data

In this section we consider an application from [15] , in which an experimental drug (t), indicated for the treatment of iron deficiency anemia in adult patients with cronic kidney disease, is compared to a control treatment (c). The primary endpoint is the mean change in hemoglobin from baseline to day 35. Let $\theta ={\theta }_{t}-{\theta }_{c}$ be the difference between the expected changes in hemoglobin under the two treatments. Let ${\stackrel{¯}{X}}_{n}~\text{N}\left(\theta ,{\sigma }^{2}/n\right)$ be the difference between the sample mean changes in hemoglobin ${\stackrel{¯}{X}}_{n}^{\left(1\right)}$ and ${\stackrel{¯}{X}}_{n}^{\left(0\right)}$ under the two treatments, where

Table 1. Length, posterior probability, posterior expected loss for ${C}_{{k}_{j}^{\star }}$ and values of ${k}_{j}^{\star }$ under the three loss functions for selected values of a, for the Normal model with known variance ${\sigma }^{2}=1$ assuming (i) an informative prior for $\theta$ (e.g. ${n}_{0}=5$ ) and (ii) a non-informative prior for $\theta$ (e.g. ${n}_{0}=0$ ).

Table 2. Length, posterior probability, posterior expected loss for ${C}_{{k}_{j}^{\star }}$ and values of ${k}_{j}^{\star }$ under the three loss functions for selected values of a, for the Normal model with unknown variance, assuming ${\sigma }_{0}^{2}=1$ , (i) ${n}_{0}=0,{\nu }_{0}=100$ and (ii) ${n}_{0}=0,{\nu }_{0}=0$ .

${\sigma }^{2}/n={\sigma }_{t}^{2}/{n}_{t}+{\sigma }_{c}^{2}/{n}_{c}$ and $n={n}_{t}+{n}_{c}$ , with ${n}_{i}$ the size of the arm $i=t,c$ . In the example ${n}_{t}=62$ , ${n}_{c}=64$ , ${\stackrel{¯}{x}}_{{n}_{t}}=0.87$ and ${\stackrel{¯}{x}}_{{n}_{c}}=0.71$ . Here, we assume that ${\sigma }_{t}$ and ${\sigma }_{c}$ are known and equal to the sample standard deviations reported in the original example, i.e. 1.14 and 1 respectively. Using the non-informative prior $\pi \left(\theta \right)=\text{const}$ for $\theta$ , the 95% credible interval is $\left(-0.215,0.535\right)$ .

Table 3 shows the optimal intervals under the three loss functions for $a=1$ with different choices of the prior parameters: (A) ${\mu }_{0}=0,{n}_{0}=100$ (no difference between treatment effects); (B) ${\mu }_{0}=0.16,{n}_{0}=100$ (prior information perfectly matching sample data); (C) ${\mu }_{0}=0.32,{n}_{0}=100$ (optimistic prior mean); (D) ${n}_{0}=0$ (non-informative).

As ${\mu }_{0}$ increases [from (A) to (C)], intervals bounds are shifted towards larger values, but the selected ${k}_{j}^{\star }$ is the same for each given loss function, thus yielding the same values of ${\mathcal{L}}_{j}^{\star }$ , ${\gamma }_{j}^{\star }$ and ${\rho }_{j}^{\star }$ , $j=\mathcal{l},e,r$ . In the non-informative case (D), the values of ${\rho }_{j}^{\star }$ and ${\mathcal{L}}_{j}^{\star }$ are uniformly greater than in the previous cases, due to the larger posterior variance. The rational loss always yields the widest optimal intervals, with posterior probability close to the conventional level 95%.

Figure 3 shows the behavior of ${\mathcal{L}}_{j}^{\star }$ , ${\gamma }_{j}^{\star }$ , ${\rho }_{j}^{\star }$ and ${k}_{j}^{\star }$ as functions of the coefficient a. As a increases, ${\mathcal{L}}_{j}^{\star }$ and ${\gamma }_{j}^{\star }$ tend to decrease and the corresponding values of ${\rho }_{j}^{\star }$ tend to increase. As expected the linear loss (solid line) is the most sensitive to changes of the values of a. Due to remark (i) of Section 3, ${C}_{{k}_{\mathcal{l}}^{\star }}$ is non trivial ( ${k}_{\mathcal{l}}^{\star }>0$ ) for values of $a<{\left(\sqrt{2\pi }\sigma /\sqrt{\stackrel{¯}{n}}\right)}^{-1}$ , that is equal to 2.08 in this example. This motivates the presence of the cusp in the three plots. Note that in the second panel the curve representing ${\gamma }_{r}^{\star }$ shows values always very

Figure 3. Length, posterior probability and posterior expected loss of the optimal set ${C}^{\star }$ as functions of a, under the linear (solid line), rational (dashed line) and exponential loss (dotted line), for the Normal model with known variance assuming ${n}_{0}=0$ (non-informative).

close to 0.95; whereas ${\gamma }_{e}^{\star }$ progressively reduces as a increases. Finally, inspection of the values of ${\rho }_{j}^{\star }$ allows an overall look at the sensitivity to a of the three losses (see third panel). Although a direct comparison of the absolute values of ${\rho }_{j}^{\star }$ is not meaningful, due to the different role of a in the three loss functions,

Table 3. Bounds, length, posterior probability, posterior expected loss for optimal sets ${C}_{{k}_{j}^{\star }}$ and values of ${k}_{j}^{\star }$ under the three loss functions with $a=1$ , for the Normal model with known variance, assuming four alternative priors: (A) ${\mu }_{0}=0,{n}_{0}=10$ (no difference between treatment effects); (B) ${\mu }_{0}=0.16,{n}_{0}=10$ (prior information perfectly matching sample data); (C) ${\mu }_{0}=0.32,{n}_{0}=10$ (optimistic prior mean); (D) ${n}_{0}=0$ (non-informative).

the rate of increase of the three curves shows a greater degree of robustness of the exponential loss.

4. Regret Analysis

As discussed at the end of Section 2 one typically uses the suboptimal sets ${C}^{\gamma }$ that guarantee minimal length in the class of fixed $\gamma$ -credibility level intervals.

For the normal model with known variance ${C}^{\gamma }=\stackrel{¯}{\mu }±{z}_{\frac{1+\gamma }{2}}\sigma /\sqrt{\stackrel{¯}{n}}$ . In this section

we are interested in evaluating ${\Delta }_{j}\left({C}^{\gamma }\right)$ , the additional expected loss of sets ${C}^{\gamma }$ given by Equation (5). Our goal is to quantify the cost of using the more pragmatic interval ${C}^{\gamma }$ instead of the optimal set ${C}_{{k}_{j}^{\star }}$ under a given loss function. Here the comparison is restricted to the exponential and rational loss functions, due to the drawbacks of the linear loss previously discussed. The values of ${\Delta }_{j}\left( C \gamma \right)$

are computed noting that ${\rho }_{j}\left({C}^{\gamma },{x}_{n}\right)={\mathbb{S}}_{j}\left[2{z}_{\frac{1+\gamma }{2}}\sigma /\sqrt{\stackrel{¯}{n}}\right]+1-\gamma$ and determining the values of ${\rho }_{j}\left({C}_{{k}_{j}^{\star }},{x}_{n}\right)$ numerically as discussed in the previous section.

We explore the behaviour of ${\Delta }_{j}\left({C}^{\gamma }\right)$ with respect to a, n and $\gamma$ for the application of Section 3.2. Moreover, we compare two alternative prior assumptions, i.e. the non-informative case (with ${n}_{0}=0$ ) and an informative case (with ${n}_{0}=100$ ). Note under the non-informative prior the $\gamma$ -credible interval coincides

with the frequentist $\gamma$ -confidence interval ${\stackrel{¯}{x}}_{n}±{z}_{\frac{1+\gamma }{2}}\sigma /\sqrt{n}$ .

Application to Clinical Data (Continued)

Figures 4-6 show the values of ${\Delta }_{j}\left({C}^{\gamma }\right)$ , $j=e,r$ , as functions of a, n and $\gamma$ respectively. Solid lines correspond to a prior sample size ${n}_{0}=100$ , dashed lines to the non-informative case ( ${n}_{0}=0$ ).

1) Effect of a (Figure 4)

a) As a varies the curves ${\Delta }_{j}$ have a minimum in ${\stackrel{˜}{a}}_{j}$ , which represents the value of the penalizing coefficient such that ${C}^{\gamma }$ are the closest to the optimal sets ${C}_{{k}_{j}^{\star }}$ . Note that ${\Delta }_{j}=0$ when the value ${\stackrel{˜}{a}}_{j}$ is such that ${C}^{\gamma }$ is optimal under the loss function j.

b) In all panels but the last one (see next item), ${\stackrel{˜}{a}}_{j}$ is smaller for ${n}_{0}=0$ than

Figure 4. Additional expected loss ${\Delta }_{j}\left({C}^{\gamma }\right)$ with $\gamma =0.80$ (top panels) and $\gamma =0.95$ (bottom panels) as function of a, under the exponential (left column) and the rational loss (right column), for the Normal model with known variance assuming ${n}_{0}=100$ (solid line) and ${n}_{0}=0$ (dashed line) with $n=126$ .

Figure 5. Additional expected loss ${\Delta }_{j}\left({C}^{\gamma }\right)$ with $\gamma =0.80$ (top panels) and $\gamma =0.95$ (bottom panels) as functions of n, under the exponential (left column) and the rational loss (right column) with $a=1$ , for the Normal model with known variance assuming ${n}_{0}=100$ (solid line) and ${n}_{0}=0$ (dashed line).

for ${n}_{0}=100$ . This means that, since the non-informative prior yields larger sets, the degree of penalization of interval length has to be smaller than it can be for larger values of ${n}_{0}$ .

c) Under the rational loss function, for $\gamma =0.95$ (last panel) ${\Delta }_{r}$ is substantially negligible for a large range of values of a. This is consistent with what we observed in the previous section (see for instance Figure 3 second panel) since the optimal sets under the rational loss have credibility approximately equal to 0.95 for several values of a.

d) According to the choice of a either the standard confidence interval or the Bayesian credible interval can be preferred in terms of additional expected loss.

Figure 6. Additional expected loss ${\Delta }_{j}\left({C}^{\gamma }\right)$ as function of $\gamma$ , under the exponential (left column) and the rational loss (right column) with $a=1$ (top panels) $a=2$ (center panels) and $a=5$ (bottom panels) for the normal model with known variance assuming ${n}_{0}=100$ (solid line) and ${n}_{0}=0$ (dashed line).

2) Effect of n (Figure 5)

a) The range of ${\Delta }_{r}$ is much smaller than that of ${\Delta }_{e}$ . Values of ${\Delta }_{j}$ are particularly small when $\gamma =0.95$ (bottom panels).

b) In all the plots the solid and dashed curves ${\Delta }_{j}$ as functions of n have a minimum point ${\stackrel{˜}{n}}_{j}$ . This means that there exist a value of the sample size such that ${C}^{\gamma }$ is optimal. Such a value ${\stackrel{˜}{n}}_{j}$ is smaller for ${n}_{0}=100$ than for ${n}_{0}=0$ since the informative prior implies shorter $\gamma$ -credible sets than the non-informative one.

c) For values larger than ${\stackrel{˜}{n}}_{j}$ the curves ${\Delta }_{j}$ increase. In particular the higher steepness of solid curves is due to the shorter length of intervals under the informative prior for each given sample size. Moreover, solid and dashed curves eventually tend to coincide because the effect of ${n}_{0}$ becomes more and more negligible with respect to larger and larger values of n.

d) For $\gamma =0.8$ (top panels) there are not values of n such that ${\Delta }_{j}=0$ , i.e. sets ${C}^{\gamma }$ are never optimal.

3) Effect of $\gamma$ (Figure 6)

a) In all panels the maximum value of ${\Delta }_{j}$ is obtained for $\gamma =0$ (when ${C}^{\gamma }$ reduces to $\stackrel{¯}{\mu }$ ).

b) For increasing values of $\gamma$ the curves ${\Delta }_{j}$ decrease and reach their minimum at ${\gamma }_{j}^{\star }$ , i.e. ${\Delta }_{j}$ increases with $|\gamma -{\gamma }_{j}^{\star }|$ .

c) The smaller the value of $\gamma$ the larger the discrepancy between the lengths of ${C}_{{k}_{j}^{\star }}$ and ${C}^{\gamma }$ and this discrepancy is stronger for informative priors that yield shorter intervals with respect to non-informative priors. The opposite is observed for large values of $\gamma$ when the exponential loss is used; whereas the rational loss seems to be more insensitive to the prior at least for values of $\gamma >0.5$ .

d) Depending on the chosen value for $\gamma$ , either the standard confidence interval or the Bayesian credible interval can have the smallest regret. However, for large values of $\gamma$ —typically chosen in the practice—regret is smaller for the informative Bayesian interval.

5. Conclusions

We can now attempt to summarize some indications drawn from the numerical example and the clinical data application of Sections 3 and 4.

As regards the optimality analysis the main points are the following.

1) As shown by Table 1 and Table 2, the value of the penalizing coefficient a is critical: there are no general guidelines for its choice that is, however, highly influential in determining optimal intervals. This is also true in general in decisional analysis, for instance in testing problems when generalized 0 - 1 losses are used.

2) Non-informative distributions imply larger values of ${\mathcal{L}}_{j}$ and, as a consequence, of ${\rho }_{j}$ than those obtained with informative priors, as one can argue by looking at Table 3.

3) Figure 2 and Figure 3 allow us to sketch a comparison among the behaviour with respect to a of the posterior expected losses induced by the three loss functions: the linear loss is the most sensitive; the rational loss yields larger sets and values of ${\gamma }_{r}^{\star }$ close to 0.95; the values of ${\rho }_{e}$ are the most robust with respect to a.

In the regret analysis of Section 4 we have explored the impact of a, n and $\gamma$ . Here are the main comments.

1) The value ${\stackrel{˜}{a}}_{j}$ , that minimizes the additional loss of ${C}^{\gamma }$ , quantifies the degree of penalization such that ${C}^{\gamma }$ is as close as possible to be optimal. In this sense it can be useful to support the interpretation and the choice of a.

2) As shown in Figure 4, according to the choice of a either the standard confidence interval or the Bayesian credible interval can be preferred in terms of additional expected loss.

3) Figure 5 suggests that, as n varies, the range of ${\Delta }_{r}$ is smaller than that of ${\Delta }_{e}$ . Furthermore it hints that it is possible to find values of n such that ${C}^{\gamma }$ is optimal and sample size beyond which the weight of the prior on ${\Delta }_{j}$ becomes more and more negligible.

4) In terms of additional expected loss, it is not granted that the Bayesian credible interval has to be preferred with respect to the standard confidence interval: it depends on the chosen value for $\gamma$ . Nevertheless, in the practice large values of $\gamma$ are typically fixed. According to Figure 6 for large values of $\gamma$ the informative Bayesian interval have (slightly) smaller regret than confidence intervals.

As pointed out in [2] “it is difficult to treat the set estimation problem in a decision-theoretic way” because of some limitations related to the available loss functions and the difficulty of selecting an appropriate loss function. Nevertheless the possibility of balancing size and credibility is still appealing. In this paper we have explored the features of some loss functions that are alternative to the linear loss function and that produce sensible sets for normal models. Bearing in mind that the distribution of the data plays an important role, we hope to extend the present analysis to other models, in the spirit of [13] .

Conflicts of Interest

The authors declare no conflicts of interest.

 [1] Casella, G., Hwang, J.T.G. and Robert, C. (1993) A Paradox in Decision-Theoretic Interval Estimation. Statistica Sinica, 3, 141-155. [2] Casella, G., Hwang, J.T.G. and Robert, C. (1994) Loss Function for Set Estimation, In: Berger, J.O. and Gupta, S.S., Eds., Statistical Decision Theory and Related Topics V, Springer-Verlag, New York, 237-252. https://doi.org/10.1007/978-1-4612-2618-5_18 [3] Berger, J.O. (1985) Statistical Decision Theory and Bayesian Analysis. Springer-Verlag, Chicago. https://doi.org/10.1007/978-1-4757-4286-2 [4] French, S. and Rios Insua, D. (2000) Statistical Decision Theory, Kendall’s Library of Statistics. Wiley, Hoboken. [5] Pratt, J.W., Raiffa, H. and Schlaifer, R. (2008) Introduction to Statistical Decision Theory. MIT Press, Cambridge. [6] Parmigiani, G. and Inoue, L. (2009) Decision Theory: Principles and Approaches. Wiley Series in Probability and Statistics. Wiley, Hoboken. https://doi.org/10.1002/9780470746684 [7] Robert, C. (2007) The Bayesian Choice: From Decision-Theoretic Foundations to Computational Implementation. Springer Science & Business Media, Berlin. [8] Bernardo, J.M. (2005) Intrinsic Credible Regions: An Objective Bayesian Approach to Interval Estimation. Test, 14, 317-384. https://doi.org/10.1007/BF02595408 [9] Rice, K. and Ye, L. (2022) Expressing Regret: A Unified View of Credible Intervals. The American Statistician, 76, 248-256. https://doi.org/10.1080/00031305.2022.2039764 [10] De Santis, F. and Gubbiotti, S. (2016) A Decision-Theoretic Approach to Sample Size Determination under Several Priors. Applied Stochastic Models in Business and Industry, 33, 282-295. https://doi.org/10.1002/asmb.2211 [11] De Santis, F. and Gubbiotti, S. (2021) On the Predictive Performance of a Non-Optimal Action in Hypothesis Testing. Statistical Methods & Applications, 30, 689-709. https://doi.org/10.1007/s10260-020-00539-1 [12] Brutti, P., De Santis, F. and Gubbiotti, S. (2014) Predictive Measures of the Conflict between Frequentist and Bayesian Estimators. Journal of Statistical Planning and Inference, 148, 111-122. https://doi.org/10.1016/j.jspi.2013.12.009 [13] De Santis, F. and Gubbiotti, S. (2021) Optimal Credible Intervals under Alternative Loss Functions. In: Book of Short Papers SIS 2021, Statistical Methods & Applications, Pearson, Milano, 1618-1623. [14] R Core Team (2022) R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna. https://www.R-project.org [15] Gamalo, M.A., Wu, R. and Tiwari, R.C. (2016) Bayesian Approach to Non-Inferiority Trials for Normal Means. Statistical Methods in Medical Research, 25, 221-240. https://doi.org/10.1177/0962280212448723