Received 22 January 2016; accepted 18 April 2016; published 21 April 2016
1. Introduction
1) a distribution family of the observation (sampling distribution),;
2) a prior distribution for the parameter,;
3) a loss function associated to a decision,; with expected loss.
The posterior distribution of is given by
(1)
A posterior distribution and a loss function lead to an optimal decision rule (Bayes rule), together with its risk function and its frequentist properties.
1.1. Bayesian Model Selection
Consider a situation in which some quantity of interest, μ, is to be estimated from a sample of observations that can be regarded as realizations from some unknown probability distribution, and that in order to do so, it is necessary to specify a model for the distribution. There are usually many alternative plausible models available and, in general, they each lead to different estimates of μ. Consider a sample of data, x, and a set of K models containing the true model Mt. Each Mk consists of a family of distributions, where represents a parameter (or vector of parameters). The prior probability that Mk is the true model is denoted by and the prior distribution of the parameters of Mk (given that Mk is true) by. Conditioning on the data x and integrating out the parameter, one obtains the following posterior model probabilities:
(2)
where
(3)
is the integrated likelihood under Mk. If is a discrete distribution, the integral in (3) is replaced by a sum.
Bayesian model selection involves selecting the “best” model with some selection criterion; more often the Bayesian information criterion (BIC), also known as the Schwarz criterion [10] is used; it is an asymptotic approximation of the log posterior odds when the prior odds are all equal. More information on Bayesian model selection and applications can be found in Nguefack-Tsague and Ingo [11] , Guan and Stephens [12] , Nguefack- Tsague [13] , Carvalho and Scott [14] , Fridley [15] , Robert [16] , Liang et al. [17] , and Bernado and Smith [18] . Other variants of model selection include Nguefack-Tsague and Ingo [11] who used BMA machinery to derive a focused Bayesian information criterion (FoBMA) which selects different models for different purposes, i.e. their method depends on the parameter singled out for inferences.
1.2. Bayesian Model Averaging
Let μ be a quantity of interest depending on x, for example a future observation from the same process that generated x. The idea is to use a weighted average of the estimates of μ obtained using each of the alternative models, rather than the estimate obtained using any single model. More precisely, the posterior distribution of μ is given by
(4)
Note that is a weighted average of the posterior distributions, , where the k-th weight, , is the posterior probability that Mk is the true model.
The posterior distribution of μ, conditioned on Mk being true, is given by
(5)
The posterior mean and posterior variance are given by
(6)
(7)
Clyde and Iversen [33] developed a variant of BMA in which it is not assumed that the true model belongs to competing ones (M-open framework). They developed an optimal weighted scheme and showed that their method provides accurate predictions than any of the proxy models.
An R [34] package for BMA is now available for computational purposes; this package provides ways for carrying out BMA for linear regression, generalized linear models, and survival analysis using Cox proportional hazard models. For computations, Monte Carlo methods, or approximating methods, are used; thus many BMA applications are based on the BIC. As one can realize in deriving BMA, there are no unique statistical model and unique prior distribution associated with BMA, taught these are available for each competing model. This renders frequentist properties of BMA hard to obtain from pure Bayesian decision theory. This was the main motivation of this paper in proposing alternative Bayesian model in which the long run properties of resulted estimators could be automatically obtained from Bayesian decision theory. The present paper is organized as follows. Section 2 introduces the new BMA method; Section 3 provides practical examples while Section 4 provides discussions. The paper ends with concluding remarks.
2. BMA Based on Mixture
2.1. The Model
The purpose of this section is to define a new BMA method. The prior of the quantity of interest can be defined as
(8)
where is the prior distribution of Mk.
The parametric statistical model can also be defined as
(9)
with being the parametric statistical model for model Mk (i.e. the sampling distribution of Mk). The use of Bayes rule leads to the posterior of the quantity of interest as
(10)
Defining a loss function, Bayesian estimates are then obtained with its long and short run properties known. All the frequentist properties of Bayes rules now apply, in particular one can find conditions under which there are consistent and admissible. This approach is referred to as Mixed based Bayesian model averaging (MBMA).
2.2. Theoretical Properties of MBMA
Proposition 1. Under (8) and (9), assuming that for all k and j, ,
, the posterior of the quantity of interest in (10) is
(11)
Proof.
Since, by Bayes rule,
(a)
(b)
Dividing (a) by (b) yields the result.
Corollary 2. Suppose that all the models have identical sampling distribution, that is for all k and j, then MBMA reduces to BMA.
Proof.
In the numerator of (11),
The numerator of (11) is therefore
.
Therefore, the denominator of (11),
a mixture of marginal distributions.
Therefore
Thus in this special case, the posteriors mean and variance using the MBMA are those of BMA given in Equations (6) and (7).
2.3. Frequentist (Long Run) Evaluation of MBMA
Evaluating the long run properties of MBMA involves studying frequentist issues, including: asymptotic methods, consistency, efficiency, unbiasedness, and admissibility. Details about derivations for more general Bayes estimates can be found e.g. in Gelman [35] (p. 83). The following are proven in Gelman [35] for any Bayes estimate, in particular for MBMA. Let be the Fisher information, the observed information, Mod the posterior mode; and μ0 the value of the parameter that makes the model distribution closest (e.g. in the sense of Kullback-Leiber information) to the true distribution.
1) If the sample size is large and the posterior distribution is unimodal and roughly symmetric, one can approximate it by a normal distribution centered at Mod with variance.
2) If the likelihood () is a continuous function of μ and the true parameter value μ0 is not on the boundary of the parameter space, as the sample size n tends to ∞, the posterior distribution of μ approaches normality with mean μ0 and variance and Mod is consistent for μ0.
3) Suppose the normal approximation for the posterior distribution (), Mod ® μ0 and the true data distribution is included in the class of models, then, where I is the identity matrix.
4) When the truth is included in the family of models () being fitted, the posteriors mode, mean and median are consistent and asymptotically unbiased and efficient under mild regularity conditions.
5) If a prior distribution () is strictly positive with finite Bayes risk and the risk function is continuous, MBMA is admissible.
2.4. Predictive Performance of MBMA
One measure of predictive performance is the Good’s logarithm score rule [36] . From the nonnegativity of Kullback-Leiber information divergence, it follows that if f and g two probabilities distribution functions,
Applying this to MBMA leads to
(12)
MBMA provides thus better predictive performance than any single model.
3. Applications
Laplace distribution, and normal distribution, are often used in finance.
, , ,
, ,
,.
Laplace distribution (the double exponential) is symmetric with fat tails (much fatter than the normal). It is not bell-shaped (it has a peak at).
Suppose that the mean is known and the quantity of interest is. The data are the daily foreign exchange rates Euros versus US Dollars from January 3 2000 till June 15 2006 (the aim being their return value). The prior for is for both models (the idea remains the same for different priors, e.g., uniform priors). Model probabilities are assigned for each model.
Table 1 shows the properties of the competing models, BMA, and MBMA. Starting from equal prior for M1 and M2, i.e. 0.5 each; after observing data, M1 is more likely to be true (0.83) than M2 (0.17). While M1, M2 and MBMA have priors (over the parameter of interest) and statistical models; BMA does not have. This implies that the frequentist properties of MBMA can be automatically derived form Bayesian decision theory (see Subsection); this is not possible for BMA. The bayesian estimates (conditional on the observations) of these models are very similar, with MBMA having the smaller conditional variance (0.03).
4. Discussion
In general, as Bayes estimate, the form of the posteriors mean and variance for MBMA are not known in advance; in a special case, the properties of MBMA are those of BMA and are given in Equations (6) and (7). Posterior distributions of MBMA are very complex, thus a major challenge is in computing. MBMA estimate is thus computationally demanding (but feasible) since the posterior involves many sums, especially if the number K of models is large. This is not new as BMA faces the same drawback, though nowadays program exist for complex computations (e.g. R [34] ). Another problem is the selection of priors both for models and parameters (common to any Bayesian model). In most cases, uniform priors are used for each model, i.e.,. When the number of models is large, model search strategies are sometimes used to reduce the set of models (e.g. Occam’s window method, Hoeting et al. [19] ), by eliminating those that seem comparatively less compatible with the data. Most currently Bayesian mixtures are based either in the priors or on the statistical model, not both as the new MBMA described in this paper. For example Abd and Al-Zaydi [37] [38] used statistical mixtures model for order statistics; Al-Hussaini and Hussein [39] for exponential components; Ley and Steel [40] used a prior of mixtures with economic applications. Other Bayesian mixtures include Schäfer et al. [41] (spatial clustering), Yao [42] (Bayesian labeling), Sabourin and Naveau [43] (extremes), and Rodrguez and Walker [44] kernel estimation). Programming codes are under development for performing model averaging using MBMA with real data and simulations, and will be available as an add-on package on R [34] .
Table 1. Comparison of inference for the scale, μ = σ2.
5. Concluding Remarks
This paper proposes a new method (with application) for model averaging in Bayesian context (MBMA) when the main focus of a data analyst is on the long run (frequentist) performances of the Bayesian estimator. The method is based on using a mixture of priors and sampling distributions for model averaging. When conditioning on data at hand, the well popular Bayesian model averaging (BMA) should be preferable, given the complexity in computing of MBMA. MBMA is especially useful when exploiting the well known frequentist properties within the framework of Bayesian decision theory.
Acknowledgements
We thank the editor and the referee for their comments on earlier versions of this paper.