Minimum Description Length Methods in Bayesian Model Selection: Some Applications ()
1. Introduction
Bayesian computations can be difficult, in particular those in model selection problems. For instance, it may be noted that learning the structure of Bayesian networks is in general of the computational complexity type NPcomplete ([1,2]). It is therefore meaningful to consider alternative computationally simpler solutions which are approximations to Bayesian solutions. Sometimes direct computational simplifications are possible, as shown, for example, in [3], but often approaches arising out of different methodologies are needed. We discuss some aspects of Minimum Description Length (MDL) methods with this point of view. Another important reason for exploring these methods is that there is a substantial literature on this topic available in engineering and computer science with potential applications in statistics. We will not, however, explore certain other aspects of MDL such as the “Normalized Maximum Likelihood (NML)” introduced by [4] which do not seem to be in the spirit of the Bayesian approach that we have taken here.
The discussion below is organized as follows. In Section 2 we briefly describe the MDL principle and then indicate in Sections 3 and 4 how it applies to model fitting and model checking. It is shown that a particular version of MDL is equivalent to the Bayes factor criterion of model selection. Since this is computationally difficult most often, some approximations are desirable, and it is next shown how a different version of MDL can provide such an approximation. Following this discussion, new applications are presented in Section 5. Specifically, MDL approach to step-wise regression in Section 5.1, wavelet thresholding in 5.2 and a change-point problem in 5.3 are described.
2. Minimum Description Length Principle
The MDL approach to model fitting can be described as follows (see [5,6]). Suppose we have some data. Consider a collection of probability models for this set of data. A model provides a better fit if it can provide a more compact description for the data. In terms of coding, this means that according to MDL, the best model is the one which provides the shortest description length for the given data. The MDL approach as discussed here is also related to the Minimum Message Length (MML) approach of [7]. See [8,9] for connections to information theory and other related details.
If data
is known to arise from a probability density
, then (see [10] or [11]) the optimal code length (in an average sense) is given by
. (Here
is logaritm to the base 2.) This is the link between description length and model fitting.
The optimal code length of
is valid only in the discrete case. To handle the continuous case later, discretize x and denote it by
where 
denotes the precision. In effect we will then be considering

instead of
itself as far as coding of x is considered when x is one-dimensional. In the r-dimensional case, we will replace the density
by the probability of the
-dimensional cube of side
containing
, namely
, so that the optimal code length changes to
.
3. MDL for Estimation or Model Fitting
Consider data
, and suppose

is the collection of models of interest. Further, let
be a prior density for
. Given a value of
(or a model), the optimal code length for describing
is
, but since
is unknown, its description requires a further
bits on average. Therefore the optimal code length is obtained upon minimizing
(1)
so that MDL amounts to seeking that model which minimizes the sum of
• the length, in bits, of the description of the model, and
• the length, in bits, of data when encoded with the help of the model.
Now note that the posterior density of
given the data
is
(2)
where
is the marginal or predictive density. Therefore, minimizing

over
is equivalent to maximizing
. Thus MDL for estimation or model fitting is equivalent to finding the highest posterior density (HPD) estimate of
. Note, however, that a prior
is needed for these calculations. The approach that a Bayesian adopts in specifying the prior is not, in general, what is accepted by practitioners of the MDL approach. Therefore, the equivalence of MDL and HPD approaches is either subject to accepting the same prior, or as an asymptotic or similar approximation. MDL mostly prefers an approximately uniform prior when
for some fixed
(same
across all models), leading to the maximum likelihood estimate (MLE). The case of
having model parameters of different dimensions is different and is interesting. This can be easily seen in the continuous case upon discretization. Now denote
by
and
by
. Then

Here
and
are the precisions required to discretize
and
, respectively. Note that the term
is common across all models, so it can be ignored. However, the term
which involves the dimension of
in the model varies and is influential. According to [6,12],
is optimal (see [13] for details), in which case
(3)
Minimizing this will not lead to MLE even when π(θk) is assumed to be approximately constant. In fact, [12] proceeds further and argues that the correct precision
should depend on the Fisher information matrix. This amounts to using a prior which is similar in nature to the Jeffreys’ prior on
. Jeffreys prior is an objective choice and thus this approach to MDL can then be considered a default Bayesian approach.
In spite of these desirable properties, however, MDL leads to the HPD estimate of
, which is not the usual Bayes estimate. Posterior mean is what is generally preferred, so that the error in estimation has an immediate simple answer in the posterior standard deviation. In summary, therefore, the Bayesian approach doesn’t seem to find attractive solutions in the MDL approach as far as estimation or model fitting is concerned unless the models under consideration are hierarchical having parameters of varying dimension. On the other hand, when such hierarchical models are of interest the inference problem usually involves model selection in addition to model fitting. Thus the possible gains from studying the MDL approach are in the context of model selection as described below.
4. Model Selection Using MDL
Let us recall the Bayesian approach to model selection and express it in the following form. Let
. Suppose
. Consider testing
(4)
where
, for some
and
. Let
be a prior on
. Then
can be expressed as
(5)
where
and
and
are the conditional densities (with respect to some dominating
- finite measure) of
under
and
respectively. Then
(6)
(7)
where
(8)
and
(9)
Note that
is simply the marginal or predictive density of
under
and
is the unconditional predictive density obtained upon averaging
and
. Consequently, the posterior odds ratio of
relative to
is,
(10)
with
denoting the Bayes factor of
relative to
. When we compare two competing models
and
, we usually take
, and hence settle upon the Bayes factor
as the model selection tool. This agrees well with the intuitive notion that the model yielding a better predictive ability must be a better model for the given data.
4.1. Mixture MDL and Stochastic Complexity
Let us consider the MDL principle now for model selection between
and
. Once the conditional prior densities
and
are agreed upon, MDL will select that model
which obtains a smaller value for the code length,
, between the two. This is clearly equivalent to using Bayes factor as the model selection tool, and hence this version of MDL is equivalent to the Bayes factor criterion. In the MDL literature, this version of MDL is known as “mixture MDL”, and is distinguished from the “two-stage MDL” which separately codes the model and the prior. The two-stage MDL can be derived as an approximation to the mixture MDL as discussed later. See [13] for further details and other interesting comparisons and discussion. Let us consider a few examples before examining the need for other versions of MDL.
Example 1. Suppose Xn is a random sample from
with known
. We want to test

Consider the
prior on
with known
under
. Then the marginal distribution of
is
under
and under
it is
. A continuous model and a continuous prior is considered here. Since the precision of the prior parameter is the same across all models upon discretization we will ignore the distinction and proceed with densities. Then, both the Bayes factor criterion and the MDL principle will select
over
if and only if

where
and
are the corresponding densities. Since we are comparing two logarithms, let us switch to natural logarithms. Then
and

Noting that

and

we obtain

Therefore
is preferred over
either by the Bayes factor or by the mixture MDL if and only if

Example 2. Let us consider the previous example with unknown
now. Suppose the prior on
is the default
under both models. The prior on
under
is now assumed to depend on
, i.e.,
, where
is assumed to be a known constant for now. Then, provided
,

and letting
, and
denote the marginal density of
given
and
,

Thus

Therefore, the Bayes factor criterion or the mixture MDL reduces to a criterion which is very similar to that given in the previous example, except that
is now replaced by an estimator.
Example 3. (Jeffreys’ Test) This is similar to the problem discussed above, except that
, with density

the Cauchy prior. The prior on
is the same as before under both models:
. This approach suggested by Jeffreys ([14]) is important. It explains how one should proceed when the hypotheses which describe the model selection problem involve only some of the parameters and the remaining parameters are considered to be nuisance parameters. Then Jeffreys suggestion is to employ the same noninformative prior on the nuisance parameters under both models, and a proper prior with low level of information on the parameters of interest. Details on this problem along with this choice of prior can be found in Section 2.7 of [15].
Note that
is the same as in the previous example, namely

whereas

No closed form is available for m1(xn) in this case. To calculate this one can proceed as follows as indicated in Section 2.7 of [15]. The Cauchy density
can be expressed as a Gamma scale mixture of normals,

where
is the mixing Gamma variable. Now one can integrate over
and
in closed form to simplify (m1xn). Finally, one has a one-dimensional integral over
left, which can be numerically computed whenever needed.
Now, let us note from the examples discussed above that an efficient computation of mi(xn) relies on having an explicit functional form for it. This is generally possible only when a conjugate prior is used as in Examples 1 and 2. For other priors, such as in Example 3, some numerical approximation will have to be employed. Thus we are lead to considering possible approximations to the mixture MDL technique, or equivalently to the Bayes factor,
.
From Sections 4.3.1 and 7.1 of [15], assuming that
and
are smooth functions, we obtain, for large
, the following asymptotic approximation for
of equation (8). Let
be the dimension of
,
and
denote the Hessian of
, i.e.,

Also, let
denote either the MLE or the posterior mode. Then
(11)
so that
(12)
Ignoring terms that stay bounded as
, [12] suggests using the (approximate) criterion which Rissanen calls stochastic information complexity (or “stochastic complexity” for short),
(13)
where
(14)
for implementing MDL. See [12,16-18] for further details.
If
are i.i.d. observations, then we have

where
is the Fisher Information matrix and hence
(15)
Now ignoring terms that stay bounded as
, we obtain the Schwarz criterion ([19]), BIC, for
given by
(16)
which can be seen to be asymptotically equivalent to SIC.
Example 4. [20] discusses a model selection problem for failure time data. The two models considered are exponential and Weibull:

versus

where
and
. Model selection criterion of Rissanen is SIC as described in Equations (13) and (14). However, in this problem a better approximation is employed for the mixture MDL, the mixture being Jeffreys mixture, i.e., the conditional prior densities under
for
are given by

where
is a compact subset of the relevant parameter space
and
. Consequentlyit follows that,

where
is the MLE of
under
. This yields,

Compare this with (15) and note that the term involving
vanishes.
We would like to note here that many authors [21,22] define the MDL estimate to be the same as the HPD estimate with respect to the Jeffreys’ prior restricted to some compact set
where its integral is finite:

which is the stochastic complexity approach advocated above.
It must be emphasized that proper priors are being employed to derive the SIC criterion, and hence indeterminacy and inconsistency problems faced by techniques employing improper priors are not a difficulty in this approach. Moreover, this approach can be viewed as an implementable approximation to an objective Bayesian solution.
4.2. Two-Stage MDL
Now consider the two-stage MDL which codes the prior and the likelihood separately and adds the two description lengths. This approach is therefore similar to estimating the parameter
with the HPD estimate when there is an informative prior, or with the MLE, but the resulting minimum description length does have interesting features. To see when and how this approach approximates the above mentioned model selection criterion, let us look at some of the specific details in the two stages of coding. See [12,13] for further details. Again, recall the setup in (4) and (5).
Stage 1. Let
be an estimate of
such as the posterior mean, HPD or MLE under
. This needs to be coded. Consider the prior density
conditional on
being true. Usually MDL would choose a uniform density. Restrict
to a large compact subset of the parameter space and discretize it as discussed in Section 3 with a precision of
. Then the codelength required for coding
is
(17)
Stage 2. Now the data
is coded using the model density
. Discretization may again be neededsay with precision
. Thus the description length for coding
will be
(18)
Summing these two codelengths, therefore, we obtain a total description length of
(19)
Since the second term above,
, is constant over both M0 and
, and the third term stays bounded as
increases, these two terms are dropped from the MDL two-stage coding criterion. Thus, for regular parametric models, the two-stage MDL simplifies to the same criterion (for
) as BIC, namely,
(20)
In more complicated model selection problems, the two-stage MDL will involve further steps and may differ from BIC.
It may also be seen upon comparing (19) with (15) that the performance of SIC based MDL should be superior to the simplified two-stage MDL for moderate
since SIC uses a better precision for coding the parameter, namely, one based on the Fisher information.
5. Regression and Function Estimation
Model selection is an important part of parametric and nonparametric regression and smoothing. Variable selection problems in multiple linear regression, order of the spline to fit and wavelet thresholding are some such areas. We will briefly consider these problems to see how MDL methods can provide computationally attractive approximations to the respective Bayesian solutions.
5.1. Variable Selection in Linear Regression
Variable selection is an important and well studied problem in the context of normal linear models. Literature includes [23-32]. We will only touch upon this area with the specific intention of examining useful and computationally attractive approximations to some of the Bayesian methods.
Suppose we have an observation vector yn on a response variable Y and also measurements
on a set of potential explanatory variables (or regressors). Following [13], we associate with each regressor
, a binary variable
. Then the set of available linear models is
(21)
where
. Note that
is, then, a Bernoulli sequence associated with the set of regression coefficients,
also. Let
denote the vector of non-zero regression coefficients corresponding to
, and
the corresponding design matrix, which results in the model

Selecting the best model, then, is actually an estimation problem, i.e., find the HPD estimate of
starting with a prior
on
and a prior
on
given
. The two-stage MDL, which is the simplest, uses the criterion of minimizing
(22)
MLE for
and
given
are easily available:
(23)
Consider the uniform prior on
, all
values receiving the same weight
. Using these, we can re-write the MDL criterion as the one which minimizes (as in Example 2)
(24)
where
is the number of
.
We can also derive the mixture MDL or stochastic complexity of a given model
. If
is the prior density under
,
(25)
Applying (13), (14) after evaluating the information matrix of the parameters
and ignoring terms that are irrelevant for model selection, one obtains (see [13]),
(26)
If
is chosen to be the conjugate prior density, then the marginal density
can be explicitly derived. Details on this and further simplifications obtained upon using Zellner’s g-prior can be found in [13]. (See also [33,34].)
This method is only useful if one is interested in comparing a few of these models, arising out of some pre-specified subsets. Comparing all
models is not a computationally viable option for even moderate values of
, since for each model,
, one has to compute the corresponding
and
.
We are more interested in a different problem, namely, whether an extra regressor should be added to an already determined model. This is the idea behind the step-wise regression, forward selection method. In this set-up, the model comparison problem can be stated as comparing

versus

This is actually a model building method, so we assume that
, and hence
is the intercept which gives the starting model. Then we decide whether this model needs to be expanded by adding additional regressors. Thus, at step
, we have an existing model with regressors
and we fix
to be one of the remaining
regressors as the candidate for possible selection. Now the two-stage MDL approach is straight forward. From (22) and (24), we note that
is to be selected if and only if
(27)
where
is the description length of the model with regressors
and
is its residual sum of squares as given in (23). A closer look at (27) reveals certain interesting facts. We need the following additional notations involving design matrices and the corresponding projection matrices. We assume that the required matrix inverses exist.

Then we note the following result which may be found, for example, in [35].
(28)
where

It then follows that

and hence

and

where
is simply the partial correlation coefficient between
and
conditional on
. Substituting these in (27), we see that
(29)
Therefore,
if and only if
if and only if
(30)
This method does have some appeal, in that at each step, it tries to select that variable which has the largest partial correlation with the response (conditional on the variables which are already in the model), just like the step-wise regression method. However, unlike the stepwise regression method it does not require any stopping rule to decide whether the candidate should be added. It relies on the magnitude of the partial correlation instead.
One can also apply the stochastic complexity criterion given in (26) above. Then we obtain,

(31)
which is related to the step-wise regression approach, but uses more information than just the partial correlation.
A full-fledged Bayesian approach using the g-prior can also be implemented as shown below. Note that

and
(32)
where
and
, respectively, are the prior densities under
and
. Taking these priors to be g-priors, namely,

along with the density
for
, a (proper prior) density
for the hyperparameter
, we obtain,
(33)
where
(34)
with
and
.
The one-dimensional integrals in (33), however, cannot be obtained in closed form. One could also approximate
with
, where
are the ML-II (cf. [36]) estimates of
. See [13] for details.
Example 5. We illustrate the MDL approach to stepwise regression by applying it to the Iowa-corn-yielddata (see [37,38]). We have not included “year” as a regressor (which is a proxy for technological advance) and instead have considered only the weather-related regressors.
In this data set the variables are: X1 = Year, 1 denoting 1930, X2 = Pre-season precipitation, X3 = May temperature, X4 = June rain, X5 = June temperature, X6 = July rain, X7 = July temperature, X8 = August rain, X9 = August temperature, and Y = X10 = Corn Yield.
As mentioned earlier, we always keep the intercept and check whether this regression should be enlarged by adding more regressors. We first apply the Two-stage MDL criterion. From (30), at step
, we consider only those regressors (which are not already in the model and)
which satisfy
(=0.1005 in this example). From this set we pick the one with the largest
. The values of
for the relevant steps are listed below.

According to our procedure we select
first, followed by
and the selection ends there.
We consider the SIC criterion next. From (31), at step
, we pick the regressor
with the largest value for

provided it is positive. The values of
for the relevant steps are given below.

According to SIC our order of selection is
.
5.2. Wavelet Thresholding
Consider the nonparametric regression problem where we have the following model for the noisy observations
:
(35)
where
are i.i.d.
errors with unknown error variance
, and
is a function (or signal) defined on some interval
. Assuming s is a smooth function satisfying certain regularities (see [15,39,40]), we have the wavelet decomposition of
:
(36)
where
are the wavelet functions and
is the corresponding vector of wavelet coefficients. We assume that the normally infinite sum in (36) can be taken to be a finite sum (or at least a very good approximation) as indicated in [39].
Upon applying the discrete wavelet transform (DWT) to
, we get the estimated wavelet coefficients,
. Consider now the equivalent model:

where
.
The model selection problem here involves determining the number of non-zero wavelet coefficients:

versus

where
are the number of wavelet coefficients of interest.
The prior distribution on the non-zero
is assumed to be i.i.d.
under
.
Since we have not identified the locations (indices) of the non-zero wavelet coefficients,
, we proceed as follows to describe the prior structure. With each
we associate a binary variable
as in [41] for wavelet regression or as in [13] for variable selection in regression. Then
is a Bernoulli sequence associated with the set of regression coefficients,
. Let

Finally, we let
be i.i.d.
(
be the corresponding joint density), and define the following structure under
.
(37)
(38)
(39)
The nuisance parameter
which is common under both models is given the prior density
. Then it follows that
(40)
Note that only those
for which
appear in the integral above.
The Two-stage MDL approach is clearly the easiest to take in this problem. As described earlier, it approximates
by coding the prior and the likelihood (both evaluated at an estimate) separately and sums the codelengths to obtain the description length. In this case, discretizing
and
to a precision of
and ignoring terms that stay bounded as
increases, this amounts to
(41)
where the first term is obtained from Stirling’s approximation, the second term
is for coding the
non-zero
’s and
is an estimate such as the MLE. On the other hand, computing SIC or
is not an impossible task either. In fact, to integrate out the
in
of Equation (40) we argue as follows.
and

Now, as we argued in Jeffreys test, we take
, and integrate out
also. This leaves us with the following expression where we have a sum over
.
(42)
The term
is interesting. Most of the contribution to this sum is expected from
with
corresponding to the largest
of the
, which yields the MLE of
upon normalization. The Bayes estimate, on the other hand, will arise from a weighted average of all the sums, with weights depending on the posterior probabilities of the corresponding
. As is clear, weighted average over the space
is computationally very intensive when
and
are large. An appropriate approximation is indeed necessary, and MDL is important in that sense.
Even though we have justified the two-stage MDL for wavelet thresholding by showing that it is an approximation to a mixture MDL corresponding to a certain prior, a few questions related to this prior remain. First of all, the prior assumption that
are i.i.d.
is unreasonable; wavelet coefficients corresponding to wavelets at different levels of resolution must be modeled with different variances. Specifically they should decrease as the resolution level increases to indicate their decreasing importance (see [40,42,43]). Secondly, wavelet coefficients tend to cluster according to resolution levels (see [44]), so instead of independent normal priors, a multivariate normal prior with an appropriate dependence structure must be employed. These modifications can be easily implemented in the Bayesian approach, except that the resulting computational requirements may be substantial.
5.3. Change Point Problem
We shall now consider MDL methods for a problem which attempts to decide whether there is a change-point in a given time series data. We use the data on British road casualties available in [45], which examines the effects on casualty rates of the seat belt law introduced on 31 January 1983 in Great Britain.
We follow the approach of [46]. Let
be independent Poisson counts with
.
are a priori considered related, and a joint multivariate normal prior distribution on their logarithm is assumed. Specifically, let
be the
th element of
and suppose

We model the change-point as the model selection problem:

versus


where
is the possible change-point. We further let
be i.i.d.
. Note that
and
are hyperparameters.
First, we approximate the likelihood function assuming
as follows.

(43)
where
. Expanding (43) about
, its maximum, in Taylor series and ignoring higher order terms in
, we obtain
(44)
What is appealing and useful about (44) is that it is proportional to the multivariate normal likelihood function (for
) with mean vector x and covariance matrix
where
and

Thus hierarchical Bayesian analysis of multivariate normal linear models is applicable (see [15,36,47]). We note that the hyper-parameters
and
do not have substantial influence and hence treat them as fixed constants (to be chosen based on some sensitivity analysis) in the following discussion. Consequently, denoting by
and
the respective prior densities under
and
,
(45)
Now, from multivariate normal theory, observe that
(46)
and subsequently that,
can be integrated out as in Example 2. Thus,
(47)
Expression (47) is not available, in general, in closed form. Approaching it from the MDL technique, we look for a subsequent approximation employing an ML-II type estimator (cf. [36]) for
. Again, from (47), an ML-II likelihood for
is given by
which maximizes

For fixed
and
this involves only examining a smooth function of a single variable,
, which is a simple computational task.
We proceed exactly as above to derive
also. Partition the required vectors and matrices as follows.

Then we have

(48)
(a)
(b)
Figure 1. Plots of log(m0) and log(m1) for the British road casualties data. (a) LGV data; (b) HGV data.
As before, the MDL technique invloves deriving the ML-II estimator of
from
, for fixed
and
. Obtaining
(for fixed
and
) which maximizes
is very similar to that for
.
We have applied this technique to analyze the British Road Casualties data. Figures 1(a) and (b) show
and
as a function of
for
and
, for the LGV and HGV data, respectively. As mentioned previously,
and
do not seem to play any influential role; any reasonably small value of
seems to yield similar results, and any
which is not too close to 0 behaves similarly.
There seems to be strong evidence for a change-point in the intensity rate of casualties (induced by the ‘seatbelt law’) in the case of the LGV data, whereas this is absent in the case of the HGV data. This is evident from the very high value of
near the ML-II estimate of
for the LGV data.
There is a vast literature related to MDL, mostly in engineering and computer science. See [48] and the references listed there for the latest developments. See also [49]. [50] provides a review of MDL and SIC, and claim that SIC is the solution to “optimal universal coding problems”. MDL techniques have not become very popular in statistics, but they seem to be quite useful in many applications.