1. Introduction
Generalized Least Squares (GLS, also called least-squared with prior information) is a tool for statistical inference [1] - [6] that is widely used in geotomography [7] - [12] and geophysical inversion [13] [14], as well as other areas of the physical sciences and engineering. One of the attractive features of GLS that makes it especially useful in the imaging of multidimensional fields (for example, density, velocity, viscosity) is its ability to implement, in a natural and versatile way, prior information of the behavior of the field. Widely-used types of prior information include the field being smooth, as quantified by its low-order derivatives [15], having a specified power spectral density or autocovariance [7] [15], and satisfying a specified partial differential equation (such as the geostrophic flow equation [16] or the diffusion equation [4] ). The word “regularization” sometimes is used to describe the effect of prior information on the solution process [17].
We review the Generalized Least Squares (GLS) method here, following the notation in [6], in order to provide context and to establish nomenclature. In GLS, observations (or data) and prior information (or inferences) are combined to arrive at a best-estimate of initially-unknown model parameters (which might, for example, represent a field sampled on a regular grid). The data are assumed to satisfy the linear equation
, where
is a vector of data,
is a vector of model parameters, and
is a known “kernel” matrix associated with the data. Prior information is assumed to satisfy a linear equation
, where
is a vector of prior values and
is a kernel matrix associated with the prior information. GLS problems are assumed to be over-determined, with
. For observed data
, known prior information
and a specified model
, the prediction error is
and prior information error is
. These errors are assumed to be Normally-distributed with zero mean and prior covariance
and
, respectively. Then, the normalized errors
and
are independent and identically-distributed Normal random variables with zero mean and unit variance. Bayes theorem can be used to show that the best estimate
of the solution is the one that minimizes the generalized error
, with
and
[1] [2] [5]. The solution can be expressed in a variety of equivalent forms, among which is the widely-used version [6]:
(1)
The assumption of linear kernels
and
is a very restrictive one. In the well-studied nonlinear generalization [1] [6], the products
and
are replaced with vector functions
and
. Then, a common solution method is to linearize the data and prior information equations around a trial solution
:
(2)
and
. The solution is then found by iterative application of (1) applied to (2); that is, by the Gauss-Newton’s method [3]. Alternatively, a gradient-descent method [18] can be used that employs:
(3)
The latter approach is preferred for very large M, since the convergence rate of gradient descent is independent of its dimension [18], whereas the effort required to solve the M× M system (1) by a direct method scales as M3 [19].
We now discuss issues related to the covariance matrices that appear in GLS. The data covariance
quantifies the uncertainty of the observations and the information covariance
quantifies the uncertainty of the prior information. Prior knowledge of the inherent accuracy of the measurement technique is needed to assign
, and prior knowledge of the physically-plausible solutions, perhaps stemming from and understanding of the underlying physics, is needed to assign
. These assignments are often very subjective, especially when correlations are believed to occur (that is,
and
have non-zero off-diagonal elements). For example, one geotomographic study [7] reconstructs a two-dimensional field using a
that represents autocovariance of the field and that is dependent upon a scale length q. The value of q is chosen on the basis of broad physical arguments that, while plausible, leaves considerable room for subjectivity.
The matrices
and
together contain
elements,
many more than the
constraints imposed by the data
and prior information
.Consequently, insufficient information is available to uniquely solve for all the elements of
and
. However, it sometimes may be possible to parameterize
and/or
in terms of
, and ask whether an initial estimate of
can be improved. As long as
, adequate information may be available to determine a best estimate
. We refer to the process of determining
as “tuning”, since in typical practice it requires that the covariances be close to their true values.
As an example of a parametrized covariance, we consider the case where the model parameters represent a sampled version of a continuous function
, where
is an independent variable; that is,
, with
and
the sampling interval. The prior information that
is approximately oscillatory with wavenumber q can be modeled by:
(4)
In this case,
approximates the autocovariance of
, which is assumed to be stationary. The goal of tuning is to provides a best-estimate
, as well of best estimated
of the model parameters. This problem is further developed in Example 4, below.
Although the GLS formulation is widely used in geotomography and geophysical imaging, the tuning of variance is typically implemented in a very limited fashion, through the use of trade-off curves [7] - [12]. In this procedure, a scalar parameter q controls the relative size of
and
, that is,
, where
is specified [20]. The GLS problem is then solved for a suite of qs, the functions
and
are tabulated and the resulting trade-off curve
is used to identify a solution
that has acceptably low E and L (for example, Figure 1 of [20] ). As we will show below, this ad hoc procedure is not a consistent extension of GLS, because it results in a different q than the one implied by Bayes’ principle. A more consistent approach is to apply Bayes theorem directly to estimate both the model parameters
and the covariance parameters
. Such an approach has been implemented in the context of ordinary least squares [21] and the Markov chain Monte Carlo (MCMC) inversion method [22] (which is a computationally-intensive alternative to GLS). An important and novel result of this paper is a computationally-efficient procedure for tuning GLS in a Bayes-consistent manner.
2. Bayesian Extenion of GLS
The general process of using Bayes’ theorem to construct a posterior probability density function (p.d.f.) that depends on unknown parameters and of estimating those parameters though the maximization of probability is very well understood [23]. In the current case, the p.d.f. has M model parameters and J covariance parameters, so the maximization process (implemented, say, with a gradient ascent method) must search an
-dimensional space. Our main purpose here is to show that the process can be organized in a way that makes use of the GLS solution (1) and thus reduce the dimensionality of the searched space to J.
The GLS solution (1) yields the
that minimizes the generalized error
, or equivalently, the
that maximizes the Normal posterior probability density function (p.d.f.)
:
(5)
Here, Bayes theorem [23] is used to related the Normal posterior p.d.f.
to the Normal likelihood
and the Normal prior
. When poorly known parameters
are added to the problem, they must be treated as additional random variables [22]. Writing
, with
appearing in the likelihood and
appear in the prior, we have:
(6)
Here, we have assumed that
and
are not correlated with one another. The maximization with respect to the two variables can be performed as a sequence of two single-variable maximizations:
(7a)
(7b)
(7c)
In the special case of the uniform prior
, the maximization in (7a) is the GPR solution at fixed
. For the Normal p.d.f.:
(8)
the maximization (7b) is equivalent to the minimization of an objective function
, defined as:
(9)
The quantity
is best computed by finding the Choleski decomposition
, the algorithm [24] for which is implemented in many software environments, including MATLAB® and PYTHON/linalg. Then,
(and similarly for
).The nonlinear optimization problem of minimizing
can be implemented using a gradient descent method, provided that the derivative
can be calculated [18]. In the next section, we derive analytic formula for this and related derivatives.
3. Solution Method and Formula for Derivatives
The process of simultaneously estimating the covariance parameters
and model parameters
consists of six steps. First, the analytic form of the covariance matrices
and
are specified, and their derivatives
and
are computed analytically. Second, an initial estimate
is identified. Third, the covariance matrices
and
are inserted into (1), yielding model parameters
. Fourth, using formulas developed below, the value of the derivative
is calculated at
. Fifth, a gradient descent method employing
is used to iteratively perturb
towards the minimum of
at
(and in process, repeating steps three through five many times). Sixth, the estimated model parameters are computed as
. This process is depicted in Figure 1.
Our derivation of
uses three matrix derivatives,
,
and
that may be unfamiliar to some readers, so we derive them here for completeness. Let
be asquare, invertible, differentiable matrix. Differentiating
yields
, which can be rearranged into ( [25], their (36)):
(10)
Figure 1. Schematic depiction of solution process. (a) The GLS solution
(red curve) is considered a function of the covariance parameters
and its derivative
(blue line) at a point
is computed by analytic differentiation of GLS equation (1); (b) The objective function Ψ (colors) is considered a function of
. The results of (a) are used to compute its gradient
at the point
. The gradient descent method is used to iteratively perturb this point anti-parallel to the gradient until it reaches the minimum
of the objective function, resulting in the best-estimate
. This value is then used to determine a best-estimate of the model parameters
, as depicted in (a).
Similarly, differentiating
and applying (10), yields the Sylvester equation:
(11)
We have not been able to determine a source for this equation, but in all likelihood, it has been derived previously. In practice, (11) is not significantly harder to compute than (10), because efficient algorithms for solving Sylvester equations [26] and for computing a symmetric (principal) square root [27], are widely available and implemented in many software environments, including MATLAB® and PYTHON/linalg. The derivative of
is derived starting with Jacobi’s formula [12]:
(12)
where
is the adjugate and
is the trace, applying Laplace’s identify [28]
and the rule
(where c is a scalar and
is a matrix) [29]. Finally, the determinant is moved to the left-hand side and the well-known relationship
, for a differentiable function
, is applied, yielding ( [25], their (38)):
(13)
We begin the main derivation by considering the case in which data variance
depends on a parameter vector
, and the information variance
is constant. The derivative of the GLS solution can be found by applying the chain rule applied to (1):
(14)
Note that we have used (10). The derivative of the normalized prediction error is
and total error
are:
(15)
Here, the Sylvester equation arises from (11). An alternate way of differentiating E that does not require solving a Sylvester equation is:
(16)
The derivative of the normalized error in prior information
and total error
are:
(17)
Finally, since
, we have:
(18)
Note that we have applied (13).
Finally, we consider the case in which the information variance
depends on parameters
, and
is constant. Since the data and prior information play completely symmetric roles in (1), the derivatives can be obtained by interchanging the roles of
and
,
and
,
and
,
and
and E and L, in the equations above, yielding:
(19)
These formulas have been checked numerically.
4. Examples with Discussion
In the first example, we examine the simplistic case in which the parameter q represents an overall scaling of variance; that is
and
, with specified
and
. The solution
is independent of q, as can be verified by substitution into (1). The parameter q can then be found by direct minimization of (9), which simplifies to:
(20)
Here, we have used the rule
[25], valid for any
matrix
, and have defined
and
. The minimum occurs when:
(21)
This is a generalization of the well-known maximum likelihood estimate of the sample variance [30]. As long as
exists, the minimization in (21) is well-behaved and the overall scaling q is uniquely determined.
In the second example, we examine another simplistic case in which a parameter q represents the relative weighting of variance; that is
and
.We consider the problem of estimating the mean
of data given observations
and prior information
(where
and
are vectors of zeros and ones, respectively), when
,
and
. Applying (1), we find that
. Then, the objective function is
and its derivative is
. The solution to
is
, as can be verified by direct substitution. Thus, the solution splits the difference between the observations and the prior values, and yields prior variances
and
that are equal. While simplistic, this problem illustrates that, at least in some cases, GLS is capable of uniquely determining the relative sizes of
and
. Because trade-off curves, as defined in the Introduction, are based on the behavior of E and L, and not the complete objective function Ψ, the weighting parameter
estimated from them in general will be different from
.Consequently, the trade-off curve procedure is not consistent with the Bayesian framework upon which GLS rests.
Our third example demonstrates the tuning of data covariance
. In many cases, observational error increases during the course of an experiment, due to degradation of equipment or to worsening environmental conditions. The example demonstrates that the method is capable of accurately quantifying the fractional rate of increase p of the variance
, which is assumed to vary with position
. In our simulation, we consider
synthetic data, evenly-spaced on the interval
, which scatter around the curve
(Figure 2). The covariance of the data is modeled as
, where
and
is the Kronecker delta; that is, the data are uncorrelated and their variance increases linearly with x. The derivative of the covariance is
. We have included prior information with
and
, which implements the notion that the model parameters are small. The corresponding covariance is chosen to be large,
, indicating that this information is weak. The goal is to tune the rate of increase of variance and to arrive at a best-estimate of the two model parameters. The starting value is taken to be
, which corresponds to uniform variance. It is successively improved by a gradient descent method that minimizes Ψ, yielding an estimated value
.This estimate differs from the true value
by about 1%. The estimated solution
differs from
by a few tenths of a percent, which may be significant in some applications.
Figure 2. Example of tuning
. (a) Plot of synthetic data (red dots) and predicted data (green curve); (b) The starting value
corresponds to uniform variance (black curve). The estimate
corresponds to increasing variance (green curve); (c) Generalized error
(black curve). The starting value
(black circle) is successively improved (red circles) by a gradient descent method, yielding an estimate
(green circle); (d) The gradient
, computed using the formulas developed in the text; (e) The first model parameter
, highlighting the initial value (black circle) and estimated value (green circle) (f) Same as (e), except for the second model parameter
.
The fourth example demonstrates tuning of information covariance
. In many instances, one may need to “reconstruct” or “interpolate” a function on the basis of unevenly and sparsely sampled data. In this case, prior information on the autocovariance of the function can enable a smooth interpolation. Furthermore, it can enforce a covariance structure that may be required, say, by the underlying physics of the problem. In our example, we suppose that the function is known to be oscillatory on physical grounds, but that the wavenumber of those oscillations is known only imprecisely. The goal is to tune prior knowledge of wavenumber to arrive at a best-estimate of the reconstructed function. In our simulation, a total of
model parameters
are uniformly spaced on the interval
and representing a sampled version of a continuous, sinusoidal function
with wavenumber
(Figure 3). Synthetic data
with uncorrelated error with variance
are available for
randomly-chosen points
, where the index function
aligns in x observations to model parameters. The data kernel is
. The prior information is given in (4), with autocovariance
and
. The derivative is
. An
initial guess
is improved using a gradient descent method, yielding an estimated value of
that differs from
by less than 0.01%. The reconstructed function is smooth and sinusoidal and the fit to the data is much improved.
Examples three and four were implemented in MATLAB® and executed in <5s on a notebook computer. They confirm the flexibility, speed and effectiveness of the method. An ability to tune prior information on autocovariance may be of special utility in seismic exploration applications, where three-dimensional waveform datasets are routinely interpolated.
A limitation of this overall “parametric” approach is that the solution is dependent on the choice of parameterization, which must be guided by prior knowledge of the general properties of the covariance matrices in particular problem being solved. In Example 3, we were able to recognize (say, by visually
Figure 3. Example of tuning
. Sparsely-sampled synthetic data
(red dots) are oscillatory. (a) A regularly-sampled version
is created by imposing the oscillatory covariance
. With the starting value
, the reconstruction poorly fits the data (black curve). Tuning leads to a better fit (green curve with dots), as well as a precise estimate of wavenumber
; (b) Decrease in
with iteration number during the gradient descent process.
examining the data plotted in Figure 2(a)) that observational error increases with x and chose
that matched this scenario. If, instead, the degree of correlation between successive data increased with x, this pattern might be less expected, more difficult to detect, and require a different
parameterization—say,
.
Not every parameterization of
(or
) is necessarily well-behaved. To avoid poor behavior, the parameterization must be chosen so its determinant does not have zeros at values of
that will prevent the steepest descent process from converging to the global minimum. That this choice can be problematical is illustrated by the simple Toeplitz version of
(with
,
):
(22)
with
. This form is useful for quantifying correlations within a stationary sequence of data [31]. Yet as is illustrated in Figure 4, the
volume is crossed
Figure 4. The function
for the case given by (22). (a) The
surface for
and the other qs randomly assigned; (b) Same as (a), but with
; (c) Same as (a), but with
; (d) Perspective view of the surfaces in the
volume. The positions of the three slices in (a), (b) and (c) are noted on the
-axis (green arrows). A question posed in the text is whether, given an arbitrary point
and the global minimum of the objective function, say at
(and with both points satisfying
), a steepest-descent path necessarily exists between them.
by many
surfaces that correspond to surfaces of singular objective function Ψ. Their presence suggests that the steepest descent path between a starting value
and the global minimum at
may be very convoluted (if, indeed, such a path exists) unless
is very close to
.
5. Conclusion
Generalized Least Squares requires the assignment of two prior covariance matrices, the prior covariance of the data and the prior covariance of the prior information. Making these assignments is often a very subjective process. However, in cases in which the forms of these matrices can be anticipated up to a set of poorly-known parameters, information contained within the data and prior information can be used to improve knowledge of them—a process we call “tuning”. Tuning can be achieved by minimizing an objective function that depends on both the generalized error and determinants of the covariance matrices to arrive at a best estimate of the parameters. Analytic and computationally-tractable formulas are derived for the derivative needed to implement the minimization via a gradient descent method. Furthermore, the problem is organized so that the minimization need be performed only over the space of covariance parameters, and not over the typically-much-larger space of model and covariance parameters. Although some care needs to be exercised as the covariance matrices are parametrized, the minimization is tractable and can lead to better estimates of the model parameters. An important outcome is this study is the recognition that the use of trade-off curves to determine relative weighting of covariance—a practice ubiquitous in the geophysical imaging—is not consistent with the underlying Bayesian framework of Generalized Least Squares. The strategy outlined here provides a consistent solution.
Acknowledgements
The author thanks Roger Creel for helpful discussion.