A Bayesian Regression Model and Applications

Yijun Yu

doi:10.4236/jamp.2020.89141

Journal of Applied Mathematics and Physics > Vol.8 No.9, September 2020

A Bayesian Regression Model and Applications

Yijun Yu
Department of Mathematics, Tuskegee University, Tuskegee, AL, USA.
DOI: 10.4236/jamp.2020.89141 PDF HTML XML 525 Downloads 1,717 Views

Abstract

A sparse vector regression model is developed. The model is established by employing Bayesian formulation and trained by using a set of data . The parameters needed to be determined in the algorithm are reduced by a special prior hyperparameter setting, and therefore the algorithm is simpler than similar type of Bayesian vector regression models. The examples of applications to the function approximation and inverse scattering problem are presented.

Keywords

Bayesian, Regression, Applications

Share and Cite:

Yu, Y. (2020) A Bayesian Regression Model and Applications. Journal of Applied Mathematics and Physics, 8, 1877-1887. doi: 10.4236/jamp.2020.89141.

1. Introduction

There has been a lot of interest in studying the Bayesian vector regression and its application on various classification and regression problems [1] [2] [3] [4]. The Bayesian approach considers probability distributions with the observed data; prior distributions are converted to posterior distribution through the use of Bayes’ theorem. Let x be an input vector and t be a vector of target parameters. In a regression formulation our goal is to define a model $y (x; w)$ that yields an approximation to the true target t, with the model defined by the parameters w. The model is typically designed using a set of “training” data $D = {x_{n}, t_{n}}_{n = 1}^{N}$ , Although we initially consider a finite set D, the goal is for the subsequent model $y (x; w)$ to be applicable to arbitrary $(x, t) \notin D$ , over the anticipated range of t. When developing a regression model one must address the bias-variance tradeoff. A bias is introduced by restricting the form that $y (x; w)$ may take, while the variance represents the error between the model $y (x; w)$ and true target parameters t. Models with minimal bias typically have significant flexibility, and therefore the model parameters may vary significantly as a function of the specific training set D employed. To obtain good model generalization, which may be connected to the variation in the model parameters as a function of D, one must introduce a bias. The utilization of a small number of non-zero parameters w often yields a good balance between bias and variance; such models are termed “sparse”. This has led to development of the relevance vector machine [5].

The rest of this paper is organized as follows. The theory of the vector-regression formulation is presented in Section 2, with application example provided in Section 3. The work is summarized in Section 4.

2. Sparse Bayesian Vector Regression

2.1. Model Specification

Assume we have available a set of training data $D = {x_{n}, t_{n}}_{n = 1}^{N}$ , where $x_{n} = {[x_{n}^{(1)} x_{n}^{(2)} \dots x_{n}^{(L)}]}^{⊺}$ and $t_{n} = {[t_{n}^{(1)} t_{n}^{(2)} \dots t_{n}^{(M)}]}^{⊺}$ . Our objective is to develop a function $y (x; w)$ that is dependent on the parameters w. After $y (x; w)$ is so designed, it may be used to map an arbitrary x to an approximation of the target parameters t.

The specific vector-regression function $y (x; w) = {[y^{(1)} (x; w) y^{(2)} (x; w) \dots y^{(M)} (x; w)]}^{⊺}$ employed here is defined as

$y (x; w) = \sum_{i = 1}^{N} w_{i} t_{i} K (x, x_{i}) + w_{0}$ (1)

where $w_{0} = {[w_{0}^{(1)} w_{0}^{(2)} \dots w_{0}^{(M)}]}^{⊺}$ , and $K (x, x_{i})$ is a kernel function that is designed such that $K (x, x_{i})$ is large if $x_{i} \approx x$ and otherwise $K (x, x_{i})$ is small. Hence in (1) only those $x_{i} \approx x$ are important in defining $y (x; w)$ .

Let

$w = {[w_{1} w_{2} \dots w_{N} w_{0}^{(1)} w_{0}^{(2)} \dots w_{0}^{(M)}]}^{⊺},$

$ψ_{i} (x) = {[ϕ_{i}^{(1)} ϕ_{i}^{(2)} \dots ϕ_{i}^{(M)}]}^{⊺}, i = 1, 2, \dots, N$

with

$ϕ_{i}^{(k)} = t_{i}^{(k)} K (x, x_{i}), i = 1, 2, \dots, N; k = 1, 2, \dots, M$ (2)

and $M \times (N + M)$ matrix

$Ψ (x) = [ψ_{1} (x) ψ_{2} (x) \dots ψ_{N} (x) I_{M}],$ (3)

where $I_{M}$ is $M \times M$ identity matrix, then (1) can be expressed in matrix form

$y (x; w) = Ψ (x) w$ (4)

Assume that target is from the model with additive noise

$t = y (x; w) + ε = Ψ (x) w + ε,$ (5)

where model error $ε = {[ε^{(1)} ε^{(2)} \dots ε^{(M)}]}^{⊺}$ and $ε^{(k)}, k = 1, 2, \dots, M$ are independent samples from a zero-mean Gaussian process with variance $α_{0}^{- 1}$

$p (ε^{(k)}) = N (ε^{(k)} | 0, α_{0}^{- 1}), k = 1, 2, \dots, M$ (6)

We therefore have

$\begin{matrix} p (t | x, w, α_{0}) = {(\frac{2 π}{α_{0}})}^{- \frac{M}{2}} \exp (- \frac{α_{0}}{2} {‖ t - Ψ (x) w ‖}_{2}^{2}) \\ = N (t | Ψ (x) w, α_{0}^{- 1} I_{M}) \end{matrix}$ (7)

We wish to constrain the weights w such that a simple model is favored, this accomplished by invoking a prior distribution on w that favors most of the weights being zero. In this context, only the most relevant members of the training set $D = {x_{n}, t_{n}}_{n = 1}^{N}$ , those with nonzero weights $w_{n}$ , are ultimately used in the final regression model. This simplicity allows improved regression performance for $(x, t) \notin D$ [5] [6].

We employ a zero-mean Gaussian prior distribution for w

$p (w | α_{0}, α) = N (w | 0_{N + M}, α_{0}^{- 1} α^{- 1} I_{N + M}),$ (8)

where $0_{N + M}$ is a (N + M)-dimensional zero vector, $I_{N + M}$ is a $(N + M) \times (N + M)$ identity matrix, and suitable priors over hyperparameters $α_{0}$ and $α$ are Gamma distributions [7]

$p (α_{0} | a, b) = Gamma (α_{0} | a, b)$ (9)

$p (α | c, d) = Gamma (α | c, d)$ (10)

where $Gamma (α_{0} | a, b) = Γ {(a)}^{- 1} b^{a} α_{0}^{a - 1} e^{- b α_{0}}$ with $Γ (a) = \int_{0}^{\infty} t^{a - 1} e^{- t} d t$ .

The hierarchical prior over w favors a sparse model and the prior over $α_{0}$ will be used to favor small model error on the training data D.

2.2. Inference

For training data $D = {x_{n}, t_{n}}_{n = 1}^{N}$ we introduce LN-dimensional vector

$X = {[x_{1}^{⊺} x_{2}^{⊺} \dots x_{N}^{⊺}]}^{⊺}$

and MN-dimensional vector

$T = {[t_{1}^{⊺} t_{2}^{⊺} \dots t_{N}^{⊺}]}^{⊺}$

and let $(M N) \times (M + N)$ matrix

$Φ = {[Φ_{1}^{⊺} Φ_{2}^{⊺} \dots Φ_{N}^{⊺}]}^{⊺}$ with $Φ_{i} = Ψ (x_{i}), i = 1, 2, \dots, N$ ,

then by (7), we have

$\begin{matrix} p (T | w, α_{0}, X) = {(\frac{2 π}{α_{0}})}^{- \frac{M N}{2}} \exp (- \frac{α_{0}}{2} {‖ T - Φ w ‖}_{2}^{2}) \\ = N (T | Φ w, α_{0}^{- 1} I_{M N}) \end{matrix}$ (11)

Noting that $p (T | α_{0}, α, X) = \int p (T | w, α_{0}, X) p (w | α_{0}, α) d w$ is a convolution of Gaussians, the posterior distribution over the weights w can be derived as

$p (w | α_{0}, α, X, T) = \frac{p (T | w, α_{0}, X) p (w | α_{0}, α)}{p (T | α_{0}, α, X)} = N (w | μ, α_{0}^{- 1} Σ)$ (12)

where

$Σ = {(Φ^{⊺} Φ + α I_{M + N})}^{- 1} = {(\sum_{i = 1}^{N} Φ_{i}^{⊺} Φ_{i} + α I_{M + N})}^{- 1}$ (13)

$μ = Σ Φ^{⊺} T = Σ \sum_{i = 1}^{N} (Φ_{i} t_{i})$ (14)

2.3. Hyperparameter Optimization

We determine $α$ in (13) by maximizing $p (α | T, X) \propto p (T | α, X) p (α)$ with respect to $α$ . It is equivalent to maximize the ln of this quantity. In addition, we can choose to maximize with respect to $\ln α$ as we can assume hyperpriors over a logarithmic scale.

Since

$\begin{array}{l} \ln p (T | α, X) \\ = \ln \int p (T | w, α_{0}, X) p (w | α_{0}, α) p (α_{0} | a, b) d w d α_{0} \\ = - \frac{1}{2} [\ln | B | + (M N + 2 a) \ln (T^{⊺} B^{- 1} T + 2 b)] + c o n s t \end{array}$

where $B = I_{M N} + α^{- 1} Φ Φ^{⊺}$ , and $p (\ln α) = α p (α)$ , we obtain objective function

$L (α) = - \frac{1}{2} [\ln | B | + (M N + 2 a) \ln (T^{⊺} B^{- 1} T + 2 b)] + c \ln α - d α$ (15)

By the determinant identity [8], we have

$\begin{matrix} | B | = | I_{M N} + α^{- 1} Φ Φ^{⊺} | \\ = α^{- (M + N)} | α I_{M + N} + Φ^{⊺} Φ | \\ = α^{- (M + N)} | Σ^{- 1} |, \end{matrix}$

and so

$\ln | B | = - (M + N) \ln α + \ln | Σ^{- 1} |$ (16)

Using the Woodbury formula, we obtain

$\begin{matrix} B^{- 1} = {(I_{M N} + α^{- 1} Φ Φ^{⊺})}^{- 1} \\ = I_{M N} - Φ {(α I_{M + N} + Φ^{⊺} Φ)}^{- 1} Φ^{⊺} \\ = I_{M N} - Φ Σ Φ^{⊺}, \end{matrix}$

thus

$T^{⊺} B^{- 1} T = T^{⊺} (T - Φ Σ Φ^{⊺} T)$

$= T^{⊺} (T - Φ μ)$ (17)

$= {‖ T ‖}^{2} - T^{⊺} Φ Σ Φ^{⊺} T$ (18)

Then by (16) and Jacobi’s formula, we have

$\begin{matrix} \frac{d \ln | B |}{d \ln α} = - (M + N) + \frac{1}{| Σ^{- 1} |} \frac{d | Σ^{- 1} |}{d \ln α} \\ = - (M + N) + t r (Σ \frac{d Σ^{- 1}}{d \ln α}) \\ = - (M + N) + α \sum_{j = 1}^{M + N} Σ_{j j} \end{matrix}$ (19)

where $Σ_{j j}$ is the j-th diagonal element of matrix $Σ$ .

By (18)

$\begin{matrix} \frac{d T^{⊺} B^{- 1} T}{d \ln α} = - \frac{d T^{⊺} Φ Σ Φ^{⊺} T}{d \ln α} \\ = - T^{⊺} Φ \frac{d Σ}{d \ln α} Φ^{⊺} T \\ = - T^{⊺} Φ Σ \frac{d Σ^{- 1}}{d \ln α} Σ Φ^{⊺} T \\ = α {‖ μ ‖}^{2} \end{matrix}$ (20)

Using (17), (19) and (20), we have

$\begin{matrix} \frac{d L (α)}{d α} = \frac{1}{2} (M + N - α \sum_{j = 1}^{M + N} Σ_{j j}) - \frac{(M N + 2 a)}{2 (T^{⊺} B^{- 1} T + 2 b)} \frac{d T^{⊺} B^{- 1} T}{d \ln α} + c - d α \\ = \frac{1}{2} (M + N - α \sum_{j = 1}^{M + N} Σ_{j j}) - \frac{(M N + 2 a) {‖ μ ‖}^{2} α}{2 [T^{⊺} (T - Φ μ) + 2 b]} + c - d α \end{matrix}$ (21)

Setting (21) to zero, followed by algebra operations, yield

$α = \frac{M + N + 2 c}{\sum_{j = 1}^{M + N} Σ_{j j} + 2 d + (M N + 2 a) {‖ μ ‖}^{2} / [T^{⊺} (T - Φ μ) + 2 b]}$ (22)

The algorithm consists of (13), (14) and (22) with iteration for $α, Σ$ and $μ$ .

2.4. Making Predictions

Assume $α_{M P}$ and $α_{0}_{_{M P}}$ are maximizing values obtained by maximizing $p (α | T, X)$ (Sec. 2.3) and $p (α_{0} | T, X)$ , respectively. Assume

$p (α_{0}, α | X, T) \approx δ (α_{0} - α_{0}_{_{M P}}) δ (α - α_{M P})$

then

$\begin{array}{l} p (t | x, X, T) = \int p (t | x, w, α_{0}, α) p (w, α_{0}, α | X, T) d w d α_{0} d α \\ = \int p (t | x, w, α_{0}) p (w | α_{0}, α, X, T) p (α_{0}, α | X, T) d w d α_{0} d α \\ \approx \int p (t | x, w, α_{0}) p (w | α_{0}, α, X, T) δ (α_{0} - α_{0}_{_{M P}}) δ (α - α_{M P}) d w d α_{0} d α \\ = \int p (t | x, w, α_{0}_{_{M P}}) p (w | α_{0}_{_{M P}} α_{M P}, X, T) d w \\ = N (t | y (x; μ), {(α_{0}_{_{M P}})}^{- 1} Ω) \end{array}$ (23)

with

$y (x; μ) = Ψ (x) μ$ (24)

$Ω = I_{M} + Ψ (x) Σ Ψ {(x)}^{⊺}$ (25)

3. Applications

In examples we employ a radial-basis-function kernel $K (x, x_{i}) = \exp (- {‖ x - x_{i} ‖}^{2} / r^{2})$ , and just parameters a, b, c and d by training and testing on given training data, finally we take $a = b = c = d = 0.05$ for all examples in this section. In all figures the horizontal axis is the index of samples and the vertical axis is output.

3.1. Regression: Function Approximation

The model can be used to establish the relation between independent variables and dependent variables of a function.

Example 1 2-dimensional vector function with two variables

$t_{1} = sinc (\frac{x_{1} + x_{2}}{4})$

$t_{2} = - 0.5 sinc (\frac{x_{1} + x_{2}}{4}) \sin (\frac{x_{1} x_{2}}{20}) - 0.4$

in domain ${(x_{1}, x_{2}) | - 10 \leq x_{1} \leq 10, 0 \leq x_{2} \leq 20}$ , where $sinc (x) = \sin (x) / x$ .

Figure 1 and Figure 2 illustrate the results. Figure 1 is learning from 100 noise-free training samples. Figure 2 is based on 100 noisy training samples. The noise is generated from zero-mean Gaussian with 5% of average training data $‖ t ‖$ as standard deviation. Both test on 100 examples that are not in training data.

Example 2 3-dimensional vector function with 200 variables $(x_{1}, x_{2}, \dots, x_{200}) \to (t_{1}, t_{2}, t_{3})$ .

$t_{1} = \sum_{k = 1}^{200} \sin ({(x_{k})}^{5 / 7}) + \frac{x_{50}}{100}$

$t_{2} = \frac{x_{200}}{800} t_{1} + \frac{x_{50}}{200} + \cos (\frac{x_{100}}{5}) - 10$

$t_{3} = atan (\frac{t_{1} + t_{2}}{6}) + \frac{t_{2} - t_{1}}{2} - 10$

We choose samples at point $x^{n} = (x_{1}^{n}, x_{2}^{n}, \dots, x_{200}^{n})$ with $x_{k}^{n} = k + (n - 1) π / 4$ . 100 samples at points $x^{n}$ with $n = 1, 3, 5, \dots, 199$ used as training data, and 100 samples at points $x^{n}$ with $n = 2, 4, 6, \dots, 200$ used as testing data.

Figure 3 is from noise-free training samples. Figure 4 is based on noisy training samples. The noise is generated from zero-mean Gaussian with 5% of average training data $‖ t ‖$ as standard deviation.

3.2. Regression: Inverse Scattering

The model can be used to characterize the connection between measured vector

Figure 1. Results for 2-dim vector function with noise-free data: (a) predict on training points; (b) predict on testing points.

Figure 2. Results for 2-dim vector function with noisy data: (a) predict on training points; (b) predict on testing points.

Figure 3. Results for 3-dim vector function with noise-free data: (a) predict on training points; (b) predict on testing points.

scattered-field data x and the underlying target responsible for these fields, characterized by the parameter vector t. The scattering data x may be measured at multiple positions. In the examples the measure data is simulated by forward model.

We consider a homogeneous lossless dielectric target buried in a lossy dielectric half space. The objective is to invert for the parameters of the target. In the examples, the parameter vector t is composed of three real numbers: the depth of target, the size of target, and the dielectric constant of target. For each target there are 100 simulated measure data. Training data $D = {x_{n}, t_{n}}_{n = 1}^{N}$ is composed of N = 180 examples and testing data is composed of 125 examples that are not in D.

Example 1 We consider cube target in this example. Figure 5 and figure 6 illustrate the results. Figure 5 is from noise-free data. Figure 6 is based on noisy data. The noise is generated from zero-mean Gaussian with 10% of average training data $‖ x ‖$ as standard deviation. The “size” is the width of cube.

Figure 4. Results for 3-dim vector function with noisy data: (a) predict on training points; (b) predict on testing points.

Figure 5. Results for cube target with noise-free data: (a) predict on training points; (b) predict on testing points.

Figure 6. Results for cube target with noisy data: (a) predict on training points; (b) predict on testing points.

Figure 7. Results for sphere target with noise-free data: (a) predict on training points; (b) predict on testing points.

Figure 8. Results for sphere target with noisy data: (a) predict on training points; (b) predict on testing points.

Example 2 We consider sphere target in this example. Figure 7 and figure 8 illustrate the results. Figure 7 is from noise-free data. Figure 8 is based on noisy data. The noise is generated from zero-mean Gaussian with 10% of average training data $‖ x ‖$ as standard deviation. The “size” is the diameter of sphere.

We applied the model to two completely different types of problems, the model works well for both application. The results display this regression model can apply to various types of regression problems.

4. Conclusion

A Bayesian vector-regression algorithm has been developed. The model employs a statistical prior that favors a sparse model, for which most of its weights are zero [5]. This model improves the algorithm in [9], and reduces the number of hyperparameters, which need to be calculated in the algorithm, from two to one. The model is not established for one specific problem, and so can be applied to different regression problems. We have discussed the theoretical development of the model and have presented several example results for two different applications. One is for function approximation, and the other is for inverse scattering of dielectric targets buried in a lossy half space. It has been demonstrated that the algorithm works well for different applications.

Conflicts of Interest

The author declares no conflicts of interest regarding the publication of this paper.

References

[1]	Law, T. and Shawe-Taylor, J. (2017) Practical Bayesian Support Vector Regression for Financial Time Series Prediction and Market Condition Change Detection. Quantitative Finance, 17, 1403-1416. https://doi.org/10.1080/14697688.2016.1267868
[2]	Yu, J. (2012) A Bayesian Inference Based Two-Stage Support Vector Regression Framework for Soft Sensor Development in Batch Bioprocesses. Computers & Chemical Engineering, 41, 134-144. https://doi.org/10.1016/j.compchemeng.2012.03.004
[3]	Jacobs, J.P. (2012) Bayesian Support Vector Regression with Automatic Relevance Determination Kernel for Modeling of Antenna Input Characteristics. IEEE Transactions on Antennas and Propagation, 60, 2114-2118. https://doi.org/10.1109/TAP.2012.2186252
[4]	Hans, C. (2009) Bayesian Lasso Regression. Biometrika, 96, 835-845. https://doi.org/10.1093/biomet/asp047
[5]	Tipping, M.E. (2001) Sparse Bayesian Learning and the Relevance Vector Machine. Journal of Machine Learning Research, 1, 211-244.
[6]	Scholkopf, B. and Smola, A.J. (2001) Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, Cambridge.
[7]	Berger, J.O. (1985) Statistical Decision Theory and Bayesian Analysis. 2nd Edition, Springer, Berlin. https://doi.org/10.1007/978-1-4757-4286-2
[8]	Mardia, K.V., Kent, J.T. and Bibby, J.B. (1979) Multivariate Analysis. Academic Press, New York.
[9]	Yu, Y., Krishnapuram, B. and Carin, L. (2004) Inverse Scattering with Sparse Bayesian Vector Regression. Inverse Problems, 20, 217-231. https://doi.org/10.1088/0266-5611/20/6/S13

Journals Menu

Follow SCIRP

	+1 323-425-8868
	customer@scirp.org
	+86 18163351462(WhatsApp)
	1655362766

	Paper Publishing WeChat

Journals Menu

Home

About SCIRP

Service

Policies