Least Squares Method from the View Point of Deep Learning II: Generalization

Abstract

The least squares method is one of the most fundamental methods in Statistics to estimate correlations among various data. On the other hand, Deep Learning is the heart of Artificial Intelligence and it is a learning method based on the least squares method, in which a parameter called learning rate plays an important role. It is in general very hard to determine its value. In this paper we generalize the preceding paper [K. Fujii: Least squares method from the view point of Deep Learning: Advances in Pure Mathematics, 8, 485-493, 2018] and give an admissible value of the learning rate, which is easily obtained.

Share and Cite:

Fujii, K. (2018) Least Squares Method from the View Point of Deep Learning II: Generalization. Advances in Pure Mathematics, 8, 782-791. doi: 10.4236/apm.2018.89048.

1. Introduction

This paper is a sequel to the preceding paper [1] .

The least squares method in Statistics plays an important role in almost all disciplines, from Natural Science to Social Science. When we want to find properties, tendencies or correlations hidden in huge and complicated data we usually employ the method. See for example [2] .

On the other hand, Deep Learning is the heart of Artificial Intelligence and will become a most important field in Data Science in the near future. As to Deep Learning see for example [3] - [10] .

Deep Learning may be stated as a successive learning method based on the least squares method. Therefore, to reconsider it from the view point of Deep Learning is natural and instructive. We carry out the calculation thoroughly of the successive approximation called gradient descent sequence, in which a parameter called learning rate plays an important role.

One of main points is to determine the range of the learning rate, which is a very hard problem [8] . We showed in [1] that a difference in methods between Statistics and Deep Learning leads to different results when the learning rate changes.

We generalize the preceding results to the case of the least squares method by polynomial approximation. Our results may give a new insight to both Statistics and Data Science.

2. Least Squares Method

Let us explain the least squares method by polynomial approximation [9] . The model function f ( x ) is a polynomial in x of degree M given by

f ( x ) = w 0 + w 1 x + + w M x M = j = 0 M w j x j . (1)

For N pieces of two dimensional real data

{ ( x 1 , y 1 ) , ( x 2 , y 2 ) , , ( x N , y N ) }

we assume that their scatter plot is given like Figure 1.

The coefficients of (1)

w = ( w 0 , w 1 , , w M ) T (2)

must be determined by the data set (T denotes the transposition of a vector or a matrix).

For this set of data the error function is given by

E ( w ) = 1 2 i = 1 N { y i f ( x i ) } 2 = 1 2 i = 1 N ( y i j = 0 M w j x i j ) 2 . (3)

Figure 1. Scatter plot.

The aim of least squares method is to minimize the error function (3) with respect to w in (2). Usually it is obtained by solving the simultaneous differentiable equations

{ E ( w ) w 0 = 0 , E ( w ) w 1 = 0 , E ( w ) w M = 0.

However, in this paper another approach based on quadratic form is given, which is instructive.

Let us calculate the error function (3). By using the definition of inner product

a | b = a 1 b 1 + a 2 b 2 + + a n b n = a T b

it is not difficult to see

2 E ( w ) = y Φ w | y Φ w (4)

where

y = ( y 1 , y 2 , , y N ) T , w = ( w 0 , w 1 , , w M ) T

and

Φ = ( 1 x 1 1 x 1 2 x 1 M 1 x 2 1 x 2 2 x 2 M 1 x N 1 x N 2 x N M ) .

Here we make an important

Assumption N > M + 1 and rank ( Φ ) = M + 1 (full rank).

Let us deform (4). From

y Φ w | y Φ w = ( y Φ w ) T ( y Φ w ) = ( y T w T Φ T ) ( y Φ w ) = ( w T Φ T y T ) ( Φ w y ) = w T Φ T Φ w w T Φ T y y T Φ w + y T y

we set for simplicity

x = w , A = Φ T Φ , b = Φ T y , c = y T y .

Namely, we have a general quadratic form

2 E ( w ) = y Φ w | y Φ w = x T A x x T b b T x + c . (5)

On the other hand, the deformation of (5) is well-known.

Formula For a symmetric and invertible matrix A (: A 1 ) we have

x T A x x T b b T x + c = ( x A 1 b ) T A ( x A 1 b ) b T A 1 b + c . (6)

The proof is easy. Since A T = A ( A 1 ) T = A 1 we obtain

( x A 1 b ) T A ( x A 1 b ) = ( x T b T A 1 ) A ( x A 1 b ) = x T A x x T b b T x + b T A 1 b

and this gives (6).

Therefore, our case becomes

2 E ( w ) = { w ( Φ T Φ ) 1 Φ T y } T Φ T Φ { w ( Φ T Φ ) 1 Φ T y } y T Φ ( Φ T Φ ) 1 Φ T y + y T y (7)

because Φ T Φ is symmetric and invertible by the assumption.

If we choose

w = ( Φ T Φ ) 1 Φ T y (8)

then the minimum is given by

2 E ( w ) min = y T y y T Φ ( Φ T Φ ) 1 Φ T y = y T { E N Φ ( Φ T Φ ) 1 Φ T } y (9)

where E N is the N-dimensional identity matrix.

Our method is simple and clear (“smart” in our terminology).

3. Least Squares Method from Deep Learning

In this section we reconsider the least squares method in Section 2 from the view point of Deep Learning.

First we arrange the data in Section 2 like

Input data : { ( 1, x j , x j 2 , , x j M ) | 1 j N }

Teacher signal : { y 1 , y 2 , , y N }

and consider a simple neuron model in [11] (see Figure 2).

Here we use the polynomial (1) instead of the sigmoid function z = σ ( x ) .

In this case the square error function becomes

L ( w ) = 1 2 y Φ w | y Φ w = 1 2 ( w T Φ T Φ w w T Φ T y y T Φ w + y T y ) . (10)

Figure 2. Simple neuron model.

We in general use L ( w ) instead of E ( w ) in (3).

Our aim is also to determine the parameters w in order to minimize L ( w ) . However, the procedure is different from the least squares method in Section 2. This is an important and interesting point.

The parameters w are determined successively by the gradient descent method (see for example [12] ): For t = 0 , 1 , 2 ,

w ( 0 ) w ( 1 ) w ( t ) w ( t + 1 )

and

w ( t + 1 ) = w ( t ) ϵ L w ( t ) (11)

where

L = L ( w ( t ) ) = 1 2 { w ( t ) T Φ T Φ w ( t ) w ( t ) T Φ T y y T Φ w ( t ) + y T y } (12)

and ϵ ( 0 < ϵ < 1 ) is a small parameter called the learning rate.

The initial value w ( 0 ) is given appropriately. Pay attention that t is discrete time and T is the transposition.

Let us calculate (11) explicitly. Since

L w ( t ) = Φ T Φ w ( t ) Φ T y

from (12) we have

w ( t + 1 ) = w ( t ) ϵ { Φ T Φ w ( t ) Φ T y } = ( E M + 1 ϵ Φ T Φ ) w ( t ) + ϵ Φ T y . (13)

This equation is easily solved to be

w ( t ) = ( E M + 1 ϵ Φ T Φ ) t w ( 0 ) + { E M + 1 ( E M + 1 ϵ Φ T Φ ) t } ( Φ T Φ ) 1 Φ T y (14)

for t = 0 , 1 , 2 , .

The proof is left to readers.

Since this is not a final form let us continue the calculation. From (14) we have

lim t w ( t ) = ( Φ T Φ ) 1 Φ T y (15)

if

lim t ( E M + 1 ϵ Φ T Φ ) t = O M + 1 (16)

where O M + 1 is the N-dimensional zero matrix. (15) is just the equation (8) and it is independent of ϵ .

Let us evaluate (14) further. The matrix Φ T Φ is positive definite, so all eigenvalues are positive. This can be shown as follows. Let us consider the eigenvalue equation

Φ T Φ v = λ v ( v 0 ) .

Then we have

λ v | v = λ v | v = Φ T Φ v | v = Φ v | Φ v > 0 λ > 0.

Therefore we can arrange all eigenvalues like

λ 1 λ 2 λ M + 1 > 0.

Since Φ T Φ is symmetric, it is diagonalized as

Φ T Φ = Q D Q T (17)

where Q is an element in O ( M + 1 ) ( Q T = Q 1 ) and D is a diagonal matrix

D = ( λ 1 λ 2 λ M + 1 ) .

See for example [13] .

By substituting (17) into (14) and using the equation

E M + 1 ϵ Φ T Φ = Q ( E M + 1 ϵ D ) Q T = Q ( 1 ϵ λ 1 1 ϵ λ 2 1 ϵ λ M + 1 ) Q T

we finally obtain

Theorem I A general solution to (14) is

w ( t ) = Q ( ( 1 ϵ λ 1 ) t ( 1 ϵ λ 2 ) t ( 1 ϵ λ M + 1 ) t ) Q T w ( 0 ) + Q ( 1 ( 1 ϵ λ 1 ) t λ 1 1 ( 1 ϵ λ 2 ) t λ 2 1 ( 1 ϵ λ M + 1 ) t λ M + 1 ) Q T Φ T y . (18)

This is our main result.

Next, let us show how to choose the learning rate ϵ ( 0 < ϵ < 1 ) , which is a very important problem in Deep Learning [7] [8] .

Let us remember

λ 1 λ 2 λ M + 1 > 0.

From (16) and (18) the equations

lim t ( E M + 1 ϵ Φ T Φ ) t = O M + 1 lim t ( 1 ϵ λ j ) t = 0 for 1 j M + 1 (19)

determine the range of ϵ . Noting

| 1 x | < 1 ( 0 < x < 2 ) lim n ( 1 x ) n = 0

and

ϵ λ 1 ϵ λ 2 ϵ λ M + 1 > 0

we obtain

Theorem II The learning rate ϵ must satisfy an inequality

0 < ϵ λ 1 < 2 0 < ϵ < 2 λ 1 . (20)

The greater the value of ϵ , the sooner goes the gradient descent (11) so long as the convergence (19) is guaranteed. Let us note that the choice of the initial values w ( 0 ) is irrelevant when the convergence condition (20) is satisfied.

Comment For example, if we choose ϵ like

2 λ 1 < ϵ < 1

then we cannot recover (15), which shows a difference in methods between Statistics and Deep Learning.

4. How to Estimate the Learning Rate

How do we calculate λ 1 ? Since { λ j } are the eigenvalues of the matrix Φ T Φ , they satisfy the equation

F ( λ j ) = 0 , λ 1 λ 2 λ M + 1 > 0

where F ( λ ) is the characteristic polynomial of Φ T Φ given by

F ( λ ) = | λ E M + 1 Φ T Φ | . (21)

This is abstract, so let us deform (21). For simplicity we write Φ as

Φ = ( 1 x 1 1 x 1 2 x 1 M 1 x 2 1 x 2 2 x 2 M 1 x N 1 x N 2 x N M ) ( x ( 0 ) , x ( 1 ) , x ( 2 ) , , x ( M ) ) . (22)

Then it is easy to see

Φ T Φ = ( x ( i ) | x ( j ) ) 0 i , j M , x ( i ) | x ( j ) = k = 1 N x k i + j

where the notation a | b is the (real) inner product of vectors.

For clarity let us write down (21) explicitly.

F ( λ ) = | λ x ( 0 ) | x ( 0 ) x ( 0 ) | x ( 1 ) x ( 0 ) | x ( M ) x ( 1 ) | x ( 0 ) λ x ( 1 ) | x ( 1 ) x ( 1 ) | x ( M ) x ( M ) | x ( 0 ) x ( M ) | x ( 1 ) λ x ( M ) | x ( M ) | .

As far as we know there is no viable method to determine the greatest root of F ( λ ) = 0 if M is very large1. Therefore, let us get satisfied by obtaining an approximate value which is both greater than λ 1 and easy to calculate.

For the purpose the Gerschgorin’s theorem is very useful2. Let A = ( a i j ) be an n × n complex (real in our case) matrix, and we set

R i = j = 1 , j i n | a i j | (23)

and

D ( a i i ; R i ) = { z C | | z a i i | R i } (24)

for each i. This is a closed disc centered at a i i with radius R i called the Gerschgorin’s disc.

Theorem (Gerschgorin [14] ) For any eigenvalue λ of A we have

λ i = 1 n D ( a i i ; R i ) . (25)

The proof is simple. See for example [7] .

Our case is real and n = M + 1 and

A = Φ T Φ = ( x ( i ) | x ( j ) ) 0 i , j M .

Therefore, all eigenvalues { λ } satisfy

λ i = 1 n D ( a i i ; R i ) = i = 1 M + 1 [ a i i R i , a i i + R i ] (26)

where [ A , B ] is a closed interval and

a i i = x ( i 1 ) | x ( i 1 ) and R i = k = 1 , k i M + 1 | x ( i 1 ) | x ( i 1 ) | .

If we define

F M + 1 = max 1 i M + 1 { a i i + R i } (27)

then it is easy to see

λ 1 F M + 1 2 F M + 1 2 λ 1 .

from (26).

Thus we arrive at an admissible value of the learning rate ϵ which is easily obtained.

Theorem III An admissible value of ϵ is

ϵ = 2 F M + 1 . (28)

Let us show an example in the case of M = 1 ( [1] ), which is very instructive for non-experts.

Example In this case it is easy to see and we set

Φ T Φ = ( N i = 1 N x i i = 1 N x i i = 1 N x i 2 ) ( a x x X )

for simplicity. Moreover, we may assume x > 0 . Then from (21) we have

f ( λ ) = | λ E 2 Φ T Φ | = λ 2 ( a + X ) λ + ( a X x 2 )

and

λ 1 = a + X + ( a + X ) ) 2 4 ( a X x 2 ) 2 = a + X + ( a X ) 2 + 4 x 2 2 .

On the other hand, from (27) we have

F 2 = max { a + x , X + x }

because x > 0 .

Then it is easy to show

max { a + x , X + x } a + X + ( a X ) 2 + 4 x 2 2 .

To check this inequality is left to readers. Therefore, from (28) the admissible value becomes

ϵ = 2 max { a + x , X + x } .

We emphasize once more that F M + 1 is easy to evaluate, while to calculate λ 1 is very hard if M is large.

5. Concluding Remarks

In this paper we have discussed the least squares method by polynomial approximation from the view point of Deep Learning and carried out calculation of the gradient descent thoroughly. A difference in methods between Statistics and Deep Learning delivers different results when the learning rate ϵ is changed. Theorem III is the first result to provide an admissible value of ϵ as far as we know.

Deep Learning plays an essential role in Data Science and maybe in almost all fields of Science. Therefore it is desirable for undergraduates to master it in the early stages. To master it they must study Calculus, Linear Algebra and Statistics from Mathematics. My textbook [7] is recommended.

Acknowledgements

We wishes to thank Ryu Sasaki for useful suggestions and comments.

NOTES

1 Φ T Φ is not a sparse matrix.

2In my opinion this theorem is not so popular. Why?

Conflicts of Interest

The authors declare no conflicts of interest.

References

[1] Fujii, K. (2018) Least Squares Method from the View Point of Deep Learning. Advances in Pure Mathematics, 8, 485-493.
[2] Wikipedia: Least Squares.
https://en.m.wikipedia.org/wiki/Least_Squares
[3] Wikipedia: Deep Learning.
https://en.m.wikipedia.org/wiki/Deep_Learning
[4] Goodfellow, I., Bengio, Y. and Courville, A. (2016) Deep Learning. The MIT Press, Cambridge.
[5] Patterson, J. and Gibson, A. (2017) Deep Learning: A Practitioner’s Approach, O’Reilly Media, Inc., Sebastopol.
[6] Alpaydin, J. (2014) Introduction to Machine Learning. 3rd Edition, The MIT Press, Cambridge.
[7] Fujii, K. (2018) Introduction to Mathematics for Understanding Deep Learning. Scientific Research Publishing Inc., Wuhan.
[8] Okaya, T. (2015) Deep Learning (In Japanese). Kodansha Ltd., Tokyo.
[9] Nakai, E. (2015) Introduction to Theory of Machine Learning (In Japanese). Gijutsu-Hyouronn Co., Ltd., Tokyo.
[10] Amari, S. (2016) Brain Heart Artificial Intelligence (In Japanese). Kodansha Ltd., Tokyo.
[11] Fujii, K. (2018) Mathematical Reinforcement to the Minibatch of Deep Learning. Advances in Pure Mathematics, 8, 307-320.
https://doi.org/10.4236/apm.2018.83016
[12] Wikipedia: Gradient Descent.
https://en.m.wikipedia.org/wiki/Gradient_descent
[13] Kasahara, K. (2000) Linear Algebra (In Japanese). Saiensu Ltd., Tokyo.
[14] Gerschgorin, S. (1931) über die Abgrenzung der Eigenwerte einer Matrix. Izv. Akad. Nauk. USSR Otd. Fiz.-Mat. Nauk, 6, 749-754.

Copyright © 2024 by authors and Scientific Research Publishing Inc.

Creative Commons License

This work and the related PDF file are licensed under a Creative Commons Attribution 4.0 International License.