Convergence Analysis of a Kind of Deterministic Discrete-Time PCA Algorithm ()
1. Introduction
At present, Principal Component Analysis is widely used in data processing, image recognition and even video recognition [1]. In general, QR decomposition and SVD decomposition are most commonly used when calculating principal components. However, when the data dimension is large, the calculation cost of the traditional decomposition matrix method is very high. And the approach is not suitable for real applications where data come incrementally or in the on-line way. The PCA algorithm can overcome these problems.
Regarding the PCA algorithm, the Hebbian learning rules were first proposed by Hebbian, but the Hebbian rules are unstable and easy to diverge. Oja [2] added a delayed item under the Hebbian learning rule premise, called Oja’s learning rule. The Oja’s rule could always converge to the principal component of the covariance matrix.
Many optimization methods could also obtain the principal components. Xu [3] proposed the LMSER PCA algorithm to extract the principal components based on the least mean square error reconstruction. Chatterjee [4] proposed an unconstrained objective function and obtained various new PCA adaptive algorithms using gradient descent method, steepest descent method, conjugate gradient method, and Newton method. The results showed that other methods converge faster than the gradient descent method. When there are outliers in the data, Xu [5] used a Gibbs distribution derivation, added a coefficient to the learning rate to detect outliers, and adjusted their weights called the robust PCA algorithm. Then Song [6] used a Cauchy distribution to redefine the weights. Yang [7] proposed the FRPCA algorithm, which automatically selects relevant parameters compared to Xu [5].
All of these PCA learning algorithms are described by stochastic discrete time systems, and convergence is crucial for these algorithms to be useful in applications. It is not easy to study the behavior of these PCA algorithms directly. In order to indirectly prove the convergence of the Oja’s learning algorithm, traditionally, the algorithms are first transformed into some deterministic continuous time (DCT) systems [8] [9] [10]. This transformation is based on a fundamental theorem for the study of the stochastic approximation algorithm [10], relating the discrete time stochastic model to a DCT formulation. The stochastic approximation theory requires strict assumptions, including the algorithm’s learning rate to converge to 0. However, these conditions are difficult to achieve in practice and affect convergence speed [11] [12]. This makes it challenging to apply the DCT method to analyze the PCA algorithm’s convergence in practice. Subsequently, Zufiria [12] proposed to study Oja’s algorithm through a deterministic discrete time (DDT) system. This DDT system is obtained by applying conditional expectation to Oja’s learning algorithm. It preserves the original learning algorithm’s discrete time nature and can shed some light on the characteristics of the constant learning rate. Zhang [13] proposed that when
, the convergence speed is the fastest. Based on this, Lv [14] proposed an adaptive non-zero learning rate algorithm and proved its convergence. [15] summarized this.
Although Lv [14] proposed an adaptive non-zero learning rate algorithm and proved its convergence. When we choose an inappropriate initial value will cause the algorithm to fail to iterate. This paper we proposed a generalized adaptive learning rate (GALR) PCA algorithm which can ensure the convergence of the algorithm.
2. Oja’s Algorithm, the DCT Method and the DDT Method
2.1. Oja’s Algorithm
Consider a linear single neuron network with input–output relation
(1)
where y(k) is the network output, the input sequence
is a zero mean stationary stochastic process, each
is a weight vector. The basic Oja’s PCA learning algorithm for weight updating is described as
(2)
for
, where
is the learning rate. It is very difficult to study the convergence of (2) directly.
2.2. DCT Method
To indirectly interpret the convergence of (2), traditionally, the DCT method is used. This method can be described as follows: applying a fundamental stochastic approximation theorem [8] to (2), a DCT system can be obtained
(3)
where
denotes the correlation matrix of
. The convergence of this DCT system is then used to interpret the convergence of (2). To transform (2) into the previous DCT system, the following restrictive conditions are required:
1)
;
2)
.
By the condition 1), clearly, it implies that
as
. However, in many practical applications,
is often set to a small constant due to the round off limitation and speed requirement [11] [12]. The condition 2) is also difficult to be satisfied [12]. Thus, these conditions are unrealistic in practical applications and the previous DCT analysis is not directly applicable [11] [12].
2.3. DDT Method
To overcome the problem of learning rate converging toward zero, a DDT method is proposed in [12]. Unlike the DCT method to transform (2) into a continuous time deterministic system, the DDT method transform (2) into a discrete time deterministic system. The DDT system can be formulated as follows. Applying the conditional expectation operator
to (2) and identifying the conditional expected value as the next iterate in the system [11], a DDT system can be obtained and given as
(4)
for
, where
is the correlation matrix. The DDT method has some obvious advantages. First, it preserves the discrete time nature as that of the original learning algorithm. Second, it is not necessary to satisfy some unrealistic conditions of the DCT approach. Thus, it allows the learning rate to be constant. The main propose of this paper is to study the convergence of (4) subject to the learning rate
being some constant.
[13] proved the convergence of (4) when
satisfied certain conditions. And it is proved that when
, the algorithm has the fastest convergence speed. According to the proof of [13] [14] proposed an adaptive learning rate PCA algorithm and gave a theoretical proof:
(5)
3. The Generalized Adaptive Learning Rate (GALR) PCA Algorithm
In this section, details associated with a convergence analysis of the proposed learning algorithm will be provided systematically.
3.1. GALR PCA Algorithm Formulation
Although [14] has given the convergence proof of (5) under certain conditions, when the data is online, we cannot obtain the covariance matrix C in advance, which may lead to
and affect the convergence process of the algorithm. We proposed a generalized adaptive learning rate (GALA) PCA algorithm
(6)
for
, where
is the correlation matrix and
,
are the coefficients and I is an identity matrix.
For convenience of analysis, next, some preliminaries are given. It is known that the correlation matrix C is a symmetric nonnegative definite matrix. There exists an orthonormal basis of
composed by eigenvectors of C. Let
be all the eigenvalues of C order by
, Suppose that
is the smallest nonzero eigenvalue, i.e.,
but
. Denote by
, the largest eigenvalue of C. Suppose that the multiplicity of
is
, then
. Suppose that
is an orthonormal basis in
such that each
is a unit eigenvector of C associated with the eigenvalue
. Denote by
the eigensubspace of the largest eigenvalue
, i.e.
Denoting by
the subspace which is perpendicular to
, clearly
.
Since the vector set
is an orthonormal basis of
, for each
,
can be represented as
(7)
where
are some constants, and
(8)
(9)
Substitute (7) to (6), we have
(10)
for
, where
.
Next, a lemma is presented that will be used for the subsequent algorithm analysis.
Lemma 1. It holds that
where
and
is a constant.
Its proof can be found in [14].
3.2. Boundedness
We will analyze the boundedness of (6). The lower and upper bounds will be given in the following theorems.
Lemma 2. Suppose that
, if
then
is increasing.
Proof. If
it can be checked that
(11)
for
. Then, from (6), (8), (10), we have
(12)
Theorem 1. If
, it holds that
(13)
for all
, where
.
Proof. It can be checked that
(14)
Similar to the proof of Lemma 2, we have
(15)
for
. By Lemma 1, it holds that
The above theorem shows that if
, the trajectory starting from
is lower bounded, then it will never converge to zero.
Theorem 2. For any
, there is a constant
. The specific expression is
(16)
for all
. Then
is always true.
Proof. Let’s proved it by mathematical induction. For all
,
holds. Next, we proved
holds too. Discuss in two situations:
Case 1:
. Then
(17)
It follows that
(18)
Case 2:
. Then
(19)
It follows that
(20)
By lemma 1 and lemma 2, denote
When
(21)
When
(22)
So
is always true.
3.3. Global Convergence
Lemma 3. If
there exists constants
and
such that
(23)
for all
, where
(24)
Proof. Since
there must exist some
such that
. Without loss of generality, assume that
.
From (10), it follows that
(25)
(26)
By Theorem 2,
has an upper bound for all
. Given any
, it holds that
(27)
for
. Then from (25) and (26), for each
, it follows that
(28)
for all
.
Since
is bounded, there exists a constant
such that
for all
. Then
(29)
for
, where
Next, for convenience, denote
(30)
(31)
for all
. Clearly,
and
for all
.
Lemma 4. It holds that
(32)
for
.
Proof. From (9) and (10), it follows that
(33)
Lemma 5. If
, it holds that
(34)
Proof. From (30), it follows that
(35)
for
. By Lemma 1, it holds that
Lemma 6. There exists a positive constant
such that
(36)
for
.
Proof. From (31)
(37)
for
. By Theorem1 and Lemma 3,
(38)
for
, where
Lemma 7. If
, then there exist constants
and
such that
(39)
for all
, where
(40)
and
.
Proof. From (30), (31) and Lemma 4, it follows that
(41)
for
.
Denote
(42)
for
. It follows that,
(43)
for
. By Lemma5, it holds that
(44)
for
. Denote
(45)
Clearly, if
. Then
(46)
Denote
(47)
By Theorem 1 and Lemma 6,
(48)
where
(49)
Denote.
(50)
Then,
(51)
for all
.
Lemma 8. Suppose there exists constants
and
such that
(52)
for
. Then,
(53)
where
are constants.
Proof. When
(54)
when
(55)
Next, we can prove that each sequence
is a Cauchy sequence. See [14] for the proof. By Cauchy Convergence Principle, there exist constants
such that
Theorem 3. Suppose that
, if
then the trajectory of (4) starting from
will converge to a linear combination of
and coefficients
. This linear combination is the eigenvector corresponding to the largest eigenvalue of its covariance matrix.
Proof. By Lemma 3, there exists constants
and
such that
(56)
for all
. By Lemma 7, there exists constants
and
such that
(57)
for all
.
Obviously, there exists constants
and
such that
(58)
for
and
Using Lemmas 3 and 8, it follows that
(59)
Then,
(60)
When the algorithm converges, we have
(61)
(62)
(63)
Consider the most common case where the largest eigenvalue has only one. After the algorithm converges,
. We have
. Finally, we can get the largest eigenvector
4. Numerical Experiment
Some examples will be provided in this section to illustrate the proposed theory.
4.1. Offline Data Experiment
The numerical experiment is mainly to observe the convergence of the algorithm, and we randomly generate a covariance matrix
According to the existing related methods, we can directly find C’s maximum eigenvalue is 1.778753, and the maximum eigenvector is [0.341311, 0.579619, 0.411713 0.336439, 0.203388, 0.472741]. Next, we select three different initial vectors, take
and use
as the stopping criterion (That is, in the iterative process of the algorithm, when the absolute value of each element of the i-th eigenvector minus each element of the i-1-th step is less than
, the algorithm is considered to have converged and the iteration is stopped) to observe the convergence of the algorithm.
The first initial vector we randomly generated is [0.5488, 0.7152, 0.6028, 0.5449, 0.4237, 0.6459]. The trajectory of the eigenvector iteration and the learning rate iteration trajectory is shown in Figure 1. The algorithm converged after 23 iterations,
converged to the maximum eigenvalue 1.778753, and the maximum eigenvector from the final iteration was [0.341436, 0.579398, 0.411745, 0.336571, 0.203655, 0.472685], and the learning rate converged to 0.281096. It can be seen that the error from the true maximum eigenvector is less than 0.0001.
The second initial vector we randomly generated is [0.0055, 0.0072, 0.006, 0.0054, 0.0042, 0.0065]. The trajectory of the eigenvector iteration and the learning rate iteration trajectory are shown in Figure 2. The algorithm converges after 27 iterations,
converges to the maximum eigenvalue 1.778753, and the maximum eigenvector from the final iteration is [0.341423, 0.579428, 0.411735, 0.336547, 0.203620, 0.472697], and the learning rate converges to 0.281096. It can be seen that the error with the true maximum eigenvector is also less than 0.0001.
Figure 1. Convergence process of (6) with median initial vector.
Figure 2. Convergence process of (6) with smaller initial vector.
The third initial vector we randomly generated is [1142.75, 1458.86, 1245.25, 1135.28, 904.94, 1327.2]. The trajectory of the eigenvector iteration and the learning rate iteration trajectory are shown in Figure 3. The algorithm converges after 35 iterations,
converges to the maximum eigenvalue 1.778753, and the maximum eigenvector from the final iteration is [0.341414, 0.579436, 0.411738, 0.336548, 0.203616, 0.472693], and the learning rate converges to 0.281096.It can be seen that the error with the real maximum eigenvector is still less than 0.0001.
We can see that the algorithm can converge to the maximum eigenvector faster from three different initial value experiments. Of course, the closer the initial vector is to the maximum eigenvector, the faster the convergence will be. When the initial value is vast, the initial learning rate is close to 0, and the vector will gradually come down to convergence. When the initial value is minimal, since the initial learning rate will be huge, the vector will have a more significant jump and gradually converge. When
in the algorithm is determined, no matter how the initial vector is taken, the learning rate after the final algorithm converges is a fixed value, which is consistent with our theoretical proof.
Although in the theoretical proof,
will converge, but after many numerical experiments, we found that an initial vector
is chosen arbitrarily, the results will converge. This is because the probability that
is almost 0. Then we select a low-dimensional covariance matrix with integer eigenvectors for experiment. We found that when
,
is not divergent, it will converge to other eigenvectors. Suppose the multiplicity of the largest eigenvalue (denoted by
) of C is
, i.e.,
; Suppose the multiplicity of the second largest eigenvalue (denoted by
) of C is
, i.e.,
. Denote
,
when
,
, will converge to the largest eigenvector of C; when
and
,
will converge to the second largest eigenvector of C.
Figure 3. Convergence process of (6) with larger initial vector.
4.2. Comparison Algorithm (5) and (6)
In order to compare the difference between (5) and (6), we did some experiments. We randomly generated an initial vector as [571.37, 729.43, 622.63, 567.64, 452.47, 663.6]. The two algorithms use the same
. For (6) we use a = 0.001, b = 0.001. We still use
as the stopping criterion. The results are shown in Figure 4 and Figure 5.
From the results, we can see that Algorithm (5) converges after 83 iterations, and Algorithm (6) converges after 69 iterations. Algorithm (6) can quickly converge to near the true value within the first ten iterations, while Algorithm (5) requires more iterations. We found that when we have a larger initial value, choosing smaller coefficients a, b can help the algorithm converge more quickly.
Figure 4. Convergence of algorithm (5) with larger initial vector.
Figure 5. Convergence of algorithm (6) with larger initial vector and smaller a, b.
4.3. Selection of Parameters a, b
The most significant difference between Algorithm (6) and Algorithm (5) is the newly introduced two parameters a and b. Next, we used numerical experiments to discuss the influence of parameters a and b on the algorithm’s convergence process.
We take [0.5488, 0.7152, 0.6028, 0.5449, 0.4237, 0.6459] as the initial vector, choose
, and changea, b to observe the convergence of the algorithm (6). The number of iterations required by the algorithm (6) under different parameters is shown in Table 1.
We can see that when one parameter is much larger than the other, the smaller parameter will lose its effect. We must keep the two parameters in the same order of magnitude. This is because the covariance matrix C and the identity matrix I are in the same order of magnitude in the experiment. If one coefficient is larger, the effect of the other coefficient will be reduced.
From Table 1, we can see that when
the convergence speed is the fastest. Next, we take a value in this interval, and the convergence rate hardly changes. This shows that the algorithm is not sensitive to changes of a and b in the same order of magnitude.
We use [0.0055, 0.0072, 0.006, 0.0054, 0.0042, 0.0065] as the initial vector, we found when
the convergence speed is the fastest. Then we use [59.3932, 74.367, 64.2487, 59.0395, 48.1289, 68.1305] as the initial vector, we found when
the convergence speed is the fastest. This shows that when we choose a larger initial vector, we should choose smaller a, b and vice versa. The above three situations are all satisfied
. It means when we choose
a and b satisfied
, the convergence speed of the algorithm (6) will reach the fastest.
4.4. Online Data Experiment
Next, we consider the case of online data. Suppose the input sequence
is a zero mean stationary stochastic process. Because the data come in one by one online, we cannot get the covariance matrix C in advance. According to [4] C can be expressed as
(64)
where
is a coefficient, when
comes from a stationary process, then
, else
. According to The Law of Large Numbers,
converges with probability 1.
We randomly generate data and add a new data every iteration. We still use
as the stopping criterion, and find that different initial vectors and different
will affect the convergence speed of the algorithm. Experiments with different initial values are just like offline data.
We studied the influence of parameter
on the convergence of the algorithm and selected the same initial value. The result is shown in Figure 6. It is found that the smaller the
is, the faster the algorithm can converge.
Figure 6. Convergence of algorithms with different
.
Table 1. Number of iterations with different a, b.
5. Conclusion
In this paper, we proposed a Generalized Adaptive Learning Rate (GALR) PCA Algorithm, which could be guaranteed that the algorithm’s convergence process will not be affected by the selection of the initial value. Using the DDT method, we gave the upper and lower bounds of the algorithm and proved the global convergence. We no longer need the learning rate to tend to zero. Numerical experiments have also verified our theory. We discussed the relationship between the initial vector and the parameters a, b. Our algorithm is effective for both online and offline data.