Greedy Randomized Gauss-Seidel Method with Oblique Direction

Abstract

For the linear least squares problem with coefficient matrix columns being highly correlated, we develop a greedy randomized Gauss-Seidel method with oblique direction. Then the corresponding convergence result is deduced. Numerical examples demonstrate that our proposed method is superior to the greedy randomized Gauss-Seidel method and the randomized Gauss-Seidel method with oblique direction.

Share and Cite:

Li, W. and Zhang, P. (2023) Greedy Randomized Gauss-Seidel Method with Oblique Direction. Journal of Applied Mathematics and Physics, 11, 1036-1048. doi: 10.4236/jamp.2023.114068.

1. Introduction

We focus on the linear least squares problem

$\underset{\beta \in {ℝ}^{n}}{\mathrm{min}}{‖y-X\beta ‖}_{2}^{2},$ (1)

where $X\in {ℝ}^{m×n}$ $\left(m\ge n\right)$ is of full column rank, $y\in {ℝ}^{m}$ and ${‖\text{ }\cdot \text{ }‖}_{2}$ denotes the Euclidean norm. There is a wide range of applications for the least squares problem in many fields, such as signal processing, image restoration, and so on. As we know, the coordinate descent method is an effective iteration method for (1). It applies the Gauss-Seidel method to the following equivalent normal Equation

${X}^{\text{T}}X\beta ={X}^{\text{T}}y,$ (2)

${\beta }_{k+1}={\beta }_{k}+\frac{{X}_{{j}_{k}}^{\text{T}}\left(y-X{\beta }_{k}\right)}{{‖{X}_{{j}_{k}}‖}_{2}^{2}}{e}_{{j}_{k}},\text{ }{j}_{k}=\left(k\mathrm{mod}n\right)+1,$

where ${X}_{{j}_{k}}$ denotes the jkth column of $X$ , ${e}_{{j}_{k}}$ is the jkth unit coordinate vector and the superscript T denotes the transpose. The convergence of the Gauss-Seidel method highly depends on the order of the column selected in each step.

Inspired by the fancy work of Strohmer and Vershynin [1] , Leventhal and Lewis [2] proposed the randomized Gauss-Seidel (RGS) method and proved that it has an expected linear convergence rate. As Bai and Wu [3] pointed out an obvious flaw of the RGS method that the probability for selecting column will be a uniform column sampling if the coefficient matrix is scaled with a diagonal matrix. To tackle this problem, they proposed the greedy randomized coordinate descent (GRCD) or called the greedy randomized Gauss-Seidel (GRGS) method by adopting an effective probability for selecting the working column to capture larger entries of the residual vector with respect to (2). They showed that the GRGS method is significantly superior to the RGS method in terms of both theoretical analysis and numerical experiments. In addition, there are tons of attention about the Gauss-Seidel type methods [4] [5] [6] [7] . However, the convergence rate of the RGS method will be significantly reduced when the coefficient matrix columns are highly correlated. To improve the convergence rate, Wang et al. [8] proposed the randomized Gauss-Seidel method with oblique direction (RGSO) by combining two successive selected unit coordinate directions as the search direction

${d}_{k}={e}_{{j}_{k+1}}-\frac{{X}_{{j}_{k}}^{\text{T}}{X}_{{j}_{k+1}}}{{‖{X}_{{j}_{k}}‖}_{2}^{2}}{e}_{{j}_{k}}.$ (3)

They showed that, in terms of both theory and experiments, the RGSO method outperforms the RGS method. For more discussions about the oblique projection, we refer the readers to other literatures [9] [10] and their references.

However, the RGSO method still exists in the same flaw of the RGS method that the probability for selecting column will be a uniform column sampling if the coefficient matrix is scaled with a diagonal matrix. In addition, it is worth noting that the convergence rate of the GRGS method would significantly decrease when the coefficient matrix columns are close to linear correlation. For the above mentioned limitations, we present a greedy randomized Gauss-Seidel method with oblique direction (GRGSO) for solving (1), by combining the oblique direction and the GRGS method. In theory, it is proved that the iterative solutions generated by the GRGSO method can converge to the least squares solution ${\beta }_{*}$ when the coefficient matrix is of full column rank. Numerical results show that compared with the RGSO and GRGS methods, the GRGSO method has a significant advantage over the iteration steps and computing time, especially when the coefficient matrix columns are highly correlated.

The organization of this paper is as follows. In Section 2, some notation and lemmas are introduced. In Section 3, we propose the GRGSO method for solving (1) and give its convergence analysis. Some examples are used to demonstrate the competitiveness of our proposed method in Section 4. Finally, we draw some brief conclusions in Section 5.

2. Notion and Preliminaries

In the beginning of this section, we give some notation. For a Hermitian positive definite matrix $B$ and a column vector $\beta$ with appropriate dimension, we denote ${‖\beta ‖}_{B}^{2}=〈B\beta ,\beta 〉={‖{B}^{\frac{1}{2}}\beta ‖}_{2}^{2}$ and ${\beta }^{\left(i\right)}$ the ith entry of $\beta$ . For a given matrix $S\in {ℝ}^{m×n}$ , ${\sigma }_{\mathrm{min}}\left(S\right)$ and ${‖S‖}_{F}$ denote the smallest nonzero singular value and the Frobenius norm of matrix $S$ , respectively. Let ${r}_{k}=y-X{\beta }_{k}$ and ${s}_{k}={X}^{\text{T}}{r}_{k}$ , then ${s}_{k}^{\left(j\right)}={X}_{j}^{\text{T}}{r}_{k}$ represents jth entry of ${s}_{k}$ . ${\beta }_{*}$ is the optimal solution of the corresponding problem. We indicate by ${\mathbb{E}}_{k}$ the expected value conditional on the first k iterations, that is,

${\mathbb{E}}_{k}\left[\cdot \right]=\mathbb{E}\left[\cdot |{j}_{0},{j}_{1},\cdots ,{j}_{k-1}\right],$

where ${j}_{t},t=0,1,\cdots ,k-1$ is the column selected at the t-th iteration.

In the following, we give a basic lemma.

Lemma 1 (See Bai and Wu [11] ) If the vector $u$ is in the column space of ${A}^{\text{T}}$ , it holds

${‖Au‖}_{2}^{2}\ge {\sigma }_{\mathrm{min}}^{2}\left(A\right){‖u‖}_{2}^{2}.$

3. GRGSO Method

In this section, we design the GRGSO method for solving (1), by combining the oblique direction with the GRGS method. The pseudo-code of GRGSO method is listed in Table 1. The difference between the RGSO method and GRGSO method is the selection strategy. The RGSO method utilizes the random selection strategy. Specially, the RGSO method selects ${j}_{k+1}$ th column with probability $\frac{{‖{X}_{{j}_{k+1}}‖}_{2}^{2}}{{‖X‖}_{F}^{2}}$ in the numerical experiments, which can be equivalent to the uniform sampling if the Euclidean norms of all the columns of the matrix $X$ are same; while the GRGSO method aims to grasp the larger entries of the residual vector at each iteration. Compared with the GRGS method, our proposed method considers the oblique projection, which is expected to have better convergence performance in some cases of coefficient matrix columns being highly correlated.

Remark Let ${t}_{k}=\mathrm{arg}\underset{1\le j\le n}{\mathrm{max}}\left\{\frac{{|{s}_{k}^{\left(j\right)}|}_{2}^{2}}{{‖{X}_{j}‖}_{2}^{2}}\right\}$ , which implies ${t}_{k}\in {V}_{k}$ . Therefore, for all iterative step k, the index set ${V}_{k}$ generated by the GRGSO method is nonempty.

Remark In the GRGSO method, it holds that

$\begin{array}{c}{s}_{k+1}={X}^{\text{T}}{r}_{k+1}={X}^{\text{T}}\left(y-X{\beta }_{k+1}\right)\\ ={X}^{\text{T}}\left(y-X\left({\beta }_{k}+{\eta }_{k}^{\left({j}_{k}\right)}{w}_{{j}_{k}}\right)\right)\\ ={X}^{\text{T}}\left(y-X{\beta }_{k}\right)-{\eta }_{k}^{\left({j}_{k}\right)}{X}^{\text{T}}X{w}_{{j}_{k}}\\ ={s}_{k}-{\eta }_{k}^{\left({j}_{k}\right)}{X}^{\text{T}}X{w}_{{j}_{k}},\text{ }\left(k\ge 1\right)\end{array}$ (4)

Table 1. GRGSO method.

and

$\begin{array}{c}{s}_{1}={X}^{\text{T}}{r}_{1}={X}^{\text{T}}\left(y-X{\beta }_{1}\right)\\ ={X}^{\text{T}}\left(y-X\left({\beta }_{0}+\frac{{s}_{0}^{\left({j}_{1}\right)}}{{‖{X}_{{j}_{1}}‖}_{2}^{2}}{e}_{{j}_{1}}\right)\right)\\ ={X}^{\text{T}}\left(y-X{\beta }_{0}\right)-\frac{{s}_{0}^{\left({j}_{1}\right)}}{{‖{X}_{{j}_{1}}‖}_{2}^{2}}{X}^{\text{T}}{X}_{{j}_{1}}\\ ={s}_{0}-\frac{{s}_{0}^{\left({j}_{1}\right)}}{{‖{X}_{{j}_{1}}‖}_{2}^{2}}{X}^{\text{T}}{X}_{{j}_{1}}.\end{array}$ (5)

Therefore, the GRGSO method can be executed more effectively if ${X}^{\text{T}}X$ can be computed in an economical manner at the beginning.

Next, we give some lemmas which are useful to analyze the convergence of the GRGSO method.

Lemma 2 For the GRGSO method, we have

${s}_{k}^{\left({j}_{k}\right)}=0,\text{ }\left(\forall k>0\right),$ (6)

${s}_{k}^{\left({j}_{k-1}\right)}=0,\text{ }\left(\forall k>1\right).$ (7)

Proof. For $k=1$ , one has

${s}_{1}^{\left({j}_{1}\right)}={X}_{{j}_{1}}^{\text{T}}\left(y-X{\beta }_{1}\right)={X}_{{j}_{1}}^{\text{T}}y-{X}_{{j}_{1}}^{\text{T}}X\left({\beta }_{0}+\frac{{s}_{0}^{\left({j}_{1}\right)}}{{‖{X}_{{j}_{1}}‖}_{2}^{2}}{e}_{{j}_{1}}\right)=0.$ (8)

For $k>1$ , we have

$\begin{array}{c}{s}_{k}^{\left({j}_{k}\right)}={X}_{{j}_{k}}^{\text{T}}\left(y-X{\beta }_{k}\right)={X}_{{j}_{k}}^{\text{T}}y-{X}_{{j}_{k}}^{\text{T}}X\left({\beta }_{k-1}+{\eta }_{k-1}^{\left({j}_{k-1}\right)}{w}_{{j}_{k-1}}\right)\\ ={X}_{{j}_{k}}^{\text{T}}y-{X}_{{j}_{k}}^{\text{T}}X{\beta }_{k-1}-{\eta }_{k-1}^{\left({j}_{k-1}\right)}\left({X}_{{j}_{k}}^{T}X{w}_{{j}_{k-1}}\right)={\stackrel{˜}{s}}_{k-1}^{\left({j}_{k}\right)}-\frac{{\stackrel{˜}{s}}_{k-1}^{\left({j}_{k}\right)}}{{h}_{{j}_{k-1}}}{h}_{{j}_{k-1}}\\ =0,\end{array}$

which together with (8) proves (6).

By (6), it follows for $k>0$ that

$\begin{array}{c}{s}_{k+1}^{\left({j}_{k}\right)}={X}_{{j}_{k}}^{\text{T}}\left(y-X{\beta }_{k+1}\right)={X}_{{j}_{k}}^{\text{T}}y-{X}_{{j}_{k}}^{\text{T}}X\left({\beta }_{k}+{\eta }_{k}^{\left({j}_{k}\right)}{w}_{{j}_{k}}\right)\\ ={s}_{k}^{\left({j}_{k}\right)}-{\eta }_{k}^{\left({j}_{k}\right)}\left({X}_{{j}_{k}}^{\text{T}}X{w}_{{j}_{k}}\right)\\ =-{\eta }_{k}^{\left({j}_{k}\right)}\left({X}_{{j}_{k}}^{\text{T}}X\left({e}_{{j}_{k+1}}-\frac{{X}_{{j}_{k}}^{\text{T}}{X}_{{j}_{k+1}}}{{‖{X}_{{j}_{k}}‖}_{2}^{2}}{e}_{{j}_{k}}\right)\right)\\ =0,\end{array}$

Remark From (6) and (7), it is obvious that in kth iteration, the GRGSO method dose not select ${j}_{k}$ , ${j}_{k-1}$ , which means ${j}_{k+1}\ne {j}_{k},{j}_{k-1}$ . Thus, the direction ${w}_{{j}_{k}}$ can be the combination of two unit coordinate directions. This is also the advantage of the GRGSO method compared with the RGSO method. Since the RGSO method randomly selects ${j}_{k+1}$ in the kth iteration, which can not avoid selecting ${j}_{k}$ and ${j}_{k-1}$ while the GRGSO method can avoid.

Lemma 3 For ${h}_{{j}_{k}}$ in the GRGSO method, it satisfies

${h}_{{j}_{k}}={‖X{w}_{{j}_{k}}‖}_{2}^{2}={‖{X}_{{j}_{k+1}}‖}_{2}^{2}-\frac{{|{X}_{{j}_{k}}^{\text{T}}{X}_{{j}_{k+1}}|}^{2}}{{‖{X}_{{j}_{k}}‖}_{2}^{2}}\le \Delta {‖{X}_{{j}_{k+1}}‖}_{2}^{2},$

where $\Delta =\underset{i\ne j}{\mathrm{max}}{\mathrm{sin}}^{2}〈{X}_{i},{X}_{j}〉$ .

Proof. Since ${j}_{k}\ne {j}_{k+1}$ , it holds that

$\begin{array}{c}{h}_{{j}_{k}}={X}_{{j}_{k+1}}^{\text{T}}X\left({e}_{{j}_{k+1}}-\frac{{X}_{{j}_{k}}^{\text{T}}{X}_{{j}_{k+1}}}{{‖{X}_{{j}_{k}}‖}_{2}^{2}}{e}_{{j}_{k}}\right)={‖{X}_{{j}_{k+1}}‖}_{2}^{2}-\frac{{|{X}_{{j}_{k}}^{\text{T}}{X}_{{j}_{k+1}}|}^{2}}{{‖{X}_{{j}_{k}}‖}_{2}^{2}}\\ ={‖{X}_{{j}_{k+1}}‖}_{2}^{2}-\frac{{‖{X}_{{j}_{k}}‖}_{2}^{2}{‖{X}_{{j}_{k+1}}‖}_{2}^{2}{\mathrm{cos}}^{2}〈{X}_{{j}_{k}},{X}_{{j}_{k+1}}〉}{{‖{X}_{{j}_{k}}‖}_{2}^{2}}\\ ={\mathrm{sin}}^{2}〈{X}_{{j}_{k}},{X}_{{j}_{k+1}}〉{‖{X}_{{j}_{k+1}}‖}_{2}^{2}\le \Delta {‖{X}_{{j}_{k+1}}‖}_{2}^{2}\end{array}$

and

$\begin{array}{c}{‖X{w}_{{j}_{k}}‖}_{2}^{2}={\left({e}_{{j}_{k+1}}-\frac{{X}_{{j}_{k}}^{\text{T}}{X}_{{j}_{k+1}}}{{‖{X}_{{j}_{k}}‖}_{2}^{2}}{e}_{{j}_{k}}\right)}^{\text{T}}{X}^{\text{T}}X\left({e}_{{j}_{k+1}}-\frac{{X}_{{j}_{k}}^{\text{T}}{X}_{{j}_{k+1}}}{{‖{X}_{{j}_{k}}‖}_{2}^{2}}{e}_{{j}_{k}}\right)\\ ={‖{X}_{{j}_{k+1}}‖}_{2}^{2}-\frac{{|{X}_{{j}_{k}}^{\text{T}}{X}_{{j}_{k+1}}|}^{2}}{{‖{X}_{{j}_{k}}‖}_{2}^{2}}.\end{array}$

Hence, we complete this proof.

Lemma 4 The iteration sequence ${\left\{{\beta }_{k}\right\}}_{k=0}^{\infty }$ generated by the GRGSO method satisfies

${‖X\left({\beta }_{k+1}-{\beta }_{*}\right)‖}_{2}^{2}={‖X\left({\beta }_{k}-{\beta }_{*}\right)‖}_{2}^{2}-{‖X\left({\beta }_{k+1}-{\beta }_{k}\right)‖}_{2}^{2},\text{ }k=0,1,2,\cdots .$

Proof. For $k=0$ , we have

$\begin{array}{c}{e}_{{j}_{1}}^{\text{T}}{X}^{\text{T}}X\left({\beta }_{1}-{\beta }_{*}\right)={X}_{{j}_{1}}^{\text{T}}X\left({\beta }_{0}-{\beta }_{*}+\frac{{s}_{0}^{\left({j}_{1}\right)}}{{‖{X}_{{j}_{1}}‖}_{2}^{2}}{e}_{{j}_{1}}\right)\\ ={X}_{{j}_{1}}^{\text{T}}X{\beta }_{0}-{X}_{{j}_{1}}^{\text{T}}X{\beta }_{*}+{s}_{0}^{\left({j}_{1}\right)}\\ =0.\end{array}$

This means that the vector ${X}^{\text{T}}X\left({\beta }_{1}-{\beta }_{*}\right)$ is perpendicular to the vector ${e}_{{j}_{1}}$ . Since ${\beta }_{1}-{\beta }_{0}$ is parallel to ${e}_{{j}_{1}}$ , the vector ${X}^{\text{T}}X\left({\beta }_{1}-{\beta }_{*}\right)$ is perpendicular to ${\beta }_{1}-{\beta }_{0}$ .

For $k>0$ , It follows from Lemma 3 and Lemma 2 that

$\begin{array}{c}{w}_{{j}_{k}}^{\text{T}}{X}^{\text{T}}X\left({\beta }_{k+1}-{\beta }_{*}\right)={w}_{{j}_{k}}^{\text{T}}{X}^{\text{T}}X\left({\beta }_{k}-{\beta }_{*}+{\eta }_{k}^{\left({j}_{k}\right)}{w}_{{j}_{k}}\right)\\ ={w}_{{j}_{k}}^{\text{T}}\left(-{s}_{k}+{\eta }_{k}^{\left({j}_{k}\right)}{X}^{\text{T}}X{w}_{{j}_{k}}\right)\\ =-{w}_{{j}_{k}}^{\text{T}}{s}_{k}+{\eta }_{k}^{\left({j}_{k}\right)}{‖X{w}_{{j}_{k}}‖}_{2}^{2}\\ =-{\left({e}_{{j}_{k+1}}-\frac{{X}_{{j}_{k}}^{\text{T}}{X}_{{j}_{k+1}}}{{‖{X}_{{j}_{k}}‖}_{2}^{2}}{e}_{{j}_{k}}\right)}^{\text{T}}{s}_{k}+{\stackrel{˜}{s}}_{k}^{\left({j}_{k+1}\right)}\\ =-{s}_{k}^{\left({j}_{k+1}\right)}+\frac{{X}_{{j}_{k}}^{\text{T}}{X}_{{j}_{k+1}}}{{‖{X}_{{j}_{k}}‖}_{2}^{2}}{s}_{k}^{\left({j}_{k}\right)}+{\stackrel{˜}{s}}_{k}^{\left({j}_{k+1}\right)}=0,\end{array}$

which means that the vector ${X}^{\text{T}}X\left({\beta }_{k+1}-{\beta }_{*}\right)$ is perpendicular to the vector ${w}_{{j}_{k}}$ . Since ${\beta }_{k+1}-{\beta }_{k}$ is parallel to ${w}_{{j}_{k}}$ , the vector ${X}^{\text{T}}X\left({\beta }_{k+1}-{\beta }_{*}\right)$ is perpendicular to ${\beta }_{k+1}-{\beta }_{k}$ . For all $k=0,1,\cdots$ , it follows that

$〈X\left({\beta }_{k+1}-{\beta }_{*}\right),X\left({\beta }_{k+1}-{\beta }_{k}\right)〉=〈{X}^{\text{T}}X\left({\beta }_{k+1}-{\beta }_{*}\right),{\beta }_{k+1}-{\beta }_{k}〉=0,$

which together with Pythagoras theorem leads to the desired result.

Next, the convergence theory of the GRGSO method is deduced.

Theorem 5 For the least squares problem (1), the iteration sequence ${\left\{{\beta }_{k}\right\}}_{k=0}^{\infty }$ generated by the GRGSO method from any initial guess ${\beta }_{0}$ satisfies

$\mathbb{E}{‖X\left({\beta }_{k}-{\beta }_{*}\right)‖}_{2}^{2}\le \underset{t=0}{\overset{k-1}{\prod }}\text{ }\text{ }{\zeta }_{t}{‖X\left({\beta }_{0}-{\beta }_{*}\right)‖}_{2}^{2},\text{ }\text{for}\text{\hspace{0.17em}}\text{ }k=1,2,\cdots ,$ (9)

where ${\zeta }_{0}=1-\frac{{\sigma }_{\mathrm{min}}^{2}\left(X\right)}{{‖X‖}_{F}^{2}}$ , ${\zeta }_{1}=1-\frac{1}{2}\left(\frac{1}{{\gamma }_{1}}+\frac{1}{{‖X‖}_{F}^{2}}\right)\frac{{\sigma }_{\mathrm{min}}^{2}\left(X\right)}{\Delta }$ , ${\zeta }_{t}=1-\frac{1}{2}\left(\frac{1}{{\gamma }_{2}}+\frac{1}{{‖X‖}_{F}^{2}}\right)\frac{{\sigma }_{\mathrm{min}}^{2}\left(X\right)}{\Delta }$ ( $t>1$ ). Here ${\gamma }_{1}=\underset{1\le i\le n}{\mathrm{max}}\underset{\begin{array}{l}j=1\\ j\ne i\end{array}}{\overset{n}{\sum }}{‖{X}_{j}‖}_{2}^{2}$ , ${\gamma }_{2}=\underset{\begin{array}{c}1\le i,j\le n\\ i\ne j\end{array}}{\mathrm{max}}\underset{\begin{array}{c}t=1\\ t\ne i,j\end{array}}{\overset{n}{\sum }}{‖{X}_{t}‖}_{2}^{2}$ and $\Delta$ is given as in Lemma 3.

Proof. By Lemma 2, it follows for $k>1$ that

$\begin{array}{c}{\delta }_{k}{‖X‖}_{F}^{2}=\frac{1}{2}\left(\frac{{‖X‖}_{F}^{2}}{{‖{s}_{k}‖}_{2}^{2}}\underset{1\le j\le n}{\mathrm{max}}\left\{\frac{{|{s}_{k}^{\left(j\right)}|}^{2}}{{‖{X}_{j}‖}_{2}^{2}}\right\}+1\right)=\frac{1}{2}\left(\frac{\underset{1\le j\le n}{\mathrm{max}}\left\{\frac{{|{s}_{k}^{\left(j\right)}|}^{2}}{{‖{X}_{j}‖}_{2}^{2}}\right\}}{\underset{j=1}{\overset{n}{\sum }}\frac{{‖{X}_{j}‖}_{2}^{2}}{{‖X‖}_{F}^{2}}\frac{{|{s}_{k}^{\left(j\right)}|}^{2}}{{‖{X}_{j}‖}_{2}^{2}}}+1\right)\\ =\frac{1}{2}\left(\frac{\underset{1\le j\le n}{\mathrm{max}}\left\{\frac{{|{s}_{k}^{\left(j\right)}|}^{2}}{{‖{X}_{j}‖}_{2}^{2}}\right\}}{\underset{\begin{array}{c}j=1\\ j\ne {j}_{k},{j}_{k-1}\end{array}}{\overset{n}{\sum }}\frac{{‖{X}_{j}‖}_{2}^{2}}{{‖X‖}_{F}^{2}}\frac{{|{s}_{k}^{\left(j\right)}|}^{2}}{{‖{X}_{j}‖}_{2}^{2}}}+1\right)\ge \frac{1}{2}\left(\frac{{‖X‖}_{F}^{2}}{\underset{\begin{array}{c}j=1\\ j\ne {j}_{k},{j}_{k-1}\end{array}}{\overset{n}{\sum }}{‖{X}_{j}‖}_{2}^{2}}+1\right)\\ \ge \frac{1}{2}\left(\frac{{‖X‖}_{F}^{2}}{{\gamma }_{2}}+1\right).\end{array}$ (10)

For $k=1$ , it follows from (6) that

$\begin{array}{c}{\delta }_{1}{‖X‖}_{F}^{2}=\frac{1}{2}\left(\frac{{‖X‖}_{F}^{2}}{{‖{s}_{1}‖}_{2}^{2}}\underset{1\le j\le n}{\mathrm{max}}\left\{\frac{{|{s}_{1}^{\left(j\right)}|}^{2}}{{‖{X}_{j}‖}_{2}^{2}}\right\}+1\right)=\frac{1}{2}\left(\frac{\underset{1\le j\le n}{\mathrm{max}}\left\{\frac{{|{s}_{1}^{\left(j\right)}|}^{2}}{{‖{X}_{j}‖}_{2}^{2}}\right\}}{\underset{j=1}{\overset{n}{\sum }}\frac{{‖{X}_{j}‖}_{2}^{2}}{{‖X‖}_{F}^{2}}\frac{{|{s}_{1}^{\left(j\right)}|}^{2}}{{‖{X}_{j}‖}_{2}^{2}}}+1\right)\\ =\frac{1}{2}\left(\frac{\underset{1\le j\le n}{\mathrm{max}}\left\{\frac{{|{s}_{1}^{\left(j\right)}|}^{2}}{{‖{X}_{j}‖}_{2}^{2}}\right\}}{\underset{\begin{array}{l}j=1\\ j\ne {j}_{1}\end{array}}{\overset{n}{\sum }}\frac{{‖{X}_{j}‖}_{2}^{2}}{{‖X‖}_{F}^{2}}\frac{{|{s}_{1}^{\left(j\right)}|}^{2}}{{‖{X}_{j}‖}_{2}^{2}}}+1\right)\ge \frac{1}{2}\left(\frac{{‖X‖}_{F}^{2}}{\underset{\begin{array}{l}j=1\\ j\ne {j}_{1}\end{array}}{\overset{n}{\sum }}{‖{X}_{j}‖}_{2}^{2}}+1\right)\\ \ge \frac{1}{2}\left(\frac{{‖X‖}_{F}^{2}}{{\gamma }_{1}}+1\right).\end{array}$ (11)

By Lemma 4, Lemma 3 and Lemma 1, for $k\ge 1$ , we have

$\begin{array}{l}{\mathbb{E}}_{k}{‖X\left({\beta }_{k+1}-{\beta }_{*}\right)‖}_{2}^{2}\\ ={‖X\left({\beta }_{k}-{\beta }_{*}\right)‖}_{2}^{2}-{\mathbb{E}}_{k}{‖X\left({\beta }_{k+1}-{\beta }_{k}\right)‖}_{2}^{2}\\ ={‖X\left({\beta }_{k}-{\beta }_{*}\right)‖}_{2}^{2}-\underset{{j}_{k+1}\in {V}_{k}}{\sum }\frac{{|{\stackrel{˜}{s}}_{k}^{\left({j}_{k+1}\right)}|}^{2}}{\underset{{j}_{k+1}\in {V}_{k}}{\sum }{|{\stackrel{˜}{s}}_{k}^{\left({j}_{k+1}\right)}|}^{2}}{‖{\eta }_{k}^{\left({j}_{k}\right)}\left(X{w}_{{j}_{k}}\right)‖}^{2}\\ ={‖X\left({\beta }_{k}-{\beta }_{*}\right)‖}_{2}^{2}-\underset{{j}_{k+1}\in {V}_{k}}{\sum }\frac{{|{\stackrel{˜}{s}}_{k}^{\left({j}_{k+1}\right)}|}^{2}}{\underset{{j}_{k+1}\in {V}_{k}}{\sum }{|{\stackrel{˜}{s}}_{k}^{\left({j}_{k+1}\right)}|}^{2}}\frac{{|{\stackrel{˜}{s}}_{k}^{\left({j}_{k+1}\right)}|}^{2}{‖X{w}_{{j}_{k}}‖}_{2}^{2}}{{|{h}_{{j}_{k}}|}^{2}}\\ ={‖X\left({\beta }_{k}-{\beta }_{*}\right)‖}_{2}^{2}-\underset{{j}_{k+1}\in {V}_{k}}{\sum }\frac{{|{\stackrel{˜}{s}}_{k}^{\left({j}_{k+1}\right)}|}^{2}}{\underset{{j}_{k+1}\in {V}_{k}}{\sum }{|{\stackrel{˜}{s}}_{k}^{\left({j}_{k+1}\right)}|}^{2}}\frac{{|{\stackrel{˜}{s}}_{k}^{\left({j}_{k+1}\right)}|}^{2}}{{h}_{{j}_{k}}}\end{array}$

$\begin{array}{l}\le {‖X\left({\beta }_{k}-{\beta }_{*}\right)‖}_{2}^{2}-\underset{{j}_{k+1}\in {V}_{k}}{\sum }\frac{{|{\stackrel{˜}{s}}_{k}^{\left({j}_{k+1}\right)}|}^{2}}{\underset{{j}_{k+1}\in {V}_{k}}{\sum }{|{\stackrel{˜}{s}}_{k}^{\left({j}_{k+1}\right)}|}^{2}}\frac{{|{\stackrel{˜}{s}}_{k}^{\left({j}_{k+1}\right)}|}^{2}}{\Delta {‖{X}_{{j}_{k+1}}‖}_{2}^{2}}\\ \le {‖X\left({\beta }_{k}-{\beta }_{*}\right)‖}_{2}^{2}-\frac{{\delta }_{k}}{\Delta }{‖{s}_{k}‖}_{2}^{2}\\ ={‖X\left({\beta }_{k}-{\beta }_{*}\right)‖}_{2}^{2}-\frac{{\delta }_{k}}{\Delta }{‖{X}^{\text{T}}X\left({\beta }_{k}-{\beta }_{*}\right)‖}_{2}^{2}\\ \le \left(1-{\delta }_{k}\frac{{\sigma }_{\mathrm{min}}^{2}\left(X\right)}{\Delta }\right){‖X\left({\beta }_{k}-{\beta }_{*}\right)‖}_{2}^{2},\end{array}$

which together with (11) and (10) can lead to

${\mathbb{E}}_{1}{‖X\left({\beta }_{2}-{\beta }_{*}\right)‖}_{2}^{2}\le {\zeta }_{1}{‖X\left({\beta }_{1}-{\beta }_{*}\right)‖}_{2}^{2}$ (12)

and

${\mathbb{E}}_{k}{‖X\left({\beta }_{k+1}-{\beta }_{*}\right)‖}_{2}^{2}\le \zeta {‖X\left({\beta }_{k}-{\beta }_{*}\right)‖}_{2}^{2},\text{ }k>1,$ (13)

respectively. For $k=0$ , we can similarly get

$\mathbb{E}{‖X\left({\beta }_{1}-{\beta }_{*}\right)‖}_{2}^{2}\le {\zeta }_{0}{‖X\left({\beta }_{0}-{\beta }_{*}\right)‖}_{2}^{2}.$

Then taking the full expectation on both sides of (12) and (13) and by induction on the iteration index k, we can easily obtain (9). Thus, we complete the proof.

Remark Suppose that the upper bound for convergence rate of the GRGS method [3] is defined as

${\rho }_{GRCD}=1-\frac{1}{2}\left(\frac{1}{{\gamma }_{1}}+\frac{1}{{‖X‖}_{F}^{2}}\right){\sigma }_{\mathrm{min}}^{2}\left(X\right).$

Since $\Delta =\underset{i\ne j}{\mathrm{max}}{\mathrm{sin}}^{2}〈{X}_{i},{X}_{j}〉\in \left(0,1\right]$ , ${\gamma }_{2}<{\gamma }_{1}<{‖X‖}_{F}^{2}$ and $0<{\sigma }_{\mathrm{min}}^{2}\left(X\right)\le {‖X‖}_{F}^{2}$ , we have

$\zeta <{\zeta }_{1}\le {\rho }_{GRCD}<{\zeta }_{0}<1.$

This implies that the upper bound on the convergence factor of the GRGSO method is smaller, uniformly with respect to the iterative step k, than that of the GRCD method.

4. Numerical Experiments

Some examples are designed in this section to verify the effectiveness of the GRGSO method. Specially, the GRGSO method is compared with the GRGS method [3] , RGS method [2] and RGSO method [8] . We list the tested results of these methods in terms of the number of iteration steps (denoted by “IT”) and the running time in seconds (denoted by “CPU”).

The coefficient matrix $X\in {ℝ}^{m×n}\left(m\ge n\right)$ in numerical experiments is from two sources. One is the random matrix whose entries are randomly taken from the interval $\left[c,1\right]$ $\left(c\ge 0\right)$ by using the function rand in MATLAB. c being close to 1 implies that the matrix columns are highly correlated. Another is sparse matrices from the literature [12] listed in Table 2. We use $\text{cond}\left(X\right)$ to represent the condition number for $X$ , and define the density as

$\text{density}=\frac{\text{number}\text{\hspace{0.17em}}\text{of}\text{\hspace{0.17em}}\text{nonzero}\text{\hspace{0.17em}}\text{entries}\text{\hspace{0.17em}}\text{of}\text{\hspace{0.17em}}\text{an}\text{\hspace{0.17em}}\text{ }m×n\text{ }\text{\hspace{0.17em}}\text{matrix}\text{ }}{m×n}.$

We randomly generate the true vector ${\beta }_{*}$ by utilizing the MATLAB function randn and construct the vector y by $y=X{\beta }_{*}$ when the linear system is consistent; while for the inconsistent linear systems the right-hand side is set by $y=X{\beta }_{*}+noise$ , where $noise\in null\left({X}^{\text{T}}\right)$ . We take zero vector for the initial approximation of each iteration process. Since

${‖X\left({\beta }_{k}-{\beta }_{*}\right)‖}_{2}^{2}={‖X{\beta }_{*}+noise-{r}_{k}-X{\beta }_{*}‖}_{2}^{2}={‖noise-{r}_{k}‖}_{2}^{2},$

which was also used in the work of Wang et al. [8] , we terminate the iteration process once

$RSE=\frac{{‖noise-{r}_{k}‖}_{2}}{{‖y‖}_{2}}<{10}^{-6}.$

We set “-” in the numerical tables if the number of iteration steps exceeds 300,000. All the results are averages from 20 repetitions. All experiments were implemented by using MATLAB (R2021b) on a computer with 2.30 GHz central processing unit (Intel(R) Core(TM) i7-10875H CPU), 16 GB memory.

Table 2. Sparse matrix properties of realistic problems [12] .

For the randomly generated matrices, with $c=0$ , the numerical results for consistent and inconsistent linear systems are listed in Table 3 and Table 4, respectively. It is easy to observe from Table 3 and Table 4 that the GRGSO method significantly outperforms the RGS, GRGS and RGSO methods in terms of both IT and CPU.

In the following, we compare these methods for solving (1) when the randomly generated matrix is with different c. We list the numerical results for the consistent and inconsistent systems in Table 5 and Table 6, respectively. From Table 5 and Table 6, it is easy to observe that the IT and CPU of the RGS and GRGS methods increase significantly with c increasing closer to 1. When c increases to 0.8, the IT of the RGS method exceeds the maximal number of iteration steps. And when c increases to 0.9, the IT of the GRGS method exceeds the maximal number of iteration steps. For the different c values, the methods with the oblique direction outperform the methods without the oblique direction. In addition, compared with the other methods, the GRGSO method performs best in terms of both IT and CPU.

For the full-rank sparse matrices from literature [12] , the numerical results for the consistent and inconsistent linear systems are listed in Table 7 and Table 8, respectively. It can be seen that for the matrix abtaha1 the performance of the GRGSO method is similar to that of the GRGS method in terms of both IT and CPU, whereas for the other matrices the GRGSO method significantly performs better in terms of both IT and CPU than the other methods.

Table 3. The consistent system with $c=0$ : different m impacts on IT and CPU.

Table 4. The inconsistent system with $c=0$ : different m impacts on IT and CPU.

Table 5. The consistent system with $X\in {ℝ}^{1000×100}$ : different c impacts on IT and CPU.

Table 6. The inconsistent system with $X\in {ℝ}^{1000×100}$ : different c impacts on IT and CPU.

Table 7. The consistent system: IT and CPU time of test methods for different sparse matrices.

Table 8. The inconsistent system: IT and CPU time of test methods for different sparse matrices.

5. Conclusion

In this manuscript, we construct the GRGSO method for the linear least squares problem. We have established the convergence analyses of the GRGSO method. Numerical experiments show that the GRGSO method is superior to the RGS, GRGS and RGSO methods in terms of both IT and CPU, especially when the coefficient matrix columns are highly correlated. It is natural to generalize the GRGSO method by introducing a relaxation parameter in its probability criterion. However, the choice of the optimal relaxation parameter is difficult in theory up to now. How to find the optimal relaxation parameter in theory is worthy of further study.

Conflicts of Interest

The authors declare no conflicts of interest regarding the publication of this paper.

 [1] Strohmer, T. and Vershynin, R. (2009) A Randomized Kaczmarz Algorithm with Exponential Convergence. Journal of Fourier Analysis and Applications, 15, 262-278. https://doi.org/10.1007/s00041-008-9030-4 [2] Leventhal, D. and Lewis, A.S. (2010) Randomized Methods for Linear Constraints: Convergence Rates and Conditioning. Mathematics of Operations Research, 35, 641-654. https://doi.org/10.1287/moor.1100.0456 [3] Bai, Z.Z. and Wu, W.T. (2019) On Greedy Randomized Coordinate Descent Methods for Solving Large Linear Least-Squares Problems. Numerical Linear Algebra with Applications, 26, e2237. https://doi.org/10.1002/nla.2237 [4] Du, K. (2019) Tight Upper Bounds for the Convergence of the Randomized Extended Kaczmarz and Gauss-Seidel Algorithms. Numerical Linear Algebra with Applications, 26, e2233. https://doi.org/10.1002/nla.2233 [5] Liu, Y., Jiang, X.L. and Gu, C.Q. (2021) On Maximum Residual Block and Two-Step Gauss-Seidel Algorithms for Linear Least-Squares Problems. Calcolo, 58, Article No. 13. https://doi.org/10.1007/s10092-021-00404-x [6] Zhang, J.H. and Guo, J.H. (2020) On Relaxed Greedy Randomized Coordinate Descent Methods for Solving Large Linear Least-Squares Problems. Applied Numerical Mathematics, 157, 372-384. https://doi.org/10.1016/j.apnum.2020.06.014 [7] Niu, Y.Q. and Zheng, B. (2021) A New Randomzied Gauss-Seidel Method for Solving Linear Least-Squares Problems. Applied Mathematics Letters, 116, Article ID: 107057. https://doi.org/10.1016/j.aml.2021.107057 [8] Wang, F., Li, W.G., Bao, W.D. and Lv, Z.L. (2021) Gauss-Seidel Method with Oblique Direction. Results in Applied Mathematics, 12, Article ID: 100180. https://doi.org/10.1016/j.rinam.2021.100180 [9] Li, W.G., Wang, Q., Bao, W.D. and Xing, L.L. (2022) Kaczmarz Method with Oblique Projection. Results in Applied Mathematics, 16, Article ID: 100342. https://doi.org/10.1016/j.rinam.2022.100342 [10] Wang, F., Li, W.G., Bao, W.D. and Liu, L. (2022) Greedy Randomized and Maximal Weighted Residual Kaczmarz Methods with Oblique Projection. Electronic Research Archive, 30, 1158-1186. https://doi.org/10.3934/era.2022062 [11] Bai, Z.Z. and Wu, W.T. (2018) On Greedy Randomized Kaczmarz Method for Solving Large Sparse Linear Systems. SIAM Journal on Scientific Computing, 40, A592-A606. https://doi.org/10.1137/17M1137747 [12] Davis, T.A. and Hu, Y. (2011) The University of Florida Sparse Matrix Collection. ACM Transactions on Mathematical Software, 38, 1-25. https://doi.org/10.1145/2049662.2049663