Fully Distributed Learning for Deep Random Vector Functional-Link Networks ()
1. Introduction
In recent years, with the rapid development of digital technology and network technology, the scale of data we can collect is unprecedented, and it presents three characteristics: one is the large scale of data, the other is the high dimension of data, and the third is the distributed storage of data. These characteristics of data bring us a lot of challenges in processing data [1] . Traditional centralized machine learning is limited to a single machine for processing calculations, which has revealed many drawbacks. The problem of limited training data size and long training time makes centralized learning unable to meet the requirements of processing today’s big data, so it is necessary to deploy the data to be processed to multiple machines for joint modeling in a distributed manner, which also corresponds to this feature of data distributed storage [2] . Therefore, it is of great significance to apply fast and efficient distributed learning algorithm to the original neural network.
On the other hand, in recent years, deep neural networks have become a very popular research direction in the field of machine learning, and have made major breakthroughs in many fields. Although deep neural networks are favored by everyone because of their excellent performance, with the rapid development of digitalization and the three characteristics of data presentation, stand-alone can no longer meet the training requirements of deep neural networks. Therefore, the application of distributed optimization algorithms to deep neural networks has become a new research trend. As early as 2012, Dean et al. [3] , a researcher at Google, developed two distributed training algorithms, Downpour SGD and Sandblaster L-BFGS, in the training of a large-scale deep neural network. It is of great significance. Of course, there is a gradual increase in research on distributed deep neural networks, and many frameworks that support distributed training have emerged, such as the TensorFlow framework proposed by Abadi et al. [4] and the Horovod framework proposed by Sergeev et al. [5]
To realize distributed training of models, two distributed frameworks are generally adopted [6] , one is master-slave mode, and the other is point-to-point mode. In master-slave mode, there is a central node, which is responsible for collecting and aggregating data or model parameters sent by other child nodes for processing and calculation, and then sending the calculated results to them respectively [7] [8] . Such a communication architecture may cause problems with communication stress on the one hand [9] , and risks of data leakage and misuse on the other [10] . In a point-to-point distributed architecture, there is no central node in the network, and the state between nodes is the same. Depending on the topology of the network, a node communicates with one or more other nodes, and after several rounds of communication, the entire network eventually reaches the goal of consistency. This decentralized, fully distributed architecture not only saves some communication overhead, but also data or model parameters are communicated only between adjacent nodes, thus preserving data privacy [11] . Due to the advantages of this framework, there have been many researches and applications on this distributed framework in recent years, and the application examples in deep learning are [12] [13] and so on.
In addition, the choice of algorithm for deep neural network also has an important impact on the efficiency of the model. Gradient algorithm is that most widely used neural networks learn algorithms in deep neural networks. However, traditional gradient algorithms have some disadvantage, such as easy to fall into local minimum points, slow convergence speed, strong dependence on initial parameters, etc. [14] . For deep networks, gradient algorithms also have gradient vanishing or gradient explosion problems, which will affect the training efficiency and make it difficult to exert the strong learning ability of the deep neural network [15] . In order to solve these problems, this paper proposes a distributed learning method based on deep random weight neural network. Compared with traditional neural network, random weight neural network has a very fast training speed, reduces the probability of falling into local minimum point, and ensures good approximation and generalization ability. Representative deep random neural networks, such as multi-hidden layer feedforward neural networks (MLFN) [16] , limit learners for deep structures (H-ELM) [17] , deep random vector functional-link neural networks based on stacked autoencoders (sdRVFL) [18] [19] , etc., where sdRVFL has faster and better generalization ability than the above deep random networks.
Based on the solid foundation of the above models and theories, combining the advantages of current deep neural networks and distributed learning frameworks in various aspects, this paper creates a point-to-point fully distributed deep vector functional-link model algorithm called D-sdRVFL on the proposed deep random vector functional-link neural network (sdRVFL). Our proposed algorithm is based on the decentralized average consensus (DAC) [20] and alternating direction method of multipliers (ADMM) [21] . In the process of distributed model training, we first use ADMM algorithm to transform the global consistency optimization problem of the model into equivalent sub-problems to solve. In the process of solving, we involve the values that need global information to calculate. We use DAC algorithm to achieve global consistency only through communication between nodes, avoiding the existence of central nodes, and finally realizing decentralized and completely distributed training of deep learning models. The main contributions of this paper are as follows:
· A peer-to-peer distributed learning algorithm based on deep RVFL is proposed, in which multiple nodes can jointly train modeling without a central server, while also protecting data privacy.
· According to two different connection variants of deep RVFL network, we propose corresponding distributed deep neural network algorithms.
· The proposed D-sdRVFL algorithm is comparable to the centralized deep RVFL algorithm in performance. The experimental results on multiple classification datasets show that the proposed algorithm has little difference in model accuracy with the centralized deep RVFL, and the training speed of the model is improved. Compared with the centralized algorithm, the point-to-point distributed algorithm has great advantages in dealing with large-scale high-dimensional data, and at the same time, it also protects data privacy to a certain extent.
The rest of this paper is structured as follows. Section 2 briefly introduces the basic concepts and training optimization process of two kinds of deep RVFL networks. In Section 3, decentralized fully distributed optimization algorithms are proposed for two kinds of deep RVFL networks. In Section 4, we compare the performance of the proposed distributed algorithm with other centralized deep random weight algorithms. Section 5 summarizes the paper.
2. Preliminary
In this section, we will introduce the basic structure of deep RVFL and its optimization problems, and introduce the concept of the decentralized average consensus (DAC) as the theoretical basis for our extension of the network to decentralized distributed deep networks.
2.1. Deep RVFL with Direct Links
In the Deep RVFL with direct links network, the original data first goes through L hidden layers for feature extraction to obtain complex high-level features, and then enters the RVFL classifier. The learning and optimization of the whole network are also divided into two parts, one is the optimization of the reconstruction matrix of the hidden layer encoder, and the other is the optimization of the weight matrix of the RVFL classifier.
The hidden layers in depth RVFL are composed of stacked self-encoded layers of L layers, and the output of each hidden layer represents
${H}_{l}$ . In the hidden layer, the output result of the previous layer is used as the input value of the next layer. The optimization problem for each coding layer is as follows:
${\stackrel{^}{U}}_{l}=\underset{{U}_{l}}{\mathrm{arg}\mathrm{min}}\frac{1}{2}{\Vert {Z}_{l}{U}_{l}-{H}_{l-1}\Vert}^{2}+{\lambda}_{l}{\Vert {U}_{l}\Vert}_{1}$ (1)
where
${H}_{l-1}$ is the output of the coding layer of the
$l-1$ th layer, and is also the input of the encoder of the coding layer of the lth layer,
${Z}_{l}$ is the output of the encoder of the coding layer of the lth layer obtained by the activation function, and H_{0} = X, our goal is to optimize the weight matrix U of the decoder of the coding layer,
${\lambda}_{l}$ is the regularization parameter of the lth layer.
After L self-encoding layers, the final feature representation
${H}_{L}$ is obtained. We need to connect
${H}_{L}$ with the original data
$X$ and then enter the classifier of RVFL. We use
${X}_{c}$ to represent the input value of the classifier, and
${X}_{c}$ can be defined as:
${X}_{c}=\left[{H}_{L}\mathrm{,}X\right]\mathrm{.}$ (2)
In the RVFL classifier, the learning objective is to optimize the weight matrix
$\beta $ , and the optimization objective function is as follows:
$\stackrel{^}{\beta}=\underset{\beta}{\mathrm{arg}\mathrm{min}}\frac{1}{2}{\Vert {X}_{c}\beta -T\Vert}^{2}+\frac{\lambda}{2}{\Vert \beta \Vert}^{2}$ (3)
where
$T$ is the target matrix,
$\lambda $ is the regularization parameter.
2.2. Deep RVFL with Dense Direct Links
In the Deep RVFL with dense direct links network, the original data is first subjected to feature extraction through hidden layers, and in the self-encoding layers of the L layers in the hidden layers, each self-encoding layer is connected with the subsequent self-encoding layer, so that the input value of each subsequent hidden layer includes the output values of all the previous hidden layers, and each hidden layer input
${X}_{l}$ can be represented as follows:
${X}_{l}=\left[X\mathrm{,}{H}_{1}\mathrm{,}\cdots \mathrm{,}{H}_{l-1}\right]$ (4)
where
$X$ is the original data,
${H}_{l}$ is the output value of each hidden layer, and the form of the output value may refer to formula:
${H}_{l}={Z}_{l}{\stackrel{^}{U}}_{l}\mathrm{,}$ (5)
except that the input value of each layer is changed.
After passing through L hidden layers, we get the output value
${H}_{L}$ of the hidden layer. We connect the output value
${H}_{l}$ of each hidden layer with the original data and enter the classifier of RVFL as a whole. We use
${X}_{c}$ to represent the input value of the classifier, which can be expressed as:
${X}_{c}=\left[X\mathrm{,}{H}_{1}\mathrm{,}\cdots \mathrm{,}{H}_{L}\right]\mathrm{.}$ (6)
The optimization problem in the RVFL classifier is the same as in Equation (3), we need to solve the optimal weight matrix
$\beta $ .
2.3. Decentralized Average Consensus (DAC)
DAC [15] is an algorithm that iterates continuously over the parameters of each node to reach a global average, requiring only communication between nodes. Here, we assume that there are N nodes in the network. In the kth iteration, the parameter of a node i is
${\psi}_{i}$ , and the update of the DAC of each local node is as follows:
${\psi}_{i}\left(k\right)={\displaystyle \underset{j=1}{\overset{N}{\sum}}}\text{\hspace{0.17em}}{b}_{ij}{\psi}_{i}\left(k-1\right)$ (7)
where
$B=\left[{b}_{ij}\right]$ B is an adjacency matrix of size
$N\times N$ , The parameters will gradually converge to the global average value through continuous iteration, as follows:
$\underset{k\to +\infty}{lim}{\psi}_{i}\left(k\right)=\frac{1}{N}{\displaystyle \underset{i=1}{\overset{N}{\sum}}}\text{\hspace{0.17em}}{\psi}_{i}\left(0\right)\mathrm{,}\text{\hspace{1em}}\forall i\in \mathcal{N}\mathrm{.}$ (8)
3. Fully Distributed Deep RVFL Network
In this section, we extend the previous two forms of deep RVFL to the peer-to-peer distributed learning framework. By using DAC and ADMM methods to optimize the weights of each hidden layer of each node and the weights of their RVFL classifiers. The following describes two distributed deep RVFL networks and their solving processes.
3.1. Problem Description
In a distributed learning network based on a point-to-point architecture, we assume that the network has N nodes that are connected to their neighbors and can communicate with each other. The whole dataset is randomly distributed among nodes. Here, we assume that the dataset local to the ith node is
${X}^{i}$ and
${Y}^{i}$ , and each node is trained locally for the deep RVFL network. Then in distributed scenarios, the whole global optimization problem becomes minimizing the sum of the loss functions at each node. The following formula is used to express, assuming that the loss function at the ith node is
${f}_{i}\left(z\right)$ , then the global objective function is:
${z}^{\mathrm{*}}=\mathrm{arg}\mathrm{min}\left\{F\left(z\right)\mathrm{:}={\displaystyle \underset{i=1}{\overset{N}{\sum}}}\text{\hspace{0.17em}}{f}_{i}\left(z\right)\right\}\mathrm{.}$ (9)
3.2. Fully Distributed Deep RVFL Network with Direct Links
From the introduction of the second part, we know that in deep RVFL directly connected networks, the optimization of the model is divided into two parts, one is the optimization of the decoder reconstruction weight matrix of the self-encoder in the hidden layer, and the other is the optimization of the RVFL classifier weight matrix. We extend the optimization problem to distributed scenarios. Suppose we are in a topological network of N nodes, and each node only communicates with its neighbors. From the analysis and deduction in the previous subsection, the optimization problem (1) is decomposed into N subproblems for cumulative solution:
${\stackrel{^}{U}}_{l}=\underset{{U}_{l}^{i}}{\mathrm{arg}\mathrm{min}}\frac{1}{2}{\displaystyle \underset{i=1}{\overset{N}{\sum}}}{\Vert {Z}_{l}^{i}{U}_{l}^{i}-{H}_{l-1}^{i}\Vert}^{2}+{\lambda}_{l}{\Vert {U}_{l}^{i}\Vert}_{1}$ (10)
where
${\lambda}_{l}$ is the regularization parameter of the hidden layer of the lth layer. By solving the above objective function, we obtain the optimal reconstruction matrix
${U}_{l}$ of each hidden layer, and each node can use
${U}_{l}$ to extract the features of the hidden layer. After optimizing the RVFL classifier weight matrix, we assume the same distributed topology scenario, and then problem (3) naturally becomes the following form:
$\stackrel{^}{\beta}=\underset{\beta}{\mathrm{arg}\mathrm{min}}\frac{1}{2}{\displaystyle \underset{i\mathrm{=1}}{\overset{N}{\sum}}}{\Vert {X}_{c}^{i}{\beta}^{i}-{T}^{i}\Vert}^{2}+\frac{\lambda}{2}{\Vert {\beta}^{i}\Vert}^{2}$ (11)
where
${X}_{c}^{i}=\left[{H}_{L}^{i}\mathrm{,}{X}^{i}\right]$ ,
${H}_{L}^{i}$ is the output value of the hidden layer for each node,
${X}^{i}$ is the original data for each node and
${T}^{i}$ is the target matrix for each node.
3.3. Fully Distributed Deep RVFL Network with Dense Direct Links
Here, the deep RVFL with dense direct links is also extended to a distributed scenario. The difference from the fully distributed deep RVFL network with direct links lies in the connection between the hidden layers. Each hidden layer in the front and all hidden layers in the back are connected, so that features with lower complexity can be used multiple times, so that the features extracted by the hidden layers are more representative and meaningful. Suppose that on a certain node, the input value
${X}_{l}^{i}$ of a certain hidden layer can be represented as follows:
${X}_{l}^{i}=\left[{X}^{i}\mathrm{,}{H}_{1}^{i}\mathrm{,}\cdots \mathrm{,}{H}_{l-1}^{i}\right]\mathrm{.}$ (12)
The distributed optimization problem can then look at problems (10) and (11) for the optimization problem of the entire network, as in the case of direct connections.
3.4. Fully Distributed Solutions
For the above objective function to solve the global optimal weight matrix, there are two aspects of the problem, one is to minimize the sum of loss functions, the other is to achieve global consistency, which is actually an optimization problem with constraints. For such problems, ADMM method can be used to solve. Below we outline the principles of ADMM.
ADMM algorithm combines Lagrangian multiplier method and dual decomposition, and solves the original problem by optimizing the original problem and dual problem alternately. ADMM is typically applied to constrained optimization problems of the form:
$\begin{array}{l}\mathrm{min}\text{\hspace{1em}}{f}_{1}\left({\theta}_{1}\right)+{g}_{2}\left({\theta}_{2}\right)\\ \text{s}\text{.t}\text{.}\text{\hspace{1em}}{P}_{1}{\theta}_{1}+{P}_{2}{\theta}_{2}-R=0.\end{array}$ (13)
The core idea of ADMM is to transform constrained optimization problems into equivalent unconstrained ones, and this process realizes the interpretation of constraints by introducing Lagrangian multiplier terms. In this way, we obtain the augmented Lagrangian function of the above problem, and then find its partial derivative to obtain the specific iterative formula of variables.
In addition, there have been many literatures on the convergence analysis and convergence rate judgment of distributed ADMM algorithm, and it has been proved in [22] that this algorithm converges at the rate
$O\left(\frac{1}{k}\right)$ .
According to the principle of ADMM above, we set the auxiliary variable
${V}_{l}$ so that the parameters of each node converge to the same value. Then, problem (10) is rewritten as follows:
$\begin{array}{l}\mathrm{min}\text{\hspace{1em}}\frac{1}{2}{\displaystyle \underset{i=1}{\overset{N}{\sum}}}{\Vert {Z}_{l}^{i}{U}_{l}^{i}-{H}_{l-1}^{i}\Vert}^{2}+{\lambda}_{l}{\Vert {V}_{l}\Vert}_{1}\\ \text{s}\text{.t}\text{.}\text{\hspace{1em}}{U}_{l}^{i}-{V}_{l}=\mathrm{0,}\text{\hspace{0.17em}}i=\mathrm{1,2,}\cdots \mathrm{,}N\end{array}$ (14)
where
${\lambda}_{l}$ denotes the regularization parameter for each hidden layer, then we obtain the augmented Lagrangian for the above problem as follows:
$\begin{array}{c}{L}_{{\rho}_{l}}\left(\left\{{U}_{l}^{i}\right\}\mathrm{,}{V}_{l}\mathrm{,}\left\{{\mu}_{l}^{i}\right\}\right)=\frac{1}{2}{\displaystyle \underset{i=1}{\overset{N}{\sum}}}{\Vert {Z}_{l}^{i}{U}_{l}^{i}-{H}_{l-1}^{i}\Vert}^{2}+{\lambda}_{l}{\Vert {V}_{l}\Vert}_{1}+{\displaystyle \underset{i=1}{\overset{N}{\sum}}}{\left({\mu}_{l}^{i}\right)}^{{\rm T}}\left({U}_{l}^{i}-{V}_{l}\right)\\ \text{\hspace{0.17em}}+\frac{{\rho}_{l}}{2}{\displaystyle \underset{i=1}{\overset{N}{\sum}}}{\Vert {U}_{l}^{i}-{V}_{l}\Vert}^{2}\mathrm{.}\end{array}$ (15)
For each of the hidden layers, where
${\mu}_{l}^{i}$ is the dual variable of the ith node,
${\rho}_{l}$ is the penalty term. In each iteration process, the local objective functions of
${U}_{l}^{i}$ and
${V}_{l}$ are first optimized alternately, and then the dual variable
${\mu}_{l}^{i}$ is updated, and the iteration formula is as follows:
${U}_{l}^{i}\left(t+1\right)=\underset{{U}_{l}^{i}}{\mathrm{arg}\mathrm{min}}{L}_{{\rho}_{l}}\left({U}_{l}^{i}\mathrm{,}{V}_{l}\left(t\right)\mathrm{,}{\mu}_{l}^{i}\left(t\right)\right)$ (16)
${V}_{l}\left(t+1\right)=\underset{{V}_{l}^{i}}{\mathrm{arg}\mathrm{min}}{L}_{{\rho}_{l}}\left({U}_{l}^{i}\left(t+1\right)\mathrm{,}{V}_{l}\mathrm{,}{\mu}_{l}^{i}\left(t\right)\right)$ (17)
${\mu}_{l}^{i}\left(t+1\right)={\mu}_{l}^{i}\left(t\right)+{\rho}_{l}\left({U}_{l}^{i}\left(t+1\right)-{V}_{l}\left(t+1\right)\right)$ (18)
where t represents the tth iteration. Equations (16) and (17) can be calculated to obtain closed solutions. Then, we can obtain the iterative steps as follows:
${U}_{l}^{i}\left(t+1\right)={\left({\left({Z}_{l}^{i}\right)}^{{\rm T}}{Z}_{l}^{i}+{\rho}_{l}I\right)}^{-1}\left({\left({Z}_{l}^{i}\right)}^{{\rm T}}{H}_{l-1}^{i}+{\rho}_{l}{V}_{l}\left(t\right)-{\mu}_{l}^{i}\left(t\right)\right)$ (19)
${V}_{l}\left(t+1\right)={S}_{{\lambda}_{l}\mathrm{/}N{\rho}_{l}}\left({\stackrel{^}{U}}_{l}+{\stackrel{^}{\mu}}_{l}\right)$ (20)
${\mu}_{l}^{i}\left(t+1\right)={\mu}_{l}^{i}\left(k\right)+{\rho}_{l}\left({U}_{l}^{i}\left(t+1\right)-{V}_{l}\left(t+1\right)\right)$ (21)
where
${\stackrel{^}{U}}_{l}=\frac{1}{N}{\displaystyle {\sum}_{i=1}^{N}}\text{\hspace{0.17em}}{U}_{l}^{i}\left(t+1\right)$ ,
${\stackrel{^}{\mu}}_{l}=\frac{1}{N}{\displaystyle {\sum}_{i=1}^{N}}\text{\hspace{0.17em}}{\mu}_{l}^{i}\left(t\right)$ , are the average of global nodes. In master-slave mode, this requires a central node to aggregate information from all nodes to compute. Here, we use the decentralized average consensus (DAC) algorithm to achieve global average consistency only by communication between nodes, instead of the role of central nodes, thus avoiding the existence of central nodes and realizing decentralized distributed optimization. We obtain an estimate of the mean value by (7) and (8).
In addition,
${\mathcal{S}}_{\kappa}(\cdot )$ stands for the element-wise soft threshold operator [23] , which is defined as follows:
${\mathcal{S}}_{\kappa}\left(a\right)=\{\begin{array}{ll}a-\kappa ,\hfill & a>\kappa \hfill \\ 0,\hfill & \left|a\right|\le \kappa \hfill \\ a+\kappa ,\hfill & a<-\kappa .\hfill \end{array}$ (22)
Through the above calculation, we find the optimal reconstruction matrix
${\stackrel{^}{U}}_{l}^{i}$ of each hidden layer, the data enters each hidden layer to find the optimal reconstruction matrix and then enters the next layer, and the optimization of the hidden layer is completed before the optimization of the RVFL classifier.
For problem (11), we also use ADMM combined with DAC to solve it, set auxiliary variable
$V$ , so (11) is rewritten as follows:
$\begin{array}{l}\mathrm{min}\text{\hspace{1em}}\frac{1}{2}{\displaystyle \underset{i=1}{\overset{N}{\sum}}}{\Vert {X}_{c}^{i}{\beta}^{i}-{T}^{i}\Vert}^{2}+\frac{\lambda}{2}{\Vert V\Vert}^{2}\\ \text{s}\text{.t}\text{.}\text{\hspace{1em}}{\beta}^{i}-V=\mathrm{0,}\text{\hspace{0.17em}}i=\mathrm{1,2,}\cdots \mathrm{,}N\mathrm{.}\end{array}$ (23)
We get the augmented Lagrange function as follows:
$\begin{array}{c}{L}_{\rho}\left(\left\{{\beta}^{i}\right\}\mathrm{,}V\mathrm{,}\left\{{\mu}^{i}\right\}\right)=\frac{1}{2}{\displaystyle \underset{i=1}{\overset{N}{\sum}}}{\Vert {X}_{c}^{i}{\beta}^{i}-{T}^{i}\Vert}^{2}+\frac{\lambda}{2}{\Vert V\Vert}^{2}+{\displaystyle \underset{i=1}{\overset{N}{\sum}}}{\left({\mu}^{i}\right)}^{{\rm T}}\left({\beta}^{i}-V\right)\\ \text{\hspace{0.17em}}+\frac{\rho}{2}{\displaystyle \underset{i=1}{\overset{N}{\sum}}}{\Vert {\beta}^{i}-V\Vert}^{2}\mathrm{.}\end{array}$ (24)
Then, the ADMM iterations are as follows:
${\beta}^{i}\left(t+1\right)={\left({\left({X}_{c}^{i}\right)}^{{\rm T}}{X}_{c}^{i}+\rho I\right)}^{-1}\left({\left({X}_{c}^{i}\right)}^{{\rm T}}{T}^{i}+\rho V\left(t\right)-{\mu}^{i}\left(t\right)\right)$ (25)
$V\left(t+1\right)=\frac{\rho \stackrel{^}{\beta}+\stackrel{^}{\mu}}{\rho +\frac{\lambda}{N}}$ (26)
${\mu}^{i}\left(t+1\right)={\mu}^{i}\left(t\right)+\rho \left({\beta}^{i}\left(t+1\right)-V\left(t+1\right)\right)$ (27)
where
$\stackrel{^}{\beta}=\frac{1}{N}{\displaystyle {\sum}_{i=1}^{N}}\text{\hspace{0.17em}}{\beta}^{i}\left(t+1\right)$ and
$\stackrel{^}{\mu}=\frac{1}{N}{\displaystyle {\sum}_{i=1}^{N}}\text{\hspace{0.17em}}{\mu}^{i}\left(t\right)$ in (26) are the average value of the global nodes, and the DAC algorithm is also used to obtain the average value, and the calculation is carried out according to Formulas (7) and (8). Through the calculation of the above formula, the global optimal value of the RVFL classifier weight matrix is finally obtained.
In order to understand the training process of the distributed algorithm more clearly, the pseudocode of Algorithm 1 shows the iterative steps of the decentralized distributed algorithm in the directly connected deep RVFL network. The algorithm for dense connections is similar to Algorithm 1 and will not be repeated here.
4. Experiments and Analysis
In order to verify the effectiveness and feasibility of the proposed algorithm, and the robustness of the algorithm in the face of network layer number changes. We designed two experiments. The first part of the experiment is mainly to compare with other algorithms in terms of performance, by comparing the model accuracy and training time of each model algorithm on the same data set. In the second part, we change the number of hidden layers of the depth model to observe the accuracy and training time of the proposed distributed algorithm, and verify its robustness.
We will introduce the experimental setup below, including a brief description of the dataset, metrics to measure the accuracy of the model, a description of the training time of the model, and the selection and parameter setting of the model algorithm compared with it. Make the superiority of the proposed algorithm more convincing.
4.1. Experimental Setup
4.1.1. Training Datasets
In the selection of data, we use the data sets used for classification tasks on the classical UCI dataset, carefully selected according to the size of the data set, there are large data sets with a total data volume of more than one million, and there are small data sets with a total data volume of less than ten thousand. Minmax normalization is performed on the data, and the performance of the observation model on different orders of magnitude data sets is better. Details about the dataset are presented in Table 1, and further descriptions of the data can be found on the UCI dataset website.
4.1.2. Evaluation Index
In the accuracy evaluation of the model, we select the classification accuracy as the evaluation index. The closer the classification prediction of the model is to the actual situation, the higher the accuracy of the model. The calculation formula for the classification accuracy is as follows:
$\text{CAR}=\frac{\text{thenumberofcorrectlyclassifiedsamples}}{\text{thetotalnumberofsamples}}\times \mathrm{100\%.}$ (28)
In terms of training time, we measure the training time of each node. For example, in a centralized model, there are no redundant nodes, so the training time
Table 1. Overview of the UCI datasets.
of its nodes is the training time of the model. In a distributed model, because multiple nodes participate in each optimization, the training time of each node needs to be divided by the corresponding number of nodes and then compared with the centralized model.
4.1.3. Testing Models and Parameter Setting
For comparison model selection, we not only compare the proposed distributed algorithm model with the corresponding centralized model, but also select two representative deep random weight neural networks H-ELM and ML-KELM and centralized deep RVFL models sdRVFL (d) and sdRVFL (dense) as comparison objects for vertical and horizontal comparison.
We set all the models for comparison, and they keep consistent in the number of hidden layers and neurons to ensure the rationality of comparison. In this paper, the number of hidden layers is set to 3, the number of neurons is fixed to 32, and other parameters are simulated according to the optimal values mentioned in the paper where the model is located. For centralized depth RVFL and distributed depth RVFL, we uniformly adjust regularization term λ and Lagrangian parameter ρ synchronously, λ is set to λ = 0.01, 0.1, 1.10, 100, ρ is set to ρ = 0.01, 0.1, 1, 10, 100. The maximum iteration number of DAC algorithm is 500, and the iteration termination limit of DAC algorithm is 0.001.
4.2. Performance
4.2.1. Classification Accuracy
Through experimental verification on 6 classification data, as shown in Table 2 above, we find that our proposed distributed depth models D-sdRVFL(d) and D-sdRVFL (dense) have good performance on classification tasks, and participate in the comparison of centralized depth models sdRVFL(d − l_{1}/l_{2}) and sdRVFL(dense − l_{1}/l_{2}) and H-ELM models differ only 3% to 4% in classification accuracy on average, and ML-KELM models differ less than 1% in classification accuracy on average, indicating that our proposed distributed depth model can match the performance of centralized models. In addition, the classification accuracy of D-sdRVFL(dense) model is higher than that of D-sdRVFL(d) model.
Table 2. CAR (%) for different algorithms on the test datasets.
4.2.2. Training Time
As shown in Table 3, we observe that for the D-sdRVFL(d) and D-sdRVFL(dense) models with 5 agents and 3 hidden layers, the actual training time per agent is slightly higher than that of the centralized model, but the training time is greatly reduced compared to the ML-KELM model. In the following experiments, we discussed the change of training time of each agent in distributed model after changing the number of hidden layer network layers in the network. We found that with the increase of network layers and the number of agents, the training time of single agent will decrease continuously. On the contrary, the training time of centralized model will increase continuously.
4.3. Correlation Analysis of Model Robustness
In this experiment, we change the number of hidden layers in the network to observe the changes in classification accuracy and training time. Three representative data sets were selected as the data sets of this experiment, namely musk-2, waveform and credit-approval. These three data sets also represent large, medium and small data sets.
As shown in Figure 1, in this experiment we compared the classification accuracy and training time of two centralized depth models and two proposed distributed models, and the number of hidden layers changed from 3 to 7. For the model classification accuracy, on the dataset Waveform, the model classification accuracy of centralized deep RVFL model and distributed deep RVFL model does not change significantly with the increase of network layers, the difference between the highest and lowest is less than 2%, there is no obvious increase and decrease, and the highest accuracy does not appear in the model with the most layers. In Musk-2, the classification accuracy of centralized deep RVFL model and D-sdRVFL(d) model does not change significantly with the increase of network layers, while in D-sdRVFL(dense) model, the classification accuracy of model increases with the increase of network layers, and reaches the highest when the number of hidden layers reaches 6, and decreases after reaching 7 layers. In credit-approval dataset, the classification accuracy of centralized deep RVFL model and D-sdRVFL(d) model increases first and then decreases with the increase of network layers, while in D-sdRVFL(dense) model, the classification
Table 3. Average training time (s) per node for different algorithms on training datasets.
Figure 1. Comparison of CAR and training time between distributed deep network model and centralized deep network model with the number of layers of the network changing, other parameters unchanged.
accuracy decreases gradually with the increase of network layers. Each model shows different characteristics on different data sets, but when other parameters are fixed and only the number of layers is changed, the classification accuracy of the model does not change greatly, the maximum change is not more than 7%, most of them are concentrated in about 2%, and the change of distributed model is slightly larger than that of centralized model, thus verifying the robustness of the model.
As for the training time of the model, it can be seen from the training results of the three data sets that the training time of a single node in the centralized model will increase with the increase of the number of hidden layers of the network, while the training time of each node in the distributed model will gradually decrease with the increase of the number of layers of the network, and the average training time of each node in the distributed network will be lower than that of the centralized network when the number of layers of the network is greater than 4. With the increase of the number of layers of the network, distributed networks have more and more obvious advantages in training time, but can maintain robustness in training effect.
5. Conclusions
Based on the deep RVFL model, this paper proposes a completely distributed deep RVFL algorithm. In the fully distributed framework, agents in the network topology only communicate with each other, and do not need to interact with the original data. At the same time, DAC and ADMM algorithms are used to achieve collaborative optimization between agents in hidden layer and output layer, avoiding the existence of central servers and effectively protecting data privacy. Through experiments on several representative classification data sets show that the proposed algorithm has good classification accuracy and can greatly save the training time of each agent. At the same time, the robustness of the model is verified by changing the number of hidden layers.
The outlook for future work is mainly divided into two aspects. Firstly, in the aspect of algorithm, DAC and ADMM algorithms are used for collaborative optimization, which needs two iterations and consumes more training time. In the later research, other collaborative optimization methods will be selected to reduce the number of iterations in the process, thus further reducing the training time. Second, in terms of model application, relevant experiments have been carried out only on classification tasks to verify the effectiveness of the model, while experiments on other tasks of machine learning need to be expanded and verified.
Acknowledgements
This work was supported in part by the National Natural Science Foundation of China (No. 62166013), the Natural Science Foundation of Guangxi (No. 2022GXNSFAA035499) and the Foundation of Guilin University of Technology (No. GLUTQD2007029).