CN115086174B

CN115086174B - Communication optimization method based on random event triggered stochastic gradient descent algorithm

Info

Publication number: CN115086174B
Application number: CN202210778235.8A
Authority: CN
Inventors: 李渝哲; 孙赫喆
Original assignee: Northeastern University China
Current assignee: Northeastern University China
Priority date: 2022-06-30
Filing date: 2022-06-30
Publication date: 2023-11-07
Anticipated expiration: 2042-06-30
Also published as: CN115086174A

Abstract

The invention provides a communication optimization method based on a random event triggering random gradient descent algorithm, which is suitable for a distributed machine learning training process, and aims at the high-volume communication problem existing in the current distributed machine learning training. The L2 norm is used for measuring the distance between two gradients, the probability of the gradient is converted into a [0,1] interval by an exponential function, and a Bernoulli variable is defined to indicate the probability of each triggering result. In practical application, random numbers can be uniformly generated in the [0,1] interval to serve as a threshold value for describing practical probability comparison. The method is suitable for enhancing the generalization performance of the method in practical application when the data of enterprises are polluted.

Description

Communication optimization method based on random event triggered stochastic gradient descent algorithm

技术领域Technical field

本发明属于机器学习技术领域，具体涉及一种基于随机事件触发随机梯度下降算法的通讯优化方法。The invention belongs to the field of machine learning technology, and specifically relates to a communication optimization method based on a stochastic gradient descent algorithm triggered by random events.

背景技术Background technique

梯度下降(Gradient Descent)是遵循成本函数的梯度来最小化一个函数的过程。这个过程涉及到对成本形式以及其衍生形式的认知，使得我们可以从已知的给定点朝既定方向移动。比如向下朝最小值移动。在机器学习中，我们可以利用随机梯度下降的方法来最小化训练模型中的误差，即每次迭代时完成一次评估和更新。这种优化算法的工作原理是模型每看到一个训练实例，就对其作出预测，并重复迭代该过程到一定的次数。这个流程可以用于找出能导致训练数据最小误差的模型的系数。随机梯度下降(SGD)也称为增量梯度下降，是一种迭代方法，用于优化可微分目标函数。该方法通过在小批量数据上计算损失函数的梯度而迭代地更新权重与偏置项。SGD在高度非凸的损失表面上远远超越了朴素梯度下降法，这种简单的爬山法技术已经主导了现代的非凸优化。Gradient Descent is the process of minimizing a function by following the gradient of the cost function. This process involves the recognition of cost forms and their derivative forms that allow us to move in a given direction from a known given point. For example, move downward toward the minimum value. In machine learning, we can use stochastic gradient descent to minimize the error in the training model, that is, one evaluation and update is completed at each iteration. This optimization algorithm works by making a prediction for each training instance it sees and repeating the process a certain number of times. This process can be used to find the coefficients of the model that lead to the smallest error on the training data. Stochastic gradient descent (SGD), also known as incremental gradient descent, is an iterative method for optimizing a differentiable objective function. This method iteratively updates the weights and bias terms by computing the gradient of the loss function on mini-batches of data. SGD far surpasses naive gradient descent on highly non-convex loss surfaces, and this simple hill-climbing technique has dominated modern non-convex optimization.

在大数据时代，海量数据带来丰富信息的背后，不可避免的使得通讯与计算成本都大幅度上升。在这种情况下，即使是SGD方法也存在着极为繁重的通讯；另一方面，数据中往往存在着一些干扰或噪声，会给算法的收敛带来干扰。如何在这两个问题的存在下设计一种合适的SGD变体方法，保证算法收敛的同时，能够有效减少所需的通讯，是本领域亟需解决的技术难题。In the era of big data, massive data brings rich information, which inevitably leads to a significant increase in communication and computing costs. In this case, even the SGD method has extremely heavy communication; on the other hand, there is often some interference or noise in the data, which will interfere with the convergence of the algorithm. How to design a suitable SGD variant method in the presence of these two problems, which can effectively reduce the required communication while ensuring algorithm convergence, is an urgent technical problem that needs to be solved in this field.

发明内容Contents of the invention

针对现有技术存在的不足，本发明要解决的问题是在噪声存在的情况下如何减少分布式机器学习训练带来的繁重通讯率。通过事件触发机制，减少通讯的轮次，解决了通讯带来的高额负担。In view of the shortcomings of the existing technology, the problem to be solved by the present invention is how to reduce the heavy communication rate caused by distributed machine learning training in the presence of noise. Through the event triggering mechanism, the number of communication rounds is reduced and the high burden caused by communication is solved.

为此，本发明提出的一种基于随机事件触发随机梯度下降算法的通讯优化方法，包括：To this end, the present invention proposes a communication optimization method based on a random event-triggered stochastic gradient descent algorithm, including:

步骤1：收集本地数据作为样本数据集，并搭建云-边分布式机器学习框架；Step 1: Collect local data as a sample data set and build a cloud-edge distributed machine learning framework;

步骤2：子节点构建基于随机事件触发的随机梯度下降算法模型并进行训练；Step 2: The child node constructs a stochastic gradient descent algorithm model based on random event triggering and trains it;

步骤3：云中心服务器进行梯度汇总并更新模型参数；Step 3: The cloud center server performs gradient summary and updates model parameters;

步骤4：利用训练后的随机梯度下降算法模型进行数据预测。Step 4: Use the trained stochastic gradient descent algorithm model to predict data.

所述步骤1具体表述为：The specific expression of step 1 is:

步骤1-1：确定M个数据提供方(以下简称为子节点)公共的云中心服务器，每个子节点将自己的本地数据作为样本数据集；Step 1-1: Determine the public cloud center servers of M data providers (hereinafter referred to as sub-nodes), and each sub-node uses its own local data as a sample data set;

步骤1-2：选择机器学习模型结构、确定损失函数，构建云-边分布式机器学习框架，其中，每一个子节点的目标函数f_m是最小化本地的数据集上的损失：Step 1-2: Select the machine learning model structure, determine the loss function, and build a cloud-edge distributed machine learning framework. The objective function f _m of each sub-node is to minimize the loss on the local data set:

式中，N_m是本地数据集的个数，是每一个样本数据的损失函数；In the formula, N _m is the number of local data sets, is the loss function of each sample data;

则全局的目标就是将所有M个子节点的平均损失最小化：Then the global goal is to minimize the average loss of all M child nodes:

云中心服务器将模型初始化发送给每个子节点并开启训练；The cloud center server sends model initialization to each child node and starts training;

所述步骤2具体表述为：The specific expression of step 2 is:

步骤2-1：在每一次迭代时刻k，每个子节点m根据当前的模型参数ω_k，随机选取一个数据样本计算梯度；Step 2-1: At each iteration time k, each child node m randomly selects a data sample to calculate the gradient according to the current model parameter ω _k ;

步骤2-2：构建基于随机事件的触发条件：Step 2-2: Construct trigger conditions based on random events:

式中，表示在第k时刻的随机梯度，/>表示子节点m以k时刻为基准上一次传输的梯度，τ_m是节点m的延迟因子；In the formula, Represents the stochastic gradient at the kth moment, /> Indicates the gradient of the last transmission of child node m based on k time, and τ _m is the delay factor of node m;

步骤2-3：根据构建的触发条件判断节点m在k时刻是否传输随机梯度具体表述为：Step 2-3: Determine whether node m transmits random gradient at time k according to the constructed trigger condition. The specific expression is:

步骤2-3-1：根据公式(4)计算节点m在k时刻是否传输当前梯度的概率 Step 2-3-1: Calculate the probability of whether node m transmits the current gradient at time k according to formula (4)

其中，为k时刻不传输的概率，/>为传输的概率；in, is the probability of no transmission at time k,/> is the probability of transmission;

步骤2-3-2：在[0，1]区间均匀生成一个随机数若当前传输/>的概率大于/>则将/>发送给云中心节点，若当前传输/>的概率小于/>则放弃传输；Step 2-3-2: Generate a random number uniformly in the interval [0, 1] If currently transmitting/> The probability is greater than/> Then/> Sent to the cloud center node, if currently transmitting/> The probability is less than/> then give up the transmission;

所述步骤3具体表述为：The specific expression of step 3 is:

步骤3-1：云中心服务器根据节点传输的信息更新模型参数；具体表述为：Step 3-1: The cloud center server updates the model parameters based on the information transmitted by the node; the specific expression is:

步骤3-1-1：云中心服务器将进行传输的节点的延迟因子τ_m重置为1，将未进行传输的节点的延迟因子置为τ_m+1；Step 3-1-1: The cloud center server resets the delay factor τ _m of the node that is transmitting to 1, and sets the delay factor of the node that is not transmitting to τ _m +1;

步骤3-1-2：云中心服务器利用进行传输的节点的新梯度和未进行传输的节点的旧梯度进行梯度下降，得到k+1时刻的模型参数ω_k+1；Step 3-1-2: The cloud center server performs gradient descent using the new gradient of the node that is transmitting and the old gradient of the node that is not transmitting, and obtains the model parameter ω _{k+1 at time k+1} ;

式中，α_k表示学习的步长因子；In the formula, α _k represents the learning step factor;

步骤3-2：云中心服务器将更新的模型参数ω_k+1发送给所有子节点，循环步骤1-1至步骤3-2直至训练完成。Step 3-2: The cloud center server sends the updated model parameters ω _k+1 to all child nodes, and loops from step 1-1 to step 3-2 until the training is completed.

本发明的有益效果是：The beneficial effects of the present invention are:

1)在实际的分布式机器学习训练过程中，尤其是多家企业共同合作来训练模型。该方法定义了两次梯度差异的判别标准，L2范数代表了两个向量在空间上的距离，越小表示两个向量在空间中越接近。在梯度下降的时候，可认为对全局函数来说，两个相近的梯度会带来差不多的函数值下降。利用这一特点，可以通过本方法设计的触发条件，减少通讯次数，使得训练过程中通讯开销减小。1) In the actual distributed machine learning training process, especially when multiple companies work together to train models. This method defines the criterion for the difference between two gradients. The L2 norm represents the distance between two vectors in space. The smaller it is, the closer the two vectors are in space. During gradient descent, it can be considered that for the global function, two similar gradients will bring about a similar decrease in function value. Taking advantage of this feature, the number of communications can be reduced through the trigger conditions designed by this method, so that the communication overhead during the training process is reduced.

2)能够随机的进行传输的选择。当两个梯度的差值范数小时，该算法以一个更小的概率去传输，但并不是绝对不传，反之，当两个梯度的差值范数大时，该算法以一个更大的概率去传输，但也不是一定会传输。因此这种情况下，由于附加噪声引起的不确定性，子节点仍然有机会去自适应的选择是否上传最新的梯度，来修正噪声带来的影响。适用于当企业的数据存在污染时，能够增强该方法在实际应用中的泛化性能。2) Ability to select transmission randomly. When the difference norm of the two gradients is small, the algorithm will transmit with a smaller probability, but it does not absolutely not transmit. On the contrary, when the difference norm of the two gradients is large, the algorithm will transmit with a larger probability. There is a probability of transmission, but it is not certain that it will be transmitted. Therefore, in this case, due to the uncertainty caused by additional noise, the child nodes still have the opportunity to adaptively choose whether to upload the latest gradient to correct the impact of the noise. It is suitable for when enterprise data is contaminated, and can enhance the generalization performance of this method in practical applications.

附图说明Description of the drawings

图1为本发明中云-边分布式机器学习框架示意图；Figure 1 is a schematic diagram of the cloud-edge distributed machine learning framework in the present invention;

图2为本发明中事件触发函数示意图；Figure 2 is a schematic diagram of the event triggering function in the present invention;

图3为本发明中基于随机事件触发随机梯度下降算法的通讯优化方法流程图；Figure 3 is a flow chart of the communication optimization method based on the stochastic gradient descent algorithm triggered by random events in the present invention;

图4为本发明中在Iris数据集上的结果图，其中，(a)为acr＝0.05，σ²＝0情况下的结果图，(b)为acr＝0.178，σ²＝0.05情况下的结果图，(c)为acr＝0.452，σ²＝0.1情况下的结果图，横轴坐标Epoch表示迭代轮数，loss为训练的损失函数值，Acc为测试集上的精度，communication rate为每一时刻的通讯率，acr为所有通讯轮次通讯率的平均值，σ²为梯度噪声的方差；Figure 4 is a result graph on the Iris data set in the present invention, where (a) is the result graph when acr=0.05, σ ² =0, (b) is the result graph when acr = 0.178, σ ² =0.05 The result diagram, (c) is the result diagram when acr=0.452, σ ² =0.1, the horizontal axis coordinate Epoch represents the number of iteration rounds, loss is the loss function value of training, Acc is the accuracy on the test set, and the communication rate is each The communication rate at a moment, acr is the average communication rate of all communication rounds, σ ² is the variance of gradient noise;

图5为本发明中在Covtype数据集上的结果图，其中，(a)为acr＝0.256，σ²＝0情况下的结果图，(b)为acr＝0.308，σ²＝0.1情况下的结果图，(c)为acr＝0.615，σ²＝0.4情况下的结果图；Figure 5 is a result graph on the Covtype data set in the present invention, where (a) is the result graph when acr=0.256, σ ² =0, (b) is the result graph when acr = 0.308, σ ² =0.1 The result graph, (c) is the result graph when acr=0.615, σ ² =0.4;

图6为本发明中在MNIST数据集上的结果图，其中，(a)为训练的损失函数值，(b)为测试集上的精度，cr为所有通讯轮次通讯率的平均值；Figure 6 is a result diagram of the present invention on the MNIST data set, in which (a) is the loss function value of training, (b) is the accuracy on the test set, and cr is the average communication rate of all communication rounds;

图7为本发明中在Cifar10数据集上的结果图，其中，(a)为训练的损失函数值，(b)为测试集上的精度；Figure 7 is a result diagram on the Cifar10 data set in the present invention, where (a) is the loss function value of training, (b) is the accuracy on the test set;

图8为与确定性随机触发在实际数据集上对比的结果图，横坐标AverageCommunication Rate为平均通讯率，Average Loss为平均损失函数值，SET-SGD为随机事件触发，DET-SGD为确定性事件触发。Figure 8 shows the results compared with deterministic random triggering on the actual data set. The abscissa AverageCommunication Rate is the average communication rate, Average Loss is the average loss function value, SET-SGD is the random event trigger, and DET-SGD is the deterministic event. trigger.

具体实施方式Detailed ways

下面结合附图和具体实施实例对发明做进一步说明。本发明方法适用于机器学习训练，在带有噪声的分布式机器学习系统上效果尤为显著，旨在减少不必要的通讯。所述方法包括随机触发机制的设计、伯努利变量的概率分布，采用事件触发的方式来减少传输的次数，大幅度减少通讯开销。首先利用L2范数衡量两个梯度之间的距离，再利用指数函数将其转换为[0，1]区间的概率，并定义一个伯努利变量来指示每次触发结果的概率。在实际应用时，可利用在[0,1]区间均匀生成随机数作为阈值，用于刻画实际的概率比较。本发明在触发机制上面，实现了随机性，具有更平滑的性质，可以降低噪声带来的影响；在数值仿真方面，保证了收敛的前提下，大量减少了训练过程中通讯次数，节约了带宽资源。The invention will be further described below in conjunction with the accompanying drawings and specific implementation examples. The method of the present invention is suitable for machine learning training, and is particularly effective on distributed machine learning systems with noise, aiming to reduce unnecessary communications. The method includes the design of a random triggering mechanism, the probability distribution of Bernoulli variables, and the use of event triggering to reduce the number of transmissions and significantly reduce communication overhead. First, use the L2 norm to measure the distance between the two gradients, then use the exponential function to convert it into the probability of the [0, 1] interval, and define a Bernoulli variable to indicate the probability of each trigger result. In practical applications, random numbers uniformly generated in the [0,1] interval can be used as thresholds to characterize actual probability comparisons. The present invention realizes randomness in the triggering mechanism, has smoother properties, and can reduce the impact of noise; in terms of numerical simulation, it greatly reduces the number of communications during the training process and saves bandwidth while ensuring convergence. resource.

如图3所示，首先设计一种随机的事件触发机制，舍弃原有的标准事件触发的固定阈值，将每次事件触发的结果由概率来进行描述。在分布式机器学习中，存在一个云端中心参数服务器和M个训练节点。以节点m为例，在第k时刻，该节点接受此时云端参数服务器发送最新的模型参数ω_k，然后产生一个具有噪声的随机梯度然后将随机梯度/>与以k时刻为基准的最近一次传输梯度/>带入随机触发条件中，得到k时刻传输/>的概率。所有M个节点都进行完此操作后，中心服务器利用收到的梯度统一更新，以此循环，完成随机梯度下降算法的训练。As shown in Figure 3, first design a random event triggering mechanism, abandon the original fixed threshold of standard event triggering, and describe the result of each event triggering by probability. In distributed machine learning, there is a cloud central parameter server and M training nodes. Taking node m as an example, at the k-th moment, the node accepts the latest model parameters ω _k sent by the cloud parameter server at this time, and then generates a random gradient with noise Then convert the stochastic gradient/> With the latest transmission gradient based on time k/> Bring it into the random trigger condition and get k time transmission/> The probability. After all M nodes have completed this operation, the central server uses the received gradients to update uniformly, and in this cycle, the training of the stochastic gradient descent algorithm is completed.

本发明提出的一种基于随机事件触发随机梯度下降算法的通讯优化方法，具体包括如下步骤：The present invention proposes a communication optimization method based on random event-triggered stochastic gradient descent algorithm, which specifically includes the following steps:

以包含两个数据拥有方(即银行A和银行B，以下简称为子节点)的场景为例介绍本发明的技术方案(该构架可扩展至包含多个数据拥有方的场景)。The technical solution of the present invention is introduced by taking a scenario involving two data owners (i.e., Bank A and Bank B, hereinafter referred to as child nodes) as an example (this architecture can be extended to scenarios involving multiple data owners).

步骤1：收集本地数据作为样本数据集，并搭建云-边分布式机器学习框架；具体表述为：Step 1: Collect local data as a sample data set and build a cloud-edge distributed machine learning framework; the specific expression is:

步骤1-1：A和B想联合训练一个机器学习模型来预测新用户的可贷款额度，它们的业务系统分别拥有各自储户的信息(工资、贷款情况等)进行数据集的构建。每个子节点将自己的本地数据作为训练数据集，Step 1-1: A and B want to jointly train a machine learning model to predict the loan amount available for new users. Their business systems respectively have the information of their respective savers (salaries, loan status, etc.) to construct data sets. Each child node uses its own local data as a training data set,

步骤1-2：出于数据隐私保护和安全考虑无法直接进行数据交换，因此他们共同选定一个公共的云中心服务器，选择机器学习模型结构(如神经网络，逻辑回归等)、确定损失函数(交叉熵损失，误差均方值损失等)，一起构成一个云-边分布式机器学习框架，如图1所示，图1中Cloud Server为云中心服务器，Worker为子节点，Dataset为每个子节点的数据集，ω_k为模型在k时刻的参数，表示传输最新时刻的梯度，none表示不传输。Step 1-2: Due to data privacy protection and security considerations, data cannot be exchanged directly, so they jointly selected a public cloud center server, selected the machine learning model structure (such as neural network, logistic regression, etc.), and determined the loss function ( Cross entropy loss, error mean square value loss, etc.), together form a cloud-edge distributed machine learning framework, as shown in Figure 1. In Figure 1, Cloud Server is the cloud center server, Worker is the child node, and Dataset is each child node. data set, ω _k is the parameter of the model at time k, Indicates transmitting the gradient at the latest moment, and none means not transmitting.

其中，每一个子节点的目标函数f_m是最小化本地的数据集上的损失：Among them, the objective function f _m of each child node is to minimize the loss on the local data set:

云中心服务器将模型初始化发送给每个字节点并开启训练；The cloud center server sends model initialization to each byte point and starts training;

步骤2：子节点构建基于随机事件触发的随机梯度下降算法模型并进行训练；具体表述为：Step 2: The child node constructs a stochastic gradient descent algorithm model triggered by random events and conducts training; the specific expression is:

步骤2-1：在每一次迭代时刻k，每个子节点m根据当前的模型参数ω_k，随机选取一个数据样本计算梯度。Step 2-1: At each iteration time k, each child node m randomly selects a data sample to calculate the gradient based on the current model parameter ω _k .

步骤2-2：以节点m为例，在第k时刻的随机梯度为表示的是m节点以k时刻为基准，上一次传输的梯度，τ_m是节点m的延迟因子。将/>与/>带入设计好的随机触发条件中，如图2所示，得到k时刻传输/>的概率。基于随机事件的触发条件表示为：Step 2-2: Taking node m as an example, the random gradient at the k-th moment is It represents the gradient of the last transmission of node m based on time k, and τ _m is the delay factor of node m. Will/> with/> Bring it into the designed random trigger conditions, as shown in Figure 2, and get k time transmission/> The probability. The trigger condition based on random events is expressed as:

当两个梯度之间的距离越近，则其L2范数越小，那么传输的概率也越小；When the distance between two gradients is closer, their L2 norm is smaller, and the probability of transmission is also smaller;

步骤2-3-1：定义为k时刻不传输的概率，/>为传输的概率，则k时刻节点m是否传输当前最新梯度的可由如下伯努利随机变量/>的分布表达：Step 2-3-1: Definition is the probability of no transmission at time k,/> is the probability of transmission, then whether node m transmits the current latest gradient at time k can be determined by the following Bernoulli random variable/> The distribution expression of:

步骤3：云中心服务器进行梯度汇总并更新模型参数；具体表述为：Step 3: The cloud center server performs gradient summary and updates model parameters; the specific expression is:

步骤3-1：根据节点传输的信息更新模型参数；具体表述为：Step 3-1: Update the model parameters according to the information transmitted by the node; the specific expression is:

步骤3-1-1：云中心节点将进行传输的节点的延迟因子τ_m重置为1，将未进行传输的节点的延迟因子置为τ_m+1；Step 3-1-1: The cloud center node resets the delay factor τ _m of the node that is transmitting to 1, and sets the delay factor of the node that is not transmitting to τ _m +1;

式中，α_k表示模型训练时候的步长；In the formula, α _k represents the step size during model training;

步骤3-2：云中心服务器将更新的模型参数ω_k+1发送给所有子节点，循环步骤1-1至步骤3-2直至训练完成；Step 3-2: The cloud center server sends the updated model parameters ω _k+1 to all child nodes, and loops from step 1-1 to step 3-2 until the training is completed;

步骤4：利用训练后的随机梯度下降算法模型进行数据预测；具体表述为：Step 4: Use the trained stochastic gradient descent algorithm model to predict data; the specific expression is:

步骤4-1：模型训练完成，查看训练模型的通讯成本并进行实际部署；Step 4-1: After the model training is completed, check the communication cost of the trained model and perform actual deployment;

步骤4-2：当有一个新客户的数据时，将其作为模型的输入，最终得到其可贷款额度。Step 4-2: When there is data about a new customer, use it as input to the model and finally get its loan amount.

为了验证本发明方法的有效性，首先在线性模型下、在相对较小数据库上分别进行实验。Iris是一个小数据集，包含150个样本，具有四个特性，分为三种类型。使用99个样本(标签为setosa和versicolour)进行二元分类，结果如图4所示。In order to verify the effectiveness of the method of the present invention, experiments were first conducted under a linear model and on a relatively small database. Iris is a small dataset containing 150 samples with four features divided into three types. 99 samples (labeled setosa and versicolour) were used for binary classification, and the results are shown in Figure 4.

Covtype有54个特性，值的差异很大，因此我们将所有特性重新定义为[0,1]，并使用30000个样本(标签为Douglas-fir和Krummholz)进行训练，其他7877个样本用作测试集，结果如图5所示。Covtype has 54 features and the values vary greatly, so we redefine all features as [0,1] and use 30000 samples (labeled Douglas-fir and Krummholz) for training and the other 7877 samples for testing Set, the results are shown in Figure 5.

可以看出，无论噪声的大小，所有曲线都收敛到零。实际上，噪声需要控制在一个特定的范围内，在这个范围内，该算法可以在一定程度上降低噪声的影响。注意通信速率也有衰减的趋势。在训练开始时，由于距离最优点较远，梯度比较大，因此两个梯度之间的差异也比较大，这就造成了较高的触发频率。随着迭代次数的增加，梯度逐渐减小到零，此时两个梯度之间的差值变小，从而也可以减小触发频率。也就是说，该方法通过较多的前期触发次数保证了收敛性，节省了后期通信量。这一现象也解释了为什么较大的方差会导致较大的通信速率，尤其是在后期。It can be seen that all curves converge to zero regardless of the amount of noise. In fact, the noise needs to be controlled within a specific range, within which the algorithm can reduce the impact of the noise to a certain extent. Note that the communication rate also tends to decrease. At the beginning of training, because the distance from the optimal point is far away and the gradient is relatively large, the difference between the two gradients is also relatively large, which results in a higher trigger frequency. As the number of iterations increases, the gradient gradually decreases to zero. At this time, the difference between the two gradients becomes smaller, so the trigger frequency can also be reduced. In other words, this method ensures convergence through a larger number of early triggers and saves later communication traffic. This phenomenon also explains why larger variance leads to larger communication rates, especially in the later stages.

我们在数据集MNIST上构建了LeNet5模型，在数据集Cifar10上构建了CifarNet模型，以显示本发明方法的通用性。通信率(简称Cr)是实际传输次数与应该传输次数之比。We built the LeNet5 model on the dataset MNIST and the CifarNet model on the dataset Cifar10 to show the versatility of the method of the present invention. The communication rate (Cr for short) is the ratio of the actual number of transmissions to the number of transmissions that should be made.

如图6所示，10％的通信速率可以在MNIST上实现与标准SGD类似的性能。灰度手写体数字是一种简单的数据样本，不需要很多参数来表征。因此，对于冗杂的参数，少数传输不会影响整体结果。As shown in Figure 6, a communication rate of 10% can achieve similar performance to standard SGD on MNIST. Grayscale handwritten digits are a simple data sample that do not require many parameters to characterize. Therefore, for complex parameters, a few transfers will not affect the overall results.

在图7中，展示了本发明方法在Cifar10上的性能。在这个仿真中，我们可以看到收敛性和通信速率之间的权衡。它只需50％的通信速率就能达到与标准SGD相近的性能。In Figure 7, the performance of the method of the present invention on Cifar10 is shown. In this simulation we can see the trade-off between convergence and communication rate. It only requires 50% of the communication rate to achieve similar performance to standard SGD.

总之，本发明提出的随机梯度下降算法SGD可以大大减少通信负担，并且在适当的超参数下可以收敛到与SGD基线近似相同的精度。In short, the stochastic gradient descent algorithm SGD proposed by the present invention can greatly reduce the communication burden, and can converge to approximately the same accuracy as the SGD baseline under appropriate hyperparameters.

最后，在实际数据集上，用确定性事件触发SGD(DET-SGD)对不同平均通信速率下的本发明方法进行了测试，以验证本发明方法的优越性。所有的平均值都是从30个迭代轮次的结果中计算出来的。结果表明，本发明方法明显优于DET-SGD，特别是在低通信速率下，证明了本发明方法在噪声环境下具有较好的鲁棒性。结果如图8所示。Finally, on the actual data set, deterministic event-triggered SGD (DET-SGD) was used to test the inventive method under different average communication rates to verify the superiority of the inventive method. All averages are calculated from the results of 30 iteration rounds. The results show that the method of the present invention is significantly better than DET-SGD, especially at low communication rates, which proves that the method of the present invention has better robustness in a noisy environment. The results are shown in Figure 8.

Claims

1. A communication optimization method based on a random event-triggered random gradient descent algorithm, comprising:

step 1: collecting local data as a sample data set, and building a cloud-edge distributed machine learning framework;

step 2: the child nodes construct a random gradient descent algorithm model triggered on the basis of random events and train the model;

step 3: the cloud center server performs gradient summarization and updates model parameters;

step 4: carrying out data prediction by using the trained random gradient descent algorithm model;

the step 1 comprises the following steps:

step 1-1: determining a cloud center server common to M data providers, wherein each child node takes own local data as a sample data set;

step 1-2: selecting a machine learning model structure, determining a loss function, and constructing a cloud-edge distributed machine learning framework, wherein the objective function f of each child node _m Is to minimize loss on the local dataset:

wherein N is _m Is the number of local data sets and,is a loss function for each sample data;

the global goal is to minimize the average loss of all M child nodes:

the cloud center server sends the model initialization to each child node and starts training;

the step 2 comprises the following steps:

step 2-1: at each iteration instant k, each child node m is based on the current model parameters ω _k Randomly selecting a data sample to calculate the gradient;

step 2-2: constructing trigger conditions based on random events:

in the method, in the process of the invention,represents the random gradient at time k, +.>Representing the last transmitted gradient of child node m with k time as reference, τ _m Is the delay factor of node m;

step 2-3: judging whether the node m transmits a random gradient at the moment k according to the constructed triggering condition

The step 2-3 comprises the following steps:

step 2-3-1: calculating the probability of whether node m transmits the current gradient at time k according to equation (4)

Wherein, for the probability that no transmission is at time k, +.>Probability of transmission;

step 2-3-2: at [0,1]Generating a random number uniformly in intervalIf the current transmission +.>The probability of (2) is greater than +.>Will beSending to the cloud center node, if the current transmission is +.>The probability of (2) is less than +.>The transmission is aborted;

the step 3 comprises the following steps:

step 3-1: the cloud center server updates model parameters according to the information transmitted by the nodes;

step 3-2: the cloud center server updates the model parameter omega _k+1 Transmitting to all child nodes, and circulating steps 1-1 to 13-2 until training is completed;

the step 3-1 comprises the following steps:

step 3-1-1: delay factor τ for the node that the cloud center server will transmit _m Reset to 1, set the delay factor of the node not transmitting to τ _m +1；

Step 3-1-2: the cloud center server uses the new gradient of the node for transmission and the old gradient of the node for no transmission to carry out gradient descent to obtain a model parameter omega at the moment k+1 _k+1 ；

Wherein alpha is _k Representing the step size factor of the learning.