CN112069876A

CN112069876A - Handwriting recognition method based on adaptive differential gradient optimization

Info

Publication number: CN112069876A
Application number: CN202010698221.6A
Authority: CN
Inventors: 姜淏予; 张建朝; 徐今强; 葛泉波
Original assignee: Guangdong Ocean University
Current assignee: Guangdong Ocean University
Priority date: 2020-07-20
Filing date: 2020-07-20
Publication date: 2020-12-11

Abstract

The invention discloses a handwriting recognition method based on adaptive differential gradient optimization. In the BP neural network parameter optimization algorithm for handwriting recognition, the invention integrates and deforms the commonly used gradient descent algorithm again by combining the traditional control theory idea; then, a differential link is added in a conventional gradient descent algorithm for advanced correction, and the future change trend of an error signal is forecasted through the change rate of the error, so that the precision is improved; and finally, the learning rate is adaptively adjusted by utilizing the stored average value of past square gradients of exponential decay, so that the training rate is accelerated. The method provided by the invention introduces a differential link, can effectively improve the training speed, and forecasts the future change trend of the error signal through the change rate of the error. And the learning rate can be adjusted in a self-adaptive mode, namely when the training is close to the optimal value, the learning rate is reduced due to the fact that the accumulated past square gradient is increased, and the learning rate is prevented from being too large to skip the optimal point.

Description

A Handwriting Recognition Method Based on Adaptive Band Differential Gradient Optimization

技术领域technical field

本发明属于人工智能领域，涉及深度学习与智能计算中网络训练及优化方面，特别是涉及一种基于自适应带微分梯度优化的手写体识别方法。The invention belongs to the field of artificial intelligence, and relates to network training and optimization in deep learning and intelligent computing, in particular to a handwriting recognition method based on adaptive band differential gradient optimization.

背景技术Background technique

人工智能是一门研究开发模拟、延伸和扩展人类智能的技术，其主要研究内容可归纳为四个方面：机器感知、机器思维、机器行为和机器学习。而机器学习是利用计算机、概率论、统计学等知识，通过给计算机程序输入数据，让计算机能够学习新知识和新技能，其最初的研究动机是为了让计算机系统具有人的学习能力以便实现人工智能。深度学习是基于学习特征的更广泛的机器学习方法，它试图在多个层次中进行学习，其中较高层次的概念是从较低层次的概念中定义的，而较低层次的概念可以帮助定义许多更高层的概念。Artificial intelligence is a technology that studies and develops the simulation, extension and expansion of human intelligence. Its main research contents can be summarized into four aspects: machine perception, machine thinking, machine behavior and machine learning. Machine learning is the use of computer, probability theory, statistics and other knowledge to allow computers to learn new knowledge and new skills by inputting data into computer programs. intelligent. Deep learning is a broader machine learning approach based on learning features, which attempts to learn at multiple levels, where higher level concepts are defined from lower level concepts that help define Many higher level concepts.

随着研究的不断深入，深度学习技术已经被应用到数以百计的实际问题中，且超出了传统的多层神经网络的内涵。而光学字符识别(Optical Character Recognition,OCR)是其广泛应用的一个领域，它的主要任务是对文本资料的图像文件进行分析识别处理，获取文字及版面信息。With the deepening of research, deep learning technology has been applied to hundreds of practical problems, beyond the connotation of traditional multi-layer neural networks. Optical Character Recognition (OCR) is a widely used field. Its main task is to analyze and recognize image files of text data to obtain text and layout information.

卷积神经网络是一类特殊的用于数据处理的神经网络，它受视觉系统结构的启发，由生物学家Hubel和Wiesel于1962年提出。他们通过对猫的实验发现：人的视觉系统的信息处理是分级的，初级视觉皮层提取边缘特征，中级视觉皮层提取形状或者目标，更高层的视觉皮层得到特征组合。受此启发，Lecun等人于1989年提出了卷积神经网络。此后，卷积神经网络被广泛地应用于图像处理、手写体识别等领域中，并衍生出许多改进模型。Convolutional neural networks are a special class of neural networks for data processing inspired by the structure of the visual system and proposed by biologists Hubel and Wiesel in 1962. Through experiments on cats, they found that the information processing of the human visual system is hierarchical. The primary visual cortex extracts edge features, the intermediate visual cortex extracts shapes or objects, and the higher visual cortex obtains feature combinations. Inspired by this, Lecun et al. proposed Convolutional Neural Networks in 1989. Since then, convolutional neural networks have been widely used in image processing, handwriting recognition and other fields, and many improved models have been derived.

神经网络的优化问题是深度学习与智能计算领域一个非常重要的问题，尤其是神经网络权值的学习和修正问题，大多数学习算法是基于迭代的更新方法，而常用的基于梯度的优化算法最大的困难是如何选择合适的学习率及梯度来加快训练速率及收敛率。The optimization of neural networks is a very important problem in the field of deep learning and intelligent computing, especially the learning and correction of neural network weights. Most learning algorithms are based on iterative update methods, and the commonly used gradient-based optimization algorithms are the largest The difficulty is how to choose the appropriate learning rate and gradient to speed up the training rate and convergence rate.

目前，手写体识别方法常用的优化方法为带动量的梯度下降算法，此种方法主要存在1)收敛率较差，且收敛速率比较慢。2)权重更新滞后于实际梯度改变。3)学习率对于不同参数单一性问题。鉴于此，本发明在对常规梯度下降算法重新整合后，提出一种自适应带微分梯度优化方法，能够有效地克服上述问题，并提高训练的收敛率及收敛速率，从而提高手写体识别方法的识别率。At present, the commonly used optimization method for handwriting recognition methods is the gradient descent algorithm with momentum, which mainly has 1) poor convergence rate and slow convergence rate. 2) The weight update lags the actual gradient change. 3) Learning rate for the problem of singleness of different parameters. In view of this, the present invention proposes an adaptive gradient optimization method with differential gradient after reintegrating the conventional gradient descent algorithm, which can effectively overcome the above problems and improve the convergence rate and convergence rate of training, thereby improving the recognition method of handwriting recognition. Rate.

发明内容SUMMARY OF THE INVENTION

本发明针对现有技术中存在的不足，提供一种基于自适应带微分梯度优化的手写体识别方法。Aiming at the deficiencies in the prior art, the present invention provides a handwriting recognition method based on adaptive band differential gradient optimization.

1)模型参数的预处理：离线训练多层卷积神经网络模型。1) Preprocessing of model parameters: offline training of multi-layer convolutional neural network models.

2)对完成扫描的样本进行预处理：对手写体字样本的处理及图像中字符的定位，切割或切分，归一化，二值化，平滑，去燥，细化以及生成样本。基于数据集构建训练样本集及测试样本集。2) Preprocessing the scanned samples: processing the handwritten character samples and locating the characters in the image, cutting or segmenting, normalizing, binarizing, smoothing, removing dryness, thinning and generating samples. The training sample set and the test sample set are constructed based on the data set.

3)输入样本集，提取样本集特征：找出样本中能够有效区分不同类型的本质特征，通过特征提取，找到最有用的信息，大大减少数据量。3) Input the sample set and extract the sample set features: find out the essential features in the samples that can effectively distinguish different types, and through feature extraction, find the most useful information and greatly reduce the amount of data.

4)网络建立与优化：将提取到的特征输入到单隐层BP神经网络系统训练，并且系统采用自适应带微分梯度优化方法优化。4) Network establishment and optimization: The extracted features are input into the single-hidden layer BP neural network system for training, and the system is optimized by the adaptive band differential gradient optimization method.

5)使用所得到的神经网络对未知手写体样本进行识别。5) Use the resulting neural network to identify unknown handwriting samples.

其中，传统的梯度下降算法收敛率比较慢，收敛性差，经常导致训练不准确，为了进一步加快训练的速率，避免训练陷入局部极值点，采用自适应带微分梯度优化方法。Among them, the traditional gradient descent algorithm has a relatively slow convergence rate and poor convergence, which often leads to inaccurate training. In order to further speed up the training rate and avoid training falling into local extreme points, the adaptive gradient optimization method with differential is adopted.

自适应带微分梯度优化方法大体上可分为三部分：The adaptive band differential gradient optimization method can be roughly divided into three parts:

第一部分：根据传统控制理论，变形几种常用梯度下降法；The first part: According to the traditional control theory, deform several common gradient descent methods;

第二部分：在传统的梯度优化中引入微分环节，加速BP神经网络训练；The second part: Introduce the differential link in the traditional gradient optimization to speed up the BP neural network training;

第三部分：优化梯度下降法学习率，从而使学习率自适应调节，提高优化速率及收敛率；The third part: optimize the learning rate of the gradient descent method, so that the learning rate can be adjusted adaptively, and the optimization rate and convergence rate can be improved;

具体的技术方案如下：The specific technical solutions are as follows:

一种自适应带微分梯度优化方法，包括以下步骤：An adaptive band differential gradient optimization method, comprising the following steps:

步骤1：根据传统控制理论，变形梯度下降法：Step 1: According to the traditional control theory, the deformation gradient descent method:

PID控制器由比例单元P、积分单元I和微分单元D组成，主要是利用当前、历史及超前信息来反馈系统，其输入与输出关系为：The PID controller is composed of proportional unit P, integral unit I and differential unit D. It mainly uses current, historical and advanced information to feedback the system. The relationship between its input and output is:

其中：u(t)表示输出，e(t)表示误差，K_p、T_i及T_d分别表示P、I和D的增益。Where: u(t) represents the output, e(t) represents the error, and K _p , T _i and T _d represent the gains of P, I and D, respectively.

在神经网络中，随机梯度下降法的更新规则如下：In neural networks, the update rules for stochastic gradient descent are as follows:

其中：w_t+1为t+1时刻网络权重，w_t为t时刻网络权重，α表示学习率，

为梯度，将其等同为PID控制中的误差e(t)，可以将随机梯度下降法看作比例环节。Where: w _t+1 is the network weight at time t+1, w _t is the network weight at time t, α represents the learning rate,

is the gradient, which is equivalent to the error e(t) in PID control, and the stochastic gradient descent method can be regarded as a proportional link.

为了加快目标收敛，提出带有动量项的随机梯度下降法，其表达式如下：In order to speed up the target convergence, a stochastic gradient descent method with a momentum term is proposed, and its expression is as follows:

其中，r为动量系数。where r is the momentum coefficient.

将上式合并后利用累计求和方法，得到下面的梯度更新规则：After merging the above equations, the cumulative summation method is used to obtain the following gradient update rules:

另一种梯度下降法为Nesterov加速梯度法，更新规则如下：Another gradient descent method is the Nesterov accelerated gradient method. The update rules are as follows:

如带有动量项的随机梯度下降法一样，经过变换得到如下更新规则：Like the stochastic gradient descent method with a momentum term, the following update rules are obtained after transformation:

其中，

表示网络权重。in,

represents the network weight.

将上面两个重写的算法与PID控制输入输出公式对比，可以看出两种梯度都是利用了历史信息与当前信息更新，相当于比例积分环节。Comparing the above two rewritten algorithms with the PID control input and output formulas, it can be seen that both gradients use historical information and current information to update, which is equivalent to the proportional integral link.

步骤2：在传统的梯度优化中引入微分环节，加速深度神经网络训练：Step 2: Introduce the differential link in the traditional gradient optimization to speed up the training of deep neural network:

从上面带有动量项的随机梯度下降法及Nesterov加速梯度法两种算法的改写公式中可以看出，动量项为历史梯度的累计，但是，如果要更新的权重来改变下降的方向，由于历史梯度累积的存在，权重将不会及时改变，即：It can be seen from the rewritten formulas of the two algorithms of the stochastic gradient descent method with the momentum term and the Nesterov accelerated gradient method above that the momentum term is the accumulation of historical gradients. With the existence of gradient accumulation, the weights will not change in time, namely:

其中，σ％表示超调量，w_max表示权重最大值，w^*表示权重最优值。Among them, σ% represents the overshoot, w _max represents the maximum value of the weight, and w ^* represents the optimal value of the weight.

在自动控制原理中，超调量主要反映了系统动态性能中的平稳性；超调量越大，反映平稳性越差，也就是最大输出偏离稳态值的幅度比越大；稳态值通常是系统的正常工作状态或称平衡态；显然对系统来说偏离平衡态越远，对系统正常工作越不利，所以通常希望超调量小一些好；In the principle of automatic control, the overshoot mainly reflects the stability in the dynamic performance of the system; the larger the overshoot, the worse the reflection stability, that is, the larger the amplitude ratio of the maximum output deviation from the steady state value; the steady state value usually It is the normal working state or equilibrium state of the system; obviously, the farther the system deviates from the equilibrium state, the more unfavorable it is for the normal operation of the system, so it is usually better to have a smaller overshoot;

微分控制是通过误差的变化率预报误差信号的未来变化趋势，通过提供超前控制作用，微分控制能使被控过程趋于稳定；Differential control is to predict the future change trend of the error signal through the rate of change of the error. By providing a leading control effect, the differential control can make the controlled process tend to be stable;

因此，可以抵消积分控制产生的不稳定趋势；Therefore, the unstable trend produced by integral control can be counteracted;

鉴于此，为了消除历史梯度累积带来的不稳定趋势，加入梯度的变化，即微分项：In view of this, in order to eliminate the unstable trend brought about by the accumulation of historical gradients, the change of gradient is added, that is, the differential term:

其中，D_t表示t时刻的微分项，D_t-1表示t-1时刻的微分项，T_d表示微分项的增益，J_t表示t时刻的损失函数，w_t为t时刻网络权重；Among them, D _t represents the differential term at time t, D _t-1 represents the differential term at time t-1, T _d represents the gain of the differential term, J _t represents the loss function at time t, and w _t is the network weight at time t;

通过加入微分项，为梯度下降法引入超前信息，能够有效减少超调，即对瞬时变化做出响应；改进后的梯度下降法更新如下：By adding a differential term, leading information is introduced into the gradient descent method, which can effectively reduce overshoot, that is, respond to instantaneous changes; the improved gradient descent method is updated as follows:

其中，r为动量系数，Δw_t为t时刻的动量项，Δw_t+1为t+1时刻的动量项，D_t+1为t+1时刻累积微分项，D_t为t时刻累积微分项，w_t+1为t+1时刻网络权重，α为学习率。where r is the momentum coefficient, Δw _t is the momentum term at time t, Δw _t+1 is the momentum term at time t+1, D _t+1 is the cumulative differential term at time t+1, and D _t is the cumulative differential term at time t , w _t+1 is the network weight at time t+1, and α is the learning rate.

步骤3：优化梯度下降法学习率，从而使学习率自适应调节，提高优化速率及收敛率：Step 3: Optimize the learning rate of the gradient descent method, so that the learning rate can be adjusted adaptively, and the optimization rate and convergence rate can be improved:

存储一个指数衰减的过去平方梯度的平均值v_t+1，并保持了过去梯度的指数衰减平均值m_t+1，类似于动量：Stores an exponentially decaying mean v _t+1 of past squared gradients, and keeps an exponentially decaying mean m _t+1 of past gradients, similar to momentum:

其中，m_t+1和v_t+1分别表示梯度的一阶矩(均值)和二阶矩(无中心方差)；当Δw_t+1和v_t+1被初始化为值为0的向量，它们是偏向于零的，特别是在初始时间步骤中，当衰变率很小时，可以通过计算偏差修正的一阶和二阶矩估计来抵消这些偏差，如下式所示：Among them, m _t+1 and v _t+1 represent the first-order moment (mean) and the second-order moment (no central variance) of the gradient, respectively; when Δw _t+1 and v _t+1 are initialized as vectors with a value of 0, They are biased towards zero, especially at the initial time step, when the decay rate is small, and these biases can be counteracted by computing bias-corrected first- and second-order moment estimates as follows:

则学习率为：Then the learning rate is:

然后，通过使用这些方法来更新参数，得到最后的更新规则：Then, by using these methods to update the parameters, get the final update rule:

其中，η为超参数，ε防止分母为零的极小值，

为t+1时刻的梯度二阶矩，

为t+1时刻的梯度一阶矩。where η is a hyperparameter, ε prevents minima where the denominator is zero,

is the second-order moment of the gradient at time t+1,

is the first moment of the gradient at time t+1.

本发明在用于手写体识别的BP神经网络参数优化算法中，为了更好地加快训练速率及提高收敛率，通过结合传统控制理论思想，将常用的梯度下降算法进行重新整合、变形；然后，在常规梯度下降算法中加入微分环节进行超前的校正，通过误差的变化率预报误差信号的未来变化趋势，从而提高精度；最后利用存储的指数衰减的过去平方梯度的平均值自适应地调整学习率，从而加快训练速率。本发明中的自适应带微分梯度优化方法引入微分环节可以有效地提高训练速率，通过误差的变化率预报误差信号的未来变化趋势，并且学习率可以自适应调整，即当训练接近最优值是由于累积过去平方梯度增大而学习率减小，避免学习率过大而跳过最优点。In the present invention, in the BP neural network parameter optimization algorithm for handwriting recognition, in order to better speed up the training rate and improve the convergence rate, the commonly used gradient descent algorithm is reintegrated and deformed by combining the traditional control theory; A differential link is added to the conventional gradient descent algorithm to perform advanced correction, and the future change trend of the error signal is predicted through the rate of change of the error, thereby improving the accuracy; finally, the average value of the past square gradients of the stored exponential decay is used to adaptively adjust the learning rate. thereby speeding up the training rate. The adaptive gradient optimization method with differential gradient in the present invention can effectively improve the training rate by introducing the differential link, predict the future change trend of the error signal through the rate of change of the error, and the learning rate can be adjusted adaptively, that is, when the training is close to the optimal value, it is Since the accumulated past squared gradient increases and the learning rate decreases, avoid skipping the optimal point if the learning rate is too large.

附图说明Description of drawings

图1为本发明所提供的自适应带微分梯度优化方法的整体方案框图；Fig. 1 is the overall scheme block diagram of the adaptive band differential gradient optimization method provided by the present invention;

图2为在手写体识别仿真训练的损失曲线；Fig. 2 is the loss curve of simulation training in handwriting recognition;

图3为在手写体识别仿真训练的验证损失曲线；Fig. 3 is the verification loss curve in the handwriting recognition simulation training;

图4为在手写体识别仿真中训练精度曲线；Fig. 4 is the training accuracy curve in the handwriting recognition simulation;

图5为在手写体识别仿真中验证精度曲线。Figure 5 shows the verification accuracy curve in the handwriting recognition simulation.

具体实施方式Detailed ways

本发明具体包括以下步骤：The present invention specifically includes the following steps:

步骤1.模型参数的预处理：离线训练单隐层BP神经网络结构，其中输入节点数目为13个，即每个样本有13个特征表达。Step 1. Preprocessing of model parameters: offline training of single hidden layer BP neural network structure, in which the number of input nodes is 13, that is, each sample has 13 feature expressions.

步骤2.对完成扫描的样本预处理：Step 2. Preprocess the scanned sample:

预处理是识别前的必要阶段，对扫描好的手写体字样本的预处理及扫描好的图像中字符的定位，切割或切分，归一化，二值化，平滑，去燥，细化以及生成样本；定位及切割即通过算法对这些图像样本进行处理，搜索到纸张图像上的定位标记，然后根据方格的大小读取出指定位置的图像。最终生成训练样本集与测试样本集。Preprocessing is a necessary stage before recognition. It is the preprocessing of scanned handwritten character samples and the positioning, cutting or segmentation, normalization, binarization, smoothing, de-drying, thinning and character positioning of scanned images. Generating samples; positioning and cutting is to process these image samples through algorithms, search for the positioning marks on the paper image, and then read the image at the specified position according to the size of the square. Finally, a training sample set and a test sample set are generated.

步骤3.输入样本集，提取样本集特征，进行分类训练并优化网络：Step 3. Input the sample set, extract the features of the sample set, perform classification training and optimize the network:

将样本集分别输入到预先训练的网络，特征提取的目的主要是找出原始数据中能够有效区分不同类型的本质特征，通过特征提取，找到最有用的信息，大大减少数据量，为后续BP神经网络训练减轻负担，减少学习样本的时间成本及计算量。本发明主要提取特征包括数字长度，各部分宽度，各个分支笔画长度、形状、角度及曲率，粗网格特征，笔画密度特征等等。具体做法就是将图像样本分成13个区域，即每个样本有13个特征表达。The sample sets are input into the pre-trained network respectively. The purpose of feature extraction is to find out the essential features in the original data that can effectively distinguish different types. Through feature extraction, the most useful information can be found, and the amount of data can be greatly reduced. Network training reduces the burden and reduces the time cost and calculation amount of learning samples. The main extraction features of the present invention include digit length, width of each part, stroke length, shape, angle and curvature of each branch, coarse grid feature, stroke density feature and the like. The specific method is to divide the image samples into 13 regions, that is, each sample has 13 feature expressions.

步骤4.网络建立与优化：Step 4. Network establishment and optimization:

在步骤3提到特征提取13维特征表达，即在建立的BP神经网络训练中，输入节点数目为13个。选择单隐层BP神经网络结构训练，但是传统的BP算法收敛率比较慢，收敛性差，经常导致训练不准确。为了进一步加快训练的速率，避免训练陷入局部极值点，采用自适应带微分梯度优化方法，所提自适应带微分梯度优化方法描述如下：In step 3, it is mentioned that feature extraction 13-dimensional feature expression, that is, in the established BP neural network training, the number of input nodes is 13. The single hidden layer BP neural network structure is selected for training, but the traditional BP algorithm has a slow convergence rate and poor convergence, which often leads to inaccurate training. In order to further speed up the training rate and avoid the training falling into local extreme points, the adaptive gradient optimization method with differential gradient is adopted. The proposed adaptive gradient gradient optimization method is described as follows:

其中，m_t+1和v_t+1分别表示梯度的一阶矩和二阶矩的估计；当m_t+1和v_t+1被初始化为值为0的向量，它们是偏向于零的，特别是在初始时间步骤中，当衰变率很小时，可以通过计算偏差修正的一阶和二阶矩估计来抵消这些偏差，如下式所示：where m _t+1 and v _t+1 represent estimates of the first and second moment of the gradient, respectively; when m _t+1 and v _t+1 are initialized to vectors with a value of 0, they are biased towards zero , especially in the initial time step, when the decay rate is small, these biases can be offset by computing bias-corrected first- and second-order moment estimates as follows:

引入微分项校正，即：A differential term correction is introduced, namely:

步骤5.使用所得到的神经网络对未知手写体样本进行识别。Step 5. Use the resulting neural network to identify unknown handwriting samples.

图2、图3分别为训练损失曲线及验证损失曲线，通过两图可以看出本文所提自适应带微分梯度优化方法相较于传统的带动量的梯度下降算法具有更好的收敛性及泛化能力强。Figure 2 and Figure 3 are the training loss curve and the verification loss curve respectively. From the two figures, it can be seen that the adaptive gradient optimization method with differential gradient proposed in this paper has better convergence and generalization than the traditional gradient descent algorithm with momentum. Strong chemical ability.

图4、图5分别为训练精度曲线与验证精度曲线，通过两图可以清晰地看出基于自适应带微分梯度优化的手写体识别方法精度高且收敛速率快。Figure 4 and Figure 5 are the training accuracy curve and the verification accuracy curve respectively. From the two figures, it can be clearly seen that the handwriting recognition method based on adaptive band differential gradient optimization has high accuracy and fast convergence rate.

上述具体实施方式用来解释说明本发明，仅为本发明的优选实施例，而不是对本发明进行限制，在本发明的精神和权利要求的保护范围内，对本发明作出的任何修改、等同替换、改进等，都落入本发明的保护范围。The above-mentioned specific embodiments are used to explain the present invention, are only preferred embodiments of the present invention, rather than limit the present invention, within the spirit of the present invention and the protection scope of the claims, any modification, equivalent replacement, Improvements and the like all fall within the protection scope of the present invention.

Claims

1. Based on the adaptive handwriting recognition method with differential gradient optimization, the steps of the method are as follows:

1) Preprocessing of model parameters: offline training of multi-layer convolutional neural network models;

2) Preprocessing the scanned samples: processing the handwritten character samples and positioning, cutting or segmenting, normalizing, binarizing, smoothing, de-drying, refining and generating samples in images; based on data Set up a training sample set and a test sample set;

3) Input the sample set and extract the sample set features: find out the essential features in the samples that can effectively distinguish different types, and find the most useful information through feature extraction;

4) Network establishment and optimization: The extracted features are input into the single hidden layer BP neural network system for training, and the adaptive belt differential gradient optimization method is used to optimize;

5) using the obtained neural network to identify the unknown handwriting sample;

The adaptive band differential gradient optimization method is to introduce a differential term and an adaptive mechanism into the traditional momentum stochastic gradient descent method, specifically:

Step 1: Introduce the differential link in the gradient descent method with momentum to speed up the training of deep neural network:

In order to eliminate the unstable trend caused by the accumulation of historical gradients, the change of gradient is added, and the differential term is introduced:

Among them, D _t represents the differential term at time t, D _t-1 represents the differential term at time t-1, T _d represents the gain of the differential term, J _t represents the loss function at time t, and w _t is the network weight at time t;

By adding a differential term, advanced information is introduced into the gradient descent method, and the improved gradient descent method is updated as follows:

where r is the momentum coefficient, Δw _t is the momentum term at time t, Δw _t+1 is the momentum term at time t+1, D _t+1 is the cumulative differential term at time t+1, and D _t is the cumulative differential term at time t , w _t+1 is the network weight at time t+1, α is the learning rate;

Step 2: Optimize the learning rate of the gradient descent method, so that the learning rate can be adjusted adaptively:

Optimize the gradient descent learning rate so that it can be adjusted adaptively, namely:

Finally, get the update rule:

where η is a hyperparameter, ε prevents minima where the denominator is zero,

is the second-order moment of the gradient at time t+1,

is the first moment of the gradient at time t+1.