CN108898213A

CN108898213A - A kind of adaptive activation primitive parameter adjusting method towards deep neural network

Info

Publication number: CN108898213A
Application number: CN201810631395.3A
Authority: CN
Inventors: 胡海根; 周莉莉; 罗诚; 陈胜勇; 管秋; 周乾伟
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Gaodean Zhejiang Information Technology Co ltd
Priority date: 2018-06-19
Filing date: 2018-06-19
Publication date: 2018-11-27
Anticipated expiration: 2038-06-19
Also published as: CN108898213B

Abstract

A kind of adaptive activation function parameter adjustment method for deep neural network, described method comprises the following steps: step 1, at first carry out mathematical definition to adaptive activation function parameter adjustment method; Step 2, carry out adaptive activation function based on MNIST data set and other classic activation functions to compare and analyze the experimental results. The network used has three hidden layers, each hidden layer has 50 neurons, and the gradient descent algorithm is used to iterate for 100 cycles. The learning rate is set to 0.01, and the minimum batch The number is 100; step 3, after the optimal activation function version is obtained in step 2, it is applied to the detection of specific bladder cancer cells. In the process of continuous training of the network, the present invention finds the optimal activation function suitable for the network by continuously adjusting its own shape, improves the performance of the network, reduces the overall number of learnable parameters of the adaptive activation function in the network, and accelerates the learning rate of the network , to improve the generalization of the network.

Description

An Adaptive Activation Function Parameter Adjustment Method for Deep Neural Networks

技术领域technical field

本发明属于自适应激活函数领域，设计了一种面向深度神经网络的自适应激活函数参数调节方法。具体是自适应激活函数通过添加可学习参数控制自身的形状，同时这些学习参数可通过反向传播算法，随着网络训练的进行得到更新，降低自适应激活函数在网络中整体的可学习参数数量。The invention belongs to the field of adaptive activation functions, and designs an adaptive activation function parameter adjustment method oriented to a deep neural network. Specifically, the adaptive activation function controls its own shape by adding learnable parameters. At the same time, these learning parameters can be updated with the progress of network training through the back propagation algorithm, reducing the overall number of learnable parameters of the adaptive activation function in the network. .

背景技术Background technique

如今机器学习被广泛应用于社会生活，而传统的机器学习多采用浅层结构，如高斯混合模型(GMM)、条件随机域(CRF)、支持向量机(SVM)等，这些浅层结构对复杂函数的表示能力有限，对原始输入信号特征的提取相对初级，针对复杂分类问题其泛化能力受到一定的制约，比较难解决一些比较复杂的自然信号处理问题，例如人类语音和自然图像识别等。于是深度学习通过模拟大脑进行学习来极大的促进机器学习的发展，深度学习最大的特点是把原始数据通过一些简单的但非线性的模型转变成更高层次，更加抽象的特征表达，学习一种深层非线性网络结构，实现复杂函数的逼近，以及从少数样本集中学习数据集本质特征的能力。通过实践证明，深度学习擅长发现高维数据中的复杂结构，广泛应用于计算机视觉、语音识别、自然语言处理等研究领域。Nowadays, machine learning is widely used in social life, while traditional machine learning mostly uses shallow structures, such as Gaussian Mixture Model (GMM), Conditional Random Field (CRF), Support Vector Machine (SVM), etc. The expressive ability of the function is limited, and the extraction of the original input signal features is relatively elementary. Its generalization ability for complex classification problems is limited to a certain extent, and it is difficult to solve some complex natural signal processing problems, such as human speech and natural image recognition. Therefore, deep learning greatly promotes the development of machine learning by simulating the brain for learning. The biggest feature of deep learning is to transform the original data into higher-level and more abstract feature expressions through some simple but nonlinear models. A deep nonlinear network structure, which realizes the approximation of complex functions, and the ability to learn the essential characteristics of data sets from a small number of sample sets. Practice has proved that deep learning is good at discovering complex structures in high-dimensional data, and is widely used in research fields such as computer vision, speech recognition, and natural language processing.

随着深度学习在各大领域的应用，越来越多研究集中在对深度学习算法的创新和优化。其中包括分类器和损失函数的优化、基于反向传播的梯度下降优化、网络权重参数初始化的优化及人工神经网络的优化等，其中对人工神经网络的优化是深度学习算法创新的重要组成部分。人工神经网络会根据任务的不同，拥有不同的网络结构和神经元数量，在这些网络中人们通常使用相同的激活函数，例如Sigmoid、Tanh、Relu。近年来提出的自适应激活函数，使网络神经元呈现出不同的形状，但随着网络规模的扩大及神经元的增加，用于调节这些神经元形状的可学习参数呈现出线性增长，网络的学习效率被大幅度拉低。由此可见人工神经网络基本结构可看作是由一些神经元互相连接形成，激活函数则在其中扮演着举足轻重的角色。With the application of deep learning in various fields, more and more researches focus on the innovation and optimization of deep learning algorithms. These include optimization of classifiers and loss functions, gradient descent optimization based on backpropagation, optimization of network weight parameter initialization, and optimization of artificial neural networks, among which the optimization of artificial neural networks is an important part of deep learning algorithm innovation. Artificial neural networks will have different network structures and number of neurons depending on the task. In these networks, people usually use the same activation function, such as Sigmoid, Tanh, Relu. The adaptive activation function proposed in recent years makes network neurons present different shapes, but with the expansion of network size and the increase of neurons, the learnable parameters used to adjust the shape of these neurons show a linear growth, and the network Learning efficiency is greatly reduced. It can be seen that the basic structure of artificial neural network can be regarded as formed by the interconnection of some neurons, and the activation function plays a pivotal role in it.

人工神经网络中激活函数的主要作用是提供网络的非线性表达能力。如果一个神经网络中神经元仅仅是线性运算，那么该网络仅能够表达简单的线性映射，即便增加网络的深度和宽度也依旧还是线性映射，难以有效建模实际环境中非线性分布的数据。加入非线性激活函数之后，深度神经网络才具备了分层的非线性映射学习能力。本发明主要针对激活函数进行改进，优化网络中神经元之间的联系，借此进一步提高网络的性能。The main function of the activation function in the artificial neural network is to provide the nonlinear expression ability of the network. If the neurons in a neural network are only linear operations, then the network can only express a simple linear mapping, even if the depth and width of the network are increased, it is still a linear mapping, and it is difficult to effectively model nonlinearly distributed data in the actual environment. After adding the nonlinear activation function, the deep neural network has the ability of hierarchical nonlinear mapping learning. The invention mainly improves the activation function, optimizes the connection between neurons in the network, thereby further improving the performance of the network.

发明内容Contents of the invention

为了降低网络中自适应激活函数可学习参数的总体数量，加快网络学习速率，改善网络的泛化能力，本发明提出了一种面向深度神经网络的自适应激活函数参数调节方法，在网络不断训练的过程中，通过不断的调整自身形状来寻找适合该网络的最优激活函数，提高网络的性能。In order to reduce the overall number of learnable parameters of the adaptive activation function in the network, speed up the learning rate of the network, and improve the generalization ability of the network, the present invention proposes an adaptive activation function parameter adjustment method for deep neural networks. In the process, by constantly adjusting its own shape to find the optimal activation function suitable for the network, improve the performance of the network.

本发明解决其技术问题所采用的技术方案是：The technical solution adopted by the present invention to solve its technical problems is:

一种面向深度神经网络的自适应激活函数参数调节方法，所述方法包括以下步骤：A kind of adaptive activation function parameter adjustment method for deep neural network, described method comprises the following steps:

步骤1，首先对自适应激活函数参数调节方法进行数学定义，过程如下：Step 1, first mathematically define the parameter adjustment method of the adaptive activation function, the process is as follows:

设自适应激活函数的可调参数个数为N，那么自适应激活函数被定义为：Assuming that the number of adjustable parameters of the adaptive activation function is N, then the adaptive activation function is defined as:

f_(x)＝f(a*x+c)f _(x) = f(a*x+c)

其中a和c都是用来控制激活函数形状的可学习参数，所谓的神经网络看作是许多单个神经元的组合，定义神经网络的输出为一个集权重、偏差和可学习神经元参数的复合函数，函数如下：Among them, a and c are learnable parameters used to control the shape of the activation function. The so-called neural network is regarded as a combination of many individual neurons, and the output of the neural network is defined as a composite of weights, deviations and learnable neuron parameters. function, the function is as follows:

h_(w,b,a,c)＝h(f(a*x+c))h _(w,b,a,c) = h(f(a*x+c))

其中h代表神经网络的输出，w和b代表网络的权重和偏差；与此同时，该函数还被看成在神经网络中所有神经元使用同一组可学习参数，一个更为广泛的定义即：神经网络中每个神经元都使用不同的可调节参数，如下所示：Where h represents the output of the neural network, w and b represent the weights and biases of the network; at the same time, this function is also regarded as using the same set of learnable parameters for all neurons in the neural network. A broader definition is: Each neuron in a neural network uses different adjustable parameters, as follows:

其中fn代表网络中一层的每个神经元，每一层神经元使用相同的可调节参数被定义如下：where fn represents each neuron in a layer of the network, each layer of neurons using the same adjustable parameters is defined as follows:

使用反向转播算法来训练神经网络中的自适应激活函数，其中可学习参数伴随着权重和偏执随着网络训练的进行一起得到优化，参数{a1,…,n,b1,…,n}根据链式求导法则得到更新，更新如下所示：Use the backbroadcasting algorithm to train the adaptive activation function in the neural network, where the learnable parameters are optimized along with the weights and biases as the network is trained, and the parameters {a1,...,n,b1,...,n} according to The chain rule of derivation is updated as follows:

其中ai∈{a1,…,n,b1,…,n}，L表示代价函数，这一项可从后一层通过反向传播得到，加权项∑Xi可被用在特征图或神经网络层的所有位置上，对于一层中共享的变量，梯度ai可用如下公式求得，∑i用来对所有通道或一层中的神经元求和，公式如下：where ai∈{a1,…,n,b1,…,n}, L represents the cost function, This term can be obtained from the latter layer through backpropagation. The weighting term ∑Xi can be used in all positions of the feature map or neural network layer. For the variables shared in one layer, the gradient ai can be obtained by the following formula, ∑ i is used to sum over all channels or neurons in a layer, the formula is as follows:

步骤2，基于MNIST数据集进行自适应激活函数及其他经典激活函数进行实验结果对比与分析，得到最终激活函数版本。过程如下：Step 2. Based on the MNIST dataset, compare and analyze the experimental results of the adaptive activation function and other classic activation functions to obtain the final activation function version. The process is as follows:

使用的网络为有三个隐藏层，每个隐藏层有50个神经元，使用随时梯度下降算法迭代了100周期，学习率设为0.01，最小批次数量为100。The network used has three hidden layers, each hidden layer has 50 neurons, iteratively uses the gradient descent algorithm at any time for 100 cycles, the learning rate is set to 0.01, and the minimum number of batches is 100.

进一步，所述步骤2中，使用的对比激活函数有传统Sigmoid函数、传统ReLU激活函数、自适应激活函数的统一版本、自适应激活函数的各自版本及自适应激活函数的分层版本。Further, in the step 2, the comparison activation functions used include the traditional Sigmoid function, the traditional ReLU activation function, the unified version of the adaptive activation function, the respective versions of the adaptive activation function and the hierarchical versions of the adaptive activation function.

步骤3，在步骤2得到最优激活函数版本之后，应用于具体膀胱癌细胞的检测，过程如下：Step 3, after obtaining the optimal activation function version in step 2, apply it to the detection of specific bladder cancer cells, the process is as follows:

3.1、对膀胱癌进行数据集的制作；3.1. Create a data set for bladder cancer;

3.2、选择合算法和模型进行参数的初始化；3.2. Select the combination algorithm and model to initialize the parameters;

3.3、将最优激活函数和传统激活函数进行实验结果的对比与分析。3.3. Compare and analyze the experimental results of the optimal activation function and the traditional activation function.

再进一步，所述3.1中，将膀胱癌细胞数据集做成pascal_voc2007格式，主要是利用生成的xml文件保存细胞的标签信息。Furthermore, in the above 3.1, the bladder cancer cell data set is made into pascal_voc2007 format, mainly using the generated xml file to save the label information of the cells.

所述3.2中，选择Faster R-CNN算法，利用vgg16模型进行网络参数的初始化，利用vgg16预训练模型进行网络参数初始化。In the above 3.2, select the Faster R-CNN algorithm, use the vgg16 model to initialize the network parameters, and use the vgg16 pre-trained model to initialize the network parameters.

所述3.3中，利用步骤3.2中生成的最优激活函数版本替换Faster R-CNN算法中的传统激活函数，最后进行实验结果的分析与对比。In the above 3.3, the traditional activation function in the Faster R-CNN algorithm is replaced with the optimal activation function version generated in step 3.2, and finally the experimental results are analyzed and compared.

本发明的有益效果主要表现在：从理论和实验证明自适应激活函数参数调节方法的有效性，为网络提供最佳激活函数，避免传统激活函数存在的梯度弥散等问题，提高网络的拟合能力。The beneficial effects of the present invention are mainly manifested in: the effectiveness of the adaptive activation function parameter adjustment method is proved theoretically and experimentally, the optimal activation function is provided for the network, the gradient dispersion and other problems existing in the traditional activation function are avoided, and the fitting ability of the network is improved .

附图说明Description of drawings

图1为本发明激活函数AS收敛曲线图；Fig. 1 is the convergence curve diagram of activation function AS of the present invention;

图2为本发明AS激活函数可学习参数的调节图；Fig. 2 is an adjustment diagram of the learnable parameters of the AS activation function of the present invention;

图3为本发明原始sigmoid激活函数及最终AS激活函数图；Fig. 3 is original sigmoid activation function and final AS activation function figure of the present invention;

图4为本发明最终AS激活函数与其他激活函数的实验对比结果图。Fig. 4 is a graph of experimental comparison results between the final AS activation function of the present invention and other activation functions.

图5为Sigmoid激活函数图。Figure 5 is a diagram of the Sigmoid activation function.

图6为Tanh激活函数图。Figure 6 is a diagram of the Tanh activation function.

图7为ReLU激活函数图。Figure 7 is a diagram of the ReLU activation function.

具体实施方式Detailed ways

下面结合附图对本发明作进一步描述。The present invention will be further described below in conjunction with the accompanying drawings.

参照图1～图7，一种面向深度神经网络的自适应激活函数参数调节方法，所述方法包括以下步骤：Referring to Fig. 1～Fig. 7, a kind of adaptive activation function parameter adjustment method for deep neural network, described method comprises the following steps:

f_(x)＝f(a*x+c)f _(x) = f(a*x+c)

h_(w,b,a,c)＝h(f(a*x+c))h _(w,b,a,c) = h(f(a*x+c))

步骤2，基于MNIST数据集进行自适应激活函数及其他经典激活函数进行实验结果对比与分析，过程如下：Step 2. Based on the MNIST dataset, compare and analyze the experimental results of the adaptive activation function and other classic activation functions. The process is as follows:

本发明基于MNIST数据集，在sigmoid经典激活函数里添加可学习参数使之成为自适应激活函数：Adaptive Sigmoid(AS)，再将各版本自适应激活函数与Sigmoid和ReLU两个个经典激活函数进行试验结果的对比。Based on the MNIST data set, the present invention adds learnable parameters to the sigmoid classic activation function to make it an adaptive activation function: Adaptive Sigmoid (AS), and then performs each version of the adaptive activation function with the two classic activation functions of Sigmoid and ReLU Comparison of test results.

MNIST是一个手写数字识别数据集，被称为深度学习实验的果蝇，它包含了60000张图片作为训练数据，10000张图片作为测试集。在MNIST数据集中，每一张灰度图片都代表了0～9中的一个数字。图片的大小均为28*28，手写数字均会出现在图片的正中间。激活函数AS的定义为：MNIST is a handwritten digit recognition data set, known as the fruit fly of deep learning experiments, which contains 60,000 pictures as training data and 10,000 pictures as test set. In the MNIST dataset, each grayscale image represents a number from 0 to 9. The size of the picture is 28*28, and the handwritten numbers will appear in the middle of the picture. The activation function AS is defined as:

f＝b₀*sigmoid(a₀*x+a₁)+b₁ f＝b ₀ *sigmoid(a ₀ *x+a ₁ )+b ₁

a0，a1，b0，b1是四个可学习参数，它们控制了函数的形状，能随着网络权重和偏执一起得到训练。a0, a1, b0, b1 are four learnable parameters that control the shape of the function and can be trained along with the network weights and biases.

本发明主要是在sigmoid经典激活函数里添加可学习参数使之成为自适应激活函数AS，函数的数学定义如下：The present invention mainly adds learnable parameters in the sigmoid classic activation function to make it an adaptive activation function AS. The mathematical definition of the function is as follows:

设自适应激活函数的可调参数个数为N，这里假设N＝2。那么自适应激活函数可以被定义为：Assume that the number of adjustable parameters of the adaptive activation function is N, where N=2 is assumed. Then the adaptive activation function can be defined as:

f_(x)＝f(a*x+c)f _(x) = f(a*x+c)

其中a和c都是用来控制激活函数形状的可学习参数。所谓的神经网络可以看作是许多单个神经元的组合，定义神经网络的输出为一个集权重、偏差和可学习神经元参数的复合函数，函数如下：Where a and c are learnable parameters used to control the shape of the activation function. The so-called neural network can be regarded as a combination of many individual neurons. The output of the neural network is defined as a composite function of weights, deviations and learnable neuron parameters. The function is as follows:

h_(w,b,a,c)＝h(f(a*x+c))h _(w,b,a,c) = h(f(a*x+c))

其中h代表神经网络的输出，w和b代表网络的权重和偏差。与此同时，该函数还可以被看成在神经网络中所有神经元使用同一组可学习参数。一个更为广泛的定义即：神经网络中每个神经元都使用不同的可调节参数，如下所示：where h represents the output of the neural network, and w and b represent the weights and biases of the network. At the same time, this function can also be seen as using the same set of learnable parameters for all neurons in the neural network. A more general definition is that each neuron in a neural network uses different adjustable parameters, as follows:

其中fn代表网络中一层的每个神经元。每一层神经元使用相同的可调节参数被定义如下：where fn represents each neuron in one layer of the network. Each layer of neurons is defined using the same adjustable parameters as follows:

本发明使用反向转播算法来训练神经网络中的自适应激活函数，其中可学习参数伴随着权重和偏执随着网络训练的进行一起得到优化。参数{a1,…,n,b1,…,n}可根据链式求导法则得到更新，更新如下所示：The present invention uses a reverse rebroadcasting algorithm to train an adaptive activation function in a neural network, wherein learnable parameters along with weights and biases are optimized along with network training. The parameters {a1,...,n,b1,...,n} can be updated according to the chain derivation rule, and the update is as follows:

其中ai∈{a1,…,n,b1,…,n}，L表示代价函数。这一项可从后一层通过反向传播得到，加权项∑Xi可被用在特征图或神经网络层的所有位置上。对于一层中共享的变量，梯度ai可用如下公式求得，∑i用来对所有通道或一层中的神经元求和，公式如下：where ai∈{a1,…,n,b1,…,n}, L represents the cost function. This term can be obtained by backpropagation from the latter layer, and the weighting term ΣXi can be used in all positions of the feature map or neural network layer. For the variables shared in one layer, the gradient ai can be obtained by the following formula, ∑i is used to sum all channels or neurons in one layer, the formula is as follows:

步骤3中，将自适应激活函数的方法应用于深度学习，本发明是将步骤2中得到的最优激活函数应用于膀胱癌细胞的检测。过程如下：In step 3, the method of adaptive activation function is applied to deep learning, and the present invention applies the optimal activation function obtained in step 2 to the detection of bladder cancer cells. The process is as follows:

3.1、进行数据集的制作。本发明是将膀胱癌细胞数据集做成pascal_voc2007格式，主要是利用生成的xml文件保存细胞的标签信息。3.1. Make the data set. The present invention makes the bladder cancer cell data set into the pascal_voc2007 format, mainly using the generated xml file to save the label information of the cells.

3.2、选择合适的算法和模型进行参数的初始化。本发明选择Faster R-CNN算法，利用vgg16模型进行网络参数的初始化，主要是利用vgg16预训练模型进行网络参数初始化来减少训练时间，同时降低欠拟合或过拟合的风险。3.2. Select the appropriate algorithm and model to initialize the parameters. The present invention selects the Faster R-CNN algorithm and uses the vgg16 model to initialize network parameters, mainly using the vgg16 pre-training model to initialize network parameters to reduce training time and reduce the risk of underfitting or overfitting.

3.3、将最优激活函数和传统激活函数进行实验结果的对比与分析。主要是利用步骤2中生成的最优激活函数版本替换Faster R-CNN算法中的传统激活函数，最后进行实验结果的分析与对比。3.3. Compare and analyze the experimental results of the optimal activation function and the traditional activation function. It mainly uses the optimal activation function version generated in step 2 to replace the traditional activation function in the Faster R-CNN algorithm, and finally analyzes and compares the experimental results.

最后，本发明所提出的方法，即整个网络中使用相同的可调节激活函数，无论神经网络中使用多少个神经元，其总共添加的参数数目就是自适应激活函数可学习的参数个数(这些参数用来控制函数的形状)。整个网络使用相同的可调激活函数，就如同复合函数多项式次叠加，增强了网络的非线性，提高了网络的拟合能力，并且加快了网络的学习速度。Finally, in the method proposed by the present invention, the same adjustable activation function is used in the entire network, no matter how many neurons are used in the neural network, the total number of parameters added is the number of parameters that can be learned by the adaptive activation function (these parameters are used to control the shape of the function). The entire network uses the same adjustable activation function, just like the polynomial superposition of composite functions, which enhances the nonlinearity of the network, improves the fitting ability of the network, and speeds up the learning speed of the network.

以下将参照附图，对本发明进行详细的描述。Hereinafter, the present invention will be described in detail with reference to the accompanying drawings.

如图1所示，使用的网络为有三个隐藏层，每个隐藏层有50个神经元，使用随时梯度下降算法迭代了100周期。学习率设为0.01，最小批次数量为100。其中使用的对比激活函数有传统Sigmoid函数、传统ReLU激活函数、自适应激活函数的统一版本、自适应激活函数的各自版本及自适应激活函数的分层版本。其中图1为ReLU，Sigmoid和基于自适应激活函数AS的三种版本收敛曲线。“relu_train”表示使用Relu激活函数在训练集上分类错误率。“relu_test”表示使用Relu激活函数在测试集上分类错误率。“AUsigmoid”表示自适应激活函数AS的统一版本(Unified Version，UV)，即每个神经元全部使用相同的激活函数。“ALsigmoid”表示自适应激活函数的各自版本(Individual Version，IV)，即每个神经元使用各自认为好的激活函数。“AIsigmoid”表示分层版本(Layer Version，LV)，即每一层的所有神经元使用相同的激活函数，但每层之间的激活函数是不一定相同的。As shown in Figure 1, the network used has three hidden layers, each hidden layer has 50 neurons, and iteratively uses the gradient descent algorithm at any time for 100 cycles. The learning rate is set to 0.01 and the minimum batch size is 100. The comparative activation functions used include the traditional Sigmoid function, the traditional ReLU activation function, the unified version of the adaptive activation function, the respective versions of the adaptive activation function, and the hierarchical version of the adaptive activation function. Figure 1 shows the convergence curves of ReLU, Sigmoid and three versions based on the adaptive activation function AS. "relu_train" indicates the classification error rate on the training set using the Relu activation function. "relu_test" indicates the classification error rate on the test set using the Relu activation function. "AUsigmoid" represents the unified version (Unified Version, UV) of the adaptive activation function AS, that is, each neuron uses the same activation function. "ALsigmoid" represents the respective version of the adaptive activation function (Individual Version, IV), that is, each neuron uses the activation function that it thinks is good. "AIsigmoid" means a layered version (Layer Version, LV), that is, all neurons in each layer use the same activation function, but the activation function between each layer is not necessarily the same.

传统Sigmoid激活函数的表达式如下：The expression of the traditional Sigmoid activation function is as follows:

Sigmoid激活函数图参照图5。Refer to Figure 5 for the Sigmoid activation function diagram.

Sigmoid激活函数是一种常有的激活方法，是因为该激活函数对于神经元的激活频率有良好的解释：从完全不激活0到最大分界处1的完全饱和激活。但是现在Sigmoid函数已经很少用了，一个重要原因是Sigmoid函数存在饱和使梯度消失。这是因为Sigmoid神经元存在一个不好的特性，就是当神经元的激活值在接近0或1处时会饱和，当在这些区域时，函数梯度几乎为0，这种情况导致在反向传播的时候，这个(局部)梯度将会与整个损失函数关于该门单元输出的梯度相乘，其相乘的结果也会接近零，这会有效地终止梯度，导致几乎没有信号通过神经元传到权重再到数据了，导致最后的梯度弥散问题。The Sigmoid activation function is a common activation method because it has a good explanation for the activation frequency of neurons: from no activation at all to full saturation activation at 1 at the maximum boundary. But now the Sigmoid function is rarely used. An important reason is that the Sigmoid function is saturated and the gradient disappears. This is because the Sigmoid neuron has a bad characteristic, that is, when the activation value of the neuron is close to 0 or 1, it will saturate. When in these areas, the function gradient is almost 0, which leads to the backpropagation When , this (local) gradient will be multiplied by the gradient of the entire loss function with respect to the output of the gate unit, and the result of the multiplication will also approach zero, which effectively terminates the gradient, causing almost no signal to pass through the neuron to The weight goes to the data again, leading to the final gradient dispersion problem.

另一个经典的激活函数是：Tanh非线性函数，表达式如下：Another classic activation function is: Tanh nonlinear function, the expression is as follows:

Tanh激活函数图参照图6。Refer to Figure 6 for the Tanh activation function diagram.

由图可知，相比于Sigmoid函数，Tanh虽然将实数值压缩到[-1,1]之间，但和Sigmoid一样也存在饱和问题。但是和Sigmoid神经元不同的是，它的输出是零中心的。在实际操作中，Tanh非线性函数比Sigmoid非线性函数更受欢迎。可以说Tanh神经元是一个简单放大的Sigmoid神经元。It can be seen from the figure that, compared with the Sigmoid function, although Tanh compresses the real value to [-1,1], it also has the same saturation problem as Sigmoid. But unlike the Sigmoid neuron, its output is zero-centered. In practice, the Tanh nonlinear function is more popular than the Sigmoid nonlinear function. It can be said that a Tanh neuron is a simply enlarged Sigmoid neuron.

相比于前两个经典激活函数，ReLU是现在被广泛使用的激活函数，它的数学公式如下：Compared with the first two classic activation functions, ReLU is now a widely used activation function. Its mathematical formula is as follows:

f(x)＝max(0,x)f(x)=max(0,x)

ReLU激活函数图参照图7。Refer to Figure 7 for the ReLU activation function diagram.

相较于Sigmoid和Tanh函数，ReLU对于梯度下降的收敛有巨大的加速作用,这是由于它的线性，非饱和的公式产生。ReLU激活函数在输入为正数的时候，不存在梯度饱和问题；当输入是负数的时候，ReLU是完全不被激活的，这就表明一旦输入到了负数，ReLU就会瘫痪。例如，当一个很大的梯度通过ReLU神经元的反向传播时，可能会导致梯度更新到一种特别的状态，在这种状态下神经元很可能无法再被其他任何运算节点再次激活。如果这种情况发生，那么从此经过这个神经元反向传播的梯度将都变成0。也就是说，这个ReLU单元在训练中将不可逆转的瘫痪，这就导致了数据多样化的丢失。Compared with the Sigmoid and Tanh functions, ReLU has a huge acceleration effect on the convergence of gradient descent, which is due to its linear, non-saturated formula. When the input of the ReLU activation function is a positive number, there is no gradient saturation problem; when the input is a negative number, the ReLU is not activated at all, which means that once a negative number is input, the ReLU will be paralyzed. For example, when a large gradient is backpropagated through a ReLU neuron, it may cause the gradient to update to a special state where the neuron is likely not to be reactivated by any other computation node. If this happens, then the gradients backpropagated through this neuron will all become zero from now on. In other words, this ReLU unit will be irreversibly paralyzed during training, which leads to the loss of data diversity.

由图1可知，在MNIST训练集上，使用激活函数AS的统一版本比使用Relu激活函数实现更低的分类错误率。相比使用原始的Sigmoid激活函数，网络具有更强的拟合能力。It can be seen from Figure 1 that on the MNIST training set, using the unified version of the activation function AS achieves a lower classification error rate than using the Relu activation function. Compared with using the original sigmoid activation function, the network has a stronger fitting ability.

如图2所示为自适应激活函数统一版本的参数调节过程图，自适激活函数的可学习参数初始化设置为：a0＝1.0，a1＝0.0，b0＝1.0，b1＝0.0，经过训练迭代后，其最终参数变为：a0＝3.87，a1＝0.07，b0＝5.89，b1＝-0.51，基本上就没有太大的变化。Figure 2 shows the parameter adjustment process diagram of the unified version of the adaptive activation function. The initial settings of the learnable parameters of the adaptive activation function are: a0=1.0, a1=0.0, b0=1.0, b1=0.0, after training iterations , its final parameters become: a0=3.87, a1=0.07, b0=5.89, b1=-0.51, basically there is not much change.

如图3所示，最终的自适应激活函数统一版本相比于传统Sigmoid激活函数有更大的值域，很大程度上解决了传统Sigmoid激活函数的梯度弥散问题，所以分类的准确率得到上升。As shown in Figure 3, the final unified version of the adaptive activation function has a larger value range than the traditional Sigmoid activation function, which largely solves the gradient dispersion problem of the traditional Sigmoid activation function, so the classification accuracy is increased. .

如图4所示，本发明利用最终的自适应激活函数版本(RAS)和其他几种激活函数进行实验结果对比，最终的自适应激活函数公式为：As shown in Figure 4, the present invention uses the final adaptive activation function version (RAS) and other several activation functions to compare the experimental results, and the final adaptive activation function formula is:

f＝5.89*sigmoid(3.87*x+0.07)-0.51f=5.89*sigmoid(3.87*x+0.07)-0.51

从实验结果对比图可以看出，统一的自适应激活函数能达到最好的实验效果，在膀胱癌细胞检测的实验中，利用统一的自适应激活函数的检测结果及速度都比传统的激活函数要好，进一步证明了每个网络都可以训练得到自己最合适的激活函数。From the comparison chart of the experimental results, it can be seen that the unified adaptive activation function can achieve the best experimental results. In the experiment of bladder cancer cell detection, the detection results and speed of using the unified adaptive activation function are faster than the traditional activation function. Well, it further proves that each network can be trained to get its own most suitable activation function.

Claims

1. an adaptive activation function parameter adjustment method for deep neural network, it is characterized in that, described method comprises the following steps:

Step 1, first mathematically define the parameter adjustment method of the adaptive activation function, the process is as follows:

Assuming that the number of adjustable parameters of the adaptive activation function is N, then the adaptive activation function is defined as:

f _(x) = f(a*x+c)

Among them, a and c are learnable parameters used to control the shape of the activation function. The so-called neural network is regarded as a combination of many individual neurons, and the output of the neural network is defined as a composite of weights, deviations and learnable neuron parameters. function, the function is as follows:

h _(w,b,a,c) = h(f(a*x+c))

Where h represents the output of the neural network, w and b represent the weights and biases of the network; at the same time, this function is also regarded as using the same set of learnable parameters for all neurons in the neural network. A broader definition is: Each neuron in a neural network uses different adjustable parameters, as follows:

where fn represents each neuron in a layer of the network, each layer of neurons using the same adjustable parameters is defined as follows:

Use the backbroadcasting algorithm to train the adaptive activation function in the neural network, where the learnable parameters are optimized along with the weights and biases as the network is trained, and the parameters {a1,...,n,b1,...,n} according to The chain rule of derivation is updated as follows:

where ai∈{a1,…,n,b1,…,n}, L represents the cost function, This term can be obtained from the latter layer through backpropagation. The weighting term ∑Xi can be used in all positions of the feature map or neural network layer. For the variables shared in one layer, the gradient ai can be obtained by the following formula, ∑ i is used to sum over all channels or neurons in a layer, the formula is as follows:

Step 2. Based on the MNIST dataset, compare and analyze the experimental results of the adaptive activation function and other classic activation functions. The process is as follows:

The network used has three hidden layers, each hidden layer has 50 neurons, iteratively uses the gradient descent algorithm at any time for 100 cycles, the learning rate is set to 0.01, and the minimum number of batches is 100.

Step 3, after obtaining the optimal activation function version in step 2, apply it to the detection of specific bladder cancer cells, the process is as follows:

3.1. Create a data set for bladder cancer;

3.2. Select the algorithm and model to initialize the parameters;

3.3. Compare and analyze the experimental results of the optimal activation function and the traditional activation function.

2. the method for a kind of depth neural network adaptive activation function parameter adjustment as claimed in claim 1, is characterized in that, in described step 2, the contrast activation function that uses has traditional Sigmoid function, traditional ReLU activation function, self-adaptive A unified version of the activation function, individual versions of the adaptive activation function, and a layered version of the adaptive activation function.

3. The method for adjusting a deep neural network adaptive activation function parameter as claimed in claim 1 or 2, is characterized in that, in said 3.1, the bladder cancer cell data set is made into pascal_voc2007 format, mainly using the generated The xml file saves the label information of the cells.

4. the method for a kind of depth neural network adaptive activation function parameter adjustment as claimed in claim 1 or 2, is characterized in that, in described 3.2, select Faster R-CNN algorithm, utilize vgg16 model to carry out the initialization of network parameter, Use the vgg16 pre-trained model to initialize the network parameters.

5. the method for a kind of depth neural network self-adaptive activation function parameter adjustment as claimed in claim 4, is characterized in that, in described 3.3, utilize the optimal activation function version that generates in step 3.2 to replace in Faster R-CNN algorithm The traditional activation function, and finally analyze and compare the experimental results.