Disclosure of Invention
Technical problem to be solved
The present invention is directed to an apparatus and method for artificial neural network reverse training supporting adaptive learning rate, which solves at least one of the above technical problems of the prior art.
(II) technical scheme
According to an aspect of the present invention, there is provided an artificial neural network reverse training apparatus including a controller unit, a storage unit, a learning rate adjustment unit, and an arithmetic unit, wherein,
the storage unit is used for storing neural network data, including instructions, weights, derivatives of activation functions, learning rates, gradient vectors and learning rate adjustment data;
the controller unit is used for reading the instruction from the storage unit and decoding the instruction into a microinstruction for controlling the behaviors of the storage unit, the learning rate adjusting unit and the arithmetic unit;
a learning rate adjusting unit, which adjusts data according to the previous generation learning rate and the learning rate before each generation of training, and obtains the learning rate for the current generation of training after calculation;
and the operation unit is used for calculating the weight of the current generation according to the gradient vector, the learning rate of the current generation, the derivative of the activation function and the weight of the previous generation.
Further, the arithmetic unit includes a master arithmetic unit, an interconnection unit, and a plurality of slave arithmetic units, the gradient vector includes an input gradient vector and an output gradient vector, in which: the main operation unit is used for completing subsequent calculation by utilizing the output gradient vector of the layer in the calculation process of each layer; the interconnection unit is used for transmitting the input gradient vector of the layer to all the slave operation units through the interconnection unit at the stage of starting calculation of reverse training of each layer of the neural network, and after the calculation process of the slave operation units is completed, the interconnection unit gradually adds the output gradient vector parts of all the slave operation units pairwise to obtain the output gradient vector of the layer; the plurality of slave arithmetic units calculate corresponding partial sums of output gradient vectors in parallel by using the same input gradient vector and respective weight data.
Further, the storage unit is an on-chip cache.
Further, the instruction is a SIMD instruction.
Further, the learning rate adjustment data includes a weight variation and an error function.
According to another aspect of the present invention, there is provided an artificial neural network reverse training method, including the steps of:
s1: before each generation of training begins, the learning rate used for the training of the current generation is calculated according to the learning rate of the previous generation and the learning rate adjustment data;
s2: the training is started, and the weight values are updated layer by layer according to the learning rate of the training of the present generation;
s3: after all the weight values are updated, calculating learning rate adjustment data of the present generation network, and storing the learning rate adjustment data;
s4: and judging whether the neural network converges, if so, finishing the operation, and otherwise, turning to the step S1.
Further, step S2 includes:
s21: for each layer of the network, carrying out weighted summation on input gradient vectors to calculate output gradient vectors of the layer, wherein the weight of the weighted summation is the weight to be updated of the layer;
s22: multiplying the output gradient vector of the current layer by the derivative value of the activation function of the next layer during forward operation to obtain the input gradient vector of the next layer;
s23: multiplying the input gradient vector by the input neuron counterpoint during forward operation to obtain the gradient of the weight of the layer;
s24: updating the weight of the layer according to the gradient and the learning rate of the obtained weight of the layer;
s25: judging whether all layers are updated, if so, entering step S3; otherwise, go to step S21.
Furthermore, in the present training, the weight value adopts a non-uniform learning rate.
Furthermore, in the training of the present generation, the weight value adopts the unified learning rate.
(III) advantageous effects
(1) By arranging the learning rate adjusting unit and adopting the adaptive learning rate training network, the weight variation generated in each cycle training is more properly determined, so that the training iteration process is more stable, the time required for the neural network to be trained to be stable is reduced, and the training efficiency is improved;
(2) by adopting the special on-chip cache aiming at the multilayer artificial neural network operation algorithm, the reusability of input neurons and weight data is fully excavated, the data is prevented from being read from the memory repeatedly, the memory access bandwidth is reduced, and the problem that the memory bandwidth becomes the performance bottleneck of the multilayer artificial neural network operation and the training algorithm thereof is avoided.
(3) By adopting the special SIMD instruction and the customized operation unit aiming at the operation of the multilayer artificial neural network, the problems of insufficient operation performance of the CPU and the GPU and high front-end decoding overhead are solved, and the support for the operation algorithm of the multilayer artificial neural network is effectively improved.
Detailed Description
The traditional training method adopted by the artificial neural network is a back propagation algorithm, the variable quantity of the weight between two generations is the gradient of an error function to the weight multiplied by a constant, and the constant is called as a learning rate. The learning rate determines the amount of weight variation generated in each round of training. The value is too small, the effective updating of the weight value in each iteration is too small, the small learning rate causes longer training time, and the convergence speed is quite slow; if the value is too large, the iterative process may oscillate so as to diverge. The artificial neural network reverse training device is provided with a learning rate adjusting unit, and before each generation of training, the learning rate adjusting unit adjusts data according to the learning rate and the learning rate of the previous generation and calculates to obtain the learning rate for the current generation. The weight variable quantity generated in each cycle training is more properly determined, so that the training iterative process is more stable, the time required for the neural network to be trained to be stable is shortened, and the training efficiency is improved.
In order that the objects, technical solutions and advantages of the present invention will become more apparent, the present invention will be further described in detail with reference to the accompanying drawings in conjunction with the following specific embodiments.
Fig. 1 is a block diagram illustrating an example of an overall structure of an artificial neural network reverse training apparatus according to an embodiment of the present invention. The embodiment of the invention provides a device for artificial neural network reverse training supporting adaptive learning rate, which comprises:
the storage unit A is used for storing neural network data, including instructions, weights, derivatives of activation functions, learning rates, gradient vectors (which may include input gradient vectors and output gradient vectors) and learning rate adjustment data (which may include network error values, value variation and the like); the storage unit can be an on-chip cache, so that the situation that the data are read from the memory repeatedly and the memory bandwidth becomes the performance bottleneck of multilayer artificial neural network operation and training algorithm thereof is avoided.
The controller unit B is used for reading the instruction from the storage unit A and decoding the instruction into a microinstruction for controlling the behaviors of the storage unit, the learning rate adjusting unit and the arithmetic unit;
the instructions accessed and read by the storage unit A and the controller unit B can be SIMD instructions, and the problems of insufficient operation performance and high front-end decoding overhead of the existing CPU and GPU are solved by adopting the special SIMD instruction aiming at the operation of the multilayer artificial neural network.
A learning rate adjusting unit E, before each generation of training, adjusting data according to the learning rate and the learning rate of the previous generation, and obtaining the learning rate for the current generation after calculation;
and the operation units (D, C, F) calculate the current generation weight according to the gradient vector, the current generation learning rate, the derivative of the activation function and the previous generation weight.
The storage unit A is used for storing neural network data including instructions and storage neuron input, weights, neuron output, learning rates, weight variation, activation function derivatives, gradient vectors of all layers and the like;
for the controller unit B, reading an instruction from the storage unit A and decoding the instruction into a microinstruction for controlling the behavior of each unit;
as for the operation unit, it may include a master operation unit C, an interconnection unit D, and a plurality of slave operation units F.
The interconnection unit D is used for connecting the master operation module and the slave operation module, and may be implemented in different interconnection topologies (e.g., a tree structure, a ring structure, a grid structure, a hierarchical interconnection, a bus structure, etc.).
And the interconnection unit D is used for transmitting the input gradient vector of the layer to all the slave operation units F through the interconnection unit D at the stage of starting calculation of reverse training of each layer of the neural network, and after the calculation process of the slave operation units F is completed, the interconnection unit D gradually adds the output gradient vector parts of the slave operation units F in pairs to obtain the output gradient vector of the layer.
The main operation unit C is used for completing subsequent calculation by utilizing the output gradient vector of the layer in the calculation process of each layer;
a plurality of slave operation units F, which utilize the same input gradient vector and respective weight data to calculate the corresponding output gradient vector partial sum in parallel;
the learning rate adjusting unit E is configured to obtain the learning rate for the training of the previous generation after calculation according to the information of the learning rate, the weight, the network error value, the weight variation, and the like of the previous generation (the information is stored in the storage unit in advance and can be called up).
Fig. 2 schematically shows an embodiment of the interconnect unit 4: and (5) an interconnection structure. The interconnection unit D constitutes a data path between the master operation unit C and the plurality of slave operation units F, and has an interconnection type structure. The interconnect includes a plurality of nodes that form a binary tree path, i.e., each node has a parent node and 2 child nodes. Each node sends the upstream data to the two downstream child nodes through the parent node in the same way, merges the data returned by the two downstream child nodes and returns the merged data to the upstream parent node.
For example, in the reverse operation process of the neural network, vectors returned by two nodes at the downstream end are added into one vector at the current node and returned to the node at the upstream end. At the stage that each layer of artificial neural network starts to calculate, the input gradient in the main operation unit C is sent to each slave operation unit F through the interconnection unit D; after the calculation process of the slave operation unit F is completed, the sum of the output gradient vector portions output by each slave operation unit F is added pairwise in the interconnection unit D, that is, the sum of all the output gradient vector portions is summed to be the final output gradient vector.
The learning rate adjusting means E performs different calculations on data according to the adaptive learning rate adjusting method.
First, in the standard back propagation algorithm:
w(k+1)=w(k)-ηg(w(k)) (1)
in the formula (1), w (k) is the current training weight, i.e. the current generation weight, w (k +1) is the next generation weight, η is the fixed learning rate and is a predetermined constant, and g (w) is the gradient vector.
Here we allow the learning rate to be updated from generation to generation, as are other network parameters. The method for adjusting the learning rate comprises the following steps: when the training error increases, the learning rate is reduced; when the training error decreases, the learning rate is increased. Several specific examples of adaptive learning rate adjustment rules are given below, but are not limited to these examples.
The method comprises the following steps:
in the formula (2), η (k) is the present-generation learning rate, η (k +1) is the next-generation learning rate, Δ E ═ E (k) -E (k-1) is the variation of the error function E, a > 0, b > 0, and a, b are appropriate constants.
The second method comprises the following steps:
η(k+1)=η(k)(1-ΔE) (3)
in the formula (3), η (k) is the present-generation learning rate, η (k +1) is the next-generation learning rate, and Δ E ═ E (k) -E (k-1) is the amount of change in the error function E.
The third method comprises the following steps:
in the formula (4), η (k) is a present-generation learning rate, η (k +1) is a next-generation learning rate, Δ E ═ E (k) -E (k-1) is a variation of the error function E, a > 1, 0 < b < 1, c > 0, and a, b, and c are appropriate constants.
The method four comprises the following steps:
in the formula (5), η (k) is a present-generation learning rate, η (k +1) is a next-generation learning rate, Δ E ═ E (k) -E (k-1) is a variation of the error function E, 0 < a < 1, b > 1, 0 < α < 1, a, b, α areThe appropriate constant is set to be a constant,
the learning rate η in the above four methods can be common to all weights, that is, the same learning rate is used for each layer of weights during each generation of training, and we remember this method as a uniform adaptive learning rate training method; or not universal, that is, different learning rates are adopted for each weight, and we remember this method as a respective adaptive learning rate training method. The training precision can be further improved and the training time can be reduced by the respective adaptive learning rate training method.
For clearer comparison, schematic diagrams of two methods are respectively given, and a unified adaptive learning rate training method and a respective adaptive learning rate training method respectively correspond to fig. 3 and fig. 4.
In FIG. 3, the connection weight w between the output layer P and the hidden layer Jjp1,wjp2,…,wjpnDuring reverse adjustment, the learning rate eta is uniformly adopted for adjustment; in FIG. 4, the connection weight w between the output layer P and the hidden layer Jjp1,wjp2,...,wjpnIn the reverse adjustment, the learning rate eta is adopted1,η2,...,ηnAnd (6) adjusting. The difference between different nodes is reversely adjusted, so that the self-adaptive capacity of the learning rate can be furthest adjusted, and the changeable requirements of the weight in learning can be furthest met.
As for the respective adaptive learning rate adjusting methods, after the initial values of the respective learning rates are obtained, the iterative updating of the respective learning rates can still be performed according to the methods one to four, which are not limited to these four methods. The learning rate η in this equation is the respective learning rate corresponding to each weight.
Based on the same inventive concept, the invention also provides an artificial neural network reverse training method, wherein an operation flow chart is shown in fig. 5, and the method comprises the following steps:
s1: before each generation of training begins, the learning rate used for the training of the current generation is calculated according to the learning rate of the previous generation and the learning rate adjustment data;
s2: the training is started, and the weight values are updated layer by layer according to the learning rate of the training of the present generation;
s3: after all the weight values are updated, calculating learning rate adjustment data of the present generation network, and storing the learning rate adjustment data;
s4: judging whether the neural network is converged, if so, finishing the operation, otherwise, turning to the step S1
For step S1, before each generation of training starts, the learning rate adjustment unit E calls the learning rate adjustment data in the storage unit a to adjust the learning rate, resulting in the learning rate used for the training of the present generation.
For step S2: after the training of this generation, the weight values are updated layer by layer according to the learning rate of the training of this generation. Step S2 may include the following sub-steps (see fig. 6):
step S21, for each layer, first, performing weighted summation on the input gradient vector to calculate the output gradient vector of the layer, where the weight of the weighted summation is the weight to be updated of the layer, and the process is completed by the master operation unit C, the interconnection unit D, and each slave operation unit F together;
step S22, in the main operation unit C, the output gradient vector is multiplied by the derivative value of the activation function of the next layer during forward operation to obtain the input gradient vector of the next layer;
step S23, in the main operation unit C, multiplying the input gradient vector by the input neuron counterpoint in the forward operation to obtain the gradient of the weight of the layer;
step S24, finally, in the main operation unit C, updating the weight of the layer according to the obtained gradient and learning rate of the weight of the layer;
step S25: and judging whether the weight values of all the layers are updated, if so, performing the step S3, otherwise, turning to the step S21.
In step S3, after all the weights are updated, the main operation unit C calculates other data for adjusting the learning rate, such as the network error of this generation, and puts the calculated data into the storage unit a, and this generation of training is finished.
Step S4: and judging whether the network converges, if so, finishing the operation, and otherwise, turning to the step S1.
The weight value is a non-uniform learning rate or a uniform learning rate, and the specific description refers to the above contents, which are not repeated herein.
The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are only exemplary embodiments of the present invention and are not intended to limit the present invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the protection scope of the present invention.