CN108205706B

CN108205706B - Artificial neural network reverse training device and method

Info

Publication number: CN108205706B
Application number: CN201611180607.8A
Authority: CN
Inventors: 陈云霁; 郝一帆; 刘少礼; 陈天石
Original assignee: Shanghai Cambricon Information Technology Co Ltd
Current assignee: Shanghai Cambricon Information Technology Co Ltd
Priority date: 2016-12-19
Filing date: 2016-12-19
Publication date: 2021-04-23
Anticipated expiration: 2036-12-19
Also published as: CN108205706A

Abstract

The invention provides an artificial neural network reverse training device and method, wherein the device includes a controller unit, a storage unit, a learning rate adjustment unit and an operation unit, and the storage unit is used to store neural network data, including instructions, weights, activations Derivatives of functions, learning rate, gradient vectors, and learning rate adjustment data; a controller unit that reads instructions from storage units and decodes them into microinstructions that control the behavior of storage units, learning rate adjustment units, and arithmetic units ; Learning rate adjustment unit, before the start of each generation of training, adjust the data according to the learning rate and learning rate of the previous generation, and obtain the learning rate for the current generation after operation; operation unit, according to the gradient vector, the learning rate of the current generation, and the derivative of the activation function Calculate the weight of this generation with the weight of the previous generation. The device and method of the present invention make the training iterative process more stable, reduce the time required for the neural network to be trained to be stable, and improve the training efficiency.

Description

Artificial neural network reverse training device and method

Technical Field

The invention relates to an artificial neural network, in particular to an artificial neural network reverse training device and an artificial neural network reverse training method.

Background

Artificial Neural Networks (ANNs), referred to as Neural Networks (NNs), are algorithmic mathematical models that mimic the behavioral characteristics of animal Neural Networks and perform distributed parallel information processing. The network achieves the aim of processing information by adjusting the interconnection relationship among a large number of nodes in the network depending on the complexity of the system. The algorithm used by neural networks is vector multiplication, and sign functions and their various approximations are widely used.

One known method to support multi-layer artificial neural network back training is to use a general purpose processor. One of the disadvantages of this method is that the single general-purpose processor has a low operation performance and cannot meet the performance requirements of the common multi-layer artificial neural network operation. When multiple general-purpose processors are executed in parallel, the mutual communication between the general-purpose processors becomes a performance bottleneck. In addition, the general processor needs to decode the reverse operation of the multilayer artificial neural network into a long-row operation and access instruction sequence, and the front-end decoding of the processor brings large power consumption overhead.

Another known method to support multi-layer artificial neural network back training is to use a Graphics Processor (GPU). The GPU only has small on-chip cache, model data (weight) of the multilayer artificial neural network needs to be carried from the outside of the chip repeatedly, the bandwidth of the outside of the chip becomes a main performance bottleneck, and huge power consumption overhead is brought.

Disclosure of Invention

Technical problem to be solved

The present invention is directed to an apparatus and method for artificial neural network reverse training supporting adaptive learning rate, which solves at least one of the above technical problems of the prior art.

(II) technical scheme

According to an aspect of the present invention, there is provided an artificial neural network reverse training apparatus including a controller unit, a storage unit, a learning rate adjustment unit, and an arithmetic unit, wherein,

the storage unit is used for storing neural network data, including instructions, weights, derivatives of activation functions, learning rates, gradient vectors and learning rate adjustment data;

the controller unit is used for reading the instruction from the storage unit and decoding the instruction into a microinstruction for controlling the behaviors of the storage unit, the learning rate adjusting unit and the arithmetic unit;

a learning rate adjusting unit, which adjusts data according to the previous generation learning rate and the learning rate before each generation of training, and obtains the learning rate for the current generation of training after calculation;

and the operation unit is used for calculating the weight of the current generation according to the gradient vector, the learning rate of the current generation, the derivative of the activation function and the weight of the previous generation.

Further, the arithmetic unit includes a master arithmetic unit, an interconnection unit, and a plurality of slave arithmetic units, the gradient vector includes an input gradient vector and an output gradient vector, in which: the main operation unit is used for completing subsequent calculation by utilizing the output gradient vector of the layer in the calculation process of each layer; the interconnection unit is used for transmitting the input gradient vector of the layer to all the slave operation units through the interconnection unit at the stage of starting calculation of reverse training of each layer of the neural network, and after the calculation process of the slave operation units is completed, the interconnection unit gradually adds the output gradient vector parts of all the slave operation units pairwise to obtain the output gradient vector of the layer; the plurality of slave arithmetic units calculate corresponding partial sums of output gradient vectors in parallel by using the same input gradient vector and respective weight data.

Further, the storage unit is an on-chip cache.

Further, the instruction is a SIMD instruction.

Further, the learning rate adjustment data includes a weight variation and an error function.

According to another aspect of the present invention, there is provided an artificial neural network reverse training method, including the steps of:

s1: before each generation of training begins, the learning rate used for the training of the current generation is calculated according to the learning rate of the previous generation and the learning rate adjustment data;

s2: the training is started, and the weight values are updated layer by layer according to the learning rate of the training of the present generation;

s3: after all the weight values are updated, calculating learning rate adjustment data of the present generation network, and storing the learning rate adjustment data;

s4: and judging whether the neural network converges, if so, finishing the operation, and otherwise, turning to the step S1.

Further, step S2 includes:

s21: for each layer of the network, carrying out weighted summation on input gradient vectors to calculate output gradient vectors of the layer, wherein the weight of the weighted summation is the weight to be updated of the layer;

s22: multiplying the output gradient vector of the current layer by the derivative value of the activation function of the next layer during forward operation to obtain the input gradient vector of the next layer;

s23: multiplying the input gradient vector by the input neuron counterpoint during forward operation to obtain the gradient of the weight of the layer;

s24: updating the weight of the layer according to the gradient and the learning rate of the obtained weight of the layer;

s25: judging whether all layers are updated, if so, entering step S3; otherwise, go to step S21.

Furthermore, in the present training, the weight value adopts a non-uniform learning rate.

Furthermore, in the training of the present generation, the weight value adopts the unified learning rate.

(III) advantageous effects

(1) By arranging the learning rate adjusting unit and adopting the adaptive learning rate training network, the weight variation generated in each cycle training is more properly determined, so that the training iteration process is more stable, the time required for the neural network to be trained to be stable is reduced, and the training efficiency is improved;

(2) by adopting the special on-chip cache aiming at the multilayer artificial neural network operation algorithm, the reusability of input neurons and weight data is fully excavated, the data is prevented from being read from the memory repeatedly, the memory access bandwidth is reduced, and the problem that the memory bandwidth becomes the performance bottleneck of the multilayer artificial neural network operation and the training algorithm thereof is avoided.

(3) By adopting the special SIMD instruction and the customized operation unit aiming at the operation of the multilayer artificial neural network, the problems of insufficient operation performance of the CPU and the GPU and high front-end decoding overhead are solved, and the support for the operation algorithm of the multilayer artificial neural network is effectively improved.

Drawings

FIG. 1 is a block diagram illustrating an example of the overall structure of an artificial neural network reverse training apparatus according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of the structure of the interconnection unit in the artificial neural network reverse training device in FIG. 1;

FIG. 3 is a schematic diagram of an artificial neural network back-regulation process according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a process for backward regulation using an artificial neural network, according to an embodiment of the present invention;

FIG. 5 is a flowchart illustrating an operation of a method for inverse training with an artificial neural network according to an embodiment of the present invention.

FIG. 6 is a flowchart illustrating an operation of a method for inverse training with an artificial neural network according to another embodiment of the present invention.

Detailed Description

The traditional training method adopted by the artificial neural network is a back propagation algorithm, the variable quantity of the weight between two generations is the gradient of an error function to the weight multiplied by a constant, and the constant is called as a learning rate. The learning rate determines the amount of weight variation generated in each round of training. The value is too small, the effective updating of the weight value in each iteration is too small, the small learning rate causes longer training time, and the convergence speed is quite slow; if the value is too large, the iterative process may oscillate so as to diverge. The artificial neural network reverse training device is provided with a learning rate adjusting unit, and before each generation of training, the learning rate adjusting unit adjusts data according to the learning rate and the learning rate of the previous generation and calculates to obtain the learning rate for the current generation. The weight variable quantity generated in each cycle training is more properly determined, so that the training iterative process is more stable, the time required for the neural network to be trained to be stable is shortened, and the training efficiency is improved.

In order that the objects, technical solutions and advantages of the present invention will become more apparent, the present invention will be further described in detail with reference to the accompanying drawings in conjunction with the following specific embodiments.

Fig. 1 is a block diagram illustrating an example of an overall structure of an artificial neural network reverse training apparatus according to an embodiment of the present invention. The embodiment of the invention provides a device for artificial neural network reverse training supporting adaptive learning rate, which comprises:

the storage unit A is used for storing neural network data, including instructions, weights, derivatives of activation functions, learning rates, gradient vectors (which may include input gradient vectors and output gradient vectors) and learning rate adjustment data (which may include network error values, value variation and the like); the storage unit can be an on-chip cache, so that the situation that the data are read from the memory repeatedly and the memory bandwidth becomes the performance bottleneck of multilayer artificial neural network operation and training algorithm thereof is avoided.

The controller unit B is used for reading the instruction from the storage unit A and decoding the instruction into a microinstruction for controlling the behaviors of the storage unit, the learning rate adjusting unit and the arithmetic unit;

the instructions accessed and read by the storage unit A and the controller unit B can be SIMD instructions, and the problems of insufficient operation performance and high front-end decoding overhead of the existing CPU and GPU are solved by adopting the special SIMD instruction aiming at the operation of the multilayer artificial neural network.

A learning rate adjusting unit E, before each generation of training, adjusting data according to the learning rate and the learning rate of the previous generation, and obtaining the learning rate for the current generation after calculation;

and the operation units (D, C, F) calculate the current generation weight according to the gradient vector, the current generation learning rate, the derivative of the activation function and the previous generation weight.

The storage unit A is used for storing neural network data including instructions and storage neuron input, weights, neuron output, learning rates, weight variation, activation function derivatives, gradient vectors of all layers and the like;

for the controller unit B, reading an instruction from the storage unit A and decoding the instruction into a microinstruction for controlling the behavior of each unit;

as for the operation unit, it may include a master operation unit C, an interconnection unit D, and a plurality of slave operation units F.

The interconnection unit D is used for connecting the master operation module and the slave operation module, and may be implemented in different interconnection topologies (e.g., a tree structure, a ring structure, a grid structure, a hierarchical interconnection, a bus structure, etc.).

And the interconnection unit D is used for transmitting the input gradient vector of the layer to all the slave operation units F through the interconnection unit D at the stage of starting calculation of reverse training of each layer of the neural network, and after the calculation process of the slave operation units F is completed, the interconnection unit D gradually adds the output gradient vector parts of the slave operation units F in pairs to obtain the output gradient vector of the layer.

The main operation unit C is used for completing subsequent calculation by utilizing the output gradient vector of the layer in the calculation process of each layer;

a plurality of slave operation units F, which utilize the same input gradient vector and respective weight data to calculate the corresponding output gradient vector partial sum in parallel;

the learning rate adjusting unit E is configured to obtain the learning rate for the training of the previous generation after calculation according to the information of the learning rate, the weight, the network error value, the weight variation, and the like of the previous generation (the information is stored in the storage unit in advance and can be called up).

Fig. 2 schematically shows an embodiment of the interconnect unit 4: and (5) an interconnection structure. The interconnection unit D constitutes a data path between the master operation unit C and the plurality of slave operation units F, and has an interconnection type structure. The interconnect includes a plurality of nodes that form a binary tree path, i.e., each node has a parent node and 2 child nodes. Each node sends the upstream data to the two downstream child nodes through the parent node in the same way, merges the data returned by the two downstream child nodes and returns the merged data to the upstream parent node.

For example, in the reverse operation process of the neural network, vectors returned by two nodes at the downstream end are added into one vector at the current node and returned to the node at the upstream end. At the stage that each layer of artificial neural network starts to calculate, the input gradient in the main operation unit C is sent to each slave operation unit F through the interconnection unit D; after the calculation process of the slave operation unit F is completed, the sum of the output gradient vector portions output by each slave operation unit F is added pairwise in the interconnection unit D, that is, the sum of all the output gradient vector portions is summed to be the final output gradient vector.

The learning rate adjusting means E performs different calculations on data according to the adaptive learning rate adjusting method.

First, in the standard back propagation algorithm:

w(k+1)＝w(k)-ηg(w(k)) (1)

in the formula (1), w (k) is the current training weight, i.e. the current generation weight, w (k +1) is the next generation weight, η is the fixed learning rate and is a predetermined constant, and g (w) is the gradient vector.

Here we allow the learning rate to be updated from generation to generation, as are other network parameters. The method for adjusting the learning rate comprises the following steps: when the training error increases, the learning rate is reduced; when the training error decreases, the learning rate is increased. Several specific examples of adaptive learning rate adjustment rules are given below, but are not limited to these examples.

The method comprises the following steps:

in the formula (2), η (k) is the present-generation learning rate, η (k +1) is the next-generation learning rate, Δ E ═ E (k) -E (k-1) is the variation of the error function E, a > 0, b > 0, and a, b are appropriate constants.

The second method comprises the following steps:

η(k+1)＝η(k)(1-ΔE) (3)

in the formula (3), η (k) is the present-generation learning rate, η (k +1) is the next-generation learning rate, and Δ E ═ E (k) -E (k-1) is the amount of change in the error function E.

The third method comprises the following steps:

in the formula (4), η (k) is a present-generation learning rate, η (k +1) is a next-generation learning rate, Δ E ═ E (k) -E (k-1) is a variation of the error function E, a > 1, 0 < b < 1, c > 0, and a, b, and c are appropriate constants.

The method four comprises the following steps:

in the formula (5), η (k) is a present-generation learning rate, η (k +1) is a next-generation learning rate, Δ E ═ E (k) -E (k-1) is a variation of the error function E, 0 < a < 1, b > 1, 0 < α < 1, a, b, α areThe appropriate constant is set to be a constant,

the learning rate η in the above four methods can be common to all weights, that is, the same learning rate is used for each layer of weights during each generation of training, and we remember this method as a uniform adaptive learning rate training method; or not universal, that is, different learning rates are adopted for each weight, and we remember this method as a respective adaptive learning rate training method. The training precision can be further improved and the training time can be reduced by the respective adaptive learning rate training method.

For clearer comparison, schematic diagrams of two methods are respectively given, and a unified adaptive learning rate training method and a respective adaptive learning rate training method respectively correspond to fig. 3 and fig. 4.

In FIG. 3, the connection weight w between the output layer P and the hidden layer J_jp1，w_jp2，…，w_jpnDuring reverse adjustment, the learning rate eta is uniformly adopted for adjustment; in FIG. 4, the connection weight w between the output layer P and the hidden layer J_jp1，w_jp2，...，w_jpnIn the reverse adjustment, the learning rate eta is adopted₁，η₂，...，η_nAnd (6) adjusting. The difference between different nodes is reversely adjusted, so that the self-adaptive capacity of the learning rate can be furthest adjusted, and the changeable requirements of the weight in learning can be furthest met.

As for the respective adaptive learning rate adjusting methods, after the initial values of the respective learning rates are obtained, the iterative updating of the respective learning rates can still be performed according to the methods one to four, which are not limited to these four methods. The learning rate η in this equation is the respective learning rate corresponding to each weight.

Based on the same inventive concept, the invention also provides an artificial neural network reverse training method, wherein an operation flow chart is shown in fig. 5, and the method comprises the following steps:

s4: judging whether the neural network is converged, if so, finishing the operation, otherwise, turning to the step S1

For step S1, before each generation of training starts, the learning rate adjustment unit E calls the learning rate adjustment data in the storage unit a to adjust the learning rate, resulting in the learning rate used for the training of the present generation.

For step S2: after the training of this generation, the weight values are updated layer by layer according to the learning rate of the training of this generation. Step S2 may include the following sub-steps (see fig. 6):

step S21, for each layer, first, performing weighted summation on the input gradient vector to calculate the output gradient vector of the layer, where the weight of the weighted summation is the weight to be updated of the layer, and the process is completed by the master operation unit C, the interconnection unit D, and each slave operation unit F together;

step S22, in the main operation unit C, the output gradient vector is multiplied by the derivative value of the activation function of the next layer during forward operation to obtain the input gradient vector of the next layer;

step S23, in the main operation unit C, multiplying the input gradient vector by the input neuron counterpoint in the forward operation to obtain the gradient of the weight of the layer;

step S24, finally, in the main operation unit C, updating the weight of the layer according to the obtained gradient and learning rate of the weight of the layer;

step S25: and judging whether the weight values of all the layers are updated, if so, performing the step S3, otherwise, turning to the step S21.

In step S3, after all the weights are updated, the main operation unit C calculates other data for adjusting the learning rate, such as the network error of this generation, and puts the calculated data into the storage unit a, and this generation of training is finished.

Step S4: and judging whether the network converges, if so, finishing the operation, and otherwise, turning to the step S1.

The weight value is a non-uniform learning rate or a uniform learning rate, and the specific description refers to the above contents, which are not repeated herein.

The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are only exemplary embodiments of the present invention and are not intended to limit the present invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. An artificial neural network reverse training device comprises a storage unit, a learning rate adjusting unit and an arithmetic unit, wherein,

the storage unit is used for storing neural network data, and the neural network data comprises instructions, weights, derivatives of activation functions, learning rates, gradient vectors and learning rate adjustment data;

a learning rate adjusting unit, which adjusts data according to the previous generation learning rate and the learning rate before each generation of training, and obtains the learning rate for the current generation after calculation;

the operation unit is used for calculating the weight of the current generation according to the gradient vector, the learning rate of the current generation, the derivative of the activation function and the weight of the previous generation;

the arithmetic unit comprises a main arithmetic unit, an interconnection unit and a plurality of slave arithmetic units, wherein the gradient vector comprises an input gradient vector and an output gradient vector, and the operation unit comprises:

the main operation unit is used for completing subsequent calculation by utilizing the output gradient vector of the layer in the calculation process of each layer;

the interconnection unit is used for transmitting the input gradient vector of the layer to all the slave operation units through the interconnection unit at the stage of starting calculation of reverse training of each layer of the neural network, and after the calculation process of the slave operation units is completed, the interconnection unit gradually adds the output gradient vector parts of all the slave operation units pairwise to obtain the output gradient vector of the layer so as to realize interconnection topology;

the plurality of slave arithmetic units calculate corresponding partial sums of output gradient vectors in parallel by using the same input gradient vector and respective weight data.

2. The apparatus of claim 1, further comprising:

and the controller unit is used for reading the instruction from the storage unit and decoding the instruction into a microinstruction for controlling the behaviors of the storage unit, the learning rate adjusting unit and the arithmetic unit.

3. The apparatus of claim 1, wherein the interconnection topology is at least one of:

tree structures, ring structures, mesh structures, hierarchical interconnects, and bus structures.

4. The apparatus of claim 1, wherein the interconnect comprises a plurality of nodes, and the plurality of nodes form a binary tree path, that is, each node has a parent node and 2 child nodes, each node sends upstream data to two child nodes downstream through the parent node, and combines data returned from the two child nodes downstream and returns the data to the parent node upstream.

5. The apparatus of claim 1, wherein the storage unit is an on-chip cache.

6. The apparatus of claim 1, wherein the instruction is a SIMD instruction.

7. The apparatus of claim 1, wherein the learning rate adjustment data comprises a weight variance and an error function.

8. An artificial neural network reverse training method comprises the following steps:

before each generation of training begins, the learning rate adjusting unit adjusts data according to the previous generation learning rate and the learning rate, and the learning rate used for the training of the current generation is calculated;

the training begins, and the operation unit updates the weight layer by layer according to the learning rate of the training of the present generation;

after all the weight values are updated, the learning rate adjusting unit calculates the learning rate adjusting data of the present network, and the storage unit stores the learning rate adjusting data;

the operation unit judges whether the neural network is converged, if so, the operation is finished, otherwise, the steps are continuously executed;

the step of the arithmetic unit executing arithmetic includes:

using a main operation unit to finish subsequent calculation by using the output gradient vector of the layer in the calculation process of each layer;

using an interconnection unit, in the stage of starting calculation by reverse training of each layer of neural network, transmitting the input gradient vector of the layer to all the slave operation units by the main operation unit through the interconnection unit, and after the calculation process of the slave operation units is completed, gradually adding the output gradient vector parts of the slave operation units in pairs by the interconnection unit to obtain the output gradient vector of the layer so as to realize interconnection topology;

using a plurality of slave arithmetic units, corresponding partial sums of output gradient vectors are calculated in parallel using the same input gradient vector and respective weight data.

9. The method of claim 8, wherein the training is started, and the computing unit updates the weights layer by layer according to the learning rate of the training of the present generation, and specifically comprises:

for each layer of the network, carrying out weighted summation on input gradient vectors to calculate output gradient vectors of the layer, wherein the weight of the weighted summation is the weight to be updated of the layer;

multiplying the output gradient vector of the current layer by the derivative value of the activation function of the next layer during forward operation to obtain the input gradient vector of the next layer;

multiplying the input gradient vector by the input neuron counterpoint during forward operation to obtain the gradient of the weight of the layer;

updating the weight of the layer according to the gradient and the learning rate of the obtained weight of the layer;

judging whether all layers are updated, if so, entering the following steps; otherwise, continuing to perform the above steps.

10. The method of claim 9, wherein the interconnection topology is at least one of:

11. The method of claim 9, wherein the interconnect comprises a plurality of nodes, and the plurality of nodes form a binary tree path, that is, each node has a parent node and 2 child nodes, each node sends upstream data to two child nodes downstream through the parent node, and combines data returned from the two child nodes downstream and returns the data to the parent node upstream.

12. The method of claim 8, wherein the weights are non-uniformly learned using a non-uniform learning rate in the current generation of training.

13. The method of claim 8, wherein the weights are trained according to a uniform learning rate.

14. The method of claim 8, further comprising:

the controller unit is used to read instructions from the memory unit and decode the instructions into microinstructions that control the behavior of the memory unit, the learning rate adjustment unit, and the arithmetic unit.

15. The method of claim 14, wherein the instruction is a SIMD instruction.