CN109272112B

CN109272112B - Data reuse instruction mapping method, system and device for neural network

Info

Publication number: CN109272112B
Application number: CN201810939096.6A
Authority: CN
Inventors: 欧焱; 李易; 范东睿; 叶笑春; 李文明
Original assignee: Beijing Zhongke Ruixin Technology Group Co ltd
Current assignee: Beijing Zhongke Ruixin Technology Group Co ltd
Priority date: 2018-07-03
Filing date: 2018-08-17
Publication date: 2021-08-27
Anticipated expiration: 2038-08-17
Also published as: CN109272112A

Abstract

The invention discloses a data reuse instruction mapping method and a data reuse instruction mapping system for a neural network, which are suitable for chip interconnection multi-application effective mapping, and the method comprises the following steps: in a processor unit PE array, each PE calculates the result of each neuron node of the nerve layer part, input data are evenly distributed in each PE of the array in each time beat, after calculation, the input data stored in the current PE are transmitted to the next PE, and a ring-shaped structure is formed in the array among the PEs, so that N times of data flow is executed on N PEs, and the sharing of the input data can be completed. After the result settlement of the neural layer of one layer is finished, the result is not written back to the memory, and the new calculation instruction of the neural layer is directly mapped. The invention optimizes the data sharing mechanism between layers in the neural algorithm, so that the data between the neural layers can be transmitted more flexibly and efficiently, the calculated memory access load of each PE is balanced, and the data reuse rate and flexibility are improved.

Description

Data reuse instruction mapping method, system and device for neural network

Technical Field

The invention relates to the field of computer architectures, in particular to a method and a system for solving instruction mapping of a data stream system in the computer architecture.

Background

With the advent of the E-level data era, data sets and data scales required to be processed by computing chips are becoming larger and larger, and the structure of a multi-core processor has become the mainstream of computers for processing large data. In a multi-core processor architecture, data sets are distributed on different processor cores, the processor cores work cooperatively to divide a large processing task into a plurality of small tasks, and a processing unit (PE) processes data in a fine-grained and small-scale manner, thereby completing a large data task. However, due to the limitations of moore's law, the limitations imposed by "memory walls" in computers are increasing. "memory wall" means that the clock frequency and performance of the processor increase at a speed beyond the imagination but the access speed of the main memory is mainly the increase of the DRAM but much slower, so that the data transmission speed is far lower than the data processing speed, the memory bandwidth is "fed to an unsaturated" processor, and the performance of the computing chip for processing data is limited.

In order to alleviate the problem caused by the storage wall and improve the processing performance of the computing chip, a commonly adopted method is to increase the reusability of data, so that the network on chip of the chip has a long dead time as much as possible, and the source data or result data of each processing unit PE (process element) can be directly transmitted to other PEs.

With the development of artificial intelligence, a neural network is more and more concerned by society, as shown in fig. 1, in a neural network algorithm, there are an input layer, a hidden layer and an output layer (hereinafter, referred to as a neural layer collectively), each layer of the neural layer includes a plurality of neuron nodes, the neuron nodes of each layer are connected by line segments, a line segment represents a connection weight between the two nodes, a value of each node is equal to a result obtained by multiplying and adding an output value of a part of nodes (all nodes or part of nodes) in the upper layer connected to the node by a corresponding weight, and then an activation operation (such as a sigmoid function, a relu function, etc.) and a bias calculation (subtracting a constant) are performed to obtain an output value of a current neuron node. For computing hardware, the operation of the input value of each node in the neural network is equivalent to the dot product operation of a vector, and then the vector matrix multiplication operation of a layer of neuron nodes.

In each PE of the multi-core processor, instruction sets to be executed are stored, an application/algorithm is executed on the multi-core processor, and the instruction sets of the application/algorithm are required to be mapped in a processor array in a blocking mode. The execution performance of the multi-core processor is directly influenced by the quality determination of the algorithm of the instruction mapping and the delay of data transmission among the instructions. In the prior art, a commonly used instruction mapping algorithm is an X-Y instruction mapping algorithm, as shown in fig. 2 (TPU of google), in the processor, weights (a first multiplier weight of a multiply-add operation) flow longitudinally from top to bottom, input data (a second multiplier input of the multiply-add operation) flows laterally from left to right without interfering with each other, a middle part and a result (an addend psum of the multiply-add operation) are stored in PEs, assuming that each neuron node needs to perform N times of multiply-add operation calculations (it is understood that each neuron node in the layer is connected to N neuron nodes in the previous layer), and after completing the N times of multiply-add operation calculations, each PE transfers the result to an ACTIVATE unit for activation and bias operations. And storing the result into a memory after execution, and reading the data into the calculation array from the memory for calculation again during the next layer of calculation. The mapping algorithm of the processor is to map the instructions which can share the input data horizontally and map the instructions which can share the weight vertically, thereby being capable of fully mining the reusability of the data.

The above method has the problems that although good data multiplexing rate can be obtained by mining in the calculation of a single neural layer of the neural network, the method is limited to the calculation of the neuron nodes in the current layer neural network, all results need to be stored in the memory again after the calculation of the neuron nodes in one layer is completed, and when the calculation of the next hidden layer is performed, besides the need of remapping the calculated instruction, the data needs to be calculated in the PE array again. As described in the foregoing background, in the large background of modern computer architecture, the transmission delay of data is much higher than the computation delay of data, and with the development of artificial intelligence, the number of hidden layers and the number of neurons in the hidden layers in a neural network algorithm are gradually increased, frequent data accesses often cause performance degradation, and during data accesses, the computation components are idle.

Based on the above existing problems, although the conventional instruction mapping method can obtain good data reuse rate in a single neural layer of a neural network, it is difficult to share and reuse data between the neural layers, and repeated reading of data causes waste of memory bandwidth and idle of computing components. There is a need for an instruction mapping method that enables multiplexing and sharing of data between neural layers in a neural network algorithm.

The English abbreviations in the present invention have the following meanings:

PE: a processing unit;

MEM: and (7) storing the data.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a method for chip interconnection multi-application effective mapping, which can support the shared transmission of data among chips, the calculation of each layer of neuron is evenly distributed in a PE array, only the PE array mapping the first layer of hidden layer instruction reads the data of an input layer from a memory, and in each layer of neural layer, the input data is shared among PEs through a ring-shaped structure. Whenever the PE receives a part of new input data, it reads in the corresponding weight data from the memory to perform multiply-add calculation. After the PE array completes one-time periodic flow, the calculation of neuron node input data and weight in the current hidden layer is completed, and the output of the current hidden layer can be obtained through activation operation and bias calculation. The output of the previous neural layer is the input of the next neural layer, and only the command needs to be remapped among different neural layers, so that repeated reading and writing of intermediate data are omitted.

Specifically, the invention provides the following technical scheme:

in one aspect, the present invention provides a data reuse instruction mapping method for a neural network, including:

step 1, only mapping the calculation of a layer of neuron nodes on a processing unit array each time; for the first layer of neuron nodes, each processing unit needs to complete the computation of K neuron nodes, where,

m is the number of the first layer of neuron nodes, and N is the scale of a processing unit array in the processor;

step 2, after the processing unit array reads the data to be input into the processing unit array, the input data of the neural network is not read any more, and the data flow among the processing units is in an annular structure;

step 3, when the input data flows to the processing unit, the processing unit reads the corresponding weight value from the memory to carry out multiply-add calculation; activating a function after all the multiplication and addition calculation is completed, so as to obtain a result of a first layer of hidden layer;

and 4, taking the result of the first hidden layer as the input of a second hidden layer, and so on until the results of all the hidden layers are calculated.

Preferably, the step 1 further includes, for each neuron node, reading in input data and a weight value corresponding to the input data from a memory, and performing multiply-add calculation;

the scale of the input data and the weight of each neuron node is

And distributing the input data on each processing unit evenly, and performing partial sum calculation of K neuron nodes, wherein D is the number of input layer nodes.

Preferably, the step 2 further includes that the data flow between the processing units is in a ring structure, each processing unit transmits the input data X currently held by the current processing unit to the processing unit downstream of the processing unit, and accepts the data Y transmitted by the previous processing unit;

meanwhile, each processing unit reads in the corresponding weight value of the data Y from the memory, and performs multiply-add calculation to obtain a partial sum result. After data flows in an N × N PE array N × N times, data that a PE initially shares returns to the original PE (the N × N PE array completes N × N flows is called a periodic flow, and the same is applied below), thereby completing the sharing operation of the input data on the PE array.

Preferably, the step 3 further includes that, for the calculation of the neuron node of each layer, the data only needs to flow periodically once, and all the input data "flows" on each PE, and each time the input data flows on a PE, the PE reads the corresponding weight from MEM to perform the multiply-add calculation. After all the multiply-add calculations are completed, the calculation of the activation function (e.g., Sigmoid, relu, etc.) is performed, and the calculation result of the currently calculated first hidden layer can be obtained.

Preferably, the step 4 further includes, after the computation of one hidden layer is completed, storing the computation result in the processing unit array in a distributed manner, and when performing the computation of the next hidden layer, only remapping the instruction of the computation of the next hidden layer onto the processing unit array according to the ring structure, without reading in the input data required by the next hidden layer from the memory.

In addition, the invention also provides a data reuse instruction mapping system facing the neural network, which comprises:

the processing unit array is composed of a plurality of processing units, only one layer of calculation of neuron nodes is mapped on the processing unit array every time, and when the processing unit array reads data needing to be input into the processing unit array, the input data of a neural network is not read; and

the memory unit is used for storing input data of the first layer of neuron nodes and a weight corresponding to the input data of each neuron node;

the data flow direction control unit is used for controlling the data flow direction of the data in the processing unit array to be in a ring structure;

and the neural network computing unit is used for controlling the processing unit to carry out multiplication and addition computation of the neural network and computation of the activation function.

Preferably, for each said neuron node of the neural network, the input data and the weights are scaled to

And distributing the input data on each processing unit evenly, wherein D is the number of input layer nodes, and N is the size of a processing unit array in the processor.

Preferably, each processing unit transmits the input data X currently held by the current processing unit to the processing units downstream of the processing unit, and accepts the data Y transmitted by the last processing unit;

and simultaneously, reading the corresponding weight value of the data Y from the memory by each processing unit, and performing multiply-add calculation.

Preferably, for hidden layer calculation in the neural network, after a hidden layer is calculated, the calculation result is distributively stored in the processing unit array, and when the next hidden layer is calculated, the instruction of the next hidden layer calculation only needs to be remapped to the processing unit array according to the ring structure.

In addition, the invention also provides a data reuse instruction mapping device facing the neural network, which comprises a processing unit array consisting of a plurality of processing units;

the memory unit stores input data and weight used by the neural network calculation, and computer instructions which can be called and operated by the processor unit;

the computer instructions perform the neural network-oriented data reuse instruction mapping method of any one of claims 1-5.

Compared with the prior art, the invention has the beneficial effects that:

the invention optimizes the instruction mapping of the neural network algorithm, and forms an annular data flow graph on the PE array, so that the intermediate data of different neural layers can be multiplexed, and the memory access times of shared storage are greatly reduced. Compared with the traditional sharing mode, the instruction mapping method has the advantages of more flexible data multiplexing and higher efficiency.

Drawings

FIG. 1 is a neural network algorithm model;

FIG. 2 illustrates a TPU structure and instruction mapping method;

FIG. 3 illustrates a data flow pattern of a PE array according to an embodiment of the present invention;

FIG. 4 illustrates a PE array and data flow method according to an embodiment of the invention;

FIG. 5 is a parameter set according to an embodiment of the present invention;

FIG. 6 is a data illustration of an embodiment of the present invention;

FIG. 7 is a diagram illustrating an example of time 0 according to an embodiment of the present invention;

FIG. 8 is a diagram of an example at time 1 in accordance with an embodiment of the present invention;

FIG. 9 is a diagram illustrating an example of time 16 in an embodiment of the present invention;

fig. 10 is a diagram of an example at time 17 in the embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without inventive effort based on the embodiments of the present invention, are within the scope of the present invention.

Example 1

In a specific embodiment, the present invention provides a data reuse instruction mapping method for a neural network, which can be implemented as follows:

step 1, only mapping one layer of neuron nodes on a processing unit array each timeCalculating (1); for the first layer of neuron nodes, each processing unit needs to complete the computation of K neuron nodes, where,

the scale of the input data and the weight of each neuron node is

Example 2

In yet another embodiment, the present invention further provides a data reuse instruction mapping system oriented to a neural network, which can perform the specific methods in

embodiments

1 and 3, for example. The system comprises:

Example 3

One of the specific implementations of the method of the present invention is illustrated below by way of a specific example. As shown in fig. 4, it is assumed that the scale of the PE array is 4 × 4, the scale number of the PE array is as shown in fig. 4, the number of input nodes is 64, the number of nodes of the hidden layer of the first layer is 160, the number of nodes of neurons of the second layer is 320, and the data calculation mode between other adjacent layers from the second layer to the third layer, from the third layer to the fourth layer, and the like is the same as the calculation mode between the first layer and the second layer.

According to the above assumptions, as shown in fig. 5, there are 4 × 4 PEs in total, and each PE does so in the first hidden layer (L1_ layer) neuron computation

And (4) calculating each neuron node. Data read in by an Input layer (Input _ layer) is stored on a PE array in a distributed manner, and each PE reads in the Input layer

And (4) calculation of each input layer node. Similarly, in the second layer hidden layer neuron (L2_ layer) computation, each PE does

And (4) calculating each neuron node. The data read by the input layer is stored on the PE array in a distributed manner, and each PE reads the input layer

And (4) calculating input data. As shown in FIG. 6, the input array stores input data, and weight stores weight data of the neuron node corresponding to the input data. In order to simplify the design, it is assumed that the multiply-add operation and the activate operation of the PE can be immediately performed, and it is assumed that the remapping of the instruction can be immediately performed.

Step 601: at time 0, as shown in the figure, since each PE needs to perform calculation on 10 neuron nodes, each PE reads in data of 4 input layers and 10 × 4 corresponding weight data (10 refers to 10 neuron nodes, and 4 is a corresponding weight of 4 input data corresponding to each neuron node). Assuming that PE1 reads input data of input [1] [1-4] (the data of the first dimension of input refers to the input data of the nerve layer of the second layer, and the second dimension refers to the number of outputs of the node of the previous layer, i.e., four data of input [1] [1], input [1] [2], input [1] [3], and input [1] [4], the same applies below), PE1 calculates the 1 st to 10 th neuron nodes in the first hidden layer, PE2 reads input data of input [1] [5-8], PE2 calculates the 11 th to 20 th neuron nodes in the first hidden layer, and so on.

Step 602: at time 1, as shown in fig. 8, after each PE has performed the corresponding calculation, the data flows according to the serial number order of the PEs, that is, the data of input [1] [1-4] in PE1 is transmitted to PE2, the data of input [1] [5-8] in PE2 is transmitted to PE3, and so on, the data of input [1] [61-64] in PE16 is transmitted to PE 1. In addition to the sharing of the input data, each PE needs to read in new weight data, for example, PE1 reads in data of weight [1] [1-10] [61-64] (i.e. reads in weights corresponding to the 61-64 input data corresponding to the 1 st to 10 th neuron nodes of the first hidden layer, the same applies below), each PE reads in 40 weight data in total, and similarly, PE2 reads in data of weight [1] [11-20] [1-4], and so on.

Step 603: at times 2-16, the PE array continues to flow as per the steps of step 602, with each flow in the PE array moving only one hop (i.e., data from PE1 flows only to PE2, data from PE2 flows only to PE3, and so on, for each flow). As can be seen from the above, at time 16, as shown in FIG. 9, data of input [1] [1-4] is retrieved from PE 1. Therefore, each PE completes the multiplication and addition calculation of input data and weight values in the neuron node calculation of all nodes (completes all multiplication and addition calculations at the 16 th moment), and supposing that the remapping of the executable instruction is carried out, the executed instructions are different and need to be remapped because the calculation scales of the first hidden layer and the second hidden layer are different, the mapping rule is the same as the previous mode, a ring-shaped structure is formed, and in order to simplify the structure, the mapping of the instruction can be supposed to be immediately completed.

Step 604: at the 17 th time, as shown in fig. 10, each node does not perform the sharing operation of the input data between the nodes, but performs the activation operation inside each node, and the output result of the first hidden layer can be obtained by activating the function. As known from the neural network algorithm, the output result of one hidden layer is the input data of the second hidden layer, that is, after the first hidden layer is calculated, the obtained result is input [2] [1-160] (it can be understood that there are 160 input nodes for the data of the second hidden layer), the data is evenly distributed on each PE, the data stored in PE1 is input [2] [1-10], the data stored in PE2 is input [2] [11-20], and so on. At this time, each PE reads weight data corresponding to input data in a node of the layer 2 hidden layer, for example, PE1 reads weight [2] [1-20] [1-10] (i.e., each PE calculates the calculation of 20 neuron nodes, reads corresponding weight data of 10 input data, and reads 200 weight data in total), and performs the 1 st multiply-add calculation of the layer 2 hidden layer.

Step 605: at the 18 th moment, the sharing calculation is continued according to the mode of the step 602, and the calculation of the neural network algorithm is completed layer by layer according to the calculation mode of the step 602 and the step 604 until the output result is finally obtained.

Those of ordinary skill in the art will understand that: in the embodiment, the calculation of the neuron layer may refer to calculation from an input layer to a hidden layer, between hidden layers, and from a hidden layer to an output layer, for the calculation of one neuron node, not all input data may be used (for example, sparse MLP algorithm), but this does not affect the sharing mode of the input data, if there is no calculation operation on the input data, only the instruction of calculation needs to be omitted, the sharing of data is still shared according to the data of all input layers, that is, there are 10 input data, a PE may perform calculation using only 6 data therein, but the next beat of the PE needs to share 10 input data to the next PE, thereby achieving the full sharing operation of the input data.

Example 4

In another embodiment, the present invention further provides an apparatus for data reuse instruction mapping for a neural network, the apparatus including a processing unit array composed of a plurality of processing units;

In yet another specific embodiment, the apparatus for neural network-oriented data reuse instruction mapping may comprise a system as described in embodiment 2, thereby performing a specific method as described in embodiment 1 or embodiment 3, for example.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A data reuse instruction mapping method for a neural network, the method comprising:

and 4, taking the result of the first hidden layer as the input of the second hidden layer, and so on until the results of all the hidden layers are calculated.

2. The method according to claim 1, wherein the step 1 further comprises, for each neuron node, reading in input data and a weight value corresponding to the input data from a memory, and performing a multiply-add calculation;

the scale of the input data and the weight of each neuron node is

And distributing the input data on each processing unit evenly, wherein D is the number of input layer nodes.

3. The method according to claim 1, wherein the step 2 further comprises that the data flow between the processing units is in a ring structure, each processing unit transmits the input data X currently held by the current processing unit to the processing unit downstream of the processing unit, and accepts the data Y transmitted by the last processing unit;

4. The method of claim 1, wherein step 3 further comprises, for each layer of neuron node calculations, only one periodic flow of data is required.

5. The method of claim 1, wherein step 4 further comprises, after a hidden layer is computed, distributively storing the computation result on the processing unit array, and when performing the next hidden layer computation, only remapping the instruction of the next hidden layer computation onto the processing unit array according to the ring structure.

6. A neural network-oriented data reuse instruction mapping system, the system comprising:

7. The system of claim 6, wherein for each of the neuron nodes of the neural network, the input data and the weights are scaled to

8. The system according to claim 6, wherein each processing unit transmits input data X currently held by the current processing unit to the processing units downstream thereof, and accepts data Y incoming from the previous processing unit;

9. The system of claim 6, wherein for hidden layer computation in the neural network, after a hidden layer computation is completed, the computation result is distributively stored in the processing unit array, and when performing the next hidden layer computation, the instruction of the next hidden layer computation is remapped to the processing unit array according to the ring structure.

10. An apparatus for neural network-oriented data reuse instruction mapping, the apparatus comprising a processing unit array comprising a plurality of processing units;