Disclosure of Invention
In order to solve the problems, the invention provides a multi-core neural network tensor processor with expandable computing power. The tensor processor adopts the design idea of modularized multiplexing, and the minimum calculation unit is multiplexed by designing the minimum calculation unit and forms a structured calculation module, so that the tensor processor structure of the multi-core neural network with expandable calculation force is realized. The multi-core neural network tensor processor adopts a modularized multiplexing method, so that the complexity of design and verification can be greatly reduced. By means of simple module combination, various different power neural network tensor processor configuration schemes can be realized, so that design complexity of the high power neural network tensor processor is greatly reduced, design and verification period is shortened, and finally, the design requirements of various power neural network tensor processors can be met with minimum design and verification cost.
The invention provides a multi-core neural network tensor processor with extensible computing power, which comprises a PCIE controller, M MTC cores and M SDRAM controllers, wherein each MTC core comprises S STC cores, each STC core comprises L LTC cores, the LTC cores are minimum computing modules, all the LTC cores in the same multi-core neural network tensor processor are configured to be the same functional module, the PCIE controller is used for realizing access control of the multi-core neural network tensor processor and external equipment, and the SDRAM controllers are used for accessing SDRAM memories outside corresponding MTC cores.
Further preferably, S is a multiple of 2 and L is a multiple of 2.
Further, the LTC core comprises a 4D tensor core, a 1D tensor core, an instruction control unit, a local cache unit, a memory reading unit, a memory writing unit and a special unit, wherein the 4D tensor core is used for realizing basic operation of 4D tensor data, the 1D tensor core is used for realizing basic operation of 1D data, the instruction control unit is used for acquiring configuration parameters and instructions of the 4D tensor core and the 1D tensor core from an external memory, the local cache unit is used for storing input data required by the 4D tensor core, the memory reading unit is used for providing direct reading and writing capability of an external memory for each module in the minimum calculation module, and the special unit is used for realizing a calculation function related to coordinate transformation.
Further preferably, the LTC core comprises one 4D tensor core and two 1D tensor cores.
Further, the 4D tensor kernel contains P1 FP16 MACs and 2 x P1 INT8 MACs.
Further preferably, P1 is M1 power of 2, M1 being not less than 8.
Further, the 1D tensor kernel contains P2 FP16 MACs inside.
Further preferably, P2 is to the power M2 of 2, M2 is 4 or 5 or 6.
Further, the basic operation of the 4D tensor data comprises multiplication, addition, multiplication accumulation and maximum value, and the basic operation of the 1D data comprises multiplication, addition, linear activation operation and nonlinear activation calculation.
Further, a cascade path is arranged between the 4D tensor core and the 1D tensor core, and is used for directly inputting output data of the 4D tensor core into the 1D tensor core.
Further, the memory reading unit is configured in such a way that the 4D tensor core is provided with two independent memory reading units for respectively realizing the reading operation of data and parameters required by 4D calculation, the 1D tensor core is provided with two independent memory reading units for respectively realizing the reading operation of data and parameters required by 1D calculation, the instruction control unit and the special unit share one memory reading unit, the 1D tensor core is provided with an independent memory writing unit for realizing the writing operation of the calculation result of the 4D/1D tensor core, and the special unit is provided with an independent memory writing unit for realizing the writing operation of the calculation result.
Further, the configuration parameters are used for configuring the 4D tensor core to a specific computing function and configuring the 1D tensor core to a specific computing function, and the instructions are used for realizing the control of a computing flow.
Further preferably, the capacity of the local buffer unit is P3 x 32KB, where P3 is an integer not less than 8.
Furthermore, the STC core of the neural network multi-core tensor processor also comprises a shared memory and a data control unit, wherein the shared memory is used for caching input data and intermediate data or commonly required parameters commonly required by all LTC cores in the STC, and the data control unit is used for pre-fetching the shared data or the shared parameters existing in the off-chip SDRAM into the shared memory in advance.
Further, the interconnection relationship among the MTC core, the STC core and the LTC core is that the LTC core can only access own local cache units and cannot access local cache units of other LTC cores, the LTC core can access a shared memory of the STC core or a shared memory of other STC cores, and the LTC core can access an off-chip SDRAM memory of the MTC core or an off-chip SDRAM memory of other MTC cores.
Further, the capacity of the shared memory is greater than or equal to the capacity of the local memory cache.
The invention realizes the following technical effects:
(1) The multi-core neural network tensor processor adopts a modularized multiplexing design scheme, and the minimum calculation modules are repeatedly called and combined together to form the neural network tensor processor with a certain computational power specification.
(2) According to the multi-core neural network tensor processor, the number of the MTC cores, the STC cores and the LTC cores is flexibly set, different LTC cores, STC cores and MTC cores are combined, and various tensor processors with calculation power can be formed, so that a tensor processor architecture with extensible calculation power is realized.
Detailed Description
For further illustration of the various embodiments, the invention is provided with the accompanying drawings. The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate embodiments and together with the description, serve to explain the principles of the embodiments. With reference to these matters, one of ordinary skill in the art will understand other possible embodiments and advantages of the present invention.
The invention will now be further described with reference to the drawings and detailed description.
The invention provides a multi-core neural network tensor processor with extensible computing power. The tensor processor adopts a modularized multiplexing design scheme, and the minimum calculation modules are repeatedly called and combined together to form the neural network tensor processor with a certain computational power specification.
Let the calculation power of the minimum calculation module be a, the calculation power scalability means that the calculation power specification of the multi-core neural network tensor processor depends on the number N of the minimum calculation modules specifically called, and the specific value of the calculation power is equal to a×n. The value range of the computing force extensible finger N can be any positive integer.
Although computational power is scalable, the functionality of neural network computation is unchanged for the multi-core neural network tensor processors of different computational power. The function of the multi-core neural network tensor processor is determined by the function of the minimum calculation module, and the function of the neural network tensor processor with N being 1 is identical to that of the neural network tensor processor with N being any other positive integer.
The minimum calculation module (LTC Core) is composed of a 4D Tensor Core (4D Tensor Core), a 1D Tensor Core (1D Tensor Core), an Instruction control unit (Instruction unit), a Local Memory unit (Local Memory), a Memory reading unit (LD), a Memory writing unit (ST) and a special unit (Recut), and the internal structure of the minimum calculation module (LTC Core) is shown in fig. 1.
In this embodiment, the minimum calculation module includes one 4D Tensor Core (4D Tensor Core) and two 1D Tensor cores (1D Tensor Core) (in different implementations, the number of 1D Tensor cores may be other numbers, for example, 1). The 4D tensor core contains 1024 FP16 MACs and 2048 INT8 MACs (in different implementations, the number of FP16 MACs and INT8 MACs may be other numbers, typically set to include P1 FP16 MACs and 2x P1 INT8 MACs within the 4D tensor core, where P1 is the power M1 of 2, M1 is not less than 8). The 1D tensor kernel contains 16 FP16 MACs internally (in different implementations, the number of FP16 MACs may be other numbers, typically set to include P2 FP16 MACs internally to the 1D tensor kernel, where P2 is the power of 2M 2 and M2 is 4 or 5 or 6).
Most of the computing functions of the minimum computing module are implemented by the 4D tensor kernel and the 1D tensor kernel, including convolution, full connection, pooling, etc. The main function of the 4D tensor kernel is to implement basic operations of 4D tensor data (data with dimensions (n, c, h, w)), including multiplication, addition, multiply-accumulate, maximum, and the like. The primary function of the 1D tensor kernel is to implement basic operations of 1D data (e.g., data of dimension (w)), including multiplication, addition, various linear activation operations (e.g., relu), various nonlinear activation operations (e.g., sigmod), and the like.
The cascade path is arranged between the 4D tensor core and the 1D tensor core, namely the output data of the 4D tensor core can be directly input into the 1D tensor core, so that the function of completing the 1D operation task immediately after the 4D operation task is realized. Therefore, one calculation of the minimum calculation module can load a plurality of operators, such as convolution + Relu, at the same time, so that higher calculation efficiency is realized.
The instruction control unit is mainly used for acquiring configuration parameters and instructions of the 4D tensor core and the 1D tensor core from an external memory (the external memory can be an external SDRAM (synchronous dynamic random access memory) or other shared memories in a tensor processor chip). The configuration parameters are used to configure the 4D tensor kernel to a particular computing function (e.g., a convolution computing function) and the 1D tensor kernel to a particular computing function (e.g., relu computing function). The instructions are used to implement control of the computational flow, such as start, pause, end, etc.
The local cache unit capacity is set to a multiple of 32KB, which is typically set to not less than 8. The typical capacity of the local buffer unit is 320KB, and the main function of the local buffer unit is to store the input data required by the 4D tensor core, and simultaneously rearrange the input data according to the sequence required by the 4D tensor core, so that the complexity of a subsequent calculation circuit is reduced. The 1D tensor core does not need a separate cache unit, its input may come from the output of the 4D tensor core (the 1D tensor core may also be considered to indirectly use the local cache of the 4D tensor core). The input of the 1D tensor core can also come from an external memory directly, and input data is directly calculated without buffering and then output.
The memory reading unit LD and the memory writing unit ST are mainly used for providing the direct reading and writing capability of the external memory for each module in the minimum calculation module. The 4D tensor core is provided with two independent memory reading units LD for respectively realizing the reading operation of data and parameters required by 4D calculation, the 1D tensor core is provided with two independent memory reading units LD for respectively realizing the reading operation of data and parameters required by 1D calculation, the instruction control unit and the special unit share one memory reading unit LD, the 1D tensor core is provided with one independent memory writing unit ST for realizing the writing operation of the calculation result of the 4D/1D tensor core (the calculation result of the 4D tensor core is written out by the 1D tensor core), and the special unit is provided with one independent memory writing unit ST for realizing the writing operation of the calculation result.
The special unit is mainly used for realizing the calculation function related to coordinate transformation, no multiplication or addition and other calculation resources are arranged in the special unit, and the main function is to read in input data and output the data according to a new data arrangement mode, and typical calculation functions are Reshape, concat and the like.
Effectively organizing multiple LTC cores together may form a neural network multi-core tensor processor with substantial power. The STC kernel is the first hierarchical organization of the neural network multi-core tensor processor, as shown in fig. 2.
Each STC core contains L LTC cores, preferably L is a multiple of 2, with typical values of L being 4 or 8. These LTC cores are identical in design, and a multi-core design may be achieved by repeatedly invoking the same LTC cores. In addition, the STC core includes a Shared Memory (Shared Memory) and a data control unit (STREAMING UNIT).
The capacity of the shared memory is typically the same as or slightly greater than the local cache unit of the LTC core (e.g., the typical capacity of the local cache unit of the LTC core is 320KB and the typical capacity of the shared memory is 352 KB). The shared memory functions to buffer input data and intermediate data, or parameters, commonly required by the LTC cores within the STC.
According to different calculation modes of the multi-core neural network tensor processor, each LTC core can calculate the same data by using different parameters, and can calculate different data by using the same parameters. If the shared data or parameters exist in the off-chip SDRAM, the data control unit can pre-fetch the shared data or parameters into the shared memory in advance, and then the LTC core can directly acquire the required data or parameters from the shared memory in the calculation process, so that the purpose of saving the off-chip memory bandwidth is achieved. If the capacity of the shared memory is insufficient to store the shared data or parameters, the LTC core directly reads the required data or parameters from the off-chip SDRAM during the computation process.
MTC kernels are the second hierarchical organization of the neural network multi-kernel tensor processor, as shown in fig. 3.
Each MTC core contains S STC cores, preferably S is a multiple of 2, with typical values of S being 4 or 8. These STC cores are identical in design, and a complex design of more cores can be achieved by repeatedly calling the same STC core.
All STC cores in the MTC core access the same off-chip SDRAM memory through the same SDRAM controller. Thus, the computation unit within the MTC core uses a three-level cache structure of LTC local cache unit- > STC shared memory- > off-chip SDRAM memory. The off-chip SDRAM memory stores all data required by all computing units, the STC shared memory stores data required by a plurality of LTC cores, and the LTC local caching unit stores data required by the computation of an LTC internal unit.
The MTC core includes a plurality of STC cores, and the STC core includes a plurality of LTC cores, and the interconnection relationship between them is:
The LTC core can only access its own local cache unit, but not the local cache units of other LTC cores;
The LTC core may access the shared memory of the associated STC core, or may access the shared memory of other STC cores;
The LTC core can access the off-chip SDRAM memory of the MTC core, and also can access the off-chip SDRAM memory of other MTC cores;
In the MTC core, only the LTC local buffer unit is exclusive to each LTC core and cannot be accessed by other LTC cores, and the STC shared memory and the off-chip SDRAM memory are shared and can be accessed by any STC core in the MTC core. In the initial stage of operation of the multi-core neural network tensor processor, all data required by calculation are all stored in an off-chip SDRAM memory. During operation, if some data is accessed by multiple STC cores of the MTC core, the data will be prefetched into the STC shared memory, as needed. Further, during operation, some data may be prefetched into the LTC local cache unit as needed.
The highest level organization of the neural network multi-core tensor processor is shown in fig. 4.
The neural network multi-core tensor processor comprises a PCIE Controller (PCI Express Controller), a plurality of MTC cores and a plurality of SDRAM controllers (SDRAM controllers), wherein the number of the MTC cores is the same as that of the SDRAM controllers.
The neural network multi-core tensor processor comprises M MTC cores, each MTC core comprises S STC cores, and each STC core comprises L LTC cores. By flexibly setting the number of MTC cores, STC cores and LTC cores and combining different LTC cores, STC cores and MTC cores, tensor processors with various computational forces can be formed, so that a tensor processor architecture with extensible computational forces is realized.
Assuming that the computing power of the LTC is A, the computing power of the neural network multi-core tensor processor is M, S and L, wherein the value range of M, S, L is 1 to any positive integer. Therefore, the neural network multi-core tensor processor has flexible computational power extensible characteristics, and different computational power neural network multi-core tensor processors can be realized by configuring M, S, L with different values. For example, if the computation power a of the LTC kernel is 4TOPS, the configuration of the neural network multi-core tensor processor adopts m= 8,S =8, and l=16, then the total computation power of the tensor processor is 8×8×16×4=4096 TOPS.
While the invention has been particularly shown and described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.