CN114492777B

CN114492777B - A multi-core neural network tensor processor with scalable computing power

Info

Publication number: CN114492777B
Application number: CN202210102921.3A
Authority: CN
Inventors: 罗闳訚; 周志新; 何日辉; 尤培坤; 汤梦饶
Original assignee: Xiamen Yipu Intelligent Technology Co ltd
Current assignee: Xiamen Yipu Intelligent Technology Co ltd
Priority date: 2022-01-27
Filing date: 2022-01-27
Publication date: 2025-02-21
Anticipated expiration: 2042-01-27
Also published as: CN114492777A

Abstract

The present invention discloses a multi-core neural network tensor processor with scalable computing power, comprising a PCIE controller, M MTC cores and M SDRAM controllers; each MTC core includes S STC cores, and each STC core includes L LTC cores; wherein the LTC core is the minimum computing module, and all LTC cores in the same multi-core neural network tensor processor are configured as the same functional module; the PCIE controller is used to implement access control between the multi-core neural network tensor processor and external devices; the SDRAM controller is used to correspond to the MTC core's access to an off-chip SDRAM memory. The multi-core neural network tensor processor of the present invention adopts a modular reuse design scheme, and forms a neural network tensor processor with a certain computing power specification by repeatedly calling the minimum computing module and combining these minimum computing modules together; this structure can greatly reduce the complexity of design and verification.

Description

Multi-core neural network tensor processor with expandable computing power

Technical Field

The invention relates to the field of neural network tensor processors, in particular to a multi-core neural network tensor processor with extensible computational power.

Background

Existing neural network tensor processors typically have a fixed internal architecture and provide a fixed computational performance (abbreviated as computational power). For example, in patent 1 (patent name: a neural network multi-core tensor processor, application number: 202011423696.0) and patent 2 (patent name: a neural network tensor processor, application number: 202011421828.6), the computational power of the tensor processor is determined by the number of computational resources, and in the neural network tensor processor, the computational resources need to be matched with resources such as storage capacity and bus bandwidth. When the architecture of the tensor processor is determined, the maximum computing power that the architecture can support is also substantially determined and is difficult to change.

On the other hand, for a neural network tensor processor with excessive computational power (e.g., computational power 4096 TOPS), the chip area may reach hundreds of square millimeters. For a chip with an area of hundreds of square millimeters, the front-end design, verification, the back-end layout design and other works become very time-consuming and complex, and a large amount of resources such as manpower and material resources are required to be input.

Disclosure of Invention

In order to solve the problems, the invention provides a multi-core neural network tensor processor with expandable computing power. The tensor processor adopts the design idea of modularized multiplexing, and the minimum calculation unit is multiplexed by designing the minimum calculation unit and forms a structured calculation module, so that the tensor processor structure of the multi-core neural network with expandable calculation force is realized. The multi-core neural network tensor processor adopts a modularized multiplexing method, so that the complexity of design and verification can be greatly reduced. By means of simple module combination, various different power neural network tensor processor configuration schemes can be realized, so that design complexity of the high power neural network tensor processor is greatly reduced, design and verification period is shortened, and finally, the design requirements of various power neural network tensor processors can be met with minimum design and verification cost.

The invention provides a multi-core neural network tensor processor with extensible computing power, which comprises a PCIE controller, M MTC cores and M SDRAM controllers, wherein each MTC core comprises S STC cores, each STC core comprises L LTC cores, the LTC cores are minimum computing modules, all the LTC cores in the same multi-core neural network tensor processor are configured to be the same functional module, the PCIE controller is used for realizing access control of the multi-core neural network tensor processor and external equipment, and the SDRAM controllers are used for accessing SDRAM memories outside corresponding MTC cores.

Further preferably, S is a multiple of 2 and L is a multiple of 2.

Further, the LTC core comprises a 4D tensor core, a 1D tensor core, an instruction control unit, a local cache unit, a memory reading unit, a memory writing unit and a special unit, wherein the 4D tensor core is used for realizing basic operation of 4D tensor data, the 1D tensor core is used for realizing basic operation of 1D data, the instruction control unit is used for acquiring configuration parameters and instructions of the 4D tensor core and the 1D tensor core from an external memory, the local cache unit is used for storing input data required by the 4D tensor core, the memory reading unit is used for providing direct reading and writing capability of an external memory for each module in the minimum calculation module, and the special unit is used for realizing a calculation function related to coordinate transformation.

Further preferably, the LTC core comprises one 4D tensor core and two 1D tensor cores.

Further, the 4D tensor kernel contains P1 FP16 MACs and 2 x P1 INT8 MACs.

Further preferably, P1 is M1 power of 2, M1 being not less than 8.

Further, the 1D tensor kernel contains P2 FP16 MACs inside.

Further preferably, P2 is to the power M2 of 2, M2 is 4 or 5 or 6.

Further, the basic operation of the 4D tensor data comprises multiplication, addition, multiplication accumulation and maximum value, and the basic operation of the 1D data comprises multiplication, addition, linear activation operation and nonlinear activation calculation.

Further, a cascade path is arranged between the 4D tensor core and the 1D tensor core, and is used for directly inputting output data of the 4D tensor core into the 1D tensor core.

Further, the memory reading unit is configured in such a way that the 4D tensor core is provided with two independent memory reading units for respectively realizing the reading operation of data and parameters required by 4D calculation, the 1D tensor core is provided with two independent memory reading units for respectively realizing the reading operation of data and parameters required by 1D calculation, the instruction control unit and the special unit share one memory reading unit, the 1D tensor core is provided with an independent memory writing unit for realizing the writing operation of the calculation result of the 4D/1D tensor core, and the special unit is provided with an independent memory writing unit for realizing the writing operation of the calculation result.

Further, the configuration parameters are used for configuring the 4D tensor core to a specific computing function and configuring the 1D tensor core to a specific computing function, and the instructions are used for realizing the control of a computing flow.

Further preferably, the capacity of the local buffer unit is P3 x 32KB, where P3 is an integer not less than 8.

Furthermore, the STC core of the neural network multi-core tensor processor also comprises a shared memory and a data control unit, wherein the shared memory is used for caching input data and intermediate data or commonly required parameters commonly required by all LTC cores in the STC, and the data control unit is used for pre-fetching the shared data or the shared parameters existing in the off-chip SDRAM into the shared memory in advance.

Further, the interconnection relationship among the MTC core, the STC core and the LTC core is that the LTC core can only access own local cache units and cannot access local cache units of other LTC cores, the LTC core can access a shared memory of the STC core or a shared memory of other STC cores, and the LTC core can access an off-chip SDRAM memory of the MTC core or an off-chip SDRAM memory of other MTC cores.

Further, the capacity of the shared memory is greater than or equal to the capacity of the local memory cache.

The invention realizes the following technical effects:

(1) The multi-core neural network tensor processor adopts a modularized multiplexing design scheme, and the minimum calculation modules are repeatedly called and combined together to form the neural network tensor processor with a certain computational power specification.

(2) According to the multi-core neural network tensor processor, the number of the MTC cores, the STC cores and the LTC cores is flexibly set, different LTC cores, STC cores and MTC cores are combined, and various tensor processors with calculation power can be formed, so that a tensor processor architecture with extensible calculation power is realized.

Drawings

FIG. 1 is a block diagram of the internal architecture of a least squares computation module (LTC core) of the present invention;

FIG. 2 is a block diagram of the STC core internal architecture of the present invention;

FIG. 3 is a block diagram of the internal architecture of an MTC core of the present invention;

FIG. 4 is a block diagram of the internal architecture of a neural network multi-core tensor processor of the present invention.

Detailed Description

For further illustration of the various embodiments, the invention is provided with the accompanying drawings. The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate embodiments and together with the description, serve to explain the principles of the embodiments. With reference to these matters, one of ordinary skill in the art will understand other possible embodiments and advantages of the present invention.

The invention will now be further described with reference to the drawings and detailed description.

The invention provides a multi-core neural network tensor processor with extensible computing power. The tensor processor adopts a modularized multiplexing design scheme, and the minimum calculation modules are repeatedly called and combined together to form the neural network tensor processor with a certain computational power specification.

Let the calculation power of the minimum calculation module be a, the calculation power scalability means that the calculation power specification of the multi-core neural network tensor processor depends on the number N of the minimum calculation modules specifically called, and the specific value of the calculation power is equal to a×n. The value range of the computing force extensible finger N can be any positive integer.

Although computational power is scalable, the functionality of neural network computation is unchanged for the multi-core neural network tensor processors of different computational power. The function of the multi-core neural network tensor processor is determined by the function of the minimum calculation module, and the function of the neural network tensor processor with N being 1 is identical to that of the neural network tensor processor with N being any other positive integer.

The minimum calculation module (LTC Core) is composed of a 4D Tensor Core (4D Tensor Core), a 1D Tensor Core (1D Tensor Core), an Instruction control unit (Instruction unit), a Local Memory unit (Local Memory), a Memory reading unit (LD), a Memory writing unit (ST) and a special unit (Recut), and the internal structure of the minimum calculation module (LTC Core) is shown in fig. 1.

In this embodiment, the minimum calculation module includes one 4D Tensor Core (4D Tensor Core) and two 1D Tensor cores (1D Tensor Core) (in different implementations, the number of 1D Tensor cores may be other numbers, for example, 1). The 4D tensor core contains 1024 FP16 MACs and 2048 INT8 MACs (in different implementations, the number of FP16 MACs and INT8 MACs may be other numbers, typically set to include P1 FP16 MACs and 2x P1 INT8 MACs within the 4D tensor core, where P1 is the power M1 of 2, M1 is not less than 8). The 1D tensor kernel contains 16 FP16 MACs internally (in different implementations, the number of FP16 MACs may be other numbers, typically set to include P2 FP16 MACs internally to the 1D tensor kernel, where P2 is the power of 2M 2 and M2 is 4 or 5 or 6).

Most of the computing functions of the minimum computing module are implemented by the 4D tensor kernel and the 1D tensor kernel, including convolution, full connection, pooling, etc. The main function of the 4D tensor kernel is to implement basic operations of 4D tensor data (data with dimensions (n, c, h, w)), including multiplication, addition, multiply-accumulate, maximum, and the like. The primary function of the 1D tensor kernel is to implement basic operations of 1D data (e.g., data of dimension (w)), including multiplication, addition, various linear activation operations (e.g., relu), various nonlinear activation operations (e.g., sigmod), and the like.

The cascade path is arranged between the 4D tensor core and the 1D tensor core, namely the output data of the 4D tensor core can be directly input into the 1D tensor core, so that the function of completing the 1D operation task immediately after the 4D operation task is realized. Therefore, one calculation of the minimum calculation module can load a plurality of operators, such as convolution + Relu, at the same time, so that higher calculation efficiency is realized.

The instruction control unit is mainly used for acquiring configuration parameters and instructions of the 4D tensor core and the 1D tensor core from an external memory (the external memory can be an external SDRAM (synchronous dynamic random access memory) or other shared memories in a tensor processor chip). The configuration parameters are used to configure the 4D tensor kernel to a particular computing function (e.g., a convolution computing function) and the 1D tensor kernel to a particular computing function (e.g., relu computing function). The instructions are used to implement control of the computational flow, such as start, pause, end, etc.

The local cache unit capacity is set to a multiple of 32KB, which is typically set to not less than 8. The typical capacity of the local buffer unit is 320KB, and the main function of the local buffer unit is to store the input data required by the 4D tensor core, and simultaneously rearrange the input data according to the sequence required by the 4D tensor core, so that the complexity of a subsequent calculation circuit is reduced. The 1D tensor core does not need a separate cache unit, its input may come from the output of the 4D tensor core (the 1D tensor core may also be considered to indirectly use the local cache of the 4D tensor core). The input of the 1D tensor core can also come from an external memory directly, and input data is directly calculated without buffering and then output.

The memory reading unit LD and the memory writing unit ST are mainly used for providing the direct reading and writing capability of the external memory for each module in the minimum calculation module. The 4D tensor core is provided with two independent memory reading units LD for respectively realizing the reading operation of data and parameters required by 4D calculation, the 1D tensor core is provided with two independent memory reading units LD for respectively realizing the reading operation of data and parameters required by 1D calculation, the instruction control unit and the special unit share one memory reading unit LD, the 1D tensor core is provided with one independent memory writing unit ST for realizing the writing operation of the calculation result of the 4D/1D tensor core (the calculation result of the 4D tensor core is written out by the 1D tensor core), and the special unit is provided with one independent memory writing unit ST for realizing the writing operation of the calculation result.

The special unit is mainly used for realizing the calculation function related to coordinate transformation, no multiplication or addition and other calculation resources are arranged in the special unit, and the main function is to read in input data and output the data according to a new data arrangement mode, and typical calculation functions are Reshape, concat and the like.

Effectively organizing multiple LTC cores together may form a neural network multi-core tensor processor with substantial power. The STC kernel is the first hierarchical organization of the neural network multi-core tensor processor, as shown in fig. 2.

Each STC core contains L LTC cores, preferably L is a multiple of 2, with typical values of L being 4 or 8. These LTC cores are identical in design, and a multi-core design may be achieved by repeatedly invoking the same LTC cores. In addition, the STC core includes a Shared Memory (Shared Memory) and a data control unit (STREAMING UNIT).

The capacity of the shared memory is typically the same as or slightly greater than the local cache unit of the LTC core (e.g., the typical capacity of the local cache unit of the LTC core is 320KB and the typical capacity of the shared memory is 352 KB). The shared memory functions to buffer input data and intermediate data, or parameters, commonly required by the LTC cores within the STC.

According to different calculation modes of the multi-core neural network tensor processor, each LTC core can calculate the same data by using different parameters, and can calculate different data by using the same parameters. If the shared data or parameters exist in the off-chip SDRAM, the data control unit can pre-fetch the shared data or parameters into the shared memory in advance, and then the LTC core can directly acquire the required data or parameters from the shared memory in the calculation process, so that the purpose of saving the off-chip memory bandwidth is achieved. If the capacity of the shared memory is insufficient to store the shared data or parameters, the LTC core directly reads the required data or parameters from the off-chip SDRAM during the computation process.

MTC kernels are the second hierarchical organization of the neural network multi-kernel tensor processor, as shown in fig. 3.

Each MTC core contains S STC cores, preferably S is a multiple of 2, with typical values of S being 4 or 8. These STC cores are identical in design, and a complex design of more cores can be achieved by repeatedly calling the same STC core.

All STC cores in the MTC core access the same off-chip SDRAM memory through the same SDRAM controller. Thus, the computation unit within the MTC core uses a three-level cache structure of LTC local cache unit- > STC shared memory- > off-chip SDRAM memory. The off-chip SDRAM memory stores all data required by all computing units, the STC shared memory stores data required by a plurality of LTC cores, and the LTC local caching unit stores data required by the computation of an LTC internal unit.

The MTC core includes a plurality of STC cores, and the STC core includes a plurality of LTC cores, and the interconnection relationship between them is:

The LTC core can only access its own local cache unit, but not the local cache units of other LTC cores;

The LTC core may access the shared memory of the associated STC core, or may access the shared memory of other STC cores;

The LTC core can access the off-chip SDRAM memory of the MTC core, and also can access the off-chip SDRAM memory of other MTC cores;

In the MTC core, only the LTC local buffer unit is exclusive to each LTC core and cannot be accessed by other LTC cores, and the STC shared memory and the off-chip SDRAM memory are shared and can be accessed by any STC core in the MTC core. In the initial stage of operation of the multi-core neural network tensor processor, all data required by calculation are all stored in an off-chip SDRAM memory. During operation, if some data is accessed by multiple STC cores of the MTC core, the data will be prefetched into the STC shared memory, as needed. Further, during operation, some data may be prefetched into the LTC local cache unit as needed.

The highest level organization of the neural network multi-core tensor processor is shown in fig. 4.

The neural network multi-core tensor processor comprises a PCIE Controller (PCI Express Controller), a plurality of MTC cores and a plurality of SDRAM controllers (SDRAM controllers), wherein the number of the MTC cores is the same as that of the SDRAM controllers.

The neural network multi-core tensor processor comprises M MTC cores, each MTC core comprises S STC cores, and each STC core comprises L LTC cores. By flexibly setting the number of MTC cores, STC cores and LTC cores and combining different LTC cores, STC cores and MTC cores, tensor processors with various computational forces can be formed, so that a tensor processor architecture with extensible computational forces is realized.

Assuming that the computing power of the LTC is A, the computing power of the neural network multi-core tensor processor is M, S and L, wherein the value range of M, S, L is 1 to any positive integer. Therefore, the neural network multi-core tensor processor has flexible computational power extensible characteristics, and different computational power neural network multi-core tensor processors can be realized by configuring M, S, L with different values. For example, if the computation power a of the LTC kernel is 4TOPS, the configuration of the neural network multi-core tensor processor adopts m= 8,S =8, and l=16, then the total computation power of the tensor processor is 8×8×16×4=4096 TOPS.

While the invention has been particularly shown and described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. The multi-core neural network tensor processor with the extensible computing power is characterized by comprising a PCIE controller, M MTC cores and M SDRAM controllers, wherein each MTC core comprises S STC cores, and each STC core comprises L LTC cores;

The PCIE controller is used for realizing access control of the multi-core neural network tensor processor and external equipment;

the SDRAM controller is used for corresponding to the MTC to check the access of the external SDRAM memory;

The LTC core comprises a 4D tensor core, a 1D tensor core, an instruction control unit, a local cache unit, a memory reading unit, a memory writing unit and a special unit, wherein the 4D tensor core is used for realizing basic operation of 4D tensor data, the 1D tensor core is used for realizing basic operation of 1D data, the instruction control unit is used for acquiring configuration parameters and instructions of the 4D tensor core and the 1D tensor core from an external memory, the local cache unit is used for storing input data required by the 4D tensor core, the memory reading unit is used for providing direct reading and writing capability of an external memory for each module in a minimum calculation module, and the special unit is used for realizing a calculation function related to coordinate transformation.

2. The power scalable multi-core neural network tensor processor of claim 1, wherein S is a multiple of 2 and L is a multiple of 2.

3. The computationally scalable multi-core neural network tensor processor of claim 1, wherein the LTC core comprises one 4D tensor core and two 1D tensor cores.

4. The computational power scalable multi-core neural network tensor processor of claim 2, wherein the 4D tensor core internally contains P1 FP16 MACs and 2 x P1 INT8 MACs.

5. The power scalable multi-core neural network tensor processor of claim 4, wherein P1 is M1 to the power of 2, M1 being not less than 8.

6. The computational power scalable multi-core neural network tensor processor of claim 1, wherein the 1D tensor core internally contains P2 FP16 MACs.

7. The power scalable multi-core neural network tensor processor of claim 6, wherein P2 is to the power of M2 of 2, and M2 is 4 or 5 or 6.

8. The multi-core neural network tensor processor of claim 1, wherein the basic operations of the 4D tensor data include multiplication, addition, multiply-accumulate, and maximum, and wherein the basic operations of the 1D data include multiplication, addition, linear activation, and nonlinear activation.

9. The computational power scalable multi-core neural network tensor processor of claim 1, wherein a cascading path is provided between the 4D tensor core and the 1D tensor core for inputting output data of the 4D tensor core directly into the 1D tensor core.

10. The multi-core neural network tensor processor of claim 9, wherein the memory read unit is configured such that the 4D tensor core has two independent memory read units for respectively implementing the read operations of data and parameters required for 4D computation, the 1D tensor core has two independent memory read units for respectively implementing the read operations of data and parameters required for 1D computation, the instruction control unit and the special unit share one memory read unit, the 1D tensor core has one independent memory write unit for implementing the write operations of the 4D/1D tensor core results, and the special unit has one independent memory write unit for implementing the write operations of the computation results.

11. The power scalable multi-core neural network tensor processor of claim 1, wherein the configuration parameters are used to configure the 4D tensor core as a particular computing function and the 1D tensor core as a particular computing function, and wherein the instructions are used to implement control of a computational flow.

12. The scalable computational power multi-core neural network tensor processor of claim 1, wherein the local buffer unit has a capacity of P3 x 32KB, wherein P3 is an integer no less than 8.

13. The power scalable multi-core neural network tensor processor of claim 1, wherein the STC core of the multi-core neural network tensor processor further comprises a shared memory and a data control unit;

the shared memory is used for caching input data and intermediate data commonly required by each LTC core in the STC or commonly required parameters;

The data control unit is used for pre-fetching the shared data or the shared parameters existing in the off-chip SDRAM into the shared memory in advance.

14. The scalable computational power multi-core neural network tensor processor of claim 13, wherein the interconnection among the MTC cores, the STC cores and the LTC cores is that the LTC cores can only access own local cache units and cannot access local cache units of other LTC cores, the LTC cores can access shared memories of the STC cores and shared memories of other STC cores, and the LTC cores can access off-chip SDRAM memories of the MTC cores and off-chip SDRAM memories of other MTC cores.

15. The computational power scalable multi-core neural network tensor processor of claim 13, wherein the capacity of the shared memory is greater than or equal to a local memory cache.