Disclosure of Invention
Aiming at the technical problems of higher programming complexity and relatively lower energy efficiency when the traditional FPGA realizes highly complex calculation tasks, the invention aims to provide a highly programmable processor array network.
To solve the foregoing technical problem, the present invention provides a highly programmable processor array network, including:
The processor array is provided with a plurality of processor units distributed in an array, the processor units are provided with processor nodes and register groups, and data interaction is carried out between adjacent processor nodes through the register groups;
The dynamic random access memory layer is provided with one or a plurality of layers of dynamic random access memories, the dynamic random access memory layer is stacked on the upper layer of the processor array, and each dynamic random access memory is respectively connected with each processor node.
Optionally, in a highly programmable processor array network as described above, the processor array has 2 n of the processor units distributed in an array, where n is a natural number;
when n is even, the processor array has 2 n/2 rows and 2 n/2 columns;
When n is odd, the processor array has 2 (n-1)/2 rows and 2 (n+1)/2 columns.
Optionally, in the highly programmable processor array network as described above, in the single processor unit, the register group is located around the processor node, and the register group performs data interaction with other adjacent processor units in four directions respectively.
Optionally, in a highly programmable processor array network as described above, in a single said processor unit, said processor node and said register group are connected by a metal layer copper interconnect;
in the adjacent processor units, the two adjacent register groups are connected through metal layer copper interconnection.
Optionally, in the highly programmable processor array network as described above, in a single processor unit, storage controller nodes are disposed around the processor nodes, and the processor nodes respectively perform data interaction with the register groups in four directions through the storage controller nodes in four directions.
Optionally, in the highly programmable processor array network as described above, in a single processor unit, the processor node and the storage controller node are connected by a metal layer copper interconnect, and the storage controller node and the register group are connected by a metal layer copper interconnect;
in the adjacent processor units, the two adjacent register groups are connected through metal layer copper interconnection.
Optionally, in a highly programmable processor array network as described previously, the processor nodes comprise a central processor (Central Processing Unit, CPU), a graphics processor (Graphics Processing Unit, GPU), a tensor processor (Tensor Processing Unit, TPU), a processor decentralized processor (Data Processing Unit, DPU), a smart processor (Image Processing Unit, IPU) or a neural network processor (Neural-network Processing Unit, NPU).
Optionally, in a highly programmable processor array network as described above, the processor array projection is fully coincident with the dynamic random memory layer region.
Optionally, in a highly programmable processor array network as described above, the processor node is connected to the dynamic random access memory layer through the register group.
Optionally, in the highly programmable processor array network as described above, the register group and the dynamic random access memory layer are connected by a metal layer copper interconnect.
Optionally, in the highly programmable processor array network as described above, the dynamic random access memory layer has a separate memory space belonging to each of the processor nodes.
Optionally, in a highly programmable processor array network as described above, each of the processor nodes shares all address space within the dynamic random access memory layer.
The invention has the positive progress effects that:
1. The invention is designed into a processor array form, the size of the processor array can be customized at will during packaging, and the purpose of cutting a single wafer large chip can be realized according to requirements.
2. The invention has the advantages of equality and decentralization among all the processor nodes, so that each processor node in the network has a host (host) function and has no master-slave division. Each processor node has its own memory management, control logic and external communication capabilities. Any node can initiate computing tasks, manage data transmissions, and control the collaboration of other nodes, without relying on a single central controller.
3. The invention provides a higher level of abstraction by using the processor node as a basic unit, reduces programming complexity, and facilitates development of complex algorithms and applications. The fine control and dynamic resource management of the processor nodes can effectively reduce idle power consumption and improve the energy efficiency ratio when processing specific tasks.
Detailed Description
Other advantages and effects of the present invention will become apparent to those skilled in the art from the following disclosure, which is to be read in light of the specific examples. The invention may be practiced or carried out in other embodiments that depart from the specific details, and the details of the present description may be modified or varied from the spirit and scope of the present invention.
It should be noted that the following embodiments and features in the embodiments may be combined with each other without conflict.
In the description of the present invention, it should be noted that, for the azimuth terms, such as terms "outside," "middle," "inside," "outside," and the like, the azimuth and positional relationships are indicated based on the azimuth or positional relationships shown in the drawings, only for convenience in describing the present invention and simplifying the description, but not to indicate or imply that the apparatus or element to be referred to must have a specific azimuth, be configured and operated in a specific azimuth, and should not be construed as limiting the specific protection scope of the present invention.
Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features. Thus, the definition of "a first", "a second" feature may explicitly or implicitly include one or more of such feature, and in the description of the present invention, the meaning of "a number", "a number" is two or more, unless otherwise specifically defined.
Referring to fig. 1-3, embodiments of the present invention provide a highly programmable processor array network (CPPA: configurable Processor Array Network) that replaces the basic cells of an FPGA with small processor nodes (e.g., CPU cores or GPU cores, etc.) from a logic gate array to form a highly programmable processor array network to achieve a higher level of computational abstraction and energy efficiency improvement.
The highly programmable processor array network of the present invention comprises a processor array 1 and a dynamic random access memory layer 2.
The processor array 1 has a plurality of processor units 11 distributed in an array, the processor units 11 having processor nodes 111 and register groups 112 (also referred to as register groups), and data interaction between adjacent processor nodes 111 is performed through the register groups 112.
The dynamic random access memory layer 2 has one or several layers of dynamic random access memories 21 (Dynamic Random Access Memory, DRAM), the dynamic random access memory layer 2 is stacked on the upper layer of the processor array 1, and each dynamic random access memory 21 is connected to each processor node 111.
The invention is designed into a processor array form, the size of the processor array can be customized at will during packaging, and the purpose of cutting a single wafer large chip can be realized according to requirements.
The CPPA is constructed by a multi-core processor node core layer and a plurality of DRAM layers through a three-dimensional stacking technology. All the processor nodes realize efficient data exchange and computing resource optimization by sharing the upper stacked DRAM space, and are particularly suitable for data-intensive tasks such as AI, big data processing and the like.
Each processor node is designed as an autonomous unit and has own memory management, control logic and external communication capability. Each processor node is equipped with complete computing resources and control logic, capable of executing program code independently, managing local memory, and interacting with peripherals. This means that any processor node can initiate computing tasks, manage data transfer, and control the collaboration of other nodes, without relying on a single central controller.
The task scheduling and allocation of the invention is no longer controlled by a single host, but can be implemented by a distributed task scheduling algorithm on the network that balances task allocation based on the current load of the processor nodes, resource availability, and task characteristics. Each processor node directly communicates with each other through a standardized protocol without being transferred through a central controller. Each processor node can dynamically decide to accept new tasks, delegate tasks to other nodes or request resources according to the current load, the resource state and the task queue, so as to realize a real peer-to-peer computing environment.
Specifically, memory resources can be dynamically allocated through a resource scheduling system with a distributed task scheduling algorithm, data distribution is optimized according to task requirements and running conditions, data copying and migration are reduced, and data access efficiency is improved.
In some embodiments, the processor array 1 has 2 n processor units 11 distributed in an array, where n is a natural number. When n is an even number, processor array 1 has 2 n/2 rows and 2 n/2 columns. When n is an odd number, processor array 1 has 2 (n -1)/2 rows and 2 (n+1)/2 columns.
Through the reasonable design, the processor array 1 is conveniently cut into the required network size according to the requirement.
In some embodiments, referring to fig. 1, in a single processor unit 11, a register group 112 is located around a processor node 111, and the register group 112 performs data interaction with other adjacent processor units 11 in four directions, respectively.
In some embodiments, in a single processor unit 11, the processor node 111 and the register group 112 are connected by a metal layer copper interconnect.
In the adjacent processor unit 11, the adjacent two register groups 112 are connected by metal layer copper interconnection.
The metal layer copper interconnect ensures low delay of data transmission.
In some embodiments, referring to fig. 2, in a single processor unit 11, memory Controller (MC) nodes 113 are disposed around the processor node 111, and the processor node 111 performs data interaction with four-directional register groups 112 through the four-directional Memory controller nodes 113, respectively.
In some embodiments, referring to FIG. 3, in a single processor unit 11, processor node 111 is connected to memory controller node 113 through a metal layer copper interconnect, and memory controller node 113 is connected to register group 112 through a metal layer copper interconnect.
In the adjacent processor unit 11, the adjacent two register groups 112 are connected by metal layer copper interconnection.
In some embodiments, the processor node 111 includes a central processor (Central Processing Unit, CPU), a graphics processor (Graphics Processing Unit, GPU), a tensor processor (Tensor Processing Unit, TPU), a processor-dispersed processor (Data Processing Unit, DPU), an intelligent processor (Image Processing Unit, IPU), or a neural network processor (Neural-network Processing Unit, NPU).
Therefore, the CPPA mainly comprising a many-core CPU/GPU/TPU/DPU/IPU/NPU core layer and a plurality of DRAM layers can be constructed by adopting the three-dimensional stacking technology.
In some embodiments, the processor array 1 projection coincides completely with the upper dynamic random access memory layer 2 area.
In some embodiments, referring to FIG. 3, processor node 111 is coupled to DRAM layer 2 through register bank 112.
In some embodiments, the register file 112 is connected to the DRAM layer 2 via a metal layer copper interconnect.
Specifically, when the dram layer 2 is a plurality of layers of the dram 21, the register group 112 is distributed and connected with the dram 21 of each layer through the metal layer copper interconnection.
In some embodiments, in order to achieve decentralization between processor nodes, in the memory space of the upper DRAM, there is an independent memory space within the dynamic random access memory layer 2 that belongs to each processor node 111. The independent storage space is controlled and allocated by a local memory management circuit of the processor node.
In some embodiments, each processor node 111 shares all address space within dynamic random access memory layer 2 in order to achieve high-speed communication of data and strong correlation of data between the processor nodes. I.e. all processor nodes 111 have access to all addresses within the dynamic random access memory layer 2.
In some embodiments, in the CPPA of the present invention, a high-efficiency NoC (network on chip) may be integrated to optimize the transfer of data between the processor node and the DRAM, supporting high-bandwidth, low-latency communications. The NoC adopts a dynamic routing algorithm to intelligently adjust the data flow direction according to the network condition and the task priority, so that the data processing parallelism is improved. The NoC architecture adopts a multi-dimensional ring or two-dimensional grid topology, so that data can be transmitted in multiple directions, and transmission bottlenecks are reduced. The dynamic routing algorithm is based on a distributed self-adaptive mechanism, collects network traffic information and task priority indexes, and dynamically plans an optimal path for data transmission by combining a first-in first-out (FIFO) queue with a priority queue management strategy. In addition, the flow control mechanism embedded in the NoC can prevent data packet collision and congestion, and further improves the fluency and efficiency of data transmission.
In some embodiments, in the CPPA of the present invention, the microarchitecture of processor nodes may be fine-grained using a Hardware Description Language (HDL) level, such as instruction sets, cache sizes, and interconnect topologies.
The present invention has been described in detail with reference to the embodiments of the drawings, and those skilled in the art can make various modifications to the invention based on the above description. Accordingly, certain details of the embodiments are not to be interpreted as limiting the invention, which is defined by the appended claims.