CN111090611B

CN111090611B - Small heterogeneous distributed computing system based on FPGA

Info

Publication number: CN111090611B
Application number: CN201811247613.XA
Authority: CN
Inventors: 陈钰文
Original assignee: Shanghai Xuehu Information Technology Co ltd
Current assignee: Shanghai Xuehu Information Technology Co ltd
Priority date: 2018-10-24
Filing date: 2018-10-24
Publication date: 2024-08-27
Anticipated expiration: 2038-10-24
Also published as: CN111090611A

Abstract

The invention discloses a small heterogeneous distributed computing system based on FPGA, which belongs to the technical field of computationally intensive hardware design and comprises a data input module, a data computing module and a data return module; the data input module is used for scattering and reorganizing data and sending the data to the data calculation module in a serial form in a pipeline form; the data calculation module is used for receiving the data input module and transmitting the data to the data return module; the data return module is used for grouping the sequence of the arrival of the calculation output result of the front stage data input module through the disordered return data, and the invention can furthest exert the advantages of FPGA (field programmable gate array) flow calculation and large throughput and is very suitable for the calculation requirement of tattletale; and the distributed core computing units adopt an FPGA cascade configurable strategy to configure according to specific computing requirements.

Description

Small heterogeneous distributed computing system based on FPGA

Technical Field

The invention relates to the technical field of computationally intensive hardware design, in particular to a small heterogeneous distributed computing system based on an FPGA.

Background

Most existing open source frameworks of software are based on an operating system, which is based on a hardware unit, and a core unit of the hardware unit involved in computation is a CPU. At present, the CPU may be divided into different architectures such as x86 and MIPS, POWERPC, ARM according to different manufacturers or different instruction sets, but is essentially a von willebrand architecture, each operation is simplified to be executed by a single instruction, and the single instruction undergoes the most basic steps of accessing, fetching, decoding, executing and writing back to complete the actual life cycle of the single instruction. Therefore, each calculation CPU can perform relatively complex and time-consuming instruction translation execution process from a microscopic analysis. Moreover, for the CPU, execution among the multiple instructions must be performed sequentially, that is, the next instruction must wait for the execution of the last instruction to complete, so that the time-consuming calculation accumulated in the microcosmic area will not be satisfied by the macroscopic real-time high-density calculation. While various optimization approaches such as branch prediction, superscalar, hyper-threading, etc. have been proposed for computational performance inadequacies of the CPU, they are merely optimizations, and their most fundamental architectural problems have not been eliminated.

GPUs are also becoming increasingly widely used for increasing computational and complexity market demands. Compared with a CPU, the GPU has the data parallel capability which is not possessed by the CPU, and can carry out block parallel operation on data, so that the GPU has a larger data throughput rate, and can better support streaming computation with large data volume like multimedia, images and audios and videos. However, the GPU is currently running on the operating system for most applications, and needs to interact with the CPU, so that the computing process winds around the CPU-based frame, and the drawbacks are obvious. Furthermore, more critical is that the GPU can only perform data parallelism, which cannot implement a deeply pipelined computing module, and the data entering the GPU must be cross-linked before and after one computing process, and once the data are correlated with each other, the GPU must wait for the data preparation to be completed before entering the next computing process. Therefore, although data parallelism is realized, the data parallelism is not really used, and the data parallelism must be finished by the data of the previous operation to be really calculated.

The existing distributed computing system adopts CPU or GPU of the Von architecture, wherein the CPU is not suitable for intensive data computation, the CPU is more suitable for task scheduling, the GPU has higher efficiency, but still only data parallelism, and the instruction pipeline depth is still limited, so that the CPU and the GPU are not suitable for intensive computation; the existing FPGA computing modules aiming at acceleration all adopt high-performance FPGA chips to form an FPGA computing block by adopting a PCIE protocol cascade mode, so that great requirements are brought to the requirements on the aspects of PCB design, cost and the like, in addition, the number of FPGA integration is limited by the mode, and once a single FPGA in the integrated module fails, the whole system is paralyzed; at the computing nodes of the distributed computing system, the node data is not received in a CPU+NIC mode.

Based on the above, the invention designs a small heterogeneous distributed computing system based on an FPGA to solve the problems.

Disclosure of Invention

The invention aims to provide a small heterogeneous distributed computing system based on an FPGA, so as to solve the problems that the CPU or the GPU of a von willebrand system architecture adopted by a computing unit of the existing distributed computing system proposed in the background art is not suitable for intensive data computation, the CPU is more suitable for task scheduling, the GPU has higher efficiency but still has data parallelism, and the instruction pipeline depth is still limited, so that the CPU and the GPU are not suitable for intensive computation; the existing FPGA computing modules aiming at acceleration all adopt high-performance FPGA chips to form an FPGA computing block by adopting a PCIE protocol cascade mode, so that great requirements are brought to the requirements on the aspects of PCB design, cost and the like, in addition, the number of FPGA integration is limited by the mode, and once a single FPGA in the integrated module fails, the whole system is paralyzed; in the computing nodes of the distributed computing system, the problem of receiving node data in a CPU+NIC mode is solved.

In order to achieve the above purpose, the present invention provides the following technical solutions: a small heterogeneous distributed computing system based on an FPGA comprises a data input module, a data computing module and a data return module;

The data input module is used for scattering and reorganizing data and sending the data to the data calculation module in a serial form in a pipeline form;

the data calculation module is used for receiving the data input module and transmitting the data to the data return module;

the data returning module is used for grouping the sequence of the arrival of the output result calculated by the front stage data input module through the out-of-order returning data.

Preferably, the data input module includes, but is not limited to, a CPU, an FPGA and a DDR hardware module;

The FPGA module is used for receiving data and scattering and reorganizing the data;

The CPU module is directly connected with the FPGA module at a high speed through QPI protocol and is used for rapidly and dynamically configuring the FPGA module to receive and transmit data.

Preferably, the data input module further comprises at least two groups of ethernet physical interfaces, and one group of ethernet physical interfaces is used for receiving data;

and the other group of the Ethernet physical interfaces is used for data forwarding.

Preferably, the data input module further comprises a reassembly pipeline module, and the ethernet physical interface for receiving data may spread serial input data and transfer parallel data to the reassembly pipeline module.

Preferably, the data calculation module comprises at least one group of data calculation units, and the data calculation units comprise a single group of FPGA, DDR and at least two groups of ethernet physical interfaces.

Preferably, the data feedback module comprises a post-stage processing module, and the post-stage processing module is used for improving the data throughput of the recombined data in a deep stream mode.

Compared with the prior art, the invention has the beneficial effects that: the invention can exert the advantages of FPGA pipelining calculation and large throughput to the greatest extent, and is very suitable for the calculation requirement of tattletale; the distributed core computing units adopt an FPGA cascade configurable strategy to configure according to specific computing requirements; in the data distribution module and the data return module, the FPGA is communicated with the CPU through the QPI bus, the CPU can directly access the memory controller of the FPGA, and the FPGA can be directly informed of reading and writing data, so that a great amount of time is saved compared with the traditional mode that the CPU and the FPGA share the memory; the network protocol stack is realized to directly transmit and receive network data packets through the FPGA, so that the time for executing a large amount of decoding in the process of transmitting and checking by the CPU can be saved, and the overall transmitting and receiving time can be increased by an order of magnitude.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a diagram of the overall framework of a distributed heterogeneous computing system of the present invention;

FIG. 2 is a diagram of a distributed heterogeneous computing system hardware framework in accordance with the present invention;

FIG. 3 is a block diagram of the embodiment of FIG. 2 of the present invention;

FIG. 4 is an enlarged view of the left end of FIG. 3 in accordance with the present invention;

FIG. 5 is an enlarged view of the right end connection of FIG. 4 in accordance with the present invention;

FIG. 6 is an enlarged view of the right end connection of FIG. 5 in accordance with the present invention;

FIG. 7 is an enlarged view of the right end connection of FIG. 6 in accordance with the present invention;

fig. 8 is a block diagram of a data computing unit according to the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Referring to fig. 1-8, the present invention provides a technical solution: a small heterogeneous distributed computing system based on an FPGA comprises a data input module, a data computing module and a data return module;

It should be noted that the system is composed of three parts, namely a data input module, a data calculation module and a data return module. The input module is composed of hardware modules such as a CPU, an FPGA, a DDR and the like. After being transmitted to an input module through a network, input data is directly received by the FPGA, and then the data is scattered and recombined and forwarded in a pipeline mode. The CPU in the input module is directly connected with the FPGA at high speed through QPI protocol, the CPU plays a role of rapidly and dynamically configuring the related strategy of receiving and transmitting data by the FPGA, and the CPU does not directly participate in receiving and transmitting, verifying, reorganizing and the like of the data. The FPGA in the input module internally realizes a complete TCP/IP protocol stack, and externally configures a group (two total) of Ethernet physical interfaces, one special for receiving data and the other special for forwarding data. At a data receiving end, expanding serial input data into parallel data and transmitting the parallel data to a reorganization pipeline module; at one data output end, before data forwarding, the parallel output data of the recombination pipeline module is converted into serial data through a local frequency doubling mode. The reorganized data are serially distributed to the subsequent computing modules in a manner several times higher than the input rate. The computing module is composed of a group of computing units, and each computing unit is a single FPGA, DDR and two Ethernet physical interfaces. The data distributed from the input module reaches each calculation unit after passing through the exchanger, the calculation unit receives the data through the IP with the TCP/IP protocol stack realized inside, transmits the data to the specially calculated IP, and forwards the data to the post-stage back transmission module through the Ethernet interface after the calculation is completed. The hardware composition of the data feedback module is the same as that of the data input module, but the FPGA of the data feedback module feeds back data in an out-of-order mode, and particularly the data feedback module groups according to the arrival sequence of the calculation output result of the front-stage calculation module.

In still further embodiments, the data input module includes, but is not limited to, a CPU, an FPGA, and a DDR hardware module;

In a further embodiment, the data input module further includes at least two groups of ethernet physical interfaces, and one group of ethernet physical interfaces is configured to receive data;

In a further embodiment, the data input module further includes a reassembly pipeline module, and the ethernet physical interface for receiving data may spread serial input data and pass parallel data to the reassembly pipeline module.

In a further embodiment, the data computing module includes at least one group of data computing units, and the data computing units include a single group of FPGA, DDR, and at least two groups of ethernet physical interfaces.

In a further embodiment, the data backhaul module includes a post-processing module, and the post-processing module is configured to improve data throughput by deep pipelining the reorganized data;

as shown in fig. 2, the hardware framework of the distributed heterogeneous computing system designed by the present invention includes a front-end data distribution module, a data computing unit, and a data backhaul unit. Fig. 3 is a specific design of fig. 2. The data distribution module adopts a CPU+FPGA architecture, and the CPU and the FPGA are connected through a PCIE or QPI bus. Front-end network data is input to the data distribution module through a route or a switch, and is cached by the FPGA in the data distribution module and the cascaded DDR thereof. And if the post-stage calculation module does not need to reorganize the data at this time, the FPGA directly distributes the cached data in parallel through the internal integrated data distribution IP unit. If the latter FPGA computing unit needs to reconstruct the data before computation, the data is directly connected to the data reconstruction module after the FPGA buffer module in series, and then forwarded to the latter computing unit. If the data reorganization is complex, the operation needed by the reorganization can be converted into the instruction corresponding to the MIG module in the FPGA under the condition of the reorganization strategy which needs to be changed dynamically, and then the instruction is directly sent to the FPGA through the PCIE or QPI bus which is directly connected with the CPU and the FPGA, so that the FPGA can also change the strategy of data reorganization quickly while buffering the data efficiently. The data calculation unit is completely composed of a plurality of groups of single-block FPGAs, and the total number of the data calculation units is dynamically distributed according to the actual calculation amount or communication tasks. The data computation unit in the monolithic FPGA is completed by an internal unique IPCore. The hardware composition of the data feedback unit is consistent with that of the data distribution module, and the difference is that the CPU transmits MIG instructions to the FPGA, and the specific design implementation of the FPGA internal data result recombination module and the result feedback module is realized.

As shown in fig. 3, the data distribution module is cascaded with the data calculation module through a switch or other network devices, and the data calculation unit is cascaded with the back-stage data feedback module through another switch or network device, so that the reason for using two groups of network devices is to fully adapt to the deep pipeline structure in the calculation unit module, and ensure the high data throughput capability of the system.

As shown in fig. 4-7, the hardware architecture of the data distribution and data return modules is the same, and two network physical interfaces are provided at the peripheral part of the FPGA, and the physical interfaces can be RJ45, ST, SC. For the data distribution module, the data distribution module receives network computing data through one port, performs data recombination in a deep pipeline mode through an internal special IPCore, and then forwards the data to the post-processing module through the other port. The data throughput rate is greatly improved by means of dual ports and deep pipelining. For the data backhaul module, the design of the internal special IPCore is different from that of the data receiving module, and the IPCore function of the data backhaul module is to repack the calculation results arriving out of order according to rules, attach labels and then backhaul the calculation results to the later modules. The data distribution and data return modules all realize a network protocol stack inside the FPGA. As shown in fig. 8, the data calculation unit is composed of a single FPGA plus dual network interfaces. According to actual requirements, the single computing unit can be deployed by a single node, and can also be partially interconnected into a star network or a ring network according to the complex condition of computing tasks, and the formed local network and other nodes form a computing unit part in the computing system together. Thus, the computing unit portion is dynamically configured to structurally account for the needs of the task. Inside each node of the computing unit, a proprietary IPCore is used for parallel pipeline computation.

In the description of the present specification, the descriptions of the terms "one embodiment," "example," "specific example," and the like, mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

The preferred embodiments of the invention disclosed above are intended only to assist in the explanation of the invention. The preferred embodiments are not exhaustive or to limit the invention to the precise form disclosed. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the invention and the practical application, to thereby enable others skilled in the art to best understand and utilize the invention. The invention is limited only by the claims and the full scope and equivalents thereof.

Claims

1. A small heterogeneous distributed computing system based on FPGA, characterized in that: the system comprises a data input module, a data calculation module and a data return module;

The data calculation module is connected with the data input module and is used for transmitting data to the data return module;

the data returning module groups the sequence of the arrival of the calculation output result of the data calculation module according to the out-of-order returning data;

The system consists of three parts, namely a data input module, a data calculation module and a data feedback module, wherein the input module consists of a CPU, an FPGA and a DDR hardware module, input data is directly received by the FPGA after being transmitted to the input module through a network, then the data is scattered and recombined and forwarded in a pipeline mode, the CPU in the input module is directly connected with the FPGA at a high speed through a QPI protocol, the CPU plays a role in rapidly and dynamically configuring a strategy related to the FPGA for receiving and transmitting the data, the CPU does not directly participate in the receiving, the verification and the recombination of the data, the FPGA in the input module internally realizes a complete TCP/IP protocol stack, a group of Ethernet physical interfaces are externally configured, one is specially used for receiving the data, the other is specially used for forwarding the data, and at one data receiving end, the serial input data is unfolded into parallel data to be transmitted to a recombination pipeline module; at one data output end, before data forwarding, the parallel output data of the recombination pipeline module is converted into serial data through a local frequency doubling mode, and the recombined data is serially distributed to a later stage computing module in a mode of being several times higher than the input rate;

The hardware framework of the system comprises a front-end data distribution module, a data calculation unit and a data return unit, wherein the data distribution module adopts a CPU+FPGA framework, and the CPU and the FPGA are connected through a PCIE or QPI bus; front-end network data are input to a data distribution module through a route or a switch, the data are cached together by an FPGA in the data distribution module and the cascaded DDR thereof, and if the latter calculation module does not need to reorganize the data at this time, the cached data are directly distributed in parallel by the FPGA through an internally integrated data distribution IP unit; if the latter FPGA computing unit needs to reconstruct the data before computation, the data is directly connected to the data reconstruction module after the FPGA buffer module in series, and then forwarded to the latter computing unit; if the data recombination is complex, under the condition that the recombination strategy needs to be changed dynamically, converting the operation needed by the recombination into an instruction corresponding to an MIG module in the FPGA, and then directly sending the instruction to the FPGA through a PCIE or QPI bus directly connected with the CPU and the FPGA, so that the FPGA can also change the strategy of the data recombination quickly while buffering the data efficiently.

2. The FPGA-based small heterogeneous distributed computing system of claim 1, wherein: the data input module comprises a CPU, an FPGA and a DDR hardware module;

3. The FPGA-based small heterogeneous distributed computing system of claim 2, wherein: the data input module also comprises at least two groups of Ethernet physical interfaces, and one group of Ethernet physical interfaces is used for receiving data; and the other group of the Ethernet physical interfaces is used for data forwarding.

4. A FPGA-based small heterogeneous distributed computing system as claimed in claim 3, wherein: the data input module further comprises a reorganization pipeline module, and the Ethernet physical interface for receiving data is used for expanding serial input data into parallel data and transmitting the parallel data to the reorganization pipeline module.

5. The FPGA-based small heterogeneous distributed computing system of claim 1, wherein: the data computing module comprises at least one group of data computing units, and the data computing units comprise a single group of FPGA, DDR and at least two groups of Ethernet physical interfaces.

6. The FPGA-based small heterogeneous distributed computing system of claim 1, wherein: the data return module comprises a post-stage processing module, and the post-stage processing module is used for improving the data throughput of the recombined data in a deep stream mode.