CN119415457A

CN119415457A - ROCE communication and transmission method based on non-HOST main memory

Info

Publication number: CN119415457A
Application number: CN202411556754.5A
Authority: CN
Inventors: 许光政; 李宇庭; 齐永; 何国强; 张韬
Original assignee: Jiangsu Huachuang Micro System Co ltd
Current assignee: Jiangsu Huachuang Micro System Co ltd
Priority date: 2024-11-04
Filing date: 2024-11-04
Publication date: 2025-02-11

Abstract

The present invention discloses a ROCE communication and transmission method based on non-HOST main memory, using an FPGA as a ROCE network card, the FPGA network card and the CPU are interconnected by a PCIe bus, and 4 groups of DDR are expanded through the BANK resources inside the FPGA network card; at the same time, the FPGA network card uses GbE MAC/PHY resources to instantiate 4 GbE ports for RDMA communication, and the CPU runs RDMA. Compared with the traditional design method, this method can support multi-channel 100GbE data distribution scenarios, solve the problem of insufficient maximum bandwidth limit, and expand the DDR space and memory channel number of the CPU, build a multi-channel DDR scenario of simultaneous reading and writing, and improve the efficiency of concurrent execution.

Description

ROCE communication and transmission method based on non-HOST main memory

Technical Field

The invention relates to the technical field of high-speed data communication, in particular to a ROCE communication and transmission method based on non-HOST main memory.

Background

ROCE is a network protocol defined in the InfiniBand Trade Association standard that allows RDMA to be used over ethernet. ROCE v2 is widely applied to various processing scenes as a transmission protocol with the most expansibility and performance ratio in RDMA, and provides low-delay characteristics by using Kernel Bypass and Zero Copy technologies, reduces CPU occupation, reduces memory bandwidth bottleneck and provides high bandwidth utilization rate. In the conventional ROCE implementation architecture, as shown in fig. 1, in the conventional ROCE architecture, a CPU issues a WQE to ROCE network card, the network card receives a control instruction, then analyzes transmission type, data address, transmission length and access authority information, and obtains main memory data hung on a CPU side through an address conversion table to perform unilateral and bilateral communication, and the network card of the design architecture ROCE performs data movement and transmission for Host main memory, so that the network card has very high universality.

However, these measures currently taken involve at least the following three problems:

1) For an application scene of distributing after data is received, particularly one-to-many communication, the received data does not need to be processed by a CPU, but needs to finish data distribution as soon as possible, transmission bandwidth resources of a bottom layer are utilized to the maximum extent, and a large amount of CPU and PCIE bandwidth can be wasted when the data enters the CPU and is distributed;

2) The PCIE channel between the limited ROCE network card and the CPU is calculated according to PCIE3.0 multiplied by 16 if the PCIE channel is GEN3.0, the theoretical bandwidth is 128Gbps at maximum, and the PCIE channel can only meet the transmission bandwidth requirement of 1 path of 100GbE and can not meet the transmission requirement of multiple paths of 100GbE far;

3) For the current domestic CPU, including FT2000/4, FTD2000/8 and FTD3000, the DDR controller has only 2 groups, and can not expand the channel number of the memory, so that when the concurrent read-write operation is executed, a single-point bottleneck exists, and the concurrent transmission and read-write high-performance scene can not be well adapted.

Disclosure of Invention

Aiming at the three problems, the invention aims to provide a ROCE communication and transmission method based on non-HOST main memory, which adopts an FPGA as ROCE network card to expand 4 groups of DDR with the size of 4GB, overcomes the defects that the traditional ROCE communication and transmission method cannot be well suitable for concurrent transmission and read-write high-performance scenes, and uses 100GbE MAC/PHY resources to instantiate 4 paths of 100GbE ports in the FPGA network card, can support multipath 100GbE data distribution scenes, and solves the problem of insufficient maximum bandwidth upper limit.

The method is realized by the following technical scheme:

A ROCE communication and transmission method based on non-HOST main memory adopts an FPGA as ROCE network card, the FPGA network card and a CPU are interconnected by PCIe bus, 4 groups of DDR are outwards expanded through BANK resources in the FPGA network card; the method comprises the steps of S1, triggering a request end software according to an application program, adding a WQE task into an SQ queue, analyzing the WQE task by a request end of the FPGA network card to obtain a virtual address of source data storage and a virtual address of destination data storage, a data length, a key and an operation code, inquiring an MTT table by the request end of the FPGA network card according to the virtual address of the source data storage, converting a storage domain address and an FPGA DDR address to obtain a physical address of data storage in the FPGA network card, copying data to be transmitted into the FPGA network card by DMA operation, packaging the data according to a protocol, S2, after the request end of the FPGA network card sends the data packet to the GbE MAC/transmission interface, sending the data packet to a response end of the FPGA network card through the transmission interface and analyzing the virtual address, obtaining the virtual address of the source data storage and the key and the operation code, receiving the data packet by the response end of the FPGA network card, receiving the request end of the data packet and the physical address of the FPGA network card, writing the data storage address of the FPGA network card into the physical address of the FPGA network card, and writing the data packet into the physical address of the FPGA network card after the request end is received by the transmission interface, and the request end of the FPGA network card is decrypted by the application address of the data storage, and the physical address of the FPGA network card is received, after receiving the ACK message, the request end of the FPGA network card sends the ACK message to the RDMA message analysis module for analysis processing and obtains an analysis result, the request end of the FPGA network card identifies the ACK based on the analysis result, then a completion marking element corresponding to the WQE task is added to the CQ queue, and after the request end software polls the completion marking element, task completion information is obtained and an application program is informed of the current sending result.

Aiming at the limitations existing in the traditional ROCE communication and transmission method, the invention provides a ROCE communication and transmission method based on non-HOST main memory, which can support a multi-channel 100GbE data distribution scene, solve the problem of insufficient upper limit of the highest bandwidth, expand the DDR space and the number of memory channels of a CPU, construct a multi-channel DDR scene of simultaneous reading and writing, and promote the efficiency of concurrent execution.

Preferably, in step S1, the MTT table is used to manage the data structure of the memory pages in the RDMA transfer, and provides a mapping mechanism from virtual address to physical address. The conversion between the storage domain address and the FPGA DDR address can be rapidly and efficiently carried out by inquiring the MTT table, and the physical address of the data storage in the FPGA network card expansion DDR is obtained.

Preferably, in step S1, when the storage domain address and the FPGA DDR address are converted, address mapping of the FPGA network card expansion DDR space and the CPU bus domain is completed in a mode that the CPU allocates the BAR space in a PCIe scanning stage, wherein the CPU directly accesses the mapping address of the FPGA network card in the storage space, and access control of the FPGA network card expansion DDR in the PCIe bus space is completed. 4 groups of DDRs with the size of 4GB are expanded through the FPGA network card, the DDR space and the memory channel number of the CPU are expanded, a multi-channel DDR scene with simultaneous reading and writing is constructed, and the concurrent execution efficiency is improved.

Preferably, when the storage domain address and the FPGA DDR address are converted, a Host bridge needs to be arranged between the storage domain address and the PCIe bus domain, wherein the Host bridge is used for mapping the address space of the CPU bus domain to the address space of the PCIe bus, so that communication between the CPU and the PCIe bus is realized, and meanwhile, data transmission between the CPU and the PCIe bus is responsible. Communication between the CPU and the PCIe bus is completed through the Host bridge, and the CPU accesses to the PCIe bus address space.

Preferably, when the BAR space is allocated, the external expansion DDR space of the FPGA network card is realized by adopting a window switching mechanism based on the PCIe BAR size of the CPU, when the DDR is 16G, the 16GB DDR is divided into 64 BAR subspaces of 256MB by adopting the window switching mechanism in the FPGA network card, wherein the 64-bit PCIe address mapped by the Host bridge is utilized to take the lower 34 bits, wherein 0-27 bits are used for determining the lower address of the DDR, and 28-33 bits are used for representing the subspace accessed currently. Through a window switching mechanism, the CPU can access the whole DDR space of the FPGA network card.

Preferably, in step S2, the MR address conversion table is used for performing memory management in RDMA, recording the address conversion relation of the memory, and the response end of the FPGA network card obtains the virtual address stored by the target data and then obtains the corresponding physical address by using the MR address conversion table. The MR address conversion table can accurately and efficiently complete that the virtual address is correctly mapped to the physical address, thereby realizing process isolation and memory protection.

Compared with the prior art, the invention has the following beneficial effects:

According to the technical scheme, a piece of high-performance FPGA is adopted as the ROCE network card, 4 groups of DDR with the size of 4GB are expanded, and 100GbE MAC/PHY resources are used in the FPGA network card to instantiate 4 paths of 100GbE ports, so that transmission bandwidth resources of a bottom layer can be utilized to the maximum extent, the throughput bandwidth of a system is improved, PCIe (peripheral component interconnect express) number operation is reduced, transmission delay is effectively reduced, the method is suitable for a real-time sensitive transmission scene, meanwhile, the requirement on communication bandwidth in a multipath parallel transmission scene is met, the DDR space and the number of memory channels of a CPU (Central processing unit) are expanded, a simultaneous read-write multichannel DDR scene is constructed, and the concurrent execution efficiency is improved.

Drawings

FIG. 1 is a schematic diagram of a non-HOST-based ROCE communication and transmission method architecture;

FIG. 2 is a schematic diagram of address translation between a memory domain and a PCIe bus domain in a non-HOST-based ROCE communication and transmission method;

FIG. 3 is a schematic diagram of data interaction based on ROCE communication and transmission methods other than HOST;

FIG. 4 is a flow chart of RDMA data communication based on the ROCE communication and transfer method of non-HOST hosting.

Detailed Description

The following describes the technical solution in the embodiment of the present invention in detail with reference to the drawings in the embodiment of the present invention.

As shown in fig. 1, the architecture design schematic diagram of a ROCE communication and transmission method based on non-HOST main memory is provided, a piece of high-performance FPGA is adopted as ROCE network card, the FPGA network card and a CPU are interconnected by pcie3.0×16, 4 groups of DDRs are expanded through BANK resources in the FPGA network card, each group of DDRs has a size of 4GB and a frequency of 2666MT/s, meanwhile, ROCE communication ports of multiple paths of 100 gbes are exemplified by using 100GbE MAC/PHY resources in the FPGA network card, and the ports comprise Port0, port1, port2 and Port3, and are used for RDMA communication and CPU runs RDMA; the RDMA is a network communication technology, which allows the memory of a computer system to directly access the memory of other computer systems without CPU participation, the communication technology can reduce the communication delay and the CPU load and improve the network communication efficiency, the BANK resources in the FPGA network card are obtained by dividing the input and output pins of the FPGA into a plurality of groups, the interface standard of each group is determined by the interface voltage VCCO of each group, and the interface voltage VCCO of each group is only one and can be different, the division can enable the FPGA to conveniently manage and adapt to various electrical appliance standards, the flexibility and the applicability of the FPGA are improved, and the 100GbE MAC/PHY refers to a medium access MAC controller and a physical layer chip for 100G Ethernet communication, and the 100GbE MAC/PHY mainly comprises the MAC controller and the physical layer chip.

As shown in FIG. 4, the RDMA data communication flow chart based on ROCE communication and transmission methods of non-HOST main memory is provided, a WQE task is added into an SQ queue through software driving during RDMA data communication, a request end of an FPGA network card analyzes the WQE task and obtains data information, the data information is packaged according to a protocol and a corresponding data packet is obtained, the request end of the FPGA network card sends the data packet to a response end of the FPGA network card and analyzes the data packet, the data is written into a physical address for storage, the response end of the FPGA network card receives the data packet sent by the request end of the FPGA network card, returns an ACK message to the request end of the FPGA network card, analyzes the ACK message, and the request end of the FPGA network card informs an application program of an analysis result.

The data communication flow of the method specifically comprises the following steps:

S1, a request end of an FPGA network card is triggered according to an application program, a WQE task is added to an SQ queue, a virtual address stored in a DDR (double data rate) of data and a virtual address stored in a target data, a data length, a secret key and an operation code are identified in the WQE task, then the WQE task is taken out of the SQ queue through the request end of the FPGA network card, the WQE task is analyzed, the virtual address stored in a source data and the virtual address stored in the target data, the data length, the secret key and the operation code are obtained, the request end of the FPGA network card inquires an MTT table according to the virtual address stored in the source data by utilizing the operation code, conversion between a storage domain address and the DDR address of the FPGA network card is carried out, a CPU (Central processing) directly accesses to the mapping address of the FPGA network card in a storage space, access control of the external expansion DDR of the FPGA network card is completed, the physical address stored in the external expansion DDR of the FPGA network card is obtained, the data to be transmitted into the FPGA network card is copied through DMA operation, and the data to be stored in the FPGA network card is packaged according to a protocol, and a corresponding data packet is obtained. The DDR space is a storage space provided by the double rate synchronous dynamic random access memory and is used for storing data, the data transmission rate is improved by carrying out data transmission twice in one clock period, the SQ queue is a sending queue and is used for storing data packets to be sent, in RDMA data communication, two ends of each data channel are respectively provided with a pair of queues including the SQ queue, the WQE task is a work queue element and is mainly used for describing work requests in the RDMA data communication, the DMA operation is a direct memory access operation, and the DMA operation is a data transmission mode in a computer system and can improve the data transmission efficiency.

As shown in FIG. 2, which is a schematic diagram of address conversion between a memory domain and a PCIe bus domain in a method for ROCE communication and transmission based on non-HOST main memory, when the memory domain address is converted with an FPGA DDR address, a CPU completes access control of the FPGA network card expansion DDR in the PCIe bus domain by directly accessing the mapping address of the FPGA network card in the memory domain, namely, the CPU bus domain address-memory domain address-PCIe bus domain address.

The CPU directly accesses the mapping address of the FPGA network card in a memory domain, then a Host bridge is arranged between the memory domain address and the PCIe bus domain to obtain the space address of the external expansion DDR of the FPGA network card in the PCIe bus domain, and then the CPU performs access control on the external expansion DDR of the FPGA network card, wherein the memory domain address comprises the mapping address of a BAR space in the PCIe bus, the mapping address of the FPGA network card and the Host address space, the PCIe bus domain is divided into 16GB of the FPGA DDR space and the BAR space of other devices, the BAR space refers to the address space mapped by a base address register in the PCIe bus domain, and each PCIe bus domain is provided with a plurality of BAR spaces for communication between the PCIe bus and system software.

As shown in fig. 3, a data interaction diagram of ROCE communication and transmission method based on non-HOST main memory is shown, when the storage domain address and the FPGA DDR address are converted, data interaction is performed at the same time, and external data input and analysis are completed and corresponding events are generated through the input channel, the management channel, the message channel and the data channel.

External data enter an RDMA subsystem through an input channel, the external data are subjected to protocol processing in the RDMA subsystem to obtain corresponding naked data, the naked data are stored on an expanded DDR (double data rate) on an FPGA (field programmable gate array) network card through an Axi-MM data channel, meanwhile, information such as an address, a length and the like of data currently stored in the DDR is informed to a CPU (Central processing Unit) through an AXI-Lite management channel, meanwhile, a part of external input data carrying a slightly larger data such as an instruction and a control table is considered, and the part of external input data needs to be transmitted to the CPU for further processing, so that an AXI-Stream message channel is added, the RDMA subsystem extracts the instruction and the control table data and then transmits the instruction and the control table data to the CPU through the AXI-Stream message channel, and the CPU obtains the instruction and the control table data through a driving interface and analyzes, and guides an application program to generate a corresponding event based on an analysis result. The AXi-MM data channel, the AXI-Stream message channel and the AXI-Lite management channel are all one data transmission mode in the AXi bus, the AXi-MM data channel is mainly used for high-performance and large-batch data transmission, no address stage exists when the AXI-Stream message channel is used for data transmission, the AXI-Lite management channel is suitable for streaming data transmission, and the AXI-Lite management channel is mainly used for small-batch data transmission.

In the embodiment, address mapping of the FPGA network card expansion DDR space and a CPU bus domain is completed by a way that a CPU allocates the BAR space in a PCIe scanning stage, when the BAR space is allocated, the FPGA network card expansion DDR space is realized by adopting a window switching mechanism based on the PCIe BAR size of the CPU, the 16GB DDR is divided into 64 256MB BAR subspaces, the 64-bit PCIe address mapped by a Host bridge is utilized, the lower 34 bits are taken, wherein 0-27 bits are used for determining the lower address of the DDR, and 28-33 bits are used for representing the subspace accessed currently. The DDR space and the number of memory channels of the CPU are expanded, a multi-channel DDR scene which is read and written simultaneously is constructed, the concurrent execution efficiency is improved, the address mapping of the CPU bus domain is completed, the CPU can access and control the DDR which is externally expanded by the FPGA network card in the PCIe bus space, and meanwhile, the CPU can access the whole DDR space of the FPGA network card through a window switching mechanism.

In this embodiment, the MTT table is used to manage the data structure of the memory page in RDMA transmission, and provides a mapping mechanism from virtual address to physical address. The conversion between the storage domain address and the FPGA DDR address can be rapidly and efficiently carried out by inquiring the MTT table, and the physical address of the data storage in the FPGA network card expansion DDR is obtained.

In this embodiment, when the storage domain address and the FPGA DDR address are converted, a Host bridge needs to be set between the storage domain address and the PCIe bus domain, where the Host bridge is configured to map an address space of the CPU bus domain to an address space of the PCIe bus, so as to implement communication between the CPU and the PCIe bus, and meanwhile, is responsible for data transmission between the CPU and the PCIe bus. The non-HOST-based main memory in the invention means that HOST is not used as main memory, CPU cannot directly access, and the HOST bridge has different functions and roles in the system.

S2, after the request end of the FPGA network card sends the data packet to the 100GbE MAC/PHY sending interface, the data packet is sent to the response end of the FPGA network card through the sending interface and analyzed to obtain the virtual address, the data length and the secret key of the application data and the destination data storage, then the MR address conversion table is queried according to the virtual address and the data length of the destination data storage, the virtual address of the destination data storage is accurately mapped to the physical address through the MR address conversion table to obtain the corresponding physical address, then the application data is written into the physical address to be stored, and the application data is decrypted through the secret key. The 100GbE MAC/PHY refers to a media access MAC controller and a physical layer chip for 100G Ethernet communication, and mainly comprises the MAC controller and the physical layer chip, wherein the MAC controller is responsible for sending data to the physical layer chip, and the physical layer chip is responsible for speed negotiation and conversion from digital signals to analog signals and outputting the signals to a network cable.

The MR address conversion table is used for performing memory management in RDMA and recording the address conversion relation of the memory, and after the response end of the FPGA network card obtains the virtual address stored by the target data, the corresponding physical address is obtained by using the MR address conversion table. The MR address conversion table can accurately and efficiently complete that the virtual address is correctly mapped to the physical address, thereby realizing process isolation and memory protection.

S3, after receiving a data packet sent by a request end of the FPGA network card, a response end of the FPGA network card replies an ACK message to the request end of the FPGA network card, after receiving the ACK message, the request end of the FPGA network card sends the ACK message to an RDMA message analysis module, wherein the head of the ACK message comprises a plurality of fields, including a source port number, a destination port number, a serial number, a confirmation number and the like, the RDMA message analysis module analyzes the fields one by one and obtains analysis results, then the RDMA message analysis module returns the analysis results to the request end of the FPGA network card, the request end of the FPGA network card identifies the ACK based on the analysis results, then a completion marking element corresponding to a WQE task is added to a CQ queue, and after the request end software polls the completion marking element, task completion information is obtained and an application program is informed of the current sending results. The CQ queue refers to a completion queue for storing notifications of completion of transmitting and receiving operations, and one CQ queue is required at both ends of each data channel.

In this embodiment, the ACK message is a confirmation message, which is sent by the response end of the FPGA network card to the request end of the FPGA network card, and is used to confirm that the request end of the FPGA network card has successfully received the data packet, so as to prevent data loss or retransmission, and ensure reliable transmission of data.

In this embodiment, the RDMA message parsing module is configured to receive an ACK message sent by a request end of the FPGA network card, parse fields in the ACK message one by one, extract useful data and information, and perform further analysis and processing, so as to ensure the sequence of data transmission and the integrity and correctness of data.

In summary, by adopting a high-performance FPGA as ROCE network card, expanding 4 groups of DDR with size of 4GB, and using 100GbE MAC/PHY resources to instantiate 4 paths of 100GbE ports inside the FPGA network card, the invention can maximally utilize the transmission bandwidth resources of the bottom layer, improve the throughput bandwidth of the system, reduce PCIe number-of-moving operation, effectively reduce transmission delay, be applicable to real-time sensitive transmission scenes, simultaneously satisfy the requirement for communication bandwidth in multipath parallel transmission scenes, expand DDR space and memory channel number of CPU, construct simultaneous read-write multichannel DDR scenes, and improve concurrent execution efficiency.

The above embodiments are only for illustrating the technical idea of the present invention, and the protection scope of the present invention is not limited thereto, and any modification made on the basis of the technical scheme according to the technical idea of the present invention falls within the protection scope of the present invention.

Claims

1. A ROCE communication and transmission method based on non-HOST main memory is characterized in that an FPGA is adopted as ROCE network card, PCIe buses are adopted between the FPGA network card and a CPU, 4 groups of DDR are outwards expanded through BANK resources in the FPGA network card, meanwhile, 4 paths of GbE ports are instantiated by using GbE MAC/PHY resources in the FPGA network card for RDMA communication, and the CPU runs RDMA, wherein the data communication flow of the method comprises the following steps:

S1, the request end software adds a WQE task into the SQ queue according to the triggering of an application program, and then the request end of the FPGA network card analyzes the WQE task to obtain a virtual address stored in source data and a virtual address, a data length, a secret key and an operation code stored in destination data; the method comprises the steps of inquiring an MTT table and converting a storage domain address and an FPGA DDR address by a request end of an FPGA network card according to a virtual address stored by source data by utilizing an operation code to obtain a physical address stored by data in an FPGA network card expansion DDR;

S2, after the request end of the FPGA network card sends the data packet to the GbE MAC/PHY sending interface, the sending interface sends the data packet to the response end of the FPGA network card and analyzes the data packet to obtain a virtual address, a data length and a secret key for storing application data and destination data, then the MR address conversion table is queried according to the virtual address and the data length for storing the destination data to obtain a corresponding physical address, then the application data is written into the physical address for storing, and the application data is decrypted through the secret key;

s3, after receiving a data packet sent by a request end of the FPGA network card, a response end of the FPGA network card replies an ACK message to the request end of the FPGA network card, after receiving the ACK message, the request end of the FPGA network card sends the ACK message to an RDMA message analysis module for analysis processing and obtains an analysis result, the request end of the FPGA network card identifies the ACK based on the analysis result, then a completion marking element corresponding to the WQE task is added to the CQ queue, and after the request end software polls the completion marking element, task completion information is obtained and an application program is informed of the current sending result.

2. The non-HOST-based ROCE communication and transmission method of claim 1 wherein in step S1, the MTT table is used to manage data structures of memory pages in RDMA transmission, providing a virtual address to physical address mapping mechanism.

3. The ROCE communication and transmission method based on non-HOST main memory according to claim 1, wherein in step S1, when the memory domain address and the FPGA DDR address are converted, address mapping of the FPGA network card expansion DDR space and the CPU bus domain is completed by a way that the CPU allocates the BAR space in a PCIe scanning stage, wherein the CPU directly accesses the mapping address of the FPGA network card in the memory space, and access control of the FPGA network card expansion DDR in the PCIe bus space is completed.

4. The method for ROCE communication and transmission based on non-HOST main memory according to claim 3, wherein a HOST bridge is set between the memory domain address and the PCIe bus domain when the memory domain address and the FPGA DDR address are converted, the HOST bridge is used for mapping an address space of the CPU bus domain to an address space of the PCIe bus, so as to implement communication between the CPU and the PCIe bus.

5. The method for ROCE communication and transmission based on non-HOST main memory according to claim 3, wherein when the BAR space is allocated, the FPGA network card expansion DDR space is realized by adopting a window switching mechanism based on the PCIe BAR size of the CPU.

6. The method for ROCE communication and transmission based on non-HOST HOST is characterized in that 16G is selected as DDR, a window switching mechanism is adopted in an FPGA network card, 16GB DDR is divided into 64 BAR subspaces of 256MB, the 64-bit PCIe address mapped by the HOST HOST bridge is utilized to take the lower 34 bits, 0-27 bits are used for determining the lower address of DDR, and 28-33 bits are used for representing the subspace accessed currently.

7. The method for ROCE communication and transmission based on non-HOST main memory according to claim 1, wherein in step S2, the MR address conversion table is used for performing memory management in RDMA, recording the address conversion relation of the memory, and after the response end of the FPGA network card obtains the virtual address stored by the destination data, the corresponding physical address is obtained by using the MR address conversion table.