[go: up one dir, main page]

CN119759839A - GPU network interconnection method and device - Google Patents

GPU network interconnection method and device Download PDF

Info

Publication number
CN119759839A
CN119759839A CN202411903948.8A CN202411903948A CN119759839A CN 119759839 A CN119759839 A CN 119759839A CN 202411903948 A CN202411903948 A CN 202411903948A CN 119759839 A CN119759839 A CN 119759839A
Authority
CN
China
Prior art keywords
gpu
connection
destination
sending
data packet
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202411903948.8A
Other languages
Chinese (zh)
Inventor
郝俊瑞
余少华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan Research Institute of Posts and Telecommunications Co Ltd
Original Assignee
Wuhan Research Institute of Posts and Telecommunications Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan Research Institute of Posts and Telecommunications Co Ltd filed Critical Wuhan Research Institute of Posts and Telecommunications Co Ltd
Priority to CN202411903948.8A priority Critical patent/CN119759839A/en
Publication of CN119759839A publication Critical patent/CN119759839A/en
Pending legal-status Critical Current

Links

Landscapes

  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

本申请涉及一种GPU网络互联的方法及装置,根据源GPU的转发表和目的GPU的地址,生成连接管理任务,转发表包括各个目的GPU的地址及对应的网络出接口的编号,所述连接管理任务包括与之连接的目的GPU信息,所述目的GPU信息包括网络出接口的编号、I P和数据缓存,所述转发表和数据缓存预设于静态随机存取存储器SRAM中;根据所述连接管理任务,向所述目的GPU发送连接请求报文,并开启计时器,记录发送所述连接请求报文时的时间戳;若在预定时间内接收到所述目的GPU基于所述连接请求报文发送的连接响应报文,则连接建立成功;否则,连接建立失败,并重新发送连接请求报文。本申请可以提高数据传输效率,降低传输时延。

The present application relates to a method and device for GPU network interconnection, generating a connection management task according to the forwarding table of the source GPU and the address of the destination GPU, the forwarding table including the address of each destination GPU and the number of the corresponding network output interface, the connection management task including the destination GPU information connected thereto, the destination GPU information including the number of the network output interface, IP and data cache, the forwarding table and data cache are preset in a static random access memory SRAM; according to the connection management task, sending a connection request message to the destination GPU, and starting a timer to record the timestamp when sending the connection request message; if the connection response message sent by the destination GPU based on the connection request message is received within a predetermined time, the connection is successfully established; otherwise, the connection establishment fails, and the connection request message is resent. The present application can improve data transmission efficiency and reduce transmission delay.

Description

GPU network interconnection method and device
Technical Field
The application relates to the technical field of communication, in particular to a method and a device for interconnecting GPU networks.
Background
Conventional data center networks are mainly used for interconnection between servers and storage in a data center, and with wide application of AI and large models, the data center needs a large number of GPUs (graphic processors, graphics Process i ng Unit) or TPUs (tensor processors, tensor Process i ng Unit) for distributed training and deep learning of AI, and the requirements of the GPUs and the TPUs are also increasing in scale. For some intelligent computing centers, the number of GPUs even exceeds the number of CPUs (central processing units, centra l Process i ng Unit), while for some servers for deep learning, the number of GPUs on one server is far greater than the number of CPUs. In this case, the interconnection of GPUs also becomes a critical issue.
Traditional GPU interconnection is generally performed by PCI E under the conditions of small quantity of GPUs and weak operation capability. However, with the increase of the amount of data processed by a single GPU and the increase of the number of GPUs in a cluster, the technology of PCI E has not been able to meet the interconnection requirements of GPUs.
At present, most of the mainstream GPU manufacturers in the industry adopt self-developed technology to realize interconnection of GPUs, for example Nvid ia adopts private technology NVLi nk to perform interconnection of GPUs, the interconnection bandwidth of GPUs can reach hundreds of Gb, and AMD adopts self-developed I nfi nity Fabr ic technology. Of course, standard technologies, such as RoCEv based on ethernet, etc., are also used in the industry, but for the scene of tens of thousands of GPUs/TPUs interconnection, it is difficult for the technology to well meet the requirements of a large model or deep learning scene in terms of time delay and bandwidth.
The communication between GPUs is much larger in data volume than the communication of the conventional CPU because there are many samples and training parameters to be transferred, and for the model of hundreds of billions, the parameter size to be instantaneously transferred reaches several billions, that is, hundreds of GB, and the communication has the characteristic of aggregate communication, and needs to perform data transfer in a broadcast, diffusion, many-to-one or one-to-many manner, so that the requirement on network bandwidth is high. Meanwhile, in order to improve efficiency, the training of the large model requires all GPUs to synchronize parameters and data, otherwise, the model learning efficiency is greatly reduced. The GPU communication is therefore also very demanding on the transmission latency requirements, or the lower the better.
The conventional GPU data transmission method of the data center server has the following problems:
1. the traditional server data transmission is based on a complex TCP/IP software protocol stack, packages the package heads of TCP, IP address, ethernet and the like in the protocol stack, and even if RDMA technology is adopted, the package heads of TCP, IP and the like are not participated in, but the package heads of TCP, IP and the like are packaged and unpackaged, and the package heads of the multi-layer protocol increase the time delay of sending and receiving and reduce the transmission efficiency. For the data transmission scene between the GPUs, under the condition that the time delay and the efficiency are extremely severe, the time delay and the efficiency cannot be met based on the traditional protocol stack mode.
2. The connection established between the transmitting end and the receiving end of the traditional data center is mostly based on TCP connection for data transmission. The TCP method requires three handshakes to establish a connection, and is long in time and low in efficiency. In addition, the sliding window technology of TCP is realized based on software, and has low efficiency and large time delay. For the GPU interconnection scenario, the establishment and maintenance of TCP connections can greatly affect the efficiency of GPU data transfer.
Therefore, the network of the traditional data center cannot meet the requirements of GPU interconnection in both bandwidth and time delay, and meanwhile, as the GPU cards mounted on one server are more and more dense, the network of the traditional data center cannot meet the requirements of large-scale GPU interconnection in networking architecture and transmission protocol.
Disclosure of Invention
The embodiment of the application provides a method and a device for interconnecting GPU networks, which can improve data transmission efficiency and reduce transmission delay.
In a first aspect, a method for interconnecting GPU networks is provided, which includes:
Generating a connection management task according to a forwarding table of a source GPU and addresses of a destination GPU, wherein the forwarding table comprises addresses of all destination GPUs and corresponding network output interfaces, the connection management task comprises destination GPU information connected with the destination GPU, the destination GPU information comprises numbers, IP and data caches of the network output interfaces, and the forwarding table and the data caches are preset in a static random access memory SRAM;
according to the connection management task, sending a connection request message to the destination GPU, starting a timer, and recording a time stamp when the connection request message is sent;
if the connection response message sent by the destination GPU based on the connection request message is received within the preset time, the connection is established successfully;
Otherwise, the connection establishment fails and the connection request message is resent.
In some embodiments, the connection request message and the connection response message each include an address of a destination GPU, an address of a source GPU, a type field, a task number, a data packet type, a data packet sequence number, a data packet content, and a check code;
the address of the destination GPU of the connection request message is the same as the address of the source GPU of the connection response message, and the address of the source GPU of the connection request message is the same as the address of the destination GPU of the connection response message.
In some embodiments, obtaining a verification result after verifying an address, a type field and a packet type of the destination GPU in the connection request message via the destination GPU;
and sending a connection response message with the data packet type corresponding to the verification result according to the obtained verification result.
In some embodiments, when the connection is established successfully, the method further comprises calculating Round Trip Time (RTT) from sending the connection request message to receiving the connection response message by using the timer;
The method further comprises the step of obtaining the size of a sending window for sending the service data packet based on the round trip time RTT and the sending rate of the network output interface of the destination GPU information.
In some embodiments, the method further comprises transmitting a service data packet;
The sending service data packet specifically includes:
Based on the size of the sending window, copying business data of the HBM memory of the source GPU to a data cache corresponding to a network output interface of the destination GPU information in batches through a DMA engine, and packaging the business data into a plurality of business data packets;
and sending a service data packet to the destination GPU according to the connection management task, and receiving a data response message sent by the destination GPU based on the service data packet.
In some embodiments, the sending the data response message specifically includes sending a data response message every 1/2 round trip time RTT, where a data packet sequence number of the data response message is a largest data packet sequence number in the service data packet that passes the current verification.
In some embodiments, if the data response message is not received within 1/2 round trip time RTT, the sending rate is reduced.
In some embodiments, the service data packet and the data response packet each include an address of a destination GPU, an address of a source GPU, a type field, a task number, a data packet type, a data packet sequence number, a data packet content, and a check code;
the address of the destination GPU of the service data packet is the same as the address of the source GPU of the data response message, and the address of the source GPU of the service data packet is the same as the address of the destination GPU of the data response message.
In some embodiments, obtaining a verification result after verifying an address, a task number, and a check code of the destination GPU in the service data packet via the destination GPU;
and determining whether to send a data response message according to the obtained verification result.
In a second aspect, there is provided an apparatus for interconnecting GPUs in a network, for interconnecting at least two GPUs, at least one of the GPUs serving as a source GPU and at least another of the GPUs serving as a destination GPU, comprising:
The GPU network management module is used for generating a connection management task according to a forwarding table of a source GPU and an address of a destination GPU, sending a connection request message to the destination GPU according to the connection management task, starting a timer and recording a time stamp when the connection request message is sent; if the connection response message sent by the destination GPU based on the connection request message is received within the preset time, the connection establishment is successful, otherwise, the connection establishment fails, and the connection request message is resent; when connection is established successfully, calculating round trip time RTT from sending the connection request message to receiving the connection response message, and acquiring the size of a sending window for sending service data packets based on the round trip time RTT and the sending rate of a network output interface of the destination GPU information, wherein the forwarding table comprises the address of each destination GPU and the number of the corresponding network output interface, the connection management task comprises destination GPU information connected with the forwarding table, the destination GPU information comprises the number of the network output interface, IP and data cache, and the forwarding table and the data cache are preset in a static random access memory SRAM;
The DMA engine is used for copying business data of the HBM memory of the source GPU to a data cache corresponding to a network output interface of the destination GPU information in batches based on the size of the sending window, and packaging the business data into a plurality of business data packets;
and the GPU network management module is also used for sending a service data packet to the destination GPU according to the connection management task and receiving a data response message sent by the destination GPU based on the service data packet.
The technical scheme provided by the application has the beneficial effects that:
The application presets the forwarding table and the data cache in the SRAM, generates the connection management task through the forwarding table of the source GPU and the address of the destination GPU, and sends the connection request message by utilizing the connection management task to carry out connection and subsequent data transmission.
The application simplifies the connection protocol and the data packet format, avoids using complex TCP connection and TCP/IP packet header between the GPUs, and the message or the data packet adopts the Ethernet format, thereby being compatible with the existing Ethernet network, and simultaneously simplifying a large amount of simplification, so that the processing at the two ends of the GPU is simple and efficient, and the application can be used for interconnecting the GPUs on a large scale.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flowchart of a method for GPU network interconnection provided by an embodiment of the present application;
FIG. 2 is a block diagram of a device for GPU network interconnection provided by an embodiment of the present application;
fig. 3 is a data packet format according to an embodiment of the present application.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
The embodiment of the application provides a method for interconnecting GPU networks, which is applied to interconnection of at least two GPUs, wherein at least one of the GPUs is used as a source GPU, at least one other of the GPUs is used as a destination GPU, each GPU comprises a forwarding table and network output interfaces, each GPU is provided with a plurality of network output interfaces for interconnection with surrounding GPUs in order to simultaneously transmit data with a plurality of GPUs, each network output interface is provided with a unique number and an IP address for distinguishing each network output interface, and meanwhile, each network output interface is provided with a data cache with a certain capacity, as shown in figure 2, each GPU is provided with 6 network output interfaces, and the number can be adjusted according to specific conditions.
Each network output interface is configured with an independent data buffer, so that the GPU can enable all network output interfaces to simultaneously send data, the GPU adopts a multitasking mode to carry out data transmission with other GPUs, each data transmission task is responsible for the whole life cycle of a connection session, the connection independently occupies one network output interface and the data buffer thereof, the data transmission task is started when the connection is started, and the task is ended when the connection is ended. Each task independently maintains a connection, and independently transmits and receives data, and is not related to other connection tasks. This increases security and reduces the additional system complexity that is imposed by network sharing and congestion control.
The forwarding table of the GPU is a two-dimensional table structure stored in a static random access memory SRAM of the GPU and consists of an address of the destination GPU and a corresponding network output interface number. The addresses of the GPUs can be chip numbers, and in the application, for compatibility with the ethernet, a unique MAC address is configured for each GPU. The setting of the forwarding table is set in advance by a software programming method. In the process of executing tasks by the GPU cluster, the topology structure of the GPU is not changed, so that the forwarding table is basically unchanged.
Specifically, referring to fig. 1, the method includes the steps of:
And 101, generating a connection management task according to a forwarding table of a source GPU and addresses of a destination GPU, wherein the forwarding table comprises addresses of all destination GPUs and corresponding network output interfaces, the connection management task comprises destination GPU information connected with the destination GPU, the destination GPU information comprises numbers of the network output interfaces, IP and data caches, and the forwarding table and the data caches are preset in a static random access memory SRAM.
Two devices in the network are to transfer data and first to establish a connection. Similarly, the connection is established between two GPUs, and the connection is established by the GPU network management module, which generates a connection management task, and the connection is established and maintained by the task.
Before establishing a connection, the GPU network management module generates a connection management task to maintain and manage the connection. The application adopts a mode of multiple connection management tasks to manage connection and network output interfaces. In the present application, a connection management task is responsible for the establishment and management of a connection, which exclusively shares a network out interface for transmitting data. Thus, if there are multiple connections, multiple network out interfaces to transfer data, the GPU network management module generates multiple connection management tasks.
The GPU network management module uses the corresponding network output interface found from the forwarding table of the source GPU as an interface for transmitting data according to the address of the destination GPU provided by the source GPU, and simultaneously empties the data cache corresponding to the network output interface. The generated connection management task maintains the binding relation among the connection (the connection has a task number), the network output interface and the data cache until the connection is terminated. The connection management task is also responsible for generating a sequence number of a transmitted data packet, recording a received response sequence number, and judging whether the phenomena of packet loss and the like exist or not. The connection management task also maintains a fixed-size data buffer window through hardware (i.e., SRAM) to determine whether congestion occurs during data transmission, and maintains a timer for recording the transmission delay of the data packet through hardware.
Each GPU maintains a global task number, the number is generated when a connection management task is generated, and the number is recovered when the task is finished.
When the task is ready, the task starts to establish a transport connection with the destination GPU.
102, According to the connection management task, sending a connection request message to the destination GPU, starting a timer, and recording a time stamp when the connection request message is sent.
First, the connection management task generates a connection request message. It should be noted that, as shown in fig. 3, the packet and the data packet format adopted in the present application are the same as the packet header format of the ethernet in order to be compatible with the ethernet, but the difference between the packet and the data packet sent by the conventional CPU is that the present application does not have the conventional IP layer, TCP layer, etc. parts, and the packet header part and the ethernet format are simplified in order to enable the data packet to be forwarded on the existing ethernet switching device. The message or data packet format in the application is specifically composed of the following parts:
the destination address is 6 bytes, the address of the destination GPU is compatible with the MAC address, and each GPU is assigned a unique MAC address.
The source address is 6 bytes, the address of the source GPU is compatible with the MAC address.
Type field 2 bytes, the application selects 0x0877 as the type field, which is to distinguish from the existing protocols in the ethernet.
Task number 2 bytes, representing the number of the connection.
The data packet type is 1 byte, and is the type of the data packet or the data message carried in the message, and mainly comprises a connection establishment message, a connection response message, a service data message, a connection ending message, an ending response message and other messages.
0X1, representing a connection establishment message;
0x2 represents a connection response message;
0x4 represents a service data message;
0x8 represents a connection end message;
0x10 is an end response message representing the end of the connection;
Only when the packet type is 0x4 (representing a service data packet) is there data in the packet sequence number and the packet content, otherwise there is no data.
The data packet number is 4 bytes, which indicates the number of the service data packet in the connection.
The check code FCS is 4 bytes, is a check of the data packet part, and adopts a CRC-32 check algorithm.
It can be understood that the address of the destination GPU of the connection request packet is the same as the address of the source GPU of the connection response packet, and the address of the source GPU of the connection request packet is the same as the address of the destination GPU of the connection response packet.
The destination address of the connection request message is the address of the opposite end GPU of the data transmission, the source address is the address of the GPU, and the type field is 0x0877. The task number is the connection management task responsible for generation, and is globally unique in the GPU, so the task number of the GPU can be used as the connection number. The packet type of the connection request message is 0x1, which represents connection establishment. Meanwhile, when the connection management task sends the connection request message, a timer is started, and the time stamp of the connection request message is recorded.
And when the connection request message is sent, the source GPU starts a timer and records the time stamp when the connection request message is sent.
And 103, judging whether a connection response message sent by the destination GPU based on the connection request message is received within a preset time.
And 104, if the connection response message sent by the destination GPU based on the connection request message is received within the preset time, the connection establishment is successful.
And 105, if the connection response message sent by the destination GPU based on the connection request message is not received within the preset time, the connection establishment failure is indicated after the time is out, the step 102 is returned, and the connection request message is resent.
The application presets the forwarding table and the data cache in the SRAM, generates the connection management task through the forwarding table of the source GPU and the address of the destination GPU, and sends the connection request message by utilizing the connection management task to carry out connection and subsequent data transmission.
The application simplifies the connection protocol and the data packet format, avoids using complex TCP connection and TCP/IP packet header between the GPUs, and the message or the data packet adopts the Ethernet format, thereby being compatible with the existing Ethernet network, and simultaneously simplifying a large amount of simplification, so that the processing at the two ends of the GPU is simple and efficient, and the application can be used for interconnecting the GPUs on a large scale.
In the step 102, the destination GPU sends a connection response message based on the connection request message, which specifically includes the following steps:
And 201, acquiring a verification result after verifying the address, the type field and the data packet type of the destination GPU in the connection request message through the destination GPU.
And 202, according to the obtained verification result, sending a connection response message with the data packet type corresponding to the verification result.
It can be understood that, after the destination GPU receives the connection request packet:
Firstly, judging a destination address, judging whether the destination address of the connection request message is a GPU address of the destination address, if not, discarding the connection request message, wherein the destination GPU does not send a connection response message, and the source GPU cannot receive the connection response message within a preset time, overtime and naturally knows that connection establishment fails, if so, performing the next step.
And secondly, if the destination address is the own GPU address, learning and verifying the output interface IP.
And thirdly, detecting whether the type field is 0x0877, if so, carrying out the next step, if not, discarding the connection request message, wherein the destination GPU does not send the connection response message, the source GPU cannot receive the connection response message within the preset time, overtime and naturally knowing that the connection establishment fails.
Fourth, detecting the data packet type, if the data packet type is not 0x1, discarding the connection request message, wherein the destination GPU does not send a connection response message, and the source GPU cannot receive the connection response message within a preset time, and overtime and naturally knows that the connection establishment fails. If the data packet type is 0x1, it indicates that the connection request message is a connection establishment request, and the verification is passed, the verification result is that the connection is successfully established, and the destination GPU generates a connection response message with the data packet type of 0x2 corresponding to the verification result. The connection response message is sent and the service data packet is ready to be received.
After the source GPU receives the connection response message, the connection establishment is successful, and the service data can be normally sent. The connection during the transmission of the service data packets is unchanged, as is the task number.
Meanwhile, when the connection is established successfully, calculating round trip time delay RTT from sending the connection request message to receiving the connection response message by using the timer. This time is used to estimate the delay of the entire link from the sender to the receiver.
Based on the round trip time RTT and the transmission rate of the network output interface of the destination GPU information, the size of a transmission window for transmitting the service data packet may be calculated.
It is to be understood that the predetermined time may be set to a size that is artificially and reasonably set according to actual needs, for example, the predetermined time may be set to 2 times the round trip time RTT.
After the connection is successfully established in the step 104, a service data packet may be sent, where sending the service data packet specifically includes the following steps:
and 301, based on the size of the sending window, copying business data of the HBM memory of the source GPU to a data cache corresponding to a network output interface of the destination GPU information in batches through a DMA engine, and packaging the business data into a plurality of business data packets.
In step 301, the source GPU sends an instruction to the DMA engine to start sending service data, which specifically includes the following steps:
The DMA engine copies the business data in the HBM memory into the SRAM, and after the business data are packaged into a plurality of business data packets, the business data packets are sent. After the SRAM starts to send the service data packet, a sending window is started to record the quantity of sending data, and the sent service data packet is numbered at the same time, wherein the sent service data packet contains the number of the data packet. The method adopts a fixed window which is convenient to realize in hardware, and the size of the window is determined according to the sending rate and Round Trip Time (RTT) of a network output interface for sending data.
Transmission window size = round trip delay RTT x interface rate.
When sending service data packets, each sent service data packet is filled with the sequence number of the data packet in the sequence number field of the data packet, the sequence number starts from 0, the sequence number occupies 4 bytes, and the maximum sequence number can reach 2 32 -1.
And 302, sending a service data packet to the destination GPU according to the connection management task, and receiving a data response message sent by the destination GPU based on the service data packet, wherein the address of the destination GPU of the service data packet is the same as the address of the source GPU of the data response message, and the address of the source GPU of the service data packet is the same as the address of the destination GPU of the data response message.
The DMA engine will continually copy the data to the send window in the SRAM until the window is full or until the send is complete. Meanwhile, the window moves forward continuously according to the sequence number of the corresponding message of the received data, and the data of the received response sequence number is deleted.
Since the size of the transmission window is equal to the round trip time rtt×the interface rate, that is, in the case of no congestion, the data in the transmission window is transmitted in one round trip time RTT period. The window would normally move continuously.
And after receiving the service data packet, the destination GPU checks and verifies whether the address, the task number and the check code of the destination GPU are all correct so as to obtain a verification result, if the verification result is correct, a data response message is sent to the source GPU, and otherwise, the data response message is not sent. In order to avoid too frequent data response message transmission, the application adopts round trip time RTT of every 1/2 to transmit a data response message, the data packet sequence number of the data response message is the largest data packet sequence number in the service data packets passing the current verification, and the service data packets before the sequence number are all received.
When the data response message of the sequence number is sent to the source GPU, the source GPU deletes the data before the sequence number, and then moves the sending window forward.
If the data response message is not received within the round trip time RTT of 1/2, the source GPU considers that congestion occurs, starts to reduce the sending rate, at this time, the movement of the sending window is slowed down, if the data response message is still not received, the congestion is serious, or the packet loss occurs, at this time, the sending rate is continuously reduced, part of the message is retransmitted, and the message is retransmitted from the beginning of the sending window.
Referring to fig. 2, an embodiment of the present application further provides a device for interconnecting GPU networks, which is applied to interconnection of at least two GPUs, at least one of the GPUs is used as a source GPU, and at least another one of the GPUs is used as a destination GPU, and includes a GPU network management module and a DMA engine.
The GPU network management module is used for generating a connection management task according to a forwarding table of a source GPU and addresses of a destination GPU, sending a connection request message to the destination GPU according to the connection management task, starting a timer, recording a time stamp when the connection request message is sent, if a connection response message sent by the destination GPU based on the connection request message is received within a preset time, connection establishment is successful, otherwise, connection establishment fails, the connection request message is resent, when the connection establishment is successful, round trip time RTT from sending the connection request message to receiving the connection response message is calculated, and the size of a sending window for sending service data packets is obtained based on the round trip time RTT and the sending rate of a network outlet interface of the destination GPU information, wherein the forwarding table comprises the address of each destination GPU and the number of a corresponding network outlet interface, the connection management task comprises destination GPU information connected with the destination GPU information, and the destination GPU information comprises the number, the IP and a data cache.
And the DMA engine is used for copying the business data of the HBM memory of the source GPU to the data cache corresponding to the network output interface of the destination GPU information in batches based on the size of the sending window, and packaging the business data into a plurality of business data packets.
The GPU network management module is further used for sending a service data packet to the destination GPU according to the connection management task and receiving a data response message sent by the destination GPU based on the service data packet.
The foregoing is only a specific embodiment of the application to enable those skilled in the art to understand or practice the application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. A method for interconnecting GPU networks, comprising:
Generating a connection management task according to a forwarding table of a source GPU and addresses of a destination GPU, wherein the forwarding table comprises addresses of all destination GPUs and corresponding network output interfaces, the connection management task comprises destination GPU information connected with the destination GPU, the destination GPU information comprises numbers, IP and data caches of the network output interfaces, and the forwarding table and the data caches are preset in a static random access memory SRAM;
according to the connection management task, sending a connection request message to the destination GPU, starting a timer, and recording a time stamp when the connection request message is sent;
if the connection response message sent by the destination GPU based on the connection request message is received within the preset time, the connection is established successfully;
Otherwise, the connection establishment fails and the connection request message is resent.
2. The method of GPU network interconnection of claim 1, wherein:
the connection request message and the connection response message comprise a destination GPU address, a source GPU address, a type field, a task number, a data packet type, a data packet sequence number, data packet content and a check code;
the address of the destination GPU of the connection request message is the same as the address of the source GPU of the connection response message, and the address of the source GPU of the connection request message is the same as the address of the destination GPU of the connection response message.
3. The method of GPU network interconnection of claim 2, wherein:
acquiring a verification result after verifying the address, the type field and the data packet type of the destination GPU in the connection request message through the destination GPU;
and sending a connection response message with the data packet type corresponding to the verification result according to the obtained verification result.
4. The method of GPU network interconnection of claim 1, wherein:
when the connection is established successfully, calculating Round Trip Time (RTT) from sending the connection request message to receiving the connection response message by using the timer;
The method further comprises the step of obtaining the size of a sending window for sending the service data packet based on the round trip time RTT and the sending rate of the network output interface of the destination GPU information.
5. The method of GPU network interconnection according to claim 4, wherein:
the method further comprises the steps of sending a service data packet;
The sending service data packet specifically includes:
Based on the size of the sending window, copying business data of the HBM memory of the source GPU to a data cache corresponding to a network output interface of the destination GPU information in batches through a DMA engine, and packaging the business data into a plurality of business data packets;
and sending a service data packet to the destination GPU according to the connection management task, and receiving a data response message sent by the destination GPU based on the service data packet.
6. The method of GPU network interconnection according to claim 5, wherein:
The data response message sending step specifically comprises sending a data response message every 1/2 Round Trip Time (RTT), wherein the data packet sequence number of the data response message is the largest data packet sequence number in the service data packets passing the current verification.
7. The method of GPU network interconnection according to claim 6, wherein:
If the data response message is not received within the round trip time RTT of 1/2, the sending rate is reduced.
8. The method of GPU network interconnection according to claim 5, wherein:
The service data packet and the data response message comprise a destination GPU address, a source GPU address, a type field, a task number, a data packet type, a data packet sequence number, data packet content and a check code;
the address of the destination GPU of the service data packet is the same as the address of the source GPU of the data response message, and the address of the source GPU of the service data packet is the same as the address of the destination GPU of the data response message.
9. The method of GPU network interconnection according to claim 8, wherein:
acquiring a verification result after verifying the address, the task number and the check code of the destination GPU in the service data packet through the destination GPU;
and determining whether to send a data response message according to the obtained verification result.
10. A device for interconnecting GPUs in a network, applied to the interconnection of at least two GPUs, at least one of said GPUs being a source GPU and at least another of said GPUs being a destination GPU, comprising:
The GPU network management module is used for generating a connection management task according to a forwarding table of a source GPU and an address of a destination GPU, sending a connection request message to the destination GPU according to the connection management task, starting a timer and recording a time stamp when the connection request message is sent; if the connection response message sent by the destination GPU based on the connection request message is received within the preset time, the connection establishment is successful, otherwise, the connection establishment fails, and the connection request message is resent; when connection is established successfully, calculating round trip time RTT from sending the connection request message to receiving the connection response message, and acquiring the size of a sending window for sending service data packets based on the round trip time RTT and the sending rate of a network output interface of the destination GPU information, wherein the forwarding table comprises the address of each destination GPU and the number of the corresponding network output interface, the connection management task comprises destination GPU information connected with the forwarding table, the destination GPU information comprises the number of the network output interface, IP and data cache, and the forwarding table and the data cache are preset in a static random access memory SRAM;
The DMA engine is used for copying business data of the HBM memory of the source GPU to a data cache corresponding to a network output interface of the destination GPU information in batches based on the size of the sending window, and packaging the business data into a plurality of business data packets;
and the GPU network management module is also used for sending a service data packet to the destination GPU according to the connection management task and receiving a data response message sent by the destination GPU based on the service data packet.
CN202411903948.8A 2024-12-23 2024-12-23 GPU network interconnection method and device Pending CN119759839A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202411903948.8A CN119759839A (en) 2024-12-23 2024-12-23 GPU network interconnection method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202411903948.8A CN119759839A (en) 2024-12-23 2024-12-23 GPU network interconnection method and device

Publications (1)

Publication Number Publication Date
CN119759839A true CN119759839A (en) 2025-04-04

Family

ID=95176606

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202411903948.8A Pending CN119759839A (en) 2024-12-23 2024-12-23 GPU network interconnection method and device

Country Status (1)

Country Link
CN (1) CN119759839A (en)

Similar Documents

Publication Publication Date Title
US6393023B1 (en) System and method for acknowledging receipt of messages within a packet based communication network
CN112039722B (en) A kind of MQTT protocol simulation method and simulation device
CN112631788B (en) Data transmission method and data transmission server
WO2004040819A2 (en) An apparatus and method for receive transport protocol termination
CN106790266A (en) The communication means and device of a kind of intelligent distribution type feeder automation
CN113986811A (en) High-performance kernel-mode network data packet acceleration method
CN118200253A (en) RDMA UD transmission-oriented reliable communication method, electronic equipment and readable medium
CN110445666B (en) Network quality detection method and device and server
US7580410B2 (en) Extensible protocol processing system
US20230130964A1 (en) Communication protocol, and a method thereof for accelerating artificial intelligence processing tasks
CN105933325A (en) Kernel mode RPC (Remote Procedure CALL) communication acceleration method based on NFSoRDMA (Network File System over Remote Direct Memory Access)
WO2023109891A1 (en) Multicast transmission method, apparatus and system
WO2020007278A1 (en) Data transmitting method and device, and data receiving method and device
CN116032998B (en) Data transmission method, data transmission device, computer readable storage medium and electronic equipment
CN112118594A (en) Data uploading method, downloading method, electronic device and storage medium
CN118869777A (en) A RDMA network data processing method and system based on driver middleware
US20250112872A1 (en) Establishing connections in a computer network supporting a remote direct memory access (rdma) protocol
CN109413142B (en) Method for realizing iSCSI virtual agent under L inux
CN119759839A (en) GPU network interconnection method and device
CN117834532A (en) Soft-hard coordinated programmable congestion control system and method
JP2000067017A (en) Method and device for communicating data
US20080056263A1 (en) Efficient transport layer processing of incoming packets
CN116132503A (en) Data transmission method, device and equipment
CN115767143A (en) Judgment method, device, electronic equipment and readable storage medium for playing stuck
CN116419317A (en) Congestion control method, device, equipment and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination