[go: up one dir, main page]

CN118093499A - Data transmission method, device, equipment and storage medium for remote memory access - Google Patents

Data transmission method, device, equipment and storage medium for remote memory access Download PDF

Info

Publication number
CN118093499A
CN118093499A CN202410166037.5A CN202410166037A CN118093499A CN 118093499 A CN118093499 A CN 118093499A CN 202410166037 A CN202410166037 A CN 202410166037A CN 118093499 A CN118093499 A CN 118093499A
Authority
CN
China
Prior art keywords
message
host
data
transmitted
cache
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202410166037.5A
Other languages
Chinese (zh)
Other versions
CN118093499B (en
Inventor
张世明
杜剑峰
余鑫才
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beigemeis Shenzhen Technology Co ltd
Original Assignee
Beigemeis Shenzhen Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beigemeis Shenzhen Technology Co ltd filed Critical Beigemeis Shenzhen Technology Co ltd
Priority to CN202410166037.5A priority Critical patent/CN118093499B/en
Publication of CN118093499A publication Critical patent/CN118093499A/en
Application granted granted Critical
Publication of CN118093499B publication Critical patent/CN118093499B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L49/00Packet switching elements
    • H04L49/90Buffering arrangements
    • H04L49/9047Buffering arrangements including multiple buffers, e.g. buffer pools
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0806Multiuser, multiprocessor or multiprocessing cache systems
    • G06F12/0811Multiuser, multiprocessor or multiprocessing cache systems with multilevel cache hierarchies
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0806Multiuser, multiprocessor or multiprocessing cache systems
    • G06F12/084Multiuser, multiprocessor or multiprocessing cache systems with a shared cache
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0806Multiuser, multiprocessor or multiprocessing cache systems
    • G06F12/0842Multiuser, multiprocessor or multiprocessing cache systems for multiprocessing or multitasking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/10Address translation
    • G06F12/1081Address translation for peripheral access to main memory, e.g. direct memory access [DMA]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/16Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs
    • G06F15/163Interprocessor communication
    • G06F15/173Interprocessor communication using an interconnection network, e.g. matrix, shuffle, pyramid, star, snowflake
    • G06F15/17306Intercommunication techniques
    • G06F15/17331Distributed shared memory [DSM], e.g. remote direct memory access [RDMA]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L49/00Packet switching elements
    • H04L49/30Peripheral units, e.g. input or output ports
    • H04L49/3036Shared queuing
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L49/00Packet switching elements
    • H04L49/90Buffering arrangements
    • H04L49/9063Intermediate storage in different physical parts of a node or terminal
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2212/00Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
    • G06F2212/10Providing a specific technical effect
    • G06F2212/1012Design facilitation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2212/00Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
    • G06F2212/10Providing a specific technical effect
    • G06F2212/1016Performance improvement
    • G06F2212/1024Latency reduction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2212/00Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
    • G06F2212/10Providing a specific technical effect
    • G06F2212/1041Resource optimization
    • G06F2212/1044Space efficiency improvement
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2212/00Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
    • G06F2212/15Use in a specific computing environment
    • G06F2212/154Networked environment

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Computer Hardware Design (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Computer And Data Communications (AREA)

Abstract

本发明公开了一种远程内存访问的数据传输方法、装置、设备及存储介质,包括确定待传输消息的消息类型为小消息后,通过RDMA Sendℜ原语进行通信;生成对应的工作队列元素放入发送队列;确定小消息缓存区存在剩余缓存块或者空闲缓存,根据工作队列元素指向的主存消息存放地址取出待传输消息发送;确定待传输消息的消息类型为大消息后,通过RDMA Sendℜ原语进行通信,确定待传输消息的需求容量小于等于剩余存储空间,根据需求容量在大消息缓存区分配容量,将待发送消息切分为多个数据包发送;当末级缓存模块中的缓存区接收数据后,缓存区通过与应用程序的共享缓存映射通知应用程序处理数据。

The present invention discloses a data transmission method, device, equipment and storage medium for remote memory access, including communicating through the RDMA Sendℜ primitive after determining that the message type of a message to be transmitted is a small message; generating a corresponding work queue element and putting it into a sending queue; determining that there are remaining cache blocks or free caches in a small message buffer area, taking out the message to be transmitted according to the main memory message storage address pointed to by the work queue element and sending it; after determining that the message type of the message to be transmitted is a large message, communicating through the RDMA Sendℜ primitive, determining that the required capacity of the message to be transmitted is less than or equal to the remaining storage space, allocating capacity in the large message buffer area according to the required capacity, and dividing the message to be sent into multiple data packets for sending; when the cache area in the last-level cache module receives data, the cache area notifies the application to process the data through a shared cache mapping with the application.

Description

远程内存访问的数据传输方法、装置、设备及存储介质Data transmission method, device, equipment and storage medium for remote memory access

技术领域Technical Field

本发明涉及数据传输技术领域,尤其涉及一种远程内存访问的数据传输方法、装置、设备及存储介质。The present invention relates to the technical field of data transmission, and in particular to a data transmission method, device, equipment and storage medium for remote memory access.

背景技术Background technique

数据中心应用越来越依赖高速网络,需要具备高吞吐量、低延迟、低CPU开销的网络通信特性,支持大规模的网络通信。特别是对于带宽密集型应用,例如分布式机器学习和云存储,服务器之间需要超过100Gbps的网络带宽,而对于在线服务,如数据库在线分析处理,则需要低延迟以最小化查询响应时间。同时,大多数应用都希望网络堆栈具有低CPU开销,以便为计算保留尽可能多的CPU核。Data center applications are increasingly dependent on high-speed networks, which require network communication features with high throughput, low latency, and low CPU overhead to support large-scale network communication. In particular, for bandwidth-intensive applications such as distributed machine learning and cloud storage, more than 100Gbps of network bandwidth is required between servers, while for online services such as database online analytical processing, low latency is required to minimize query response time. At the same time, most applications want the network stack to have low CPU overhead in order to reserve as many CPU cores as possible for computing.

远程直接内存访问(Remote Direct Memory Access,RDMA)已成为一种流行的高速网络技术,这主要归功于其内核旁路和传输卸载等架构创新所提供的高吞吐量、低延迟和低CPU开销。然而,尽管RDMA技术具有高性能和低CPU开销,但商用RDMA网卡(RDMANetwork Interface Card,RNIC)仍然面临着主存带宽竞争问题,具体表现为:当RNIC执行直接内存访问(Direct Memory Access,DMA)操作将接收到的消息存储到主存时,RNIC会与CPU的其他计算流程(例如主存数据分析、数据复制和垃圾收集等)争用主存带宽;在此竞争下,RNIC难以获得足够的主存带宽来将接收到的数据包及时存储到主存,致使RNIC片上缓存充斥未处理的数据包,最终导致RNIC内数据溢出,只能丢弃已收到但未处理的数据包;同时,数据溢出会触发拥塞控制,从而导致网络吞吐量下降和显著增加延迟。Remote Direct Memory Access (RDMA) has become a popular high-speed network technology, mainly due to its high throughput, low latency and low CPU overhead provided by its architectural innovations such as kernel bypass and transport offload. However, despite the high performance and low CPU overhead of RDMA technology, commercial RDMA Network Interface Card (RNIC) still faces the problem of main memory bandwidth competition, which is manifested as follows: when RNIC performs direct memory access (DMA) operations to store received messages to main memory, RNIC will compete with other CPU computing processes (such as main memory data analysis, data replication and garbage collection, etc.) for main memory bandwidth; under this competition, RNIC is difficult to obtain sufficient main memory bandwidth to store received data packets to main memory in time, causing the RNIC on-chip cache to be filled with unprocessed data packets, eventually leading to data overflow in RNIC, and it can only discard the received but unprocessed data packets; at the same time, data overflow triggers congestion control, resulting in a decrease in network throughput and a significant increase in latency.

随着高速网络发展和部署规模扩大,主存带宽竞争问题将进一步加剧。因此,如何避免主存带宽竞争对RDMA网络通信的影响,是RDMA网络应用中急需解决的问题。With the development of high-speed networks and the expansion of deployment scale, the main memory bandwidth competition problem will be further aggravated. Therefore, how to avoid the impact of main memory bandwidth competition on RDMA network communication is an urgent problem to be solved in RDMA network applications.

发明内容Summary of the invention

基于此,有必要针对上述问题,提出了一种远程内存访问的数据传输方法、装置、设备及存储介质。Based on this, it is necessary to propose a data transmission method, device, equipment and storage medium for remote memory access to address the above problems.

一种远程内存访问的数据传输方法,所述方法包括:A remote memory access data transmission method, the method comprising:

S1、第二主机需要向第一主机发送的待传输消息;S1, a message to be transmitted that the second host needs to send to the first host;

S2、所述第二主机确定所述待传输消息的消息类型,所述消息类型包括大消息、小消息;S2. The second host determines the message type of the message to be transmitted, where the message type includes a large message and a small message;

S3、确定所述待传输消息的消息类型为小消息后,所述第二主机与第一主机通过RDMA Send&Recv原语进行通信;所述第二主机生成对应的工作队列元素放入发送队列;所述第一主机从末级缓存模块中的小消息缓存区取出一个缓存块并生成相应的工作队列元素放入共享接收队列;S3. After determining that the message type of the message to be transmitted is a small message, the second host communicates with the first host through the RDMA Send&Recv primitive; the second host generates a corresponding work queue element and puts it into the sending queue; the first host takes a cache block from the small message cache area in the last-level cache module and generates a corresponding work queue element and puts it into the shared receiving queue;

S4:所述第一主机判断所述小消息缓存区中是否存在剩余缓存块或存在空闲缓存;S4: The first host determines whether there are any remaining cache blocks or free caches in the small message cache area;

S5:如果所述小消息缓存区不存在剩余缓存块和空闲缓存,则所述第一主机拒绝所述第二主机的通信请求;S5: If there are no remaining cache blocks and free cache in the small message cache area, the first host rejects the communication request of the second host;

S6:如果所述小消息缓存区存在剩余缓存块或者空闲缓存,所述第二主机根据工作队列元素指向的主存消息存放地址取出待传输消息发送到所述第一主机,所述第一主机根据共享接收队列中的工作队列元素指向的地址将待传输消息传输到准备好的缓存空间中,执行S11;S6: If there are remaining cache blocks or free caches in the small message cache area, the second host takes out the message to be transmitted according to the main memory message storage address pointed to by the work queue element and sends it to the first host, and the first host transmits the message to be transmitted to the prepared cache space according to the address pointed to by the work queue element in the shared receiving queue, and executes S11;

S7、确定所述待传输消息的消息类型为大消息后,所述第二主机与第一主机通过RDMA Read/Write原语进行通信,所述第一主机确定待传输消息的需求容量大小;S7. After determining that the message type of the message to be transmitted is a large message, the second host communicates with the first host through an RDMA Read/Write primitive, and the first host determines the required capacity size of the message to be transmitted;

S8、所述第一主机判断所述待传输消息的需求容量大小是否满足剩余存储空间;S8. The first host determines whether the required capacity of the message to be transmitted satisfies the remaining storage space;

S9、如果所述待传输消息的需求容量大于剩余存储空间,所述第一主机拒绝所述第二主机的通信请求;S9. If the required capacity of the message to be transmitted is greater than the remaining storage space, the first host rejects the communication request of the second host;

S10、如果所述待传输消息的需求容量小于等于剩余存储空间,所述第一主机根据需求容量在大消息缓存区分配容量,所述第二主机将待发送消息切分为多个数据包发送,每个数据包一抵达第一主机的大消息缓存区后立刻执行S11;S10, if the required capacity of the message to be transmitted is less than or equal to the remaining storage space, the first host allocates capacity in the large message buffer according to the required capacity, and the second host divides the message to be sent into multiple data packets for transmission, and executes S11 immediately after each data packet arrives at the large message buffer of the first host;

S11、当末级缓存模块中的缓存区接收数据后,缓存区通过与应用程序的共享缓存映射通知应用程序处理数据;S11, when the cache area in the last-level cache module receives the data, the cache area notifies the application to process the data through a shared cache mapping with the application;

S12、如果数据在末级缓存模块中的停留时间大于数据停留时间阈值,将该消息的数据卸载到主存的接收缓冲区中,并告知所述应用程序相关信息,所述应用程序后续从主存中取出数据并处理;S12, if the residence time of the data in the last-level cache module is greater than the data residence time threshold, unloading the data of the message to the receiving buffer of the main memory, and informing the application of the relevant information, and the application subsequently takes out the data from the main memory and processes it;

S13、如果数据在末级缓存模块中的停留时间小于等于数据停留时间阈值,确定处理完成。S13: If the residence time of the data in the last-level cache module is less than or equal to the data residence time threshold, it is determined that the processing is completed.

在一种具体实施例中,所述S1之前,所述方法还包括:In a specific embodiment, before S1, the method further includes:

所述第一主机与第二主机建立RDMA通信,初始化通信资源,包括双方的队列对编号、数据缓存地址;The first host establishes RDMA communication with the second host and initializes communication resources, including queue pair numbers and data cache addresses of both parties;

所述初始化通信资源通过套接字通信或RDMA连接管理器完成。The initialization of communication resources is completed through socket communication or RDMA connection manager.

所述方法中,所述确定所述待传输消息的消息类型,具体包括:根据所述待传输消息的大小与数据阈值进行比较,根据比较结果确定消息类型;In the method, determining the message type of the message to be transmitted specifically includes: comparing the size of the message to be transmitted with a data threshold, and determining the message type according to the comparison result;

如果所述待传输消息的大小小于等于数据阈值,确定消息类型为小消息;If the size of the message to be transmitted is less than or equal to the data threshold, determining that the message type is a small message;

如果所述待传输消息的大小大于数据阈值,确定消息类型为大消息。If the size of the message to be transmitted is greater than the data threshold, it is determined that the message type is a large message.

在一种具体实施例中,所述数据阈值设置为4KB;共享接收队列中每个工作队列元素指向末级缓存模块中的大小为4KB的缓存块。In a specific embodiment, the data threshold is set to 4KB; each work queue element in the shared receiving queue points to a cache block of 4KB in size in the last-level cache module.

在一种具体实施例中,所述末级缓存模块中分配属于RDMA通信下的数据接收使用的缓存空间,小消息缓存区和大消息缓存区共用该缓存空间;所述末级缓存模块的缓存空间大小=网络带宽*数据停留时间阈值*需要在缓存空间中处理的停留数据比例。In a specific embodiment, the cache space used for data reception under RDMA communication is allocated in the last-level cache module, and the small message cache area and the large message cache area share the cache space; the cache space size of the last-level cache module = network bandwidth * data residence time threshold * proportion of residence data that needs to be processed in the cache space.

在一种具体实施例中,所述第一主机确定待传输消息的需求容量大小,具体包括:待传输消息的需求容量=待发送数据量*数据停留时间阈值,所述数据停留时间阈值设置为200微秒。In a specific embodiment, the first host determines the required capacity size of the message to be transmitted, specifically including: the required capacity of the message to be transmitted = the amount of data to be sent * the data residence time threshold, and the data residence time threshold is set to 200 microseconds.

在一种具体实施例中,所述第二主机将待发送消息切分为多个数据包发送,具体包括:将待发送消息切分为若干个不大于256KB的数据包发送。In a specific embodiment, the second host divides the message to be sent into multiple data packets for sending, specifically including: dividing the message to be sent into multiple data packets no larger than 256KB for sending.

一种远程通信装置,所述装置包括:A remote communication device, comprising:

请求模块,用于在第二主机需要向第一主机发送的待传输消息时生成通信请求;A request module, used for generating a communication request when the second host needs to send a message to be transmitted to the first host;

消息类型确定模块,用于确定所述待传输消息的消息类型,所述消息类型包括大消息、小消息;A message type determination module, used to determine the message type of the message to be transmitted, wherein the message type includes a large message and a small message;

小消息传输模块,用于确定所述待传输消息的消息类型为小消息后,所述第二主机与所述第一主机通过RDMA Send&Recv原语进行通信;所述第二主机生成对应的工作队列元素放入发送队列;所述第一主机从末级缓存模块中的小消息缓存区取出一个缓存块并生成相应的工作队列元素放入共享接收队列,并判断所述小消息缓存区中是否存在剩余缓存块或存在空闲缓存;如果所述小消息缓存区不存在剩余缓存块和空闲缓存,拒绝第二主机到第一主机的通信请求;如果所述小消息缓存区存在剩余缓存块或者空闲缓存,所述第二主机根据工作队列元素指向的主存消息存放地址取出待传输消息发送到第一主机,所述第一主机根据共享接收队列中的工作队列元素指向的地址将待传输消息传输到准备好的缓存空间中;A small message transmission module is used to determine that the message type of the message to be transmitted is a small message, and then the second host communicates with the first host through the RDMA Send&Recv primitive; the second host generates a corresponding work queue element and puts it into the sending queue; the first host takes out a cache block from the small message cache area in the final cache module and generates a corresponding work queue element and puts it into the shared receiving queue, and determines whether there are remaining cache blocks or free caches in the small message cache area; if there are no remaining cache blocks and free caches in the small message cache area, the communication request from the second host to the first host is rejected; if there are remaining cache blocks or free caches in the small message cache area, the second host takes out the message to be transmitted according to the main memory message storage address pointed to by the work queue element and sends it to the first host, and the first host transmits the message to be transmitted to the prepared cache space according to the address pointed to by the work queue element in the shared receiving queue;

大消息传输模块,用于确定所述待传输消息的消息类型为大消息后,所述第二主机与所述第一主机通过RDMA Read/Write原语进行通信,所述第一主机确定待传输消息的需求容量大小,并判断所述待传输消息的需求容量大小是否满足剩余存储空间;如果所述待传输消息的需求容量大于剩余存储空间,拒绝第二主机到第一主机的通信请求;如果所述待传输消息的需求容量小于等于剩余存储空间,所述第一主机根据需求容量在大消息缓存区分配容量,所述第二主机将待发送消息切分为多个数据包发送;A large message transmission module is used to determine that the message type of the message to be transmitted is a large message, and then the second host communicates with the first host through the RDMA Read/Write primitive, and the first host determines the required capacity of the message to be transmitted, and judges whether the required capacity of the message to be transmitted meets the remaining storage space; if the required capacity of the message to be transmitted is greater than the remaining storage space, the communication request from the second host to the first host is rejected; if the required capacity of the message to be transmitted is less than or equal to the remaining storage space, the first host allocates capacity in the large message buffer area according to the required capacity, and the second host divides the message to be sent into multiple data packets for transmission;

传输处理模块,用于当末级缓存模块中的缓存区接收数据后,缓存区通过与应用程序的共享缓存映射通知应用程序处理数据;如果数据在末级缓存模块中的停留时间大于数据停留时间阈值,将该消息的数据卸载到主存的接收缓冲区中,并告知所述应用程序相关信息,所述应用程序后续从主存中取出数据并处理;如果数据在末级缓存模块中的停留时间小于数据停留时间阈值,确定处理完成。The transmission processing module is used to notify the application to process the data through the shared cache mapping with the application after the cache area in the last-level cache module receives the data; if the residence time of the data in the last-level cache module is greater than the data residence time threshold, the data of the message is unloaded to the receiving buffer of the main memory, and the relevant information is notified to the application, and the application subsequently retrieves the data from the main memory and processes it; if the residence time of the data in the last-level cache module is less than the data residence time threshold, it is determined that the processing is completed.

一种电子设备,包括存储器和处理器,所述存储器存储有计算机程序,所述计算机程序被所述处理器执行时,使得所述处理器执行以下步骤:An electronic device includes a memory and a processor, wherein the memory stores a computer program, and when the computer program is executed by the processor, the processor performs the following steps:

S1、第二主机需要向第一主机发送的待传输消息;S1, a message to be transmitted that the second host needs to send to the first host;

S2、所述第二主机确定所述待传输消息的消息类型,所述消息类型包括大消息、小消息;S2. The second host determines the message type of the message to be transmitted, where the message type includes a large message and a small message;

S3、确定所述待传输消息的消息类型为小消息后,所述第二主机与第一主机通过RDMA Send&Recv原语进行通信;所述第二主机生成对应的工作队列元素放入发送队列;所述第一主机从末级缓存模块中的小消息缓存区取出一个缓存块并生成相应的工作队列元素放入共享接收队列;S3. After determining that the message type of the message to be transmitted is a small message, the second host communicates with the first host through the RDMA Send&Recv primitive; the second host generates a corresponding work queue element and puts it into the sending queue; the first host takes a cache block from the small message cache area in the last-level cache module and generates a corresponding work queue element and puts it into the shared receiving queue;

S4:所述第一主机判断所述小消息缓存区中是否存在剩余缓存块或存在空闲缓存;S4: The first host determines whether there are any remaining cache blocks or free caches in the small message cache area;

S5:如果所述小消息缓存区不存在剩余缓存块和空闲缓存,则所述第一主机拒绝所述第二主机的通信请求;S5: If there are no remaining cache blocks and free cache in the small message cache area, the first host rejects the communication request of the second host;

S6:如果所述小消息缓存区存在剩余缓存块或者空闲缓存,所述第二主机根据工作队列元素指向的主存消息存放地址取出待传输消息发送到所述第一主机,所述第一主机根据共享接收队列中的工作队列元素指向的地址将待传输消息传输到准备好的缓存空间中,执行S11;S6: If there are remaining cache blocks or free caches in the small message cache area, the second host takes out the message to be transmitted according to the main memory message storage address pointed to by the work queue element and sends it to the first host, and the first host transmits the message to be transmitted to the prepared cache space according to the address pointed to by the work queue element in the shared receiving queue, and executes S11;

S7、确定所述待传输消息的消息类型为大消息后,所述第二主机与第一主机通过RDMA Read/Write原语进行通信,所述第一主机确定待传输消息的需求容量大小;S7. After determining that the message type of the message to be transmitted is a large message, the second host communicates with the first host through an RDMA Read/Write primitive, and the first host determines the required capacity size of the message to be transmitted;

S8、所述第一主机判断所述待传输消息的需求容量大小是否满足剩余存储空间;S8. The first host determines whether the required capacity of the message to be transmitted satisfies the remaining storage space;

S9、如果所述待传输消息的需求容量大于剩余存储空间,所述第一主机拒绝所述第二主机的通信请求;S9. If the required capacity of the message to be transmitted is greater than the remaining storage space, the first host rejects the communication request of the second host;

S10、如果所述待传输消息的需求容量小于等于剩余存储空间,所述第一主机根据需求容量在大消息缓存区分配容量,所述第二主机将待发送消息切分为多个数据包发送,每个数据包一抵达第一主机的大消息缓存区后立刻执行S11;S10, if the required capacity of the message to be transmitted is less than or equal to the remaining storage space, the first host allocates capacity in the large message buffer according to the required capacity, and the second host divides the message to be sent into multiple data packets for transmission, and executes S11 immediately after each data packet arrives at the large message buffer of the first host;

S11、当末级缓存模块中的缓存区接收数据后,缓存区通过与应用程序的共享缓存映射通知应用程序处理数据;S11, when the cache area in the last-level cache module receives the data, the cache area notifies the application to process the data through a shared cache mapping with the application;

S12、如果数据在末级缓存模块中的停留时间大于数据停留时间阈值,将该消息的数据卸载到主存的接收缓冲区中,并告知所述应用程序相关信息,所述应用程序后续从主存中取出数据并处理;S12, if the residence time of the data in the last-level cache module is greater than the data residence time threshold, unloading the data of the message to the receiving buffer of the main memory, and informing the application of the relevant information, and the application subsequently takes out the data from the main memory and processes it;

S13、如果数据在末级缓存模块中的停留时间小于等于数据停留时间阈值,确定处理完成。S13: If the residence time of the data in the last-level cache module is less than or equal to the data residence time threshold, it is determined that the processing is completed.

一种计算机可读存储介质,存储有计算机程序,所述计算机程序被处理器执行时,使得所述处理器执行以下步骤:A computer-readable storage medium stores a computer program, which, when executed by a processor, causes the processor to perform the following steps:

S1、第二主机需要向第一主机发送的待传输消息;S1, a message to be transmitted that the second host needs to send to the first host;

S2、所述第二主机确定所述待传输消息的消息类型,所述消息类型包括大消息、小消息;S2. The second host determines the message type of the message to be transmitted, where the message type includes a large message and a small message;

S3、确定所述待传输消息的消息类型为小消息后,所述第二主机与第一主机通过RDMA Send&Recv原语进行通信;所述第二主机生成对应的工作队列元素放入发送队列;所述第一主机从末级缓存模块中的小消息缓存区取出一个缓存块并生成相应的工作队列元素放入共享接收队列;S3. After determining that the message type of the message to be transmitted is a small message, the second host communicates with the first host through the RDMA Send&Recv primitive; the second host generates a corresponding work queue element and puts it into the sending queue; the first host takes a cache block from the small message cache area in the last-level cache module and generates a corresponding work queue element and puts it into the shared receiving queue;

S4:所述第一主机判断所述小消息缓存区中是否存在剩余缓存块或存在空闲缓存;S4: The first host determines whether there are any remaining cache blocks or free caches in the small message cache area;

S5:如果所述小消息缓存区不存在剩余缓存块和空闲缓存,则所述第一主机拒绝所述第二主机的通信请求;S5: If there are no remaining cache blocks and free cache in the small message cache area, the first host rejects the communication request of the second host;

S6:如果所述小消息缓存区存在剩余缓存块或者空闲缓存,所述第二主机根据工作队列元素指向的主存消息存放地址取出待传输消息发送到所述第一主机,所述第一主机根据共享接收队列中的工作队列元素指向的地址将待传输消息传输到准备好的缓存空间中,执行S11;S6: If there are remaining cache blocks or free caches in the small message cache area, the second host takes out the message to be transmitted according to the main memory message storage address pointed to by the work queue element and sends it to the first host, and the first host transmits the message to be transmitted to the prepared cache space according to the address pointed to by the work queue element in the shared receiving queue, and executes S11;

S7、确定所述待传输消息的消息类型为大消息后,所述第二主机与第一主机通过RDMA Read/Write原语进行通信,所述第一主机确定待传输消息的需求容量大小;S7, after determining that the message type of the message to be transmitted is a large message, the second host communicates with the first host through the RDMA Read/Write primitive, and the first host determines the required capacity size of the message to be transmitted;

S8、所述第一主机判断所述待传输消息的需求容量大小是否满足剩余存储空间;S8. The first host determines whether the required capacity of the message to be transmitted satisfies the remaining storage space;

S9、如果所述待传输消息的需求容量大于剩余存储空间,所述第一主机拒绝所述第二主机的通信请求;S9. If the required capacity of the message to be transmitted is greater than the remaining storage space, the first host rejects the communication request of the second host;

S10、如果所述待传输消息的需求容量小于等于剩余存储空间,所述第一主机根据需求容量在大消息缓存区分配容量,所述第二主机将待发送消息切分为多个数据包发送,每个数据包一抵达第一主机的大消息缓存区后立刻执行S11;S10, if the required capacity of the message to be transmitted is less than or equal to the remaining storage space, the first host allocates capacity in the large message buffer according to the required capacity, and the second host divides the message to be sent into multiple data packets for transmission, and executes S11 immediately after each data packet arrives at the large message buffer of the first host;

S11、当末级缓存模块中的缓存区接收数据后,缓存区通过与应用程序的共享缓存映射通知应用程序处理数据;S11, when the cache area in the last-level cache module receives the data, the cache area notifies the application to process the data through a shared cache mapping with the application;

S12、如果数据在末级缓存模块中的停留时间大于数据停留时间阈值,将该消息的数据卸载到主存的接收缓冲区中,并告知所述应用程序相关信息,所述应用程序后续从主存中取出数据并处理;S12, if the residence time of the data in the last-level cache module is greater than the data residence time threshold, unloading the data of the message to the receiving buffer of the main memory, and informing the application of the relevant information, and the application subsequently takes out the data from the main memory and processes it;

S13、如果数据在末级缓存模块中的停留时间小于等于数据停留时间阈值,确定处理完成。S13: If the residence time of the data in the last-level cache module is less than or equal to the data residence time threshold, it is determined that the processing is completed.

采用本发明实施例,具有如下有益效果:The embodiments of the present invention have the following beneficial effects:

本发明分别对小消息和大消息采用不同的RDMA操作和控制方法。对于小消息,选择RDMA SEND/RECV,并用共享接收队列机制来聚合接收队列;对于大消息,使用RDMA READ/WRITE,根据需求容量大小,两者共享使用数据接收缓存池的缓存空间,通过设置两类消息的使用数据阈值来避免缓存资源独占。The present invention adopts different RDMA operation and control methods for small messages and large messages respectively. For small messages, RDMA SEND/RECV is selected, and the receiving queue is aggregated by a shared receiving queue mechanism; for large messages, RDMA READ/WRITE is used, and according to the required capacity, the two share the cache space of the data receiving cache pool, and the cache resource monopoly is avoided by setting the data usage threshold of the two types of messages.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings required for use in the embodiments or the description of the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For ordinary technicians in this field, other drawings can be obtained based on these drawings without paying creative work.

其中:in:

图1为一个实施例中远程内存访问的数据传输方法的流程图;FIG1 is a flow chart of a data transmission method for remote memory access in one embodiment;

图2为一个实施例中远程内存访问的数据传输方法中小消息接收过程的流程图;FIG2 is a flow chart of a small message receiving process in a data transmission method for remote memory access in one embodiment;

图3为一个实施例中远程内存访问的数据传输方法中大消息接收过程的流程图;FIG3 is a flow chart of a large message receiving process in a data transmission method for remote memory access in one embodiment;

图4为一个实施例中电子设备的结构框图。FIG. 4 is a structural block diagram of an electronic device in an embodiment.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。The following will be combined with the drawings in the embodiments of the present invention to clearly and completely describe the technical solutions in the embodiments of the present invention. Obviously, the described embodiments are only part of the embodiments of the present invention, not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by ordinary technicians in this field without creative work are within the scope of protection of the present invention.

本发明解决主存带宽竞争问题的基本思路是避免过多使用慢速的主存,而直接使用CPU缓存,即在末级缓存(LLC,Last Level Cache)中保留一小块隔离的缓存池。从带宽上看,CPU缓存足够满足RNIC的需求,但只能利用非常有限的缓存容量。如果RNIC获得的数据足够小,同时生命周期足够短,那么CPU缓存是足够的。但对于数据比较大或者生命期较长的数据,则需要更好的解决方案。The basic idea of the present invention to solve the main memory bandwidth competition problem is to avoid excessive use of slow main memory and directly use the CPU cache, that is, to reserve a small isolated cache pool in the last level cache (LLC). From the bandwidth point of view, the CPU cache is sufficient to meet the needs of RNIC, but only a very limited cache capacity can be used. If the data obtained by RNIC is small enough and the life cycle is short enough, then the CPU cache is sufficient. However, for data that is relatively large or has a long life cycle, a better solution is needed.

具体而言,RNIC数据接收模块将主存移出接收方的数据路径,并在末级缓存(LLC)中保留一小块隔离的缓存池,用于接收来自网卡的消息数据;然后,为了提高预留LLC的复用水平,对小消息和大消息分别采用不同的RDMA操作和控制方法;小消息采用RDMA SEND/RECV操作,并利用共享接收队列机制来聚合接收队列;大消息采用RDMA READ/WRITE操作,并根据消息大小和数据停留时间阈值来分配缓存。小消息和大消息都共享缓存池,并通过用户设定的数据停留时间阈值和需要在缓存空间中处理的停留数据比例来避免缓存资源的独占。Specifically, the RNIC data receiving module moves the main memory out of the receiver's data path and reserves a small isolated cache pool in the last-level cache (LLC) to receive message data from the network card; then, in order to improve the reuse level of the reserved LLC, different RDMA operations and control methods are used for small messages and large messages respectively; small messages use RDMA SEND/RECV operations and use the shared receive queue mechanism to aggregate the receive queue; large messages use RDMA READ/WRITE operations and allocate cache according to the message size and data residence time threshold. Small and large messages share the cache pool and avoid exclusive cache resources through the user-set data residence time threshold and the proportion of residence data that needs to be processed in the cache space.

此外,为了提高数据处理效率和缓存重用,本发明使用SLAB(SequentialLocality Allocation Buffer)算法管理缓存以避免内部碎片的产生,使用多线程并行数据处理操作以提高处理效率,并利用管道技术实现更细粒度的缓存回收。为了维持高性能网络,当应用程序占用缓存的时间超过阈值时,缓存池上的数据都会被复制到主存中,从而快速释放缓存。In addition, in order to improve data processing efficiency and cache reuse, the present invention uses the SLAB (Sequential Locality Allocation Buffer) algorithm to manage the cache to avoid internal fragmentation, uses multi-threaded parallel data processing operations to improve processing efficiency, and uses pipeline technology to achieve more fine-grained cache recycling. In order to maintain a high-performance network, when the application occupies the cache for more than a threshold, the data on the cache pool will be copied to the main memory, thereby quickly releasing the cache.

本发明能显著提升通信吞吐量,降低网络延迟和主存带宽的占用。The present invention can significantly improve communication throughput and reduce network delay and main memory bandwidth occupancy.

已有实验结果表明,本发明使通信吞吐量提升到2.11倍、平均网络延迟降低了64.2%、主存带宽的占用下降了96.4%。Existing experimental results show that the present invention increases the communication throughput by 2.11 times, reduces the average network delay by 64.2%, and reduces the main memory bandwidth occupancy by 96.4%.

在一个实施例中,提供了一种远程内存访问的数据传输方法。In one embodiment, a data transmission method for remote memory access is provided.

如图1所示,该远程内存访问的数据传输方法,具体包括如下步骤:As shown in FIG1 , the data transmission method for remote memory access specifically includes the following steps:

S1、第二主机需要向第一主机发送的待传输消息;S1, a message to be transmitted that the second host needs to send to the first host;

S2、所述第二主机确定所述待传输消息的消息类型,所述消息类型包括大消息、小消息;S2. The second host determines the message type of the message to be transmitted, where the message type includes a large message and a small message;

具体地,根据所述待传输消息的大小与数据阈值进行比较,根据比较结果确定消息类型;Specifically, comparing the size of the message to be transmitted with a data threshold, and determining the message type according to the comparison result;

如果所述待传输消息的大小小于等于数据阈值,确定消息类型为小消息;If the size of the message to be transmitted is less than or equal to the data threshold, determining that the message type is a small message;

如果所述待传输消息的大小大于数据阈值,确定消息类型为大消息。If the size of the message to be transmitted is greater than the data threshold, it is determined that the message type is a large message.

示例性地,所述数据阈值设置为4KB;共享接收队列中每个工作队列元素指向末级缓存模块中的大小为4KB的缓存块。Exemplarily, the data threshold is set to 4 KB; each work queue element in the shared receiving queue points to a cache block of 4 KB in the last-level cache module.

S3、确定所述待传输消息的消息类型为小消息后,所述第二主机与第一主机通过RDMA Send&Recv原语进行通信;所述第二主机生成对应的工作队列元素放入发送队列;所述第一主机从末级缓存模块中的小消息缓存区取出一个缓存块并生成相应的工作队列元素放入共享接收队列;S3. After determining that the message type of the message to be transmitted is a small message, the second host communicates with the first host through the RDMA Send&Recv primitive; the second host generates a corresponding work queue element and puts it into the sending queue; the first host takes a cache block from the small message cache area in the last-level cache module and generates a corresponding work queue element and puts it into the shared receiving queue;

具体地,所述末级缓存模块中分配属于RDMA通信下的数据接收使用的缓存空间,小消息缓存区和大消息缓存区共用该缓存空间;所述末级缓存模块的缓存空间大小=网络带宽*数据停留时间阈值*需要在缓存空间中处理的停留数据比例。Specifically, the cache space used for data reception under RDMA communication is allocated in the last-level cache module, and the small message cache area and the large message cache area share the cache space; the cache space size of the last-level cache module = network bandwidth * data residence time threshold * the proportion of residence data that needs to be processed in the cache space.

第一主机从末级缓存模块中的小消息缓存区取出一个缓存块并生成相应的工作队列元素放入共享接收队列。第二主机生成对应的工作队列元素请求放入发送队列,其中区分消息类型的数据阈值设置为4KB。共享接收队列中每个工作队列元素指向末级缓存模块中的大小为4KB的缓存块。The first host takes a cache block from the small message cache area in the final cache module and generates a corresponding work queue element and puts it into the shared receive queue. The second host generates a corresponding work queue element request and puts it into the send queue, where the data threshold for distinguishing message types is set to 4KB. Each work queue element in the shared receive queue points to a cache block of 4KB in the final cache module.

所述末级缓存模块中的缓存根据实际CPU结构决定。如果CPU的最后一级缓存为L3缓存,则缓存在L3缓存中分配;如果CPU的最后一级缓存为L4缓存,则缓存在L4缓存中分配。The cache in the last level cache module is determined according to the actual CPU structure. If the last level cache of the CPU is L3 cache, the cache is allocated in the L3 cache; if the last level cache of the CPU is L4 cache, the cache is allocated in the L4 cache.

末级缓存模块中分配的缓存空间专属于RDMA通信下的数据接收使用,不能用于其他用途,即缓存空间是隔离的,并且小消息缓存区和大消息缓存区共用一块缓存空间,但两者存在最低占有空间阈值,当一方低于自身设置的阈值时,拒绝对方继续占据缓存空间。最低占有空间阈值的大小根据实际业务中小消息和大消息的请求频率决定。The cache space allocated in the last-level cache module is exclusively used for data reception under RDMA communication and cannot be used for other purposes. That is, the cache space is isolated, and the small message cache area and the large message cache area share a cache space, but there is a minimum occupied space threshold between the two. When one side is lower than the threshold set by itself, the other side is refused to continue to occupy the cache space. The size of the minimum occupied space threshold is determined according to the request frequency of small and large messages in the actual business.

末级缓存模块中小消息缓存区和大消息缓存区的缓存空间管理使用基于对象管理的SLAB算法,管理的对象粒度为4KB。所述小消息缓存区预先分配固定数量的缓存块,预分配的缓存块数量与设置的空间阈值和实际的小消息请求频率有关。SLAB(SequentialLocality Allocation Buffer)算法是一种用于内存管理的高效算法,用于提高内存分配和释放的效率,减少内存碎片化。该算法通过将内存分为不同的缓存块,根据对象的大小选择合适的块进行分配,从而避免了频繁的内存分配和释放操作,提高了内存的利用率。The cache space management of the small message cache area and the large message cache area in the last-level cache module uses the SLAB algorithm based on object management, and the object granularity of management is 4KB. The small message cache area pre-allocates a fixed number of cache blocks, and the number of pre-allocated cache blocks is related to the set space threshold and the actual small message request frequency. The SLAB (Sequential Locality Allocation Buffer) algorithm is an efficient algorithm for memory management, which is used to improve the efficiency of memory allocation and release and reduce memory fragmentation. The algorithm divides the memory into different cache blocks and selects the appropriate block for allocation according to the size of the object, thereby avoiding frequent memory allocation and release operations and improving memory utilization.

对于RDMA Send/Recv原语的通信,在数据传输前,接收方会检查小消息缓存区中是否存在空闲缓存块,存在则生成相对应的工作队列元素并通知对端发送数据。For the communication of RDMA Send/Recv primitives, before data transmission, the receiver will check whether there is a free cache block in the small message buffer. If so, it will generate a corresponding work queue element and notify the other end to send data.

上述原语的通信中,每个数据包抵达缓存区的相应区域后,缓存监控器立即通过与应用程序的共享缓存映射通知应用程序处理数据,应用程序处理完成后通过共享缓存映射通知缓存监控器释放回收已完成使用的缓存以供进一步使用。如果数据处理时间超过最大停留时间,则缓存监控器将该数据复制到主存中并通知应用程序,并释放回收该数据占有的缓存以供进一步使用。In the communication of the above primitives, after each data packet arrives at the corresponding area of the cache, the cache monitor immediately notifies the application to process the data through the shared cache mapping with the application. After the application completes the processing, it notifies the cache monitor through the shared cache mapping to release and recycle the cache that has been used for further use. If the data processing time exceeds the maximum residence time, the cache monitor copies the data to the main memory and notifies the application, and releases and recycles the cache occupied by the data for further use.

应用程序处理数据时,会采用多线程与管道技术处理抵达缓存的数据。LLC中数据接收缓存池所接收的数据均使用SLAB算法以4KB的粒度大小进行管理,也就是说,将到来的每个数据包切分为不大于4KB大小的小包后输入到一个数据处理的流水线管道进行处理。所述的流水线管道包括数据存放、数据处理、数据释放三个阶段,采用多线程并行处理,一个粒度数据经历三个阶段后立即释放相应的缓存空间,而不需要等待整个消息被应用程序处理后释放。When the application processes data, it uses multi-threading and pipeline technology to process the data arriving at the cache. The data received by the data receiving cache pool in LLC is managed with a granularity of 4KB using the SLAB algorithm, that is, each incoming data packet is cut into small packets no larger than 4KB and then input into a data processing pipeline for processing. The pipeline includes three stages: data storage, data processing, and data release. It uses multi-threaded parallel processing. After a granular data goes through three stages, the corresponding cache space is released immediately without waiting for the entire message to be processed by the application.

S4:所述第一主机判断所述小消息缓存区中是否存在剩余缓存块或存在空闲缓存;S4: The first host determines whether there are any remaining cache blocks or any free cache in the small message cache area;

S5:如果所述小消息缓存区不存在剩余缓存块和空闲缓存,则所述第一主机拒绝所述第二主机的通信请求;S5: If there are no remaining cache blocks and free cache in the small message cache area, the first host rejects the communication request of the second host;

S6:如果所述小消息缓存区存在剩余缓存块或者空闲缓存,所述第二主机根据工作队列元素指向的主存消息存放地址取出待传输消息发送到所述第一主机,所述第一主机根据共享接收队列中的工作队列元素指向的地址将待传输消息传输到准备好的缓存空间中,执行S11;S6: If there are remaining cache blocks or free caches in the small message cache area, the second host takes out the message to be transmitted according to the main memory message storage address pointed to by the work queue element and sends it to the first host, and the first host transmits the message to be transmitted to the prepared cache space according to the address pointed to by the work queue element in the shared receiving queue, and executes S11;

具体地,数据地址的转换和权限保护功能分别由主存转换表(MTT,MemoryTranslation Table)和主存保护表(MPT,Memory Protection Table)提供,同时采用巨页技术、物理段技术、流量局部性技术缩小表结构,将这两种表存放在网卡存储介质中。Specifically, the data address translation and permission protection functions are provided by the main memory translation table (MTT) and the main memory protection table (MPT), respectively. At the same time, huge page technology, physical segment technology, and traffic locality technology are used to shrink the table structure and store these two tables in the network card storage medium.

S7、确定所述待传输消息的消息类型为大消息后,所述第二主机与第一主机通过RDMA Read/Write原语进行通信,所述第一主机确定待传输消息的需求容量大小;S7. After determining that the message type of the message to be transmitted is a large message, the second host communicates with the first host through an RDMA Read/Write primitive, and the first host determines the required capacity size of the message to be transmitted;

具体地,待传输消息的需求容量=待发送数据量*数据停留时间阈值,数据停留时间阈值设置为200微秒。Specifically, the required capacity of the message to be transmitted = the amount of data to be sent * the data residence time threshold, and the data residence time threshold is set to 200 microseconds.

对于RDMA Read/Write原语的通信,在数据传输前,接收方会计算接收数据所需要的缓存容量。如果大消息缓存区的剩余缓存空间大于等于需求的容量,则分配缓存空间并通知对端发送数据。待发送消息发送前,会按照预先设置的最大传输单元大小将消息切分为一个个数据包传输。For RDMA Read/Write primitive communication, before data transmission, the receiver will calculate the buffer capacity required to receive the data. If the remaining buffer space of the large message buffer is greater than or equal to the required capacity, the buffer space is allocated and the peer is notified to send the data. Before the message is sent, it will be divided into data packets according to the preset maximum transmission unit size.

上述原语的通信中,每个数据包抵达缓存区的相应区域后,缓存监控器立即通过与应用程序的共享缓存映射通知应用程序处理数据,应用程序处理完成后通过共享缓存映射通知缓存监控器释放回收已完成使用的缓存以供进一步使用。如果数据处理时间超过最大停留时间,则缓存监控器将该数据复制到主存中并通知应用程序,并释放回收该数据占有的缓存以供进一步使用。In the communication of the above primitives, after each data packet arrives at the corresponding area of the cache, the cache monitor immediately notifies the application to process the data through the shared cache mapping with the application. After the application completes the processing, it notifies the cache monitor through the shared cache mapping to release and recycle the cache that has been used for further use. If the data processing time exceeds the maximum residence time, the cache monitor copies the data to the main memory and notifies the application, and releases and recycles the cache occupied by the data for further use.

应用程序处理数据时,会采用多线程与管道技术处理抵达缓存的数据。LLC中数据接收缓存池所接收的数据均使用SLAB算法以4KB的粒度大小进行管理,也就是说,将到来的每个数据包切分为不大于4KB大小的小包后输入到一个数据处理的流水线管道进行处理。所述的流水线管道包括数据存放、数据处理、数据释放三个阶段,采用多线程并行处理,一个粒度数据经历三个阶段后立即释放相应的缓存空间,而不需要等待整个消息被应用程序处理后释放。When the application processes data, it uses multi-threading and pipeline technology to process the data arriving at the cache. The data received by the data receiving cache pool in LLC is managed with a granularity of 4KB using the SLAB algorithm, that is, each incoming data packet is cut into small packets no larger than 4KB and then input into a data processing pipeline for processing. The pipeline includes three stages: data storage, data processing, and data release. It uses multi-threaded parallel processing. After a granular data goes through three stages, the corresponding cache space is released immediately without waiting for the entire message to be processed by the application.

S8、所述第一主机判断所述待传输消息的需求容量大小是否满足剩余存储空间;S8. The first host determines whether the required capacity of the message to be transmitted satisfies the remaining storage space;

S9、如果所述待传输消息的需求容量大于剩余存储空间,所述第一主机拒绝所述第二主机的通信请求;S9. If the required capacity of the message to be transmitted is greater than the remaining storage space, the first host rejects the communication request of the second host;

S10、如果所述待传输消息的需求容量小于等于剩余存储空间,所述第一主机根据需求容量在大消息缓存区分配容量,所述第二主机将待发送消息切分为多个数据包发送,每个数据包一抵达第一主机的大消息缓存区后立刻执行S11;S10, if the required capacity of the message to be transmitted is less than or equal to the remaining storage space, the first host allocates capacity in the large message buffer according to the required capacity, and the second host divides the message to be sent into multiple data packets for transmission, and executes S11 immediately after each data packet arrives at the large message buffer of the first host;

具体地,将待发送消息切分为若干个不大于256KB的数据包发送。Specifically, the message to be sent is divided into several data packets no larger than 256KB and sent.

第一主机从大消息缓存区分配需求的容量,第二主机将待发送消息切分为多个数据包发送,每个数据包不大于256KB。The first host allocates the required capacity from the large message buffer area, and the second host divides the message to be sent into multiple data packets for transmission, each of which is no larger than 256KB.

每个数据包一抵达第一主机的大消息缓存区立刻进行下一步,不等待消息完全传输。As soon as each data packet arrives at the large message buffer of the first host, it proceeds to the next step without waiting for the message to be completely transmitted.

S11、当末级缓存模块中的缓存区接收数据后,缓存区通过与应用程序的共享缓存映射通知应用程序处理数据;S11, when the cache area in the last-level cache module receives the data, the cache area notifies the application to process the data through a shared cache mapping with the application;

具体地,数据通过多线程与管道技术进行流水线处理。Specifically, data is pipelined through multi-threading and pipeline technology.

所述多线程和管道技术进行数据的流水线处理的具体细节为,末级缓存模块中对于到来的数据,以4KB粒度大小切分数据,切分的数据经过数据存放、数据处理、数据释放三个阶段,采用多线程并行处理,形成一个数据处理的流水线管道。一个粒度的数据经历三个阶段后立即由缓存监控模块释放并回收相应的缓存空间,而不需要等待整个消息被应用程序处理后才释放回收缓存空间。The specific details of the multi-thread and pipeline technology for data pipeline processing are as follows: the last-level cache module divides the incoming data into 4KB granularity, and the divided data undergoes three stages: data storage, data processing, and data release, and multi-thread parallel processing is used to form a data processing pipeline. After a granularity of data has gone through the three stages, the cache monitoring module immediately releases and reclaims the corresponding cache space, without waiting for the entire message to be processed by the application before releasing and reclaiming the cache space.

S12、如果数据在末级缓存模块中的停留时间大于数据停留时间阈值,将该消息的数据卸载到主存的接收缓冲区中,并告知所述应用程序相关信息,所述应用程序后续从主存中取出数据并处理;S12, if the residence time of the data in the last-level cache module is greater than the data residence time threshold, unloading the data of the message to the receiving buffer of the main memory, and informing the application of the relevant information, and the application subsequently takes out the data from the main memory and processes it;

具体地,对于卸载的数据,缓存监控模块回收释放相应的缓存空间。Specifically, for the unloaded data, the cache monitoring module reclaims and releases the corresponding cache space.

S13、如果数据在末级缓存模块中的停留时间小于等于数据停留时间阈值,确定处理完成。S13: If the residence time of the data in the last-level cache module is less than or equal to the data residence time threshold, it is determined that the processing is completed.

具体地,如果数据在末级缓存模块中的停留时间小于数据停留时间阈值,数据离开管道则认为处理完成,不等待处理结果,缓存监控模块回收释放相应的缓存空间。Specifically, if the residence time of the data in the last-level cache module is less than the data residence time threshold, the data leaves the pipeline and the processing is considered completed. Without waiting for the processing result, the cache monitoring module reclaims and releases the corresponding cache space.

进一步地,所述S1之前,所述方法还包括:Furthermore, before S1, the method further includes:

所述第一主机与第二主机建立RDMA通信,初始化通信资源,包括双方的队列对编号、数据缓存地址;The first host establishes RDMA communication with the second host and initializes communication resources, including queue pair numbers and data cache addresses of both parties;

所述初始化通信资源通过套接字通信或RDMA连接管理器完成。The initialization of communication resources is completed through socket communication or RDMA connection manager.

具体地,初始化通信资源包括双方的队列对编号、数据缓存地址等信息。双方的数据从主存中发出,末级缓存进行数据的接收。Specifically, the initialization communication resources include information such as the queue pair numbers of both parties, data cache addresses, etc. The data of both parties are sent from the main memory, and the last-level cache receives the data.

示例性地,假设第二主机需要向第一主机传输信息,通信资源的初始化可以通过套接字通信或RDMA连接管理器完成。Exemplarily, assuming that the second host needs to transmit information to the first host, the initialization of communication resources can be completed through socket communication or RDMA connection manager.

队列对上下文存储在网卡存储介质中,队列对调度器负责调度就绪的队列对,工作队列元素移动器负责每次从主存中批量取出就绪队列对的工作队列元素。The queue pair context is stored in the network card storage medium. The queue pair scheduler is responsible for scheduling the ready queue pairs, and the work queue element mover is responsible for taking out the work queue elements of the ready queue pairs in batches from the main memory each time.

在LLC中根据网络带宽与业务复杂度分配一块隔离的缓存用作数据接收的缓存池,为了提高LLC的统计复用水平,本发明分别对小消息和大消息采用不同的RDMA操作和控制方法。对于小消息,选择RDMA SEND/RECV,并用共享接收队列机制来聚合接收队列;对于大消息,使用RDMA READ/WRITE,根据消息大小与数据停留时间阈值的乘积分配需求的缓存,两者共享使用数据接收缓存池的缓存空间,通过设置两类消息的使用阈值来避免缓存资源独占。In LLC, an isolated cache is allocated according to network bandwidth and business complexity as a cache pool for data reception. In order to improve the statistical multiplexing level of LLC, the present invention adopts different RDMA operations and control methods for small messages and large messages respectively. For small messages, RDMA SEND/RECV is selected, and the receiving queue is aggregated using a shared receiving queue mechanism; for large messages, RDMA READ/WRITE is used, and the required cache is allocated according to the product of the message size and the data residence time threshold. The two share the cache space of the data receiving cache pool, and the cache resource monopoly is avoided by setting the usage threshold of the two types of messages.

与缓存不隔离的数据直接IO存取(Data Direct Input Output)技术相比,本发明将命中吞吐量平均提升到1.6倍,网络吞吐量平均提升5%,主存带宽占用量平均减少5%。同时,根据利特尔法则限制数据接收使用的缓冲区的大小来解决预留的LLC的小尺寸和RDMA队列对的低效预留之间的不匹配,只需要很小的一块缓存就可以支撑起很高的带宽需求。Compared with the data direct IO access (Data Direct Input Output) technology without cache isolation, the present invention increases the hit throughput by 1.6 times on average, increases the network throughput by 5% on average, and reduces the main memory bandwidth usage by 5% on average. At the same time, according to Little's law, the size of the buffer used for data reception is limited to solve the mismatch between the small size of the reserved LLC and the inefficient reservation of the RDMA queue pair, and only a small cache is needed to support high bandwidth requirements.

例如:对于200Gbps的带宽,数据停留时间阈值为200微秒的情况下,本发明只需要5MB的缓存空间就可以支撑需求。For example, for a bandwidth of 200 Gbps, when the data residence time threshold is 200 microseconds, the present invention only needs 5 MB of cache space to support the demand.

为了更好理解本发明的一种远程内存访问的数据传输方法中在接收小消息时的具体细节和特征,结合图2进行详细说明,包括步骤1至步骤6:In order to better understand the specific details and features of receiving a small message in a remote memory access data transmission method of the present invention, a detailed description is given in conjunction with FIG. 2, including steps 1 to 6:

步骤1:准备双方通信的资源,包括构造队列对,获取对端的队列对编号、数据缓存地址等信息,双方使用可靠连接器建立连接;完成建立后,第二主机对第一主机申请数据发送请求,告知即将发送一个小消息,第二主机将采用RDMA Send原语发送数据。Step 1: Prepare resources for communication between the two parties, including constructing a queue pair, obtaining the queue pair number, data cache address and other information of the other end, and the two parties use a reliable connector to establish a connection; after the establishment is completed, the second host applies for a data sending request to the first host, informing it that a small message is about to be sent, and the second host will use the RDMA Send primitive to send data.

步骤2:第一主机收到第二主机的数据发送请求,判断请求发送的是小消息,采用RDMA Recv原语接收数据,第一主机中的应用程序检查LLC中的小消息缓存区是否还有空闲缓存块或空闲缓存,如果没有则拒绝产生相应的工作队列元素接收请求,若存在空闲缓存块或空闲缓存,分配相应的接收缓存块,应用程序生成相应的工作队列元素指向该缓存块,并将工作队列元素存放在主存的接收队列中,等待网卡程序的提取。Step 2: The first host receives the data sending request from the second host, determines that the request is for a small message, and uses the RDMA Recv primitive to receive data. The application in the first host checks whether there are any free cache blocks or free caches in the small message buffer area in the LLC. If not, the corresponding work queue element receiving request is rejected. If there are free cache blocks or free caches, the corresponding receiving cache block is allocated, and the application generates a corresponding work queue element pointing to the cache block, and stores the work queue element in the receiving queue of the main memory, waiting for the network card program to extract it.

步骤3:第一主机准备好接收数据的缓存块和工作队列元素后,通知第二主机可以发送数据。第二主机收到后回复一个确认(ACK),第一主机轮询完成队列,确认对端收到信息。Step 3: After the first host is ready to receive the data buffer block and work queue element, it notifies the second host that it can send data. After receiving the data, the second host replies with an acknowledgment (ACK), and the first host polls the completion queue to confirm that the other end has received the information.

步骤4:第二主机将相应的发送请求以工作队列元素的形式放入发送队列,工作队列元素中指向主存中待发送的数据信息,包括地址、长度等。当网卡上的程序从主存中取得该工作队列元素并执行时,根据工作队列元素中存放的信息,将待发送数据传输至第一主机的网卡接收队列中,第一主机从共享接收队列中取出相应的工作队列元素,将数据从接收队列传输至提前在小消息缓存区中分配好的缓存块中。Step 4: The second host puts the corresponding sending request into the sending queue in the form of a work queue element, and the work queue element points to the data information to be sent in the main memory, including the address, length, etc. When the program on the network card obtains the work queue element from the main memory and executes it, the data to be sent is transferred to the network card receiving queue of the first host according to the information stored in the work queue element. The first host takes out the corresponding work queue element from the shared receiving queue and transfers the data from the receiving queue to the cache block allocated in advance in the small message buffer area.

步骤5:传输的数据一抵达末级缓存中的相应缓存块时,应用程序立即通过多线程和管道技术处理数据。Step 5: As soon as the transmitted data reaches the corresponding cache block in the last-level cache, the application processes the data through multi-threading and pipelining techniques.

步骤6:数据离开管道后,应用程序立即通过共享缓存映射通知缓存监控回收缓存块以供进一步使用。Step 6: As soon as the data leaves the pipeline, the application notifies the cache monitor through the shared cache mapping to reclaim the cache block for further use.

为了更好理解本发明的一种远程内存访问的数据传输方法中在接收大消息时的具体细节和特征,结合图3进行详细说明,包括RDMA Read和RDMA Write:In order to better understand the specific details and features of a remote memory access data transmission method of the present invention when receiving a large message, a detailed description is given in conjunction with FIG. 3, including RDMA Read and RDMA Write:

所述的RDMA Read包含步骤1至步骤5:The RDMA Read process includes steps 1 to 5:

步骤1:准备双方通信的资源,包括构造队列对,获取对端的队列对编号、数据地址等信息,第一主机告知第二主机需要执行RDMA Read操作,准备好读取的数据。第二主机将要读取的数据放置在主存已注册的区域中,通知第一主机读取的数据已就绪,并告知存放的主存信息和权限。Step 1: Prepare the resources for communication between the two parties, including constructing a queue pair, obtaining the queue pair number, data address and other information of the other end, and the first host notifies the second host that it needs to perform an RDMA Read operation and prepares the data to be read. The second host places the data to be read in the registered area of the main memory, notifies the first host that the data to be read is ready, and informs the main memory information and permissions for storage.

步骤2:第一主机检查末级缓存中的大消息缓存区是否有足够的容量接收待读取的数据,有则产生相应的Read工作队列元素请求放入发送队列。网卡上的程序从主存取得该工作队列元素并执行时,第一主机向第二主机发送Read请求,请求包含要读取的数据的目标主存地址、数据大小、访问权限等信息。Step 2: The first host checks whether the large message buffer area in the last-level cache has enough capacity to receive the data to be read. If yes, it generates a corresponding Read work queue element request and puts it into the send queue. When the program on the network card obtains the work queue element from the main memory and executes it, the first host sends a Read request to the second host, which contains the target main memory address of the data to be read, the data size, access rights and other information.

步骤3:第二主机接收到第一主机发出的RDMA Read请求后,根据请求中的信息,第二主机的网卡从主存中相应的区域取得数据并按照设置好的包大小切分成数据包发到第一主机的网卡接收队列中,每个数据包抵达第一主机的网卡时,第一主机回复确认ACK代表数据已接收。Step 3: After the second host receives the RDMA Read request sent by the first host, the network card of the second host obtains data from the corresponding area in the main memory according to the information in the request and divides it into data packets according to the set packet size and sends it to the network card receiving queue of the first host. When each data packet arrives at the network card of the first host, the first host replies with an ACK confirmation to indicate that the data has been received.

步骤4:第一主机的网卡对于传输的数据包以4KB的粒度从RX发送到末级缓存的相应缓存区域时,数据一抵达,应用程序立即通过多线程和管道技术处理数据。Step 4: When the network card of the first host sends the transmitted data packet from RX to the corresponding cache area of the last-level cache with a granularity of 4KB, the application immediately processes the data through multi-threading and pipeline technology as soon as the data arrives.

步骤5:数据离开管道后,应用程序立即通过共享缓存映射通知缓存监控回收缓存以供进一步使用。Step 5: As soon as the data leaves the pipeline, the application notifies the cache monitor through the shared cache map to reclaim the cache for further use.

所述的RDMA Write包含步骤6至步骤11:The RDMA Write process includes steps 6 to 11:

步骤6:准备双方通信的资源,包括构造队列对,获取对端的队列对编号、数据地址等信息。Step 6: Prepare resources for communication between the two parties, including constructing a queue pair and obtaining the queue pair number, data address and other information of the other end.

步骤7:第二主机发出Write请求,请求中包含待写入数据的相关信息。Step 7: The second host sends a Write request, which includes relevant information about the data to be written.

步骤8:第一主机接收到请求后根据提供的相关信息,检查第一主机中末级缓存的大消息缓存区是否有足够的容量接收要写入的数据,寻找到足够的缓存后第一主机回复确认ACK通知第二主机写入数据。回复中包含要写入的数据的缓存地址、访问权限等信息。Step 8: After receiving the request, the first host checks whether the large message buffer area in the last-level cache of the first host has enough capacity to receive the data to be written based on the relevant information provided. After finding enough buffer, the first host replies with an ACK to notify the second host to write the data. The reply contains information such as the cache address and access rights of the data to be written.

步骤9:第二主机将要写入的数据从主存中准备好,放置在已注册的主存区域。第二主机的网卡将要写入的数据按照设置好的包大小切分后发到第一主机的网卡接收队列中。Step 9: The second host prepares the data to be written from the main memory and places it in the registered main memory area. The network card of the second host divides the data to be written according to the set packet size and sends it to the network card receiving queue of the first host.

步骤10:第一主机的网卡对于传输的数据包以4KB的粒度从接收队列发送到末级缓存的相应缓存区域时,数据一抵达,应用程序立即通过多线程和管道技术处理数据。Step 10: When the network card of the first host sends the transmitted data packet from the receiving queue to the corresponding cache area of the last-level cache with a granularity of 4KB, the application immediately processes the data through multi-threading and pipeline technology as soon as the data arrives.

步骤11:数据离开管道后,应用程序立即通过共享缓存映射通知缓存监控回收缓存以供进一步使用。Step 11: As soon as the data leaves the pipeline, the application notifies the cache monitor through the shared cache mapping to reclaim the cache for further use.

在一个实施例中,提供了一种远程通信装置,所述装置包括:In one embodiment, a remote communication device is provided, the device comprising:

请求模块用于在第二主机需要向第一主机发送的待传输消息时生成通信请求;The request module is used to generate a communication request when the second host needs to send a message to be transmitted to the first host;

消息类型确定模块用于确定所述待传输消息的消息类型,所述消息类型包括大消息、小消息;The message type determination module is used to determine the message type of the message to be transmitted, and the message type includes a large message and a small message;

小消息传输模块用于确定所述待传输消息的消息类型为小消息后,所述第二主机与所述第一主机通过RDMA Send&Recv原语进行通信;所述第二主机生成对应的工作队列元素放入发送队列;所述第一主机从末级缓存模块中的小消息缓存区取出一个缓存块并生成相应的工作队列元素放入共享接收队列,并判断所述小消息缓存区中是否存在剩余缓存块或存在空闲缓存;如果所述小消息缓存区不存在剩余缓存块和空闲缓存,拒绝第二主机到第一主机的通信请求;如果所述小消息缓存区存在剩余缓存块或者空闲缓存,所述第二主机根据工作队列元素指向的主存消息存放地址取出待传输消息发送到第一主机,所述第一主机根据共享接收队列中的工作队列元素指向的地址将待传输消息传输到准备好的缓存空间中;The small message transmission module is used to determine that the message type of the message to be transmitted is a small message, and then the second host communicates with the first host through the RDMA Send&Recv primitive; the second host generates a corresponding work queue element and puts it into the sending queue; the first host takes out a cache block from the small message cache area in the final cache module and generates a corresponding work queue element and puts it into the shared receiving queue, and determines whether there are remaining cache blocks or free caches in the small message cache area; if there are no remaining cache blocks and free caches in the small message cache area, the communication request from the second host to the first host is rejected; if there are remaining cache blocks or free caches in the small message cache area, the second host takes out the message to be transmitted according to the main memory message storage address pointed to by the work queue element and sends it to the first host, and the first host transmits the message to be transmitted to the prepared cache space according to the address pointed to by the work queue element in the shared receiving queue;

大消息传输模块用于确定所述待传输消息的消息类型为大消息后,所述第二主机与所述第一主机通过RDMA Read/Write原语进行通信,所述第一主机确定待传输消息的需求容量大小,并判断所述待传输消息的需求容量大小是否满足剩余存储空间;如果所述待传输消息的需求容量大于剩余存储空间,拒绝第二主机到第一主机的通信请求;如果所述待传输消息的需求容量小于等于剩余存储空间,所述第一主机根据需求容量在大消息缓存区分配容量,所述第二主机将待发送消息切分为多个数据包发送;The large message transmission module is used to determine that the message type of the message to be transmitted is a large message, and then the second host communicates with the first host through the RDMA Read/Write primitive. The first host determines the required capacity of the message to be transmitted, and determines whether the required capacity of the message to be transmitted meets the remaining storage space; if the required capacity of the message to be transmitted is greater than the remaining storage space, the communication request from the second host to the first host is rejected; if the required capacity of the message to be transmitted is less than or equal to the remaining storage space, the first host allocates capacity in the large message buffer area according to the required capacity, and the second host divides the message to be sent into multiple data packets for transmission;

传输处理模块用于当末级缓存模块中的缓存区接收数据后,缓存区通过与应用程序的共享缓存映射通知应用程序处理数据;如果数据在末级缓存模块中的停留时间大于数据停留时间阈值,将该消息的数据卸载到主存的接收缓冲区中,并告知所述应用程序相关信息,所述应用程序后续从主存中取出数据并处理;如果数据在末级缓存模块中的停留时间小于数据停留时间阈值,确定处理完成。The transmission processing module is used to notify the application to process the data through the shared cache mapping with the application after the cache area in the last-level cache module receives the data; if the residence time of the data in the last-level cache module is greater than the data residence time threshold, the data of the message is unloaded to the receiving buffer of the main memory, and the relevant information is notified to the application, and the application subsequently retrieves the data from the main memory and processes it; if the residence time of the data in the last-level cache module is less than the data residence time threshold, it is determined that the processing is completed.

在一个实施例中,提出了一种电子设备,包括存储器和处理器,所述存储器存储有计算机程序,所述计算机程序被所述处理器执行时,使得所述处理器执行以下步骤:In one embodiment, an electronic device is provided, including a memory and a processor, wherein the memory stores a computer program, and when the computer program is executed by the processor, the processor performs the following steps:

S1、第二主机需要向第一主机发送的待传输消息;S1, a message to be transmitted that the second host needs to send to the first host;

S2、所述第二主机确定所述待传输消息的消息类型,所述消息类型包括大消息、小消息;S2. The second host determines the message type of the message to be transmitted, where the message type includes a large message and a small message;

S3、确定所述待传输消息的消息类型为小消息后,所述第二主机与第一主机通过RDMA Send&Recv原语进行通信;所述第二主机生成对应的工作队列元素放入发送队列;所述第一主机从末级缓存模块中的小消息缓存区取出一个缓存块并生成相应的工作队列元素放入共享接收队列;S3. After determining that the message type of the message to be transmitted is a small message, the second host communicates with the first host through the RDMA Send&Recv primitive; the second host generates a corresponding work queue element and puts it into the sending queue; the first host takes a cache block from the small message cache area in the last-level cache module and generates a corresponding work queue element and puts it into the shared receiving queue;

S4:所述第一主机判断所述小消息缓存区中是否存在剩余缓存块或存在空闲缓存;S4: The first host determines whether there are any remaining cache blocks or free caches in the small message cache area;

S5:如果所述小消息缓存区不存在剩余缓存块和空闲缓存,则所述第一主机拒绝所述第二主机的通信请求;S5: If there are no remaining cache blocks and free cache in the small message cache area, the first host rejects the communication request of the second host;

S6:如果所述小消息缓存区存在剩余缓存块或者空闲缓存,所述第二主机根据工作队列元素指向的主存消息存放地址取出待传输消息发送到所述第一主机,所述第一主机根据共享接收队列中的工作队列元素指向的地址将待传输消息传输到准备好的缓存空间中,执行S11;S6: If there are remaining cache blocks or free caches in the small message cache area, the second host takes out the message to be transmitted according to the main memory message storage address pointed to by the work queue element and sends it to the first host, and the first host transmits the message to be transmitted to the prepared cache space according to the address pointed to by the work queue element in the shared receiving queue, and executes S11;

S7、确定所述待传输消息的消息类型为大消息后,所述第二主机与第一主机通过RDMA Read/Write原语进行通信,所述第一主机确定待传输消息的需求容量大小;S7. After determining that the message type of the message to be transmitted is a large message, the second host communicates with the first host through an RDMA Read/Write primitive, and the first host determines the required capacity size of the message to be transmitted;

S8、所述第一主机判断所述待传输消息的需求容量大小是否满足剩余存储空间;S8. The first host determines whether the required capacity of the message to be transmitted satisfies the remaining storage space;

S9、如果所述待传输消息的需求容量大于剩余存储空间,所述第一主机拒绝所述第二主机的通信请求;S9. If the required capacity of the message to be transmitted is greater than the remaining storage space, the first host rejects the communication request of the second host;

S10、如果所述待传输消息的需求容量小于等于剩余存储空间,所述第一主机根据需求容量在大消息缓存区分配容量,所述第二主机将待发送消息切分为多个数据包发送,每个数据包一抵达第一主机的大消息缓存区后立刻执行S11;S10, if the required capacity of the message to be transmitted is less than or equal to the remaining storage space, the first host allocates capacity in the large message buffer according to the required capacity, and the second host divides the message to be sent into multiple data packets for transmission, and executes S11 immediately after each data packet arrives at the large message buffer of the first host;

S11、当末级缓存模块中的缓存区接收数据后,缓存区通过与应用程序的共享缓存映射通知应用程序处理数据;S11, when the cache area in the last-level cache module receives the data, the cache area notifies the application to process the data through a shared cache mapping with the application;

S12、如果数据在末级缓存模块中的停留时间大于数据停留时间阈值,将该消息的数据卸载到主存的接收缓冲区中,并告知所述应用程序相关信息,所述应用程序后续从主存中取出数据并处理;S12, if the residence time of the data in the last-level cache module is greater than the data residence time threshold, unloading the data of the message to the receiving buffer of the main memory, and informing the application of the relevant information, and the application subsequently takes out the data from the main memory and processes it;

S13、如果数据在末级缓存模块中的停留时间小于等于数据停留时间阈值,确定处理完成。S13: If the residence time of the data in the last-level cache module is less than or equal to the data residence time threshold, it is determined that the processing is completed.

如图4所示,所述电子设备500包含处理器501,处理器501可以运行可读存储介质503中的计算机程序或者运行从存储单元508加载到可读存储介质503中的计算机程序,实现本发明阐述的功能。As shown in FIG. 4 , the electronic device 500 includes a processor 501 , which can execute a computer program in a readable storage medium 503 or a computer program loaded from a storage unit 508 into the readable storage medium 503 to implement the functions described in the present invention.

设备500中PCIe总线505与内部总线504连接,设备500中的各个部件通过与这两类总线进行交互,与PCIe总线相连的部件包括输入单元506,例如键盘、鼠标等;输出单元507,例如显示器、扬声器等;存储单元508,例如磁盘、光盘等;通信单元509,包括网卡、网卡存储区、处理单元;网卡存储介质中存储着计算机程序,网卡上的处理单元通过运行其中的计算机程序来执行本发明的实施例。通信单元509允许设备500通过诸如因特网的计算机网络与其他设备交换信息/数据。通信单元509可以在数字电子电路系统、集成电路系统、现场可编程门阵列(FPGA)、专用集成电路(ASIC)、专用标准产品(ASSP)、芯片上系统(SOC)、复杂可编程逻辑设备(CPLD)等计算机硬件、固件、软件和/或它们的组合中实现。The PCIe bus 505 in the device 500 is connected to the internal bus 504. The various components in the device 500 interact with these two types of buses. The components connected to the PCIe bus include an input unit 506, such as a keyboard, a mouse, etc.; an output unit 507, such as a display, a speaker, etc.; a storage unit 508, such as a disk, an optical disk, etc.; a communication unit 509, including a network card, a network card storage area, and a processing unit; the network card storage medium stores a computer program, and the processing unit on the network card executes the embodiment of the present invention by running the computer program therein. The communication unit 509 allows the device 500 to exchange information/data with other devices through a computer network such as the Internet. The communication unit 509 can be implemented in computer hardware, firmware, software, and/or a combination thereof, such as a digital electronic circuit system, an integrated circuit system, a field programmable gate array (FPGA), an application-specific integrated circuit (ASIC), an application-specific standard product (ASSP), a system on a chip (SOC), a complex programmable logic device (CPLD), etc.

在一个实施例中,提出了一种计算机可读存储介质,存储有计算机程序,所述计算机程序被处理器执行时,使得所述处理器执行以下步骤:In one embodiment, a computer-readable storage medium is provided, storing a computer program, wherein when the computer program is executed by a processor, the processor performs the following steps:

S1、第二主机需要向第一主机发送的待传输消息;S1, a message to be transmitted that the second host needs to send to the first host;

S2、所述第二主机确定所述待传输消息的消息类型,所述消息类型包括大消息、小消息;S2. The second host determines the message type of the message to be transmitted, where the message type includes a large message and a small message;

S3、确定所述待传输消息的消息类型为小消息后,所述第二主机与第一主机通过RDMA Send&Recv原语进行通信;所述第二主机生成对应的工作队列元素放入发送队列;所述第一主机从末级缓存模块中的小消息缓存区取出一个缓存块并生成相应的工作队列元素放入共享接收队列;S3. After determining that the message type of the message to be transmitted is a small message, the second host communicates with the first host through the RDMA Send&Recv primitive; the second host generates a corresponding work queue element and puts it into the sending queue; the first host takes a cache block from the small message cache area in the last-level cache module and generates a corresponding work queue element and puts it into the shared receiving queue;

S4:所述第一主机判断所述小消息缓存区中是否存在剩余缓存块或存在空闲缓存;S4: The first host determines whether there are any remaining cache blocks or free caches in the small message cache area;

S5:如果所述小消息缓存区不存在剩余缓存块和空闲缓存,则所述第一主机拒绝所述第二主机的通信请求;S5: If there are no remaining cache blocks and free cache in the small message cache area, the first host rejects the communication request of the second host;

S6:如果所述小消息缓存区存在剩余缓存块或者空闲缓存,所述第二主机根据工作队列元素指向的主存消息存放地址取出待传输消息发送到所述第一主机,所述第一主机根据共享接收队列中的工作队列元素指向的地址将待传输消息传输到准备好的缓存空间中,执行S11;S6: If there are remaining cache blocks or free caches in the small message cache area, the second host takes out the message to be transmitted according to the main memory message storage address pointed to by the work queue element and sends it to the first host, and the first host transmits the message to be transmitted to the prepared cache space according to the address pointed to by the work queue element in the shared receiving queue, and executes S11;

S7、确定所述待传输消息的消息类型为大消息后,所述第二主机与第一主机通过RDMA Read/Write原语进行通信,所述第一主机确定待传输消息的需求容量大小;S7. After determining that the message type of the message to be transmitted is a large message, the second host communicates with the first host through an RDMA Read/Write primitive, and the first host determines the required capacity size of the message to be transmitted;

S8、所述第一主机判断所述待传输消息的需求容量大小是否满足剩余存储空间;S8. The first host determines whether the required capacity of the message to be transmitted satisfies the remaining storage space;

S9、如果所述待传输消息的需求容量大于剩余存储空间,所述第一主机拒绝所述第二主机的通信请求;S9. If the required capacity of the message to be transmitted is greater than the remaining storage space, the first host rejects the communication request of the second host;

S10、如果所述待传输消息的需求容量小于等于剩余存储空间,所述第一主机根据需求容量在大消息缓存区分配容量,所述第二主机将待发送消息切分为多个数据包发送,每个数据包一抵达第一主机的大消息缓存区后立刻执行S11;S10, if the required capacity of the message to be transmitted is less than or equal to the remaining storage space, the first host allocates capacity in the large message buffer according to the required capacity, and the second host divides the message to be sent into multiple data packets for transmission, and executes S11 immediately after each data packet arrives at the large message buffer of the first host;

S11、当末级缓存模块中的缓存区接收数据后,缓存区通过与应用程序的共享缓存映射通知应用程序处理数据;S11, when the cache area in the last-level cache module receives the data, the cache area notifies the application to process the data through a shared cache mapping with the application;

S12、如果数据在末级缓存模块中的停留时间大于数据停留时间阈值,将该消息的数据卸载到主存的接收缓冲区中,并告知所述应用程序相关信息,所述应用程序后续从主存中取出数据并处理;S12, if the residence time of the data in the last-level cache module is greater than the data residence time threshold, unloading the data of the message to the receiving buffer of the main memory, and informing the application of the relevant information, and the application subsequently takes out the data from the main memory and processes it;

S13、如果数据在末级缓存模块中的停留时间小于等于数据停留时间阈值,确定处理完成。S13: If the residence time of the data in the last-level cache module is less than or equal to the data residence time threshold, it is determined that the processing is completed.

所述计算机可读存储介质包括接收存储介质、可读存储介质或者网卡存储介质。所述接收存储介质存储网卡发送的数据。所述可读存储介质和网卡存储介质存储计算机程序,实现本发明所述的基于远程直接数据存取的数据通信方法。The computer readable storage medium includes a receiving storage medium, a readable storage medium or a network card storage medium. The receiving storage medium stores data sent by the network card. The readable storage medium and the network card storage medium store computer programs to implement the data communication method based on remote direct data access described in the present invention.

具体而言,接收存储介质可以是能够接收网卡发送的数据的任何实体或记录介质,包括静态随机存取存储器(SRAM,Static Random Access Memory)、嵌入式动态随机存取存储器(eDRAM,Embedded Dynamic Random Access Memory)等。可读存储介质可以是能够携带所述计算机程序指令的任何实体或记录介质,包括U盘、移动硬盘、光盘、计算机存储器、只读存储器(ROM,Read-only memory)、随机存取存储器(RAM,Random-access memory)等。网卡存储介质可以是能够携带所述计算机程序指令并与网卡结合的任何实体或记录介质,包括现场可编程门阵列(FPGA,Field-Programmable Gate Array)、静态随机存取存储器(SRAM,Static Random Access Memory)、闪存(Flash Memory)、电子可擦除可编程只读存储器(EEPROM,Electrically Erasable Programmable Read-Only Memory)等。Specifically, the receiving storage medium may be any entity or recording medium capable of receiving data sent by the network card, including static random access memory (SRAM), embedded dynamic random access memory (eDRAM), etc. The readable storage medium may be any entity or recording medium capable of carrying the computer program instructions, including a USB flash drive, a mobile hard disk, an optical disk, a computer memory, a read-only memory (ROM), a random access memory (RAM), etc. The network card storage medium may be any entity or recording medium capable of carrying the computer program instructions and combined with the network card, including a field programmable gate array (FPGA), static random access memory (SRAM), flash memory, electronically erasable programmable read-only memory (EEPROM), etc.

以上实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。The technical features of the above embodiments may be arbitrarily combined. To make the description concise, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this specification.

以上所述实施例仅表达了本申请的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对本申请专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本申请构思的前提下,还可以做出若干变形和改进,这些都属于本申请的保护范围。因此,本申请专利的保护范围应以所附权利要求为准。The above-mentioned embodiments only express several implementation methods of the present application, and the descriptions thereof are relatively specific and detailed, but they cannot be understood as limiting the scope of the present application. It should be pointed out that, for a person of ordinary skill in the art, several variations and improvements can be made without departing from the concept of the present application, and these all belong to the protection scope of the present application. Therefore, the protection scope of the present application shall be subject to the attached claims.

Claims (10)

1. A method for transmitting data for remote memory access, the method comprising:
S1, a second host needs to send a message to be transmitted to a first host;
s2, the second host determines the message type of the message to be transmitted, wherein the message type comprises a big message and a small message;
S3, after the message type of the message to be transmitted is determined to be a small message, the second host and the first host communicate through RDMA Send & Recv primitives; the second host generates corresponding work queue elements and puts the work queue elements into a sending queue; the first host takes out a buffer block from a small message buffer area in the final buffer module and generates corresponding work queue elements to be put into a shared receiving queue;
S4: the first host judges whether residual cache blocks or idle caches exist in the small message cache region;
S5: if the small message buffer area does not have the residual buffer blocks and the idle buffer, the first host refuses the communication request of the second host;
S6: if the small message buffer area has residual buffer blocks or idle buffer, the second host takes out the message to be transmitted according to the main storage message storage address pointed by the work queue element and sends the message to the first host, and the first host transmits the message to be transmitted to the prepared buffer space according to the address pointed by the work queue element in the shared receiving queue, and S11 is executed;
S7, after the message type of the message to be transmitted is determined to be a large message, the second host and the first host communicate through RDMA Read/Write primitives, and the first host determines the required capacity of the message to be transmitted;
s8, the first host judges whether the required capacity of the message to be transmitted meets the residual storage space or not;
s9, if the required capacity of the message to be transmitted is larger than the residual storage space, the first host refuses the communication request of the second host;
S10, if the required capacity of the message to be transmitted is smaller than or equal to the remaining storage space, the first host allocates capacity in a large message buffer area according to the required capacity, the second host segments the message to be transmitted into a plurality of data packets for transmission, and S11 is executed immediately after each data packet reaches the large message buffer area of the first host;
s11, after the buffer area in the final buffer module receives data, the buffer area informs the application program to process the data through the shared buffer mapping with the application program;
S12, if the residence time of the data in the final-stage buffer module is larger than the data residence time threshold value, unloading the data of the message into a main storage receiving buffer area, informing the application program of related information, and then taking out the data from the main storage and processing the data by the application program;
S13, if the residence time of the data in the final cache module is smaller than or equal to the data residence time threshold value, determining that the processing is completed.
2. The method for data transmission of remote memory access according to claim 1, wherein prior to S1, the method further comprises:
The first host establishes RDMA communication with the second host, and initializes communication resources including queue pair numbers and data cache addresses of the two parties;
the initializing communication resources is accomplished through a socket communication or RDMA connection manager.
3. The method for transmitting data for remote memory access according to claim 2, wherein the determining the message type of the message to be transmitted specifically includes: comparing the size of the message to be transmitted with a data threshold value, and determining the type of the message according to a comparison result;
If the size of the message to be transmitted is smaller than or equal to a data threshold value, determining that the message type is small message;
And if the size of the message to be transmitted is larger than the data threshold value, determining that the message type is a large message.
4. A method of data transmission for remote memory access according to claim 3, wherein the data threshold is set to 4KB; each work queue element in the shared receive queue points to a 4KB sized cache block in the last level cache block.
5. A method for transferring data in remote memory access according to claim 3, wherein the last level buffer module allocates a buffer space for receiving data in RDMA communication, and the small message buffer and the large message buffer share the buffer space; the last buffer module allocates buffer space size = network bandwidth data residence time threshold value × the proportion of residence data that needs to be processed in the buffer space.
6. The method for remote memory access data transmission according to claim 1, wherein the first host determines a required capacity of a message to be transmitted, specifically comprising: the required capacity of the message to be transmitted = amount of data to be sent =data dwell time threshold set to 200 microseconds.
7. The method for transmitting remote memory access data according to claim 1, wherein the second host splits a message to be transmitted into a plurality of data packets for transmission, and specifically comprises: and splitting the message to be transmitted into a plurality of data packets with the size not larger than 256KB for transmission.
8. A telecommunications device, the device comprising:
The request module is used for generating a communication request when the second host needs to send a message to be transmitted to the first host;
The message type determining module is used for determining the message type of the message to be transmitted, wherein the message type comprises a big message and a small message;
The small message transmission module is used for determining that the message type of the message to be transmitted is a small message, and the second host and the first host communicate through RDMA Send & Recv primitives; the second host generates corresponding work queue elements and puts the work queue elements into a sending queue; the first host computer takes out a buffer block from a small message buffer area in a final buffer module, generates corresponding work queue elements, puts the work queue elements into a shared receiving queue, and judges whether residual buffer blocks exist or idle buffer exists in the small message buffer area; if the small message buffer area does not have the residual buffer blocks and the idle buffer, rejecting the communication request from the second host to the first host; if the small message buffer area has residual buffer blocks or idle buffer, the second host takes out the message to be transmitted according to the main storage message storage address pointed by the work queue element and sends the message to the first host, and the first host transmits the message to be transmitted to the prepared buffer space according to the address pointed by the work queue element in the shared receiving queue;
The large message transmission module is used for determining that the message type of the message to be transmitted is large, the second host and the first host communicate through RDMA Read/Write primitives, the first host determines the required capacity of the message to be transmitted, and judges whether the required capacity of the message to be transmitted meets the residual storage space or not; if the required capacity of the message to be transmitted is larger than the residual storage space, rejecting the communication request from the second host to the first host; if the required capacity of the message to be transmitted is smaller than or equal to the remaining storage space, the first host allocates capacity in a large message buffer area according to the required capacity, and the second host segments the message to be transmitted into a plurality of data packets for transmission;
The transmission processing module is used for notifying the application program to process the data through the shared cache mapping with the application program after the buffer area in the final buffer module receives the data; unloading the data of the message to a main storage receiving buffer area and informing the application program of related information if the residence time of the data in the final-stage buffer module is larger than the data residence time threshold value, and then taking the data from the main storage and processing the data by the application program; if the residence time of the data in the last cache block is less than the data residence time threshold, the process is determined to be complete.
9. An electronic device comprising a memory and a processor, the memory storing a computer program that, when executed by the processor, causes the processor to perform the steps of the method of any of claims 1 to 7.
10. A computer readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the steps of the method of any one of claims 1 to 7.
CN202410166037.5A 2024-02-06 2024-02-06 Data transmission method, device, equipment and storage medium for remote memory access Active CN118093499B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410166037.5A CN118093499B (en) 2024-02-06 2024-02-06 Data transmission method, device, equipment and storage medium for remote memory access

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410166037.5A CN118093499B (en) 2024-02-06 2024-02-06 Data transmission method, device, equipment and storage medium for remote memory access

Publications (2)

Publication Number Publication Date
CN118093499A true CN118093499A (en) 2024-05-28
CN118093499B CN118093499B (en) 2024-11-19

Family

ID=91162757

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410166037.5A Active CN118093499B (en) 2024-02-06 2024-02-06 Data transmission method, device, equipment and storage medium for remote memory access

Country Status (1)

Country Link
CN (1) CN118093499B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102546612A (en) * 2011-12-23 2012-07-04 华中科技大学 Remote procedure call implementation method based on remote direct memory access (RDMA) protocol in user mode
CN104679688A (en) * 2013-12-02 2015-06-03 华为技术有限公司 Data access method, device and system
US20160342567A1 (en) * 2015-05-18 2016-11-24 Red Hat Israel, Ltd. Using completion queues for rdma event detection
CN109491809A (en) * 2018-11-12 2019-03-19 西安微电子技术研究所 A kind of communication means reducing high-speed bus delay
CN115858160A (en) * 2022-12-07 2023-03-28 江苏为是科技有限公司 Remote direct memory access virtualization resource allocation method and device and storage medium
CN116471242A (en) * 2023-05-23 2023-07-21 江苏华创微系统有限公司 RDMA-based transmitting end, RDMA-based receiving end, data transmission system and data transmission method
CN116501549A (en) * 2023-05-06 2023-07-28 上海英方软件股份有限公司 Data caching method and device, electronic equipment and storage medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102546612A (en) * 2011-12-23 2012-07-04 华中科技大学 Remote procedure call implementation method based on remote direct memory access (RDMA) protocol in user mode
CN104679688A (en) * 2013-12-02 2015-06-03 华为技术有限公司 Data access method, device and system
US20160342567A1 (en) * 2015-05-18 2016-11-24 Red Hat Israel, Ltd. Using completion queues for rdma event detection
CN109491809A (en) * 2018-11-12 2019-03-19 西安微电子技术研究所 A kind of communication means reducing high-speed bus delay
CN115858160A (en) * 2022-12-07 2023-03-28 江苏为是科技有限公司 Remote direct memory access virtualization resource allocation method and device and storage medium
CN116501549A (en) * 2023-05-06 2023-07-28 上海英方软件股份有限公司 Data caching method and device, electronic equipment and storage medium
CN116471242A (en) * 2023-05-23 2023-07-21 江苏华创微系统有限公司 RDMA-based transmitting end, RDMA-based receiving end, data transmission system and data transmission method

Also Published As

Publication number Publication date
CN118093499B (en) 2024-11-19

Similar Documents

Publication Publication Date Title
CN103763173B (en) Data transmission method and calculate node
CN103581181B (en) Data packet capturing, processing and sending method and system
US11755513B2 (en) Data processing and writing method based on virtual machine memory identification field and devise
WO2020019743A1 (en) Traffic control method and device
CN114201421A (en) A data stream processing method, storage control node and readable storage medium
US20240348686A1 (en) Remote Data Access Method and Apparatus
CN115964319A (en) Data processing method for remote direct memory access and related product
US20230342087A1 (en) Data Access Method and Related Device
CN116155828B (en) Message order keeping method and device for multiple virtual queues, storage medium and electronic equipment
CN110083307A (en) Date storage method, memory and server
CN111857992A (en) Thread resource allocation method and device in Radosgw module
WO2017177400A1 (en) Data processing method and system
CN115129621B (en) Memory management method, device, medium and memory management module
CN115129459A (en) Method and device for memory management
CN115766044A (en) Communication method based on user mode protocol stack and corresponding device
CN118093499B (en) Data transmission method, device, equipment and storage medium for remote memory access
CN118963641A (en) Joint management of data operators on near-memory compute nodes
CN114253733B (en) A memory management method, device, computer equipment and storage medium
CN115412502A (en) Network port expansion and message rapid equalization processing method
CN114567614A (en) Method and device for realizing ARP protocol processing based on FPGA
CN114911411A (en) Data storage method and device and network equipment
CN112399611B (en) Internet of things service access method and device
CN109474543B (en) Queue resource management method, device and storage medium
CN117539627B (en) NVMe queue resource dynamic allocation method and register based on SR-IOV protocol
CN113127183B (en) Memory allocation method and device in user mode protocol stack

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant