CN108268208B

CN108268208B - A Distributed Memory File System Based on RDMA

Info

Publication number: CN108268208B
Application number: CN201611261722.8A
Authority: CN
Inventors: 陆游游; 舒继武; 陈游旻
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2016-12-30
Filing date: 2016-12-30
Publication date: 2020-01-17
Anticipated expiration: 2036-12-30
Also published as: CN108268208A

Abstract

The invention discloses an RDMA-based distributed memory file system. In the initialization stage of the distributed memory file system, the memory used by the cluster for file storage is uniformly divided and registered to a network card to support the remote node for direct memory access. Then, a distributed shared memory pool is constructed; on the distributed memory shared pool, file indexing and file data block indexing are carried out respectively through two-level hash indexes to provide query services for the file system; processing is performed through a self-identifying remote procedure call method The client's request, and returns the processing result. The present invention has the following advantages: reducing data duplication during file reading and writing, reducing response delay, and improving the overall efficiency of file access in program software.

Description

A Distributed Memory File System Based on RDMA

技术领域technical field

本发明涉及分布式存储系统领域，特别涉及一种基于RDMA的分布式内存文件系统。The invention relates to the field of distributed storage systems, in particular to an RDMA-based distributed memory file system.

背景技术Background technique

远程直接内存访问(Remote Direct Memory Access，RDMA)是指在没有双方主机操作系统直接参与下，直接访问远端内存，从而提供高带宽、低延迟的特性。Remote Direct Memory Access (RDMA) refers to direct access to remote memory without the direct participation of both host operating systems, thereby providing high bandwidth and low latency.

分布式环境下的数据传输决定了系统整体的I/O性能，这类技术被广泛的应用在分布式文件系统和数据库系统中。传统的分布式系统大多以磁盘作为存储介质，并通过基于TCP/IP的远程过程调用模块进行数据传输，由于磁盘带宽低，延迟高，因此网络传输模块本身不会成为瓶颈，近年来，内存日益廉价，将存储和计算转移到内存的内存计算已经成为一种趋势，存储介质性能提升的同时，网络传输也面临极大的挑战。Data transmission in a distributed environment determines the overall I/O performance of the system, and this type of technology is widely used in distributed file systems and database systems. Most traditional distributed systems use disks as storage media, and transmit data through TCP/IP-based remote procedure call modules. Due to low disk bandwidth and high latency, the network transmission module itself will not become a bottleneck. In recent years, memory has become increasingly Cheap, in-memory computing that transfers storage and computing to memory has become a trend. While the performance of storage media is improving, network transmission also faces great challenges.

目前，尚不存在通用的数据传输模块，用于高效处理不同的网络I/O特性，同时，在并发控制上，仍采用集中式的同步模型，严重影响系统的可扩展性。At present, there is no general data transmission module to efficiently handle different network I/O characteristics. At the same time, in terms of concurrency control, a centralized synchronization model is still used, which seriously affects the scalability of the system.

发明内容SUMMARY OF THE INVENTION

本发明旨在至少解决上述技术问题之一。The present invention aims to solve at least one of the above-mentioned technical problems.

为此，本发明的目的在于提出一种基于RDMA的分布式内存文件系统，以减少文件在读写时的数据复制、降低响应延迟，并在程序软件中提高文件访问的整体效率。Therefore, the purpose of the present invention is to propose a distributed memory file system based on RDMA, so as to reduce the data duplication during file reading and writing, reduce the response delay, and improve the overall efficiency of file access in the program software.

为了实现上述目的，本发明的实施例公开了一种基于RDMA的分布式内存文件系统，所述分布式内存文件系统将各节点内存通过RDMA进行网络互联，所述文件系统包括客户端和服务端，所述客户端提供文件访问接口，供上层应用调用，所述服务端提供元数据服务和数据服务，所述分布式内存文件系统执行以下动作：S1：在所述分布式内存文件系统初始化阶段，将集群用于文件存储的内存统一划分，并注册到网卡，以支持远端节点进行内存直接访问，进而构建分布式共享内存池；S2：在所述分布式内存共享池之上，通过两级哈希索引分别进行文件索引及文件数据块索引，为所述文件系统提供查询服务；S3：通过自识别远程过程调用方法处理所述客户端的请求，并返回处理结果。In order to achieve the above purpose, an embodiment of the present invention discloses an RDMA-based distributed memory file system, the distributed memory file system interconnects the memory of each node through RDMA, and the file system includes a client and a server , the client provides a file access interface for upper-layer applications to call, the server provides metadata services and data services, and the distributed memory file system performs the following actions: S1: In the initialization phase of the distributed memory file system , uniformly divide the memory used by the cluster for file storage, and register it to the network card to support direct memory access by remote nodes, and then build a distributed shared memory pool; S2: On the distributed memory shared pool, through the two-level hash The wish index performs file indexing and file data block indexing, respectively, to provide query services for the file system; S3: Process the client's request through a self-identifying remote procedure call method, and return a processing result.

根据本发明实施例的基于RDMA的分布式内存文件系统，减少文件在读写时的数据复制、降低响应延迟，并在程序软件中提高文件访问的整体效率。According to the RDMA-based distributed memory file system according to the embodiment of the present invention, data duplication during file reading and writing is reduced, response delay is reduced, and the overall efficiency of file access in program software is improved.

另外，根据本发明上述实施例的基于RDMA的分布式内存文件系统，还可以具有如下附加的技术特征：In addition, the RDMA-based distributed memory file system according to the foregoing embodiments of the present invention may also have the following additional technical features:

进一步地，所述分布式共享内存池依次存放超级块、消息池、链式哈希索引表、元数据存储块和数据存储块。Further, the distributed shared memory pool sequentially stores a super block, a message pool, a chain hash index table, a metadata storage block and a data storage block.

进一步地，数据布局区域中接收远程节点的直接访问；所述链式哈希索引表和所述元数据存储区中，服务节点响应所述客户端并发请求，对元数据执行查询和更新；元数据根据文件路径名哈希分散到整个集群，各节点独立维护文件的元数据和数据。Further, in the data layout area, the direct access of the remote node is received; in the chain hash index table and the metadata storage area, the service node responds to the concurrent request of the client to perform query and update on the metadata; The data is distributed to the entire cluster according to the file path name hash, and each node maintains the metadata and data of the file independently.

进一步地，所述超级块用于存放元数据块数量、元数据块大小、数据块数量和数据块大小，所述超级块在所述文件系统系统启动时被各节点远程读取。Further, the super block is used to store the number of metadata blocks, the size of the metadata block, the number of data blocks and the size of the data block, and the super block is remotely read by each node when the file system system is started.

进一步地，所述消息池包括多个消息区，所述多个消息区分配给连入系统的不同客户端，以便当客户端有新请求时，所述客户端将新请求远程地写入服务节点的所属消息区，服务端接收线程监测到新请求后，利用自识别方法快速定位消息并处理返回。Further, the message pool includes a plurality of message areas, and the plurality of message areas are allocated to different clients connected to the system, so that when a client has a new request, the client remotely writes the new request to the service In the message area to which the node belongs, after the server receiving thread detects the new request, it uses the self-identification method to quickly locate the message and process the return.

进一步地，所述链式哈希索引表用于在查询文件元数据时，计算文件全路径名的哈希值；将所述哈希值作为索引表索引号，查询该索引号下的表项，进行文件名匹配，成功则获取元数据地址并根据地址访问元数据；如果文件名不匹配，则继续查找下一个表项，直到匹配成功。Further, the chained hash index table is used to calculate the hash value of the full path name of the file when querying the file metadata; the hash value is used as the index number of the index table, and the entry under the index number is queried. , perform file name matching, if successful, obtain the metadata address and access the metadata according to the address; if the file name does not match, continue to find the next entry until the match is successful.

进一步地，所述元数据存储块和所述数据存储块分别用于存放元数据和数据，所述元数据区和所述数据区被分割为固定大小元数据块和数据块，在所述元数据区和所述数据区的首部存放对应区域的空闲块元数据，以描述内存使用情况。Further, the metadata storage block and the data storage block are respectively used to store metadata and data, the metadata area and the data area are divided into fixed-size metadata blocks and data blocks, The data area and the header of the data area store free block metadata of the corresponding area to describe the memory usage.

进一步地，所述两级哈希索引的步骤包括：在客户端发起文件访问请求时，根据文件全路径名计算存放该文件元数据的元数据服务器ID，其中，元数据服务器ID由系统配置文件决定；客户端将请求信息发往对应ID的元数据服务器，元数据服务器检测到新消息后，解析出请求内容，根据所述请求内容中的文件路径名进行第二次哈希值计算，根据第二次的哈希值访问所述链式哈希索引表，获取元数据信息，并做出相应的逻辑处理，返回请求结果。Further, the step of the two-level hash index includes: when the client initiates a file access request, calculating the metadata server ID storing the file metadata according to the full path name of the file, wherein the metadata server ID is determined by the system configuration file. The client sends the request information to the metadata server corresponding to the ID. After the metadata server detects the new message, it parses out the request content, and performs a second hash value calculation according to the file path name in the request content. The second hash value accesses the chain hash index table, obtains metadata information, performs corresponding logical processing, and returns the request result.

进一步地，所述自识别远程过程调用方法包括：在所述客户端向所述服务端发送消息时，使用RDMA_WRITE_WITH_IMM原语携带消息内容，并在报文头部存放客户端元数据；在所述服务端返回请求结果时，通过RDMA原语将返回结果直接写回到所述客户端指定的内存区域，所述客户端将轮询地监测用于存放返回结果的内存区域，直到数据成功返回。Further, the self-identifying remote procedure calling method includes: when the client sends a message to the server, using the RDMA_WRITE_WITH_IMM primitive to carry the message content, and storing client metadata in the message header; When the server returns the request result, the returned result is directly written back to the memory area designated by the client through the RDMA primitive, and the client will poll the memory area used to store the returned result until the data is returned successfully.

进一步地，所述客户端将自身ID以及时间戳存放到所述报文头部，所述客户端ID在连接建立时由服务端主节点分配，且全局唯一。Further, the client stores its own ID and timestamp in the message header, and the client ID is allocated by the server master node when the connection is established, and is globally unique.

进一步地，在RDMA_WRITE_WITH_IMM消息成功送达之后，服务端根据完成信息获取报文头部的客户端元数据，并解析出客户端ID，根据客户端ID直接查询本地消息池的固定偏移位置，获取新的请求信息。Further, after the RDMA_WRITE_WITH_IMM message is successfully delivered, the server obtains the client metadata in the message header according to the completion information, parses out the client ID, and directly queries the fixed offset position of the local message pool according to the client ID, and obtains the New request information.

进一步地，当客户端数量超过所述服务端提前分配的消息池数量时，所述服务端查询已经断开连接的客户端并将其所占用的消息区转移给当前客户端；如果所有客户端均保持连接状态，则需重新申请消息区，注册到网卡并通知所述当前客户端。Further, when the number of clients exceeds the number of message pools allocated in advance by the server, the server queries the disconnected client and transfers the message area occupied by it to the current client; if all clients If both remain connected, you need to re-apply for the message area, register with the network card, and notify the current client.

本发明的附加方面和优点将在下面的描述中部分给出，部分将从下面的描述中变得明显，或通过本发明的实践了解到。Additional aspects and advantages of the present invention will be set forth, in part, from the following description, and in part will be apparent from the following description, or may be learned by practice of the invention.

附图说明Description of drawings

本发明的上述和/或附加的方面和优点从结合下面附图对实施例的描述中将变得明显和容易理解，其中：The above and/or additional aspects and advantages of the present invention will become apparent and readily understood from the following description of embodiments taken in conjunction with the accompanying drawings, wherein:

图1是本发明实施例的基于RDMA的分布式内存文件系统的执行动作流程图；1 is a flowchart of an execution action of an RDMA-based distributed memory file system according to an embodiment of the present invention;

图2是本发明一个实施例的RDMA数据传输示意图；2 is a schematic diagram of RDMA data transmission according to an embodiment of the present invention;

图3是本发明一个实施例的服务节点共享内存的布局图；3 is a layout diagram of a shared memory of a service node according to an embodiment of the present invention;

图4是本发明一个实施例的数据收发时拷贝次数统计示意图；FIG. 4 is a schematic diagram of the statistics of the number of copies during data transmission and reception according to an embodiment of the present invention;

图5是本发明一个实施例的自识别远程过程调用示意图。FIG. 5 is a schematic diagram of a self-identifying remote procedure call according to an embodiment of the present invention.

具体实施方式Detailed ways

下面详细描述本发明的实施例，所述实施例的示例在附图中示出，其中自始至终相同或类似的标号表示相同或类似的元件或具有相同或类似功能的元件。下面通过参考附图描述的实施例是示例性的，仅用于解释本发明，而不能理解为对本发明的限制。The following describes in detail the embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein the same or similar reference numerals refer to the same or similar elements or elements having the same or similar functions throughout. The embodiments described below with reference to the accompanying drawings are exemplary, only used to explain the present invention, and should not be construed as a limitation of the present invention.

参照下面的描述和附图，将清楚本发明的实施例的这些和其他方面。在这些描述和附图中，具体公开了本发明的实施例中的一些特定实施方式，来表示实施本发明的实施例的原理的一些方式，但是应当理解，本发明的实施例的范围不受此限制。相反，本发明的实施例包括落入所附加权利要求书的精神和内涵范围内的所有变化、修改和等同物。These and other aspects of embodiments of the present invention will become apparent with reference to the following description and accompanying drawings. In these descriptions and drawings, some specific implementations of the embodiments of the invention are specifically disclosed to represent some ways of implementing the principles of the embodiments of the invention, but it should be understood that the scope of the embodiments of the invention is not limited by this limit. On the contrary, embodiments of the present invention include all changes, modifications and equivalents falling within the spirit and scope of the appended claims.

以下结合附图描述本发明。在介绍本发明的实施例之前，先对本发明的出现的术语进行说明。The present invention is described below with reference to the accompanying drawings. Before introducing the embodiments of the present invention, the terminology used in the present invention is explained.

直接内存访问(Direct Memory Access，DMA)允许某些硬件装置独立地直接读写内存，而不需要CPU的大量的参与，该技术用于缓解CPU对外设的处理压力，整个数据传输过程只需CPU在最开始进行初始化传输操作，然后将整个传输动作交给DMA控制器来执行完成。Direct Memory Access (DMA) allows some hardware devices to directly read and write memory independently, without the need for a lot of CPU participation. This technology is used to relieve the processing pressure of the CPU on peripherals, and the entire data transfer process only requires the CPU Initialize the transfer operation at the very beginning, and then hand over the entire transfer operation to the DMA controller for execution.

远程直接内存访问(Remote Direct Memory Access,RDMA)是一种新型的网络通讯技术，它能实现在双方操作系统不直接参与的情况下直接访问远端内存，并实现高吞吐、低延迟的特性。RDMA通过让网络适配器将数据直接传输到对方的内存实现数据传输的零拷贝，从而消除了CPU和Cache的直接参与，并减少了冗余的现场切换。目前支持RDMA技术的网络协议栈包括Infiniband、RoCE(RDMA over Converged Ethernet)和iWARP，前两者由Mellanox提供硬件技术支持，特别地，后两者由于采用了普通以太网的数据链路层，因此均可与以太网完全兼容。图2展示了RDMA通信的具体流程：首先由本地CPU以MMIO的方式向网卡发起通信命令，本地网卡检测到新的命令之后，以DMA的方式从内存中读取待传输的数据，将数据打包，并在RDMA网络进行数据传输，对方网卡接收到数据之后，将数据以DMA的方式直接写入到内存对应的地址区域，并将相应的完成信息写入到完成队列，整个过程不涉及对方CPU参与，并绕过了双方的内核，实现了数据传输的零拷贝。通讯双方在建立通讯之前需要经过一下步骤：打开网卡设备；创建保护域，该保护域会与在后面阶段创建的对象绑定，以保障数据传输安全，任何跨域的操作将会引发通讯错误；注册内存，该阶段将通讯的内存进行注册，具体方法是建立该段内存用户态地址和内存地址的映射，并将映射表存到网卡缓存，同时生成该内存段的密钥对(lkey和rkey)，网卡在本地或远程访问内存时需携带相应的密钥进行身份确认；创建CQ(Completion Queue)，发送方在消息发送成功后或者接收方接收消息成功后均会将相应的完成信息放入完成队列，用户可以反复检测完成队列来验证消息发送是否完成；创建QP(Queue Pair)，QP可以对等到TCP/IP的socket，QP由SendQueue和Receive Queue构成，发送方把要发送的消息放入发送队列，同时接收方将接收请求放入接收队列，双方通过这种方式进行网络通信；QP状态初始化，通讯双方创建一一对应的QP之后，需要进行一系列的握手状态转换，直到成功建立起通讯链路。QP可以建立不同的连接类型，包括RC(Reliable Connection)、UC(Unreliable Connection)和UD(UnreliableDatagram)，RC模式下，QP只能进行一对一可靠传输，数据包发送成功后会有相应的确认信息反馈，UC模式下，QP进行一对一传输，无确认信息反馈，UD模式则无一对一的界定，也没有确认信息反馈，以上三种传输模式具有不同的特点，同时对通信原语的支持程度也各不相同。Remote Direct Memory Access (RDMA) is a new type of network communication technology, which can directly access the remote memory without the direct participation of the two operating systems, and achieve the characteristics of high throughput and low latency. RDMA achieves zero-copy data transfer by allowing network adapters to transfer data directly to each other's memory, thereby eliminating the direct involvement of the CPU and Cache and reducing redundant on-site switching. The network protocol stacks currently supporting RDMA technology include Infiniband, RoCE (RDMA over Converged Ethernet) and iWARP. The first two are supported by Mellanox hardware technology. In particular, the latter two use the data link layer of ordinary Ethernet, so All are fully compatible with Ethernet. Figure 2 shows the specific process of RDMA communication: first, the local CPU initiates a communication command to the network card in the form of MMIO. After the local network card detects the new command, it reads the data to be transmitted from the memory in the form of DMA, and packs the data , and perform data transmission on the RDMA network. After the other party's network card receives the data, it directly writes the data to the corresponding address area of the memory by DMA, and writes the corresponding completion information to the completion queue. The whole process does not involve the other party's CPU. Participate in and bypass the kernels of both parties to achieve zero-copy data transfer. Before establishing communication, both parties need to go through the following steps: open the network card device; create a protection domain, which will be bound to the object created in the later stage to ensure the security of data transmission, any cross-domain operation will cause communication errors; Register the memory, register the communication memory at this stage. The specific method is to establish the mapping between the user mode address and the memory address of the memory segment, save the mapping table to the network card cache, and generate the key pair (lkey and rkey) of the memory segment. ), the network card needs to carry the corresponding key for identity confirmation when accessing the memory locally or remotely; to create a CQ (Completion Queue), the sender will put the corresponding completion information after the message is sent successfully or the receiver receives the message successfully. Completion queue, the user can repeatedly check the completion queue to verify whether the message sending is completed; create QP (Queue Pair), QP can peer to TCP/IP socket, QP consists of SendQueue and Receive Queue, the sender puts the message to be sent into Sending queue, at the same time, the receiver puts the receiving request into the receiving queue, and the two parties conduct network communication in this way; the QP state is initialized. After the two parties create a one-to-one corresponding QP, they need to perform a series of handshake state transitions until they are successfully established. communication link. QP can establish different connection types, including RC (Reliable Connection), UC (Unreliable Connection) and UD (Unreliable Datagram). In RC mode, QP can only perform one-to-one reliable transmission, and there will be a corresponding confirmation after the packet is sent successfully Information feedback. In UC mode, QP performs one-to-one transmission without confirmation information feedback. In UD mode, there is no one-to-one definition and no confirmation information feedback. The above three transmission modes have different characteristics. The level of support also varies.

内存计算是指，面对海量的数据以及高实时性处理的需求，传统的以磁盘为存储介质的存储系统因缓慢的访问速度而很难应对新的挑战，从而将存储系统转移到内存进行实时处理的新型处理模式。内存存储系统主要包含两类，分别是内存数据库系统和内存文件系统。本发明结合RDMA网络通讯对内存文件系统进行重构。目前比较主流的内存文件系统包括Alluxio、IGFS等。Alluxio主要用于解决Spark计算框架现存的问题，加速数据处理性能，并使用lineage实现数据的单份存储和可靠恢复。IGFS是一个介于计算框架和HDFS之间的缓存文件系统，向上层提供了兼容于HDFS的接口，但是不同于HDFS的是，IGFS没有单独的元数据服务器，而是采用了哈希的方式进行数据分布。In-memory computing means that in the face of massive data and high real-time processing requirements, traditional storage systems using disks as storage media are difficult to deal with new challenges due to slow access speeds, so the storage system is transferred to memory for real-time processing. A new processing mode for processing. There are two main types of in-memory storage systems: in-memory database systems and in-memory file systems. The invention reconstructs the memory file system in combination with RDMA network communication. At present, the mainstream memory file systems include Alluxio, IGFS, etc. Alluxio is mainly used to solve the existing problems of the Spark computing framework, accelerate data processing performance, and use lineage to achieve single-copy storage and reliable recovery of data. IGFS is a cache file system between the computing framework and HDFS. It provides an interface compatible with HDFS to the upper layer. However, unlike HDFS, IGFS does not have a separate metadata server, but uses a hash method. data distribution.

远程过程调用(Remote Procedure Call，RPC)是一种远程通讯协议，它能实现让运行在一台计算机上的程序远程地调用另一台计算机上面的函数，而用户无需关心底层的通讯交互策略。远程过程调用被广泛的应用在分布式系统领域，它采用了客户端-服务器的模型，调用过程总是由客户端发起，具体包括将调用函数序列号，调用函数参数等信息打包发送到服务端，然后服务端接收请求并执行，服务端执行完毕后将执行结果返回给客户端。Remote Procedure Call (RPC) is a remote communication protocol that enables a program running on one computer to remotely call a function on another computer without the user needing to care about the underlying communication interaction strategy. Remote procedure call is widely used in the field of distributed systems. It adopts the client-server model. The calling process is always initiated by the client, including the serial number of the calling function, the parameters of the calling function and other information are packaged and sent to the server. , and then the server receives the request and executes it. After the server executes, it returns the execution result to the client.

图1是本发明实施例的一种基于RDMA的分布式内存文件系统的执行动作流程图。如图1所示，根据本发明实施例的基于RDMA的分布式内存文件系统，首先将各节点内存通过RDMA进行网络互联，分布式内存文件系统包括客户端和服务端，客户端提供文件访问接口，供上层应用调用，服务端提供元数据服务和数据服务，分布式内存文件系统执行以下动作：FIG. 1 is a flow chart of execution actions of an RDMA-based distributed memory file system according to an embodiment of the present invention. As shown in FIG. 1 , according to an RDMA-based distributed memory file system according to an embodiment of the present invention, the memory of each node is firstly networked through RDMA, the distributed memory file system includes a client and a server, and the client provides a file access interface , which is called by the upper-layer application, the server provides metadata services and data services, and the distributed memory file system performs the following actions:

S1：在分布式内存文件系统初始化阶段，将集群用于文件存储的内存统一划分，并注册到网卡，以支持远端节点进行内存直接访问，进而构建分布式共享内存池。S1: In the initialization stage of the distributed memory file system, the memory used by the cluster for file storage is uniformly divided and registered to the network card to support the remote nodes to directly access the memory, and then build a distributed shared memory pool.

S2：在分布式内存共享池之上，通过两级哈希索引分别进行文件索引及文件数据块索引，为文件系统提供查询服务；S2: On top of the distributed memory shared pool, file indexing and file data block indexing are performed respectively through two-level hash indexes to provide query services for the file system;

S3：通过自识别远程过程调用方法处理客户端的请求，并返回处理结果。S3: Process the client's request through the self-identifying remote procedure call method, and return the processing result.

需要说明的是，分布式共享内存池由各节点共享内存构成，各节点共享内存具有统一的数据布局，具体地，共享内存依次存放超级块、消息池、链式哈希索引表、元数据存储块和数据存储块(如图3)，该共享内存池用于文件存储和消息传递，因此本文件系统的存储介质和通讯方式均发生改变，通过这种统一管理方式，使得软件栈整体变薄，处理速度更高。It should be noted that the distributed shared memory pool is composed of the shared memory of each node, and the shared memory of each node has a unified data layout. Specifically, the shared memory sequentially stores the super block, the message pool, the chain hash index table, and the metadata storage. Blocks and data storage blocks (as shown in Figure 3), the shared memory pool is used for file storage and message transmission, so the storage medium and communication method of this file system have changed. Through this unified management method, the overall software stack becomes thinner , the processing speed is higher.

在本发明的一个实施例中，数据布局区域中，超级块、消息池和数据区被注册到网卡，该区域可被远程节点直接访问，进而减少内存拷贝以提高效率；链式哈希索引表和元数据存储区由本地服务线程独立维护，具体方法是，服务节点响应所有的元数据请求，并完成相应的索引以及元数据的查询和更新；元数据根据文件路径名哈希分散到整个集群，各节点独立维护文件的元数据和数据，以提升文件系统的整体性能。In one embodiment of the present invention, in the data layout area, the super block, message pool and data area are registered to the network card, and this area can be directly accessed by remote nodes, thereby reducing memory copy to improve efficiency; chain hash index table and metadata storage area is independently maintained by the local service thread. The specific method is that the service node responds to all metadata requests, and completes the corresponding index and metadata query and update; metadata is distributed to the entire cluster according to the file path name hash , each node independently maintains the metadata and data of the file to improve the overall performance of the file system.

在本发明的一个实施例中，超级块用于存放文件系统的核心数据结构，具体包括元数据块数量、元数据块大小、数据块数量、数据块大小等。该区域在文件系统系统启动时将会被各节点远程读取，用于初始识别和定位。In an embodiment of the present invention, the super block is used to store the core data structure of the file system, and specifically includes the number of metadata blocks, the size of the metadata block, the number of data blocks, the size of the data block, and the like. This area will be remotely read by each node when the file system system is started for initial identification and positioning.

在本发明的一个实施中，消息池用于客户端与服务端通讯，具体做法是，将消息池划分相同大小的消息区，每个消息区被一个客户端单独占有，即客户端绑定到该服务节点消息池的固定偏移，当客户端有新请求时，客户端新请求远程地写入该服务节点的所属消息区，服务端接收线程监测到新请求后，利用客户端独有的ID号查询对应消息区，识别消息类别并处理返回。In an implementation of the present invention, the message pool is used for the communication between the client and the server. The specific method is to divide the message pool into message areas of the same size, and each message area is occupied by a client alone, that is, the client is bound to the The fixed offset of the message pool of the service node. When the client has a new request, the new request of the client is remotely written into the message area of the service node. After the receiving thread of the server detects the new request, it uses the unique The ID number queries the corresponding message area, identifies the message type, and returns it after processing.

在本发明的一个实施例中，链式哈希索引表用于本地元数据索引，该索引表设置全局的统一入口，以线性表的形式排布，用于索引具体的链式表项，每个表项包含三个字段，分别是文件名、元数据地址、下一入口地址，具体方法是：在查询文件元数据时，首先计算文件全路径名的哈希值，将该值作为索引表索引号，查询线性表，读取对应表项的文件名并进行匹配，成功则获取元数据地址并根据地址访问元数据；如果文件名不匹配，则根据下一入口地址继续查找，直到匹配成功。In an embodiment of the present invention, a chained hash index table is used for local metadata indexing, and the index table sets a global unified entry, arranged in the form of a linear table, and is used to index specific chained table entries. Each table entry contains three fields, namely file name, metadata address, and next entry address. The specific method is: when querying file metadata, first calculate the hash value of the full path name of the file, and use this value as the index table Index number, query the linear table, read the file name of the corresponding entry and match, if successful, obtain the metadata address and access the metadata according to the address; if the file name does not match, continue to search according to the next entry address until the match is successful .

在本发明的一个实施例中，元数据存储块和数据存储块分别用于存放元数据和数据。在上述两个区域的首部存放对应区域的位图，用于表示该区域的占用情况。In one embodiment of the present invention, the metadata storage block and the data storage block are used to store metadata and data, respectively. The bitmap of the corresponding area is stored in the header of the above two areas to indicate the occupancy of the area.

在本发明的一个实施例中，链式哈希索引表用于本地元数据索引，每个入口包含三个字段，分别是文件名、元数据地址、下一入口地址，具体方法是：在查询文件元数据时，首先计算文件全路径名的哈希值，将该值作为索引表索引号，查询对应入口的文件名并进行匹配，成功则获取元数据地址并根据地址访问元数据；如果文件名不匹配，则根据下一入口地址继续查找，直到匹配成功。In an embodiment of the present invention, the chained hash index table is used for local metadata indexing, and each entry contains three fields, namely file name, metadata address, and next entry address. The specific method is: in the query When the file metadata is used, first calculate the hash value of the full path name of the file, use the value as the index number of the index table, query the file name of the corresponding entry and match, if successful, obtain the metadata address and access the metadata according to the address; if the file If the name does not match, continue to search according to the next entry address until the match is successful.

根据本发明的内存文件系统在RDMA网络中的重构方法，使得系统整体性能大幅提升。传统的分布式文件系统，以缓慢的磁盘作为存储介质，用千兆网进行网络通讯，由于磁盘和千兆网本身延迟较高(毫秒级别),因此文件系统本身带来的性能损失较小，当把这类文件系统以内存作为存储介质，在整个数据通路中，文件系统本身占据了大量的延迟，期间引入了大量的数据拷贝(如图4)和冗余的现场切换，使得整体性能不能线性提升，为此本发明结合RDMA技术，专门提出了针对内存介质的优化方案，提出了更薄的数据管理层，构建分布式内存共享池，以哈希的方式定位文件并进行快速获取，使得整个数据通路的延迟变得极低，同时也保证了很高的系统吞吐量。According to the reconstruction method of the memory file system in the RDMA network of the present invention, the overall performance of the system is greatly improved. The traditional distributed file system uses the slow disk as the storage medium and uses the Gigabit network for network communication. Due to the high latency of the disk and the Gigabit network itself (millisecond level), the performance loss caused by the file system itself is small. When this type of file system uses memory as the storage medium, the file system itself occupies a large amount of delay in the entire data path, during which a large number of data copies (as shown in Figure 4) and redundant on-site switching are introduced, making the overall performance impossible. Linear improvement, for this reason, combined with RDMA technology, the present invention specially proposes an optimization scheme for memory media, proposes a thinner data management layer, builds a distributed memory shared pool, locates files in a hash way and obtains them quickly, so that The latency of the entire data path becomes extremely low, while also ensuring high system throughput.

图5示例了本发明实施例的自识别远程调用技术，方法基于支持RDMA硬件技术网络互连的大内存集群，RDMA技术是指节点可以在无需远程CPU直接参与下实现远程内存的直接读写，大内存集群是指集群内各节点配备大容量内存且有空余内存用来构建分布式内存文件系统，方法包括：Fig. 5 illustrates the self-identifying remote calling technology of the embodiment of the present invention. The method is based on a large memory cluster supporting RDMA hardware technology network interconnection. The RDMA technology means that the node can realize the direct reading and writing of the remote memory without the direct participation of the remote CPU. A large-memory cluster means that each node in the cluster is equipped with large-capacity memory and has spare memory to build a distributed memory file system. The methods include:

在客户端向服务端发送消息时，采用了RDMA_WRITE_WITH_IMM原语进行数据发送和自识别，在服务端返回请求结果时，使用RDMA原语将返回数据写回。具体做法是：When the client sends a message to the server, the RDMA_WRITE_WITH_IMM primitive is used for data transmission and self-identification. When the server returns the request result, the RDMA primitive is used to write the returned data back. The specific method is:

在客户端向服务端发送消息时，RDMA_WRITE_WITH_IMM原语允许客户端在发送请求时携带客户端元数据，特别地，客户端将自身ID以及时间戳存放到该区域，以方便服务端快速识别定位；When the client sends a message to the server, the RDMA_WRITE_WITH_IMM primitive allows the client to carry the client metadata when sending the request. In particular, the client stores its own ID and timestamp in this area to facilitate the server to quickly identify and locate;

在服务端返回请求结果时，通过RDMA原语将返回结果直接写回到客户端指定的内存区域，此时，客户端将轮询地监测用于存放返回结果的内存区域，直到数据成功返回。When the server returns the request result, the returned result is directly written back to the memory area specified by the client through the RDMA primitive. At this time, the client will poll the memory area used to store the returned result until the data is returned successfully.

在本发明的一个实施例中，客户端ID在连接建立时由服务端主节点分配，且全局唯一，这使得客户端可以通过其ID在服务端的消息池中自动占据一个消息区。In an embodiment of the present invention, the client ID is assigned by the server master node when the connection is established, and is globally unique, which enables the client to automatically occupy a message area in the server message pool through its ID.

在本发明的一个实施例中，自识别无需服务端扫描整个消息池，RDMA_WRITE_WITH_IMM消息成功送达之后，服务端接收队列中的接收请求完成，并将完成信息放入完成队列，服务端通过独立线程循环访问完成队列，用于检测新的请求，服务端发现新消息之后，首先根据完成信息获取该消息携带的辅助信息，并解析出客户端ID，根据客户端ID直接查询本地消息池的固定偏移位置，获取新的请求信息，然后解析请求内容，在服务端执行相应函数，然后将执行结果返回。In an embodiment of the present invention, self-identification does not require the server to scan the entire message pool. After the RDMA_WRITE_WITH_IMM message is successfully delivered, the server receives the request in the queue and completes it, and puts the completion information into the completion queue. The server uses an independent thread Iteratively accesses the completion queue to detect new requests. After the server finds a new message, it first obtains the auxiliary information carried by the message according to the completion information, parses out the client ID, and directly queries the fixed bias of the local message pool according to the client ID. Move the location, obtain the new request information, then parse the request content, execute the corresponding function on the server, and then return the execution result.

在本发明的一个实施例中，当客户端数量达到一定数量，超过了服务端提前分配的消息池数量时，服务端查询已经断开连接的客户端并将其所占用的消息区转移给当前客户端；如果所有客户端均保持活跃状态，则需新申请消息区，注册到网卡并通知客户端。In an embodiment of the present invention, when the number of clients reaches a certain number and exceeds the number of message pools allocated by the server in advance, the server queries the disconnected clients and transfers the message area occupied by them to the current Client; if all clients remain active, you need to apply for a new message area, register with the network card, and notify the client.

根据本发明的自识别远程调用技术，远程请求得以及时响应。该技术具有如下优点：选用了RDMA_WRITE_WITH_IMM原语来发送消息，在保证较低延迟的前提下，通过携带辅助信息使得服务端可以快速检测识别，并做出及时处理；服务端选用了RDMA原语将请求结果写回，具有极低延迟的特性，使得整个往返延迟变得更低，同时，客户端在发出远程请求之前会提前分配存放返回结果的内存区域，并将对应地址附到请求信息中，因此服务端可直接按照所给地址进行远程写，在客户端内部可以很好的进行内存区域的并发控制，因此该技术能很好的适应高并发场景。According to the self-identifying remote calling technology of the present invention, the remote request can be responded in time. This technology has the following advantages: the RDMA_WRITE_WITH_IMM primitive is selected to send messages, and on the premise of ensuring lower delay, the server can quickly detect and identify by carrying auxiliary information, and make timely processing; The request result is written back, which has the characteristics of extremely low latency, which makes the entire round-trip delay lower. At the same time, the client will allocate the memory area for storing the returned result in advance before sending the remote request, and attach the corresponding address to the request information. Therefore, the server can directly perform remote writing according to the given address, and the concurrency control of the memory area can be well controlled inside the client, so this technology can be well adapted to high concurrency scenarios.

另外，本发明实施例的基于RDMA的分布式内存文件系统的其它构成以及作用对于本领域的技术人员而言都是已知的，为了减少冗余，不做赘述。In addition, other structures and functions of the RDMA-based distributed memory file system according to the embodiment of the present invention are known to those skilled in the art, and in order to reduce redundancy, detailed descriptions are omitted.

在本说明书的描述中，参考术语“一个实施例”、“一些实施例”、“示例”、“具体示例”、或“一些示例”等的描述意指结合该实施例或示例描述的具体特征、结构、材料或者特点包含于本发明的至少一个实施例或示例中。在本说明书中，对上述术语的示意性表述不一定指的是相同的实施例或示例。而且，描述的具体特征、结构、材料或者特点可以在任何的一个或多个实施例或示例中以合适的方式结合。In the description of this specification, description with reference to the terms "one embodiment," "some embodiments," "example," "specific example," or "some examples", etc., mean specific features described in connection with the embodiment or example , structure, material or feature is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

尽管已经示出和描述了本发明的实施例，本领域的普通技术人员可以理解：在不脱离本发明的原理和宗旨的情况下可以对这些实施例进行多种变化、修改、替换和变型，本发明的范围由权利要求及其等同限定。Although embodiments of the present invention have been shown and described, it will be understood by those of ordinary skill in the art that various changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, The scope of the invention is defined by the claims and their equivalents.

Claims

1. An RDMA-based distributed memory file system, wherein the distributed memory file system interconnects memory of each node through RDMA, the file system includes a client and a server, the client provides a file access interface for upper layer applications to call, the server provides metadata services and data services, and the distributed memory file system performs the following actions:

s1: in the initialization stage of the distributed memory file system, uniformly dividing the memory of the cluster for file storage, and registering the memory to a network card to support a remote node to directly access the memory, thereby constructing a distributed shared memory pool;

s2: on the distributed shared memory pool, file indexing and file data block indexing are respectively carried out through two-stage Hash indexing, and query service is provided for the file system;

s3: processing the request of the client by a self-identification remote procedure call method, and returning a processing result;

wherein the two-level hash index comprises: when a client side initiates a file access request, calculating a metadata server ID for storing metadata of a file according to the full path name of the file, wherein the metadata server ID is determined by a system configuration file; the client sends the request information to a metadata server corresponding to the ID, the metadata server analyzes the request content after detecting a new message, performs second hash value calculation according to the file path name in the request content, accesses the chained hash index table according to the second hash value, acquires the metadata information, performs corresponding logic processing and returns a request result;

the self-recognition remote procedure call method comprises the following steps: when the client sends a message to the server, using RDMA _ WRITE _ WITH _ IMM primitive to carry message content, and storing client metadata at the head of the message; when the server side returns the request result, the returned result is directly written back to the memory area designated by the client side through the RDMA primitive, and the client side monitors the memory area for storing the returned result in a polling mode until the data is successfully returned.

2. The RDMA-based distributed memory file system of claim 1, wherein the distributed shared memory pool stores, in order, a superblock, a message pool, a chain hash index table, a metadata store, and a data store.

3. The RDMA-based distributed memory file system of claim 2, wherein direct access by a remote node is received in a data layout area; in the chain hash index table and the metadata storage area, a service node responds to the client concurrent request and executes query and update on metadata; and the metadata is hashed and dispersed to the whole cluster according to the file path name, and each node independently maintains the metadata and the data of the file.

4. The RDMA-based distributed memory file system of claim 2, wherein the superblock is used to house a number of metadata blocks, a metadata block size, a number of data blocks, and a data block size, the superblock being read remotely by nodes at startup of the file system.

5. The RDMA-based distributed memory file system of claim 2, wherein the message pool comprises a plurality of message areas, and the plurality of message areas are allocated to different clients connected to the system, so that when a new request is made by a client, the client remotely writes the new request into the message area of the service node, and after the new request is monitored by the server receiving thread, the message is quickly located by using a self-recognition method and processed back.

6. The RDMA-based distributed memory file system of claim 2, wherein the chained hash index table is used for metadata indexing, and when querying file metadata, the hash value of the full pathname of the file is first calculated;

taking the hash value as an index number of an index table, inquiring the table items under the index number, matching file names, and if the file names are successfully matched, acquiring a metadata address and accessing the metadata according to the address;

if the file names do not match, the next table entry is continuously searched until the matching is successful.

7. The RDMA-based distributed memory file system according to claim 2, wherein the metadata storage area and the data storage area are used for storing metadata and data, respectively, the metadata storage area and the data storage area are divided into fixed-size metadata blocks and data blocks, and free block bitmaps of corresponding areas are stored at the headers of the metadata area and the data area to describe the memory usage.

8. The RDMA-based distributed memory file system of claim 1, wherein the client deposits its ID and a timestamp to the packet header, the client ID being assigned by a server master node at connection setup and being globally unique.

9. The RDMA-based distributed memory file system of claim 8, wherein after the RDMA _ WRITE _ WITH _ IMM message is successfully delivered, the server obtains the client metadata of the header of the message according to the completion information, parses out the client ID, and directly queries the fixed offset location of the local message pool according to the client ID to obtain new request information.

10. The RDMA-based distributed memory file system of claim 8, wherein when the number of clients exceeds the number of message pools allocated in advance by the server, the server queries the disconnected clients and transfers the message area occupied by the disconnected clients to the current client; and if all the clients keep the connection state, the clients need to reapply the message area, register to the network card and inform the current clients.