[go: up one dir, main page]

CN111221773A - A Data Storage Architecture Method Based on RMDA High-speed Network and Jump Table - Google Patents

A Data Storage Architecture Method Based on RMDA High-speed Network and Jump Table Download PDF

Info

Publication number
CN111221773A
CN111221773A CN202010041020.9A CN202010041020A CN111221773A CN 111221773 A CN111221773 A CN 111221773A CN 202010041020 A CN202010041020 A CN 202010041020A CN 111221773 A CN111221773 A CN 111221773A
Authority
CN
China
Prior art keywords
access
skiplist
data
local
remote
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010041020.9A
Other languages
Chinese (zh)
Other versions
CN111221773B (en
Inventor
胡卉芪
林志刚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
East China Normal University
Original Assignee
East China Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by East China Normal University filed Critical East China Normal University
Priority to CN202010041020.9A priority Critical patent/CN111221773B/en
Publication of CN111221773A publication Critical patent/CN111221773A/en
Application granted granted Critical
Publication of CN111221773B publication Critical patent/CN111221773B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/16Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs
    • G06F15/163Interprocessor communication
    • G06F15/173Interprocessor communication using an interconnection network, e.g. matrix, shuffle, pyramid, star, snowflake
    • G06F15/17306Intercommunication techniques
    • G06F15/17331Distributed shared memory [DSM], e.g. remote direct memory access [RDMA]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/06Addressing a physical block of locations, e.g. base addressing, module addressing, memory dedication
    • G06F12/0646Configuration or reconfiguration
    • G06F12/0653Configuration or reconfiguration with centralised address assignment
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • G06F9/547Remote procedure calls [RPC]; Web services
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1097Protocols in which an application is distributed across nodes in the network for distributed storage of data in networks, e.g. transport arrangements for network file system [NFS], storage area networks [SAN] or network attached storage [NAS]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/54Indexing scheme relating to G06F9/54
    • G06F2209/541Client-server
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/54Indexing scheme relating to G06F9/54
    • G06F2209/548Queue
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computer Hardware Design (AREA)
  • Mathematical Physics (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明公开发明了一种基于RMDA高速网络和跳表的数据存储架构方法,其特点是采用任务调度器分配本地和远程任务和跳表的数据存储架构,基于R‑skiplist索引设计的访问并发控制策略和范围扫描过程,通过任务调度器将以R‑skiplist形式存储在服务器上的数据进行本地和远程任务分配,实现RDMA网络通信的高带宽和低延迟的数据访问。本发明与现有技术相比具有较低的延迟、更高的带宽和数据访问吞吐量,减少远程访问中的通信时间,大大节省了一些网络资源,有效避免了服务器数据传输的CPU占用和写入不存在冲突。

Figure 202010041020

The invention discloses and invents a data storage architecture method based on RMDA high-speed network and skip list. During the policy and range scanning process, the data stored on the server in the form of R-skiplist is allocated to local and remote tasks by the task scheduler, so as to realize high-bandwidth and low-latency data access for RDMA network communication. Compared with the prior art, the present invention has lower delay, higher bandwidth and data access throughput, reduces the communication time in remote access, greatly saves some network resources, and effectively avoids CPU occupation and writing of server data transmission. There is no conflict in the entry.

Figure 202010041020

Description

Data storage architecture method based on RMDA high-speed network and skip list
Technical Field
The invention relates to the technical field of database systems, in particular to a data storage architecture method based on a high-bandwidth low-delay RMDA network and a skip list.
Background
Remote Direct Memory Access (RDMA) enables low latency network access by accessing memory directly from a remote computer. It bypasses the operating system and supports zero-copy networks, thereby achieving high bandwidth and low latency in network access. There are two commands available for remote memory access in RDMA, one being a single-sided operation: including Send/Recv, which, like socket programming, requires RDMA Recv to be initiated at the server side before the client sends an RDMA Send, and appending an address indicating the storage location of the upcoming message; another command is a bilateral operation: including Read/Write. Memory semantically, the memory address where the message is stored on the remote computer is allocated by the client, which eliminates the CPU consumption of the remote computer (server). Unilateral operations provide relatively higher bandwidth and lower latency than bilateral operations, however, existing concurrency control methods of traditional skiplists used in RDMA, whether with or without locks, consume significant CPU resources because they require the addition of steps to handle various conflicts in operations. In addition, the nodes are randomly stored, so only one node can be obtained in one RDMA round trip, and a large amount of network resources are consumed by remotely traversing the traditional skiplist through RDMA. In the prior art client-server design, the server is primarily responsible for processing operations, while the client only sends RPC requests and receives the returned messages from the server. When the concurrency of local access operations is high, the CPU of the local node will become a performance bottleneck for the application program, and many tasks will be blocked.
Disclosure of Invention
The invention aims to design a data storage architecture method based on an RMDA high-speed network and a skip list aiming at the defects of the prior art, which adopts the technical scheme that data is stored on a server in an R-skip form, local and remote tasks are distributed through a task scheduler, and an access concurrency control strategy and a scanning process are designed based on an R-skip index, so that high bandwidth, low delay and higher data access throughput of RDMA network communication are realized, the communication time in remote access is greatly reduced, certain network resources are saved, and the condition that the CPU occupation and writing of server data transmission have no conflict is effectively avoided.
The purpose of the invention is realized as follows: a data storage architecture method based on an RMDA high-speed network and a skip list is characterized in that a task scheduler is adopted to allocate a data storage architecture of local and remote tasks and skip lists, an access concurrency control strategy and a range scanning process are designed based on an R-skip index, and the task scheduler is used for allocating the local and remote tasks to data stored on a server in an R-skip form, so that high-bandwidth and low-delay data access of RDMA network communication is realized.
The data storage architecture comprises:
a1: in an RDMA network, when a server identifies a request sent by a client through a PRC (resource sharing component) through an access queue (CQ), firstly, the access mode is determined by detecting the current CPU resource utilization rate and other factors, and the judgment process is mainly responsible for a task scheduling part;
a2: if the judgment result is local access, requesting to be executed by one thread in the local thread pool, and then returning the result to the client;
a3: if the remote access is determined, the scheduling thread returns the request to the client, and when the returned information has the remote access mark, the client executes the remote access.
The access concurrency control strategy designed based on the R-skip index comprises the following steps:
b1: the R-skip divides the skip into a plurality of blocks, each block comprises a plurality of nodes, each block is stored in a continuous address, and one RDMA communication can obtain one block instead of one node;
b2: the R-skip supports two Access modes of Local Access (Local-Access) and Remote Access (Remote-Access), a computer where the R-skip is located in the cluster is called a Local node, other computers are called Remote nodes, and the Local-Access refers to the R-skip which is accessing the Local node; the Remote-Access is that the Remote node directly accesses the R-skip through RDMA, and can bypass the CPU of the local node;
b3: and the concurrency control on the R-skip list comprises the resolution of the top layer split conflict, the read-write conflict of the top layer and the boundary conflict.
The range scanning process designed based on the R-skiplist index comprises the following steps:
c1: firstly, a client sends a scanning request to a server;
c2: secondly, the server traverses the R-skiplist to find a first target data node;
c3: the server sends back to the client the address of the data block containing the first target data.
C4: the client accesses the data block directly using the returned address via RDMA. The process can not only save the data copy, but also avoid CPU occupation of server data transmission. In addition, since the first target data is found only on the server side, there is no conflict with the writing of other partitions.
Compared with the prior art, the invention has lower delay, higher bandwidth and data access throughput, reduces the communication time in remote access, greatly saves some network resources, and effectively avoids the occupation of a CPU (central processing unit) for data transmission of the server and the absence of conflict in writing. The design of the R-skip divides the skip into a number of blocks containing multiple nodes, then stores each block in a contiguous address, and an RDMA communication may fetch a block instead of a node. In the cluster, if a computer where the R-skip is located is called a Local node and other computers are called remote nodes, the Local-Access is accessing the R-skip on the Local node; Remote-Access is a Remote node directly accessing R-skip via RDMA, which can bypass the CPU of the local node to perform Remote Access. Meanwhile, network resources are limited, and the R-Skiplist adjusts the traditional Skiplist structure to reduce communication time in remote access, so that some network resources are saved. In addition, in practical applications, the reading operation is most frequent, when the local search result is large, copying and sending data in the traditional Skiplist mode consumes a large amount of CPU resources, and if the required data set is a plurality of partitions, the range query conflicts with the writing of other partitions. With the R-skip, the client can directly access the data block address of the first target data via RDMA. The process can not only save the data copy, but also avoid CPU occupation of server data transmission. Since we should only find the first target data at the server side, there is no conflict with the writes of other partitions.
Drawings
FIGS. 1-2 are schematic diagrams of data storage architectures according to the present invention;
FIG. 3 is a schematic design diagram of an R-skiplist jump table;
FIGS. 4 to 5 are schematic diagrams of an R-skiplist construction process;
FIG. 6 is a schematic diagram of a prior art range search scan process;
FIG. 7 is a diagram illustrating the R-skiplist range search scanning process.
Detailed Description
The present invention will be described in further detail with reference to specific examples.
Example 1
Referring to fig. 1, when the local access concurrency is high, the CPU of the local node becomes a performance bottleneck, causing many tasks to be blocked, and in order to improve performance, the blocked tasks are allocated to the remote node and the remote access is performed using the remaining bandwidth.
On the server, a background thread is used to check the CPU utilization, which can be reflected by the number of blocking tasks. If a high CPU load is detected, part of the read operation will be distributed to the client until CPU utilization drops. Here, the task scheduling process is not blind, and a scheduling strategy based on priority is designed to improve the performance to the maximum extent. Since the number of partitions is greater than the number of threads, there are some partitions without local thread service, and according to this phenomenon, read operations are classified into two categories: i.e., reading a partition that is not serviced by a local thread and reading a partition that is serviced by a local thread. For reading a partition without local thread service, the success rate of remote reading will be higher since it can be ensured that there is no conflict between the client and the server. Then the read has no local thread service assigned to the client first. In contrast, reads are serviced by a local thread at a lower priority than reads not serviced by a local thread because it may cause a conflict between the client and server, causing a remote read retry.
In the process of obtaining, inserting and deleting, the key of the required data is used to determine which range the required data belongs to, for the scanning operation, the start key is used to perform task division on the required data set, and the arriving task information is obtained by accessing the CQ queue on the server according to the self-recognition RPC method described in the inner C1. After the threads acquire the tasks, the tasks need to be partitioned according to the partition range, the partition range is determined by the partition number and a top node of the R-skip list, the partitioned tasks are pushed into task queues, one queue corresponds to one partition, and then each thread can pop up the tasks from different queues. Load skew may occur due to the division of tasks by range, for which reason it is assumed that the number of partitions is set to be much larger than the number of threads to minimize high load, and threads are not bound to queues, but rather loop access to each queue to pop up tasks, thereby avoiding execution of multiple high load partitions by only one thread. Loop accesses will result in frequent context switches to reduce performance, so threads can be made to perform a batch of tasks for one access.
Referring to FIG. 2, to balance the load, the load is periodically checked using a background thread and if load skew occurs, the partition scope is readjusted as appropriate.
Referring to FIG. 3, each level is an ordered linked list of blocks, each containing an ordered linked list of nodes.
Referring to fig. 4, there is shown the definition of the data structure in the R-skip, and compared to the conventional skip, the representation of key, value and next in the index node and data node is not changed, but down in the index node is used to record the address of the next layer containing the block of nodes with the same key, instead of the address of the next node with the same key. In the structure of the data block and the index block, next is used to record the address of the next block in addition to the node linked list, and max _ key records the largest key value. There are some flag fields in the data structure, such as is _ delete, is _ update and is _ split, which will be used for the design of concurrency control.
Referring to FIG. 5, which shows the locally inserted code, the construction process of the R-skip is introduced by the insert operation as follows:
step 1: each layer (line 1) finds a block, such as the code of a Find operation, into which new data is to be inserted. According to the key of the required data, firstly acquiring a corresponding partition header from a cache, then using a function tlayer _ get _ block () to obtain a block containing the top layer of the target node, and if the obtained block contains the partition header, traversing the link list of the block to find the target node from the partition header and acquiring the target node.
Step 2: move to the next level using down, the rest of the levels have the same access as the top level. The R-skip assigns a random height h to the data (line 3) and then inserts the data into the target blocks from layer 1 to layer h (lines 4-31). In the process, when the data needs to be inserted continuously, the block for inserting the new data in the ith layer needs to be split from the key.
And step 3: entering layer i + 1 (lines 8-9, 18-19), this splitting rule may ensure that each node of layer i + 1 corresponds to a unique block of layer i, which means that a block of layer i contains only data between two adjacent nodes of layer i + 1; the target block of the ith layer only contains data which can be accessed by the user, and the transmission of a lot of useless data in remote access is avoided. Theoretically, if the data in the R-skip is in distributed balance, there are about 1/p nodes in a block, but the data in a certain range may occupy the height h randomly, which may cause blocking and further affect the communication performance. For this case, the maximum chunk size max _ cap is set before constructing the R-skip, and when the chunk contains a number of nodes equal to max _ cap, the chunk will be split (lines 10-11, 20-29).
The structure of the R-cliplist of the invention avoids conflicts between most local operations, and saves a lot of CPU resources by reducing the steps of conflict handling in the operations.
Referring to fig. 3, R-skiplsts are partitioned according to their top level nodes, and each partition is then treated as a small R-skiplst with no effect on other partitions if data is read or written in the partition. Thus, as long as each thread is guaranteed to be responsible for the tasks of different partitions, many conflicts can be avoided, but some conflicts still exist:
(1) for a read or write operation, regardless of which partition its required data belongs to, the R-skiplist needs to be traversed from the first node on top to find the target data or location, which would likely conflict with a write operation to a location before the data block.
(2) There is a shared pointer between two adjacent partitions, and a conflict occurs when two adjacent threads modify the shared pointer at the same time.
(3) For a scan operation, a conflict may occur if its required data set spans multiple partitions.
(4) For a block at the top level, it may contain data belonging to several partitions, and if one partition partitions the block, it may conflict with other partitions.
For top-level read-write collision elimination: the first node at the top of each partition, called the partition header, is cached, and the lookup operation may begin with the corresponding partition header. If the partition is deleted, it is no longer directly accessed, but can only be found by the previous partition, so the conflict will again occur. The present invention does not delete the partition header directly from the R-skip, but simply sets the flag is _ del to true to ensure that the boundary of each partition always exists.
For eliminating boundary collisions: when data is inserted or deleted in the top boundary of a partition, or blocks are split or merged in the boundary of a partition, the shared pointers between adjacent partitions will be modified. If two adjacent partitions perform these operations simultaneously, the shared pointers may collide, excluding that only one combination (split merge) may collide, and for merge operations, when the two merge blocks belong to different partitions, other collisions may occur in addition to boundary collisions. Without a merge operation, boundary conflicts will not exist and other conflicts will be reduced. An advantage may be provided for concurrent control design between remote operations and local operations, so no merge operation is performed by the block in the R-skip.
For top-level split conflict resolution: since the block division of the top layer causes a conflict with a plurality of partitions, it is difficult to eliminate, and it can be solved in performing operations considering that the frequency of the top layer division is low. In order not to affect the concurrency of non-split operations, a flag is _ split and a reference count ref _ count are added to a block, and before a thread splits a block, it first sets the flag split to true and invalidates the associated partition header cache. If other threads find the partition header invalid or is _ split set to true, access will be stopped and retried. If not, only ref _ count needs to be incremented. After is _ split is set to true, there may be some outstanding block operations, and then the block cannot be split until all operations are completed, as can be determined by ref _ count. When the split operation is complete, is _ split is set to false and the partition header cache is updated.
For concurrent control on remote read operation, linearity between linear write and local write can be ensured by ensuring atomicity of single block read, and only a small amount of CPU resources on the server side need to be consumed, which is specifically implemented as follows:
adding a flag to each block as update, before updating the block, the thread first sets it to true and performs the update, and then sets reset to false. When a remote read gets a block from the server, it first determines whether the block is being updated by checking the value of is _ update: if false, the block can be accessed directly; if true, the read requires retries and the write-induced block split is a bottom-up approach. Thus, after the target node is found on the client, the next level of blocks corresponding to the target node may have been split in the server. However, this phenomenon does not affect the correctness of reading because all the split data can still be read. In addition, since the blocks are not merged in the R-skip, there is no fear that a block of a next layer corresponding to the target node may not exist.
Referring to FIG. 6, the prior art R-skiplist range search scan process is: the client first sends a scan request to the server, and then the server traverses the R-skiplist to find the first target data node. It then copies all the target data to the buffer and returns to the client.
The problems in the above scanning process are: (1) when the size of the search result is large, copying and sending data will consume a large amount of CPU resources; (2) if the desired data set comes from multiple partitions, the range search will conflict with the writes of other partitions. In R-skip, because the data nodes are stored in contiguous blocks of memory, they can be read directly on the client via RDMA. To this end, the present invention designs a new scanning process to solve the above two problems.
Referring to fig. 7, the R-skiplist range search scanning process is performed as follows:
step 1: the client sends a scanning request to the server;
step 2: the server traverses the R-skiplist to find a first target data node;
and step 3: the server sends the address of the data block containing the first target data back to the client;
and 4, step 4: the client accesses the data block over RDMA directly using the returned address.
The scanning process can not only save the data copy, but also avoid the CPU of the server occupying the data transmission, and because only the first target data is found at the server end, no conflict exists with the writing of other partitions. For remote scope searching, the execution flow is similar to that of remote acquisition, except that it may require access to more data blocks. After the first target data node is found by the start key, the block containing the first target node may continue to be accessed to find the remaining target data nodes. If the last node is accessed and the key is found to be less than the end _ key, then the next data block will continue to be accessed. In fact, when a data block containing the first target data node is obtained, max _ key and end _ key can be used directly to determine whether the next data block needs to be accessed. If necessary, the next data block can be preferentially used when the client accesses the current data block, so that the delay of remote range query is greatly reduced.
It is intended that all such modifications and variations be included herein within the scope of the present invention and protected by the following claims.

Claims (4)

1.一种基于RMDA高速网络和跳表的数据存储架构方法,其特征在于采用任务调度器分配本地和远程任务和跳表的数据存储架构,基于R-skiplist索引设计的访问并发控制策略和范围扫描过程,通过任务调度器将以R-skiplist形式存储在服务器上的数据进行本地和远程任务分配,实现RDMA网络通信的高带宽和低延迟的数据访问。1. a data storage architecture method based on RMDA high-speed network and skip list, it is characterized in that adopting task scheduler to distribute the data storage architecture of local and remote tasks and skip list, based on the access concurrency control strategy and the scope of R-skiplist index design During the scanning process, the data stored on the server in the form of R-skiplist is allocated to local and remote tasks through the task scheduler, so as to realize high bandwidth and low latency data access of RDMA network communication. 2.根据权利要求1所述基于RMDA高速网络和跳表的数据存储架构方法,其特征在于所述跳表的数据存储架构包括:2. the data storage architecture method based on RMDA high-speed network and skip table according to claim 1, it is characterized in that the data storage architecture of described skip table comprises: A1:在RDMA网络中,当服务器通过访问队列(CQ)识别到客户端通过PRC发送的请求时,首先通过检测当前CPU资源利用率,判定其访问方式;A1: In the RDMA network, when the server recognizes the request sent by the client through the PRC through the access queue (CQ), it first determines its access mode by detecting the current CPU resource utilization; A2:如判定为本地访问,则请求由本地线程池中的一个线程执行,然后将结果返回给客户端;A2: If it is determined to be local access, the request is executed by a thread in the local thread pool, and the result is returned to the client; A3:如判定为远程访问,调度线程将请求返回给客户端,当返回的信息中有远程访问标志时,客户端将执行远程访问。A3: If it is determined to be remote access, the scheduling thread will return the request to the client, and when the returned information has a remote access flag, the client will perform remote access. 3.根据权利要求1所述基于RMDA高速网络和跳表的数据存储架构方法,其特征在于所述R-skiplist索引设计的访问并发控制策略包括:3. the data storage architecture method based on RMDA high-speed network and skip table according to claim 1, it is characterized in that the access concurrency control strategy of described R-skiplist index design comprises: B1:R-skiplist将skiplist分成多个块,块上包含多个节点,然后将每个块存储在一个连续地址中;B1: R-skiplist divides the skiplist into multiple blocks containing multiple nodes, and then stores each block in a contiguous address; B2:R-skiplist支持本地和远程两种访问模式,在群集中R-skiplist所在的计算机为本地节点,其他计算机称为远程节点,所述本地访问为访问本地节点上的R-skiplist;所述远程访问为远程节点通过RDMA直接访问R-skiplist,它可以绕过本地节点的CPU;所述本地节点为群集中R-skiplist所在的计算机,其他计算机则称为远程节点;B2: R-skiplist supports both local and remote access modes. In the cluster, the computer where the R-skiplist is located is the local node, and other computers are called remote nodes. The local access is to access the R-skiplist on the local node; the Remote access means that the remote node directly accesses the R-skiplist through RDMA, which can bypass the CPU of the local node; the local node is the computer where the R-skiplist is located in the cluster, and other computers are called remote nodes; B3:R-skiplist上的并发控制包括顶层拆分冲突解决、顶层读写之冲突和边界冲突。B3: Concurrency control on R-skiplist includes top-level split conflict resolution, top-level read and write conflicts, and boundary conflicts. 4.根据权利要求1所述基于RMDA高速网络和跳表的数据存储架构方法,其特征在于所述R-skiplist索引设计的扫描过程包括如下步骤:4. the data storage architecture method based on RMDA high-speed network and skip table according to claim 1, is characterized in that the scanning process of described R-skiplist index design comprises the steps: C1:客户端将扫描请求发送到服务器;C1: The client sends the scan request to the server; C2:服务器遍历R-skiplist以找到第一个目标数据节点;C2: The server traverses the R-skiplist to find the first target data node; C3:服务器将包含第一个目标数据的数据块的地址发送回客户端;C3: The server sends back to the client the address of the data block containing the first target data; C4:客户端通过RDMA直接使用返回的地址访问数据块。C4: The client directly uses the returned address to access the data block through RDMA.
CN202010041020.9A 2020-01-15 2020-01-15 A data storage architecture method based on RDMA high-speed network and jump table Active CN111221773B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010041020.9A CN111221773B (en) 2020-01-15 2020-01-15 A data storage architecture method based on RDMA high-speed network and jump table

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010041020.9A CN111221773B (en) 2020-01-15 2020-01-15 A data storage architecture method based on RDMA high-speed network and jump table

Publications (2)

Publication Number Publication Date
CN111221773A true CN111221773A (en) 2020-06-02
CN111221773B CN111221773B (en) 2023-05-16

Family

ID=70827007

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010041020.9A Active CN111221773B (en) 2020-01-15 2020-01-15 A data storage architecture method based on RDMA high-speed network and jump table

Country Status (1)

Country Link
CN (1) CN111221773B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112817887A (en) * 2021-02-24 2021-05-18 上海交通大学 Far memory access optimization method and system under separated combined architecture
KR20220106622A (en) * 2021-01-22 2022-07-29 성균관대학교산학협력단 Zipper compaction method and apparatus for compacting the plural of skiplists

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150006478A1 (en) * 2013-06-28 2015-01-01 Silicon Graphics International Corp. Replicated database using one sided rdma
CN105471745A (en) * 2014-09-25 2016-04-06 英特尔公司 Technologies for bridging between coarse-grained and fine-grained load balancing
CN110177118A (en) * 2019-06-13 2019-08-27 上海海事大学 A kind of RPC communication method based on RDMA

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150006478A1 (en) * 2013-06-28 2015-01-01 Silicon Graphics International Corp. Replicated database using one sided rdma
CN105471745A (en) * 2014-09-25 2016-04-06 英特尔公司 Technologies for bridging between coarse-grained and fine-grained load balancing
CN110177118A (en) * 2019-06-13 2019-08-27 上海海事大学 A kind of RPC communication method based on RDMA

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
CHENCHEN HUANG ET AL.: "Partition pruning for range query on distributed log-structured merge-tree", 《FRONTIERS OF COMPUTER SCIENCE》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20220106622A (en) * 2021-01-22 2022-07-29 성균관대학교산학협력단 Zipper compaction method and apparatus for compacting the plural of skiplists
KR102568662B1 (en) 2021-01-22 2023-08-22 성균관대학교산학협력단 Zipper compaction method and apparatus for compacting the plural of skiplists
CN112817887A (en) * 2021-02-24 2021-05-18 上海交通大学 Far memory access optimization method and system under separated combined architecture
CN112817887B (en) * 2021-02-24 2021-09-17 上海交通大学 Far memory access optimization method and system under separated combined architecture

Also Published As

Publication number Publication date
CN111221773B (en) 2023-05-16

Similar Documents

Publication Publication Date Title
US11269772B2 (en) Persistent memory storage engine device based on log structure and control method thereof
US20210011652A1 (en) Data storage access method, device and apparatus for persistent memory
JP6314355B2 (en) Memory management method and device
US7249152B2 (en) Dynamic disk space management by multiple database server instances in a cluster configuration
CN110555001B (en) Data processing method, device, terminal and medium
CN105224255B (en) A kind of storage file management method and device
US20200218662A1 (en) Data caching device and control method therefor, data processing chip, and data processing system
CN110109868B (en) Method, apparatus and computer program product for indexing files
US9477413B2 (en) System and method for managing access requests to a memory storage subsystem
US8806168B2 (en) Producer-consumer data transfer using piecewise circular queue
CN114490141B (en) High-concurrency IPC data interaction method based on shared memory
WO2024078342A1 (en) Memory swap method and apparatus, and computer device and storage medium
CN111124270A (en) Method, apparatus and computer program product for cache management
CN110147345A (en) A kind of key assignments storage system and its working method based on RDMA
WO2024188050A1 (en) File lock management method for distributed file system, and device and medium
CN111221773B (en) A data storage architecture method based on RDMA high-speed network and jump table
US9223799B1 (en) Lightweight metadata sharing protocol for location transparent file access
CN116996449A (en) Data message processing system, method, computer device and storage medium
US9176792B2 (en) Class-based mutex
US9116814B1 (en) Use of cache to reduce memory bandwidth pressure with processing pipeline
EP3910484B1 (en) Systems and methods for managing cache replacement
CN112463306B (en) Method for sharing disk data consistency in virtual machine
CN113590507B (en) Distributed storage system, cache layer thereof, data access method and device
CN111858418B (en) Memory communication method and device based on remote direct memory access RDMA
CN119046225B (en) A multi-processor system, switch and data processing method thereof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant