CN100565460C

CN100565460C - Be used for method of managing data

Info

Publication number: CN100565460C
Application number: CNB200480021585XA
Authority: CN
Inventors: 罗杰·J·班福德; 萨希坎什·钱德拉塞克拉; 安杰洛·普鲁希诺
Original assignee: Oracle International Corp
Current assignee: Oracle International Corp
Priority date: 2003-08-01
Filing date: 2004-07-28
Publication date: 2009-12-02
Anticipated expiration: 2024-07-28
Also published as: CN1829974B; CN100449539C; CN1829962A; CN1829961A; CN1829974A; CN1829988A; CN100429622C

Abstract

The invention describes the various technology of the performance that is used to improve shared-nothing database system, wherein, at least two nodes that move in the node of this shared-nothing database system can be shared the ground accessing disk.Especially, be provided in the proprietorial technology that changes the data in the shared-nothing database under the situation that does not change the position of data on long-time memory.Because the persistent storage position of data is not changed during the proprietorial transfer of data, therefore can more freely passes ownership, and have than physics and rearrange the other littler performance loss that causes by data.The various technology that are used to provide fast run-time reassignment of ownership have also been described.Since can run time between carry out reallocation, therefore needn't make no shared system off line carry out reallocation.In addition, these technical descriptions carry out reallocation as how relative fine granulation, avoid the entitlement of the minority data item on the node of reallocation in node and need to carry out a large amount of reallocation through the mass data of all nodes.

Description

Methods used to manage data

技术领域 technical field

本发明涉及用于管理在共享磁盘硬件上运行的无共享(shared-nothing)数据库系统中的数据的技术。The present invention relates to techniques for managing data in a shared-nothing database system running on shared disk hardware.

背景技术 Background technique

多处理计算机系统一般分为三类：一切资源共享(shared-everything)系统、共享磁盘系统、以及无共享系统。在一切资源共享系统中，所有处理器上的程序能够直接存取系统中的所有易失存储装置(下文中一般称为“存储器”)以及所有非易失存储装置(下文中一般称为“磁盘”)。因此，要求不同的计算机组件之间的高级布线，以提供一切资源共享的功能。另外，就一切资源共享结构而言还存在可伸缩性限制。Multiprocessing computer systems generally fall into three categories: shared-everything systems, shared-disk systems, and shared-nothing systems. In all resource sharing systems, programs on all processors can directly access all volatile storage devices (hereinafter generally referred to as "memory") and all nonvolatile storage devices (hereinafter generally referred to as "disks") in the system. "). Therefore, advanced wiring between different computer components is required to provide all resource sharing functions. Additionally, there are scalability limitations as with all resource sharing structures.

在共享磁盘系统中，处理器和存储器被分组成节点。共享磁盘系统中的每个节点本身可以构成包括多处理器和多存储器的一切资源共享系统。所有处理器上的程序能够存取系统中的所有磁盘，但是只有属于特定节点的处理器上的程序能够直接存取在特定节点内的存储器。共享磁盘系统一般地要求比一切资源共享系统少的布线。因为所有节点能够存取所有数据，所以共享磁盘系统还能够容易地适应不平衡的工作负荷条件。然而，共享磁盘系统易受相关系统开销(coherence overhead)的影响。例如，如果第一节点已经修改了数据并且第二节点想要读取或者修改该相同的数据，则必须采取多个步骤以确保将数据的正确版本提供给第二节点。In a shared disk system, processors and memory are grouped into nodes. Each node in the shared disk system itself can constitute a system for sharing all resources including multiprocessors and multimemory. Programs on all processors can access all disks in the system, but only programs on processors belonging to a particular node can directly access memory within a particular node. Shared disk systems generally require less wiring than all resource sharing systems. Shared disk systems can also easily accommodate unbalanced workload conditions because all nodes have access to all data. However, shared disk systems are susceptible to coherence overhead. For example, if a first node has modified data and a second node wants to read or modify that same data, several steps must be taken to ensure that the correct version of the data is provided to the second node.

在无共享系统中，所有的处理器、存储器、和磁盘被分组成节点。如同在共享磁盘系统中一样，在无共享系统中，每个节点本身可以构成一切资源共享系统或共享磁盘系统。只有在特定节点上运行的程序才能够直接存取特定节点内的存储器和磁盘。三种一般类型的多处理系统的无共享系统通常要求各种系统组件之间的最少量的布线。然而，无共享系统最易受不平衡的工作负荷条件的影响。例如，在特定任务期间待被存取的所有数据可能都存在于特定节点的磁盘上。因此，只有在该节点内运行的程序可以用于执行工作颗粒(work granule)，即使其他节点上的程序都保持空闲状态。In a shared-nothing system, all processors, memory, and disks are grouped into nodes. As in a shared-disk system, in a shared-nothing system, each node can itself constitute a shared-everything system or a shared-disk system. Only programs running on a specific node can directly access memory and disks within a specific node. Shared-nothing systems of the three general types of multiprocessing systems generally require a minimal amount of wiring between the various system components. However, shared-nothing systems are most susceptible to unbalanced workload conditions. For example, all data to be accessed during a particular task may exist on a particular node's disk. Therefore, only programs running within that node can be used to execute work granules, even if programs on other nodes remain idle.

在多节点系统上运行的数据库一般分为两类：共享磁盘数据库和无共享数据库。Databases running on multi-node systems generally fall into two categories: shared-disk databases and shared-nothing databases.

共享磁盘数据库shared disk database

共享磁盘数据库基于以下的假设来协调工作：假设由数据库系统管理的所有数据对于数据库系统可用的所有处理节点而言都可见。因此，在共享磁盘数据库中，服务器可以向任何节点上的程序分配任何工作，而与包含在工作期间将被存取的数据的磁盘的位置无关。Shared disk databases work together based on the assumption that all data managed by the database system is visible to all processing nodes available to the database system. Thus, in a shared disk database, the server can assign any job to a program on any node, regardless of the location of the disk containing the data to be accessed during the job.

因为所有节点都能够存取相同的数据，并且每个节点都具有其自己的专用缓存，因此相同数据项的多个版本可以存在于任意数量的多个节点的缓存中。遗憾的是，这意味着当一个节点要求特定数据项的特定版本时，该节点必须与其他节点相协调以使数据项的特定版本被传送至请求节点。因而，共享磁盘数据库被认为以“数据传送”的原理运行，其中，数据必须被传送到已经被指定处理该数据的节点。Because all nodes have access to the same data, and each node has its own private cache, multiple versions of the same data item can exist in the caches of any number of multiple nodes. Unfortunately, this means that when a node requests a particular version of a particular data item, that node must coordinate with other nodes to have the particular version of the data item delivered to the requesting node. Thus, shared disk databases are said to operate on a "data transfer" principle, where data must be transferred to nodes that have been designated to process the data.

这样的数据传送要求可能导致“查验(ping)”。特别地，当由一个节点所需的数据项的拷贝存在于另一节点的缓存中时，就会出现查验。查验可能要求将数据项写入磁盘，然后从磁盘读取。查验所必需的磁盘操作的性能能够显著地降低数据库系统的性能。Such a data transfer requirement may result in a "ping". In particular, pinging occurs when a copy of a data item required by one node exists in another node's cache. Checking may require data items to be written to disk and then read from disk. The performance of the disk operations necessary to check can significantly reduce the performance of the database system.

共享磁盘数据库既可以在无共享计算机系统上运行，也可以在共享磁盘计算机系统上运行。为了在无共享计算机系统上运行共享磁盘数据库，可以将软件支持程序(software support)添加到操作系统或者可以提供其它硬件以允许程序能够存取远程磁盘。Shared-disk databases can run on both shared-nothing and shared-disk computer systems. To run a shared-disk database on a shared-nothing computer system, software support can be added to the operating system or other hardware can be provided to allow programs to access remote disks.

无共享数据库no shared database

无共享数据库假设程序只能在数据被包含在与程序属于相同节点的磁盘上时存取该数据。因此，如果特定节点想要对由另一节点所拥有的数据项执行操作，则特定节点必须向另一节点发送请求，请求另一节点执行该操作。因而，无共享数据库被认为执行“功能传送”，而不是在节点之间传送数据。Shared-nothing databases assume that programs can only access data if the data is contained on disks belonging to the same node as the program. Therefore, if a particular node wants to perform an operation on a data item owned by another node, the particular node must send a request to the other node asking the other node to perform the operation. Thus, a shared-nothing database is said to perform "function transfer," rather than transfer data between nodes.

因为任何给定的数据块都仅由一个节点拥有，因此只有这一个节点(数据的“所有者”)将永久在其缓存中具有数据的拷贝。因此，无需在共享磁盘数据库系统中所要求的缓存相关性机制类型。另外，由于不要求拥有数据项的节点将数据项的缓存版本保存到磁盘以使另一节点然后能够将该数据项存入其缓存，因此无共享系统不遭受与查验相关的性能损失。Because any given block of data is owned by only one node, only this one node (the "owner" of the data) will permanently have a copy of the data in its cache. Thus, there is no need for a cache coherency mechanism of the type required in a shared disk database system. In addition, the shared-nothing system does not suffer from the performance penalty associated with pinging since the node owning the data item is not required to save a cached version of the data item to disk so that another node can then deposit the data item into its cache.

无共享数据库可以在共享磁盘多处理系统和无共享多处理系统上运行。为了在共享磁盘机器上运行无共享数据库，可以提供一种机制用于对数据库进行分区(partitioning)，并且将每个分区的所有权分配给特定节点。Shared-nothing databases can run on both shared-disk multiprocessing systems and shared-nothing multiprocessing systems. To run a shared-nothing database on a shared-disk machine, a mechanism can be provided for partitioning the database and assigning ownership of each partition to a specific node.

只有有所有权的节点可以对数据块进行操作的事实意味着无共享数据库中的工作负荷可能变得极度不平衡。例如，在十个节点的系统中，所有工作要求的90％可能涉及由节点中的一个所拥有的数据。因此，该一个节点工作过度，而其他节点的计算资源未被充分使用。为了“重新平衡”工作负荷，可以使无共享数据库脱机，并且数据(及其所有权)可以在节点之间被再分配。然而，该过程涉及潜在地移动大量数据，并且可能仅仅临时的解决工作负荷的失衡。The fact that only owning nodes can operate on data blocks means that the workload in a shared-nothing database can become wildly unbalanced. For example, in a system of ten nodes, 90% of all work requests may involve data owned by one of the nodes. Therefore, the one node is overworked, while the computing resources of the other nodes are underutilized. To "rebalance" the workload, the shared-nothing database can be taken offline, and the data (and its ownership) can be redistributed among the nodes. However, this process involves potentially moving large amounts of data, and may only temporarily resolve workload imbalances.

发明内容 Contents of the invention

为了解决上述技术问题之一，本发明提供了一种用于管理数据的方法，所述方法包括以下步骤：在能够存取可被多节点系统中的多个节点存取的持久存储器上保持多个持久数据项，所述持久数据项包括存储于所述持久存储器上的特定位置的特定数据项；将所述持久数据项中的每个的独占所有权分配给所述节点中的一个，其中，所述多个节点的特定节点被分配有所述特定数据项的独占所有权；当任何节点想要执行涉及所述特定数据项的操作时，由于所述特定数据项存在于所述特定位置，因此期望所述操作被执行的所述节点将所述操作传送至所述特定节点，用于所述特定节点对所述特定数据项执行所述操作；当所述特定节点继续操作时，在不将所述特定数据项从所述持久存储器上的所述特定位置移动的情况下，将所述特定数据项的所有权从所述特定节点再分配给另一节点；在所述再分配之后，当任何节点想要执行涉及所述特定数据项的操作时，由于所述特定数据项存在于所述特定位置，因此期望所述操作被执行的所述节点将所述操作传送至所述其他节点，用于所述其他节点对所述特定数据项执行所述操作；在第一节点已经从所述多节点系统去除之后，继续使一组数据项由所述第一节点拥有；以及响应于检测对涉及所述数据项的操作的请求，将数据项的所有权从所述第一节点再分配到一个或多个其他节点。采用该技术方案，能够更有效地重新平衡并恢复无共享数据库系统，可以更加自由地转移所有权，并且具有比数据的物理重新布置将引起的更小的性能损失，可以相对的精细粒度来执行再分配，避免仅仅为了再分配节点中的一个节点上的少数数据项的所有权而需要执行经过所有节点的大量数据的大量再分配。In order to solve one of the above-mentioned technical problems, the present invention provides a method for managing data, the method comprising the steps of: maintaining multiple persistent data items, the persistent data items including specific data items stored at specific locations on the persistent memory; assigning exclusive ownership of each of the persistent data items to one of the nodes, wherein, a particular node of said plurality of nodes is assigned exclusive ownership of said particular data item; when any node wants to perform an operation involving said particular data item, since said particular data item exists at said particular location, said node desiring said operation to be performed communicates said operation to said specific node for said specific node to perform said operation on said specific data item; when said specific node continues to operate, without sending reassigning ownership of the particular data item from the particular node to another node in the event that the particular data item is moved from the particular location on the persistent storage; after the reassignment, when any When a node wants to perform an operation involving said particular data item, said node expecting said operation to be performed, since said particular data item exists at said particular location, communicates said operation to said other nodes, with performing said operation on said particular data item by said other node; continuing to have a set of data items owned by said first node after said first node has been removed from said multi-node system; A request for an operation on the data item reassigns ownership of the data item from the first node to one or more other nodes. With this technical solution, the shared-nothing database system can be rebalanced and restored more effectively, ownership can be transferred more freely, and there is less performance loss than the physical rearrangement of data, and the reconfiguration can be performed at a relatively fine-grained level. Allocation, avoiding the need to perform massive redistribution of large amounts of data across all nodes just to redistribute ownership of a few data items on one of the nodes.

附图说明 Description of drawings

通过附图中的实例来描述本发明，但是不局限于此，在附图中相同的参考标号表示类似的元件，其中：The invention is described, but not limited to, by way of example in the accompanying drawings, in which like reference numerals indicate similar elements, in which:

图1是示出根据本发明的实施例的包括两个共享磁盘子系统的群的框图；以及Figure 1 is a block diagram illustrating a cluster comprising two shared disk subsystems according to an embodiment of the invention; and

图2是可以实施本发明的实施例的计算机系统的框图。Figure 2 is a block diagram of a computer system on which embodiments of the present invention may be implemented.

具体实施方式 Detailed ways

下文中描述了用于提高包括共享磁盘存储系统的无共享数据库系统的性能的各种技术。在下面的描述中，为了解释的目的，描述了多个特定的细节，以对本发明有彻底的了解。然而，很显然，在没有这些特定细节的情况下，也可以实现本发明。在其它的实例中，以框图形式示出已知的结构和设备，以避免不必要地使本发明不清楚。Various techniques for improving the performance of shared-nothing database systems including shared disk storage systems are described below. In the following description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the present invention. It is evident, however, that the invention may be practiced without these specific details. In other instances, well-known structures and devices are shown in block diagram form in order to avoid unnecessarily obscuring the present invention.

功能概述Functional Overview

下文中描述了用于提高无共享数据库系统的性能的各种技术，其中，运行无共享数据库系统的节点中的至少两个节点能够共享地存取磁盘。正如由数据库系统的无共享结构所确定的，在任何给定的时间，每个数据块仍然仅由一个节点拥有。然而，利用运行无共享数据库系统的节点中的至少一些节点能够共享的存取磁盘这一事实，以更有效地重新平衡并恢复无共享数据库系统。Various techniques for improving the performance of a shared-nothing database system in which at least two of the nodes running the shared-nothing database system have shared access to a disk are described below. As determined by the shared-nothing structure of database systems, each data block is still owned by only one node at any given time. However, the fact that at least some of the nodes running a shared-nothing database system have shared access to disk is exploited to more efficiently rebalance and restore a shared-nothing database system.

特别地，提供用于在不改变数据的在存储器上的位置的情况下来改变无共享数据库中的数据的所有权的技术。由于数据的持久存储位置在数据所有权的转移期间没有被改变，因此可以更加自由地转移所有权，并且具有比数据的物理重新布置将引起的更小的性能损失。In particular, techniques are provided for changing the ownership of data in a shared-nothing database without changing the location of the data on memory. Since the persistent storage location of the data is not changed during the transfer of ownership of the data, ownership can be transferred more freely and with less performance penalty than a physical rearrangement of the data would cause.

还描述了用于提供所有权的快速运行时(run-time)再分配的各种技术。由于能够在运行时期间执行再分配，因此不必使无共享系统脱机以执行再分配。另外，这些技术描述了如何以相对地精细粒度(fine granularity)来执行再分配，避免仅仅为了再分配节点中的一个节点上的少数数据项的所有权而需要执行经过所有节点的大量数据的大量再分配。Various techniques for providing fast run-time reallocation of ownership are also described. Because the reallocation can be performed during runtime, it is not necessary to take the shared-nothing system offline to perform the reallocation. In addition, these techniques describe how to perform redistribution at a relatively fine granularity, avoiding the need to perform extensive redistribution of large amounts of data across all nodes just to redistribute ownership of a few data items on one of the nodes. distribute.

包括共享磁盘系统的示例性群(cluster)Exemplary cluster including shared disk systems

图1是示出可以实施本发明的实施例的群100的框图。群100包括五个节点102、104、106、108、和110，这些节点通过允许节点彼此通信的互连线130连接。群100包括两个磁盘150和152。节点102、104、和106能够存取磁盘150，并且节点108和110能够存取磁盘152。因此，包括节点102、104、和106以及磁盘150的子系统构成第一共享磁盘系统，而包括节点108和110以及磁盘152的子系统构成第二共享磁盘系统。Figure 1 is a block diagram illustrating a cluster 100 in which embodiments of the present invention may be practiced. Cluster 100 includes five nodes 102, 104, 106, 108, and 110 connected by an interconnect 130 that allows the nodes to communicate with each other. Group 100 includes two disks 150 and 152 . Nodes 102 , 104 , and 106 have access to disk 150 , and nodes 108 and 110 have access to disk 152 . Thus, the subsystem including nodes 102, 104, and 106 and disk 150 constitutes a first shared disk system, while the subsystem including nodes 108 and 110 and disk 152 constitutes a second shared disk system.

群100是包括两个共享磁盘子系统并且共享磁盘子系统之间没有重叠的从属关系(membership)的相对简单系统的实例。实际系统可能比群100复杂的多，有几百个节点、几百个共享磁盘并且节点和共享磁盘之间是多对多关系。在这样的系统中，例如，能够存取许多磁盘的单个节点可以是多个不同的共享磁盘子系统的成员，其中，每个共享磁盘子系统均包括共享磁盘中的一个共享磁盘和能够存取该共享磁盘的所有节点。Farm 100 is an example of a relatively simple system that includes two shared disk subsystems with no overlapping membership between the shared disk subsystems. The actual system may be more complex than the group 100, with hundreds of nodes and hundreds of shared disks, and a many-to-many relationship between nodes and shared disks. In such a system, for example, a single node with access to many disks may be a member of several different shared-disk subsystems, where each shared-disk subsystem includes one of the shared disks and has access to All nodes that share the disk.

共享磁盘系统上的无共享数据库A shared-nothing database on a shared-disk system

为了说明，将假设无共享数据库系统在群110上运行，其中，由无共享数据库系统管理的数据库存储在磁盘150和152上。基于数据库系统的无共享性质，可以将数据分为五个组或分区112、114、116、118、和120。每个分区都被分配给相应的节点。分配给分区的节点被认为是存在于该分区中的所有数据的唯一所有者。在本实例中，节点102、104、106、108、和110分别拥有分区112、114、116、118、和120。由能够存取磁盘150的节点(节点102、104、和106)所拥有的分区112、114、和118存储在磁盘150上。类似地，由能够存取磁盘152的节点(节点108和110)拥有的分区118和120存储在磁盘152上。For purposes of illustration, it will be assumed that a shared-nothing database system is running on cluster 110 , wherein the databases managed by the shared-nothing database system are stored on disks 150 and 152 . Based on the shared-nothing nature of the database system, data can be divided into five groups or partitions 112, 114, 116, 118, and 120. Each partition is assigned to a corresponding node. A node assigned to a partition is considered the sole owner of all data present in that partition. In this example, nodes 102, 104, 106, 108, and 110 own partitions 112, 114, 116, 118, and 120, respectively. Partitions 112 , 114 , and 118 owned by nodes (nodes 102 , 104 , and 106 ) that have access to disk 150 are stored on disk 150 . Similarly, partitions 118 and 120 owned by nodes with access to disk 152 (nodes 108 and 110 ) are stored on disk 152 .

正如由在群100上运行的数据库系统的无共享性质所规定的，在任何给定的时间，任何数据块至多由一个节点拥有。另外，通过功能传送来协调对共享数据的存取。例如，在支持SQL语言的数据库系统的环境中，不拥有特定数据块的节点可以通过向确实拥有该数据块的节点发送SQL语句的片断来引起对该数据的操作。As dictated by the shared-nothing nature of database systems running on the swarm 100, any data block is owned by at most one node at any given time. Additionally, access to shared data is coordinated through function transfers. For example, in the context of a database system supporting the SQL language, a node that does not own a particular data block can cause an operation on that data by sending a fragment of an SQL statement to a node that does own the data block.

所有权映射ownership mapping

为了有效地执行功能传送，所有节点都需要知道哪个节点拥有哪个数据。因此，建立所有权映射，其中，所有权映射指出数据到节点的所有权分配。在运行时期间，不同的节点参考所有权映射以在运行时向正确的节点发送SQL片断。In order to perform function transfer efficiently, all nodes need to know which node owns which data. Therefore, an ownership map is established, wherein the ownership map indicates the ownership assignment of data to nodes. During runtime, different nodes refer to the ownership map to send SQL fragments to the correct node at runtime.

根据一个实施例，在SQL(或任何其他数据库存取语言)语句的编译时间无需决定数据到节点的映射。相反，正如将在下文中更详细地描述的，数据到节点的映射可以在运行时期间建立并修改。使用下文描述的技术，当所有权从能够存取其上存在数据的磁盘的一个节点改变到能够存取其上存在数据的磁盘的另一节点时，能够在不移动数据在磁盘上的持久位置的情况下执行所有权的改变。According to one embodiment, the data-to-node mapping need not be decided at compile time of the SQL (or any other database access language) statement. Instead, as will be described in more detail below, data-to-node mappings can be established and modified during runtime. Using the techniques described below, when ownership changes from one node with access to the disk on which the data resides to another node with access to the disk on which the data resides, it is possible to case to perform a change of ownership.

锁定locking

锁是用于在能够存取资源的多个实体中协调对资源的存取的结构。在无共享数据库系统的情况下，无需全局锁定(global locking)来协调对无共享数据库中的用户数据的存取，这是因为任何给定的数据块仅由单个节点拥有。然而，因为无共享数据库的所有节点均要求存取所有权映射，因此可能需要一些锁定来防止对所有权映射的不一致的更新。A lock is a structure used to coordinate access to a resource among multiple entities that can access the resource. In the case of a shared-nothing database system, no global locking is required to coordinate access to user data in a shared-nothing database because any given block of data is only owned by a single node. However, because all nodes of a shared-nothing database require access to the ownership map, some locking may be required to prevent inconsistent updates to the ownership map.

根据一个实施例，当数据块的所有权正在从一个节点(“原所有者”)被再分配到另一节点(“新所有者”)时，使用两节点锁定方案。另外，全局锁定机制可以用于控制对与无共享数据库相关的元数据的存取。这样的元数据可以包括例如所有权映射。According to one embodiment, a two-node locking scheme is used when the ownership of a data block is being redistributed from one node (the "old owner") to another node (the "new owner"). Additionally, a global locking mechanism can be used to control access to metadata associated with a shared-nothing database. Such metadata may include, for example, an ownership map.

在不移动数据的情况下转移所有权Transfer ownership without moving data

根据本发明的一个方面，可以在不移动数据的情况下将数据的所有权从一个节点(原所有者)改变到与该数据有联系的另一个节点(新所有者)。例如，假设特定数据项当前存在于分区112中。由于数据项存在于分区112中，因此数据项由节点102拥有。为了将数据的所有权改变至节点104，数据必须不再属于分区112，而是改为属于分区114。在无共享数据库系统的传统执行中，这种所有权改变一般地会使得数据项实际地从对应于分区112的磁盘150上的一个物理位置被移动到对应于分区114的磁盘150上的另一个物理位置。According to one aspect of the invention, the ownership of data can be changed from one node (the original owner) to another node associated with the data (the new owner) without moving the data. For example, assume that a particular data item currently exists in partition 112 . The data item is owned by node 102 because it exists in partition 112 . In order to change ownership of the data to node 104 , the data must no longer belong to partition 112 but instead belong to partition 114 . In a traditional implementation of a shared-nothing database system, such a change of ownership would typically result in the data item being physically moved from one physical location on disk 150 corresponding to partition 112 to another physical location on disk 150 corresponding to partition 114. Location.

相反地，根据本发明的实施例，分区112和114不是必然是磁盘150的特定位置的物理分区。相反，分区112和114是不依赖于位置的分区，其仅仅分别表示当前由节点102和104拥有的数据项的集合，而与特定数据项存在于磁盘152上的位置无关。因而，由于分区112和114不依赖于位置，因此能够在不经磁盘150上的数据的任何实际移动的情况下，将数据项从一个分区移动至另一个分区(即从一个所有者分配到另一个所有者)。Conversely, according to embodiments of the present invention, partitions 112 and 114 are not necessarily physical partitions at specific locations on disk 150 . In contrast, partitions 112 and 114 are location-independent partitions that simply represent the collection of data items currently owned by nodes 102 and 104 , respectively, regardless of where a particular data item exists on disk 152 . Thus, since partitions 112 and 114 are location independent, data items can be moved from one partition to another (i.e., assigned from one owner to another) without any actual movement of the data on disk 150. an owner).

虽然改变数据项的所有权不要求数据项的移动，但是其要求所有权映射的改变。不同于无共享数据库中的用户数据，所有权映射在不同的节点中共享。因此，所有权映射的部分可以缓存于不同的节点的专用缓存中。因而，响应于数据项的所有权的改变，所有权映射被改变，并且所有权映射的受影响部分的缓存拷贝失效。While changing the ownership of a data item does not require movement of the data item, it does require a change in the ownership mapping. Unlike user data in a shared-nothing database, the ownership map is shared across nodes. Therefore, parts of the ownership map can be cached in private caches of different nodes. Thus, in response to a change in ownership of the data item, the ownership map is changed and the cached copy of the affected portion of the ownership map is invalidated.

根据可选实施例，与基础对象的方案改变类似地执行所有权改变。特别地，在对所有权映射做出改变之后，涉及所有权映射的编译语句失效并且被重新编译以使用新的所有权映射。According to an alternative embodiment, ownership changes are performed similarly to schema changes of base objects. In particular, after a change is made to the ownership mapping, compiled statements referring to the ownership mapping are invalidated and recompiled to use the new ownership mapping.

节点的添加和去除Adding and removing nodes

在群100的操作期间，可能需要从群100添加或去除节点。在传统的无共享系统中，这样的操作会涉及频繁地从文件或磁盘的一个物理分区中将大量数据移动至另一个文件或磁盘的另一个物理分区。通过使用不依赖于位置的分区，必须被物理地重新布置的唯一数据，是那些其所有权被转移至不能存取该数据当前存在的磁盘的节点的数据。During operation of cluster 100 , nodes may need to be added or removed from cluster 100 . In a traditional shared-nothing system, such an operation would involve frequently moving large amounts of data from a file or one physical partition of a disk to another file or another physical partition of a disk. By using location-independent partitioning, the only data that must be physically rearranged is that whose ownership is transferred to nodes that do not have access to the disks on which the data currently resides.

例如，假设新节点X被添加至群100，并且节点X能够存取磁盘152但不能存取磁盘150。为了重新平衡节点之间的工作负荷，当前由节点102至110拥有的一些数据可以被再分配给节点X。由于其原所有者是节点102至106的数据存在于节点X不能存取的磁盘150上，因此该数据必须被物理地移动至节点X能够存取的磁盘152。然而，由于其原所有者是节点108和110的数据已经存在于节点X能够存取的磁盘152上，因此可以在不移动实际数据的情况下通过更新所有权映射将该数据的所有权转移至节点X。For example, suppose a new node X is added to cluster 100 , and node X has access to disk 152 but not disk 150 . Some data currently owned by nodes 102-110 may be redistributed to node X in order to rebalance the workload among the nodes. Since the data whose original owners were nodes 102 to 106 exists on disk 150 which is not accessible to node X, the data must be physically moved to disk 152 which is accessible to node X. However, since the data whose original owners were nodes 108 and 110 already exists on disk 152 to which node X has access, ownership of that data can be transferred to node X by updating the ownership map without moving the actual data .

类似地，当节点从群100被去除时，只有以下数据项需要被物理地重新布置：该数据项被转移到当前不能存取其上存在数据项的磁盘的节点。其所有权被转移到能够存取其上存在该数据的磁盘的节点的数据项无需被移动。例如，如果节点102从群100被去除，并且之前由节点102所拥有的所有数据项的所有权均被转移至节点104，则没有数据项响应于所有权的改变而需要被物理地重新布置。Similarly, when a node is removed from the cluster 100, only data items that need to be physically rearranged need to be transferred to nodes that currently do not have access to the disks on which the data items reside. Data items whose ownership is transferred to nodes that have access to the disks on which the data resides need not be moved. For example, if node 102 is removed from swarm 100, and ownership of all data items previously held by node 102 is transferred to node 104, no data items need to be physically rearranged in response to the change in ownership.

所有权的逐渐转移gradual transfer of ownership

根据一个实施例，通过逐渐地而不是突然地执行所有权转移，可以减轻与响应于添加或去除的节点的数据所有权的大量再分配相关的性能损失。例如，当向群添加新节点时，系统可以开始向新节点转移少量数据项的所有权或不转移数据项的所有权，而不是转移足够多的数据的所有权以使新节点与现存的节点一样繁忙。根据一个实施例，基于工作负荷的需要逐渐地转移所有权。例如，数据所有权的转移可以在系统检测到节点中的一个节点的工作负荷变得过度时被触发。响应于检测到节点工作过度，属于该工作过度节点的一些数据项可以被分配给先前添加的节点。逐渐地，越来越多的数据项可以从工作过度的节点被分配到新节点，直到检测到该工作过度的节点不再工作过度为止。According to one embodiment, by performing ownership transfers gradually rather than abruptly, the performance penalty associated with large reallocations of data ownership in response to added or removed nodes may be mitigated. For example, when a new node is added to the swarm, the system may begin transferring ownership of few or no data items to the new node, rather than transferring ownership of enough data to keep the new node as busy as existing nodes. According to one embodiment, ownership is gradually transferred based on workload needs. For example, the transfer of data ownership can be triggered when the system detects that the workload of one of the nodes becomes excessive. In response to detecting an overworked node, some data items belonging to the overworked node may be assigned to previously added nodes. Gradually, more and more data items may be allocated from the overworked node to the new node until it is detected that the overworked node is no longer overworked.

另一方面，所有权的再分配可以在现存节点的工作负荷降至低于某一阈值时被触发。特别地，理想的是，当繁忙节点的工作负荷减轻时将一些所有权责任从其他繁忙节点转移至新节点，以避免再分配操作降低已经工作过度的节点的性能。On the other hand, reassignment of ownership can be triggered when the workload of existing nodes drops below a certain threshold. In particular, it is desirable to transfer some ownership responsibilities from other busy nodes to new nodes when the workload of busy nodes lightens, to avoid reallocation operations degrading the performance of already overworked nodes.

至于从去除的节点逐渐转移所有权，所有权转移可能，例如，在必要时被触发。例如，如果数据项X由去除的节点拥有，则当检测到一些节点已经请求涉及数据项X的操作时，数据项X可以被再分配给另一节点。类似地，将所有权从去除的节点转移至现存节点可能在现存节点的工作负荷降至低于某一阈值时被触发。As for the gradual transfer of ownership from removed nodes, ownership transfers may, for example, be triggered when necessary. For example, if data item X is owned by the removed node, data item X may be reassigned to another node when it is detected that some nodes have requested operations involving data item X. Similarly, transferring ownership from a removed node to an existing node may be triggered when the workload of the existing node drops below a certain threshold.

基于存储段(bucket)的分区Bucket-based partitioning

如上所述，由无共享数据库管理的数据被分区，并且每个分区中的数据由一个节点独占地拥有。根据一个实施例，通过将数据分配给逻辑存储段来建立分区，然后将每个存储段分配给分区。因此，所有权映射中的数据到节点的映射包括数据到存储段的映射和存储段到节点的映射。As mentioned above, the data managed by a shared-nothing database is partitioned, and the data in each partition is owned exclusively by one node. According to one embodiment, partitions are established by assigning data to logical storage segments, and then assigning each storage segment to a partition. Therefore, the mapping of data to nodes in the ownership mapping includes the mapping of data to storage segments and the mapping of storage segments to nodes.

根据一个实施例，数据到存储段的映射通过对每个数据项的名称运用散列函数来建立。类似地，存储段到节点的映射可以通过对与存储段相关的标识符运用另一散列函数来建立。可选地，该两个映射或其中之一可以使用基于范围的分区来建立，或通过简单地列举每个个体关系来建立。例如，可以通过将数据项的名字空间划分为五十个范围将一百万个数据项映射到五十个存储段。然后通过为每个存储段存储记录可以将五十个存储段映射到五个节点，该记录用于(1)识别存储段并且(2)识别当前分配有存储段的节点。According to one embodiment, the mapping of data to storage segments is established by applying a hash function to the name of each data item. Similarly, a bucket-to-node mapping can be established by applying another hash function to the identifier associated with the bucket. Alternatively, one or both of these maps can be built using range-based partitioning, or by simply enumerating each individual relationship. For example, one million data items can be mapped to fifty buckets by dividing the namespace of the data items into fifty extents. Fifty buckets can then be mapped to five nodes by storing a record for each bucket that (1) identifies the bucket and (2) identifies the node that is currently assigned the bucket.

相对于其中为每个数据项存储单独的映射记录的映射而言，存储段的使用显著地减少了所有权映射的大小。另外，在存储段的数量超过节点的数量的实施例中，存储段的使用使得将所有权再分配给由给定的节点拥有的数据的子集相对容易。例如，新节点可以从当前分配有十个存储段的节点被分配单个存储段。这样的再分配将简单地涉及为该存储段修改指示存储段到节点的映射的记录。被再分配的数据的数据到存储段的映射不必被改变。The use of buckets significantly reduces the size of the ownership map relative to a map in which a separate map record is stored for each data item. Additionally, in embodiments where the number of buckets exceeds the number of nodes, the use of buckets makes it relatively easy to reallocate ownership to subsets of the data owned by a given node. For example, a new node may be allocated a single memory segment from a node currently allocated ten memory segments. Such reallocation would simply involve modifying, for that bucket, a record indicating the bucket-to-node mapping. The data-to-segment mapping of data being reallocated does not have to be changed.

如上所述，可以通过使用各种技术(包括但不限于散列分区、范围分区、或列值)中的任何一种来建立数据到存储段的映射。如果使用基于范围的分区并且范围的数量不显著地大于节点的数量，只要用于对数据项分区的范围关键码(range key)是不会改变的值(例如数据)，则数据库服务器可以采用更精细的(更狭窄)范围来实现所需数量的存储段。如果范围关键码是可以改变的值，则响应于用于特定数据项的范围关键码值的改变，数据项从其原存储段被去除并被添加到对应于数据项的范围关键码的新值的存储段。As noted above, the mapping of data to buckets may be established by using any of a variety of techniques including, but not limited to, hash partitioning, range partitioning, or column values. If range-based partitioning is used and the number of ranges is not significantly larger than the number of nodes, the database server can use more Fine (narrower) scope to achieve desired number of buckets. If the range key is a value that can be changed, then in response to a change in the range key value for a particular data item, the data item is removed from its original storage segment and added to a new value corresponding to the data item's range key storage segment.

基于树的分区tree-based partitioning

将由数据库系统管理的数据项分区为子集的另一种方法是使用分级存储结构(例如，BTree)，以使树形结构的上级(例如，根)由所有节点所拥有，并且下级(例如，叶节点)在节点之中被分区。根据一个实施例，树形结构包括多个子树，其中，每个子树被分配到特定节点。另外，每个下级树节点对应于一组数据。与下级树节点相关的一组数据由与包括树节点的子树相关的节点所拥有。Another way to partition data items managed by a database system into subsets is to use a hierarchical storage structure (e.g., a BTree) such that the upper level of the tree structure (e.g., the root) is owned by all nodes, and the lower levels (e.g., leaf nodes) are partitioned among the nodes. According to one embodiment, the tree structure comprises a plurality of subtrees, wherein each subtree is assigned to a specific node. In addition, each subordinate tree node corresponds to a set of data. A set of data associated with subordinate tree nodes is owned by nodes associated with subtrees that include tree nodes.

在这样的实施例中，当子树的所有权发生变化时，通过锁定/广播方案使上级无效。下级的指针被修改以移动不同节点下的子树的所有权。In such an embodiment, when the ownership of a subtree changes, the superior is invalidated through a locking/broadcasting scheme. Subordinate pointers are modified to move ownership of subtrees under different nodes.

在再分配期间处理脏版本(dirty version)Handle dirty versions during redistribution

如上所述，当数据的新所有者能够存取其上存在数据的磁盘时，通过在没有物理上移动磁盘上的数据的物理位置的情况下将存储段再分配给节点，来改变数据的所有权。然而，可能原所有者在其易失性存储器中具有一个或多个再分配的数据项的“脏”版本。数据项的脏版本是包括没有影响当前存在于磁盘上的版本的改变的版本。As described above, when the new owner of the data has access to the disk on which the data resides, the ownership of the data is changed by reallocating storage segments to nodes without physically moving the physical location of the data on the disk . However, it is possible that the original owner has a "dirty" version of one or more reallocated data items in its volatile memory. A dirty version of a data item is a version that includes changes that do not affect the version currently existing on disk.

根据一个实施例，数据项的脏版本被写入共享磁盘作为所有权转移操作的一部分。因此，当新所有者从磁盘读取其新近已经获得所有权的数据项时，由新所有者读取的项的版本将反映由前一所有者做出的最新改变。According to one embodiment, dirty versions of data items are written to shared disk as part of an ownership transfer operation. Thus, when a new owner reads from disk a data item that it has newly acquired ownership of, the version of the item read by the new owner will reflect the latest changes made by the previous owner.

可选地，为了避免与将数据项的脏版本写入磁盘相关的开销，如果强制重做并且不重写，则在将脏数据项写入共享磁盘之前，可以从原所有者的易失存储器清除数据项的脏版本。特别地，当所有权节点对数据项作出改变时，生成反映该改变的“重做”记录。只要用于改变的重做记录在数据项的所有权改变之时或之前被强加到磁盘，则原所有者能够在不首先将脏版本保存到磁盘的情况下清除数据项的脏版本。在这种情况下，新所有者能够通过(1)读取重做记录来确定必须对数据项的磁盘版本做出何种改变以及(2)对数据项的磁盘版本做出所指示的改变，来重构数据项的最新版本。Optionally, to avoid the overhead associated with writing dirty versions of data items to disk, if redo is forced and not rewritten, dirty data items can be read from the original owner's volatile memory before writing them to shared disk. Clean up dirty versions of data items. In particular, when an ownership node makes a change to a data item, a "redo" record is generated that reflects the change. As long as the redo records for the change are forced to disk at or before the ownership of the data item changes, the original owner can clean up dirty versions of data items without first saving the dirty versions to disk. In this case, the new owner can determine what changes must be made to the on-disk version of the data item by (1) reading the redo records and (2) making the indicated changes to the on-disk version of the data item, to reconstruct the latest version of the data item.

另一可选情况是，在事务请求新所有者节点中的数据时，将脏数据项自动地(原所有者主动地)或在要求时(响应于新所有者的请求)转移至新所有者的缓存。Another option is to transfer dirty data items to the new owner automatically (on the initiative of the original owner) or on demand (in response to the new owner's request) when a transaction requests data in the new owner's node cache.

如果在所有权改变之前数据项的脏版本没有被转储(flush)到磁盘，则数据项的改变可以被反映在多个恢复日志中。例如，假设第一节点对数据项做出改变，之后数据项的所有权被转移至第二节点。第一节点可以将重做日志转储到磁盘，但是将数据项的脏版本直接转移至第二节点而不首先将其存储至磁盘。第二节点之后可以对数据项做出第二改变。假设第二节点将第二改变的重做记录转储到磁盘，然后第二节点在将数据项的脏版本存储到磁盘之前失效。在这些情况下，必须再次被应用于数据项的磁盘版本的改变既反映在第一节点的重做日志中，也反映在第二节点的重做日志中。根据一个实施例，通过合并重做日志来执行联机恢复以恢复数据项。Changes to a data item may be reflected in multiple recovery logs if dirty versions of the data item were not flushed to disk prior to the ownership change. For example, suppose a first node makes a change to a data item, after which ownership of the data item is transferred to a second node. A first node may dump redo logs to disk, but transfer dirty versions of data items directly to a second node without first storing them to disk. The second node may then make a second change to the data item. Assume that the second node dumps the second changed redo record to disk, and then the second node fails before storing the dirty version of the data item to disk. In these cases, changes to the on-disk version that must be applied again to the data item are reflected in both the first node's redo log and the second node's redo log. According to one embodiment, online recovery is performed by merging redo logs to recover data items.

根据一个实施例，可以在不等待正在修改数据项的事务提交的情况下，转移数据项的所有权。因此，由单个事务对数据项做出的改变可以扩展至多个重做日志。在这些情况下，数据库服务器的事务退回机制被设置为撤销多个日志的改变，其中，以与对数据块做出改变的顺序相反的顺序对数据块执行撤销操作。另外，提供了可以合并所有所有者的重做日志的介质恢复机制，其中，合并程序包括自数据被备份时起做出的所有改变的重做记录。According to one embodiment, ownership of a data item may be transferred without waiting for the transaction that is modifying the data item to commit. Therefore, changes made to data items by a single transaction can be spread across multiple redo logs. In these cases, the database server's transactional rollback mechanism is set to undo changes to multiple logs, where the undo operations are performed on the data blocks in the reverse order in which the changes were made to the data blocks. In addition, a media recovery mechanism is provided that can merge the redo logs of all owners, where the merge process includes redo records of all changes made since the time the data was backed up.

无阻塞更新(blocking update)的再分配Redistribution without blocking update

根据本发明的一个方面，在不经对正被再分配的数据进行阻塞更新的情况下，执行数据项所有权的再分配。根据一个实施例，通过使数据库服务器等待用于存取属于再分配的存储段的任何数据项的所有事务提交，并等待属于该存储段的所有脏数据项被转储到磁盘，可以在无阻塞更新的情况下执行所有权的分配。在这些情况下，如果原所有者不能够以独占模式或共享模式存取数据，则属于再分配的存储段的数据能够立即被更新(不等待脏版本被转储到磁盘)。如果原所有者确实能够以独占模式存取数据，则原所有者可以在其缓存中具有数据的脏版本，因此更新被延迟，直到原所有者将脏页(或用于相关改变的重做)写入共享磁盘。According to one aspect of the invention, reallocation of ownership of data items is performed without blocking updates to the data being reallocated. According to one embodiment, by having the database server wait for all transactions that access any data items belonging to the reallocated bucket to commit, and wait for all dirty data items belonging to the bucket to be flushed to disk, the Assignment of ownership is performed in case of update. In these cases, the data belonging to the reallocated bucket can be updated immediately (without waiting for dirty versions to be dumped to disk) if the original owner cannot access the data in exclusive or shared mode. If the original owner did have exclusive access to the data, the original owner may have a dirty version of the data in its cache, so updates are delayed until the original owner removes the dirty page (or redo for related changes) Write to shared disk.

在允许对其所有权已经被新近转移的数据项进行新的更新之前，数据库服务器可以被设置为等待原所有者的已经被请求的进行中的更新完成。另一方面，数据库服务器可以被设置为中止进行中的操作，然后将事务重新发布给所有者。根据一个实施例，基于各种因素做出对于是否等待给定的进行中的操作完成的决定。这样的因素可以包括，例如，已经为该操作完成了多少工作。The database server may be configured to wait for an already requested in-progress update by the original owner to complete before allowing a new update to a data item whose ownership has been newly transferred. On the other hand, the database server can be set to abort operations in progress and then reissue the transaction to the owner. According to one embodiment, the decision as to whether to wait for a given in-progress operation to complete is made based on various factors. Such factors may include, for example, how much work has been done for the operation.

在某些情况下，等待原所有者的已经被请求的更新完成，可能会引起假死锁。例如，假设行A在存储段1中，并且行B和C在存储段2中。假设事务T1更新行A，并且另一事务T2已经更新行B。假设在此时，存储段2的所有权被重新映射到新节点。如果此时T1想要更新行C，则T1将等待重新映射完成。因此，T1将等待T2。如果T2想要更新行A，则T1和T2之间存在死锁。In some cases, waiting for the original owner's requested updates to complete may cause spurious deadlocks. For example, assume row A is in bucket 1, and rows B and C are in bucket 2. Suppose transaction T1 updates row A, and another transaction T2 has updated row B. Suppose at this point, the ownership of bucket 2 is remapped to the new node. If at this point T1 wants to update row C, T1 will wait for the remap to complete. Therefore, T1 will wait for T2. If T2 wants to update row A, there is a deadlock between T1 and T2.

根据一个实施例，即使当要求几个存储段的所有权时，一次只对一个存储段执行所有权的转移，以最小化事务将必须等待以存取再分配的存储段中的数据的时间量。According to one embodiment, even when ownership of several buckets is required, the transfer of ownership is only performed one bucket at a time to minimize the amount of time a transaction will have to wait to access data in a reallocated bucket.

用于转移所有权的技术Technology Used to Transfer Ownership

根据本发明的多个实施例，以下的实例说明了用于在共享磁盘系统上执行的无共享数据库内转移数据的所有权的技术。在以下的实例中，假设在已经修改数据的事务仍在进行时改变数据的所有权。即，数据库系统不会等待已经存取待被再分配的数据的进行中的事务停顿。The following examples illustrate techniques for transferring ownership of data within a shared-nothing database executing on a shared disk system, according to various embodiments of the invention. In the following example, it is assumed that ownership of data is changed while a transaction that has modified the data is still in progress. That is, the database system does not wait for an in-flight transaction to stall that has accessed data to be reallocated.

下面将参考其中假设对象(“存储段B”)的子集的所有权从节点X改变至节点Y的实例，来描述一种用于转移所有权的技术。根据一个实施例，数据库系统开始将存储段B标记为正在从节点X到节点Y的“转变中”。所有权映射的改变然后被广播至所有节点，或通过全局锁定失效。One technique for transferring ownership will be described below with reference to an example where ownership of a subset of hypothetical objects ("bucket B") changes from node X to node Y. According to one embodiment, the database system initially marks bucket B as being "in transition" from node X to node Y. Changes to the ownership map are then broadcast to all nodes, or invalidated via a global lock.

根据一个实施例，响应于所有权映射的改变，重新生成涉及存储段B中的数据的查询执行计划。可选地，响应于所有权映射的改变，缓存的映射被无效或被重新加载。According to one embodiment, query execution plans involving data in storage segment B are regenerated in response to a change in the ownership map. Optionally, the cached mapping is invalidated or reloaded in response to a change in ownership mapping.

在再分配之后，存取存储段B中的数据的任何新子查询/数据操纵语言(dml)片断将被传送至节点Y。可选地，在存储段被标为在从X到Y的转变中之前，当前在X中运行的SQL片断可以被退回。在再分配之后，这些片断然后可以被重新发布至节点Y。应当注意，这些片断所属的事务并不是本身被退回，而仅仅是当前调用被退回并且重新发送给新所有者。特别地，在先前的调用中由节点X对存储段B中的数据做出的改变不受影响。After reallocation, any new subqueries/data manipulation language (dml) fragments that access data in bucket B will be transmitted to node Y. Optionally, the SQL fragment currently running in X may be rolled back before the bucket is marked as in transition from X to Y. These fragments can then be republished to node Y after reallocation. It should be noted that the transaction to which these fragments belong is not itself rolled back, only the current call is rolled back and resent to the new owner. In particular, changes made by node X to data in bucket B in previous calls are not affected.

根据一个实施例，节点X能够通过简单的局部锁定方案来检测到没有正在存取存储段B中的数据的进行中的调用。该锁定方案可以涉及，例如，使得存取存储段中的数据的每个程序都获得该存储段上的共享锁/锁存器。当该存储段将被再分配时，执行该再分配的程序等待直至其能够获得存储段上的独占的锁/锁存器。通过获得存储段上的独占的锁/锁存器，再分配程序确保当前没有其他程序正在存取存储段。According to one embodiment, node X is able to detect that there is no call in progress that is accessing data in bucket B through a simple local locking scheme. The locking scheme may involve, for example, having each program that accesses data in a memory segment acquire a shared lock/latch on that memory segment. When the memory segment is to be reallocated, the program performing the reallocation waits until it can acquire an exclusive lock/latch on the memory segment. By obtaining an exclusive lock/latch on the memory segment, the reallocator ensures that no other program is currently accessing the memory segment.

根据一个实施例，因为潜在的死锁，在将存储段标记为在转变中之前，节点X不等待所有调用成功地完成。以下是这样的死锁如何发生的实例。考虑将被重新映射的存储段中的行1、2、和3三行。According to one embodiment, node X does not wait for all calls to complete successfully before marking the bucket as in transition because of a potential deadlock. The following is an example of how such a deadlock can occur. Consider lines 1, 2, and 3 in the memory segment to be remapped.

下列事件序列能够导致死锁：The following sequence of events can lead to a deadlock:

(a)T1更新行2。(a) T1 updates row 2.

(b)T2对后边跟着行2的行1执行多行更新。(b) T2 performs a multi-row update on row 1 followed by row 2.

T2现在等待T1。T2 now waits for T1.

(c)决定存储段将被重新映射。(c) Deciding that the memory segment is to be remapped.

(d)T1想要更新行3。T1现在等待T2。(d) T1 wants to update row 3. T1 now waits for T2.

根据一个实施例，通过只要进行中的调用所存取的存储段B中的数据在缓存中则允许进行中的调用继续正常地执行，来避免在节点X的进行中的调用的强制中止。换句话说，在没有内部节点(inter-node)锁定的情况下，X不能从存储段B的磁盘中读取块。如果有缓存缺失并且存储段在转变中，则X必须发送信息给Y，或者从Y检索该块，或从磁盘读取该块。当存储段在转变中时，在X和Y之间使用缓存相关性协议。According to one embodiment, a forced abort of an ongoing call at node X is avoided by allowing the ongoing call to continue executing normally as long as the data in memory segment B accessed by the ongoing call is in the cache. In other words, X cannot read blocks from disk in segment B without an inter-node lock. If there is a cache miss and the storage segment is in transition, then X must send the message to Y, either retrieve the block from Y, or read the block from disk. A cache coherency protocol is used between X and Y while buckets are in transition.

处理新所有者的请求Process new owner's request

在存储段B已经被再分配给节点Y之后，需要存取存储段B中的数据的任何新SQL片断在节点Y中开始执行。在从磁盘读取新近转移的数据项时由节点Y使用的技术，可以基于在存储段被标记为在转移中之前前所有者节点X做了什么而改变。以下情况是说明可能由新所有者节点Y处理的不同情况的实例。After bucket B has been reassigned to node Y, any new SQL fragments that need to access data in bucket B start executing in node Y. The technique used by node Y when reading a newly transferred data item from disk may vary based on what the previous owner node X did before the storage segment was marked as in transfer. The following cases are examples illustrating different situations that may be handled by the new owner node Y.

情况A：假设节点X中止所有进行中的调用，并将映射到该存储段的所有脏块写入共享磁盘。为了效率，每个节点能够将脏数据项链接至每个存储段对象序列。在这些情况下，节点Y能够直接从磁盘读取。不需要任何缓存相关性。节点Y立即被标记为该存储段的新所有者。Case A: Assume node X aborts all ongoing calls and writes all dirty blocks mapped to the storage segment to the shared disk. For efficiency, each node can link dirty data items to each bucket object sequence. In these cases, node Y is able to read directly from disk. No cache dependencies are required. Node Y is immediately marked as the new owner of the bucket.

情况B：假设节点X中止所有进行中的调用，但是脏数据项没有被写出。在这些情形下，节点Y将需要在从磁盘读取块之前检索或核实节点X没有脏拷贝。如果X具有脏拷贝，则前映象被留在在节点X中，用于恢复以及确保校验点没有预先通过(advance past)在节点X中做出的改变，该改变还没有通过在节点Y中块记录(block write)被反映到磁盘中。在X中的所有脏数据项已经被写入(既可自己写入也可由通过Y中的记录清除的前映象(PI)写入)之后，存储段状态从在转变中改变为作为所有者的Y。Y现在能够在不检查X的情况下存取该存储段中的磁盘块。Case B: Suppose node X aborts all in-progress calls, but dirty data items are not written out. In these situations, node Y will need to retrieve or verify that node X has no dirty copies before reading the block from disk. If X has a dirty copy, the before image is left on node X for recovery and to ensure that the checkpoint did not advance past changes made in node X that have not yet advanced on node Y The block record (block write) is reflected to the disk. After all dirty data items in X have been written (either by itself or by a pre-image (PI) cleared by records in Y), the bucket state changes from in transition to as owner the Y. Y can now access the disk blocks in that bucket without checking X.

如果节点X在存储段状态为在转变中时失效，则如果节点Y没有数据项的当前拷贝，那么恢复的节点(节点Y，如果其继续存在)将需要应用为节点X中的该存储段所生成的重做(并且如果节点Y也失效，则可能在节点Y中生成重做)。If node X fails while the bucket state is in transition, then if node Y does not have a current copy of the data item, then the recovered node (node Y, if it continues to exist) will need to apply the Generated redo (and possibly generated redo in node Y if node Y also fails).

情况C：假设节点X中止进行中的调用，并且清除脏数据项。在这些情况下，节点Y能够在无缓存相关性的情况下直接从磁盘读取块。然而，如果X中有未应用的重做，则该块可能需要被更新。存储段将被认为在转变中，直至在X中产生的所有重做已经被Y应用并被写入磁盘。这对于防止X使其校验点越过尚没有反映在磁盘上的重做段而言是需要的。Case C: Assume that node X aborts the in-progress call and cleans up dirty data items. In these cases, node Y is able to read blocks directly from disk without cache dependencies. However, if there is unapplied redo in X, the block may need to be updated. A bucket will be considered in transition until all redo generated in X has been applied by Y and written to disk. This is needed to prevent X from making its checkpoints past redo segments that have not been reflected on disk.

情况D：假设节点X继续执行进行中的调用。节点X和节点Y将需要在从磁盘存取块之前互相校验。当所有进行中的调用已经完成并且所有的块均被写出或转移至Y时，Y被标记为新所有者。从这一刻起，无需缓存相关性。Case D: Assume that node X continues to execute the call in progress. Node X and Node Y will need to check each other out before accessing blocks from disk. Y is marked as the new owner when all in-progress calls have completed and all blocks have been written out or transferred to Y. From this moment on, there is no need to cache dependencies.

如果节点X或节点Y在存储段状态为在转变中时失效，则必须执行恢复。在此环境下使用的恢复技术可能类似于在2002年3月4日提交的美国专利号6,353,836和美国专利申请号10/092,047中描述的技术，其每个都被完整的结合于此。如果两个节点都失效，则需要合并来自X和Y的重做日志。If node X or node Y fails while the bucket state is in transition, recovery must be performed. Restoration techniques used in this environment may be similar to those described in US Patent No. 6,353,836, filed March 4, 2002, and US Patent Application No. 10/092,047, each of which is incorporated herein in its entirety. If both nodes fail, the redo logs from X and Y need to be merged.

如在情况D中所述，允许节点X继续执行进行中的调用可以带来各种益处。特别地，允许节点X继续执行进行中的调用，使得所有权将在对正在进行的事务影响最小的情况下被再分配。然而，其要求缓存相关性和在节点X和节点Y之间为存储段B执行的恢复方案。As described in case D, allowing node X to continue executing calls in progress may have various benefits. In particular, node X is allowed to continue executing calls in progress such that ownership will be reassigned with minimal impact on ongoing transactions. However, it requires a cache coherency and recovery scheme to be performed between node X and node Y for bucket B.

一种用于在这些情况下提供缓存相关性的方法，包括让节点X获得用于其当前为存储段B缓存的所有数据项的锁。用于B中的所有数据项的主/目录节点能够被分配作为节点Y。然后向所有节点发送存储段B将从X移动至Y的通知。该通知无效/更新所有权映射，使得对B的进一步存取将被发送至Y。One method for providing cache coherency in these situations involves having node X acquire locks for all data items it currently caches for bucket B. A home/directory node for all data items in B can be assigned as node Y. A notification is then sent to all nodes that bucket B will be moved from X to Y. This notification invalidates/updates the ownership map so that further accesses to B will be sent to Y.

如果在此之前发生失效，则所有权再分配操作被中止。否则映射被更新来指示Y是新所有者并且缓存相关性有效。If an invalidation occurs before then, the ownership reallocation operation is aborted. Otherwise the map is updated to indicate that Y is the new owner and the cache dependency is valid.

然后X和Y为存储段B中的所有数据项执行缓存相关性协议，诸如美国专利号6,353,836和美国专利申请号10/092,047中描述的协议。当B中不再存在脏数据项时，可以停止缓存相关性协议。Y可以释放其可能已经为B中的数据项获得的任何锁。X and Y then execute a cache coherency protocol for all data items in bucket B, such as the protocols described in US Patent No. 6,353,836 and US Patent Application No. 10/092,047. When there are no more dirty data items in B, the cache coherency protocol can be stopped. Y can release any locks it may have acquired for data items in B.

根据一个实施例，缓存相关性协议总是有效，使得拥有存储段的节点也是这些锁的支配者。在多数情况下，每个节点将只分配局部锁(因为其是支配者)并且缓存相关性消息将只在所有权改变时被需要。当所有权改变时，在该存储段中打开的锁可以动态地重新分配给(remaster)新所有者。According to one embodiment, the cache coherence protocol is always in effect such that the node owning the bucket is also the master of these locks. In most cases, each node will only allocate local locks (since it is the dominator) and cache coherency messages will only be needed when ownership changes. When ownership changes, locks opened in that storage segment can be dynamically reassigned to the new owner.

所有权改变前的修改Modifications prior to ownership change

根据实施例，在所有权改变之前修改节点X中的数据的子事务将仍被认为在起作用，因为事务退回将需要这些改变。如果事务退回并且其已经对节点X中的该存储段做出改变，则节点Y将需要通过从共享磁盘日志的X的部分中读取撤销日志来应用撤销日志。According to an embodiment, sub-transactions that modified data in node X prior to the ownership change will still be considered active, since transaction rollback will require these changes. If the transaction backs out and it has made changes to that bucket in node X, then node Y will need to apply the undo log by reading it from X's portion of the shared disk log.

在所有权改变之前修改节点X中的数据的子事务可能会更新节点Y中的相同数据。这要求诸如节点Y中的行锁或页锁的事务锁的请求需要与节点X协调。如果子事务请求锁并且该锁已经由其在节点X中的同级(sibling)事务所持有，则同意锁的请求。然而，如果锁由不相关的事务持有，则锁请求被阻塞。等待-图表将该等待反映为对母事务(parent transaction)的等待，从而可以检测全局死锁。一旦所有权改变完成，并且已经获得锁以存取节点X中的存储段B中的数据的所有事务已经结束(提交或中止)，则事务锁只需要局部锁。A subtransaction that modifies data in node X before the ownership change may update the same data in node Y. This requires that requests for transactional locks such as row locks or page locks in node Y need to be coordinated with node X. If a child transaction requests a lock and the lock is already held by its sibling transaction in node X, the lock request is granted. However, if the lock is held by an unrelated transaction, the lock request is blocked. Wait - The graph reflects this wait as a wait on the parent transaction, allowing detection of global deadlocks. Once the ownership change is complete, and all transactions that have acquired locks to access data in bucket B in node X have ended (committed or aborted), only local locks are required for transactional locks.

通过在开始所有权改变之前使数据库服务器等待事务完成，事务锁的请求能够始终被局部地协调。Requests for transactional locks can always be coordinated locally by having the database server wait for the transaction to complete before initiating an ownership change.

硬件综述hardware overview

图2是示出可以执行本发明的实施例的计算机系统200的框图。计算机系统200包括用于传递信息的总线202或其它通信装置以及用于处理信息的与总线202连接的处理器204。计算机系统200还包括连接至总线202的主存储器206，诸如随机访问存储器(RAM)或者其它动态存储装置，用于储存信息和将由处理器204执行的指令。在执行将由处理器204执行的指令期间，主存储器206还可用于储存临时变量或其他中间信息。计算机系统200进一步包括只读存储器(ROM)208或连接至总线202的其他静态存储装置，用于存储静态信息和处理器204的指令。提供诸如磁盘或光盘的存储设备210，并连接至总线202用于存储信息和指令。FIG. 2 is a block diagram illustrating a computer system 200 on which an embodiment of the present invention may be implemented. Computer system 200 includes a bus 202 or other communication means for communicating information and a processor 204 coupled with bus 202 for processing information. Computer system 200 also includes main memory 206 , such as a random access memory (RAM) or other dynamic storage device, connected to bus 202 for storing information and instructions to be executed by processor 204 . Main memory 206 may also be used for storing temporary variables or other intermediate information during execution of instructions to be executed by processor 204 . Computer system 200 further includes a read only memory (ROM) 208 or other static storage device coupled to bus 202 for storing static information and instructions for processor 204 . A storage device 210, such as a magnetic or optical disk, is provided and connected to bus 202 for storing information and instructions.

计算机系统200可以经由总线202连接至诸如阴极射线管(CRT)的显示器212，用于向计算机用户显示信息。包括字母数字键和其他键的输入装置214连接至总线202，用于将信息和指令选择传递到处理器204。另一种类型的用户输入装置是光标控制216，诸如鼠标、跟踪球、或光标方向键，用于将方向信息和命令选择传递到处理器204并用于控制显示器212上的光标移动。输入装置通常在两个轴上(第一个轴(例如X轴)和第二个轴(例如Y轴))具有两个自由度，使装置能指定平面上的位置。Computer system 200 may be connected via bus 202 to a display 212, such as a cathode ray tube (CRT), for displaying information to a computer user. An input device 214 including alphanumeric and other keys is connected to bus 202 for communicating information and command selections to processor 204 . Another type of user input device is cursor control 216 , such as a mouse, trackball, or cursor direction keys, for communicating direction information and command selections to processor 204 and for controlling cursor movement on display 212 . The input device typically has two degrees of freedom in two axes, a first axis (eg, X-axis) and a second axis (eg, Y-axis), enabling the device to specify a position on a plane.

本发明涉及计算机系统200的使用，用于执行在此描述的技术。根据本发明的一个实施例，通过计算机系统200响应于执行包括在主存储器206中的一个或多个指令的一个或多个序列的处理器204，来实现这些技术。这样的指令可以从诸如存储装置210的其它计算机可读介质读入主存储器206。包括在主存储器206中的指令序列的执行，使得处理器204执行此处所述的处理步骤。在可选实施例中，可以使用硬连线电路(hard-wired circuitry)来取代软件指令或者与软件指令结合来实施该发明。因此，本发明的实施例将不限于硬件电路和软件的任何特定组合。The present invention is directed to the use of computer system 200 for performing the techniques described herein. These techniques are implemented by computer system 200 in response to processor 204 executing one or more sequences of one or more instructions contained in main memory 206 , according to one embodiment of the invention. Such instructions may be read into main memory 206 from other computer-readable media, such as storage device 210 . Execution of the sequences of instructions contained in main memory 206 causes processor 204 to perform the process steps described herein. In alternative embodiments, hard-wired circuitry may be used in place of or in combination with software instructions to implement the invention. Thus, embodiments of the invention should not be limited to any specific combination of hardware circuitry and software.

这里使用的术语“计算机可读介质”是指参与向处理器204提供指令用于执行的任何介质。这种介质可以采取多种形式，包括但不限于非易失性介质、易失性介质、和传递介质。非易失性介质举例来说包括光盘或磁盘，诸如存储装置210。易失性介质包括动态存储器，诸如主存储器206。传输介质包括同轴电缆、铜线、和光纤，包括组成总线202的导线。传输介质还可采取声波或光波形式，例如那些在无线电波和红外线数据通信过程中产生的声波和光波。The term "computer-readable medium" is used herein to refer to any medium that participates in providing instructions to processor 204 for execution. Such a medium may take many forms, including but not limited to, non-volatile media, volatile media, and transmission media. Non-volatile media include, for example, optical or magnetic disks, such as storage device 210 . Volatile media includes dynamic memory, such as main memory 206 . Transmission media include coaxial cables, copper wire, and fiber optics, including the wires that make up bus 202 . Transmission media can also take the form of acoustic or light waves, such as those generated during radio-wave and infrared data communications.

通常形式的计算机可读介质包括如软盘、软性盘、硬盘、磁带，或者任何其它磁性介质、CD-ROM、任何其它光介质、打孔纸、纸带、或者任何带孔图样的物理介质、RAM、PROM、EPROM、FLASH-EPROM、或者其他任何存储芯片或者盒式磁带，或者以下提到的载波、或者计算机可读的任何其他介质。Common forms of computer readable media include, for example, floppy disks, floppy disks, hard disks, magnetic tape, or any other magnetic media, CD-ROMs, any other optical media, punched paper, paper tape, or any physical media with a pattern of holes, RAM, PROM, EPROM, FLASH-EPROM, or any other memory chips or cartridges, or carrier waves mentioned below, or any other medium readable by a computer.

各种形式的计算机可读介质可参与将一个或者多个指令的一个或多个序列承载到处理器204用于执行。例如，指令开始可承载在远程计算机的磁盘中。远程计算机可以将指令加载到其动态存储器中，然后使用调制解调器通过电话线发送指令。计算机系统200本地的调制解调器可接收电话线上的数据，并使用红外发射器将数据转换成红外信号。红外探测器可以接收红外信号携带的数据，并且合适的电路可以将数据放到总线202上。总线202将数据承载到主存储器206，处理器204从主存储器取回并执行这些指令。在由处理器204执行这些指令之前或之后，由主存储器206接收的指令可随意地储存在存储装置210上。Various forms of computer readable media may be involved in carrying one or more sequences of one or more instructions to processor 204 for execution. For example, the instructions may initially be carried on a disk in a remote computer. The remote computer can load the instructions into its dynamic memory and then send the instructions over a telephone line using a modem. A modem local to computer system 200 can receive the data on the telephone line and use an infrared transmitter to convert the data to an infrared signal. An infrared detector can receive the data carried in the infrared signal and appropriate circuitry can place the data on bus 202 . Bus 202 carries the data to main memory 206 , from which processor 204 retrieves and executes the instructions. The instructions received by main memory 206 can optionally be stored on storage device 210 either before or after execution of the instructions by processor 204 .

计算机系统200还包括连接至总线202的通信接口218。提供双向数据通信的通信接口218，连接到与局域网222连接的网络链路220。例如，通信接口218可以是综合业务数字网(ISDN)卡或者调制解调器，用于提供到相应类型的电话线的数据通信连接。又如，通信接口218可以是局域网(LAN)卡，用于提供至兼容局域网(LAN)的数据通信连接。也可以使用无线链路。在任何这样的实施中，通信接口218发送和接收承载表示各种类型的信息的数字数据流的电信号、电磁信号、和光学信号。Computer system 200 also includes a communication interface 218 connected to bus 202 . A communication interface 218 , which provides bi-directional data communication, is connected to a network link 220 connected to a local area network 222 . For example, communication interface 218 may be an Integrated Services Digital Network (ISDN) card or a modem for providing a data communication connection to a corresponding type of telephone line. As another example, communication interface 218 may be a local area network (LAN) card for providing a data communication connection to a compatible local area network (LAN). Wireless links may also be used. In any such implementation, communication interface 218 sends and receives electrical, electromagnetic, and optical signals that carry digital data streams representing various types of information.

网络链路220通常可通过一个或者多个网络向其它数据装置提供数据通信。例如，网络链路220可通过局域网222与主机224连接，或者与互联网服务提供商(ISP)226操作的数据设备连接。ISP226又通过目前通称为“互联网”228的全球分组数据通信网络提供数据通信服务。局域网222和互联网228都使用承载数字数据流的电信号、电磁信号、或光学信号。通过各种网络的信号和网络链路220上的信号以及通过通信接口218的信号，都传送数字数据给计算机系统200或者传送来自计算机系统的数字数据，是传输信息的载波的示例性形式。Network link 220 may generally provide data communication to other data devices through one or more networks. For example, network link 220 may connect to host computer 224 through local area network 222 or to data equipment operated by Internet Service Provider (ISP) 226 . ISP 226 in turn provides data communication services through a global packet data communication network currently known as the "Internet" 228 . Local area network 222 and Internet 228 both use electrical, electromagnetic or optical signals that carry digital data streams. The signals through the various networks and the signals on network link 220 and the signals through communication interface 218, both carrying digital data to and from computer system 200, are exemplary forms of carrier waves carrying the information.

计算机系统200能通过网络、网络链路220、和通信接口218发送消息和接收数据(包括程序代码)。在互联网的实例中，服务器230可通过互联网228、ISP 226、局域网222、和通信接口218，传送用于应用程序的所请求的程序代码。Computer system 200 is capable of sending messages and receiving data (including program code) over a network, network link 220 , and communication interface 218 . In the example of the Internet, the server 230 may transmit the requested program code for the application through the Internet 228, the ISP 226, the local area network 222, and the communication interface 218.

所接收的代码可以在其被接收时由处理器204执行，和/或储存在存储装置210或者其它非易失性介质中用于随后执行。按照这种方式，计算机系统200可以以载波的形式获得应用代码。The received code may be executed by processor 204 as it is received, and/or stored in storage device 210 or other non-volatile medium for subsequent execution. In this manner, computer system 200 can obtain the application code in the form of a carrier wave.

以上所述仅为本发明的优选实施例而已，并不用于限制本发明，对于本领域的技术人员来说，本发明可以有各种更改和变化。凡在本发明的精神和原则之内，所作的任何修改、等同替换、改进等，均应包含在本发明的保护范围之内。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. For those skilled in the art, the present invention may have various modifications and changes. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present invention shall be included within the protection scope of the present invention.

Claims

1. A method for managing data, said method comprising the steps of:

maintaining a plurality of persistent data items on persistent storage accessible by multiple nodes in a multi-node system, the persistent data items including specific data items stored at specific locations on the persistent storage;

assigning exclusive ownership of each of the persistent data items to one of the nodes, wherein a particular node of the plurality of nodes is assigned exclusive ownership of the particular data item;

When any node wants to perform an operation involving said specific data item, said node expecting said operation to be performed, since said specific data item exists in said specific location, transmits said operation to said specific node , for the specific node to perform the operation on the specific data item;

reassigning ownership of the particular data item from the particular node to another without moving the particular data item from the particular location on the persistent storage as the particular node continues to operate a node;

After the reallocation, when any node wants to perform an operation involving the specific data item, since the specific data item exists in the specific location, the node expecting the operation to be performed will send the an operation is communicated to the other node for the other node to perform the operation on the particular data item;

continuing to have a set of data items owned by the first node after the first node has been removed from the multi-node system; and

Responsive to detecting a request for an operation involving the data item, reassigning ownership of the data item from the first node to one or more other nodes.

2. A method for managing data, said method comprising the steps of:

maintaining a plurality of persistent data items on a persistent storage accessible by a plurality of nodes, the persistent data items including a specific data item stored at a specific location on the persistent storage;

reassigning ownership of the particular data item from the particular node to another without moving the particular data item from the particular location on the persistent storage as the particular node continues to operate node, wherein said particular node stores a dirty version of said particular data item in volatile memory when said particular data item is to be redistributed to said other nodes, and wherein said step of reallocating ownership include:

forcing one or more redo records associated with the dirty version to persistent storage; and

After the reallocation, when any node wants to perform an operation involving the specific data item, since the specific data item exists in the specific location, the node expecting the operation to be performed will send the The operation is communicated to the other nodes for the other nodes to perform the operation on the particular data item.

3. A method for managing data, said method comprising the steps of:

maintaining multiple persistent data items on persistent storage that can be accessed by multiple nodes;

Assigning ownership of each of said persistent data items to one of said nodes by:

assign each data item to one of the plurality of buckets; and

assigning each storage segment to a node of the plurality of nodes;

wherein said node to which the storage segment is assigned is established as the owner of all data items assigned to said storage segment;

wherein the step of assigning each data item to one of the plurality of storage segments is performed by (a) applying a hash function to a name associated with each data item or (b) using range-based partitioning;

When a first node wants to perform an operation involving a data item owned by a second node, the first node communicates the operation to the second node for the second node to perform the operation.

4. The method of claim 3, wherein the step of assigning each storage segment to one of the plurality of nodes is performed by applying a hash function to an identifier associated with each storage segment.

5. The method of claim 3, wherein the step of allocating each storage segment to one of the plurality of nodes is performed using range-based partitioning.

6. A method of managing data, said method comprising the steps of:

A plurality of persistent data items are maintained on persistent storage accessible by multiple nodes, wherein the persistent data items are rows assigned to one or more storage segments, and wherein the persistent data items include a first persistent a data item and a second persistent data item;

assigning each data item to one of a plurality of storage segments: wherein said first persistent data item is assigned to a first storage segment and said second persistent data item is assigned to a second storage segment, wherein said said first memory segment is different from said second memory segment; and

assigning each storage segment to a node of the plurality of nodes;

Wherein, the step of assigning each data item to one of the plurality of storage segments is performed by enumerating the relationship of a single data item to the storage segment;

7. The method of claim 3, wherein the step of assigning each bucket to one of the plurality of nodes is performed by enumerating a single bucket-to-node relationship.

8. The method of claim 3, wherein the number of buckets is greater than the number of nodes, and the bucket-to-node relationship is a many-to-one relationship.

9. The method of claim 3 , further comprising the step of converting all data items mapped to buckets by modifying bucket-to-node mappings without modifying the mapping of data items to buckets. Ownership is redistributed from the first node to the second node.

10. The method of claim 3, wherein the data item is allocated to a first storage segment, the first storage segment is allocated to the first node, further comprising allocating the first storage segment Steps for different nodes.

11. The method of claim 2, further comprising flushing the dirty version from the volatile memory without writing the dirty version of the particular data item to the persistent memory.

12. The method of claim 11 , wherein the other nodes rebuild the dirty data by applying the one or more redo records to the version of the particular data item present on the persistent store. Version.

13. The method of claim 2, wherein the other nodes rebuild the dirty data by applying the one or more redo records to the version of the particular data item present on the persistent store. Version.

14. The method of claim 2, further comprising transferring the dirty version of the particular data item from volatile memory associated with the particular node to volatile memory associated with the other node.

15. The method according to claim 14, wherein the step of transferring the dirty version is actively performed by the specific node without requesting the dirty version through the other nodes.

16. The method of claim 14, wherein the step of transferring the dirty version is performed by the specific node in response to a request for the dirty version from the other node.

17. The method of claim 2, wherein said step of reassigning ownership of said particular data item from said particular node to said other nodes is performed as ownership of a data item transfers from said particular node to said other node. Part of a gradual transfer of one or more nodes in a node group, wherein the node group does not include the particular node.

18. The method of claim 17, wherein the gradual transfer is initiated in response to detecting that the other node is processing fewer work requests relative to one or more nodes in the multi-node database system.

19. The method of claim 18 , wherein terminating said gradual transfer.

20. The method of claim 2, wherein the step of reallocating ownership of the particular data item is performed without waiting for a transaction that is modifying the particular data item to commit.

21. The method of claim 20, wherein the transaction modifies the specific data at a first time when the specific data item is owned by the specific node, and when the specific data item is owned by the Said transaction modifies said particular data at a second time while owned by said other node.

22. The method of claim 21 , further comprising: undoing changes made when the transaction changed the particular data at the second time by based on undo records in undo logs associated with the other nodes and undoing the changes made by the transaction at the first time based on an undo record in an undo log associated with the particular node, thereby rolling back the changes made by the transaction.

23. The method of claim 2, further comprising:

said other node receives a request to update said particular data item;

in response to receiving the request, determining whether the particular node accesses the particular data item in exclusive mode or shared mode;

If the specific node does not access the specific data item in exclusive mode or shared mode, the other nodes do not wait for the specific node to dump any dirty version or redo of the dirty version of the specific data item In the case of persistent storage, update the particular data item.

24. The method of claim 23, further comprising the step of:

in response to transferring ownership of the particular data item to the other node, suspending ongoing operations involving the particular data item;

The ongoing operation is automatically re-executed after the ownership of the particular data item has been transferred to the other node.