CN115114294A - Adaptive method, device and computer equipment for database storage mode - Google Patents
Adaptive method, device and computer equipment for database storage mode Download PDFInfo
- Publication number
- CN115114294A CN115114294A CN202210764392.3A CN202210764392A CN115114294A CN 115114294 A CN115114294 A CN 115114294A CN 202210764392 A CN202210764392 A CN 202210764392A CN 115114294 A CN115114294 A CN 115114294A
- Authority
- CN
- China
- Prior art keywords
- data
- storage
- storage mode
- column
- key
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
- G06F16/2282—Tablespace storage structures; Management thereof
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
技术领域technical field
本申请涉及计算机技术领域,特别是涉及一种数据库存储模式的自适应方法、装置、计算机设备、存储介质和计算机程序产品。The present application relates to the field of computer technology, and in particular, to an adaptive method, apparatus, computer device, storage medium and computer program product for a database storage mode.
背景技术Background technique
随着计算机技术以及互联网技术的发展,数据成为企业的核心资产,对数据进行保护是保护企业资产的有效手段之一。在对数据进行存储时,通常采用行存或列存的存储模式,以使得这些存储模式可以在某种工作负载下,提升数据库在处理查询时的读写(I/O)性能。With the development of computer technology and Internet technology, data has become the core asset of an enterprise, and data protection is one of the effective means to protect enterprise assets. When storing data, row-based or column-based storage modes are usually used, so that these storage modes can improve the read-write (I/O) performance of the database when processing queries under certain workloads.
然而,目前的数据库存储模式的自适应方式中,由于存储模式受限于行存、列存,参数的调节大多依赖于硬件的配置以及查询的特征,无法对存储模式这样底层基础的配置进行调节,因而容易使得最终选择出的存储模式往往不是最佳的,可能会造成比较大的存储开销甚至空间浪费,导致数据库系统的整体性能较差。However, in the current adaptive mode of database storage mode, because the storage mode is limited by row storage and column storage, and parameter adjustment mostly depends on hardware configuration and query characteristics, it is impossible to adjust the underlying basic configuration such as storage mode. Therefore, it is easy to make the final selected storage mode often not optimal, which may cause relatively large storage overhead or even waste of space, resulting in poor overall performance of the database system.
发明内容SUMMARY OF THE INVENTION
基于此,有必要针对上述技术问题,提供一种数据库存储模式的自适应方法、装置、计算机设备、计算机可读存储介质和计算机程序产品,能够有效提升数据库系统的整体性能。Based on this, it is necessary to provide an adaptive method, apparatus, computer equipment, computer-readable storage medium and computer program product for the database storage mode, which can effectively improve the overall performance of the database system.
第一方面,本申请提供了一种数据库存储模式的自适应方法。所述方法包括:获取工作负载的运行参数和各键值对的读写操作性能参数;所述运行参数包括读写操作比例、查询语句和数据分布;根据所述读写操作比例和所述读写操作性能参数,确定所述键值对所占用空间的第一空间大小;根据所述查询语句的语义信息,确定存储区的分区方式;根据所述数据分布确定候选分组范围下的数据密度;基于所述分区方式、所述候选分组范围、所述数据密度和所述第一空间大小,确定在运行所述工作负载时数据库中数据表的存储模式。In a first aspect, the present application provides an adaptive method for a database storage mode. The method includes: obtaining operating parameters of the workload and read-write operation performance parameters of each key-value pair; the operating parameters include read-write operation ratio, query statement and data distribution; according to the read-write operation ratio and the read-write operation ratio Write operation performance parameters, determine the first space size of the space occupied by the key-value pair; determine the partition mode of the storage area according to the semantic information of the query statement; determine the data density in the candidate grouping range according to the data distribution; Based on the partition mode, the candidate grouping range, the data density, and the first space size, a storage mode of a data table in the database is determined when the workload is running.
第二方面,本申请还提供了一种数据库存储模式的自适应装置。所述装置包括:获取模块,用于获取工作负载的运行参数和各键值对的读写操作性能参数;所述运行参数包括读写操作比例、查询语句和数据分布;确定模块,用于根据所述读写操作比例和所述读写操作性能参数,确定所述键值对所占用空间的第一空间大小;根据所述查询语句的语义信息,确定存储区的分区方式;根据所述数据分布确定候选分组范围下的数据密度;基于所述分区方式、所述候选分组范围、所述数据密度和所述第一空间大小,确定在运行所述工作负载时数据库中数据表的存储模式。In a second aspect, the present application also provides an adaptive device for a database storage mode. The device includes: an acquisition module for acquiring operating parameters of the workload and read-write operation performance parameters of each key-value pair; the operating parameters include read-write operation ratio, query statement and data distribution; a determination module, used for according to The read-write operation ratio and the read-write operation performance parameter determine the first space size of the space occupied by the key-value pair; determine the partition mode of the storage area according to the semantic information of the query statement; The distribution determines the data density under the candidate grouping range; based on the partition mode, the candidate grouping range, the data density and the first space size, determines the storage mode of the data table in the database when the workload is running.
第三方面,本申请还提供了一种计算机设备。所述计算机设备包括存储器和处理器,所述存储器存储有计算机程序,所述处理器执行所述计算机程序时实现以下步骤:获取工作负载的运行参数和各键值对的读写操作性能参数;所述运行参数包括读写操作比例、查询语句和数据分布;根据所述读写操作比例和所述读写操作性能参数,确定所述键值对所占用空间的第一空间大小;根据所述查询语句的语义信息,确定存储区的分区方式;根据所述数据分布确定候选分组范围下的数据密度;基于所述分区方式、所述候选分组范围、所述数据密度和所述第一空间大小,确定在运行所述工作负载时数据库中数据表的存储模式。In a third aspect, the present application also provides a computer device. The computer device includes a memory and a processor, the memory stores a computer program, and the processor implements the following steps when executing the computer program: acquiring the operating parameters of the workload and the read-write operation performance parameters of each key-value pair; The operating parameters include read-write operation ratio, query statement and data distribution; according to the read-write operation ratio and the read-write operation performance parameter, determine the first space size of the space occupied by the key-value pair; according to the The semantic information of the query statement is used to determine the partition mode of the storage area; the data density in the candidate grouping range is determined according to the data distribution; based on the partition mode, the candidate grouping range, the data density and the first space size , which determines the storage mode of the data tables in the database when running the workload.
第四方面,本申请还提供了一种计算机可读存储介质。所述计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现以下步骤:获取工作负载的运行参数和各键值对的读写操作性能参数;所述运行参数包括读写操作比例、查询语句和数据分布;根据所述读写操作比例和所述读写操作性能参数,确定所述键值对所占用空间的第一空间大小;根据所述查询语句的语义信息,确定存储区的分区方式;根据所述数据分布确定候选分组范围下的数据密度;基于所述分区方式、所述候选分组范围、所述数据密度和所述第一空间大小,确定在运行所述工作负载时数据库中数据表的存储模式。In a fourth aspect, the present application also provides a computer-readable storage medium. The computer-readable storage medium has a computer program stored thereon, and when the computer program is executed by the processor, the following steps are implemented: obtaining the operating parameters of the workload and the read-write operation performance parameters of each key-value pair; the operating parameters Including read-write operation ratio, query statement and data distribution; according to the read-write operation ratio and the read-write operation performance parameter, determine the first space size of the space occupied by the key-value pair; according to the semantics of the query statement information, determine the partition mode of the storage area; determine the data density under the candidate grouping range according to the data distribution; The workload is the storage mode of the data table in the database.
第五方面,本申请还提供了一种计算机程序产品。所述计算机程序产品,包括计算机程序,该计算机程序被处理器执行时实现以下步骤:获取工作负载的运行参数和各键值对的读写操作性能参数;所述运行参数包括读写操作比例、查询语句和数据分布;根据所述读写操作比例和所述读写操作性能参数,确定所述键值对所占用空间的第一空间大小;根据所述查询语句的语义信息,确定存储区的分区方式;根据所述数据分布确定候选分组范围下的数据密度;基于所述分区方式、所述候选分组范围、所述数据密度和所述第一空间大小,确定在运行所述工作负载时数据库中数据表的存储模式。In a fifth aspect, the present application also provides a computer program product. The computer program product includes a computer program, and when the computer program is executed by the processor, the following steps are implemented: obtaining the operation parameters of the workload and the read-write operation performance parameters of each key-value pair; the operation parameters include the read-write operation ratio, query statement and data distribution; determine the first space size of the space occupied by the key-value pair according to the read-write operation ratio and the read-write operation performance parameter; determine the size of the storage area according to the semantic information of the query statement Partitioning mode; determining the data density under the candidate grouping range according to the data distribution; determining the database when running the workload based on the partitioning mode, the candidate grouping range, the data density and the first space size The storage mode of the data table in .
上述数据库存储模式的自适应方法、装置、计算机设备、存储介质和计算机程序产品,通过获取工作负载的运行参数和各键值对的读写操作性能参数;运行参数包括读写操作比例、查询语句和数据分布;根据读写操作比例和读写操作性能参数,确定键值对所占用空间的第一空间大小;根据查询语句的语义信息,确定存储区的分区方式;根据数据分布确定候选分组范围下的数据密度;基于分区方式、候选分组范围、数据密度和第一空间大小,确定在运行工作负载时数据库中数据表的存储模式。由于通过预先测试得到了各键值对的读写操作性能参数,因此,可以基于不同工作负载运行参数中的读写操作比例和预先测试得到的各键值对的读写操作性能参数,确定键值对所占用空间大小的最优取值范围,以使得服务器可以基于分区方式、候选分组范围、数据密度和键值对所占用空间大小的最优取值范围,确定在运行不同工作负载时数据库中数据表的最优存储模式,解决了传统方式中无法根据不同负载匹配最优存储模式的问题,有效提升了系统的I/O效率,从而有效提升了系统的整体性能。The self-adaptive method, device, computer equipment, storage medium and computer program product of the above-mentioned database storage mode are obtained by obtaining the operation parameters of the workload and the read-write operation performance parameters of each key-value pair; the operation parameters include the read-write operation ratio, the query statement and data distribution; determine the first space size of the space occupied by key-value pairs according to the ratio of read and write operations and performance parameters of read and write operations; determine the partition mode of the storage area according to the semantic information of the query statement; determine the candidate grouping range according to the data distribution The data density under the data density; based on the partition method, the candidate grouping range, the data density and the first space size, determine the storage mode of the data table in the database when the workload is running. Since the read-write operation performance parameters of each key-value pair are obtained through pre-testing, the key-value pair can be determined based on the read-write operation ratio in the operating parameters of different workloads and the read-write operation performance parameters of each key-value pair obtained by the pre-test. The optimal value range of the space occupied by the value pair, so that the server can determine the database when running different workloads based on the partition mode, candidate grouping range, data density, and the optimal value range of the space occupied by the key-value pair. The optimal storage mode of the data table in the data table solves the problem that the traditional method cannot match the optimal storage mode according to different loads, and effectively improves the I/O efficiency of the system, thereby effectively improving the overall performance of the system.
附图说明Description of drawings
图1为一个实施例中数据库存储模式的自适应方法的应用环境图;Fig. 1 is the application environment diagram of the adaptive method of database storage mode in one embodiment;
图2为一个实施例中数据库存储模式的自适应方法的流程示意图;2 is a schematic flowchart of an adaptive method for a database storage mode in one embodiment;
图3为一个实施例中数据表结构变化的示意图;Fig. 3 is the schematic diagram of the structure change of data table in one embodiment;
图4为一个实施例中存储模式的转换示意图;Fig. 4 is the conversion schematic diagram of the storage mode in one embodiment;
图5为一个实施例中为列段存模式的示意图;5 is a schematic diagram of a column segment memory mode in one embodiment;
图6为一个实施例中列族段存模式的示意图;6 is a schematic diagram of a column family segment storage mode in one embodiment;
图7为一个实施例中整表存模式的示意图;FIG. 7 is a schematic diagram of the entire table storage mode in one embodiment;
图8为一个实施例中系统整体架构图;Fig. 8 is the overall architecture diagram of the system in one embodiment;
图9为一个实施例中单元格行存示意图;9 is a schematic diagram of cell row storage in one embodiment;
图10为一个实施例中单元格列存示意图;10 is a schematic diagram of cell column storage in one embodiment;
图11为一个实施例中多行存示意图;11 is a schematic diagram of multi-line storage in one embodiment;
图12为一个实施例中列族存存储格式示意图;12 is a schematic diagram of a column family memory storage format in one embodiment;
图13为一个实施例中多样化存储模式的编码方案示意图;13 is a schematic diagram of an encoding scheme of a diversified storage mode in one embodiment;
图14为一个实施例中存储模式自适应处理的流程图;FIG. 14 is a flow diagram of storage mode adaptation processing in one embodiment;
图15为一个实施例中数据库存储模式的自适应装置的结构框图;15 is a structural block diagram of an apparatus for adapting a database storage mode in one embodiment;
图16为一个实施例中计算机设备的内部结构图。Figure 16 is a diagram of the internal structure of a computer device in one embodiment.
具体实施方式Detailed ways
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。In order to make the purpose, technical solutions and advantages of the present application more clearly understood, the present application will be described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present application, but not to limit the present application.
本申请实施例提供的数据库存储模式的自适应方法,可以应用于如图1所示的应用环境中。其中,终端102通过网络与服务器104进行通信。数据存储系统可以存储服务器104需要处理的数据。数据存储系统可以集成在服务器104上,也可以放在云上或其他服务器上。服务器104可以从终端102获取工作负载的运行参数和各键值对的读写操作性能参数,运行参数包括读写操作比例、查询语句和数据分布;服务器104根据读写操作比例和读写操作性能参数,确定键值对所占用空间的第一空间大小;服务器104根据查询语句的语义信息,确定存储区的分区方式;进一步的,服务器104根据数据分布确定候选分组范围下的数据密度,并基于分区方式、候选分组范围、数据密度和第一空间大小,确定在运行工作负载时数据库中数据表的存储模式。The adaptive method for the database storage mode provided by the embodiment of the present application can be applied to the application environment shown in FIG. 1 . The terminal 102 communicates with the
其中,终端102可以但不限于是各种台式计算机、笔记本电脑、智能手机、平板电脑、物联网设备和便携式可穿戴设备,物联网设备可为智能音箱、智能电视、智能空调、智能车载设备等。便携式可穿戴设备可为智能手表、智能手环、头戴设备等。服务器104可以用独立的服务器或者是多个服务器组成的服务器集群来实现。The terminal 102 can be, but is not limited to, various desktop computers, notebook computers, smart phones, tablet computers, IoT devices and portable wearable devices, and the IoT devices can be smart speakers, smart TVs, smart air conditioners, smart vehicle-mounted devices, etc. . The portable wearable device may be a smart watch, a smart bracelet, a head-mounted device, or the like. The
可以理解,本申请实施例提供的服务器104也可以是区块链系统中的服务节点,该区块链系统中的各服务节点之间形成组成点对点(P2P,Peer To Peer)网络,P2P协议是一个运行在传输控制协议(TCP,Transmission Control Protocol)协议之上的应用层协议。It can be understood that the
云技术(Cloud technology)基于云计算商业模式应用的网络技术、信息技术、整合技术、管理平台技术、应用技术等的总称,可以组成资源池,按需所用,灵活便利。云计算技术将变成重要支撑。技术网络系统的后台服务需要大量的计算、存储资源,如视频网站、图片类网站和更多的门户网站。伴随着互联网行业的高度发展和应用,将来每个物品都有可能存在自己的识别标志,都需要传输到后台系统进行逻辑处理,不同程度级别的数据将会分开处理,各类行业数据皆需要强大的系统后盾支撑,只能通过云计算来实现。Cloud technology is a general term for network technology, information technology, integration technology, management platform technology, application technology, etc. based on the application of cloud computing business models. It can form a resource pool, which can be used on demand, flexible and convenient. Cloud computing technology will become an important support. Background services of technical network systems require a lot of computing and storage resources, such as video websites, picture websites and more portal websites. With the high development and application of the Internet industry, in the future, each item may have its own identification mark, which needs to be transmitted to the back-end system for logical processing. Data of different levels will be processed separately, and all kinds of industry data need to be strong. The system backing support can only be achieved through cloud computing.
云计算(cloud computing)是一种计算模式,它将计算任务分布在大量计算机构成的资源池上,使各种应用系统能够根据需要获取计算力、存储空间和信息服务。提供资源的网络被称为“云”。“云”中的资源在使用者看来是可以无限扩展的,并且可以随时获取,按需使用,随时扩展,按使用付费。Cloud computing is a computing model that distributes computing tasks on a resource pool composed of a large number of computers, enabling various application systems to obtain computing power, storage space and information services as needed. The network that provides the resources is called the "cloud". The resources in the "cloud" are infinitely expandable in the eyes of users, and can be obtained at any time, used on demand, expanded at any time, and paid for according to usage.
作为云计算的基础能力提供商,会建立云计算资源池(简称云平台,一般称为IaaS(Infrastructure as a Service,基础设施即服务)平台,在资源池中部署多种类型的虚拟资源,供外部客户选择使用。云计算资源池中主要包括:计算设备(为虚拟化机器,包含操作系统)、存储设备、网络设备。As a basic capability provider of cloud computing, it will establish a cloud computing resource pool (referred to as cloud platform, generally called IaaS (Infrastructure as a Service) platform, and deploy various types of virtual resources in the resource pool for External customers choose to use. The cloud computing resource pool mainly includes: computing devices (which are virtualized machines, including operating systems), storage devices, and network devices.
按照逻辑功能划分,在IaaS(Infrastructure as a Service,基础设施即服务)层上可以部署PaaS(Platform as a Service,平台即服务)层,PaaS层之上再部署SaaS(Software as a Service,软件即服务)层,也可以直接将SaaS部署在IaaS上。PaaS为软件运行的平台,如数据库、web容器等。SaaS为各式各样的业务软件,如web门户网站、短信群发器等。一般来说,SaaS和PaaS相对于IaaS是上层。According to the division of logical functions, the PaaS (Platform as a Service) layer can be deployed on the IaaS (Infrastructure as a Service) layer, and the SaaS (Software as a Service) layer can be deployed on the PaaS layer. service) layer, or directly deploy SaaS on IaaS. PaaS is a platform on which software runs, such as databases and web containers. SaaS is a variety of business software, such as web portals, SMS group senders, etc. Generally speaking, SaaS and PaaS are upper layers relative to IaaS.
云存储(cloud storage)是在云计算概念上延伸和发展出来的一个新的概念,分布式云存储系统(以下简称存储系统)是指通过集群应用、网格技术以及分布存储文件系统等功能,将网络中大量各种不同类型的存储设备(存储设备也称之为存储节点)通过应用软件或应用接口集合起来协同工作,共同对外提供数据存储和业务访问功能的一个存储系统。Cloud storage is a new concept extended and developed from the concept of cloud computing. Distributed cloud storage system (hereinafter referred to as storage system) refers to functions such as cluster application, grid technology and distributed storage file system. A storage system that integrates a large number of different types of storage devices (also called storage nodes) in the network through application software or application interfaces to work together to provide external data storage and service access functions.
目前,存储系统的存储方法为:创建逻辑卷,在创建逻辑卷时,就为每个逻辑卷分配物理存储空间,该物理存储空间可能是某个存储设备或者某几个存储设备的磁盘组成。客户端在某一逻辑卷上存储数据,也就是将数据存储在文件系统上,文件系统将数据分成许多部分,每一部分是一个对象,对象不仅包含数据而且还包含数据标识(ID,ID entity)等额外的信息,文件系统将每个对象分别写入该逻辑卷的物理存储空间,且文件系统会记录每个对象的存储位置信息,从而当客户端请求访问数据时,文件系统能够根据每个对象的存储位置信息让客户端对数据进行访问。At present, the storage method of the storage system is as follows: creating a logical volume, and when creating a logical volume, a physical storage space is allocated to each logical volume, and the physical storage space may be composed of a storage device or disks of several storage devices. The client stores data on a logical volume, that is, stores the data on the file system. The file system divides the data into many parts, each part is an object, and the object contains not only data but also data identification (ID, ID entity) and other additional information, the file system writes each object into the physical storage space of the logical volume, and the file system records the storage location information of each object, so that when the client requests to access data, the file system can The storage location information of the object allows the client to access the data.
存储系统为逻辑卷分配物理存储空间的过程,具体为:按照对存储于逻辑卷的对象的容量估量(该估量往往相对于实际要存储的对象的容量有很大余量)和独立冗余磁盘阵列(RAID,Redundant Array of Independent Disk)的组别,预先将物理存储空间划分成分条,一个逻辑卷可以理解为一个分条,从而为逻辑卷分配了物理存储空间。The process of allocating physical storage space by the storage system to the logical volume, specifically: according to the capacity estimation of the objects stored in the logical volume (this estimation often has a large margin relative to the actual capacity of the objects to be stored) and independent redundant disks Array (RAID, Redundant Array of Independent Disk) group, which divides the physical storage space into stripes in advance, and a logical volume can be understood as a stripe, thereby allocating physical storage space for the logical volume.
随着人工智能技术研究和进步,人工智能技术在多个领域展开研究和应用,例如常见的智能家居、智能穿戴设备、虚拟助理、智能音箱、智能营销、无人驾驶、自动驾驶、无人机、机器人、智能医疗、智能客服、车联网、自动驾驶、智慧交通等,相信随着技术的发展,人工智能技术将在更多的领域得到应用,并发挥越来越重要的价值。With the research and progress of artificial intelligence technology, artificial intelligence technology has been researched and applied in many fields, such as common smart homes, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned driving, autonomous driving, drones , robots, intelligent medical care, intelligent customer service, Internet of Vehicles, autonomous driving, intelligent transportation, etc. It is believed that with the development of technology, artificial intelligence technology will be applied in more fields and play an increasingly important value.
在一个实施例中,如图2所示,提供了一种数据库存储模式的自适应方法,以该方法应用于图1中的服务器为例进行说明,包括以下步骤:In one embodiment, as shown in FIG. 2, an adaptive method for a database storage mode is provided, and the method is applied to the server in FIG. 1 as an example to illustrate, including the following steps:
步骤202,获取工作负载的运行参数和各键值对的读写操作性能参数;运行参数包括读写操作比例、查询语句和数据分布。Step 202: Acquire operating parameters of the workload and performance parameters of read and write operations of each key-value pair; the operating parameters include the ratio of read and write operations, query statements, and data distribution.
其中,工作负载是指在数据库之上运行的各个工作负载,本申请中的数据库可以为关系数据库,关系数据库中的各个数据表的结构是预先配置好的,数据库的结构可以为分布式数据库,本申请中的工作负载可以包括不同类型的工作负载,例如,OLTP(On-LineTransaction Processing)型负载对数据有很多写操作,OLAP(On-Line AnalyticalProcessing)操作一般有很多范围查询操作。OLTP(On-Line Transaction Processing)是快速响应用户操作的方式之一,基本特点是前台收到的用户数据可以立即传送到计算中心进行处理,并在短时间内给出处理结果。OLAP(On-Line Analytical Processing)是一种软件机制,用于对来自数据仓库、数据集市或其他一些统一的集中式数据存储的大量数据进行高速多维分析。The workload refers to each workload running on the database. The database in this application can be a relational database, the structure of each data table in the relational database is pre-configured, and the structure of the database can be a distributed database. The workloads in this application may include different types of workloads. For example, OLTP (On-Line Transaction Processing) type workloads have many write operations to data, and OLAP (On-Line Analytical Processing) operations generally have many range query operations. OLTP (On-Line Transaction Processing) is one of the ways to quickly respond to user operations. The basic feature is that the user data received by the front desk can be immediately sent to the computing center for processing, and the processing results can be given in a short time. OLAP (On-Line Analytical Processing) is a software mechanism for high-speed multi-dimensional analysis of large amounts of data from a data warehouse, data mart, or some other unified centralized data store.
工作负载的运行参数是指工作负载在数据库之上正常运行时的参数,例如,运行参数可以包括工作负载中各个I/O操作的比例、工作负载中涉及的全部查询语句以及工作负载对应的数据分布情况。The running parameters of the workload refer to the parameters when the workload runs normally on the database. For example, the running parameters can include the proportion of each I/O operation in the workload, all the query statements involved in the workload, and the data corresponding to the workload. Distribution.
键值是指(Key-Value,简写为KV),本申请中数据按照键值对的形式进行组织、索引和存储,即在存储引擎中,以键值对作为存储的基本单位。Key-value refers to (Key-Value, abbreviated as KV). In this application, data is organized, indexed and stored in the form of key-value pairs, that is, in the storage engine, key-value pairs are used as the basic unit of storage.
读写操作性能参数是指各个I/O操作所对应的性能参数,例如,针对I/O操作中的点查操作,16B大小的KV对的延迟时间为0.01ms,32B大小的KV对的延迟时间为0.005ms,延迟时间即为点查操作所对应的性能参数。The performance parameters of read and write operations refer to the performance parameters corresponding to each I/O operation. For example, for the check operation in the I/O operation, the delay time of a 16B KV pair is 0.01ms, and the delay of a 32B KV pair is 0.01ms. The time is 0.005ms, and the delay time is the performance parameter corresponding to the check operation.
读写操作比例是指工作负载中各个I/O操作的比例,例如,工作负载A在预设时间段内单点查询操作、更新操作、插入操作、删除操作、以及范围查询操作的相应比例。The ratio of read and write operations refers to the ratio of each I/O operation in the workload. For example, the corresponding ratio of single-point query operations, update operations, insert operations, delete operations, and range query operations for workload A within a preset time period.
查询语句是指工作负载中涉及的全部查询语句,例如,运行参数中可以包括查询语句的比例以及各个查询语句所涉及的元信息。Query statements refer to all query statements involved in the workload. For example, the running parameters may include the proportion of query statements and the meta information involved in each query statement.
数据分布是指工作负载中数据分布的情况,例如,主键的范围和相应的比例。具体地,服务器可以获取预设时间段内的某个工作负载在数据库中正常运行时的运行参数,此时的数据库中的存储模式可以是任何一种存储模式。一般地,可以选择行存模式作为基准模式进行测试。这一步骤中,需要收集的运行参数包含下述三种类型:一是工作负载中各I/O操作的比例,如单点查询操作、更新操作、插入操作、删除操作、以及范围查询操作的相应比例;二是工作负载中涉及的全部查询语句的比例,以及所涉及的列对应的元信息;三是工作负载中数据分布的情况,如主键的范围和相应的比例。进一步的,服务器可以从预先建立好的性能模型中,获取占用不同空间大小的键值对的读写操作性能参数。其中,性能参数可以包括延迟时间、带宽或者IOPS(Input/Output Operations Per Second)中的至少一种。Data distribution refers to the distribution of data in the workload, for example, the range of primary keys and the corresponding proportions. Specifically, the server may acquire the running parameters of a certain workload within a preset period of time when it is running normally in the database, and the storage mode in the database at this time may be any storage mode. In general, the line memory mode can be selected as the benchmark mode for testing. In this step, the running parameters that need to be collected include the following three types: First, the proportion of each I/O operation in the workload, such as single-point query operations, update operations, insert operations, delete operations, and range query operations. The corresponding proportion; the second is the proportion of all query statements involved in the workload, and the meta information corresponding to the columns involved; the third is the data distribution in the workload, such as the range of primary keys and the corresponding proportion. Further, the server may obtain the read/write operation performance parameters of key-value pairs occupying different space sizes from the pre-established performance model. The performance parameter may include at least one of delay time, bandwidth, or IOPS (Input/Output Operations Per Second).
举个例子,假设当前工作负载A包含1000个点查询操作、10个涉及100000行的全表扫描操作、1000个更新操作、1000个插入操作和10个删除操作,服务器获取预设时间段内的工作负载A的运行参数,运行参数中包含1000个点查询操作、10个涉及100000行的全表扫描操作、1000个更新操作、1000个插入操作和10个删除操作。进一步的,服务器可以从预先建立好的性能模型中,获取各大小KV对在不同I/O操作下的性能,比如,服务器获取到在点查操作中,16B大小的KV对的延迟时间为0.01ms,32B大小的KV对的延迟时间为0.005ms,48B大小的KV对的延迟时间为0.009ms。For example, assuming that the current workload A contains 1000 point query operations, 10 full table scan operations involving 100,000 rows, 1000 update operations, 1000 insert operations, and 10 delete operations, the server obtains the The running parameters of workload A include 1,000 point query operations, 10 full table scan operations involving 100,000 rows, 1,000 update operations, 1,000 insert operations, and 10 delete operations. Further, the server can obtain the performance of each size KV pair under different I/O operations from the pre-established performance model. For example, the server obtains a delay time of 0.01 for a 16B KV pair in the check operation. ms, the delay time of 32B size KV pair is 0.005ms, and the delay time of 48B size KV pair is 0.009ms.
步骤204,根据读写操作比例和读写操作性能参数,确定键值对所占用空间的第一空间大小。Step 204: Determine the first space size of the space occupied by the key-value pair according to the read-write operation ratio and the read-write operation performance parameter.
其中,第一空间大小是指选取的工作负载的总体性能最优的键值对所占用空间大小的取值范围,例如,假设估算出当前工作负载在16B大小的KV对下的总体性能最高,因此,可以确定键值对所占用空间的第一空间大小为16B。The first space size refers to the value range of the space occupied by the key-value pair with the best overall performance of the selected workload. Therefore, it can be determined that the first space size of the space occupied by the key-value pair is 16B.
具体的,服务器获取工作负载的运行参数和各键值对的读写操作性能参数之后,服务器可以根据运行参数中包含的各个读写操作比例和各键值对的读写操作性能参数,估算出当前工作负载在每一种KV对大小下的总体性能,进而服务器可以从中选择总体性能最高的KV对大小,作为系统设置的键值对所占用空间的第一空间大小。由于键值对所占用空间大小对存储系统的性能影响很大,因此,需要根据当前工作负载中各个I/O操作的占比,估算出最适合的KV对大小。一般地,大粒度的KV对更适合处理全表扫描操作,小粒度的KV对更适合处理单点查询、更新、插入、删除等操作。Specifically, after the server obtains the operating parameters of the workload and the read-write operation performance parameters of each key-value pair, the server can estimate the read-write operation ratio and the read-write operation performance parameters of each key-value pair contained in the operating parameters. The overall performance of the current workload under each KV pair size, and then the server can select the KV pair size with the highest overall performance from it, as the first space size of the space occupied by the key-value pair set by the system. Since the size of the space occupied by the key-value pair has a great impact on the performance of the storage system, it is necessary to estimate the most suitable KV pair size according to the proportion of each I/O operation in the current workload. Generally, KV pairs with large granularity are more suitable for processing full table scan operations, and KV pairs with small granularity are more suitable for processing single-point query, update, insert, delete and other operations.
举个例子,假设当前工作负载A包含1000个点查询操作、10个涉及100000行的全表扫描操作、1000个更新操作、0个插入操作和0个删除操作,服务器获取预设时间段内的工作负载A的运行参数,运行参数中包含1000个点查询操作、10个涉及100000行的全表扫描操作、1000个更新操作、0个插入操作和0个删除操作。进一步的,服务器可以从预先建立好的性能模型中,获取各大小KV对在不同I/O操作下的性能,比如,服务器获取到在点查操作中,16B大小的KV对的延迟时间为0.01ms,32B大小的KV对的延迟时间为0.005ms,48B大小的KV对的延迟时间为0.009ms,则服务器可以估算出在当前工作负载A中,16B的KV对在点查操作中的耗时ta1为1000*0.01ms,32B的KV对在点查操作中的耗时ta2为1000*0.05ms,48B的KV对在点查操作中的耗时ta3为1000*0.09ms。以此类推,服务器可以估算出16B的KV对在全表扫描操作和更新操作中的耗时分别为tb1和tc1,服务器将16B的KV对在点查操作、全表扫描操作和更新操作中的耗时求和后即为总耗时T1=ta1+tb1+tc1。同样,对于其他大小的KV对,服务器也可以算出一个预估的总耗时,假设服务器将32B的KV对在点查操作、全表扫描操作和更新操作中的耗时求和后即为总耗时T2=ta2+tb2+tc2。服务器将48B的KV对在点查操作、全表扫描操作和更新操作中的耗时求和后即为总耗时T3=ta3+tb3+tc3。由于T3<T1<T2,因此,服务器可以选取预估总耗时最短的T3所对应的KV对大小即48B作为最优的KV对大小,即服务器根据当前负载A中各个读写操作比例和不同KV对的读写操作性能参数,确定键值对所占用空间的最优取值为48B。For example, assuming that the current workload A contains 1000 point query operations, 10 full table scan operations involving 100,000 rows, 1000 update operations, 0 insert operations, and 0 delete operations, the server obtains the The running parameters of workload A include 1,000 point query operations, 10 full table scan operations involving 100,000 rows, 1,000 update operations, 0 insert operations, and 0 delete operations. Further, the server can obtain the performance of each size KV pair under different I/O operations from the pre-established performance model. For example, the server obtains a delay time of 0.01 for a 16B KV pair in the check operation. ms, the delay time of a 32B-sized KV pair is 0.005ms, and the 48B-sized KV pair’s delay time is 0.009ms, then the server can estimate the time-consuming of the 16B KV pair in the current workload A in the check operation. t a1 is 1000*0.01ms, the time consuming t a2 of the 32B KV pair in the enumeration operation is 1000*0.05ms, and the time consuming t a3 of the 48B KV pair in the enumeration operation is 1000*0.09ms. By analogy, the server can estimate that the time-consuming of 16B KV pairs in the full table scan operation and the update operation are t b1 and t c1 respectively, and the server uses the 16B KV pair in the check operation, the full table scan operation and the update operation. The total time consumption T1=t a1 +t b1 +t c1 is obtained by summing the time consumption in . Similarly, for KV pairs of other sizes, the server can also calculate an estimated total time-consuming. Assuming that the server sums the time-consuming of 32B KV pairs in the check operation, full table scan operation and update operation, the total time is calculated as the total time. Time-consuming T2=t a2 +t b2 +t c2 . After the server sums up the time consuming of the 48B KV pair in the check operation, the full table scan operation and the update operation, the total time consuming is T3=t a3 +t b3 +t c3 . Since T3<T1<T2, the server can select the KV pair size corresponding to T3 with the shortest estimated total time consumption, that is, 48B as the optimal KV pair size, that is, the server can select the ratio of each read and write operation in the current load A and the difference For the read and write operation performance parameters of the KV pair, determine the optimal value of the space occupied by the key-value pair of 48B.
步骤206,根据查询语句的语义信息,确定存储区的分区方式。Step 206: Determine the partition mode of the storage area according to the semantic information of the query statement.
其中,查询语句的语义信息是指解析查询语句中的语法,得到的查询语句中的语义信息。例如,对于每个查询,可以解析其语法,从投影子句和WHERE条件子句中,提取出涉及的列信息并保存。The semantic information of the query statement refers to the semantic information in the query statement obtained by parsing the grammar in the query statement. For example, for each query, its grammar can be parsed to extract the column information involved from the projection clause and the WHERE conditional clause and save it.
存储区是指数据表中存放数据的区域,例如,预先定义一张数据表的结构为5行*8列的数据表,则该数据表中5行*8列的区域均为存储区。The storage area refers to the area where data is stored in the data table. For example, if a data table with a structure of 5 rows*8 columns is pre-defined, the area of 5 rows*8 columns in the data table is the storage area.
分区(Partition)是指将数据库中某个数据表的各属性划分为若干组。分区方式是指在垂直方向对数据表进行分区的方式,即在垂直方向上将一张数据表划分为若干子集的方式,例如,对数据表的列进行分区,得到对应的列分区方案。Partition refers to dividing the attributes of a data table in the database into several groups. The partitioning method refers to the method of partitioning the data table in the vertical direction, that is, the method of dividing a data table into several subsets in the vertical direction. For example, the columns of the data table are partitioned to obtain the corresponding column partitioning scheme.
具体的,服务器根据读写操作比例和读写操作性能参数,确定键值对所占用空间的第一空间大小之后,服务器可以根据运行参数中查询语句的语义信息,确定存储区的分区方式,即服务器可以根据查询语句的语意信息,提取出相关列的信息,以推导出列分区的候选方案。在实际存储中,服务器按照分区方式,将数据库中某张数据表的各属性划分为若干组后,服务器可以按照这些分组存放数据。Specifically, after the server determines the first space size of the space occupied by the key-value pair according to the read-write operation ratio and the read-write operation performance parameters, the server can determine the partition mode of the storage area according to the semantic information of the query statement in the running parameters, that is, The server can extract relevant column information according to the semantic information of the query statement, so as to deduce the candidate scheme of column partitioning. In actual storage, after the server divides the attributes of a data table in the database into several groups according to the partition method, the server can store data according to these groups.
例如,如图3所示,为数据表结构变化的示意图。如图3中(1)所示,分区会在垂直方向将一张数据表划分为若干子集。首先,服务器可以获取针对该数据表上所涉及的所有查询语句和相应的比例。对于每个查询语句,服务器可以解析其语法,将其中涉及的列信息取出并保存。进一步的,服务器可以根据各个查询语句的占比,对各查询语句进行排序,选取占比最高的前n位的查询语句作为目标查询语句。服务器可以从占比最高的目标查询语句开始,根据该目标查询语句中涉及的列信息进行分区之后,服务器再取出次高占比的目标查询语句,重复分区操作,直到所有属性都已经被分区完毕。For example, as shown in FIG. 3 , it is a schematic diagram of the structure change of the data table. As shown in (1) in Figure 3, partitioning divides a data table into subsets in the vertical direction. First, the server can obtain all query statements and corresponding ratios involved in the data table. For each query statement, the server can parse its syntax, fetch and save the column information involved. Further, the server may sort each query statement according to the proportion of each query statement, and select the top n query statements with the highest proportion as the target query statement. The server can start from the target query statement with the highest proportion, and after partitioning according to the column information involved in the target query statement, the server takes out the target query statement with the second highest proportion, and repeats the partitioning operation until all attributes have been partitioned. .
步骤208,根据数据分布确定候选分组范围下的数据密度。Step 208: Determine the data density in the candidate grouping range according to the data distribution.
其中,候选分组范围是指预先设置好的一组分组范围,例如,预定义一组分组范围时,为方便分组算法的实现和降低计算量,可以取2的幂次作为分组范围,即预先设置的候选分组范围的候选值为{21、22和23}。分组范围是指主键数值的分组范围。The candidate grouping range refers to a preset grouping range. For example, when a grouping range is predefined, in order to facilitate the implementation of the grouping algorithm and reduce the amount of calculation, a power of 2 can be taken as the grouping range, that is, the preset The candidate values for the candidate grouping range of are {2 1 , 2 2 and 2 3 }. The grouping range refers to the grouping range of primary key values.
数据密度是指每个候选分组范围下的平均数据密度,例如,当候选分组范围设定为8时,可以将数据表分为8个分组,分别确定每个分组内的数据密度,再将8个分组内的数据密度的均值P作为该候选分组范围下的数据密度,即候选分组范围为8时的数据密度为P。其中,在确定每个分组所对应的数据密度时,假设分组范围设定为8,而数据表中主键数值为1~7的数据不存在,只有主键数值为0的数据被分到了这组中,因此,主键数值为1~7的这些位置空了出来,此时该分组内的数据密度为1/8。Data density refers to the average data density under each candidate grouping range. For example, when the candidate grouping range is set to 8, the data table can be divided into 8 groups, and the data density in each group is determined separately, and then 8 The average value P of the data densities in each grouping is taken as the data density in the candidate grouping range, that is, the data density when the candidate grouping range is 8 is P. Among them, when determining the data density corresponding to each group, it is assumed that the grouping range is set to 8, and the data with the primary key value of 1 to 7 in the data table does not exist, and only the data with the primary key value of 0 is assigned to this group. , therefore, the positions where the primary key values are 1 to 7 are vacated, and the data density in this group is 1/8 at this time.
具体的,服务器根据查询语句的语义信息,确定存储区的分区方式之后,服务器可以根据数据分布和预设的一组候选分组范围,确定各个候选分组范围下的数据密度。即服务器可以根据当前工作负载运行时的数据分布情况,求出各个候选分组范围下的数据密度。分组聚集能够将多行数据集中到一个KV对中,提高读写数据的效率。而分组的策略、分组范围则需要根据当前工作负载的数据分布情况决定。如图3中(2)所示,分组会在水平方向将一张数据表划分为若干子集。本申请中的分组策略可以包括哈希分组和顺序分组。哈希分组是指通过某一哈希函数,将同一哈希值的数据保存在同一个数据分组中。顺序分组是指按照数据的到来顺序,将一系列连续到来的数据存放到同一个数据分组中。可以理解,本申请中的分组策略包括但不限于是哈希分组和顺序分组,还可以为其他自定义的分组策略。Specifically, after the server determines the partition mode of the storage area according to the semantic information of the query statement, the server can determine the data density under each candidate grouping range according to the data distribution and a preset group of candidate grouping ranges. That is, the server can calculate the data density in each candidate grouping range according to the data distribution when the current workload is running. Grouping and aggregation can aggregate multiple rows of data into a KV pair, improving the efficiency of reading and writing data. The grouping strategy and grouping range need to be determined according to the data distribution of the current workload. As shown in (2) in Figure 3, grouping divides a data table into several subsets in the horizontal direction. The grouping strategy in this application may include hash grouping and sequential grouping. Hash grouping refers to storing data of the same hash value in the same data group through a certain hash function. Sequential grouping refers to storing a series of consecutively arriving data in the same data group according to the arrival order of the data. It can be understood that the grouping strategy in this application includes, but is not limited to, hash grouping and sequential grouping, and can also be other customized grouping strategies.
举个例子,假设采用的分组策略为哈希分组,即分组函数设置为哈希函数,预定义的一组候选分组范围为{21、22、24}。假设数据表A中的第一行数据的主键为int类型,数值为32,第二行数据的主键数值为31,此时服务器基于哈希函数即整除函数,当候选分组范围为24时,针对key为32的第一行数据,其分组序号=32整除16=2,其应当分在第2组;对于key为31的第二行数据,其分组序号=31整除16=1,其应当分在第1组。假设服务器将数据表A分为2个分组即第1组和第2组,则服务器分别确定每个分组内的数据密度,再将2个分组内的数据密度的均值P作为该候选分组范围为24下的数据密度,即候选分组范围为16时的数据密度为PFor example, it is assumed that the adopted grouping strategy is hash grouping, that is, the grouping function is set to a hash function, and a predefined group of candidate grouping ranges are {2 1 , 2 2 , 2 4 }. Assume that the primary key of the first row of data in data table A is int type, the value is 32, and the primary key value of the second row of data is 31. At this time, the server is based on the hash function, that is, the divisibility function. When the candidate grouping range is 2 4 , For the first row of data with key 32, its grouping sequence number=32 divisible by 16=2, it should be divided into the second group; for the second row of data with key 31, its grouping sequence number=31 divisible by 16=1, it should be in group 1. Assuming that the server divides the data table A into two groups, namely the first group and the second group, the server determines the data density in each group respectively, and then takes the mean value P of the data densities in the two groups as the candidate grouping range: The data density under 2 4 , that is, the data density when the candidate grouping range is 16 is P
步骤210,基于分区方式、候选分组范围、数据密度和第一空间大小,确定在运行工作负载时数据库中数据表的存储模式。Step 210: Determine the storage mode of the data table in the database when the workload is running based on the partition mode, the candidate grouping range, the data density and the first space size.
其中,数据表是指数据库中预先设置好数据表结构的不同数据表。The data table refers to different data tables in the database whose data table structure is pre-set.
存储模式是指数据库中数据表的存放方式,本申请中的存储模式包括行存模式、列存模式、单元格存模式、多行存模式、列段存模式、列族存模式、列族段存模式或整表存模式中的至少一种。Storage mode refers to the storage mode of data tables in the database. The storage modes in this application include row storage mode, column storage mode, cell storage mode, multi-row storage mode, column segment storage mode, column family storage mode, and column family segment. at least one of the storage mode or the whole table storage mode.
具体的,服务器根据数据分布确定候选分组范围下的数据密度之后,服务器可以基于分区方式、候选分组范围、数据密度和第一空间大小,确定在运行工作负载时数据库中数据表的最优存储模式。即服务器可以根据分区方式、各个候选分组范围、以及各个候选分组范围下的数据密度,估算出各个候选分组范围下的键值对所占用空间的第二空间大小,服务器可以将各个候选分组范围下的键值对所占用空间的第二空间大小与步骤204中确定的键值对所占用空间的第一空间大小进行比较,当第二空间大小与第一空间大小之间的差值满足预设的差值条件时,服务器获取该第二空间大小所对应的分区方式和候选分组范围,并基于第二空间大小所对应的分区方式和候选分组范围,确定在运行当前工作负载时数据库中数据表的最优存储模式。Specifically, after the server determines the data density in the candidate grouping range according to the data distribution, the server may determine the optimal storage mode of the data table in the database when running the workload based on the partition mode, the candidate grouping range, the data density and the first space size . That is, the server can estimate the second space size of the space occupied by the key-value pairs in each candidate grouping range according to the partitioning method, each candidate grouping range, and the data density in each candidate grouping range. The second space size of the space occupied by the key-value pair is compared with the first space size of the space occupied by the key-value pair determined in
此外,服务器确定在运行当前工作负载时数据库中数据表的最优存储模式之后,在服务器处于空闲状态时,服务器可以将数据表中的数据,按照最优存储模式所对应的编码方式,重新编码后进行存放。In addition, after the server determines the optimal storage mode of the data table in the database when the current workload is running, when the server is in an idle state, the server can re-encode the data in the data table according to the encoding method corresponding to the optimal storage mode stored later.
上述数据库存储模式的自适应方法中,通过获取工作负载的运行参数和各键值对的读写操作性能参数;运行参数包括读写操作比例、查询语句和数据分布;根据读写操作比例和读写操作性能参数,确定键值对所占用空间的第一空间大小;根据查询语句的语义信息,确定存储区的分区方式;根据数据分布确定候选分组范围下的数据密度;基于分区方式、候选分组范围、数据密度和第一空间大小,确定在运行工作负载时数据库中数据表的存储模式。由于通过预先测试得到了各键值对的读写操作性能参数,因此,可以基于不同工作负载运行参数中的读写操作比例和预先测试得到的各键值对的读写操作性能参数,确定键值对所占用空间大小的最优取值范围,以使得服务器可以基于分区方式、候选分组范围、数据密度和键值对所占用空间大小的最优取值范围,确定在运行不同工作负载时数据库中数据表的最优存储模式,解决了传统方式中无法根据不同负载匹配最优存储模式的问题,有效提升了系统的I/O效率,从而有效提升了系统的整体性能。In the above adaptive method of database storage mode, the operation parameters of the workload and the read and write operation performance parameters of each key-value pair are obtained; the operation parameters include the ratio of read and write operations, query statements and data distribution; Write operation performance parameters to determine the first space size of the space occupied by the key-value pair; determine the partition mode of the storage area according to the semantic information of the query statement; determine the data density in the candidate grouping range according to the data distribution; Scope, data density, and first space size, determine the storage pattern of data tables in the database when running the workload. Since the read-write operation performance parameters of each key-value pair are obtained through pre-testing, the key-value pair can be determined based on the read-write operation ratio in the operating parameters of different workloads and the read-write operation performance parameters of each key-value pair obtained by the pre-test. The optimal value range of the space occupied by the value pair, so that the server can determine the database when running different workloads based on the partition mode, candidate grouping range, data density, and the optimal value range of the space occupied by the key-value pair. The optimal storage mode of the data table in the data table solves the problem that the traditional method cannot match the optimal storage mode according to different loads, and effectively improves the I/O efficiency of the system, thereby effectively improving the overall performance of the system.
在一个实施例中,所述获取工作负载的运行参数和各键值对的读写操作性能参数之前,所述方法还包括:In one embodiment, before acquiring the operating parameters of the workload and the read/write operation performance parameters of each key-value pair, the method further includes:
获取键值对占用的空间大小;Get the size of the space occupied by the key-value pair;
在占用空间大小的情况下,确定键值对的读写操作性能参数;In the case of the size of the occupied space, determine the read and write operation performance parameters of the key-value pair;
基于读写操作性能参数,建立键值对的性能模型;Based on the performance parameters of read and write operations, establish a performance model of key-value pairs;
所述获取工作负载的运行参数和各键值对的读写操作性能参数,包括:The obtaining of the operating parameters of the workload and the read-write operation performance parameters of each key-value pair includes:
获取工作负载的运行参数,以及基于性能模型获取键值对的读写操作性能参数。Get the running parameters of the workload, and the read and write operation performance parameters of key-value pairs based on the performance model.
其中,键值对占用的空间大小是指存储键值对所占用的空间大小,例如,预先设置一组键值对占用的空间大小的集合为:{16B、32B、48B、64B}。The size of the space occupied by the key-value pair refers to the size of the space occupied by the storage of the key-value pair. For example, a set of preset space sizes occupied by a group of key-value pairs is: {16B, 32B, 48B, 64B}.
性能模型是指不同键值对所对应的读写操作性能参数,例如,在点查操作中,16B大小的KV对的延迟时间为0.01ms,32B大小的KV对的延迟时间为0.005ms,即占用空间大小不同的键值对所对应的读写操作性能参数也不同。The performance model refers to the performance parameters of read and write operations corresponding to different key-value pairs. For example, in the check operation, the delay time of a 16B KV pair is 0.01ms, and the delay time of a 32B KV pair is 0.005ms, that is The performance parameters of read and write operations corresponding to key-value pairs with different occupied space are also different.
具体的,服务器获取工作负载的运行参数和各键值对的读写操作性能参数之前,服务器可以预先建立不同KV对大小的性能模型。即服务器可以获取设置好的一组键值对占用的空间大小的集合,并在键值对占用不同空间大小的情况下,确定不同空间大小的键值对的读写操作性能参数,进而使得服务器可以基于读写操作性能参数,建立不同空间大小的键值对的性能模型,即服务器建立不同空间大小的键值对与读写操作性能参数之间的映射关系,以使得后续服务器可以基于性能模型直接获取到不同空间大小的键值对所对应的读写操作性能参数。Specifically, before the server obtains the operating parameters of the workload and the read/write operation performance parameters of each key-value pair, the server may pre-establish performance models of different KV pair sizes. That is, the server can obtain a set of space sizes occupied by a set set of key-value pairs, and determine the read and write operation performance parameters of key-value pairs with different space sizes when the key-value pairs occupy different space sizes, so as to make the server Based on the performance parameters of read and write operations, a performance model of key-value pairs of different space sizes can be established, that is, the server establishes the mapping relationship between key-value pairs of different space sizes and performance parameters of read and write operations, so that subsequent servers can be based on the performance model. Directly obtain the performance parameters of read and write operations corresponding to key-value pairs of different space sizes.
举个例子,假设预先设置好的一组键值对占用的空间大小的集合为:{16B、32B、48B、64B},则可以通过事先的一系列实验,测出各大小KV对在各个典型I/O操作下的性能,典型I/O操作包括更新操作、插入操作、删除操作、点查询操作或者范围查询操作中的至少一种。即分别测出16B大小的KV对在更新操作、插入操作、删除操作、点查询操作以及范围查询操作中的性能参数、32B大小的KV对在更新操作、插入操作、删除操作、点查询操作以及范围查询操作中的性能参数、48B大小的KV对在更新操作、插入操作、删除操作、点查询操作以及范围查询操作中的性能参数、以及64B大小的KV对在更新操作、插入操作、删除操作、点查询操作以及范围查询操作中的性能参数,这些测试结果将作为后续步骤的判定依据。进一步的,服务器可以基于上述各个典型I/O操作对应的性能参数,建立16B、32B、48B、64B空间大小的键值对的性能模型,即服务器建立不同空间大小的键值对与各个读写操作性能参数之间的映射关系,这些测试结果将作为后续步骤的判定依据。For example, assuming that the set of space sizes occupied by a set of preset key-value pairs is: {16B, 32B, 48B, 64B}, then through a series of experiments in advance, it is possible to measure the size of each KV pair in each typical Performance under I/O operations. Typical I/O operations include at least one of update operations, insert operations, delete operations, point query operations, or range query operations. That is, the performance parameters of 16B KV pairs in update, insert, delete, point query and range query operations, and 32B KV pairs in update, insert, delete, point query and Performance parameters in range query operations, 48B KV pairs in update operations, insert operations, delete operations, point query operations, and performance parameters in range query operations, and 64B KV pairs in update operations, insert operations, and delete operations , point query operation and performance parameters in range query operation, these test results will be used as the judgment basis for subsequent steps. Further, the server can establish a performance model of key-value pairs of 16B, 32B, 48B, and 64B space sizes based on the performance parameters corresponding to the above-mentioned typical I/O operations, that is, the server establishes key-value pairs of different space sizes and each read and write. The mapping relationship between operational performance parameters, these test results will be used as the judgment basis for subsequent steps.
本实施例中,通过事先的一系列实验,可以测出各大小KV对在各个典型I/O操作下的性能,使得后续服务器可以基于性能模型直接获取到不同空间大小的键值对所对应的读写操作性能参数,并结合不同工作负载的运行参数,快速准确的估算出键值对所占用空间的最优范围,提高数据处理过程中的计算效率。In this embodiment, through a series of experiments in advance, the performance of each size KV pair under each typical I/O operation can be measured, so that the subsequent server can directly obtain the corresponding key-value pairs of different space sizes based on the performance model. Read and write operation performance parameters, combined with the operating parameters of different workloads, quickly and accurately estimate the optimal range of space occupied by key-value pairs, and improve the computing efficiency in the data processing process.
在一个实施例中,根据读写操作比例和读写操作性能参数,确定键值对所占用空间的第一空间大小的步骤,包括:In one embodiment, the step of determining the first space size of the space occupied by the key-value pair according to the read-write operation ratio and the read-write operation performance parameter includes:
根据读写操作比例和读写操作性能参数,确定工作负载在键值对占用空间大小时的总体性能参数;According to the ratio of read and write operations and the performance parameters of read and write operations, determine the overall performance parameters of the workload when the key-value pair occupies space;
基于总体性能参数,确定键值对所占用空间的第一空间大小。Based on the overall performance parameter, the first space size of the space occupied by the key-value pair is determined.
其中,总体性能参数是指将各个读写操作的性能参数进行总体评估,得到的总体性能参数,例如,服务器可以从性能模型中分别获取16B的KV对在全表扫描操作、更新操作中的耗时为T1和T2,将T1、T2求和后即为总耗时T=T1+T2,总耗时可以作为16B的KV对所对应的总体性能参数。Among them, the overall performance parameter refers to the overall performance parameter obtained by the overall evaluation of the performance parameters of each read and write operation. For example, the server can obtain 16B of KV pairs from the performance model respectively. The consumption of the full table scan operation and the update operation The time is T1 and T2. After summing up T1 and T2, the total time is T=T1+T2, and the total time can be used as the overall performance parameter corresponding to the 16B KV pair.
第一空间大小是指键值对所占空间大小的最优范围,即第一空间大小可以是一个区间范围,也可以是一个最优的目标值。例如,在工作负载A中,确定键值对所占用空间的第一空间大小为16B,在工作负载B中,确定键值对所占用空间的第一空间大小为16B-17B。The first space size refers to the optimal range of the space size occupied by the key-value pair, that is, the first space size may be an interval range or an optimal target value. For example, in workload A, the first space size of the space occupied by key-value pairs is determined to be 16B, and in workload B, the first space size of the space occupied by key-value pairs is determined to be 16B-17B.
具体的,服务器获取工作负载的运行参数和各键值对的读写操作性能参数之后,服务器可以根据运行参数中的读写操作比例和预先建立的性能模型中的各键值对的读写操作性能参数,确定当前工作负载下的键值对所占用空间大小时的总体性能参数,并基于总体性能参数,确定当前工作负载下键值对所占用空间的最优范围。其中,本申请中用于表征总体性能参数的参数可以包括:延迟时间、带宽或者IOPS(Input/Output OperationsPer Second)中的至少一种。本申请中在确定不同空间大小的键值对所对应的总体性能参数时,可以通过加权平均的方法,估算出当前工作负载在每一种KV对大小下的总体性能,进而可以从中选择总体性能最高的KV对大小作为系统设置的最优值。Specifically, after the server obtains the operating parameters of the workload and the read and write operation performance parameters of each key-value pair, the server can read and write operations of each key-value pair in the pre-established performance model according to the ratio of read and write operations in the operating parameters and the pre-established performance model. Performance parameters: determine the overall performance parameters when the space occupied by key-value pairs under the current workload is determined, and based on the overall performance parameters, determine the optimal range of space occupied by key-value pairs under the current workload. Wherein, the parameter used to characterize the overall performance parameter in this application may include at least one of delay time, bandwidth, or IOPS (Input/Output Operations Per Second). In this application, when determining the overall performance parameters corresponding to key-value pairs of different space sizes, the weighted average method can be used to estimate the overall performance of the current workload under each KV pair size, and then the overall performance can be selected. The highest KV pair size is used as the optimal value for the system settings.
举个例子,假设当前工作负载B包含1000个点查询操作、10个涉及100000行的全表扫描操作、1000个更新操作、0个插入操作和0个删除操作,服务器获取预设时间段内的工作负载B的运行参数,运行参数中包含1000个点查询操作、10个涉及100000行的全表扫描操作、1000个更新操作、0个插入操作和0个删除操作。进一步的,服务器可以从预先建立好的性能模型中,获取各大小KV对在不同I/O操作下的性能参数,假设服务器获取到在点查询操作中,16B大小的KV对所对应的IOPS即每秒读写次数为Ia1=850,32B大小的KV对的IOPS为Ia2=750,48B大小的KV对的IOPS为Ia3=950。以此类推,服务器可以估算出16B的KV对在全表扫描操作和更新操作中每秒读写次数分别为Ib1和Ic1,服务器将16B的KV对在点查操作、全表扫描操作和更新操作中的IOPS求和后即可得到总IOPS为I总1=Ia1+Ib1+Ic1。同样,对于其他大小的KV对,服务器也可以算出一个预估的总IOPS,假设服务器将32B的KV对在点查操作、全表扫描操作和更新操作中的IOPS求和后即可得到总IOPS为I总2=Ia2+Ib2+Ic2。服务器将48B的KV对在点查操作、全表扫描操作和更新操作中的IOPS求和后即可得到总IOPS为I总3=Ia3+Ib3+Ic3。由于I总1<I总3<I总2,因此,服务器可以选取预估总IOPS中的最大值I总2所对应的KV对大小即32B作为最优的KV对大小,即服务器根据当前工作负载B中各个读写操作比例和不同KV对的读写操作性能参数,确定当前工作负载B下键值对所占用空间的最优取值范围为32B。由此使得,服务器可以基于性能模型,直接获取到不同空间大小的键值对所对应的读写操作性能参数,并结合不同工作负载的运行参数,快速准确的估算出键值对所占用空间的最优范围,提高数据处理过程中的计算效率。For example, assuming that the current workload B contains 1,000 point query operations, 10 full table scan operations involving 100,000 rows, 1,000 update operations, 0 insert operations, and 0 delete operations, the server obtains the The running parameters of workload B include 1,000 point query operations, 10 full table scan operations involving 100,000 rows, 1,000 update operations, 0 insert operations, and 0 delete operations. Further, the server can obtain the performance parameters of each size KV pair under different I/O operations from the pre-established performance model. Suppose the server obtains the IOPS corresponding to the 16B KV pair in the point query operation. The number of reads and writes per second is I a1 =850, the IOPS of a KV pair with a size of 32B is I a2 =750, and the IOPS of a KV pair with a size of 48B is I a3 =950. By analogy, the server can estimate that the number of reads and writes per second of the 16B KV pair in the full table scan operation and the update operation are I b1 and I c1 respectively. After the IOPS in the update operation are summed, the total IOPS can be obtained as I total 1 =I a1 +I b1 +I c1 . Similarly, for KV pairs of other sizes, the server can also calculate an estimated total IOPS. Assuming that the server sums the IOPS of 32B KV pairs in the check operation, full table scan operation and update operation, the total IOPS can be obtained. is Itotal 2 =I a2 +I b2 +I c2 . After the server sums the IOPS of the 48B KV pair in the check operation, the full table scan operation and the update operation, the total IOPS can be obtained as I total 3 =I a3 +I b3 +I c3 . Since Itotal 1 <Itotal 3 <Itotal 2 , the server can select the KV pair size corresponding to the maximum value Itotal 2 in the estimated total IOPS, that is, 32B as the optimal KV pair size, that is, the server can select the KV pair size according to the current work The ratio of each read and write operation in load B and the read and write operation performance parameters of different KV pairs determine the optimal value range of the space occupied by the key-value pair under the current workload B to be 32B. As a result, the server can directly obtain the performance parameters of read and write operations corresponding to key-value pairs of different space sizes based on the performance model, and combine the operating parameters of different workloads to quickly and accurately estimate the space occupied by key-value pairs. Optimal range to improve computational efficiency during data processing.
在一个实施例中,根据读写操作比例和读写操作性能参数,确定键值对所占用空间的第一空间大小的步骤,包括:In one embodiment, the step of determining the first space size of the space occupied by the key-value pair according to the read-write operation ratio and the read-write operation performance parameter includes:
当总体性能参数为总体耗时时,选取各总体耗时中满足耗时条件的总体耗时作为目标总体耗时;When the overall performance parameter is the overall time-consuming, select the overall time-consuming that satisfies the time-consuming condition among the overall time-consuming as the target overall time-consuming;
将所述目标总体耗时所对应的键值对所占用的空间大小作为所述第一空间大小。The size of the space occupied by the key-value pair corresponding to the total time-consuming of the target is taken as the first space size.
其中,耗时条件是指预设的总体耗时的条件,例如,耗时条件可以预先设置为小于某个阈值,比如,小于阈值15。耗时条件也可以设置为自动选取多个总体耗时中的最小值。The time-consuming condition refers to a preset overall time-consuming condition. For example, the time-consuming condition may be preset to be less than a certain threshold, for example, less than a threshold of 15. The time-consuming condition can also be set to automatically select the minimum value among multiple total time-consuming.
具体的,服务器获取工作负载的运行参数和各键值对的读写操作性能参数之后,服务器可以根据运行参数中的读写操作比例和预先建立的性能模型中的各键值对的读写操作性能参数,确定当前工作负载下的键值对所占用空间大小时的总体性能参数,当总体性能参数为总体耗时时,服务器可以选取各个总体耗时中满足耗时条件的总体耗时作为目标总体耗时,并将目标总体耗时所对应的键值对所占用的空间大小作为当前工作负载下键值对所占用空间的最优范围。Specifically, after the server obtains the operating parameters of the workload and the read and write operation performance parameters of each key-value pair, the server can read and write operations of each key-value pair in the pre-established performance model according to the ratio of read and write operations in the operating parameters and the pre-established performance model. Performance parameters: determine the overall performance parameters when the space occupied by key-value pairs under the current workload is determined. When the overall performance parameter is the overall time-consuming, the server can select the total time-consuming that satisfies the time-consuming conditions among the total time-consuming as the target total. time-consuming, and take the space occupied by the key-value pair corresponding to the target total time-consuming as the optimal range of the space occupied by the key-value pair under the current workload.
举个例子,以总体性能参数为总体耗时为例进行说明。假设当前工作负载A包含1000个点查询操作、10个涉及100000行的全表扫描操作、1000个更新操作、0个插入操作和0个删除操作,服务器获取预设时间段内的工作负载A的运行参数,运行参数中包含1000个点查询操作、10个涉及100000行的全表扫描操作、1000个更新操作、0个插入操作和0个删除操作。进一步的,服务器可以从预先建立好的性能模型中,获取各大小KV对在不同I/O操作下的性能,比如,服务器获取到在点查操作中,16B大小的KV对的延迟时间为0.01ms,32B大小的KV对的延迟时间为0.005ms,48B大小的KV对的延迟时间为0.009ms,则服务器可以估算出在当前工作负载A中,16B的KV对在点查操作中的耗时ta1为1000*0.01ms,32B的KV对在点查操作中的耗时ta2为1000*0.05ms,48B的KV对在点查操作中的耗时ta3为1000*0.09ms。以此类推,服务器可以估算出16B的KV对在全表扫描操作和更新操作中的耗时分别为tb1和tc1,服务器将16B的KV对在点查操作、全表扫描操作和更新操作中的耗时求和后即为总耗时T1=ta1+tb1+tc1。同样,对于其他大小的KV对,服务器也可以算出一个预估的总耗时,假设服务器将32B的KV对在点查操作、全表扫描操作和更新操作中的耗时求和后即为总耗时T2=ta2+tb2+tc2。服务器将48B的KV对在点查操作、全表扫描操作和更新操作中的耗时求和后即为总耗时T3=ta3+tb3+tc3。由于T3<T1<T2,因此,服务器可以选取预估总耗时最短的T3作为目标总体耗时,并将作为目标总体耗时T3所对应的KV对大小即48B作为最优的KV对大小,即服务器根据当前工作负载A中各个读写操作比例和不同KV对的读写操作性能参数,确定当前工作负载A下键值对所占用空间的最优取值范围为48B。由此使得,服务器可以基于性能模型,直接获取到不同空间大小的键值对所对应的读写操作性能参数,并结合不同工作负载的运行参数,快速准确的估算出键值对所占用空间的最优范围,提高数据处理过程中的计算效率。For example, take the overall performance parameter as the overall time consumption as an example for illustration. Assuming that the current workload A contains 1,000 point query operations, 10 full table scan operations involving 100,000 rows, 1,000 update operations, 0 insert operations, and 0 delete operations, the server obtains the data of workload A within a preset time period. The running parameters include 1000 point query operations, 10 full table scan operations involving 100,000 rows, 1000 update operations, 0 insert operations and 0 delete operations. Further, the server can obtain the performance of each size KV pair under different I/O operations from the pre-established performance model. For example, the server obtains a delay time of 0.01 for a 16B KV pair in the check operation. ms, the delay time of a 32B-sized KV pair is 0.005ms, and the 48B-sized KV pair’s delay time is 0.009ms, then the server can estimate the time-consuming of the 16B KV pair in the current workload A in the check operation. t a1 is 1000*0.01ms, the time consuming t a2 of the 32B KV pair in the enumeration operation is 1000*0.05ms, and the time consuming t a3 of the 48B KV pair in the enumeration operation is 1000*0.09ms. By analogy, the server can estimate that the time-consuming of 16B KV pairs in the full table scan operation and the update operation are t b1 and t c1 respectively, and the server uses the 16B KV pair in the check operation, the full table scan operation and the update operation. The total time consumption T1=t a1 +t b1 +t c1 is obtained by summing the time consumption in . Similarly, for KV pairs of other sizes, the server can also calculate an estimated total time-consuming. Assuming that the server sums the time-consuming of 32B KV pairs in the check operation, full table scan operation and update operation, the total time is calculated as the total time. Time-consuming T2=t a2 +t b2 +t c2 . After the server sums up the time consuming of the 48B KV pair in the check operation, the full table scan operation and the update operation, the total time consuming is T3=t a3 +t b3 +t c3 . Since T3<T1<T2, the server can select T3 with the shortest estimated total time-consuming as the target total time-consuming, and take the KV pair size corresponding to T3 as the target total time-consuming, that is, 48B, as the optimal KV pair size. That is, the server determines that the optimal value range of the space occupied by the key-value pair under the current workload A is 48B according to the ratio of each read and write operation in the current workload A and the read and write operation performance parameters of different KV pairs. As a result, the server can directly obtain the performance parameters of read and write operations corresponding to key-value pairs of different space sizes based on the performance model, and combine the operating parameters of different workloads to quickly and accurately estimate the space occupied by the key-value pairs. Optimal range to improve computational efficiency during data processing.
在一个实施例中,根据查询语句的语义信息,确定存储区的分区方式的步骤,包括:In one embodiment, the step of determining the partition mode of the storage area according to the semantic information of the query statement includes:
获取数据表所涉及的查询语句和查询语句总量;Obtain the query statements and the total number of query statements involved in the data table;
确定各查询语句的数量与查询语句总量之间的比值;Determine the ratio between the number of each query statement and the total number of query statements;
在所涉及的查询语句中,基于比值确定目标查询语句;Among the query statements involved, determine the target query statement based on the ratio;
基于目标查询语句中的列信息,确定存储区的分区方式。Based on the column information in the target query statement, determine how the storage area is partitioned.
其中,查询语句总量是指当前工作负载中所涉及的全部查询语句的数量,例如,假设当前工作负载A包含1000个点查询操作、10个涉及100000行的全表扫描操作、1000个更新操作、0个插入操作和0个删除操作,则当前工作负载A中所涉及的查询语句总量为S=1000+10+1000=2010。The total number of query statements refers to the number of all query statements involved in the current workload. For example, suppose the current workload A contains 1000 point query operations, 10 full table scan operations involving 100,000 rows, and 1000 update operations , 0 insert operations and 0 delete operations, the total number of query statements involved in the current workload A is S=1000+10+1000=2010.
具体的,服务器根据读写操作比例和读写操作性能参数,确定键值对所占用空间的第一空间大小之后,服务器可以获取数据表所涉及的查询语句和查询语句总量,确定各查询语句的数量与查询语句总量之间的比值,并在所涉及的查询语句中,基于比值确定目标查询语句;进一步的,服务器可以基于目标查询语句中的列信息,确定存储区的分区方式。即服务器可以根据查询语句的语意信息,提取出相关列的信息,以推导出列分区的候选方案。例如,如图3中(1)所示,列分区是指在垂直方向将一张数据表划分为若干子集。Specifically, after the server determines the first space size of the space occupied by the key-value pair according to the read-write operation ratio and the read-write operation performance parameters, the server can obtain the query statements and the total number of query statements involved in the data table, and determine each query statement The ratio between the number of query statements and the total number of query statements, and among the query statements involved, the target query statement is determined based on the ratio; further, the server can determine the partition mode of the storage area based on the column information in the target query statement. That is, the server can extract relevant column information according to the semantic information of the query statement, so as to deduce the candidate scheme of column partitioning. For example, as shown in (1) in Figure 3, column partitioning refers to dividing a data table into several subsets in the vertical direction.
举个例子,以数据库中数据表A为例进行说明。假设在工作负载A中涉及数据表A的查询语句为100个,其中,全表扫描操作的查询语句为10个,第一类查询语句为50个,第二类查询语句为40个,则服务器可以获取到该数据表A上所涉及的所有查询语句和查询语句总量为100。对于每个查询语句,服务器可以解析其语法,将其中涉及的列信息取出并保存。服务器可以确定各个查询语句的数量与查询语句总量之间的比值,即得到全表扫描操作的查询语句的占比为S1=10/100=0.1、第一类查询语句的占比为S2=50/100=0.5、以及第二类查询语句的占比为S3=40/100=0.4,则服务器可以根据上述各种查询语句的占比对各查询语句进行排序,取出占比最高的前n位查询语句作为目标查询语句。假设占比最高的查询语句A为:查询第10列和第11列中的数据,则服务器可以从占比最高的查询语句A开始,根据该查询语句A中对应的相关列进行分区,即服务器可以将数据表A中的第10列和第11列聚集为一个列组。以此类推,之后服务器再取出次高占比的查询语句B,根据查询语句B中对应的相关列进行分区,重复上述列分区操作,直到数据表A中的所有属性都已经被分区完毕。For example, take the data table A in the database as an example for illustration. Assuming that there are 100 query statements involving data table A in workload A, among which, there are 10 query statements for full table scan operations, 50 query statements of the first type, and 40 query statements of the second type, then the server All the query statements and query statements involved in the data table A can be obtained and the total amount is 100. For each query statement, the server can parse its syntax, fetch and save the column information involved. The server can determine the ratio between the number of each query statement and the total number of query statements, that is, the proportion of the query statement obtained by the full table scan operation is S1=10/100=0.1, and the proportion of the first type of query statement is S2= 50/100=0.5, and the proportion of the second type of query statement is S3=40/100=0.4, then the server can sort each query statement according to the proportion of the above various query statements, and retrieve the top n with the highest proportion Bit query statement as the target query statement. Assuming that the query statement A with the highest proportion is: querying the data in the 10th and 11th columns, the server can start from the query statement A with the highest proportion and partition according to the relevant columns in the query statement A, that is, the server Columns 10 and 11 in Data Table A can be aggregated into a column group. By analogy, the server then retrieves the query statement B with the second highest proportion, performs partitioning according to the relevant columns in query statement B, and repeats the above column partitioning operation until all attributes in data table A have been partitioned.
此外,分区算法在处理同一列处在多个查询中时,可以采用不同的策略。当多个查询语句中涉及同一列时,这个列称为“冲突列”,本申请中针对冲突列的处理策略可以包括:第一种策略是高优先级查询的分组独占冲突列,高优先级查询是指排序位置靠前的查询语句。在这种策略下,更重要的查询即优先级更高或权重更大的查询,将冲突列加入自己的分组中,而其他的分组中则不包含这一列。第二种策略是合并这些分组。在这种策略下,包含冲突列的分组将会合并为更大的一个分组。两种策略各有优势,第一种策略更倾向于符合高优先级查询的特征,比较适合不同查询之间的占比差距较大的情况;第二种策略比较适合不同查询之间占比差距不大的情况。由此,能够基于不同工作负载运行参数中的查询语句的占比,确定在数据表在垂直方向上的分区方式,以解决传统方式中不能根据不同负载匹配最优存储模式的问题,有效提升了系统的I/O效率,从而有效提升了系统的整体性能。In addition, partitioning algorithms can employ different strategies when processing the same column in multiple queries. When the same column is involved in multiple query statements, this column is called a "conflict column", and the processing strategies for conflicting columns in this application may include: Query refers to the query statement with the top sorting position. Under this strategy, more important queries, i.e. those with higher priority or weight, add the conflicting column to its own grouping, while other groups do not include this column. The second strategy is to combine these groupings. Under this strategy, groups containing conflicting columns are merged into a larger group. Both strategies have their own advantages. The first strategy is more likely to conform to the characteristics of high-priority queries, and is more suitable for situations where the proportion of different queries has a large gap; the second strategy is more suitable for the proportion gap between different queries. Not a big case. Therefore, it is possible to determine the vertical partition mode of the data table based on the proportion of query statements in the operating parameters of different workloads, so as to solve the problem that the optimal storage mode cannot be matched according to different loads in the traditional method, which effectively improves the The I/O efficiency of the system can effectively improve the overall performance of the system.
在一个实施例中,数据密度包括第一数据密度;根据数据分布确定候选分组范围下的数据密度的步骤,包括:In one embodiment, the data density includes a first data density; the step of determining the data density under the candidate grouping range according to the data distribution includes:
确定候选分组范围;Determine the candidate grouping range;
基于数据表的主键数值和候选分组范围确定数据分组;Determine the data grouping based on the primary key value of the data table and the candidate grouping range;
根据数据分布确定各数据分组内的第二数据密度;determining the second data density in each data packet according to the data distribution;
将第二数据密度的均值作为候选分组范围下的第一数据密度。The average value of the second data density is taken as the first data density under the candidate grouping range.
其中,第一数据密度是指各个候选分组范围下的平均数据密度,例如,假设分组范围为8时,数据表A中数据被划分为4个数据分组,服务器可以将4个数据分组的数据密度取平均值后,得到的平均数据密度S1作为该分组范围为8时所对应的第一数据密度。The first data density refers to the average data density in each candidate grouping range. For example, if the grouping range is 8, the data in data table A is divided into 4 data groups, and the server can divide the data density of the 4 data groups into After taking the average value, the obtained average data density S1 is taken as the first data density corresponding to the grouping range of 8.
具体的,服务器根据查询语句的语义信息,确定存储区的分区方式之后,服务器可以确定候选分组范围,例如,候选分组范围可以为系统中预先设置好的一组候选分组范围{21、22和23}。进一步的,服务器可以基于数据表的主键数值和候选分组范围{21、22和23}确定各个候选分组范围下的数据分组,并根据当前工作负载下的数据分布确定各个数据分组内的第二数据密度,服务器将第二数据密度的均值作为各个候选分组范围下的第一数据密度。Specifically, after the server determines the partition mode of the storage area according to the semantic information of the query statement, the server can determine the candidate grouping range. For example, the candidate grouping range can be a set of candidate grouping ranges preset in the system {2 1 , 2 2 and 2 3 }. Further, the server may determine the data groups in each candidate grouping range based on the primary key value of the data table and the candidate grouping ranges {2 1 , 2 2 and 2 3 }, and determine the data groups in each data group according to the data distribution under the current workload. For the second data density, the server uses the average value of the second data density as the first data density in each candidate grouping range.
举个例子,假设系统中预先设置的候选分组范围为{21、22和23},数据表A中包含两行数据,则服务器获取到上述候选分组范围后,服务器可以基于数据表A的主键数值和上述候选分组范围确定数据分组,即当分组范围为2时,数据表A中的第一行数据的主键为int类型,数值为32;数据表A中的第二行数据的主键数值为31。假设系统中预设的分组策略为采用哈希函数,以哈希函数为整除函数为例进行说明,即分组函数为y=主键数值÷分组范围,y表示分组序号,即分组序号=主键数值÷分组范围,则服务器可以基于上述分组函数,确定主键数值为32的第一行数据,其分组序号=32÷2=16,其应当分在第16组;对于主键数值为31的第二行数据,其分组序号=31÷2=15,其应当分在第15组。以此类推,当分组范围为4时,服务器可以基于上述分组函数,确定主键数值为32的第一行数据,其分组序号=32÷4=8,其应当分在第8组;对于主键数值为31的第二行数据,其分组序号=31÷4=7,其应当分在第7组。当分组范围为8时,服务器可以基于上述分组函数,确定主键数值为32的第一行数据,其分组序号=32÷8=4,其应当分在第4组;对于主键数值为31的第二行数据,其分组序号=31÷8=3,其应当分在第3组。For example, assuming that the preset candidate grouping range in the system is {2 1 , 2 2 and 2 3 }, and the data table A contains two rows of data, after the server obtains the above candidate grouping range, the server can base on the data table A. The primary key value and the above candidate grouping range determine the data grouping, that is, when the grouping range is 2, the primary key of the first row of data in data table A is int type and the value is 32; the primary key of the second row of data in data table A is The value is 31. Assuming that the preset grouping strategy in the system is to use a hash function, take the hash function as the divisible function as an example, that is, the grouping function is y=primary key value÷grouping range, and y represents the grouping sequence number, that is, the grouping sequence number=primary key value÷ grouping range, the server can determine the first row of data with a primary key value of 32 based on the above grouping function, and its grouping sequence number=32÷2=16, which should be in the 16th group; for the second row of data with a primary key value of 31 , its grouping sequence number=31÷2=15, and it should be in the 15th group. By analogy, when the grouping range is 4, the server can determine the first row of data with a primary key value of 32 based on the above grouping function, and its grouping sequence number=32÷4=8, which should be in the 8th group; for the primary key value The second line of data is 31, and its grouping sequence number=31÷4=7, which should be divided into the 7th group. When the grouping range is 8, the server can determine, based on the above grouping function, that the first row of data whose primary key value is 32, whose grouping sequence number=32÷8=4, should be in the 4th group; for the first row whose primary key value is 31 Two lines of data, whose grouping sequence number=31÷8=3, should be divided into the third group.
进一步的,对于候选分组范围{21、22和23}中的每个候选分组范围,计算每个候选分组范围下的平均数据密度。由于主键数值的分布不一定是严格递增的,因此,可能出现某个数据分组内实际元素达不到分组范围的情况。例如,以分组范围为8进行举例说明,当分组范围为8时,数据表A中数据被划分为4个数据分组,服务器可以分别统计出组1、组2、组3和组4内的数据密度,若数据表A中主键数值为1~7的数据不存在,只有主键数值为0的数据被分到了组1中,因此这个分组中主键数值为1~7的这些位置空了出来,此时,服务器可以统计出该数据分组即组1的数据密度为1/8,服务器逐一统计候选分组范围为8时的各个数据分组内的数据密度,得到组1、组2、组3和组4的数据密度分别为S1=1/8、S2、S3和S4,服务器可以将统计得到的所有数据分组的数据密度求平均值S,求出的平均值S即为该分组范围为8时所对应的第一数据密度S8=(S1+S2+S3+S4)÷4=S。以此类推,服务器可以分别计算出分组范围为4和分组范围为2时所对应的第一数据密度为S4和S2,即服务器最终得到了各个候选分组范围下的平均数据密度为S8、S4和S2。由此,能够基于不同工作负载运行参数中的数据分布情况,确定各个候选分组范围下的数据密度,以解决传统方式中不能根据不同负载匹配最优存储模式的问题,有效提升了系统的I/O效率,从而有效提升了系统的整体性能。Further, for each candidate grouping range in the candidate grouping ranges {2 1 , 2 2 and 2 3 }, the average data density under each candidate grouping range is calculated. Since the distribution of primary key values is not necessarily strictly increasing, it may happen that the actual elements in a certain data group cannot reach the grouping range. For example, taking the grouping range of 8 as an example, when the grouping range is 8, the data in data table A is divided into 4 data groups, and the server can count the data in group 1,
可以理解,本申请中的分组策略包括但不限于采用哈希函数的分组方式,还可以为其他的分组策略,例如,顺序分组策略,顺序分组是指按照数据的到来顺序,将一系列连续到来的数据存放到同一个数据分组中。对于顺序分组策略,也可采用上述方式计算对应的数据密度。It can be understood that the grouping strategy in this application includes, but is not limited to, a grouping method using a hash function, and can also be other grouping strategies, such as a sequential grouping strategy. The data are stored in the same data group. For the sequential grouping strategy, the corresponding data density can also be calculated in the above manner.
在一个实施例中,基于分区方式、候选分组范围、数据密度和第一空间大小,确定在运行工作负载时数据库中数据表的存储模式的步骤,包括:In one embodiment, the step of determining the storage mode of the data table in the database when the workload is running, based on the partition mode, the candidate grouping range, the data density and the first space size, includes:
基于分区方式、候选分组范围以及候选分组范围下的第一数据密度,确定候选分组范围下的键值对所占用的第二空间大小;Based on the partition mode, the candidate grouping range, and the first data density in the candidate grouping range, determine the size of the second space occupied by the key-value pairs in the candidate grouping range;
当第二空间大小与第一空间大小之间的差值满足预设差值条件时,基于第二空间大小对应的分区方式和候选分组范围确定在运行工作负载时数据库中数据表的存储模式。When the difference between the second space size and the first space size satisfies the preset difference condition, the storage mode of the data table in the database when the workload is running is determined based on the partition mode and the candidate grouping range corresponding to the second space size.
其中,第二空间大小是指根据分区方式、候选分组范围以及候选分组范围下的第一数据密度求出的键值对所占用空间大小,例如,在数值上,分组范围*数据密度*分区大小=KV对所占用空间大小,服务器可以基于上述计算方式求出不同候选分组范围下的键值对所占用的第二空间大小。The second space size refers to the size of the space occupied by the key-value pair calculated according to the partitioning method, the candidate grouping range, and the first data density under the candidate grouping range, for example, numerically, grouping range * data density * partition size = the size of the space occupied by the KV pair, the server may obtain the second size of the space occupied by the key-value pairs in different candidate grouping ranges based on the above calculation method.
具体的,服务器根据数据分布确定各个候选分组范围下的数据密度之后,服务器可以基于分区方式、候选分组范围以及候选分组范围下的第一数据密度,确定各个候选分组范围下的键值对所占用的第二空间大小;当第二空间大小与第一空间大小之间的差值满足预设差值条件时,服务器可以基于第二空间大小对应的分区方式和候选分组范围确定在运行工作负载时数据库中数据表的最优存储模式。如图3所示,服务器会根据分区方式和分组范围将数据表同时进行水平、垂直的划分,得到最终的存储模式。即服务器依据步骤206中确定的分区方式、步骤208中确定的各个候选分组范围和数据密度,可以求出KV对的大小。在数值上,当某个候选分组范围A*该候选分组范围下的数据密度S*分区大小B,得到的KV对的第二空间大小与步骤204中确定的最优的第一空间大小之间的差值满足预设差值阈值时,服务器可以基于上述第二空间大小对应的分区方式和候选分组范围,确定在运行当前工作负载时数据库中数据表的最优存储模式。此外,当服务器确定在运行当前工作负载时数据库中数据表的最优存储模式之后,在处于空闲状态时,服务器可以将数据库中数据表的数据按照上述得到的最优存储模式重新编码并存储。Specifically, after the server determines the data density in each candidate grouping range according to the data distribution, the server may determine the occupied key-value pairs in each candidate grouping range based on the partition mode, the candidate grouping range, and the first data density in the candidate grouping range When the difference between the second space size and the first space size satisfies the preset difference condition, the server can determine when running the workload based on the partition mode and the candidate grouping range corresponding to the second space size The optimal storage mode for data tables in the database. As shown in Figure 3, the server will divide the data table horizontally and vertically according to the partition method and grouping range to obtain the final storage mode. That is, the server can obtain the size of the KV pair according to the partition mode determined in
举个例子,假设系统中预先设置的候选分组范围为{21、22和23},数据表A中包含两行数据,服务器分别计算出分组范围为8、分组范围为4和分组范围为2时所对应的第一数据密度为S8=1/8、S4=1/4和S2=3/4之后,服务器可以基于分区方式、各个候选分组范围以及各个候选分组范围下的第一数据密度,确定各个候选分组范围下的键值对所占用的第二空间大小,假设分区方式A为将数据表A中每一列数据中的第一个单元格和第二个单元格聚集为一个列组,每个单元格中存储4字节数据,则每个列组的大小为8字节,则服务器可以基于下述计算函数:分组范围*数据密度*分区大小=KV对所占用空间大小,计算出各个候选分组范围下的键值对所占用的第二空间大小,即当分组范围为8时,服务器计算出的KV对所占用空间的第二空间大小L8=分组范围*数据密度*分区大小=8*1/8*8=1;当分组范围为4时,服务器计算出的KV对所占用空间的第二空间大小L4=分组范围*数据密度*分区大小=4*1/4*8=8;当分组范围为2时,服务器计算出的KV对所占用空间的第二空间大小L2=分组范围*数据密度*分区大小=2*3/4*8=12,假设服务器在步骤204中根据当前工作负载中的读写操作比例和性能模型中的读写操作性能参数,确定键值对所占用空间的第一空间大小L为16B、预设差值条件为小于差值阈值5,则服务器可以分别计算各个候选分组范围下的键值对所占用的第二空间大小即L8、L4、L2与第一空间大小L之间的差值的绝对值,得到|L8-L|=|1-16|=15、|L4-L|=|8-16|=8、|L2-L|=|12-16|=4,由于|L2-L|=|12-16|=4小于差值阈值5,即当第二空间大小L2与第一空间大小L之间的差值满足预设差值条件时,服务器可以基于该第二空间大小L2对应的分区方式A和候选分组范围2确定在运行当前工作负载时数据库中数据表的最优存储模式,即服务器可以确定在运行当前工作负载时数据库中数据表的最优存储模式为:服务器同时将数据表A按照分组范围为2进行水平划分,以及按照分区方式A进行垂直划分,即服务器将数据表A中每列数据中每两个单元格数据聚集为一个列组,同时将数据表A中每行数据按照分组范围为2进行分组,即可得到如图3中(3)所示的最优存储模式的示意图。由此解决了传统方式中不能根据不同负载匹配最优存储模式的问题,有效提升了系统的I/O效率,从而有效提升了系统的整体性能。For example, assuming that the preset candidate grouping ranges in the system are {2 1 , 2 2 and 2 3 }, the data table A contains two rows of data, and the server calculates that the grouping range is 8, the grouping range is 4 and the grouping range respectively. When it is 2, the corresponding first data density is S 8 =1/8, S 4 =1/4, and S 2 =3/4. The first data density is to determine the second space occupied by the key-value pairs under each candidate grouping range, assuming that the partition mode A is to aggregate the first and second cells in each column of data in table A is a column group, each cell stores 4 bytes of data, and the size of each column group is 8 bytes, then the server can calculate the function based on the following: grouping range * data density * partition size = occupied by KV pairs Space size, calculate the second space size occupied by key-value pairs under each candidate grouping range, that is, when the grouping range is 8, the second space size of the space occupied by the KV pair calculated by the server L 8 = grouping range* Data density*partition size=8*1/8*8=1; when the grouping range is 4, the second space size L 4 of the space occupied by the KV pair calculated by the server = grouping range*data density*partition size=4 *1/4*8=8; when the grouping range is 2, the second space size L 2 of the space occupied by the KV pair calculated by the server = grouping range*data density*partition size=2*3/4*8= 12. Suppose the server determines in step 204 that the first space size L of the space occupied by the key-value pair is 16B according to the read-write operation ratio in the current workload and the read-write operation performance parameters in the performance model, and the preset difference condition is is less than the difference threshold value of 5, the server can calculate the absolute difference between the second space size occupied by the key-value pairs in each candidate grouping range, that is, the difference between L 8 , L 4 , L 2 and the first space size L value, we get |L 8 -L|=|1-16|=15, |L 4 -L|=|8-16|=8, |L 2 -L|=|12-16|=4, since | L 2 -L|=|12-16|=4 is less than the difference threshold 5, that is, when the difference between the second space size L 2 and the first space size L satisfies the preset difference condition, the server may The partition mode A corresponding to the second space size L 2 and the candidate grouping range 2 determine the optimal storage mode of the data table in the database when the current workload is running, that is, the server can determine the optimal storage mode of the data table in the database when the current workload is running The storage mode is: the server simultaneously divides the data table A horizontally according to the grouping range of 2, and divides it vertically according to the partition method A, that is, the server aggregates the data of each two cells in each column of data in the data table A into a column group. , at the same time, group each row of data in data table A according to the grouping range of 2, so as to obtain a schematic diagram of the optimal storage mode as shown in (3) in FIG. 3 . This solves the problem that the traditional method cannot match the optimal storage mode according to different loads, effectively improves the I/O efficiency of the system, and thus effectively improves the overall performance of the system.
在一个实施例中,所述方法还包括:In one embodiment, the method further includes:
当工作负载更换为其它工作负载时,确定其它工作负载在数据库中的目标存储模式;When the workload is replaced with other workloads, determine the target storage mode of the other workloads in the database;
在处于空闲状态时,将存储模式转换为目标存储模式。When idle, transition the storage mode to the target storage mode.
其中,其它工作负载用于区分当前的工作负载,例如,当工作负载A更换为工作负载B时,工作负载B即为其他工作负载。The other workload is used to distinguish the current workload. For example, when workload A is replaced with workload B, workload B is the other workload.
目标存储模式是指在运行其它工作负载时数据库中数据表的存储模式,例如,服务器可以根据本申请中提供的数据库存储模式的自适应方法,确定在运行工作负载A时数据库中数据表的最优存储模式为多行存模式,当工作负载A更换为工作负载B时,服务器可以根据本申请中提供的数据库存储模式的自适应方法,重新确定在运行工作负载B时数据库中数据表的最优存储模式。可以理解,目标存储模式可以与当前的存储模式相同,也可以不同。The target storage mode refers to the storage mode of the data table in the database when running other workloads. For example, the server can determine the maximum size of the data table in the database when running workload A according to the adaptive method of the database storage mode provided in this application. The optimal storage mode is a multi-row storage mode. When workload A is replaced with workload B, the server can re-determine the maximum size of the data table in the database when running workload B according to the adaptive method of the database storage mode provided in this application. optimal storage mode. It can be understood that the target storage mode may be the same as or different from the current storage mode.
具体的,在实际运行中,服务器系统可以根据工作负载的不同,自动切换副本节点的存储模式。当服务器中的自适应模块感知到负载发生变化时,就会重新评估出最优存储模式以及相应的转换代价。在系统较空闲时段,服务器可以进行存储模式转换的工作,将当前存储模式转换为目标存储模式,即转换为新的最优存储模式。Specifically, in actual operation, the server system can automatically switch the storage mode of the replica node according to different workloads. When the adaptive module in the server senses the change in load, it will re-evaluate the optimal storage mode and the corresponding conversion cost. During a relatively idle period of the system, the server can perform the work of storage mode conversion, and convert the current storage mode to the target storage mode, that is, to a new optimal storage mode.
例如,假设在分布式服务器集群中共有3个服务节点,分别为A服务节点、B服务节点和C服务节点,A服务节点对应的存储模式为行存模式、B服务节点对应的存储模式为多行存模式,以及C服务节点对应的存储模式为列族段存模式。在某个时刻,工作负载A更换为工作负载B,当服务器中的自适应模型确定更换后的工作负载B需要的存储模型为列存,而当前服务集群中的3个服务器节点中没有列存模式,则需选择出其中一个服务节点进行存储模式的转换,可能是行存到列存的转换。此外,在进行转换时需要考虑服务节点的负载,优先选择负载小的节点。本申请中的存储模式包括行存模式、列存模式、单元格存模式、多行存模式、列段存模式、列族存模式、列族段存模式或整表存模式中的至少一种。如图4所示,为存储模式的转换示意图,图4中考虑了8种不同的存储模式,对应的一共有7*8=56种转换方式。由此解决了传统方式中不能根据不同负载匹配最优存储模式的问题,在存储模式多样性的基础上,提出了多副本异构协同计算的方案,可以实现在同个共识组的不同副本上存放不同模式的数据,并可以为不同的工作负载选择最合适的存储节点。相较于传统的分布式系统,本实施例中的方法可以同时为AP和TP选择具有最优存储模式的存储节点,从而更好的适应HTAP的场景。同时,本实施例中的方法考虑到了不同服务节点间负载均衡的问题,因此,可以有效提高各节点的利用率,从而提升系统整体的性能。For example, suppose there are 3 service nodes in the distributed server cluster, namely A service node, B service node and C service node. The storage mode corresponding to the A service node is row storage mode, and the storage mode corresponding to the B service node is multiple. The row storage mode and the storage mode corresponding to the C service node are the column family segment storage mode. At a certain moment, workload A is replaced with workload B. When the adaptive model in the server determines that the storage model required by the replaced workload B is column storage, there is no column storage in the three server nodes in the current service cluster. mode, you need to select one of the service nodes to convert the storage mode, which may be the conversion from row storage to column storage. In addition, the load of the service node needs to be considered when performing the conversion, and the node with the smaller load is preferentially selected. The storage mode in this application includes at least one of row storage mode, column storage mode, cell storage mode, multi-row storage mode, column segment storage mode, column family storage mode, column family segment storage mode or whole table storage mode . As shown in FIG. 4 , which is a schematic diagram of the conversion of storage modes, 8 different storage modes are considered in FIG. 4 , corresponding to a total of 7*8=56 conversion modes. This solves the problem that the traditional method cannot match the optimal storage mode according to different loads. On the basis of the diversity of storage modes, a multi-copy heterogeneous collaborative computing scheme is proposed, which can be implemented on different copies of the same consensus group. Store data in different modes and select the most suitable storage node for different workloads. Compared with the traditional distributed system, the method in this embodiment can simultaneously select the storage node with the optimal storage mode for the AP and the TP, thereby better adapting to the HTAP scenario. At the same time, the method in this embodiment takes into account the problem of load balancing among different service nodes, therefore, the utilization rate of each node can be effectively improved, thereby improving the overall performance of the system.
在一个实施例中,将存储模式转换为目标存储模式之后,所述方法还包括:In one embodiment, after converting the storage mode to the target storage mode, the method further includes:
按照目标存储模式所对应的编码方式,对数据表中的数据重新编码,得到编码后的数据;According to the encoding mode corresponding to the target storage mode, re-encode the data in the data table to obtain the encoded data;
将编码后的数据进行存储。Store the encoded data.
具体的,当工作负载更换为其它工作负载时,服务器确定其它工作负载在数据库中的目标存储模式,在处于空闲状态时,服务器可以将存储模式转换为目标存储模式。即服务器可以按照目标存储模式所对应的编码方式,对数据表中的数据重新编码,得到编码后的数据,并将编码后的数据进行存放。Specifically, when the workload is replaced with another workload, the server determines the target storage mode of the other workload in the database, and when in an idle state, the server can convert the storage mode to the target storage mode. That is, the server can re-encode the data in the data table according to the encoding mode corresponding to the target storage mode, obtain the encoded data, and store the encoded data.
例如,假设当前存储模式为行存,目标存储模式为列存,行存是指数据以一行作为一个KV对存储在底层存储引擎中,列存是指数据以一列作为一个KV对存储在底层存储引擎中。由于行存模式对应的编码方式为:将数据表A的表标识tableA、数据表A中每行数据的行标识rowID作为key,将每行中的数据作为value进行存储,则当服务器处于空闲状态时,服务器可以按照目标存储模式即列存所对应的编码方式,对数据表A中的数据重新编码,即服务器读取数据表A中的数据,将数据表A的表标识tableA、数据表A中每列数据的列标识columnID作为key,将每列中的数据作为value,得到key-value所组成的数据,并将编码后的key-value所组成的数据进行存储。由此,可以根据工作负载的不同,自动切换副本的存储模式,当自适应模块感知到负载发生变化时,就会评估出最优存储模式以及相应的转换代价,解决了传统方式中不能根据不同负载匹配最优存储模式的问题,有效提升了系统的I/O效率,从而有效提升了系统的整体性能。For example, assuming that the current storage mode is row storage and the target storage mode is column storage, row storage means that data is stored in the underlying storage engine with a row as a KV pair, and column storage means that data is stored in the underlying storage with a column as a KV pair in the engine. Because the encoding method corresponding to the row storage mode is: use the table ID tableA of data table A, the row ID rowID of each row of data in data table A as the key, and store the data in each row as the value, then when the server is in an idle state At this time, the server can re-encode the data in the data table A according to the target storage mode, that is, the encoding method corresponding to the column storage, that is, the server reads the data in the data table A, and identifies the table A and the data table A in the data table A. The column ID of each column of data is used as the key, and the data in each column is used as the value to obtain the data composed of key-value, and store the data composed of the encoded key-value. As a result, the storage mode of the copy can be automatically switched according to different workloads. When the adaptive module senses a change in the load, it will evaluate the optimal storage mode and the corresponding conversion cost. The problem of load matching the optimal storage mode effectively improves the I/O efficiency of the system, thereby effectively improving the overall performance of the system.
在一个实施例中,所述方法还包括:In one embodiment, the method further includes:
当存储模式为行存模式、且目标存储模式为单元格存模式时,读取数据表中的数据;When the storage mode is row storage mode and the target storage mode is cell storage mode, read the data in the data table;
将数据表的表标识、数据表中目标行的行标识以及数据表中目标列的列标识作为键,将目标行和目标列之间相交的单元格中的数据作为值,得到键和值所组成的单元格数据;Use the table ID of the data table, the row ID of the target row in the data table, and the column ID of the target column in the data table as the key, and use the data in the intersecting cell between the target row and the target column as the value, and get the value between the key and the value. composed cell data;
将键和值所组成的单元格数据进行存储。Stores cell data consisting of keys and values.
其中,单元格存是指将一行或一列中的单元格作为KV对存储在底层存储引擎中。单元格存打破了整行或者整列的存储模式,使得KV单元的粒度更小,在一些只想对单个单元格进行操作的工作负载中,会体现出比较好的性能优势。在传统行存、列存的存储模式下,当需要更新某个单元格时,服务器必须先将整行整列数据读出来,进行修改后再写回去,这就造成了严重的读写放大,在这种情况下,采用单元格存则可以很好的消除这种写放大。本申请中的单元格存模式又可以分为单元格行存和单元格列存,单元格行存是指将单元格按照行的格式组织起来,使整体的存储在行上保持连续性;单元格列存是指将单元格按照列的格式组织起来,使整体的存储在列上保持连续性。Among them, cell storage refers to storing cells in a row or column as KV pairs in the underlying storage engine. Cell storage breaks the storage mode of the entire row or column, making the granularity of the KV unit smaller. In some workloads that only want to operate on a single cell, it will show a better performance advantage. In the traditional storage mode of row storage and column storage, when a cell needs to be updated, the server must first read out the entire row and column data, modify it and then write it back, which causes serious read and write amplification. In this case, using cell storage can eliminate this write amplification very well. The cell storage mode in this application can be further divided into cell row storage and cell column storage. Cell row storage refers to organizing cells according to the format of rows, so that the overall storage maintains continuity on the row; Grid storage refers to organizing cells according to the format of columns, so that the overall storage can maintain continuity on the columns.
具体的,由于转换种类较多,总体上,转换存储模式可以抽象为两类转换,分别是拆分和重组。拆分的应用场景主要是将数据从一个较大的KV变成几个较小的KV。即当数据以行存模式写入时,其key的编码形式为:表标识+行标识,Value为每行中的数据。在某些场景下,当需要将行存模式转为单元格存模式时,服务器需要对key-Value重新编码,为每个属性设置一个对应的单元格标识,即当数据以单元格模式写入时,其key的编码形式为:表标识+行标识+列标识,或者表标识+列标识+行标识,Value为每个单元格中的数据,由此将数据从行存模式拆分成为单元格存模式。Specifically, since there are many types of conversions, in general, the conversion storage mode can be abstracted into two types of conversions, namely splitting and reorganization. The application scenario of splitting is mainly to change the data from one large KV into several smaller KVs. That is, when data is written in row storage mode, the encoding form of its key is: table ID + row ID, and Value is the data in each row. In some scenarios, when the row storage mode needs to be converted to the cell storage mode, the server needs to re-encode the key-value and set a corresponding cell ID for each attribute, that is, when the data is written in the cell mode When the key is encoded in the form of: table ID + row ID + column ID, or table ID + column ID + row ID, Value is the data in each cell, thus splitting the data from the row storage mode into cells Grid mode.
本实施例中以单行存转换为单元格存为例进行详细说明。若当前的存储模式为行存模式、且目标存储模式为单元格存模式,服务器可以按照目标存储模式即单元格存模式所对应的编码方式,读取数据表中的数据,并将数据表的表标识、数据表中目标行的行标识以及数据表中目标列的列标识作为键,将目标行和目标列之间相交的单元格中的数据作为值,得到键和值所组成的单元格数据,并将键和值所组成的单元格数据进行存储,即可完成存储模式的转换。In this embodiment, the conversion of single row storage to cell storage is taken as an example for detailed description. If the current storage mode is row storage mode and the target storage mode is cell storage mode, the server can read the data in the data table according to the encoding method corresponding to the target storage mode, that is, the cell storage mode, and store the data in the data table. The table ID, the row ID of the target row in the data table, and the column ID of the target column in the data table are used as the key, and the data in the intersecting cell between the target row and the target column is used as the value, and the cell composed of the key and the value is obtained. The conversion of storage mode can be completed by storing the cell data composed of key and value.
举个例子,假设数据表A的结构为2*2的二维表,则服务器可以读取数据表A中的数据,并将数据表A的表标识tableA、数据表A中目标行即第一行的行标识row1,以及数据表A中目标列即第一列的列标识column1作为key,即key为:tableA+row1+column1,将目标行即第一行和目标列即第二行之间相交的单元格中的数据作为value,得到key-value所组成的第一个单元格数据,并将key-value所组成的第一个单元格数据进行存储,以此类推,服务器可以按照上述编码方式,将数据表A中所有数据进行编码,得到key-value所组成的4个键值对进行存储,4个键值对分别对应4个单元格数据。可以理解,上述编码后的key为:tableA+row1+column1,则是单元格行存,若编码后的key为:tableA+column1+row1,则是单元格列存。由此,可以根据工作负载的不同,自动切换副本的存储模式,当自适应模块感知到负载发生变化时,就会评估出最优存储模式以及相应的转换代价,解决了传统方式中不能根据不同负载匹配最优存储模式的问题,有效提升了系统的I/O效率,从而有效提升了系统的整体性能。For example, assuming that the structure of data table A is a 2*2 two-dimensional table, the server can read the data in data table A, and identify table A in data table A and the target row in data table A as the first row. The row identifier row1 of the row, and the column identifier column1 of the target column in the data table A, that is, the first column, are used as the key, that is, the key is: tableA+row1+column1, and the target row is the first row and the target column is the second row. The data in the intersecting cells is used as value, the first cell data composed of key-value is obtained, and the first cell data composed of key-value is stored, and so on, the server can follow the above coding method, encode all the data in the data table A, and obtain 4 key-value pairs composed of key-value for storage, and the 4 key-value pairs correspond to 4 cell data respectively. It can be understood that the above encoded key is: tableA+row1+column1, which is a cell row storage, and if the encoded key is: tableA+column1+row1, it is a cell column storage. As a result, the storage mode of the copy can be automatically switched according to different workloads. When the adaptive module senses a change in the load, it will evaluate the optimal storage mode and the corresponding conversion cost. The problem of load matching the optimal storage mode effectively improves the I/O efficiency of the system, thereby effectively improving the overall performance of the system.
在一个实施例中,所述方法还包括:In one embodiment, the method further includes:
当存储模式为单元格存模式、且目标存储模式为列存模式时,将单元格存模式下的单元格写入缓冲区中;When the storage mode is the cell storage mode and the target storage mode is the column storage mode, write the cells in the cell storage mode into the buffer;
当缓冲区中的单元格数目满足列存模式所需的数目时,将数据表的表标识和数据表中目标列的列标识作为键,将目标列中的数据作为值,得到键和值所组成的列数据;When the number of cells in the buffer meets the number required by the column storage mode, the table ID of the data table and the column ID of the target column in the data table are used as keys, and the data in the target column is used as the value. composed of column data;
将键和值所组成的列数据进行存储。Stores column data consisting of keys and values.
具体的,转换存储模式时,重组的应用场景主要是数据从几个较小的KV重组成一个较大的KV,例如:单行存储到多行存的转换,单元格存到行存的转换等,本实施例中以单元格存到列存的转换为例进行详细说明。Specifically, when switching the storage mode, the application scenario of reorganization is mainly to reorganize the data from several smaller KVs into a larger KV, for example: conversion from single-line storage to multi-line storage, conversion from cell storage to row storage, etc. , in this embodiment, the conversion from cell storage to column storage is taken as an example for detailed description.
假设某一副本节点上的数据开始以单元格存进行存储,随着负载的改变,需要将该副本节点的存储模式转换为列存,但是列存所需要的单元格还不够,该副本节点的服务器可以暂时将这些单元格写入缓冲区中,等到缓冲区中积攒够转换为列存所需要的单元格数目时再进行重新编码和转换。即当存储模式为单元格存模式、且目标存储模式为列存模式时,服务器可以将单元格存模式下的单元格写入缓冲区中;当缓冲区中的单元格数目满足列存模式所需的数目时,服务器可以将数据表的表标识和数据表中目标列的列标识作为键,将目标列中的数据作为值,得到键和值所组成的列数据,并将键和值所组成的列数据进行存储,即可完成存储模式的转换。由此,可以根据工作负载的不同,自动切换副本的存储模式,当自适应模块感知到负载发生变化时,就会评估出最优存储模式以及相应的转换代价,解决了传统方式中不能根据不同负载匹配最优存储模式的问题,有效提升了系统的I/O效率,从而有效提升了系统的整体性能。Assuming that the data on a replica node starts to be stored in cell storage, as the load changes, the storage mode of the replica node needs to be converted to column storage, but the cells required for column storage are not enough. The server can temporarily write these cells into the buffer, and then re-encode and convert when the buffer has enough cells to convert to column storage. That is, when the storage mode is the cell storage mode and the target storage mode is the column storage mode, the server can write the cells in the cell storage mode into the buffer; when the number of cells in the buffer meets the requirements of the column storage mode. When the required number is required, the server can use the table ID of the data table and the column ID of the target column in the data table as the key, and the data in the target column as the value to obtain the column data composed of the key and the value, and combine the key and value. The composed column data is stored, and the conversion of the storage mode can be completed. As a result, the storage mode of the copy can be automatically switched according to different workloads. When the adaptive module senses a change in the load, it will evaluate the optimal storage mode and the corresponding conversion cost. The problem of load matching the optimal storage mode effectively improves the I/O efficiency of the system, thereby effectively improving the overall performance of the system.
在一个实施例中,所述方法还包括:In one embodiment, the method further includes:
当存储模式为列段存模式时,将目标列中的至少两个单元格聚集为一个分组;When the storage mode is column segment storage mode, at least two cells in the target column are aggregated into a group;
以数据表的表标识、目标列的列标识和至少两个单元格的分组标识为键,以目标列中至少两个单元格内的数据为值进行存储。The table ID of the data table, the column ID of the target column and the group ID of at least two cells are used as keys, and the data in at least two cells in the target column is used as the value for storage.
其中,列段存是指将一列中的至少两个单元格数据聚集为一个KV对进行存储。这种机制可以理解成MySQL的垂直分区,引入这种机制的原因是:查询可能并不需要将一整列中的所有列数据全部返回。对于某些特定的工作负载,引入列段以后性能会得到一定的提升。Among them, the column segment storage refers to the aggregation of at least two cell data in a column into a KV pair for storage. This mechanism can be understood as the vertical partition of MySQL. The reason for introducing this mechanism is that the query may not need to return all the column data in a whole column. For some specific workloads, the performance will be improved after the introduction of column segments.
具体的,当服务器确定在运行当前工作负载时数据库中数据表的存储模式为列段存模式时,在服务器处于空闲状态时,服务器可以将数据表的每列中的至少两个单元格聚集为一个分组,并以数据表的表标识、每列的列标识和至少两个单元格的分组标识作为键,以每列中至少两个单元格内的数据作为值进行存储。Specifically, when the server determines that the storage mode of the data table in the database is the column segment storage mode when the current workload is running, when the server is in an idle state, the server can aggregate at least two cells in each column of the data table as A group is stored with the table ID of the data table, the column ID of each column and the group ID of at least two cells as keys, and the data in at least two cells in each column as values.
例如,如图5所示,为列段存模式的示意图。图5中第一行第6列中的第一个单元格中的row1co6,表示第一行第6列,key为:tableID+columnID+groupID,表示的含义为:表标识+列标识+分组标识。在服务器处于空闲状态时,服务器可以将数据表A的目标列即第6列中的每三个单元格聚集为一个分组,并以数据表A的表标识tableA、目标列即第6列的列标识column6和每三个单元格的分组标识group1作为键,以第6列中每三个单元格内的数据作为值进行存储,即可得到按照列段存模式存储的两个键值对,即第一个键值对的key为:tableA+column6+group1,对应的value为:第6列中第1、2、3单元格内的数据;第二个键值对的key为:tableA+column6+group2,对应的value为:第6列中第4、5、6单元格内的数据。以此类推,在服务器处于空闲状态时,服务器可以将数据表A的每个列作为目标列进行处理,即可得到将数据表A按照列段存模式存储的多个键值对。由此,根据数据的粒度以及排序方式的不同,可以拓展得到更多的存储模式,使得系统获得对多样化存储模式的支持。For example, as shown in FIG. 5 , it is a schematic diagram of a column segment memory mode. In Figure 5, row1co6 in the first cell in the first row and the sixth column represents the first row and the sixth column. The key is: tableID+columnID+groupID, which means: table ID+column ID+group ID . When the server is in an idle state, the server can aggregate every three cells in the target column of data table A, that is, the sixth column, into a group, and use the table of data table A to identify tableA and the target column, that is, the column of the sixth column. The identifier column6 and the grouping identifier group1 of every three cells are used as keys, and the data in every three cells in the sixth column is stored as the value to obtain two key-value pairs stored in the column segment storage mode, namely The key of the first key-value pair is: tableA+column6+group1, and the corresponding value is: the data in
在一个实施例中,所述方法还包括:In one embodiment, the method further includes:
当存储模式为列族段存模式时,将位于数据表中第一方向上的每列中至少两个单元格聚集为列族;When the storage mode is the column family segment storage mode, at least two cells in each column located in the first direction in the data table are aggregated into a column family;
将位于数据表中第二方向上的每行中至少两个单元格聚集为数据分组;Aggregate at least two cells in each row located in the second direction in the data table into data groupings;
以数据表的表标识、列族的族标识以及数据分组的组标识为键,以每列中至少两个单元格中的数据和每行中至少两个单元格中的数据为值进行存储。Using the table ID of the data table, the family ID of the column family, and the group ID of the data grouping as keys, the data in at least two cells in each column and the data in at least two cells in each row are stored as values.
其中,列族存是指一行中的某几个属性聚集成一个KV对,列族段存是指将多行中的几个属性聚集为一个KV对,从而提升在AP场景下系统的吞吐率,即可以理解为增加列族存的行数。Among them, column family storage means that certain attributes in a row are aggregated into a KV pair, and column family segment storage means that several attributes in multiple rows are aggregated into a KV pair, thereby improving the throughput of the system in the AP scenario. , which can be understood as increasing the number of rows stored in the column family.
第一方向可以为垂直方向,第二方向可以为水平方向。The first direction may be a vertical direction, and the second direction may be a horizontal direction.
具体的,如图6所示,为列族段存模式的示意图。图6中第1列第一行中的第一个单元格中的row1co1,表示第一行第一列,key为:tableID+CFID+groupID,表示的含义为:表标识+列族标识+分组标识。当服务器确定在运行当前工作负载时数据库中数据表的存储模式为列族段存模式时,在服务器处于空闲状态时,服务器可以将位于数据表A中水平方向上的每行中的前两个单元格聚集为一个数据分组,后四个单元格聚集为一个数据分组,即可得到12个数据分组;进一步的,服务器可以将位于数据表A中垂直方向上的左边6个数据分组聚集为一个列族,右边6个数据分组聚集为一个列族,并以数据表A的表标识tableA、列族的族标识column family1以及数据分组的组标识group1-6作为第一个键值对的键,以左边12个单元格中的数据作为第一个键值对的值进行存储;同时,以数据表A的表标识tableA、列族的族标识columnfamily2以及数据分组的组标识group7-12作为第二个键值对的键,以右边24个单元格中的数据作为第二个键值对的值进行存储。由此,根据数据的粒度以及排序方式的不同,可以拓展得到更多的存储模式,使得系统获得对多样化存储模式的支持。Specifically, as shown in FIG. 6 , it is a schematic diagram of a column family segment storage mode. In Figure 6, row1co1 in the first cell in the first row of the first column represents the first row and the first column. The key is: tableID+CFID+groupID, which means: table ID+column family ID+grouping logo. When the server determines that the storage mode of the data table in the database is the column family segment storage mode when the current workload is running, when the server is in an idle state, the server can store the first two in each row located in the horizontal direction in the data table A. The cells are aggregated into a data group, and the last four cells are aggregated into a data group, so that 12 data groups can be obtained; further, the server can aggregate the 6 data groups on the left in the vertical direction in the data table A into one Column family, the 6 data groups on the right are aggregated into a column family, and the table ID tableA of data table A, the family ID column family1 of the column family, and the group ID group1-6 of the data grouping are used as the keys of the first key-value pair. Store the data in the 12 cells on the left as the value of the first key-value pair; at the same time, use the table ID tableA of data table A, the family ID columnfamily2 of the column family, and the group ID group7-12 of the data grouping as the second The key of a key-value pair is stored with the data in the 24 cells on the right as the value of the second key-value pair. Therefore, according to the granularity and sorting method of the data, more storage modes can be expanded, so that the system can obtain support for diversified storage modes.
在一个实施例中,所述方法还包括:In one embodiment, the method further includes:
当存储模式为整表存模式时,以数据表的表标识为键,以数据表中的数据为值进行存储。When the storage mode is the whole table storage mode, the table ID of the data table is used as the key, and the data in the data table is used as the value for storage.
其中,整表存是指将整张表作为一个KV进行存储。整表存在全表扫描的查询条件下对性能提升显著。Among them, the whole table storage refers to the storage of the entire table as a KV. The performance of the entire table is significantly improved under the query condition of full table scan.
具体的,当服务器确定在运行当前工作负载时数据库中数据表的存储模式为整表存模式时,在服务器处于空闲状态时,服务器以数据表的表标识为键,以数据表中的数据为值进行存储。Specifically, when the server determines that the storage mode of the data table in the database is the whole table storage mode when running the current workload, when the server is in an idle state, the server uses the table identifier of the data table as the key, and the data in the data table as the key value is stored.
例如,如图7所示,为整表存模式的示意图。在服务器处于空闲状态时,服务器可以将数据表A的表标识tableA作为键,并将数据表A中的所有数据作为值进行存储。即可得到按照整表存模式存储的一个键值对,即键值对的key为:tableA,对应的value为:数据表A中的所有数据。由此,根据数据的粒度以及排序方式的不同,可以拓展得到更多的存储模式,使得系统获得对多样化存储模式的支持。For example, as shown in FIG. 7 , it is a schematic diagram of the whole table storage mode. When the server is in an idle state, the server can use the table identifier tableA of the data table A as a key, and store all the data in the data table A as a value. You can get a key-value pair stored in the whole table storage mode, that is, the key of the key-value pair is: tableA, and the corresponding value is: all the data in data table A. Therefore, according to the granularity and sorting method of the data, more storage modes can be expanded, so that the system can obtain support for diversified storage modes.
在一个实施例中,所述存储模式为第一存储模式;所述确定在运行所述工作负载时数据库中数据表的存储模式之后,所述方法还包括:In one embodiment, the storage mode is the first storage mode; after the determining the storage mode of the data table in the database when the workload is running, the method further includes:
接收不同类型的查询请求;Receive different types of query requests;
确定查询请求的执行时间、等待时间和传输时间;Determine the execution time, wait time and transmission time of the query request;
基于执行时间、等待时间、传输时间以及预设权重,确定查询请求所对应的第二存储模式;第一存储模式与第二存储模式是不同的存储模式;Determine the second storage mode corresponding to the query request based on the execution time, the waiting time, the transmission time and the preset weight; the first storage mode and the second storage mode are different storage modes;
将查询请求路由至第二存储模式对应的副本节点,以使副本节点基于查询请求进行数据查询。The query request is routed to the replica node corresponding to the second storage mode, so that the replica node performs data query based on the query request.
其中,第一存储模式是指在运行某个工作负载时数据库中数据表的最优存储模式,第二存储模式是指处理某个查询请求时所对应的数据库中数据表的最优存储模式。The first storage mode refers to the optimal storage mode of the data table in the database when running a certain workload, and the second storage mode refers to the optimal storage mode of the data table in the database corresponding to a query request.
具体的,查询代价是制定查询计划的重要指标,其由三部分构成:执行时间Tproc、等待时间Twait与传输时间Ttrans,即处理查询请求的总耗时T=Tproc+Twait+Ttrans,数据库中数据表的存储模式主要会影响查询请求的执行时间Tproc,因此,在本实施例中,考虑多副本节点对查询计划定制所带来的影响。即服务器确定在运行当前工作负载时数据库中数据表的存储模式为多行存模式之后,当服务器接收到不同类型的查询请求时,服务器可以确定各个查询请求的执行时间、等待时间和传输时间,并基于执行时间、等待时间、传输时间以及预设权重,确定各个查询请求所对应的第二存储模式。其中,当第一存储模式与第二存储模式是不同的存储模式时,服务器可以基于查询策略,例如,查询策略包括低负载节点所在副本优先,则服务器可以将查询请求路由至第二存储模式对应的某个负载较小的副本节点,以使该副本节点基于查询请求进行数据查询。Specifically, the query cost is an important indicator for formulating a query plan, which consists of three parts: execution time T proc , waiting time T wait and transmission time T trans , that is, the total time spent processing query requests T=T proc +T wait + T trans , the storage mode of the data table in the database mainly affects the execution time T proc of the query request. Therefore, in this embodiment, the impact of multiple replica nodes on query plan customization is considered. That is, after the server determines that the storage mode of the data table in the database is multi-row storage mode when running the current workload, when the server receives different types of query requests, the server can determine the execution time, waiting time and transmission time of each query request. And based on the execution time, the waiting time, the transmission time and the preset weight, the second storage mode corresponding to each query request is determined. Wherein, when the first storage mode and the second storage mode are different storage modes, the server may route the query request to the corresponding second storage mode based on the query policy. One of the replica nodes with less load, so that the replica node can perform data query based on the query request.
举个例子,假设在分布式服务集群中存在3个服务节点,分别为A服务节点、B服务节点和C服务节点,A服务节点对应的存储模式为行存模式、B服务节点对应的存储模式为多行存模式,以及C服务节点对应的存储模式为列族段存模式。当B服务节点接收到查询请求A时,B服务节点的服务器确定查询请求A的执行时间为Tproc、等待时间为Twait、以及传输时间为Ttrans,服务器可以基于上述执行时间Tproc、等待时间Twait、传输时间Ttrans以及预设权重,确定查询请求A所对应的第二存储模式为行存模式,由于当前B服务节点的存储模式为多行存模式,而确定查询请求A所对应的第二存储模式为行存模式,因此,B服务节点的服务器可以将查询请求A路由至行存模式对应的副本节点即A服务节点,以使A服务节点基于查询请求A进行数据查询。For example, suppose there are 3 service nodes in the distributed service cluster, namely A service node, B service node and C service node. The storage mode corresponding to service node A is row storage mode, and the storage mode corresponding to service node B is the storage mode. It is a multi-row storage mode, and the storage mode corresponding to the C service node is a column family segment storage mode. When the B service node receives the query request A, the server of the B service node determines that the execution time of the query request A is T proc , the waiting time is T wait , and the transmission time is T trans . The time T wait , the transmission time T trans and the preset weight determine that the second storage mode corresponding to the query request A is the row storage mode. Since the storage mode of the current service node B is the multi-row storage mode, it is determined that the query request A corresponds to The second storage mode is row storage mode. Therefore, the server of service node B can route query request A to the replica node corresponding to row storage mode, that is, service node A, so that service node A can perform data query based on query request A.
本实施例中,由于使用了基于多样化存储的多副本异构方案,使得在混合负载下,不同类型的请求可以在采用不同存储模式的副本上,更好的发挥磁盘读取优势,减少网络带宽的占用。In this embodiment, due to the use of a multi-copy heterogeneous solution based on diversified storage, under mixed loads, different types of requests can use the copies of different storage modes to better utilize the advantages of disk reading and reduce network bandwidth usage.
在一个实施例中,所述确定所述查询请求的执行时间、等待时间和传输时间,包括:In one embodiment, the determining the execution time, waiting time and transmission time of the query request includes:
获取查询请求的查找时间、处理时间和元组构建时间,并基于查找时间、读写数据量处理时间和元组构建时间确定查询请求的执行时间;Obtain the lookup time, processing time, and tuple construction time of the query request, and determine the execution time of the query request based on the lookup time, read and write data volume processing time, and tuple construction time;
获取查询请求的请求队列时间、机器负载延迟时间和从节点数据同步时间,并基于请求队列时间、机器负载延迟时间和从节点数据同步时间确定查询请求的等待时间;Obtain the request queue time, machine load delay time and slave node data synchronization time of the query request, and determine the query request waiting time based on the request queue time, machine load delay time and slave node data synchronization time;
确定查询请求的传输时间。Determines the transmission time of the query request.
其中,处理时间是指读取或者写入数据的时间。The processing time refers to the time to read or write data.
具体的,当服务器接收到不同类型的查询请求时,服务器可以获取各个查询请求的查找时间、处理时间和元组构建时间,并基于各个查询请求的查找时间、读写数据量处理时间和元组构建时间确定各个查询请求的执行时间;进一步的,服务器可以获取各个查询请求的请求队列时间、机器负载延迟时间和从节点数据同步时间,并基于各个查询请求的请求队列时间、机器负载延迟时间和从节点数据同步时间,确定各个查询请求的等待时间;服务器可以确定各个查询请求的传输时间。本实施例中,由于下层存储模式的扩展,查询优化可以得到更好的配置方案,数据库可以加快对用户请求的处理速度,因此,用户对于系统的满意度也会上升。Specifically, when the server receives different types of query requests, the server can obtain the search time, processing time and tuple construction time of each query request, and based on the search time, read and write data volume processing time and tuple construction time of each query request The construction time determines the execution time of each query request; further, the server can obtain the request queue time, machine load delay time and slave node data synchronization time of each query request, and based on the request queue time, machine load delay time and The data synchronization time of the slave node determines the waiting time of each query request; the server can determine the transmission time of each query request. In this embodiment, due to the expansion of the lower-layer storage mode, a better configuration scheme can be obtained for query optimization, and the database can speed up the processing speed of user requests. Therefore, the user's satisfaction with the system will also increase.
在一个实施例中,所述存储模式为第一存储模式;所述方法还包括:In one embodiment, the storage mode is the first storage mode; the method further includes:
接收范围查询请求;Receive range query requests;
确定范围查询请求对应的查询表中的列总数,以及确定范围查询请求所需目标列的列数量;目标列为查询表中的数据列;Determine the total number of columns in the query table corresponding to the range query request, and determine the number of target columns required by the range query request; the target column is the data column in the query table;
基于列数量、列总数和预设阈值,确定范围查询请求所对应的第三存储模式;第一存储模式与第三存储模式是不同的存储模式;Determine the third storage mode corresponding to the range query request based on the number of columns, the total number of columns and the preset threshold; the first storage mode and the third storage mode are different storage modes;
将范围查询请求路由至第三存储模式对应的副本节点,以使副本节点基于范围查询请求进行数据查询。The range query request is routed to the replica node corresponding to the third storage mode, so that the replica node performs data query based on the range query request.
具体的,当服务器接收到的读写操作请求为范围查询请求时,服务器可以确定该范围查询请求对应的查询表中的列总数,以及确定范围查询请求所需目标列的列数量,目标列为查询表中的数据列;进一步的,服务器可以基于列数量、列总数和预设阈值,确定该范围查询请求所对应的第三存储模式。当第一存储模式与第三存储模式为不同的存储模式时,服务器可以将该范围查询请求路由至第三存储模式对应的副本节点,以使副本节点基于该范围查询请求进行数据查询。Specifically, when the read/write operation request received by the server is a range query request, the server can determine the total number of columns in the query table corresponding to the range query request, and determine the number of target columns required by the range query request, and the target column is query the data columns in the table; further, the server may determine the third storage mode corresponding to the range query request based on the number of columns, the total number of columns and a preset threshold. When the first storage mode and the third storage mode are different storage modes, the server may route the range query request to the replica node corresponding to the third storage mode, so that the replica node performs data query based on the range query request.
举个例子,假设服务器确定在运行当前工作负载时数据库中数据表的存储模式为多行存模式之后,当服务器接收到范围查询请求A时,服务器可以确定该范围查询请求A对应的查询表A中的列总数为Col_num,以及确定该范围查询请求A所需目标列的列数量为Col,目标列为查询表A中的数据列;进一步的,服务器可以基于列数量Col、列总数Col_num和预设阈值α,确定该范围查询请求A所对应的第三存储模式。例如,当Col/Col_num<α时,则可以确定列存模式是最佳存储模式;当Col/Col_num≥α,说明该范围查询请求A需要读取的列数较多,一般情况下可以采用行存或多行存的存储模式,比如,该范围查询请求A需要读取20列中的19列数据,在列存的存储模式上读取时需要创建19个迭代器,而在行存或多行存上只需要创建一个迭代器,整体开销会大大降低。For example, suppose that after the server determines that the storage mode of the data table in the database is multi-row storage mode when running the current workload, when the server receives the range query request A, the server can determine the query table A corresponding to the range query request A. The total number of columns in the query request A is Col_num, and the number of target columns required to determine the range query request A is Col, and the target column is the data column in the query table A; further, the server can be based on the number of columns Col, the total number of columns The threshold α is set to determine the third storage mode corresponding to the range query request A. For example, when Col/Col_num<α, it can be determined that the column storage mode is the best storage mode; when Col/Col_num≥α, it means that the range query request A needs to read a large number of columns. For example, the range query request A needs to read 19 columns of 20 data, and 19 iterators need to be created when reading in the column-based storage mode. Only one iterator needs to be created on the row memory, and the overall overhead will be greatly reduced.
假设服务器确定范围查询请求A所对应的第三存储模式为列存模式,由于当前服务器数据库中数据表的存储模式为多行存模式,因此,服务器可以将该范围查询请求A路由至多行存模式对应的副本节点,以使其他的副本节点基于该范围查询请求A进行数据查询。本实施例中,使得原有系统获得了多样化存储的支持。同时,由于下层存储模式的扩展,上层查询优化可以获得更好的调参方案,因此开发者可以省下更多的计算和存储的开销。Suppose the server determines that the third storage mode corresponding to the range query request A is the column storage mode. Since the storage mode of the data table in the current server database is the multi-row storage mode, the server can route the range query request A to the multi-row storage mode. Corresponding replica nodes, so that other replica nodes perform data query based on the range query request A. In this embodiment, the original system is supported by diversified storage. At the same time, due to the expansion of the lower-layer storage mode, the upper-layer query optimization can obtain better parameter adjustment schemes, so developers can save more computing and storage costs.
本申请还提供一种应用场景,该应用场景应用上述的数据库存储模式的自适应方法。具体地,该数据库存储模式的自适应方法在该应用场景的应用如下:The present application also provides an application scenario, where the above-mentioned adaptive method for database storage mode is applied to the application scenario. Specifically, the application of the adaptive method of the database storage mode in this application scenario is as follows:
当需要针对不同负载的特点决策出最适合的存储结构时,可以采用上述的数据库存储模式的自适应方法,即在运行某个工作负载时,对于数据库中的每一张数据表,均可以采用上述的数据库存储模式的自适应方法,决策出最优的存储方案。When it is necessary to decide the most suitable storage structure according to the characteristics of different loads, the above-mentioned adaptive method of database storage mode can be used, that is, when running a certain workload, for each data table in the database, the The above-mentioned adaptive method of database storage mode decides the optimal storage scheme.
本申请实施例提供的方法,可以应用于任意的工作负载正常运行的场景中,以下以基于LSM-TreeKV存储引擎的分布式数据库存储模式的优化场景为例,对本申请实施例提供的数据库存储模式的自适应方法进行说明。The methods provided by the embodiments of the present application can be applied to any scenario in which the workload is running normally. The following takes the optimization scenario of the distributed database storage mode based on the LSM-TreeKV storage engine as an example to describe the database storage mode provided by the embodiments of the present application. The adaptive method is described.
传统方式中,通常采用行存或列存的存储模式,以使得这些存储模式可以在某种工作负载下,提升数据库在处理查询时的读写(I/O)性能。例如,OLTP型负载对数据有很多写操作,OLAP操作一般有很多范围查询操作。传统数据库的设计,一般OLTP和OLAP数据库分来为独立产品,其中OLTP数据库基本都采用行存作为存储模式,便于写操作和点查询操作;而OLAP数据库一般采用列存作为存储模式,用于加速连续访问的范围查询,尤其是某一个或某几个列上的扫描操作。显然,OLTP和OLAP在存储模式的选择上是矛盾的,因此,HTAP(Hybrid Transactional/Analytical Processing)数据库中存储模式的选择和优化是一个非常有挑战性的问题。HTAP是一种新兴的应用程序架构,可以很好地支持事务处理和分析。具体到基于LSM-Tree KV的新型分布式数据库系统中,存储模式的问题变得更为复杂,由于存储模式受限于行存、列存,参数的调节大多依赖于硬件的配置以及查询的特征,无法对存储模式这样底层基础的配置进行调节,因而容易使得最终选择出的存储模式往往不是最佳的,可能会造成比较大的存储开销甚至空间浪费,导致数据库系统的整体性能较差。In the traditional way, the storage mode of row storage or column storage is usually adopted, so that these storage modes can improve the read/write (I/O) performance of the database when processing queries under a certain workload. For example, OLTP type workloads have many write operations to data, and OLAP operations generally have many range query operations. In the design of traditional databases, OLTP and OLAP databases are generally divided into independent products. Among them, OLTP databases basically use row storage as the storage mode, which is convenient for write operations and point query operations; while OLAP databases generally use column storage as the storage mode to accelerate Range queries with continuous access, especially scan operations on one or several columns. Obviously, OLTP and OLAP are contradictory in the choice of storage mode, therefore, the selection and optimization of storage mode in HTAP (Hybrid Transactional/Analytical Processing) database is a very challenging problem. HTAP is an emerging application architecture that supports transaction processing and analytics well. Specifically, in the new distributed database system based on LSM-Tree KV, the problem of storage mode becomes more complicated. Since the storage mode is limited by row storage and column storage, parameter adjustment mostly depends on hardware configuration and query characteristics , it is impossible to adjust the underlying basic configuration such as the storage mode, so it is easy to make the final selected storage mode often not optimal, which may cause a relatively large storage overhead or even a waste of space, resulting in poor overall performance of the database system.
因此,为了解决上述问题,本申请提供了一种基于LSM-Tree KV存储引擎的分布式数据库存储模式的优化方法,以解决数据库系统的整体性能较差的问题。本方法中根据LSM-Tree KV存储引擎的特点,拓展了数据库存储模式的设计空间,并通过数据库的统计信息和代价评估函数,可以决策出面向当前负载的最优存储方案,以适应企业不同业务的不同需求。由此使得,可以根据用户访问负载,选择最适合的存储模式。相比传统方法,本实施例中的方法能够在更大的范围内,寻找最适应查询负载的存储结构,并且可以实现在同个共识组的不同副本上存放不同模式的数据,可以为不同的工作负载选择最合适的存储节点。相较于传统的分布式系统,本方法可以同时为AP和TP型负载选择具有最优存储模式的存储节点,从而更好的适应HTAP的场景。同时,本方法中也考虑到了主从节点间负载均衡的问题,可以提高各节点的利用率,从而提升系统整体的性能。Therefore, in order to solve the above problem, the present application provides an optimization method for a distributed database storage mode based on the LSM-Tree KV storage engine, so as to solve the problem of poor overall performance of the database system. In this method, according to the characteristics of the LSM-Tree KV storage engine, the design space of the database storage mode is expanded, and through the statistical information and cost evaluation function of the database, the optimal storage scheme for the current load can be decided to adapt to different businesses of the enterprise. different needs. This makes it possible to select the most suitable storage mode according to the user's access load. Compared with the traditional method, the method in this embodiment can find the storage structure that is most suitable for the query load in a wider range, and can store data of different modes on different copies of the same consensus group, which can be different. Select the most appropriate storage node for the workload. Compared with the traditional distributed system, the method can simultaneously select the storage node with the optimal storage mode for the AP and TP type loads, so as to better adapt to the HTAP scenario. At the same time, the method also considers the problem of load balancing between the master and slave nodes, which can improve the utilization rate of each node, thereby improving the overall performance of the system.
在产品侧,本申请提供的方法可以自适应用户的查询请求,为不同的工作负载选择合适的存储模式。传统基于行存、列存的数据库,可供选择的存储模式太少,无法为不同的查询请求提供最优的存储模式,从而可能导致空间的浪费、性能的损失。本实施例中提出了一种基于多样化存储模式的分布式数据库系统,扩展了存储模式的种类,既提升了磁盘的I/O效率、减少了空间浪费,又可以保证不同副本节点间数据的一致性,增加了系统的并发性能,提高了系统在混合负载下的性能表现。On the product side, the method provided in this application can adapt to the user's query request, and select an appropriate storage mode for different workloads. Traditional databases based on row storage and column storage have too few storage modes to choose from, and cannot provide optimal storage modes for different query requests, which may lead to waste of space and loss of performance. In this embodiment, a distributed database system based on diversified storage modes is proposed, which expands the types of storage modes, which not only improves the I/O efficiency of disks, reduces space waste, but also ensures data integrity between different replica nodes. Consistency increases the concurrent performance of the system and improves the performance of the system under mixed loads.
产品侧具备的特征如下:The features on the product side are as follows:
该产品有存储和计算功能。在存储端会对数据进行分片,每个数据片有若干个副本,每个副本都提供访问功能。数据片的不同副本所对应的存储模式可以相同,也可以不同。在计算层,根据分析统计,判断请求的类型,确定性能更优的副本,同时考虑负载均衡,将请求发送给被选定的副本处理。该副本所在节点执行请求后将数据返回给计算层,再经由计算层返回给客户端。对上层用户来说,并不会感知到底层存储模式的多样化,而是由数据库产品本身完成多样化存储模式的管理。同时,对于开发人员来说只需要修改KV映射的规则和事务处理流程就可以直接兼容本方法的改动,使系统获得对多样化存储模式的支持。The product has storage and computing capabilities. On the storage side, the data will be sharded, and each data shard has several copies, and each copy provides access functions. The storage modes corresponding to different copies of data slices can be the same or different. At the computing layer, according to the analysis and statistics, the type of request is judged, and the replica with better performance is determined. At the same time, load balancing is considered, and the request is sent to the selected replica for processing. After the node where the replica resides executes the request, the data is returned to the computing layer, and then returned to the client through the computing layer. For upper-level users, the diversification of the underlying storage mode is not perceived, but the management of the diversified storage mode is completed by the database product itself. At the same time, developers only need to modify the rules of KV mapping and transaction processing flow to be directly compatible with the changes of this method, so that the system can obtain support for diversified storage modes.
在技术侧,如图8所示,为系统整体架构图。On the technical side, as shown in Figure 8, it is the overall architecture diagram of the system.
第1小节-整体描述。Section 1 - Overall description.
本方法适用于采用了LSM-Tree KV作为底层存储引擎的数据库。系统可以采用分布式服务系统中常见的存算分离架构,将存储层和计算层分离开来。如图8所示的SQLENGINE为计算层中的SQL引擎,SQL引擎是数据库系统的重要组成部分,主要职责是将应用程序输入的SQL语句在当前负载场景下生成高效的执行计划,在SQL语句的高效执行上扮演重要角色。如图8所示,在存储层中,共识组1中包括多个副本节点即存储节点1、存储节点2和存储节点3,服务器通过一致性协议层将多个副本组成一个共识组,以使得共识组中的各个存储节点中存储的数据相同,提高系统的容错率,并且,每个共识组中的副本节点的存储模式可以不同,也可以相同。This method is suitable for databases using LSM-Tree KV as the underlying storage engine. The system can adopt the common storage and computing separation architecture in distributed service systems to separate the storage layer and the computing layer. As shown in Figure 8, SQLENGINE is the SQL engine in the computing layer. The SQL engine is an important part of the database system. Its main responsibility is to generate an efficient execution plan for the SQL statement input by the application under the current load scenario. important role in efficient execution. As shown in Figure 8, in the storage layer, consensus group 1 includes multiple replica nodes, namely storage node 1,
例如,假设当前负载为负载A,在当前负载A运行在数据库的场景下,存储节点1的服务器可以基于本申请实施例中提出的LSM-Tree KV存储引擎的分布式数据库存储模式的优化方法,决策出面向当前负载A的最优存储方案。如图8中所示共识组1中的存储节点1中的状态机的存储模式为行存,即此时的数据库中的存储模式是以行存模式作为基准模式,服务器可以获取当前负载A的运行参数,运行参数包括读写操作比例、查询语句和数据分布;服务器可以根据当前负载A运行参数中的读写操作比例和各键值对的读写操作性能参数,确定键值对所占用空间的第一空间大小,并根据当前负载A运行参数中查询语句的语义信息,确定存储区的分区方式;进一步的,服务器根据当前负载A运行参数中的数据分布确定候选分组范围下的数据密度,并基于分区方式、候选分组范围、数据密度和第一空间大小,确定在运行当前负载A时数据库中数据表的最优存储模式为多行存模式,则在服务器处于空闲状态时,服务器可以将当前负载A所对应的存储节点1中的状态机的存储模式,由行存模式转换为多行存模式,即服务器可以按照多行存模式所对应的编码方式,将数据表中的数据重新编码后进行存放,以使得数据表中的数据按照多行存模式存储。此外,当负载A更换为其它负载B时,服务器可以基于本申请实施例中提出的LSM-Tree KV存储引擎的分布式数据库存储模式的优化方法,重新确定其它负载B在数据库中的目标存储模式,并在服务器处于空闲状态时,将当前存储模式转换为目标存储模式。For example, assuming that the current load is load A, in the scenario that the current load A is running in the database, the server of the storage node 1 can be based on the optimization method of the distributed database storage mode of the LSM-Tree KV storage engine proposed in the embodiment of this application, The optimal storage solution for the current load A is decided. As shown in Figure 8, the storage mode of the state machine in the storage node 1 in the consensus group 1 is row storage, that is, the storage mode in the database at this time is the row storage mode as the reference mode, and the server can obtain the current load A's storage mode. Running parameters, including the ratio of read and write operations, query statements, and data distribution; the server can determine the space occupied by key-value pairs according to the ratio of read and write operations in the running parameters of current load A and the performance parameters of read and write operations of each key-value pair The first space size of the storage area is determined according to the semantic information of the query statement in the running parameters of the current load A, and the partition mode of the storage area is determined; further, the server determines the data density in the candidate grouping range according to the data distribution in the running parameters of the current load A, And based on the partition mode, candidate grouping range, data density and first space size, it is determined that the optimal storage mode of the data table in the database is the multi-row storage mode when the current load A is running, then when the server is in an idle state, the server can store The storage mode of the state machine in the storage node 1 corresponding to the current load A is converted from the row storage mode to the multi row storage mode, that is, the server can re-encode the data in the data table according to the encoding method corresponding to the multi row storage mode Then store it, so that the data in the data table is stored in a multi-line storage mode. In addition, when the load A is replaced with another load B, the server may re-determine the target storage mode of the other load B in the database based on the optimization method for the distributed database storage mode of the LSM-Tree KV storage engine proposed in the embodiment of the present application , and when the server is idle, convert the current storage mode to the target storage mode.
进一步的,在当前负载A运行在数据库的场景下,当服务器确定在运行当前负载A时数据库中数据表的最优存储模式为多行存模式,并在服务器处于空闲状态时,将当前负载A所对应的存储节点1中的状态机的存储模式转换为多行存模式之后,当服务器接收到不同类型的查询请求时,首先,由如图8中所示的计算层中的SQL引擎与用户的输入进行交互,即计算层中的SQL引擎负责处理用户发起的不同查询请求所对应的SQL语句,将SQL语句解析并制定查询计划。即在计算层中,服务器可以确定各个查询请求的执行时间、等待时间和传输时间,并基于执行时间、等待时间、传输时间以及预设权重,确定各个查询请求所对应的最优存储模式。Further, in the scenario where the current load A is running in the database, when the server determines that the optimal storage mode of the data table in the database is the multi-row storage mode when the current load A is running, and when the server is in an idle state, the current load A is stored in the database. After the storage mode of the state machine in the corresponding storage node 1 is converted to the multi-line storage mode, when the server receives different types of query requests, first, the SQL engine in the computing layer as shown in Figure 8 and the user That is, the SQL engine in the computing layer is responsible for processing the SQL statements corresponding to different query requests initiated by the user, parsing the SQL statements and formulating query plans. That is, in the computing layer, the server can determine the execution time, waiting time and transmission time of each query request, and determine the optimal storage mode corresponding to each query request based on the execution time, waiting time, transmission time and preset weight.
例如,如图8所示的共识组1中存在3个存储节点,分别为存储节点1、存储节点2和存储节点3,图8中所示的存储节点1对应的存储模式为行存模式、存储节点2对应的存储模式为列段存模式,以及存储节点3对应的存储模式为行存模式。当存储节点1的服务器接收到查询请求A时,服务器确定查询请求A的执行时间、等待时间和传输时间,并基于执行时间、等待时间、传输时间以及预设权重,确定查询请求A所对应的最优的存储模式为列段存模式,由于当前存储节点1所对应的存储模式为行存模式,因此,服务器可以将查询请求A路由至列段存模式对应的副本节点即存储节点2,以使存储节点2基于查询请求A进行数据查询处理。若服务器确定查询请求A所对应的最优的存储模式为行存模式,由于当前存储节点1所对应的存储模式为行存模式,则无需路由至其他副本节点进行处理。若当前存储节点1发生故障变为不可用时,则服务器可以将查询请求A路由至行存模式对应的其他副本节点即存储节点3,以使存储节点3基于查询请求A进行数据查询处理。For example, as shown in Figure 8, there are three storage nodes in consensus group 1, namely storage node 1,
本申请实施例中,对于计算层和存储层这两大模块,其具体功能如下:In the embodiment of the present application, for the two major modules of the computing layer and the storage layer, their specific functions are as follows:
一、计算层:1. Computing layer:
通过该计算层可以确定出最优的存储模式,可参考上述图2的实施例中的步骤210。具体地,该计算层具体功能如下:The optimal storage mode can be determined through the computing layer, and reference may be made to step 210 in the above-mentioned embodiment of FIG. 2 . Specifically, the specific functions of the computing layer are as follows:
1、与用户的输入进行交互。1. Interact with user input.
2、负责处理业务的SQL语句,将其解析并制定查询计划,即服务器可以根据接收到的不同类型的查询请求,确定各个查询请求的执行时间、等待时间和传输时间,并基于执行时间、等待时间、传输时间以及预设权重,确定各个查询请求所对应的最优存储模式,并将各个查询请求路由至最优存储模式对应的副本节点,以使副本节点基于各个查询请求进行数据查询。2. Responsible for processing the SQL statement of the business, parse it and formulate a query plan, that is, the server can determine the execution time, waiting time and transmission time of each query request according to different types of query requests received, and based on the execution time, waiting time Time, transmission time and preset weight, determine the optimal storage mode corresponding to each query request, and route each query request to the replica node corresponding to the optimal storage mode, so that the replica node can perform data query based on each query request.
3、针对不同的工作负载,基于分区方式、候选分组范围、数据密度和第一空间大小确定数据库中数据表的存储模式,并路由到不同的节点执行。3. For different workloads, determine the storage mode of the data table in the database based on the partition mode, candidate grouping range, data density and size of the first space, and route to different nodes for execution.
4、当元数据发生变更时,计算层重新选择合适的存储模式。4. When the metadata changes, the computing layer re-selects the appropriate storage mode.
本实施例中,对于计算层的改进主要在于查询计划的制定,即计算层会根据当前系统的状态进行分析处理,为不同的查询请求选择合适的存储模式,各种存储模式的介绍将于第2小节中具体阐述;元数据发生变更的请况在第12小节中进行讨论。In this embodiment, the improvement of the computing layer mainly lies in the formulation of query plans, that is, the computing layer will analyze and process according to the current state of the system, and select appropriate storage modes for different query requests.
二、存储层:存储层具体来说又分为事务处理层、分布式一致性协议层和状态机。各模块功能如下:2. Storage layer: The storage layer is further divided into transaction processing layer, distributed consistency protocol layer and state machine. The functions of each module are as follows:
1、事务处理层:1. Transaction processing layer:
事务处理层负责:1)接收计算层发来的事务处理请求2)协调处理2pc事务。The transaction processing layer is responsible for: 1) receiving transaction processing requests sent by the computing layer 2) coordinating and processing 2pc transactions.
本实施例中对于事务处理层的改动主要在于针对不同的存储模式,事务处理时锁的粒度会随之改变,具体事务处理过程将于第5小节中具体阐述。The changes to the transaction processing layer in this embodiment are mainly for different storage modes, and the granularity of the lock will change accordingly during transaction processing. The specific transaction processing process will be described in detail in Section 5.
2、分布式一致性协议层:2. Distributed consistency protocol layer:
分布式一致性协议层负责:1)控制数据同步,保证各个副本之间的数据一致性;2)接收和处理计算层的读写请求,并控制访问存储层的数据;3)主节点选举。The distributed consistency protocol layer is responsible for: 1) controlling data synchronization and ensuring data consistency between replicas; 2) receiving and processing read and write requests from the computing layer, and controlling access to data in the storage layer; 3) master node election.
分布式一致性协议层采用的是Raft协议,分为leader即主节点选举和日志复制两大模块。Raft是一个强leader的一致性协议,写数据时必须通过主节点然后再复制给其它副节点,以保证各副本间数据的一致性,关于Raft协议的工作流程将在第10小节中详细阐述。各副本使用的存储模式可以相同,也可以不相同,本方法中利用Raft实现了多副本间的异构,具体方案在第8小节中详细阐述。同时,各副本在运行过程中可以根据负载的不同转换存储模式,具体内容在第4小节中详细阐述。在分布式系统实际运行中,副本的配置、副本上数据的分裂、聚集等,这些都由多副本管理器进行控制,关于多副本管理器的管理策略将在第9小节中详细阐述。The distributed consistency protocol layer adopts the Raft protocol, which is divided into two modules: leader, that is, master node election and log replication. Raft is a strong leader consistency protocol. When writing data, it must pass through the primary node and then replicate to other secondary nodes to ensure data consistency between replicas. The workflow of the Raft protocol will be described in detail in Section 10. The storage mode used by each copy can be the same or different. In this method, Raft is used to realize the heterogeneity among multiple copies. The specific scheme is described in detail in Section 8. At the same time, each replica can switch the storage mode according to the different load during the running process. The specific content is elaborated in Section 4. In the actual operation of the distributed system, the configuration of replicas, the splitting and aggregation of data on the replicas, etc. are all controlled by the multi-replica manager. The management strategy of the multi-replica manager will be elaborated in Section 9.
3、状态机:3. State machine:
状态机负责:1)接收和处理分布式一致性协议层的访问请求;2)管理数据多副本,保证各个副本之间的数据一致性。The state machine is responsible for: 1) receiving and processing access requests from the distributed consistency protocol layer; 2) managing multiple copies of data to ensure data consistency between copies.
本实施例中状态机采用RocksDB,RocksDB是基于LSM-Tree KV的存储引擎,以键值对作为存储的基本单位,将数据聚集为SSTable并存放在分层的结构中,关于LSM-Tree的介绍将在第11小节中详细阐述。In this embodiment, the state machine adopts RocksDB. RocksDB is a storage engine based on LSM-Tree KV. It uses key-value pairs as the basic unit of storage, and aggregates data into SSTables and stores them in a hierarchical structure. Introduction to LSM-Tree It will be elaborated in Section 11.
本实施例中的关注点在于存储层接口处的编码部分,通过不同的编码方案实现多样化存储结构,并通过自适应算法,对具有不同访问特征的负载制定最优的方案,编码方案的具体实现在第3小节中阐述。自适应算法的处理流程在第7小节中详细阐述。The focus of this embodiment is on the coding part at the interface of the storage layer. Different coding schemes are used to realize a diversified storage structure, and an adaptive algorithm is used to formulate an optimal scheme for loads with different access characteristics. The specific coding scheme is The implementation is elaborated in
第2小节-存储模式拓展Section 2 - Storage Mode Expansion
在基于LSM-Tree KV的存储引擎中,KV外部的顺序无法得到保证,只可以保证KV内部顺序不被破坏,因此什么数据映射成一个KV对系统整体的读写性能影响很大,为了确保数据的有序性,在数据映射为KV之前进行排序,然后再将得到的结果整体映射为一个KV,即可以将更大的数据量放置在一个KV对中。由于KV对是KV存储系统中的基本单元,因此能够确保内部的数据会被顺序访问。本申请提供的方法,根据数据的粒度以及排序方式的不同,可以得到更多的存储模式。具体如下:In the storage engine based on LSM-Tree KV, the external order of KV cannot be guaranteed, only the internal order of KV can be guaranteed not to be destroyed, so what data is mapped into a KV has a great impact on the overall read and write performance of the system, in order to ensure data The ordering of the data is sorted before the data is mapped to KV, and then the overall result is mapped to a KV, that is, a larger amount of data can be placed in a KV pair. Since the KV pair is the basic unit in the KV storage system, it can ensure that the internal data will be accessed sequentially. According to the method provided by the present application, more storage modes can be obtained according to the granularity and sorting method of the data. details as follows:
1、行存:1. Line storage:
数据以一行作为一个KV对存储在底层存储引擎中。行存作为最直观、最常见的存储方式,其读、写性能相对平衡。行存这一存储方法在单机数据库中非常常见,如MySQL、PostgreSQL均采用行存。行存被广泛应用于Spanner、TiDB等分布式数据库中。在处理TP型事务时,行存往往能取得较优的性能,因为TP型事务的操作粒度以行为主。但是,在处理AP型查询时,行存会暴露出有效数据量低、总带宽低等问题。Data is stored in the underlying storage engine in a row as a KV pair. As the most intuitive and common storage method, row storage has relatively balanced read and write performance. The storage method of row storage is very common in stand-alone databases, such as MySQL and PostgreSQL, which use row storage. Row storage is widely used in distributed databases such as Spanner and TiDB. When dealing with TP-type transactions, row storage can often achieve better performance, because the operation granularity of TP-type transactions is row-based. However, when processing AP-type queries, row storage will expose problems such as low amount of effective data and low total bandwidth.
2、列存:2. Column storage:
数据以一列作为一个KV对存储在底层存储引擎中,列存与行存一样,都是比较常见的存储模式,列存对于AP型查询往往具有较好的性能。在考虑物理实现时,KV的大小不可能无限增加,存储模式中的列存在面对数据量很大的情况时,会表现出巨大的性能劣势,或是诱发其他的工程上的难题。因此,本申请方法中只推荐在数据量较小的表上使用这种存储模式,例如TPC-H中的Warehouse表。Data is stored in the underlying storage engine with a column as a KV pair. Like row storage, column storage is a common storage mode. Column storage often has better performance for AP-type queries. When considering the physical implementation, the size of the KV cannot be increased infinitely. When the columns in the storage mode face a large amount of data, they will show a huge performance disadvantage or induce other engineering problems. Therefore, in the method of the present application, it is only recommended to use this storage mode on tables with a small amount of data, such as the Warehouse table in TPC-H.
3、单元格存:3. Cell storage:
按单元格存是指将一行或一列中的单元格作为KV对存储在底层存储引擎中。单元格存打破了整行整列的存储模式,使得KV单元的粒度更小。在一些只对单个单元格进行操作的工作负载中,会体现出比较好的性能优势。在传统行存、列存的存储模式下,当需要更新某个单元格时,服务器必须先将整行整列数据读出来,进行修改后再写回去,这就造成了严重的读写放大,按单元格存则可以很好的消除这种写放大。其中,单元格存具体又可以分为:1)单元格行存和2)单元格列存,具体如下:Store by cell refers to storing cells in a row or column as KV pairs in the underlying storage engine. The cell storage breaks the storage mode of the entire row and column, making the granularity of the KV unit smaller. In some workloads that only operate on a single cell, there will be better performance advantages. In the traditional storage mode of row storage and column storage, when a cell needs to be updated, the server must first read out the entire row and column data, modify it and then write it back, which causes serious read and write amplification. Cell storage is a good way to eliminate this write amplification. Among them, cell storage can be divided into: 1) cell row storage and 2) cell column storage, as follows:
1)单元格行存:单元格行存是指将单元格按照行的格式组织起来,使整体的存储在行上保持连续性,如图9所示,为单元格行存示意图。图9中第1列第一行中的第一个单元格中的row1co1,表示第一行第一列,图9中key为:tableID+rowID+columnID,表示的含义为:表标识+行标识+列标识。1) Cell row storage: Cell row storage refers to organizing cells according to the row format, so that the overall storage is continuous on the row, as shown in Figure 9, which is a schematic diagram of cell row storage. row1co1 in the first cell in the first row of the first column in Figure 9 represents the first row and the first column. The key in Figure 9 is: tableID+rowID+columnID, which means: table ID+row ID +Column ID.
2)单元格列存:单元格列存是指将单元格按照列的格式组织起来,保持列上的连续性,如图10所示,为单元格列存示意图。图10中第1列第一行中的第一个单元格中的row1co1,表示第一行第一列,图10中key为:tableID+columnID+rowID,表示的含义为:表标识+列标识+行标识。2) Cell column storage: Cell column storage refers to organizing cells in a column format to maintain the continuity on the column, as shown in Figure 10, which is a schematic diagram of cell column storage. The row1co1 in the first cell in the first row of the first column in Figure 10 represents the first row and the first column. The key in Figure 10 is: tableID+columnID+rowID, which means: table ID+column ID +Line ID.
4、多行存:4. Multi-line storage:
多行聚集是指将多行作为一个KV对存放在底层存储引擎中。相对于将一行作为一个KV,其KV存储的粒度变得更大。对于扫描操作来说,每次读取数据量的多少对性能会有很大的影响,因此,较于单行KV,在多行聚集的情况下,扫描操作性能会得到大幅提升。同时,多行聚集在更新时也会出现问题,对多行中的某一行进行更新,意味着要做更大粒度的RMW。针对这种问题提出了Merge方案,即将要更新的某行先存放在Memtable中,等待Rocksdb进行Compaction时,再将其合并到行组中。如图11所示,为多行存示意图。图11中第1列第一行中的第一个单元格中的row1co1,表示第一行第一列,图11中key为:tableID+groupID,表示的含义为:表标识+行分组标识。Multi-row aggregation refers to storing multiple rows as a KV pair in the underlying storage engine. Compared to treating a row as a KV, the granularity of its KV storage becomes larger. For scan operations, the amount of data read each time will have a great impact on performance. Therefore, compared with single-row KV, in the case of multi-row aggregation, the performance of scan operations will be greatly improved. At the same time, there will also be problems in the update of multi-row aggregation. Updating a row in multiple rows means that a larger granularity of RMW is required. In response to this problem, a Merge scheme is proposed, which is to store a row to be updated in Memtable first, and then merge it into the row group when Rocksdb performs Compaction. As shown in Figure 11, it is a schematic diagram of multi-line storage. In Figure 11, row1co1 in the first cell in the first row of the first column represents the first row and first column. The key in Figure 11 is: tableID+groupID, which means: table ID+row group ID.
5、列段存:5. Column segment storage:
列段存是指将一列中的至少两个单元格聚集为一个KV进行存储,这种机制可以理解成MySQL的垂直分区,引入这种机制的原因是:查询可能并不需要将一整列的所有列数据全部返回。对于某些特定的工作负载,引入列段以后性能会得到一定的提升。如图5所示,为列段存存储格式示意图。Column segment storage means that at least two cells in a column are aggregated into one KV for storage. This mechanism can be understood as MySQL's vertical partition. The reason for introducing this mechanism is that a query may not need to store all of the entire column. All column data is returned. For some specific workloads, the performance will be improved after the introduction of column segments. As shown in FIG. 5 , it is a schematic diagram of a column segment storage format.
6、列族存:6. Column family storage:
数据以列族的方式聚集。列族,表示数据表中的一组属性,即将一行中的至少两个单元格聚集为一个KV进行存储。在实际使用中,常常将频繁被一起读取/写入的列设置为一个列族,以提高系统的吞吐率。当数据以列族聚集时,能够实现对特定查询的优化。将列族作为一个键值对,可以有效提高这类查询的性能。如图12所示,为列族存存储格式示意图。图12中第1列第一行中的第一个单元格中的row1co1,表示第一行第一列,图12中key为:tableID+CF ID+rowID,表示的含义为:表标识+列族标识+行标识。Data is aggregated in column families. A column family represents a set of attributes in a data table, that is, at least two cells in a row are aggregated into a KV for storage. In practical use, columns that are frequently read/written together are often set as a column family to improve system throughput. When data is aggregated in column families, optimizations for specific queries can be achieved. Treating the column family as a key-value pair can effectively improve the performance of such queries. As shown in Figure 12, it is a schematic diagram of the column family storage format. The row1co1 in the first cell in the first row of the first column in Figure 12 represents the first row and the first column. The key in Figure 12 is: tableID+CF ID+rowID, which means: table ID+column Family ID + Row ID.
7、列族段存:7. Column family segment storage:
列族存中,一行中的某几个属性被聚集成一个KV对,列族段存是指将多行中的几个属性聚集为一个KV对,从而提升在AP场景下系统的吞吐率,即可以理解为增加列族存的行数。如图6所示,为列族段存示意图。In column family storage, certain attributes in a row are aggregated into a KV pair. Column family segment storage means that several attributes in multiple rows are aggregated into a KV pair, thereby improving the throughput of the system in the AP scenario. That is, it can be understood as increasing the number of rows stored in the column family. As shown in FIG. 6 , it is a schematic diagram of column family segment storage.
8、整表存:整表存是指将整张表作为一个KV进行存储。整表存在全表扫描的查询条件下对性能提升显著。如图7所示,为整表存示意图。8. Whole table storage: Whole table storage refers to storing the entire table as a KV. The performance of the entire table is significantly improved under the query condition of full table scan. As shown in Figure 7, it is a schematic diagram of the entire table storage.
第3小节-编码方案Section 3 - Coding Schemes
本小节对各种存储模式的编码方案作出详细介绍。如图13所示,为多样化存储模式的编码方案示意图。This section provides a detailed introduction to the encoding schemes for various storage modes. As shown in FIG. 13 , it is a schematic diagram of a coding scheme of diversified storage modes.
1、行存编码方案:1. Row storage coding scheme:
行存将数据表中的一行的数据作为KV对。在编码上,将表标识即TableID和行标识即RowID作为Key,将这一行中的数据作为Value存放。在实现中,Key中可以包括时间戳、Hash值等其他的信息。The row store takes the data of a row in the data table as a KV pair. In coding, the table ID, namely TableID, and the row ID, namely RowID, are used as Key, and the data in this row is stored as Value. In implementation, the Key may include timestamps, Hash values, and other information.
2、列存编码方案:2. Column storage coding scheme:
列存将数据表中的一列的数据作为KV对。在编码上,将表标识即TableID和列标识即ColumnID作为Key,将这一整列中的数据作为Value存放。在实现中,一整列的数据量可能很大,因此这种编码方式有可能产生很大的KV对。The column store takes the data of a column in the data table as a KV pair. In coding, the table ID, namely TableID, and the column ID, namely ColumnID, are used as Key, and the data in this entire column is stored as Value. In an implementation, the amount of data in an entire column can be large, so this encoding has the potential to produce large KV pairs.
3、单元格存编码方案:3. Cell storage coding scheme:
单元格存将数据表中的一个单元格作为KV对。在编码上有按行编码和按列编码两种方式。例如,将TableID、RowID和ColumnID按顺序编码为Key,将该单元格存储的数据作为Value存放。这样的编码方式会使得各KV对按照相同RowID的情况排列在SST中,属于按行存放。同理,如果按照TableID、ColumnID和RowID的顺序编码为Key,使得KV对按照相同ColumnID的方式排列在SST中,属于按列存放。Cell storage treats a cell in the data table as a KV pair. There are two ways to encode by row and by column. For example, TableID, RowID, and ColumnID are encoded as Key in sequence, and the data stored in this cell is stored as Value. This encoding method will make each KV pair arranged in the SST according to the same RowID, which belongs to the row storage. In the same way, if the Key is encoded in the order of TableID, ColumnID and RowID, the KV pairs are arranged in the SST according to the same ColumnID, which belongs to the column storage.
4、多行存编码方案:4. Multi-line storage coding scheme:
多行存将数据表中的若干行分为一组,存放在同一个KV对中。在编码上,将TableID和该分组标识即GroupID作为Key,该分组中的数据作为Value存放。Multi-row storage groups several rows in the data table into a group and stores them in the same KV pair. In coding, the TableID and the group ID, that is, the GroupID, are used as the Key, and the data in the group is stored as the Value.
其中,GroupID由分组方式确定。常见的分组方式有索引分组和哈希分组。索引分组按照数据插入的顺序分组,连续到来的数据元组会被存放到同一个分组中,直到该分组装满了数据。哈希分组按照一定的哈希函数,例如整除函数,基于哈希函数将数据元组的RowID映射到GroupID,并存放到该分组中。The GroupID is determined by the grouping method. Common grouping methods include index grouping and hash grouping. Index groups are grouped in the order in which data is inserted, and consecutive data tuples are stored in the same group until the group is full of data. The hash grouping maps the RowID of the data tuple to the GroupID based on a certain hash function, such as an integer division function, and stores it in the group.
5、列族存编码方案:5. Column family storage coding scheme:
列族存将数据表中一行中的某几个属性聚集成一个分组。在编码上,将TableID、该分组标识即GroupID以及该行所对应的行标识即RowID作为Key,该行中的某几个属性作为Value存放。Column family storage aggregates certain attributes of a row in a data table into a group. In coding, the TableID, the group ID, that is, the GroupID, and the row ID corresponding to the row, that is, the RowID, are used as the Key, and some attributes in the row are stored as the Value.
6、列族段存编码方案:6. Column family segment storage coding scheme:
列族段存将数据表中一行里的一个属性组聚集为列族,再将连续若干个列族聚集。其中,属性聚集的方式与列族相同,分组方式与多行存相同。即将位于数据表中水平方向上的每行中至少两个单元格聚集为一个数据分组,将位于数据表中垂直方向上的每列中至少两个单元格聚集为一个列分组;在编码上,将TableID、列分组的标识CFID和数据分组的组标识GroupID作为Key,将列分组和数据分组中的数据作为Value存放。Column family segment storage aggregates an attribute group in a row in the data table into a column family, and then aggregates several consecutive column families. Among them, the method of attribute aggregation is the same as that of column family, and the method of grouping is the same as that of multi-row storage. That is, at least two cells in each row in the horizontal direction of the data table are aggregated into a data group, and at least two cells in each column in the vertical direction in the data table are aggregated into a column group; in coding, The TableID, the identifier CFID of the column group, and the group identifier GroupID of the data group are used as the Key, and the data in the column group and the data group are stored as the Value.
7、列段存编码方案:7. Column segment storage coding scheme:
列段存将数据表中同一列中的连续若干单元格聚集成一个KV对。在编码上,将TableID、ColumnID和列分组CFID作为Key,将这一列的列分组中的数据作为Value存放。列段存也需要确定分组的方式。分组方式与多行存一致。与单元格按列存放相比,在处理全表扫描操作时,由于列段的粒度更大,可以减少读取的kv对的数量,从而提高整体的性能。和列存相比,列存会将某一列中的所有数据聚集为一个kv对,可以认为列段存将列存的一个kv对分解为若干个小kv对。列段存的大小介于单格存和列存之间,可以平衡读写性能。Column segment storage aggregates several consecutive cells in the same column in the data table into a KV pair. In coding, use TableID, ColumnID and column grouping CFID as Key, and store the data in the column grouping of this column as Value. The column segment memory also needs to determine the way of grouping. The grouping method is consistent with the multi-line storage. Compared with cells stored in columns, when processing full table scan operations, due to the larger granularity of column segments, the number of kv pairs read can be reduced, thereby improving overall performance. Compared with column storage, column storage aggregates all data in a column into a kv pair. It can be considered that column segment storage decomposes a kv pair in column storage into several small kv pairs. The size of column segment memory is between single-column memory and column memory, which can balance read and write performance.
8、整表存编码方案:8. Entire table storage coding scheme:
整表存将整个数据表的数据聚集成一个KV对。在编码上,将TableID作为Key,将数据表中的所有数据作为Value存放。全表存需要确定KV对内部的数据排列方式,可选的包括按行、按列。全表存将会产生巨大的KV对,因此在实践中只适合较小的表。The whole table storage aggregates the data of the entire data table into a KV pair. In coding, TableID is used as Key, and all data in the data table is stored as Value. For full table storage, it is necessary to determine the internal data arrangement of the KV pair, and the options include row-by-row and column-by-column. Full table storage will generate huge KV pairs, so in practice it is only suitable for smaller tables.
第4小节-存储模式的相互转换Section 4 - Interconversion of storage modes
本节主要讨论不同存储模式之间的转换问题。This section mainly discusses the conversion between different storage modes.
实际运行中,系统可以根据工作负载的不同,自动切换副本的存储模式。当自适应模块感知到负载发生变化时,就会评估出最优存储模式以及相应的转换代价。在系统较空闲时段,可以进行存储模式转换的工作,将当前存储模式转换为新的最优存储模式。例如当自适应模型判断出当前工作负载需要的存储模型为列存,而当前节点中没有列存模式,则需选择出副本节点进行存储模式的转换,可能是行存到列存的转换。转换时需要考虑节点的负载,优先选择负载小的节点。具体的自适应转换方式在第7小节查询自适应中有详细介绍。通过统计当前负载的运行参数,加上当前数据表的结构、数据分布,就能通过自适应模型求解出最优存储模式。考虑到第3小节中一共讨论了8种不同的存储模式,对应的一共有7*8=56种转换方式,具体如图4所示。In actual operation, the system can automatically switch the storage mode of replicas according to different workloads. When the adaptive module senses a change in load, it evaluates the optimal storage mode and the corresponding conversion cost. When the system is relatively idle, the storage mode conversion can be performed to convert the current storage mode into a new optimal storage mode. For example, when the adaptive model determines that the storage model required by the current workload is column storage, but there is no column storage mode in the current node, a replica node needs to be selected to convert the storage mode, which may be the conversion from row storage to column storage. The load of the node needs to be considered during the conversion, and the node with the smallest load is preferred. The specific adaptive transformation method is described in detail in Section 7 Query Adaptation. By calculating the operating parameters of the current load, plus the structure and data distribution of the current data table, the optimal storage mode can be solved through the adaptive model. Considering that a total of 8 different storage modes are discussed in
由于转换种类较多,从总体上看我们将其抽象为两类转换,分别是拆分和重组,具体如下:Since there are many types of transformations, we abstract them into two types of transformations, namely splitting and reorganization, as follows:
1.拆分:拆分的应用场景主要是数据从一个较大的KV变成几个较小的KV,这里以单行存到单元格存储的转换为例进行详细说明:1. Splitting: The application scenario of splitting is mainly that the data is changed from a large KV to several smaller KVs. Here, the conversion from single-line storage to cell storage is taken as an example to describe in detail:
当数据以行存模式写入时,其key的编码形式为表ID+行ID,在某些场景下,服务器需要将行存模式转为单元格存存储模式,所以服务器需要对Value里的各个属性重新编码,为每个属性设置一个对应的CeilID(TableID+RowID+ColumnID),将数据从行存模式拆分成为Ceil(单元格存)存模式。When data is written in row storage mode, its key is encoded in the form of table ID + row ID. In some scenarios, the server needs to convert row storage mode to cell storage mode, so the server needs to change each attribute in Value Recode, set a corresponding CeilID (TableID+RowID+ColumnID) for each attribute, and split the data from row storage mode into Ceil (cell storage) storage mode.
2.重组:重组的应用场景主要是数据从几个较小的KV重组成一个较大的KV,例如:单行存储到多行存的转换,单元格存到行存的转换等,这里以单元格存到列存的转换为例进行详细说明:2. Reorganization: The application scenario of reorganization is mainly to reorganize data from several smaller KVs into a larger KV, for example: conversion from single-line storage to multi-line storage, conversion from cell storage to row storage, etc. Here, the unit The conversion from grid storage to column storage is taken as an example to describe in detail:
假设某一副本上的数据开始以单元格存进行存储,随着负载的改变,我们需要将副本的存储模式转换为列存,但是列存所需要的单元格还不够,我们暂时将这些单元格写入缓冲区中,等到缓冲区中积攒够列存转换所需要的数目时再进行重新编码、转换。Assuming that the data on a copy starts to be stored in cell storage, as the load changes, we need to convert the storage mode of the copy to column storage, but the cells required for column storage are not enough, we temporarily store these cells Write into the buffer, and then re-encode and convert when the buffer has accumulated enough for the number of column-to-store conversions.
第5小节-新型存储模式对事务的影响Section 5 - Impact of novel storage modes on transactions
本节主要介绍新型存储模式对数据库中事务产生的影响:This section mainly introduces the impact of the new storage mode on transactions in the database:
主流的NewSQL数据库(CockRoachDB,TiDB,TDSQL3.0)事务并发控制机制都是与KV绑定的,以锁为例,传统的一行一KV的存储模式对应了传统数据库中的行锁。当采用不同的存储模式时,比如拆成单元格粒度的KV,就对应了更小粒度的锁,而聚集成多行或者列段粒度的KV,则对应了更大粒度的锁。对于前者而言,只要保证每次读写,都去获取所有相关KV的锁,就能保证其对应事务的正确性;如果每次写都按照单元格对应的字段定义的顺序去获取锁,就能破坏死锁条件中的循环等待,也不会造成更多的死锁,即本申请的实施例中,为了消除死锁,可以采用加锁的单元定序的方式,未定序时有可能产生死锁,如一行数据包含id和name两个字段,事务A和事务B同时访问这一行数据。事务A先对id加锁,事务B先对name加锁,此时事务A等待事务B释放name上的锁,事务B等待事务A释放id上的锁,互相等待对方释放资源,即发生了死锁。而采用了定序的策略后,就可避免这一现象:规定必须先对id加锁,之后才能对name加锁。此时,在事务A对id加锁后,事务B尝试对id加锁,发现已被占用,于是进入等待队列。此时不会出现死锁。The transaction concurrency control mechanisms of mainstream NewSQL databases (CockRoachDB, TiDB, TDSQL3.0) are all bound to KV. Taking locks as an example, the traditional one-row-one-KV storage mode corresponds to row locks in traditional databases. When different storage modes are used, such as KV split into cell granularity, it corresponds to smaller granularity locks, while KV aggregated into multiple row or column segment granularity corresponds to larger granularity locks. For the former, as long as each read and write is guaranteed to acquire the locks of all relevant KVs, the correctness of the corresponding transaction can be guaranteed; if each write acquires the locks in the order defined by the fields corresponding to the cells, the It can destroy the cyclic waiting in the deadlock condition, and will not cause more deadlocks. That is, in the embodiment of the present application, in order to eliminate the deadlock, the locking unit sequencing method can be adopted. Deadlock, such as a row of data contains two fields id and name, transaction A and transaction B access this row of data at the same time. Transaction A locks the id first, and transaction B locks the name first. At this time, transaction A waits for transaction B to release the lock on name, transaction B waits for transaction A to release the lock on id, and waits for each other to release resources, that is, death occurs. Lock. After adopting the ordering strategy, this phenomenon can be avoided: it is stipulated that the id must be locked first, and then the name can be locked. At this time, after transaction A locks the id, transaction B tries to lock the id and finds that it is occupied, so it enters the waiting queue. There is no deadlock at this point.
第6小节-查询优化Section 6 - Query Optimization
本节主要介绍计算层如何根据元数据、查询特征、存储模式等信息优化查询计划,基于多样化存储模式,提升数据库系统的查询性能。This section mainly introduces how the computing layer optimizes the query plan based on metadata, query characteristics, storage mode and other information, and improves the query performance of the database system based on diversified storage modes.
查询代价是制定查询计划的重要指标,其由三部分构成:执行时间Tproc、等待时间Twait与传输时间Ttrans,即T=Tproc+Twait+Ttrans。这三部分中,执行时间由查找时间Tsearch、处理时间Tio和元组构建时间Tcons组成,即Tproc=Tsearch+Tio+Tcons;等待时间由请求队列时间Tqueue、机器负载延迟Tload和从节点数据同步时间Tsync组成,即Twait=Tqueue+Tload+Tsync;传输时间即网络传输的时间。在本方案中,主要考虑多副本对查询计划制定产生的影响,因此主要考虑执行时间的影响因素,此处考虑几种策略:The query cost is an important indicator for formulating a query plan, which consists of three parts: execution time T proc , waiting time T wait and transmission time T trans , that is, T=T proc +T wait +T trans . In these three parts, the execution time consists of the search time T search , the processing time T io and the tuple construction time T cons , namely T proc =T search +T io +T cons ; the waiting time is composed of the request queue time T queue , the machine load The delay T load is composed of the slave node data synchronization time T sync , namely T wait =T queue +T load +T sync ; the transmission time is the time of network transmission. In this solution, the influence of multiple copies on query plan formulation is mainly considered, so the influence factors of execution time are mainly considered. Several strategies are considered here:
1.宽表少列,则列存模式优先。存储模式主要影响查询的执行时间,直接选择列存模式相比行存模式可能会减少Tsearch和Tio,但是由于需要对结果进行重组,可能导致Tcons增加,结果所需的列数越多,Tcons越大。对于分析查询型请求,统计该请求所需列的数量Col,以及该表中列的总数Col_num。如果Col/Col_num<α,其中,阈值α可依据实际情况进行调整,则认为列存模式是最佳存储模式。如果Col/Col_num≥α,说明需要读取的列数较多,一般情况下可以采用行存或多行存的存储模式。例如需要读取20列中的19列,在列存的存储模式上读取时需要创建19个迭代器,而在行存或多行存上只需要创建一个迭代器,整体开销会降低。1. If the wide table has fewer columns, the column storage mode takes precedence. The storage mode mainly affects the execution time of the query. Directly selecting the column storage mode may reduce T search and T io compared with the row storage mode. However, due to the need to reorganize the results, T cons may increase, and the number of columns required for the result is more. , the larger T cons is. For analytical query requests, count the number of columns Col required by the request and the total number of columns Col_num in the table. If Col/Col_num<α, where the threshold α can be adjusted according to the actual situation, it is considered that the column storage mode is the best storage mode. If Col/Col_num≥α, it means that there are many columns to be read. Generally, the storage mode of row storage or multi-row storage can be used. For example, 19 out of 20 columns need to be read, 19 iterators need to be created when reading in the column storage mode, and only one iterator needs to be created in the row storage or multi-row storage, the overall overhead will be reduced.
2.Scan扫描则多行存储优先。由于扫描操作要按序扫描多行数据甚至全表数据,因此按多行聚集的存储模式将数据存储在一起,相比其它存储模式可以有效减少Tsearch和Tio,并且上层结果也不需要重组,对Tcons没有影响,因此可以将多行存存储模式作为针对Scan扫描操作的最佳存储模式。2.Scan scan is multi-line storage priority. Since the scan operation needs to scan multiple rows of data or even the entire table data in sequence, storing data together in the storage mode aggregated by multiple rows can effectively reduce T search and T io compared to other storage modes, and the upper-level results do not need to be reorganized , has no effect on T cons , so the multi-line storage mode can be used as the best storage mode for Scan operation.
3.点查询则行存优先。点查询的查询目标是获得符合预期的某条数据,无论列存模式还是多行聚集都需要在上层对数据进行拆分,会导致Tcons增加,行存可以同时保证Tsearch、Tio不受影响的情况下Tcons达到最优,因此可以认为行存是针对点查询的最佳存储模式。3. Click the query to save the line first. The query goal of point query is to obtain a certain piece of data that meets the expectations. Regardless of the column storage mode or multi-row aggregation, the data needs to be split at the upper layer, which will lead to an increase in T cons . The row storage can simultaneously ensure that T search and T io are not affected. In the case of influence, T cons is optimal, so it can be considered that row storage is the best storage mode for point queries.
4.更新操作则单元格存优先。更新操作往往是对某行中的某个属性进行更新,选择行存需要将整行数据读上来再修改,然后整行写入,造成读写放大,选择列存也会有同样的影响,这对Tio和Tcons产生了不利的影响,逻辑上来说单元格存恰好可以与更新操作的粒度相匹配,因此可以认为单元格存是针对更新操作的最佳存储模式。4. For update operations, cell storage takes precedence. The update operation is often to update an attribute in a row. Selecting row storage requires reading the entire row of data and then modifying it, and then writing the entire row, resulting in read and write amplification. Selecting column storage will also have the same impact. It has an adverse effect on T io and T cons . Logically, cell storage can just match the granularity of update operations, so it can be considered that cell storage is the best storage mode for update operations.
同时,我们也对影响查询计划的其他因素做出讨论,如下:At the same time, we also discuss other factors that affect the query plan, as follows:
5.低负载节点所在副本优先。副本所在节点的负载主要影响查询的等待时间中的Tqueue与Tload,节点负载越低,这两项时间越小。从元数据管理器可以获得各副本所在节点负载情况,选择所在节点负载最低的副本作为最佳副本。5. The replica where the node with low load is located has priority. The load of the node where the replica is located mainly affects T queue and T load in the waiting time of the query. The lower the node load, the smaller the time of these two items. The load status of the nodes where each replica is located can be obtained from the metadata manager, and the replica with the lowest load on the node is selected as the best replica.
6近计算节点优先。与副本所在节点的物理距离影响传输时间Ttrans,距离越近,Ttrans越小。数据分片的多个副本可能分散在不同机房甚至不同的数据中心,选择近计算节点的副本所在节点,可以减少网络传输的开销。6 Near computing nodes are preferred. The physical distance from the node where the replica is located affects the transmission time T trans , the closer the distance, the smaller the T trans . Multiple copies of data shards may be scattered in different computer rooms or even different data centers. Selecting the node where the copy of the nearest computing node is located can reduce the overhead of network transmission.
7.数据同步状态最新的副本优先。读请求需要等待节点日志同步到最新,同步的时间Tsync影响等待时间Twait,因此副本的数据同步状态越新,等待时间越短。7. The copy with the latest data synchronization status takes precedence. The read request needs to wait for the node log to be synchronized to the latest. The synchronization time T sync affects the waiting time T wait . Therefore, the newer the data synchronization state of the replica, the shorter the waiting time.
以上几种策略可以单独工作,也可以通过权重等方式组合工作。查询的总代价公式如下所示:The above strategies can work individually or in combination through weights and other methods. The formula for the total cost of the query is as follows:
C=a1(tsearch+tio)+1/a1(tcons)+a2(tqueue+tload)+a3(Ttrans)+a4(tsync)C=a1(t search +t io )+1/a1(t cons )+a2(t queue +t load )+a3(T trans )+a4(t sync )
其中,a1、a2、a3、a4表示每种策略的权重,其中a1具体表示存储模式选择策略的权重,a2和a3则表示其它策略的权重。策略的选取需要考虑实际系统的情况。实际部署中,由于节点性能、网络速度等因素的影响,Tproc、Twait与Ttrans的量级存在差异,权重的设置应保证尽可能加速总时间较长的部分。Among them, a1, a2, a3, and a4 represent the weight of each strategy, where a1 specifically represents the weight of the storage mode selection strategy, and a2 and a3 represent the weight of other strategies. The selection of the strategy needs to consider the actual system situation. In actual deployment, due to the influence of factors such as node performance and network speed, there are differences in the magnitude of T proc , T wait and T trans . The weight setting should ensure that the part with a longer total time is accelerated as much as possible.
由于tsearch、tio、tcons都与存储模型的选择有关,当存储模型的选择更有利于tsearch和tio时,会偏离元组的存储模型,从而使得元组的构建时间(tcons)增加,二者成反比关系。故在查询总代价中,当我们选择了更利于查询请求的存储模型去缩短tsearch与tio时,其权重a1增大,也就也意味着tcons的权重减小,二者权重成反比关系,因此,设置tcons的权重为1/a1。Since t search , t io , and t cons are all related to the choice of storage model, when the choice of storage model is more favorable to t search and t io , it will deviate from the storage model of tuples, so that the construction time of tuples (t cons ) increases, and the two are inversely proportional. Therefore, in the total query cost, when we choose a storage model that is more conducive to query requests to shorten t search and t io , the weight a1 increases, which also means that the weight of t cons decreases, and the weights of the two are inversely proportional. relationship, therefore, set the weight of t cons to 1/a1.
第7小节-存储模式自适应调节方法Section 7 - Storage Mode Adaptive Adjustment Method
本节将承接第2小节中讨论的多种存储结构,继续讨论如何在数据库的实际系统中,针对不同负载的特点决策出最适合的存储结构。对于每一张数据表,我们可以采用自适应存储模式决策出其最优存储方案。This section will undertake the various storage structures discussed in
如图14所示,为存储模式自适应处理的流程图。存储模式自适应的方法分为以下6个步骤:As shown in FIG. 14 , it is a flowchart of the storage mode adaptation process. The method of storage mode adaptation is divided into the following six steps:
1.建立不同KV对大小的性能模型。1. Build performance models for different KV pair sizes.
2.收集负载的运行参数。2. Collect the operating parameters of the load.
3.根据负载I/O操作比例,计算KV对大小的最优范围,该最优范围可以是上述实施例中的键值对所占用空间的第一空间大小。3. Calculate the optimal range of the size of the KV pair according to the load I/O operation ratio, and the optimal range may be the first space size of the space occupied by the key-value pair in the above embodiment.
4.根据负载的查询语句信息,决定列分区的候选方案(即上述实施例中的分区方式)。4. According to the query statement information of the load, a candidate solution for column partitioning (ie, the partitioning method in the above embodiment) is determined.
5.根据负载的数据分布,决定候选的分组范围和数据密度。5. Determine the candidate grouping range and data density according to the data distribution of the load.
6.综合考虑2、3、4中求出的分区方式、候选分组范围、数据密度和第一空间大小,决策出最优的存储模式。6. Consider the partition mode, candidate grouping range, data density and first space size obtained in 2, 3, and 4, and decide the optimal storage mode.
步骤1通过事先的一系列实验,测出各大小KV对在各个典型I/O操作(如更新、插入、点查询、删除、范围查询等)下的性能,这些测试结果将作为后续步骤的判定依据。Step 1 Through a series of experiments in advance, measure the performance of each size KV pair under each typical I/O operation (such as update, insert, point query, delete, range query, etc.), these test results will be used as the judgment of subsequent steps in accordance with.
步骤2需要收集数据库正常运行负载时的参数,以供后续步骤作为决策的依据。此时的数据库可以是任何一种存储模式。一般地,可以选择行存作为基准模式进行测试。这一步中,需要收集的信息包含下述三种。一是负载中各I/O操作的比例,如单点查询、更新、插入、删除、全表扫描的相应比例。二是负载中涉及的全部查询语句的比例,以及所涉及的列对应的元信息。三是负载中数据分布的情况,如主键的范围和相应的比例。
步骤3根据各I/O操作的占比,求出适合的KV对大小。键值对存储系统的KV对对性能影响很大。一般地,大粒度对KV对更适合处理全表扫描操作,小粒度对KV对更适合处理单点查询、更新、插入、删除等操作。根据步骤2中统计的各种I/O操作的比例,以及步骤1中每种可能的KV对大小的各种I/O操作性能,通过加权平均的方法,可以估算出当前工作负载在每一种KV对大小下的总体性能,进而可以从中选择总体性能最高的KV对大小作为系统设置。Step 3: According to the proportion of each I/O operation, find the appropriate KV pair size. The KV pair of the key-value storage system has a great impact on performance. Generally, KV pairs with large granularity are more suitable for processing full table scan operations, and KV pairs with small granularity are more suitable for processing single-point query, update, insert, delete and other operations. According to the proportion of various I/O operations counted in
例如,当前工作负载包含1000个点查,10个涉及100000行的全表扫描,1000个更新操作、插入、删除。结合步骤1中搜集的各大小KV对在不同I/O操作下的性能,如(点查操作中,16B大小的KV对延迟为0.01ms,32B的延迟为0.005ms),可以估算出16B的KV对在点查中的耗时为1000*0.01ms。以此类推,可以求出16B的KV对在全表扫描、更新中的耗时,求和后即为总耗时。本段中提到的加权平均的权重即为本例子中各操作涉及的总行数。同样,对于其他大小的KV对,也能求出一个预估的总耗时。最后,预估总耗时最短的KV对大小为最优的KV对大小。For example, the current workload contains 1000 checkpoints, 10 full table scans involving 100000 rows, 1000 update operations, inserts, deletes. Combined with the performance of each size KV pair collected in step 1 under different I/O operations, such as (in the check operation, the delay of 16B KV pair is 0.01ms, and the delay of 32B is 0.005ms), it can be estimated that the 16B KV pair has a delay of 0.01ms. The time-consuming of the KV pair in the enumeration is 1000*0.01ms. By analogy, the time-consuming of 16B KV pairs in full table scan and update can be calculated, and the summation is the total time-consuming. The weight of the weighted average mentioned in this paragraph is the total number of rows involved in each operation in this example. Similarly, for other sizes of KV pairs, an estimated total time can also be obtained. Finally, the KV pair size with the shortest estimated total time consumption is the optimal KV pair size.
步骤4根据查询语句的语意信息,相关列的信息,以推导出列分区的候选方案。如图3中(1)所示,分区会在垂直方向将一张数据表划分为若干子集。首先,服务器获取该数据表上所涉及的所有查询和相应的比例。对于每个查询,服务器可以解析其语法,从投影子句和WHERE条件子句,将其中涉及的列信息取出并保存。之后,服务器可以根据应用面临的各种查询语句的占比对各查询语句进行排序,取出占比最高的前n位的查询语句。接着,服务器可以从占比最高的查询开始,根据该查询对应的相关列进行分区。之后服务器取出次高占比的查询,重复分区操作,直到所有属性都已经被分区完毕。Step 4: According to the semantic information of the query statement and the information of related columns, a candidate scheme of column partitioning is deduced. As shown in (1) in Figure 3, partitioning divides a data table into subsets in the vertical direction. First, the server obtains all the queries and corresponding scales involved on the data table. For each query, the server can parse its grammar, extract and save the column information involved from the projection clause and the WHERE conditional clause. After that, the server can sort each query statement according to the proportion of various query statements faced by the application, and retrieve the top n query statements with the highest proportion. Then, the server can start with the query with the highest proportion and partition according to the relevant column corresponding to the query. After that, the server takes out the query with the second highest proportion and repeats the partition operation until all attributes have been partitioned.
此外,分区算法在处理同一列处在多个查询中,这些列称为“冲突列”,可以采用不同的策略:第一种策略是高优先级查询即排序靠前的分组独占冲突列。在这种策略下,更重要的查询即优先级更高或权重更大,将冲突列加入自己的分组中,而其他的分组则不包含这一列。第二种策略是合并这些分组。在这种策略下,包含冲突列的分组将会合并为更大的一个分组。两种策略各有优势,第一种策略更倾向于符合高优先级查询的特征,比较适合不同查询之间的占比差距较大的情况;第二种策略在多个查询特征之间这种,比较适合不同查询之间占比差距不大的情况。In addition, the partitioning algorithm is dealing with the same column in multiple queries, these columns are called "conflicting columns", and different strategies can be adopted: the first strategy is that the high-priority query, i.e., the top-ranked grouping exclusives the conflicting columns. Under this strategy, more important queries, i.e. higher priority or more weight, add the conflicting column to their own grouping, while other groupings do not include this column. The second strategy is to combine these groupings. Under this strategy, groups containing conflicting columns are merged into a larger group. The two strategies have their own advantages. The first strategy is more inclined to meet the characteristics of high-priority queries, and is more suitable for situations where the proportion of different queries is far apart; the second strategy , which is more suitable for cases where the proportion of different queries is not very different.
步骤5根据负载的数据分布情况,求出候选的分组范围和数据密度。分组聚集能够将多行数据集中到一个KV对中,提高读写数据的效率。而分组的策略、分组的大小则需要根据当前负载的数据分布情况决定。如图3中(1)所示,分组会在水平方向将一张数据表划分为若干子集。分组策略包括哈希分组和顺序分组。哈希分组通过某一哈希函数,将同一哈希值的数据保存在同一个数据分组中。顺序分组是按照数据的到来顺序,将一系列连续到来的数据存放到同一个数据分组中。一种常规的、可行的确定最佳分组范围和数据密度的方式如下:Step 5: According to the data distribution of the load, obtain the candidate grouping range and data density. Grouping and aggregation can aggregate multiple rows of data into a KV pair, improving the efficiency of reading and writing data. The grouping strategy and the size of the grouping need to be determined according to the data distribution of the current load. As shown in (1) in Figure 3, grouping divides a data table into several subsets in the horizontal direction. Grouping strategies include hash grouping and sequential grouping. Hash grouping stores data of the same hash value in the same data group through a certain hash function. Sequential grouping is to store a series of consecutive data in the same data group according to the arrival order of the data. A common and feasible way to determine the optimal grouping range and data density is as follows:
i)分组函数采用哈希函数。如用主键的值对分组范围进行整除,整除的结果即为分组序号。对于非整型的数据,如果分组范围是2的幂次,可以采用按位右移的方式进行分组。例如,一行数据的主键为int类型,数值为32。另一行数据的主键数值为31。此时采用哈希函数分组序号=key整除16。对于key为32的数据,其分组序号=32整除16=2,其应当分在第2组;对于key为31的数据,其分组序号=31整除16=1,其应当分在第1组。i) The grouping function adopts a hash function. If the value of the primary key is used to divide the grouping range, the result of the division is the grouping sequence number. For non-integer data, if the grouping range is a power of 2, it can be grouped by bitwise right shift. For example, the primary key of a row of data is of type int and the value is 32. Another row of data has a primary key value of 31. At this time, the hash function is used to group the serial number=key divisible by 16. For the data whose key is 32, its grouping sequence number=32 divisible by 16=2, it should be in the second group; for the data whose key is 31, its grouping sequence number=31 divisible by 16=1, it should be in the first group.
ii)预定义一组分组范围。分组范围的取值可以是任意整数,但在实践中,为方便分组算法的实现和降低计算量,可以取2的幂次作为分组范围。ii) A set of grouping ranges are predefined. The value of the grouping range can be any integer, but in practice, to facilitate the implementation of the grouping algorithm and reduce the amount of calculation, a power of 2 can be taken as the grouping range.
iii)对于每个分组范围,计算平均数据密度。由于主键的分布不一定是严格递增的,因此可能出现某个分组内实际元素达不到分组范围的情况。例如,分组范围设定为8,而数据表中主键为1~7的数据不存在,只有主键为0的数据被分到了这组。因此这些位置空了出来,此时该分组的密度为1/8。所有分组的密度求平均值,即为该分组范围下的平均数据密度。iii) For each grouping range, calculate the average data density. Since the distribution of primary keys is not necessarily strictly increasing, it may happen that the actual elements in a group cannot reach the grouping range. For example, the grouping range is set to 8, but the data whose primary key is 1 to 7 does not exist in the data table, and only the data whose primary key is 0 is assigned to this group. Therefore, these positions are vacated, and the density of the packet is 1/8 at this time. The average of the densities of all groups is the average data density under the grouping range.
这一步骤求出的分组范围和平均数据密度会在步骤6中进行综合决策。对于顺序分组算法,也可采用同样的方式计算数据密度,并将相应参数提交至步骤6做最终决策。而分组的大小由范围和数据分布决定。例如,按照主键值进行哈希分组时,如果某些主键值未被使用,那么这些位置将空出来。可以预定义一组分组的范围,扫描数据分布,统计其实际的分组大小。The grouping range and average data density obtained in this step will be combined for decision-making in step 6. For the sequential grouping algorithm, the data density can also be calculated in the same way, and the corresponding parameters are submitted to step 6 for final decision. The size of the grouping is determined by the range and data distribution. For example, when hash grouping by primary key value, if some primary key value is not used, then those positions will be empty. The range of a group of groups can be predefined, the data distribution can be scanned, and the actual group size can be counted.
步骤6综合考虑上述三个步骤,并求出最优的存储模式。如图3中(3)所示,会将数据表同时进行水平、垂直的划分,得到最终的存储模式。依据步骤4中给出的分区方案、步骤5中求出的分组范围和数据密度,可以求出KV对的大小。在数值上,分组的范围乘以数据密度乘以分区大小应当等于KV对的大小。当某一存储模式所对应的KV对的大小等于或最接近于步骤3中给出的KV对大小,该存储模式即为最优存储模式。In step 6, the above three steps are considered comprehensively, and the optimal storage mode is obtained. As shown in (3) in Figure 3, the data table will be divided horizontally and vertically at the same time to obtain the final storage mode. According to the partition scheme given in step 4, the grouping range and data density obtained in step 5, the size of the KV pair can be obtained. Numerically, the extent of the grouping times the data density times the partition size should equal the size of the KV pair. When the size of the KV pair corresponding to a certain storage mode is equal to or closest to the size of the KV pair given in
第8小节-存储模式的异构性Section 8 - Heterogeneity of storage schemas
为了适应不同环境的需要,我们在存储模式多样化的基础上,应用了多副本异构协同计算方案。在应用多副本异构之后会引入不同副本之间如何相互转换存储模式的问题、异构副本的共识问题以及异构副本所带来的事务问题,以下做出详细讨论:In order to meet the needs of different environments, we apply a multi-copy heterogeneous collaborative computing scheme based on the diversification of storage modes. After applying multi-copy heterogeneity, the problem of how to convert storage modes between different copies, the consensus problem of heterogeneous copies, and the transaction problems caused by heterogeneous copies will be introduced. The following is a detailed discussion:
1.异构性存储模式间的转换:假设我们的共识组中有三个成员,分别存储S1模式副本、S2模式副本、S3模式副本,从共识协议层来看,主节点还是会将相同的日志发送给各个节点,而节点中的状态机在Apply日志时,会根据该节点中的日志翻译技术将数据的格式转换成对应的存储模式,随后在逐层写入。1. Conversion between heterogeneous storage modes: Assuming that there are three members in our consensus group, which store S1 mode copies, S2 mode copies, and S3 mode copies respectively, from the perspective of the consensus protocol layer, the master node will still keep the same log It is sent to each node, and the state machine in the node will convert the format of the data into the corresponding storage mode according to the log translation technology in the node when applying the log, and then write it layer by layer.
2.异构性存储模式间的共识问题:在异构副本间的转换方案中,共识组成员变更、数据在节点之间的转移以及快照持久化到状态机都可以通过共识协议的原有过程实现,可以保证系统的正确性。由于存储模式切换涉及状态机的读取和写入,存储模式切换过程中,切换模式的选择可以根据节点负载决定:当S1模式副本所在节点负载较小且节点可用空间较大时,选择直接切换。当节点负载较大或空间不足,应重新选择。实践中,在系统中行存Leader节点离线、还未重连成功的情况下,可以由其他存储模式的副本当选Leader节点作为临时的选项,待恢复后再转换为Leader节点。2. Consensus problem between heterogeneous storage modes: In the conversion scheme between heterogeneous copies, the change of consensus group members, the transfer of data between nodes, and the persistence of snapshots to the state machine can all go through the original process of the consensus protocol. Implementation, can guarantee the correctness of the system. Since the storage mode switching involves the reading and writing of the state machine, during the storage mode switching process, the selection of the switching mode can be determined according to the node load: when the load of the node where the S1 mode copy is located is small and the available space of the node is large, choose direct switching . When the node load is large or the space is insufficient, it should be re-selected. In practice, when the row storage leader node in the system is offline and has not been successfully reconnected, a copy of other storage mode can be selected as a leader node as a temporary option, and then converted to a leader node after recovery.
3.异构性存储模式的事务问题:3. Transaction problems of heterogeneous storage mode:
异构多副本的实现中主要有三个问题:从副本提供读接口(即Follower Read)、从副本读事务和主副本上写事务的并发控制和日志翻译问题。There are three main problems in the implementation of heterogeneous multi-replica: providing a read interface (that is, Follower Read) from the replica, concurrency control of the read transaction from the replica and the write transaction on the primary replica, and log translation issues.
目前大多数NewSQL数据库(CockRoachDB和TiDB)对于副本读的实现都采用了Lease Read机制,而Spanner由于使用Paxos来进行副本同步,其实现方式是主节点维护一个最新的Apply过的最大时间戳,并同步给从节点,即保证该时间戳之前的数据都已经复制完成,保证最新,如果读请求时间戳大于次时间,说明可能还有数据未复制完成,则需要阻塞等待。At present, most NewSQL databases (CockRoachDB and TiDB) use the Lease Read mechanism for replica read implementation, while Spanner uses Paxos for replica synchronization. Synchronize to the slave node, that is, to ensure that the data before the timestamp has been copied and ensured to be up-to-date. If the read request timestamp is greater than the next time, it means that there may be data that has not been copied, and you need to block and wait.
而从副本读事务和主副本上写事务的并发控制则主要有分布式锁和快照读安全时间戳两种方案。The concurrency control of the read transaction from the replica and the write transaction on the primary replica mainly includes distributed locks and snapshot read security timestamps.
分布式锁机制通过在多节点间复制锁等方式,使得主节点和从节点都可以访问到当前存在事务相关的锁,从而进行跨节点并发控制。以TiDB为例,TiDB的Percolator事务模型中,需要持久化三种数据,其中包括事务并发控制所需的Lock(锁),而TiDB作为分布式数据库,本身提供了数据的多副本机制,因此Lock本身也作为多副本复制数据的一部分在多副本之间进行同步,从而天然的提供了分布式锁,但由于需要跨网络复制大量数据,分布式锁本身仍旧是一种开销较大的并发控制手段。The distributed lock mechanism enables both the master node and the slave node to access the locks related to the current transaction by replicating locks among multiple nodes, thereby performing cross-node concurrency control. Taking TiDB as an example, in the Percolator transaction model of TiDB, three kinds of data need to be persisted, including the Lock (lock) required for transaction concurrency control. As a distributed database, TiDB itself provides a multi-copy mechanism of data, so Lock It is also used as a part of multi-copy replication data to synchronize between multiple copies, thus naturally providing distributed locks. However, due to the need to replicate a large amount of data across the network, distributed locks themselves are still an expensive concurrency control method. .
安全时间戳机制,即主从节点同步安全时间戳机制,需满足其之前不可能会有新事务提交。因此,此时小于此时间戳的快照读请求都是可以被安全满足的。同时,从节点上只允许对外提供小于该时间戳的快照读。以Spanner为例,除上述所说的保证数据新鲜度的时间戳之外,还会同时维护一个安全的事务冲突时间戳(实际上两个时间戳会去最小值对外表现为一个时间戳),一种实现是需要记录每个分布式事务的Prepare时间戳和Commit时间戳,此时可以使用主节点上所有写事务的最小Prepare时间戳作为安全时间戳,就可以保证此时间戳之前可以提交的事务已经全部提交了,正在进行的事务即使可能会提交,但提交时间戳一定在安全时间戳之后,此时副节点上小于此时间戳的读事务一定不会存在并发的会造成冲突的写事务,虽然这样免去了维护分布式锁的代价,但上述时间戳本身则代表了某一分片的所有数据的整体安全可读的时间,如果此时副节点上读事务的范围和主节点上并发的写事务的数据范围并无冲突,那副节点其实只需要等待多副本数据同步完成就应该执行该读事务,但此时由于事务的安全时间戳的存在,副节点只知道leader上存在并发的写事务,而不知道其数据范围是否冲突,只能选择阻塞该读事务,从而“误伤”一些读事务,由于难以确认事务的写集合,所以也难以去减小安全时间戳的粒度。CockroachDB虽然由于其事务机制中也存在类似TiDB的机制,锁(Write Intent)作为持久化数据进行多副本复制,因此也天然提供了分布式锁,但仍旧选择了安全时间戳的实现方式。The security timestamp mechanism, that is, the master-slave node synchronization security timestamp mechanism, needs to satisfy that it is impossible for a new transaction to be submitted before. Therefore, snapshot read requests less than this timestamp at this time can be safely satisfied. At the same time, the slave node is only allowed to provide external snapshot reads less than this timestamp. Taking Spanner as an example, in addition to the above-mentioned timestamp to ensure data freshness, a secure transaction conflict timestamp will be maintained at the same time (in fact, the two timestamps will go to the minimum value and appear as one timestamp externally), One implementation is to record the Prepare timestamp and Commit timestamp of each distributed transaction. In this case, the minimum Prepare timestamp of all write transactions on the master node can be used as the security timestamp to ensure that commits before this timestamp can be guaranteed. All transactions have been committed. Even if the ongoing transaction may be committed, the commit timestamp must be after the security timestamp. At this time, the read transaction on the secondary node with a timestamp smaller than this timestamp must not have concurrent write transactions that would cause conflicts. , although this eliminates the cost of maintaining distributed locks, the above timestamp itself represents the overall safe and readable time of all data of a certain shard. If the scope of the read transaction on the secondary node and the primary node are The data range of the concurrent write transaction does not conflict. In fact, the secondary node only needs to wait for the synchronization of the multi-copy data to complete before executing the read transaction. However, at this time, due to the existence of the security timestamp of the transaction, the secondary node only knows that there is concurrency on the leader. It is difficult to confirm the write set of the transaction, so it is difficult to reduce the granularity of the security timestamp. Although CockroachDB also naturally provides distributed locks due to the fact that there is a mechanism similar to TiDB in its transaction mechanism, and the lock (Write Intent) is used as persistent data for multi-copy replication, it still chooses the implementation of secure timestamps.
在采用异构副本的Raft系统中,非行存的副本可以选为Leader节点,但可能降低整体系统的性能。更一般地,任意存储模式的副本都可以作为Leader节点,此时Raft日志中记录的是该存储模式的数据。有的存储模式,如行存、列族存,写入性能较好,更适合作为行存;而有的存储模式,如列段存,写入性能较差,不适合作为行存。例如,列段存作为Leader时,所有的写入操作会被转换为多个写操作,导致整体系统写入性能下降。因此,在通常情况下,并不推荐写入性能差的副本当选Leader。实践中,在系统中写入性能好的Leader节点离线、还未重连成功的情况下,可以由其他存储模式的副本当选Leader节点作为临时的选项,待恢复后再转换为Leader节点。In a Raft system using heterogeneous replicas, non-row-stored replicas can be selected as Leader nodes, but it may reduce the performance of the overall system. More generally, a replica of any storage mode can be used as a leader node, and the data of this storage mode is recorded in the Raft log at this time. Some storage modes, such as row storage and column family storage, have better write performance and are more suitable as row storage; while some storage modes, such as column segment storage, have poor write performance and are not suitable for row storage. For example, when the column segment is used as the leader, all write operations will be converted into multiple write operations, resulting in a decrease in the overall system write performance. Therefore, under normal circumstances, replicas with poor write performance are not recommended to be elected as leaders. In practice, when the leader node with good write performance in the system is offline and has not been successfully reconnected, copies of other storage modes can be selected as the leader node as a temporary option, and then converted to the leader node after recovery.
第9小节-多副本管理策略Section 9 - Multiple Replica Management Strategies
本节主要介绍存储层中多副本的管理策略。This section mainly introduces the management strategy of multiple copies in the storage layer.
1、副本存储模式切换:对于存储层中的某个副本,它的存储模式在系统实际运行过程中是可以动态调整的,方案如下:1. Copy storage mode switching: For a copy in the storage layer, its storage mode can be dynamically adjusted during the actual operation of the system. The scheme is as follows:
在当前节点上建立该数据分片的一个新副本,该副本不计入共识组成员(即不参与共识的所有流程,只同步数据)。读取所有S1模式副本的数据,通过格式转换器生成模式为S2的新数据,通过新的临时副本重新持久化到状态机,并把S1模式副本还未持久化的日志同步到新副本。同步完成后,销毁S1模式的副本,并将S2模式作为新成员加入共识组。A new copy of the data shard is established on the current node, which is not counted as a consensus group member (that is, does not participate in all processes of consensus, only synchronizes data). Read the data of all S1 mode replicas, generate new data in the S2 mode through the format converter, re-persist it to the state machine through the new temporary replica, and synchronize the unpersisted logs of the S1 mode replica to the new replica. After the synchronization is completed, the copy of the S1 schema is destroyed, and the S2 schema is added to the consensus group as a new member.
2、副本配置策略:对于每一个数据分片来说,副本的配置策略并不是一成不变的。在系统运行的过程中,不同存储模式的副本数量根据数据访问模式的变化(负载的变化)做出相应调整,如果系统在某时间段内收集到的路由到列存副本所在节点的请求较多、列存副本负载较大时,判断当前负载在该数据分片上是偏列存的类型,选择这段时间中处理请求数较少(代表该类型负载低)的存储类型中的副本,将其切换为列存副本。2. Replica configuration strategy: For each data shard, the replica configuration strategy is not static. During the operation of the system, the number of replicas in different storage modes is adjusted according to the changes in the data access mode (load changes). , When the load of the column storage copy is large, it is judged that the current load is a partial column storage type on the data shard, and select the copy in the storage type with a small number of processing requests (representing the low load of this type) during this period of time. Switch to columnar copy.
3、数据分裂:当一个数据分片过大或数据过热时,为了均衡负载,系统对数据分片进行再分裂操作,使一个数据分片裂变为两个数据分片。新产生的数据分片的副本配置状况,既能选择与数据分裂前保持一致,也能通过策略管理器重新进行决策配置。3. Data splitting: When a data fragment is too large or the data is overheated, in order to balance the load, the system re-splits the data fragment to split one data fragment into two data fragments. The replica configuration status of the newly generated data shard can be selected to be consistent with that before the data split, or can be re-determined and configured through the policy manager.
例如:当行存模式副本进行分裂时,只需要分别迁移部分数据即可;当列存模式副本进行分裂时,由于列存储格式下重新组织所有行代价较大,直接从行存模式副本进行分裂并转换为列存格式,完成后销毁被分裂的列存副本。For example: when the row-stored copy is split, only part of the data needs to be migrated separately; when the column-stored copy is split, because the cost of reorganizing all rows in the column-store format is high, splitting directly from the row-stored copy and Convert to columnar storage format, and destroy the split columnar storage copy after completion.
4、错误处理:当一个数据分片的一个或若干个副本出现错误无法正常工作时,应该立刻创建新的副本,保证副本总数与预期配置相符。新上线的副本采用何种存储模式,由系统当前策略配置决定,不一定与出错前的状态保持一致。对于新上线副本,如果已有同样存储模式的其他副本,从该副本同步数据(该副本需要首先与主节点进行同步)。如果没有,则使用前文的模式转换中针对重新选择节点的方案从主节点同步数据。副本出现错误、建立新副本过程中及新副本建立后数据分片的可用性和副本数据的正确性共识协议保证。4. Error handling: When one or several copies of a data shard fail to work properly, a new copy should be created immediately to ensure that the total number of copies matches the expected configuration. The storage mode used by the newly online copy is determined by the current policy configuration of the system, and may not be consistent with the state before the error. For newly online replicas, if there are other replicas with the same storage mode, synchronize data from this replica (this replica needs to be synchronized with the primary node first). If not, synchronize data from the master node using the scheme for re-selecting nodes in the schema transition above. Consensus protocol guarantees the availability of data shards and the correctness of replica data in the process of creating a new replica and after the creation of a new replica.
第10小节-Raft工作流程Section 10 - Raft Workflow
Raft是一种容易理解、易于构建实际系统的一致性协议。它将一致性算法分解为了领导者选举、日志复制和安全性三大模块,通过首先选举出一个领导者(Leader),由领导者负责日志管理实现一致性,大大简化了需要考虑的状态,增强了可理解性,降低了工程实践的难度。Raft is a consensus protocol that is easy to understand and easy to build practical systems. It decomposes the consensus algorithm into three modules: leader election, log replication and security. By first electing a leader (Leader), the leader is responsible for log management to achieve consistency, which greatly simplifies the state to be considered and enhances It improves the comprehensibility and reduces the difficulty of engineering practice.
1.节点状态1. Node status
Raft集群中的节点共有三类状态,分别是领导者(Leader)、跟随者(Follower)和候选人(Candidate),其中Leader节点只有一个,其他节点全部是Follower。Raft会首先选举出一个Leader,由Leader完全管理副本日志。Leader负责接收所有客户端的日志条目,并复制到其他的Follower节点,在安全时告知各个Follower把这些日志条目应用到各自的状态机上。如果Leader出现故障,则Followers会重新选举出新的Leader。Follower自身不发送任何请求,只负责响应来自Leader和Candidate的请求,如果Follower接收不到消息,他会转换成Candidate并发起Leader选举,获得集群中大多数选票的Candidate将成为Leader。There are three types of nodes in the Raft cluster, namely Leader, Follower, and Candidate. There is only one Leader node, and all other nodes are Followers. Raft will first elect a leader, and the leader will fully manage the replica log. The Leader is responsible for receiving the log entries of all clients and copying them to other Follower nodes, and telling each Follower to apply these log entries to their own state machines when it is safe. If the leader fails, the Followers will re-elect a new leader. Followers do not send any requests themselves, but are only responsible for responding to requests from Leaders and Candidates. If Followers cannot receive messages, they will convert to Candidates and initiate Leader elections. Candidates that get most votes in the cluster will become Leaders.
Raft把时间分割成任意长度的任期,每个任期都有一个任期号,每进行一次Leader选举,都会开启一个新的任期。如果Leader或Candidate发现自己的任期号过时,会立刻切换为Follower状态:如果一个节点收到包含过时任期号的请求,则会拒绝这个请求。Follower、Candidate和Leader的转换关系如图所示:Raft divides time into terms of arbitrary length, each term has a term number, and a new term starts with each leader election. If the Leader or Candidate finds that its term number is out of date, it will immediately switch to the Follower state: if a node receives a request containing an outdated term number, it will reject the request. The conversion relationship between Follower, Candidate and Leader is shown in the figure:
2.Leader选举2. Leader election
Raft通过心跳机制触发Leader选举。在初始状态下,每个节点都是Follower。Follower节点与Leader节点通过心跳机制保持连续,如果一段时间内未接收到任何消息,Follower认为系统没有可用的Leader,并发起Leader选举。Raft triggers leader election through the heartbeat mechanism. In the initial state, each node is a Follower. The follower node and the leader node maintain continuity through the heartbeat mechanism. If no message is received within a period of time, the follower considers that the system has no available leader and initiates a leader election.
发起选举的Follower节点增加本地的当前任期号,由Follower状态切换到Candidate,然后它会给自己投票并将投票请求发送给其他的Follower节点。每个Follower节点可能会受到多个投票请求,但它只能按照先到先得的原则投出一票,且获得投票的Candidate节点所拥有的日志信息不能比自己的更旧。The Follower node that initiates the election increases the local current term number, and the Follower state switches to Candidate, then it will vote for itself and send the voting request to other Follower nodes. Each Follower node may receive multiple voting requests, but it can only cast one vote on a first-come, first-served basis, and the log information owned by the candidate node that has won the vote cannot be older than its own.
Candidate等待其他Follower节点的投票回复,收到的回复可能会出现以下三种结果:Candidate waits for voting replies from other Follower nodes, and the received replies may have the following three results:
Candiadate节点收到了超过半数的投票,赢得选举,成为Leader,此时他会向其他节点发送心跳消息,维持自身的Leader地位,并阻止新选举的产生。The Candiadate node receives more than half of the votes, wins the election, and becomes the leader. At this time, it will send heartbeat messages to other nodes to maintain its leader status and prevent the generation of new elections.
Candidate节点收到了其他节点发来的任期号更大的消息,这表示其他节点当选为Leader,此时Candidate会切换为Follower状态。如果Candidate收到了更小的任期号,节点会拒绝这次请求,并继续保持Candidate状态。The Candidate node receives a message with a larger term number sent by other nodes, which means that other nodes are elected as Leaders. At this time, Candidate will switch to the Follower state. If Candidate receives a smaller term number, the node will reject the request and keep the Candidate state.
若各个Candidate节点均未获得半数的选票,则选举超时。此时每个Candidate节点通过增加当前任期号开始一轮新的选举。为防止多次选举超时,Raft使用随机选举超时时间的算法,每个Candidate开始一次选举时,会设置一个随机选举超时时间,防止多个Candidate节点同时超时、同时开始下一轮选举,从而减少在新的选举中选票被瓜分的可能性。If each candidate node does not get half of the votes, the election times out. At this point, each Candidate node starts a new round of election by incrementing the current term number. In order to prevent multiple election timeouts, Raft uses a random election timeout algorithm. When each candidate starts an election, a random election timeout will be set to prevent multiple candidate nodes from timing out at the same time and starting the next round of elections at the same time. Likelihood of votes being split in new elections.
在采用了异构副本的Raft系统中,与传统Raft不同,Leader的存储模式对整体系统的性能影响很大。因此,不同存储模式在Leader选举中的权重也应不同。例如,行存存储模式更适合处理写入操作,因此选举中的权重应该较高;列段存不适合处理写入操作,因此选举中的权重应该较低。In a Raft system that uses heterogeneous replicas, unlike traditional Raft, the storage mode of the leader has a great impact on the performance of the overall system. Therefore, the weights of different storage modes in leader election should also be different. For example, the row storage mode is more suitable for processing write operations, so the weight in the election should be higher; the column storage mode is not suitable for processing write operations, so the weight in the election should be lower.
3.日志复制3. Log replication
每个服务器节点都由基于复制日志实现的复制状态机,若状态机的起始状态相同,从日志中获取的执行指令顺序也相同,则状态机的最终状态也一定相同。Each server node is implemented by a replicated state machine based on a replicated log. If the initial state of the state machine is the same and the order of execution instructions obtained from the log is also the same, the final state of the state machine must also be the same.
当Leader选举出来以后,系统对外提供服务。Leader接收客户端发来的请求,每个请求包含一个作用于副本状态机的命令。Leader将每个请求封装成一个日志条目,追加到日志尾部,同时将这些日志条目按顺序并行的发送给Follower。每个日志条目都包含了一个状态机命令和Leader收到该请求的当前任期号,此外还包括该日志条目在日志文件中的位置索引。当日志条目被安全地复制到多数节点,该日志条目被称为Committed,Leader向客户端返回成功,并通知各个节点按照相同的顺序将日志条目中的状态机命令应用到复制状态机,此时该日志被称为Applied。When the leader is elected, the system provides services to the outside world. The leader receives requests from clients, and each request contains a command that acts on the replica state machine. The Leader encapsulates each request into a log entry, appends it to the tail of the log, and sends these log entries to the Followers in parallel in order. Each log entry contains a state machine command and the current term number of the leader that received the request, as well as an index of the log entry's location in the log file. When the log entry is safely replicated to the majority of nodes, the log entry is called Committed, the Leader returns success to the client, and notifies each node to apply the state machine commands in the log entry to the replicated state machine in the same order. The log is called Applied.
在采用了异构副本的Raft系统中,与传统Raft采用相同的日志同步流程。在同步完日志之后,日志内的信息需要经过日志回放才能转变为表结构数据。此时,由于Leader和Follower的结构可能不相同,因此日志回放的过程比传统Raft更复杂。Follower在进行日志回放时,通常可以采取批量处理的方式,将一批数据同时转换为相应的存储模式,以减少日志回放引发的额外开销。In a Raft system using heterogeneous replicas, the same log synchronization process is used as in traditional Raft. After the log is synchronized, the information in the log needs to be replayed before it can be converted into table-structured data. At this time, since the structure of Leader and Follower may be different, the process of log playback is more complicated than traditional Raft. When a Follower performs log playback, it can usually adopt batch processing to convert a batch of data into the corresponding storage mode at the same time, so as to reduce the extra overhead caused by log playback.
可以看出,Raft日志复制是一个Quorum过程,能够容忍n/2-1个副本失败。Leader会在后台为落后的副本补完日志。It can be seen that Raft log replication is a Quorum process that can tolerate n/2-1 replica failures. The leader completes the log for the lagging replicas in the background.
为了保证Follower的日志与Leader的一致,Leader需要找到Follower与其日志一致的索引位置,并让Follower删除该位置后的日志条目,然后再将自己在该索引位置后的条目发送给Follower。此外,Leader为每个Follower维护了一个NextIndex,用来表示Leader将要发送给该Follower的下一条日志条目索引。当一个Leader开始它的任期后,会将NextIndex初始化为它最新日志条目索引+1,如果Follower在一致性检查的锅中发现自己的日志索引和Leader不一致,则拒绝接受该日志条目。Leader收到响应后将NextIndex递减,然后重试,知道nextIndex达到一个Leader和Follower日志一致的位置。此时,来自Leader的日志条目会被追加成功,Leader和Follower的日志达成一致。因此,Raft日志复制机具有以下特性:In order to ensure that the follower's log is consistent with the leader's, the leader needs to find the index position of the follower and its log, and let the follower delete the log entry after this position, and then send the entry after the index position to the follower. In addition, the leader maintains a NextIndex for each follower, which is used to indicate the index of the next log entry that the leader will send to the follower. When a leader starts its term, it will initialize the NextIndex to its latest log entry index + 1. If the follower finds that its log index is inconsistent with the leader in the pot of consistency check, it will refuse to accept the log entry. After the Leader receives the response, it decrements the NextIndex, and then tries again until the nextIndex reaches a position consistent with the Leader and Follower logs. At this point, the log entry from the leader will be appended successfully, and the leader and follower's logs will reach an agreement. Therefore, the Raft log replicator has the following characteristics:
如果在不同日志中的两个日志条目拥有相同的日志索引和任期号,那么这两条日志存储了相同的状态机命令。If two log entries in different logs have the same log index and term number, then the two logs store the same state machine commands.
如果在不同日志中的两个日志条目拥有相同的日志索引和任期号,那么他们之前的所有日志条目也全部相同。If two log entries in different logs have the same log index and term number, then all log entries before them are also the same.
为了防止已提交的日志被覆盖,Raft要求Candidate需要拥有所有已提交的日志条目。若一个节点新当选为Leader,则它只能提交当前Term的已经复制到大多数节点上的日志。旧任期号的日志即使已经复制到大多数节点上,也不能由当前Leader直接提交,而是需要在Leader提交当前任期号的日志时,通过日志匹配间接提交。To prevent the committed log from being overwritten, Raft requires that Candidate needs to have all committed log entries. If a node is newly elected as Leader, it can only submit logs of the current Term that have been replicated to most nodes. Even if the log of the old term number has been copied to most nodes, it cannot be directly submitted by the current leader, but needs to be submitted indirectly through log matching when the leader submits the log of the current term number.
第11小节-LSM-TreeSection 11 - LSM-Tree
LSM-Tree全称为日志结构合并树,是由PatrickO’Neil教授与1996年在The Log-Structured Merge-Tree一文中提出的。日志结构合并树这一名称取自日志结构文件系统。LSM-tree的实现如同日志文件系统,它基于不可变存储方式,采用缓冲和仅追加写的实现顺序写操作,避免了可变存储结构中绝大部分的随机写操作,降低了写操作带来的多次随机I/O对性能的影响,提高了磁盘上数据空间的利用率。它保证了磁盘数据存储的有序性。不可变的磁盘存储结构有利于顺序写入。数据可以一次性的写入磁盘,并且在磁盘中是以仅追加的形式存在的,这也使得不可变存储结构具有更高的数据密度,避免了外部碎片的产生。The full name of LSM-Tree is the log-structured merge tree, which was proposed by Professor Patrick O'Neil in The Log-Structured Merge-Tree in 1996. The name log-structured merge tree is taken from the log-structured file system. The implementation of LSM-tree is similar to the log file system. It is based on the immutable storage method and uses buffering and append-only writing to implement sequential write operations, avoiding most random write operations in the variable storage structure and reducing the impact of write operations. The impact of multiple random I/Os on performance improves the utilization of data space on disk. It guarantees the orderliness of disk data storage. The immutable disk storage structure facilitates sequential writes. Data can be written to the disk once, and it exists in the form of append only, which also makes the immutable storage structure have higher data density and avoids the generation of external fragmentation.
由于文件是不可变的,所以写入操作、插入操作和更新操作都无须提前定位到数据位置,大大减少了由于随机I/O带来的影响,并且显著提高了写入的性能和吞吐量。但对于不可变大文件来说,重复是允许的,随着追加的数据不断增多,磁盘驻留表的数量不断增长,需要解决读取时带来的文件重复问题。可以通过触发合并操作进行对LSM-Tree的维护。Since the file is immutable, write operations, insert operations and update operations do not need to locate the data location in advance, greatly reducing the impact caused by random I/O, and significantly improving write performance and throughput. However, for immutable large files, duplication is allowed. With the continuous increase of appended data, the number of disk resident tables continues to increase, and the problem of file duplication caused by reading needs to be solved. Maintenance of the LSM-Tree can be done by triggering a merge operation.
LSM-Tree中,数据以有序字符串表(Sorted String Table,SSTable)的形式存在,SSTable通常有两个组件组成,分别是索引文件和数据文件。索引文件保存键和其在数据文件中的偏移量,数据文件由连起来的键值对组成,每个SSTable由多个页构成。在查询一条数据时,并非想B+树一样直接定位到数据所在的页,而是先定位到SSTable再根据SSTable中的索引文件找到数据对应的页。In LSM-Tree, data exists in the form of an ordered string table (Sorted String Table, SSTable). SSTable usually consists of two components, namely an index file and a data file. The index file stores keys and their offsets in the data file. The data file consists of concatenated key-value pairs, and each SSTable consists of multiple pages. When querying a piece of data, instead of directly locating the page where the data is located like the B+ tree, it first locates the SSTable and then finds the page corresponding to the data according to the index file in the SSTable.
1.LSM-Tree结构1.LSM-Tree structure
LSM-tree的整体架构如图所示,其包括内存驻留组件和磁盘驻留组件。执行写请求写入数据时,会先在磁盘CommitLog上记录操作,以便进行故障恢复,随后记录被写入可变内存组件(Memtable)中,当Memtable达到某个阈值后,会转变成不可变内存驻留组件(Immemtable),并在后台将数据刷写到磁盘上。对于磁盘驻留组件,写入的数据会分为多个层级,从Immentable输入的数据会优先进入Level0层,并生成相应的SSTable,待Level0层达到某个阈值后,Level0层的SSTable会以一种方式合并到Level1ceng,并依此方式逐层向下合并。The overall architecture of LSM-tree is shown in the figure, which includes memory-resident components and disk-resident components. When executing a write request to write data, the operation is first recorded on the disk CommitLog for fault recovery, and then the record is written to the variable memory component (Memtable). When the Memtable reaches a certain threshold, it will be converted into immutable memory Resident component (Immemtable) and flushes data to disk in the background. For the disk-resident component, the written data will be divided into multiple levels. The data input from Immentable will enter the Level0 layer first, and the corresponding SSTable will be generated. This method is merged into Level1ceng, and merged down layer by layer in this way.
内存驻留组件:内存驻留组件由Memtable与Immemtable组成。数据在Memtable中通常以有序的调表(Skip List)结构进行存储,以此保证磁盘数据的有序性。Memtable负责缓冲数据记录并充当读写操作的首要目标。Immemtable完成对数据的落盘操作。Memory resident component: The memory resident component consists of Memtable and Immemtable. Data is usually stored in an ordered Skip List structure in Memtable to ensure the orderliness of disk data. Memtable is responsible for buffering data records and acts as the primary target for read and write operations. Immemtable completes the drop operation of the data.
磁盘驻留组件:磁盘驻留组件是由WAL与SSTable组成的。由于Memtable存在于内存中,为防止系统故障导致内存中尚未写入磁盘的数据丢失,在想Memtable中写入数据之前,需要先将操作记录写入WAL保证数据记录的持久化。SSTable是由Immetable刷写到磁盘上的数据记录所构建的,SSTable是不可变的,仅可用于读取合并和删除操作。Disk resident component: The disk resident component is composed of WAL and SSTable. Since the Memtable exists in the memory, in order to prevent the loss of data that has not been written to the disk in the memory due to system failure, before writing data to the Memtable, it is necessary to write the operation record to the WAL to ensure the persistence of the data record. SSTables are constructed from data records that Immetable flushes to disk. SSTables are immutable and can only be used for read merge and delete operations.
2.LSM-Tree的更新与删除2. Update and delete of LSM-Tree
由于LSM-Tree是基于不可变存储结构的,更新操作无法直接修改原有数据,只能以时间戳作为标记插入一条新数据,因此LSM-Tree中无法显示地区分插入操作与更新操作。删除操作也同样,可以通过插入一个特殊的删除标记,有时也成为了墓碑来实现。该条目说明此键对应的数据记录已被删除。Since LSM-Tree is based on an immutable storage structure, the update operation cannot directly modify the original data, and only a new piece of data can be inserted with a timestamp as a mark. Therefore, the insertion operation and the update operation cannot be clearly distinguished in the LSM-Tree. The same is true for delete operations, which can be implemented by inserting a special delete marker, sometimes called a tombstone. This entry states that the data record corresponding to this key has been deleted.
3.LSM-Tree的查找3. LSM-Tree search
LSM-Tree查找一条数据的访问顺序如下所示:The access order of LSM-Tree to find a piece of data is as follows:
访问可变内存驻留组件。Access variable memory resident components.
访问不可变内存驻留组件。Access immutable memory-resident components.
访问磁盘驻留组件,从Level0层开始依次访问,由于越底层级的数据越新,因此在找到要查找的数据时立即返回,便可取得该数据的最新值。Access the disk-resident components, starting from the Level0 layer, because the data at the lower level is newer, so when you find the data you want to find, you can return immediately to get the latest value of the data.
除此之外,还有一些针对读操作的优化,布隆过滤器常用于判断一个SSTable是否包含特定的键,布隆过滤器的低层是一个位图结构,用于表示一个集合,并能判断一个元素是否属于这个集合。应用布隆过滤器可以大大减少磁盘的访问次数。但其也存在一定的误判率,由于其位图值基于散列函数进行判定,最终会发生多个值的散列冲突问题。在判断一个元素是否属于某个集合时,可能会把不属于这个集合的元素误认为属于这个集合,即布隆过滤器具有假阳性,同时判断一个元素在集合中需要位图中多位值为1,这也从根本上决定了布隆过滤器不存在假阴性。也就是说他可能存在以下两种情形:In addition, there are some optimizations for read operations. Bloom filters are often used to determine whether an SSTable contains a specific key. The lower layer of the Bloom filter is a bitmap structure, which is used to represent a collection and can determine whether an SSTable contains a specific key. Whether an element belongs to this set. Applying a bloom filter can greatly reduce the number of disk accesses. However, it also has a certain false positive rate. Since its bitmap value is determined based on a hash function, a hash collision problem of multiple values will eventually occur. When judging whether an element belongs to a certain set, elements that do not belong to this set may be mistaken as belonging to this set, that is, the Bloom filter has false positives, and at the same time, judging that an element in the set needs to be multi-bit values in the bitmap 1, which also fundamentally determines that the Bloom filter does not have false negatives. That is to say, he may have the following two situations:
布隆过滤器判定某个元素不在集合中,那么这个元素一定不在。Bloom filter determines that an element is not in the set, then this element must not be.
布隆过滤器判定某个元素在集合中,那么这个元素可能在,也可能不在。Bloom filter determines that an element is in the set, then this element may or may not be.
4.LSM-Tree的合并策略4. Merging strategy of LSM-Tree
在LSM-Tree中,随着磁盘驻留表数据的不断增加,可以通过周期性的合(Compaction)操作减少重复数据。基本的合并策略有两种,分别是Tiered Compaction与LeveledCompaction。In LSM-Tree, with the continuous increase of disk resident table data, repeated data can be reduced through periodic compaction operations. There are two basic merge strategies, Tiered Compaction and LeveledCompaction.
(1)两种合并策略的具体表现形式(1) The specific manifestations of the two merger strategies
TieredCompaction:每层允许的SSTable文件的最大数量都由一个相同的阈值,随着Immemtable不断刷写成SSTable,当某层的SSTable数量达到阈值时,就把盖层的所有SSTable合并成一个大的新SSTable,并放到较高的一层。其优先点是实现简单,缺点是合并时的空间放大问题比较严重,且越高层级的SSTable越大,读放大问题越严重。TieredCompaction: The maximum number of SSTable files allowed for each layer is determined by the same threshold. As Immemtable is continuously written into SSTables, when the number of SSTables in a certain layer reaches the threshold, all SSTables in the cover layer are merged into a large new SSTable , and place it on a higher level. The priority is simple implementation, but the disadvantage is that the space enlargement problem during merging is more serious, and the higher the level of the SSTable, the larger the read enlargement problem is.
Leveled Compaction:每层的SSTable大小固定,默认为2MB,且每层的SSTable最大数量为其第一层的N倍。Leveled Compaction: The SSTable size of each layer is fixed, the default is 2MB, and the maximum number of SSTables per layer is N times that of the first layer.
(2)合并策略的过程(2) The process of merging the strategy
当L0层满时,将L0层的全部SSTable与L1层的全部SSTable合并,并去掉重复的Key值,基于SSTable大小的限制,会合并成多个SSTable文件,并归入L1层。When the L0 layer is full, all the SSTables in the L0 layer are merged with all the SSTables in the L1 layer, and the duplicate Key values are removed. Based on the size limit of the SSTable, multiple SSTable files are merged and classified into the L1 layer.
当L1-LN层满时,选取满一层的一个SSTable与下一层合并。When the L1-LN layer is full, select an SSTable of the full layer and merge it with the next layer.
其优点是减少了空间放大,但缺点是合并时会造成严重的写放大问题。The advantage is that it reduces space amplification, but the disadvantage is that it can cause serious write amplification problems when merging.
5.读放大、写放大和空间放大5. Read Amplification, Write Amplification and Space Amplification
基于不可变存储结构的LSM-Tree会存在读放大问题,就如同基于可变存储结构的B+树存在写放大问题是不可避免的。但LSM-Tree中不同的合并策略又会带来新的问题。在分布式领域,著名的CAP理论证明了一个分布式系统最多只能同时满足一致性(Consistency)、可用性(Availability)和分区容错性(Partition Tolerance)三项中的两项。与此相似的是,Manos Athanassoulis等人在2016年提出了RUM猜想,它指出对于任何数据结构来说,最多只能同时优化读放大(Read Amplification)、写放大(WriteAmplification)和空间放大(Space Amplification)中的两项,并需要以牺牲另一项作为代价。总结来说,基于不可变存储结构的LSM-Tree在存储数据时会面临以下三个问题。The LSM-Tree based on the immutable storage structure will have the problem of read amplification, just as the B+ tree based on the variable storage structure will have the problem of write amplification, which is inevitable. But different merging strategies in LSM-Tree will bring new problems. In the distributed field, the famous CAP theory proves that a distributed system can only satisfy at most two of the three items of consistency (Consistency), availability (Availability) and partition tolerance (Partition Tolerance). Similar to this, Manos Athanassoulis et al. proposed the RUM conjecture in 2016, which pointed out that for any data structure, at most Read Amplification, Write Amplification and Space Amplification can be optimized simultaneously. ) at the expense of the other. To sum up, LSM-Tree based on immutable storage structure will face the following three problems when storing data.
读放大,为了检索数据,需要一层一层的查找,造成额外的磁盘I/O操作,尤其在范围查询时读放大现象很明显。Read amplification, in order to retrieve data, requires layer-by-layer search, resulting in additional disk I/O operations, especially in range queries, the phenomenon of read amplification is obvious.
写放大,在合并过程中不断地重写为新的文件,从而导致写放大。Write amplification, which is continuously rewritten to new files during the merge process, causes write amplification.
空间放大,由于重复是允许的,而且过期的数据不会被马上清理掉,由此会导致空间放大。Space enlargement, because duplication is allowed, and expired data will not be cleaned up immediately, which will lead to space enlargement.
由于这两种合并策略的实现方式不同,会分别导致空间放大和写放大问题。Due to the different implementations of these two merging strategies, space amplification and write amplification will be caused respectively.
Tiered Compaction的合并过程会导致较高层的SSTable文件非常大,当执行合并操作时,为了保证容错在合并操作结束以前不会删除原有的SSTable文件,这会造成短期内数据量变为原来的近两倍,待合并操作完成后,会删除旧的数据,此时数据量会恢复正常。尽管只是暂时的,但这仍然是一个严重的空间放大问题。The merge process of Tiered Compaction will cause the SSTable file of the higher layer to be very large. When the merge operation is performed, in order to ensure fault tolerance, the original SSTable file will not be deleted before the end of the merge operation, which will cause the amount of data to become nearly two in the short term. After the merge operation is completed, the old data will be deleted, and the data volume will return to normal. Although only temporary, this is still a serious space magnification problem.
Leveled Compaction由于固定了SSTable的大小,但其合并策略不是将整层的所有SSTable与下一层进行合并,而是选择一个与下一层具有相同Key值的SSTable进行合并,减少了空间放大问题:但由于合并过程中一个SSTable可能与下一层的10个SSTable都存在重复的Key值,此时就需要重写10个SSTable文件,会导致严重的写放大。Leveled Compaction fixes the size of the SSTable, but its merging strategy is not to merge all SSTables in the entire layer with the next layer, but to select an SSTable with the same Key value as the next layer to merge, reducing the problem of space enlargement: However, since one SSTable may have duplicate Key values with the 10 SSTables in the next layer during the merging process, 10 SSTable files need to be rewritten at this time, which will lead to serious write amplification.
第12小节-元数据变化对存储模式产生的影响Section 12 - Impact of Metadata Changes on Storage Schema
在多样化存储模式下,元数据发生变化可能会影响到存储模式的选择,甚至会将副本上已存储的数据转换为另一种存储模式,对于元数据发生变更的情况,本方案提供两种解决办法,下面进行详细讨论。元数据发生变更时,某些存储模式会受到影响。举例来说,如果表结构中加一列,列存模式只需对应的增加一列即可;In the diversified storage mode, the change of metadata may affect the choice of storage mode, and even convert the data stored on the copy to another storage mode. For the change of metadata, this solution provides two types of storage modes. The solution is discussed in detail below. When metadata changes, some storage modes are affected. For example, if a column is added to the table structure, the column storage mode only needs to add a corresponding column;
行存与多行存模式受影响较大,需要将该列中的数据添加到逐一添加到行存/多行存数据中;Cell存模式中,由于Cell的key是(tableid、rowid、cloumnid)组合而成的,增加一列时,我们只需要按照同样的规则进行拆分,所以cell行存、cell列存不会造成额外的开销。The row storage and multi-row storage modes are greatly affected, and the data in this column needs to be added to the row storage/multi-row storage data one by one; in the Cell storage mode, because the cell's key is (tableid, rowid, cloumnid) Combined, when adding a column, we only need to split according to the same rules, so cell row storage and cell column storage will not cause additional overhead.
对于列族存、列族段存,存储模式的改变可以有下列两种方式。For column family storage and column family segment storage, the storage mode can be changed in the following two ways.
1.选择最优存储模式:表结构发生变化时,当前工作负载下的数据分布会因此发生变化,从而会影响到自适应查询中的分组范围。这种情况下,当前的存储模式可能已经不是当前工作负载的最优解了。因此,我们需要重新进行查询自适应,抉择出最优的存储模式。1. Select the optimal storage mode: When the table structure changes, the data distribution under the current workload will change accordingly, which will affect the grouping range in the adaptive query. In this case, the current storage model may not be the optimal solution for the current workload. Therefore, we need to perform query adaptation again and choose the optimal storage mode.
2.选择改动最小的存储模式:在方案1中,我们重新对存储模式进行了选择,虽然找到了最优的存储模式,但是这种方案的开销太大,我们需要重新转换该表全部数据的存储模式。为了避免这样的大开销,我们可以选择改动最小的存储模式。例如当副本的存储模式为列族存时,表结构增加一列,我们可以为新增的一列选择列族存,在不影响表中其它数据的情况下完成对元数据变更的适应。2. Select the storage mode with the least change: In scheme 1, we re-selected the storage mode. Although we found the optimal storage mode, the overhead of this scheme is too high, and we need to re-transform all the data in the table. storage mode. In order to avoid such a large overhead, we can choose the storage mode with minimal changes. For example, when the storage mode of the copy is column family storage, and a column is added to the table structure, we can select column family storage for the newly added column, and complete the adaptation to metadata changes without affecting other data in the table.
对于列段存来说,可能需要将该列聚集到某个列段当中,会造成一定额外的组合开销。For column segment storage, it may be necessary to aggregate the column into a column segment, which will cause some additional combination overhead.
整表存在插入该列时,会产生一些插入所必须的开销,不会对系统性能造成太大影响。When inserting this column exists in the whole table, it will generate some overhead necessary for inserting, and will not cause much impact on system performance.
本申请实施例中提供的方法所产生的有益效果包括:The beneficial effects produced by the methods provided in the embodiments of the present application include:
1.本方案提出的多样化存储模式层次分明、逻辑清晰,不需要修改任何查询计划、共识协议处理流程,不需要感知底层存储引擎的变化,只需要改动少量代码更改KV映射的规则就可直接兼容本方案的改动,使原有系统获得多样化存储的支持。同时,由于下层存储模式的扩展,上层查询优化可以获得更好的调参方案,因此开发者可以省下更多的计算和存储的开销。1. The diversified storage mode proposed by this solution has clear layers and clear logic. It does not need to modify any query plan or consensus protocol processing process, and does not need to perceive changes in the underlying storage engine. It only needs to change a small amount of code to change the rules of KV mapping. Compatible with the changes of this scheme, the original system can obtain the support of diversified storage. At the same time, due to the expansion of the lower-layer storage mode, the upper-layer query optimization can obtain better parameter adjustment schemes, so developers can save more computing and storage costs.
2.由于下层存储模式的扩展,查询优化可以得到更好的配置方案,数据库可以加快对用户请求的处理速度,因此,用户对于系统的满意度也会上升。2. Due to the expansion of the underlying storage mode, the query optimization can obtain a better configuration scheme, and the database can speed up the processing speed of user requests. Therefore, the user's satisfaction with the system will also increase.
3.本方案中使用了基于多样化存储的多副本异构方案,使得在混合负载下,不同类型的请求可以在采用不同存储模式的副本上,更好的发挥磁盘读取优势,减少网络带宽的占用。3. In this solution, a multi-copy heterogeneous solution based on diversified storage is used, so that under mixed load, different types of requests can better utilize the advantages of disk reading and reduce network bandwidth on replicas with different storage modes. occupancy.
4.采用计算与存储分离架构,方便改变存储节点分布;4. Adopting the separation architecture of computing and storage, it is convenient to change the distribution of storage nodes;
5.整体而言,本方案扩展了存储模式,解决了查询请求不能很好匹配存储模式的问题,节省了磁盘的存储空间,提升了系统的I/O效率,最终达到提升系统整体性能的问题。5. On the whole, this solution expands the storage mode, solves the problem that the query request cannot match the storage mode well, saves the storage space of the disk, improves the I/O efficiency of the system, and finally achieves the problem of improving the overall performance of the system. .
此外,对于多样化存储模式的管理办法,该方案中采用Raft作为叙述对象,Raft是一个强Leader的分布式协议。但是对于Leaderless Replication的分布式协议,也可以工作。比如Dynamo,它是一个LeaderLess Replication的分布式协议,相比于Raft必须拥有一个Leader,日志必须要先经过Leader,然后再进行Follower间的同步,等待半数以上成员认可然后再Commit的工作流程,Dynamo的中心策略是保证写副本数量+读副本数量>副本总数量,这样就可以保证每次读操作都可以得到最新的数据。在这种分布式协议上,我们也可以支持多样化的存储模式,提升系统的性能。In addition, for the management method of diversified storage modes, Raft is used as the narrative object in this scheme, and Raft is a distributed protocol with strong leader. But for the distributed protocol of Leaderless Replication, it also works. For example, Dynamo is a distributed protocol for LeaderLess Replication. Compared with Raft, which must have a leader, the log must first go through the leader, then synchronize between followers, wait for more than half of the members to approve and then commit the workflow, Dynamo The central strategy is to ensure that the number of write replicas + the number of read replicas > the total number of replicas, so that each read operation can get the latest data. On this distributed protocol, we can also support diversified storage modes to improve the performance of the system.
对于多样化存储的异构多副本方案,其他的分布式副本管理协议同样可以使用,例如Paxos、Multi-Paxos、Quorum机制。只需要合理设计KV映射规则,不同副本之间使用合理的日志翻译技术,就能够实现多样化存储的异构多副本。For the heterogeneous multi-copy scheme of diversified storage, other distributed copy management protocols can also be used, such as Paxos, Multi-Paxos, and Quorum mechanism. It is only necessary to reasonably design KV mapping rules and use reasonable log translation technology between different replicas to realize heterogeneous multi-copy of diversified storage.
综上所述,该方案在基于LSMTree的存储引擎中实现了存储模式的多样化,并且基本包含了所有KV存储模式,即使后续再出现新的存储模式,也只是对本申请中提出的存储模式的一些改动,并不会超出本方案所讨论的范畴。To sum up, this solution realizes the diversification of storage modes in the storage engine based on LSMTree, and basically includes all KV storage modes. Even if new storage modes appear in the future, it is only for the storage mode proposed in this application. Some changes will not go beyond the scope discussed in this program.
应该理解的是,虽然如上所述的各实施例所涉及的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,这些步骤可以以其它的顺序执行。而且,如上所述的各实施例所涉及的流程图中的至少一部分步骤可以包括多个步骤或者多个阶段,这些步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,这些步骤或者阶段的执行顺序也不必然是依次进行,而是可以与其它步骤或者其它步骤中的步骤或者阶段的至少一部分轮流或者交替地执行。It should be understood that, although the steps in the flowcharts involved in the above embodiments are sequentially displayed according to the arrows, these steps are not necessarily executed in the order indicated by the arrows. Unless explicitly stated herein, the execution of these steps is not strictly limited to the order, and these steps may be performed in other orders. Moreover, at least a part of the steps in the flowcharts involved in the above embodiments may include multiple steps or multiple stages, and these steps or stages are not necessarily executed and completed at the same time, but may be performed at different times The execution order of these steps or phases is not necessarily sequential, but may be performed alternately or alternately with other steps or at least a part of the steps or phases in the other steps.
基于同样的发明构思,本申请实施例还提供了一种用于实现上述所涉及的数据库存储模式的自适应方法的数据库存储模式的自适应装置。该装置所提供的解决问题的实现方案与上述方法中所记载的实现方案相似,故下面所提供的一个或多个数据库存储模式的自适应装置实施例中的具体限定可以参见上文中对于数据库存储模式的自适应方法的限定,在此不再赘述。Based on the same inventive concept, an embodiment of the present application also provides a database storage mode adaptation apparatus for implementing the above-mentioned database storage mode adaptation method. The implementation solution for solving the problem provided by the device is similar to the implementation solution described in the above method, so the specific limitations in the embodiments of the adaptive device for one or more database storage modes provided below can refer to the above for database storage The limitation of the mode adaptive method is not repeated here.
在一个实施例中,如图15所示,提供了一种数据库存储模式的自适应装置,包括:获取模块1502和确定模块1504,其中:In one embodiment, as shown in FIG. 15, an adaptive apparatus for database storage mode is provided, including: an
获取模块1502,用于获取工作负载的运行参数和各键值对的读写操作性能参数;所述运行参数包括读写操作比例、查询语句和数据分布。The obtaining
确定模块1504,用于根据所述读写操作比例和所述读写操作性能参数,确定所述键值对所占用空间的第一空间大小;根据所述查询语句的语义信息,确定存储区的分区方式;根据所述数据分布确定候选分组范围下的数据密度;基于所述分区方式、所述候选分组范围、所述数据密度和所述第一空间大小,确定在运行所述工作负载时数据库中数据表的存储模式。The determining
在一个实施例中,所述装置还包括:获取模块还用于获取所述键值对占用的空间大小;确定模块还用于在占用所述空间大小的情况下,确定所述键值对的读写操作性能参数;建立模块,用于基于所述读写操作性能参数,建立所述键值对的性能模型。In one embodiment, the apparatus further includes: the obtaining module is further configured to obtain the size of the space occupied by the key-value pair; the determining module is further configured to determine the size of the key-value pair when the space is occupied. a read-write operation performance parameter; a building module is configured to establish a performance model of the key-value pair based on the read-write operation performance parameter.
在一个实施例中,确定模块还用于根据所述读写操作比例和所述读写操作性能参数,确定所述工作负载在所述键值对占用所述空间大小时的总体性能参数;基于所述总体性能参数,确定所述键值对所占用空间的第一空间大小。In one embodiment, the determining module is further configured to determine the overall performance parameter of the workload when the key-value pair occupies the space size according to the read-write operation ratio and the read-write operation performance parameter; based on The overall performance parameter determines the first space size of the space occupied by the key-value pair.
在一个实施例中,所述装置还包括:选取模块,用于当所述总体性能参数为总体耗时时,选取各所述总体耗时中满足耗时条件的总体耗时作为目标总体耗时;将所述目标总体耗时所对应的键值对所占用的空间大小作为所述第一空间大小。In one embodiment, the device further includes: a selection module, configured to select the overall time consumption that satisfies the time-consuming condition in each of the overall time-consuming as the target overall time-consuming when the overall performance parameter is the overall time-consuming; The size of the space occupied by the key-value pair corresponding to the total time-consuming of the target is taken as the first space size.
在一个实施例中,获取模块还用于获取所述数据表所涉及的查询语句和查询语句总量;确定模块还用于确定各所述查询语句的数量与所述查询语句总量之间的比值;在所涉及的查询语句中,基于所述比值确定目标查询语句;基于所述目标查询语句中的列信息,确定存储区的分区方式。In one embodiment, the acquiring module is further configured to acquire the query statements and the total amount of query statements involved in the data table; the determining module is further configured to determine the difference between the number of each of the query statements and the total amount of the query statements Ratio; in the query statement involved, the target query statement is determined based on the ratio; the partition mode of the storage area is determined based on the column information in the target query statement.
在一个实施例中,确定模块还用于确定候选分组范围;基于所述数据表的主键数值和所述候选分组范围确定数据分组;根据所述数据分布确定各所述数据分组内的第二数据密度;将所述第二数据密度的均值作为所述候选分组范围下的第一数据密度。In one embodiment, the determining module is further configured to determine a candidate grouping range; determine a data grouping based on the primary key value of the data table and the candidate grouping range; determine the second data in each data grouping according to the data distribution Density; taking the mean value of the second data density as the first data density under the candidate grouping range.
在一个实施例中,确定模块还用于基于所述分区方式、所述候选分组范围以及所述候选分组范围下的第一数据密度,确定所述候选分组范围下的键值对所占用的第二空间大小;当所述第二空间大小与所述第一空间大小之间的差值满足预设差值条件时,基于所述第二空间大小对应的分区方式和候选分组范围确定在运行所述工作负载时数据库中数据表的存储模式。In one embodiment, the determining module is further configured to determine, based on the partition mode, the candidate grouping range, and the first data density in the candidate grouping range, the first data density occupied by the key-value pair in the candidate grouping range Two space sizes; when the difference between the second space size and the first space size satisfies the preset difference condition, determine the running The storage mode of the data table in the database when the workload is described.
在一个实施例中,所述装置还包括:确定模块还用于当所述工作负载更换为其它工作负载时,确定所述其它工作负载在所述数据库中的目标存储模式;转换模块,用于在处于空闲状态时,将所述存储模式转换为所述目标存储模式。In one embodiment, the apparatus further includes: a determining module is further configured to determine a target storage mode of the other workload in the database when the workload is replaced with another workload; a conversion module, configured to When in the idle state, the storage mode is converted to the target storage mode.
在一个实施例中,所述装置还包括:编码模块,用于按照所述目标存储模式所对应的编码方式,对所述数据表中的数据重新编码,得到编码后的数据;存储模块,用于将编码后的所述数据进行存储。In one embodiment, the apparatus further includes: an encoding module, configured to re-encode the data in the data table according to the encoding mode corresponding to the target storage mode, to obtain encoded data; a storage module, used to store the encoded data.
在一个实施例中,所述装置还包括:读取模块,用于当所述存储模式为行存模式、且所述目标存储模式为单元格存模式时,读取所述数据表中的数据;存储模块还用于将所述数据表的表标识、所述数据表中目标行的行标识以及所述数据表中目标列的列标识作为键,将所述目标行和所述目标列之间相交的单元格中的数据作为值,得到所述键和所述值所组成的单元格数据;将所述键和所述值所组成的单元格数据进行存储。In one embodiment, the apparatus further includes: a reading module, configured to read the data in the data table when the storage mode is a row storage mode and the target storage mode is a cell storage mode The storage module is also used to use the table identification of the data table, the row identification of the target row in the data table and the column identification of the target column in the data table as a key, and the target row and the target column The data in the intersecting cells is taken as the value, and the cell data composed of the key and the value is obtained; the cell data composed of the key and the value is stored.
在一个实施例中,所述装置还包括:写入模块,用于当所述存储模式为单元格存模式、且所述目标存储模式为列存模式时,将所述单元格存模式下的单元格写入缓冲区中;存储模块还用于当所述缓冲区中的单元格数目满足所述列存模式所需的数目时,将所述数据表的表标识和所述数据表中目标列的列标识作为键,将所述目标列中的数据作为值,得到所述键和所述值所组成的列数据;将所述键和所述值所组成的列数据进行存储。In one embodiment, the apparatus further includes: a writing module, configured to write the data in the cell storage mode when the storage mode is the cell storage mode and the target storage mode is the column storage mode The cells are written into the buffer; the storage module is further configured to store the table identifier of the data table and the target in the data table when the number of cells in the buffer meets the number required by the column storage mode The column identifier of the column is used as the key, the data in the target column is used as the value, and the column data composed of the key and the value is obtained; the column data composed of the key and the value is stored.
在一个实施例中,所述装置还包括:聚集模块,用于当所述存储模式为列段存模式时,将目标列中的至少两个单元格聚集为一个分组;存储模块还用于以所述数据表的表标识、所述目标列的列标识和所述至少两个单元格的分组标识为键,以所述目标列中至少两个单元格内的数据为值进行存储。In one embodiment, the apparatus further includes: an aggregation module, configured to aggregate at least two cells in the target column into a group when the storage mode is the column segment storage mode; the storage module is further configured to The table identifier of the data table, the column identifier of the target column and the grouping identifier of the at least two cells are used as keys, and the data in the at least two cells in the target column are stored as values.
在一个实施例中,聚集模块还用于当所述存储模式为列族段存模式时,将位于所述数据表中第一方向上的每列中至少两个单元格聚集为列族;将位于所述数据表中第二方向上的每行中至少两个单元格聚集为数据分组;存储模块还用于以所述数据表的表标识、所述列族的族标识以及所述数据分组的组标识为键,以所述每列中至少两个单元格中的数据和所述每行中至少两个单元格中的数据为值进行存储。In one embodiment, the aggregation module is further configured to aggregate at least two cells in each column located in the first direction in the data table into a column family when the storage mode is a column family segment storage mode; At least two cells in each row located in the second direction in the data table are aggregated into data groups; the storage module is further configured to use the table identifier of the data table, the family identifier of the column family, and the data grouping The group identifier of is a key, and the data in at least two cells in each column and the data in at least two cells in each row are stored as values.
在一个实施例中,存储模块还用于当所述存储模式为整表存模式时,以所述数据表的表标识为键,以所述数据表中的数据为值进行存储。In one embodiment, the storage module is further configured to use the table identifier of the data table as the key and the data in the data table as the value to store when the storage mode is the whole table storage mode.
在一个实施例中,所述装置还包括:接收模块,用于接收不同类型的查询请求;确定模块还用于确定所述查询请求的执行时间、等待时间和传输时间;基于所述执行时间、所述等待时间、所述传输时间以及预设权重,确定所述查询请求所对应的第二存储模式;所述第一存储模式与所述第二存储模式是不同的存储模式;路由模块,用于将所述查询请求路由至所述第二存储模式对应的副本节点,以使所述副本节点基于所述查询请求进行数据查询。In one embodiment, the apparatus further includes: a receiving module configured to receive different types of query requests; the determining module is further configured to determine the execution time, waiting time and transmission time of the query request; based on the execution time, The waiting time, the transmission time and the preset weight determine the second storage mode corresponding to the query request; the first storage mode and the second storage mode are different storage modes; the routing module, using and routing the query request to a replica node corresponding to the second storage mode, so that the replica node performs data query based on the query request.
在一个实施例中,获取模块还用于获取所述查询请求的查找时间、处理时间和元组构建时间,确定模块还用于基于所述查找时间、所述读写数据量处理时间和所述元组构建时间确定所述查询请求的执行时间;获取模块还用于获取所述查询请求的请求队列时间、机器负载延迟时间和从节点数据同步时间,确定模块还用于基于所述请求队列时间、所述机器负载延迟时间和所述从节点数据同步时间确定所述查询请求的等待时间;确定所述查询请求的传输时间。In one embodiment, the obtaining module is further configured to obtain the search time, processing time and tuple construction time of the query request, and the determining module is further configured to obtain the search time, the read and write data volume processing time and the The tuple construction time determines the execution time of the query request; the obtaining module is further configured to obtain the request queue time, machine load delay time and slave node data synchronization time of the query request, and the determining module is further configured to obtain the request queue time based on the query request , the machine load delay time and the slave node data synchronization time determine the waiting time of the query request; determine the transmission time of the query request.
在一个实施例中,接收模块还用于接收范围查询请求;In one embodiment, the receiving module is further configured to receive a range query request;
确定所述范围查询请求对应的查询表中的列总数,以及确定所述范围查询请求所需目标列的列数量;所述目标列为所述查询表中的数据列;确定模块还用于基于所述列数量、所述列总数和预设阈值,确定所述范围查询请求所对应的第三存储模式;所述第一存储模式与所述第三存储模式是不同的存储模式;路由模块还用于将所述范围查询请求路由至所述第三存储模式对应的副本节点,以使所述副本节点基于所述范围查询请求进行数据查询。Determining the total number of columns in the query table corresponding to the range query request, and determining the number of target columns required by the range query request; the target column is a data column in the query table; the determining module is further configured to based on The number of columns, the total number of columns, and a preset threshold determine a third storage mode corresponding to the range query request; the first storage mode and the third storage mode are different storage modes; the routing module further for routing the range query request to the replica node corresponding to the third storage mode, so that the replica node performs data query based on the range query request.
上述数据库存储模式的自适应装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。Each module in the above-mentioned adaptive device for database storage mode can be implemented in whole or in part by software, hardware and combinations thereof. The above modules can be embedded in or independent of the processor in the computer device in the form of hardware, or stored in the memory in the computer device in the form of software, so that the processor can call and execute the operations corresponding to the above modules.
在一个实施例中,提供了一种计算机设备,该计算机设备可以是服务器,其内部结构图可以如图16所示。该计算机设备包括处理器、存储器、输入/输出接口(Input/Output,简称I/O)和通信接口。其中,处理器、存储器和输入/输出接口通过系统总线连接,通信接口通过输入/输出接口连接到系统总线。其中,该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质和内存储器。该非易失性存储介质存储有操作系统、计算机程序和数据库。该内存储器为非易失性存储介质中的操作系统和计算机程序的运行提供环境。该计算机设备的数据库用于存储数据库存储模式的自适应数据。该计算机设备的输入/输出接口用于处理器与外部设备之间交换信息。该计算机设备的通信接口用于与外部的终端通过网络连接通信。该计算机程序被处理器执行时以实现一种数据库存储模式的自适应方法。In one embodiment, a computer device is provided, and the computer device may be a server, and its internal structure diagram may be as shown in FIG. 16 . The computer device includes a processor, a memory, an input/output interface (Input/Output, I/O for short) and a communication interface. Wherein, the processor, the memory and the input/output interface are connected through the system bus, and the communication interface is connected to the system bus through the input/output interface. Among them, the processor of the computer device is used to provide computing and control capabilities. The memory of the computer device includes non-volatile storage media and internal memory. The nonvolatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the execution of the operating system and computer programs in the non-volatile storage medium. The database of the computer device is used to store adaptive data of the database storage mode. The input/output interface of the computer device is used to exchange information between the processor and external devices. The communication interface of the computer device is used to communicate with an external terminal through a network connection. The computer program, when executed by a processor, implements a method of adapting a database storage schema.
本领域技术人员可以理解,图16中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,具体的计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。Those skilled in the art can understand that the structure shown in FIG. 16 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer equipment to which the solution of the present application is applied. Include more or fewer components than shown in the figures, or combine certain components, or have a different arrangement of components.
在一个实施例中,还提供了一种计算机设备,包括存储器和处理器,存储器中存储有计算机程序,该处理器执行计算机程序时实现上述各方法实施例中的步骤。In one embodiment, a computer device is also provided, including a memory and a processor, where a computer program is stored in the memory, and the processor implements the steps in the foregoing method embodiments when the processor executes the computer program.
在一个实施例中,提供了一种计算机可读存储介质,存储有计算机程序,该计算机程序被处理器执行时实现上述各方法实施例中的步骤。In one embodiment, a computer-readable storage medium is provided, which stores a computer program, and when the computer program is executed by a processor, implements the steps in the foregoing method embodiments.
在一个实施例中,提供了一种计算机程序产品或计算机程序,该计算机程序产品或计算机程序包括计算机指令,该计算机指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令,处理器执行该计算机指令,使得该计算机设备执行上述各方法实施例中的步骤。In one embodiment, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the steps in the foregoing method embodiments.
需要说明的是,本申请所涉及的用户信息(包括但不限于用户设备信息、用户个人信息等)和数据(包括但不限于用于分析的数据、存储的数据、展示的数据等),均为经用户授权或者经过各方充分授权的信息和数据,且相关数据的收集、使用和处理需要遵守相关国家和地区的相关法律法规和标准。It should be noted that the user information (including but not limited to user equipment information, user personal information, etc.) and data (including but not limited to data for analysis, stored data, displayed data, etc.) involved in this application are all It is the information and data authorized by the user or fully authorized by all parties, and the collection, use and processing of the relevant data need to comply with the relevant laws, regulations and standards of the relevant countries and regions.
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,所述的计算机程序可存储于一非易失性计算机可读取存储介质中,该计算机程序在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、数据库或其它介质的任何引用,均可包括非易失性和易失性存储器中的至少一种。非易失性存储器可包括只读存储器(Read-OnlyMemory,ROM)、磁带、软盘、闪存、光存储器、高密度嵌入式非易失性存储器、阻变存储器(ReRAM)、磁变存储器(Magnetoresistive Random Access Memory,MRAM)、铁电存储器(Ferroelectric Random Access Memory,FRAM)、相变存储器(Phase Change Memory,PCM)、石墨烯存储器等。易失性存储器可包括随机存取存储器(Random Access Memory,RAM)或外部高速缓冲存储器等。作为说明而非局限,RAM可以是多种形式,比如静态随机存取存储器(Static Random Access Memory,SRAM)或动态随机存取存储器(Dynamic RandomAccess Memory,DRAM)等。本申请所提供的各实施例中所涉及的数据库可包括关系型数据库和非关系型数据库中至少一种。非关系型数据库可包括基于区块链的分布式数据库等,不限于此。本申请所提供的各实施例中所涉及的处理器可为通用处理器、中央处理器、图形处理器、数字信号处理器、可编程逻辑器、基于量子计算的数据处理逻辑器等,不限于此。Those of ordinary skill in the art can understand that all or part of the processes in the methods of the above embodiments can be implemented by instructing relevant hardware through a computer program, and the computer program can be stored in a non-volatile computer-readable storage In the medium, when the computer program is executed, it may include the processes of the above-mentioned method embodiments. Wherein, any reference to a memory, a database or other media used in the various embodiments provided in this application may include at least one of a non-volatile memory and a volatile memory. Non-volatile memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash memory, optical memory, high-density embedded non-volatile memory, resistive memory (ReRAM), magnetic variable memory (Magnetoresistive Random Memory) Access Memory, MRAM), Ferroelectric Random Access Memory (FRAM), Phase Change Memory (Phase Change Memory, PCM), graphene memory, etc. Volatile memory may include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration and not limitation, the RAM may be in various forms, such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM). The database involved in the various embodiments provided in this application may include at least one of a relational database and a non-relational database. The non-relational database may include a blockchain-based distributed database, etc., but is not limited thereto. The processors involved in the various embodiments provided in this application may be general-purpose processors, central processing units, graphics processors, digital signal processors, programmable logic devices, data processing logic devices based on quantum computing, etc., and are not limited to this.
以上实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。The technical features of the above embodiments can be combined arbitrarily. In order to make the description simple, all possible combinations of the technical features in the above embodiments are not described. However, as long as there is no contradiction in the combination of these technical features It is considered to be the range described in this specification.
以上所述实施例仅表达了本申请的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对本申请专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本申请构思的前提下,还可以做出若干变形和改进,这些都属于本申请的保护范围。因此,本申请的保护范围应以所附权利要求为准。The above-mentioned embodiments only represent several embodiments of the present application, and the descriptions thereof are relatively specific and detailed, but should not be construed as a limitation on the scope of the patent of the present application. It should be pointed out that for those skilled in the art, without departing from the concept of the present application, several modifications and improvements can be made, which all belong to the protection scope of the present application. Therefore, the scope of protection of the present application should be determined by the appended claims.
Claims (20)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202210764392.3A CN115114294B (en) | 2022-06-30 | 2022-06-30 | Adaptive method, device and computer equipment for database storage mode |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202210764392.3A CN115114294B (en) | 2022-06-30 | 2022-06-30 | Adaptive method, device and computer equipment for database storage mode |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN115114294A true CN115114294A (en) | 2022-09-27 |
| CN115114294B CN115114294B (en) | 2025-06-17 |
Family
ID=83330096
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202210764392.3A Active CN115114294B (en) | 2022-06-30 | 2022-06-30 | Adaptive method, device and computer equipment for database storage mode |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN115114294B (en) |
Cited By (9)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN115563116A (en) * | 2022-10-11 | 2023-01-03 | 北京奥星贝斯科技有限公司 | A database table scanning method, device and equipment |
| CN116303275A (en) * | 2022-12-29 | 2023-06-23 | 南京南瑞继保电气有限公司 | Service system and method for distributed files of electric power system industrial internet platform |
| CN116775598A (en) * | 2023-05-23 | 2023-09-19 | 阿里云计算有限公司 | Data table performance detection method, system, computing device and computer readable storage medium |
| CN116991848A (en) * | 2023-07-27 | 2023-11-03 | 上海沄熹科技有限公司 | A distributed database metadata replication method and system |
| CN117151712A (en) * | 2023-10-26 | 2023-12-01 | 腾讯科技(深圳)有限公司 | Blockchain transaction processing method, device, computer equipment and storage medium |
| CN117453707A (en) * | 2023-12-09 | 2024-01-26 | 北京镜舟科技有限公司 | Data updating method, device, electronic equipment and storage medium |
| CN118394762A (en) * | 2024-06-28 | 2024-07-26 | 济南浪潮数据技术有限公司 | Storage management method, device, program product and medium for distributed storage system |
| CN119232306A (en) * | 2024-10-17 | 2024-12-31 | 中电信量子科技有限公司 | Time synchronization method, cloud server, electronic device and storage medium |
| CN119759957A (en) * | 2025-03-07 | 2025-04-04 | 太原科技大学 | A high-storage query processing system based on equipment digital twin data |
Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2000045286A1 (en) * | 1999-01-28 | 2000-08-03 | Genrad, Inc. | Method and apparatus for distributed database access |
| CN111241184A (en) * | 2020-02-17 | 2020-06-05 | 湖南工学院 | High-throughput data processing method for distribution network multi-source database |
| CN113901069A (en) * | 2021-12-08 | 2022-01-07 | 威讯柏睿数据科技(北京)有限公司 | Data storage method and device of distributed database |
-
2022
- 2022-06-30 CN CN202210764392.3A patent/CN115114294B/en active Active
Patent Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2000045286A1 (en) * | 1999-01-28 | 2000-08-03 | Genrad, Inc. | Method and apparatus for distributed database access |
| CN111241184A (en) * | 2020-02-17 | 2020-06-05 | 湖南工学院 | High-throughput data processing method for distribution network multi-source database |
| CN113901069A (en) * | 2021-12-08 | 2022-01-07 | 威讯柏睿数据科技(北京)有限公司 | Data storage method and device of distributed database |
Non-Patent Citations (2)
| Title |
|---|
| 于利胜;张延松;王珊;张倩;: "基于行存储模型的模拟列存储策略研究", 计算机研究与发展, no. 05, 15 May 2010 (2010-05-15) * |
| 徐涛;顾瑜;汪东升;: "基于关键列分组排序的列存储结构", 计算机工程与科学, no. 08, 15 August 2016 (2016-08-15) * |
Cited By (12)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN115563116A (en) * | 2022-10-11 | 2023-01-03 | 北京奥星贝斯科技有限公司 | A database table scanning method, device and equipment |
| CN116303275A (en) * | 2022-12-29 | 2023-06-23 | 南京南瑞继保电气有限公司 | Service system and method for distributed files of electric power system industrial internet platform |
| CN116775598A (en) * | 2023-05-23 | 2023-09-19 | 阿里云计算有限公司 | Data table performance detection method, system, computing device and computer readable storage medium |
| CN116991848A (en) * | 2023-07-27 | 2023-11-03 | 上海沄熹科技有限公司 | A distributed database metadata replication method and system |
| CN117151712A (en) * | 2023-10-26 | 2023-12-01 | 腾讯科技(深圳)有限公司 | Blockchain transaction processing method, device, computer equipment and storage medium |
| CN117151712B (en) * | 2023-10-26 | 2024-03-26 | 腾讯科技(深圳)有限公司 | Blockchain transaction processing method, device, computer equipment and storage medium |
| CN117453707A (en) * | 2023-12-09 | 2024-01-26 | 北京镜舟科技有限公司 | Data updating method, device, electronic equipment and storage medium |
| CN118394762A (en) * | 2024-06-28 | 2024-07-26 | 济南浪潮数据技术有限公司 | Storage management method, device, program product and medium for distributed storage system |
| CN118394762B (en) * | 2024-06-28 | 2024-08-23 | 济南浪潮数据技术有限公司 | Storage management method, device, program product and medium for distributed storage system |
| CN119232306A (en) * | 2024-10-17 | 2024-12-31 | 中电信量子科技有限公司 | Time synchronization method, cloud server, electronic device and storage medium |
| CN119759957A (en) * | 2025-03-07 | 2025-04-04 | 太原科技大学 | A high-storage query processing system based on equipment digital twin data |
| CN119759957B (en) * | 2025-03-07 | 2025-05-02 | 太原科技大学 | High-storage query processing system based on equipment digital twin data |
Also Published As
| Publication number | Publication date |
|---|---|
| CN115114294B (en) | 2025-06-17 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN115114294B (en) | Adaptive method, device and computer equipment for database storage mode | |
| US11816126B2 (en) | Large scale unstructured database systems | |
| CN115114374B (en) | Transaction execution method and device, computing equipment and storage medium | |
| US11093468B1 (en) | Advanced metadata management | |
| Vora | Hadoop-HBase for large-scale data | |
| US10191932B2 (en) | Dependency-aware transaction batching for data replication | |
| US10031935B1 (en) | Customer-requested partitioning of journal-based storage systems | |
| US12314251B2 (en) | Transaction processing method and apparatus, computing device, and storage medium | |
| Makris et al. | A classification of NoSQL data stores based on key design characteristics | |
| CN111917834A (en) | A data synchronization method, device, storage medium and computer equipment | |
| US20130110873A1 (en) | Method and system for data storage and management | |
| US11314717B1 (en) | Scalable architecture for propagating updates to replicated data | |
| US20130262389A1 (en) | Parallel Backup for Distributed Database System Environments | |
| CN109933631A (en) | Distributed parallel database system and data processing method based on Infiniband network | |
| CN117321583A (en) | Storage engine for hybrid data processing | |
| CN102663117A (en) | OLAP (On Line Analytical Processing) inquiry processing method facing database and Hadoop mixing platform | |
| CN117677943A (en) | Data consistency mechanism for mixed data processing | |
| US10289723B1 (en) | Distributed union all queries | |
| US11216416B2 (en) | Managing snapshotting of a dataset using an ordered set of B+ trees | |
| CN102890678A (en) | Gray-code-based distributed data layout method and query method | |
| US11461201B2 (en) | Cloud architecture for replicated data services | |
| Borkar et al. | Have your data and query it too: From key-value caching to big data management | |
| US11940972B2 (en) | Execution of operations on partitioned tables | |
| CN105677761A (en) | A method and system for data segmentation | |
| Waqas et al. | Transaction management techniques and practices in current cloud computing environments: A survey |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |