CN108197275A - A kind of distributed document row storage indexing means - Google Patents
A kind of distributed document row storage indexing means Download PDFInfo
- Publication number
- CN108197275A CN108197275A CN201810016197.6A CN201810016197A CN108197275A CN 108197275 A CN108197275 A CN 108197275A CN 201810016197 A CN201810016197 A CN 201810016197A CN 108197275 A CN108197275 A CN 108197275A
- Authority
- CN
- China
- Prior art keywords
- index
- row
- field
- read
- fields
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/14—Details of searching files based on file metadata
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/13—File access structures, e.g. distributed indices
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/18—File system types
- G06F16/182—Distributed file systems
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Library & Information Science (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
本发明公开了一种分布式文件列存储索引方法,该方法包括下述步骤:解析查询语句获得查询条件;根据查询条件中的索引字段读取索引副本列,所述索引副本列是在列存储引擎的Stripe内对索引字段进行排序后复制构成,包括列值和列值所在行的行偏移;根据索引副本列,获取最终需要读取的Row Group。通过本发明,复制排序后的数据既可作为直接索引进行数据读取,从而减少磁盘输入输出开销,优化查询性能,也可作为间接索引减少其他字段的数据读取,进一步优化查询效率,且可实现最细存储粒度上的索引,提高细粒度查询的性能。
The invention discloses a distributed file column storage indexing method, which comprises the following steps: analyzing query statements to obtain query conditions; reading index copy columns according to the index fields in the query conditions, and the index copy columns are stored in the column storage The index fields are sorted in the engine's Stripe and then copied, including the column value and the row offset of the row where the column value is located; according to the index copy column, the Row Group that needs to be read is finally obtained. Through the present invention, the copied and sorted data can be used as a direct index for data reading, thereby reducing disk input and output costs, optimizing query performance, and can also be used as an indirect index to reduce data reading of other fields, further optimizing query efficiency, and can Realize the index on the finest storage granularity and improve the performance of fine-grained queries.
Description
技术领域technical field
本发明涉及数据查询技术领域,尤其涉及一种分布式数据索引和查询方法。The invention relates to the technical field of data query, in particular to a distributed data index and query method.
背景技术Background technique
大数据时代中,分布式文件系统在实际应用中已经成为在线联机分析处理系统中存储大规模数据的标准文件系统。尽管分布式文件系统最初的设计定位是用于大文件的顺序读写,但为了更好地支持上层在线联机分析处理系统的查询性能,近年出现了一些针对分布式文件系统的索引方案,这些索引方案最细粒度可索引到行级别。面临大数据时代PB(Petabyte)甚至ZB(Zettabyte)级数据量的挑战,考虑到相比于传统的行式存储引擎,列式存储引擎具备更加优秀的压缩性能,更少的输入输出操作,在线联机分析处理系统大多建议使用列存储作为分布式文件系统上的存储方式。然而现有的分布式文件系统上的索引方案将其应用于列存储引擎上时即使是最细粒度的行级索引也会退化为较粗粒度的索引,对细粒度查询支持不好。现有的列存储引擎部分支持列存储中最细粒度的聚簇索引,但聚簇索引是基于索引列对全部数据进行排序实现,所以索引字段只能有一个。In the era of big data, the distributed file system has become a standard file system for storing large-scale data in the online analysis and processing system in practical applications. Although the original design of the distributed file system is for sequential reading and writing of large files, in order to better support the query performance of the upper-level online analysis and processing system, some indexing schemes for distributed file systems have appeared in recent years. The finest granularity of schemas can be indexed down to the row level. Facing the challenge of PB (Petabyte) or even ZB (Zettabyte) level data in the era of big data, considering that compared with the traditional row storage engine, the column storage engine has better compression performance and fewer input and output operations. Most online analytical processing systems recommend using columnar storage as a storage method on a distributed file system. However, when the existing index schemes on the distributed file system are applied to the column storage engine, even the most fine-grained row-level index will degenerate into a coarse-grained index, which does not support fine-grained queries well. The existing column storage engine partially supports the most fine-grained clustered index in column storage, but the clustered index is implemented by sorting all data based on the index column, so there can only be one index field.
发明内容Contents of the invention
鉴于上述问题,提出了本发明以便提供一种克服上述问题或者至少部分地解决上述问题的方法。In view of the above problems, the present invention is proposed to provide a method for overcoming the above problems or at least partially solving the above problems.
本发明提供了一种分布式文件列存储索引方法,该方法包括下述步骤:The present invention provides a distributed file column storage index method, which includes the following steps:
解析查询语句获得查询条件;Analyze the query statement to obtain the query condition;
根据查询条件中的索引字段读取索引副本列,所述索引副本列是在列存储引擎的Stripe内对索引字段进行排序后复制构成,包括列值和列值所在行的行偏移;Read the index copy column according to the index field in the query condition, and the index copy column is formed by copying after sorting the index field in the Stripe of the column storage engine, including the row offset of the column value and the row where the column value is located;
根据索引副本列,获取最终需要读取的Row Group。According to the index copy column, get the Row Group that needs to be read eventually.
可选的,所述索引副本列不替换原先的数据列。Optionally, the index copy column does not replace the original data column.
可选的,所述索引副本列添加在原先数据列的后面、Optionally, the index copy column is added behind the original data column,
可选的,根据查询条件、过滤条件、查询结果字段通过不同的读取方式获取最终需要读取的Row Group,具体包括:Optionally, according to the query conditions, filter conditions, and query result fields, the final Row Group to be read can be obtained through different reading methods, including:
根据查询条件中是否包含索引字段、过滤条件中除索引字段外是否包含其他字段、查询结果字段中除过滤条件中字段外是否包含其他字段三个参数,将查询划分为三种不同的读取方式。According to whether the query condition contains the index field, whether the filter condition contains other fields except the index field, and whether the query result field contains other fields except the field in the filter condition, the query is divided into three different reading methods .
可选的,如果查询条件中包含索引字段、过滤条件中除索引字段外不包含其他字段,且查询结果字段中除过滤条件中字段外不包含其他字段,直接根据索引字段对应的索引元数据信息选择需要读取的索引副本列Row Group。Optionally, if the query condition contains index fields, the filter condition does not contain other fields except the index field, and the query result field does not contain other fields except the field in the filter condition, directly according to the index metadata information corresponding to the index field Select the index copy column Row Group that needs to be read.
可选的,如果查询条件中包含索引字段、过滤条件中除索引字段外不包含其他字段,但查询结果字段中除过滤条件中字段外还包含其他字段,根据本索引元数据信息选择需要读取的索引副本列Row Group,并读取索引副本列数据;解析出索引副本列数据中的行偏移,并解析最终需要读取的Row Group。Optionally, if the query condition contains index fields, and the filter condition does not contain other fields except the index field, but the query result field contains other fields besides the field in the filter condition, select the required read according to the index metadata information The index copy column Row Group, and read the index copy column data; parse out the row offset in the index copy column data, and parse the final Row Group that needs to be read.
可选的,如果查询条件中包索引字段、过滤条件中除索引字段外包含其他字段,根据索引元数据信息选择需要读取的索引副本列Row Group,读取索引副本列数据;解析索引副本列数据中的行偏移,并解析所述索引字段条件下选中的需要读取的Row Group,并与其他字段下的Row Group进行合并,获取最终需要读取的Row Group。Optionally, if the query condition includes index fields and filter conditions include other fields besides the index field, select the index copy column Row Group to be read according to the index metadata information, read the index copy column data; parse the index copy column Row offset in the data, and parse the selected Row Group that needs to be read under the conditions of the index field, and merge it with the Row Group under other fields to obtain the final Row Group that needs to be read.
可选的,该方法还包括:根据索引元数据信息计算按照权利要求1-7任一项所述的索引方法需要读取的Row Group数,该其与不使用所述索引方法需要读取的Row Group数相比较,如果前者大于后者,则自适应使用权利要求1-7任一项所述的索引方法,否则不使用权利要求1-7任一项所述的索引方法。Optionally, the method further includes: calculating the number of Row Groups that need to be read according to the index method according to any one of claims 1-7 according to the index metadata information, which is different from the number of Row Groups that need to be read without using the index method Compared with the number of Row Groups, if the former is greater than the latter, the indexing method described in any one of claims 1-7 is adaptively used, otherwise the indexing method described in any one of claims 1-7 is not used.
本申请实施例中提供的技术方案,至少具有如下技术效果或优点:查询时索引副本列数据既可作为直接索引直接进行读取减少磁盘输入输出开销,优化查询性能,也可作为间接索引减少其他字段的数据读取,进一步优化查询效率。The technical solution provided in the embodiment of the present application has at least the following technical effects or advantages: when querying, the index copy column data can be directly read as a direct index to reduce disk input and output costs, optimize query performance, and can also be used as an indirect index to reduce other Field data reading to further optimize query efficiency.
上述说明仅是本发明技术方案的概述,为了能够更清楚了解本发明的技术手段,而可依照说明书的内容予以实施,并且为了让本发明的上述和其它目的、特征和优点能够更明显易懂,以下特举本发明的具体实施方式。The above description is only an overview of the technical solution of the present invention. In order to better understand the technical means of the present invention, it can be implemented according to the contents of the description, and in order to make the above and other purposes, features and advantages of the present invention more obvious and understandable , the specific embodiments of the present invention are enumerated below.
附图说明Description of drawings
通过阅读下文优选实施方式的详细描述,各种其他的优点和益处对于本领域普通技术人员将变得清楚明了。附图仅用于示出优选实施方式的目的,而并不认为是对本发明的限制。而且在整个附图中,用相同的参考符号表示相同的部件。在附图中:Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiment. The drawings are only for the purpose of illustrating a preferred embodiment and are not to be considered as limiting the invention. Also throughout the drawings, the same reference numerals are used to designate the same parts. In the attached picture:
图1示出了根据本发明的分布式文件列存储索引方法的流程图;Fig. 1 shows the flowchart of the distributed file column storage index method according to the present invention;
图2示出了根据本发明提出的索引副本列的数据结构图;Fig. 2 shows the data structure diagram of the index copy column proposed according to the present invention;
图3示出了根据本发明提出的索引方法的查询流程图。Fig. 3 shows a query flowchart of the indexing method proposed according to the present invention.
具体实施方式Detailed ways
下面将参照附图更详细地描述本公开的示例性实施例。虽然附图中显示了本公开的示例性实施例,然而应当理解,可以以各种形式实现本公开而不应被这里阐述的实施例所限制。相反,提供这些实施例是为了能够更透彻地理解本公开,并且能够将本公开的范围完整的传达给本领域的技术人员。Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. Although exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited by the embodiments set forth herein. Rather, these embodiments are provided for more thorough understanding of the present disclosure and to fully convey the scope of the present disclosure to those skilled in the art.
本发明的一个方面,本索引方案是基于数据复制的聚簇索引,数据复制是在保留原先列的基础上进行列复制并排序后形成索引副本列,索引副本列则由索引列排序后的列值和列值原先所在行的行偏移组成。查询时索引副本列数据既可作为直接索引直接进行读取减少磁盘输入输出开销,优化查询性能,也可作为间接索引减少其他字段的数据读取,进一步优化查询效率。本发明对现有分布式文件系统、列存储引擎以及在线联机分析处理系统的兼容性较好。In one aspect of the present invention, the index scheme is a clustered index based on data replication. Data replication is to copy and sort the columns on the basis of retaining the original columns to form index copy columns, and the index copy columns are the columns sorted by the index columns value and the row offset of the row where the column value was originally located. When querying, the index copy column data can be directly read as a direct index to reduce disk input and output costs and optimize query performance, and can also be used as an indirect index to reduce data reading of other fields and further optimize query efficiency. The invention has better compatibility with existing distributed file systems, column storage engines and online analysis and processing systems.
本方案中排序的特点在于不是对整列数据进行全局排序,而是在列存储引擎的Stripe内对索引字段进行排序后复制,这样设计相比于全局排序避免了其索引构建和查询时分布式集群上的网络开销,且可实现最细存储粒度上的索引,提高细粒度查询的性能。The feature of sorting in this solution is that instead of globally sorting the entire column of data, the index fields are sorted and then copied in the column storage engine Stripe. This design avoids distributed clusters during index construction and query compared to global sorting. It can realize the index on the finest storage granularity and improve the performance of fine-grained query.
索引的实现方案主要包括对索引元数据的具体设计和索引查询流程的设计,本发明的实现方案为:The implementation scheme of the index mainly includes the specific design of the index metadata and the design of the index query process, and the implementation scheme of the present invention is:
本发明的索引元数据信息用于查询索引字段时,根据此元数据信息可实现索引副本列的寻址,并实现索引副本列内Row Group级的数据过滤。具体的,元数据信息包括索引副本列列地址偏移、索引副本列各Row Group的数据统计信息以及各Row Group行偏移地址。When the index metadata information of the present invention is used for querying index fields, the addressing of index copy columns can be realized according to the metadata information, and data filtering at Row Group level in the index copy columns can be realized. Specifically, the metadata information includes the column address offset of the index copy column, the data statistics information of each Row Group of the index copy column, and the row offset address of each Row Group.
如图1所示,本发明提出一种分布式文件列存储索引方法,该方法包括下述步骤:As shown in Figure 1, the present invention proposes a distributed file column storage index method, the method includes the following steps:
S1.解析查询语句获得查询条件;S1. Analyzing the query statement to obtain the query condition;
S2.根据查询条件中的索引字段读取索引副本列,所述索引副本列是在列存储引擎的Stripe内对索引字段进行排序后复制构成,包括列值和列值所在行的行偏移;S2. Read the index copy column according to the index field in the query condition, and the index copy column is to sort the index field in the Stripe of the column storage engine and copy the composition, including the row offset of the column value and the row where the column value is located;
S3.根据索引副本列,获取最终需要读取的Row Group。S3. According to the index copy column, obtain the Row Group that needs to be read eventually.
上述方法的特点在于复制出的排序后的数据既可作为直接索引直接进行数据读取减少磁盘输入输出开销,优化查询性能,也可作为间接索引减少其他字段的数据读取,进一步优化查询效率。The feature of the above method is that the copied sorted data can be used as a direct index to directly read data to reduce disk input and output costs and optimize query performance, and can also be used as an indirect index to reduce data reading of other fields and further optimize query efficiency.
图2为本发明的索引方案原理图和索引方案中索引副本列数据结构图,将本索引方案应用于分布式文件系统上的列存储引擎中时,则可根据图2所示,在数据导入时保留原先数据的基础上,将Stripe内索引字段列列2数据进行排序后复制形成索引副本列列2’,索引副本列结构如图2所示,由索引列排序后的列值和列值原先所在行的行偏移组成。生成索引副本列的过程中则按照本发明索引的索引元数据结构生成元数据信息,元数据信息包括索引副本列列地址偏移、索引副本列各Row Group的数据统计信息以及各Row Group行偏移地址。Fig. 2 is the principle diagram of the index scheme of the present invention and the data structure diagram of the index copy column in the index scheme. When this index scheme is applied to the column storage engine on the distributed file system, as shown in Fig. On the basis of retaining the original data, the index field column 2 data in Stripe is sorted and copied to form the index copy column 2'. The index copy column structure is shown in Figure 2, and the column values and column values are sorted by the index columns The row offset of the previous row. In the process of generating the index copy column, the metadata information is generated according to the index metadata structure of the index of the present invention, and the metadata information includes the column address offset of the index copy column, the data statistics information of each Row Group of the index copy column, and the row offset of each Row Group address.
Stripe是指分布式文件系统列存储上将数据块数据按行切分成若干个Stripe,以Stripe为单位来划分列集中存储的单元。Row Group是指将Stripe内每列数据按照若干行数切分成多个Row Group。Row Group是指分布式文件系统列存储上的最细划分粒度,本索引方案实现的是分布式文件系统列存储上最细粒度的索引方案,Stripe则指最细存储粒度上一层的划分粒度。Stripe refers to the column storage of the distributed file system that divides the data block data into several Stripes by row, and uses Stripe as the unit to divide the column set storage unit. Row Group refers to dividing each column of data in Stripe into multiple Row Groups according to the number of rows. Row Group refers to the most fine-grained division granularity on the column storage of the distributed file system. This index scheme implements the most fine-grained index scheme on the column storage of the distributed file system. Stripe refers to the division granularity of the upper layer of the finest storage granularity. .
执行查询时把查询划分为三类,根据查询条件中是否包含本索引字段、过滤条件中除本索引字段外是否包含其他字段、查询结果字段中除过滤条件中字段外是否包含其他字段等三个参数进行划分。第一类查询是指查询条件中包含本索引字段、过滤条件中除本索引字段外不包含其他字段,且查询结果字段中除过滤条件中字段外不包含其他字段的查询。第二类查询是指查询条件中包含本索引字段、过滤条件中除本索引字段外不包含其他字段,但查询结果字段中除过滤条件中字段外还包含其他字段的查询。第三类查询是指查询条件中包含本索引字段、过滤条件中除本索引字段外包含其他字段的查询。When executing the query, the query is divided into three categories, according to whether the query condition contains this index field, whether the filter condition contains other fields except this index field, and whether the query result field contains other fields except the field in the filter condition. parameters are divided. The first type of query refers to the query that includes this index field in the query condition, does not contain other fields except this index field in the filter condition, and does not contain other fields in the query result field except the field in the filter condition. The second type of query refers to queries that include this index field in the query condition, do not include other fields except this index field in the filter condition, but include other fields in the query result field besides the field in the filter condition. The third type of query refers to queries that include this index field in the query condition, and include other fields in the filter condition except this index field.
下面结合附图3,详细说明发明的具体实施方式:Below in conjunction with accompanying drawing 3, describe the specific embodiment of the invention in detail:
图3描述了本发明索引方案的查询流程。每种查询具体按图3所示使用本索引选择需要读取的Row Group,忽略不需要读取的数据,减少磁盘输入输出开销,提高查询性能。且查询时根据索引元数据信息可计算出具体使用本索引后需要读取的Row Group数以及原先需要读取的Row Group数,如果前者小于后者,则说明使用本索引可以减少需要读取的RowGroup数,则自适应选择使用本索引,反之,则可选择不使用本索引,实现本索引方案上的自适应查询策略,提升查询性能。Fig. 3 describes the query process of the indexing scheme of the present invention. For each query, use this index to select the Row Group that needs to be read as shown in Figure 3, ignore the data that does not need to be read, reduce the disk input and output overhead, and improve query performance. And the number of Row Groups that need to be read after using this index and the number of Row Groups that need to be read can be calculated according to the index metadata information when querying. If the former is smaller than the latter, it means that using this index can reduce the number of Row Groups that need to be read. If the number of RowGroups is large, the index will be used adaptively, otherwise, the index can be selected not to be used, so as to realize the adaptive query strategy on the index scheme and improve query performance.
查询语句经过解析后获得查询条件,根据查询条件中是否包含本索引字段、过滤条件中除本索引字段外是否包含其他字段、查询结果字段中除过滤条件中字段外是否包含其他字段等三个参数,将查询划分为三类查询。每种查询类型使用本发明索引方案得到需要读取的Row Group的流程不同。第一类查询是指查询条件中包含本索引字段、过滤条件中除本索引字段外不包含其他字段,且查询结果字段中除过滤条件中字段外不包含其他字段的查询。如为tb1的列2创建了本索引,此类查询如:select count(*)from tb1where列2>1and列2<5。此类查询可直接根据本索引元数据信息选择需要读取的索引副本列RowGroup并进行数据读取即可,减少了磁盘输入输出开销,提高查询性能。第二类查询是指查询条件中包含本索引字段、过滤条件中除本索引字段外不包含其他字段,但查询结果字段中除过滤条件中字段外还包含其他字段的查询,此类查询如:select列1,列2from tb1where列2>1and列2<5。此类查询不仅需要根据本索引元数据信息选择需要读取的索引副本列RowGroup,读取索引副本列数据,而且索引副本列数据还需作为读取其他字段的索引信息,即解析出索引副本列数据中的行偏移,并解析成最终需要读取的Row Group即可。第三类查询是指查询条件中包含本索引字段、过滤条件中除本索引字段外包含其他字段的查询,此类查询如:select列1,列2from tb1where列2>1and列2<5and列3=5。此类查询不仅需要根据本索引元数据信息选择需要读取的索引副本列Row Group,读取索引副本列数据,而且索引副本列数据还需作为读取其他字段的索引信息,即解析出索引副本列数据中的行偏移,并解析出本索引字段条件下选中的需要读取的Row Group,并与其他条件选择Row Group进行合并,得出最终需要读取的Row Group即可,减少磁盘输入输出开销,提升查询性能。且查询时根据索引元数据信息可计算出具体使用本索引后需要读取的RowGroup数以及原先不使用本索引时需要读取的Row Group数,如果前者小于后者,则说明使用本索引可以减少需要读取的Row Group数,则自适应选择使用本索引,反之,则不使用本索引,实现本索引方案上的自适应查询策略,提升查询性能。After the query statement is parsed, the query condition is obtained, based on three parameters: whether the query condition contains this index field, whether the filter condition contains other fields except this index field, and whether the query result field contains other fields except the field in the filter condition. , which divides queries into three types of queries. Each query type uses the index scheme of the present invention to obtain the Row Group to be read in a different process. The first type of query refers to the query that includes this index field in the query condition, does not contain other fields except this index field in the filter condition, and does not contain other fields in the query result field except the field in the filter condition. If this index is created for column 2 of tb1, such queries as: select count(*) from tb1where column 2>1and column 2<5. This type of query can directly select the index copy column RowGroup to be read according to the metadata information of the index and read the data, which reduces the disk input and output overhead and improves query performance. The second type of query refers to the query that includes this index field in the query condition and does not contain other fields except this index field in the filter condition, but the query result field contains other fields besides the field in the filter condition. Such queries include: select column 1, column 2 from tb1 where column 2>1 and column 2<5. This type of query not only needs to select the index copy column RowGroup to be read according to the index metadata information, read the index copy column data, but also use the index copy column data as index information for reading other fields, that is, parse out the index copy column The row offset in the data can be parsed into the Row Group that needs to be read in the end. The third type of query refers to the query that includes this index field in the query condition and other fields in the filter condition except this index field. This type of query is such as: select column 1, column 2 from tb1where column 2>1and column 2<5and column 3 =5. This type of query not only needs to select the index copy column Row Group to be read according to the index metadata information, read the index copy column data, but also use the index copy column data as index information for reading other fields, that is, parse out the index copy The row offset in the column data, and parse out the Row Group that needs to be read selected under the conditions of this index field, and merge it with the Row Group selected by other conditions to obtain the final Row Group that needs to be read, reducing disk input Output overhead to improve query performance. And the number of RowGroups that need to be read after using this index and the number of Row Groups that need to be read when this index is not used can be calculated according to the index metadata information during query. If the former is smaller than the latter, it means that using this index can reduce For the number of Row Groups that need to be read, this index is adaptively selected to be used; otherwise, this index is not used to implement an adaptive query strategy on this index scheme and improve query performance.
本申请实施例中提供的技术方案,至少具有如下技术效果或优点:The technical solutions provided in the embodiments of the present application have at least the following technical effects or advantages:
复制出的排序后的数据既可作为直接索引直接进行数据读取减少磁盘输入输出开销,优化查询性能,也可作为间接索引减少其他字段的数据读取,进一步优化查询效率。The copied sorted data can be used as a direct index to directly read data to reduce disk input and output overhead and optimize query performance, and can also be used as an indirect index to reduce data reading of other fields and further optimize query efficiency.
本发明不是对整列数据进行全局排序,而是在列存储引擎的Stripe内对索引字段进行排序后复制,这样设计相比于全局排序避免了其索引构建和查询时分布式集群上的网络开销,且可实现最细存储粒度上的索引,提高细粒度查询的性能。The present invention does not perform global sorting on the entire column of data, but sorts the index fields in the column storage engine Stripe and then replicates them. Compared with global sorting, this design avoids the network overhead on the distributed cluster during index construction and query. And it can realize the index on the finest storage granularity and improve the performance of fine-grained query.
本发明索引查询时可结合查询特点和数据分布情况,为各查询自适应选择是否使用本索引方案,可有效规避不适用于本索引的查询,提升查询效率。The present invention can adaptively select whether to use this index scheme for each query in combination with query characteristics and data distribution conditions during index query, which can effectively avoid queries that are not applicable to this index and improve query efficiency.
在此提供的方法和显示不与任何特定计算机、虚拟系统或者其它设备固有相关。各种通用系统也可以与基于在此的示教一起使用。根据上面的描述,构造这类系统所要求的结构是显而易见的。此外,本发明也不针对任何特定编程语言。应当明白,可以利用各种编程语言实现在此描述的本发明的内容,并且上面对特定语言所做的描述是为了披露本发明的最佳实施方式。The methods and displays presented herein are not inherently related to any particular computer, virtual system, or other device. Various generic systems can also be used with the teachings based on this. The structure required to construct such a system is apparent from the above description. Furthermore, the present invention is not specific to any particular programming language. It should be understood that various programming languages can be used to implement the content of the present invention described herein, and the above description of specific languages is for disclosing the best mode of the present invention.
在此处所提供的说明书中,说明了大量具体细节。然而,能够理解,本发明的实施例可以在没有这些具体细节的情况下实践。在一些实例中,并未详细示出公知的方法、结构和技术,以便不模糊对本说明书的理解。In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure the understanding of this description.
类似地,应当理解,为了精简本公开并帮助理解各个发明方面中的一个或多个,在上面对本发明的示例性实施例的描述中,本发明的各个特征有时被一起分组到单个实施例、图、或者对其的描述中。然而,并不应将该公开的方法解释成反映如下意图:即所要求保护的本发明要求比在每个权利要求中所明确记载的特征更多的特征。更确切地说,如下面的权利要求书所反映的那样,发明方面在于少于前面公开的单个实施例的所有特征。因此,遵循具体实施方式的权利要求书由此明确地并入该具体实施方式,其中每个权利要求本身都作为本发明的单独实施例。Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, in order to streamline this disclosure and to facilitate an understanding of one or more of the various inventive aspects, various features of the invention are sometimes grouped together in a single embodiment, figure, or its description. This method of disclosure, however, is not to be interpreted as reflecting an intention that the claimed invention requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the Detailed Description are hereby expressly incorporated into this Detailed Description, with each claim standing on its own as a separate embodiment of this invention.
此外,本领域的技术人员能够理解,尽管在此的一些实施例包括其它实施例中所包括的某些特征而不是其它特征,但是不同实施例的特征的组合意味着处于本发明的范围之内并且形成不同的实施例。例如,在下面的权利要求书中,所要求保护的实施例的任意之一都可以以任意的组合方式来使用。Furthermore, those skilled in the art will understand that although some embodiments herein include some features included in other embodiments but not others, combinations of features from different embodiments are meant to be within the scope of the invention. And form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.
本发明的各个部件实施例可以以硬件实现,或者以在一个或者多个处理器上运行的软件模块实现,或者以它们的组合实现。本领域的技术人员应当理解,可以在实践中使用微处理器或者数字信号处理器(DSP)来实现根据本发明实施例的网关、代理服务器、系统中的一些或者全部部件的一些或者全部功能。本发明还可以实现为用于执行这里所描述的方法的一部分或者全部的设备或者装置程序(例如,计算机程序和计算机程序产品)。这样的实现本发明的程序可以存储在计算机可读介质上,或者可以具有一个或者多个信号的形式。这样的信号可以从因特网网站上下载得到,或者在载体信号上提供,或者以任何其他形式提供。The various component embodiments of the present invention may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art should understand that a microprocessor or a digital signal processor (DSP) can be used in practice to implement some or all functions of some or all components in the gateway, proxy server, and system according to the embodiments of the present invention. The present invention can also be implemented as an apparatus or an apparatus program (for example, a computer program and a computer program product) for performing a part or all of the methods described herein. Such a program for realizing the present invention may be stored on a computer-readable medium, or may be in the form of one or more signals. Such a signal may be downloaded from an Internet site, or provided on a carrier signal, or provided in any other form.
应该注意的是上述实施例对本发明进行说明而不是对本发明进行限制,并且本领域技术人员在不脱离所附权利要求的范围的情况下可设计出替换实施例。在权利要求中,不应将位于括号之间的任何参考符号构造成对权利要求的限制。单词“包含”不排除存在未列在权利要求中的元件或步骤。It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim.
Claims (8)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201810016197.6A CN108197275A (en) | 2018-01-08 | 2018-01-08 | A kind of distributed document row storage indexing means |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201810016197.6A CN108197275A (en) | 2018-01-08 | 2018-01-08 | A kind of distributed document row storage indexing means |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| CN108197275A true CN108197275A (en) | 2018-06-22 |
Family
ID=62588688
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN201810016197.6A Pending CN108197275A (en) | 2018-01-08 | 2018-01-08 | A kind of distributed document row storage indexing means |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN108197275A (en) |
Cited By (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN112347104A (en) * | 2020-11-06 | 2021-02-09 | 中国人民大学 | Column storage layout optimization method based on deep reinforcement learning |
| CN113297244A (en) * | 2020-05-29 | 2021-08-24 | 阿里巴巴集团控股有限公司 | Database operation method, device, equipment and storage medium |
| CN114356230A (en) * | 2021-12-22 | 2022-04-15 | 天津南大通用数据技术股份有限公司 | Method and system for improving reading performance of column storage engine |
Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN106326387A (en) * | 2016-08-17 | 2017-01-11 | 电子科技大学 | Distributive data storage architecture, data storage method and data inquiry method |
| CN106709851A (en) * | 2016-11-30 | 2017-05-24 | 中体彩科技发展有限公司 | Big data retrieval method and apparatus |
| US20170177677A1 (en) * | 2015-05-14 | 2017-06-22 | Walleye Software, LLC | Query task processing based on memory allocation and performance criteria |
| CN107103032A (en) * | 2017-03-21 | 2017-08-29 | 中国科学院计算机网络信息中心 | The global mass data paging query method sorted is avoided under a kind of distributed environment |
-
2018
- 2018-01-08 CN CN201810016197.6A patent/CN108197275A/en active Pending
Patent Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20170177677A1 (en) * | 2015-05-14 | 2017-06-22 | Walleye Software, LLC | Query task processing based on memory allocation and performance criteria |
| CN106326387A (en) * | 2016-08-17 | 2017-01-11 | 电子科技大学 | Distributive data storage architecture, data storage method and data inquiry method |
| CN106709851A (en) * | 2016-11-30 | 2017-05-24 | 中体彩科技发展有限公司 | Big data retrieval method and apparatus |
| CN107103032A (en) * | 2017-03-21 | 2017-08-29 | 中国科学院计算机网络信息中心 | The global mass data paging query method sorted is avoided under a kind of distributed environment |
Non-Patent Citations (1)
| Title |
|---|
| JK_RUSH: ""索引的一些总结"", 《博客园 CNBLOGS》 * |
Cited By (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN113297244A (en) * | 2020-05-29 | 2021-08-24 | 阿里巴巴集团控股有限公司 | Database operation method, device, equipment and storage medium |
| CN113297244B (en) * | 2020-05-29 | 2022-05-06 | 阿里巴巴集团控股有限公司 | Database operation method, device, equipment and storage medium |
| CN112347104A (en) * | 2020-11-06 | 2021-02-09 | 中国人民大学 | Column storage layout optimization method based on deep reinforcement learning |
| CN112347104B (en) * | 2020-11-06 | 2023-09-29 | 中国人民大学 | Column storage layout optimization method based on deep reinforcement learning |
| CN114356230A (en) * | 2021-12-22 | 2022-04-15 | 天津南大通用数据技术股份有限公司 | Method and system for improving reading performance of column storage engine |
| CN114356230B (en) * | 2021-12-22 | 2024-04-23 | 天津南大通用数据技术股份有限公司 | Method and system for improving read performance of column storage engine |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| RU2663358C2 (en) | Clustering storage method and device | |
| US8065484B2 (en) | Enhanced access to data available in a cache | |
| CN103678494B (en) | Client synchronization services the method and device of end data | |
| US20060085484A1 (en) | Database tuning advisor | |
| US20140351239A1 (en) | Hardware acceleration for query operators | |
| Wesley et al. | An analytic data engine for visualization in tableau | |
| US9633104B2 (en) | Methods and systems to operate on group-by sets with high cardinality | |
| US9218394B2 (en) | Reading rows from memory prior to reading rows from secondary storage | |
| CN103678520A (en) | Multi-dimensional interval query method and system based on cloud computing | |
| WO2018157680A1 (en) | Method and device for generating execution plan, and database server | |
| US12026168B2 (en) | Columnar techniques for big metadata management | |
| CN115964374B (en) | Query processing method and device based on pre-calculation scene | |
| US8880485B2 (en) | Systems and methods to facilitate multi-threaded data retrieval | |
| US8438153B2 (en) | Performing database joins | |
| US20100161930A1 (en) | Statistics collection using path-value pairs for relational databases | |
| US8229924B2 (en) | Statistics collection using path-identifiers for relational databases | |
| CN108197275A (en) | A kind of distributed document row storage indexing means | |
| US7472108B2 (en) | Statistics collection using path-value pairs for relational databases | |
| CN114428776A (en) | A method and system for index partition management for time series data | |
| CN116126901A (en) | Data processing method, device, electronic equipment and computer readable storage medium | |
| CN108897865A (en) | The index copy amount appraisal procedure and device of distributed type assemblies | |
| CN109241100B (en) | Query method, device, equipment and storage medium | |
| US8396858B2 (en) | Adding entries to an index based on use of the index | |
| US8046394B1 (en) | Dynamic partitioning for an ordered analytic function | |
| CN104361090B (en) | Data query method and device |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180622 |
|
| RJ01 | Rejection of invention patent application after publication |