CN117851537A

CN117851537A - Index construction method of time sequence data storage engine

Info

Publication number: CN117851537A
Application number: CN202410070820.1A
Authority: CN
Inventors: 刘晓光; 徐子越; 王刚; 黄苏童; 费迪; 刘欣瑀; 余文清; 魏子敬; 刘少治
Original assignee: Nankai University
Current assignee: Nankai University
Priority date: 2024-01-18
Filing date: 2024-01-18
Publication date: 2024-04-09

Abstract

The invention provides an index construction method of a time sequence data storage engine, and belongs to the technical field of database storage. The method specifically comprises the following steps: pre-screening the data blocks according to the document frequency of the label keys and the occurrence frequency of the label values; performing feature extraction on the pre-screening set through the historical access frequency of the tag keys to obtain data features and performing machine learning to further screen so as to obtain a target set of index tag groups comprising each time line; extracting target labels from the target set according to a plurality of different indexes in the index label group to obtain a plurality of group label sets; corresponding time lines are put into the group label sets with the same index labels, and a plurality of time line sets are obtained; and allocating a unique group ID to each time line set, establishing an inverted index mapped by the tag key value pair and the group ID, and establishing a leading index mapped by the target tag and the inverted index. The invention can improve the writing efficiency and the index construction efficiency of time sequence data.

Description

An index construction method for a time series data storage engine

技术领域Technical Field

本发明涉及数据库存储技术领域，尤其涉及一种时序数据存储引擎的索引构建方法。The present invention relates to the field of database storage technology, and in particular to an index construction method for a time series data storage engine.

背景技术Background technique

随着物联网技术的发展，物联网设备的数量和应用范围急剧增加。为保证物联网设备以及互联网服务的高可用性和鲁棒性，对其实时运行状态进行更加精密而全面的监控需求则应运而生。时序数据库作为上述监控数据的存储引擎，近年来在此背景下得到了学术界和工业界的广泛关注。With the development of IoT technology, the number and application scope of IoT devices have increased dramatically. In order to ensure the high availability and robustness of IoT devices and Internet services, the need for more precise and comprehensive monitoring of their real-time operating status has emerged. As the storage engine for the above monitoring data, time series databases have received widespread attention from academia and industry in recent years.

典型的时序数据一般由两部分组成：时间线数据及时间点数据。时间点数据一般由64bit整型的时间戳，以及双精度浮点型（IEEE754 double）的指标值组成。而时间线数据的表现形式则较为复杂：一般由一个监控指标字符串（metric），及一系列标签键值对字符串（tagkv pairs）组成，通常称为一条时间线（serieskey）。Typical time series data generally consists of two parts: timeline data and time point data. Time point data generally consists of a 64-bit integer timestamp and a double-precision floating point (IEEE754 double) indicator value. The representation of timeline data is more complex: it generally consists of a monitoring indicator string (metric) and a series of tag key-value pairs (tagkv pairs), usually called a timeline (serieskey).

当前的对时序数据进行存储及索引构建的方法是基于标签值、时间线标识以及标识集合，构建用于检索时间线的两索引层索引结构，保存创建标签值与标识集合之间的映射关系，及创建时间线标识和时间线之间的第二映射关系。但是这种方法并不能适应时序数据库的时间线基数膨胀，当时间线基数膨胀时，索引构建量会随之快速增加，从而影响时序数据的写入效率及索引构建效率。The current method for storing and indexing time series data is to build a two-index-layer index structure for retrieving time lines based on tag values, timeline identifiers, and identifier sets, and save the mapping relationship between the creation of tag values and identifier sets, and the creation of a second mapping relationship between timeline identifiers and time lines. However, this method cannot adapt to the expansion of the timeline cardinality of the time series database. When the timeline cardinality expands, the amount of index construction will increase rapidly, thereby affecting the writing efficiency and index construction efficiency of time series data.

发明内容Summary of the invention

本发明旨在至少解决相关技术中存在的技术问题之一。为此，本发明提供一种时序数据存储引擎的索引构建方法。The present invention aims to solve at least one of the technical problems existing in the related art. To this end, the present invention provides an index construction method for a time series data storage engine.

本发明提供一种时序数据存储引擎的索引构建方法，包括：The present invention provides an index construction method for a time series data storage engine, comprising:

S1：根据标签键的文档频率及标签值的出现频率对待存储的数据块进行预筛选，获得预筛选集合；S1: pre-screen the data blocks to be stored according to the document frequency of the tag key and the occurrence frequency of the tag value to obtain a pre-screened set;

S2：通过标签键的历史访问频率，对所述预筛选集合进行特征提取，获得数据特征；S2: extracting features from the pre-screened set based on the historical access frequency of the tag key to obtain data features;

S3：对所述数据特征进行机器学习获得筛选函数，通过所述筛选函数对所述预筛选集合进行筛选，获得目标集合，所述目标集合中至少包括每条时间线的指标标签组；S3: performing machine learning on the data features to obtain a screening function, and screening the pre-screening set by using the screening function to obtain a target set, wherein the target set includes at least an indicator label group for each timeline;

S4：根据所述指标标签组中多个不同的指标对所述目标集合进行目标标签提取，获得多个组标签集合；S4: extracting target labels from the target set according to a plurality of different indicators in the indicator label group to obtain a plurality of group label sets;

S5：对指标标签相同的组标签集合，置入所述指标标签对应的时间线，获得多个时间线集合；S5: For a group tag set with the same indicator tag, insert a timeline corresponding to the indicator tag to obtain multiple timeline sets;

S6：对每个时间线集合分配唯一的组ID，建立标签键值对与组ID映射的倒排索引，并建立目标标签与倒排索引映射的前置索引，以完成时序数据存储引擎的索引构建。S6: Assign a unique group ID to each timeline set, create an inverted index mapping the tag key-value pair to the group ID, and create a pre-index mapping the target tag to the inverted index to complete the index construction of the time series data storage engine.

根据本发明提供的一种时序数据存储引擎的索引构建方法，步骤S2中的所述数据特征包括：文档频率、标签键基数、标签键基数排名、标签键基数排名比率、标签键频率、标签键频率排名及标签键频率排名比率。According to an index construction method for a time series data storage engine provided by the present invention, the data features in step S2 include: document frequency, tag key cardinality, tag key cardinality ranking, tag key cardinality ranking ratio, tag key frequency, tag key frequency ranking and tag key frequency ranking ratio.

根据本发明提供的一种时序数据存储引擎的索引构建方法，步骤S3中用于所述数据特征的机器学习方案为AdaBoost迭代算法。According to an index building method for a time series data storage engine provided by the present invention, the machine learning scheme used for the data features in step S3 is an AdaBoost iterative algorithm.

根据本发明提供的一种时序数据存储引擎的索引构建方法，步骤S3中，所述指标标签组中的每个指标均以指标名称为标签键，指标值为标签值。According to an index construction method for a time series data storage engine provided by the present invention, in step S3, each indicator in the indicator tag group uses the indicator name as the tag key and the indicator value as the tag value.

根据本发明提供的一种时序数据存储引擎的索引构建方法，步骤S6中的所述倒排索引中每个标签键值对有对应的倒排链，所述倒排链中包括升序的组ID组。According to an index construction method for a time series data storage engine provided by the present invention, each label key-value pair in the inverted index in step S6 has a corresponding inverted chain, and the inverted chain includes ascending group ID groups.

根据本发明提供的一种时序数据存储引擎的索引构建方法，步骤S6中的所述前置索引通过代数重建法与字典数据结构实现。According to the index construction method of a time series data storage engine provided by the present invention, the pre-index in step S6 is implemented by an algebraic reconstruction method and a dictionary data structure.

根据本发明提供的一种时序数据存储引擎的索引构建方法，步骤S6中的所述前置索引中，所述代数重建法用于由标签键映射至对应的标签值集合构成的字典数据结构地址。According to an index construction method for a time series data storage engine provided by the present invention, in the pre-index in step S6, the algebraic reconstruction method is used to map the tag key to the dictionary data structure address consisting of the corresponding tag value set.

根据本发明提供的一种时序数据存储引擎的索引构建方法，步骤S6中的所述前置索引中，所述字典数据结构存有标签值与对应的倒排链偏移量的映射。According to an index construction method for a time series data storage engine provided by the present invention, in the pre-index in step S6, the dictionary data structure stores a mapping between a label value and a corresponding postings chain offset.

本发明提供的一种时序数据存储引擎的索引构建方法，基于启发式筛选，以及机器学习等方案筛选目标标签，以期在面对当今时序数据库时间线膨胀的问题时，可以保证索引构建量并不随之快速增加，从而保证时序数据的写入效率与索引构建效率；同时，利用所筛选的目标标签进行预测分组，并基于该分组方案建立了一套完整的索引结构体系，在不降低查询性能的同时，对数据存储及索引构建量做出优化；同时，本发明利用合适的索引结构，针对不同模式的查询条件、查询需求做出优化，以求在不同的数据特征、不同的查询偏好的限定下，本发明仍能拥有良好的表现。The present invention provides an index construction method for a time series data storage engine, which selects target tags based on heuristic screening and machine learning schemes, so as to ensure that the index construction amount does not increase rapidly when facing the problem of timeline expansion of today's time series databases, thereby ensuring the writing efficiency and index construction efficiency of time series data; at the same time, the screened target tags are used for predictive grouping, and a complete index structure system is established based on the grouping scheme, which optimizes data storage and index construction without reducing query performance; at the same time, the present invention uses a suitable index structure to optimize query conditions and query requirements of different modes, so that the present invention can still have good performance under the limitations of different data characteristics and different query preferences.

本发明的附加方面和优点将在下面的描述中部分给出，部分将从下面的描述中变得明显，或通过本发明的实践了解到。Additional aspects and advantages of the present invention will be given in part in the following description and in part will be obvious from the following description, or will be learned through practice of the present invention.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

为了更清楚地说明本发明或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the present invention or the prior art, the drawings required for use in the embodiments or the description of the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of the present invention. For ordinary technicians in this field, other drawings can be obtained based on these drawings without paying creative work.

图1是本发明实施例提供的一种时序数据存储引擎的索引构建方法流程图。FIG. 1 is a flow chart of an index construction method of a time series data storage engine provided by an embodiment of the present invention.

图2是本发明实施例中通过本发明的一种时序数据存储引擎的索引构建方法构建的时间线集合的结构示意图。FIG. 2 is a schematic diagram of the structure of a timeline set constructed by an index construction method of a time series data storage engine of the present invention in an embodiment of the present invention.

具体实施方式Detailed ways

为使本发明的目的、技术方案和优点更加清楚，下面将结合本发明中的附图，对本发明中的技术方案进行清楚、完整地描述，显然，所描述的实施例是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。以下实施例用于说明本发明，但不能用来限制本发明的范围。In order to make the purpose, technical scheme and advantages of the present invention clearer, the technical scheme of the present invention will be clearly and completely described below in conjunction with the drawings in the present invention. Obviously, the described embodiments are part of the embodiments of the present invention, rather than all the embodiments. Based on the embodiments in the present invention, all other embodiments obtained by ordinary technicians in the field without creative work are within the scope of protection of the present invention. The following embodiments are used to illustrate the present invention, but cannot be used to limit the scope of the present invention.

在本发明实施例的描述中，需要说明的是，术语“中心”、“纵向”、“横向”、“上”、“下”、“前”、“后”、“左”、“右”、“竖直”、“水平”、“顶”、“底”、“内”、“外”等指示的方位或位置关系为基于附图所示的方位或位置关系，仅是为了便于描述本发明实施例和简化描述，而不是指示或暗示所指的装置或元件必须具有特定的方位、以特定的方位构造和操作，因此不能理解为对本发明实施例的限制。此外，术语“第一”、“第二”、“第三”仅用于描述目的，而不能理解为指示或暗示相对重要性。In the description of the embodiments of the present invention, it should be noted that the terms "center", "longitudinal", "lateral", "up", "down", "front", "back", "left", "right", "vertical", "horizontal", "top", "bottom", "inside", "outside" and the like indicate positions or positional relationships based on the positions or positional relationships shown in the accompanying drawings, and are only for the convenience of describing the embodiments of the present invention and simplifying the description, and do not indicate or imply that the devices or elements referred to must have a specific orientation, be constructed and operate in a specific orientation, and therefore cannot be understood as limitations on the embodiments of the present invention. In addition, the terms "first", "second", and "third" are used for descriptive purposes only and cannot be understood as indicating or implying relative importance.

在本发明实施例的描述中，需要说明的是，除非另有明确的规定和限定，术语“相连”、“连接”应做广义理解，例如，可以是固定连接，也可以是可拆卸连接，或一体连接；可以是机械连接，也可以是电连接；可以是直接相连，也可以通过中间媒介间接相连。对于本领域的普通技术人员而言，可以具体情况理解上述术语在本发明实施例中的具体含义。In the description of the embodiments of the present invention, it should be noted that, unless otherwise clearly specified and limited, the terms "connected" and "connection" should be understood in a broad sense, for example, it can be a fixed connection, a detachable connection, or an integral connection; it can be a mechanical connection or an electrical connection; it can be a direct connection or an indirect connection through an intermediate medium. For ordinary technicians in this field, the specific meanings of the above terms in the embodiments of the present invention can be understood according to specific circumstances.

在本发明实施例中，除非另有明确的规定和限定，第一特征在第二特征“上”或“下”可以是第一和第二特征直接接触，或第一和第二特征通过中间媒介间接接触。而且，第一特征在第二特征“之上”、“上方”和“上面”可是第一特征在第二特征正上方或斜上方，或仅仅表示第一特征水平高度高于第二特征。第一特征在第二特征“之下”、“下方”和“下面”可以是第一特征在第二特征正下方或斜下方，或仅仅表示第一特征水平高度小于第二特征。In the embodiments of the present invention, unless otherwise clearly specified and limited, a first feature being "above" or "below" a second feature may mean that the first and second features are in direct contact, or the first and second features are in indirect contact through an intermediate medium. Moreover, a first feature being "above", "above" or "above" a second feature may mean that the first feature is directly above or obliquely above the second feature, or simply means that the first feature is higher in level than the second feature. A first feature being "below", "below" or "below" a second feature may mean that the first feature is directly below or obliquely below the second feature, or simply means that the first feature is lower in level than the second feature.

在本说明书的描述中，参考术语“一个实施例”、“一些实施例”、“示例”、“具体示例”、或“一些示例”等的描述意指结合该实施例或示例描述的具体特征、结构、材料或者特点包含于本发明实施例的至少一个实施例或示例中。在本说明书中，对上述术语的示意性表述不必须针对的是相同的实施例或示例。而且，描述的具体特征、结构、材料或者特点可以在任一个或多个实施例或示例中以合适的方式结合。此外，在不相互矛盾的情况下，本领域的技术人员可以将本说明书中描述的不同实施例或示例以及不同实施例或示例的特征进行结合和组合。In the description of this specification, the description with reference to the terms "one embodiment", "some embodiments", "example", "specific example", or "some examples" etc. means that the specific features, structures, materials or characteristics described in conjunction with the embodiment or example are included in at least one embodiment or example of the embodiments of the present invention. In this specification, the schematic representations of the above terms do not necessarily refer to the same embodiment or example. Moreover, the specific features, structures, materials or characteristics described may be combined in any one or more embodiments or examples in a suitable manner. In addition, those skilled in the art may combine and combine the different embodiments or examples described in this specification and the features of the different embodiments or examples, without contradiction.

为了更好的理解本发明，下面对本发明中出现的名词进行解释。In order to better understand the present invention, the terms appearing in the present invention are explained below.

Adaboost是一种迭代算法，其核心思想是针对同一个训练集训练不同的分类器(弱分类器)，然后把这些弱分类器集合起来，构成一个更强的最终分类器(强分类器)。其算法本身是通过改变数据分布来实现的，它根据每次训练集之中每个样本的分类是否正确，以及上次的总体分类的准确率，来确定每个样本的权值。将修改过权值的新数据集送给下层分类器进行训练，最后将每次训练得到的分类器最后融合起来，作为最后的决策分类器。使用adaboost分类器可以排除一些不必要的训练数据特征，并放在关键的训练数据上面。Adaboost is an iterative algorithm. Its core idea is to train different classifiers (weak classifiers) for the same training set, and then combine these weak classifiers to form a stronger final classifier (strong classifier). The algorithm itself is implemented by changing the data distribution. It determines the weight of each sample based on whether the classification of each sample in each training set is correct and the accuracy of the last overall classification. The new data set with modified weights is sent to the lower-level classifier for training, and finally the classifiers obtained from each training are finally fused together as the final decision classifier. Using the adaboost classifier can exclude some unnecessary training data features and put them on the key training data.

自适应基数树（Adaptive Radix Tree，简称ART）是一种高效的数据结构，用于实现键值对的存储和检索。它是一种特殊的基数树（Radix Tree），旨在解决传统基数树在内存利用和性能方面的一些缺陷。Adaptive Radix Tree (ART) is an efficient data structure for storing and retrieving key-value pairs. It is a special radix tree that aims to solve some of the shortcomings of traditional radix trees in terms of memory utilization and performance.

自适应基数树的特点包括：The features of adaptive radix trees include:

高效的内存利用：相比于传统的基数树，ART根据键的前缀长度来选择不同的节点类型，从而更好地利用内存，减少了节点的空间开销。高性能的查找和插入操作：ART树采用了多种节点类型（4、16、48、256叶子节点），根据键的长度动态选择节点类型，使得在不同场景下能够有更高效的查找和插入操作。适用于高并发环境：ART树采用乐观并发控制，通过无锁或者细粒度锁来实现对节点的并发访问，适合高并发读写场景。适用于高效的范围查询：ART树支持高效的范围查询操作，可以快速地定位到某个前缀匹配的键值对集合。Efficient memory utilization: Compared with traditional radix trees, ART selects different node types based on the prefix length of the key, thereby better utilizing memory and reducing the space overhead of nodes. High-performance search and insertion operations: The ART tree uses a variety of node types (4, 16, 48, and 256 leaf nodes), and dynamically selects the node type based on the length of the key, allowing more efficient search and insertion operations in different scenarios. Suitable for high-concurrency environments: The ART tree uses optimistic concurrency control to achieve concurrent access to nodes through lock-free or fine-grained locks, which is suitable for high-concurrency read and write scenarios. Suitable for efficient range queries: The ART tree supports efficient range query operations and can quickly locate a set of key-value pairs that match a prefix.

自适应基数树在数据库、操作系统、网络路由表等领域有着广泛的应用，它的设计理念主要在于优化内存利用和提高查找效率，使得它成为处理大规模键值对存储和检索的一种有效数据结构。Adaptive radix tree has a wide range of applications in databases, operating systems, network routing tables and other fields. Its design concept is mainly to optimize memory utilization and improve search efficiency, making it an effective data structure for handling large-scale key-value pair storage and retrieval.

FST是一种在自然语言处理，以及语音识别等领域被广泛应用的数据结构，在数据库领域，Lucene将其引入Lucene Apache中，作为重要的索引结构已替代传统的哈希表等数据结构来使用。该数据结构本质上是一种特殊的有限状态自动机（Finite StateAutomaton, FSA），而在应用时其结构也类似于一棵不仅共享前缀，同时也共享后缀的权值Trie。相较于其他数据结构，FST作为map使用时的时间复杂度是的，其中/>为输入序列的长度，这与哈希表的时间复杂度一致，且为具有映射功能的数据结构做单次映射时所能达到的时间复杂度的最小值，因此，它在查询速度上应优于使用红黑树构造的map、或HAMT（Hash Array Mapped Trie）等空间优化的哈希表。同时，相较于Trie，FST由于共享字符串后缀的特性，其压缩率要显著高于前者，但与Trie等字典数据结构一致的是，FST中存储了字符串集合的所有信息，即通过遍历FST可以恢复其包含的字符串集合，而这是哈希表等数据结构所不具备的，这意味着FST可以支持正则匹配、通配匹配以及模糊匹配等输入数据非精确的匹配模式。FST is a data structure widely used in natural language processing and speech recognition. In the database field, Lucene introduced it into Lucene Apache and used it as an important index structure to replace traditional data structures such as hash tables. This data structure is essentially a special finite state automaton (FSA). When applied, its structure is similar to a weighted trie that shares not only prefixes but also suffixes. Compared with other data structures, the time complexity of FST when used as a map is Of which/> is the length of the input sequence, which is consistent with the time complexity of the hash table and is the minimum time complexity that can be achieved when a data structure with mapping function performs a single mapping. Therefore, it should be better than space-optimized hash tables such as maps constructed using red-black trees or HAMT (Hash Array Mapped Trie) in terms of query speed. At the same time, compared with Trie, FST has a significantly higher compression rate due to the characteristic of sharing string suffixes, but consistent with dictionary data structures such as Trie, FST stores all information about the string set, that is, by traversing FST, the string set it contains can be restored, which is not available in data structures such as hash tables. This means that FST can support inexact matching modes of input data such as regular matching, wildcard matching, and fuzzy matching.

下面结合图1描述本发明实施例。The embodiment of the present invention is described below with reference to FIG. 1 .

首先，如下的步骤S1至S3均为目标标签的选取阶段，目标标签的选取，主要分为两个阶段：启发式筛选阶段与机器学习筛选阶段。在启发式筛选过程中通过预定义的规则以及相应的阈值对标签进行一次初步筛选，以生成一个相较于原标签集合，即一个数据段中出现的所有标签的集合，而言基数小得多的预筛选集合。在此过程中，同时提取预筛选集合中所含标签的数据特征，接下来使用上述得到的数据特征，以过往查询历史统计频率作为标签进行机器学习筛选，从而最终生成目标标签。First, the following steps S1 to S3 are all the target label selection stages. The target label selection is mainly divided into two stages: the heuristic screening stage and the machine learning screening stage. In the heuristic screening process, the labels are preliminarily screened by predefined rules and corresponding thresholds to generate a pre-screening set with a much smaller cardinality than the original label set, that is, the set of all labels appearing in a data segment. In this process, the data features of the labels contained in the pre-screening set are extracted at the same time. Next, the data features obtained above are used to perform machine learning screening with the historical statistical frequency of past queries as labels, so as to finally generate the target label.

在预筛选阶段，主要通过标签键的文档频率，以及标签值的出现频率进行筛选。In the pre-screening stage, the filtering is mainly based on the document frequency of the tag key and the frequency of occurrence of the tag value.

标签键的文档频率，即某个标签键存在于多少条具有不同指标的时间线中，对于不同指标的监控，有部分标签键会有所重合，这些标签键构成了监控目标的基础数据。具有这部分标签键的标签应当是被召回的主要指定条件，一个数据段中一般含有102数量级的标签键，而一般将筛选阈值设为30%，即将标签键按照文档频率排序后，选取前30%的标签键。The document frequency of the tag key, that is, the number of timelines with different indicators in which a certain tag key exists. For the monitoring of different indicators, some tag keys will overlap, and these tag keys constitute the basic data of the monitoring target. Tags with these tag keys should be the main designated conditions for recall. A data segment generally contains 102 tag keys, and the screening threshold is generally set to 30%, that is, after sorting the tag keys according to the document frequency, the top 30% of the tag keys are selected.

而标签值的筛选，主要使用其出现频率。在选取上述标签键集合后，对于每个标签键，约有10至3000个不同的标签值与之对应。在将标签值在数据段中的出现频率排序后，规定一个阈值P（一般取80%或90%，称为P80或P90方案），自高概率标签值起向下筛选，直到其出现频率之和达到阈值P为止。选取被筛选出的标签值，与前述标签键组合，组成预筛选标签键值对集合。The selection of label values mainly uses their frequency of occurrence. After selecting the above label key set, for each label key, there are about 10 to 3000 different label values corresponding to it. After sorting the frequency of occurrence of label values in the data segment, a threshold P is specified (usually 80% or 90%, called P80 or P90 scheme), and the label values with high probability are selected downward until the sum of their frequency of occurrence reaches the threshold P. The selected label values are selected and combined with the above label keys to form a pre-screened label key value pair set.

每条数据的标签则由历史查询记录生成，一般的，历史查询记录中有部分标签键的访问频率远高于其他标签键，本发明将这部分标签键的布尔标签置1，其余标签键则置0，具体来说，与步骤S1中标签值的筛选方式类似，对历史查询记录中的不同标签访问频次排序后，使用P80或P90方案筛选，将选出的标签键的值置位。The label of each data is generated by the historical query record. Generally, the access frequency of some label keys in the historical query record is much higher than that of other label keys. In the present invention, the Boolean label of these label keys is set to 1, and the other label keys are set to 0. Specifically, similar to the label value screening method in step S1, after sorting the access frequencies of different labels in the historical query record, the P80 or P90 scheme is used for screening, and the selected label keys are Value is set.

其中，步骤S2中的所述数据特征包括：文档频率、标签键基数、标签键基数排名、标签键基数排名比率、标签键频率、标签键频率排名及标签键频率排名比率。The data features in step S2 include: document frequency, tag key cardinality, tag key cardinality ranking, tag key cardinality ranking ratio, tag key frequency, tag key frequency ranking and tag key frequency ranking ratio.

本阶段使用在上一阶段提取的数据特征进行机器学习模型训练，所提取的特征包括：This stage uses the data features extracted in the previous stage to train the machine learning model. The extracted features include:

标签键基数：一个标签键所具有的标签值集合的基数；Tag key cardinality : The cardinality of the tag value set of a tag key;

标签键基数排名：上述基数在记录同一指标的时间线集合中所有标签键的排名；Tag key cardinality ranking : The ranking of the above cardinality among all tag keys in the timeline set recording the same metric;

标签键基数排名比率：上述排名占同一指标的时间线集合中所有标签键的百分比；Tag key cardinality ranking ratio : The above ranking is the percentage of all tag keys in the timeline set of the same metric;

标签键频率：一个标签键在记录同一指标的时间线集合中出现的次数；Tag key frequency : The number of times a tag key appears in a set of timelines recording the same metric;

标签键频率排名：上述频率在记录同一指标的时间线集合中所有标签键的排名；Tag key frequency ranking : The above frequency is the ranking of all tag keys in the timeline set recording the same metric;

标签键频率排名比率：上述排名占同一指标的时间线集合中所有标签键的百分比；Tag key frequency ranking ratio : The above ranking is the percentage of all tag keys in the timeline set of the same metric;

文档频率：标签键的文档频率。Document frequency : Document frequency for the tag key.

其中，步骤S3中用于所述数据特征的机器学习方案为AdaBoost迭代算法。Among them, the machine learning scheme used for the data features in step S3 is the AdaBoost iterative algorithm.

其中，步骤S3中，所述指标标签组中的每个指标均以指标名称为标签键，指标值为标签值。Wherein, in step S3, each indicator in the indicator tag group uses the indicator name as the tag key and the indicator value as the tag value.

基于步骤S2之后，使用经典的机器学习方案，如AdaBoost等，训练一个相较于启发式筛选而言更为精确的函数，通过该函数，可以得到最终的目标标签。本发明将时间线数据中的指标也视为一组标签，即其标签键为“指标”，值为指标值，如“metric: cpu”。这组特殊的标签永远是目标标签，因此，每个组的组标签中都含有且仅含有一个此类标签，而组中的所有时间线都具有相同的指标。After step S2, a classic machine learning solution, such as AdaBoost, is used to train a function that is more accurate than heuristic screening, through which the final target label can be obtained. The present invention also regards the indicators in the timeline data as a set of labels, that is, its label key is "metric" and the value is the indicator value, such as "metric: cpu". This special set of labels is always the target label, so each group's group label contains and only contains one such label, and all timelines in the group have the same indicator.

进一步的，根据目标标签筛选结果，对每条时间线提取其包含的目标标签从而形成一个集合，并将其置入一个分组中，每个分组具有一些组标签以组成一个组标签集合，该组标签集合在不同组间各不相同，也是不同分组区分彼此的唯一方式。Furthermore, according to the target tag screening results, the target tags contained in each timeline are extracted to form a set, and are placed in a group. Each group has some group tags to form a group tag set. The group tag set is different between different groups and is also the only way to distinguish different groups from each other.

进一步的，获得一条时间线的目标标签集合后，将该时间线置入一个组标签集合恰好与时间线的目标标签集合等价分组中，若不存在符合要求的分组，则创建一个满足上述要求的新组，并将时间线置入其中。Furthermore, after obtaining a target tag set of a timeline, the timeline is placed in a group whose group tag set is exactly equivalent to the target tag set of the timeline. If there is no group that meets the requirements, a new group that meets the above requirements is created and the timeline is placed in it.

本发明将一个数据段中的所有时间线划分至不同组中，组内数据的分布结构如图2所示，时间线组内包括指标1、指标2、指标3及指标g，每个指标下都有对应的属于指标自身的标签1至标签kg，每个组都有一个独一无二的组标签集合，而该组中的所有时间线也都含有上述组标签集合中的所有标签，组内的时间线无需再独立地存储这一部分标签，而只需要存储其具有的剩余标签，以及其指标名。The present invention divides all timelines in a data segment into different groups. The distribution structure of the data in the group is shown in Figure 2. The timeline group includes indicator 1, indicator 2, indicator 3 and indicator g. Each indicator has corresponding labels 1 to labels kg belonging to the indicator itself. Each group has a unique group label set, and all timelines in the group also contain all labels in the above group label set. The timelines in the group no longer need to store this part of the labels independently, but only need to store the remaining labels they have and their indicator names.

进一步的，基于步骤S5及之前的分组方案，对时间线数据建立前置索引及倒排索引，其中前置索引将目标标签索引到倒排链，倒排索引将前置索引结果再次索引到包含待索引时间线处。Furthermore, based on step S5 and the previous grouping scheme, a pre-index and a reverse index are established for the timeline data, wherein the pre-index indexes the target tag to the reverse index chain, and the reverse index indexes the pre-index result again to the location containing the timeline to be indexed.

其中，步骤S6中的所述倒排索引中每个标签键值对有对应的倒排链，所述倒排链中包括升序的组ID组。Wherein, each tag key-value pair in the inverted index in step S6 has a corresponding inverted chain, and the inverted chain includes ascending group ID groups.

其中，步骤S6中的所述前置索引通过代数重建法与字典数据结构实现。The pre-index in step S6 is implemented by an algebraic reconstruction method and a dictionary data structure.

其中，步骤S6中的所述前置索引中，所述代数重建法用于由标签键映射至对应的标签值集合构成的字典数据结构地址。Among them, in the pre-index in step S6, the algebraic reconstruction method is used to map the tag key to the dictionary data structure address consisting of the corresponding tag value set.

其中，步骤S6中的所述前置索引中，所述字典数据结构存有标签值与对应的倒排链偏移量的映射。In the preceding index in step S6, the dictionary data structure stores a mapping between label values and corresponding postings chain offsets.

基于分组方案，本发明设计了一种组级别的倒排索引。对于一个数据段中的每一个分组，为其分配一个独特的组id后，建立一个由标签键值对到组id的映射，该映射即为本发明中的组级倒排，具体来说，每个标签键值对对应一个倒排链，该倒排链中升序地包含一系列组id，这些id所对应的组的组标签集合中包含该标签键值对。Based on the grouping scheme, the present invention designs a group-level inverted index. For each group in a data segment, after assigning it a unique group id, a mapping from label key-value pairs to group ids is established. This mapping is the group-level inverted index in the present invention. Specifically, each label key-value pair corresponds to an inverted chain, which contains a series of group ids in ascending order. The group label set of the groups corresponding to these ids contains the label key-value pair.

基于分组方案，本发明设计了一种前置索引，其由目标标签映射到其对应的倒排链起始位置，这部分索引将使用ART代数重建法+FST字典数据结构的组合实现，Based on the grouping scheme, the present invention designs a pre-index, which maps the target label to the starting position of its corresponding postings chain. This part of the index will be implemented using a combination of ART algebraic reconstruction method + FST dictionary data structure.

其中，ART用于从标签键映射到其对应的标签值集合所构成的FST的地址，对于目标标签集合中的每个不同的标签键，其都对应一个标签值的集合，该标签键与集合中的任一标签值的组合都至少在数据段中出现过至少一次，对于该标签值集合，本发明建立一个FST来存储。Among them, ART is used to map from a label key to the address of an FST composed of its corresponding label value set. For each different label key in the target label set, it corresponds to a set of label values. The combination of the label key and any label value in the set has appeared at least once in the data segment. For the label value set, the present invention establishes an FST to store it.

在本发明中，一个FST存储了由标签值到其对应的倒排链偏移量的映射，偏移量是一个基于倒排列表基地址的正整数，而正整数间可以自然地定义加法运算，因此可以作为FST的输出；而另一方面，以FST的形式存储标签值，可以很好地应对正则表达式条件的查询。In the present invention, an FST stores a mapping from a label value to its corresponding postings list offset. The offset is a positive integer based on the postings list base address. Addition operations can be naturally defined between positive integers, so they can be used as the output of the FST. On the other hand, storing label values in the form of FST can well cope with queries under regular expression conditions.

不同于Lucene提出的剪枝搜索方案，得益于根据时间顺序的分段机制，每个数据段内所包含的时间线数据量并不算庞大，而使用分组方案后，每个数据段所构建的索引量得到进一步的缩减，因而每个FST仅有约数量级的存储空间占用。Different from the pruning search solution proposed by Lucene, thanks to the segmentation mechanism based on time order, the amount of timeline data contained in each data segment is not huge. After using the grouping solution, the amount of index constructed for each data segment is further reduced, so each FST is only about Order of magnitude storage space occupied.

下面介绍在一些实施例中，基于本发明提供的一种时序数据引擎的索引构建方法存储的数据的召回方法。In some embodiments, a method for recalling data stored in an index building method of a time series data engine provided by the present invention is described below.

对于时序数据库的一种典型查询模式单元为：A typical query mode unit for a time series database is:

查询=指标名+{标签键1：标签值，…，标签键：标签值}；query = metric name + {tag key 1: tag value, ..., tag key :label value};

其中，标签值可以为精确字符串、正则表达式或具有通配符的字符串，标签键值对间为与关系，即需召回同时包含查询标签集合中的所有标签的时间线数据，标签个数一般为2至6个不等。The tag value can be an exact string, a regular expression, or a string with wildcards. The tag key-value pairs are in an AND relationship, that is, the timeline data containing all tags in the query tag set needs to be recalled. The number of tags Generally, there are 2 to 6 items.

当一个查询到来时，首先查询ART，根据每一个查询条件，如果该条件命中了一个目标标签键，则可以映射到相应的FST；接下来，在FST中查询对应的标签值，如果标签值为目标标签值，将得到一个（或多个，在正则匹配等情况下）返回值，该返回值即为倒排索引的偏移量。接下来，通过上述得到的偏移量，查询倒排索引得到一系列倒排链，将倒排链求交，即可得到本条查询所需要召回的所有组号；最后，依照组号依次解压相应分组，并返回其内部的所有时间线数据，则本条查询处理完毕。When a query comes, ART is queried first. According to each query condition, if the condition hits a target label key, it can be mapped to the corresponding FST; next, the corresponding label value is queried in FST. If the label value is the target label value, one (or more, in the case of regular matching, etc.) return value will be obtained, and the return value is the offset of the inverted index. Next, through the offset obtained above, the inverted index is queried to obtain a series of inverted chains. The inverted chains are intersected to obtain all the group numbers that need to be recalled for this query; finally, the corresponding groups are decompressed in sequence according to the group number, and all the timeline data inside them are returned, and this query is processed.

一种可能发生的情况为，某个查询条件未命中任何一个目标标签键，则本次查询需要进行过滤处理。对于其他的查询条件，最终获得所有待召回组的所有时间线数据后，需要对这些数据进行一次过滤，过滤条件即包含未命中目标键标签的查询条件的时间线数据，在过滤之后返回。注意到，一个查询中总会提供一个指标名作为查询条件之一，这就保证了不存在一个查询，使得其提供的所有查询条件都未命中任何一个目标标签，因此，该方案可以保证永远不会出现需要在数据段级别进行全局过滤的情况，从而保证了查询延迟的下限。One possible situation is that a query condition does not hit any target tag key, then this query needs to be filtered. For other query conditions, after finally obtaining all the timeline data of all the groups to be recalled, these data need to be filtered once. The filter conditions are the timeline data of the query conditions that do not hit the target key tag, which are returned after filtering. Note that a metric name is always provided as one of the query conditions in a query, which ensures that there is no query in which all the query conditions provided do not hit any target tag. Therefore, this solution can ensure that there will never be a situation where global filtering is required at the data segment level, thereby ensuring the lower limit of query latency.

另一种可能发生的情况为，某个查询条件在第一步中命中了某个目标标签键（记作），并成功的获得了其对应的FST数据结构，但在FST中，并未命中任何一个标签值，这种情况也需要进行过滤，但过滤的数据量和时间开销将远小于前一种过滤，同样的，当获得了所有需召回的组中的所有时间线数据后，对于每条时间线，只需检查其中/>所对应的标签值是否为查询指定的标签值即可，过滤之后返回，即为本次查询的最终结果。Another possible situation is that a query condition hits a target tag key (denoted as ), and successfully obtained its corresponding FST data structure, but in the FST, no label value was hit. In this case, filtering is also required, but the amount of data and time overhead of filtering will be much smaller than the previous filtering. Similarly, after obtaining all the timeline data in all the groups to be recalled, for each timeline, only the timeline needs to be checked. The corresponding tag value is the tag value specified by the query, and the result returned after filtering is the final result of this query.

最后应说明的是：以上实施例仅用以说明本发明的技术方案，而非对其限制；尽管参照前述实施例对本发明进行了详细的说明，本领域的普通技术人员应当理解：其依然可以对前述各实施例所记载的技术方案进行修改，或者对其中部分技术特征进行等同替换；而这些修改或者替换，并不使相应技术方案的本质脱离本发明各实施例技术方案的精神和范围。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, rather than to limit it. Although the present invention has been described in detail with reference to the aforementioned embodiments, those skilled in the art should understand that they can still modify the technical solutions described in the aforementioned embodiments, or make equivalent replacements for some of the technical features therein. However, these modifications or replacements do not deviate the essence of the corresponding technical solutions from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method for constructing an index of a time series data storage engine, comprising:

S1: pre-screen the data blocks to be stored according to the document frequency of the tag key and the occurrence frequency of the tag value to obtain a pre-screened set;

S2: extracting features from the pre-screened set based on the historical access frequency of the tag key to obtain data features;

S3: performing machine learning on the data features to obtain a screening function, and screening the pre-screening set by using the screening function to obtain a target set, wherein the target set includes at least an indicator label group for each timeline;

S4: extracting target labels from the target set according to a plurality of different indicators in the indicator label group to obtain a plurality of group label sets;

S5: For a group tag set with the same indicator tag, insert a timeline corresponding to the indicator tag to obtain multiple timeline sets;

S6: Assign a unique group ID to each timeline set, create an inverted index mapping the tag key-value pair to the group ID, and create a pre-index mapping the target tag to the inverted index to complete the index construction of the time series data storage engine.

2. According to the index construction method of a time series data storage engine described in claim 1, it is characterized in that the data features in step S2 include: document frequency, tag key cardinality, tag key cardinality ranking, tag key cardinality ranking ratio, tag key frequency, tag key frequency ranking and tag key frequency ranking ratio.

3. According to the index construction method of a time series data storage engine according to claim 1, it is characterized in that the machine learning scheme used for the data features in step S3 is an AdaBoost iterative algorithm.

4. The index construction method of a time series data storage engine according to claim 1 is characterized in that, in step S3, each indicator in the indicator label group uses the indicator name as the label key and the indicator value as the label value.

5. According to the index construction method of a time series data storage engine according to claim 1, it is characterized in that each label key-value pair in the inverted index in step S6 has a corresponding inverted chain, and the inverted chain includes ascending group ID groups.

6. The index construction method of a time series data storage engine according to claim 1 is characterized in that the pre-index in step S6 is implemented by an algebraic reconstruction method and a dictionary data structure.

7. According to the index construction method of a time series data storage engine according to claim 6, it is characterized in that in the pre-index in step S6, the algebraic reconstruction method is used to map the label key to the dictionary data structure address consisting of the corresponding label value set.

8. The index construction method for a time series data storage engine according to claim 6, characterized in that in the pre-index in step S6, the dictionary data structure stores a mapping between a label value and a corresponding postings chain offset.