CN108021678B

CN108021678B - Key value pair storage structure with compact structure and quick key value pair searching method

Info

Publication number: CN108021678B
Application number: CN201711287661.7A
Authority: CN
Inventors: 嵩天; 魏煜
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2017-12-07
Filing date: 2017-12-07
Publication date: 2022-05-17
Anticipated expiration: 2037-12-07
Also published as: CN108021678A

Abstract

The invention relates to a compact key-value pair storage structure and a fast key-value pair search method, belonging to the technical field of real-time big data processing and key-value query. Based on the combined structure of hierarchical hash table and Bloom filter, the first layer uses a 2-left hashing hash structure for storage, and each layer of hash table is used as an auxiliary table to save the conflicting data in the upper table, and Combined with Bloom filter to save conflicting data, use "Bloom filter to determine whether the searched key exists in the current set of results" as an index to improve search speed, thereby improving storage space utilization, making the structure compact and fast Find the software platform requirements. The invention can effectively solve the problems of current large-scale key-value pair storage and search time uncertainty, unsteady search speed, slow software implementation and running speed that cannot meet application requirements and the like.

Description

A compact key-value pair storage structure and a fast key-value pair search method

技术领域technical field

本发明涉及一种结构紧凑的键值对存储结构及快速键值(key-value)对查找方法，特别涉及一种基于多层次哈希表(hash table)和布鲁姆过滤器(bloom filter，BF)的键值对存储和查找方法，属于实时大数据处理及键值查询技术领域。The invention relates to a compact key-value pair storage structure and a fast key-value (key-value) pair search method, in particular to a multi-level hash table (hash table) and a bloom filter (bloom filter, BF) ) key-value pair storage and search method, belonging to the technical field of real-time big data processing and key-value query.

背景技术Background technique

随着互联网规模的日益扩大，网络流量日益增长，计算机技术领域对大数据存储和查找速率的要求越来越高。键值对查找问题在计算机领域的各个方向如大数据处理和高速网络中应用颇广。With the increasing scale of the Internet and the growing network traffic, the requirements for large data storage and search rates in the field of computer technology are getting higher and higher. The key-value search problem is widely used in various directions of the computer field, such as big data processing and high-speed networks.

然而，键值对查找方法的性能严重受限于硬件资源，因此，在网络功能虚拟化的趋势下，适应软件平台更为重要且更加适应互联网发展。在一个软件平台中会同时处理多种任务，存储和查找方法不能独自占用一整个高速缓存(cache)，因此，该方法要谨慎地控制放入cache中的结构大小，且避免频繁替换。然而，现有方法大多数更适应硬件平台，且在查找速度和cache使用效率方面不够高。However, the performance of the key-value pair lookup method is severely limited by hardware resources. Therefore, under the trend of network function virtualization, it is more important to adapt to the software platform and more suitable for the development of the Internet. In a software platform, multiple tasks are processed at the same time, and the storage and search methods cannot occupy an entire cache by themselves. Therefore, this method should carefully control the size of the structures placed in the cache and avoid frequent replacement. However, most of the existing methods are more suitable for hardware platforms, and are not high enough in terms of search speed and cache usage efficiency.

现在已有的键值对查找方法主要分为哈希表查找，树形查找，以及基于布鲁姆过滤器的查找方法。哈希表方法是较为传统的方法，可以通过确定的时间复杂度O(1)执行键值对的插入，删除，查找，但是哈希表的重要缺点是它需要处理哈希冲突，以至于无法每次都在O(1)复杂度完成。因此，单纯使用哈希表进行方法的基础数据结构设计无法满足当前对键值对查找速度的要求，而减少哈希冲突又需要更大的存储空间，从而导致空间利用率较低。The existing key-value pair search methods are mainly divided into hash table search, tree search, and search method based on Bloom filter. The hash table method is a more traditional method, which can perform key-value pair insertion, deletion, and search with a certain time complexity of O(1). It is done in O(1) complexity every time. Therefore, the basic data structure design of the method simply using the hash table cannot meet the current requirements for the search speed of key-value pairs, and reducing hash collisions requires larger storage space, resulting in low space utilization.

基于布鲁姆过滤器数据结构的方法主要依赖于布鲁姆过滤器结构简单、紧凑的特性。主要方法有多个布鲁姆过滤器组合的方法，布鲁姆过滤器结合数据编码的方法，基于布鲁姆过滤器变形后的数据结构等。这些结构改变了原有布鲁姆过滤器的结构，提升了一定的空间使用率，但是破坏了布鲁姆过滤器原有的简单、紧凑的特性，因此并没有发挥出Bloom filter应有的效果。Methods based on Bloom filter data structure mainly rely on the simple and compact properties of Bloom filter structure. The main methods include the method of combining multiple Bloom filters, the method of combining Bloom filters with data encoding, and the data structure based on the deformation of Bloom filters. These structures changed the structure of the original Bloom filter and improved a certain space utilization rate, but destroyed the original simple and compact characteristics of the Bloom filter, so they did not play the due effect of the Bloom filter. .

本发明涉及一种具有紧凑且快速特点的键值对查找方法，提供了一种结构紧凑的键值对存储结构，提高了传统哈希表方法的空间利用率，通过确定的时间复杂度完成键值对的查找，所提出的结构紧凑的键值对存储结构和快速键值对查找方法主要涉及以下两个问题：The invention relates to a key-value pair search method with compact and fast characteristics, provides a key-value pair storage structure with a compact structure, improves the space utilization rate of the traditional hash table method, and completes the key-value pair by determining the time complexity. Value pair search, the proposed compact key-value pair storage structure and fast key-value pair search method mainly involve the following two problems:

(1)基于多层哈希表设计，使用2-left hashing哈希结构为多层次哈希表的首层，保证结构的紧密和空间的高利用率，同时减少哈希冲突；(1) Based on the multi-layer hash table design, the 2-left hashing hash structure is used as the first layer of the multi-level hash table to ensure the compactness of the structure and the high utilization of space, while reducing hash collisions;

(2)通过多层的布鲁姆过滤器存储对应层哈希表中冲突数据，及提供被查找键是否存在当前层冲突数据集的结果作为查找索引，提高查找速度，减少内存访问次数。(2) The conflict data in the corresponding layer hash table is stored through the multi-layer Bloom filter, and the result of whether the search key exists in the current layer conflict data set is provided as the search index, which improves the search speed and reduces the number of memory accesses.

发明内容SUMMARY OF THE INVENTION

本发明的目的是为了克服现有方法的键值对存储结构不够紧凑的缺陷以及为了解决当前大规模键值对存储和查找时间不确定性，软件实现运行速度过慢达不到应用要求等问题，满足合理利用高速缓存争取更少地进行数据替换的要求，提出一种结构紧凑的键值对存储结构及快速键值对查找方法。The purpose of the present invention is to overcome the defect that the key-value pair storage structure of the existing method is not compact enough, and to solve the problems of the current large-scale key-value pair storage and search time uncertainty, the software implementation running speed is too slow to meet the application requirements, etc. , to meet the requirement of reasonable use of cache for less data replacement, a compact key-value pair storage structure and a fast key-value pair search method are proposed.

本发明的思想是基于分层哈希表和布鲁姆过滤器相结合的结构，通过首层使用2-left hashing哈希结构存储，每层哈希表都作为辅助表保存在上层表中发生存储冲突的数据，并结合布鲁姆过滤器进行冲突数据的保存，将“布鲁姆过滤器判断所查找的键是否存在当前集合中的结果”作为索引提高查找速度，从而提高存储空间利用率使结构紧凑并实现快速查找的软件平台要求。一种结构紧凑的键值对存储结构及快速键值对查找方法，包括一种结构紧凑的键值对存储结构及一种快速键值对查找方法；The idea of the present invention is based on the combination of hierarchical hash table and Bloom filter. The first layer uses a 2-left hashing hash structure for storage, and each layer of hash table is stored as an auxiliary table in the upper table for storage. Conflicting data, combined with Bloom filter to save conflicting data, using "Bloom filter to determine whether the searched key exists in the current set of results" as an index to improve the search speed, thereby improving storage space utilization. Software platform requirements for compact structure and fast lookup. A compact key-value pair storage structure and a fast key-value pair search method, including a compact key-value pair storage structure and a fast key-value pair search method;

其中，一种结构紧凑的键值对存储结构，简称存储结构包括λ层哈希表，从第二层哈希表开始，每层哈希表中包含的总存储块数量呈等比序列递减，其中，优选的等比因子为10；Among them, a compact key-value pair storage structure, referred to as the storage structure, includes a λ-layer hash table. Starting from the second-layer hash table, the total number of storage blocks contained in each layer of the hash table decreases in a proportional sequence. Wherein, the preferred proportionality factor is 10;

其中，首层哈希表，即第一层哈希表，使用2-left hashing哈希表，该哈希表存在左右两个子表，每个子表中包含相同个数的存储块；对于其余层哈希表，不存在子表的概念；Among them, the first-layer hash table, that is, the first-layer hash table, uses a 2-left hashing hash table, which has two sub-tables on the left and right, and each sub-table contains the same number of storage blocks; for the remaining layers Hash table, there is no concept of sub-table;

其中，每个存储块包含N个存储单元，每个存储单元的大小为M比特；Wherein, each storage block includes N storage units, and the size of each storage unit is M bits;

其中，M*N的范围小于使用该存储结构的计算机缓存行的大小，优选的N为8，优选的M为32或64比特；Wherein, the range of M*N is smaller than the size of the computer cache line using the storage structure, the preferred N is 8, and the preferred M is 32 or 64 bits;

每个存储块中第一个存储单元保存该存储块中已被占用的存储单元个数，其余存储单元中存储的是键值对；The first storage unit in each storage block stores the number of occupied storage units in the storage block, and the remaining storage units store key-value pairs;

其中，键值对包括键和值；具体存储时存储的是键的“值”和键的“校验值”；其中，存储的“值”部分可以存储该值本身或者指向值本身存储位置的指针，后文统称为值；键的“校验值”的计算可使用常用的校验算法，如crc循环冗余校验或md5校验算法；Among them, the key-value pair includes a key and a value; the "value" of the key and the "check value" of the key are stored in the specific storage; the "value" part of the storage can store the value itself or point to the storage location of the value itself. Pointer, hereinafter collectively referred to as value; the calculation of the "check value" of the key can use a commonly used check algorithm, such as crc cyclic redundancy check or md5 check algorithm;

除最后一层哈希表外，每层哈希表对应一个布鲁姆过滤器，从第一层布鲁姆过滤器开始，每层布鲁姆过滤器的所占空间大小呈等比序列递减，其中，优选的等比因子为10；Except for the last layer of hash table, each layer of hash table corresponds to a Bloom filter. Starting from the first layer of Bloom filter, the size of the space occupied by each layer of Bloom filter decreases in a proportional sequence. , wherein, the preferred scaling factor is 10;

一种快速键值对查找方法，包括键值对数据存储过程和键值对查找过程；A fast key-value pair search method, including a key-value pair data storage process and a key-value pair search process;

其中，键值对数据存储过程，即键值对插入过程，简称插入过程；键值对查找过程简称查找过程；Among them, the key-value pair data storage process is the key-value pair insertion process, referred to as the insertion process; the key-value pair search process is referred to as the search process;

其中，插入过程的具体步骤如下：The specific steps of the insertion process are as follows:

步骤1，设置哈希表层数以及每一层哈希表包含存储块的个数，同时设置每一层哈希表对应的布鲁姆过滤器的大小；Step 1, set the number of hash table layers and the number of storage blocks included in each layer of the hash table, and set the size of the Bloom filter corresponding to each layer of the hash table;

步骤2，设置哈希表每个存储块中包含的存储单元的个数N，设置每个存储单元的大小为M比特；Step 2, set the number N of the storage units included in each storage block of the hash table, set the size of each storage unit to be M bits;

步骤2优选的方案是保证存储块的大小M*N比特等于或小于计算机缓存行大小，由此使访问存储块时将整个存储块调入缓存中，避免访问存储块的不同单元时进行多次缓存替换，从而加快查找时间；The preferred solution of step 2 is to ensure that the size of the memory block M*N bits is equal to or smaller than the size of the computer cache line, so that the entire memory block is transferred into the cache when accessing the memory block, and multiple times when accessing different units of the memory block is avoided. Cache replacement for faster lookup times;

步骤3,向第一层哈希表，即2-left hashing，中插入数据，若不冲突，则该键值对插入成功，跳至步骤5；若冲突，则将数据标记在第一层对应的bloom filter中，并执行步骤4；Step 3, insert data into the first-layer hash table, namely 2-left hashing, if there is no conflict, the key-value pair is inserted successfully, and skip to step 5; if there is a conflict, mark the data in the first layer corresponding to in the bloom filter, and execute step 4;

其中，插入的数据就是键值对的“值”部分，同时也存储了键的“校验值”；Among them, the inserted data is the "value" part of the key-value pair, and the "check value" of the key is also stored;

步骤3的具体过程为：对键值对中的键进行两次哈希运算，使用两个哈希计算结果在2-left hashing的左右子表中分别确定一个存储块；The specific process of step 3 is: perform two hash operations on the keys in the key-value pair, and use the two hash calculation results to determine a storage block respectively in the left and right sub-tables of 2-left hashing;

其中，每层哈希表的哈希运算是固定的，此处第一层的两次哈希运算也是固定的；Among them, the hash operation of each layer of hash table is fixed, and here the two hash operations of the first layer are also fixed;

当出现如下两种情况，表明插入数据冲突，需要跳至步骤4：When the following two situations occur, indicating that the inserted data conflicts, you need to skip to step 4:

3.1A两处存储块中存储单元均被占满，即两处存储块中存储单元个数相同，均为N-1，等于该块中能存储的最大键值对个数，则表明发生插入冲突，进入步骤4；3.1A The storage units in the two storage blocks are all occupied, that is, the number of storage units in the two storage blocks is the same, both N-1, which is equal to the maximum number of key-value pairs that can be stored in the block, indicating that an insertion has occurred. Conflict, go to step 4;

3.1B两处存储块中所有存储单元中保存的校验值存在和欲插入数据或欲插入键值对的“校验值”相同的存储单元，则表明发生插入冲突，进入步骤4；3.1B If the check value stored in all the storage units in the two storage blocks exists in the same storage unit as the "check value" to be inserted into the data or the key-value pair to be inserted, it indicates that an insertion conflict has occurred, and the process proceeds to step 4;

当不满足3.1A和3.1B的两种插入数据冲突情况，即插入数据不冲突时，存在如下两种情况：When the two inserted data conflict situations of 3.1A and 3.1B are not satisfied, that is, when the inserted data does not conflict, there are the following two situations:

3.2A两处存储块中单元被占用的个数不同，则将插入数据，即该键的值和该键的校验值保存到被占用单元个数较少的存储块的第一个空闲的存储单元中，跳至步骤6；3.2A The number of occupied cells in the two storage blocks is different, and the inserted data, that is, the value of the key and the check value of the key, are saved to the first free storage block with fewer occupied cells. In the storage unit, skip to step 6;

3.2B两处存储块中单元被占用的个数相同，则将插入数据，即将该键对应的值和该键的校验值保存到左侧存储块的第一个空闲的存储单元中，跳至步骤6；3.2B The number of occupied cells in the two storage blocks is the same, then the data will be inserted, that is, the value corresponding to the key and the check value of the key will be saved to the first free storage cell of the left storage block. to step 6;

步骤4，判断当前层是否为最后一层，并根据判断结果，若为最后一层，则跳入步骤7，若当前层不是哈希表结构中的最后一层，应向当前层对应的布鲁姆过滤器集合中添加当前键，跳至步骤5；Step 4, determine whether the current layer is the last layer, and according to the judgment result, if it is the last layer, then jump to step 7, if the current layer is not the last layer in the hash table structure, it should be distributed to the corresponding layer of the current layer. Add the current key to the Rum filter set, skip to step 5;

步骤5跳入下一层哈希表中，进行一次哈希运算，使用此哈希运算的结果确定一个存储块位置，通过判断该存储块和当前键是否存在冲突来进行如下操作：当出现插入数据冲突时，需要跳至步骤4：当插入数据不冲突时，将插入数据，即将该键的值和该键的校验值保存到该存储块的第一个空闲的存储单元中，进入步骤6；Step 5: Jump into the next layer of the hash table, perform a hash operation, use the result of the hash operation to determine the location of a storage block, and perform the following operations by judging whether the storage block and the current key conflict: when an insertion occurs When the data conflicts, you need to skip to step 4: when the inserted data does not conflict, insert the data, that is, save the value of the key and the check value of the key to the first free storage unit of the storage block, and enter the step 6;

其中，插入数据冲突，需要跳至步骤4的两种情况如下：Among them, there are two situations in which the inserted data conflicts and you need to skip to step 4 as follows:

5.1A存储块中存储单元被占满，即存储块中存储单元个数为N-1，等于该块中能存储的最大键值对个数，则表明发生插入冲突，需跳至步骤4；5.1A If the storage units in the storage block are full, that is, the number of storage units in the storage block is N-1, which is equal to the maximum number of key-value pairs that can be stored in the block, it indicates that an insertion conflict occurs, and you need to skip to step 4;

5.1B存储块中所有存储单元中保存的校验值存在和欲插入数据或欲插入键值对的“校验值”相同的存储单元，则表明发生插入冲突，需跳至步骤4；5.1B If the check value stored in all storage units in the storage block exists in the same storage unit as the "check value" of the data to be inserted or the key-value pair to be inserted, it indicates that an insertion conflict occurs, and it is necessary to skip to step 4;

步骤6，数据插入成功，即键值对插入函数调用成功，函数返回1，结束插入过程；Step 6, the data insertion is successful, that is, the key-value pair insertion function is successfully called, the function returns 1, and the insertion process ends;

步骤7，数据插入失败，即键值对插入函数调用失败，函数返回0，结束插入过程；Step 7, the data insertion fails, that is, the key-value pair insertion function fails to be called, the function returns 0, and the insertion process ends;

至此，从步骤1到步骤6或步骤7，完成了插入过程；So far, from step 1 to step 6 or step 7, the insertion process is completed;

第二部分键值对查找过程，即对于给出键值对中的键，查找对应键值对中的值的过程，简称查找过程，具体步骤如下：The second part of the key-value pair search process, that is, for a given key in the key-value pair, the process of finding the value in the corresponding key-value pair, referred to as the search process, the specific steps are as follows:

步骤8，在第一层布鲁姆过滤器集合中对给出键值对中的键进行查找，通过判断该键是否在该集合中保存，决定跳至步骤10还是步骤9，具体为：Step 8, search for the key in the given key-value pair in the first-layer Bloom filter set, by judging whether the key is saved in the set, decide to skip to step 10 or step 9, specifically:

8.1若该键不存在第一层布鲁姆过滤器集合中，执行步骤9；8.1 If the key does not exist in the first-level Bloom filter set, go to step 9;

8.2若该键存在于第一层布鲁姆过滤器集合中，执行步骤10；8.2 If the key exists in the first-level Bloom filter set, go to step 10;

步骤9，在2-left hashing中进行两次哈希计算，其结果在左右子表中分别确定一个存储块；将两个存储块中每个非空单元内的校验值和“根据当前键计算出的校验值”比较，查看是否存在相等的情况，根据判断结果进行如下操作：Step 9, perform two hash calculations in 2-left hashing, and determine a storage block in the left and right sub-tables as a result; Calculated check value "comparison, check whether there is an equal situation, and perform the following operations according to the judgment result:

9.1若存在相等的情况，则表示该单元保存的值即为所查找的键对应的值，跳至步骤13；9.1 If there is an equal situation, it means that the value stored in the unit is the value corresponding to the searched key, and skip to step 13;

9.2若不存在相等的情况，则键对应的值不存在于本存储结构中，跳至步骤14；9.2 If there is no equality, the value corresponding to the key does not exist in this storage structure, and skip to step 14;

步骤10，进入下一层布鲁姆过滤器，在下一层布鲁姆过滤器集合中对给出的键进行查找，通过判断该键是否在该布鲁姆过滤器集合中来进行如下操作：Step 10, enter the next layer of Bloom filter, search for the given key in the next layer of Bloom filter set, and perform the following operations by judging whether the key is in the Bloom filter set:

10.1若该键存在于该布鲁姆过滤器集合中，通过判断当前布鲁姆过滤器所在层是否为最后一层，决定跳至步骤11还是步骤10，具体为：10.1 If the key exists in the Bloom filter set, by judging whether the layer where the current Bloom filter is located is the last layer, it is decided to skip to step 11 or step 10, specifically:

10.1A若当前布鲁姆过滤器所在层不是最后一层，跳至步骤10；10.1A If the current Bloom filter layer is not the last layer, skip to step 10;

10.1B若当前布鲁姆过滤器所在层是最后一层，跳至步骤11；10.1B If the current Bloom filter layer is the last layer, skip to step 11;

10.2若该键不存在于该层布鲁姆过滤器集合中，进入当前布鲁姆过滤器对应层的哈希表，跳至步骤12；10.2 If the key does not exist in the Bloom filter set of this layer, enter the hash table of the corresponding layer of the current Bloom filter, and skip to step 12;

步骤11，进入最后一层哈希表；Step 11, enter the last layer of hash table;

步骤12，在当前层哈希表中查找键对应的校验值，对键进行一次哈希运算，通过哈希运算的结果在哈希表中确定一个存储块，在存储块中遍历每个非空单元的校验值并比较是否与当前键的校验值相等，根据是否存在相等的情况，判断跳至步骤13还是步骤14，具体为：Step 12: Find the check value corresponding to the key in the hash table of the current layer, perform a hash operation on the key, determine a storage block in the hash table through the result of the hash operation, and traverse each non-volatile storage block in the storage block. Check whether the check value of the empty cell is equal to the check value of the current key. According to whether there is an equality, it is judged whether to jump to step 13 or step 14, specifically:

12.1若存在相等的情况，即找到了相等的校验值，则进入步骤13；12.1 If there is an equal situation, that is, an equal check value is found, then go to step 13;

12.2若不存在相等的情况，即没找到相等的校验值，则进入步骤14；12.2 If there is no equal condition, that is, no equal check value is found, then go to step 14;

步骤13，查找函数调用成功，函数返回当前存储单元中保存的值，结束查找；Step 13, the search function is called successfully, the function returns the value saved in the current storage unit, and ends the search;

步骤14，查找函数调用失败，函数返回0，结束查找；Step 14, the search function fails to call, the function returns 0, and the search ends;

至此，从步骤1到步骤13或步骤14，完成了一种快速键值对查找方法。So far, from step 1 to step 13 or step 14, a quick key-value pair search method is completed.

有益效果beneficial effect

本发明一种结构紧凑的键值对存储结构及快速键值对查找方法，与现有方法相比，具有如下有益效果：Compared with the existing method, the present invention has a compact key-value pair storage structure and a fast key-value pair search method, and has the following beneficial effects:

1.本发明所述方法通过以2-left hashing哈希表为首层的分层哈希表保证数据存储的紧凑性，通过使用布鲁姆过滤器存储冲突数据，有效减少了布鲁姆过滤器占用的空间，即有效地解决存储空间利用率低的问题；1. The method of the present invention ensures the compactness of data storage by taking the 2-left hashing hash table as the first layer of the hierarchical hash table, and by using the Bloom filter to store conflicting data, the Bloom filter is effectively reduced. Occupied space, that is, effectively solve the problem of low utilization of storage space;

2.对比已有方法的查找速度慢且不适应软件平台的问题，本方法可以适应软件平台的高速缓存大小，并提高了查找速度，保证了实时性。2. Compared with the problems of slow search speed and incompatibility with the software platform of the existing method, the method can adapt to the cache size of the software platform, improve the search speed, and ensure real-time performance.

附图说明Description of drawings

图1为本发明实施例一种结构紧凑的键值对快速查找方法(插入过程)流程示意图；1 is a schematic flowchart of a method (insertion process) for fast searching of key-value pairs with a compact structure according to an embodiment of the present invention;

图2为本发明实施例一种结构紧凑的键值对快速查找方法(查找过程)流程示意图；FIG. 2 is a schematic flowchart of a fast search method (search process) for a key-value pair with a compact structure according to an embodiment of the present invention;

图3为本发明中结构紧凑的键值对存储结构示意图；3 is a schematic diagram of a compact key-value pair storage structure in the present invention;

图4为快速键值对查找方法实施例1中插入过程使用2-left hashing的示意图；4 is a schematic diagram of using 2-left hashing in the insertion process in Embodiment 1 of the fast key-value pair search method;

图5为快速键值对查找方法实施例2中插入过程使用2-left hashing的示意图；5 is a schematic diagram of using 2-left hashing in the insertion process in Embodiment 2 of the quick key-value pair search method;

图6为快速键值对查找方法实施例2中使用查找过程的示意图。FIG. 6 is a schematic diagram of a search process used in Embodiment 2 of the quick key-value pair search method.

具体实施方式Detailed ways

为使发明的目的、技术方案及优点更加清楚，下面将结合附图对本发明的具体实施方式进行详细描述。本实施例以本发明技术方案为前提进行实施，给出了详细实施方式和具体操作过程，但本发明的保护范围不限于下述实施例。In order to make the objectives, technical solutions and advantages of the present invention clearer, the specific embodiments of the present invention will be described in detail below with reference to the accompanying drawings. This embodiment is implemented on the premise of the technical solution of the present invention, and provides a detailed implementation manner and a specific operation process, but the protection scope of the present invention is not limited to the following embodiments.

本发明为一种基于分层的哈希表和布鲁姆过滤器的紧凑的键值对存储结构和快速键值对查找方法，其具体步骤分为两部分，图1所示为第一部分，键值对插入过程，图2所示为第二部分，键值对查找过程。The present invention is a compact key-value pair storage structure and a fast key-value pair search method based on a layered hash table and a Bloom filter. The specific steps are divided into two parts. Figure 1 shows the first part, the key The value pair insertion process, Figure 2 shows the second part, the key-value pair search process.

实施例1Example 1

第一部分键值对插入过程的具体步骤包括：The specific steps of the first part of the key-value pair insertion process include:

本实施例中.如图3所示，设置哈希结构层数为4层，首层哈希表2-left hashing包括204800个存储块，第二层哈希表包括500个存储块，第三次哈希表包括50个存储块.第四层哈希表包括5个存储块；In this embodiment, as shown in Figure 3, the number of layers of the hash structure is set to 4, the first layer of hash table 2-left hashing includes 204,800 storage blocks, the second layer of hash table includes 500 storage blocks, and the third layer of hash table includes 500 storage blocks. The secondary hash table includes 50 storage blocks. The fourth-level hash table includes 5 storage blocks;

设置第一层布鲁姆过滤器大小为5000字节，下层依次递减为500字节，50字节；Set the Bloom filter size of the first layer to 5000 bytes, and the lower layer to decrease to 500 bytes and 50 bytes in turn;

实施例中，设置N为8，M为32比特；每个存储块的大小为8*32＝256比特；每个存储单元包括“值”部分和“校验值”部分，值部分占16比特，校验值部分占16比特；In the embodiment, set N to be 8 and M to be 32 bits; the size of each storage block is 8*32=256 bits; each storage unit includes a "value" part and a "check value" part, and the value part occupies 16 bits. , the check value part occupies 16 bits;

首层为2-left hashing数据结构，该数据结构由两个子表组成，如图4所示；每个子表由相同个数的存储块组成，本实施例中，每个子表包含102400个存储块；The first layer is a 2-left hashing data structure, which consists of two sub-tables, as shown in Figure 4; each sub-table consists of the same number of storage blocks, in this embodiment, each sub-table contains 102,400 storage blocks ;

步骤3,向第一层哈希表中插入数据，根据判断是否存在冲突执行操作；Step 3, insert data into the first-layer hash table, and execute the operation according to whether there is a conflict;

本实施例中，如图4所示，对键值对(key1，value1)中的key1进行两次哈希运算Hash1(key1)和Hash2(key1)，其中，Hash1是左子表使用的哈希函数，Hash2是右子表使用的哈希函数，两次哈希运算的结果分别对应到2-left hashing中左右两个子表中的两处存储块；In this embodiment, as shown in FIG. 4 , two hash operations Hash1(key1) and Hash2(key1) are performed on key1 in the key-value pair (key1, value1), where Hash1 is the hash used by the left sub-table function, Hash2 is the hash function used by the right sub-table, the results of the two hash operations correspond to two storage blocks in the left and right sub-tables in 2-left hashing respectively;

查看两处存储块的第一个单元，图中两处存储块分别已被占用了2个和4个单元，且不存在校验值冲突，则将value1和key1的校验值存入左侧子表的存储块第一个空闲单元中，并将第一个单元的计数器累加1，存储步骤完毕，进入步骤4；Check the first unit of the two storage blocks. In the figure, the two storage blocks have been occupied by 2 and 4 units respectively, and there is no check value conflict, then save the check values of value1 and key1 to the left In the first free unit of the storage block of the sub-table, the counter of the first unit is incremented by 1, the storage step is completed, and the step 4 is entered;

步骤4，插入成功，函数返回1，结束添加步骤；Step 4, the insertion is successful, the function returns 1, and the adding step is ended;

第二部分键值对查找过程，具体步骤如下：The second part of the key-value pair search process, the specific steps are as follows:

步骤5，在第一层布鲁姆过滤器集合中对给出的键进行查找，通过判断该键是否在该集合中保存来进行如下操作：Step 5: Search for the given key in the first-level Bloom filter set, and perform the following operations by judging whether the key is saved in the set:

本实施例中，该键不存在第一层布鲁姆过滤器集合中，执行步骤6；In this embodiment, if the key does not exist in the first-layer Bloom filter set, perform step 6;

步骤6，通过键key1，在第一层2-left hashing结构中查找键对应的值，根据查找结果判断执行步骤；Step 6, through the key key1, search the value corresponding to the key in the first layer 2-left hashing structure, and judge the execution steps according to the search result;

本实施例中，首先对key1进行Hash11(key1)和Hash12(key1)运算，在2-lefthashing的两个子表中分别找到一个存储块；In the present embodiment, first perform Hash11 (key1) and Hash12 (key1) operations on key1, and find a storage block respectively in the two sub-tables of 2-lefthashing;

计算key1的校验码，将其与两个存储块中所有的校验值进行比较，此时在左侧存储块中找到相同的校验值，表示查找成功，进入步骤7；Calculate the check code of key1, and compare it with all the check values in the two storage blocks. At this time, the same check value is found in the left storage block, indicating that the search is successful, and then go to step 7;

步骤7，返回找到的value1，结束查找；Step 7, return the found value1, and end the search;

实施例2Example 2

第一部分键值对插入过程，结合附图3和5，具体步骤如下：The first part of the key-value pair insertion process, in conjunction with Figures 3 and 5, the specific steps are as follows:

本实施例中，设置同实施例1；In this embodiment, the settings are the same as those in Embodiment 1;

本实施例中，对键值对(key2，value2)中的key2进行两次哈希运算Hash1(key2)和Hash2(key2)，该值对应的两个位置如图5所示，均被占用了7个单元，均满，因此产生插入冲突，图3中虚线1表示插入冲突，进入步骤4；In this embodiment, two hash operations Hash1 (key2) and Hash2 (key2) are performed on key2 in the key-value pair (key2, value2), and the two positions corresponding to the values are shown in Figure 5, both of which are occupied. 7 units are full, so there is an insertion conflict, the dotted line 1 in Figure 3 represents an insertion conflict, and enter step 4;

步骤4，判断当前层是否为最后一层，并根据判断结果进行操作；Step 4, judge whether the current layer is the last layer, and operate according to the judgment result;

当前层不是哈希表结构中的最后一层，应向当前层对应的布鲁姆过滤器集合中添加当前键，如图3中实线2所示，后跳至步骤5；The current layer is not the last layer in the hash table structure, and the current key should be added to the Bloom filter set corresponding to the current layer, as shown by the solid line 2 in Figure 3, and then skip to step 5;

步骤5，跳入下一层哈希表中，进行一次哈希运算，使用此哈希运算的结果确定一个存储块位置，通过判断该存储块和当前键是否存在冲突来进行如下操作：Step 5, jump into the next layer of hash table, perform a hash operation, use the result of the hash operation to determine the location of a storage block, and perform the following operations by judging whether the storage block and the current key conflict:

本实施例中，进入第二层哈希表；In this embodiment, enter the second-layer hash table;

在第二层哈希表中，进行一次哈希运算，该运算结果确定一个存储块位置，本实施例中，对key2进行一次哈希运算确定一个存储块；In the second-layer hash table, a hash operation is performed, and the operation result determines a storage block position. In this embodiment, a hash operation is performed on key2 to determine a storage block;

当前存储块中，已存储3个数据，未满，再检查key2的校验值是否与当前存储块已存储数据的校验值相同，此处发现相同的校验码，因此key2在第二层哈希表中也产生冲突，图3中虚线3表示在第二层中产生冲突；In the current storage block, 3 data have been stored, but it is not full. Then check whether the check value of key2 is the same as the check value of the data stored in the current storage block. The same check code is found here, so key2 is in the second layer. A conflict also occurs in the hash table, and the dotted line 3 in Figure 3 indicates that a conflict occurs in the second layer;

此时重复步骤4，向第二层布鲁姆过滤器中添加key2，如图3实线4；At this point, repeat step 4, and add key2 to the second layer of Bloom filter, as shown in solid line 4 in Figure 3;

进入第三层哈希表，向第三层哈希表中插入key2；对key2进行一次哈希运算，映射key2到第三层哈希表的某个存储块；Enter the third-layer hash table, insert key2 into the third-layer hash table; perform a hash operation on key2 to map key2 to a certain storage block in the third-layer hash table;

当前存储块中，存储未满，同时验证key2的校验值是该存储块中唯一的，将key2和key2的校验码存入第一个空的单元中，如图3实线5，进入步骤6；In the current storage block, the storage is not full. At the same time, verify that the check value of key2 is the only one in the storage block. Store the check codes of key2 and key2 in the first empty unit, as shown in solid line 5 in Figure 3, enter step 6;

步骤6，插入成功，函数返回1，结束添加步骤；Step 6, the insertion is successful, the function returns 1, and the adding step is ended;

第二部分键值对查找过程，结合附图6，具体步骤如下：The second part of the key-value pair search process, in conjunction with Figure 6, the specific steps are as follows:

步骤7，在第一层布鲁姆过滤器集合中查找key2，根据是否找到执行操作；Step 7, look up key2 in the first-layer Bloom filter set, and execute the operation according to whether it is found;

本实施例中，在第一层布鲁姆过滤器中判断key2是否存在，结果为是，如图6实线1，则进入步骤8；In this embodiment, it is judged whether key2 exists in the first-layer Bloom filter, and the result is yes, as shown by the solid line 1 in Figure 6, then go to step 8;

步骤8，进入下一层布鲁姆过滤器，在下一层布鲁姆过滤器集合中对给出的键进行查找，通过判断该键是否在该布鲁姆过滤器集合中来进行如下操作；Step 8, enter the next layer of Bloom filter, in the next layer of Bloom filter set, the given key is searched, by judging whether the key is in this Bloom filter set, carry out the following operations;

本实施例中，key2存在于第二层布鲁姆过滤器集合中，且不是最后一层，如图6中实线2所示，因此重复步骤8，即进入第三层布鲁姆过滤器，查找key2是否在该集合中；In this embodiment, key2 exists in the second-layer Bloom filter set, and is not the last layer, as shown by the solid line 2 in Figure 6, so repeat step 8, that is, enter the third-layer Bloom filter , find out whether key2 is in the set;

如图6中虚线3所示，key2不存在第三层布鲁姆过滤器集合中，进入步骤9；As shown by the dotted line 3 in Figure 6, key2 does not exist in the third-layer Bloom filter set, enter step 9;

步骤9，进入当前布鲁姆过滤器对应层的哈希表，执行步骤10；Step 9, enter the hash table of the corresponding layer of the current Bloom filter, and execute step 10;

本实施例中，进入第三层哈希表；In this embodiment, enter the third-layer hash table;

步骤10，在当前哈希表中查找键对应的值，根据查找结果判断执行步骤；Step 10, look up the value corresponding to the key in the current hash table, and judge the execution step according to the search result;

本实施例中，在第三层哈希表中，进行一次哈希运算，根据运算结果确定一个存储块，在该存储块中查找，是否存在和key2校验值相同的校验值；In this embodiment, in the third-layer hash table, a hash operation is performed, a storage block is determined according to the operation result, and the storage block is searched to find out whether there is a check value that is the same as the check value of key2;

本实施例中找到了相同的校验值，该存储单元中保存的值即为value2或指向value2的指针；The same check value is found in this embodiment, and the value stored in the storage unit is value2 or a pointer to value2;

步骤11，将value2返回，查找结束。Step 11, return value2, and the search ends.

结论in conclusion

本发明主要由紧凑的键值对存储结构及键值对快速查找方法两部分组成，快速查找方法分为添加过程和查找过程。通过使用多层的哈希表存储数据，特别地使用以2-lefthashing哈希表为首层进行多层哈希表存储键值对，保证键值对存储结构的紧凑性。通过使用布鲁姆过滤器记录在对应哈希表中产生冲突的数据，减少内存访问次数，提高对高速缓存的利用。键值对查找的时间开销是较为恒定的，因为95％以上的数据都会存储在2-lefthashing中，在2-left hashing中的键值对查找平均需要1.5次内存访问，达到了查找速度恒定的目的。通过使用足够小的能保存在高速缓存中的布鲁姆过滤器结构，对其查找不产生内存访问次数，因此查找极快，达到了加快键值对查找在软件实现中速度的目的。The invention is mainly composed of a compact key-value pair storage structure and a key-value pair quick search method. The quick search method is divided into an adding process and a search process. By using a multi-layer hash table to store data, especially using a 2-left hashing hash table as the first layer to store key-value pairs in a multi-layer hash table, the compactness of the key-value pair storage structure is guaranteed. By using the Bloom filter to record the conflicting data in the corresponding hash table, the number of memory accesses is reduced and the utilization of the cache is improved. The time overhead of key-value pair search is relatively constant, because more than 95% of the data will be stored in 2-left hashing, and the key-value pair search in 2-left hashing requires an average of 1.5 memory accesses, which achieves a constant search speed. Purpose. By using a Bloom filter structure that is small enough to be stored in the cache, its lookup does not generate memory access times, so the lookup is extremely fast, and the purpose of speeding up the key-value pair lookup in software implementation is achieved.

自此，就完成了基于首层2-left hashing，并使用布鲁姆过滤器记录哈希冲突项进行键值对的存储和快速查找方法。Since then, the storage and fast search method of key-value pairs based on the first layer 2-left hashing and using Bloom filter to record hash conflict items has been completed.

虽然结合了附图描述了本发明的实施方式，但是对于本领域技术人员来说，在不脱离本发明原理的前提下，还可以做出若干改进，这些也应视为属于本发明的保护范围。Although the embodiments of the present invention are described in conjunction with the accompanying drawings, for those skilled in the art, some improvements can be made without departing from the principles of the present invention, which should also be regarded as belonging to the protection scope of the present invention. .

Claims

1. A fast key value pair searching method comprises a storage structure based on combination of a layered hash table and a bloom filter, wherein the storage structure is stored by using a 2-left hashing structure through a first layer, each layer of hash table is used as an auxiliary table to store data with storage conflict in an upper layer table, each layer of hash table is combined with the bloom filter to store the conflict data, and the bloom filter is used for judging whether a searched key has a result in a current set or not as an index to improve the searching speed, so that the storage space utilization rate is improved, the structure is compact, and the fast searching software platform is realized, and the fast key value pair searching method is characterized in that: the storage structure comprises lambda-layer hash tables, and the total storage block number contained in each layer of hash table is decreased in an equal ratio sequence from the second layer of hash table, wherein the equal ratio factor is 10;

the first-layer hash table comprises a left sub-table and a right sub-table, and each sub-table comprises the same number of storage blocks; for the rest of the layers of hash tables, the concept of sub-tables does not exist;

the first storage unit in each storage block stores the number of occupied storage units in the storage block, and the key value pairs are stored in the other storage units;

wherein a key-value pair comprises a key and a value; stored in the specific storage are the "value" of the key and the "check value" of the key; wherein the stored "value" portion may store the value itself or a pointer to a storage location of the value itself, hereinafter collectively referred to as the value; the "check value" of the key may be calculated using commonly used checking algorithms, including the crc cyclic redundancy check or md5 checking algorithm;

except the last layer of hash table, each layer of hash table corresponds to one bloom filter, and the space occupied by each layer of bloom filter decreases in an equal ratio sequence from the first layer of bloom filter, wherein the equal ratio factor is 10;

the key-value pair data storage process and the key-value pair searching process comprise the following steps:

the first part of key-value pair data storage process, namely key-value pair insertion process, which is called insertion process for short, comprises the following specific steps:

step 1, setting the number of layers of hash tables, the number of storage blocks contained in each layer of hash table, and the size of a bloom filter corresponding to each layer of hash table;

step 2, setting the number N of storage units contained in each storage block of the hash table, and setting the size of each storage unit as M bits;

step 3, inserting data into a first layer hash table, wherein the first layer hash table uses a 2-left hashing hash table, if the data does not conflict, the key value pair is successfully inserted, and the step 5 is skipped; if so, marking the data in the bloom filter corresponding to the first layer, and executing the step 4;

step 4, judging whether the current layer is the last layer, jumping to step 7 if the current layer is the last layer according to the judgment result, and if the current layer is not the last layer in the hash table structure, adding a current key to a bloom filter set corresponding to the current layer and jumping to step 5;

step 5, jumping into the next layer of hash table, performing a hash operation, determining a storage block position by using the hash operation result, and performing the following operations by judging whether the storage block and the current key have conflict or not: when the conflict of the inserted data occurs, jumping to the step 4; when the inserted data is not conflicted, storing the inserted data, namely the value of the key and the check value of the key into the first free storage unit of the storage block, and entering the step 6;

step 6, the data insertion is successful, namely the key value is successfully called for the insertion function, the function returns to 1, and the insertion process is ended;

step 7, the data insertion fails, namely the key value fails to call the insertion function, the function returns to 0, and the insertion process is ended;

to this end, from step 1 to step 6 or step 7, the insertion process is completed;

the second partial key value pair searching process, that is, the process of searching for the value in the corresponding key value pair for the key in the given key value pair, is referred to as the searching process for short, and includes the following specific steps:

step 8, searching the key in the given key value pair in the first layer bloom filter set, and determining whether to jump to step 10 or step 9 by judging whether the key is stored in the set, specifically:

8.1 if the key does not exist in the first layer bloom filter set, execute step 9;

8.2 if the key is present in the first layer bloom filter set, performing step 10;

step 9, performing hash calculation twice in 2-left shifting, and determining a storage block in the left and right sub-tables respectively according to the result; comparing the check value in each non-empty unit in the two storage blocks with the check value calculated according to the current key, checking whether the check values are equal, and deciding to jump to step 13 or step 14, wherein the specific steps are as follows:

9.1 if there is equal condition, it represents that the value stored in the unit is the value corresponding to the searched key, and jumps to step 13;

9.2 if the key is not equal, the value corresponding to the key does not exist in the storage structure, and the step 14 is skipped;

step 10, entering a next layer bloom filter, searching a given key in a next layer bloom filter set, and performing the following operations by judging whether the key is in the bloom filter set:

10.1 if the key exists in the bloom filter set, determining to jump to step 11 or step 10 by judging whether the layer where the current bloom filter is located is the last layer, specifically:

10.1A, if the layer where the current bloom filter is located is not the last layer, jumping to the step 10;

10.1B, if the layer where the current bloom filter is located is the last layer, jumping to the step 11;

10.2 if the key does not exist in the layer bloom filter set, entering a hash table of a layer corresponding to the current bloom filter, and jumping to the step 12;

step 11, entering the last layer of hash table;

step 12, searching a check value corresponding to the key in the hash table of the current layer, performing a hash operation on the key, determining a storage block in the hash table according to a hash operation result, traversing the check value of each non-empty unit in the storage block and comparing whether the check value is equal to the check value of the current key, and judging whether to jump to step 13 or step 14 according to whether the check values are equal, specifically:

12.1, if the condition of equality exists, namely an equal check value is found, entering a step 13;

12.2 if there is no equality condition, i.e. no equal check value is found, go to step 14;

step 13, the function is searched for and called successfully, the function returns to the value stored in the current storage unit, and the search is finished;

step 14, searching function call failure, returning the function to 0 and finishing searching;

to this end, from step 8 to step 13 or step 14, a fast key-value pair lookup method is completed.

2. The fast key-value pair lookup method according to claim 1, wherein: the scheme of the step 2 is to ensure that the size M x N bits of the memory block is equal to or smaller than the size of a computer cache line, so that the whole memory block is transferred into a cache when the memory block is accessed, multiple cache replacement is avoided when different units of the memory block are accessed, and the search time is shortened.

3. The fast key-value pair lookup method according to claim 1, wherein: in step 3, the inserted data is the 'value' part of the key value pair, and the 'check value' of the key is also stored; the specific process of the step 3 is as follows: performing two times of hash operation on the keys in the key value pair, and respectively determining a storage block in a 2-left shifting left sub-table and a 2-left shifting right sub-table by using two hash calculation results;

the hash operation of each layer of hash table is fixed, and the two hash operations of the first layer are also fixed;

when the following two situations occur, which indicate that the inserted data conflicts, it is necessary to jump to step 4:

3.1A, if the storage units in the two storage blocks are all occupied, namely the number of the storage units in the two storage blocks is the same, both the storage units are N-1 and equal to the maximum key value pair number which can be stored in the block, indicating that an insertion conflict occurs, and entering the step 4;

3.1B, if the check values stored in all the storage units in the two storage blocks have the same storage unit as the check value of the data to be inserted or the key value pair to be inserted, indicating that an insertion conflict occurs, and entering the step 4;

when two cases of collision of insertion data that do not satisfy 3.1A and 3.1B, i.e., when the insertion data do not collide, there are two cases as follows:

3.2A, if the occupied numbers of the units in the two storage blocks are different, storing the inserted data, namely the value of the key and the check value of the key, into the first idle storage unit of the storage block with the smaller number of occupied units, and jumping to the step 6;

3.2B, if the occupied numbers of the units in the two storage blocks are the same, storing the inserted data, namely the value corresponding to the key and the check value of the key, into the first free storage unit of the left storage block, and jumping to the step 6.

4. The fast key-value pair lookup method according to claim 1, wherein: in step 5, inserting data collision, and two situations that need to jump to step 4 are as follows:

5.1A, if the storage units in the storage block are full, namely the number of the storage units in the storage block is N-1 and is equal to the maximum key value pair number which can be stored in the block, indicating that an insertion conflict occurs;

and if the check values stored in all the storage units in the 5.1B storage block have the same storage units as the 'check values' of the data to be inserted or the key value pairs to be inserted, indicating that insertion conflict occurs.