CN108932738B

CN108932738B - Bit slice index compression method based on dictionary

Info

Publication number: CN108932738B
Application number: CN201810716805.4A
Authority: CN
Inventors: 刘晓光; 刘欣瑀; 王刚; 张瞾华
Original assignee: Nankai University
Current assignee: Nankai University
Priority date: 2018-07-03
Filing date: 2018-07-03
Publication date: 2022-08-16
Anticipated expiration: 2038-07-03
Also published as: CN108932738A

Abstract

A dictionary-based bit-slice index compression method and optimization strategy, applicable to the 0/1-bit slice index structure represented by BitFunnel. The method of the present invention includes: 1. Document rearrangement: the document is rearranged according to the density of bit 1 in the index column with the block size as an interval, so as to increase the repetition degree between blocks. 2. Partial compression: Select some query low-frequency access rows for compression. 3. Dictionary compression: Divide the index into blocks, and store a full 1-bit block and high frequency blocks in the index into the dictionary. Replace blocks that appear in the dictionary with block numbers with fewer bits; replace blocks that don't appear in the dictionary with the number of the closest block in the dictionary (which will cause misnomer results for query requests but guarantees that they will not be lost) . The invention is applicable to the scene of bit-slice index compression in the field of information retrieval. The present invention can significantly improve the index compression effect without causing a large decompression delay, which is of great significance to the optimization of the search engine system.

Description

A Dictionary-Based Bit Slice Index Compression Method

【技术领域】【Technical field】

本发明属于搜索引擎中的以BitFunnel为代表的位片索引的压缩技术领域，特别涉及对于位片索引的一种基于字典的压缩方法和部分压缩、文档重排等的优化策略。本发明同样适用于信息检索领域其他类似位图的位片索引结构压缩。The invention belongs to the technical field of compression of bit slice indexes represented by BitFunnel in search engines, and particularly relates to a dictionary-based compression method for bit slice indexes and optimization strategies for partial compression, document rearrangement and the like. The invention is also applicable to other bitmap-like index structure compression in the field of information retrieval.

【背景技术】【Background technique】

在如今科技高速发展的时代，互联网已成为人们生活中必不可少的一部分。而搜索引擎也成为最重要的互联网入口。它使用网络爬虫等工具，定期扒取互联网的网页内容，在重新组织、存储后，为用户提供检索服务。现代商用搜索引擎系统主要包括网络服务器、索引服务器、文档服务器三大部分。网络服务器用于接收用户的查询，并将查询提交给索引服务器。索引服务器在接收到查询后，对查询词涉及到的索引进行访问、缓存、求交等操作来得到包含查询词的文档编号，然后对其按查询相关度进行排序并将K个最高分文档(topK)的编号返回网络服务器。接下来网络服务器将文档编号和查询交由文档服务器生成包含查询词的摘要返回给用户。In today's era of rapid technological development, the Internet has become an indispensable part of people's lives. The search engine has also become the most important Internet entrance. It uses tools such as web crawlers to regularly scrape the web page content of the Internet, and after reorganization and storage, provides retrieval services for users. The modern commercial search engine system mainly includes three parts: network server, index server and document server. The web server is used to receive the user's query and submit the query to the index server. After receiving the query, the index server performs operations such as accessing, caching, and intersecting the indexes involved in the query word to obtain the document number containing the query word, and then sorts it according to the query relevancy and sorts the K highest scoring documents ( topK) number returned to the web server. Next, the web server sends the document number and query to the document server to generate a summary containing the query words and return it to the user.

在现有的搜索引擎系统中，索引的压缩解压求交是最为重要的部分之一，直接影响着查询效率和用户体验。对索引进行压缩不仅可以节约存储空间，减少存储设备的开销，在查询过程中由于同等大小文件包含更多的压缩信息，所以索引压缩还可增加磁盘到内存的吞吐率以及增加cache命中率，从而提升查询效率。索引压缩方案有两个主要的评价指标，一方面是压缩率，直接体现压缩效果；另一方面是解压时间，关乎查询效率。搜索引擎会根据实际需求选择合适的压缩方案平衡压缩率和解压时间。In the existing search engine system, index compression, decompression and intersection is one of the most important parts, which directly affects query efficiency and user experience. Compressing indexes can not only save storage space, but also reduce the overhead of storage devices. During the query process, since files of the same size contain more compression information, index compression can also increase the throughput rate from disk to memory and increase the cache hit rate, thereby increasing the cache hit rate. Improve query efficiency. The index compression scheme has two main evaluation indicators, one is the compression rate, which directly reflects the compression effect; the other is the decompression time, which is related to the query efficiency. The search engine will choose an appropriate compression scheme to balance the compression ratio and decompression time according to actual needs.

2017年微软提出了一种新的索引位片结构——BitFunnel。其索引是一种基于Bloom Filter的类似于位图的位片索引结构。其中每一列代表一篇文档，每一行代表一个或多个词在文档上的映射，词出现在文档中置1，否则置0。由于其应用了Bloom Filter，查询处理可能得到真实解之外的解——误称(false positive)。由于其采用按位运算替代传统倒排索引的比较判断，减少了分支预测失败的可能性，所以有着很高的查询效率，在未来应用空间巨大。具体的索引结构如图2所示。然而，类似于位图的结构导致存储开销相比于倒排索引大很多。In 2017, Microsoft proposed a new index bit slice structure - BitFunnel. Its index is a bitmap-like bitmap index structure based on Bloom Filter. Each column represents a document, and each row represents the mapping of one or more words on the document. The word appears in the document and is set to 1, otherwise it is set to 0. Because of its application of Bloom Filter, query processing may get solutions other than true solutions—false positives. Since it uses bitwise operations to replace the traditional inverted index comparison and judgment, it reduces the possibility of branch prediction failure, so it has high query efficiency and has huge application space in the future. The specific index structure is shown in Figure 2. However, the bitmap-like structure incurs a much larger storage overhead than an inverted index.

针对上述应用场景，目前存在两种解决方案——算术编码和位图压缩。现有的针对倒排索引的算术编码压缩方案更偏重于整数压缩，希望被压缩的序列单调或者数值很小，所以不适用于BitFunnel索引结构。而传统的针对位图的压缩方法主要采用run-length编码，这需要数据中有大规模连续0或连续1，而BitFunnel结构也难以保证满足这一条件。For the above application scenarios, there are currently two solutions - arithmetic coding and bitmap compression. The existing arithmetic coding compression scheme for inverted index is more focused on integer compression, hoping that the compressed sequence is monotonous or the value is small, so it is not suitable for the BitFunnel index structure. The traditional compression method for bitmaps mainly uses run-length encoding, which requires large-scale continuous 0 or continuous 1 in the data, and the BitFunnel structure is difficult to guarantee to meet this condition.

【发明内容】[Content of the invention]

为解决上述问题，区别于传统的利用run-length的压缩方法，本发明提出了一种基于字典的位片索引压缩方法，无需较多连续的0/1序列即可达到较好的压缩效果，更适合BitFunnel索引。具体的将BitFunnel索引中重复度较高的块用更少比特位的编号替代从而实现压缩的目的，字典保存重复块到编号的映射。In order to solve the above problem, different from the traditional compression method using run-length, the present invention proposes a dictionary-based bit slice index compression method, which can achieve better compression effect without more continuous 0/1 sequences, Better for BitFunnel indexing. Specifically, the blocks with a higher degree of repetition in the BitFunnel index are replaced by numbers with fewer bits to achieve the purpose of compression, and the dictionary saves the mapping of the repeated blocks to the numbers.

本发明提供的基于字典的位片索引压缩方法(和对压缩的优化策略)，参照图1，其主要方法包括：The dictionary-based bit slice index compression method (and the optimization strategy for compression) provided by the present invention, with reference to FIG. 1 , the main methods include:

步骤1(S1)，以块大小为间隔根据索引中比特1的列密度重排文档；Step 1 (S1), rearrange the document according to the column density of bit 1 in the index with the block size as an interval;

步骤2(S2)，根据查询数据集特征自定义阈值选定一部分查询访问低频行进行压缩；Step 2 (S2), select a part of the query access low-frequency rows to compress according to the custom threshold value of the query data set characteristics;

步骤3(S3)，对S2选定的压缩行进行字典压缩。In step 3 (S3), dictionary compression is performed on the compressed row selected by S2.

S1，一种以块大小为间隔根据索引列中比特1的密度重排文档策略。具体地：S1, a strategy for rearranging documents according to the density of bits 1 in the index column at block size intervals. specifically:

假设索引列数为d，块大小为k个比特；Suppose the number of index columns is d and the block size is k bits;

步骤1.1，统计索引每列中1的密度，将列按密度降序排列；Step 1.1, count the density of 1 in each column of the index, and arrange the columns in descending order of density;

步骤1.2，初始化一个空索引，将比特1的密度最高的

列放入索引第1,k+1,2k+1,…,

列；将1密度次高的

列放入索引第2,k+2,2k+2,…,

列，以此类推，最后将比特1密度最低的

列放入k,k+k,2k+k,…,

列。Step 1.2, initialize an empty index, set the highest density of bits 1

Columns are put into index 1,k+1,2k+1,…,

column; put 1 with the next highest density

Columns are put into index 2,k+2,2k+2,…,

column, and so on, and finally the lowest density of bit 1

Columns put k,k+k,2k+k,…,

List.

S2，一种根据查询数据集特征自定义阈值选定一部分查询访问低频行进行压缩的部分压缩策略。具体地：S2, a partial compression strategy that selects a part of query access low-frequency rows for compression according to a custom threshold value of the characteristics of the query data set. specifically:

步骤2.1，自定义查询过程中高频访问行访问频率的阈值α∈[0,1]。Step 2.1, define the threshold α∈[0,1] of the access frequency of high-frequency access rows in the query process.

步骤2.2，将矩阵中行按查询访问频率从高到低排序，选定最高频访问的最少q行为高频访问行，保证

其中N是所有行的总访问频率之和，f(i)表示第i行的访问频率；根据部分查询数据集和自定义阈值对行进行如下分类：高频访问行，低频访问行；Step 2.2, sort the rows in the matrix according to the query access frequency from high to low, and select the least q row with the highest frequency access to the high frequency access row to ensure that

where N is the sum of the total access frequencies of all rows, and f(i) represents the access frequency of the i-th row; according to some query data sets and custom thresholds, the rows are classified as follows: high-frequency access rows, low-frequency access rows;

步骤2.3，建立行压缩标志位文件，判断每一行是否为低频访问行，也就是需要压缩行。Step 2.3, establish a line compression flag bit file, and determine whether each line is a low-frequency access line, that is, a line that needs to be compressed.

S3，对S2选定的压缩行进行字典压缩，具体地：S3, perform dictionary compression on the compressed rows selected by S2, specifically:

步骤3.1，将重排后BitFunnel位片索引以行为单位划分成块，建立字典保存一个全1块高频出现在S2中选定的压缩行的块，记录高频块和块编号的映射关系；Step 3.1: Divide the rearranged BitFunnel bit slice index into blocks in units of rows, establish a dictionary to save a block of all 1 block of compressed rows that appear frequently in S2, and record the mapping relationship between high-frequency blocks and block numbers;

步骤3.2，遍历索引，对出现在字典中的块，用更少比特的块编号替代；对未出现在字典中的块，用字典中最近似块的编号替代，这样虽然会导致查询请求误称但保证不丢解。Step 3.2, traverse the index, replace the block that appears in the dictionary with the block number with fewer bits; for the block that does not appear in the dictionary, replace it with the number of the closest block in the dictionary, although this will cause the query request to be misnamed. But make sure not to lose it.

为方便描述，我们假设块大小k个比特，块编号大小b个比特且b<k。For convenience of description, we assume that the block size is k bits, the block number size is b bits and b<k.

在进行完步骤S1，S2，后，对选定的需要压缩行，将每行都划分为对齐的k比特定长块，以此为单位挖掘重复块。从第一行开始，逐个扫描k比特块，查找重复块并记录其出现频率，直至处理完最后一行。对发掘出的重复块按频率降序排列。将出现频率最高的2^b-1种块和一个全1块存入字典。再次遍历矩阵。After performing steps S1 and S2, for the selected rows that need to be compressed, each row is divided into aligned blocks that are longer than a specific length of k, and the repeated blocks are mined in units of this. Starting from the first row, k-bit blocks are scanned one by one, looking for duplicate blocks and recording their frequency, until the last row is processed. The discovered duplicate blocks are sorted in descending order of frequency. Store the most frequently occurring 2 ^b -1 blocks and an all-1 block into the dictionary. Traverse the matrix again.

A.如果某一块在字典中存在，用b比特的块编号代替。A. If a certain block exists in the dictionary, replace it with a b-bit block number.

B.如果块x在字典中不存在，在字典中寻找包含它(x AND y＝x)且1的个数最少的最近似块y(由于字典中保存了全1块，所以总能找到这样的y)，用y的编号代替x。B. If the block x does not exist in the dictionary, look for the nearest block y that contains it (x AND y=x) and has the least number of 1s in the dictionary (since the dictionary holds all 1 blocks, it is always possible to find such y), replace x with the number of y.

基于字典的压缩方法利用了矩阵稀疏且1出现不均匀的特性。由于整体1的密度较低，高密度1的块出现较少，导致重复块比例升高，近似块差距不大，这些特征有利于本发明基于字典的压缩不会大幅度增大误称率。Dictionary-based compression methods take advantage of the sparseness of matrices and the non-uniformity of 1's. Since the density of the overall 1 is low, the blocks of high density 1 appear less frequently, resulting in an increase in the proportion of repeated blocks and a small gap between approximate blocks. These features are beneficial to the dictionary-based compression of the present invention without greatly increasing the misnomer rate.

本发明的优点和有益效果：Advantages and beneficial effects of the present invention:

通过文档重排增加了BitFunnel索引的块重复度，通过部分压缩将压缩导致的误称率控制在10％左右，最后利用基于字典的压缩对位片索引进行压缩，提高BitFunnel索引的压缩效果。本发明能够广泛适用于搜索引擎性能优化和索引压缩领域。The block repetition of BitFunnel index is increased by document rearrangement, and the misnomer rate caused by compression is controlled to about 10% by partial compression. Finally, dictionary-based compression is used to compress the bit slice index to improve the compression effect of BitFunnel index. The invention can be widely used in the fields of search engine performance optimization and index compression.

【附图说明】【Description of drawings】

图1是本发明的基于字典的位片索引压缩方法流程图；1 is a flowchart of a dictionary-based bit slice index compression method of the present invention;

图2是BitFunnel索引结构示意图；Figure 2 is a schematic diagram of the BitFunnel index structure;

图3是以块大小为间隔根据索引列中比特1的密度重排文档过程及结果示例图；FIG. 3 is an example diagram of the process and result of rearranging documents according to the density of bit 1 in the index column at intervals of block size;

图4是字典压缩过程及结果示例图；Fig. 4 is a dictionary compression process and an example diagram of the result;

【具体实施方式】【Detailed ways】

为便于理解本发明的上述目的、特征和优点，下面结合附图和具体实施方式对本发明作进一步的详细说明。In order to facilitate the understanding of the above objects, features and advantages of the present invention, the present invention will be further described in detail below with reference to the accompanying drawings and specific embodiments.

实施例1文档重排Example 1 Document rearrangement

由于BitFunnel的映射矩阵由哈希函数(Bloom filter)和文档内容决定，我们可以通过调节词到行的映射和矩阵行数来改变矩阵中1的密度。虽然0/1矩阵重复度很高，然而由于文档分布的集聚性等特征导致矩阵中1的分布很不均匀(BitFunnel只保证全局1的密度)，因此我们提出了一种基于密度的文档重排方案，增大矩阵中块之间的相似性，从而提升压缩效果。显然，块重复度越高，对于字典压缩越有利。具体表现在两方面：每一块在字典中出现的可能性变大；不在字典中块压缩多置1的比例变小，从而大大降低了字典压缩导致的误称率。Since BitFunnel's mapping matrix is determined by the hash function (Bloom filter) and the document content, we can change the density of 1's in the matrix by adjusting the word-to-row mapping and the number of matrix rows. Although the 0/1 matrix has a high degree of repetition, the distribution of 1 in the matrix is very uneven due to the agglomeration of document distribution (BitFunnel only guarantees the density of global 1), so we propose a density-based document rearrangement scheme to increase the similarity between blocks in the matrix, thereby improving the compression effect. Obviously, the higher the block repetition, the better for dictionary compression. It is manifested in two aspects: the possibility of each block appearing in the dictionary becomes larger; the ratio of setting 1 for block compression that is not in the dictionary becomes smaller, thus greatly reducing the misnomer rate caused by dictionary compression.

为方便描述，我们假设映射矩阵共d列，每块含有k个比特。图3以d＝16，k＝4为例描述了文档重排的过程，首先统计每列1的密度并按从大至小的顺序排列，为了方便展示，这里每d/k个比特一组用不同背景标记。首先将密度最高的O、J、M和G列重排至第1，5，9，13列(斜线阴影)，然后依此类推，最后将密度最低的K、L、P和F文档重排至第4，8，12，16列(无阴影)。这样按列密度进行重排后，我们发现每k列密度均从第1列到第k列递减，块之间重复度增大，更利于字典压缩。For the convenience of description, we assume that the mapping matrix has a total of d columns, and each block contains k bits. Figure 3 takes d=16, k=4 as an example to describe the process of document rearrangement. First, the density of 1 in each column is counted and arranged in descending order. For convenience of presentation, here each d/k bits are grouped as a group. Mark with different backgrounds. Rearrange the highest density columns O, J, M, and G first to columns 1, 5, 9, 13 (shaded with diagonal lines), and so on, and finally rearrange the lowest density K, L, P, and F documents Columns 4, 8, 12, 16 (unshaded). After rearranging by column density in this way, we find that the density of each k column decreases from the first column to the kth column, and the repetition degree between blocks increases, which is more conducive to dictionary compression.

实施例2部分压缩Example 2 Partial Compression

通过对实际数据集进行分析我们发现，在查询过程中，只有少部分行被高频访问，大部分行访问频率较低。而对于高频访问行进行压缩会导致较高的误称率，而对于低频访问行的压缩对误称率影响较小。基于这个发现，我们考虑一种部分压缩的方案，即压缩低频访问行，而对于误称率影响较大的高频访问行不进行压缩。By analyzing the actual data set, we found that during the query process, only a few rows are accessed frequently, and most rows are accessed less frequently. Compression of high-frequency access lines will lead to a higher missymmetry rate, while compression of low-frequency access lines has little effect on the missymmetry rate. Based on this finding, we consider a partial compression scheme, that is, compress the low-frequency access lines, and leave no compression for the high-frequency access lines that have a greater impact on the misnomer rate.

将矩阵中行按频率从高到低排序，选定最高频访问的最小q行为高频访问行，保证

其中N是所有行的总访问频率之和，f(i)表示第i行的访问频率；根据部分查询数据集和自定义阈值对行进行如下分类：高频访问行，低频访问行。仅对低频访问行进行压缩。Sort the rows in the matrix by frequency from high to low, and select the row with the smallest q accessed with the highest frequency to access the row with high frequency to ensure that

where N is the sum of the total access frequencies of all rows, and f(i) represents the access frequency of the i-th row; the rows are classified as follows according to the partial query dataset and custom threshold: high-frequency access rows, low-frequency access rows. Only low-frequency access lines are compressed.

为方便描述，我们假设矩阵共100行，每一行的访问频率和行号相等，即f(i)＝i。首先我们将矩阵中行按频率从高到低排序，每行的访问频率依次递减，具体为100,99,98,…,1。其中N＝(100+99+98+…+1)＝5050。如果我们自定义阈值为α＝0.9，那么我们选定最高频访问的最少69行(行号100至32)保证

为高频访问行。对剩下的31行(行号31至1)进行压缩。由于实际数据集的特征，在真实数据的查询过程中，只有少部分行被高频访问，大部分行访问频率较低，所以部分压缩可以在保证误称率的条件下得到较好的压缩效果。For the convenience of description, we assume that the matrix has 100 rows in total, and the access frequency and row number of each row are equal, that is, f(i)=i. First, we sort the rows in the matrix from high to low frequency, and the access frequency of each row decreases in turn, specifically 100, 99, 98, ..., 1. where N=(100+99+98+...+1)=5050. If we customize the threshold to α=0.9, then we select the most frequently accessed minimum 69 rows (row numbers 100 to 32) guaranteed

row for high frequency access. The remaining 31 lines (line numbers 31 to 1) are compressed. Due to the characteristics of the actual data set, in the query process of real data, only a few rows are accessed frequently, and most rows are accessed less frequently, so partial compression can achieve better compression effect under the condition of ensuring the misproportion rate. .

对于行的选取，我们将整个查询数据集随机分成相等的两份，一份作为训练集，由训练集得到的行访问频率f(i)和阈值α决定每一行是否被压缩。另一半查询集作为测试集，测试查询时间、误称率等性能。可见，我们不是简单地选取固定的参数进行压缩，而是利用同一来源的查询请求具有相似性这一特点，从查询历史中学习出适合的参数，以期由此得到的压缩结果在未来的查询请求上获得更好的性能。具体自定义阈值α的取值和索引的原始误称率和块重复度有关。For row selection, we randomly divide the entire query data set into two equal parts, and one is used as the training set. The row access frequency f(i) obtained from the training set and the threshold α determine whether each row is compressed. The other half of the query set is used as a test set to test the performance of query time, misnomer rate and so on. It can be seen that we do not simply select fixed parameters for compression, but take advantage of the similarity of query requests from the same source to learn suitable parameters from the query history, in order to obtain the compression results in future query requests. for better performance. The value of the specific custom threshold α is related to the original misnomer rate of the index and the block repetition.

实施例3字典压缩Example 3 Dictionary compression

图4以k＝4，b＝2为例描述了字典压缩的过程及压缩后的结果。首先将原始字典中每块的出现频率记录下来，选取频率最高的2²-1＝3个块和一个全1块生成字典(图4左下)；图4的第四行第一块1100，因为其在字典中出现，所以利用编号10替代；第一行的第二块0010(虚线框内)，由于其在字典中未出现，且第三个比特位置置1。字典中块1111、1110、1010第三个比特位均为1，均包含块0010满足相似条件，我们选择其中1个数最少的1010。这一策略保证查询请求不会丢解，且可能增加的误称尽量少。图4中阴影部分表示块不在字典中，用近似块代替的情况。压缩后的数据分为两部分：字典、压缩的索引。FIG. 4 takes k=4 and b=2 as an example to describe the process of dictionary compression and the result after compression. First record the frequency of occurrence of each block in the original dictionary, select the most frequent 2 ² -1 = 3 blocks and an all-one block to generate a dictionary (lower left in Figure 4); the first block in the fourth row of Figure 4 is 1100, because It appears in the dictionary, so it is replaced with the number 10; the second block 0010 in the first row (in the dashed box), since it does not appear in the dictionary, and the third bit position is set to 1. The third bits of blocks 1111, 1110, and 1010 in the dictionary are all 1, and all contain blocks 0010 that meet similar conditions. We choose 1010 with the least number of 1. This strategy ensures that the query request will not be lost, and the possible increase of misnomers is as small as possible. The shaded part in Figure 4 represents the case where the block is not in the dictionary and is replaced by an approximate block. The compressed data is divided into two parts: dictionary and compressed index.

假设矩阵m行d列，采用上述字典压缩方式的压缩率为：Assuming that the matrix has m rows and d columns, the compression ratio of the above dictionary compression method is:

这里字典大小为k*2^b，压缩后映射矩阵大小为mdb/k，实际中字典较小可以忽略，所以压缩率近似等于

The size of the dictionary here is k*2 ^b , and the size of the mapping matrix after compression is mdb/k. In practice, the dictionary is small and can be ignored, so the compression rate is approximately equal to

在求交过程中，我们只需简单地将b个比特的块编号作为索引从字典中提取出原始块定义进行与运算即可。由于字典规模一般较小且在求交(解压)过程中频繁访问，大部分字典数据会驻留在CPU的cache中，减少访存开销，提高求交性能。In the intersection process, we simply use the b-bit block number as an index to extract the original block definition from the dictionary and perform an AND operation. Since the dictionary size is generally small and frequently accessed during the intersection (decompression) process, most of the dictionary data will reside in the CPU cache, reducing memory access overhead and improving intersection performance.

实施例4实验结果Example 4 Experimental Results

我们在TREC GOV2数据集上测试本发明基于字典的位片索引压缩方法的效果。其中Pri，Opt为BitFunnel的两种策略PrivateSharedRank0和Optimal下生成的索引。实验中的块大小设置为32比特，块编号大小设置为16比特。We test the effect of the dictionary-based bit-slice index compression method of the present invention on the TREC GOV2 dataset. Among them, Pri and Opt are indexes generated by BitFunnel's two strategies, PrivateSharedRank0 and Optimal. The block size in the experiment is set to 32 bits, and the block number size is set to 16 bits.

对所用倒排索引数据集做如下说明：A description of the inverted index dataset used is as follows:

(1)我们使用TREC GOV2为2004年从.gov域名抓取下来的数据集，共包含2500多万个网页作为生成索引的文档集合，其中所有文档内容均已剔除HTML标签，提取标题和正文部分后建立索引；(1) We use TREC GOV2 as the data set crawled from the .gov domain name in 2004, which contains more than 25 million web pages as the indexed document collection, in which all document content has been stripped of HTML tags, and the title and body parts have been extracted. After indexing;

(2)我们使用MillionSet作为查询集合1，包含2007,2008,and 2009 TRECMillion Query Track共6万条查询；(2) We use MillionSet as query set 1, including 2007, 2008, and 2009 TRECMillion Query Track, a total of 60,000 queries;

(3)我们使用TerabyteSet作为查询集合2，包含2006 TREC Terabyte Track共10万条查询。(3) We use TerabyteSet as query set 2, which contains a total of 100,000 queries of 2006 TREC Terabyte Track.

表1Table 1

表1描述了基于字典的位片索引压缩方法的压缩率。根据表1我们可知，Pri方式下MillionSet、TerabyteSet两种查询集对应的总体压缩率分别为0.73、0.72；Opt方式下MillionSet，TerabyteSet两种查询集对应的总体压缩率分别为0.71、0.70。MillionSet稍好于TerabyteSet，但区别很小只有0.02-0.01，说明查询集对压缩率影响不大，我们的部分压缩方式对不同的查询数据集具有通用性。Table 1 describes the compression ratio of the dictionary-based bit-slice index compression method. According to Table 1, we can see that the overall compression ratios corresponding to the two query sets of MillionSet and TerabyteSet in the Pri mode are 0.73 and 0.72, respectively; the overall compression ratios corresponding to the two query sets of MillionSet and TerabyteSet in the Opt mode are 0.71 and 0.70, respectively. MillionSet is slightly better than TerabyteSet, but the difference is only 0.02-0.01, indicating that the query set has little effect on the compression rate, and our partial compression method is universal to different query data sets.

我们还尝试将每32个比特看作一个无符号整数(unsigned int类型)，利用算术编码和位图压缩方法进行压缩。我们发现，利用Pfor、VByte、EWAH进行压缩，文件大小分别只减少了-1％、11.3％、4.35％。这是由于BitFunnel矩阵中1的分布无规律且不均匀，因此整数序列中的数值都很大，不利于做d-gap等数值转换，从而导致算术编码压缩效果较差。而针对位图的压缩方案由于其利用run-length需要较多的连续0，连续1，BitFunnel索引构建完全依赖于文档特征所以也无法满足条件。综上，传统的压缩算法对BitFunnel索引压缩效果较差，而本发明采用的基于字典的压缩方案可以有效地提升压缩率。We also try to treat every 32 bits as an unsigned integer (unsigned int type), and compress using arithmetic coding and bitmap compression methods. We found that using Pfor, VByte, and EWAH for compression only reduced file size by -1%, 11.3%, and 4.35%, respectively. This is because the distribution of 1 in the BitFunnel matrix is irregular and uneven, so the values in the integer sequence are very large, which is not conducive to numerical conversion such as d-gap, resulting in poor compression of arithmetic coding. The compression scheme for bitmaps cannot meet the conditions because it requires more consecutive 0s and consecutive 1s using run-length, and the BitFunnel index construction completely depends on the document characteristics. To sum up, the traditional compression algorithm has poor compression effect on the BitFunnel index, while the dictionary-based compression scheme adopted in the present invention can effectively improve the compression rate.

表2Table 2

表2给出了两种策略下每条查询平均求交时间和误称率。由表2我们可以看出利用Pri策略生成矩阵在MillionSet和TerabyteSet数据集上解压使得求交时间分别增加了21％、16％。利用Opt策略生成矩阵在MillionSet和TerabyteSet数据集上解压使得求交时间分别增加了48％、37％。这部分求交时间的增加一方面是由于采用部分压缩使得一部分高频访问行不进行压缩，求交时需判断每一行是否被压缩，另一方面是由于压缩导致的访问字典的访存延迟。同时我们也可发现，字典压缩使得误称率增加了约7到10个百分点。Table 2 shows the average intersection time and misnomer rate of each query under the two strategies. From Table 2, we can see that using the Pri strategy to generate the matrix to decompress on the MillionSet and TerabyteSet datasets increases the intersection time by 21% and 16%, respectively. Using the Opt strategy to generate matrices for decompression on the MillionSet and TerabyteSet datasets increases the intersection time by 48% and 37%, respectively. This part of the increase in intersection time is due to the use of partial compression so that some high-frequency access rows are not compressed, and it is necessary to determine whether each row is compressed when intersecting. At the same time, we can also find that dictionary compression increases the miscall rate by about 7 to 10 percentage points.

在求交方面，由于索引需要解压，压缩的索引在查询上需多花费16％-48％的时间，然而由于BitFunnel本身特性，只需简单按位运算即可得到相关文档，所以本身的求交时间很少，占整个查询时间的比例很小(完整查询包括求交、计算topK文档、生成摘要等过程)，所以5毫秒左右的求交时间对用户影响不大。另外压缩将会引入额外的误称率，这部分误称率我们利用阈值α控制在一定的范围7到10个百分点内。由于搜索引擎会对求交得到的文档进行相关性排序，选择分数最高的前K篇返回，压缩导致的误称文档由于分数较低可以很快被筛选掉。由此，我们预测在实际搜索引擎系统中，这部分误称文档不会造成较大影响。In terms of intersection, since the index needs to be decompressed, the compressed index takes 16%-48% more time to query. However, due to the characteristics of BitFunnel itself, only simple bitwise operations can be used to obtain relevant documents, so the intersection of its own The time is very small, accounting for a small proportion of the entire query time (the complete query includes the process of seeking intersection, calculating topK documents, generating summaries, etc.), so the intersection time of about 5 milliseconds has little effect on users. In addition, the compression will introduce an additional misnomer rate, which we use the threshold α to control within a certain range of 7 to 10 percentage points. Since the search engine will sort the documents obtained from the intersection by relevance and select the top K documents with the highest scores to return, the misrepresented documents caused by compression can be quickly filtered out due to their low scores. Therefore, we predict that in the actual search engine system, this part of the misnamed documents will not have a great impact.

综合压缩率、求交时间和误称率三方面考虑，本发明提出的压缩方案对于压缩BitFunnel位片索引有较好效果。Considering the three aspects of compression rate, intersection time and misnomer rate, the compression scheme proposed by the present invention has a better effect on compressing BitFunnel bit slice index.

以上对本发明的基于字典的压缩方法进行了详细介绍，本文中应用了具体个例对本发明的原理及实施方式进行阐述，以上实施例的说明只是用于帮助理解本发明的方法及其核心思想；同时，对于本领域的一般技术人员，依据本发明的思想，在具体实施方式及应用范围上均会有改变之处，综上所述，本说明书内容不应理解为对本发明的限制。The dictionary-based compression method of the present invention has been described in detail above, and specific examples are used in this paper to illustrate the principles and implementations of the present invention. The descriptions of the above embodiments are only used to help understand the method of the present invention and its core idea; Meanwhile, for those of ordinary skill in the art, according to the idea of the present invention, there will be changes in the specific embodiments and application scope. In summary, the contents of this specification should not be construed as limiting the present invention.

Claims

1. A dictionary-based bit slice index compression method, wherein the method comprises:

Step 1. Rearrange the documents according to the column density of bit 1 in the index column with the block size as an interval;

Suppose the number of index columns is d and the block size is k bits;

Step 1.1, count the density of bit 1 in each column of the index, and arrange the columns in descending order of density;

Step 1.2, initialize an empty index, set the highest density of bits 1

column into index

column; put 1 with the next highest density

column into index

column, and so on, and finally the lowest density of bit 1

column put

List;

Step 2, select a part of the query access low-frequency rows for compression according to the custom threshold value of the query data set characteristics;

Step 2.1, customize the threshold for the frequency of high-frequency access row access during the query process;

Step 2.2: According to some query data sets and custom thresholds, the rows are classified as follows: high-frequency access rows, low-frequency access rows;

Step 2.3, establish a line compression flag file, and determine whether each line is a low-frequency access line, that is, the line needs to be compressed;

Step 3, perform dictionary compression on the compressed row selected in step 2;

Step 3.1: Divide the rearranged BitFunnel bit slice index into blocks in units of rows, build a dictionary to save a block of all 1s and a block that frequently appears in the compressed line selected in step 2, and record the mapping of high-frequency blocks and block numbers relation;

Step 3.2, traverse the index, replace the block that appears in the dictionary with the block number with fewer bits; for the block that does not appear in the dictionary, replace it with the number of the closest block in the dictionary, although this will lead to errors in the query request. Claimed but guaranteed not to be lost.

2. The method according to claim 1, wherein the method application field comprises:

Bitmap, Bloom Filter, and BitFunnel-based bit slice index structure based on Bloom Filter.

3. The method according to claim 1, wherein the threshold value of the high-frequency access row access frequency in the query process is defined in step 2.1:

The threshold setting of the row access frequency of high-frequency and low-frequency rows in the query process is customized according to the repetition degree of the BitFunnel index block and the original miscall rate of the BitFunnel index.

4. method according to claim 1, is characterized in that, recovering the corresponding decompression method of original data after compression, specifically:

During decompression, the original block definition is simply extracted from the dictionary using the block number as an index to restore the original data.