CN107609105B

CN107609105B - Construction method of big data acceleration structure

Info

Publication number: CN107609105B
Application number: CN201710817537.0A
Authority: CN
Inventors: 段贵多; 罗光春; 田玲; 秦科
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2017-09-12
Filing date: 2017-09-12
Publication date: 2020-07-28
Anticipated expiration: 2037-09-12
Also published as: CN107609105A

Abstract

The invention relates to a construction method of a big data acceleration structure, which comprises the following steps: A. preprocessing data to form a data set conforming to an operation process; B. clustering, calculating the similarity between records in the category, and enabling the most similar records in the group to have the minimum spatial distance according to the grouping result of a clustering algorithm; C. establishing a mapping relation among the transaction attributes, the transaction attribute weights and the transaction records according to the three-level index, and circularly performing the process until all data are mapped; D. initializing a compression index structure, a transaction attribute weight index and a transaction attribute, determining the range of the shared attribute weight of the continuous records, traversing the inverted index mapping structure, and compressing the continuous records under the shared transaction attribute weight through a stroke compression algorithm. The method can quickly establish an acceleration structure of big data association analysis, and remarkably accelerates the processing speed and the data loading speed of the model.

Description

Construction method of big data acceleration structure

技术领域technical field

本发明涉及数据加速处理的方法，具体讲是大数据加速结构的构建方法。The invention relates to a method for data acceleration processing, in particular to a construction method of a big data acceleration structure.

背景技术Background technique

大数据技术已经成为处理海量数据最为有效，同时也是最为常见的技术。警务大数据作为大数据处理场景中极具代表性的场景之一，得到越来越多人的关注。在海量数据分析过程中，通用大数据平台处理速度以及处理性能是一个迫切需要解决的问题之一。在大数据分析场景中，尤其警务大数据中，最为常见分析方式为关联分析，将分析对象涉及的相关因素进行综合关联分析，能够切实提高警务分析的准确率，常用的大数据关联分析算法模型为关联分析算法模型，关联分析算法又分为传统关联分析算法和基于进化理论的关联分析算法，传统关联分析算法通过数据记录频繁项集来发现属性间隐藏关系以及分析项集间存在的相互作用，其难点在于候选集的生成，而进化关联分析算法其利用进化式算法生成规则，没有频繁项集的生成过程。现阶段大数据处理系统中针对不同的关联分析业务，采用不同的算法模型，当前解决该类模型的处理速度的主要方式为针对业务平台的算法并行化方法，但是该方法只能针对单一算法且指定平台起作用，算法局限性较大且效果不够理想。目前较为常用的并行化方法为针对Hadoop平台和Spark平台的算法并行方法，该类方法的主要思路为针对平台的特定并行化框架，将整体数据集进行划分，使得划分完成的数据集均匀散布在集群中，利用关联规则算法模型将局部数据集进行处理，通过收集函数将各节点的结果进行收集，从而完成算法在该框架下的并行化操作。这种方法的缺陷十分明显，即其针对特定平台的改进并不能在其他平台上有较好的性能表现，而且尤其在警务复杂平台下，其关联分析算法模型多样，而有些算法并不太适合此类并行化操作，使得此种方式的局限性增加。Big data technology has become the most effective and common technology for processing massive data. As one of the most representative scenarios in big data processing scenarios, police big data has attracted more and more attention. In the process of massive data analysis, the processing speed and processing performance of the general big data platform is one of the urgent problems to be solved. In the big data analysis scenario, especially in the police big data, the most common analysis method is the correlation analysis. The comprehensive correlation analysis of the relevant factors involved in the analysis object can effectively improve the accuracy of the police analysis. The commonly used big data correlation analysis The algorithm model is the association analysis algorithm model, and the association analysis algorithm is further divided into the traditional association analysis algorithm and the association analysis algorithm based on evolution theory. The difficulty of interaction lies in the generation of candidate sets, while evolutionary association analysis algorithm uses evolutionary algorithm to generate rules, and there is no generation process of frequent itemsets. At this stage, the big data processing system uses different algorithm models for different correlation analysis services. The main way to solve the processing speed of this type of model is the algorithm parallelization method for the business platform, but this method can only be used for a single algorithm and The specified platform works, and the algorithm is limited and the effect is not ideal. At present, the more commonly used parallelization method is the algorithm parallel method for Hadoop platform and Spark platform. The main idea of this kind of method is to divide the overall data set according to the specific parallelization framework of the platform, so that the divided data set is evenly distributed in the In the cluster, the association rule algorithm model is used to process the local data set, and the results of each node are collected through the collection function, so as to complete the parallel operation of the algorithm under this framework. The defect of this method is very obvious, that is, its improvement for a specific platform cannot have better performance on other platforms, and especially in the complex platform of police affairs, its correlation analysis algorithm model is diverse, and some algorithms are not very good. Being suitable for such parallelized operations increases the limitations of this approach.

发明内容SUMMARY OF THE INVENTION

本发明提供了一种大数据加速结构的构建方法，以提高对关联大数据分析时数据处理的速度和准确性。The invention provides a construction method of a big data acceleration structure, so as to improve the speed and accuracy of data processing when analyzing the associated big data.

本发明的大数据加速结构的构建方法，包括：The construction method of the big data acceleration structure of the present invention includes:

A.数据预处理：对原始数据进行数据清理、数据集成和数据转换，形成符合运算过程的数据集；A. Data preprocessing: perform data cleaning, data integration and data conversion on the original data to form a data set that conforms to the operation process;

B.聚类处理：对预处理后的数据进行聚类，聚类完成后通过汉明距离计算类别内部的记录间的相似度，并对所述记录重新排序，根据聚类算法的分组结果，在每个分组内分别进行记录相似度的计算，使组内最相似的记录在空间距离最小；B. Clustering processing: Clustering the preprocessed data, calculating the similarity between the records within the category through Hamming distance after the clustering is completed, and reordering the records, according to the grouping results of the clustering algorithm, Calculate the similarity of records in each group, so that the most similar records in the group have the smallest spatial distance;

C.倒排索引映射结构：初始化索引文件、属性权重索引项和属性索引项，提取排序后数据的属性以及属性权值列表，然后进行倒排索引结构的构建，将事务属性、事务属性权重和事务记录按照三级索引建立映射关系，此过程循环进行，直到所有数据映射完成；所述的“事务”指的是数据处理中的一种原子操作，其性质类似于数据库中的事务；C. Inverted index mapping structure: initialize the index file, attribute weight index item and attribute index item, extract the attributes of the sorted data and the attribute weight list, and then construct an inverted index structure, which combines transaction attributes, transaction attribute weights and The transaction record establishes the mapping relationship according to the three-level index, and this process is repeated until all data mapping is completed; the "transaction" refers to an atomic operation in data processing, and its nature is similar to the transaction in the database;

正常建立索引的过程是根据数据记录的编号或者行号进行索引的建立，其索引访问顺序为索引->记录编号->属性->权重，这种方式称为正向索引。倒排索引即为不根据数据记录的行号进行索引的建立，而是根据记录的属性以及权重进行索引的建立，记录号即为权重，其访问顺序为索引->属性->权重->记录编号，这种与正向索引的访问顺序相反的索引结构即称为倒排索引。The normal process of index building is to build the index according to the number or row number of the data record. The index access sequence is index->record number->attribute->weight. This method is called forward indexing. The inverted index means that the index is not established according to the row number of the data record, but is established according to the attribute and weight of the record. The record number is the weight, and its access order is index->attribute->weight->record The index structure whose access order is opposite to that of the forward index is called an inverted index.

D.行程压缩：初始化压缩索引结构、事务属性权重索引和事务属性，确定连续记录的共享属性权值的范围，遍历倒排索引映射结构，将共享事务属性权值下的连续记录通过行程压缩算法进行压缩。行程压缩算法，即RLE算法，其原理是将数据记录行中的数值相同的相邻元素用一个起始位置和相邻权重两个数值进行代替，例如：aaabccccccddeee，则可用3a1b6c2d3e来代替，用RLE压缩方法能够非常有效的压缩数据。由RLE原理派生出有许多具体的行程压缩方法。D. Stroke compression: Initialize the compression index structure, transaction attribute weight index and transaction attribute, determine the range of shared attribute weights of continuous records, traverse the inverted index mapping structure, and pass the continuous records under the shared transaction attribute weights through the stroke compression algorithm to compress. Stroke compression algorithm, namely RLE algorithm, its principle is to replace the adjacent elements with the same value in the data record line with a starting position and two adjacent weight values, for example: aaabccccccddeee, can use 3a1b6c2d3e instead, use RLE Compression methods can compress data very efficiently. There are many specific stroke compression methods derived from the RLE principle.

进一步的，步骤A所述的数据清理包括对原始数据进行删除、填写缺失值和光滑/过滤噪声数据。Further, the data cleaning described in step A includes deleting original data, filling in missing values and smoothing/filtering noise data.

具体的，通过原始数据的属性权重覆盖率p判断对缺失数据进行删除或填写缺失值，所述属性权重覆盖率p为：Specifically, deletion of missing data or filling in missing values is determined by the attribute weight coverage p of the original data, where the attribute weight coverage p is:

其中A为数据属性，ε为属性权重，表示属性的重要程度，且ε₁+ε₂+...+ε_k＝1，α表示属性值是否缺失，α∈{0,1}，设置阈值ω，阈值ω的取值范围为(0,1)，如果p≥ω，则对缺失数据填写缺失值，否则，则删除该原始数据；where A is the data attribute, ε is the attribute weight, indicating the importance of the attribute, and ε ₁ +ε ₂ +...+ε _k =1, α indicates whether the attribute value is missing, α∈{0,1}, set the threshold ω, the value range of the threshold ω is (0,1), if p≥ω, fill in the missing value for the missing data, otherwise, delete the original data;

光滑/过滤噪声数据采用均值滤波法，该方法对待处理的当前数据，选择一个模板，该模板为其邻近的若干记录，采用模板均值来代替原记录值的方法，表示为：The mean filtering method is used to smooth/filter the noise data. This method selects a template for the current data to be processed, which is a number of adjacent records, and uses the template mean value to replace the original record value, which is expressed as:

其中k表示属性个数，g()表示均值，M为与当前数据的相邻数据的条数，s表示当前数据的邻居集合，f表示s中的一条数据。Where k represents the number of attributes, g() represents the mean, M represents the number of adjacent data with the current data, s represents the neighbor set of the current data, and f represents a piece of data in s.

具体的，所述填写缺失值的方法为：对已知完整的原始数据中的数据通过欧式距离进行度量，查询出与缺失值最为相似的完整数据，根据完整数据的相关字段权重对缺失值进行补充。Specifically, the method for filling in the missing values is as follows: measure the data in the known complete original data by Euclidean distance, query the complete data that is most similar to the missing value, and perform the missing value according to the relevant field weight of the complete data. Replenish.

进一步的，在步骤A所述的数据集成中将多个数据集中的数据结合为统一的数据库，并通过各数据集的数据交叉匹配和多数据集融合过滤冗余属性。Further, in the data integration described in step A, the data in the multiple data sets are combined into a unified database, and redundant attributes are filtered through data cross-matching of each data set and multi-data set fusion.

进一步的，步骤A所述的数据转换包括对数据集成后的数据集进行属性构造，降低数据集的属性维度并对降低维度后的数据集进行完整性检查和补充。降低数据集的属性维度能够减少内存或硬盘的占用，如果内存不足或计算的时候出现内存溢出等问题，就需要使用PCA(Principal Component Analysis)获取低维度的属性的数据；其次，数据降维能够加快机器学习的速度，在很多情况下，可能需要查看数据属性，但是高维度(多个)的属性无法观察，此时可以将数据的属性降维到2D或者3D，也就是将数据的属性维数降到2个或3个，这样就可以采用可视化观察数据。Further, the data conversion described in step A includes constructing the attributes of the data set after data integration, reducing the attribute dimension of the data set, and performing integrity checking and supplementation on the data set after the reduction of the dimension. Reducing the attribute dimension of the data set can reduce the occupation of memory or hard disk. If there is insufficient memory or memory overflow during calculation, it is necessary to use PCA (Principal Component Analysis) to obtain data of low-dimensional attributes; secondly, data dimensionality reduction can To speed up machine learning, in many cases, it may be necessary to view data attributes, but high-dimensional (multiple) attributes cannot be observed. At this time, the attributes of the data can be reduced to 2D or 3D, that is, the attribute dimension of the data can be reduced. Count down to 2 or 3 so that you can visualize the data.

具体的，所述降低数据集属性维度包括：Specifically, the reducing the attribute dimension of the dataset includes:

将数据集变换为矩阵形式，并对属性归一化，归一化操作包括：让所有的属性拥有相似的尺度，避免属性尺度的大小差异过大影响降维的效果，然后进行零均值归一化操作；Transform the data set into a matrix form and normalize the attributes. The normalization operation includes: making all attributes have similar scales, avoiding the large difference in attribute scales affecting the effect of dimensionality reduction, and then performing zero mean normalization operation;

计算数据集属性的协方差矩阵，再通过奇异值分解的算法计算所述协方差矩阵的特征值和特征向量；Calculate the covariance matrix of the attributes of the data set, and then calculate the eigenvalues and eigenvectors of the covariance matrix through the algorithm of singular value decomposition;

通过降维计算获得降维矩阵，通过降维矩阵将数据集映射到低维空间上。The dimensionality reduction matrix is obtained through the dimensionality reduction calculation, and the dataset is mapped to the low-dimensional space through the dimensionality reduction matrix.

具体的，步骤B所述的聚类处理包括：Specifically, the clustering process described in step B includes:

B1.设置聚类个数K的取值范围[K1，K2]，K1和K2为分类个数，将聚类个数K赋值为其取值范围的下限K1；B1. Set the value range of the number of clusters K [K1, K2], K1 and K2 are the number of classifications, and assign the number of clusters K as the lower limit K1 of the value range;

B2.随机选取K个聚类中心；B2. Randomly select K cluster centers;

B3.对于一个类，通过距离计算将其分配到最近的聚类中心；B3. For a class, assign it to the nearest cluster center by distance calculation;

B4.重新计算新的聚类中心；B4. Recalculate the new cluster center;

B5.通过二分划分法将得到的每一个类都划分为两个子类，比较每一对父类和子类的BIC得分；BIC为贝叶斯信息准则(Bayesian Information Criterion，BIC)，通过BIC进行自动判断聚类的类别数，假设存在数据集D和一些供选择的模型M_j，不同的模型M_j对应着不同的K值。BIC计算公式为：B5. Divide each obtained class into two subclasses by the binary division method, and compare the BIC scores of each pair of parent classes and subclasses; BIC is the Bayesian Information Criterion (BIC), which is carried out by BIC. Automatically determine the number of categories of clusters, assuming that there is a dataset D and some models M _j for selection, and different models M _j correspond to different K values. The BIC calculation formula is:

其中R是数据集D的数据点总个数，l_j(D)表示数据针对第j个模型的对数似然函数，且取极大似然函数值，p_j是模型M_j中的参数数量。其原理为利用后验决策概率来判断聚类结果的效果。where R is the total number of data points in the data set D, l _j (D) represents the log-likelihood function of the data for the jth model, and takes the maximum likelihood function value, and p _j is the parameter in the model M _j quantity. The principle is to use the posterior decision probability to judge the effect of clustering results.

B6.计算得出BIC分数相差最大的一对父类和子类；B6. Calculate the pair of parent class and child class with the largest difference in BIC score;

B7.保存B6步骤的所述父类子类，其他剩余的父类保持不变，使得聚类个数K＝K+1；B7. Save the parent class subclass of step B6, and other remaining parent classes remain unchanged, so that the number of clusters K=K+1;

B8.转到步骤B2，计算新情况下的类别情况；B8. Go to step B2, calculate the category situation under the new situation;

B9.若K值达到上限值K2或连续两次循环结果相同，说明没有聚类中心需要分裂操作(即步骤B5)，结束循环；B9. If the K value reaches the upper limit value K2 or the results of two consecutive cycles are the same, it means that there is no cluster center that needs a split operation (ie step B5), and the cycle ends;

B10.对多个聚类类别同时计算记录间的汉明距离，进行重新排序组内数据。B10. Simultaneously calculate the Hamming distance between records for multiple cluster categories, and reorder the data within the group.

本发明的大数据加速结构的构建方法，能够快速建立大数据关联分析的加速结构，并且克服了现有方法的局限性及平台限制性，不论是传统关联分析算法还是基于进化理论的关联分析算法，该加速结构都能非常明显的加快了模型的处理速度和数据加载速度，同时该加速结构的构建过程简单，适合包括警务数据在内的各种大规模数据场景应用。The construction method of the big data acceleration structure of the present invention can quickly establish the acceleration structure of the big data association analysis, and overcome the limitations of the existing methods and platform limitations, whether it is a traditional association analysis algorithm or an association analysis algorithm based on evolution theory , the acceleration structure can significantly speed up the processing speed and data loading speed of the model, and the construction process of the acceleration structure is simple, which is suitable for various large-scale data scenarios including police data.

以下结合实施例的具体实施方式，对本发明的上述内容再作进一步的详细说明。但不应将此理解为本发明上述主题的范围仅限于以下的实例。在不脱离本发明上述技术思想情况下，根据本领域普通技术知识和惯用手段做出的各种替换或变更，均应包括在本发明的范围内。The above content of the present invention will be further described in detail below with reference to the specific implementation of the embodiments. However, this should not be construed as limiting the scope of the above-mentioned subject matter of the present invention to the following examples. Without departing from the above-mentioned technical idea of the present invention, various substitutions or changes made according to common technical knowledge in the art and conventional means should all be included in the scope of the present invention.

附图说明Description of drawings

图1为本发明大数据加速结构的构建方法的流程图。FIG. 1 is a flowchart of a method for constructing a big data acceleration structure according to the present invention.

具体实施方式Detailed ways

如图1所示本发明大数据加速结构的构建方法，包括：As shown in Figure 1, the construction method of the big data acceleration structure of the present invention includes:

A.数据预处理：对原始数据进行数据清理、数据集成和数据转换，形成符合运算过程的数据集。A. Data preprocessing: Data cleaning, data integration and data conversion are performed on the original data to form a data set that conforms to the operation process.

其中所述的数据清理包括对原始数据进行删除、填写缺失值和光滑/过滤噪声数据。The data cleaning described therein includes deleting raw data, filling in missing values, and smoothing/filtering noisy data.

在对原始数据进行删除、填写缺失值中，是通过原始数据的属性权重覆盖率p判断对缺失数据来进行，所述属性权重覆盖率p为：In the process of deleting the original data and filling in the missing values, the missing data is judged by the attribute weight coverage p of the original data, and the attribute weight coverage p is:

所述填写缺失值的方法为：对已知完整的原始数据中的数据通过欧式距离进行度量，查询出与缺失值最为相似的完整数据，根据完整数据的相关字段权重对缺失值进行补充。The method for filling in missing values is: measure the data in the known complete original data by Euclidean distance, query the complete data that is most similar to the missing value, and supplement the missing value according to the relevant field weights of the complete data.

在所述的数据集成中，将多个数据集中的数据结合为统一的数据库，并通过各数据集的数据交叉匹配和多数据集融合过滤冗余属性。In the data integration, data in multiple data sets are combined into a unified database, and redundant attributes are filtered through data cross-matching of each data set and multi-data set fusion.

所述的数据转换包括对数据集成后的数据集进行属性构造，降低数据集的属性维度并对降低维度后的数据集进行完整性检查和补充。所述降低数据集属性维度包括：The data transformation includes constructing the attributes of the data set after data integration, reducing the attribute dimension of the data set, and performing integrity checking and supplementation on the data set after the reduction of the dimension. The reducing the attribute dimension of the dataset includes:

B.聚类处理：对预处理后的数据进行聚类，聚类完成后通过汉明距离计算类别内部的记录间的相似度，并对所述记录重新排序，根据聚类算法的分组结果，在每个分组内分别进行记录相似度的计算，使组内最相似的记录在空间距离最小。具体包括有：B. Clustering processing: Clustering the preprocessed data, calculating the similarity between the records within the category through Hamming distance after the clustering is completed, and reordering the records, according to the grouping results of the clustering algorithm, The similarity of records is calculated in each group, so that the most similar records in the group have the smallest spatial distance. Specifically include:

B1.设置聚类个数K的取值范围[K1，K2]，K1和K2为分类个数，将聚类个数K赋值为其取值范围的下限K1。K1和K2的值由具体业务数据的属性决定。B1. Set the value range [K1, K2] of the number of clusters K, where K1 and K2 are the number of classifications, and assign the number of clusters K as the lower limit K1 of the value range. The values of K1 and K2 are determined by attributes of specific business data.

B2.随机选取K个聚类中心。B2. Randomly select K cluster centers.

B3.对于一个类，通过距离计算将其分配到最近的聚类中心。B3. For a class, assign it to the nearest cluster center by distance calculation.

B4.重新计算新的聚类中心。B4. Recalculate new cluster centers.

B5.通过二分划分法将得到的每一个类都划分为两个子类，比较每一对父类和子类的BIC得分。BIC为贝叶斯信息准则(Bayesian Information Criterion，BIC)，通过BIC进行自动判断聚类的类别数，假设存在数据集D和一些供选择的模型M_j，不同的模型M_j对应着不同的K值。BIC计算公式为：B5. Divide each obtained class into two subclasses by the binary division method, and compare the BIC scores of each pair of parent class and subclass. BIC is the Bayesian Information Criterion (BIC), and the number of categories of clusters is automatically determined by BIC. Assuming that there is a dataset D and some models M _j for selection, different models M _j correspond to different K value. The BIC calculation formula is:

B6.计算得出BIC分数相差最大的一对父类和子类。B6. Calculate the pair of parent and child classes with the largest difference in BIC scores.

B7.保存B6步骤的所述父类子类，其他剩余的父类保持不变，使得聚类个数K＝K+1。B7. Save the parent class subclass in step B6, and other remaining parent classes remain unchanged, so that the number of clusters K=K+1.

B8.转到步骤B2，计算新情况下的类别情况。B8. Go to step B2 to calculate the class case for the new case.

B9.若K值达到上限值K2或连续两次循环结果相同，说明没有聚类中心需要分裂操作(即步骤B5)，结束循环。B9. If the K value reaches the upper limit value K2 or the results of two consecutive cycles are the same, it means that there is no cluster center that needs a split operation (ie, step B5), and the cycle ends.

C.倒排索引映射结构：初始化索引文件、属性权重索引项和属性索引项，提取排序后数据的属性以及属性权值列表。再进行倒排索引结构的构建，将事务属性、事务属性权重和事务记录按照三级索引建立映射关系，此过程循环进行，直到所有数据映射完成，构建出三级倒排索引结构。C. Inverted index mapping structure: initialize the index file, attribute weight index item and attribute index item, extract the attributes of the sorted data and the attribute weight list. Then construct an inverted index structure, and establish a mapping relationship between transaction attributes, transaction attribute weights, and transaction records according to the three-level index. This process is repeated until all data mapping is completed, and a three-level inverted index structure is constructed.

D.行程压缩：初始化压缩索引结构、事务属性权重索引和事务属性，确定连续记录的共享属性权值的范围，遍历倒排索引映射结构，将共享事务属性权值下的连续记录通过行程压缩算法进行压缩。行程压缩算法，即RLE算法，其原理是将数据记录行中的数值相同的相邻元素用一个起始位置和相邻权重两个数值进行代替，例如：aaabccccccddeee，则可用3a1b6c2d3e来代替，用RLE压缩方法能够非常有效的压缩数据。由RLE原理派生出有许多具体的行程压缩方法。本实施例中采用PCX行程压缩方法：该方法是位映射格式到压缩格式的转换方法。该方法对于连续出现1次的字节Ch，若Ch>0xc0则压缩时在该字节前加上0xc1，否则直接输出Ch，对于连续出现N次的字节Ch，则压缩成0xc0+N。因为Ch是两个字节，因而N最大只能为ff－c0＝3fh(十进制为63)，当N大于63时，则需分多次压缩。D. Stroke compression: Initialize the compression index structure, transaction attribute weight index and transaction attribute, determine the range of shared attribute weights of continuous records, traverse the inverted index mapping structure, and pass the continuous records under the shared transaction attribute weights through the stroke compression algorithm to compress. Stroke compression algorithm, namely RLE algorithm, its principle is to replace the adjacent elements with the same value in the data record line with a starting position and two adjacent weight values, for example: aaabccccccddeee, can use 3a1b6c2d3e instead, use RLE Compression methods can compress data very efficiently. There are many specific stroke compression methods derived from the RLE principle. In this embodiment, the PCX run-length compression method is adopted: this method is a conversion method from a bitmap format to a compressed format. In this method, for the byte Ch that appears once in a row, if Ch>0xc0, add 0xc1 before the byte when compressing, otherwise, output Ch directly. For the byte Ch that appears N times in a row, it is compressed to 0xc0+N. Because Ch is two bytes, the maximum N can only be ff-c0=3fh (63 in decimal). When N is greater than 63, it needs to be compressed several times.

Claims

1. The method for constructing the big data acceleration structure is characterized by comprising the following steps:

A. Data preprocessing: performing data cleaning, data integration and data conversion on the original data to form a data set conforming to the operation process;

B. Clustering treatment: clustering the preprocessed data, calculating the similarity among the records in the category through the Hamming distance after the clustering is finished, reordering the records, and calculating the record similarity in each group according to the grouping result of the clustering algorithm so as to minimize the spatial distance of the most similar record in the group;

C. The mapping structure of the inverted index: initializing an index file, an attribute weight index item and an attribute index item, extracting attributes of the sorted data and an attribute weight list, then constructing an inverted index structure, establishing a mapping relation among the transaction attributes, the transaction attribute weights and the transaction records according to a three-level index, and circularly performing the process until all data are mapped;

D. Stroke compression: initializing a compression index structure, a transaction attribute weight index and a transaction attribute, determining the range of the shared attribute weight of the continuous records, traversing the inverted index mapping structure, and compressing the continuous records under the shared transaction attribute weight through a stroke compression algorithm.

2. The big data acceleration structure construction method according to claim 1, characterized by: the data cleaning described in step a includes deleting the original data, filling in missing values, and smoothing/filtering the noise data.

3. The big data acceleration structure construction method according to claim 2, characterized in that: deleting missing data or filling missing values in the missing data is judged according to the attribute weight coverage rate p of the original data, wherein the attribute weight coverage rate p is as follows:

Wherein A is data attribute, is attribute weight, represents importance degree of attribute, and ₁+₂+...+_kif p is larger than or equal to omega, filling missing values in the missing data, otherwise, deleting the original data;

Smoothing/filtering the noise data employs mean filtering, represented as:

Where k represents the number of attributes, g () represents the mean, M is the number of data neighbors to the current data, s represents the neighbor set of the current data, and f represents one of the data in s.

4. The big data acceleration structure construction method according to claim 2, characterized in that: the method for filling in the missing value comprises the following steps: and measuring the data in the known complete original data through Euclidean distance, inquiring the complete data most similar to the missing value, and supplementing the missing value according to the weight of the relevant field of the complete data.

5. The big data acceleration structure construction method according to claim 1, characterized by: and B, combining the data in the data sets into a unified database in the data integration in the step A, and filtering redundant attributes through data cross matching and multi-data-set fusion of each data set.

6. The big data acceleration structure construction method according to claim 1, characterized by: the data conversion in the step A comprises the steps of carrying out attribute construction on the data set after data integration, reducing attribute dimensionality of the data set and carrying out integrity check and supplement on the data set after dimensionality reduction.

7. The big data acceleration structure construction method according to claim 6, characterized in that: the reducing the dataset attribute dimension comprises:

Transforming the data set into a matrix form and normalizing the attributes;

Calculating a covariance matrix of the data set attribute, and calculating an eigenvalue and an eigenvector of the covariance matrix through a singular value decomposition algorithm;

And obtaining a dimension reduction matrix through dimension reduction calculation, and mapping the data set to a low-dimensional space through the dimension reduction matrix.

8. The big data acceleration structure construction method according to claim 1, characterized by: the clustering process in the step B comprises the following steps:

B1. Setting the value range [ K1, K2], K1 and K2 of the cluster number K as the classification number, and assigning the cluster number K as the lower limit K1 of the value range;

B2. Randomly selecting K clustering centers;

B3. For a class, assigning it to the nearest cluster center by distance calculation;

B4. Recalculating a new cluster center;

B5. Dividing each obtained class into two subclasses by a dichotomy method, and comparing the BIC scores of each pair of the father class and the subclasses;

B6. Calculating to obtain a pair of parent class and subclass with the largest difference of BIC scores;

B7. Saving the parent subclass in the step B6, and keeping other remaining parent subclasses unchanged, so that the clustering number K is K + 1;

B8. Turning to step B2, calculating the category condition in the new case;

B9. If the K value reaches the upper limit value K2 or the results of two continuous circulation are the same, indicating that no clustering center needs to be split, and ending the circulation;

B10. And simultaneously calculating Hamming distances among the records for a plurality of cluster categories, and reordering the data in the group.