CN115480756A - Template data compression method, device, equipment and storage medium - Google Patents
Template data compression method, device, equipment and storage medium Download PDFInfo
- Publication number
- CN115480756A CN115480756A CN202110598530.0A CN202110598530A CN115480756A CN 115480756 A CN115480756 A CN 115480756A CN 202110598530 A CN202110598530 A CN 202110598530A CN 115480756 A CN115480756 A CN 115480756A
- Authority
- CN
- China
- Prior art keywords
- data
- template data
- algorithm
- preset
- compression
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/30—Creation or generation of source code
- G06F8/35—Creation or generation of source code model driven
-
- H—ELECTRICITY
- H03—ELECTRONIC CIRCUITRY
- H03M—CODING; DECODING; CODE CONVERSION IN GENERAL
- H03M7/00—Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
- H03M7/30—Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
- H03M7/40—Conversion to or from variable length codes, e.g. Shannon-Fano code, Huffman code, Morse code
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- General Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Compression, Expansion, Code Conversion, And Decoders (AREA)
Abstract
本申请实施例提供了一种模板数据的压缩方法、装置、设备及存储介质,该模板数据的压缩方法,包括:获取模板数据;通过预设数据转换算法对模板数据进行重复字符聚集转换,得到字符串;通过预设数据编码算法对字符串进行编码,得到数据序列串;根据预设索引压缩算法压缩存储数据序列串,得到压缩后的模板数据;本申请实施例能够解决现有技术中压缩后的模板数据所占存储空间较大的问题。
Embodiments of the present application provide a template data compression method, device, device, and storage medium. The template data compression method includes: obtaining template data; performing repetitive character aggregation conversion on the template data through a preset data conversion algorithm to obtain Character string; encode the character string through a preset data encoding algorithm to obtain a data sequence string; compress and store the data sequence string according to a preset index compression algorithm to obtain compressed template data; the embodiment of the present application can solve the problem of compression in the prior art After the template data occupies a large storage space problem.
Description
技术领域technical field
本申请属于软件开发领域,尤其涉及一种模板数据的压缩方法、装置、设备及存储介质。The present application belongs to the field of software development, and in particular relates to a template data compression method, device, equipment and storage medium.
背景技术Background technique
目前的软件开发多使用组件化、模板化的方式,对于一些输出可配置化模板的软件,比如海报编辑器、网页可视化编辑器等,往往需要存储大量的配置模板数据,现有的模板数据存储流程如图1所示。这类信息包含大量相同或相似的属性信息,同时对于冒号、逗号、分号等字符的使用率较高,具有大体量、高相似、高重复的特点。The current software development mostly uses componentization and template-based methods. For some software that outputs configurable templates, such as poster editors and web page visual editors, it is often necessary to store a large amount of configuration template data. Existing template data storage The process is shown in Figure 1. This type of information contains a large amount of identical or similar attribute information, and at the same time, characters such as colons, commas, and semicolons are used more frequently, and has the characteristics of large volume, high similarity, and high repetition.
现阶段,随着用户不断增加,自定义模板数据也会不断膨胀,软件必将面对海量数据所带来的较高存储空间需求。而目前现有技术并未针对模板数据特征设计出具备针对性的压缩方法,仍旧采用常规压缩软件对其进行压缩,如WinRar,WinZip等,导致压缩后的模板数据所占存储空间较大。At this stage, as users continue to increase, custom template data will also continue to expand, and the software will inevitably face higher storage space requirements brought about by massive data. However, the current prior art has not designed a targeted compression method for template data characteristics, and still uses conventional compression software to compress it, such as WinRar, WinZip, etc., resulting in a large storage space for the compressed template data.
发明内容Contents of the invention
本申请实施例提供一种模板数据的压缩方法、装置、设备及存储介质,能够解决现有技术中压缩后的模板数据所占存储空间较大的问题。Embodiments of the present application provide a template data compression method, device, device, and storage medium, which can solve the problem in the prior art that compressed template data occupies a large storage space.
第一方面,本申请实施例提供一种模板数据的压缩方法,包括:In the first aspect, the embodiment of the present application provides a method for compressing template data, including:
获取模板数据;get template data;
通过预设数据转换算法对模板数据进行重复字符聚集转换,得到字符串;Perform repeated character aggregation conversion on the template data through a preset data conversion algorithm to obtain a character string;
通过预设数据编码算法对字符串进行编码,得到数据序列串;Encode the character string through the preset data encoding algorithm to obtain the data sequence string;
根据预设索引压缩算法压缩存储数据序列串,得到压缩后的模板数据。The stored data sequence string is compressed according to a preset index compression algorithm to obtain compressed template data.
进一步地,在一种实施例中,方法还包括:Further, in an embodiment, the method also includes:
根据预设后缀数组构造算法对数据序列串进行后缀数组构造,得到后缀数组;According to the preset suffix array construction algorithm, the data sequence string is constructed with a suffix array to obtain a suffix array;
每隔预设个数据对后缀数组进行采样,得到采样数组;Sampling the suffix array every preset data to obtain a sampling array;
将采样数据确定为索引数据,用于在被查询时与查询输入数据比对。The sampled data is determined as index data for comparison with query input data when being queried.
进一步地,在一种实施例中,预设数据转换算法,包括:Further, in one embodiment, the preset data conversion algorithm includes:
Burrows Wheeler transform算法。Burrows Wheeler transform algorithm.
进一步地,在一种实施例中,预设数据编码算法,包括:Further, in one embodiment, the preset data encoding algorithm includes:
哈夫曼编码算法。Huffman coding algorithm.
进一步地,在一种实施例中,预设索引压缩算法,包括:Further, in one embodiment, the preset index compression algorithm includes:
RRR压缩索引结构。RRR compressed index structure.
第二方面,本申请实施例提供一种模板数据的压缩装置,包括:In the second aspect, the embodiment of the present application provides a template data compression device, including:
获取模块,用于获取模板数据;Obtain module, used to obtain template data;
转换模块,用于通过预设数据转换算法对模板数据进行重复字符聚集转换,得到字符串;The conversion module is used to perform repeated character aggregation conversion on the template data through a preset data conversion algorithm to obtain a character string;
编码模块,用于通过预设数据编码算法对字符串进行编码,得到数据序列串;An encoding module, configured to encode a character string through a preset data encoding algorithm to obtain a data sequence string;
压缩存储模块,用于根据预设索引压缩算法压缩存储数据序列串,得到压缩后的模板数据。The compression storage module is configured to compress and store the data sequence string according to a preset index compression algorithm to obtain compressed template data.
进一步地,在一种实施例中,装置还包括:Further, in one embodiment, the device also includes:
构造模块,用于根据预设后缀数组构造算法对数据序列串进行后缀数组构造,得到后缀数组;A construction module, configured to construct a suffix array for the data sequence string according to a preset suffix array construction algorithm to obtain a suffix array;
采样模块,用于每隔预设个数据对后缀数组进行采样,得到采样数组;The sampling module is used to sample the suffix array every preset data to obtain the sampling array;
确定模块,用于将采样数据确定为索引数据,用于在被查询时与查询输入数据比对。A determination module, configured to determine the sampling data as index data, and compare it with the query input data when being queried.
进一步地,在一种实施例中,预设数据转换算法,包括:Further, in one embodiment, the preset data conversion algorithm includes:
Burrows Wheeler transform算法。Burrows Wheeler transform algorithm.
第三方面,本申请实施例提供一种模板数据的压缩设备,包括:存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,计算机程序被处理器执行时实现如权利要求至中任一项的模板数据的压缩方法。In the third aspect, an embodiment of the present application provides a compression device for template data, including: a memory, a processor, and a computer program stored on the memory and operable on the processor. When the computer program is executed by the processor, the following claims are realized Compression method for template data in any of to.
第四方面,本申请实施例提供一种计算机可读存储介质,计算机可读存储介质上存储有信息传递的实现程序,程序被处理器执行时实现如权利要求至中任一项的模板数据的压缩方法。In the fourth aspect, the embodiment of the present application provides a computer-readable storage medium, on which is stored a program for implementing information transfer, and when the program is executed by a processor, the implementation of the template data according to any one of the claims to compression method.
本申请实施例的模板数据的压缩方法、装置、设备及存储介质,针对模板数据高相似、高重复的特点,采用预设数据转换算法对模板数据进行重复字符聚集转换,降低重复字节存储比例,有效提升原模板数据的可压缩性;结合预设数据编码算法、构建小波树、预设索引压缩算法对重复字符聚集转换后的模板数据进行压缩,进一步提升模板数据的可压缩性,压缩后的模板数据所占存储空间较小;并且由于预设数据转换算法和小波树的数据处理时间复杂度均较低,使得压缩耗时较少。The template data compression method, device, equipment, and storage medium of the embodiment of the present application aim at the characteristics of high similarity and high repetition of the template data, and adopt a preset data conversion algorithm to perform aggregation and conversion of repeated characters on the template data to reduce the storage ratio of repeated bytes , to effectively improve the compressibility of the original template data; combine the preset data encoding algorithm, construct the wavelet tree, and the preset index compression algorithm to compress the template data after repeated character aggregation and conversion, further improve the compressibility of the template data, after compression The template data occupies less storage space; and because the data processing time complexity of the preset data conversion algorithm and wavelet tree is low, the compression time is less.
附图说明Description of drawings
为了更清楚地说明本申请实施例的技术方案,下面将对本申请实施例中所需要使用的附图作简单的介绍,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions of the embodiments of the present application, the following will briefly introduce the accompanying drawings that need to be used in the embodiments of the present application. Additional figures can be derived from these figures.
图1是本申请实施例提供的一种现有的模板数据存储流程示意图;FIG. 1 is a schematic diagram of an existing template data storage process provided by an embodiment of the present application;
图2是本申请实施例提供的一种模板数据的压缩方法的流程示意图;FIG. 2 is a schematic flowchart of a method for compressing template data provided in an embodiment of the present application;
图3是本申请实施例提供的一种小波树的结构示意图;Fig. 3 is a schematic structural diagram of a wavelet tree provided by an embodiment of the present application;
图4是本申请实施例提供的一种模板数据的压缩装置的结构示意图;FIG. 4 is a schematic structural diagram of a template data compression device provided in an embodiment of the present application;
图5是本申请实施例提供的一种模板数据的压缩设备的结构示意图。Fig. 5 is a schematic structural diagram of a device for compressing template data provided by an embodiment of the present application.
具体实施方式detailed description
下面将详细描述本申请的各个方面的特征和示例性实施例,为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及具体实施例,对本申请进行进一步详细描述。应理解,此处所描述的具体实施例仅被配置为解释本申请,并不被配置为限定本申请。对于本领域技术人员来说,本申请可以在不需要这些具体细节中的一些细节的情况下实施。下面对实施例的描述仅仅是为了通过示出本申请的示例来提供对本申请更好的理解。The characteristics and exemplary embodiments of various aspects of the application will be described in detail below. In order to make the purpose, technical solution and advantages of the application clearer, the application will be further described in detail below in conjunction with the accompanying drawings and specific embodiments. It should be understood that the specific embodiments described here are only configured to explain the application, not to limit the application. It will be apparent to one skilled in the art that the present application may be practiced without some of these specific details. The following description of the embodiments is only to provide a better understanding of the present application by showing examples of the present application.
需要说明的是,在本文中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。It should be noted that in this article, relational terms such as first and second are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply that there is a relationship between these entities or operations. There is no such actual relationship or order between them. Furthermore, the term "comprises", "comprises" or any other variation thereof is intended to cover a non-exclusive inclusion such that a process, method, article, or apparatus comprising a set of elements includes not only those elements, but also includes elements not expressly listed. other elements of or also include elements inherent in such a process, method, article, or device. Without further limitations, an element defined by the statement "comprising..." does not exclude the presence of additional same elements in the process, method, article or device comprising said element.
现有模板数据压缩方法存在如下弊端:The existing template data compression method has the following disadvantages:
(1)通用数据压缩软件未能充分考虑模板数据高相似、高重复的特点,导致重复数据占用大量存储空间,压缩效率较低;(1) General data compression software fails to fully consider the characteristics of high similarity and high repetition of template data, resulting in repeated data occupying a large amount of storage space and low compression efficiency;
(2)纯压缩软件无法在未解压情况下进行属性查询,简单的查询操作需要进行全文本解压,严重影响软件运行效率;(2) Pure compression software cannot perform attribute query without decompression, and simple query operations require full-text decompression, which seriously affects the operating efficiency of the software;
(3)模板数据往往体量较大,且随软件发展会越增越多,如果不进行压缩必定会遇到存储有限的问题,进行压缩又会导致无法查询,无法保证软件的长期可用性。(3) The template data is often large in size and will increase with the development of the software. If it is not compressed, it will inevitably encounter the problem of limited storage. Compression will make it impossible to query and cannot guarantee the long-term availability of the software.
为了解决现有技术问题,本申请实施例提供了一种模板数据的压缩方法、装置、设备及存储介质。本申请实施针对模板数据高相似、高重复的特点,采用预设数据转换算法对模板数据进行重复字符聚集转换,降低重复字节存储比例,有效提升原模板数据的可压缩性;结合预设数据编码算法、构建小波树、预设索引压缩算法对重复字符聚集转换后的模板数据进行压缩,进一步提升模板数据的可压缩性,压缩后的模板数据所占存储空间较小;并且由于预设数据转换算法和小波树的数据处理时间复杂度均较低,使得压缩耗时较少。下面首先对本申请实施例所提供的模板数据的压缩方法进行介绍。In order to solve the problems in the prior art, the embodiments of the present application provide a template data compression method, device, device, and storage medium. Aiming at the characteristics of high similarity and repetition of template data, this application adopts the preset data conversion algorithm to carry out repeated character aggregation conversion on the template data, reduces the storage ratio of repeated bytes, and effectively improves the compressibility of the original template data; combined with the preset data Encoding algorithm, construction of wavelet tree, and preset index compression algorithm compress the template data after repeated character aggregation and conversion, further improving the compressibility of template data, and the compressed template data occupies less storage space; and because the preset data The data processing time complexity of the conversion algorithm and the wavelet tree is low, which makes the compression less time-consuming. The method for compressing template data provided by the embodiment of the present application will be firstly introduced below.
图2示出了本申请一个实施例提供的模板数据的压缩方法的流程示意图。如图2所示,该方法可以包括以下步骤:Fig. 2 shows a schematic flowchart of a method for compressing template data provided by an embodiment of the present application. As shown in Figure 2, the method may include the following steps:
S110,获取模板数据。S110, acquiring template data.
模板数据根据不同用户的配置修改会生成新的模板数据,所以这类数据在软件的推广使用过程当中会随着用户增长而不断高速膨胀,同时此类数据因为配置项通用、固定字符出现频率高等情况,有着高相似、高重复的特点,可在软件后台获取。The template data will generate new template data according to the configuration modification of different users, so this kind of data will continue to expand at a high speed with the growth of users during the promotion and use of the software. Situation, which has the characteristics of high similarity and high repetition, can be obtained in the background of the software.
S220,通过预设数据转换算法对模板数据进行重复字符聚集转换,得到字符串。S220. Perform repeated character aggregation conversion on the template data by using a preset data conversion algorithm to obtain a character string.
预设数据转换算法可以对数据进行重复字符聚类。The preset data conversion algorithm can perform repeated character clustering on the data.
在一种实施例中,预设数据转换算法可以选用为Burrows Wheeler transform(BWT)算法。BWT的基本思想是对原字符串进行左向逐位轮转并对获得的字符矩阵的每一行字符串进行字典序排序,最终取排序后矩阵的最后一列作为变换结果。在整个变换过程中并不会产生数据丢失,但是变换后的字符串会出现连续多个相同字符的聚集现象,使原数据的可压缩性进一步提高。In one embodiment, the preset data conversion algorithm may be selected as Burrows Wheeler transform (BWT) algorithm. The basic idea of BWT is to rotate the original string bit by bit to the left and sort each row of the obtained character matrix in lexicographical order, and finally take the last column of the sorted matrix as the transformation result. There will be no data loss in the whole transformation process, but the converted string will have multiple consecutive aggregations of the same characters, which further improves the compressibility of the original data.
以模板数据T=″{id:1;Fd:F;}″为例,表1列出了模板数据T的所有字符的ASCII码,通过从左向右依次比较字符的ASCII码,越小的越为先序,该排序过程称为字符的字典序排序。Taking the template data T="{id:1; Fd:F;}" as an example, Table 1 lists the ASCII codes of all characters in the template data T, by comparing the ASCII codes of the characters from left to right, the smaller The higher the order, the sorting process is called lexicographical sorting of characters.
表1 T的所有字符的ASCII码Table 1 ASCII codes of all characters of T
表2则示出了对模板数据T进行BWT变换并得到L串的整个过程,其中矩阵M由字符串循环左移所得,F为每行对应的末尾字符;而矩阵M′是将矩阵M中所有行对应字符串进行字典序排序的结果,矩阵M′的最后一列即为原字符串经BWT的变换结果—字符串L串,L=Table 2 shows the whole process of performing BWT transformation on the template data T and obtaining the L string, in which the matrix M is obtained by shifting the string to the left, F is the end character corresponding to each row; and the matrix M' is the matrix M The result of lexicographical sorting of all rows corresponding to strings, the last column of matrix M′ is the transformation result of the original string through BWT—string L string, L=
″:dd1F;:iF{};″。":dd1F;:iF{};".
由此可见,BWT变换可将原字符串的重复字符在一定程度上进行聚集,原串重复字符越多聚集效果就越明显,因此经过BWT变换可放大模板数据的高重复的特点,提升整体的可压缩性。It can be seen that BWT transformation can aggregate the repeated characters of the original string to a certain extent. The more repeated characters in the original string, the more obvious the aggregation effect is. Therefore, the high repetition characteristics of template data can be amplified after BWT transformation, and the overall compressibility.
表2 BWT变换过程Table 2 BWT transformation process
本申请实施例通过对重复字符进行聚集,使得原模板数据高重复的特点被放大,提示了模板数据的可压缩性。In the embodiment of the present application, by aggregating repeated characters, the feature of high repetition in the original template data is amplified, which indicates the compressibility of the template data.
S230,通过预设数据编码算法对字符串进行编码,得到数据序列串。S230. Encode the character string by using a preset data encoding algorithm to obtain a data sequence string.
预设数据编码算法能够通过对字符串编码进而实现对字符串的压缩。The preset data encoding algorithm can realize the compression of the string by encoding the string.
在一种实施例中,预设数据编码算法可以选用为哈夫曼编码算法。并选用哈夫曼编码算法中的平衡编码。表3示出了字符串L=″:dd1F;:iF{};″使用哈夫曼平衡编码后的编码结果。In an embodiment, the preset data encoding algorithm may be selected as a Huffman encoding algorithm. And choose the balanced coding in the Huffman coding algorithm. Table 3 shows the encoding result of the character string L=":dd1F;:iF{};" after Huffman balance encoding.
表3编码结果Table 3 Encoding Results
在使用哈夫曼编码算法对字符串进行编码时,首先需要确定空间压缩结构,本申请实施例选用小波树结构。小波树是一种空间压缩结构,它一般情况下为二叉树,且其内部结点的0/1序列信息由其父结点根据之前的字符顺序分配的,每一个叶子结点都代表原字符串的字符集中一个可区分的字符。When using the Huffman coding algorithm to encode a character string, it is first necessary to determine the space compression structure, and the embodiment of the present application chooses the wavelet tree structure. The wavelet tree is a space compression structure, which is generally a binary tree, and the 0/1 sequence information of its internal nodes is allocated by its parent node according to the previous character order, and each leaf node represents the original character string A distinguishable character in the character set of .
图3示出了小波树的结构示意图,如图3所示,小波树的根节点为长为12的二进制序列{0,1,1,0,0,0,0,1,0,1,1,0},根据左右子树划分规则以及编码方式,可以得到左节点为{0,0,1,1,0,1,1},右节点为{0,0,0,1,1},再划分得到四个节点{1,0,1}、{1,0,1,0}、{0,0,1}和{0,1},可见,小波树的每个节点的0或1都代表字符串当中的一个特定字符,比如第一个节点中所有0代表的都是字符“1”,所有的1即代表字符“:”。Figure 3 shows a schematic diagram of the structure of the wavelet tree. As shown in Figure 3, the root node of the wavelet tree is a binary sequence {0,1,1,0,0,0,0,1,0,1, 1,0}, according to the left and right subtree division rules and encoding methods, the left node is {0,0,1,1,0,1,1}, and the right node is {0,0,0,1,1} , and then divided into four nodes {1,0,1}, {1,0,1,0}, {0,0,1} and {0,1}, it can be seen that the 0 or 1 represents a specific character in the string, for example, all 0s in the first node represent the character "1", and all 1s represent the character ":".
本申请实施例经过小波树变换后,将字符串转化为0、1分布更为集中0/1序列串,进一步提升了数据的可压缩性。In the embodiment of the present application, after the wavelet tree transformation, the character string is transformed into a 0/1 sequence string with a more concentrated distribution of 0 and 1, which further improves the compressibility of data.
S240,根据预设索引压缩算法压缩存储数据序列串,得到压缩后的模板数据。S240. Compress and store the data sequence string according to a preset index compression algorithm to obtain compressed template data.
在经过小波树变换后,本申请实施例选用0/1序列压缩索引结构对得到的数据序列串进行压缩存储。After the wavelet tree transformation, the embodiment of the present application selects a 0/1 sequence compression index structure to compress and store the obtained data sequence string.
本申请实施针对模板数据高相似、高重复的特点,采用预设数据转换算法对模板数据进行重复字符聚集转换,降低重复字节存储比例,有效提升原模板数据的可压缩性;结合预设数据编码算法、构建小波树、预设索引压缩算法对重复字符聚集转换后的模板数据进行压缩,进一步提升模板数据的可压缩性,压缩后的模板数据所占存储空间较小;并且由于预设数据转换算法和小波树的数据处理时间复杂度均较低,使得压缩耗时较少。Aiming at the characteristics of high similarity and repetition of template data, this application adopts the preset data conversion algorithm to carry out repeated character aggregation conversion on the template data, reduces the storage ratio of repeated bytes, and effectively improves the compressibility of the original template data; combined with the preset data Encoding algorithm, construction of wavelet tree, and preset index compression algorithm compress the template data after repeated character aggregation and conversion, further improving the compressibility of template data, and the compressed template data occupies less storage space; and because the preset data The data processing time complexity of the conversion algorithm and the wavelet tree is low, which makes the compression less time-consuming.
在一种实施例中,本申请采用的是RRR压缩索引结构,它记录了每个块内1的个数,可以实现在该串某位置上0/1的计数操作(rank)和定位操作(locate),为数据查询操作提供数据基础。In one embodiment, what the present application adopts is the RRR compression index structure, which records the number of 1s in each block, and can realize the counting operation (rank) and positioning operation (rank) of 0/1 at a certain position of the string locate), providing a data basis for data query operations.
在一种实施例中,该方法还可以包括:In one embodiment, the method may also include:
S250,根据预设后缀数组构造算法对数据序列串进行后缀数组构造,得到后缀数组。S250. Construct a suffix array on the data sequence string according to a preset suffix array construction algorithm to obtain a suffix array.
后缀数组构造算法能够对数据进行后缀数据构造;常用的有倍增法和DC3法。The suffix array construction algorithm can carry out suffix data construction on the data; the commonly used methods are multiplication method and DC3 method.
在一种实施例中,本申请选用倍增法作为后缀数组构造算法。倍增法首先为从每个位置开始的长度为2的子串进行字典序排序,在利用这个结果为长度为4的子串进行顺序,接下来为长度为8的子串排序,不断倍增,直到长度大于等于n就得到了后缀数组。In one embodiment, the present application selects the multiplication method as the suffix array construction algorithm. The doubling method first sorts the substrings of length 2 starting from each position in lexicographical order, then uses this result to sort the substrings of length 4, and then sorts the substrings of length 8, and keeps multiplying until If the length is greater than or equal to n, the suffix array is obtained.
以字符串T为例,设T[1..n]=t0t1t2···tn-1,T[i]表示字符串T在位置i上的字符,Ti表示T的第i个后缀,即子串Ti,n-1,则字符串T的后缀数组SA即为按照T的各个后缀的字典序排列的下标索引。例如:字符串T={id:1;Fd:F;},则其各后缀按字典序列排列为:T4,T3,T8,T5,T10,T6,T9,T2,T7,T1,T0,T11,表4示出了T的后缀数组,如表4所示,T所对应的后缀数组SA=[4,3,8,5,10,6,9,2,7,1,0,11]。Take the string T as an example, let T[1..n]=t 0 t 1 t 2 ···t n-1 , T[ i ] represents the character at position i of the string T, and T i represents the character of T The i-th suffix, that is, the substring T i,n-1 , the suffix array SA of the string T is the subscript index arranged according to the lexicographical order of each suffix of T. For example: string T={id:1; Fd:F;}, then its suffixes are arranged according to the dictionary sequence: T4, T3, T8, T5, T10, T6, T9, T2, T7, T1, T0, T11 , Table 4 shows the suffix array of T, as shown in Table 4, the corresponding suffix array SA=[4,3,8,5,10,6,9,2,7,1,0,11] .
表4 T的SA后缀数组Table 4 SA suffix array of T
S260,每隔预设个数据对后缀数组进行采样,得到采样数组。S260. Sampling the suffix array every preset data to obtain a sampling array.
为了保证较高的存储效率,本申请实施例加入了采样机制,可以间隔预设个数据对后缀数组进行采样。In order to ensure higher storage efficiency, the embodiment of the present application adds a sampling mechanism, which can sample the suffix array at intervals of preset data.
S270,将采样数据确定为索引数据,用于在被查询时与查询输入数据比对。S270. Determine the sampling data as index data, which is used for comparison with the query input data when being queried.
后缀数组SA是把字符串T的所有后缀以字典序进行从小到大的排序。故后缀数组SA具有以下性质:The suffix array SA is to sort all the suffixes of the string T in alphabetical order from small to large. Therefore, the suffix array SA has the following properties:
TSA[0]<TSA[1]<···<TSA[n-1]TSA[0]<TSA[1]<···<TSA[n-1]
所以,当用户要查询某一个属性时,如id时,只需进在原文本中查询模式串P=“id:1;”是否匹配即可。通过后缀数组的排序,模式串P在原文本的最长的匹配串只可能出现在至少前一个字符相同的后缀数组指向的字符串中。Therefore, when the user wants to query a certain attribute, such as id, he only needs to enter whether the pattern string P="id:1;" matches in the original text. Through the sorting of the suffix array, the longest matching string of the pattern string P in the original text can only appear in the string pointed to by the suffix array with at least the same character as the previous one.
可以采用二分法来对模式串P从左到右进行模式匹配并获得对应的后缀数组位置i,如表5示出的算法1所示,然后在采样的SA数组上返回SA[i],并由当前排名开始,迭代向前一个位置开始的后缀跳跃,直至某个采样点,通过多次匹配将后缀区间转换成原文本的对应位置。若算法结果输出对应位置,则说明原模板数据具有“id:1”这一属性,实现压缩状态下的属性查询操作。The dichotomy method can be used to perform pattern matching on the pattern string P from left to right and obtain the corresponding suffix array position i, as shown in
表5table 5
如表5所示,在对后缀数组进行采样时,SAl就是SA的采样结果,而d表示SA的采样距离,即对整个SA数组每隔d个数据存储一个采样数据。算法1的整个过程平均需要对原文本的压缩信息进行d/2次访问,而每次访问都需要从小波树的根开始,并逐层调用0/1序列上的计数操作,故其时间复杂度为:O(d*log|∑|),其中∑为小波树的高度,执行效率较高。As shown in Table 5, when sampling the suffix array, SA 1 is the sampling result of SA, and d represents the sampling distance of SA, that is, every d data of the entire SA array stores a sampling data. The whole process of
本申请实施例提供的模板数据的压缩方法首先通过BWT将原模板数据可压缩性提高,再用小波树结构进行压缩存储,然后配合采样存储的SA数组实现压缩状态下的属性查询,有效提升了大体量模板数据存储效率,且能支持快速的属性查询操作。The template data compression method provided by the embodiment of the present application first improves the compressibility of the original template data through BWT, then uses the wavelet tree structure to compress and store, and then cooperates with the sampled and stored SA array to realize the attribute query under the compressed state, which effectively improves the Large-volume template data storage efficiency, and can support fast attribute query operations.
本申请实施例提供的方法只需安装Java运行环境(Java Runtime Environment,JRE)即可直接在多平台进行使用,且在实际应用过程当中可用有效缩减模板数据存储空间,提供模板分类查询等基本操作,在用户自定义数据不断增长的过程中依旧保持良好的性能,能够有效保证模板输出类软件长期的可用性和高效性。The method provided by the embodiment of the present application only needs to install the Java Runtime Environment (JRE) to be used directly on multiple platforms, and in the actual application process, it can effectively reduce the template data storage space and provide basic operations such as template classification and query. , it still maintains good performance in the process of increasing user-defined data, and can effectively guarantee the long-term availability and efficiency of template output software.
图2-3描述了模板数据的压缩方法,下面结合附图4和附图5描述本申请实施例提供的装置。Figures 2-3 describe the template data compression method, and the device provided by the embodiment of the present application will be described below in conjunction with Figure 4 and Figure 5 .
图4示出了本申请一个实施例提供的模板数据的压缩装置的结构示意图,图4所示装置中各模块具有实现图1中各个步骤的功能,并能达到其相应技术效果。如图4所示,该装置可以包括:Fig. 4 shows a schematic structural diagram of a template data compression device provided by an embodiment of the present application. Each module in the device shown in Fig. 4 has the function of implementing each step in Fig. 1 and can achieve its corresponding technical effect. As shown in Figure 4, the device may include:
获取模块410,用于获取模板数据。An acquisition module 410, configured to acquire template data.
转换模块420,用于通过预设数据转换算法对模板数据进行重复字符聚集转换,得到字符串。The conversion module 420 is configured to perform repeated character aggregation conversion on the template data through a preset data conversion algorithm to obtain a character string.
编码模块430,用于通过预设数据编码算法对字符串进行编码,得到数据序列串。The encoding module 430 is configured to encode a character string by using a preset data encoding algorithm to obtain a data sequence string.
压缩存储模块440,用于根据预设索引压缩算法压缩存储数据序列串,得到压缩后的模板数据。The compression storage module 440 is configured to compress and store the data sequence string according to a preset index compression algorithm to obtain compressed template data.
本申请实施例提供的模板数据的压缩装置首先通过BWT将原模板数据可压缩性提高,再用小波树结构进行压缩存储,然后配合采样存储的SA数组实现压缩状态下的属性查询,有效提升了大体量模板数据存储效率,且能支持快速的属性查询操作。The template data compression device provided by the embodiment of the present application first improves the compressibility of the original template data through BWT, then uses the wavelet tree structure to compress and store, and then cooperates with the sampled and stored SA array to realize the attribute query in the compressed state, which effectively improves the Large-volume template data storage efficiency, and can support fast attribute query operations.
在一种实施例中,该装置还可以包括:In one embodiment, the device may also include:
构造模块450,用于根据预设后缀数组构造算法对数据序列串进行后缀数组构造,得到后缀数组。The construction module 450 is configured to construct a suffix array for the data sequence string according to a preset suffix array construction algorithm to obtain a suffix array.
采样模块460,用于每隔预设个数据对后缀数组进行采样,得到采样数组。The sampling module 460 is configured to sample the suffix array every preset data to obtain a sample array.
确定模块470,用于将采样数据确定为索引数据,用于在被查询时与查询输入数据比对。The determining module 470 is configured to determine the sampling data as index data for comparison with the query input data when being queried.
在一种实施例中,预设数据转换算法,可以包括:In one embodiment, the preset data conversion algorithm may include:
Burrows Wheeler transform算法。Burrows Wheeler transform algorithm.
在一种实施例中,预设数据编码算法,可以包括:In one embodiment, the preset data encoding algorithm may include:
哈夫曼编码算法。Huffman coding algorithm.
在一种实施例中,预设索引压缩算法,可以包括:In one embodiment, the preset index compression algorithm may include:
RRR压缩索引结构。RRR compressed index structure.
本申请实施例的模板数据的压缩装置,针对模板数据高相似、高重复的特点,采用预设数据转换算法对模板数据进行重复字符聚集转换,降低重复字节存储比例,有效提升原模板数据的可压缩性;结合预设数据编码算法、构建小波树、预设索引压缩算法对重复字符聚集转换后的模板数据进行压缩,进一步提升模板数据的可压缩性,压缩后的模板数据所占存储空间较小;并且由于预设数据转换算法和小波树的数据处理时间复杂度均较低,使得压缩耗时较少。The template data compression device of the embodiment of the present application, aiming at the characteristics of high similarity and high repetition of the template data, adopts the preset data conversion algorithm to carry out repeated character aggregation conversion on the template data, reduces the storage ratio of repeated bytes, and effectively improves the efficiency of the original template data. Compressibility: Combining the preset data encoding algorithm, constructing wavelet tree, and preset index compression algorithm to compress the template data after repeated character aggregation and conversion, further improving the compressibility of the template data, the storage space occupied by the compressed template data Smaller; and because the data processing time complexity of the preset data conversion algorithm and wavelet tree is low, the compression time is less.
图5示出了本申请一个实施例提供的模板数据的压缩设备的结构示意图。如图5所示,该设备可以包括处理器501以及存储有计算机程序指令的存储器502。FIG. 5 shows a schematic structural diagram of a device for compressing template data provided by an embodiment of the present application. As shown in Fig. 5, the device may include a
具体地,上述处理器501可以包括中央处理器(Central Processing Unit,CPU),或者特定集成电路(Application Specific Integrated Circuit,ASIC),或者可以被配置成实施本申请实施例的一个或多个集成电路。Specifically, the above-mentioned
存储器502可以包括用于数据或指令的大容量存储器。举例来说而非限制,存储器502可包括硬盘驱动器(Hard Disk Drive,HDD)、软盘驱动器、闪存、光盘、磁光盘、磁带或通用串行总线(Universal Serial Bus,USB)驱动器或者两个或更多个以上这些的组合。在一个实例中,存储器502可以包括可移除或不可移除(或固定)的介质,或者存储器502是非易失性固态存储器。存储器502可在综合网关容灾设备的内部或外部。
在一个实例中,存储器502可以是只读存储器(Read Only Memory,ROM)。在一个实例中,该ROM可以是掩模编程的ROM、可编程ROM(PROM)、可擦除PROM(EPROM)、电可擦除PROM(EEPROM)、电可改写ROM(EAROM)或闪存或者两个或更多个以上这些的组合。In one example, the
处理器501通过读取并执行存储器502中存储的计算机程序指令,以实现图2所示实施例中的方法,并达到图2所示实例执行其方法达到的相应技术效果,为简洁描述在此不再赘述。The
在一个示例中,该模板数据的压缩设备还可包括通信接口503和总线510。其中,如图5所示,处理器501、存储器502、通信接口503通过总线510连接并完成相互间的通信。In an example, the template data compression device may further include a
通信接口503,主要用于实现本申请实施例中各模块、装置、单元和/或设备之间的通信。The
总线510包括硬件、软件或两者,将在线数据流量计费设备的部件彼此耦接在一起。举例来说而非限制,总线可包括加速图形端口(Accelerated Graphics Port,AGP)或其他图形总线、增强工业标准架构(Extended Industry Standard Architecture,EISA)总线、前端总线(Front Side Bus,FSB)、超传输(Hyper Transport,HT)互连、工业标准架构(Industry Standard Architecture,ISA)总线、无限带宽互连、低引脚数(LPC)总线、存储器总线、微信道架构(MCA)总线、外围组件互连(PCI)总线、PCI-Express(PCI-X)总线、串行高级技术附件(SATA)总线、视频电子标准协会局部(VLB)总线或其他合适的总线或者两个或更多个以上这些的组合。在合适的情况下,总线510可包括一个或多个总线。尽管本申请实施例描述和示出了特定的总线,但本申请考虑任何合适的总线或互连。The
该模板数据的压缩设备可以执行本申请实施例中的模板数据的压缩方法,从而实现图2描述的模板数据的压缩方法的相应技术效果。The device for compressing template data can execute the method for compressing template data in the embodiment of the present application, so as to achieve the corresponding technical effect of the method for compressing template data described in FIG. 2 .
另外,结合上述实施例中的模板数据的压缩方法,本申请实施例可提供一种计算机存储介质来实现。该计算机存储介质上存储有计算机程序指令;该计算机程序指令被处理器执行时实现上述实施例中的任意一种模板数据的压缩方法。In addition, in combination with the template data compression method in the foregoing embodiments, the embodiment of the present application may provide a computer storage medium for implementation. Computer program instructions are stored on the computer storage medium; when the computer program instructions are executed by a processor, any template data compression method in the above-mentioned embodiments is implemented.
需要明确的是,本申请并不局限于上文所描述并在图中示出的特定配置和处理。为了简明起见,这里省略了对已知方法的详细描述。在上述实施例中,描述和示出了若干具体的步骤作为示例。但是,本申请的方法过程并不限于所描述和示出的具体步骤,本领域的技术人员可以在领会本申请的精神后,作出各种改变、修改和添加,或者改变步骤之间的顺序。It is to be understood that the application is not limited to the specific configurations and processes described above and shown in the figures. For conciseness, detailed descriptions of known methods are omitted here. In the above embodiments, several specific steps are described and shown as examples. However, the method process of the present application is not limited to the specific steps described and shown, and those skilled in the art may make various changes, modifications and additions, or change the order of the steps after understanding the spirit of the present application.
以上所述的结构框图中所示的功能块可以实现为硬件、软件、固件或者它们的组合。当以硬件方式实现时,其可以例如是电子电路、专用集成电路(Application SpecificIntegrated Circuit,ASIC)、适当的固件、插件、功能卡等等。当以软件方式实现时,本申请的元素是被用于执行所需任务的程序或者代码段。程序或者代码段可以存储在机器可读介质中,或者通过载波中携带的数据信号在传输介质或者通信链路上传送。“机器可读介质”可以包括能够存储或传输信息的任何介质。机器可读介质的例子包括电子电路、半导体存储器设备、ROM、闪存、可擦除ROM(EROM)、软盘、CD-ROM、光盘、硬盘、光纤介质、射频(RadioFrequency,RF)链路,等等。代码段可以经由诸如因特网、内联网等的计算机网络被下载。The functional blocks shown in the structural block diagrams described above may be implemented as hardware, software, firmware, or a combination thereof. When implemented in hardware, it may be, for example, an electronic circuit, an Application Specific Integrated Circuit (ASIC), appropriate firmware, a plug-in, a function card, and the like. When implemented in software, the elements of the present application are the programs or code segments employed to perform the required tasks. Programs or code segments can be stored in machine-readable media, or transmitted over transmission media or communication links by data signals carried in carrier waves. "Machine-readable medium" may include any medium that can store or transmit information. Examples of machine-readable media include electronic circuits, semiconductor memory devices, ROM, flash memory, erasable ROM (EROM), floppy disks, CD-ROMs, optical disks, hard disks, fiber optic media, Radio Frequency (RF) links, etc. . Code segments may be downloaded via a computer network such as the Internet, an Intranet, or the like.
还需要说明的是,本申请中提及的示例性实施例,基于一系列的步骤或者装置描述一些方法或系统。但是,本申请不局限于上述步骤的顺序,也就是说,可以按照实施例中提及的顺序执行步骤,也可以不同于实施例中的顺序,或者若干步骤同时执行。It should also be noted that the exemplary embodiments mentioned in this application describe some methods or systems based on a series of steps or devices. However, the present application is not limited to the order of the above steps, that is, the steps may be performed in the order mentioned in the embodiment, or may be different from the order in the embodiment, or several steps may be performed simultaneously.
上面参考根据本申请的实施例的方法、装置(系统)和计算机程序产品的流程图和/或框图描述了本申请的各方面。应当理解,流程图和/或框图中的每个方框以及流程图和/或框图中各方框的组合可以由计算机程序指令实现。这些计算机程序指令可被提供给通用计算机、专用计算机、或其它可编程数据处理装置的处理器,以产生一种机器,使得经由计算机或其它可编程数据处理装置的处理器执行的这些指令使能对流程图和/或框图的一个或多个方框中指定的功能/动作的实现。这种处理器可以是但不限于是通用处理器、专用处理器、特殊应用处理器或者现场可编程逻辑电路。还可理解,框图和/或流程图中的每个方框以及框图和/或流程图中的方框的组合,也可以由执行指定的功能或动作的专用硬件来实现,或可由专用硬件和计算机指令的组合来实现。Aspects of the present application are described above with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the present application. It will be understood that each block of the flowchart and/or block diagrams, and combinations of blocks in the flowchart and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine such that execution of these instructions via the processor of the computer or other programmable data processing apparatus enables Implementation of the functions/actions specified in one or more blocks of the flowchart and/or block diagrams. Such processors may be, but are not limited to, general purpose processors, special purpose processors, application specific processors, or field programmable logic circuits. It can also be understood that each block in the block diagrams and/or flowcharts and combinations of blocks in the block diagrams and/or flowcharts can also be realized by dedicated hardware for performing specified functions or actions, or can be implemented by dedicated hardware and Combination of computer instructions to achieve.
以上所述,仅为本申请的具体实施方式,所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,上述描述的系统、模块和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。应理解,本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到各种等效的修改或替换,这些修改或替换都应涵盖在本申请的保护范围之内。The above is only a specific implementation of the present application, and those skilled in the art can clearly understand that for the convenience and brevity of description, the specific working process of the above-described systems, modules and units can refer to the foregoing method embodiments The corresponding process in , will not be repeated here. It should be understood that the protection scope of the present application is not limited thereto, and any person skilled in the art can easily think of various equivalent modifications or replacements within the technical scope disclosed in the application, and these modifications or replacements should cover all Within the protection scope of this application.
Claims (10)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202110598530.0A CN115480756A (en) | 2021-05-31 | 2021-05-31 | Template data compression method, device, equipment and storage medium |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202110598530.0A CN115480756A (en) | 2021-05-31 | 2021-05-31 | Template data compression method, device, equipment and storage medium |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| CN115480756A true CN115480756A (en) | 2022-12-16 |
Family
ID=84420191
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202110598530.0A Pending CN115480756A (en) | 2021-05-31 | 2021-05-31 | Template data compression method, device, equipment and storage medium |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN115480756A (en) |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN118394725A (en) * | 2024-06-28 | 2024-07-26 | 中国兵器装备集团兵器装备研究所 | VMF-based data message compression method and system |
Citations (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN103810228A (en) * | 2012-11-01 | 2014-05-21 | 辉达公司 | System, method, and computer program product for parallel reconstruction of a sampled suffix array |
| CN111211787A (en) * | 2019-10-09 | 2020-05-29 | 华中科技大学 | An industrial data compression method, system, storage medium and terminal |
-
2021
- 2021-05-31 CN CN202110598530.0A patent/CN115480756A/en active Pending
Patent Citations (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN103810228A (en) * | 2012-11-01 | 2014-05-21 | 辉达公司 | System, method, and computer program product for parallel reconstruction of a sampled suffix array |
| CN111211787A (en) * | 2019-10-09 | 2020-05-29 | 华中科技大学 | An industrial data compression method, system, storage medium and terminal |
Non-Patent Citations (1)
| Title |
|---|
| 郭文钰: "全文自索引压缩算法的研究", 中国优秀硕士学位论文全文数据库 信息科技辑, no. 6, 15 June 2018 (2018-06-15), pages 138 - 2155 * |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN118394725A (en) * | 2024-06-28 | 2024-07-26 | 中国兵器装备集团兵器装备研究所 | VMF-based data message compression method and system |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US11693839B2 (en) | Parser for schema-free data exchange format | |
| CN104011723B (en) | Boolean logic in a state machine lattice | |
| CN108733689B (en) | A JSON text comparison method and device | |
| Bille et al. | Random access to grammar-compressed strings | |
| CN104572685B (en) | data sorting method | |
| Gawrychowski | Optimal pattern matching in LZW compressed strings | |
| CN110795526B (en) | A method and system for creating mathematical formula index for retrieval system | |
| CN107210753A (en) | The lossless simplification of the data of data is exported by the primitive from relevance screen is resided in | |
| WO2021072874A1 (en) | Dual array-based location query method and apparatus, computer device, and storage medium | |
| CN111950263A (en) | A log parsing method, system and electronic device | |
| CN114529741A (en) | Picture duplicate removal method and device and electronic equipment | |
| CN109815238A (en) | A method and device for dynamically adding database by using strictly balanced binary tree | |
| Nong et al. | Induced sorting suffixes in external memory | |
| CN117493622B (en) | Method and device for inquiring character strings based on field programmable array device | |
| CN113986950A (en) | An SQL statement processing method, device, device and storage medium | |
| CN112115313A (en) | Regular expression generation, data extraction method, apparatus, equipment and medium | |
| CN115982310B (en) | Chain table generation method with verification function and electronic equipment | |
| CN115480756A (en) | Template data compression method, device, equipment and storage medium | |
| CN110472385A (en) | A kind of password cracking method and device | |
| CN114297046B (en) | Log-based event acquisition method, device, equipment and medium | |
| CN115221191A (en) | Virtual column construction method based on data lake and data query method | |
| CN111723122A (en) | Method, apparatus, device and readable storage medium for determining association rules between data | |
| WO2025156482A1 (en) | Semantic-based huffman encoding method, semantic-based huffman decoding method and related device | |
| US11031092B2 (en) | Taxonomic annotation of variable length metagenomic patterns | |
| CN115510051A (en) | Data processing method, query method, device and electronic equipment |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination |







