[go: up one dir, main page]

CN119135188A - Data compression method and device based on Huffman coding and data decompression method and device - Google Patents

Data compression method and device based on Huffman coding and data decompression method and device Download PDF

Info

Publication number
CN119135188A
CN119135188A CN202411292217.4A CN202411292217A CN119135188A CN 119135188 A CN119135188 A CN 119135188A CN 202411292217 A CN202411292217 A CN 202411292217A CN 119135188 A CN119135188 A CN 119135188A
Authority
CN
China
Prior art keywords
data
node
data table
leaf
nodes
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202411292217.4A
Other languages
Chinese (zh)
Inventor
郭东辉
邵佳丽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xiamen University
Original Assignee
Xiamen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiamen University filed Critical Xiamen University
Priority to CN202411292217.4A priority Critical patent/CN119135188A/en
Publication of CN119135188A publication Critical patent/CN119135188A/en
Pending legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • H03M7/40Conversion to or from variable length codes, e.g. Shannon-Fano code, Huffman code, Morse code

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The invention discloses a data compression method and device based on Huffman coding, and a data decompression method and device, wherein in the compression method, data to be compressed is firstly processed according to a Huffman coding mode to obtain a first data table, then standardized processing is carried out on the first data table to obtain a second data table, a hierarchical table is obtained according to the second data table, numbers of all leaf nodes are obtained according to a formula based on the hierarchical table, codes corresponding to each character are output according to the numbers and depth of the leaf nodes, and then the data is compressed according to the codes.

Description

Data compression method and device based on Huffman coding and data decompression method and device
Technical Field
The invention relates to the technical field of data compression, in particular to a data compression method and device based on Huffman coding and a data decompression method and device.
Background
The Huffman coding method is widely used in the field of data lossless compression as an entropy coding technology. The Huffman coding method is based on the count result of the occurrence times of various characters in the data to be coded, a binary tree is constructed from bottom to top, and then the codes with the shortest average length of different word heads are allocated to each character, so that the Huffman coding method is a coding method based on the character statistical frequency and the binary tree data structure. However, in the huffman coding method, after the huffman tree is constructed, the huffman tree needs to be traversed to obtain the code corresponding to each character, which definitely reduces the compression efficiency of the data.
Disclosure of Invention
The invention aims to overcome the defects or problems in the background art and provide a data compression method and device based on Huffman coding and a data decompression method and device, wherein the compression method can improve the compression efficiency of data.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
The data compression method based on the Huffman coding is used for compressing data comprising a plurality of characters and is characterized in that the data to be compressed are processed according to the Huffman coding mode, and a first data table with a three-fork tree structure is output; the method comprises the steps of recording information of a first data table, carrying out standardization processing on the first data table to obtain a second data table, wherein the standardization processing refers to that the number of child nodes of father nodes of other layers above the father node of the second layer in the first data table is adjusted to be 3, the depth of nodes with changed tree relations is adjusted, outputting a hierarchical table based on the second data table, recording the number of leaf nodes in each layer of the hierarchical table, each leaf node corresponds to one word element, carrying out compression on the hierarchical table according to the number of leaf nodes in each layer by M= { M i, i=1, 2, 3..k } in the hierarchical table, wherein M i is the number of leaves of each layer in the second data table, i is the total number in the second data table, and the number of all leaf nodes is determined according to a recurrence formula, carrying out compression on the hierarchical table according to the number N i=Ri-mi,Ri+1=3Ni+2,Pi=Ni +1, wherein R35 is the number of leaf nodes in the first layer and the leaf nodes in the second layer, and the number of leaf nodes in the first layer are coded according to the number of leaf nodes in the first layer.
Based on the technical scheme II of the technical scheme I, the code corresponding to each character is the ternary representation of the number of the leaf node corresponding to the character.
Based on the technical scheme II, the length of the code corresponding to each character is consistent with the depth of the leaf node corresponding to the character in the second data table.
According to the technical scheme III, the hierarchical table corresponds to a third data table, the third data table is obtained by normalization processing based on a second data table, the normalization processing refers to that non-leaf nodes of each layer in the second data table are adjusted to be before leaf nodes, the leaf nodes of the layer are ordered in an ascending order by taking the weight of each node as a reference, R i represents the node number of the last position in the ith layer in the third data table in a recursive formula, N i represents the non-leaf node number of the last position in the third data table, and when the node N i only comprises 2 sub-nodes, R i+1=3Ni +2 in the recursive formula is adjusted to be R i+1=3Ni +1.
According to the technical scheme IV, the data to be compressed is processed according to the Huffman coding mode, namely, the weight of characters in the data to be compressed is calculated according to the frequency, 3 nodes with the lowest weight are added circularly by taking one character as a start corresponding to one node, and the depth of the node and the child node participating in the adding operation is added by one when each adding is carried out.
According to a sixth technical scheme, the standardized processing of the first data table means that when a certain father node only has two or fewer child nodes, if the depth of the child node is not the maximum value in all depths, the node with the largest weight is selected from the next layer of the child node as a new child node, the weight and the depth of the child node are modified, and the number of child nodes of father nodes of other layers above the penultimate layer is 3.
The data compression device based on the Huffman coding is used for compressing data comprising a plurality of characters and is characterized by comprising a first processing module, a second processing module and a data processing module, wherein the first processing module is used for processing the data to be compressed according to the Huffman coding mode and outputting a first data table with a three-fork tree structure; the information recorded by the first data table comprises the weight of each node, the father node of the child node and the depth of each node; the second processing module is used for carrying out standardization processing on the first data table and outputting a second data table; the normalization processing means that the number of child nodes of father nodes of other layers above the father node of the last layer in the first data table is adjusted to be 3, the depth of the node with changed tree relation is adjusted, a third processing module is used for outputting a hierarchical table based on the second data table, the hierarchical table records the number of leaf nodes in each layer, each leaf node corresponds to one character, the hierarchical table is represented by M= { m_i, i=1, 2, 3..k } wherein m_i is the number of leaves of each layer in the second data table, i is the corresponding layer in the second data table, k is the total layer number in the second data table, the number of all leaf nodes is determined according to a recurrence formula, the recurrence formula is that N_i=R_i-m_i, R_i+1, N_i=2, P_i=2, wherein when the number of leaves in the second data table is the leaf number of each layer is the leaf node in the second data table, and the leaf nodes in the first layer are added with the leaf number of the leaf nodes in the second data table, the method is used for outputting codes corresponding to the characters according to the numbers of all leaf nodes and the corresponding depths in the second data table; and the compression module is used for compressing the data to be compressed according to the coding.
The method is characterized in that the depth of each leaf node to be decompressed is obtained according to a weight table corresponding to the data to be decompressed, the depth of each word is obtained, the weight table comprises the corresponding relation between each word and the weight of the corresponding leaf node, the number of each leaf node after all leaf nodes are sequenced according to the weight ascending order of the weight table is determined according to the hierarchy table through the recursion formula, a fourth data table is output, or the fourth data table is output according to an encoding table corresponding to the data to be decompressed, the encoding table comprises the corresponding relation between each word and the encoding, the depth of the leaf node with the smallest depth is taken as the smallest unit, the data to be decompressed is read, the data to be decompressed is compared with the number and the depth of each leaf node in the fourth data table, if matching is successful, the data is converted into the original word, if matching fails, a bit is added, the number is read, the fourth data table is output, or the fourth data table is output according to the encoding table corresponding to the encoding table, the depth of the leaf node to be decompressed, the depth is completely.
According to the technical scheme III, when the hierarchical table is output according to the weight table corresponding to the data to be decompressed, the weight table is sequentially processed according to the Huffman coding mode to output a first data table with a three-fork tree structure, the first data table is standardized to output a second data table, and the hierarchical table is output based on the second data table.
The data decompression device based on the Huffman coding is used for decompressing data obtained by compressing the data compression method based on the Huffman coding according to claim 6 and is characterized by comprising a first decoding module, a decompression module and a second decoding module, wherein the first decoding module is used for obtaining depth corresponding to each character according to a weight table corresponding to the data to be decompressed and outputting the hierarchical table, the weight table comprises corresponding relation between each character and weight of each character, the second decoding module is used for determining the number of each leaf node after all leaf nodes are ordered according to weight ascending sequence of the leaf nodes according to the hierarchical table according to the recursion formula and outputting a fourth data table, or outputting the fourth data table according to a coding table corresponding to the data to be decompressed, the coding table comprises corresponding relation between each character and the code, the decompression module is used for reading the compressed data by taking the depth of the leaf node with the minimum depth as the minimum unit, comparing the number of each leaf node with the depth in the fourth data table, converting the number of each leaf node into the number of each character into the original character if the original character is matched, and if the character is successfully matched, and the data is completely decompressed in the fourth bit after the complete decoding is completely matched.
As can be seen from the above description of the present invention, the present invention has the following advantages over the prior art:
The data compression method and the data decompression method based on the Huffman coding can compress data comprising a plurality of characters and decompress the data compressed by the method to obtain the initial data for arranging the plurality of characters according to a specific mode, and compared with the conventional Huffman coding method, the method has higher compression efficiency and decompression efficiency. The data to be compressed may be in various formats, such as text, picture, audio, video, etc., wherein the characters refer to the smallest units for storing information corresponding to different formats, for example, for a string of text, each character is a character, and for a picture, the value of each pixel is a character.
In the data compression method, firstly, data to be compressed is processed into a first data table with a three-fork tree structure according to a Huffman coding mode, the first data table records the weight of each node, the father node of a child node and the depth of each node, the first data table corresponds to an initial Huffman tree, wherein the depth is the layer where each node is located, for example, the depth is 1 which indicates that the node is located in the first layer, namely the layer of a root node, the depth is 3 which indicates that the node is located in the third layer and is one of the child nodes of the root node;
And then carrying out standardization processing on the first data table to obtain a second data table, wherein the second data table corresponds to a standardized Huffman tree obtained by standardizing the initial Huffman tree, and the standardization processing is to adjust father nodes of less than 3 father nodes except for the father node of the penultimate layer in the initial Huffman tree to father nodes with 3 father nodes so as to minimize weighted paths of all nodes of the standard Huffman tree, and adjust the depth of the node with changed tree relation, wherein the depth of each leaf node is determined, namely how many leaf nodes are included in each layer;
It should be noted that the number obtained according to the recurrence formula does not correspond to the position of the leaf node in the second data table, but corresponds to the position of the leaf node in the third data table after reordering, the corresponding relation can facilitate the subsequent decompression operation, when the number of all the leaf nodes is obtained, the depth of each leaf node is combined, a code with uniqueness corresponding to each leaf node can be formed, the code can be expressed in a ternary system to adapt to the Huffman tree structure of the three branches, and meanwhile, the length of the code is consistent with the depth of the leaf node corresponding to the code in the second data table, so that the layer of the leaf node corresponding to the code in the standard Huffman tree can be determined through the length of the code;
after the codes corresponding to the characters are obtained, each character in the data to be compressed can be compressed, and new data obtained after compression can be decompressed by the data decompression method provided by the invention. When the data is decompressed, the decompression method needs to acquire a coding table or a weight table corresponding to the data to be decompressed, and accordingly obtains a depth corresponding to each character and a hierarchical table used in the compression process, wherein the depth is used for determining a minimum reading unit when the data to be decompressed is read, the hierarchical table is used for determining the number of each leaf node, and the data before compression can be obtained after the data to be decompressed is read and converted one by one.
According to the data compression and decompression method provided by the invention, no traversing operation is needed in the compression process, the serial numbers of each leaf node are directly obtained through a formula, and are arranged according to a specific rule, so that the serial numbers are unique, codes with uniqueness can be obtained through the serial numbers and corresponding depths, original data can be obtained by reversely pushing data compressed according to the codes, no traversing operation is needed in decompression, the read codes are directly compared and matched with a fourth data table, and the compression and decompression efficiency of the data is remarkably improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required to be used in the description of the embodiments below are briefly introduced, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a data compression method according to an embodiment of the present invention;
Fig. 2 is a schematic diagram of a trigeminal huffman tree corresponding to a first data table according to an embodiment of the present invention;
Fig. 3 is a schematic diagram of a trigeminal huffman tree corresponding to the second data table according to an embodiment of the present invention;
Fig. 4 is a schematic diagram of a trigeminal huffman tree corresponding to the third data table according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. It is to be understood that the described embodiments are preferred embodiments of the invention and should not be taken as excluding other embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without creative efforts, are within the protection scope of the present invention.
In the claims, specification and drawings hereof, unless explicitly defined otherwise, the terms "first," "second," or "third," etc. are used for distinguishing between different objects and not for describing a particular sequential order.
In the claims, specification and drawings of the present invention, the terms "comprising," having, "and variations thereof as used herein, are intended to be" including but not limited to.
Example 1
The embodiment 1 of the invention provides a data compression method based on Huffman coding, which is used for compressing data comprising a plurality of characters.
First, it is necessary to briefly introduce huffman coding to facilitate the understanding of the data compression method provided by the present application. Huffman Coding (Huffman Coding), also called Huffman Coding or Huffman Coding, is a high-efficiency lossless data compression Coding method, and based on the principle of variable length Coding, the Coding of data is realized by constructing a special binary Tree, namely Huffman Tree. The working principle of huffman coding is that frequency statistics is performed by first counting the frequency of occurrence of each symbol (e.g. letter, number, character, etc.) in the data to be coded. The Huffman tree is constructed, then the symbols are ordered from small to large according to the occurrence frequency, and then the two symbols with the lowest frequency are combined into a new node, and the frequency of the new node is the sum of the frequencies of the two symbols. This process is repeated until all symbols are combined into a single root node. In this process, each time a merge occurs, the newly generated node is always placed at the end of the queue to ensure that the two nodes with the smallest current frequency are always selected at the next merge. A code table is generated that defines the codes of the symbols represented by each leaf node from the root node of the huffman tree to the path of that leaf node. Typically, the path from the root to the leaf is marked 0 to the left and 1 to the right, so that each symbol has a unique binary code, with higher frequency symbols having shorter codes and lower frequency symbols having longer codes. And encoding and decoding, namely encoding each symbol in the original data according to the generated encoding table to obtain a compressed data stream. In decoding, the original data can be restored from the encoded data stream according to the encoding rules and the huffman tree, because the huffman coding is a prefix coding, i.e. none of the codes is a prefix of another code, which ensures the uniqueness of the codes. The Huffman coding aims to achieve the effect of data compression by reducing the average bit number of the coded data, and is particularly suitable for data sets with uneven occurrence frequency of symbols.
Secondly, it is necessary to define explicitly several technical terms used in the description and the claims of the present invention.
The data referred to in the specification and the claims of the present invention are divided into data to be compressed, compressed data and decompressed data according to different states, and after one part of data to be compressed is compressed into compressed data and decompressed, the decompressed data and the uncompressed data to be compressed are the same part of data.
The term "data including a plurality of characters" as used in the present specification and claims means that for different types of data, "characters" are used as the smallest unit of stored information, and a piece of data may include a plurality of characters. For example, for text type data, each character is a character, and for picture type data, each pixel is a character. Other types of data may be understood with reference.
The term "node" in the present specification and claims refers to a node in a huffman tree of the three branches, and the types of the node are divided into leaf nodes and non-leaf nodes according to whether child nodes are included, wherein the leaf nodes correspond to each character in data to be compressed.
The term "weight" in the present description and claims refers to the frequency or number of occurrences of the corresponding character in the data to be compressed, for a leaf node, for example, in a string of texts, the number of occurrences of character a is 50, and the term "weight" refers to the sum of the weights of its own child nodes in the first data table for a non-leaf node.
The "depth" referred to in the present specification and claims corresponds to the level at which a node is located in the corresponding trigeminal huffman tree, with the root node being level 1 and its depth being 0, for example, for a 4-level trigeminal huffman tree, the node is located at level 4 and its depth is 3.
With the above basic concept known, the data compression method provided by the present invention is described in detail below.
Referring to fig. 1, the huffman coding-based data compression method provided by the invention mainly comprises the following steps:
the method comprises the steps of firstly, processing data to be compressed according to a Huffman coding mode, and outputting a first data table with a three-tree structure, wherein information recorded by the first data table comprises the weight of each node, the father node of a child node and the depth of each node;
The second step is that the first data table is standardized and the second data table is output, wherein the standardized processing means that the number of child nodes of father nodes of other layers above the father node of the penultimate layer in the first data table is adjusted to 3, and the depth of the node with changed tree relation is adjusted;
Step three, outputting a hierarchical table based on a second data table, wherein the hierarchical table records the number of leaf nodes in each layer, and each leaf node corresponds to one character; the hierarchy table is expressed as m= { M i, i=1, 2,3..... k }, where M i is the number of leaves per layer in the second data table, i represents a corresponding layer in the second data table, and k represents the total number of layers in the second data table;
Determining the numbers of all leaf nodes according to a recursion formula, wherein the recursion formula is N i=Ri-mi,Ri+1=3Ni+2,Pi=Ni +1, R 2=2,Pi is the number of the first leaf node in the ith layer when all the leaf nodes in the second data table are sorted according to ascending weights of the leaf nodes, and the number of the subsequent leaf node in the ith layer is one more than the number of the previous leaf node;
Outputting codes corresponding to the characters according to the numbers of all leaf nodes and the corresponding depths in the second data table;
and step six, compressing the data to be compressed according to the codes.
In the first step, the processing the data to be compressed according to the huffman coding mode means that the weight of the characters in the data to be compressed is calculated according to the frequency, 3 nodes with the lowest weight are added circularly with one character corresponding to one node as the start, and the depth of the node and its child node participating in the adding operation is added by one when each adding.
In the second step, the normalization processing is performed on the first data table, that is, when there are only two or fewer child nodes in a certain parent node, if the depth of the child node is not the maximum value in all depths, the node with the largest weight is selected from the next layer of the child nodes as the new child node, and the weight and the depth of the child node are modified until the number of child nodes of the parent nodes of other layers above the penultimate layer is 3.
In the third step, the hierarchical table corresponds to a third data table, the third data table is obtained by normalization processing based on the second data table, the normalization processing refers to that before non-leaf nodes of each layer in the second data table are adjusted to leaf nodes, and the leaf nodes of each layer are ordered in an ascending order by taking the weight of each node as a reference.
In the fourth step, in the recurrence formula, R i represents the node number of the last position in the i-th layer in the third data table, N i represents the non-leaf node number of the last position in the third data table, and when the node N i includes only 2 child nodes, R i+1=3Ni +2 in the recurrence formula is adjusted to R i+1=3Ni +1.
In the fifth step, the code corresponding to a character is a ternary representation of the number of the leaf node corresponding to the character, and the length of the code corresponding to a character is consistent with the depth of the leaf node corresponding to the character in the second data table.
The above-described data compression method will be described below with a specific example.
Taking compression of a text type of data as an example, the specific text of the text is:
FGHFHFEAJBHFEJBIEJDIEHEBIFAGJEFCFCDHJDJJGJHJBBGAABCGIIHHEDIDEEIJJCGDFEEIEFCEDJFFBHHEHCAADHHBHDJJDHGHIGFDAHCDJGGDGIAGGAJIEDECCJHHGBACFGFJIJEBHFDEFJJIFHIDGFGICGBFADJAIABFJCAADCACFFGHHIGICFHAIBCJIIBCBJJFJCBDGDEACCHBACIBDGIHBEFCIJIEHDABAIDEEBEGEIJIIJHCBGGDGFAG.
And carrying out preliminary processing on the text data according to a conventional Huffman coding mode, namely calculating corresponding weight according to the occurrence frequency of each character. The following is a weight table of the data, where the weight table includes a correspondence between each character in the data and its weight.
Weight table:
Character(s) A B C D E F G H I J
Weighting of 22 22 23 24 25 26 27 28 29 30
According to the weight table, the following first data table with a trigeminal tree structure is output:
a first data table:
The calculation process can be described with reference to the following steps of taking initial data as data in a weight table, taking each character as a node, not performing operation at the moment, and therefore not having a father node yet, taking first addition, adding weights of three nodes A, B, C with lowest weight to obtain father node ABC, adding one for the depth of the three nodes, taking second addition, adding weights of three nodes D, E, F with lowest weight to obtain father node DEF, adding one for the depth of the three nodes, adding similarly for the third addition, not repeating, taking ABC, DEF, J for the three nodes with lowest weight and not performing addition operation, adding the three nodes to obtain father node ABCDEFJ, adding one for the depths of the corresponding child nodes and child nodes, and then taking fifth addition to finish the output of the first data table. At this time, referring to fig. 2, a structure of a trigeminal huffman tree corresponding to the first data table is shown. In the trigeminal huffman tree, node a is at layer 4, its depth is 3, and its parent node is ABC. For each node, the information stored includes its own depth, parent node and weight.
After the first data table is obtained, the first data table is standardized, and the following second data table is output:
For the processing result of the second data table, referring to fig. 3, the purpose of the normalization process is to make each parent node include three child nodes as far as possible, so as to form a three-fork huffman tree with sufficient regularity, where fewer child nodes may exist in the parent node of the penultimate layer, which is determined by the number of characters in the data to be compressed. For the case that the number of characters in the data to be compressed is even, there will necessarily be a parent node that includes only two child nodes. The correspondence is that R i+1=3Ni +1 is only present in the last layer in the hierarchy table.
In the adjustment process of the second data table, the number of the child nodes of each father node is firstly judged, the father node with only two byte points is positioned to be positioned on the layer where the father node is positioned, the positioning can be performed through the stored depth information, then the node with the largest weight is selected from the lower two layers of the layer where the father node is positioned, and the layer where the node is positioned is adjusted to be the upper layer of the layer where the father node is originally positioned, namely the new child node which is the father node. The above adjustment is performed cyclically until the parent node of the other layer except the penultimate layer includes three child nodes.
Then, from the second data table, the following hierarchical table M can be obtained:
according to the definition of the hierarchical table, the number of leaf nodes in the first layer is 0, the number of leaf nodes in the second layer is 0, the number of leaf nodes in the third layer is 8, and the number of leaf nodes in the fourth layer is 2.
In practice, since the hierarchical table is derived based on the second data table, the third data table is created by creating rules, so that the hierarchical table can be associated with the third data table. The third data table may be obtained by normalizing the second data table, in the form of the following table:
The form of the huffman tree of the third data table may be referred to fig. 4. It can be seen from fig. 3 to fig. 4 that the normalization process is to sort the positions of the nodes twice, the first sort is to adjust all the non-leaf nodes to the leaf nodes before, where the previous is actually to locate at the left side in the order from the left to the right, and the second sort is to sort all the leaf nodes in ascending order according to their weights, so that the positions of each leaf node in the third data table correspond to their weights. Thereafter, each node is given a number in the order from front to back, starting from 0. For example, in the fourth layer, node A is numbered 0, node B is numbered 1, in the third layer, node ABC is numbered 0, node C is numbered 1, and so on. According to the rule of the normalization processing, a recurrence formula can be formed so as to conveniently calculate the number of each leaf node.
In the recurrence formula, N i=Ri-mi,Ri+1=3Ni+2,Pi=Ni +1, where R 2 =2, in the case of obtaining the value of m i through the hierarchical table, P i may be obtained by a loop calculation method, and then numbers of all leaf nodes after that are obtained according to P i.
And outputting codes corresponding to the characters according to the numbers of all the leaf nodes and the depth corresponding to each leaf node in the second data table. It should be noted that since the numbers correspond to the ordering relationship of the leaf nodes, the corresponding encoding tables also need to be populated according to the ordering relationship.
Encoding table:
In order to correspond to the three-fork Huffman tree structure, the numbers are converted by adopting ternary system to form ternary system codes. It can be seen that for a leaf node of depth 3, the length of the code is also 3 bits, and for a leaf node of depth 2, the length of the code is also 2 bits. If the huffman tree comprises 6 layers, the longest coding length should be 6 bits in the same way as the example given in this embodiment.
Compressing the original character string according to the coding table, wherein the compressed coding stream is as follows:
111220201120111010000221001111100022102110102102020102011100012000011001000010010012212012021221022122012012002222000000010012212102220110211012101112201012011120220200021000212021101012012120210221001012020201120100000001010122220102201200210221012120122120221022101021022010002001120121001000010100101001011010122110002110001021101011022112101220000212021012010001120100000001100002122001001121020201221012120100022101001000211021221021120000101011101110122012011001010002112010010021101220110011021012000020101012201012112210012100020110220221022120000000001111122120210110022120110210021012112011121110002111100100001000010001011001111021200001210001011210201022221002011120121101010110000021001122000012000111010010002002110021200021210110002012110212020100010111002010211011200010002001110.
In the data compression method provided in this embodiment, first, data to be compressed is processed into a first data table having a three-tree structure according to a huffman coding manner, where the first data table records a weight of each node, a parent node of a child node, and a depth of each node, and the first data table corresponds to an initial huffman tree, where the depth refers to a layer where each node is located, for example, a depth of 1 indicates that the node is located in a first layer, that is, a layer of a root node, and a depth of 3 indicates that the node is located in a third layer, that is, one of child nodes of the root node; the processing according to the Huffman coding mode refers to a conventional Huffman coding method, wherein a plurality of nodes with lowest weight are added to form a new father node each time, the number of nodes added each time is 3, compared with a conventional binary Huffman tree, the nodes with higher information density under the same layer number are obtained, the first data table is normalized to obtain a second data table, the second data table corresponds to a standardized Huffman tree standardized by the initial Huffman tree, the normalization process is to adjust the father nodes with the same number of 3 child nodes except the father node of the last second layer existing in the initial Huffman tree to the father node with the same number of 3 child nodes, so that the weighted path of all nodes of the standard Huffman tree is shortest, the depth of each leaf node is determined, namely, how many leaf nodes can be included in each layer, obviously, one leaf node corresponds to a character of data to be compressed, the leaf node can form a hierarchical table according to the number of leaf nodes in each layer, the method comprises the steps of obtaining a hierarchical table, obtaining the number of each leaf node, obtaining the number of all leaf nodes according to the hierarchical table and a recurrence formula, wherein the number obtained according to the recurrence formula does not correspond to the position of the leaf node in a second data table but corresponds to the position of the leaf node in a third data table after reordering, the corresponding relation can facilitate subsequent decompression operation, when the number of all leaf nodes is obtained, combining the depth of each leaf node, forming a code with uniqueness corresponding to each leaf node, the code can be expressed in a ternary system to adapt to the Huffman tree structure of the three branches, meanwhile, the length of the code is consistent with the depth of the leaf node corresponding to the code in the second data table, and therefore the layer of the leaf node corresponding to the code in a standard Huffman tree can be determined through the length of the code, and after obtaining the code corresponding to each character, each character in data to be compressed can be compressed.
The number of each leaf node is directly obtained through a formula without traversing operation in the compression process, and the numbers are arranged according to a specific rule, so that the leaf node has uniqueness, and the code with uniqueness can be obtained through the numbers and the corresponding depth, so that the compression efficiency is remarkably improved.
Example 2
An embodiment 2 of the present invention provides a huffman coding-based data compression apparatus for compressing data including a plurality of characters, including:
The first processing module is used for processing the data to be compressed according to a Huffman coding mode and outputting a first data table with a three-fork tree structure, wherein the information recorded by the first data table comprises the weight of each node, the father node of the child node and the depth of each node;
The second processing module is used for carrying out standardization processing on the first data table and outputting a second data table, wherein the standardization processing refers to that the number of child nodes of father nodes of other layers above the father node of the penultimate layer in the first data table is adjusted to 3, and the depth of the node with changed tree relation is adjusted;
A third processing module, configured to output a hierarchy table based on the second data table, where the hierarchy records a number of leaf nodes in each layer, and each leaf node corresponds to one of the characters; the hierarchy table is expressed as m= { M i, i=1, 2, 3.....k..where M i is the number of leaves per layer in the second data table, i represents a corresponding layer in the second data table, and k represents the total number of layers in the second data table; determining the numbers of all leaf nodes according to a recursion formula, wherein the recursion formula is N i=Ri-mi,Ri+1=3Nu+2,Pi=Ri +1, R 2=2,Pi is the number of the first leaf node in the ith layer when all the leaf nodes in the second data table are sorted according to the ascending order of the weights, and the number of the subsequent leaf node in the ith layer is the number of the previous leaf node added with one;
The coding module is used for outputting codes corresponding to the characters according to the numbers of all the leaf nodes and the corresponding depths in the second data table;
and the compression module is used for compressing the data to be compressed according to the coding.
The description of the above modules may refer to embodiment 1, and will not be repeated here.
Example 3
The embodiment 3 of the invention provides a data decompression method based on Huffman coding, which is used for decompressing data obtained by compressing the data compression method based on Huffman coding in the embodiment 1, and mainly comprises the following steps:
Obtaining the depth corresponding to each character according to a weight table corresponding to the data to be decompressed, and outputting the hierarchical table, wherein the weight table comprises the corresponding relation between each character and the weight of each character;
Step two, determining the serial numbers of all leaf nodes after all the leaf nodes are sequenced according to the ascending sequence of the weight of the leaf nodes according to the hierarchical table, and outputting a fourth data table, or outputting the fourth data table according to an encoding table corresponding to the data to be decompressed, wherein the encoding table comprises the corresponding relation between each character and the encoding;
And thirdly, reading the data to be decompressed by taking the depth of the leaf node with the smallest depth as the smallest unit, comparing the data with the serial number and the depth of each leaf node in the fourth data table, converting into an original character if matching is successful, increasing one reading bit number if matching is failed, and re-matching in the fourth data table until the data to be decompressed is completely decompressed.
In the first step, when the hierarchical table is output according to the weight table corresponding to the data to be decompressed, the weight table is sequentially processed according to the huffman coding mode to output a first data table with a three-tree structure, the first data table is standardized to output a second data table, and the hierarchical table is output based on the second data table.
Furthermore, the coding table referred to in step one may be in the form of the coding table given in embodiment 1. It should be noted that, since the depth and number corresponding to the character can be determined by encoding, the obtained encoding table may include only two items of the character and the encoding at the time of decompression, without supplementing the depth and number. The complete encoding table described above is more susceptible to matching operations.
The fourth data table in step two is actually similar to the third data table in content, except that the fourth data table is obtained directly through the hierarchical table and the recurrence formula, and therefore includes only leaf nodes and corresponding numbers.
In step three, a specific method of reading data to be decompressed is given, which is to take the smallest depth in the leaf node as the smallest unit of reading, and it should be noted that the range of depths here is considered only in the leaf node, and not in the depth of the non-leaf node.
In particular, in this embodiment, the encoded stream given in embodiment 1 above is decompressed.
Since decompression by the encoding table is relatively simple, the decompression process is described based on the weight table in this embodiment.
Firstly, a weight table corresponding to data to be decompressed needs to be acquired, and in this embodiment, the data to be decompressed is transmitted, and meanwhile, the corresponding weight table is also transmitted synchronously. After the weight table is obtained, a hierarchical table can be obtained according to the weight table and the related processing steps in embodiment 1, and the specific processing procedure is not described again. After the hierarchical table is obtained, the number of each leaf node can be obtained according to the given recurrence formula, and then a fourth data table comprising the leaf nodes and the numbers can be formed. And then reading the coded stream, converting the read code into a decimal number, comparing the decimal number with the number in the fourth data table and the depth of the corresponding node, and converting according to the matching result.
Specifically, according to the weight table of the character string of embodiment 1, the minimum depth of the leaf node is 2, and thus the encoded stream is read by the reading unit with 2 as the minimum. The encoded stream is:
111220201120111010000221001111100022102110102102020102011100012000011001000010010012212012021221022122012012002222000000010012212102220110211012101112201012011120220200021000212021101012012120210221001012020201120100000001010122220102201200210221012120122120221022101021022010002001120121001000010100101001011010122110002110001021101011022112101220000212021012010001120100000001100002122001001121020201221012120100022101001000211021221021120000101011101110122012011001010002112010010021101220110011021012000020101012201012112210012100020110220221022120000000001111122120210110022120110210021012112011121110002111100100001000010001011001111021200001210001011210201022221002011120121101010110000021001122000012000111010010002002110021200021210110002012110212020100010111002010211011200010002001110.
First the first 2 bits of the encoded stream are read, 11, and match character F, thus converting to character F, and then the 2 bits after reading, 12, and match character G, thus converting to character G. With such a forward line transition, when a mismatch condition is encountered, then the next bit is read continuously, for example, when a character without a match is found after reading 2 bits, then the next bit is read, and the code that becomes 3 bits is matched until the match is successful.
Through the decompression step, the encoded stream of embodiment 1 may be re-decompressed into the original string:
FGHFHFEAJBHFEJBIEJDIEHEBIFAGJEFCFCDHJDJJGJHJBBGAABCGIIHHEDIDEEIJJCGDFEEIEFCEDJFFBHHEHCAADHHBHDJJDHGHIGFDAHCDJGGDGIAGGAJIEDECCJHHGBACFGFJIJEBHFDEFJJIFHIDGFGICGBFADJAIABFJCAADCACFFGHHIGICFHAIBCJIIBCBJJFJCBDGDEACCHBACIBDGIHBEFCIJIEHDABAIDEEBEGEIJIIJHCBGGDGFAG.
Example 4
The embodiment 4 of the present invention provides a data decompression device based on huffman coding, which is used for decompressing data obtained after compression by adopting the data compression method based on huffman coding described in the embodiment 1, and mainly comprises:
the first decoding module is used for obtaining the depth corresponding to each character according to a weight table corresponding to the data to be decompressed and outputting the hierarchical table, wherein the weight table comprises the corresponding relation between each character and the weight of the character;
The second decoding module is used for determining the serial numbers of all leaf nodes after all the leaf nodes are sequenced according to the ascending sequence of the weight of the leaf nodes through the recursion formula and outputting a fourth data table, or outputting the fourth data table according to an encoding table corresponding to the data to be decompressed, wherein the encoding table comprises the corresponding relation between each character and the code;
and the decompression module is used for reading the compressed data by taking the depth of the leaf node with the minimum depth as the minimum unit, comparing the compressed data with the serial number and the depth of each leaf node in the fourth data table, converting the data into the original character if the matching is successful, increasing the number of one reading bit if the matching is failed, and re-matching the data in the fourth data table until the data to be decompressed is completely decompressed.
The description of the above modules may refer to embodiment 3, and will not be repeated here.
The foregoing description of the embodiments and description is presented to illustrate the scope of the invention, but is not to be construed as limiting the scope of the invention. Modifications, equivalents, and other improvements to the embodiments of the invention or portions of the features disclosed herein, as may occur to persons skilled in the art upon use of the invention or the teachings of the embodiments, are intended to be included within the scope of the invention, as may be desired by persons skilled in the art from a logical analysis, reasoning, or limited testing, in combination with the common general knowledge and/or knowledge of the prior art.

Claims (10)

1. A data compression method based on Huffman coding is used for compressing data comprising a plurality of characters, and is characterized in that,
Processing data to be compressed according to a Huffman coding mode, and outputting a first data table with a three-fork tree structure, wherein the information recorded by the first data table comprises the weight of each node, the father node of a child node and the depth of each node;
the method comprises the steps of carrying out standardization processing on a first data table and outputting a second data table, wherein the standardization processing refers to the steps of adjusting the number of child nodes of father nodes of other layers above the father node of the penultimate layer in the first data table to 3 and adjusting the depth of the node with changed tree relation;
Outputting a hierarchical table based on the second data table, wherein the hierarchical table records the number of leaf nodes in each layer, and each leaf node corresponds to one character; the hierarchy table is expressed as m= { M i, i=1, 2,3..... k }, where M i is the number of leaves per layer in the second data table, i represents a corresponding layer in the second data table, and k represents the total number of layers in the second data table;
Determining the numbers of all leaf nodes according to a recursion formula, wherein the recursion formula is N i=Ri-mi,Ri+1=3Ni+2,Pi=Ni +1, R 2=2,Pi is the number of the first leaf node in the ith layer when all the leaf nodes in the second data table are sorted according to the ascending order of weight, and the number of the subsequent leaf node in the ith layer is the number of the previous leaf node added with one;
Outputting codes corresponding to the characters according to the numbers of all the leaf nodes and the corresponding depths in the second data table;
and compressing the data to be compressed according to the codes.
2. A data compression method according to claim 1, wherein the code corresponding to each character is a ternary representation of the number of the leaf node corresponding to the character.
3. A data compression method according to claim 2, wherein the length of the code corresponding to each character corresponds to the depth of the leaf node corresponding to the character in the second data table.
4. The data compression method based on huffman coding as claimed in claim 3, wherein the hierarchical table corresponds to a third data table, the third data table is obtained by normalization processing based on a second data table, the normalization processing refers to that before the non-leaf node of each layer in the second data table is adjusted to a leaf node, the leaf nodes of the layer are ordered in an ascending order by taking the weight of each node as a reference, in the recursive formula, R i represents the node number of the last position in the ith layer in the third data table, N i represents the non-leaf node number of the last position in the third data table, and when the node N i only comprises 2 sub-nodes, R i+1=3Ni +2 in the recursive formula is adjusted to R i+1=3Ni +1.
5. The method for compressing data based on Huffman coding as recited in claim 4, wherein said processing data to be compressed according to Huffman coding means that weight is calculated according to frequency for characters in data to be compressed, 3 nodes with lowest weight are added circularly with one character corresponding to one node as starting, and depth of the node and its child node participating in addition operation is added by one at each addition.
6. The method for compressing data based on Huffman coding as recited in claim 5, wherein said normalizing the first data table means that when there are only two or less child nodes of a certain parent node, if the depth of the child node is not the maximum value among all depths, the node with the greatest weight is selected from the next layer of the child node as the new child node, and the weight and depth of the child node are modified until the number of child nodes of the parent nodes of other layers above the penultimate layer is 3.
7. A huffman coding-based data compression apparatus for compressing data including a plurality of characters, comprising:
The first processing module is used for processing the data to be compressed according to a Huffman coding mode and outputting a first data table with a three-fork tree structure, wherein the information recorded by the first data table comprises the weight of each node, the father node of the child node and the depth of each node;
The second processing module is used for carrying out standardization processing on the first data table and outputting a second data table, wherein the standardization processing refers to that the number of child nodes of father nodes of other layers above the father node of the penultimate layer in the first data table is adjusted to 3, and the depth of the node with changed tree relation is adjusted;
A third processing module, configured to output a hierarchy table based on the second data table, where the hierarchy records a number of leaf nodes in each layer, and each leaf node corresponds to one of the characters; the hierarchy table is expressed as m= { M i, i=1, 2, 3.....k..where M i is the number of leaves per layer in the second data table, i represents a corresponding layer in the second data table, and k represents the total number of layers in the second data table; determining the numbers of all leaf nodes according to a recursion formula, wherein the recursion formula is N i=Ri-mi,Ri+1=3Ni+2,Pi=Ni +1, R 2=2,Pi is the number of the first leaf node in the ith layer when all the leaf nodes in the second data table are sorted according to the ascending order of the weights, and the number of the subsequent leaf node in the ith layer is the number of the previous leaf node added with one;
The coding module is used for outputting codes corresponding to the characters according to the numbers of all the leaf nodes and the corresponding depths in the second data table;
and the compression module is used for compressing the data to be compressed according to the coding.
8. A huffman coding-based data decompression method for decompressing data compressed by the huffman coding-based data compression method according to claim 6, characterized in that,
Obtaining the depth corresponding to each character according to a weight table corresponding to the data to be decompressed, and outputting the hierarchical table, wherein the weight table comprises the corresponding relation between each character and the weight of each character;
according to the hierarchical table, determining the serial number of each leaf node after all the leaf nodes are sequenced according to the ascending order of the weight of the leaf nodes through the recursion formula, and outputting a fourth data table, or outputting the fourth data table according to an encoding table corresponding to the data to be decompressed, wherein the encoding table comprises the corresponding relation between each character and the encoding;
And reading the data to be decompressed by taking the depth of the leaf node with the smallest depth as the smallest unit, comparing the data with the serial number and the depth of each leaf node in the fourth data table, converting into an original character if matching is successful, increasing one reading bit number if matching is failed, and re-matching in the fourth data table until the data to be decompressed is completely decompressed.
9. The method for data decompression based on Huffman coding according to claim 8, wherein when the hierarchical table is output according to the weight table corresponding to the data to be decompressed, the weight table is sequentially processed according to the Huffman coding mode to output a first data table with a three-tree structure, the first data table is normalized to output a second data table, and the hierarchical table is output based on the second data table.
10. A huffman coding-based data decompression apparatus for decompressing data compressed by the huffman coding-based data compression method according to claim 6, comprising:
the first decoding module is used for obtaining the depth corresponding to each character according to a weight table corresponding to the data to be decompressed and outputting the hierarchical table, wherein the weight table comprises the corresponding relation between each character and the weight of the character;
The second decoding module is used for determining the serial numbers of all leaf nodes after all the leaf nodes are sequenced according to the ascending sequence of the weight of the leaf nodes through the recursion formula and outputting a fourth data table, or outputting the fourth data table according to an encoding table corresponding to the data to be decompressed, wherein the encoding table comprises the corresponding relation between each character and the code;
and the decompression module is used for reading the compressed data by taking the depth of the leaf node with the minimum depth as the minimum unit, comparing the compressed data with the serial number and the depth of each leaf node in the fourth data table, converting the data into the original character if the matching is successful, increasing the number of one reading bit if the matching is failed, and re-matching the data in the fourth data table until the data to be decompressed is completely decompressed.
CN202411292217.4A 2024-09-14 2024-09-14 Data compression method and device based on Huffman coding and data decompression method and device Pending CN119135188A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202411292217.4A CN119135188A (en) 2024-09-14 2024-09-14 Data compression method and device based on Huffman coding and data decompression method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202411292217.4A CN119135188A (en) 2024-09-14 2024-09-14 Data compression method and device based on Huffman coding and data decompression method and device

Publications (1)

Publication Number Publication Date
CN119135188A true CN119135188A (en) 2024-12-13

Family

ID=93760052

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202411292217.4A Pending CN119135188A (en) 2024-09-14 2024-09-14 Data compression method and device based on Huffman coding and data decompression method and device

Country Status (1)

Country Link
CN (1) CN119135188A (en)

Similar Documents

Publication Publication Date Title
US11044495B1 (en) Systems and methods for variable length codeword based data encoding and decoding using dynamic memory allocation
RU2403677C1 (en) Method for lossless data compression and retrieval
EP1147612B1 (en) Code book construction for variable to variable length entropy encoding
CN108768403B (en) Lossless data compression and decompression method based on LZW and LZW encoder and decoder
CN112398484B (en) Coding method and related equipment
US6982661B2 (en) Method of performing huffman decoding
US5594435A (en) Permutation-based data compression
CN104125475B (en) Multi-dimensional quantum data compressing and uncompressing method and apparatus
CN113381768B (en) Huffman correction coding method, system and related components
CN110602498B (en) Self-adaptive finite state entropy coding method
EP0127815B1 (en) Data compression method
US7548175B2 (en) Encoding apparatus, decoding apparatus, encoding method, computer readable medium storing program thereof, and computer data signal
CN113676187B (en) Huffman correction coding method, system and related components
WO2010108373A1 (en) Method and system for compressed encoding and decoding for word stock
CN112956131B (en) Encoding device, decoding device, encoding method, decoding method, and computer-readable recording medium
KR101023536B1 (en) Data Lossless Compression Method
Yang et al. Universal lossless data compression with side information by using a conditional MPM grammar transform
US20170214413A1 (en) Joint source-channel coding with dynamic dictionary for object-based storage
CN119135188A (en) Data compression method and device based on Huffman coding and data decompression method and device
Severo et al. Your dataset is a multiset and you should compress it like one
US7193542B2 (en) Digital data compression robust relative to transmission noise
CN112506876B (en) Lossless compression query method supporting SQL query
Karpinski et al. A fast algorithm for adaptive prefix coding
US6794999B1 (en) Resilient parameterized prefix codes for adaptive coding
JP3100206B2 (en) Data compression method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination