RU2628199C1

RU2628199C1 - Data compression method

Info

Publication number: RU2628199C1
Application number: RU2016122490A
Authority: RU
Inventors: Андрей Владимирович Луценко
Original assignee: Общество с ограниченной ответственностью "ЛАН-ПРОЕКТ"; Андрей Владимирович Луценко
Priority date: 2016-06-07
Filing date: 2016-06-07
Publication date: 2017-08-15

Abstract

FIELD: information technology.

SUBSTANCE: in the original data array, sections with repeated sequences of symbols are selected, the first of such sections is entered into an information array of the compressed data, and the second and subsequent ones are replaced by an index placed in the index array of the compressed data that is formed, when the data are compressed. Information about the size of the section with repeating sequences of symbols in bytes, the number of sections with repeated sequences of symbols, and the type of repeating sequences of symbols on the section are entered into the index. The generated information compressed data array and index compressed array are combined by their sequential connection.

EFFECT: increasing the versatility of the method by providing an on-line data recovery after compression.

2 cl

Description

Предлагаемое изобретение относится к области информационных технологий и может быть использовано для сжатия данных с возможностью их последующего восстановления.The present invention relates to the field of information technology and can be used to compress data with the possibility of their subsequent recovery.

Известен способ сжатия данных [RU 2386210, С2, Н03М 7/40, Н03М 7/46, 10.04.2010], который осуществляется с помощью кодера, причем, в первом блоке памяти кодера хранятся предварительно записанные кодовые комбинации (KK₁) с числом разрядов n, где n=2, 3, 4…, представляющие собой полный набор возможных входных кодовых комбинаций (КК), во втором блоке памяти кодера хранятся предварительно записанные кодовые комбинации КК₂, однозначно соответствующие KK₁, с числом разрядов, меньшим или таким же, как в КК₁, входной поток данных разделяют на КК с одинаковым числом разрядов n, KK последовательно вводят в кодер, идентифицируют путем сравнения с КК₁, отображают соответствующей выходной кодовой комбинацией КК₂, которые представляют собой последовательность групп с одинаковым числом разрядов n в каждой, совокупное число кодовых комбинации КК₂-mⁿ, где m=2, 3, 4…, n=1, 2, 3…, число последовательных групп КК₂ определяют как m^n-1, m^n-2 разрядность КК₂ в группе выравнивают за счет добавления незначащего нуля перед кодовой комбинацией.A known method of data compression [RU 2386210, C2, H03M 7/40, H03M 7/46, 04/10/2010], which is carried out using an encoder, moreover, in the first memory block of the encoder stored pre-recorded code combinations (KK ₁ ) with the number of bits n, where n = 2, 3, 4 ..., which are a complete set of possible input code combinations (CC), in the second memory block of the encoder are stored pre-written code combinations KK ₂ , uniquely corresponding to KK ₁ , with the number of bits less or the same as in KK _1, the input data stream is separated into a QC gap with the same number s n, KK introduced successively into an encoder are identified by comparison with KK _1, KK represent corresponding output code pattern _2, which represent a sequence of groups with the same number of bits in each n, the total number of code combinations QC ₂ -m ^n, where m = 2, 3, 4 ..., n = 1, 2, 3 ..., the number of consecutive KK ₂ groups is defined as m ^n-1 , m ^{n-2 the} capacity of KK ₂ in the group is equalized by adding insignificant zero before the code combination.

Недостатком способа является его относительно высокая сложность.The disadvantage of this method is its relatively high complexity.

Кроме того, известен способ сжатия данных [RU 2450441, C1, Н03М 7/30, 10.05.2012], заключающийся в том, что в память целевого устройства записывают промежуточные сжатые данные, извлекают данные из памяти целевого устройства для последующей распаковки, при этом, данные принимают и отдают 128-битными блоками, используют 16 независимых блоков памяти для хранения кэшированных кодирующих структур размером 15-байтной длины и конфигурируют размер кэш-таблицы посредством задания числа ячеек числами, равными степени 2 в пределах от 16 до 4096, при этом предсказывают кодирующие структуры с использованием двух связных буферов упреждающей выборки для построения словаря, кодируют от двух до пятнадцати байт входного потока в один упакованный символ за один такт, используют количество упакованных байт в качестве обратной связи для логики, отвечающей за сдвиг входного потока, выбирают кодирующую структуру за один такт путем поиска кэшированной строки с наиболее длинной совпадающей с входной строкой последовательностью символов, упаковывают данные в 32-байтные группы, выровненные по два байта, упаковывают совпадающие строки в 2-байтный кодирующий символ, состоящий из длины строки, номера блока памяти и значения хэш-функции, определяющего адрес этой строки в блоке памяти.In addition, a known method of data compression [RU 2450441, C1, H03M 7/30, 05/10/2012], which consists in the fact that intermediate compressed data is written into the memory of the target device, data is extracted from the memory of the target device for subsequent unpacking, while data is received and transmitted in 128-bit blocks, 16 independent memory blocks are used to store cached encoding structures of 15 byte lengths and the cache table size is configured by setting the number of cells with numbers equal to 2 in the range from 16 to 4096, while predicting which using two connected prefetching buffers to build a dictionary, encode two to fifteen bytes of the input stream into one packed symbol per cycle, use the number of packed bytes as feedback for the logic responsible for shifting the input stream, select the encoding structure for one clock cycle by searching for a cached string with the longest sequence of characters matching the input string, pack the data into 32-byte groups aligned by two bytes, pack with falling lines in a 2-byte encoding character, consisting of the length of the line, the number of the memory block and the value of the hash function that defines the address of this line in the memory block.

Недостатком этого технического решения является относительно узкая область применения, поскольку способ требует промежуточного сжатия данных.The disadvantage of this technical solution is the relatively narrow scope, since the method requires intermediate data compression.

Наиболее близким по технической сущности к предложенному является способ сжатия информации [RU 2431918, С2, Н03М 7/30, 20.10.2011], заключающийся в формировании алфавита сообщения и кодового представления его элементов, при этом, при наличии в сообщении группы нескольких последовательно расположенных символов, находящихся на одной строке или на одном столбце матрицы, данная группа символов образует общий код, состоящий из кода общей строки и кодов столбцов или из кода общего столбца и кодов строк, причем, элементы общего кода группы символов располагают последовательно с выделением кода строки от кодов столбцов или кода столбца от кодов строк по тому или иному признаку: изменение полярности, амплитуды, частоты, фазы электрических сигналов.The closest in technical essence to the proposed one is a method of information compression [RU 2431918, C2, Н03М 7/30, 10.20.2011], which consists in forming the message alphabet and code representation of its elements, in this case, if there are several consecutive characters in the message group located on one row or on one column of the matrix, this group of characters forms a common code consisting of a common row code and column codes or a common column code and row codes, and consistently with the selection of the row code from the column codes or the column code from the row codes according to one or another sign: a change in the polarity, amplitude, frequency, phase of the electrical signals.

Для реализации этого способа предварительно формируют кодовое представление символов исходного сообщения, для чего кодируемые символы размещают в узлах матрицы размерностью m×n при максимальном числе кодируемых символов N=m⋅n, после чего порядковые номера строк и столбцов матрицы представляют в двоичном виде: 000, 001, 010… и т.д. Совокупность кода строки и кода столбца, на пересечении которых расположен символ, есть кодовое представление данного символа.To implement this method, the code representation of the symbols of the original message is preliminarily formed, for which the encoded symbols are placed in matrix nodes of dimension m × n with the maximum number of encoded symbols N = m⋅n, after which the sequence numbers of the rows and columns of the matrix are represented in binary form: 000, 001, 010 ... etc. The combination of the row code and the column code at the intersection of which the symbol is located is the code representation of this symbol.

Недостатком наиболее близкого технического решения является относительно высокая сложность, вызванная необходимостью предварительного формирования кодового представления символов исходного сообщения по определенному алгоритму.The disadvantage of the closest technical solution is the relatively high complexity caused by the need for preliminary formation of the code representation of the characters of the original message according to a certain algorithm.

Кроме того, способ предполагает предварительное определение максимального числа кодируемых символов, что ограничивает его применение в общем случае.In addition, the method involves the preliminary determination of the maximum number of encoded characters, which limits its use in the general case.

Отмеченные недостатки препятствуют оперативному проведению восстановления данных после сжатия.The noted drawbacks hinder the operational conduct of data recovery after compression.

Задача, которая решается в изобретении, направлена на упрощение способа, а также на повышение его универсальности с целью обеспечения оперативного восстановления данных после сжатия.The problem that is solved in the invention is aimed at simplifying the method, as well as at increasing its versatility in order to ensure prompt data recovery after compression.

Требуемый технический результат заключается в упрощении способа и повышении универсальности.The required technical result is to simplify the method and increase versatility.

Поставленная задача решается, а требуемый технический результат достигается тем, что в способе, в котором в исходном массиве данных выделяют участки с последовательно расположенными символами, согласно изобретению, в качестве участков с несколькими последовательно расположенными символами последовательно выделяют участки с повторяющимися последовательностями символов, первый из таких участков заносят в информационный массив сжатых данных, а второй и последующие заменяют на индекс, размещаемый в индексном массиве сжатых данных, который формируют при сжатии данных, причем, в индекс заносят информацию о размере участка с повторяющимися последовательностями символов в байтах, число участков с повторяющимися последовательностями символов и типе повторяющихся последовательностей символов на участке.The problem is solved, and the required technical result is achieved by the fact that in the method in which sections with consecutive characters are allocated in the original data array, according to the invention, sections with repeating sequences of characters are sequentially allocated as sections with several consecutive characters, the first of such sections are recorded in the information array of compressed data, and the second and subsequent are replaced by an index placed in the index array of compressed data, cat ing formed by data compression, wherein, in the index are entered size information portion with repeating sequences of symbols in bytes, the number of areas with repetitive sequences of symbols and the symbol type of repetitive sequences at the site.

Кроме того, требуемый технический результат достигается тем, что, в качестве типа повторяющихся последовательностей символов на участке задают, по крайней мере, тип повторений с полным совпадением символов, тип повторений со вставками несовпадающих символов в фиксированных позициях участка и тип повторений с несовпадающими символами в фиксированных позициях.In addition, the required technical result is achieved by the fact that, as the type of repeating sequences of characters in the section, at least the type of repetitions with full coincidence of characters, the type of repetitions with inserts of mismatched characters in fixed positions of the plot, and the type of repetitions with mismatched characters in fixed positions.

Способ сжатия данных осуществляют следующим образом.The data compression method is as follows.

Сущность способа основана на том, что, в исходном массиве данных при сжатии устраняют дублирование совпадающих участков данных, располагающихся непосредственно друг за другом. Имеется как минимум две группы символов, одинаковых по составу, которые расположены непосредственно друг за другом. Первый из совпадающих участков данных заносится в информационный массив сжатых данных, а второй и последующие удаляются и заменяются на индекс, который заносится и размещается в специально формируемом при сжатии данных индексном массиве сжатых данных. Индекс содержит информацию о размере повторяющегося участка - размере участка с повторяющимися последовательностями символов в байтах, числе участков с повторяющимися последовательностями символов и типе повторяющихся последовательностей символов, в качестве которых используют, по крайней мере, тип повторений с полным совпадением символов, тип повторений со вставками несовпадающих символов в фиксированных позициях участка и тип повторений с несовпадающими символами в фиксированных позициях.The essence of the method is based on the fact that, in the original data array during compression, duplication of coincident data sections located directly one after another is eliminated. There are at least two groups of symbols of the same composition, which are located directly one after another. The first of the coincident data sections is entered into the information array of compressed data, and the second and subsequent ones are deleted and replaced by an index, which is entered and placed in the indexed array of compressed data specially formed during data compression. The index contains information about the size of the repeating section - the size of the section with repeating sequences of characters in bytes, the number of sections with repeating sequences of characters and the type of repeating sequences of characters, which are used at least as the type of repetition with a complete match of characters, the type of repetition with inserts of mismatched characters in the fixed positions of the plot and the type of repetition with mismatched characters in the fixed positions.

После сжатия исходного массива данных получают массив сжатых данных, который содержит два отдельных массива, один из которых является информационным массивом сжатых данных и содержит только по одной копии повторяющихся участков, а другой, присоединенный к нему, является индексным массивом сжатых данных и содержит информацию, оформленную в виде индекса, содержащую сведения о размере повторяющегося участка, числе повторов, типе повтора.After compression of the initial data array, an array of compressed data is obtained, which contains two separate arrays, one of which is an information array of compressed data and contains only one copy of repeating sections, and the other attached to it is an indexed array of compressed data and contains information in the form of an index containing information about the size of the repeated section, the number of repetitions, the type of repetition.

Для практической реализации предложенного способа исходный массив данных, предназначенный для последующего сжатия, размещается в физическом носителе с произвольным доступом к информации с точностью до байта. В качестве физического носителя может выступать оперативная память компьютера, устройство хранения информации типа винчестера, флеш-накопитель и т.п.For the practical implementation of the proposed method, the initial data array intended for subsequent compression is placed in a physical medium with random access to information accurate to bytes. The physical memory can be the computer’s random access memory, a storage device such as a hard drive, a flash drive, etc.

Из физического носителя в регистр сдвига устройства, осуществляющего сжатие данных, последовательно подают блоки данных с размерностями, соответствующими размерности регистра. В регистре сдвига эти данные последовательно сравниваются с данными на физическом носителе со сдвигом от одного байта до максимального значения байт. Максимальное число байт сдвига задается при реализации способа и, практически, может быть произвольной величиной. Сдвиг производится относительно начального адреса данных, загруженных в регистр сдвига. При каждом сдвиге производится сравнение на равенство содержимого сдвигового регистра и содержимого в физическом носителе с исходным массивом данных и производится анализ числа совпадений. При этом, если число совпадений меньше размера индекса, требующегося для его хранения в индексном массиве сжатых данных, то совпадения игнорируются.Data blocks with dimensions corresponding to the dimensions of the register are sequentially supplied from the physical medium to the shift register of the device performing data compression. In the shift register, this data is sequentially compared with data on a physical medium with a shift from one byte to the maximum value of bytes. The maximum number of shift bytes is set during the implementation of the method and, in practice, can be an arbitrary value. The shift is relative to the start address of the data loaded into the shift register. At each shift, a comparison is made on the equality of the contents of the shift register and the contents in the physical medium with the original data array and the number of matches is analyzed. Moreover, if the number of matches is less than the size of the index required to store it in the indexed array of compressed data, then the matches are ignored.

После завершения последнего сдвига, в регистр сдвига из физического носителя заносятся данные, размещенные непосредственно за данными, уже размещенными в регистре сдвига. Если число совпадений данных между регистром сдвига и исходным массивом данных больше, чем размер индекса, необходимый для сохранения информации о совпадениях, то первый повторяющийся участок в информационном массив переводится в информационный массиве сжатых данных для сохранения, а все последующие удаляются из исходного массива данных. При этом в индексный массив сжатых данных заносится информация о типе повторяющегося участка данных в объеме, необходимом для ее восстановления.After the last shift is completed, data located immediately after the data already placed in the shift register are entered into the shift register from the physical medium. If the number of data matches between the shift register and the original data array is greater than the size of the index required to store information about the matches, then the first repeating section in the information array is transferred to the information array of compressed data for storage, and all subsequent ones are deleted from the original data array. At the same time, information about the type of the repeating piece of data in the amount necessary for its recovery is entered into the indexed array of compressed data.

После этого в регистр сдвига заносится информация, размещаемая непосредственно за последним совпадением в исходном массиве данных.After that, information placed immediately after the last match in the original data array is entered into the shift register.

Процесс передачи блоков данных с размерностями, соответствующими размерности регистра сдвига, из физического носителя в регистр сдвига повторяется, но, при этом, контролируется достижение конца исходного массива данных. Если он достигается, то сформированные информационный массив сжатых данных и индексный массив сжатых данных объединяются путем их последовательного присоединения, и процесс сжатия исходного массива данных прекращается.The process of transmitting data blocks with dimensions corresponding to the dimensions of the shift register from the physical medium to the shift register is repeated, but at the same time, the end of the original data array is controlled. If it is achieved, then the generated information array of compressed data and the index array of compressed data are combined by sequentially adding them, and the compression process of the original data array is terminated.

При распаковке в сформированный информационный массив сжатых данных добавляются повторяющиеся участки в соответствии с индексами, сохраняемыми в индексном массиве сжатых данных. Это позволяет восстановить исходный массив данных.When unpacking, compressed sections are added to the generated information array of compressed data in accordance with the indices stored in the indexed array of compressed data. This allows you to restore the original data array.

При этом, при реализации предложенного способа исключается необходимость предварительного формирования кодового представления символов исходного сообщения по определенному алгоритму, и не требуется предварительное определение максимального числа кодируемых символов.Moreover, the implementation of the proposed method eliminates the need for preliminary formation of the code representation of the characters of the original message according to a certain algorithm, and preliminary determination of the maximum number of encoded characters is not required.

Это обеспечивает достижение требуемого технического результата, заключающегося в упрощении способа и повышении универсальности.This ensures the achievement of the required technical result, which consists in simplifying the method and increasing versatility.

Claims

1. A method of data compression, according to which sections with several consecutive characters are allocated in the original data array, characterized in that sections with repeating sequences of characters are sequentially allocated as sections with several consecutive characters, the first of these sections is stored in the compressed data file , and the second and subsequent ones are replaced with an index placed in the indexed array of compressed data, which is formed during data compression, and the index is occupied information about the size of the plot with repeating sequences of characters in bytes, the number of plots with repeating sequences of characters and the type of repeating sequences of characters on the plot.

2. The method according to p. 1, characterized in that the type of repeating sequences of characters in the section specify at least the type of repetition with a complete match of characters, the type of repetition with inserts of mismatched characters in the fixed positions of the plot, and the type of repetition with mismatched characters in fixed positions.