KR0144633B1

KR0144633B1 - Data Compression Method of Completed Hangul Using LZW

Info

Publication number: KR0144633B1
Application number: KR1019940039802A
Authority: KR
Inventors: 서두원
Original assignee: 김준성; 고등기술연구원연구조합
Priority date: 1994-12-30
Filing date: 1994-12-30
Publication date: 1998-08-17
Also published as: KR960025207A

Abstract

LZW를 이용한 데이타 압축 방식에서, 일반적으로 사용되는 1바이트 영문전용 압축방식을 변형하여, 2바이트 한글에 대하여도 높은 데이타 압축률을 가질 수 있도록 LZW 압축 알고리즘에 사용되는 딕셔너리 트리(dictionary tree)와 내부 테이블을 재구성하여 압축률을 증가 시킨다.In the data compression method using LZW, a dictionary tree and an internal table used in the LZW compression algorithm are modified to have a high data compression rate for 2-byte Korean characters by modifying a commonly used one-byte English compression method. Recompose to increase the compression rate.

Description

Data compression method of complete Hangul using LZW.

제1도는 본 발명의 LZW를 이용한 데이타 압축방법의 흐름선도.1 is a flow diagram of a data compression method using LZW of the present invention.

제2도의 a는 종래의 LZW를 이용한 데이타 압축방법의 기본 알고리즘의 초기 내부 테이블 구조 도해도.2 is a diagram showing an initial internal table structure of a basic algorithm of a data compression method using a conventional LZW.

제2도의 b는 본 발명의 LZW를 이용한 데이타 압축방법의 한글인식 알고리즘의 초기 내부 테이블 구조 도해도.2 is a diagram illustrating an initial internal table structure of a Hangul recognition algorithm of a data compression method using LZW of the present invention.

본 발명은 LZW(Lempel Zive. Welch)를 이용한 데이타 압축 방법에 관한 것이며, 특히, LZW 압축 기법에 한글인식 과정을 부가시켜, LZW 압축 알고리즘에 딕셔너리 트리(dictionary tree), 즉 내부 테이블을 재구성 시켜 압축률을 향상시킬 수 있는 데이타 압축 방법에 관한 것이다.The present invention relates to a data compression method using LZW (Lempel Zive. Welch), in particular, by adding a Hangul recognition process to the LZW compression scheme, and recomposing a dictionary tree, that is, an internal table, to the LZW compression algorithm. It relates to a data compression method that can improve the.

일반적으로, 데이타 전송속도를 향상시킬 수 있는 데이타 압축 방법에는 허프만 부호화 기법, 산술 부호화 기법과 LZW 압축기법이 있는데 그중 가장 많이 사용되고 있는 LZW 압축기법은 상술한 두가지 기법에 사용되는 입력 데이타의 확률 통계적인 성질에 의존하지 않고, 화일을 한번만 읽어서 처리하도록 하여 실시간 환경에서 좀더 효율적으로 적용하게 할 수 있는 유니버셜 데이타 압축 기법이다.In general, data compression methods that can improve the data transfer rate include Huffman coding, arithmetic coding, and LZW compression. Among them, the LZW compression method, which is the most widely used, is a statistical method of input data used in the above two techniques. It is a universal data compression technique that can be applied more efficiently in real time environment by reading and processing files only once without depending on the property.

상기 LZW 압축방법은 스트링 테이블(string table)이라는 번역 테이블을 만들어 이를 이용하는 방법으로, 논리적으로는 딕셔너리 트리(dictionary tree)라는 구조를 만들어, 각 기호열을 관리하게 되는데, 문자열을 이 트리의 한노드에 할당하고, 이 번지만을 전송하게 함으로써 압축이 이루어지는 방법이다.In the LZW compression method, a translation table called a string table is used to create a translation table. The LZW compression method logically forms a dictionary tree and manages each symbol string. The string is a node of the tree. This method is used for compression by assigning to and sending only this address.

그러나, 상술한 종래의 LZW 압축방법에서는 한글 인식 과정이 없이 모두 8비트 심볼 단위로 처리를 하였기 때문에, 2바이트로 구성되는 한들의 특성을 살리지 못하고, 압축과정을 수행함으로써, 압축률을 높일 수 있는 요소가 배제되는 문제점이 있었다.However, in the above-described conventional LZW compression method, since all processing is performed in units of 8-bit symbols without the Hangul recognition process, the characteristics of the compression rate can be increased by performing the compression process without utilizing the characteristics of the one consisting of two bytes. There was a problem that is excluded.

본 발명은 상술한 문제점을 해결하기 위해, 종래의 LZW 압축방식에 완성형 한글인식 및 처리과정을 부가시킴으로써, 데이타 압축율을 높혀서, 데이타 전송속도를 향상시키는 것을 목적으로 한다.In order to solve the above problems, an object of the present invention is to increase the data compression rate and improve the data transmission speed by adding a complete Hangul recognition and processing process to the conventional LZW compression method.

상기 목적을 달성하기 위해, 본 발명은 LZW 압축기법에 한글인식 과정을 첨가시켜, 영문전용의 1바이트가 아닌 한글용 2바이트를 하나의 심볼로 인식하게 하고, LZW 압축 알고리즘에 사용되는 딕셔너리 트리와 내부 테이블을 재구성 시키는 방법을 특징으로 한다.In order to achieve the above object, the present invention adds a Hangul recognition process to the LZW compressor method, so that it recognizes two bytes for Hangul as one symbol instead of one byte for English, and a dictionary tree used for the LZW compression algorithm. It features a method of reorganizing internal tables.

이하, 첨부된 도면으로 본 발명을 더욱 상세하게 설명하기로 한다.Hereinafter, the present invention will be described in detail with the accompanying drawings.

제1도는 본 발명의 LZW 압축방법을 이용한 데이타 압축방법의 흐름선도로서, 그 첫 번째 단계는, 7비트 ASCⅡ 코드용 128개의 노드와 완성형 한글 코드용 2350자를 딕셔너리 트리에 초기화 시키는 단계(SⅠ-2)로서, 이는 종래의 방법에서, 비트 부호의 모든 가능한 256개의 노드를 딕셔너리 트리에 초기화 시키는 단계(SⅠ-1)를 대체한 것으로 된다. 그 다음 단계부터는 종래의 방법에서와 동일한 단계를 거치게 되는데, 즉, 입력문자를 읽어 스트링 w을 취하는 단계(S2)와, 다음 입력문자가 존재하는가를 판단하는 단계(S3)로 진행되는데, 만약 입력문자가 존재하지 않으면, 전단계(S2)의 스트링에 할당된 부호어를 출력하며(S10), 입력문자가 있으면, 다음 입력 문자를 읽어 캐릭터(k)로 설정하는 단계(S4)로 진행된다. 다음에 스트링과 캐릭터가 딕셔너리에 존재하는가를 판단하며(S5), 존재하면 스트링과 캐릭터를 새로운 스트링으로 하여(S6) 다시 상술한 S4단계로 진행된다. 만약에, 스트링과 캐릭터가 딕셔너리 트리에 존재하지 않으면, 스트링(w)에 할당한 부호어를 출력하며(S7), 계속하여, 딕셔너리 트리에 대한 새로운 노드를 생성하여 부호어를 할당시키며(S8), 다음에 캐릭터(k)를 새로운 스트링으로 취하며(S9), 다시 전단계(S4)로 궤환한다. 즉, 본 발명의 흐름선도에서는 점선으로 표시된 종래의 초기화 단계(SⅠ-1)를 단계(SⅠ-2)로 대체하는데 단계(SⅠ-1)에서는 한글 존재의 여부을 판단하지 않고 모든 문자를 8비트 단위로 처리하였으나, 본 발명은 단계(SⅠ-2)을 이용하여 더높은 압축률을 가진다.1 is a flow diagram of a data compression method using the LZW compression method of the present invention. The first step is to initialize 128 dictionaries for 7-bit ASCII code and 2350 characters for the complete Hangul code in the dictionary tree (SⅠ-2). This replaces the step (S-1) of initializing all possible 256 nodes of the bit code in the dictionary tree in the conventional method. From the next step, the same steps as in the conventional method are performed, i.e., reading the input character takes a string w (S2) and determining whether the next input character exists (S3). If there is no character, the codeword assigned to the string of the previous step S2 is output (S10). If there is an input character, the process proceeds to step S4 of reading the next input character and setting the character k. Next, it is determined whether the string and the character exist in the dictionary (S5), and if present, the string and the character are made a new string (S6), and the process proceeds to the above-described step S4. If the string and the character do not exist in the dictionary tree, the codeword assigned to the string w is output (S7). Subsequently, a new node for the dictionary tree is generated and the codeword is assigned (S8). Next, the character k is taken as a new string (S9) and fed back to the previous step (S4). That is, in the flow diagram of the present invention, the conventional initialization step S-1-1 indicated by the dotted line is replaced by step S-2. In step S-1-1, all characters are determined by 8-bit units without determining whether the Hangul is present. However, the present invention has a higher compression ratio using the step (SI-2).

제2도의 (a)는 종래의 LZW를 이용한 데이타 압축 방법의 기본 알고리즘의 내부 테이블 구조의 도해도이며, 제2도의 (b)는 본 발명의 LZW를 이용한 데이타 압축방식의 한글 인식 알고리즘의 내부 테이블 구조 도해도로서, 테이블의 총크기를 8192로 하는 LZW 알고리즘에서의 내부 테이블 구조를 각각 나타낸다.(A) of FIG. 2 is a diagram of the internal table structure of the basic algorithm of the conventional data compression method using LZW, and (b) of FIG. 2 is an internal table of the Hangul recognition algorithm of the data compression method using LZW of the present invention. As the structure diagram, internal table structures in the LZW algorithm having the total size of the table to 8192 are shown, respectively.

제2도의 (a)에서, 0에서 255까지의 부호어는 ASCⅡ코드로, 256 내지 511까지의 부호어는 9비트 부호화 영역으로, 512 내지 1023까지의 부호어는 10비트 부호화 영역으로, 1024 내지 2047까지의 부호어는 11비트 부호화 영역으로 하며, 2048 내지 4095까지의 부호어는 12비트 부호화 영역으로 하며, 4096 내지 8191까지의 부호어는 13비트 부호화 영역으로 한다. 반면에 제2도의 (b)에 따른 본 발명의 알고리즘의 내부 테이블 구조에서는, 딕셔너리 트리 테이블을 생성하는 단계에서, 제2도의 (a)의 초기에 256개의 ASCⅡ코드노드를 사용하는 대신 128개의 ASCⅡ코드와 2350자의 한글 코드용 노드를 딕셔너리 트리에 초기화시켜 사용한다. 즉, 제2도의 (a)에서의 9비트, 10비트, 11비트 부호화 영역 및 12비트 부호화 영역중 일부를 본 발명에서는 한글 코드가 차지하게 된다. 즉, KSC 완성형 코드에 대한 한글 데이타 처리를 위하여 KSC 5601에 정의된 한글2350자를 초기 테이블 노드에 할당함으로써, 초기 테이블 노드는 ASCⅡ코드를 위한 128자+2350자인 2478자가 되도록 한다.In FIG. 2A, codewords 0 to 255 are ASCII codes, codewords 256 to 511 are 9-bit coded areas, codewords 512 to 1023 are 10-bit coded areas, and numbers 1024 to 2047. A codeword is an 11-bit coded area, codewords 2048 to 4095 are 12-bit coded areas, and codewords 4096 to 8191 are 13-bit coded areas. On the other hand, in the internal table structure of the algorithm of the present invention according to FIG. 2 (b), in the step of generating the dictionary tree table, instead of using 256 ASCII code nodes at the beginning of FIG. Nodes for code and 2350 characters of Korean code are initialized and used in the dictionary tree. In other words, the Hangul code occupies some of the 9-bit, 10-bit, 11-bit encoding regions, and 12-bit encoding regions in FIG. That is, by assigning 2350 Korean characters defined in KSC 5601 to the initial table node for processing Korean data for the KSC complete code, the initial table node is 2478 characters, which is 128 characters + 2350 characters for the ASCII code.

압축진행시 1617개의 부호어는 12비트로 하고, 나머지는 13비트로 출력하게 하여, 완성형 한글 인식 알고리즘의 내부 테이블 구조를 완성시킨다.During compression, 1617 codewords are set to 12 bits and the rest are output to 13 bits, completing the internal table structure of the complete Hangul recognition algorithm.

따라서, 변경된 초기 테이블 노드를 이용하여 LZW 알고리즘에 따라 압축을 행하게 한다.Therefore, compression is performed according to the LZW algorithm using the changed initial table node.

이상, 본 발명에 따른 데이타 압축 방법에서는 영문 1바이트 대신에 2바이트 한 음절을 하나의 심볼로 간주하여 더 긴 비트 크기의 문자열에 대해 부호어를 할당시킴으로, 압축율의 향상을 가져온다.As described above, the data compression method according to the present invention considers two syllables as one symbol instead of one byte in English and assigns a codeword to a string having a longer bit size, thereby improving the compression rate.

Claims

In the LZW compression method for data compression to improve the data transfer rate, initializing the 7-bit ASCII code node and the Hangul code node at the beginning of the dictionary, reading the input character, taking a string (w), and writing the next input character. Reads and takes the character (k), and determines whether the string (w) and the character (k) exist in the dictionary tree, and if present, makes the string and the character a new string, and if not, assigns it to the string W Outputting one codeword, generating a new node for the dictionary tree, assigning the codeword, and taking the character (k) as a new string.

The LZW method according to claim 1, wherein in the initializing of the 7-bit ASCII code node and the Hangul code node, the number of 7-bit ASCII code nodes is 128, and the Hangul code node is 2350. Data compression method used.

The data compression method using LZW according to claim 1 or 2, wherein a compression method for 2-byte Hangul code is used for the Hangul code node.