JPH0511973A

JPH0511973A - Data compression method using universal code

Info

Publication number: JPH0511973A
Application number: JP16554291A
Authority: JP
Inventors: Shigeru Yoshida; 茂吉田; Yoshiyuki Okada; 佳之岡田; Yasuhiko Nakano; 泰彦中野; Hirotaka Chiba; 広隆千葉
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 1991-07-05
Filing date: 1991-07-05
Publication date: 1993-01-22
Anticipated expiration: 2015-08-21
Also published as: JP3078601B2

Abstract

(57)【要約】【目的】ユニバーサル型アルゴリズムを用いて文字情報
等の入力文字列を圧縮符号化するユニバーサル符号を用
いたデータ圧縮方式に関し、文字間の相関を取り込むこ
とにより符号化済み文字列間の冗長性を削減した符号化
により高い圧縮率を得ることを目的とする。【構成】符号化済み文字列を辞書１２に保持しておき、
文字入力部１０の入力文字列を辞書１２の符号化済み文
字列と最大長一致する部分列を検索し、該最大長一致部
分列の開始位置と一致長の組で符号化するユニバーサル
符号を用いたデータ圧縮方式であって、文字入力部１０
の入力文字列に対する直前文字１８と同じ先頭文字から
始まる辞書１２に保持された符号化済み文字列の一致部
分列Ｓ１，Ｓ２，Ｓ３，Ｓ４を検索すると共に最大長一
致する部分列Ｓ４を検索し、最大長一致する部分列Ｓ４
を符号化する際の開始位置として最大長一致部分列の直
前文字が現れる出現順番を用いて一致長との組で符号化
する。 (57) [Abstract] [Purpose] A data compression method using a universal code that compresses and encodes an input character string such as character information using a universal type algorithm. An encoded character string is obtained by incorporating the correlation between characters. The purpose is to obtain a high compression rate by encoding with reduced redundancy between. [Structure] An encoded character string is held in the dictionary 12,
A universal code that searches the input character string of the character input unit 10 for a substring whose maximum length matches the coded character string of the dictionary 12 and encodes it with a set of the start position and the matching length of the maximum length matching substring is used. The character input unit 10
Search the matching subsequences S1, S2, S3, S4 of the coded character strings held in the dictionary 12 starting from the same first character as the immediately preceding character 18 for the input character string of , Substring S4 with maximum length matching
Is encoded as a start position when encoding is performed in a pair with the matching length by using the appearance order in which the preceding character of the maximum length matching subsequence appears.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、ユニバーサル型アルゴ
リズムを用いて文字情報等の入力文字列を圧縮符号化す
るユニバーサル符号を用いたデータ圧縮方式に関する。
近年、文字コード、ベクトル情報、画像など様々な種類
のデータがコンピュータで扱われるようになっており、
扱われるデータ量も急速に増加してきている。大量のデ
ータを扱うときは、データの中の冗長な部分を省いてデ
ータ量を圧縮することで、記憶容量を減らしたり、速く
伝送したりできるようになる。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a data compression method using a universal code which compresses and encodes an input character string such as character information using a universal type algorithm.
In recent years, various types of data such as character codes, vector information, and images have been handled by computers,
The amount of data handled is also increasing rapidly. When handling a large amount of data, omitting redundant parts of the data and compressing the amount of data reduces the storage capacity and enables faster transmission.

【０００２】このような様々なデータを１つの方式でデ
ータ圧縮できる方法としてユニバーサル符号化が提案さ
れている。ここで、本発明の分野は、文字コードの圧縮
に限らず、様々なデータに適用できるが、以下では、情
報理論で用いられている呼称を踏襲し、データの１ワー
ド単位を文字と呼び、データが任意ワードつながったも
のを文字列と呼ぶことにする。Universal coding has been proposed as a method of compressing such various data by one method. Here, the field of the present invention is not limited to compression of character codes and can be applied to various data, but in the following, the word used in information theory is followed, and one word unit of data is called a character, A string in which data is connected to arbitrary words is called a character string.

【０００３】ユニバーサル符号の代表的な方法として、
ジブ−レンペル（Ziv-Lempel）符号がある（詳しくは、
例えば、宗像『Ziv-Lempelのデータ圧縮法』、情報処
理、Vol.26,No.1,1985年を参照のこと）。ジブーレンペ
ル符号では (1)ユニバーサル型と、 (2)増分分解型（Incremental parsing ）の２つのアルゴリズムが提案されている。As a typical method of the universal code,
There is a Ziv-Lempel code (for details,
For example, see Munakata "Ziv-Lempel Data Compression Method", Information Processing, Vol.26, No.1, 1985). Two algorithms of (1) universal type and (2) incremental decomposition type (Incremental parsing) have been proposed for the Jibulermpel code.

【０００４】更に、ユニバーサル型アルゴリズムの改良
として、ＬＺＳＳ符号（T.C. Bell,“Better OPM/L Tex
t Compression ”,IEEE Trans. on Commun., Vol.COM-3
4, No.12, Dec. 1986 参照）や、１／４インチ・カート
リッジ磁気テープの標準圧縮方式であるＱＩＣ−１２２
符号がある。また、増分分解型アルゴリズムの改良とし
ては、ＬＺＷ（Lempel-Ziv-Welch）符号がある（T.A. W
elch, “A Technique for High-Performance Data Comp
ression ”,Computer, June 1984参照）。Further, as an improvement of the universal type algorithm, LZSS code (TC Bell, "Better OPM / L Tex
t Compression ”, IEEE Trans. on Commun., Vol.COM-3
4, No. 12, Dec. 1986), or QIC-122, which is the standard compression method for 1/4 inch cartridge magnetic tape.
There is a sign. Further, as an improvement of the incremental decomposition type algorithm, there is a LZW (Lempel-Ziv-Welch) code (TA W
elch, “A Technique for High-Performance Data Comp
ression ”, Computer, June 1984).

【０００５】これらの改良符号は補助記憶装置のファイ
ル圧縮や、パソコン通信でのデータ伝送に利用されるよ
うになっている。These improved codes have come to be used for file compression of an auxiliary storage device and data transmission in personal computer communication.

【０００６】[0006]

【従来の技術】まず従来のユニバーサル型アルゴリズム
とその改良の１つであるＱＩＣ−１２２符号について説
明する。［ユニバーサル型アルゴリズム］ユニバーサル型アルゴ
リズムは、演算量は多いが、高圧縮率が得られるデータ
圧縮方式である。2. Description of the Related Art First, a conventional universal type algorithm and a QIC-122 code which is one of its improvements will be described. [Universal type algorithm] The universal type algorithm is a data compression method that can obtain a high compression rate although the amount of calculation is large.

【０００７】即ち、ユニバーサル型アルゴリズムにあっ
ては、符号化しようとする文字列をを、符号化済みの文
字列の任意の位置から最大長一致する系列、所謂部分列
に区切り、入力文字列を過去の最大長一致する部分列の
複製として符号化する。図１０にユニバーサル型ジブー
レンペル符号の符号化方式を示す。図１０において、辞
書としての機能をもつＰバッファ１２には入力済みの文
字列が格納されており、文字入力部としてのＱバッファ
１０にはこれから符号化しようとする文字列が入力され
ている。パターンマッチング部１４はＱバッファ１０の
文字列をＰバッファ１２の系列と照合し、Ｐバッファ１
２の中で一致する最大長の文字部分列を検索する。そし
て、Ｐバッファ１２中で検索した最大長一致する部分列
を指定するため図１１の情報の組Ｐバッファ中の最大長一致系列の開始位置（開始アドレ
ス）一致長（レングス）として符号化する。なお、一致系列がなければ不一致の
シンボルと共に生データを出力する。That is, in the universal type algorithm, a character string to be encoded is divided into a sequence having a maximum length match from any position of the encoded character string, that is, a so-called substring, and the input character string is divided. Encode as a copy of a substring with the same maximum length in the past. FIG. 10 shows an encoding method of the universal Dibulenpel code. In FIG. 10, an already-input character string is stored in the P buffer 12 having a function as a dictionary, and a character string to be encoded is input to the Q buffer 10 as a character input unit. The pattern matching unit 14 matches the character string in the Q buffer 10 with the series in the P buffer 12,
Find the matching maximum length character substring in 2. Then, in order to specify a substring matching the maximum length searched in the P buffer 12, it is encoded as a start position (start address) matching length (length) of the maximum length matching sequence in the information set P buffer of FIG. If there is no matching sequence, raw data is output together with the mismatched symbol.

【０００８】次にＱバッファ１０内の符号化した文字列
をＰバッファ１２に移して新たな符号化済み文字列を登
録する。以下、同様の操作を繰り返し、入力文字列を部
分列に分解して符号化する。このようにジブーレンペル
符号では現在の文字列を、符号化済みの過去の文字列か
らの複製として符号化するものである。ジブーレンペル
符号を用いた場合、文字コードの文書情報は１／２程度
に圧縮できる。Next, the encoded character string in the Q buffer 10 is moved to the P buffer 12, and a new encoded character string is registered. Hereinafter, the same operation is repeated, and the input character string is decomposed into substrings and encoded. As described above, in the Gibble-Rempel code, the current character string is encoded as a duplicate of the encoded past character string. When the Gibulen-pelpel code is used, the document information of the character code can be compressed to about 1/2.

【０００９】［ＱＩＣ−１２２符号］３Ｍを中心とする
メーカの団体であるＱＩＣ（Quauter Inch Cartrrige S
tandard Inc.）が１／４インチ・カートリッジ磁気テー
プの標準圧縮方式として採用した符号である。ＱＩＣ−
１２２符号のアルゴリズムでは、Ｐバッファとして２０
４８バイトの履歴をもち、Ｑバッファの符号化する文字
列をＰバッファ中の文字列の複製として表すモードと、
生データを１バイトづつ符号化するモードの２つのモー
ドをもつ。そして、Ｐバッファ中の最大長一致文字列が
２文字以上の場合、複製モードで符号化し、それ以外の
ときは生データ・モードで符号化する。[QIC-122 Code] QIC (Quauter Inch Cartrrige S), which is a group of manufacturers centering on 3M
tandard Inc.) adopted the standard compression method for 1/4 inch cartridge magnetic tape. QIC-
In the 122 code algorithm, 20 is used as the P buffer.
A mode which has a history of 48 bytes and represents a character string to be encoded in the Q buffer as a copy of the character string in the P buffer,
It has two modes of encoding raw data byte by byte. Then, when the maximum length matching character string in the P buffer is two characters or more, the encoding is performed in the duplication mode, and in other cases, the encoding is performed in the raw data mode.

【００１０】図１２はＢＮＦメタ言語で表わされたＱＩ
Ｃ−１２２符号の符号語フォーマットを示す。またＢＮ
Ｆメタ言語に用いるメタ記号は図１３に示す意味をも
つ。図１２のＱＩＣ−１２２符号の符号語フォーマット
を詳細に説明すると次のようになる。（１）圧縮系列（Compressed Stream ）は、圧縮ストリ
ング（Compressed String)とエンドマーカで構成され
る。FIG. 12 shows the QI expressed in the BNF meta language.
The codeword format of a C-122 code is shown. Also BN
The meta symbols used in the F metalanguage have the meanings shown in FIG. The code word format of the QIC-122 code in FIG. 12 will be described in detail as follows. (1) The compressed stream (Compressed Stream) is composed of a compressed string and an end marker.

【００１１】（２）圧縮ストリングは、生データについ
ては識別ビット０に続くＡＳＣＩＩ生バイトで表現さ
れ、また圧縮データについては識別ビット１に続いて圧
縮バイトで表現される。（３）ＡＳＣＩＩ生バイトは、８ビットを１バイトして
表現される。（４）圧縮バイトは、オフセット（開始位置）とレング
ス（一致長）の組でなる。（５）オフセット（開始位置）は、識別ビット１の場合
は７ビットで表現される。また識別ビット０のは場合は
１１ビットで表現される。（６）エンドマーカは、１１０００００００であり、オ
フセットは０となる。（７）ビットｂは０又は１である。（８）レングス（一致長）は、図１２のように可変長符
号で表現される。(2) The compressed string is represented by an ASCII raw byte following identification bit 0 for raw data, and an identification bit 1 followed by a compression byte for compressed data. (3) The ASCII raw byte is represented by 1 byte of 8 bits. (4) The compressed byte is a set of offset (start position) and length (match length). (5) The offset (start position) is represented by 7 bits when the identification bit is 1. When the identification bit is 0, it is represented by 11 bits. (6) The end marker is 110000000 and the offset is 0. (7) Bit b is 0 or 1. (8) Length (match length) is represented by a variable length code as shown in FIG.

【００１２】図１４にＱＩＣ−１２２符号の符号化の具
体例を示す。図１４は文字列「ＡＢＡＡＡＡＡＡＣＡＢ
Ａ」が入力した場合を例にとっている。まず最初の３文
字「ＡＢＡ」に関してはＰバッファ中の一致する文字数
が１文字以下であることからＡＳＣＩＩ生バイトのビッ
ト系列を出力する。４文字目から８文字目までの５つの
「Ａ」については、Ｐバッファの直前文字「Ａ」と一致
することから、圧縮バイト識別ビット７ビットオフセット識別ビットオフセット＝１レングス＝５バイトでなるビット系列「１１００００００１１１０
０」として出力する。FIG. 14 shows a specific example of encoding the QIC-122 code. FIG. 14 shows the character string "ABAAAAAACAB.
The case where "A" is input is taken as an example. First, since the number of matching characters in the P buffer for the first three characters "ABA" is one or less, the bit sequence of the ASCII raw byte is output. Since the 5th "A" from the 4th character to the 8th character matches the character "A" immediately before in the P buffer, the compressed byte identification bit 7-bit offset identification bit offset = 1 length = 5 bytes Series "1 100000001 110
It is output as "0".

【００１３】ここで最大長一致の部分列の開始位置を示
すオフセットの値は、Ｐバッファの最新登録位置（アド
レス）から前に遡って何番目かを示している。９番目の
文字「Ｃ」はＰバッファにないことからＡＳＣＩＩ生バ
イトを出力する。１０〜１２番目の文字「ＡＢＡ」はＰ
バッファの先頭からの３文字として既に登録済みである
ので、圧縮バイト識別ビット７ビットオフセット識別ビットオフセット＝９レングス＝３バイトでなるビット系列「１１０００１００１０１」を
出力する。Here, the value of the offset indicating the starting position of the substring having the maximum length match indicates the number of the position starting from the latest registration position (address) of the P buffer. The ninth character "C" is not in the P buffer, so it outputs an ASCII raw byte. The 10th-12th character "ABA" is P
Since it has already been registered as three characters from the beginning of the buffer, a bit sequence "1 1 001 001 001" consisting of compressed byte identification bit 7 bit offset identification bit offset = 9 length = 3 bytes is output.

【００１４】以上で全ての入力文字の符号化が済んだの
でエンドデータとして「１１０００００００」を出
力して処理を終了する。Since all the input characters have been encoded as described above, "1 1000000" is output as the end data and the process is terminated.

【００１５】[0015]

【発明が解決しようとする課題】しかしながら、ジブー
レンペル符号やＱＩＣ−１２２符号等を用いた従来のユ
ニバーサル符号を用いたデータ圧縮方式にあっては、複
製すべき最大長一致する文字列を「一致開始位置」と
「一致長」の組で表わして符号化していたため、Ｐバッ
ファに保持する符号化済み文字列が増加してくると、一
致開始位置はＰバッファを構成するメモリのアドレスで
表現されているために、長いビット数で表わさなければ
ならなくなり、符号化した文字列間に冗長性が残り、高
い圧縮率を得ることができなくなる問題があった。However, in the conventional data compression method using the universal code such as the Dibulenpel code or the QIC-122 code, the maximum length matching character string to be copied is "match start". Since the encoding was performed by expressing the combination of the "position" and the "match length", when the coded character string held in the P buffer increases, the match start position is expressed by the address of the memory forming the P buffer. Therefore, there is a problem in that it has to be represented by a long number of bits, redundancy remains between encoded character strings, and a high compression rate cannot be obtained.

【００１６】本発明はこのような従来の問題点に鑑みて
なされたもので、文字間の相関を取り込むことにより符
号化済み文字列間の冗長性を削減した符号化により高い
圧縮率を得ることのできるユニバーサル符号を用いたデ
ータ圧縮方式を提供することを目的とする。The present invention has been made in view of the above conventional problems, and obtains a high compression rate by encoding by reducing the redundancy between encoded character strings by incorporating the correlation between characters. It is an object of the present invention to provide a data compression method using a universal code capable of performing.

【００１７】[0017]

【課題を解決するための手段】図１は本発明の原理説明
図である。まず本発明は、符号化済み文字列を辞書（Ｐ
バッファ）１２に保持しておき、辞書１２の符号化済み
文字列の中の文字入力部（Ｑバッファ）１０の入力文字
列に最大長一致する部分列を検索し、最大長一致部分列
の開始位置と一致長の組で符号化するユニバーサル符号
を用いたデータ圧縮方式を対象とする。FIG. 1 is a diagram for explaining the principle of the present invention. First, the present invention converts an encoded character string into a dictionary (P
Buffer) 12 and searches for a substring having the maximum length that matches the input character string of the character input unit (Q buffer) 10 in the encoded character string of the dictionary 12 and starts the maximum length matching substring. The target is a data compression method using a universal code that is encoded by a set of position and matching length.

【００１８】このようなユニバーサル符号を用いたデー
タ圧縮方式として本発明にあっては、文字入力部１０の
入力文字列に対する直前文字１８と同じ先頭文字から始
まる辞書１２に保持された符号化済み文字列の一致部分
列Ｓ１，Ｓ２，Ｓ３，Ｓ４を検索すると共に最大長一致
する部分列Ｓ４を検索し、最大長一致する部分列Ｓ４を
符号化する際の開始位置として最大長一致部分列Ｓ４の
直前文字が現れる出現順番（４番）を用いて一致長との
組で符号化する符号化部１４を設けたことを特徴とす
る。In the present invention as a data compression method using such a universal code, the coded characters stored in the dictionary 12 starting from the same first character as the immediately preceding character 18 for the input character string of the character input unit 10 are used. The matching subsequences S1, S2, S3, S4 of the columns are searched, the subsequence S4 having the maximum length matching is searched, and the maximum length matching subsequence S4 of the maximum length matching subsequence S4 is set as a start position when the substring S4 having the maximum length matching is encoded. It is characterized in that an encoding unit 14 is provided which encodes in combination with the matching length by using the appearance order (4th) in which the immediately preceding character appears.

【００１９】また符号化部１０は、辞書１２中から検索
した最大長一致する符号化済み文字列の部分列が先頭文
字１８の連続文字列Ｓ２であった場合には、連続文字列
Ｓ２の先頭位置または終端位置に出現順番を割当てて一
致長との組で符号化することを特徴とする。この場合、
連続文字列Ｓ２の先頭位置と終端位置の２つの出現順番
を割当て、連続文字列の先頭位置と終端位置の両者の出
現順番を一致長との組で符号化するよいにしてもよい。If the substring of the coded character string with the maximum length matching retrieved from the dictionary 12 is the continuous character string S2 of the first character 18, the coding section 10 starts the continuous character string S2. It is characterized in that the order of appearance is assigned to the position or the end position, and the position and the end position are encoded as a set with the matching length. in this case,
It is also possible to allocate two appearance orders of the start position and the end position of the continuous character string S2 and to code the appearance order of both the start position and the end position of the continuous character string as a set with the matching length.

【００２０】符号化部１４は、符号化する辞書１２の符
号化済み文字列の最大一致長の部分列の文字数が所定文
字未満、例えば３文字未満の時は、入力文字列を符号化
せずにそのまま出力し、文字数が所定値、例えば３文字
以上の時には出現順番と一致長との組で符号化する。更
に符号化部１４は、符号化する辞書１２の符号化済み文
字列の最大一致長の部分列の文字数が少ない時には、部
分列の開始位置（従来方式）と一致長との組で符号化
し、文字数が多い時に出現順番（本発明の方式）と一致
長との組で符号化してもよい。The encoding unit 14 does not encode the input character string when the number of characters of the substring having the maximum matching length of the encoded character string of the dictionary 12 to be encoded is less than a predetermined character, for example, less than three characters. Is output as it is, and when the number of characters is a predetermined value, for example, 3 characters or more, it is encoded by a combination of the appearance order and the matching length. Further, when the number of characters in the substring having the maximum matching length of the coded character string of the dictionary 12 to be encoded is small, the encoding unit 14 encodes with a set of the start position (conventional method) of the substring and the matching length, When the number of characters is large, encoding may be performed using a combination of the appearance order (method of the present invention) and the matching length.

【００２１】更にまた符号化部１４は、例えば入力文字
列を１／４インチ・カートリッジ磁気テープの標準圧縮
方式であるＱＩＣ−１２２符号の符号語を用いて符号化
する。Further, the encoding unit 14 encodes the input character string, for example, using the code word of the QIC-122 code which is the standard compression system of the 1/4 inch cartridge magnetic tape.

【００２２】[0022]

【作用】以上説明したように本発明によれば、Ｑバッフ
ァの符号化しようとする文字列に対する既に符号化済み
の直前文字（Ｐバッファの最新登録の一文字）を先頭文
字とする符号化済み文字決の中の一致部分列を全て検索
して出現順番を知り、一致部分列の中の最大長一致する
部分列の複製として符号化する際に、従来は最大一致長
部分列の開始位置と一致長の組で符号化していたもの
を、本発明は、最大一致長の部分列のＰバッファでの出
現順番と一致長との組で表わして符号化するようにな
る。As described above, according to the present invention, a coded character having the preceding character (the latest character registered in the P buffer) already coded for the character string to be coded in the Q buffer as the first character Conventionally, when matching all substrings in a decision, the appearance order is known, and when encoding as a copy of the substring with the maximum matching length in the matching substring, conventionally, the matching position matches the starting position of the maximum matching length substring. According to the present invention, what is encoded by a set of lengths is represented by a set of the order of appearance of the subsequence having the maximum match length in the P buffer and the match length and is encoded.

【００２３】このため従来の開始位置（開始アドレス）
に比べ、部分列の出現順番の数の方が遙かに少なく、出
現順番を少ないビット数で表現できることによって圧縮
率を大幅に向上することができる。例えば本発明はＱＩ
Ｃ−１２２符号の符号化に適用され、Ｑバッファの直前
の一文字としての直前文字から続けたＰバッファの最大
長一致文字列を探索し、最大長一致文字列の開始位置を
Ｐバッファ内で直前文字が何番目に出現した順番で表す
ようにする。Therefore, the conventional start position (start address)
Compared with the above, the number of appearance orders of the subsequence is much smaller, and the appearance order can be expressed by a smaller number of bits, so that the compression rate can be significantly improved. For example, the present invention uses QI
It is applied to the encoding of C-122 code, searches for the maximum length matching character string of the P buffer that continues from the immediately preceding character as the one character immediately before the Q buffer, and finds the start position of the maximum length matching character string immediately before in the P buffer. Express the characters in the order in which they appear.

【００２４】[0024]

【実施例】図２は本発明の一実施例を示した実施例構成
図である。図２において、１６はバッファメモリであ
り、符号化を行おうとする文字列を格納する文字列入力
部としてのＱバッファ１０と、符号化済み文字列を登録
した辞書として機能するＰバッファ１２に割当られる。2 is a block diagram of an embodiment showing one embodiment of the present invention. In FIG. 2, reference numeral 16 denotes a buffer memory, which is assigned to a Q buffer 10 as a character string input unit for storing a character string to be encoded and a P buffer 12 functioning as a dictionary in which encoded character strings are registered. To be

【００２５】１４はＣＰＵを用いた符号化部としてのパ
ターンマッチング部であり、ユニバーサル符号化アルゴ
リズムに従ってＱバッファ１０の文字列に最大長一致す
る部分列をＰバッファ１２から検索し、最大長一致する
部分列の複製としてその開始位置と一致長の組でなる符
号語を出力する。図３は図２のパタークマッチング部１
４による本発明の符号化処理を示した説明図である。Reference numeral 14 denotes a pattern matching unit as a coding unit using a CPU, which searches the P buffer 12 for a substring having the maximum length matching the character string of the Q buffer 10 according to the universal coding algorithm, and matches the maximum length. A codeword consisting of a set of the start position and the matching length is output as a copy of the subsequence. FIG. 3 shows the pattern matching unit 1 of FIG.
FIG. 4 is an explanatory diagram showing an encoding process of the present invention according to No. 4;

【００２６】図３において、まずＱバッファ１０にこれ
から符号化しようとする文字列「ａｃｂａ・・・」が格
納されていたとすると、既に符号化済みの直前の一文字
である直前文字「ｂ」を含み、この直前文字「ｂ」に続
いてＱバッファ１０の文字列に一致する文字列がＰバッ
ファ１２にあるか否かが探索される。この条件を満足す
る文字列として例えば文字列Ｓ１，Ｓ２，Ｓ３，Ｓ４が
検索できたとする。In FIG. 3, first, assuming that the character string "acba ..." To be encoded is stored in the Q buffer 10, it includes the immediately preceding character "b" which is one character just before being encoded. Then, it is searched whether or not a character string matching the character string of the Q buffer 10 is present in the P buffer 12 following the immediately preceding character "b". It is assumed that, for example, the character strings S1, S2, S3, S4 can be retrieved as the character strings satisfying this condition.

【００２７】文字列Ｓ１〜Ｓ４の出現番号は、Ｐバッフ
ァ１２に右から登録を行っているため文字列Ｓ１：１番目文字列Ｓ２：２番目文字列Ｓ３：３番目文字列Ｓ４：４番目の出現順番となり、順に出現番号１，２，３，４が割り
付けられる。Since the appearance numbers of the character strings S1 to S4 are registered in the P buffer 12 from the right, the character string S1: the first character string S2: the second character string S3: the third character string S4: the fourth The appearance order is set, and the appearance numbers 1, 2, 3, and 4 are sequentially assigned.

【００２８】次に文字列Ｓ１〜Ｓ４の中から入力文字列
に最大長一致する文字列Ｓ４を検索し、この文字列Ｓ４
の開始位置を出現番号４で表わし、一致長と組にした符
号語として出力する。一方、文字列Ｓ２のように直前文
字「ｂ」が連続する文字列については、連続文字列Ｓ２
の開始点または終点のいずれか一方に出現番号を割り付
けて符号化する。また連続文字列Ｓの開始点と終点に２
つの出現番号を割り付け、２つの出現番号と一致長との
組の符号語として符号化してもよい。Next, the character string S4 having the maximum length matching the input character string is searched from the character strings S1 to S4, and this character string S4 is searched.
The start position of is represented by appearance number 4, and is output as a codeword paired with the match length. On the other hand, for a character string such as the character string S2 in which the preceding character "b" is continuous, the continuous character string S2
The appearance number is assigned to either the start point or the end point of and encoded. 2 at the start and end of the continuous character string S
One appearance number may be assigned and encoded as a code word of a pair of two appearance numbers and a matching length.

【００２９】図４はＱＩＣー１２２符号を例にとって本
発明によるユニバーサル符号化アルゴリズムの一実施例
を示したフローチャートである。図４において、まずス
テップＳ１でＰバッファの内容を空にし、またＱバッフ
ァに符号化しようとする入力データを詰める。次にステ
ップＳ２でＱバッファの直前文字の位置からの文字列に
一致するＰバッファの最長文字列Ｓを検索する。続いて
ステップＳ３で検索できた最長文字列Ｓが３文字以上か
否か判別する。FIG. 4 is a flow chart showing an embodiment of the universal coding algorithm according to the present invention using the QIC-122 code as an example. In FIG. 4, first, in step S1, the contents of the P buffer are emptied, and the Q buffer is filled with input data to be encoded. Next, in step S2, the longest character string S in the P buffer that matches the character string from the position of the immediately preceding character in the Q buffer is searched. Subsequently, in step S3, it is determined whether or not the longest character string S that can be searched is three characters or more.

【００３０】最長文字列Ｓが１文字或いは２文字の場合
はステップＳ４に進んで生データ・モードとなり、生デ
ータ・モードであることを示すフラクビット０とＡＳＣ
ＩＩコードでなる生データ１バイトを出力する。一方、
最長文字列Ｓが３文字以上であった場合には、ステップ
Ｓ５に進んで複製モードとし、圧縮データであることを
示すフラグビット１に続いて最長文字列の出現順番と一
致長の組を符号化する。If the longest character string S is one character or two characters, the process proceeds to step S4 to enter the raw data mode, and the frac bit 0 and ASC indicating the raw data mode.
Outputs 1 byte of raw data consisting of II code. on the other hand,
If the longest character string S is 3 characters or more, the process proceeds to step S5, the duplication mode is set, and a set of the appearance order of the longest character string and the matching length is coded after the flag bit 1 indicating compressed data. Turn into.

【００３１】ステップＳ６では符号化済みのＱバッファ
の文字列又は文字をＰバッファに移すと共に、同じ数の
新たな文字をＱバッファに入力する。更にＱＩＣ−１２
２符号のアルゴリズムではＰバッファは２０４８バイト
と固定であるため、Ｐバッファに移した文字数分の最も
古い文字をＰバッファから捨てる。以下同様な処理を繰
り返す。In step S6, the encoded character string or character in the Q buffer is moved to the P buffer, and the same number of new characters is input to the Q buffer. Further QIC-12
Since the P buffer is fixed at 2048 bytes in the 2-code algorithm, the oldest characters corresponding to the number of characters moved to the P buffer are discarded from the P buffer. The same process is repeated thereafter.

【００３２】図５は図４の複製モードで符号化されるＱ
ＩＣ−１２２符号語を利用したフォーマット説明図であ
り、図１２と比較してオフセットが出現番号を示す３ビ
ット表現となっている。従って、文字列開始位置を７箇
所まで符号化できる。ここで、出現番号０はＥＮＤマー
クに用いている。また、一致長には、直前文字を含め、
３文字以上の文字列を表わすので、一致長＝|Ｓ|−２を用いるものとする。FIG. 5 shows Q encoded in the replication mode of FIG.
FIG. 13 is an explanatory diagram of a format using an IC-122 code word, and an offset is a 3-bit expression indicating an appearance number as compared with FIG. 12. Therefore, up to 7 character string start positions can be encoded. Here, the appearance number 0 is used for the END mark. Also, the match length includes the previous character,
Since it represents a character string of three or more characters, match length = | S | −2 is used.

【００３３】このように従来の開始位置を示していたオ
フセットが７ビット或いは１１ビットであったものを本
発明では出現番号のオフセットを３ビットと少ないビッ
ト数で表現でき、これによって圧縮率を向上できる。図
６は図４の生データ・モードで出力されるデータ形式
と、複製モードで出力された符号語のデータ形式を示
す。As described above, in the present invention, the offset of the appearance number can be expressed by a small number of bits of 3 bits instead of the conventional offset indicating the starting position of 7 bits or 11 bits, thereby improving the compression rate. it can. FIG. 6 shows the data format output in the raw data mode of FIG. 4 and the data format of the codeword output in the copy mode.

【００３４】図７は本発明の他の実施例を示したＱＩＣ
−１２２符号を用いた符号化アルゴリズムを示したフロ
ーチャートであり、この実施例にあっては、図４の生デ
ータ・モードと、出現順番を用いた符号語を出力する複
製モードに加えて、従来の開始位置（開始アドレス）を
用いた符号語を出力する複製モードを組合せたことを特
徴とする。FIG. 7 shows a QIC showing another embodiment of the present invention.
6 is a flowchart showing an encoding algorithm using -122 code, and in this embodiment, in addition to the raw data mode of FIG. 4 and the duplication mode for outputting code words using the appearance order, It is characterized in that a duplication mode for outputting a code word using the start position (start address) of is combined.

【００３５】図７においては、ステップＳ１で初期処理
を行った後、ステップＳ２で図４のステップＳ２と同様
に直前文字列に続く最大一致長の最長文字列Ｓ１を検索
し、同時に出現順番及び一致長を求める。続いてステッ
プＳ３で直前文字とは無関係にＰバッファの中の最長文
字列Ｓ２を検索し、開始位置および一致長を求める。In FIG. 7, after the initial processing is performed in step S1, the longest character string S1 having the maximum matching length following the immediately preceding character string is searched for in step S2 as in step S2 of FIG. Find the match length. Then, in step S3, the longest character string S2 in the P buffer is searched irrespective of the immediately preceding character, and the start position and the matching length are obtained.

【００３６】次にステップＳ２で検索した最長文字列Ｓ
２が３文字以上か否かチェックし、３文字以上であれ
ば、ステップＳ８に進んで図９（ｃ）の複製モード
（２）の符号語を出力する。３文字未満の場合はステッ
プＳ５に進みステップＳ３で検索した最長文字列Ｓ２が
２文字以上か否かチェックし、２文字以上であればステ
ップＳ７に進み、図９（ｂ）の複製モード（１）の符号
語、即ち、開始位置と一致長の組を符号化する。Next, the longest character string S retrieved in step S2
It is checked whether 2 is 3 characters or more. If 3 characters or more, the process proceeds to step S8 to output the code word of the duplication mode (2) of FIG. 9C. If it is less than 3 characters, the process proceeds to step S5, and it is checked whether the longest character string S2 searched in step S3 is 2 characters or more. If it is 2 characters or more, the process proceeds to step S7, and the duplication mode (1 ) Codeword, that is, a set of the start position and the matching length is encoded.

【００３７】更に最長文字列Ｓ２が１文字の場合はステ
ップＳ６に進み、図９（ａ）の生データ・モードの符号
語を出力する。これらステップＳ６，ステップＳ７又は
ステップＳ８のいずれかによる符号化が済むとステップ
Ｓ９に進んで次の符号化済みのＱバッファの文字列又は
文字をＰバッファに移すと共に、同じ数の新たな文字を
Ｑバッファに入力し、さらにＰバッファの容量に制限が
あるので、Ｐバッファに移した文字数分の最も古い文字
をＰバッファから捨て、以下同様な処理を繰り返す。Further, when the longest character string S2 is one character, the process proceeds to step S6, and the code word in the raw data mode of FIG. 9A is output. When the encoding is completed in any of these steps S6, S7 or S8, the process proceeds to step S9, where the next encoded character string or character in the Q buffer is moved to the P buffer, and the same number of new characters is written. Since there is a limit to the capacity of the P buffer after inputting into the Q buffer, the oldest characters corresponding to the number of characters moved to the P buffer are discarded from the P buffer, and the same processing is repeated thereafter.

【００３８】図８には図７の実施例における複製モード
（１）（２）で符号化されるＱＩＣ−１２２符号語を利
用したフォーマット構成を示す。尚、上記の実施例で
は、出現番号の大きさを例えば３ビットと制限している
が、制限を付けずに可変長符号で表しても良い。また、
出現順番をＰバッファ中で登録の古い右から登録の新し
い左に数えたが、これは逆に左から右に数えてもよい。FIG. 8 shows a format structure using QIC-122 codewords encoded in the duplication modes (1) and (2) in the embodiment of FIG. Although the size of the appearance number is limited to, for example, 3 bits in the above embodiment, it may be represented by a variable length code without limitation. Also,
Although the order of appearance is counted from the old right of registration to the new left of registration in the P-buffer, this may in turn be counted from left to right.

【００３９】更に上記の実施例はＱＩＣ−１２２符号を
例にとるものであったが、ジブーレンペル符号等の適宜
のユニバーサル符号につきそのまま適用できる。Further, although the above-mentioned embodiment has been described with the QIC-122 code as an example, it can be applied as it is to an appropriate universal code such as the Dibullenpel code.

【００４０】[0040]

【発明の効果】以上説明したように本発明によれば、符
号化文字列を符号化済み文字列の複製として「一致開始
位置」と「一致長」の組で表すとき、従来は一致開始位
置をアドレスで表現するのに対して、直前文字が一致す
る文字列の内何番目のものか番号による相対的な順序と
して表現するため、従来より短いビット数で表すことが
でき、高い圧縮率を得ることができる。As described above, according to the present invention, when a coded character string is represented by a set of "match start position" and "match length" as a copy of the coded character string, conventionally, the match start position is Is expressed as an address, whereas it is expressed as a relative order by the number of the character string in which the immediately preceding character matches, so it can be expressed with a shorter number of bits than before and a high compression rate can be achieved. Obtainable.

[Brief description of drawings]

【図１】本発明の原理説明図FIG. 1 is an explanatory view of the principle of the present invention.

【図２】本発明の実施例構成図FIG. 2 is a block diagram of an embodiment of the present invention.

【図３】本発明の符号化処理の説明図FIG. 3 is an explanatory diagram of encoding processing of the present invention.

【図４】本発明を用いたＱＩＣ−１２２符号化アルゴリ
ズムを示したフローチャートFIG. 4 is a flowchart showing a QIC-122 encoding algorithm using the present invention.

【図５】図４の複製モードの符号語フォーマット説明図FIG. 5 is an explanatory diagram of a codeword format in the copy mode of FIG.

【図６】図４の符号語のデータ形式説明図6 is an explanatory diagram of the data format of the code word in FIG. 4.

【図７】本発明を用いたＱＩＣ−１２２符号化アルゴリ
ズムの他の実施例を示したフローチャートFIG. 7 is a flowchart showing another embodiment of the QIC-122 encoding algorithm using the present invention.

【図８】図７の複製モードの符号語フォーマット説明図8 is an explanatory diagram of a codeword format in the copy mode of FIG.

【図９】図７の符号語のデータ形式説明図9 is an explanatory diagram of a data format of the code word in FIG. 7.

【図１０】ユニバーサル型ジブーレンペル符号の符号化
方式説明図FIG. 10 is an explanatory diagram of a universal Dibulenpel code encoding method.

【図１１】ユニバーサル符号語のデータ形式説明図FIG. 11 is an explanatory diagram of a universal codeword data format.

【図１２】ＱＩＣ１２２符号のフォーマット説明図FIG. 12 is an explanatory diagram of a format of QIC122 code.

【図１３】図１２に使用したＢＮＦメタ言語の説明図13 is an explanatory diagram of the BNF meta language used in FIG.

【図１４】ＱＩＣ−１２２符号による符号化の具体例を
示した説明図FIG. 14 is an explanatory diagram showing a specific example of encoding by QIC-122 code.

[Explanation of symbols]

１０：文字入力部（Ｑバッファ）１２：辞書（Ｐバッファ）１４：符号化部（パターンマッチング部）１６：バッファメモリ１８：直前文字 10: Character input section (Q buffer) 12: Dictionary (P buffer) 14: Encoding unit (pattern matching unit) 16: Buffer memory 18: last character

───────────────────────────────────────────────────── フロントページの続き (72)発明者千葉広隆神奈川県川崎市中原区上小田中1015番地富士通株式会社内 ─────────────────────────────────────────────────── ─── Continued front page (72) Inventor Hirotaka Chiba 1015 Kamiodanaka, Nakahara-ku, Kawasaki City, Kanagawa Prefecture Within Fujitsu Limited

Claims

[Claims]

1. A coded character string is held in a dictionary 12, and a substring in the coded character string of the dictionary 12 having a maximum length matching the input character string of a character input unit 10 is searched, In a data compression method using a universal code for encoding with a set of a start position and a match length of the maximum length matching subsequence, a character 18 immediately preceding the input character string of the character input unit 10
And the matching subsequences S1, S2, S3, S4 of the coded character strings held in the dictionary 12 starting from the same first character are searched, and the maximum length matching subsequence S4 is searched, and the maximum length matching part is searched. An encoding unit 14 that encodes in combination with a match length by using the appearance order in which the preceding character of the maximum length matching subsequence S4 appears as the starting position when encoding the string S4.
A data compression method using a universal code.

2. The data compression method using a universal code according to claim 1, wherein the encoding unit 10 searches the dictionary 12 for a substring of an encoded character string having a maximum length match. First character 1
If the continuous character string S2 is 8, the continuous character string S2
A data compression method using a universal code, characterized by assigning the order of appearance of the beginning position or the end position of and encoding in a set with the matching length.

3. The data compression method using a universal code according to claim 1, wherein the encoding unit 10 searches the dictionary 12 for a substring of an encoded character string having a maximum length match. First character 1
If the continuous character string S2 is 8, the continuous character string S
A universal code is used in which two appearance orders of the start position and the end position of 2 are assigned, and the appearance order of both the start position and the end position of the continuous character string S2 is encoded as a set with the matching length. Data compression method.

4. The data compression method using the universal code according to claim 1, wherein the encoding unit 14 determines that the number of characters of the substring of the maximum matching length of the encoded character string of the dictionary 12 to be encoded is When the number of characters is less than a predetermined value, the input character string is output as it is without encoding, and when the number of characters is more than a predetermined value, it is encoded by a combination of the order of appearance and the matching length. Compression method.

5. A data compression method using a universal code according to claim 1 or 4, wherein said encoding unit 14 has a maximum matching length portion of an encoded character string of a dictionary 12 for encoding. When the number of characters in the column is small,
A data compression method using a universal code, characterized in that encoding is performed with a combination of a start position of the subsequence and a matching length, and when there are a large number of characters, encoding is performed with a combination of an appearance order and a matching length.

6. A data compression method using a universal code according to any one of claims 1 to 5, wherein the encoding unit 14 uses a standard compression method of an input character string for a 1/4 inch cartridge magnetic tape. A certain QIC-12
A data compression method using a universal code, which is characterized by encoding into two codes.