JPH08272814A

JPH08272814A - Character string retrieval device

Info

Publication number: JPH08272814A
Application number: JP7076948A
Authority: JP
Inventors: Masako Endo; 雅子遠藤
Original assignee: Hitachi Software Engineering Co Ltd
Current assignee: Hitachi Software Engineering Co Ltd
Priority date: 1995-03-31
Filing date: 1995-03-31
Publication date: 1996-10-18

Abstract

PURPOSE: To provide a character string retrieval device capable of shortening retrieval time and reducing the using memory amount of dictionary file. CONSTITUTION: A file (dictionary file) 70 for dictionary retrieval is constituted of a directory block 710, plural character string blocks 720 and a EOF block 730. The leading addresses of the respective character string blocks are stored in the directory block 710 and blocks are related to each other by the kinds of the heads of character strings and character numbers. The actual character strings are stored in the respective character string blocks 720. By the character number of an input character string and the kind of a leading character, the address of the character string block is obtained from the directory block 710, the character string group of the character string block 720 is read by the address and retrieval is reformed.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は文字列検索装置に係り、
具体的には、図書や論文等の文字列検索、コードや品番
等の検索に有効な文字列検索装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a character string search device,
More specifically, the present invention relates to a character string search device that is effective for searching character strings for books and papers and for searching for codes and product numbers.

【０００２】[0002]

【従来の技術】従来の文字列検索装置では、キー指定し
て、キー値と一致したものをサーチする方式が一般的で
ある。図５に、この種の文字列検索装置に用いられる辞
書ファイルの構成例を示す。例えば「コマツ」という文
字列を検索すると次のようになる。まず、入力文字列
「コマツ」を元にインデックス部より該当するキー値
（文字数が「３」で、先頭文字が「カ」行）をサーチ
し、次に該当するキー値を元にデータ部をサーチし、文
字列「コマツ」の有無を検索する。2. Description of the Related Art In a conventional character string search apparatus, a method is generally used in which a key is designated and a search that matches a key value is performed. FIG. 5 shows a configuration example of a dictionary file used in this type of character string search device. For example, searching for the character string "Komatsu" will result in the following. First, based on the input character string "Komatsu", search the corresponding key value (the number of characters is "3" and the first character is the "K" line) from the index section, and then search the data section based on the corresponding key value. Search for the presence of the character string "Komatsu".

【０００３】[0003]

【発明が解決しようとする課題】上記従来技術では、２
段階のサーチ（インデックス部とデータ部）を行う必要
があるため、検索回数が多くなると非常に時間がかか
り、効率が悪くなるという問題がある。例えば、辞書フ
ァイルに既に同じ名前が登録されているかどうかを検索
するような場合にも、２段階のサーチを行うため、検索
に時間がかかっていた。また、データ部にもキー値を持
たないと検索ができないため、データ（文字列）が多く
なればなるほど、使用メモリ量が多くなるという問題も
ある。In the above-mentioned prior art, 2
Since it is necessary to perform a step-wise search (index part and data part), there is a problem in that if the number of searches increases, it takes a very long time and efficiency deteriorates. For example, when searching for whether or not the same name is already registered in the dictionary file, a two-step search is performed, so the search takes time. In addition, there is also a problem that the amount of used memory increases as the amount of data (character string) increases because the search cannot be performed unless the data portion also has a key value.

【０００４】本発明の目的は、従来技術に比べて、検索
時間が短縮でき、かつ、辞書ファイルの使用メモリ量も
少なくできる文字列検索装置を提供することにある。It is an object of the present invention to provide a character string search device which can reduce the search time and the amount of memory used for a dictionary file as compared with the prior art.

【０００５】[0005]

【課題を解決するための手段】本発明の文字列検索装置
は、辞書ファイルとして、文字列の内容を文字数と先頭
の文字の内容の組み合せによって分類し、各ブロック単
位にその文字列を格納した複数の文字列ブロックと、前
記文字列ブロックの存在するアドレスを登録したディレ
クトリブロックとで構成し、前記入力文字列の文字数と
先頭の文字により前記ディレクトリブロックから該当文
字列ブロックのアドレスを取得し、該アドレスにより該
当文字列ブロックの文字列を読み込み、前記入力文字列
と一致あるいは類似する文字列を検索する手段を有する
ようにしたことを特徴とするものである。The character string search device of the present invention, as a dictionary file, classifies the contents of the character string by the combination of the number of characters and the contents of the first character, and stores the character string in each block. A plurality of character string blocks and a directory block in which the addresses where the character string blocks exist are registered, and the address of the corresponding character string block is obtained from the directory block by the number of characters of the input character string and the first character, It is characterized in that a means for reading the character string of the corresponding character string block by the address and searching for a character string that matches or is similar to the input character string is provided.

【０００６】[0006]

【作用】入力文字列の文字数とその先頭文字により、デ
ィレクトリブロックから該当文字列ブロックの存在する
アドレスが直接取得できる。このアドレスで該当文字列
ブロックのデータ（文字列群）を読み込み、入力文字列
と一致あるいは類似するものがあるか検索する。これに
より、検索回数は文字列ブロックの文字列群に対する１
度で済み、また、データはブロック単位で読み込むこと
ができるのでアクセス回数も少なく検索時間を短縮出来
る。またインデックス部をディレクトリブロックとして
１ブロックで管理し、データ部は文字列のみを格納する
ためメモリも最小限の領域で済む。Function: The address where the corresponding character string block exists can be directly obtained from the directory block based on the number of characters of the input character string and the first character. The data (character string group) of the corresponding character string block is read at this address, and a search is made to see if there is a match or similar to the input character string. As a result, the search count is 1 for the character string group of the character string block.
Since the data can be read in block units, the number of accesses is small and the search time can be shortened. Further, the index part is managed as one block as a directory block, and the data part stores only a character string, so that the memory requires a minimum area.

【０００７】[0007]

【実施例】以下、本発明の一実施例について図面により
説明する。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS An embodiment of the present invention will be described below with reference to the drawings.

【０００８】図１は、本発明の文字列検索装置の一実施
例の全体構成図である。本システムは、検索する入力文
字列や検索結果を表示するディスプレィ１０、検索する
文字列やコマンド等を入力するキーボード２０、検索結
果を出力するプリンタ３０、検索する文字列が格納され
ている入力ファイル４０、辞書登録する文字列が格納さ
れている辞書入力ファイル５０、入力ファイル４０から
作成された検索用文字列が格納される入力検索用ファイ
ル６０、辞書入力ファイル５０から作成された検索用辞
書が格納される辞書検索用ファイル７０、処理途中ファ
イルや処理結果ファイルなどを格納する補助記憶装置８
０、及び、展開文字列の作成、コード化、検索、編集な
どの処理を行うＣＰＵ（中央処理装置）１００からな
る。ここで、入力ファイル４０および辞書入力ファイル
５０は順編成（ＳＡＭ）ファイルであり、磁気テープま
たは磁気ディスクからなる。入力検索用ファイル６０及
び辞書検索用ファイル７０は直接アクセス（ＤＡＭ）フ
ァイルである。プリンタ３０には例えば漢字プリンタを
用いる。FIG. 1 is an overall configuration diagram of an embodiment of a character string search device of the present invention. This system includes a display 10 that displays an input character string to be searched and a search result, a keyboard 20 that inputs a character string to be searched and a command, a printer 30 that outputs the search result, and an input file that stores the character string to be searched. 40, a dictionary input file 50 that stores a character string to be registered in the dictionary, an input search file 60 that stores a search character string created from the input file 40, and a search dictionary created from the dictionary input file 50. Auxiliary storage device 8 for storing the stored dictionary search file 70, the in-process file, the process result file, and the like.
0, and a CPU (central processing unit) 100 that performs processing such as creation, coding, search, and editing of expanded character strings. Here, the input file 40 and the dictionary input file 50 are sequential organization (SAM) files, which are magnetic tapes or magnetic disks. The input search file 60 and the dictionary search file 70 are direct access (DAM) files. A kanji printer is used as the printer 30, for example.

【０００９】図２はＣＰＵ１００の全体的動作の流れを
示すフロー図である。入力ファイル４０及び辞書入力フ
ァイル５０はＳＡＭファイルである。これらのファイル
４０、５０をもとに展開文字列を作成し、その文字列を
コード化する（ステップ１１０、２１０）。この内容を
出力したものが、展開済入力ファイル１２０及び展開済
辞書ファイル２２０であり、これらのファイルもＳＡＭ
ファイルである。次に、これらの展開済ファイル１２
０、２２０の中に同一の文字列が存在した場合、これら
を一つにし、入力検索用ファイル６０、辞書検索用ファ
イル７０を作成する（ステップ１３０、２３０）。これ
らのファイル６０、７０はＤＡＭファイルである。この
入力検索用ファイル６０を対象に、辞書検索用ファイル
７０を参照して文字列を検索する（ステップ１４０）。
これについては、後で詳述する。入力検索結果ファイル
１５０及び辞書検索結果ファイル２５０は検索結果を出
力したものであり、入力検索結果ファイル１５０には入
力ファイル４０より抽出した文字列の全情報及び辞書入
力ファイル５０のどの文字列と類似したか、また類似し
ながったかといった情報が格納される。また、辞書検索
結果ファイル２５０は類似した文字列が格納される。こ
れらの検索結果ファイル１５０、２５０をもとに統合編
集して、検索結果リスト１７０を作成する（ステップ１
６０）。この統合編集処理では文字列以外の情報は最初
の入力である入力ファイル４０及び辞書入力ファイル５
０から情報を取得する。これは、文字列の検索時は文字
列データだけで膨大な量のデータとなるため、検索時に
余計なデータをファイル上に持たないためである。作成
された検索結果リスト１７０は、ディスプレィ１０やプ
リンタ３０に出力する。FIG. 2 is a flow chart showing the flow of the overall operation of the CPU 100. The input file 40 and the dictionary input file 50 are SAM files. A developed character string is created based on these files 40 and 50, and the character string is encoded (steps 110 and 210). The contents output are the expanded input file 120 and the expanded dictionary file 220, and these files are also SAM.
It is a file. Next, these expanded files 12
When the same character string exists in 0 and 220, these are combined into one and the input search file 60 and the dictionary search file 70 are created (steps 130 and 230). These files 60 and 70 are DAM files. The input search file 60 is searched for a character string by referring to the dictionary search file 70 (step 140).
This will be described in detail later. The input search result file 150 and the dictionary search result file 250 are output as search results, and the input search result file 150 contains all the information of the character strings extracted from the input file 40 and any character string in the dictionary input file 50. Information such as whether or not they are similar is stored. Further, the dictionary search result file 250 stores similar character strings. Based on these search result files 150 and 250, integrated editing is performed to create a search result list 170 (step 1
60). In this integrated editing process, the information other than the character string is the first input, namely the input file 40 and the dictionary input file 5.
Get information from 0. This is because when searching for a character string, the character string data alone is a huge amount of data, and therefore no extra data is stored in the file when searching. The created search result list 170 is output to the display 10 and the printer 30.

【００１０】なお、辞書検索用ファイル７０が既に用意
されており、検索する文字列がキーボード２０から入力
される場合には、ステップ１４０において、直接、この
入力された文字列について、辞書検索用ファイル７０を
参照して検索を実行すればよい。If the dictionary search file 70 has already been prepared and the character string to be searched is input from the keyboard 20, in step 140, the dictionary search file is directly input for this input character string. The search may be performed by referring to 70.

【００１１】図３に辞書検索用ファイル７０の構成例を
示す。辞書検索用ファイル７０はディレクトリブロック
７１０、複数の文字列ブロック７２０、及びＥＯＦブロ
ック７３０から構成される。ディレクトリブロック７１
０は、文字列（データ）の内容を文字数と先頭の文字の
内容の組合せによって分類して複数のブロック（文字列
ブロック）に分け、そのブロックの存在するアドレスを
格納したものである。このディレクトリブロック７１０
がインデックスの役割を果たす。図３では、該ディレク
トリブロック７１０は、文字列の先頭の文字を「ア
行」、「カ行」…に分け、文字数ごとに、各文字列ブロ
ックの先頭アドレスを格納したものである。各文字列ブ
ロック７２０には実際の文字列が格納され、ＥＯＦブロ
ック７３０はファイルの最後を表すものである。尚、図
３の実施例では、分かりやすいように文字列が日本語で
示されているが、実際には文字列はコードで扱われる。FIG. 3 shows an example of the structure of the dictionary search file 70. The dictionary search file 70 includes a directory block 710, a plurality of character string blocks 720, and an EOF block 730. Directory block 71
0 is for classifying the content of the character string (data) according to the combination of the number of characters and the content of the first character, dividing it into a plurality of blocks (character string block), and storing the address where the block exists. This directory block 710
Plays the role of index. In FIG. 3, the directory block 710 divides the first character of the character string into “A line”, “K” line, ... And stores the start address of each character string block for each number of characters. The actual character string is stored in each character string block 720, and the EOF block 730 represents the end of the file. In the embodiment of FIG. 3, the character string is shown in Japanese for easy understanding, but the character string is actually handled as a code.

【００１２】次に、図３の辞書検索用ファイル７０を使
用して、実際にどのように文字列検索が行われるかを、
入力文字列が「コマツ」の場合を例に説明する。この場
合の処理フローを図４に示す。Next, using the dictionary search file 70 of FIG. 3, how the character string search is actually performed is described.
The case where the input character string is "Komatsu" will be described as an example. The processing flow in this case is shown in FIG.

【００１３】入力文字列が「コマツ」の場合、まず、デ
ィレクトリブロック８１０より、カ行で文字数が３音の
文字列ブロックの先頭アドレス“０１０３”を取得する
（ステップ８１０）。次に、該アドレス“０１０３”の
文字列ブロックからデータ（文字列）を読み込み（ステ
ップ８２０）、該ブロックの文字列を順に検索する（ス
テップ８３０）。そして、入力文字列「コマツ」に一致
・類似する文字列があるか判定し（ステップ８４０）、
あれば、検索結果を出力して（ステップ８５０）、次の
文字列の検索に行き（ステップ８６０）、なければ、ス
テップ８５０をスキップする。なお、各文字列ブロック
の最後には次のブロックに続きがあるかを示すフラグを
つけておき、そのフラグをみて次のブロックも検索する
か判断する。When the input character string is "Komatsu", first, the head address "0103" of the character string block of three rows and three characters is acquired from the directory block 810 (step 810). Next, the data (character string) is read from the character string block of the address "0103" (step 820), and the character string of the block is searched in order (step 830). Then, it is determined whether or not there is a character string that matches or is similar to the input character string "Komatsu" (step 840),
If there is, the search result is output (step 850), the search for the next character string is performed (step 860), and if there is not, step 850 is skipped. A flag indicating whether or not there is a continuation in the next block is attached to the end of each character string block, and it is determined whether or not to search the next block by looking at the flag.

【００１４】[0014]

【発明の効果】以上説明したように、本発明の文字列検
索装置によれば、検索回数は辞書ファイルの文字列ブロ
ック中のデータの１度で済み、また、データはブロック
単位で読み込むことができるのでアクセス回数も少な
く、検索時間を短縮出来る。さらにインデックス部をデ
ィレクトリブロックの１ブロックで管理し、データ部は
文字列ブロックで文字列のみを格納するため，メモリも
最小限の領域で済む。従って、本発明の文字列検索装置
は、特に、検索回数が多く、データ量の多い検索対象に
向いている。As described above, according to the character string retrieving apparatus of the present invention, the number of retrievals is only once for the data in the character string block of the dictionary file, and the data can be read in block units. Since it can be done, the number of accesses is small and the search time can be shortened. Further, since the index part is managed by one block of the directory block, and the data part stores only the character string in the character string block, the memory requires a minimum area. Therefore, the character string search device of the present invention is particularly suitable for a search target having a large number of searches and a large amount of data.

[Brief description of drawings]

【図１】本発明の文字列検索装置の一実施例の全体構成
図である。FIG. 1 is an overall configuration diagram of an embodiment of a character string search device of the present invention.

【図２】図１の全体的動作の流れを示すフロー図であ
る。FIG. 2 is a flowchart showing the flow of the overall operation of FIG.

【図３】本発明による辞書検索用ファイルの構成例を示
す図である。FIG. 3 is a diagram showing a configuration example of a dictionary search file according to the present invention.

【図４】本発明による文字列検索の具体的処理例を示す
フロー図である。FIG. 4 is a flowchart showing a specific processing example of character string search according to the present invention.

【図５】従来の文字列検索処理を説明する図である。FIG. 5 is a diagram illustrating a conventional character string search process.

[Explanation of symbols]

１０ディスプレィ２０キーボード３０プリンタ４０入力ファイル５０辞書入力ファイル６０入力検索用ファイル７０辞書検索用ファイル７１０ディレクトリブロック７２０文字列ブロック１００ＣＰＵ 10 Display 20 Keyboard 30 Printer 40 Input File 50 Dictionary Input File 60 Input Search File 70 Dictionary Search File 710 Directory Block 720 Character String Block 100 CPU

Claims

[Claims]

1. A character string search device that searches a dictionary file for a character string that matches or is similar to an input character string, wherein the dictionary file is classified by the combination of the number of characters and the content of the first character, It is composed of a plurality of character string blocks storing the character string in each block unit and a directory block in which the address where the character string block exists is registered, and from the directory block depending on the number of characters of the input character string and the first character. A character string search device comprising means for acquiring an address of a corresponding character string block, reading a character string of the corresponding character string block at the address, and searching for a character string that matches or is similar to the input character string.