JPH09128405A

JPH09128405A - Method and device for retrieving document

Info

Publication number: JPH09128405A
Application number: JP7283999A
Authority: JP
Inventors: Toshihiro Ozaki; 敏宏尾崎; Yasuo Tanosaki; 康雄田野崎
Original assignee: Toshiba Corp; Toshiba Computer Engineering Corp
Current assignee: Toshiba Corp; Toshiba Computer Engineering Corp
Priority date: 1995-10-31
Filing date: 1995-10-31
Publication date: 1997-05-16

Abstract

PROBLEM TO BE SOLVED: To improve the retrieving speed by reducing the data quantity read out from indexes arranged on an external storage and to improve the retrieving speed with respect to a retrieving key which often consists of the characters of the same kind by changing the arrangement of character position information within the indexes. SOLUTION: This device is provided with a max. and min. ID extraction part 15 picking the min. ID number and the max. ID number of a character string constituting plural retrieval key word from a retrieval expression inputted at the time of retrieval, and an index reading part 16 reading position information for the number of IDs between the min, and max. ID numbers of the character string from the index. Then, the min. and max. ID numbers of the character string constituting the plural key retrieval key words from the retrieving expression inputted at the time of retrieval are retrieved to retrieve position information necessary for retrieval by reading not position information of all the character string but that between the min. and max. ID numbers from the index.

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、文書データベース
等に登録された文書をキーワードをもとに検索する文書
検索装置、特に文書の全文を検索対象にしたフルテキス
トサーチ型の文書検索装置、及びこの装置に用いられる
文書検索方法に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a document retrieval apparatus for retrieving a document registered in a document database or the like based on a keyword, and more particularly to a full-text search type document retrieval apparatus in which the entire text of a document is retrieved. The present invention relates to a document search method used in this device.

【０００２】[0002]

【従来の技術】従来、文書の全文を検索対象としたフル
テキストサーチでは、原文書をそのまま検索するのでは
なく、予め作成しておいた検索専用のインデックスに対
して検索を実施する形式のものがあった。このインデッ
クスは容量が大きいために、メモリではなくハードディ
スク装置等の外部記憶装置に配置されることが多く、こ
のような形式を持つインデックスを用いての検索では、
外部記憶装置上のインデックスを読み出すことに多くの
時間が割かれていた。2. Description of the Related Art Conventionally, in a full-text search in which the entire text of a document is searched, the original document is not searched as it is, but a search-specific index created in advance is searched. was there. Since this index has a large capacity, it is often placed in an external storage device such as a hard disk device instead of a memory, and a search using an index with such a format
A lot of time was spent reading the index on the external storage device.

【０００３】又、入力される検索キーは同一の文字種、
即ち、全てカタカナであるとか、全て漢字である等で構
成されることが多く、この情報に着目した効果的な検索
がなされていなかった。Further, the entered search keys are of the same character type,
That is, it is often composed of all katakana, all kanji, etc., and an effective search focusing on this information has not been made.

【０００４】[0004]

【発明が解決しようとする課題】上述したように従来の
文書検索にあっては、インデックスは容量が大きいため
に、メモリではなくハードディスク装置等の外部記憶装
置に配置されることが多く、このような形式を持つイン
デックスを用いての検索では、外部記憶装置上のインデ
ックスを読み出すことに多くの時間が割かれていた。
又、入力される検索キーは同一の文字種、即ち、全てカ
タカナであるとか、全て漢字である等で構成されること
が多く、この情報に着目した効果的な検索がなされてい
なかった。As described above, in the conventional document retrieval, since the index has a large capacity, it is often arranged not in the memory but in the external storage device such as the hard disk device. In a search using an index having a different format, much time was spent reading the index on the external storage device.
Further, the input search key is often composed of the same character type, that is, all katakana, all kanji, etc., and an effective search focusing on this information has not been performed.

【０００５】本発明は上記事情を考慮して成されたもの
で、外部記憶装置上に配置されたインデックスから読み
出すデータ量を削減することにより、検索速度の向上を
図るとともに、又インデックス内の文字位置情報の並び
方を変えて、同種の文字で構成されることの多い検索キ
ーに対しての検索速度の向上を図った文書検索装置、及
び文書検索方法を提供すること目的とする。The present invention has been made in consideration of the above circumstances. By reducing the amount of data read from an index arranged on an external storage device, the search speed is improved and the characters in the index are also improved. An object of the present invention is to provide a document search device and a document search method that change the arrangement of position information to improve the search speed for search keys that are often composed of the same type of characters.

【０００６】[0006]

【課題を解決するための手段】本発明は、上記目的を達
成するため、文字や単語等の文字列毎に固有のＩＤ番号
を割り当てて、１つの文書に出現する文字列についてＩ
Ｄ番号の順に又ＩＤ番号毎に取り出せる形式で文字列の
出現位置情報を格納した全文書分のインデックスを基に
して文書検索を行う文書検索方法であって、検索時に入
力される検索式から複数の検索キーワードを構成する文
字列の最小ＩＤ番号と最大ＩＤ番号を抽出し、検索に必
要な位置情報を全ての文字列の位置情報ではなく、最小
ＩＤ番号と最大ＩＤ番号の間の位置情報をインデックス
から読み出すことにより検索することを特徴とする。In order to achieve the above object, the present invention assigns a unique ID number to each character string such as a character or a word and assigns a unique ID number to each character string appearing in one document.
A document search method for performing a document search based on an index for all documents that stores appearance position information of character strings in a format that can be retrieved in the order of D numbers or for each ID number. The minimum ID number and the maximum ID number of the character strings that form the search keyword are extracted, and the positional information required for the search is not the positional information of all the character strings, but the positional information between the minimum ID number and the maximum ID number. The feature is that retrieval is performed by reading from the index.

【０００７】又、本発明は上記目的を達成するため、上
記文書検索方法にあって、文字列毎に固有のＩＤ番号を
割り当てるのは、文字列を文字に限定して、記号／英数
字や平仮名、カタカナ、漢字等の分類であるその文字種
毎に連続した文字のＩＤを割り当てることとし、検索時
に入力されるキーワードが全て同種の文字で構成される
場合等に検索可能としたことにある。Further, in order to achieve the above object, the present invention provides the above-mentioned document retrieval method, in which a unique ID number is assigned to each character string by limiting the character string to a character and allocating a symbol / alphanumeric character. A continuous character ID is assigned to each character type, which is a classification of hiragana, katakana, kanji, etc., and it is possible to search when all keywords input at the time of search are composed of the same type of characters.

【０００８】更に、本発明は上記目的を達成するため、
文字や単語等の文字列毎に固有のＩＤ番号を割り当て
て、１つの文書に出現する文字列についてＩＤ番号の順
に又ＩＤ番号毎に取り出せる形で文字列の出現位置情報
を格納した全文書分のインデックスを基にして、文書の
検索を高速に行う文書検索装置であって、検索式や処理
指示を入力する入力手段と、検索時に入力される検索式
から複数の検索キーワードを構成する文字列の最小ＩＤ
番号と最大ＩＤ番号を抜き出す最大最小ＩＤ抽出手段
と、文字列の最小ＩＤ番号と最大ＩＤ番号の間のＩＤ個
数分の位置情報をインデックスから読み出すインデック
ス読み出し手段と、検索手段と、処理結果を出力する出
力手段とを具備したことを特徴とする。Further, the present invention has the following objects to attain the above objects.
A unique ID number is assigned to each character string such as a character or word, and all documents storing appearance position information of the character strings that appear in one document can be taken out in the order of the ID numbers or for each ID number. Is a document search device for performing high-speed document search based on the index of, and an input means for inputting a search expression or a processing instruction, and a character string forming a plurality of search keywords from the search expression input at the time of search Minimum ID
Number and maximum ID number, maximum / minimum ID extraction means, index reading means for reading position information for the number of IDs between the minimum ID number and maximum ID number of the character string from the index, a search means, and a processing result output And an output means for performing the same.

【０００９】更に、本発明は上記目的を達成するため、
上記文書検索装置にあって、文字列毎に固有のＩＤ番号
を割り当てるのは、文字列を文字に限定して、その文字
種毎に連続した文字のＩＤを割り当てるようにしたこと
にある。Further, in order to achieve the above object, the present invention provides
In the above document retrieval apparatus, a unique ID number is assigned to each character string because the character string is limited to characters and consecutive character IDs are assigned to each character type.

【００１０】上記構成によれば、文字や単語の文字列毎
に固有のＩＤ番号を割り当てて、１つの文書に出現する
文字のＩＤ番号の順に文字の出現位置の全てを抜き出
し、且つＩＤ番号毎に出現位置情報を取得可能な形式の
全文書分のインデックスが、文書毎に並び且つ文字列の
ＩＤ番号順に文書の文字列位置情報が並んで配置されて
いることを用いて、検索時に入力される検索式に記述さ
れた複数の検索キーワードを構成する文字列の最小ＩＤ
番号と最大ＩＤ番号を抽出し、インデックスから文字列
の最小ＩＤ番号と最大ＩＤ番号の間のＩＤ個数分の位置
情報を読み出すことにより、文書内の全てのＩＤの個数
を読み出すものと比べ、読み出すインデックス情報を削
減することができ、検索速度の向上が可能となる。According to the above construction, a unique ID number is assigned to each character string of characters or words, all the appearance positions of the characters are extracted in the order of the ID numbers of the characters appearing in one document, and each ID number is extracted. The index for all documents in a format in which the appearance position information can be acquired is arranged for each document and the character string position information of the document is arranged in the order of the character string ID number. Minimum ID of the character string that constitutes multiple search keywords described in the search formula
The number and the maximum ID number are extracted, and the position information for the number of IDs between the minimum ID number and the maximum ID number of the character string is read from the index, so that the number of all the IDs in the document is read, as compared with the case where it is read. The index information can be reduced and the search speed can be improved.

【００１１】又、上記構成においては、同一文字種の文
字に連続したＩＤ番号を割り当てることにより、インデ
ックスは文字の位置情報が同一文字種でかたまる構成と
なる。このことにより、同一文字種で構成されることの
多い検索キーワードの最小ＩＤ番号と最大ＩＤ番号の差
分は小さくなり、その結果インデックスから読み出す文
字位置情報の量がより少なくなり、検索速度の向上が可
能となる。Further, in the above construction, by assigning consecutive ID numbers to the characters of the same character type, the index has a structure in which the positional information of the characters is collected by the same character type. As a result, the difference between the minimum ID number and the maximum ID number of the search keyword that is often composed of the same character type becomes small, and as a result, the amount of character position information read from the index becomes smaller and the search speed can be improved. Becomes

【００１２】[0012]

【発明の実施の形態】本発明の概要は、次の通りであ
る。即ち、文字や単語の文字列毎に固有のＩＤ番号を割
り当てて、１つの文書に出現する文字列のＩＤ番号の順
に文字列の出現位置の全てを抜き出し、且つＩＤ番号毎
に出現位置情報を取得可能な形式の全文書分のインデッ
クスを作成し、このインデックスを用いて検索を行う文
書検索方法であって、外部記憶装置上のインデックスが
文書毎に並び且つ文字列のＩＤ番号順に文書内に存在す
る文字列の位置情報が並んで配置されていることに着目
し、検索時に入力される検索式に記述された複数の検索
キーワードを構成する文字列の最小ＩＤ番号と最大ＩＤ
番号を抽出し、インデックスから文字列の最小ＩＤ番号
と最大ＩＤ番号の間のＩＤ個数分の位置情報を読み出す
ことで、全てのＩＤの位置情報を読み出すことよりも位
置情報の情報量を削減することを特徴とするインデック
スからの読み出し方法にある。DESCRIPTION OF THE PREFERRED EMBODIMENTS The outline of the present invention is as follows. That is, a unique ID number is assigned to each character string of characters or words, all the appearance positions of the character strings are extracted in the order of the ID numbers of the character strings appearing in one document, and the appearance position information is obtained for each ID number. A document search method for creating an index for all documents in a format that can be acquired and performing a search using this index, in which the indexes on the external storage device are arranged for each document and are arranged in the document in the order of ID numbers of character strings. Paying attention to the fact that the position information of existing character strings are arranged side by side, the minimum ID number and the maximum ID of the character strings that make up a plurality of search keywords described in the search formula input at the time of search
By extracting the number and reading the position information of the number of IDs between the minimum ID number and the maximum ID number of the character string from the index, the amount of position information is reduced compared to reading the position information of all IDs. This is a method of reading from the index.

【００１３】又、固有のＩＤ番号を割り当てる文字列を
文字に限定し、１つの文書に出現する文字のＩＤ番号の
順に文字の出現位置の全てを抜き出し、且つＩＤ番号毎
に出現位置情報を取得可能な形式の全文書分のインデッ
クスを作成し、このインデックスを用いて検索を行う文
書検索方法であって、文字に固有のＩＤ番号を割り当て
る際に同一文字種に連続したＩＤ番号を割り当てたイン
デックスを作成することにより、同一文字種で構成され
ることの多い検索キーワードの最小ＩＤ番号と最大ＩＤ
番号の差分を小さくし、インデックスから読み込む位置
情報の情報量を少なくすることを特徴とする文字のＩＤ
番号の割り当て方法にある。Further, the character string to which the unique ID number is assigned is limited to characters, all the appearance positions of the characters are extracted in the order of the ID numbers of the characters appearing in one document, and the appearance position information is acquired for each ID number. This is a document search method for creating an index for all documents in possible formats and performing a search using this index. When assigning a unique ID number to a character, an index that assigns consecutive ID numbers to the same character type is used. By creating, the minimum ID number and maximum ID of search keywords that are often composed of the same character type
Character ID characterized by reducing the number difference and reducing the amount of position information read from the index
There is a method of assigning numbers.

【００１４】更に、文字や単語の文字列毎に固有のＩＤ
番号を割り当てて、１つの文書に出現する文字列のＩＤ
番号の順に文字列の出現位置の全てを抜き出し、且つＩ
Ｄ番号毎に出現位置情報を取得可能な形式の全文書分の
インデックスを作成し、このインデックスを用いて検索
を行う文書検索装置であって、検索式や処理指示を入力
する入力手段と、検索時に入力される検索式から複数の
検索キーワードを構成する文字列の最小ＩＤ番号と最大
ＩＤ番号を抜き出す最大最小ＩＤ抽出手段と、文字列の
最小ＩＤ番号と最大ＩＤ番号の間のＩＤ個数分の位置情
報をインデックスから読み出すインデックス読み出し手
段と、検索手段と、処理結果を出力する出力手段とを具
備したことを特徴とする文書検索装置にある。Further, a unique ID for each character or character string
ID of the character string that appears in one document by assigning a number
Extract all the appearance positions of the character string in the order of numbers, and
A document search device for creating an index for all documents in a format in which appearance position information can be acquired for each D number, and performing a search using this index, including an input means for inputting a search expression and a processing instruction, and a search Maximum / minimum ID extraction means for extracting the minimum ID number and the maximum ID number of the character strings forming the plurality of search keywords from the search expression that is input at some time, and the number of IDs between the minimum ID number and the maximum ID number of the character string. A document retrieval apparatus comprising: an index reading unit that reads out position information from an index; a retrieval unit; and an output unit that outputs a processing result.

【００１５】更に、固有のＩＤ番号を割り当てる文字列
を文字に限定し、同一文字種の文字に連続したＩＤ番号
を割り当てて、１つの文書に出現する文字のＩＤ番号の
順に文字の出現位置の全てを抜き出し、且つＩＤ番号毎
に出現位置情報を取得可能な形式の全文書分のインデッ
クスを作成し、このインデックスを用いて検索を行う文
書検索装置であって、検索式や処理指示を入力する入力
手段と、検索時に入力される検索式から複数の検索キー
ワードを構成する文字の最小ＩＤ番号と最大ＩＤ番号を
抜き出す最大最小ＩＤ抽出手段と、文字の最小ＩＤ番号
と最大ＩＤ番号の間のＩＤ個数分の位置情報をインデッ
クスから読み出すインデックス読み出し手段と、検索手
段と、処理結果を出力する出力手段とを具備したことを
特徴とする文書検索装置にある。Further, a character string to which a unique ID number is assigned is limited to characters, consecutive ID numbers are assigned to characters of the same character type, and all the appearance positions of the characters appear in the order of the ID numbers of the characters appearing in one document. Is a document search device that creates an index for all documents in a format in which appearance position information can be acquired for each ID number, and uses this index to perform a search, and inputs a search expression and a processing instruction. Means, a maximum / minimum ID extracting means for extracting the minimum ID number and the maximum ID number of the characters constituting the plurality of search keywords from the search expression input at the time of the search, and the number of IDs between the minimum and maximum ID numbers of the characters A document detecting device comprising: an index reading unit for reading the position information of the minutes from the index; a searching unit; and an output unit for outputting the processing result. Apparatus is in.

【００１６】そして、上記構成に於いては、文字や単語
の文字列毎に固有のＩＤ番号を割り当てて、１つの文書
に出現する文字のＩＤ番号の順に文字の出現位置の全て
を抜き出し、且つＩＤ番号毎に出現位置情報を取得可能
な形式の全文書分のインデックスが、文書毎に並びかつ
文字列のＩＤ番号順に文書の文字列位置情報が並んで配
置されていることを用いて、検索時に入力される検索式
に記述された複数の検索キーワードを構成する文字列の
最小ＩＤ番号と最大ＩＤ番号を抽出し、インデックスか
ら文字列の最小ＩＤ番号と最大ＩＤ番号の間のＩＤ個数
分の位置情報を読み出すことにより、文書内の全てのＩ
Ｄの個数を読み出すものと比べ、読み出すインデックス
情報を削減することができ、検索速度の向上が可能とな
る。In the above arrangement, a unique ID number is assigned to each character string of characters and words, all the appearance positions of the characters are extracted in the order of the ID numbers of the characters appearing in one document, and A search is performed by using the index for all documents in a format in which the appearance position information can be acquired for each ID number, arranged for each document, and the character string position information of the document arranged in the order of the character string ID number. The minimum ID number and the maximum ID number of the character string that compose the plurality of search keywords described in the search formula that is input at some time are extracted, and the number of IDs between the minimum ID number and the maximum ID number of the character string is extracted from the index. By reading the position information, all I's in the document
The index information to be read can be reduced and the search speed can be improved as compared with the case where the number of D is read.

【００１７】又、上記構成に於いては、同一文字種の文
字に連続したＩＤ番号を割り当てることにより、インデ
ックスは文字の位置情報が同一文字種で固まる構成とな
る。このことにより、同一文字種で構成されることの多
い検索キーワードの最小ＩＤ番号と最大ＩＤ番号の差分
は小さくなり、その結果インデックスから読み出す文字
位置情報の量がより少なくなり、検索速度の向上が可能
となる。Further, in the above structure, by assigning consecutive ID numbers to the characters of the same character type, the index has a structure in which the positional information of the characters is fixed in the same character type. As a result, the difference between the minimum ID number and the maximum ID number of the search keyword that is often composed of the same character type becomes small, and as a result, the amount of character position information read from the index becomes smaller and the search speed can be improved. Becomes

【００１８】以下、図面を参照して本発明の一実施の形
態を説明する。図１は、文書検索装置に係わる概略構成
を示すブロック図であり、１は文書の検索や他のブロッ
クの制御を行うＣＰＵやメモリから成る制御装置、２は
検索キーワードや処理指示などを入力するキーボード等
から成る入力装置、３は処理結果を表示するディスプレ
イ等から成る出力装置、４は検索用のインデックスを格
納するハードディスクドライブ（ＨＤＤ）等の外部記憶
装置である。An embodiment of the present invention will be described below with reference to the drawings. FIG. 1 is a block diagram showing a schematic configuration of a document search device. Reference numeral 1 is a control device composed of a CPU and a memory for searching a document and controlling other blocks, and 2 is a search keyword and a processing instruction. An input device 3 including a keyboard and the like, an output device 3 including a display for displaying a processing result, and an external storage device 4 such as a hard disk drive (HDD) for storing a search index.

【００１９】図２は、図１に示した制御装置１の詳細例
を示したブロック図である。制御装置１は初期化部１
１、入力部１２、出力部１３、制御部１４、最大最小Ｉ
Ｄ抽出部１５、インデックス読み出し部１６、検索部１
７、インデックス作成部１８の制御系と、検索式格納バ
ッファ２１、最大最小ＩＤ格納バッファ２２、文字位置
情報格納バッファ２３の記憶系と、外部記憶装置又はメ
モリに格納した図８に示すような単語種ＩＤテーブル３
１、図９に示すようなＡＰＴインデックステーブル３
２、図１０に示すようなＡＰＴインデックス３３、図１
１に示すようなＡＰＴ３４のインデックス系とから構成
されている。尚、インデックスの作成方法、フォーマッ
ト、及びこれらのインデックスを用いた文書検索の詳細
については、特願平６−３２２０６８号に記述されてい
る。FIG. 2 is a block diagram showing a detailed example of the control device 1 shown in FIG. The controller 1 is the initialization unit 1
1, input unit 12, output unit 13, control unit 14, maximum minimum I
D extraction unit 15, index reading unit 16, search unit 1
7, a control system of the index creation unit 18, a storage system of a search expression storage buffer 21, a maximum / minimum ID storage buffer 22, a character position information storage buffer 23, and words stored in an external storage device or a memory as shown in FIG. Species ID table 3
1, APT index table 3 as shown in FIG.
2, APT index 33 as shown in FIG. 10, FIG.
The index system of APT 34 as shown in FIG. Details of the method of creating indexes, formats, and document retrieval using these indexes are described in Japanese Patent Application No. 6-322068.

【００２０】図２に於いて、初期化部１１は記憶系の各
バッファの初期化を行う。入力部１２は入力装置２から
検索式の情報等を入力する。出力部１３は検索の結果で
ある該当文書のタイトルの一覧等の情報を表示装置３に
出力する。In FIG. 2, the initialization section 11 initializes each buffer of the storage system. The input unit 12 inputs information on a search formula from the input device 2. The output unit 13 outputs information such as a list of titles of the relevant document, which is the result of the search, to the display device 3.

【００２１】制御部１４は、制御系全体を制御して、検
索に於ける各処理を総合的に制御する。最大最小ＩＤ抽
出部１５は入力部１２によって検索式格納バッファ２１
に格納された検索式から抽出した検索キーワードを単語
種ＩＤテーブル３１を用いてＩＤ番号の並びに変換し、
ＩＤ番号の最大値、最小値を抽出し、最大最小ＩＤ格納
バッファ２２に格納する。The control unit 14 controls the entire control system to comprehensively control each process in the search. The maximum / minimum ID extraction unit 15 uses the input unit 12 to retrieve the search expression storage buffer 21.
Using the word type ID table 31, the search keywords extracted from the search formula stored in
The maximum and minimum ID numbers are extracted and stored in the maximum / minimum ID storage buffer 22.

【００２２】インデックス読み出し部１６は制御部１４
から与えられた検索対象文書の番号と、最大最小ＩＤ格
納バッファ２２から得たＩＤ番号の最大値と最小値をも
とにして、ＡＰＴインデックステーブル３２から検索対
象文書のＡＰＴインデックス３３の情報を得て、更にＡ
ＰＴインデックス３３から最小ＩＤから最大ＩＤまでの
位置情報を取得するアドレスを得、ＡＰＴ３４から位置
情報を読み出す。The index reading unit 16 is a control unit 14.
The information of the APT index 33 of the search target document is obtained from the APT index table 32 based on the number of the search target document given from the above and the maximum and minimum values of the ID numbers obtained from the maximum / minimum ID storage buffer 22. And further A
Addresses for obtaining position information from the minimum ID to the maximum ID are obtained from the PT index 33, and the position information is read from the APT 34.

【００２３】検索部１７は、検索対象文書が検索式に対
して成立するかの判断を行う。先ず、位置情報から検索
キーワードのＩＤ列が連続して出現することが判明した
場合に、検索対象文書に検索キーワードが存在すること
が確認され、更に検索式の論理に従って検索対象文書の
検索式に対する成立の有無を判断する。The search unit 17 determines whether or not the search target document is satisfied with respect to the search formula. First, when it is found from the position information that the search keyword ID string appears consecutively, it is confirmed that the search keyword exists in the search target document, and the search expression of the search target document is added to the search expression according to the logic of the search expression. Judge whether it is established or not.

【００２４】インデックス作成部１８は、検索の前処理
を行うために存在し、入力された文書を単語や文字の文
字列に分解し、単語種ＩＤテーブル３１から文字列のＩ
Ｄ番号を得る。単語種ＩＤテーブル３１に文字列がない
場合は、新規登録し新たなＩＤ番号を取得する。文書中
に出現するＩＤ番号の順番に、ＩＤ番号の文字列が出現
する位置を抽出してＡＰＴ３４に登録し、更にＡＰＴ３
４上の位置情報のｓｔａｒｔアドレス、ｅｎｄアドレス
をＡＰＴインデックス３３に登録する。入力された文書
中に出現するＩＤ番号の全てについてＡＰＴ３４に位置
情報を登録した後に、ＡＰＴインデックス３３に格納し
たＩＤ番号に対応する開始アドレスと終了アドレスの組
のデータのｓｔａｒｔアドレス、ｅｎｄアドレスをＡＰ
Ｔインデックステーブル３２に格納する。この一連の処
理を検索対象文書数分繰り返して、検索用のインデック
スを作成する。The index creating unit 18 exists for pre-processing for retrieval, decomposes the input document into character strings of words and characters, and extracts the I of the character strings from the word type ID table 31.
Get the D number. If there is no character string in the word type ID table 31, it is newly registered and a new ID number is acquired. The positions where the character strings of the ID numbers appear in the order of the ID numbers appearing in the document are extracted and registered in the APT 34.
The start address and end address of the position information on 4 are registered in the APT index 33. After registering the position information in the APT 34 for all the ID numbers appearing in the input document, the start address and the end address of the data of the set of the start address and the end address corresponding to the ID numbers stored in the APT index 33 are set as AP.
It is stored in the T index table 32. This series of processes is repeated for the number of search target documents to create a search index.

【００２５】次に、上記実施形態による文書検索装置の
動作と処理の流れについて、図３と図４に示すフローチ
ャートを参照しながら説明する。先ず、インデックスの
作成について説明を行う。Next, the operation and the flow of processing of the document retrieval apparatus according to the above embodiment will be described with reference to the flowcharts shown in FIGS. 3 and 4. First, the creation of the index will be described.

【００２６】検索の前処理であるインデックスの作成で
は、入力部１２、制御部１４、インデックス作成部１８
が動作を行い、単語種ＩＤテーブル３１、ＡＰＴインデ
ックステーブル３２、ＡＰＴインデックス３３、ＡＰＴ
３４の各種インデックスを作成する。In the index preparation which is the pre-processing of the search, the input section 12, the control section 14 and the index preparation section 18
Performs the operation, and the word type ID table 31, the APT index table 32, the APT index 33, the APT
34 various indexes are created.

【００２７】図３に示すフローチャートに於いて、入力
部１２がステップ３０１にて起動し、検索対象文書の入
力を行う。図３のステップ３０２〜３０８では制御部１
４が動作を行う。In the flowchart shown in FIG. 3, the input unit 12 is activated in step 301 to input a document to be searched. In steps 302 to 308 of FIG. 3, the control unit 1
4 operates.

【００２８】ステップ３０２では、入力部１１から入力
された文書について、日本語解析等を用いて文章を区切
り、単語や文字等の文字列（単語種）に変換する。ステ
ップ３０３では、文書中に出現した単語種のうち、単語
種ＩＤテーブル３１に登録されていないものがあれば新
規に登録し、一意に対応したＩＤ番号を得る。In step 302, the document input from the input unit 11 is divided into sentences using Japanese analysis or the like and converted into a character string (word type) such as a word or a character. In step 303, if there is a word type that has not been registered in the word type ID table 31 among the word types that have appeared in the document, it is newly registered, and an ID number that corresponds uniquely is obtained.

【００２９】ステップ３０４では、文書中に出現する単
語種のＩＤ番号の小さなものから、その出現位置を得、
ＡＰＴ３４に登録を行う。ステップ３０５では、ＡＰＴ
３４に格納したデータのｓｔａｒｔアドレス、ｅｎｄア
ドレスをＡＰＴインデックス３３に格納する。In step 304, the appearance position is obtained from the word with the smallest ID number of the word that appears in the document,
Register with APT34. In step 305, the APT
The start address and end address of the data stored in 34 are stored in the APT index 33.

【００３０】ステップ３０６では、ＡＰＴ３４とＡＰＴ
インデックス３３への格納が、入力文書に出現する全て
の単語種について行ったかを判断し、ＮＯの場合はステ
ップ３０５に制御を移す。ＹＥＳの場合は次の処理に進
む。In step 306, APT 34 and APT
It is determined whether the storage in the index 33 has been performed for all the word types appearing in the input document, and if NO, the control is transferred to step 305. If YES, go to the next process.

【００３１】ステップ３０７では、ＡＰＴインデックス
に格納した入力文書に出現する単語種の位置情報を示す
ｓｔａｒｔアドレスとｅｎｄアドレスの組のデータ全体
のｓｔａｒｔアドレス、ｅｎｄアドレスをＡＰＴインデ
ックステーブル３２に格納する。In step 307, the APT index table 32 stores the start address and end address of the entire data of the set of start address and end address indicating the position information of the word type appearing in the input document stored in the APT index.

【００３２】ステップ３０８では、検索対象文書の全て
についてインデックス作成を行ったかの判断を行い、Ｎ
Ｏの場合はステップ３０１に制御を移し、ＹＥＳの場合
はインデックスの作成を終了する。In step 308, it is judged whether or not index creation has been performed for all documents to be searched, and N
In the case of O, the control is transferred to step 301, and in the case of YES, the index creation is completed.

【００３３】次に、図４のフローチャートにて検索の処
理の流れを示す。先ず、初期化部１１がステップ４０１
にて起動し、各バッファを初期化する。次に、ステップ
４０２にて、入力部１２から検索式が入力され制御部１
４に渡し、制御部１４は図５に示すような検索式格納バ
ッファ２１に検索式を格納する。Next, the flow of the search processing is shown in the flowchart of FIG. First, the initialization unit 11 executes step 401.
Start with and initialize each buffer. Next, in step 402, the search expression is input from the input unit 12 and the control unit 1
4, the control unit 14 stores the search expression in the search expression storage buffer 21 as shown in FIG.

【００３４】ステップ４０３では、最大最小ＩＤ抽出部
１５が起動し、検索式格納バッファ２１に格納された検
索式から検索キーワードを取り出し、単語種ＩＤテーブ
ル３１を用いて、ＩＤ番号を取得する。このＩＤ番号の
最大値、最小値を図６に示すような最大最小ＩＤ格納バ
ッファ２２に格納する。In step 403, the maximum / minimum ID extraction unit 15 is activated, a search keyword is extracted from the search expression stored in the search expression storage buffer 21, and an ID number is acquired using the word type ID table 31. The maximum and minimum values of this ID number are stored in the maximum / minimum ID storage buffer 22 as shown in FIG.

【００３５】ステップ４０４では制御部１４が起動し、
検索対象の文書がなくなるまでステップ４０５〜４１１
を繰り返す。ステップ４０５〜４０７ではインデックス
読み出し部が起動する。At step 404, the control unit 14 is activated,
Steps 405 to 411 until there are no documents to be searched
repeat. In steps 405 to 407, the index reading unit is activated.

【００３６】このうち、ステップ４０５では制御部１４
から受け取った文書番号により、対応するＡＰＴインデ
ックス３３のデータをＡＰＴインデックステーブル３２
より読み出す。Of these, in step 405, the control unit 14
According to the document number received from the APT index table 32
Read from.

【００３７】ステップ４０６では読み出したＡＰＴイン
デックステーブル３２のデータから文書番号に対応する
ＡＰＴインデックス３３のデータを読み出す。ステップ
４０７では、ＡＰＴインデックス３３のデータと最大最
小ＩＤ格納バッファ２２から、最大ＩＤ番号と最小ＩＤ
番号の文字位置情報が格納されたＡＰＴ３４のｓｔａｒ
ｔアドレス、ｅｎｄアドレスを得、最小ＩＤのｓｔａｒ
ｔアドレス、最大ＩＤのｅｎｄアドレスの間のデータを
ＡＰＴ３４から読み出し、図７に示すような文字位置情
報格納バッファ２３に格納する。In step 406, the data of the APT index 33 corresponding to the document number is read from the read data of the APT index table 32. In step 407, the maximum ID number and the minimum ID are read from the data of the APT index 33 and the maximum / minimum ID storage buffer 22.
APT34 star in which the character position information of the number is stored
get t address and end address, and start with the smallest ID
Data between the t address and the end address of the maximum ID is read from the APT 34 and stored in the character position information storage buffer 23 as shown in FIG.

【００３８】ステップ４０８〜４１０では検索部が起動
する。このうち、ステップ４０８では、検索式格納バッ
ファ２１から検索キーワードを抽出し、単語種ＩＤテー
ブル３１を用いてＩＤ番号の並びに変換する。In steps 408 to 410, the search unit is activated. Of these, in step 408, the search keyword is extracted from the search expression storage buffer 21, and the ID numbers are converted using the word type ID table 31.

【００３９】ステップ４０９では、対応するＩＤ番号の
位置情報を文字位置情報格納バッファ２３から読み出
し、ＩＤ番号の文字列の位置が、検索キーワードのＩＤ
番号の並びと同じ場合に検索キーワードが文書中に存在
すると判断する。In step 409, the position information of the corresponding ID number is read from the character position information storage buffer 23, and the position of the character string of the ID number is the ID of the search keyword.
It is determined that the search keyword exists in the document when the number is the same as the sequence.

【００４０】この判断の流れを図１２に示す。ステップ
４１０では、検索式の各検索キーワードの文書中の存在
有無がステップ４０９で判断された結果と、検索キーワ
ード間の論理演算を基に検索式の検索対象文書に対する
成立の有無を判断し、結果を制御部１４に渡す。The flow of this judgment is shown in FIG. In step 410, it is determined whether or not each search keyword of the search expression is present in the document in step 409 and whether or not the search expression is satisfied with respect to the search target document based on the logical operation between the search keywords. To the control unit 14.

【００４１】ステップ４１１では出力部が起動し、制御
部１４から渡された検索結果に応じて出力を行い、ステ
ップ４０４に制御を移す。本発明は、上記実施態様に限
定されるものではなく、例えば他に以下のような実施の
態様も考えられる。In step 411, the output unit is activated to output according to the search result passed from the control unit 14, and control is passed to step 404. The present invention is not limited to the above-described embodiments, and the following embodiments may be considered, for example.

【００４２】上記実施の態様との差異は単語種ＩＤテー
ブルのデータとその並びである。上記実施の態様での単
語種ＩＤテーブルは、インデックスの作成時に新たに得
られた単語や文字の文字列を順次登録した。この登録方
法では、英字、数字、記号、平仮名、カタカナ、漢字等
の文字種が混在している。The difference from the above embodiment is the data of the word type ID table and its arrangement. In the word type ID table in the above embodiment, character strings of words and characters newly obtained at the time of creating the index are sequentially registered. In this registration method, character types such as letters, numbers, symbols, hiragana, katakana, and kanji are mixed.

【００４３】一方、本発明に於ける他の実施の態様での
単語種ＩＤテーブルは、図１３に示すように、文字のみ
が登録されている。更に、平仮名や、英字、数字、記
号、カタカナはインデックス作成前に文字と１対１でＩ
Ｄ番号を付与して、単語種ＩＤテーブルに登録してお
く。漢字の文字種はインデックス作成時に新規に出現し
たものを単語種ＩＤテーブルに登録するものである。On the other hand, in the word type ID table in another embodiment of the present invention, only characters are registered as shown in FIG. Furthermore, hiragana, letters, numbers, symbols, and katakana are I-to-I with the letters before indexing.
A D number is given and registered in the word type ID table. The character type of the kanji is registered in the word type ID table when it newly appears when the index is created.

【００４４】[0044]

【発明の効果】以上詳記したように本発明によれば、文
字列の位置情報をその文字に対応付けたＩＤ番号毎に外
部記憶装置から取得できる形式のインデックスに於いて
文書内の検索キーワードの有無を判断する処理に於い
て、検索キーワードの文字列のＩＤ番号の最小値と最大
値を取得することで、文書内に出現する全ての文字のＩ
Ｄ番号に対応した位置情報を取得することよりも少ない
位置情報を外部記憶装置から取り出すことが可能とな
り、その結果、検索速度の向上につながるという効果を
そうする。As described above in detail, according to the present invention, a search keyword in a document is used in an index in a format in which position information of a character string can be obtained from an external storage device for each ID number associated with the character. In the process of determining the presence / absence of the search keyword, the minimum and maximum ID numbers of the character strings of the search keyword are acquired, so that the I of all characters appearing in the document
It is possible to retrieve less position information from the external storage device than to acquire position information corresponding to the D number, and as a result, it is possible to improve the search speed.

【００４５】又、文字に対応付けるＩＤ番号を文字種毎
に分類して同一文字種に連続したＩＤ番号を与えること
により、同一文字種で構成されることの多い検索キーワ
ードが入力された場合に、更に必要とする位置情報の量
を削減することができ、検索速度の向上につながるとい
う優れた効果を奏する。Further, by classifying the ID numbers associated with the characters for each character type and giving consecutive ID numbers to the same character type, it is further necessary when a search keyword often composed of the same character type is input. It is possible to reduce the amount of position information to be displayed, which has an excellent effect of increasing the search speed.

[Brief description of the drawings]

【図１】本発明の一実施の形態に係る装置の概略構成を
示すブロック図。FIG. 1 is a block diagram showing a schematic configuration of an apparatus according to an embodiment of the present invention.

【図２】同実施の形態に於ける文書検索装置での制御装
置１の詳細な構成を示すブロック図。FIG. 2 is a block diagram showing a detailed configuration of a control device 1 in the document search device according to the embodiment.

【図３】同実施の形態に於けるインデックス作成の動作
の概要を示すフローチャート。FIG. 3 is a flowchart showing an outline of an index creating operation in the embodiment.

【図４】同実施の形態に於ける文書の検索の動作の概要
を示すフローチャート。FIG. 4 is a flowchart showing an outline of a document search operation according to the first embodiment.

【図５】同実施の形態に於ける検索式格納バッファ２１
の格納例を示す図。FIG. 5 is a search expression storage buffer 21 according to the same embodiment.
FIG.

【図６】同実施の形態に於ける最大最小ＩＤ格納バッフ
ァ２２の格納例を示す図。FIG. 6 is a diagram showing a storage example of a maximum / minimum ID storage buffer 22 in the same embodiment.

【図７】同実施の形態に於ける文字位置情報格納バッフ
ァ２３の格納例を示す図。FIG. 7 is a diagram showing a storage example of a character position information storage buffer 23 in the same embodiment.

【図８】同実施の形態に於ける単語種ＩＤテーブル３１
の格納例を示す図。FIG. 8 is a word type ID table 31 in the embodiment.
FIG.

【図９】同実施の形態に於けるＡＰＴインデックステー
ブル３２の格納例を示す図。FIG. 9 is a diagram showing a storage example of an APT index table 32 according to the same embodiment.

【図１０】同実施の形態に於けるＡＰＴインデックス３
３の格納例を示す図。FIG. 10 is an APT index 3 according to the embodiment.
3 is a diagram showing a storage example of FIG.

【図１１】同実施の形態に於けるＡＰＴ３４の格納例を
示す図。FIG. 11 is a diagram showing an example of storage of an APT in the same embodiment.

【図１２】同実施の形態に於ける文字位置情報から文書
内の検索キーワードの存在有無を判定する処理の流れを
示す図。FIG. 12 is a diagram showing the flow of processing for determining the presence / absence of a search keyword in a document from character position information according to the first embodiment.

【図１３】本発明の他の実施の形態に係る単語種ＩＤテ
ーブル３１の格納例を示す図。FIG. 13 is a diagram showing a storage example of a word type ID table 31 according to another embodiment of the present invention.

[Explanation of symbols]

１…制御装置、２…入力装置、３…出力装置、４…外部
記憶装置、１１…初期化部、１２……入力部、１３…出
力部、１４…制御部、１５…最大最小ＩＤ抽出部、１６
…インデックス読み出し部、１７…検索部、１８…イン
デックス作成部。DESCRIPTION OF SYMBOLS 1 ... Control device, 2 ... Input device, 3 ... Output device, 4 ... External storage device, 11 ... Initialization part, 12 ... Input part, 13 ... Output part, 14 ... Control part, 15 ... Maximum / minimum ID extraction part , 16
... Index reading unit, 17 ... Search unit, 18 ... Index creating unit.

Claims

[Claims]

1. A unique ID number is assigned to each character string such as a character or a word, and I is assigned to a character string appearing in one document.
A method for searching a document based on an index for all documents that stores appearance position information of character strings in the order of D numbers or for each ID number. The minimum ID number and maximum ID number of the character strings that form the search keyword are extracted, and the positional information required for the search is not the positional information of all the character strings, but the positional information between the minimum ID number and the maximum ID number. A document search method characterized by searching by reading from an index.

2. A unique ID number is assigned to each character string by limiting the character string to characters and allocating consecutive characters for each character type that is a classification of symbols / alphanumeric characters, hiragana, katakana, kanji, etc. 2. The document search method according to claim 1, wherein an ID is assigned, and when the keywords input at the time of search are all composed of the same type of characters, the search is possible.

3. A unique ID number is assigned to each character string such as characters and words, and I is assigned to a character string appearing in one document.
A document search device that searches documents at high speed based on an index for all documents that stores appearance position information of character strings in the order of D numbers or for each ID number. Input means for inputting, and maximum / minimum ID extraction means for extracting the minimum ID number and maximum ID number of a character string constituting a plurality of search keywords from the search expression input at the time of search, the minimum ID number and maximum ID of the character string. A document retrieval apparatus comprising: an index reading unit that reads out position information corresponding to the number of IDs between numbers from an index; and an output unit that outputs a processing result.

4. The unique ID number is assigned to each character string by limiting the character string to characters and assigning consecutive character IDs to each character type. Document retrieval device.