JPH0954777A

JPH0954777A - Information retrieval device

Info

Publication number: JPH0954777A
Application number: JP7168457A
Authority: JP
Inventors: Tomoko Tanabe; 智子田邊; Chuichi Kikuchi; 忠一菊池
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 1995-06-09
Filing date: 1995-07-04
Publication date: 1997-02-25
Anticipated expiration: 2015-07-04
Also published as: JP3099683B2

Abstract

(57)【要約】【目的】電子計算機を利用して大量の文書を検索する
際に利用される情報検索装置に関するもので、大量のデ
ータに対してデータのもつ情報量をそのままに容量を小
さくすることで高速に検索でき、全文検索によって検索
もれを防ぐ情報検索装置を提供することを目的とする。【構成】検索対象データ記憶部に格納されている検
索対象データ中から不要語辞書を用いて検索対象となら
ない語を削除する不要語削除手段と同じく検索対象デー
タ中の文字列の重複部分を削除する重複文字列削除手段
によって検索対象データから作成された圧縮データに対
して全文検索を行なう検索処理手段を設けることによ
り、検索もれが少なく、かつ検索を高速に行なうことが
できる。 (57) [Abstract] [Purpose] The present invention relates to an information retrieval device used when retrieving a large amount of documents using an electronic computer. For a large amount of data, the amount of information that the data has is small and the capacity is small. It is an object of the present invention to provide an information retrieval device capable of performing high-speed retrieval by doing so and preventing search omission by full-text retrieval. [Structure] Similar to an unnecessary word deleting unit that deletes words that are not the search target from the search target data stored in the search target data storage unit, using the unnecessary word dictionary, deletes overlapping portions of character strings in the search target data. By providing the search processing means for performing full-text search on the compressed data created from the search target data by the duplicate character string deleting means, the search omission can be reduced and the search can be performed at high speed.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は電子計算機を利用して大
量の文書を検索する際に利用される情報検索装置に関す
るものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to an information retrieving apparatus used when retrieving a large number of documents using an electronic computer.

【０００２】[0002]

【従来の技術】近年、多様な文書が電子化されてきてい
るのに伴い、大量の文書に対する検索の要求が高まって
いる。これらの要求に対して、従来の多くの検索装置
は、文書からキーワードを抽出し、そのキーワードを文
書に付加して登録を行なっておき、検索の際には、その
キーワードに対して行なう方法を採用している。2. Description of the Related Art In recent years, as various documents have been digitized, the demand for searching a large number of documents has increased. In response to these requests, many conventional search devices extract a keyword from a document, add the keyword to the document and register it, and perform a method for the keyword when searching. It is adopted.

【０００３】以下、図２４を用いてそのような従来の検
索装置について説明する。図２４において、２４０１は
検索対象データ記憶部、２４０２は形態素解析処理手
段、２４０３は形態素解析用辞書、２４０４は単語デー
タ、２４０５はキーワード抽出手段、２４０６はキーワ
ード抽出用辞書、２４０７はキーワードデータ、２４０
８は検索処理手段、２４０９は入力手段、２４１０は出
力手段である。Hereinafter, such a conventional search device will be described with reference to FIG. In FIG. 24, reference numeral 2401 is a search target data storage unit, 2402 is a morpheme analysis processing unit, 2403 is a morpheme analysis dictionary, 2404 is word data, 2405 is a keyword extraction unit, 2406 is a keyword extraction dictionary, 2407 is keyword data, and 240.
Reference numeral 8 is search processing means, 2409 is input means, and 2410 is output means.

【０００４】以上のように構成された検索装置につい
て、以下にその動作を説明する。まず、登録開始合図が
入力手段から入力されると、形態素解析処理手段２４０
２が検索対象データ記憶部２４０１に格納された１つの
検索対象データに対して、形態素解析用辞書２４０３を
参照して形態素解析処理を行ない、単語データ２４０４
を作成する。The operation of the search device configured as described above will be described below. First, when a registration start signal is input from the input means, the morphological analysis processing means 240
2 performs morpheme analysis processing on one piece of search target data stored in the search target data storage unit 2401 with reference to the morphological analysis dictionary 2403.
Create

【０００５】前記形態素解析処理が終了すると次にキー
ワード抽出手段２４０５がキーワード抽出用辞書２４０
６を用いてキーワードデータ２４０７を作成する。上記
の一連の動作が検索対象データ記憶部２４０１に格納さ
れたすべての検索対象データの対して行なわれた後、入
力手段２４０９から検索条件が入力されると、キーワー
ドデータ２４０７に対して検索処理手段２４０８は、キ
ーワード検索を行ない、照合結果より検索対象データ記
憶部２４０１に格納された検索対象データを出力手段２
４１０に出力する。When the morphological analysis process is completed, the keyword extracting means 2405 then operates the keyword extracting dictionary 240.
6 is used to create the keyword data 2407. After the series of operations described above are performed for all the search target data stored in the search target data storage unit 2401, when the search condition is input from the input unit 2409, the search processing unit for the keyword data 2407 is searched. 2408 performs a keyword search and outputs the search target data stored in the search target data storage unit 2401 from the matching result to the output means 2
Output to 410.

【０００６】[0006]

【発明が解決しようとする課題】しかしながら上記の従
来の検索装置の構成では、入力した検索条件がキーワー
ドデータに存在しないと検索してもヒットせず、そのキ
ーワードは形態素解析によって抽出されるので、検索で
きるかどうかは形態素解析が正しく行なわれたどうかに
よる。つまり形態素解析が正しく行なわれないとキーワ
ードもれがおき、ついては検索もれをひき起こす。However, in the configuration of the conventional search device described above, if the input search condition does not exist in the keyword data, no search is made and the keyword is extracted by morphological analysis. Whether it can be searched depends on whether the morphological analysis was performed correctly. In other words, if the morphological analysis is not performed correctly, the keywords will be missed, and this will cause missed searches.

【０００７】また、他にも上記のような検索もれを防ぐ
方法として、全文検索の方法があるが、検索に必要ない
文字列まで検索するため、大量のデータになるほど検索
速度の低下が問題になる。In addition, there is a method of full-text search as a method for preventing the above-mentioned omission of search. However, since a character string not necessary for the search is searched, the search speed decreases as the amount of data increases. become.

【０００８】検索速度をあげるため、索引ファイルを使
った方式もあるが、大量のデータに対し索引作成時間が
非常にかかるという問題が発生する。There is a method using an index file to increase the search speed, but there is a problem that it takes a lot of time to create an index for a large amount of data.

【０００９】本発明は、上記従来技術の課題を解決する
もので、検索もれをおさえる全文検索を用いながら、大
量のデータに対してデータのもつ情報量を失うことな
く、検索対象データの容量を小さくすることで高速に検
索する情報検索装置を提供することを目的とする。The present invention solves the above-mentioned problems of the prior art. Using the full-text search that suppresses the omission of search, the capacity of the search target data is maintained without losing the information amount of the data for a large amount of data. It is an object of the present invention to provide an information search device that searches at high speed by reducing the size.

【００１０】[0010]

【課題を解決するための手段】この目的を達成するため
に本発明の情報検索装置は、検索対象データ記憶部と、
検索対象にならない単語を格納した不要語辞書と、前記
検索対象データ記憶部に格納されている検索対象データ
中から前記不要語辞書を用いて検索対象とならない語を
削除する不要語削除手段と、同じく検索対象データ中の
文字列の重複部分を削除する重複文字列削除手段と、前
記不要語削除手段と前記重複削除処理によって検索対象
データから作成された圧縮データを格納する圧縮データ
記憶部と、前記圧縮データに対して全文検索を行なう検
索処理手段と、登録開始合図や検索条件を入力する入力
手段と、検索結果を出力する出力手段とを備えたもので
ある。In order to achieve this object, an information retrieval apparatus of the present invention comprises a retrieval target data storage section,
An unnecessary word dictionary that stores words that are not search targets, and an unnecessary word deleting unit that deletes words that are not search targets from the search target data stored in the search target data storage unit using the unnecessary word dictionary; Similarly, a duplicate character string deleting unit that deletes a duplicated portion of the character string in the search target data, a compressed data storage unit that stores the unnecessary word deleting unit and the compressed data created from the search target data by the duplicate deleting process, A search processing means for performing a full text search on the compressed data, an input means for inputting a registration start signal and a search condition, and an output means for outputting a search result are provided.

【００１１】[0011]

【作用】本発明は、形態素解析処理を行なわず、辞書に
登録された不要語のみを削除する不要語削除手段によっ
て検索の対象とならないデータと重複文字列削除手段を
設けることにより重複文字列を削除することで、さらに
データ容量を小さくする。また、このようにデータ容量
を圧縮したデータを全文検索する検索処理手段を設け
て、検索もれを防ぎ、高速な検索を実現する。The present invention eliminates morpheme analysis processing and eliminates unnecessary characters registered in the dictionary by unnecessary word deleting means and provides duplicate character string deleting means by providing duplicate character string deleting means. By deleting it, the data capacity is further reduced. Further, by providing a search processing means for searching the whole text of the data whose data volume is compressed in this way, search omission is prevented and a high-speed search is realized.

【００１２】[0012]

【Example】

（実施例１）以下、本発明の第１の実施例について、図
面を参照しながら説明する。図１は本発明の一実施例に
おける情報検索装置の構成図である。図１において、１
０１は検索対象データ記憶部、１０２は不要語削除手
段、１０３は不要語辞書、１０４は重複文字列削除手
段、１０５圧縮データ記憶部、１０６は検索処理手段、
１０７は入力手段、１０８は出力手段である。(First Embodiment) A first embodiment of the present invention will be described below with reference to the drawings. FIG. 1 is a block diagram of an information retrieval apparatus according to an embodiment of the present invention. In FIG. 1, 1
01 is a search target data storage unit, 102 is an unnecessary word deleting unit, 103 is an unnecessary word dictionary, 104 is a duplicate character string deleting unit, 105 compressed data storage unit, 106 is a search processing unit,
107 is an input means and 108 is an output means.

【００１３】まず、本実施例における検索条件と検索対
象データと不要語辞書と圧縮データについて説明する。First, the search conditions, search target data, unnecessary word dictionary, and compressed data in this embodiment will be described.

【００１４】１つの検索条件は、照合文字列と和、積、
否定などの論理的関係を表す論理記号によって表され、
入力手段１０７によって入力される。One search condition is a collation character string and sum, product,
Represented by logical symbols that represent logical relationships such as negation,
It is input by the input means 107.

【００１５】本実施例における検索対象データは文書で
あり、検索対象データ記憶部１０１に格納されている。
１つの文書データは、文書の内容を表すテキストやイメ
ージなどにより構成されている。また、１つの検索対象
データは検索対象データ番号等のデータ識別用のヘッダ
をもち、１つのファイルの中に複数の検索対象データが
前述のヘッダを区切りにして存在する。The search target data in this embodiment is a document and is stored in the search target data storage unit 101.
One piece of document data is composed of a text, an image, or the like representing the content of the document. Further, one search target data has a header for data identification such as a search target data number, and a plurality of search target data exist in one file with the above-mentioned header as a delimiter.

【００１６】不要語辞書１０３には、不要語が格納され
ている。ここでの不要語とは、特定の文書中において、
それ自身では意味を持たず検索の対象とならない単語、
例えば記号、タグ、接尾語、接続語、その文書特有の名
詞などを指す。また、不要語は、更新の必要性の低い単
語とする。The unnecessary word dictionary 103 stores unnecessary words. Unnecessary words here mean that in a specific document,
Words that have no meaning in themselves and are not searched
For example, it means a symbol, a tag, a suffix, a connecting word, a noun peculiar to the document, or the like. Further, the unnecessary word is a word that does not need to be updated.

【００１７】圧縮データとは、前記の検索対象データか
ら、前述のように検索の対象とならない単語を不要語と
して取り除き、不要語と不要語との間に残された文字
列、すなわち検索対象となる必要語、の重複を取り除
き、区切り文字で区切って並べ登録したものを指す。区
切り文字としては検索対象とならない１バイトコードな
どが使用される。The compressed data is the character string left between the unnecessary words, that is, the search target, which is obtained by removing the words that are not the search target as unnecessary words from the search target data. Duplicates of necessary words are removed, and the words are registered by separating them with a delimiter. As a delimiter, a 1-byte code that is not the search target is used.

【００１８】１つの検索対象データから１つの圧縮デー
タが作成され、対応した検索対象データのヘッダと同じ
ものをヘッダとして持ち、１つのファイルの中に複数の
圧縮データが前述のヘッダを区切りにして存在する。上
記で説明した検索対象データと不要語辞書１０３と圧縮
データの例を図２に示す。One piece of compressed data is created from one piece of search object data, and the same header as the corresponding search object data is used as a header, and a plurality of pieces of compressed data are separated in one file by separating the aforementioned headers. Exists. FIG. 2 shows an example of the search target data, the unnecessary word dictionary 103, and the compressed data described above.

【００１９】以上のように構成された情報検索装置につ
いて、その動作を説明する。全体の流れを図３で示す。
全体の流れとしては、データの登録処理と検索処理とに
大きく分けられる。またデータの登録処理は不要語削除
処理と重複文字列削除処理に分けられる。またデータの
登録処理は１検索対象ファイル毎に行なわれ、すべての
検索対象データに対して終了するまで行なわれる。The operation of the information retrieving apparatus constructed as above will be described. The overall flow is shown in FIG.
The overall flow is roughly divided into data registration processing and search processing. The data registration processing is divided into unnecessary word deletion processing and duplicate character string deletion processing. The data registration process is performed for each search target file until all the search target data are completed.

【００２０】最初にデータの登録処理について説明す
る。データの登録処理としては、不要語削除手段１０２
と重複文字列削除手段１０４によって、前記検索対象デ
ータから、あらかじめ実際に検索対象とする圧縮データ
の作成を行なう。First, the data registration process will be described. As the data registration processing, the unnecessary word deleting unit 102
The duplicated character string deleting means 104 creates compressed data to be actually searched from the searched data in advance.

【００２１】まず、入力手段１０７からデータ登録開始
命令が出ると、検索対象データ記憶部１０１に格納され
たファイル中の前記検索対象データ１つに対して、不要
語削除手段１０２が不要語削除処理を開始する。ここで
不要語削除処理の流れを図４に示す。First, when a data registration start command is issued from the input means 107, the unnecessary word deletion means 102 performs unnecessary word deletion processing on one piece of the search target data in the file stored in the search target data storage unit 101. To start. Here, the flow of the unnecessary word deletion processing is shown in FIG.

【００２２】また図５に示すように、不要語削除手段１
０２は、不要語算出部１０２ａと必要語算出部１０２ｂ
と必要語格納テーブル１０２ｃからなる。Further, as shown in FIG. 5, unnecessary word deleting means 1
02 is an unnecessary word calculation unit 102a and a necessary word calculation unit 102b
And a necessary word storage table 102c.

【００２３】不要語削除処理として、不要語位置抽出部
１０２ａでは、不要語辞書１０３を参照し、前記不要語
辞書に格納されている不要語と検索対象データを照らし
合わせ、検索対象データ中の不要語の位置とその語の長
さを抽出する。その際の不要語の抽出は最長一致方法に
よる。As the unnecessary word deletion processing, the unnecessary word position extraction unit 102a refers to the unnecessary word dictionary 103, compares the unnecessary words stored in the unnecessary word dictionary with the search target data, and searches the unnecessary data in the search target data. Extract the position of a word and its length. In this case, unnecessary words are extracted by the longest matching method.

【００２４】最長一致方法について、その動作の流れを
図６に示す。まず、検索対象データの最初から１文字ず
つ不要語辞書１０３と照合を行ない（ステップ１）、前
方一致する文字を発見するまで（ステップ２）１文字ず
つずらしていく（ステップ３）。前方一致した文字を
発見したら（ステップ２）、次の文字と結合して（ステ
ップ４）再び前方一致しているか調べる（ステップ
５）。前方一致しなかった場合は、一つ前の文字列が不
要語となり文字の開始位置と長さを求める（ステップ
６）。FIG. 6 shows the operation flow of the longest match method. First, the search target data is compared with the unnecessary word dictionary 103 character by character from the beginning (step 1), and the characters are shifted by one character until a prefix-matching character is found (step 2) (step 3). When a prefix-matched character is found (step 2), it is combined with the next character (step 4) and it is checked again whether the prefix is matched (step 5). If there is no prefix match, the preceding character string becomes an unnecessary word and the start position and length of the character are obtained (step 6).

【００２５】前方一致した場合は（ステップ５）再び前
方一致をしなくなるまで前記の動作を繰り返す（ステッ
プ７）。ここで前方一致した単語に対して完全一致も成
立した場合は、その単語は不要語辞書１０３に登録され
ている単語であるが、次の文字と結合した場合に再び前
方一致が成立すると、前記の単語を含むさらに長い文字
列が不要語辞書１０３に登録されていることになる。具
体例を図７に示す。If there is a prefix match (step 5), the above operation is repeated until the prefix match is no longer met (step 7). If a perfect match is also established for the forward-matched word, the word is a word registered in the unnecessary word dictionary 103, but if the forward match is established again when combined with the next character, the A longer character string including the word is registered in the unnecessary word dictionary 103. A specific example is shown in FIG.

【００２６】続いて、必要語算出部１０２ｂが、求めら
れた不要語の位置と長さから不要語以外の文字列、すな
わち検索対象となる必要語を算出し、必要語格納テーブ
ル１０２ｃに必要語を格納する。例を図８に示す。以上
で不要語削除処理は終了する。Subsequently, the necessary word calculation unit 102b calculates a character string other than the unnecessary words, that is, a necessary word to be searched from the obtained unnecessary word position and length, and the necessary word is stored in the necessary word storage table 102c. To store. An example is shown in FIG. This is the end of the unnecessary word deletion process.

【００２７】次に、図９を用いて重複文字列削除処理の
流れを説明する。重複文字列削除処理として重複文字列
削除手段１０４が必要語位置格納テーブル１０２ｃを参
照して、各必要語の重複を調べ、重複していない必要語
を抽出する。その際、必要語の部分一致を認める。Next, the flow of the duplicate character string deletion processing will be described with reference to FIG. As the duplicate character string deleting process, the duplicate character string deleting unit 104 refers to the necessary word position storage table 102c to check the duplication of each necessary word, and extracts the necessary non-overlapping words. At that time, partial matching of necessary words is recognized.

【００２８】例を図１０に示す。文字列Ａと文字列Ｂが
存在し、文字列Ａの中に文字列Ｂが含まれている場合
（Ａ⊇Ｂと記す）、文字列Ｂは重複文字列として扱われ
る。例えば、テレビとテレビジョンを比較した場合、テ
レビはテレビジョンに含まれるので重複文字列となる。An example is shown in FIG. If the character string A and the character string B exist and the character string B is included in the character string A (denoted as A⊇B), the character string B is treated as a duplicate character string. For example, when a television is compared with a television, the television is included in the television, and thus a duplicate character string.

【００２９】処理対象としている検索対象データのヘッ
ダを圧縮データ記憶部１０５に格納し、続いて上記の処
理によって抽出された重複していない必要語を前述の区
切り文字で区切り並べた圧縮データを作成し格納する。
１つの検索対象データのすべての必要語に対して重複調
査・格納が終了すると、重複削除処理は終了する。すべ
ての検索対象データに対して重複文字列削除処理が終了
するとデータ登録処理は終了する。The header of the search target data to be processed is stored in the compressed data storage unit 105, and subsequently, the necessary non-overlapping words extracted by the above processing are separated and arranged by the above-described delimiter to create compressed data. And store.
When the duplication check / storage is completed for all the necessary words of one search target data, the duplication deletion process is ended. The data registration process ends when the duplicate character string deletion process ends for all the search target data.

【００３０】次に、検索処理について説明する。検索処
理の一連の流れを図１１に示す。上記の動作で作成され
たすべての圧縮データに対して、入力手段１０７から前
記検索条件が入力されると、検索処理手段１０６が検索
を行なう。検索の方法としては文書そのものを検索対象
として利用する全文検索を用いる。具体的には前記圧縮
データの区切り文字から次の区切り文字まで間の文字列
に対して全文検索を行なう。この際、直接データを検索
する方法や、索引ファイルを作成し索引検索を行なう方
法も用いることができる。索引検索を行なう場合は、前
述のデータ登録処理の一つとして、前記圧縮データを作
成した後、索引を作成する処理を行なう。Next, the search processing will be described. FIG. 11 shows a series of flows of search processing. When the search condition is input from the input unit 107 to all the compressed data created by the above operation, the search processing unit 106 performs a search. As a search method, a full-text search using the document itself as a search target is used. Specifically, a full text search is performed on the character string between the delimiter character of the compressed data and the next delimiter character. At this time, a method of directly searching data or a method of creating an index file and performing an index search can be used. When performing an index search, as one of the data registration processes described above, the process of creating an index is performed after creating the compressed data.

【００３１】検索処理手段１０６は、検索処理を行なっ
た後、照合したデータのヘッダの情報を取り出し、その
ヘッダの情報から検索対象データ記憶部１０１に格納さ
れている検索対象ファイル中から該当する検索対象デー
タを取り出し、ヘッダ情報と共に出力手段１０８に送り
出力する。After performing the search process, the search processing means 106 extracts the header information of the collated data, and from the information of the header, the corresponding search is performed from the search target files stored in the search target data storage unit 101. The target data is extracted and sent to the output means 108 together with the header information for output.

【００３２】具体的に図２を使って説明すると、検索条
件を「イメージデータ」として、検索処理手段１０６が
圧縮データに対して検索を行なう。すると文書番号２の
「イメージデータの送信」の文字列中に「イメージデー
タ」が存在するので、検索処理手段１０６は文書番号２
から、検索対象データ記憶部１０１に格納されている文
書番号２の検索対象データを取り出し、文書番号２とい
うヘッダ情報と検索対象データを出力手段１０８に送り
出力する。Specifically, referring to FIG. 2, the search condition is "image data" and the search processing means 106 searches the compressed data. Then, since “image data” exists in the character string of “image data transmission” of document number 2, the search processing means 106 causes the document number 2 to be displayed.
The retrieval target data of the document number 2 stored in the retrieval target data storage unit 101 is taken out, and the header information of the document number 2 and the retrieval target data are sent to the output means 108 for output.

【００３３】以上のように本実施例によれば、不要語削
除手段１０２によって検索の対象とならない文字列を取
り除き、ついで重複削除手段１０４によって検索に複数
個必要のない文字列を削除して文書の情報量をそのまま
保ちながらデータを圧縮し、容量を小さくすることがで
きるので、より検索もれの少ない検索で、かつ高速に検
索できる。加えて検索対象データの容量が小さくなるこ
とで、索引ファイルを使用した検索システムでは、索引
ファイルを小さくすることができる。As described above, according to this embodiment, the unnecessary word deleting means 102 removes the character strings that are not the object of the search, and then the duplicate deleting means 104 deletes the plural character strings which are not necessary for the search. Since the amount of information can be compressed and the capacity can be reduced while keeping the amount of information as it is, the search can be performed with less search omission and can be performed at high speed. In addition, since the capacity of the search target data is small, the index file can be made small in the search system using the index file.

【００３４】また圧縮したデータに対して全文検索を行
なうので、従来のキーワード検索のような検索もれを防
ぎ、加えて従来の完全一致のキーワード検索より、自由
な検索条件で検索できる。Further, since the full-text search is performed on the compressed data, it is possible to prevent the search omission like the conventional keyword search, and in addition, it is possible to perform the search with more free search conditions than the conventional exact match keyword search.

【００３５】なお、上記では、１つのファイルに複数の
検索対象データを格納したが、ファイルは複数になって
もよい。また、１つの検索対象データのヘッダをファイ
ル名にして、１つの検索対象データを１つのファイルに
格納することもできる。その場合は、検索処理手段１０
６は、圧縮データに対して検索処理を行なった後、照合
したデータのファイル名を取り出し、そのファイル名か
ら該当する検索対象データファイルを探しその中に格納
されている検索対象データを取り出し、ファイル名と共
に出力手段１０８に送り出力する。Although a plurality of search target data are stored in one file in the above, a plurality of files may be stored. Further, it is possible to store one search target data in one file by using the header of one search target data as a file name. In that case, the search processing means 10
The reference numeral 6 retrieves the file name of the collated data after performing the retrieval processing on the compressed data, retrieves the corresponding retrieval target data file from the file name, retrieves the retrieval target data stored therein, and retrieves the file. The name and the name are sent to the output means 108 for output.

【００３６】また、検索対象データ記憶部１０１は光デ
ィスクに設けることも可能であるので、検索対象データ
の格納スペースを少なくすることもできる。Further, since the search target data storage section 101 can be provided on the optical disk, the storage space for the search target data can be reduced.

【００３７】（実施例２）以下、本発明の第２の実施例
について、図面を参照しながら説明する。図１２は本発
明の一実施例における情報検索装置の構成図である。図
１２において、１０１は検索対象データ記憶部、１０２
は不要語削除手段、１０３は不要語辞書、１０４は重複
文字列削除手段、１０５圧縮データ記憶部、１０６は検
索処理手段、１０７は入力手段、１０８は出力手段、１
２０１は不要語削除手段１０２と重複文字列削除手段１
０４によってデータを圧縮する範囲を予め登録しておく
圧縮範囲記憶部である。(Second Embodiment) A second embodiment of the present invention will be described below with reference to the drawings. FIG. 12 is a block diagram of an information search device according to an embodiment of the present invention. In FIG. 12, reference numeral 101 denotes a search target data storage unit, and 102.
Is an unnecessary word deleting unit, 103 is an unnecessary word dictionary, 104 is a duplicate character string deleting unit, 105 is a compressed data storage unit, 106 is a search processing unit, 107 is an input unit, 108 is an output unit, 1
201 is unnecessary word deleting means 102 and duplicate character string deleting means 1
A compression range storage unit for registering beforehand a range for compressing data according to 04.

【００３８】以上のように構成された情報検索装置の圧
縮範囲記憶部１２０１について説明する。圧縮範囲記憶
部１２０１は、データの登録処理を開始する前に予め入
力手段１０７から入力された圧縮範囲が登録される。圧
縮範囲として、検索対象データの先頭位置からのオフセ
ットで表された圧縮開始位置と圧縮終了位置の組みか、
もしくは、圧縮開始位置と圧縮終了位置のタグを指定す
る。例を図１３に示す。The compression range storage unit 1201 of the information retrieval apparatus configured as above will be described. The compression range storage unit 1201 registers the compression range input from the input unit 107 in advance before starting the data registration process. As a compression range, a combination of a compression start position and a compression end position represented by an offset from the start position of the search target data,
Alternatively, specify tags for the compression start position and the compression end position. An example is shown in FIG.

【００３９】続いて、処理の流れについて説明する。入
力手段１０７からデータの圧縮処理の開始の命令が入力
されると、不要語削除手段１０３は、圧縮範囲記憶部１
２０１を参照して、検索対象データ中の圧縮開始位置と
圧縮終了位置を得る。続いて前述のとおり得た位置に該
当する検索対象データの範囲に対して不要語削除処理を
行ない、その後に重複文字列削除手段１０４が重複文字
列削除処理を行なう。ここで扱われる検索対象データの
構造と不要語削除処理と重複文字列削除処理は第１の実
施例と同じである。続いて圧縮しない検索対象データ、
つまり非圧縮データは、図１３に示すように、重複削除
処理で使用された区切り文字列を圧縮データとの区切り
にしてそのまま圧縮データに格納される。以後、検索の
処理の流れは第１の実施例と同様に行なわれる。Next, the flow of processing will be described. When an instruction to start data compression processing is input from the input unit 107, the unnecessary word deletion unit 103 causes the compression range storage unit 1 to operate.
Referring to 201, the compression start position and the compression end position in the search target data are obtained. Subsequently, unnecessary word deletion processing is performed on the range of the search target data corresponding to the position obtained as described above, and then the duplicate character string deletion unit 104 performs duplicate character string deletion processing. The structure of the search target data handled here, the unnecessary word deleting process, and the duplicate character string deleting process are the same as those in the first embodiment. Data to be searched that is not compressed subsequently,
That is, as shown in FIG. 13, the uncompressed data is stored in the compressed data as it is with the delimiter character string used in the duplication deletion process as a delimiter from the compressed data. After that, the flow of search processing is performed in the same manner as in the first embodiment.

【００４０】本実施例のように、検索対象データの圧縮
範囲指定が出来ると、テキストの中でも単語の列挙部分
と文章で書かれている部分がある文書などに対して、文
章部分のみ圧縮したいなど、希望した場所だけ圧縮の対
象にでき、より自由な検索対象データを扱うことができ
る。When the compression range of the data to be searched can be designated as in the present embodiment, it is desired to compress only the text part of a document having a word enumeration part and a part written in the text. , Only the desired place can be targeted for compression, and more free search target data can be handled.

【００４１】（実施例３）以下、本発明の第３の実施例
について、図面を参照しながら説明する。図１４は本発
明の一実施例における情報検索装置の構成図である。図
１４において、１０１は検索対象データ記憶部、１０２
は不要語削除手段、１０３は不要語辞書、１０４は重複
文字列削除手段、１０５圧縮データ記憶部、１０７は入
力手段、１０８は出力手段、１４０１は不要語削除手段
１０２と重複文字列削除手段１０４によってデータを圧
縮する範囲を複数箇所指定し登録できる圧縮範囲複数記
憶部、１４０２は圧縮・非圧縮範囲毎に検索を行なえる
検索処理手段である。(Embodiment 3) A third embodiment of the present invention will be described below with reference to the drawings. FIG. 14 is a block diagram of an information retrieval apparatus according to an embodiment of the present invention. In FIG. 14, reference numeral 101 denotes a search target data storage unit, and 102.
Is an unnecessary word deleting unit, 103 is an unnecessary word dictionary, 104 is a duplicate character string deleting unit, 105 is a compressed data storage unit, 107 is an input unit, 108 is an outputting unit, 1401 is an unnecessary word deleting unit 102 and a duplicate character string deleting unit 104. A compression range plural storage unit 1402 capable of designating and registering a plurality of ranges for compressing data by means of a search processing means capable of performing a search for each compression / non-compression range.

【００４２】以上のように構成された情報検索装置の圧
縮範囲複数記憶部１４０１について説明する。圧縮範囲
複数記憶部１４０１はデータの登録処理を開始する前に
予め入力手段１０７から入力された圧縮範囲が登録され
る。圧縮範囲指定方法は第２の実施例と同様である。た
だし圧縮範囲が複数箇所指定できる点は実施例２と異な
っている。圧縮範囲複数記憶部１４０１の例を図１５に
示す。The compression range plural storage section 1401 of the information retrieval apparatus configured as described above will be explained. The compression range storage unit 1401 stores the compression range input from the input unit 107 in advance before starting the data registration process. The compression range designation method is the same as in the second embodiment. However, it is different from the second embodiment in that a plurality of compression ranges can be designated. FIG. 15 shows an example of the compression range storage unit 1401.

【００４３】続いて、全体の流れについて説明する。入
力手段１０７からデータの圧縮処理の開始の命令が入力
されると、不要語削除手段１０３は、圧縮範囲複数記憶
部１４０１を参照して、検索対象データ中の圧縮開始位
置と圧縮終了位置を１組得る。続いて前述のとおり得た
位置に該当する検索対象データの範囲に対して不要語削
除処理を行ない、その後に重複文字列削除手段１０４が
重複文字列削除処理を行なう。ここで扱われる検索対象
データの構造と不要語削除処理と重複文字列削除処理は
第１の実施例と同様である。１つの圧縮範囲について重
複文字列削除処理が終了すると、必要語と必要語の区切
るために使用されている区切り文字とは異なる圧縮範囲
を区切る区切り文字を格納する。圧縮範囲の区切り文字
は、必要語の区切り文字と同様に、検索対象とならない
１バイトコードなどが使用される。Next, the overall flow will be described. When a command to start data compression processing is input from the input unit 107, the unnecessary word deletion unit 103 refers to the compression range multiple storage unit 1401 and sets the compression start position and the compression end position in the search target data to 1 Can be paired. Subsequently, unnecessary word deletion processing is performed on the range of the search target data corresponding to the position obtained as described above, and then the duplicate character string deletion unit 104 performs duplicate character string deletion processing. The structure of the search target data handled here, the unnecessary word deleting process, and the duplicate character string deleting process are the same as those in the first embodiment. When the duplicate character string deletion processing is completed for one compression range, a delimiter character that delimits a compression range different from the delimiter character used to delimit the necessary word is stored. As the delimiter of the compression range, a 1-byte code that is not the search target is used, as with the delimiter of the necessary word.

【００４４】以上で１つの圧縮範囲についてのデータ登
録処理が終了する。続いて、圧縮しない検索対象デー
タ、つまり非圧縮データは、圧縮範囲の区切り文字で区
切ってそのまま格納する。すべての圧縮指定範囲につい
ての上記データ登録処理とすべての非圧縮データの格納
が終了するまで繰り返し処理を行なう。このように作成
された圧縮データの例を図１５に示す。With the above, the data registration process for one compression range is completed. Subsequently, the search target data that is not compressed, that is, the non-compressed data is delimited by the delimiter character of the compression range and stored as it is. The above-described data registration processing for all the compression designated ranges and the storage processing of all the non-compressed data are repeatedly performed. An example of the compressed data created in this way is shown in FIG.

【００４５】データ登録処理が終了すると第１の実施列
の流れと同様に、検索処理手段１４０２が検索処理を開
始する。検索方法は第１の実施例と同様である。ただ
し、上記のデータ登録処理によって圧縮範囲の区切り文
字で区切られた範囲内で検索を行なう点は第１の実施列
と異なっている。つまり、圧縮範囲の区切り文字は検索
処理において検索範囲の区切り文字になる。このような
一つの範囲で検索条件に適合したら、ただちに一つの検
索対象データの文字列の照合処理は終了する。後の動作
は第１の実施例と同様で、照合したデータのヘッダ情報
をとり出し出力手段１０８に検索結果を送り出す。以下
すべての検索対象データに対して行なう。When the data registration process is completed, the search processing means 1402 starts the search process as in the case of the flow of the first embodiment. The search method is the same as in the first embodiment. However, the point that the search is performed within the range delimited by the delimiter of the compression range by the above-mentioned data registration processing is different from the first embodiment column. That is, the compression range delimiter becomes the search range delimiter in the search process. When the search condition is met in such one range, the matching process of the character string of one search target data ends immediately. The subsequent operation is similar to that of the first embodiment, and the header information of the collated data is taken out and the retrieval result is sent to the output means 108. The following is performed for all search target data.

【００４６】このように、検索の範囲を設け、いわばタ
グつけされたブロックとして区分されたデータ毎に検索
を行なうことで検索条件として使用されている論理式が
有効に活用できる。例えば近接した文字に対して有効に
なる検索条件で検索を行うことができる。In this way, the logical range used as the search condition can be effectively used by providing the search range and performing the search for each piece of data classified as a tagged block. For example, it is possible to perform a search with a search condition that is valid for adjacent characters.

【００４７】例を図１６で説明する。検索意図が「オン
ラインでイメージデータを扱える検索装置」について知
りたい場合に、検索条件を「オンライン＆イメージデー
タ＆検索装置」とし、図に示す文書番号１である、「請
求項１」、「請求項２」、「図面の説明」からなる一つ
のデータを検索対象データとする場合、検索範囲を設定
しない場合は、上記検索式に該当するものとして文書番
号１がヒットするが、範囲を設定した場合、文書番号１
はヒットせず、意図した結果を得ることができる。An example will be described with reference to FIG. When the search intention is to know about "a search device that can handle image data online", the search condition is "online & image data & search device", and the document number 1 shown in the figure is "Claim 1" and "Bill". When one data consisting of "item 2" and "explanation of drawing" is set as the search target data, if the search range is not set, the document number 1 is hit as corresponding to the above search formula, but the range is set. If document number 1
Won't hit and you can get the intended result.

【００４８】もちろん、検索対象を全体にする場合は、
圧縮範囲を全体にすれば良いのでいろいろな検索対象デ
ータにも使える。Of course, when the search target is the whole,
It can be used for various data to be searched because the compression range can be the whole.

【００４９】また、圧縮範囲別に不要語削除処理、重複
文字列削除処理を行なったが、最初に圧縮指定範囲全体
において不要語を削除しておき、続いて指定された圧縮
範囲毎に重複文字列削除処理を行なうこともできる。Further, although unnecessary word deletion processing and duplicate character string deletion processing were performed for each compression range, unnecessary words were first deleted in the entire compression specified range, and then duplicate character strings were specified for each specified compression range. A deletion process can also be performed.

【００５０】（実施例４）以下、本発明の第４の実施例
について、図面を参照しながら説明する。図１７は本発
明の一実施例における情報検索装置の構成図である。図
１７において、１０１は検索対象データ記憶部、１０２
は不要語削除手段、１０３は不要語辞書、１０４は重複
文字列削除手段、１０５は圧縮データ記憶部、１４０１
は圧縮範囲複数記憶部、１７０１は検索対象データ中の
検索対象とする範囲を指定する検索範囲指定手段、１７
０２は検索条件から検索範囲番号を抽出し前記の検索範
囲手段１７０１に送る機能を備えた検索処理手段、１０
７は入力手段、１０８は出力手段である。(Fourth Embodiment) A fourth embodiment of the present invention will be described below with reference to the drawings. FIG. 17 is a block diagram of an information search device according to an embodiment of the present invention. In FIG. 17, reference numeral 101 denotes a search target data storage unit, and 102.
Is an unnecessary word deleting unit, 103 is an unnecessary word dictionary, 104 is a duplicate character string deleting unit, 105 is a compressed data storage unit, 1401
Reference numeral 1701 denotes a compression range storage unit; 1701 is a search range designating unit for designating a range to be searched in the search target data;
Reference numeral 02 denotes a search processing means having a function of extracting a search range number from a search condition and sending it to the search range means 1701.
Reference numeral 7 is an input means, and 108 is an output means.

【００５１】まず、本実施例における、検索対象データ
と圧縮データと検索条件について図１８を使って説明す
る。検索対象データの構造は、第３の実施例と同様であ
る。１つ１つの検索対象データはすべての検索対象デー
タに共通したブロック毎の構造を持っている。例えば図
１８では、目的、図の説明、再び目的、図の説明と繰り
返しタグをもち、目的、説明を１単位（１検索対象デー
タ）とする。First, the search target data, the compressed data and the search conditions in this embodiment will be described with reference to FIG. The structure of the search target data is similar to that of the third embodiment. Each search target data has a structure for each block that is common to all search target data. For example, in FIG. 18, the purpose, the description of the figure, the purpose again, the description of the figure and the repeated tag are included, and the purpose and the description are set as one unit (one search target data).

【００５２】圧縮データも第３の実施例と同様で、圧
縮、非圧縮範囲を区切る区切り文字で区切られていて、
前記検索対象データと同様に１つ１つの圧縮データに共
通した構造を持っている。The compressed data is the same as in the third embodiment, and is separated by a delimiter character that separates the compressed and non-compressed ranges.
Similar to the search target data, it has a structure common to each compressed data.

【００５３】検索条件は、照合文字列と和、積、否定な
どの論理的関係を表す論理記号と共に、検索対象とする
範囲を指定する１検索対象データの先頭からのデータ範
囲の順番である検索範囲番号を添付する。図１８におい
ては、１検索対象データは１つの検索範囲区切り文字に
より、２つの検索範囲に区切られる。The search condition is the order of the data range from the beginning of one search target data that specifies the range to be searched along with the collation character string and the logical symbols indicating the logical relationship such as sum, product, negation. Attach the range number. In FIG. 18, one search target data is divided into two search ranges by one search range delimiter.

【００５４】以上のような、検索条件と検索対象データ
と検索範囲指定手段１７０１と検索手段１７０２を持つ
情報検索装置の流れについて説明する。全体としてはデ
ータ登録処理と検索処理にわかれ、データ登録処理は第
３の実施例と同様である。The flow of the information search apparatus having the search condition, the search target data, the search range specifying means 1701 and the search means 1702 as described above will be described. The entire process is divided into a data registration process and a search process, and the data registration process is the same as that of the third embodiment.

【００５５】つづいて検索処理手段１７０２によって行
なわれる検索処理について説明する。入力手段１０７か
ら前記検索条件が入力されると、検索処理手段１７０２
は前記検索条件から、検索対象とする検索範囲番号を得
て、検索範囲指定手段１７０１に送る。Next, the search processing performed by the search processing means 1702 will be described. When the search condition is input from the input unit 107, the search processing unit 1702
Obtains a search range number to be searched from the search condition and sends it to the search range specifying means 1701.

【００５６】続いて検索範囲指定手段１７０１は、１つ
の圧縮データの先頭から走査し、圧縮データ及び非圧縮
データ範囲の区切り文字をカウントし、検索処理手段１
７０２から送られた検索範囲番号に該当するデータ範囲
の開始位置を見つける。次に見つけた開始位置を検索処
理手段１７０２に指定する。検索処理手段１７０２は指
定された開始位置から、当該範囲の最後までに対して検
索処理を行なう。検索の処理方法は第１の実施例と同様
である。以上の検索処理をすべての圧縮データに対して
行なう。Subsequently, the search range designating means 1701 scans from the head of one compressed data, counts the delimiters of the compressed data and non-compressed data range, and the search processing means 1
The start position of the data range corresponding to the search range number sent from 702 is found. Then, the found start position is designated in the search processing means 1702. The search processing means 1702 performs the search processing from the designated start position to the end of the range. The search processing method is the same as in the first embodiment. The above search processing is performed for all compressed data.

【００５７】本実施例によれば、同じ構造をもつ検索対
象データに対して、範囲別に、具体的に述べれば項目別
に検索を行なうことができ、目的とするデータを取得す
ることが可能になる。According to this embodiment, search target data having the same structure can be searched by range, specifically, by item, and target data can be obtained. .

【００５８】（実施例５）以下、本発明の第５の実施例
について、図面を参照しながら説明する。図１９は本発
明の一実施例における情報検索装置の構成図である。図
１９において、１０１は検索対象データ記憶部、１０３
は不要語辞書、１０４は重複文字列削除手段、１０５圧
縮データ記憶部、１０７は入力手段、１０８は出力手
段、１９０１は抽出した不要語と抽出先の検索対象デー
タのヘッダ情報を後述の不要語記憶テーブルに格納する
機能をもつ不要語削除手段、１９０２は前記不要語削除
手段１９０１によって抽出された不要語と抽出先の検索
対象データのヘッダ情報を保持する不要語記憶テーブ
ル、１９０３は前記不要語記憶テーブル１９０２を検索
する機能をもった検索処理手段である。(Embodiment 5) Hereinafter, a fifth embodiment of the present invention will be described with reference to the drawings. FIG. 19 is a block diagram of an information search device according to an embodiment of the present invention. In FIG. 19, reference numeral 101 denotes a search target data storage unit, 103
Is an unnecessary word dictionary, 104 is a duplicate character string deleting unit, 105 is a compressed data storage unit, 107 is an input unit, 108 is an output unit, 1901 is an unnecessary word and the header information of the extraction target search target data is an unnecessary word described later. An unnecessary word deleting unit having a function of storing in a storage table, 1902 is an unnecessary word storage table for holding unnecessary words extracted by the unnecessary word deleting unit 1901 and header information of search target data of an extraction destination, and 1903 is the unnecessary word It is a search processing unit having a function of searching the storage table 1902.

【００５９】本実施例の不要語記憶テーブル１９０２の
構造について図２０を用いて説明する。図に示すよう
に、不要語記憶テーブル１９０２には、不要語辞書１０
３に登録されている不要語と当該不要語が削除された検
索対象データのヘッダ情報のペアのリストが格納されて
いる。The structure of the unnecessary word storage table 1902 of this embodiment will be described with reference to FIG. As shown in the figure, in the unnecessary word storage table 1902, the unnecessary word dictionary 10
A list of pairs of unnecessary words registered in No. 3 and header information of search target data in which the unnecessary words are deleted is stored.

【００６０】上記の不要語記憶テーブル１９０２と不要
語削除手段１９０１と検索処理手段１９０３を持つ情報
検索装置の流れについて説明する。全体としてはデータ
登録処理と検索処理にわかれる。データ登録処理は第３
の実施例と同様に不要語削除処理と重複文字列削除処理
からなる。ただし、不要語削除手段１９０１は不要語削
除処理を行なう際に抽出した不要語と抽出先の検索対象
データのヘッダ情報を不要語記憶テーブル１９０１に格
納する。The flow of the information search device having the unnecessary word storage table 1902, the unnecessary word deleting means 1901 and the search processing means 1903 will be described. As a whole, it is divided into data registration processing and search processing. Data registration process is third
As in the above embodiment, the unnecessary word deleting process and the duplicate character string deleting process are performed. However, the unnecessary word deleting unit 1901 stores the unnecessary word extracted when performing the unnecessary word deleting process and the header information of the search target data of the extraction destination in the unnecessary word storage table 1901.

【００６１】例えば、図２０で不要語辞書に格納されて
いる文字列「For example, in FIG. 20, the character string "" stored in the unnecessary word dictionary is displayed.

【目的】」は、不要語として検索対象データから削除の
対象となるので、不要語削除手段１９０１によって検索
対象データの文書番号１と文書番号２の中から抽出され
る。次に不要語削除手段１９０１は不要語記憶テーブル
１９０２に文字列「[Purpose] "is an unnecessary word to be deleted from the search target data, and therefore is extracted from the document numbers 1 and 2 of the search target data by the unnecessary word deleting unit 1901. Next, the unnecessary word deleting unit 1901 stores the character string “in the unnecessary word storage table 1902.

【目的】」と、文書番号１と文書番号２がペアにして格
納する。以下、同様に不要語とヘッダ情報のペアを不要
語記憶テーブル１９０２に格納する。すべての検索対象
データに対して上記の動作を行なうデータの登録処理が
終了する。[Purpose] "and document number 1 and document number 2 are stored as a pair. Hereinafter, similarly, a pair of an unnecessary word and header information is stored in the unnecessary word storage table 1902. The registration process of the data that performs the above operation for all the search target data ends.

【００６２】次に検索処理の流れについて図２１を用い
て説明する。まず検索処理手段１９０３は、入力手段１
０７から検索条件が入力されると、その検索条件から照
合文字列を抽出する。次に検索処理手段１９０３は不要
語記憶テーブルを参照して前記の照合文字列が格納され
ているか調べる。格納されている場合は、不要語として
検索対象データから削除されている（つまり圧縮データ
に格納されていない）。不要語記憶テーブル１９０２に
格納されていなかった場合、検索対象データから削除さ
れていないことになる。Next, the flow of search processing will be described with reference to FIG. First, the search processing means 1903 is the input means 1
When the search condition is input from 07, the matching character string is extracted from the search condition. Next, the search processing means 1903 refers to the unnecessary word storage table to check whether the collation character string is stored. If it is stored, it is deleted from the search target data as an unnecessary word (that is, not stored in the compressed data). If it is not stored in the unnecessary word storage table 1902, it means that it is not deleted from the search target data.

【００６３】前記の照合文字列が不要語記憶テーブル１
９０２を参照し、格納されていない場合、検索処理手段
１９０３は第１の実施例と同様にすべての圧縮データに
対して検索の処理を行なう。The matching character string is the unnecessary word storage table 1
If it is not stored by referring to 902, the search processing means 1903 performs a search process on all the compressed data as in the first embodiment.

【００６４】格納されていた場合、検索処理手段１９０
３はそのテーブルから格納されている検索対象データの
ヘッダ情報を取得する。次に、ヘッダ情報から検索対象
ファイルを取得する。続いて前記の動作で得た検索対象
ファイルとそのヘッダ情報を検索結果として出力手段１
０８に送る。次に前記の出力手段１０８に送られた検索
対象データ以外のデータ、つまり前記の出力手段１０８
に送られた検索対象データのヘッダ情報を持たない圧縮
データについて検索の処理を行なう。ここでの検索の処
理は第１の実施例と同様である。If it is stored, the retrieval processing means 190
3 obtains the header information of the search target data stored from the table. Next, the search target file is acquired from the header information. Subsequently, the search target file and its header information obtained by the above operation are output as a search result 1
Send to 08. Next, data other than the search target data sent to the output unit 108, that is, the output unit 108.
The search processing is performed for the compressed data that does not have the header information of the search target data sent to the. The search processing here is similar to that of the first embodiment.

【００６５】本発明によれば、検索対象データから不要
語として削除された場合でも、不要語削除手段１９０１
が不要語記憶テーブル１９０２に削除記録を保持し、検
索処理手段１９０３が前記の削除記録を参照して検索を
行なうので、検索対象データの圧縮を行ないながら検索
もれの少ない検索を同時に実現できる。According to the present invention, the unnecessary word deleting means 1901 is deleted even if the word is deleted from the search target data as an unnecessary word.
Since the delete record is held in the unnecessary word storage table 1902 and the search processing unit 1903 performs the search by referring to the delete record, it is possible to simultaneously perform the search with a small amount of search miss while compressing the search target data.

【００６６】（実施例６）以下、本発明の第６の実施例
について、図面を参照しながら説明する。図２２は本発
明の一実施例における情報検索装置の構成図である。図
２２において、１０１は検索対象データ記憶部、１０３
は不要語辞書、１０４は重複文字列削除手段、１０５圧
縮データ記憶部、１０７は入力手段、１０８は出力手
段、１９０１は不要語削除手段、１９０２は不要語記憶
テーブル、１９０３は検索処理手段、２２０１は指定さ
れた不要語を不要語辞書１０３から削除し、不要語記憶
テーブル１９０２を参照して当該不要語が削除された圧
縮データのヘッダ情報を得て、当該不要語を該当する圧
縮データに再格納するデータ再現手段である。(Embodiment 6) A sixth embodiment of the present invention will be described below with reference to the drawings. FIG. 22 is a block diagram of an information search device in one embodiment of the present invention. In FIG. 22, reference numeral 101 denotes a search target data storage unit, 103
Is an unnecessary word dictionary, 104 is a duplicate character string deleting unit, 105 compressed data storage unit, 107 is an input unit, 108 is an output unit, 1901 is an unnecessary word deleting unit, 1902 is an unnecessary word storage table, 1903 is a search processing unit, 2201 Deletes the specified unnecessary word from the unnecessary word dictionary 103, obtains header information of the compressed data in which the unnecessary word is deleted by referring to the unnecessary word storage table 1902, and re-converts the unnecessary word into the corresponding compressed data. It is a means for reproducing data to be stored.

【００６７】本実施例のデータ再現手段２２０１につい
て図２３を用いて説明する。データ再現手段２２０１
は、入力手段１０７から不要語辞書１０３へ格納中止の
不要語が入力指定されると、まず、不要語辞書１０３か
ら指定された不要語を削除する。例えば図２３において
「こと」は検索対象データから削除する対象となってい
る（図２３（ａ））。The data reproducing means 2201 of this embodiment will be described with reference to FIG. Data reproduction means 2201
When an unnecessary word to be stored is input and designated from the input means 107 to the unnecessary word dictionary 103, first, the designated unnecessary word is deleted from the unnecessary word dictionary 103. For example, in FIG. 23, “Koto” is a target to be deleted from the search target data (FIG. 23 (a)).

【００６８】次に、不要語記憶テーブル１９０２を参照
して、すでに当該不要語が削除されている圧縮データの
ヘッダ情報を得る（図２３（ｂ））。続いて、得たヘッ
ダ情報から該当する圧縮データを探し、最後部に不要語
を添付する（図２３（ｃ））。次にデータ再現手段２２
０１はすべての該当圧縮データに対して指定された不要
語を添付しおわると、不要語記憶テーブル１９０２の当
該不要語とその不要語が削除された検索対象データのヘ
ッダ情報を削除する（図２３（ｂ））。Next, referring to the unnecessary word storage table 1902, the header information of the compressed data in which the unnecessary word has been deleted is obtained (FIG. 23 (b)). Then, the corresponding compressed data is searched for from the obtained header information, and an unnecessary word is attached to the last part (FIG. 23 (c)). Next, the data reproducing means 22
When 01 has attached the designated unnecessary word to all the corresponding compressed data, the header information of the unnecessary word in the unnecessary word storage table 1902 and the search target data in which the unnecessary word is deleted is deleted (FIG. 23). (B)).

【００６９】上記の動作は、データ登録処理と検索処理
が行なわれていない時に入力手段１０７から命令が入力
されるとただちに開始され、次のデータ登録処理と検索
処理に反映される。The above operation is started immediately when a command is input from the input means 107 when the data registration process and the search process are not performed, and is reflected in the next data registration process and the search process.

【００７０】本実施例によれば、データ再現手段２２０
１を設けたことで、いったん不要語として不要語辞書１
０３に登録しておいてもただちに登録の取り止めがで
き、また、すでに不要語として削除された圧縮データに
対しても指定された不要語を添付することでデータを再
現でき、より使いやすい情報検索装置を実現できる。According to this embodiment, the data reproducing means 220
By providing 1, the unnecessary word dictionary as an unnecessary word 1
Registration can be canceled immediately even if registered in 03, and data can be reproduced by attaching specified unnecessary words to compressed data that has already been deleted as unnecessary words, making it easier to use information search The device can be realized.

【００７１】[0071]

【発明の効果】以上のように本発明の情報検索装置は、
検索対象データの情報量を失うことなく検索対象データ
の容量を小さくできる不要語削除手段と重複削除手段を
設けたことにより、検索もれを防ぎ、かつ検索が高速に
できる。キーワード検索に比べて、全文検索を行なうこ
とによりキーワードのもれが少ないため、使い易く、さ
らにキーワードを含んだ文字列から全文検索ができるの
で検索精度の優れた情報検索装置を実現できるものであ
る。As described above, the information retrieval apparatus of the present invention is
By providing the unnecessary word deleting unit and the duplicate deleting unit that can reduce the capacity of the search target data without losing the information amount of the search target data, the search omission can be prevented and the search can be performed at high speed. Compared to the keyword search, the full-text search reduces the leakage of the keywords, which is easy to use, and the full-text search can be performed from the character string including the keyword. Therefore, the information retrieval device having the excellent search accuracy can be realized. .

【００７２】特に容量の大きなデータほど、検索対象と
なる同じ文字列が出現する割合いが多くなるので、重複
削除処理によってデータの容量が小さくでき、効果が大
きい。また、検索対象データの容量を小さくすることで
索引が少なくなり経済的効果も大きい優れた情報検索装
置である。Particularly, the larger the volume of data, the more the same character string to be searched appears, so the data volume can be reduced by the duplication deletion process, and the effect is great. Further, by reducing the capacity of the data to be searched, the number of indexes is reduced and the information retrieval apparatus is excellent in economic effect.

【００７３】また、検索対象データの圧縮範囲指定が出
来ると、文書中の希望した場所だけ圧縮の対象にできる
ので圧縮データと非圧縮データの検索の選択ができ、よ
り自由な検索対象データを扱うことができる。Further, if the compression range of the search target data can be specified, only the desired location in the document can be compressed, so that it is possible to select the search between the compressed data and the non-compressed data, and more free search target data is handled. be able to.

【００７４】加えて、タグつけされたブロックとして区
分されたデータ毎に検索を行なうことで、検索条件とし
て使用されている論理式が有効に活用することもでき、
近接した文字に対して検索を行なうこともできる。In addition, by performing a search for each data segmented as a tagged block, the logical expression used as a search condition can be effectively utilized.
You can also search for characters that are close together.

【００７５】他にも。同じ構造をもつ検索対象データに
対して、範囲別、ブロック別に検索を行なうことがで
き、目的とするデータを取得することがより可能にな
る。Besides. Search target data having the same structure can be searched for each range and each block, which makes it possible to obtain target data.

【００７６】検索対象データから不要語が削除された削
除記録を保持しておくことで、削除記録を参照して検索
を行なうことができ、検索対象データの圧縮を行ないな
がら検索もれの少ない検索を同時に実現できる。By holding a deletion record in which unnecessary words have been deleted from the search target data, a search can be performed by referring to the deletion record, and the search target data is compressed while the search is rarely missed. Can be realized at the same time.

【００７７】また、いったん不要語としてに登録してお
いてもただちに登録抹消が行なえ、加えてすでに不要語
として削除された圧縮データに対しても指定された不要
語を添付することでデータを再現でき、より使いやすい
情報検索装置を実現できる。Further, even if the word is once registered as an unnecessary word, the registration can be immediately deleted. In addition, the specified unnecessary word is attached to the compressed data already deleted as the unnecessary word to reproduce the data. It is possible to realize an information retrieval device that is easier to use.

[Brief description of drawings]

【図１】本発明の第１の実施例における情報検索装置の
構成図FIG. 1 is a configuration diagram of an information search device according to a first embodiment of the present invention.

【図２】本発明の第１の実施例におけるデータの概念図FIG. 2 is a conceptual diagram of data in the first embodiment of the present invention.

【図３】本発明の第１の実施例における全体の動作を示
す流れ図FIG. 3 is a flowchart showing the overall operation in the first embodiment of the present invention.

【図４】本発明の第１の実施例における不要語削除処理
を示す流れ図FIG. 4 is a flowchart showing unnecessary word deletion processing in the first embodiment of the present invention.

【図５】本発明の第１の実施例における不要語削除手段
の構成図FIG. 5 is a configuration diagram of unnecessary word deleting means in the first exemplary embodiment of the present invention.

【図６】本発明の第１の実施例おける不要語抽出の際の
最長一致方法を示す流れ図FIG. 6 is a flow chart showing a longest matching method when extracting unnecessary words in the first embodiment of the present invention.

【図７】本発明の第１の実施例おける最長一致方法を示
す概念図FIG. 7 is a conceptual diagram showing a longest matching method in the first embodiment of the present invention.

【図８】本発明の第１の実施例における必要語算出処理
と必要語格納テーブルの概念図FIG. 8 is a conceptual diagram of necessary word calculation processing and a necessary word storage table according to the first embodiment of this invention.

【図９】本発明の第１の実施例における重複文字列削除
処理を示す流れ図FIG. 9 is a flowchart showing a duplicate character string deletion process according to the first embodiment of this invention.

【図１０】本発明の第１の実施例における重複文字列削
除処理を示す概念図FIG. 10 is a conceptual diagram showing a duplicate character string deletion process according to the first embodiment of this invention.

【図１１】本発明の第１の実施例における検索処理を示
す流れ図FIG. 11 is a flowchart showing search processing in the first embodiment of the present invention.

【図１２】本発明の第２の実施例における情報検索装置
の構成図FIG. 12 is a configuration diagram of an information search device according to a second embodiment of the present invention.

【図１３】本発明の第２の実施例におけるデータの概念
図FIG. 13 is a conceptual diagram of data in the second embodiment of the present invention.

【図１４】本発明の第３の実施例における情報検索装置
の構成図FIG. 14 is a configuration diagram of an information search device according to a third embodiment of the present invention.

【図１５】本発明の第３の実施例におけるデータの概念
図FIG. 15 is a conceptual diagram of data in the third embodiment of the present invention.

【図１６】本発明の第３の実施例における検索条件と検
索結果を示す概念図FIG. 16 is a conceptual diagram showing search conditions and search results in the third embodiment of the present invention.

【図１７】本発明の第４の実施例における情報検索装置
の構成図FIG. 17 is a configuration diagram of an information search device according to a fourth embodiment of the present invention.

【図１８】本発明の第４の実施例におけるデータの概念
図FIG. 18 is a conceptual diagram of data according to the fourth embodiment of the present invention.

【図１９】本発明の第５の実施例における情報検索装置
の構成図FIG. 19 is a configuration diagram of an information search device according to a fifth embodiment of the present invention.

【図２０】本発明の第５の実施例におけるデータの概念
図FIG. 20 is a conceptual diagram of data in the fifth embodiment of the present invention.

【図２１】本発明の第５の実施例における検索処理を示
す流れ図FIG. 21 is a flowchart showing search processing in the fifth embodiment of the present invention.

【図２２】本発明の第６の実施例における情報検索装置
の構成図FIG. 22 is a configuration diagram of an information search device according to a sixth embodiment of the present invention.

【図２３】本発明の第６の実施例におけるデータ再現手
段の処理図FIG. 23 is a processing diagram of the data reproducing means in the sixth embodiment of the present invention.

【図２４】従来の情報検索装置の構成図FIG. 24 is a block diagram of a conventional information retrieval device.

[Explanation of symbols]

１０１検索対象データ記憶部１０２不要語削除手段１０２ａ不要語位置抽出手段１０２ｂ必要語算出手段１０２ｃ必要語格納テーブル１０３不要語辞書１０４重複文字列削除手段１０５圧縮データ記憶部１０６検索処理手段１０７入力手段１０８出力手段１２０１圧縮範囲記憶部１４０１圧縮範囲複数記憶部１４０２検索処理手段１７０１検索範囲指定手段１７０２検索処理手段１９０１不要語削除手段１９０２不要語記憶テーブル１９０３検索処理手段２２０１データ再現手段２４０１検索対象データ記憶部２４０２形態素解析処理手段２４０３形態素解析用辞書２４０４単語データ２４０５キーワード抽出手段２４０６キーワード抽出用辞書２４０７キーワードデータ２４０８検索処理手段２４０９入力手段２４１０出力手段 101 search target data storage unit 102 unnecessary word deleting unit 102a unnecessary word position extracting unit 102b necessary word calculating unit 102c necessary word storage table 103 unnecessary word dictionary 104 duplicate character string deleting unit 105 compressed data storage unit 106 search processing unit 107 input unit 108 Output means 1201 Compressed range storage section 1401 Compressed range plural storage section 1402 Search processing means 1701 Search range designating means 1702 Search processing means 1901 Unnecessary word deleting means 1902 Unwanted word storage table 1903 Search processing means 2201 Data reproducing means 2401 Search target data storage section 2402 Morphological analysis processing means 2403 Morphological analysis dictionary 2404 Word data 2405 Keyword extraction means 2406 Keyword extraction dictionary 2407 Keyword data 2408 Search processing means 24 9 input means 2410 output means

Claims

[Claims]

1. A search target data storage unit for storing search target data, and a compressed data storage unit for storing data obtained by deleting a duplicated portion of a character string from the search target data stored in the search target data storage unit. An input unit for inputting search conditions, a search unit for searching the compressed data storage unit for data matching the search conditions input from the input unit, and the search unit stored in the search target data storage unit. An information search device comprising display means for displaying search target data corresponding to a search result.

2. A retrieval target data storage unit for storing retrieval target data, and compressed data for storing data obtained by deleting a character string designated as an unnecessary word from the retrieval target data stored in the retrieval target data storage unit. A storage unit; an input unit for inputting search conditions; a search unit for searching the compressed data storage unit for data matching the search conditions input from the input unit; and a unit for storing the search target data storage unit. An information search device comprising display means for displaying search target data corresponding to a search result of the search means.

3. A search target data storage unit that stores search target data, an unnecessary word dictionary that stores words that are not to be searched, and the unnecessary word from the search target data stored in the search target data storage unit. By the unnecessary word deleting means for deleting words that are not to be searched using the dictionary, the duplicate character string deleting means for deleting the overlapping portion of the character string in the search target data, the unnecessary word deleting means and the duplicate deleting processing A compressed data storage unit that stores compressed data created from the search target data; an input unit that inputs search conditions;
An information retrieval device comprising a retrieval processing means for performing a full text retrieval on the compressed data according to the retrieval conditions and an output means for outputting a retrieval result of the retrieval processing means.

4. The information retrieving apparatus according to claim 3, wherein the unnecessary word deleting means uses the longest matching method for detecting the longest matching character string according to the prefix matching condition as the unnecessary word extracting method.

5. The information retrieving apparatus according to claim 3, wherein the duplicate character string deleting means uses a method of detecting a duplicate character string by prefix matching, intermediate matching, or perfect matching as the duplicate character string deleting method.

6. The information retrieving apparatus according to claim 3, wherein the compressed data stored in the compressed data storage section is delimited by a plurality of types of delimiters that change the range of the retrieval target.

7. The information retrieval device according to claim 3, further comprising a compression range storage section for storing a range in which the unnecessary word deleting means and the duplicate character string deleting means compress data.

8. The information search device according to claim 3, wherein the compression range storage unit stores a plurality of compression ranges, and the search processing unit searches for each of the compression ranges.

9. The information search apparatus according to claim 8, wherein the search processing means is provided with a function of performing a search by specifying a compression range or search target data outside the compression range.

10. An unnecessary word storage table is provided for storing unnecessary words deleted by the unnecessary word deleting means from the search target data using the unnecessary word dictionary, and the unnecessary word storage table is used as the search target of the search processing means. The information search device according to claim 3, wherein

11. The unnecessary word registration is canceled by canceling unnecessary word registration from the unnecessary word dictionary, and checking the unnecessary word storage table in which the unnecessary word deleting means stores the unnecessary words deleted from the search target data by using the unnecessary word dictionary. 11. The information retrieving apparatus according to claim 10, further comprising a data reproducing unit that adds the unnecessary word canceled to the compressed data and cancels the unnecessary word registration from the unnecessary word storage table.