JP2000311173A

JP2000311173A - Device and method for retrieving similar document

Info

Publication number: JP2000311173A
Application number: JP11120023A
Authority: JP
Inventors: Shigemi Nakazato; 茂美中里; Takeshi Matsukuma; 剛松隈; Yukio Nakamoto; 幸夫中本; Takuya Nishina; 卓哉仁科
Original assignee: Toshiba Corp; Toshiba Computer Engineering Corp
Current assignee: Toshiba Corp; Toshiba Computer Engineering Corp
Priority date: 1999-04-27
Filing date: 1999-04-27
Publication date: 2000-11-07

Abstract

PROBLEM TO BE SOLVED: To retrieve a similar document which should be extracted with high precision without depending upon only the similarity by finding the probability that a document to be retrieved is a document which should be extracted by calculating the similarity between a retrieval key document and each document to be retrieved and outputting the document having the largest probability. SOLUTION: A similarity calculation part 206 calculates by a vector space method how much the retrieval key document in a retrieval key document storage buffer part 251 and a document to be retrieved in a retrieval object document storage buffer part 252 are similar to each other. A similarity probability conversion part 207 takes table data out of a similarity probability conversion table storage buffer part 256 according to the similarity and the field of the object document to be retrieved and calculates the corresponding similarity. A similar document candidate sorting part 208 sorts similar document candidates by the likelihood. A similar document candidate output part 209 calculates the probability that more than one correct answer is included in similar document candidates in order from the similar document candidate having the largest likelihood and displays the similar document candidates at a display part.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、電子化された文
書データの検索装置に係り、特にある文書データを検索
キーとしてこれと類似した文書データを自動検索する類
似文書検索装置及び類似文書検索方法に関する。[0001] 1. Field of the Invention [0002] The present invention relates to an electronic document data search apparatus, and more particularly to a similar document search apparatus and a similar document search method for automatically searching for document data similar to a certain document data using the search key as a search key. About.

【０００２】[0002]

【従来の技術】近年、大量の電子化された文書データ
が流通するようになり、自動分類を行なう目的で、文書
データベースの中から指定された文書（以下、検索キー
文書と称す）に類似する文書の自動検索を行なうシステ
ムが実用化されている。一般的な技術としては、共通単
語数やベクトル空間法を用いて検索キー文書と各検索対
象文書との類似度を求め、その類似度の高いものから順
に類似文書として出力するものが知られている。さらに
例えば、特開平１１−７３４１５号公報に開示されてい
るような、類似文書としてより信憑性の高いものを検索
結果として得るべく各々の類似度値の統計分布（類似度
値の平均等）を求め、この統計分布を基準にユーザが設
定した条件を満足するものを類似文書として出力する技
術がある。つまり端的にこの技術のものは、類似度値の
集合を用い、そしてその分布の中で類似文書として捉え
るべきものの抽出方法を示している。2. Description of the Related Art In recent years, a large amount of digitized document data has been distributed, and for the purpose of performing automatic classification, a document resembling a document specified in a document database (hereinafter referred to as a search key document). A system for automatically retrieving documents has been put to practical use. As a general technique, a technique is known in which the similarity between a search key document and each search target document is calculated using the number of common words and the vector space method, and the similarity is output as similar documents in descending order of similarity. I have. Further, for example, as disclosed in Japanese Patent Application Laid-Open No. H11-73415, the statistical distribution of each similarity value (average of similarity values, etc.) is obtained in order to obtain a more reliable document as a similar document as a search result. There is a technique of outputting a document that satisfies a condition set by a user based on the statistical distribution as a similar document as a similar document. In short, this technology simply uses a set of similarity values and shows a method of extracting what should be regarded as similar documents in the distribution.

【０００３】[0003]

【発明が解決しようとする課題】従来の類似文書検索
装置においては、類似度値の集合を用い、そしてその分
布の中で類似文書として捉えるべきものの抽出方法を示
しており、単に類似度のみを用いることに比べれば精度
の高い方法を提案している。しかしながら、同じ類似度
値でも正解率が一定でないことについては何ら考慮され
ていない。つまり、同じ類似度値であっても、その中に
正解が含まれる確率が同じとは限らず、例えば検索対象
文書の中に、たまたま同じ類似度値が付与されるような
２つの文書があっても、一方の文書は正解で他方の文書
は不正解という場合がある。このように、同じ類似度値
ではあるものの正解である確率に違いがある場合、上記
技術では同じ類似度値のものは同じものとして扱ってし
まう為、両者を区別するような精度の高い類似文書の抽
出を行なうことができない。The conventional similar document search apparatus uses a set of similarity values, and shows a method of extracting what should be regarded as similar documents in the distribution. We propose a method that is more accurate than using it. However, there is no consideration that the accuracy rate is not constant even for the same similarity value. That is, even if the similarity value is the same, the probability that the correct answer is included in the similarity value is not always the same. For example, in a search target document, there are two documents that happen to be given the same similarity value. However, one document may be correct and the other may be incorrect. In this way, if the similarity value is the same but there is a difference in the probability of being the correct answer, the above-mentioned technology treats the same similarity value as the same, so that a similar document with high accuracy that distinguishes the two. Cannot be extracted.

【０００４】また、類似として出力すべき文書数につい
て、出力文書中に正解が１件又はある少数の特定件数含
まれていれば良いということが多い。これに対して上述
の技術の場合「平均類似度の２倍以上の類似度を持つ検
索対象文書を検索結果とする」等の条件で区別してい
る。このように類似度のみで判別すると、出力文書中に
正解が含まれる確率にばらつきが大きく、１件も正解が
含まれない場合や正解ではあっても必要以上に文書が出
力されてしまう場合が生じる。これは検索作業として見
れば非効率と言わざるを得ない。In addition, as for the number of documents to be output as similar, it is often sufficient that the output document contains one correct answer or a certain small number of specific answers. On the other hand, in the case of the above-mentioned technology, the distinction is made based on a condition such as “a search target document having a similarity of twice or more the average similarity is set as a search result”. When the determination is made only by the similarity in this manner, the probability that the correct answer is included in the output document varies widely, and there is a case where no correct answer is included in the output document or a case where the correct answer is output even more than necessary. Occurs. This is inefficient as a search operation.

【０００５】本発明は、このような課題を解決するため
のもので、類似度のみに依存することなく、高い精度で
抽出されるべき類似文書を検索できる類似文書検索装置
及び類似文書検索方法の提供を目的としている。また本
発明は、検索対象文書が属する分類と類似度とを利用し
て、分類間における類似度の正解率の違いを反映し、高
い精度でより適切な類似文書を検索できる類似文書検索
装置及び類似文書検索方法の提供を目的としている。さ
らに本発明は、検索の結果、出力文書中に抽出されるべ
き文書が所望の確率で含まれるようにする類似文書検索
装置及び類似文書検索方法の提供を目的としている。SUMMARY OF THE INVENTION The present invention has been made in order to solve the above-described problem, and is directed to a similar document search apparatus and a similar document search method capable of searching for a similar document to be extracted with high accuracy without relying only on similarity. It is intended to be provided. The present invention also provides a similar document search device that uses a classification and a similarity to which a search target document belongs, reflects a difference in the accuracy rate of the similarity between the classes, and can search for a more appropriate similar document with high accuracy. It aims to provide a similar document search method. A further object of the present invention is to provide a similar document search apparatus and a similar document search method that allow a document to be extracted to be included in an output document as a result of a search with a desired probability.

【０００６】[0006]

【課題を解決するための手段】上記目的を達成するた
めに、請求項１に係る発明では、検索キー文書に類似す
る文書を複数の検索対象文書の中から検索する類似文書
検索装置において、検索キー文書と前記各検索対象文書
との類似度を算出する類似度算出手段と、この類似度算
出手段により求められた類似度に応じて、当該検索対象
文書が抽出されるべき文書である確率を求める手段と、
少なくともこの手段により求められた確率が最も高い文
書を出力する出力手段とを具備することを特徴とする。
このような構成により、類似度毎に抽出されるべき文書
である確率が高いものを出力するので、高い精度で抽出
されるべき類似文書を検索できる。According to an embodiment of the present invention, there is provided a similar document search apparatus for searching a document similar to a search key document from a plurality of search target documents. A similarity calculating unit for calculating a similarity between the key document and each of the search target documents; and a probability that the search target document is a document to be extracted according to the similarity calculated by the similarity calculation unit. Means to seek,
Output means for outputting at least the document having the highest probability obtained by this means.
With this configuration, a document having a high probability of being extracted for each similarity is output, so that a similar document to be extracted can be searched with high accuracy.

【０００７】また、本発明の類似文書検索装置は請求項
２に記載されるように、検索キー文書に類似する文書
を、分類情報を含む複数の検索対象文書の中から検索す
る類似文書検索装置において、検索キー文書と前記各検
索対象文書との類似度を算出する類似度算出手段と、こ
の類似度算出手段により求められた類似度と前記検索対
象文書の分類とに基づいて、当該検索対象文書が抽出さ
れるべき文書である確率を求める手段と、この手段によ
り求められた確率の高いものから順に検索対象文書を出
力する出力手段とを具備することを特徴とする。このよ
うな構成により、分類間における類似度の正解率の違い
を反映し、高い精度でより適切な類似文書を検索でき
る。Further, according to a second aspect of the present invention, there is provided a similar document search apparatus for searching a document similar to a search key document from a plurality of search target documents including classification information. A similarity calculating means for calculating a similarity between a search key document and each of the search target documents; and a search target document based on the similarity obtained by the similarity calculation means and the classification of the search target document. It is characterized by comprising means for calculating the probability that a document is a document to be extracted, and output means for outputting documents to be searched in order from the one with the highest probability obtained by this means. With this configuration, it is possible to search for a more appropriate similar document with high accuracy by reflecting the difference in the accuracy rate of the similarity between the classes.

【０００８】また、本発明の類似文書検索装置は請求項
５に記載されるように、検索キー文書に類似する文書を
複数の検索対象文書の中から検索する類似文書検索装置
において、検索キー文書と前記各検索対象文書との類似
度を算出する類似度算出手段と、この類似度算出手段に
より求められた類似度に応じて、当該検索対象文書が抽
出されるべき文書である確率を求める手段と、この手段
により求められた確率の高いものから順に検索対象文書
を出力する出力手段と、この出力手段が出力する検索対
象文書を、出力された検索対象文書の中に抽出されるべ
き文書が含まれている確率が所定値以上に達するまでと
する手段とを具備したことを特徴とする。このような構
成により、出力文書中に抽出されるべき文書が所望の確
率で含まれるようにすることができる。According to another aspect of the present invention, there is provided a similar document search apparatus for searching a document similar to a search key document from a plurality of search target documents. And similarity calculating means for calculating the degree of similarity with each of the search target documents, and means for calculating the probability that the search target document is a document to be extracted according to the similarity calculated by the similarity calculation means And output means for outputting the search target documents in order from the one with the highest probability obtained by this means, and the search target documents output by the output means are output as the documents to be extracted in the search target documents. Means for until the probability of being included reaches a predetermined value or more. With such a configuration, a document to be extracted can be included in the output document with a desired probability.

【０００９】[0009]

【発明の実施の形態】以下、図面を参照して本発明の一
実施形態を説明する。図１は本発明の一実施形態に係る
類似文書検索装置のハードウェア構成を示す図である。
なお、本装置は一般的なアーキテクチャを持つコンピュ
ータ上の一機能として構築されるものである。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS An embodiment of the present invention will be described below with reference to the drawings. FIG. 1 is a diagram showing a hardware configuration of a similar document search device according to an embodiment of the present invention.
This device is constructed as one function on a computer having a general architecture.

【００１０】図１に示すように、本装置は、制御装置
１、キーボード、ポインティングデバイス、スキャナを
有する入力装置２、類似文書の検索結果などを表示する
表示装置３、および外部記憶装置４から構成される。こ
の外部記憶装置４は、多数の文書情報を記憶する文書情
報記憶部４ａと類似度から確度を導き出すためのテーブ
ルを記憶するテーブル記憶部４ｂとを格納するものであ
って、例えばハードディスク装置、またはＤＶＤ（Ｄｉ
ｇｉｔａｌＶｉｄｅｏＤｉｓｃ）装置などからな
る。As shown in FIG. 1, the present apparatus comprises a control device 1, an input device 2 having a keyboard, a pointing device, a scanner, a display device 3 for displaying search results of similar documents, and an external storage device 4. Is done. The external storage device 4 stores a document information storage unit 4a for storing a large number of document information and a table storage unit 4b for storing a table for deriving accuracy from similarity. DVD (Di
digital video disc) device and the like.

【００１１】図２に本装置における制御装置１の構成を
示す。制御装置１はＣＰＵ、ＲＯＭ、及びＲＡＭを有し
ており、図２の中ではＣＰＵとＲＯＭにより為される部
分を機能的にプログラム部２００とし、ＲＡＭを機能的
にバッファ部２５０として表わしている。プログラム部
２００は、初期化部２０１、検索キー文書入力部２０
４、検索対象文書読み込み部２０５、類似度算出部２０
６、類似度確度変換部２０７、類似文書候補ソート部２
０８、類似文書候補出力部２０９、出力候補数判断条件
設定部２１０、正解／不正解結果入力部２１１、類似度
確度変換テーブル読み込み部２１２、及び類似度確度変
換テーブル書き込み部２１３の１１の機能を有してい
る。FIG. 2 shows the configuration of the control device 1 in the present apparatus. The control device 1 has a CPU, a ROM, and a RAM. In FIG. 2, a portion performed by the CPU and the ROM is functionally represented as a program unit 200, and a RAM is functionally represented as a buffer unit 250. . The program unit 200 includes an initialization unit 201, a search key document input unit 20
4. Search target document reading unit 205, similarity calculation unit 20
6. Similarity probability conversion unit 207, similar document candidate sorting unit 2
08, similar document candidate output unit 209, output candidate number determination condition setting unit 210, correct / incorrect answer result input unit 211, similarity probability conversion table reading unit 212, and similarity probability conversion table writing unit 213. Have.

【００１２】バッファ部２５０は、検索キー文書格納バ
ッファ部２５１、検索対象文書格納バッファ部２５２、
類似度算出結果格納バッファ部２５３、類似文書候補格
納バッファ部２５４、出力候補数判断条件格納バッファ
部２５５、類似度確度変換テーブル格納バッファ部２５
６、及び確度累積値格納バッファ２５７の７の領域を有
している。初期化部２０１は、バッファ部２５０内の各
バッファ部のデータのクリアと、類似度確度変換テーブ
ル格納バッファ部２５６に対するテーブル記憶部４ｂか
らの類似度確度変換テーブルの書込みを行う。The buffer unit 250 includes a search key document storage buffer unit 251, a search target document storage buffer unit 252,
Similarity calculation result storage buffer unit 253, similar document candidate storage buffer unit 254, output candidate number determination condition storage buffer unit 255, similarity probability conversion table storage buffer unit 25
6 and an area 7 of the probability accumulation value storage buffer 257. The initialization unit 201 clears data in each buffer unit in the buffer unit 250 and writes a similarity accuracy conversion table from the table storage unit 4b to the similarity accuracy conversion table storage buffer unit 256.

【００１３】検索キー文書入力部２０４は、ユーザが入
力装置２を用いて入力する検索キー文書の情報を検索キ
ー文書格納バッファ部２５１へ格納させる。検索対象文
書読み込み部２０５は、文書情報記憶部４ａから検索対
象文書情報を受け取り、検索対象文書格納バッファ部２
５２へ格納させる。類似度算出部２０６は、検索キー文
書格納バッファ部２５１に格納されている検索キー文書
と検索対象文書格納バッファ部２５２に格納されている
検索対象文書とから類似の度合いをベクトル空間法で算
出し、類似度算出結果格納バッファ部２５３に格納す
る。尚、ここで類似度はベクトル空間法に換えて共通単
語数にて算出するようにしても構わない。The search key document input unit 204 causes the search key document storage buffer unit 251 to store the information of the search key document input by the user using the input device 2. The search target document reading unit 205 receives the search target document information from the document information storage unit 4a, and stores the search target document storage buffer unit 2.
52. The similarity calculation unit 206 calculates the similarity from the search key document stored in the search key document storage buffer unit 251 and the search target document stored in the search target document storage buffer unit 252 by the vector space method. Are stored in the similarity calculation result storage buffer unit 253. Here, the similarity may be calculated based on the number of common words instead of the vector space method.

【００１４】類似度確度変換部２０７は、類似度算出部
２０６が算出した類似度とその検索対象文書の分野とか
ら類似度確度変換テーブル格納バッファ部２５６に格納
されたテーブルのデータを取り出し、対応する確度を計
算した後、類似文書候補格納バッファ部２５４に格納す
る。確度とは、正解数を特定数で割った値である。正回
数とは検索結果として抽出されるべきとされた回数であ
り、一方特定数とは類似度と分類とから単に特定された
回数である。すなわち、確度とは、対象とする文章がヒ
ットされるべき正解である確率であり、これが類似度及
び分類別に与えられるものである。The similarity probability conversion unit 207 extracts the data of the table stored in the similarity probability conversion table storage buffer unit 256 from the similarity calculated by the similarity calculation unit 206 and the field of the search target document, and After calculating the degree of accuracy, the similar document candidate is stored in the similar document candidate storage buffer unit 254. The accuracy is a value obtained by dividing the number of correct answers by a specific number. The positive number is the number of times that should be extracted as a search result, while the specific number is the number of times simply specified from the similarity and the classification. That is, the certainty is a probability that a target sentence is a correct answer to be hit, and is given for each similarity and each classification.

【００１５】類似文書候補ソート部２０８は、類似文書
候補格納バッファ部２５４に格納されている類似文書候
補を確度でソートし、再び類似文書候補格納バッファ部
２５４に格納する。類似文書候補出力部２０９は、類似
文書候補格納バッファ部２５４に格納されている類似文
書候補の確度の高い方から順に、それらの類似文書候補
に１件以上正解が含まれる確率を算出する。そして算出
される確率が、出力候補数判断条件格納バッファ部２５
５に格納されている出力候補数判断条件よりも大きくな
るまでの類似文書候補を表示装置３に表示するべく出力
する。出力候補数判断条件は、出力類似文書候補に１件
以上正解が含まれる確率である。The similar document candidate sorting unit 208 sorts the similar document candidates stored in the similar document candidate storage buffer unit 254 with certainty, and stores them in the similar document candidate storage buffer unit 254 again. The similar document candidate output unit 209 calculates the probability that one or more correct documents are included in the similar document candidates stored in the similar document candidate storage buffer unit 254 in descending order of the likelihood. The calculated probability is stored in the output candidate number determination condition storage buffer 25.
The similar document candidates up to the output candidate number determination condition stored in No. 5 are output for display on the display device 3. The output candidate number determination condition is a probability that one or more correct answers are included in the output similar document candidates.

【００１６】出力候補数判断条件設定部２１０は、入力
装置２を介してユーザより入力された、出力候補数判断
条件を候補数判断条件格納バッファ部２５５に格納す
る。正解／不正解結果入力部２１１は、類似度算出結果
格納バッファ部２５６に格納される類似度算出結果と類
似文書に付与された分類又は分野名単位で、検索結果と
して抽出されるべき正解、つまり類似文書候補が結果と
して抽出されるべきものであったか否かの情報に従い類
似度確度変換テーブル格納バッファ部２５６のデータを
更新／修正する。上記正解であるか否かの情報は入力装
置２を介してユーザが入力する。The output candidate number determination condition setting unit 210 stores the output candidate number determination condition input by the user via the input device 2 in the candidate number determination condition storage buffer unit 255. The correct answer / incorrect answer result input unit 211 is a correct answer to be extracted as a search result for each of the similarity calculation result stored in the similarity calculation result storage buffer unit 256 and the classification or field name given to the similar document. The data in the similarity / probability conversion table storage buffer unit 256 is updated / corrected in accordance with the information as to whether or not the similar document candidate should be extracted as a result. The information on whether the answer is correct is input by the user via the input device 2.

【００１７】類似度確度変換テーブル読み込み部２１２
は、外部記憶装置４のテーブル記憶部４ｂに格納されて
いる類似度確度変換テーブルを受け取り、類似度確度変
換テーブル格納バッファ部２５６に記憶させる。この類
似度確度変換テーブルは、検索対象文書の類似度及び分
野から確度を求めるためのテーブルである。類似度確度
変換テーブル書き込み部２１３は、類似度確度変換テー
ブル格納バッファ部２５６に格納されている類似度確度
変換テーブルを、テーブル記憶部４ｂに上書き記憶させ
る。Similarity / accuracy conversion table reading section 212
Receives the similarity probability conversion table stored in the table storage unit 4b of the external storage device 4 and stores it in the similarity probability conversion table storage buffer unit 256. This similarity probability conversion table is a table for obtaining the probability from the similarity and the field of the search target document. The similarity probability conversion table writing unit 213 overwrites and stores the similarity probability conversion table stored in the similarity probability conversion table storage buffer unit 256 in the table storage unit 4b.

【００１８】次に、本実施形態の類似文書検索装置の動
作を説明する。ここで説明する動作は制御装置１のＣＰ
Ｕが、ＲＯＭ内のプログラム、及びＲＡＭ内の記憶領域
を用いて実行するものである。本実施形態は、大きく第
１のステップと第２のステップとからなる。第１のステ
ップは、類似文書検索装置が類似文書検索を行なえる状
態にするために、類似文書検索の準備段階として、サン
プルデータを用いて類似度確度変換テーブルを作成しテ
ーブル記憶部４ｂに記憶させるステップ。第２のステッ
プは、この作成された類似度確度変換テーブルを用いて
類似文書の検索を行なうステップである。まずこの類似
度確度変換テーブルの作成を行なう第１のステップにつ
いて説明する。図３はその作成手順を示すフローチャー
トである。Next, the operation of the similar document search apparatus according to this embodiment will be described. The operation described here is based on the CP of the control device 1.
U executes by using a program in a ROM and a storage area in a RAM. This embodiment mainly includes a first step and a second step. The first step is to prepare a similarity probability conversion table using sample data and store it in the table storage unit 4b as a preparation stage for similar document search so that the similar document search device can perform similar document search. Step to let. The second step is a step of searching for similar documents using the created similarity probability conversion table. First, the first step of creating the similarity / degree of accuracy conversion table will be described. FIG. 3 is a flowchart showing the creation procedure.

【００１９】はじめにユーザは、入力装置２を使用し
て、文書情報記憶部４ａに検索の対象となる検索対象文
書のデータを格納する。この格納の際、検索対象文書に
は、文書を識別するための文書ＩＤおよび文書の分類を
表す分野を付与する（ステップ３００）。続いて初期化
部２０１が全バッファを初期化する（ステップ３０
１）。検索キー文書入力部２０４に、入力装置２を介し
て通じてユーザが検索キー文書を入力すると、検索キー
文書格納バッファ部２５１に検索キー文書を格納する。
（ステップ３０２）。検索キー文書としては、例えば図
４に示すような内容のテキスト文書であり、キーボード
やスキャナにより入力される。First, the user uses the input device 2 to store data of a search target document to be searched in the document information storage unit 4a. At the time of this storage, a document ID for identifying the document and a field indicating the classification of the document are assigned to the search target document (step 300). Subsequently, the initialization unit 201 initializes all buffers (step 30).
1). When the user inputs a search key document to the search key document input unit 204 via the input device 2, the search key document is stored in the search key document storage buffer unit 251.
(Step 302). The search key document is, for example, a text document having contents as shown in FIG. 4, and is input by a keyboard or a scanner.

【００２０】次に検索対象文書読み出し部２０５は、ス
テップ３００にて検索対象文書のデータが入力された外
部記憶装置４の文書情報記憶部４ａから、複数の検索対
象文書を読み出し、検索対象文書格納バッファ部２５２
に格納する（ステップ３０３）。検索対象文書は、例え
ば図６に示すようなものであり、上述の通り本文と、文
書を識別するための文書ＩＤと、その文書の分類を表す
分野情報とが付与されている。Next, the retrieval target document reading section 205 reads a plurality of retrieval target documents from the document information storage section 4a of the external storage device 4 to which the data of the retrieval target document has been inputted in step 300, and stores the retrieval target documents. Buffer unit 252
(Step 303). The search target document is, for example, as shown in FIG. 6, and as described above, the body, the document ID for identifying the document, and the field information indicating the classification of the document are added.

【００２１】類似度算出部２０６は、検索キー文書格納
バッファ部２５１に格納された検索キー文書と、検索対
象文書格納バッファ部２５２に格納された検索対象文書
の本文とを比較し、ベクトル空間法により類似度を算出
する。ここでベクトル空間法による類似度算出方法につ
いて簡単に説明する。まず検索キー文書、及び検索対象
文書それぞれを形態素解析を使って単語に分割する。次
に検索キー文書の中に出現する同一単語の出現回数を算
出し、同様に検索対象文書の中に出現する同一単語の出
現回数を算出する。これら両文書に出現する単語と、そ
の出現回数を抽出する。例えば検索キー文書の中に「文
書」という単語が３回、「印刷」という単語が１回出現
し、検索対象文書の中に同じく「文書」という単語が５
回、「印刷」という単語が２回出現したとする。この場
合、「文書」の回数をｘ軸の値、「印刷」の回数をy軸
の値として検索キー文書及び検索対象文書それぞれのベ
クトルを作成する。検索キー文書のベクトルは（３，
１）、検索対象文書のベクトルは（５，２）となる。そ
して両ベクトルの為す角をθとしてＣＯＳθを求める。
この値を所定の式で正規化したものが類似度となる。因
みにθが小さいければ類似度大となる。これを全ての単
語を対象に行なう。類似度算出部２０６は、こうして求
めた類似度を、検索対象文書の文書ＩＤとその文書の分
類を表す分野情報と共に、類似度算出結果格納バッファ
部２５３に格納する。類似度算出結果格納バッファ部２
５３に格納されたデータの一例として、例えば図６に示
す。格納された１番目のデータは、文書ＩＤが「５９３
３」、分野名が「出版」、類似度が「０．３７８」とい
うことを示している。The similarity calculation unit 206 compares the search key document stored in the search key document storage buffer unit 251 with the text of the search target document stored in the search target document storage buffer unit 252, and uses the vector space method. To calculate the similarity. Here, a similarity calculation method using the vector space method will be briefly described. First, each of the search key document and the search target document is divided into words using morphological analysis. Next, the number of appearances of the same word appearing in the search key document is calculated, and similarly, the number of appearances of the same word appearing in the search target document is calculated. The words appearing in these documents and the number of appearances are extracted. For example, the word “document” appears three times in the search key document and the word “print” appears once, and the word “document” also appears in the search target document.
Suppose that the word "print" appears twice. In this case, the vectors of the search key document and the search target document are created using the number of “documents” on the x-axis and the number of “prints” on the y-axis. The vector of the search key document is (3,
1) The vector of the search target document is (5, 2). Then, COS θ is obtained by setting an angle formed by both vectors to θ.
The value obtained by normalizing this value with a predetermined formula is the similarity. Incidentally, the smaller the value of θ is, the higher the similarity is. This is performed for all words. The similarity calculation unit 206 stores the obtained similarity in the similarity calculation result storage buffer unit 253 together with the document ID of the search target document and the field information indicating the classification of the document. Similarity calculation result storage buffer unit 2
FIG. 6 shows an example of the data stored in the memory 53. The first data stored has the document ID “593”.
3 ", the field name is" publishing ", and the similarity is" 0.378 ".

【００２２】類似度の格納が済むと、類似度を算出して
いない検索対象文書が残っているかを判断し（ステップ
３０５）、残っている場合は、ステップ３０３に戻って
残りの検索対象文書に対してステップ３０３及び３０４
の動作を行なう。一方他に検索対象文書がないと判断し
た場合、ステップ３０６に進む。ステップ３０６では、
ステップ３０４で類似度算出結果格納バッファ部２５３
に格納した類似度算出結果を、類似文書候補ソート部２
０８にて類似度でソートして再び類似度算出結果格納バ
ッファ部２５３に格納する。After the similarity is stored, it is determined whether or not there remains a document to be searched for which the similarity has not been calculated (step 305). Steps 303 and 304
Is performed. On the other hand, if it is determined that there is no other document to be searched, the process proceeds to step 306. In step 306,
In step 304, the similarity calculation result storage buffer unit 253
The similarity calculation result stored in the similar document candidate sorting unit 2
At 08, the data is sorted by similarity and stored again in the similarity calculation result storage buffer unit 253.

【００２３】この後類似文書候補出力部２０９により、
類似文書候補を類似度の高いものから順に、表示装置４
に出力する（ステップ３０７）。この時の出力件数は、
ユーザがその都度指定する方法や、ユーザが予め件数を
設定しておき、その件数にしたがって出力する方法など
がある。この場合表示装置４は、類似文書候補を、例え
ば図７に示すように文書ＩＤと類似度とをペアにして上
位３候補について表示する。Thereafter, the similar document candidate output unit 209 outputs
The display device 4 sorts similar document candidates in descending order of similarity.
(Step 307). The output count at this time is
There are a method of specifying each time by the user, a method of setting the number of cases in advance by the user, and outputting according to the number of cases. In this case, the display device 4 displays the similar document candidates for the top three candidates by pairing the document ID and the degree of similarity, for example, as shown in FIG.

【００２４】次に、ステップ３０７で出力した類似文書
候補が、実際に検索キー文書に類似しており、抽出され
るべきものか否かのデータは、ユーザによる入力装置２
操作により、正解／不正解結果入力部２１１を介して入
力される（ステップ３０８）。抽出されるべきものであ
れば正解、そうでなければ不正解である。この入力時、
正解／不正解結果入力部２１１は、類似度確度変換テー
ブル格納バッファ部２５６内の類似度確度変換テーブル
の該当部分を以下のように更新する（ステップ３０
９）。変換テーブルの例を図８に示す。各行は類似度毎
に区分けし、各列は分野毎に区分けしており、さらに、
各分野を正解数と特定数とに区分けしている。ステップ
３０８において、例えば、最上位の検索対象文書「文書
ＩＤ＝７５２６８」が類似しているとユーザにより判断
され、その旨データ入力された場合、正解／不正解結果
入力部２１１はこの文書の「分野＝印刷」と「類似度＝
０．３９」とから該当欄Ａを特定し、その欄の正解数
「３」、特定数「５」の両方に「１」を加算する。また
第２位の候補の検索対象文書「文書ＩＤ＝５９３３」が
類似していないとユーザにより判断され、その旨データ
入力された場合、正解／不正解結果入力部２１１はこの
文書の「分野＝出版」と「類似度＝０．３７８」とから
該当欄Ｂを特定し、その欄Ｂの正解数「２」はそのまま
で、特定数「４」にのみ「１」加算する。第３位の候補
についても同様のデータ処理を行なう。この処理によ
り、入力前後における該当欄の確度（＝正解数／特定
数）は、「分野＝印刷」且つ「類似度＝０．３８〜０．
３９」が３／５＝０．６から４／６＝０．６７へ向上し
たのに対して、「分野＝出版」且つ「類似度＝０．３７
〜０．３８」が２／４＝０．５から２／５＝０．４へと
降下することになる。つまり、この処理は分野毎に類似
度に対応した適切な正解率を設定するためのものであ
り、同じ類似度でも分野毎に正解率に違いがある場合、
それを類似文献抽出に反映させることができるようにす
るためのものである。また、これを１つの分野単位で見
ると、類似度値に対応して確度が定められていることに
もなる。Next, data indicating whether or not the similar document candidate output in step 307 is actually similar to the retrieval key document and should be extracted is determined by the input device 2 by the user.
By operation, it is input via the correct / incorrect answer result input unit 211 (step 308). The answer is correct if it should be extracted, otherwise it is incorrect. At this input,
The correct answer / incorrect answer result input unit 211 updates the corresponding part of the similarity accuracy conversion table in the similarity accuracy conversion table storage buffer unit 256 as follows (step 30).
9). FIG. 8 shows an example of the conversion table. Each row is divided by similarity, each column is divided by field,
Each field is divided into the number of correct answers and the specific number. In step 308, for example, when the user determines that the top-level search target document “document ID = 75268” is similar, and inputs data to that effect, the correct / incorrect answer result input unit 211 outputs the “ Domain = Printing "and" Similarity =
Then, the corresponding column A is specified from “0.39”, and “1” is added to both the number of correct answers “3” and the specific number “5” in that column. In addition, when the user determines that the second candidate search target document “document ID = 5933” is not similar, and inputs data to that effect, the correct / incorrect answer result input unit 211 outputs the “field = The corresponding column B is specified based on “publishing” and “similarity = 0.378”, and “1” is added only to the specific number “4” while the number of correct answers “2” in the column B remains unchanged. The same data processing is performed for the third-rank candidate. By this processing, the accuracy (= number of correct answers / specific number) of the corresponding column before and after the input is “field = printing” and “similarity = 0.38-0.
"39" has improved from 3/5 = 0.6 to 4/6 = 0.67, whereas "field = publishing" and "similarity = 0.37"
０．３0.38 ”drops from 2/4 = 0.5 to 2/5 = 0.4. In other words, this process is for setting an appropriate accuracy rate corresponding to the degree of similarity for each field, and when there is a difference in the accuracy rate for each field even with the same similarity,
This is to make it possible to reflect it in similar document extraction. When this is viewed in one field unit, the accuracy is determined corresponding to the similarity value.

【００２５】尚、類似しているか否かの判断方法は、本
実施例の方法以外に、すでに分野が特定されているもの
の、検索対象文書として登録されていない文書を検索キ
ー文書として入力し、検索の結果出力された類似文書に
付与されている分野と比較することにより、似ているか
どうかの判断を自動的に行う方法などもある。次に、検
索キー文書が残っているかを判断し（ステップ３１
０）、残っている場合は、ステップ３０２に戻って検索
キー文書について前記同様の処理を行なう。他に検索キ
ー文書がなければステップ３１１へ進む。It should be noted that, in addition to the method of the present embodiment, a method of determining whether or not the document is similar to the present embodiment is to input a document whose field is already specified but not registered as a search target document as a search key document. There is also a method of automatically determining whether or not the document is similar by comparing with a field assigned to a similar document output as a result of the search. Next, it is determined whether a search key document remains (step 31).
0) If it remains, return to step 302 and perform the same processing as described above for the search key document. If there is no other search key document, the process proceeds to step 311.

【００２６】最後に、この準備段階において作成され類
似度確度変換テーブル格納バッファ部２５６に格納され
ている類似度確度変換テーブルを、類似度確度変換テー
ブル書き込み部２１３によりテーブル記憶部４ｂに書き
込み（ステップ３１１）処理を終了する。作成した類似
度確度変換テーブルの例は図９に示すように、正解数及
び特定数が所定の実用レベルに達し、確度として使用で
きるテーブルであることが必要である。Finally, the similarity probability conversion table created in the preparation stage and stored in the similarity probability conversion table storage buffer unit 256 is written into the table storage unit 4b by the similarity probability conversion table writing unit 213 (step S1). 311) End the process. As shown in FIG. 9, the example of the created similarity / probability conversion table needs to be a table in which the number of correct answers and the specific number reach a predetermined practical level and can be used as the accuracy.

【００２７】次に上述の通り作成された類似度確度変換
テーブルを使用して、類似文書を検索する第２のステッ
プについて説明する。図１０はその手順を示すフローチ
ャートである。初めに初期化部２０１により全バッファ
をクリアし、類似度度確度変換テーブル読み込み部２１
２により、類似度確度変換テーブル格納バッファ部２５
６に外部記憶装置４のテーブル記憶部４ｂから類似度確
度変換テーブルを書き込む（ステップ４０１）。ここで
書き込まれる類似度確度変換テーブルは、第１のステッ
プで作成されたもの、又は類似文献検索実行に伴いその
後更新されたものである。ここでは、第１のステップで
作成されたものを書き込むこととし、図９に示す類似度
確度変換テーブルを書き込むものとする。Next, the second step of searching for a similar document by using the similarity probability conversion table created as described above will be described. FIG. 10 is a flowchart showing the procedure. First, all buffers are cleared by the initialization unit 201, and the similarity / degree of accuracy conversion table reading unit 21 is read.
2, the similarity / accuracy conversion table storage buffer 25
6, the similarity / degree of accuracy conversion table is written from the table storage unit 4b of the external storage device 4 (step 401). The similarity / probability conversion table written here is the one created in the first step, or the one updated after the similar document search is executed. Here, it is assumed that the one created in the first step is written, and the similarity accuracy conversion table shown in FIG. 9 is written.

【００２８】続いて、出力候補数判断条件設定部２１０
が起動される。出力候補数判断条件設定部２１０は、入
力装置２を通じてユーザより、出力する類似文書候補に
正解文書が１件以上含まれる確率の条件を受け付けて、
出力候補数判断条件格納バッファ部２５５に格納（設
定）する（ステップ４０２）。図１１は、出力する類似
文献候補の中に１件以上正解が含まれる確率として０．
９７を指定した場合の格納例である。次に、検索キー文
書入力部２０４が、入力装置２のユーザ操作により入力
される検索キー文書を受け付けて、検索キー文書格納バ
ッファ部２５１に検索キー文書を格納する（ステップ４
０３）。具体例として、図１２に例示する内容のテキス
ト文書を検索キー文書の一つとして格納したとする。Subsequently, the output candidate number determination condition setting section 210
Is started. The output candidate number determination condition setting unit 210 receives from the user via the input device 2 a condition of a probability that one or more correct documents are included in the similar document candidates to be output,
It is stored (set) in the output candidate number determination condition storage buffer 255 (step 402). FIG. 11 shows that the probability that one or more correct answers are included in the output similar document candidate is 0.
It is a storage example when 97 is specified. Next, the search key document input unit 204 receives the search key document input by a user operation of the input device 2, and stores the search key document in the search key document storage buffer unit 251 (step 4).
03). As a specific example, it is assumed that a text document having the contents illustrated in FIG. 12 is stored as one of the search key documents.

【００２９】この後、検索対象文書読み込み部２０５
は、文書情報記憶部４ａから複数の文書を読み出し、検
索対象文書格納バッファ部２５２に検索対象文書として
格納する（ステップ４０４）ものであり、例えば図５に
示すような内容のテキスト文書を検索対象文書として格
納する。検索キー文書と検索対象文書とがバッファ部２
５０内に格納されると、類似度算出部２０６は、検索キ
ー文書格納バッファ部２５１に格納された検索キー文書
と、検索対象文書格納バッファ部２５２に格納された検
索対象文書とを比較し、ステップ３０４と同様の方法に
より類似度を算出し、図１３に例示するように類似度算
出結果格納バッファ部２５３に格納する（ステップ４０
５）。Thereafter, the retrieval target document reading unit 205
Reads out a plurality of documents from the document information storage unit 4a and stores them in the search target document storage buffer unit 252 as search target documents (step 404). For example, a text document having contents as shown in FIG. Store as a document. The buffer part 2 stores the search key document and the search target document.
When stored in the search key document storage unit 50, the similarity calculation unit 206 compares the search key document stored in the search key document storage buffer unit 251 with the search target document stored in the search target document storage buffer unit 252. The similarity is calculated by the same method as in step 304, and is stored in the similarity calculation result storage buffer unit 253 as illustrated in FIG. 13 (step 40).
5).

【００３０】検索対象文書に対する類似度算出後、類似
度を算出していない検索対象文書が残っているかを判断
し（ステップ４０６）、残っている場合は、ステップ４
０４に戻って検索対象文書について前記同様の処理が行
われる。他に検索対象文書がなければ次の処理（ステッ
プ４０７）へ進む。After calculating the similarity with respect to the search target document, it is determined whether or not the search target document whose similarity has not been calculated remains (step 406).
Returning to step 04, the same processing as described above is performed on the search target document. If there is no other document to be searched, the process proceeds to the next process (step 407).

【００３１】全ての検索対象文書について類似度の計算
が終了すると、類似度算出結果格納バッファ部２５３に
格納された類似度算出結果のそれぞれについて、類似度
と検索対象文書に付与された分野とから、類似度確度変
換テーブル格納バッファ部２５６に格納された類似度確
度変換テーブルの該当欄の正解数と特定数とを取り出
し、確度（＝正解数／特定数…正解確率）を算出し、図
１４に示すように類似文書候補格納バッファ部２５４に
格納する（ステップ４０７）。つまり、ステップ４０５
による類似度算出の結果、「文書ＩＤ=８６５２」、
「分野=コンピュータ」、「類似度=０．３７１」の場
合、図９の類似度確度変換テーブルを用いると、類似度
が「０．３７〜０．３８」の行の、「コンピュータ」の
欄Ｃの特定数「３２７」と正解数「２９１」から、確度
２９１／３２７＝０．８９０を算出する。図１４に変換
された類似文書候補の格納例を示す。この類似度確度変
換の際、類似度確度変換テーブルの類似度の精度（図９
の例では０．０１）よりも、算出した類似度の精度が高
い場合は、類似度確度変換テーブルの前後の値から算出
した確度を補間して利用する。When the calculation of the similarity for all the search target documents is completed, the similarity calculation results stored in the similarity calculation result storage buffer unit 253 are calculated based on the similarity and the field assigned to the search target document. 14, the number of correct answers and the specific number in the corresponding column of the similarity probability conversion table stored in the similarity probability conversion table storage buffer unit 256 are extracted, and the accuracy (= correct number / specific number... Correct probability) is calculated. (Step 407). That is, step 405
"Document ID = 8652"
In the case of “field = computer” and “similarity = 0.371”, using the similarity accuracy conversion table in FIG. 9, the “computer” column of the row whose similarity is “0.37 to 0.38” From the specific number “327” of C and the number of correct answers “291”, the accuracy 291/327 = 0.890 is calculated. FIG. 14 shows an example of storing the converted similar document candidates. At the time of this similarity accuracy conversion, the similarity accuracy of the similarity accuracy conversion table (FIG. 9)
If the calculated similarity is more accurate than 0.01) in the example, the accuracy calculated from the values before and after the similarity accuracy conversion table is interpolated and used.

【００３２】ステップ４０７で類似度確度変換され、類
似文書候補格納バッファ部２５４に格納された類似文書
候補を、確度でソートし、例えば図１５のようにして再
びこのバッファ部２５４に格納する（ステップ４０
８）。次に、確度の上位から、出力候補数判断条件格納
バッファ部２５５に格納された候補数判断条件である
０．９７に合致する候補までを出力するステップへ進
む。まず、確度累積値格納バッファ部２５７を１に初期
化する（ステップ４０９）。図１６に確度累積値バッフ
ァ２５７が初期化された状態の例を示す。初期化後、確
度累積値格納バッファ部２５７に、１から類似文書候補
の確度を引いた値を乗じ確度累積値格納バッファ部２５
７に格納する（ステップ４１０）と共に、類似文書候補
として類似文書候補２０９を介して表示装置３に出力す
る（ステップ４１１）。In step 407, the similar document candidates converted in the similarity probability and stored in the similar document candidate storage buffer unit 254 are sorted by certainty and stored again in the buffer unit 254 as shown in FIG. 15, for example (step S407). 40
8). Next, the process proceeds to the step of outputting, from the highest accuracy, candidates up to 0.97 which is the candidate number determination condition stored in the output candidate number determination condition storage buffer unit 255. First, the accumulated probability value storage buffer unit 257 is initialized to 1 (step 409). FIG. 16 shows an example of a state in which the accumulated accuracy buffer 257 is initialized. After initialization, the accumulated probability value storage buffer unit 257 is multiplied by a value obtained by subtracting the similarity document candidate probability from 1 to the accumulated probability value storage buffer unit 257.
7 (step 410) and output to the display device 3 via the similar document candidate 209 as a similar document candidate (step 411).

【００３３】続いてこのようにして表示装置３に出力し
た類似文書候補が、実際に検索キー文書に類似していた
か否かをフィードバックし、類似度確度変換テーブルに
反映させるべく、入力装置２を操作するユーザから受付
ける（ステップ４１２）。ユーザからの類似か否かの入
力データに従いここで行なう動作は、ステップ３０８、
及び３０９にて行なった動作と同様である。つまり、類
似度確度変換テーブル格納バッファ部２５６に格納され
た類似度確度変換テーブルの該当欄（類似度と特定した
類似文書に付与された分野）の特定数に、類似か否かに
関わらず候補として表示された場合には「１」を加算
し、類似していると判断された場合は同該当欄の正解数
にも「１」を加算する（ステップ４１３）。Subsequently, the input device 2 is fed back as to whether or not the similar document candidate output to the display device 3 is actually similar to the retrieval key document, and reflects it in the similarity probability conversion table. Acceptance from the operating user (step 412). The operation performed here according to the input data of similarity from the user is step 308,
And 309 are the same as the operations performed. In other words, the candidate number is not related to the specified number of the corresponding column (the field assigned to the similar document specified as the similarity) in the similarity probability conversion table stored in the similarity probability conversion table storage buffer unit 256 irrespective of whether or not it is similar. Is displayed, "1" is added, and when it is determined that they are similar, "1" is also added to the number of correct answers in the corresponding column (step 413).

【００３４】これらステップ４１０から４１３の動作
を、出力候補数判断条件格納バッファ部２５５に格納さ
れた候補数判断条件を満足するまで繰り返す（ステップ
４１４）。図１５、及び図１７を用いてステップ４１０
から４１４における、類似文書候補の状態と確度累積値
格納バッファ部２５７の変化に関する具体例と、候補数
判断動作を説明する。まず最上位の類似文書候補、つま
り文書ＩＤ＝８６５２の確度が０．８９０なので、確度
累積値は初期値１に１−０．８９０=０．１１０を掛け
て０．１１０となり、１−０．１１０=０．８９０は候
補数判断条件「０．９７」よりも小さいので次の候補の
処理に進む。２番目の類似文書候補、「文書ＩＤ＝２４
１９」の確度が０．７０８なので確度累積値は０．１１
０に１−０．７０８＝０．２９２を掛けるて、０．０３
２となり、１−０．０３２＝０．９６８は候補数判断条
件「０．９７」よりまだ小さいので、さらに次の候補の
処理に進む。３番目の類似文書候補、「文書ＩＤ＝３９
９２４」の確度が０．５０８なので、確度累積値は０．
０３２に、１−０．５０８＝０．４９２を掛けるので、
０．０１６で、１−０．０１６＝０．９８４は候補判断
条件「０．９７」よりも大きくなり、条件を満足したと
判断する。そして表示装置３への表示は、これら文書Ｉ
Ｄ＝８６５２、２４１９、３９９２４の３候補のみで終
了する。図１８は、類似文書候補が図１７に示すような
状態にあった場合に出力される類似文書候補表示の例で
ある。ここでこれら３候補の表示の順序が重要なのでな
く、まず少なくとも最も確度の高い候補が表示されてい
る点、及びこれら３候補の中に正解が含まれている確率
が、ほぼ所望の値０．９７に近い０．９８４（≒１件）
である点が重要なのである。The operations of steps 410 to 413 are repeated until the candidate number judgment condition stored in the output candidate number judgment condition storage buffer 255 is satisfied (step 414). Step 410 using FIGS. 15 and 17
A description will now be given of a specific example of the state of the similar document candidate and the change of the accumulated probability value storage buffer unit 257, and the operation of determining the number of candidates, in the steps from 414 to 414. First, since the certainty of the highest similar document candidate, that is, the document ID = 8652 has a certainty of 0.890, the cumulative certainty is 0.110 obtained by multiplying the initial value 1 by 1-0.890 = 0.110. Since 110 = 0.890 is smaller than the candidate number determination condition “0.97”, the process proceeds to the next candidate. The second similar document candidate, “Document ID = 24
Since the accuracy of “19” is 0.708, the accumulated accuracy is 0.11
0 is multiplied by 1-0.708 = 0.292 to obtain 0.03
2 and 1−0.032 = 0.968 is still smaller than the number-of-candidates determination condition “0.97”, so the process proceeds to the next candidate. The third similar document candidate, “Document ID = 39
924 ”is 0.508, so the cumulative accuracy value is 0.08.
Since 032 is multiplied by 1-0.508 = 0.492,
At 0.016, 1−0.016 = 0.0084 is larger than the candidate determination condition “0.97”, and it is determined that the condition is satisfied. The display on the display device 3 is based on these documents I
The process ends with only three candidates of D = 8652, 2419, 39924. FIG. 18 is an example of a similar document candidate display output when the similar document candidate is in the state shown in FIG. Here, the order in which these three candidates are displayed is not important. First, at least the point at which the most probable candidate is displayed and the probability that the correct answer is included in these three candidates are almost the desired value of 0.1. 0.984 near 97 ($ 1)
Is important.

【００３５】類似度確度変換テーブルの更新が終了する
と、検索キー文書が残っているかを判断し（ステップ４
１５）、残っている場合は、ステップ４０３に戻って検
索キー文書についてステップ４０３から４１４の処理を
行なう。他に検索キー文書がなければステップ４１６に
進む。最後に、類似度確度変換テーブル書き込み部２１
３により、類似度確度変換テーブル格納バッファ部２５
６の内容を外部記憶装置４のテーブル記憶部４ｂに書き
込み（ステップ４１６）、類似文献検索処理を終了す
る。When the update of the similarity / probability conversion table is completed, it is determined whether or not a search key document remains (step 4).
15) If it remains, return to step 403 and perform the processing of steps 403 to 414 for the search key document. If there is no other search key document, the process proceeds to step 416. Finally, the similarity / accuracy conversion table writing unit 21
3, the similarity / accuracy conversion table storage buffer unit 25
6 is written into the table storage unit 4b of the external storage device 4 (step 416), and the similar document search process ends.

【００３６】尚、ステップ４１３以降の処理はなくても
検索に差し支えるものではないが、ステップ４１３以降
の処理を行うことで常に最新の結果を類似文書検索に反
映させることができるという効果がある。また、ステッ
プ４１２以降の処理は類似文書候補出力直後に行う以外
に、類似文書出力をまとめて行い、それらの類似文書出
力候補が実際に似ているかどうかを後で判断して、類似
度確度変換テーブルを更新するようにしても構わない。
本発明はその主旨を逸脱しない範囲であれば、上記の実
施例に限定されるものではない。そして、データベース
検索装置、及び文書分類装置等に広く適用できるもので
ある。It should be noted that although the processing after step 413 does not need to be performed for search, the processing after step 413 has an effect that the latest result can always be reflected in the similar document search. . In addition to performing the processing after step 412 immediately after outputting a similar document candidate, similar document output is collectively performed, and it is later determined whether or not those similar document output candidates are actually similar. The table may be updated.
The present invention is not limited to the above embodiments as long as it does not depart from the gist of the present invention. The present invention can be widely applied to a database search device, a document classification device, and the like.

【００３７】[0037]

【発明の効果】以上詳述したように本発明によれば、算
出された類似度毎に蓄積したデータに基づく正解率を求
め、その正解率の高い文書を抽出するようにしたので、
類似度度のみに依存することなく、高い精度で抽出され
るべき類似文書を検索できる。また本発明によれば、分
野毎に正解率のデータを蓄積しているため、分類間にお
ける類似度の正解率の違いを反映し、高い精度でより適
切な類似文書を検索できる。さらに本発明によれば、検
索の結果、出力文書中に抽出されるべき文書が所望の確
率で含まれるようにすることができる。As described above in detail, according to the present invention, the correct answer rate based on the data accumulated for each calculated similarity is obtained, and a document having a high correct answer rate is extracted.
A similar document to be extracted can be searched with high accuracy without depending only on the degree of similarity. According to the present invention, since the data of the accuracy rate is accumulated for each field, it is possible to search for a more appropriate similar document with high accuracy by reflecting the difference in the accuracy rate of the similarity between the classifications. Further, according to the present invention, as a result of the search, a document to be extracted can be included in the output document with a desired probability.

[Brief description of the drawings]

【図１】本発明に係る一実施形態の類似文献検索装置の
ハードウェア構成を示す図FIG. 1 is a diagram showing a hardware configuration of a similar document search device according to an embodiment of the present invention.

【図２】図１の類似文献検索装置における制御装置の機
能ブロック図FIG. 2 is a functional block diagram of a control device in the similar document search device of FIG. 1;

【図３】検索対象文書の類似度を確度に変換するための
テーブルを作成する手順を示す図FIG. 3 is a diagram showing a procedure for creating a table for converting similarity of a search target document into accuracy;

【図４】テーブル準備時の検索キー文書の例FIG. 4 shows an example of a search key document when preparing a table.

【図５】テーブル準備時の検索対象文書の例FIG. 5 shows an example of a search target document when preparing a table.

【図６】テーブル準備時の検索対象文書の類似度算出結
果の例FIG. 6 shows an example of a similarity calculation result of a search target document when preparing a table.

【図７】テーブル準備時の類似文書候補表示の例FIG. 7 shows an example of displaying similar document candidates when preparing a table.

【図８】テーブル準備段階における類似度確度変換テー
ブルの例FIG. 8 shows an example of a similarity accuracy conversion table in a table preparation stage.

【図９】検索時に使用される類似度確度変換テーブルの
例FIG. 9 is an example of a similarity / probability conversion table used at the time of search;

【図１０】類似文献検索手順を示す図FIG. 10 is a diagram showing a similar document search procedure.

【図１１】検索時の出力条件判断条件の例FIG. 11 shows an example of an output condition determination condition at the time of a search.

【図１２】検索時の検索キー文書の例FIG. 12 shows an example of a search key document at the time of search

【図１３】検索時の検索対象文書の類似度算出結果の例FIG. 13 illustrates an example of a similarity calculation result of a search target document during a search.

【図１４】類似文献候補の確度の例FIG. 14 shows an example of the accuracy of similar document candidates

【図１５】確度でソート後の類似文献候補の例FIG. 15 shows examples of similar document candidates sorted by accuracy.

【図１６】確度累積値の初期値の例FIG. 16 shows an example of an initial value of a cumulative accuracy value.

【図１７】確度累積値の変化の例FIG. 17 shows an example of a change in the cumulative accuracy value.

【図１８】検索時の類似文献候補表示の例FIG. 18 shows an example of similar document candidate display at the time of search

[Explanation of symbols]

１…制御装置２…入力装置３…表示装置４…外部記憶装置４ａ…文書情報記憶部４ｂ…テーブル記憶部２００…プログラム部２０１…初期化部２０４…検索キー文書入力部２０５…検索対象文書読み込み部２０６…類似度算出部２０７…類似度確度変換部２０８…類似文書候補ソート部２０９…類似文書候補出力部２１０…出力候補数判断条件設定部２１１…正解／不正解結果入力部２１２…類似度確度変換テーブル読み込み部２１３…類似度確度変換テーブル書き込み部２５０…バッファ部２５１…検索キー文書格納バッファ部２５２…検索対象文書格納バッファ部２５３…類似度算出結果格納バッファ部２５４…類似文書候補格納バッファ部２５５…出力候補数判断条件格納バッファ部２５６…類似度確度変換テーブル格納バッファ部２５７…確度累積値格納バッファ部 REFERENCE SIGNS LIST 1 control device 2 input device 3 display device 4 external storage device 4 a document information storage unit 4 b table storage unit 200 program unit 201 initialization unit 204 search key document input unit 205 read document to be searched Unit 206: similarity calculation unit 207: similarity probability conversion unit 208: similar document candidate sorting unit 209: similar document candidate output unit 210: output candidate number determination condition setting unit 211: correct / incorrect answer result input unit 212: similarity Probability conversion table reading unit 213 ... Similarity probability conversion table writing unit 250 ... Buffer unit 251 ... Search key document storage buffer unit 252 ... Search target document storage buffer unit 253 ... Similarity calculation result storage buffer unit 254 ... Similar document candidate storage buffer Unit 255: buffer unit for determining the number of output candidates judgment condition storage unit 256: conversion table for similarity and accuracy Paid buffer unit 257 ... likelihood accumulated value storage buffer unit

フロントページの続き (72)発明者松隈剛東京都青梅市新町３丁目３番地の１東芝コンピュータエンジニアリング株式会社内 (72)発明者中本幸夫東京都青梅市新町３丁目３番地の１東芝コンピュータエンジニアリング株式会社内 (72)発明者仁科卓哉東京都青梅市新町３丁目３番地の１東芝コンピュータエンジニアリング株式会社内Ｆターム(参考） 5B075 ND03 NK10 NK54 PP24 PQ02 PQ46 PR06 PR08 QM08 QS20 UU06 Continued on the front page (72) Inventor Tsuyoshi Matsukuma One, 3-3-1 Shinmachi, Ome-shi, Tokyo Inside Toshiba Computer Engineering Co., Ltd. (72) Inventor Yukio Nakamoto 3-3-1 Shinmachi, Ome-shi, Tokyo Toshiba Computer Engineering Inside (72) Inventor Takuya Nishina 3-3-1 Shinmachi, Ome-shi, Tokyo Toshiba Computer Engineering Co., Ltd. F-term (reference) 5B075 ND03 NK10 NK54 PP24 PQ02 PQ46 PR06 PR08 QM08 QS20 UU06

Claims

[Claims]

1. A similarity calculating device for calculating a similarity between a search key document and each of the search target documents in a similar document search apparatus for searching a document similar to the search key document from a plurality of search target documents. Means for calculating the probability that the search target document is a document to be extracted according to the similarity calculated by the similarity calculating means, and outputting the document having the highest probability obtained by at least this means. A similar document search device comprising output means.

2. A similar document search device for searching a document similar to a search key document from a plurality of search target documents including classification information, wherein a similarity between the search key document and each of the search target documents is calculated. Means for calculating a probability that the search target document is a document to be extracted based on the similarity calculated by the similarity calculation means and the classification of the search target document; Output means for outputting the search target documents in order from the one with the highest probability obtained.

3. A similar document search apparatus for searching a document similar to a search key document from a plurality of search target documents including field information, calculating a similarity between the search key document and each of the search target documents. A similarity calculating unit; a probability converting unit that calculates a probability that the search target document is a document to be extracted based on the similarity obtained by the similarity calculation unit and the field of the search target document; Means for outputting documents to be searched in order from the one having the highest probability obtained by the means, and data indicating whether or not the document to be searched output by the output means is a document to be extracted as a result. A similar document search apparatus, comprising: input means; and means for correcting a method of obtaining a probability in the accuracy conversion means according to data input by the input means.

4. A similarity calculating device for calculating a similarity between a search key document and each of the search target documents in a similar document search apparatus for searching a document similar to the search key document from a plurality of search target documents. And first output means for outputting in descending order of similarity obtained by the similarity calculation means,
Input means for inputting data as to whether or not the search target document output from the output means is a document to be extracted as a result; and searching based on the data input by the input means in accordance with the degree of similarity. Creating means for creating a conversion table for determining the probability that the target document is a document to be extracted; searching means for searching for a similar document using the conversion table created by this means; And a second output unit for outputting the searched similar document.

5. A similarity calculating device for calculating a similarity between a search key document and each of the search target documents in a similar document search apparatus for searching a document similar to the search key document from a plurality of search target documents. Means for determining the probability that the search target document is a document to be extracted according to the similarity calculated by the similarity calculation means, and search target documents in descending order of the probability calculated by this means. Output means for outputting a search target document, and a determination means for determining that a probability that a document to be extracted is included in the output search target document has reached a predetermined value. A similar document search device, comprising: when the determining means determines that the predetermined value has been reached, terminating the output of the document from the output means.

6. A similar document search apparatus for searching a document similar to a search key document from a plurality of search target documents including classification information, wherein a similarity calculating a similarity between the search key document and each of the search target documents. The degree of similarity calculated by the degree of similarity calculation means and the classification of the search target document, and the number of times of extraction of the result of the search performed in the past and the extraction of the result of the search performed in the past are performed. A conversion unit that obtains a probability that the search target document is a document to be extracted from a ratio of the number of times determined to be correct; and an output unit that outputs at least the highest probability obtained by the conversion unit. A similar document search device characterized by the following.

7. A similar document search method for searching a document similar to a search key document from a plurality of search target documents,
A similarity between a search key document and each of the search target documents is calculated, and according to the calculated similarity, a probability that the search target document is a document to be extracted is obtained. A similar document search method characterized by output of a similar document.

8. A similar document search method for searching a document similar to a search key document from a plurality of search target documents including classification information, wherein a similarity between a search key document and each of the search target documents is calculated. Based on the similarity and the classification of the search target document, a probability that the search target document is a document to be extracted is obtained, and the search target documents are output in descending order of the obtained probability. Similar document search method.

9. A similar document search method for searching a document similar to a search key document from a plurality of search target documents including field information, calculating a similarity between the search key document and each of the search target documents. Based on the similarity and the field of the search target document, the probability that the search target document is a document to be extracted is obtained, and the search target document is output in descending order of the obtained probability. A similar document search method, comprising inputting data as to whether or not the retrieved document is a document to be extracted as a result, and correcting the probability calculation method according to the input data.

10. A similar document search method for searching a document similar to a search key document from among a plurality of search target documents, wherein a similarity between a search key document and each of the search target documents is determined.
A first step of searching for a similar document using a conversion table for converting the probability that the search target document is a document to be extracted, wherein the conversion table is created;
A second step of searching for a similar document using the conversion table.

11. A similar document search method for searching a document similar to a search key document from a plurality of search target documents, calculating a similarity between the search key document and each of the search target documents, and calculating the similarity. In accordance with the similarity obtained by the means, the probability that the document to be searched is a document to be extracted is obtained, and the documents to be searched are output in descending order of the probability obtained by this means. A similar document search method, wherein the output is terminated when the probability that a document to be extracted is included in the output search target document reaches a predetermined value.

12. A similar document search method for searching a document similar to a search key document from a plurality of search target documents including field information, calculating a similarity between a search key document and each of the search target documents. Corresponding to the calculated similarity and the field of the document to be searched, the ratio of the number of times that the result of the search performed in the past was determined to be correct to the number of times that the result of the search performed in the past was extracted is determined by A similar document search method characterized by determining the probability that a search target document is a document to be extracted, and outputting at least the determined highest probability document.

13. A similar document retrieval method for retrieving a document similar to a retrieval key document from a plurality of retrieval target documents including field information, wherein a similarity between a retrieval key document and each of said retrieval target documents is determined. A similar document is searched using a conversion table that is set for each field and similarity of the search target document and is converted into a probability that the search target document is a document to be extracted. And a second step of searching for a similar document using the conversion table. The first step determines the similarity between a search key document and each of the search target documents. Calculating, and inputting data indicating whether or not the search target document having the calculated similarity is to be extracted. According to the input data, the similarity and the field of the search target document are determined. Generating the divided conversion table, wherein the second step calculates a similarity between a search key document and each of the search target documents, and calculates the calculated similarity and the search target document. A step of obtaining a probability that the target document should be extracted using the conversion table in accordance with the classification of the document, and a step of outputting the document in the descending order of the probability of the search target document. Method.