JP4045728B2

JP4045728B2 - Similar document search method and apparatus, and storage medium storing program for similar document search method

Info

Publication number: JP4045728B2
Application number: JP2000263240A
Authority: JP
Inventors: 忠孝松林; 伸也山本; 勝己多田; 菅谷　　奈津子
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2000-08-28
Filing date: 2000-08-28
Publication date: 2008-02-13
Anticipated expiration: 2020-08-28
Also published as: JP2002073681A

Description

【０００１】
【発明の属する技術分野】
本発明は、ユーザが指定した文書に記述されている内容と類似する内容を含む文書を、文書データベースの中から検索する方法に関する。
【０００２】
【従来の技術】
近年、パーソナルコンピュータやインターネット等の普及に伴い、電子化文書が爆発的に増加しており、今後も加速度的に増大していくものと予想される。このような状況において、ユーザが所望する情報を含んだ文書を高速かつ効率的に検索したいという要求が高まってきている。
【０００３】
このような要求に応える技術として、ユーザが自分の所望する内容を含んだ文書（以下、種文書と呼ぶ）を例示し、その文書と類似する文書を検索する類似文書検索技術が注目されている。
【０００４】
類似文書検索の方法としては、例えば「特開平１１−６６０８６」が開示されている（以下、従来技術１と呼ぶ）。
【０００５】
本従来技術１では、文書データベースに対して文書を登録する際に、登録対象となる文書を全文検索するために必要な情報（従来技術１では、転置インデックスと呼んでいる。以下、全文検索用インデクスと呼ぶ。）を作成しておき、類似文書の検索時に、本全文検索用インデクスを参照することで登録済みの文書（以下、登録文書と呼ぶ）に含まれる単語の出現頻度情報を要素としてもつベクトル（以下、特徴ベクトルと呼ぶ）を作成し、これと検索条件として指定された文書（以下、種文書と呼ぶ）の特徴ベクトルとが、ベクトル空間内においてなす角度の余弦を文書間の類似度として算出する技術である。
【０００６】
以下、従来技術１の処理手順を図２のＰＡＤ（Problem Analysis Diagram）図を用いて説明する。
【０００７】
従来技術１では、まずステップ２００において、文書の登録処理か類似文書の検索処理かを判定し、文書の登録処理と判定された場合には全文検索用インデクス作成ステップ２１０を実行し、全文検索用インデクスを作成する。
【０００８】
また、ステップ２００において類似文書の検索処理と判定された場合には、種文書特徴ベクトル生成ステップ２２０を実行し、種文書に対して特徴ベクトルを作成する。そして、全文検索用インデクスを用いた類似度算出ステップ２２１を実行し、該種文書の特徴ベクトルと登録文書の特徴ベクトルが、ベクトル空間内においてなす角度の余弦を文書間の類似度として算出する。
【０００９】
以上が、従来技術１の処理手順である。
【００１０】
以下、図３を用いて本従来技術１の概要を説明する。
【００１１】
従来技術１の文書登録処理では、まず全文検索用インデクス作成処理２１０で登録用文書１および文書２中に含まれる単語および出現位置を抽出し、全文検索用インデクス４０３を作成する。この結果、全文検索用インデクス４０３には、"構築：（文書１，５）（文書２，８）"のように記録される。ここで、"構築：（文書１，５）（文書２，８）"は、文字列"構築"が文書１の５文字目に、文書２の８文字目に出現していることを表している。
【００１２】
そして、類似文書の検索処理では、検索条件で指定された種文書を抽出し、種文書特徴ベクトル生成処理２２０を通じて該種文書に対応する種文書特徴ベクトル４０６を生成する。
【００１３】
次に、種文書特徴ベクトル４０６中に含まれる全ての単語に対して、前記文書登録処理で作成した全文検索用インデクス４０３を参照することで、各登録文書中の出現回数を取得する。
【００１４】
ここで図４に示すように、二つのベクトルＸおよびＹの余弦は、ベクトルの対応する成分同士（例えばx(i)とy(i)）の積和値をそれぞれのベクトルの大きさで除算することにより得られることに着目する。すなわち、特定のベクトル間の内積をベクトルの組ごとに算出していくのではなく、ベクトルの要素ごとの内積成分（以下、要素別類似度と呼ぶ）を計算した後に、全ての要素における要素別類似度の総和を算出する。なお図４では、ベクトルＸのi番目の要素を"x(i)"と表し、ベクトルＸの大きさを"|Ｘ|"と表す。
【００１５】
すなわち、図３において種文書特徴ベクトル４０６と登録文書の特徴ベクトルの余弦を算出するためには、種文書特徴ベクトル４０６中の全ての単語に対して、種文書と各登録文書での出現回数の積和値を各登録文書における単語毎の要素別類似度として算出し、全ての登録文書について単語毎の要素別類似度の総和をとることで算出できる。
【００１６】
以下、本類似度算出方法を図５を用いて具体的に説明する。
【００１７】
種文書特徴ベクトルをベクトルＸ、文書１の特徴ベクトル（以下、特徴ベクトル１と呼ぶ）をベクトルＹ、文書２の特徴ベクトル（以下、特徴ベクトル２と呼ぶ）をベクトルＺと表すとき、種文書特徴ベクトルと特徴ベクトル１および特徴ベクトル２の内積の第１成分は、それぞれ"x(1)y(1)"および"x(1)z(1)"として算出することができる。
【００１８】
ここで、"x(1)"は単語１の種文書での出現回数を表しており、"y(1)"および"z(1)"はそれぞれ単語１の文書１および文書２での出現回数を表している。
【００１９】
すなわち、単語１の各文書での出現回数６００は、種文書内での単語１の出現回数を計数すると共に、単語１に対応する全文検索用インデクスを参照することで取得することができる。
【００２０】
以下同様に、種文書中の全ての単語に対応する全文検索用インデクスを参照することで、種文書に対する登録文書の類似度を算出することができる。
【００２１】
以上が、従来技術１における類似度算出方法の具体的な説明である。
【００２２】
最後に、各登録文書全体の類似度４０７を出力する。
【００２３】
以上が、従来技術１の概要である。
【００２４】
以上説明したように従来技術１によれば、登録文書中に含まれる単語用の全文検索用単語インデクスを予め作成しておくことで、文書検索時に登録文書の特徴ベクトルの生成を可能とし、検索条件として指定された種文書に対応する種文書特徴ベクトルとの余弦を類似度として算出することで、文書データベース中から内容の類似する文書を検索することができる。
【００２５】
しかし従来技術１には、種文書から抽出された全ての単語に対して全文検索用インデクスを参照し、類似度算出に使用しているため、種文書に含まれる単語数が多いときには膨大な処理時間が必要になるということである。
【００２６】
例えば、種文書中の1種類の単語に対する全文検索用インデクスを0.5秒で参照可能としても、種文書から100種類の単語が抽出されているとすると、50秒もの処理時間を要してしまうことになる。
【００２７】
一方、処理時間を低減するために単純に種文書特徴ベクトルの単語を間引いてしまうと、単語の種類数を削減してしまうため種文書で重要な意味を持つ単語までもが排除される可能性があり、検索精度が極端に低下してしまう恐れがある。
【００２８】
【発明が解決しようとする課題】
このような問題に対し、本発明では以下の課題を解決することを目的とする。
【００２９】
すなわち本発明の課題は、文書データベースへの文書登録時に登録文書の特徴ベクトルを作成することなく、類似文書の検索時に全登録文書の特徴ベクトルを作成し、最新の単語情報を用いた類似度算出を行なう類似文書検索方法において、
検索精度を確保することのできる最低限の単語数を使用することにより、高速な類似文書検索方法を実現することである。
【００３０】
【課題を解決するための手段】
上記課題を解決するための、本発明に示す類似文書検索の処理手順を図７に示すＰＡＤ図に示す。
【００３１】
本発明に示す類似文書検索方法は、登録処理か研作処理かを判定する処理種別判定処理２００と、文書の登録処理として全文検索用インデクス作成処理２１０と、類似文書の検索処理として、種文書特徴ベクトル生成処理２２０と全文検索用インデクスを用いた類似度算出処理２２１を有する類似文書検索方法において、種文書特徴ベクトル生成処理２２０と全文検索用インデクスを用いた類似度算出処理２２１の間に、検索用単語抽出処理７０１を有することを特徴とする。
【００３２】
すなわち、本発明による類似文書検索方法は、文書データベースへの文書登録時の全文検索用インデクス作成処理２１００として、（ステップ１）登録対象文書を読み込む登録文書読込みステップ、（ステップ２）上記登録文書読込みステップで読み込まれた登録対象文書のテキストから、全文検索用情報を抽出し、全文検索用情報ファイルに格納する全文検索用情報ファイル作成登録ステップ、と、類似文書の検索処理における種文書特徴ベクトル生成処理２２０として、（ステップ３）検索条件で指定された種文書を取得する種文書取得ステップ、（ステップ４）前記種文書読込みステップで読み込まれた種文書を解析し、種文書中に含まれる単語を抽出する種文書解析単語抽出ステップ、（ステップ５）上記種文書解析ステップで抽出された単語の出現回数を計数する種文書内出現回数計数ステップと、検索用単語抽出処理７０１として、（ステップ６）上記種文書内出現回数計数ステップで計数された各単語の出現回数に基づき、該単語の重要度を算出する単語重要度算出ステップ、（ステップ７）上記（ステップ６）で算出された各単語の重みの降順に単語を選択し、種文書自体に対する該単語の要素別類似度を算出し、該要素別類似度が所定の閾値を超える場合に、該単語を検索用単語として抽出する検索用単語判定ステップと、全文検索用インデクスを用いた類似度算出処理２２１として、（ステップ８）上記種文書特徴ベクトル生成処理２２０において、種文書から抽出された検索用単語を用いて、以下の（ステップ９）〜（ステップ１０）を実行する類似度算出ステップ、（ステップ９）前記全文検索用情報ファイル作成登録ステップで作成された全文検索用情報を参照し該検索用単語の各登録文書での出現回数を取得する検索用単語出現回数取得ステップ、（ステップ１０）前記検索用単語選択ステップで選択された該検索用単語に関する
前記種文書内出現回数計数ステップで取得した種文書内出現回数および前記単語出現回数取得ステップで取得した各登録文書における検索用単語出現回数を用いて種文書と登録文書の要素別類似度を算出し、各登録文書の全体の類似度に加算する要素別類似度算出ステップ、（ステップ１１）上記要素別類似度算出ステップで算出された類似度を出力する検索結果出力ステップを有する。
【００３３】
上記類似文書検索方法を用いた本発明の原理について図８〜図１０を用いて説明する。
【００３４】
本発明の類似文書検索方法では、文書データベースへの文書登録時に（ステップ１）および（ステップ２）を実行する。
【００３５】
以下、図８を用いて、文書の登録に際する処理手順の概要を説明する。
【００３６】
まず、（ステップ１）で登録対象となる文書を読み込む。図８に示した例では、登録対象文書として文書１「ＬＡＮの構築と運用・保守に必要な機器を提供する。」および文書２「情報システムの構築や保守を手がけるＳＩベンダと提携する。」が登録対象文書として読み込まれる。
【００３７】
次に、（ステップ２）において、上記（ステップ１）で読み込まれた登録対象文書のテキストから、全文検索用情報を抽出し、全文検索用情報ファイルに格納する。
【００３８】
図８に示した例では、文書１中に含まれる"Ｌ"に対応する全文検索用情報として（文書１，１）が抽出され、全文検索用情報ファイル８０３中に格納される。なお、Ｌ（文書１，１）は、"文書１"の文字位置１に文字"Ｌ"が出現することを表す。
【００３９】
また、ここで用いる全文検索用情報としては、任意の単語あるいは文字列の各登録文書での出現回数を取得することができれば、従来技術１に示したように単語インデクス方式を用いるものとしてもよいし、「特開平０８−１９４７１８」に開示されているn-gramインデクス方式を用いるものとしてもよい。
【００４０】
以上が、本発明の文書登録に際する処理手順の概要である。
【００４１】
次に、本発明に示した類似文書検索方法では、文書の検索時に（ステップ３）〜（ステップ１１）を実行する。
【００４２】
以下、図９を用いて文書の検索に際する処理手順の概要を説明する。
【００４３】
まず（ステップ３）で検索条件として指定された種文書９０１「ＬＡＮシステムの構築ノウハウを武器にソリューションを展開する・・・」を読み込む。
【００４４】
そして、（ステップ４）において、種文書を解析し、種文書中に含まれる単語を抽出する。ここで用いる種文書解析処理としては、従来技術１に示されるように単語辞書を参照し、単語辞書に含まれる単語を抽出される方式でもよいし、「特開平１０−１４８７２１」に開示されているように文書データベース中の統計情報を用いた単語抽出方法を用いてもよいし、種文書中に含まれるn-gramを機械的に抽出する方法であってもよいし、その他の単語抽出技術を使用しても構わない。
【００４５】
図９に示した例では、この種文書解析処理の結果として、単語列９０３（ＬＡＮ，構築，ノウハウ，武器，ソリューション，展開，…）が抽出されている。
【００４６】
次に、（ステップ５）において、上記（ステップ４）で抽出された単語の種文書内での出現回数を計数し、単語と出現回数の組９０４（［ＬＡＮ，４］［構築，３］［ノウハウ，２］［武器，１］［ソリューション，２］［展開，１］…）を出力する。
【００４７】
ここで、［ＬＡＮ，３］は、単語"ＬＡＮ"が３回出現しているということを表している。
【００４８】
次に、（ステップ６）において、上記（ステップ５）で抽出された単語と出現回数の組９０４に対して、重要度を算出し、単語と重要度の組を出力する。この重要度の算出方法としては、例えば、種文書中の出現回数としてもよいし、データベースに登録された文書数に対する該単語の出現文書数の割合（以下、出現割合と呼ぶ）等を用いてもよい。図９に示した例では、種文書９０１中での出現回数を単語の重要度として算出し、単語重要度列９０５「［ＬＡＮ，４］［構築，３］［ソリューション，２］…を出力している。ここで、［ＬＡＮ，４］は、単語"ＬＡＮ"が重要度"４"として種文書に含まれていることを表す。
【００４９】
そして、（ステップ７）において、上記（ステップ８）において算出された各単語の重要度の降順に種文書自体に対する要素別類似度を算出し、該要素別類似度が所定の閾値を超えている場合、該単語を検索用単語として抽出する。この結果として、検索用単語［ＬＡＮ，４］［構築，３］が抽出される。
【００５０】
次に、（ステップ８）〜（ステップ１０）において、前記（ステップ７）で取得された各単語の種文書内出現回数および前記（ステップ２）で作成された全文検索用情報ファイル８０３を参照することで、種文書に対する各登録文書の類似度を算出する。
【００５１】
そして、（ステップ１１）において、類似度算出結果９０６を出力する。
【００５２】
以上が、本発明の文書検索に際する処理手順の概要である。
【００５３】
以下、上述した（ステップ７）により実行される検索用単語の抽出処理手順について、図１０を用いて説明する。
【００５４】
まず、（ステップ７）において、前記（ステップ６）で出力された単語重要度列９０５を読み込み、重要度の降順に単語を選択する。図１０では、単語重要度列９０５「［ＬＡＮ，４］、［構築，３］、［ソリューション，２］…」から、まず［ＬＡＮ，４］を抽出している。
【００５５】
そして、検索用単語"ＬＡＮ"の種文書内出現回数"４"を用いて、種文書に対する種文書の類似度の該検索用単語の要素別類似度を計算する。すなわち、登録文書として種文書と同一の文書が存在するもの（以下、仮想登録文書と呼ぶ）と仮定し、種文書特徴ベクトルと該仮想登録文書の特徴ベクトル間における該検索用単語の要素別類似度を算出し、総和を算出する。
【００５６】
図１０では、検索用単語"ＬＡＮ"の種文書内出現回数"４"と仮想登録文書内出現回数"４"の積を算出し、要素別類似度"１６"を得る。
【００５７】
この結果、検索用単語"ＬＡＮ"による種文書自体に対する要素別類似度は所定の閾値（本図に示した例では、５とする）を超えているため、検索用単語としてワークエリア１７０へ格納する。
【００５８】
次に、［ＬＡＮ，４］の次に重要度の高い［構築，３］を選択し、種文書に対する種文書の類似度の該検索用単語の要素別類似度を計算する。この結果、要素別類似度は９となり、所定の閾値５を超えているため、検索用単語としてワークエリア１７０へ格納する。
【００５９】
そして、［構築，３］の次に重要度の高い［ソリューション，２］を選択し、種文書に対する種文書の類似度の該検索用単語の要素別類似度を計算する。この結果、要素別類似度は４となり、所定の閾値を超えていないため、検索用単語として抽出せずに、終了する。
【００６０】
以上が、検索用単語抽出処理手順の説明である。
【００６１】
以上説明したように、文書データベースへの文書登録時に、登録文書に対する登録特徴ベクトルを作成する代わりに、全文検索用インデクスを作成しておき、類似文書の検索時には、種文書における特徴ベクトルの要素のうち種文書内での重要度の順に検索用単語を抽出し、種文書自体に対する類似度が収束するまで抽出した単語を検索用単語として使用するため、全ての単語を検索に使用する場合に比べて、検索精度を極端に落とすことなく種文書と登録文書の類似度を高速に算出することが可能となる。
【００６２】
【発明の実施の形態】
以下、本発明の第一の実施例について図１を用いて説明する。
【００６３】
本発明を適用した類似文書検索システムの第一例は、ディスプレイ１００、キーボード１０１、中央演算処理装置（ＣＰＵ）１０２、磁気ディスク装置１０３、フロッピディスクドライブ（ＦＤＤ）１０４、主メモリ１０５およびこれらを結ぶバス１０６から構成される。
【００６４】
磁気ディスク装置１０３は二次記憶装置の一つであり、全文検索用情報ファイル１８０が格納される。
【００６５】
ＦＤＤ１０４を介してフロッピディスク１０７に格納されている情報が、主メモリ１０５あるいは磁気ディスク装置１０３へ読み込まれる。
【００６６】
主メモリ１０５には、システム制御プログラム１１０、登録制御プログラム１１１、検索制御プログラム１１２、登録文書読込プログラム１２０、全文検索用情報ファイル作成登録プログラム１２１、検索条件解析プログラム１３０、検索用単語抽出プログラム１３１、類似度算出プログラム１３２、検索結果出力プログラム１３３が格納されると共にワークエリア１７０が確保される。
【００６７】
検索条件解析プログラム１３０は、種文書取得プログラム１４０、単語抽出プログラム１４２および種文書内出現回数計数プログラム１４３で構成される。
【００６８】
検索用単語抽出プログラム１３１は、単語重要度算出プログラム１５０および検索用単語抽出判定プログラム１５１で構成される。
【００６９】
類似度算出プログラム１３２は、検索用単語出現回数取得プログラム１６１および要素別類似度算出プログラム１６２で構成される。
【００７０】
登録制御プログラム１１１および検索制御プログラム１１２は、ユーザによるキーボード１０１からの指示に応じてシステム制御プログラム１１０によって起動され、それぞれ登録文書読込プログラム１２０および全文検索用情報ファイル作成登録プログラム１２１の制御と、検索条件解析プログラム１３０、検索用単語抽出プログラム１３１、類似度算出プログラム１３２および検索結果出力プログラム１３３の制御を行なう。
【００７１】
なお本実施例では、キーボード１０１から入力されたコマンドにより、登録制御プログラム１１１や検索制御プログラム１１２が起動されるものとしたが、他の入力装置を介して入力されたコマンドあるいはイベントにより起動されるものであってもかまわない。
【００７２】
また、これらのプログラムを磁気ディスク装置１０３、フロッピディスク１０７、ＭＯ、ＣＤ−ＲＯＭ、ＤＶＤ（図１には示していない）等の記憶媒体に格納し、駆動装置を介して主メモリ１０５に読み込み、ＣＰＵ１０２によって実行することも可能である。
【００７３】
以下、本実施例における類似文書検索システムの処理手順について説明する。
【００７４】
まず、システム制御プログラム１１０の処理手順について図１１のＰＡＤ図を用いて説明する。
【００７５】
システム制御プログラム１１０は、まずステップ１１００で、キーボード１０１から入力されたコマンドを解析する。
【００７６】
そしてステップ１１０１で、この結果が登録実行のコマンドであると解析された場合には、ステップ１１０２で登録制御プログラム１１１を起動して、文書の登録を行なう。
【００７７】
またステップ１１０１で、検索実行のコマンドであると解析された場合には、ステップ７０３で検索制御プログラム１１２を起動して、類似文書の検索を行なう。
【００７８】
以上が、システム制御プログラム１１０の処理手順である。
【００７９】
次に、図１１に示したステップ１１０２でシステム制御プログラム１１０により起動される登録制御プログラム１１１の処理手順について、図１２のＰＡＤ図を用いて説明する。
【００８０】
登録制御プログラム１１１では、まずステップ１２００において登録文書読込プログラム１２０を起動し、登録対象として指定された文書（以下、登録対象文書と呼ぶ）を読み込み、ワークエリア１７０に格納する。
【００８１】
次に、ステップ１２０１において、全文検索用情報ファイル作成登録プログラム１２１を起動し、ワークエリア１７０に格納されている登録文書に対応する全文検索用情報を作成し、全文検索用情報ファイル１８０へ格納する。
【００８２】
以上が、登録制御プログラム１１１の処理手順である。
【００８３】
次に、図１１に示したステップ１１０３でシステム制御プログラム１１０により起動される検索制御プログラム１１２の処理手順について、図１３のＰＡＤ図を用いて説明する。
【００８４】
検索制御プログラム１１２は、まずステップ１３００において、検索条件解析プログラム１３０を起動し、種文書から単語を抽出する。
【００８５】
次にステップ１３０１において、検索用単語抽出プログラム１３１を起動し、上記ステップ１３００において種文書から抽出された単語の重要度を算出し、所定の条件に基づいて重要度の高い単語を検索用単語として抽出する。
【００８６】
そしてステップ１３０２において、類似度算出プログラム１３２を起動し、上記ステップ１３０１において種文書から抽出された検索用単語の出現情報を用いて、種文書に対する各登録文書の類似度を算出する。
【００８７】
そしてステップ１３０３において、検索結果出力プログラム１３３を起動し、上記ステップ１３０２で算出された類似度算出結果を検索結果として出力する。
【００８８】
ここで、検索結果の出力先は、ディスプレイ１００に表示するものとしてもよいし、ワークエリア１７０や磁気ディスク１０３上に格納するものとしてもよい。また、類似度算出結果をディスプレイ１００に出力する場合には、類似度の降順に出力するものとしてもよいし、文書に付与された管理番号の昇順あるいは降順に出力するものとしてもよい。
【００８９】
以上が検索制御プログラム１１２の処理手順である。
【００９０】
次に、図１３に示したステップ１３００で検索制御プログラム１１２により起動される検索条件解析プログラム１３０の処理手順について、図１４のＰＡＤ図を用いて説明する。
【００９１】
検索条件解析プログラム１３０は、まずステップ１４００において、種文書取得プログラム１４０を起動し、検索条件で指定された種文書を抽出し、ワークエリア１７０に格納する。
【００９２】
次にステップ１４０２において、単語抽出プログラム１４２を起動し、ワークエリア１７０に格納された種文書から単語を抽出する。
【００９３】
そしてステップ１４０３において、種文書内出現回数計数プログラム１４３を起動し、ステップ１４０２で抽出された単語について、種文書内での出現回数を計数し、ワークエリア１７０に格納する。
【００９４】
以上が検索条件解析プログラム１３０の処理手順である。
【００９５】
次に、図１３に示したステップ１３０１で検索制御プログラム１１２により起動される検索用単語抽出プログラム１３１の処理手順について、図１５のＰＡＤ図を用いて説明する。
【００９６】
検索用単語抽出プログラム１３１は、まずステップ１５００において、単語重要度算出プログラム１５１を起動し、所定の算出式に基づきワークエリア１７０に格納された単語の重要度を算出し、ワークエリア１７０に格納する。
【００９７】
次に、前記ステップ１５００でワークエリア１７０に格納された全ての単語に対して、ステップ１５０２〜１５０５を繰り返し実行する（ステップ１５０１）。
【００９８】
まず、ステップ１５０２において、ワークエリア１７０に格納されている単語を重要度の降順に取得する。
【００９９】
次に、ステップ１５０３において、検索用単語抽出判定プログラム１５１を起動し、種文書の要素別類似度を算出する。
【０１００】
そして、ステップ１５０４において、種文書の要素別類似度が、所定の閾値を超えているかを判定し、超えている場合にはステップ１５０５を、越えていない場合には繰り返し処理を終了する。
【０１０１】
そして、ステップ１５０５において、該単語を検索用単語としてワークエリア１７０に格納する。
【０１０２】
以上が検索用単語抽出プログラム１３１の処理手順である。
【０１０３】
なお、上述のステップ１５０２における各単語の要素別類似度の算出方法は、従来技術１に示されるように、各単語の種文書での出現回数を用いて算出してもよいし、
後述するように、該単語の文書データベースでの出現文書数等の統計情報を用いるものでもよいし、
さらには、文書内での出現位置情報を考慮することもできる。
【０１０４】
次に、図１３に示したステップ１３０２で検索制御プログラム１１２により起動される類似度算出プログラム１３２の処理手順について、図１６のＰＡＤ図を用いて説明する。
【０１０５】
類似度算出プログラム１３２は、ワークエリア１７０に格納された全ての検索用単語に対して、ステップ１６０２〜１６０３を繰り返し実行する（ステップ１６０１）。
【０１０６】
ステップ１６０２では、検索用単語出現回数取得プログラム１６１を起動し、検索用単語に対応する全文検索用情報ファイル１８０を参照して、各登録文書内での出現回数を取得し、ワークエリア１７０に格納する。
【０１０７】
次にステップ１６０３において、要素別類似度算出プログラム１６２を起動し、ワークエリア１７０に格納された検索用単語の種文書内出現回数および登録文書内出現回数を用いて、所定の算出式により種文書に対する登録文書の要素別類似度を算出し、登録文書全体の類似度に加算する。
【０１０８】
以上が類似度算出プログラム１３２の処理手順である。
【０１０９】
以上が、本発明の第一の実施形態である。
【０１１０】
なお、本実施例では、検索条件解析プログラム１３０により種文書から単語が抽出されるものとしたが、単語の代わりにn-gramが抽出されるものとしてもよい。この場合、検索用単語抽出プログラム１３１により処理される単位もn-gramとなる。
【０１１１】
また、検索用単語抽出プログラム１３１のステップ１５０４では、ステップ１５０３で算出された種文書の要素別類似度が所定の閾値を超えるか否かを判定するものとしたが、
要素別類似度ではなく類似度の総和が所定の閾値を超えているかを判定するものとしてもよいし、
さらには、種文書から抽出された全ての単語における要素別類似度の総和に対する類似度の算出割合が所定の閾値を超えているかを判定するものとしてもよい。
【０１１２】
また、本実施例では種文書に対する各登録文書の類似度の算出には、単語の出現回数を直接用いたが、さらにこれを種文書や登録文書の文書の長さ等により正規化してもよいことは明らかであろう。
【０１１３】
以上説明したように、本発明の第一の実施形態によれば、種文書に対する要素別類似度の値を目安にして類似度算出に使用する検索用単語数を削減しているため、種文書に対する類似度算出結果が収束する必要最低限の検索で処理を終了させることができる。
【０１１４】
この結果として、検索精度を極端に低下させることなく検索用単語数を削減することができ、高速な類似文書検索を実現することができるようになる。
【０１１５】
なお、本実施例では、登録対象文書や種文書を文書としたが、文章あるいは文字列であっても構わないことは明らかであろう。
【０１１６】
また、以上説明した本発明の第一の実施例における検索用単語抽出プログラム１３１では、種文書の要素別類似度の値を目安にして検索用単語を削減するものとしたが、予め指定された数の検索用単語を抽出するものとしてもよい。この場合の検索用単語数の設定方法としては、予め用意したテストパターンを用いて所定の時間以内に検索が終了するように検索用単語数を決定するものとしてもよい。
【０１１７】
次に本発明の第二の実施例について図１７を用いて説明する。
【０１１８】
本発明を適用した類似文書検索システムの第二の実施例は、種文書から抽出された単語の重要度を算出する際に、文書データベースに蓄積された登録文書の統計情報を利用するものである。
【０１１９】
本方法によれば、第一の実施例における単語重要度算出プログラム１５０による単語重要度算出の際に、種文書内の出現情報だけでなく文書データベース全体での出現情報を利用することができ、文書データベース内で頻繁に出現する単語の重要度を調整することが可能となり、第一の実施例に比べ高精度に単語重要度を算出できるようになる。
【０１２０】
本実施例は、第一の実施例（図１）とほぼ同様の構成を取るが、単語重要度算出プログラム１５０の構成が異なり、図１７に示すように統計情報参照プログラム１７００が加わる。
【０１２１】
以下、第一の実施例と異なる単語重要度算出プログラム１５０ａの処理手順について図１８を用いて説明する。
【０１２２】
単語重要度算出プログラム１５０ａは、まずステップ１８００において、統計情報参照プログラム１７００を起動し、全文検索用情報ファイル１８０を参照することにより、種文書から抽出された各単語の文書データベースにおける出現文書数を該単語の統計情報として取得する。
【０１２３】
なお、全文検索用情報ファイル１８０から該単語の出現文書数の取得は、図８に示した全文検索用情報ファイル８０３として示したように全文検索用情報ファイル１８０には各単語の文書番号および出現位置が格納されていることを利用し、該単語の異なる文書番号を計数することで実現することができる。
【０１２４】
そして、ステップ１８０１において、種文書から抽出された各単語の重要度を、該単語の種文書内出現回数および文書データベースにおける統計情報を用いて算出し、ワークエリア１７０に格納する。
【０１２５】
以上が、単語重要度算出プログラム１５０ａの処理手順である。
【０１２６】
なお、本実施例における単語重要度算出式としては、例えばＴＦ・ＩＤＦ（Text Frequency, Inverted Documents Frequency）法を用いるものとしてもよい。
【０１２７】
以上が本発明の第二の実施例である。
【０１２８】
以上説明したように、本発明の第二の実施例における類似文書検索システムを用いることにより、文書データベース内で頻繁に出現する単語（以下、頻出単語と呼ぶ）を考慮した単語重要度を算出できるようになる。すなわち、頻出単語の単語重要度を低く、希少な単語の単語重要度を高く設定することで、種文書の特徴を表す単語を優先的に選択することが可能となり、高精度な類似文書検索を実現することができるようになる。
【０１２９】
次に、本発明の第三の実施例について図１９を用いて説明する。
【０１３０】
本発明を適用した類似文書検索システムの第三の実施例は、第二の実施例と同様に種文書から抽出された単語の重要度を算出する際に、文書データベースに蓄積された登録文書の統計情報を利用するものであるが、統計情報の取得に統計情報ファイル１９００を利用する点が異なる。
【０１３１】
本方法によれば、第二の実施例における単語重要度算出の際に参照する統計情報取得を高速に行なうことができるようになる。
【０１３２】
本実施例は、第二の実施例（図１７）とほぼ同様の構成を取るが、登録制御プログラム１１１の構成が異なり、図１９に示すように統計情報ファイル作成登録プログラム１９００が加わる。また、磁気ディスク装置１０３には統計情報ファイル１９１０が格納される。前記単語重要度算出プログラム１５０ａのステップ１８００では、種文書から抽出された各単語の文書データベースにおける統計情報を取得する際に、全文検索用情報ファイル１８０を参照する代わりに、図１９に示す統計情報ファイル１９１０を参照するようになる。
【０１３３】
以下、第二の実施例と異なる登録制御プログラム１１１ａの処理手順について図２０を用いて説明する。
【０１３４】
登録制御プログラム１１１ａでは、まずステップ１２００において登録文書読込プログラム１２０を起動し、登録対象として指定された文書を読み込み、ワークエリア１７０に格納する。
【０１３５】
次に、ステップ１２０１において、全文検索用情報ファイル作成登録プログラム１２１を起動し、ワークエリア１７０に格納されている登録文書に対応する全文検索用情報を作成し、全文検索用情報ファイル１８０へ格納する。
【０１３６】
次に、ステップ２０００において、統計情報ファイル作成登録プログラム１９００を起動し、ワークエリア１７０に格納されている登録文書に対応する統計情報を作成し、統計情報ファイル１９１０へ格納する。
【０１３７】
以上が、登録制御プログラム１１１の処理手順である。
【０１３８】
図２１に統計情報ファイル作成登録プログラム１９００により作成される統計情報ファイル１９１０の例を示す。
【０１３９】
本図に示した統計情報ファイル１９１０には、管理番号２１００、単語２１０１および出現文書数２１０２が格納される。
【０１４０】
本図に示した例では、管理番号"０"の領域に、単語"ＬＡ"が格納され、該単語の出現文書数が"１"であるというように格納されることを示している。
【０１４１】
なお、図２１に示した例では、統計情報ファイル１９００を表形式で格納されるものとしたが、単語と出現文書数が取得できる形式であればどのような形式であってもかまわない。例えば、トライ形式で格納されるものとしてもかまわないし、全文検索用情報ファイル１８０の先頭領域に格納しておくものとしてもかまわない。
【０１４２】
以上が、本発明の第三の実施例である。
【０１４３】
以上説明したように本発明の第三の実施例によれば、種文書から抽出された各単語の統計情報を取得に、文書登録処理時に予め作成された統計情報ファイルを参照することにより、全文検索用情報を参照して異なる出現文書番号の個数を計数する必要がなくなり、高速に統計情報を取得することができるようになる。これにより、第二の実施例に比べ高速な類似文書検索を実現できるようになる。
【０１４４】
次に本発明の第四の実施例について図２２を用いて説明する。
【０１４５】
本発明を適用した類似文書検索システムの第四の実施例は、種文書から抽出された各単語の統計情報を近似して利用するものである。
【０１４６】
本方法によれば、統計情報の精度を極端に低下させることなく、第三の実施例における統計情報ファイル１９１０に格納される統計情報の容量を削減することができるようになる。
【０１４７】
本実施例は、第三の実施例（図１９）とほぼ同様の構成を取るが、統計情報参照プログラム１７００の構成が異なり、近似統計情報算出プログラム２２００が加わる。
【０１４８】
以下、第三の実施例と異なる統計情報参照プログラム１７００ｂの処理手順について図２３を用いて説明する。
【０１４９】
統計情報参照プログラム１７００ｂは、種文書から抽出された全ての単語についてステップ２３０１〜２３０４を繰り返し実行する（ステップ２３００）。
【０１５０】
ステップ２３０１では、統計情報ファイル１９１０を参照し、該単語に対応する統計情報が格納されているかを確認する。
【０１５１】
そして、該単語が統計情報ファイル１９１０中に格納されている場合にはステップ２３０３を実行し、格納されていない場合にはステップ２３０４を実行する（ステップ２３０２）。
【０１５２】
ステップ２３０３では、統計情報ファイル１９１０を参照し、該単語の統計情報を取得する。
【０１５３】
また、ステップ２３０４では、近似統計情報算出プログラム２２００を起動し、該単語の近似統計情報を算出する。
【０１５４】
以上が、統計情報参照プログラム１７００ｂの処理手順である。
【０１５５】
次に、近似統計情報算出プログラム２２００の処理手順について図２４を用いて具体的に説明する。
【０１５６】
本図に示した例では、まずステップ２３０１において、統計情報を取得する対象となる単語２４００"ＬＡＮ"対して、統計情報ファイル１９１０を参照する。
【０１５７】
ここでは、統計情報ファイル１９１０には"ＬＡＮ"が格納されていないため、ステップ２３０４を実行する。
【０１５８】
ステップ２３０４では、単語２４００"ＬＡＮ"の構成要素である"ＬＡ"と"ＡＮ"の統計情報をそれぞれ取得し、これらの出現文書数のうち少ない値を"ＬＡＮ"の統計情報として設定する。
【０１５９】
本図に示した例では、"ＬＡ"の統計情報２４０１に格納された出現文書数"８０７"と、"ＡＮ"の統計情報２４０２に格納された出現文書数"１５１２"とを比較し、この結果として"ＬＡＮ"の統計情報２４０３として値の小さい"ＬＡ"の出現文書数"８０７"を格納する（２４１０）。
【０１６０】
これは、単語"ＬＡＮ"の構成要素"ＬＡ"と"ＡＮ"の出現文書数が異なる場合、"ＬＡＮ"の出現文書数は各構成要素よりも多くなることはありえないという性質を利用するものである。すなわち、単語"ＬＡＮ"の出現文書数としては、本来"ＬＡＮ"そのものの出現文書数を用いるべきであるが、単語"ＬＡＮ"の構成要素である"ＬＡ"あるいは"ＡＮ"のうち、出現文書数の少ない値を近似した出現文書数として参照するものである。
【０１６１】
以上が近似統計情報算出プログラム２２００の具体的な処理手順である。
【０１６２】
以上が本発明の第四の実施例である。
【０１６３】
以上説明したように、本発明の第四の実施例における類似文書検索システムを用いることにより、全ての単語の出現文書数を統計情報ファイルへ格納する必要がなくなるため、第三の実施例に比べ、統計情報ファイルの容量を削減することができるようになる。
【０１６４】
以上説明したように、本発明の第一の実施例から第四の実施例における類似文書検索システムでは、種文書の類似度を算出し、これに基づいて検索用単語数を調整しているため、検索精度を確保しながら高速に類似文書検索を実現することができる。
【０１６５】
次に、本発明の第五の実施例について図２５を用いて説明する。
【０１６６】
本発明を適用した類似文書検索システムの第五の実施例は、所定の検索時間で検索結果を出力するものである。
【０１６７】
本方法によれば、ユーザは所定の検索時間で検索結果を取得できるため、検索条件で指定した種文書が検索目的に合致しているかをストレスなく判断できるようになる。
【０１６８】
本実施例は、第一の実施例（図１）とほぼ同様の構成を取るが、類似度算出プログラム１３２の構成が異なり、検索処理時間計測プログラム２５００が加わる。
【０１６９】
以下、第一の実施例と異なる類似度算出プログラム１３２ｂの処理手順を図２６のＰＡＤ図を用いて説明する。
【０１７０】
類似度算出プログラム１３２ｂは、ステップ２６００において、検索処理時間計測プログラム２５００を起動し、検索処理時間の計測を開始する。
【０１７１】
次に、ワークエリア１７０に格納された全ての検索用単語に対して、検索処理時間が所定の値（以下、検索制限時間と呼ぶ）以下ならば、ステップ１６０２、１６０３および２６０２を繰り返し実行する（ステップ２６０１）。
【０１７２】
ステップ１６０２では、検索用単語出現回数取得プログラム１６１を起動し、検索用単語に対応する全文検索用情報ファイル１８０を参照して、各登録文書内での出現回数を取得し、ワークエリア１７０に格納する。
【０１７３】
次にステップ１６０３において、要素別類似度算出プログラム１６２を起動し、ワークエリア１７０に格納された検索用単語の種文書内出現回数および登録文書内出現回数を用いて、所定の算出式により種文書に対する登録文書の要素別類似度を算出し、登録文書全体の類似度に加算する。
【０１７４】
そして、ステップ２６０２において、検索処理時間計測プログラム２５００を起動し、検索処理時間の経過時間を測定し、検索処理時間を算出する。
【０１７５】
以上が類似度算出プログラム１３２ｂの処理手順である。
【０１７６】
以上が本発明の第五の実施形態である。
【０１７７】
なお、本実施例のステップ２６０１における検索制限時間は、検索実行時に検索条件として指定するものとしてもよいし、システム設定値として予め設定しておくものとしてもよい。
【０１７８】
また、本実施例では、検索制限時間を設定するものとしたが、設定値によっては少数の検索用単語しか用いられない場合も考えられるため、検索精度を保つための最小限の検索用単語数を設定できるようにしてもよい。この場合は、検索処理時間が検索制限時間を上回ったとしても、指定された最小限の検索用単語数までは類似検索を繰り返すことになる。
【０１７９】
さらに、本実施例では、検索処理時間計測プログラム２５００を用いて類似度算出処理に要する時間を計測するものとしたが、検索処理自体を計測するものとしてもよい。この場合、図２６に示したステップ２６００で検索時間の計測を開始するのではなく、検索制御プログラム１１２により検索条件解析プログラム１３０が起動される前に、検索処理時間計測プログラム２５００を起動し、検索処理時間の測定を開始すればよい。
【０１８０】
以上説明したように本発明の第五の実施例における類似文書検索システムでは、検索に要する時間に基づいて検索用単語数を調整するため、所定の処理時間で検索結果を取得することができるようになる。
【０１８１】
この結果として、ユーザは検索終了時間を予測することができるようになる。
【０１８２】
なお、第一の実施例から第四の実施例で説明した種文書の類似度を目安に検索を終了する類似文書検索システムと第五の実施例で説明した検索時間を目安に検索を終了する類似文書検索システムを検索実行時あるいはシステム定義で切り替えて使用することも可能である。
【０１８３】
次に、本発明の第六の実施例について図２７を用いて説明する。
【０１８４】
本発明を適用した類似文書検索システムの第六の実施例は、種文書から抽出された単語から検索に使用される検索用単語から、検索時間を推定し、長大な時間を要する場合にはユーザに確認を求めるものである。
【０１８５】
本方法によれば、第一の実施例から第四の実施例で説明した類似文書検索システムにおける検索用単語抽出条件では検索に長大な時間を要する場合、事前に検索を取りやめることができるため、ユーザは不用意に待たされることがなくなる。
【０１８６】
本実施例は、第一の実施例（図１）とほぼ同様の構成を取るが、検索用単語抽出プログラム１３１の構成が異なり、図２７に示すように検索時間推定確認プログラム２７００が加わる。
【０１８７】
以下、第一の実施例と異なる検索用単語抽出プログラム１３１ｂの処理手順を図２８のＰＡＤ図を用いて説明する。
【０１８８】
検索用単語抽出プログラム１３１では、まずステップ１５００において、単語重要度算出プログラム１５１を起動し、所定の算出式に基づきワークエリア１７０に格納された単語の重要度を算出し、ワークエリア１７０に格納する。
【０１８９】
次に、前記ステップ１５００でワークエリア１７０に格納された全ての単語に対して、ステップ１５０２〜１５０５を繰り返し実行する（ステップ１５０１）。
【０１９０】
まず、ステップ１５０２において、ワークエリア１７０に格納されている単語を重要度の降順に取得する。
【０１９１】
次に、ステップ１５０３において、検索用単語抽出判定プログラム１５１を起動し、種文書の要素別類似度を算出する。
【０１９２】
そして、ステップ１５０４において、種文書の要素別類似度が、所定の閾値を超えているかを判定し、超えている場合にはステップ１５０５を、越えていない場合には繰り返し処理を終了する。
【０１９３】
そして、ステップ１５０５において、該単語を検索用単語としてワークエリア１７０に格納する。
【０１９４】
次に、ステップ２８００において、ワークエリア１７０に格納された検索用単語から検索時間を推定し、推定された検索時間（以下、推定検索時間と呼ぶ）が所定の値（指定検索時間）を超える場合には、検索の継続を確認するメッセージを表示し、ユーザの確認を受ける。この確認メッセージとしては、例えば図６に示したように、継続ボタン２９０１およびキャンセルボタン２９０１を有するメッセージ２９００を表示するものであってもよい。
【０１９５】
以上が検索用単語抽出プログラム１３１ｂの処理手順である。
【０１９６】
なお、上記ステップ２８００における指定検索時間としては、検索条件として指定するものとしてもよいし、システム定義として予め指定されるものとしてもよいし、あるいはいくつかのテストパターンの結果から自動的に設定されるものとしてもよい。
【０１９７】
また、上記ステップ２８００における検索時間の推定方法としては、該検索用単語の出現文書数から推定するものとしてもよいし、該検索用単語に対応する全文検索用情報ファイル１８０のサイズから推定するものとしてもよい。あるいは、いくつかのテストパターンを用いてひとつの検索用単語に要する平均時間を計測しておき、該平均時間を用いて検索時間を推定するものとしてもよい。
【０１９８】
以上説明したように、本実施例に示した類似文書検索システムでは、抽出された検索用単語から検索時間を推定し、推定検索時間が予め指定された時間を超える場合には検索用単語の抽出条件を調整することが可能となるため、ユーザは不用意に待たされることがなくなる。
【０１９９】
【発明の効果】
以上説明したように、本発明では、種文書の類似度を目安に検索用単語数を設定しているため、類似度算出に使用する検索用単語数を削減することができる。これにより、検索精度を確保することのできる高速な類似文書検索を実現することができる。
【図面の簡単な説明】
【図１】本発明の第一の実施例における類似文書検索システムの全体構成を示す図である。
【図２】従来技術１の処理手順を説明するＰＡＤ図である。
【図３】従来技術１の概要を説明する図である。
【図４】従来技術１の類似度算出方式の考え方を説明する図である。
【図５】従来技術１の類似度算出方式の考え方を説明する図である。
【図６】本発明の第六の実施例における検索時間推定確認プログラム２７００による確認メッセージの例である。
【図７】本発明の処理手順を説明するＰＡＤ図である。
【図８】本発明の登録処理の概要を説明する図である。
【図９】本発明の検索処理の概要を説明する図である。
【図１０】本発明の検索用単語抽出処理の概要を説明する図である。
【図１１】本発明の第一の実施例におけるシステム制御プログラム１１０の処理手順を説明する図である。
【図１２】本発明の第一の実施例における登録制御プログラム１１１の処理手順を説明する図である。
【図１３】本発明の第一の実施例における検索制御プログラム１１２の処理手順を説明するＰＡＤ図である。
【図１４】本発明の第一の実施例における検索条件解析プログラム１３０の処理手順を説明するＰＡＤ図である。
【図１５】本発明の第一の実施例における検索用単語抽出プログラム１３１の処理手順を説明するＰＡＤ図である。
【図１６】本発明の第一の実施例における類似度算出プログラム１３２の処理手順を説明するＰＡＤ図である。
【図１７】本発明の第二の実施例における単語重要度算出プログラム１５０ａの構成を示す図である。
【図１８】本発明の第三の実施例における単語重要度算出プログラム１５０ａの処理手順を説明するＰＡＤ図である。
【図１９】本発明の第三の実施例における登録制御プログラム１１１ａの構成図である。
【図２０】本発明の第三の実施例における登録制御プログラム１１１ａの処理手順を示すＰＡＤ図である。
【図２１】本発明の第三の実施例における統計情報ファイル１９１０の例である。
【図２２】本発明の第四の実施例における統計情報参照プログラム１７００ｂの構成を示す図である。
【図２３】本発明の第四の実施例における統計情報参照プログラム１７００ｂの処理手順を説明するＰＡＤ図である。
【図２４】本発明の第四の実施例における近似統計情報の算出方法を説明する図である。
【図２５】本発明の第五の実施例における類似度算出プログラム１３２ｂの構成を示す図である。
【図２６】本発明の第五の実施例における類似度算出プログラム１３２ｂの処理手順を説明するＰＡＤ図である。
【図２７】本発明の第六の実施例における検索用単語抽出プログラム１３１ｂの構成を示す図である。
【図２８】本発明の第六の実施例における検索用単語抽出プログラム１３１ｂの処理手順を説明するＰＡＤ図である。
【符号の説明】
１００…ディスプレイ、１０１…キーボード、１０２…中央演算処理装置（ＣＰＵ）、１０３…磁気ディスク装置、１０４…フロッピディスクドライブ（ＦＤＤ）、１０５…主メモリ、１０６…バス、１０７…フロッピディスク、１１０…システム制御プログラム、１１１…登録制御プログラム、１１２…検索制御プログラム、１２０…登録文書読込プログラム、１２１…全文検索用情報ファイル作成登録プログラム、１３０…検索条件解析プログラム、１３１…検索用単語抽出プログラム、１３２…類似度算出プログラム、１３３…検索結果出力プログラム、１４０…種文書取得プログラム、１４２…単語抽出プログラム、１４３…種文書内出現回数計数プログラム、１５０…単語重要度算出プログラム、１５１…検索用単語抽出判定プログラム、１６１…検索用単語出現回数取得プログラム、１６２…要素別類似度算出プログラム、１７０…ワークエリア、１８０…全文検索用情報ファイル。[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a method for searching a document database for a document including content similar to the content described in a document designated by a user.
[0002]
[Prior art]
In recent years, with the spread of personal computers, the Internet, etc., the number of electronic documents has increased explosively and is expected to increase at an accelerated rate in the future. Under such circumstances, there is an increasing demand for searching for a document including information desired by a user at high speed and efficiently.
[0003]
As a technique for responding to such a request, a similar document search technique in which a user exemplifies a document including contents desired by the user (hereinafter referred to as a seed document) and searches for a document similar to the document has been attracting attention. .
[0004]
As a method for retrieving similar documents, for example, “Japanese Patent Laid-Open No. 11-66086” is disclosed (hereinafter referred to as Prior Art 1).
[0005]
In the conventional technique 1, when a document is registered in the document database, information necessary for full-text search of the document to be registered (referred to as an inverted index in the conventional technique 1. Hereinafter, for full-text search. This is referred to as an index.) When searching for similar documents, the frequency of occurrence of words contained in a registered document (hereinafter referred to as a registered document) by referring to the full-text search index is used as an element. A vector (hereinafter referred to as a feature vector) is created, and this and the feature vector of a document specified as a search condition (hereinafter referred to as a seed document) form a cosine of the angle formed in the vector space. This is a technique for calculating the degree.
[0006]
Hereinafter, the processing procedure of Prior Art 1 will be described with reference to a PAD (Problem Analysis Diagram) diagram of FIG.
[0007]
In the prior art 1, first, in step 200, it is determined whether the process is a document registration process or a similar document search process. If it is determined that the process is a document registration process, a full-text search index creation step 210 is executed. Create an index.
[0008]
If it is determined in step 200 that the search process is a similar document, a seed document feature vector generation step 220 is executed to create a feature vector for the seed document. Then, the similarity calculation step 221 using the full-text search index is executed, and the cosine of the angle formed by the feature vector of the seed document and the feature vector of the registered document in the vector space is calculated as the similarity between documents.
[0009]
The above is the processing procedure of the prior art 1.
[0010]
Hereinafter, an outline of the related art 1 will be described with reference to FIG.
[0011]
In the document registration process of Prior Art 1, first, the full text search index creation process 210 extracts words and appearance positions contained in the registration document 1 and the document 2 to create a full text search index 403. As a result, the full-text search index 403 is recorded as “construction: (documents 1 and 5) (documents 2 and 8)”. Here, “construction: (document 1, 5) (document 2, 8)” indicates that the character string “construct” appears at the fifth character of document 1 and at the eighth character of document 2. Yes.
[0012]
In the similar document search process, the seed document specified by the search condition is extracted, and the seed document feature vector 406 corresponding to the seed document is generated through the seed document feature vector generation process 220.
[0013]
Next, the number of appearances in each registered document is acquired by referring to the full-text search index 403 created in the document registration process for all words included in the seed document feature vector 406.
[0014]
Here, as shown in FIG. 4, the cosine of the two vectors X and Y is obtained by dividing the product-sum value of the corresponding components of the vectors (for example, x (i) and y (i)) by the magnitude of each vector. Pay attention to what you get by doing. That is, instead of calculating the inner product between specific vectors for each set of vectors, after calculating the inner product component (hereinafter referred to as elemental similarity) for each element of the vector, Calculate the sum of similarities. In FIG. 4, the i-th element of the vector X is represented as “x (i)”, and the magnitude of the vector X is represented as “| X |”.
[0015]
That is, in order to calculate the cosine of the seed document feature vector 406 and the feature vector of the registered document in FIG. 3, the number of occurrences in the seed document and each registered document is calculated for all the words in the seed document feature vector 406. The product-sum value can be calculated as the elemental similarity for each word in each registered document, and the sum of the elemental similarities for each word can be calculated for all registered documents.
[0016]
Hereinafter, the similarity calculation method will be specifically described with reference to FIG.
[0017]
When the seed document feature vector is represented as vector X, the feature vector of document 1 (hereinafter referred to as feature vector 1) is represented as vector Y, and the feature vector of document 2 (hereinafter referred to as feature vector 2) is represented as vector Z. The first component of the inner product of the vector, the feature vector 1 and the feature vector 2 can be calculated as “x (1) y (1)” and “x (1) z (1)”, respectively.
[0018]
Here, “x (1)” represents the number of occurrences of word 1 in the seed document, and “y (1)” and “z (1)” represent occurrences of word 1 in document 1 and document 2, respectively. It represents the number of times.
[0019]
That is, the number of appearances 600 of each word in each document can be obtained by counting the number of appearances of the word 1 in the seed document and referring to the full-text search index corresponding to the word 1.
[0020]
Similarly, the similarity of the registered document to the seed document can be calculated by referring to the full text search index corresponding to all the words in the seed document.
[0021]
The above is a specific description of the similarity calculation method in the prior art 1.
[0022]
Finally, the similarity 407 of each registered document is output.
[0023]
The above is the outline of the prior art 1.
[0024]
As described above, according to the prior art 1, by creating a full-text search word index for words contained in a registered document in advance, it is possible to generate a feature vector of the registered document at the time of document search. By calculating the cosine of the seed document feature vector corresponding to the seed document specified as the condition as the similarity, it is possible to search for documents having similar contents from the document database.
[0025]
However, since the prior art 1 uses the full-text search index for all words extracted from the seed document and uses it for similarity calculation, a large amount of processing is required when the number of words included in the seed document is large. Time is needed.
[0026]
For example, even if it is possible to refer to the full-text search index for one type of word in the seed document in 0.5 seconds, if 100 types of words are extracted from the seed document, processing time of 50 seconds may be required. become.
[0027]
On the other hand, if the word of the seed document feature vector is simply thinned out in order to reduce the processing time, the number of types of the word is reduced, and even a word having an important meaning in the seed document may be excluded. There is a risk that the search accuracy is extremely lowered.
[0028]
[Problems to be solved by the invention]
In order to solve such problems, the present invention aims to solve the following problems.
[0029]
That is, an object of the present invention is to create a feature vector of all registered documents when searching for similar documents without creating a feature vector of registered documents at the time of document registration in a document database, and calculate similarity using the latest word information In a similar document search method for performing
By using a minimum number of words that can ensure search accuracy, a high-speed similar document search method is realized.
[0030]
[Means for Solving the Problems]
The PAD diagram shown in FIG. 7 shows the processing procedure of the similar document search shown in the present invention for solving the above problem.
[0031]
The similar document search method according to the present invention includes a process type determination process 200 for determining whether a registration process or a research process, a full-text search index creation process 210 as a document registration process, and a seed document feature as a similar document search process. In the similar document search method including the vector generation process 220 and the similarity calculation process 221 using the full-text search index, the search is performed between the seed document feature vector generation process 220 and the similarity calculation process 221 using the full-text search index. And a word extraction process 701.
[0032]
That is, in the similar document search method according to the present invention, as the full-text search index creation processing 2100 at the time of document registration in the document database, (Step 1) a registered document reading step for reading a registration target document, (Step 2) reading the registered document Full-text search information file creation and registration step that extracts full-text search information from the text of the registration target document read in step and stores it in the full-text search information file, and seed document feature vector generation in similar document search processing As processing 220, (step 3) a seed document acquisition step for acquiring a seed document specified by the search condition, and (step 4) a word contained in the seed document by analyzing the seed document read in the seed document reading step. Extraction step of seed document analysis word extracting step (Step 5) Extraction in the seed document analysis step The number-of-appearance counts step in the seed document for counting the number of appearances of the generated word, and the search word extraction processing 701 (step 6) based on the number of appearances of each word counted in the above-mentioned seed document occurrence count counting step, A word importance calculating step for calculating the importance of the word; (step 7) selecting words in descending order of the weight of each word calculated in (step 6) above; When the similarity by element exceeds a predetermined threshold, a search word determination step for extracting the word as a search word and a similarity calculation process 221 using the full-text search index are (step 8) In the seed document feature vector generation process 220, similarity calculation is performed by executing the following (Step 9) to (Step 10) using the search word extracted from the seed document. (Step 9) A search word appearance number acquisition step for acquiring the number of appearances of the search word in each registered document by referring to the full text search information created in the full text search information file creation registration step. Step 10) Regarding the search word selected in the search word selection step
Using the number of appearances in the seed document acquired in the seed document occurrence count counting step and the number of search word appearances in each registered document acquired in the word appearance count acquisition step, the elemental similarity between the seed document and the registered document is calculated. And an element-by-element similarity calculation step for adding to the overall similarity of each registered document, and (step 11) a search result output step for outputting the similarity calculated at the element-by-element similarity calculation step.
[0033]
The principle of the present invention using the similar document search method will be described with reference to FIGS.
[0034]
In the similar document search method of the present invention, (Step 1) and (Step 2) are executed when a document is registered in the document database.
[0035]
Hereinafter, an outline of a processing procedure for registering a document will be described with reference to FIG.
[0036]
First, a document to be registered is read in (Step 1). In the example shown in FIG. 8, as documents to be registered, document 1 “provides equipment necessary for LAN construction and operation / maintenance.” And document 2 “partners with an SI vendor that handles information system construction and maintenance.” Is read as a registration target document.
[0037]
Next, in (Step 2), full-text search information is extracted from the text of the registration target document read in (Step 1) and stored in the full-text search information file.
[0038]
In the example shown in FIG. 8, (document 1, 1) is extracted as the full text search information corresponding to “L” included in the document 1 and stored in the full text search information file 803. Note that L (documents 1 and 1) indicates that the character “L” appears at the character position 1 of “document 1”.
[0039]
As the full-text search information used here, the word index method may be used as shown in the prior art 1 as long as the number of occurrences of an arbitrary word or character string in each registered document can be acquired. However, the n-gram index method disclosed in “JP 08-194718” may be used.
[0040]
The above is the outline of the processing procedure for document registration according to the present invention.
[0041]
Next, in the similar document search method shown in the present invention, (Step 3) to (Step 11) are executed when searching for a document.
[0042]
Hereinafter, an outline of a processing procedure for searching a document will be described with reference to FIG.
[0043]
First, the seed document 901 “Developing a solution with a LAN system construction know-how as a weapon” designated as a search condition in (Step 3) is read.
[0044]
In (Step 4), the seed document is analyzed, and words included in the seed document are extracted. The seed document analysis process used here may be a method of referring to a word dictionary and extracting words included in the word dictionary as disclosed in the prior art 1, or disclosed in “Japanese Patent Laid-Open No. 10-148721”. The word extraction method using statistical information in the document database may be used, the n-gram contained in the seed document may be mechanically extracted, or other word extraction techniques May be used.
[0045]
In the example shown in FIG. 9, a word string 903 (LAN, construction, know-how, weapon, solution, deployment,...) Is extracted as a result of this type of document analysis processing.
[0046]
Next, in (Step 5), the number of appearances in the seed document of the word extracted in (Step 4) is counted, and a pair 904 ([LAN, 4] [construction, 3] [ Knowhow, 2] [weapon, 1] [solution, 2] [deployment, 1].
[0047]
Here, [LAN, 3] indicates that the word “LAN” appears three times.
[0048]
Next, in (Step 6), importance is calculated for the pair 904 of words and appearances extracted in (Step 5), and a pair of words and importance is output. As a method of calculating the importance, for example, the number of appearances in the seed document may be used, or the ratio of the number of appearance documents of the word to the number of documents registered in the database (hereinafter referred to as the appearance ratio) is used. Also good. In the example shown in FIG. 9, the number of appearances in the seed document 901 is calculated as the importance of the word, and the word importance column 905 “[LAN, 4] [construction, 3] [solution, 2]. Here, [LAN, 4] indicates that the word “LAN” is included in the seed document as the importance “4”.
[0049]
Then, in (Step 7), elemental similarity to the seed document itself is calculated in descending order of importance of each word calculated in (Step 8), and the elemental similarity exceeds a predetermined threshold. In this case, the word is extracted as a search word. As a result, the search word [LAN, 4] [construction, 3] is extracted.
[0050]
Next, in (Step 8) to (Step 10), the number of appearances of each word in the seed document obtained in (Step 7) and the full-text search information file 803 created in (Step 2) are referred to. Thus, the similarity of each registered document to the seed document is calculated.
[0051]
In (step 11), the similarity calculation result 906 is output.
[0052]
The above is the outline of the processing procedure for the document search of the present invention.
[0053]
Hereinafter, the search word extraction processing procedure executed in the above-described (Step 7) will be described with reference to FIG.
[0054]
First, in (Step 7), the word importance sequence 905 output in (Step 6) is read, and words are selected in descending order of importance. In FIG. 10, [LAN, 4] is first extracted from the word importance column 905 “[LAN, 4], [construction, 3], [solution, 2]...”.
[0055]
Then, using the number of occurrences “4” in the seed document of the search word “LAN”, the similarity of each element of the search word for the similarity of the seed document to the seed document is calculated. That is, it is assumed that the same document as the seed document exists as a registered document (hereinafter referred to as a virtual registered document), and the similarity of the search word between the seed document feature vector and the feature vector of the virtual registered document by element The degree is calculated and the sum is calculated.
[0056]
In FIG. 10, the product of the number of appearances “4” in the seed document of the search word “LAN” and the number of appearances “4” in the virtual registration document is calculated, and the similarity by element “16” is obtained.
[0057]
As a result, the elemental similarity to the seed document itself by the search word “LAN” exceeds a predetermined threshold value (5 in the example shown in the figure), and is stored in the work area 170 as a search word. To do.
[0058]
Next, [construction, 3] having the next highest importance is selected after [LAN, 4], and the similarity of each element of the search word is calculated as the similarity of the seed document to the seed document. As a result, the element-by-element similarity is 9, which exceeds the predetermined threshold value 5, and is stored in the work area 170 as a search word.
[0059]
Then, [Solution, 2] having the next highest importance is selected after [Construction, 3], and the similarity of each element of the search word of the similarity of the seed document to the seed document is calculated. As a result, the similarity for each element is 4, which does not exceed the predetermined threshold value, and thus is terminated without being extracted as a search word.
[0060]
The above is the description of the search word extraction processing procedure.
[0061]
As described above, when registering a document in the document database, instead of creating a registered feature vector for the registered document, a full-text search index is created. Compared to the case where all the words are used for the search because the search words are extracted in order of the importance in the seed document and the extracted words are used as the search words until the similarity to the seed document itself converges. Thus, the similarity between the seed document and the registered document can be calculated at high speed without drastically reducing the search accuracy.
[0062]
DETAILED DESCRIPTION OF THE INVENTION
A first embodiment of the present invention will be described below with reference to FIG.
[0063]
A first example of a similar document search system to which the present invention is applied is a display 100, a keyboard 101, a central processing unit (CPU) 102, a magnetic disk device 103, a floppy disk drive (FDD) 104, a main memory 105, and connecting them. The bus 106 is configured.
[0064]
The magnetic disk device 103 is one of secondary storage devices and stores a full text search information file 180.
[0065]
Information stored in the floppy disk 107 is read into the main memory 105 or the magnetic disk device 103 via the FDD 104.
[0066]
The main memory 105 includes a system control program 110, a registration control program 111, a search control program 112, a registered document reading program 120, a full text search information file creation registration program 121, a search condition analysis program 130, a search word extraction program 131, A similarity calculation program 132 and a search result output program 133 are stored, and a work area 170 is secured.
[0067]
The search condition analysis program 130 includes a seed document acquisition program 140, a word extraction program 142, and a seed document occurrence count program 143.
[0068]
The search word extraction program 131 includes a word importance calculation program 150 and a search word extraction determination program 151.
[0069]
The similarity calculation program 132 includes a search word appearance count acquisition program 161 and an elemental similarity calculation program 162.
[0070]
The registration control program 111 and the search control program 112 are activated by the system control program 110 in response to an instruction from the keyboard 101 by the user, and control and search of the registered document reading program 120 and the full text search information file creation registration program 121, respectively. The condition analysis program 130, the search word extraction program 131, the similarity calculation program 132, and the search result output program 133 are controlled.
[0071]
In this embodiment, the registration control program 111 and the search control program 112 are activated by a command input from the keyboard 101. However, the registration control program 111 and the search control program 112 are activated by a command or event input via another input device. It does not matter if it is a thing.
[0072]
These programs are stored in a storage medium such as the magnetic disk device 103, floppy disk 107, MO, CD-ROM, DVD (not shown in FIG. 1), and read into the main memory 105 via the drive device. It can also be executed by the CPU 102.
[0073]
Hereinafter, a processing procedure of the similar document search system in the present embodiment will be described.
[0074]
First, the processing procedure of the system control program 110 will be described with reference to the PAD diagram of FIG.
[0075]
The system control program 110 first analyzes a command input from the keyboard 101 in step 1100.
[0076]
If it is analyzed in step 1101 that the result is a registration execution command, the registration control program 111 is activated in step 1102 to register the document.
[0077]
If it is determined in step 1101 that the command is a search execution command, the search control program 112 is activated in step 703 to search for similar documents.
[0078]
The processing procedure of the system control program 110 has been described above.
[0079]
Next, the processing procedure of the registration control program 111 activated by the system control program 110 in step 1102 shown in FIG. 11 will be described using the PAD diagram of FIG.
[0080]
In the registration control program 111, first, the registered document reading program 120 is started in step 1200, a document designated as a registration target (hereinafter referred to as a registration target document) is read and stored in the work area 170.
[0081]
Next, in step 1201, the full-text search information file creation / registration program 121 is started, full-text search information corresponding to the registered document stored in the work area 170 is created, and stored in the full-text search information file 180. .
[0082]
The processing procedure of the registration control program 111 has been described above.
[0083]
Next, the processing procedure of the search control program 112 activated by the system control program 110 in step 1103 shown in FIG. 11 will be described using the PAD diagram of FIG.
[0084]
In step 1300, the search control program 112 first starts the search condition analysis program 130 and extracts words from the seed document.
[0085]
Next, in step 1301, the search word extraction program 131 is started, the importance of the word extracted from the seed document in step 1300 is calculated, and a word having high importance is used as a search word based on a predetermined condition. Extract.
[0086]
In step 1302, the similarity calculation program 132 is activated, and the similarity of each registered document with respect to the seed document is calculated using the appearance information of the search word extracted from the seed document in step 1301.
[0087]
In step 1303, the search result output program 133 is activated, and the similarity calculation result calculated in step 1302 is output as a search result.
[0088]
Here, the output destination of the search result may be displayed on the display 100 or may be stored on the work area 170 or the magnetic disk 103. Further, when the similarity calculation result is output to the display 100, it may be output in descending order of similarity, or may be output in ascending or descending order of management numbers assigned to the documents.
[0089]
The processing procedure of the search control program 112 has been described above.
[0090]
Next, the processing procedure of the search condition analysis program 130 activated by the search control program 112 in step 1300 shown in FIG. 13 will be described using the PAD diagram of FIG.
[0091]
In step 1400, the search condition analysis program 130 first activates the seed document acquisition program 140, extracts the seed document specified by the search condition, and stores it in the work area 170.
[0092]
In step 1402, the word extraction program 142 is activated to extract words from the seed document stored in the work area 170.
[0093]
In step 1403, the seed document occurrence count program 143 is activated, and the number of appearances in the seed document is counted for the words extracted in step 1402 and stored in the work area 170.
[0094]
The processing procedure of the search condition analysis program 130 has been described above.
[0095]
Next, the processing procedure of the search word extraction program 131 activated by the search control program 112 in step 1301 shown in FIG. 13 will be described using the PAD diagram of FIG.
[0096]
First, in step 1500, the search word extraction program 131 starts up the word importance calculation program 151, calculates the importance of words stored in the work area 170 based on a predetermined calculation formula, and stores it in the work area 170. .
[0097]
Next, steps 1502 to 1505 are repeatedly executed for all the words stored in the work area 170 in step 1500 (step 1501).
[0098]
First, in step 1502, words stored in the work area 170 are acquired in descending order of importance.
[0099]
Next, in step 1503, the search word extraction determination program 151 is activated to calculate the elemental similarity of the seed document.
[0100]
In step 1504, it is determined whether the elemental similarity of the seed document exceeds a predetermined threshold value. If it exceeds, step 1505 is entered. If not, the iterative process is terminated.
[0101]
In step 1505, the word is stored in the work area 170 as a search word.
[0102]
The processing procedure of the search word extraction program 131 has been described above.
[0103]
In addition, the calculation method of the similarity of each word in step 1502 described above may be calculated using the number of appearances of each word in the seed document, as shown in the related art 1.
As will be described later, statistical information such as the number of appearing documents in the document database of the word may be used,
Furthermore, appearance position information in the document can be taken into consideration.
[0104]
Next, the processing procedure of the similarity calculation program 132 activated by the search control program 112 in step 1302 shown in FIG. 13 will be described using the PAD diagram of FIG.
[0105]
The similarity calculation program 132 repeats steps 1602 to 1603 for all search words stored in the work area 170 (step 1601).
[0106]
In step 1602, the search word appearance count acquisition program 161 is started, the full text search information file 180 corresponding to the search word is referred to, and the appearance count in each registered document is acquired and stored in the work area 170. To do.
[0107]
Next, in step 1603, the element-by-element similarity calculation program 162 is started, and the seed document is calculated by a predetermined calculation formula using the number of appearances of the search word stored in the work area 170 in the seed document and the number of appearances in the registered document. The similarity by element of the registered document with respect to is calculated and added to the similarity of the entire registered document.
[0108]
The processing procedure of the similarity calculation program 132 has been described above.
[0109]
The above is the first embodiment of the present invention.
[0110]
In this embodiment, a word is extracted from the seed document by the search condition analysis program 130, but an n-gram may be extracted instead of the word. In this case, the unit processed by the search word extraction program 131 is also an n-gram.
[0111]
In step 1504 of the search word extraction program 131, it is determined whether or not the elemental similarity of the seed document calculated in step 1503 exceeds a predetermined threshold.
It may be determined whether the sum of similarities exceeds a predetermined threshold instead of elemental similarity,
Furthermore, it may be determined whether the calculation ratio of the similarity with respect to the sum of the similarity by element in all words extracted from the seed document exceeds a predetermined threshold.
[0112]
In this embodiment, the number of occurrences of a word is directly used to calculate the similarity of each registered document with respect to the seed document. However, this may be normalized based on the length of the seed document or the document of the registered document. It will be clear.
[0113]
As described above, according to the first embodiment of the present invention, the number of search words used for similarity calculation is reduced with reference to the value of similarity by element with respect to the seed document. The processing can be terminated with a minimum necessary search for the similarity calculation result for.
[0114]
As a result, the number of search words can be reduced without extremely reducing the search accuracy, and a high-speed similar document search can be realized.
[0115]
In this embodiment, the registration target document or the seed document is a document, but it is obvious that it may be a sentence or a character string.
[0116]
Further, in the search word extraction program 131 in the first embodiment of the present invention described above, the search word is reduced using the similarity value by element of the seed document as a guideline. A number of search words may be extracted. As a method for setting the number of search words in this case, the number of search words may be determined using a test pattern prepared in advance so that the search is completed within a predetermined time.
[0117]
Next, a second embodiment of the present invention will be described with reference to FIG.
[0118]
The second embodiment of the similar document search system to which the present invention is applied uses statistical information of registered documents stored in a document database when calculating the importance of words extracted from seed documents. .
[0119]
According to this method, in the word importance calculation by the word importance calculation program 150 in the first embodiment, it is possible to use not only the appearance information in the seed document but also the appearance information in the entire document database, It is possible to adjust the importance of words that frequently appear in the document database, and the word importance can be calculated with higher accuracy than in the first embodiment.
[0120]
This embodiment has almost the same configuration as the first embodiment (FIG. 1), but the configuration of the word importance calculation program 150 is different, and a statistical information reference program 1700 is added as shown in FIG.
[0121]
Hereinafter, the processing procedure of the word importance degree calculation program 150a different from the first embodiment will be described with reference to FIG.
[0122]
First, in step 1800, the word importance calculation program 150a activates the statistical information reference program 1700 and refers to the full-text search information file 180, whereby the number of appearance documents in the document database of each word extracted from the seed document is determined. Obtained as statistical information of the word.
[0123]
Note that the number of appearance documents of the word is obtained from the full-text search information file 180, as shown in the full-text search information file 803 shown in FIG. This can be realized by counting the number of different document numbers of the word using the stored position.
[0124]
In step 1801, the importance level of each word extracted from the seed document is calculated using the number of occurrences of the word in the seed document and statistical information in the document database, and stored in the work area 170.
[0125]
The above is the processing procedure of the word importance calculation program 150a.
[0126]
As the word importance calculation formula in this embodiment, for example, a TF / IDF (Text Frequency, Inverted Documents Frequency) method may be used.
[0127]
The above is the second embodiment of the present invention.
[0128]
As described above, by using the similar document search system according to the second embodiment of the present invention, it is possible to calculate the word importance in consideration of words that frequently appear in the document database (hereinafter referred to as frequent words). It becomes like this. In other words, by setting the word importance of frequent words low and the word importance of rare words high, it is possible to preferentially select words that represent the characteristics of the seed document, and high-precision similar document search can be performed. Can be realized.
[0129]
Next, a third embodiment of the present invention will be described with reference to FIG.
[0130]
The third embodiment of the similar document search system to which the present invention is applied is similar to the second embodiment in that when the importance of the word extracted from the seed document is calculated, the registered documents stored in the document database are stored. The statistical information is used, but the statistical information file 1900 is used for obtaining the statistical information.
[0131]
According to this method, it is possible to obtain statistical information to be referred to at the time of calculating word importance in the second embodiment at high speed.
[0132]
This embodiment has almost the same configuration as the second embodiment (FIG. 17), but the configuration of the registration control program 111 is different, and a statistical information file creation / registration program 1900 is added as shown in FIG. In addition, a statistical information file 1910 is stored in the magnetic disk device 103. In step 1800 of the word importance calculation program 150a, instead of referring to the full-text search information file 180 when obtaining statistical information in the document database of each word extracted from the seed document, the statistical information shown in FIG. The file 1910 is referred to.
[0133]
Hereinafter, a processing procedure of the registration control program 111a different from the second embodiment will be described with reference to FIG.
[0134]
In the registration control program 111a, first, the registered document reading program 120 is started in step 1200, and the document designated as the registration target is read and stored in the work area 170.
[0135]
Next, in step 1201, the full-text search information file creation / registration program 121 is started, full-text search information corresponding to the registered document stored in the work area 170 is created, and stored in the full-text search information file 180. .
[0136]
Next, in step 2000, the statistical information file creation / registration program 1900 is activated to create statistical information corresponding to the registered document stored in the work area 170 and store it in the statistical information file 1910.
[0137]
The processing procedure of the registration control program 111 has been described above.
[0138]
FIG. 21 shows an example of the statistical information file 1910 created by the statistical information file creation / registration program 1900.
[0139]
In the statistical information file 1910 shown in this figure, a management number 2100, a word 2101 and the number of appearing documents 2102 are stored.
[0140]
In the example shown in the drawing, the word “LA” is stored in the area of the management number “0”, and the number of appearance documents of the word is stored as “1”.
[0141]
In the example shown in FIG. 21, the statistical information file 1900 is stored in a table format. However, any format may be used as long as the word and the number of appearing documents can be acquired. For example, it may be stored in a trie format, or may be stored in the top area of the full text search information file 180.
[0142]
The above is the third embodiment of the present invention.
[0143]
As described above, according to the third embodiment of the present invention, the statistical information of each word extracted from the seed document is obtained by referring to the statistical information file created in advance during the document registration process. It is no longer necessary to count the number of different appearance document numbers by referring to the search information, and statistical information can be acquired at high speed. This makes it possible to realize a similar document search that is faster than in the second embodiment.
[0144]
Next, a fourth embodiment of the present invention will be described with reference to FIG.
[0145]
In a fourth embodiment of the similar document search system to which the present invention is applied, the statistical information of each word extracted from the seed document is approximated and used.
[0146]
According to this method, the capacity of the statistical information stored in the statistical information file 1910 in the third embodiment can be reduced without extremely reducing the accuracy of the statistical information.
[0147]
This embodiment has almost the same configuration as that of the third embodiment (FIG. 19), but the configuration of the statistical information reference program 1700 is different and an approximate statistical information calculation program 2200 is added.
[0148]
Hereinafter, the processing procedure of the statistical information reference program 1700b different from the third embodiment will be described with reference to FIG.
[0149]
The statistical information reference program 1700b repeatedly executes Steps 2301 to 2304 for all words extracted from the seed document (Step 2300).
[0150]
In step 2301, the statistical information file 1910 is referenced to check whether statistical information corresponding to the word is stored.
[0151]
If the word is stored in the statistical information file 1910, step 2303 is executed, and if it is not stored, step 2304 is executed (step 2302).
[0152]
In step 2303, the statistical information file 1910 is referred to, and the statistical information of the word is acquired.
[0153]
In step 2304, the approximate statistical information calculation program 2200 is activated to calculate approximate statistical information of the word.
[0154]
The above is the processing procedure of the statistical information reference program 1700b.
[0155]
Next, the processing procedure of the approximate statistical information calculation program 2200 will be specifically described with reference to FIG.
[0156]
In the example shown in this figure, first, in step 2301, the statistical information file 1910 is referred to for the word 2400 “LAN” for which statistical information is to be acquired.
[0157]
Here, since “LAN” is not stored in the statistical information file 1910, step 2304 is executed.
[0158]
In step 2304, statistical information of “LA” and “AN”, which are constituent elements of the word 2400 “LAN”, is acquired, and a smaller value of these appearing document numbers is set as the statistical information of “LAN”.
[0159]
In the example shown in the figure, the number of appearing documents “807” stored in the statistical information 2401 of “LA” is compared with the number of appearing documents “1512” stored in the statistical information 2402 of “AN”. As a result, the number of appearance documents “807” of “LA” having a small value is stored as statistical information 2403 of “LAN” (2410).
[0160]
This utilizes the property that if the number of documents “LA” and “AN” in the word “LAN” are different, the number of documents in “LAN” cannot be greater than each component. is there. That is, as the number of appearing documents of the word “LAN”, the number of appearing documents of the “LAN” itself should be used, but the appearing documents of “LA” or “AN” which are the constituent elements of the word “LAN”. It refers to the appearance document number that approximates a small value.
[0161]
The specific processing procedure of the approximate statistical information calculation program 2200 has been described above.
[0162]
The above is the fourth embodiment of the present invention.
[0163]
As described above, by using the similar document search system in the fourth embodiment of the present invention, it is not necessary to store the number of appearance documents of all words in the statistical information file. , You will be able to reduce the capacity of the statistics file.
[0164]
As described above, in the similar document search system in the first to fourth embodiments of the present invention, the similarity of the seed document is calculated, and the number of search words is adjusted based on this. Thus, a similar document search can be realized at high speed while ensuring the search accuracy.
[0165]
Next, a fifth embodiment of the present invention will be described with reference to FIG.
[0166]
The fifth embodiment of the similar document search system to which the present invention is applied outputs a search result within a predetermined search time.
[0167]
According to this method, since the user can acquire the search result within a predetermined search time, it is possible to determine without stress whether the seed document specified by the search condition matches the search purpose.
[0168]
This embodiment has almost the same configuration as that of the first embodiment (FIG. 1), but the configuration of the similarity calculation program 132 is different and a search processing time measurement program 2500 is added.
[0169]
Hereinafter, the processing procedure of the similarity calculation program 132b different from the first embodiment will be described with reference to the PAD diagram of FIG.
[0170]
In step 2600, the similarity calculation program 132b activates the search processing time measurement program 2500 and starts measuring the search processing time.
[0171]
Next, if the search processing time is less than or equal to a predetermined value (hereinafter referred to as search limit time) for all the search words stored in the work area 170, steps 1602, 1603 and 2602 are repeatedly executed ( Step 2601).
[0172]
In step 1602, the search word appearance count acquisition program 161 is started, the full text search information file 180 corresponding to the search word is referred to, and the appearance count in each registered document is acquired and stored in the work area 170. To do.
[0173]
Next, in step 1603, the element-by-element similarity calculation program 162 is started, and the seed document is calculated by a predetermined calculation formula using the number of appearances of the search word stored in the work area 170 in the seed document and the number of appearances in the registered document. The similarity by element of the registered document with respect to is calculated and added to the similarity of the entire registered document.
[0174]
In step 2602, the search processing time measurement program 2500 is activated, the elapsed time of the search processing time is measured, and the search processing time is calculated.
[0175]
The above is the processing procedure of the similarity calculation program 132b.
[0176]
The above is the fifth embodiment of the present invention.
[0177]
Note that the search time limit in step 2601 of the present embodiment may be specified as a search condition when executing a search, or may be set in advance as a system setting value.
[0178]
In this embodiment, the search time limit is set. However, depending on the set value, only a small number of search words may be used. Therefore, the minimum number of search words for maintaining search accuracy is considered. May be set. In this case, even if the search processing time exceeds the search limit time, the similar search is repeated up to the specified minimum number of search words.
[0179]
Furthermore, in this embodiment, the time required for the similarity calculation processing is measured using the search processing time measurement program 2500, but the search processing itself may be measured. In this case, instead of starting the search time measurement at step 2600 shown in FIG. 26, the search processing time measurement program 2500 is started before the search condition analysis program 130 is started by the search control program 112. Measurement of the processing time may be started.
[0180]
As described above, in the similar document search system according to the fifth embodiment of the present invention, the number of search words is adjusted based on the time required for the search, so that the search result can be acquired in a predetermined processing time. become.
[0181]
As a result, the user can predict the search end time.
[0182]
The similar document search system that ends the search based on the similarity of the seed document described in the first to fourth embodiments and the search ends as described in the search time described in the fifth embodiment. It is also possible to use a similar document search system by switching between search execution or system definition.
[0183]
Next, a sixth embodiment of the present invention will be described with reference to FIG.
[0184]
In the sixth embodiment of the similar document search system to which the present invention is applied, the search time is estimated from the search word used for the search from the word extracted from the seed document. We ask for confirmation.
[0185]
According to this method, when a long time is required for the search in the search word extraction condition in the similar document search system described in the first to fourth embodiments, the search can be canceled in advance. The user will not be inadvertently waiting.
[0186]
This embodiment has almost the same configuration as the first embodiment (FIG. 1), but the configuration of the search word extraction program 131 is different, and a search time estimation confirmation program 2700 is added as shown in FIG.
[0187]
Hereinafter, the processing procedure of the search word extraction program 131b different from the first embodiment will be described with reference to the PAD diagram of FIG.
[0188]
In the search word extraction program 131, first, in step 1500, the word importance calculation program 151 is activated, calculates the importance of words stored in the work area 170 based on a predetermined calculation formula, and stores it in the work area 170. .
[0189]
Next, steps 1502 to 1505 are repeatedly executed for all the words stored in the work area 170 in step 1500 (step 1501).
[0190]
First, in step 1502, words stored in the work area 170 are acquired in descending order of importance.
[0191]
Next, in step 1503, the search word extraction determination program 151 is activated to calculate the elemental similarity of the seed document.
[0192]
In step 1504, it is determined whether the elemental similarity of the seed document exceeds a predetermined threshold value. If it exceeds, step 1505 is entered. If not, the iterative process is terminated.
[0193]
In step 1505, the word is stored in the work area 170 as a search word.
[0194]
Next, when the search time is estimated from the search word stored in the work area 170 in step 2800, and the estimated search time (hereinafter referred to as the estimated search time) exceeds a predetermined value (designated search time). Displays a message confirming the continuation of the search and receives confirmation from the user. As this confirmation message, for example, as shown in FIG. 6, a message 2900 having a continuation button 2901 and a cancel button 2901 may be displayed.
[0195]
The above is the processing procedure of the search word extraction program 131b.
[0196]
The specified search time in step 2800 may be specified as a search condition, specified in advance as a system definition, or automatically set from the results of several test patterns. It may be a thing.
[0197]
In addition, as a method for estimating the search time in the above step 2800, it may be estimated from the number of appearance documents of the search word, or estimated from the size of the full-text search information file 180 corresponding to the search word. It is good. Alternatively, an average time required for one search word may be measured using several test patterns, and the search time may be estimated using the average time.
[0198]
As described above, in the similar document search system shown in the present embodiment, the search time is estimated from the extracted search word, and if the estimated search time exceeds the time specified in advance, the search word is extracted. Since the conditions can be adjusted, the user will not be inadvertently waited.
[0199]
【The invention's effect】
As described above, in the present invention, the number of search words is set based on the similarity of the seed document, so the number of search words used for calculating the similarity can be reduced. Thereby, a high-speed similar document search that can ensure the search accuracy can be realized.
[Brief description of the drawings]
FIG. 1 is a diagram showing an overall configuration of a similar document search system in a first embodiment of the present invention.
FIG. 2 is a PAD for explaining a processing procedure of prior art 1;
FIG. 3 is a diagram for explaining an overview of prior art 1;
FIG. 4 is a diagram for explaining the concept of the similarity calculation method of the prior art 1;
FIG. 5 is a diagram for explaining the concept of a similarity calculation method of Conventional Technique 1;
FIG. 6 is an example of a confirmation message by a search time estimation confirmation program 2700 in the sixth embodiment of the present invention.
FIG. 7 is a PAD explaining the processing procedure of the present invention.
FIG. 8 is a diagram illustrating an overview of registration processing according to the present invention.
FIG. 9 is a diagram illustrating an outline of search processing according to the present invention.
FIG. 10 is a diagram illustrating an outline of a search word extraction process according to the present invention.
FIG. 11 is a diagram illustrating a processing procedure of the system control program 110 in the first embodiment of the present invention.
FIG. 12 is a diagram illustrating a processing procedure of a registration control program 111 in the first embodiment of the present invention.
FIG. 13 is a PAD for explaining the processing procedure of the search control program 112 in the first embodiment of the present invention.
FIG. 14 is a PAD explaining the processing procedure of the search condition analysis program 130 in the first embodiment of the present invention.
FIG. 15 is a PAD explaining the processing procedure of the search word extraction program 131 in the first embodiment of the present invention.
FIG. 16 is a PAD explaining the processing procedure of the similarity calculation program 132 in the first embodiment of the present invention.
FIG. 17 is a diagram showing a configuration of a word importance calculation program 150a in the second embodiment of the present invention.
FIG. 18 is a PAD diagram for explaining the processing procedure of the word importance calculation program 150a in the third embodiment of the present invention.
FIG. 19 is a block diagram of a registration control program 111a in the third embodiment of the present invention.
FIG. 20 is a PAD showing a processing procedure of a registration control program 111a in the third embodiment of the present invention.
FIG. 21 shows an example of a statistical information file 1910 in the third embodiment of the present invention.
FIG. 22 is a diagram showing a configuration of a statistical information reference program 1700b according to the fourth embodiment of the present invention.
FIG. 23 is a PAD diagram for explaining the processing procedure of a statistical information reference program 1700b according to the fourth embodiment of the present invention.
FIG. 24 is a diagram for explaining a method of calculating approximate statistical information in the fourth embodiment of the present invention.
FIG. 25 is a diagram showing a configuration of a similarity calculation program 132b in the fifth embodiment of the present invention.
FIG. 26 is a PAD explaining the processing procedure of the similarity calculation program 132b in the fifth embodiment of the present invention.
FIG. 27 is a diagram showing a configuration of a search word extraction program 131b in a sixth embodiment of the present invention.
FIG. 28 is a PAD illustrating the processing procedure of the search word extraction program 131b in the sixth embodiment of the present invention.
[Explanation of symbols]
DESCRIPTION OF SYMBOLS 100 ... Display, 101 ... Keyboard, 102 ... Central processing unit (CPU), 103 ... Magnetic disk apparatus, 104 ... Floppy disk drive (FDD), 105 ... Main memory, 106 ... Bus, 107 ... Floppy disk, 110 ... System Control program 111 ... Registration control program 112 ... Search control program 120 ... Registered document reading program 121 ... Full text search information file creation / registration program 130 ... Search condition analysis program 131 ... Search word extraction program 132 ... Similarity calculation program, 133 ... Search result output program, 140 ... Seed document acquisition program, 142 ... Word extraction program, 143 ... Seed document occurrence count counting program, 150 ... Word importance calculation program, 151 ... Search word extraction determination Professional Ram, 161 ... search word occurrence count acquisition program, 162 ... elemental similarity calculation program, 170 ... work area, 180 ... full-text search information file.

Claims

A similar document retrieval method in the similar document retrieval apparatus for given species document from the registration documents such as text or character string which is registered in the document database to find documents similar,
The similar document search device includes:
A full-text search information creation step of creating a full-text search index for a document to be registered and storing it as full-text search information;
A string extraction step of the specified species documents predetermined character string and the character string is extracted and number of occurrences appearing in the seed document,
The severity of the character string extracted on the basis of the number of occurrences in the seed document of the string, the selecting the character string in the descending order of importance of the character string as a processing target string, continuation determination step similarity A similarity calculation repetition step that repeats the calculation step;
When the importance of the processing target character string is smaller than a predetermined threshold, the continuation determination step of ending the repetition processing of the similarity calculation repetition step;
With reference to the full-text search information, the processing to calculate the number of occurrences in the registration document the subject string, appearance of the appearance frequency in the registration document of the processing target string in the seed document based on the number of times, and the similarity calculation step of calculating the similarity is added to the total similarity of each registered document in the processed character string of each registered document for the seed document,
Similar document search method characterized by executing and a search result output step of outputting the total similarity to the seed document of each registered document calculated by the similarity calculation step.

The similar document search method according to claim 1,
The continuation determining step ends the repetition process of the similarity calculation repetition step when the increase rate of the total similarity of each registered document is smaller than a predetermined threshold .

The similar document search method according to claim 1 or 2,
The continuation determining step ends the repetitive processing of the similarity calculation repetitive step when the elapsed time from the start of similar document retrieval exceeds a predetermined time .

In a similar document search apparatus for searching for a document whose content is similar to a specified seed document from a registered document such as a sentence or a character string registered in a document database,
A full-text search information creation means for creating a full-text search index for a document to be registered and storing the full-text search information;
And character string extraction means from specified species document is a predetermined character string and the character string to extract the number of occurrences appearing in the seed document,
The severity of the character string extracted on the basis of the number of occurrences in the seed document of the string, to select the character string in the descending order of importance of the character string as a processing target string, and processing continuation determination means Similarity calculation repetition means for repeating the processing of the similarity calculation means;
The continuation determination means for ending the repetition processing of the similarity calculation repetition means when the importance of the processing target character string is smaller than a predetermined threshold;
With reference to the full-text search information, it calculates the number of occurrences in the registration document of the processed character string, and the number of occurrences of in the registration document of the processed character string, in the seed document based on the number of occurrences, and the similarity calculation means for calculating the similarity in the processing object character string of each registered document for the species document is added to the total similarity of each registered document,
A search result output means for outputting a total similarity to the seed document of each registered document calculated by the similarity calculation means,
A similar document search device characterized by comprising:

The similar document search device according to claim 4 ,
The similar document search apparatus , wherein the continuation determination unit ends the repetition process of the similarity calculation repetition unit when the increase rate of the total similarity of each registered document is smaller than a predetermined threshold .

The similar document search device according to claim 4 or 5 ,
The similar document search device, wherein the continuation determination step ends the repetition process of the similarity calculation repetition step when an elapsed time from the start of the similar document search exceeds a predetermined time .

Storage medium storing a program for implementing the steps of the similar document search method in a computer according to any one of claims 1-3.