JP3595184B2

JP3595184B2 - Document search method and document search device

Info

Publication number: JP3595184B2
Application number: JP5772099A
Authority: JP
Inventors: 啓一郎帆足; 一則松本; 圭子青木; 直己井ノ上; 和夫橋本
Original assignee: KDDI Corp
Current assignee: KDDI Corp
Priority date: 1998-03-12
Filing date: 1999-03-04
Publication date: 2004-12-02
Anticipated expiration: 2019-03-04
Also published as: JP2000172717A

Description

【０００１】
【発明の属する技術分野】
この発明は、検索対象文書群の中から類似する文書を検索するための文書検索の手法に関するものである。
【０００２】
【従来の技術】
従来、検索対象文書群の中から類似する文書を検索するための様々な類似性の尺度が考案され、これらに基づく検索手法が提案されている。しかし、どの手法を用いても不要な文書が検索される割合は依然として高く、検索精度を向上させるには、さらに絞り込み検索を行うことが必要となっている。
【０００３】
ここで、一般的な文書検索システムにおける絞り込み検索について簡単に説明する。
【０００４】
図１４は、従来の絞り込み検索の処理手順を示すフローチャートである。まず、システムは検索対象文書群の各文書の特徴量を抽出し（ステップ５０１）、次いで、参照文書集合を取り込む（ステップ５０２）。参照文書集合とは、ユーザがシステムに検索してもらいたい文書を示すために指定する参照文書の集合体であり、システムは参照文書集合に類似している文書を検索対象文書群から検索する。
【０００５】
なお、実際の文書検索においては、参照文書を複数指定することが多いため、ここでは参照文書をすべて参照文書集合として説明する。
【０００６】
ステップ５０２で参照文書集合を取り込むと、その参照文書集合の特徴量を抽出する（ステップ５０３）。次いで、抽出した特徴量にもとに、検索対象文書群から参照文書集合に類似していると思われる文書集合を検索し、ユーザに提示する（ステップ５０４）。ここで、ユーザは検索された文書集合を検討し、さらに絞り込み検索を行う必要があると判断した場合は、検索された文書集合の中から、参照文書集合に類似している文書を抽出し、新たな参照文書集合としてシステムに指定する。システムは、ユーザから絞り込み検索の指示があった場合は（ステップ５０５でＹｅｓ）、ステップ５０２へ戻り、新たに指定された参照文書集合を取り込み、再びステップ５０３、ステップ５０４の処理を繰り返す。
【０００７】
この絞り込み検索に用いられる特徴量としては、例えば文書中の単語（句）の出現頻度（ｔｅｒｍｆｒｅｑｕｅｎｃｙ）を要素とした特徴ベクトル、あるいは、この特徴ベクトルの各要素に重みを付加したＴＦ＊ＩＤＦ（ＴｅｘｔＦｒｅｑｕｅｎｃｙ＊ＩｎｖｅｒｓｅＤｏｃｕｍｅｎｔＦｒｅｑｕｅｎｃｙ）などがあり、一般に広く用いられている。
【０００８】
ＴＦ＊ＩＤＦとは、文書中の各単語の出現頻度に、他の検索対象文書中での出現頻度を考慮した重みを加えた特徴量である。単語ｔに対する重みｗ_ｔは以下のような数式で表される。
【０００９】
ｗ_ｔ＝ｌｏｇ（Ｎ／ｆ_ｔ）
ただし、Ｎは検索対象文書数、ｆ_ｔは単語ｔを含む文書数である。したがって、文書の特徴ベクトルの各要素ｗ_ｄ．ｔは以下のように計算される。
【００１０】
ｗ_ｄ．ｔ＝ｆ_ｄ．ｔ・ｌｏｇ（Ｎ／ｆ_ｔ）
ただし、ｆ_ｄ．ｔは文書ｄ中の単語ｔの出現頻度である。
【００１１】
また、Ｉｗａｙａｍａらは、このような単語頻度の特徴量に基づき、ＳＶＭＶ（ＳｉｎｇｌｅＲａｎｄｏｍＶａｒｉａｂｌｅｗｉｔｈＭｕｌｔｉｐｌｅＶａｌｕｅｓ）という統計的テキスト分類手法（以下、Ｉｗａｙａｍａらの手法）を提案し、他のテキスト分類手法との比較実験の結果、その優位性を示した（Ｉｗａｙａｍａ，Ｔｏｋｕｎａｇａ：“ＡＰｒｏｂａｂｌｉｓｔｉｃＭｏｄｅｌｆｏｒＴｅｘｔＣａｔｅｇｏｒｉｚａｔｉｏｎ：ＢａｓｅｄｏｎａＳｉｎｇｌｅＲａｎｄｏｍＶａｒｉａｂｌｅｗｉｔｈＭｕｌｔｉｐｌｅＶａｌｕｅｓ”，Ｐｒｏｃｏｆ４ｔｈＣｏｎｆｅｒｅｎｃｅｏｎＡｐｐｌｉｅｄＮａｔｕｒａｌＬａｎｇｕａｇｅＰｒｏｃｅｓｓｉｎｇ，ｐｐ１６２−１６７，１９９４）。
【００１２】
一方、検索精度を向上させるための工夫の一つとして、類似検索の際に重要だと思われる語を選択して検索を行う手法（以下、重要語選択の手法）がある。
【００１３】
これまで文書間の類似度を測定する場合、一般的には各文書の中に出現する単語の頻度などを抽出し、これを要素として文書の特徴量を算定していた。しかし、ここで出現する全ての単語が文書間の類似度に影響を与えるとは考えにくく、逆に類似度に無関係な単語までも文書の特徴として考慮することにより、検索精度が低下してしまうおそれがある。そこで、検索入力ならびに検索対象の文書集合に出現する単語の中から、類似検索に必要な単語を抽出するようにしたのが重要語選択の手法である。
【００１４】
この重要語選択の手法として、各単語のχ（カイ）二乗値に基づいて重要語を選択する手法［１］が提案されている（Ｓｃｈｕｔｚｅ，Ｈｕｌｌ，Ｐｅｄｅｒｓｅｎ：“ＡＣｏｍｐａｒｒｉｓｏｎｏｆＣｌａｓｓｉｆｉｅｒｓａｎｄＤｏｃｕｍｅｎｔＲｅｐｒｅｓｅｎｔａｔｉｏｎｆｏｒｔｈｅＲｏｕｔｉｎｇＰｒｏｂｌｅｍ”，ＰｒｏｃｏｆＡＣＭＳＩＮＧＥＲ’９５，１９９５．）
この手法［１］では、検索対象文書群Ｄ_ｉを参照文書_ｑに対して類似している文書の集合（以下、正解文書集合）Ｃと、このＣに属しない集合￣Ｃ（以下、Ｃの否定を示す）の２つの集合に分け、各単語ｗのχ二乗値を以下の数式により算出する。
【００１５】
【数１】

ただし、Ｎ_（Ｃ＋）は、Ｃ中の文書で単語っを含む文書の数、Ｎ_（Ｃ−）はＣ中の文書で単語ｗを含まない文書の数、Ｎ_{（￣Ｃ＋）}は￣Ｃ中の文書で単語ｗを含む文書の数、Ｎ_{（￣Ｃ−）} は￣Ｃ中の文書で単語ｗを含まない文書の数とする。
【００１６】
このようにして計算されたχ二乗値の上位Ｎ個の単語を重要語として選択し、これらの単語に基づいて特徴量を算出して、例えばＩｗａｙａｍａらの手法を用いて類似検索を行う。
【００１７】
また、上記手法［１］の応用例として、χ二乗値の正の根を元に重要語を抽出する手法［２］も提案されている（Ｎｇ，Ｇｏｈ，Ｌｏｗ；“ＦｅａｔｕｒｅＳｅｌｅｃｔｉｏｎ，ＰｅｒｃｅｐｔｒｏｎＬｅａｒｎｉｎｇ，ａｎｄＵｓａｂｉｌｉｔｙＣａｓｅＳｔｕｄｙｆｏｒＴｅｘｔＣａｔｅｇｏｒｉｚａｔｉｏｎ”，ＰｒｏｃｏｆＡＣＭＳＩＮＧＥＲ’９７，１９９７．）。この手法［２］は、正解文書集合Ｃ中の文書に含まれる語の方が￣Ｃに含まれる語と比較して重要性が高いという仮定にしたがって提案された重要語選択手法である。
【００１８】
さらに、類似文書の検索システムにおいては、検索式の情報を自動的に拡大する、いわゆる検索式拡張の技術を採用したシステムの開発が盛んに行われている。現在、最も一般的に採用されている検索式拡張手法の一つに、Ｒｏｃｃｈｉｏのアルゴリズムに基づくものがある。この手法はベクトル空間モデルに基づく類似検索のために開発された手法である。この手法は、最適な検索式（参照文書から生成されるベクトル）とそれに類似している文書との類似度を最大化させるとともに、非類似文書との類似度を最小化させる、という考え方に基づいている。Ｒｏｃｃｈｉｏによれば、このような検索式は類似文書群のベクトルの中心と、非類似文書群のベクトルの中心との差分ベクトルを求めることによって算出することができる。したがって、最適な検索式は以下の数式によって求められる。
【００１９】
【数２】

ただし、Ｒは類似文書の数、Ｎは非類似文書の数を表す。また、この式の結果、負の値をもつ要素の値は０に設定される。
【００２０】
上記手法では、類似文書に検索式を近づけ、非類似文書から検索式を遠ざけることによって検索式の最適化を図っている。しかし、この手法では元の検索式の特徴が失われてしまうという問題がある。そこで、元の検索式の特徴を保持しつつ、検索式の最適化を行う手法が開発されている。この手法では、元の検索式及び類似文書と非類似文書のベクトルにそれぞれ係数を添えることによって最適化を行う。以下に、その数式を示す。
【００２１】
【数３】

ＡＴ＆Ｔなどで開発されている“ＳＭＡＲＴ”と呼ばれる検索システムでは、上記手法に基づいた検索式拡張方式が採用されている。具体的には、類似文書から上記数式によって各単語の重みを計算し、単語毎の重みの総和が高い単語を抽出し、抽出された単語を元の検索式に加えるという検索式拡張手法であり、高い検索精度が得られている。
【００２２】
【発明が解決しようとする課題】
しかし、上記ＴＦ＊ＩＤＦをはじめとする既存のベクトル空間モデルの特徴量は、検索対象文書群の特徴を表してはいるが、参照文書集合の特徴を考慮していないため、例えば図１４のステップ５０２で参照文書集合が変化しても特徴量自体は変化しないという性質がある。したがって、参照文書集合の変化に対応した特徴量の抽出を行うことができず、また絞り込み検索を繰り返しても検索精度は向上しないという問題点があった。
【００２３】
また、上記手法［１］［２］で求められるχ二乗値は、Ｃと￣Ｃに含まれる文書数の比率に極端な差がある場合は検索精度が低くなるという欠点がある。これに対し、類似検索では一般に大量の検索対象文書群から、わずかな類似文書（上記Ｃに相当）を検索するものであるため、χ二乗値に基づく重要語選択手法は類似検索には有効でないという問題点があった。
【００２４】
また、上記Ｒｏｃｃｈｉｏの手法による検索精度向上の効果は多くの論文などで確認されている。しかし、この手法では検索対象文書内での単語の影響のみが考慮されており、参照文書と各検索対象文書との間の類似度における単語の影響については考慮されていないため、検索式拡張の際には必ずしも有効な単語が抽出されていない可能性がある。
【００２５】
この発明の第１の目的は、単語の特徴量を用いた類似検索において、不要な文書が検索される割合を少なくし、また絞り込み検索を繰り返した場合の検索精度を向上させることができる文書検索方法及び文書検索装置を提供することにある。
【００２６】
また、第２の目的は、重要語を用いた類似検索において、不要な文書が検索される割合を少なくし、かつ検索精度を向上させることができる文書検索方法及び文書検索装置を提供することにある。
【００２７】
さらに、第３の目的は、検索式拡張手法を用いた類似検索において、従来手法に比べて検索精度を向上させることができる文書検索方法及び文書検索装置を提供することにある。
【００２８】
【課題を解決するための手段】
上記第１の目的を達成するため、請求項１の発明は、検索対象文書群を構成する検索対象文書の特徴量を用いて、指定された参照文書に類似する文書を前記検索対象文書群から検索する文書検索方法において、前記参照文書と検索対象文書群に出現する単語が文書間の類似度に与える影響の大きさを単語寄与度として数値化し、該単語寄与度を要素とするベクトルデータを前記検索対象文書の特徴量とすることを特徴とする。
【００２９】
請求項２の発明は、請求項１において、前記参照文書が請求項１の文書検索方法で検索された文書であることを特徴とする。
【００３０】
請求項３の発明は、請求項１又は２において、前記参照文書が参照文書集合であることを特徴とする。
【００３１】
また、上記第１の目的を達成するため、請求項４の発明は、検索対象文書群を構成する検索対象文書の特徴量を用いて、指定された参照文書に類似する文書を前記検索対象文書群から検索する文書検索装置において、前記参照文書と検索対象文書群に出現する単語を抽出する出現単語抽出手段と、前記参照文書と検索対象文書群から前記出現単語抽出手段で抽出された単語を除いた文書を生成する単語除去手段と、前記参照文書と検索対象文書群に含まれる文書間の類似度と、前記単語除去手段で生成された参照文書と検索対象文書群に含まれる文書間の類似度とを算出する類似度算出手段と、前記類似度算出手段で算出された２つの類似度をもとに、前記参照文書と検索対象文書群に出現する単語が文書間の類似度に与える影響の大きさを単語寄与度として数値化し、該単語寄与度を要素とするベクトルデータを生成する寄与度算出手段とを備え、前記単語寄与度を要素とするベクトルデータを前記検索対象文書の特徴量として用いることを特徴とする。
【００３２】
請求項５の発明は、請求項４において、前記参照文書が請求項４の文書検索装置で検索された文書であることを特徴とする。
【００３３】
請求項６の発明は、請求項４又は５において、前記参照文書が参照文書集合であることを特徴とする。
【００３４】
上記第２の目的を達成するため、請求項７の発明は、指定された参照文書と検索対象文書群に出現する単語の中から選択した重要語を用いて、前記参照文書に類似する文書を前記検索対象文書群から検索する文書検索方法において、前記参照文書と前記検索対象文書群から抽出した正解文書集合に出現する単語が文書間の類似度に与える影響の大きさを単語寄与度として数値化し、前記参照文書と前記正解文書集合に出現する全ての単語から、単語寄与度の高い単語を重要語として選択することを特徴とする。
【００３５】
請求項８の発明は、請求項７において、前記参照文書と前記正解文書集合の各文書との間に出現する全ての単語について単語寄与度を算出した後、各単語毎に単語寄与度の総和を求め、該総和の上位にある単語を前記参照文書に対する重要語とすることを特徴とする。
【００３６】
請求項９の発明は、請求項７又は８において、前記参照文書が参照文書集合であることを特徴とする。
【００３７】
請求項１０の発明は、請求項７、８又は９において、前記正解文書集合が前記参照文書に対し類似している文書の集合であることを特徴とする。
【００３８】
また、上記第２の目的を達成するため、請求項１１の発明は、指定された参照文書と検索対象文書群に出現する単語の中から選択した重要語を用いて、前記参照文書に類似する文書を前記検索対象文書群から検索する文書検索装置において、前記参照文書と前記検索対象文書群から抽出した正解文書集合に出現する単語を抽出する出現単語抽出手段と、前記参照文書と正解文書集合から前記出現単語抽出手段で抽出された単語を除いた文書を生成する単語除去手段と、前記参照文書と正解文書集合に含まれる文書間の類似度と、前記単語除去手段で生成された参照文書と正解文書集合に含まれる各文書間の類似度とを算出する類似度算出手段と、前記類似度算出手段で算出された２つの類似度をもとに、前記参照文書と正解文書集合に出現する単語が文書間の類似度に与える影響の大きさを単語寄与度として数値化する寄与度算出手段と、前記参照文書と正解文書集合の各文書との間に出現する全ての単語から、前記寄与度算出手段で数値化された単語寄与度の高い単語を重要語として選択する重要語選択手段とを備え、前記選択された重要語に基づいて、前記参照文書に類似する文書を前記検索対象文書群から検索することを特徴とする。
【００３９】
請求項１２の発明は、請求項１１において、前記寄与度算出手段は、前記参照文書と前記正解文書集合の各文書との間に出現する全ての単語についてそれぞれ単語寄与度を算出し、前記重要語選択手段は、各単語毎に前記寄与度算出手段で算出された単語寄与度の総和を求め、該総和の上位にある単語を参照文書に対する重要語とすることを特徴とする。
【００４０】
請求項１３の発明は、請求項１１又は１２において、前記参照文書が参照文書集合であることを特徴とする。
【００４１】
請求項１４の発明は、請求項１１、１２又は１３において、前記正解文書集合が前記参照文書に対し類似している文書の集合であることを特徴とする。
【００４２】
上記第３の目的を達成するため、請求項１５の発明は、指定された参照文書と検索対象文書群の各文書を表すベクトルを検索式として含む類似度計算式を用いて、前記参照文書に類似する文書を前記検索対象文書群から検索する文書検索方法において、前記参照文書と前記検索対象文書群から抽出した正解文書集合に出現する単語が文書間の類似度に与える影響の大きさを単語寄与度として数値化し、前記正解文書集合に含まれる各文書から単語寄与度が低い単語を抽出し、抽出した全ての単語のうち前記参照文書に含まれていない単語について、各単語毎に単語寄与度の総和を求め、その結果に重み付けをして前記参照文書の検索式に加えることを特徴とする。
【００４３】
請求項１６の発明は、請求項１５において、前記参照文書が参照文書集合であることを特徴とする。
【００４４】
請求項１７の発明は、指定された参照文書と検索対象文書群の各文書を表すベクトルを検索式として含む類似度計算式を用いて、前記参照文書に類似する文書を前記検索対象文書群から検索する文書検索装置において、前記参照文書と前記検索対象文書群から抽出した正解文書集合に出現する単語を抽出する出現単語抽出手段と、前記参照文書と正解文書集合から前記出現単語抽出手段で抽出された単語を除いた文書を生成する単語除去手段と、前記参照文書と正解文書集合に含まれる文書間の類似度と、前記単語除去手段で生成された参照文書と正解文書集合に含まれる各文書間の類似度とを算出する類似度算出手段と、前記類似度算出手段で算出された２つの類似度をもとに、前記参照文書と正解文書集合に出現する単語が文書間の類似度に与える影響の大きさを単語寄与度として数値化する寄与度算出手段と、前記参照文書と正解文書集合に含まれる各文書に出現する全ての単語から、前記寄与度算出手段で数値化された単語寄与度の低い単語を抽出し、抽出した全ての単語のうち前記参照文書に含まれていない単語について、各単語毎に単語寄与度の総和を求め、その結果に重み付けをして前記参照文書の検索式に加える検索式拡張手段とを備えたことを特徴とする。
【００４５】
請求項１８の発明は、請求項１７において、前記参照文書が参照文書集合であることを特徴とする。
【００４６】
【発明の実施の形態】
以下、この発明に係わる文書検索方法及び文書検索装置の具体例を実施形態１、実施形態２及び実施形態３として説明する。
【００４７】
この実施形態１、２及び３では、参照文書をすべて参照文書集合として説明する。ただし、この発明に係わる文書検索方法及び文書検索装置、並びに以下に説明する実施形態１、２及び３の文書検索装置においては、指定する文書として（単体の）参照文書を用いても同様の作用効果を得ることができる。
【００４８】
［実施形態１］
この実施形態１では、文書間の類似度における単語の寄与度を算出し、この寄与度に基づく特徴量を用いて類似する文書を検索する場合の具体例について説明する。
【００４９】
図２は、実施形態１に係わる文書検索装置１０の機能的な構成を示すブロック図である。この文書検索装置１０は、参照文書集合などを入力する入力部１１と、後述する特徴量抽出部１２と、前記特徴量抽出部１２で抽出された特徴量としての寄与度をもとに、検索対象文書群から参照文書集合に類似する文書を検索する検索部１３と、特徴量抽出部１２及び検索部１３で使用される文書などを一時的に格納する記憶部１４と、前記検索部１３で検索された文書を出力する出力部１４とから構成されている。
【００５０】
この実施形態１の特徴量抽出部１２では、参照文書集合に依存する特徴量として、新たに文書間の類似度における単語の寄与度（単語寄与度）という概念を定義し、この単語寄与度を要素とするベクトルデータ（以下、特徴ベクトル）を算出する。ここでいう単語寄与度とは、検索対象文書群のある文書と参照文書集合間での類似度を計算する際に、文書中に出現する単語がその類似度に与える影響を数値化したもので、数値が大きいほどその単語が類似度に与える影響の大きいことを示している。すなわち、単語寄与度は、検索対象文書群のある文書と参照文書集合の両者に出現する単語をそのまま含ませた場合に算出される類似度から該単語を両者から除いた場合に算出される類似度への変化として定義される。なお、この実施形態１では、複数の文書からなる検索対象文書群に対し、同じく複数の文書からなる参照文書集合が与えられたものとする。
【００５１】
図１は、前記特徴量抽出部１２の具体例として構成された寄与度算出装置１２０の機能的な構成を示すブロック図である。この寄与度算出装置１２０は、出現単語抽出手段１２１、単語除去手段１２２、類似度算出手段１２３、寄与度算出手段１２４から構成されている。
【００５２】
出現単語抽出手段１２１は、入力した参照文書集合Ｄと、検索対象文書群Ｄｏｃｓに含まれる文書ｄｉ（ｄ１、ｄ２、・・・ｄｎ）について、その中に出現する単語（出現単語）とその出現頻度を求める。抽出する単語は、例えば名詞のみというように指定しておくことにより選別することができる。抽出された単語ｗｊ（ｗ１、ｗ２、・・・ｗｎ）は、出現単語リストＷ＝（ｗ１、ｗ２、・・・ｗｎ）として出力される。各単語の出現頻度は、類似度算出手段１２３で類似度を計算する際に用いられる。
【００５３】
単語除去手段１２２は、参照文書集合と検索対象文書群から、前記出現単語抽出手段１２１で抽出された単語を除いた文書を生成する。ここでは、ｄｉからｗｊを除いた文書をｄｉ′（ｗｊ）とし、Ｄからｗｊを除いた文書をＤ′（ｗｊ）とする。
【００５４】
類似度算出手段１２３は、参照文書集合と検索対象文書群に含まれる各文書間の類似度として、次のような２つの類似度を計算する。１つは、文書ｄｉ、Ｄ間の類似度Ｓｉｍ（ｄｉ、Ｄ）であり、もう１つは、文書ｄｉ′（ｗｊ）、Ｄ′（ｗｊ）間の類似度Ｓｉｍ（ｄｉ′（ｗｊ）、Ｄ′（ｗｊ））である。これら類似度の計算手法としては、既知の計算手法を用いることができる。この実施形態１では、前述したＩｗａｙａｍａらの手法を用いている。
【００５５】
寄与度算出手段１２４は、ｄｉ、Ｄ間の類似度における単語ｗｊの寄与度Ｃｏｎｔ（ｄｉ、Ｄ、ｗｊ）として、Ｓｉｍ（ｄｉ、Ｄ）−Ｓｉｍ（ｄｉ′（ｗｊ）、Ｄ′（ｗｊ））を計算する。寄与度Ｃｏｎｔ（ｄｉ、Ｄ、ｗｊ）は、ｄｉ、Ｄ中のすべての出現単語について求める。
【００５６】
これによると、文書ｄｉ、Ｄに出現する単語ｗｊの類似度に与える影響が大きい場合には、Ｓｉｍ（ｄｉ′（ｗｊ）、Ｄ′（ｗｊ））の値が小さくなるため、単語ｗｊの寄与度Ｃｏｎｔ（ｄｉ、Ｄ、ｗｊ）は大きくなる。一方、文書ｄｉ、Ｄに出現する単語ｗｊの類似度に与える影響が小さい場合には、Ｓｉｍ（ｄｉ′（ｗｊ）、Ｄ′（ｗｊ））の値が大きくなるため、単語ｗｊの寄与度Ｃｏｎｔ（ｄｉ、Ｄ、ｗｊ）は小さくなる。すなわち、単語ｗｊの寄与度Ｃｏｎｔ（ｄｉ、Ｄ、ｗｊ）は、単語ｗｊの類似度に与える影響の大小関係がそのまま反映された数値となる。
【００５７】
そして、単語ｗｊの寄与度Ｃｏｎｔ（ｄｉ、Ｄ、ｗｊ）は、参照文書集合の変化に応じて変化するため、この寄与度を要素とした特徴ベクトルからなる文書ｄｉの特徴量は、参照文書集合に依存した特徴量となる。したがって、検索の進行につれて参照文書集合が変化する絞り込み検索では、その参照文書集合の変化に適応した特徴量の抽出を行うことができることになり、また絞り込み検索を繰り返すことにより検索精度をさらに向上させることができる。
【００５８】
次に、上記のように構成された寄与度算出装置１２０において、単語ｗｊの寄与度と文書ｄｉの特徴ベクトルを計算する場合の処理手順を図３のフローチャートを用いて説明する。
【００５９】
まず、出現単語抽出手段１２１において、検索対象文書群と参照文書集合を取り込む（ステップ１０１）。ここで、検索対象文書群をＤｏｃｓ、参照文書集合をＤとする。次に出現単語抽出手段１２１において、ＤとＤｏｃｓに含まれるｄｉ（ｄ１、ｄ２、・・・ｄｎ）について、出現単語とその出現頻度を求め、出現単語リストＷ（＝ｗ１、ｗ２、・・・ｗｎ）を抽出する（ステップ１０２）。次に、類似度算出手段１２３において、Ｄｏｃｓに含まれる一文書ｄｉとＤ間の類似度を計算する（ステップ１０３）。ここでは、ｄｉとＤ間の類似度をＳｉｍ（ｄｉ、Ｄ）とする。次に、単語除去手段１２２において、出現単語から単語ｗｊを取り出し、ｄｉから単語ｗｊを除いたものと、Ｄから単語ｗｊを除いたものを求める（ステップ１０４）。ここでは、ｄｉから単語ｗｊを除いた文書をｄｉ′（ｗｊ）、Ｄから単語ｗｊを除いた文書をＤ′（ｗｊ）とする。次に、類似度算出手段１２３において、ｄｉ′（ｗｊ）、Ｄ′（ｗｊ）間の類似度を計算する（ステップ１０５）。ここでは、ｄｉ′（ｗｊ）、Ｄ′（ｗｊ）間の類似度をＳｉｍ（ｄｉ′（ｗｊ）、Ｄ′（ｗｊ））とする。次に、寄与度算出手段１２４において、ｄｉ、Ｄ間の類似度における単語ｗｊの寄与度を計算する（ステップ１０６）。ここでは、単語ｗｊの寄与度をＣｏｎｔ（ｄｉ、Ｄ、ｗｊ）とし、Ｃｏｎｔ（ｄｉ、Ｄ、ｗｊ）＝Ｓｉｍ（ｄｉ、Ｄ）−Ｓｉｍ（ｄｉ′（ｗｊ）、Ｄ′（ｗｊ））として計算する。次に、すべての単語ｗｊについて寄与度を計算したかどうかを判断し（ステップ１０７）、計算が終了していなければ、次の単語ｗｊを取り出して、ステップ１０４〜ステップ１０６の処理を繰り返す。また、計算が終了していれば、それぞれの単語ｗｊ毎に算出されたＣｏｎｔ（ｄｉ、Ｄ、ｗｊ）を、ｄｉの特徴ベクトルＶ（ｄ）として出力する（ステップ１０８）。続いて、Ｄｏｃｓに含まれるすべてのｄｉについて特徴ベクトルを出力したかどうかを判断し（ステップ１０９）、計算が終了していなければ、次のｄｉを取り出して、ステップ１０３〜ステップ１０８の処理を繰り返す。また、計算が終了していれば、そこで処理を終了し、計算したすべてのデータを記憶部１４（図２）へ格納する。
【００６０】
この後は、検索部１３（図２）において、寄与度を要素とした特徴ベクトルからなる文書ｄｉの特徴量をもとにして、検索対象文書群から参照文書集合に類似している文書集合を検索する。検索結果は出力部１５（図２）を通じてユーザに提示される。ユーザは、検索された文書集合について、さらに絞り込み検索を行う必要がある判断した場合は、検索された文書集合の中から、参照文書集合に類似している文書を抽出し、これを新たな参照文書集合としてシステムに指定する。これ以後は、新たな参照文書集合に基づいて絞り込み検索を実行する。
【００６１】
上記実施形態１の文書検索装置においては、寄与度を要素とするベクトルデータを検索対象文書の特徴量としているため、例えばＴＦ＊ＩＤＦのように単語の出現頻度などを要素とするベクトルデータを特徴量として用いる手法に比べ、不要な文書が検索される割合を少なくすることができる。また、前出の寄与度に基づく特徴量は参照文書集合に依存しているため、絞り込み検索においては、参照文書集合の変化に対応した特徴量の抽出を行うことができる。すなわち、絞り込み検索を繰り返した場合には、検索精度をさらに向上させることができる。しかも、絞り込み検索を繰り返した場合は、参照文書集合の一部が重複するので、計算を一部省くことができるため、処理時間を短縮することが可能となる。
【００６２】
［実施形態２］
この実施形態２では、実施形態１で説明した単語の寄与度をもとにして重要語を選択し、この重要語に基づいて類似する文書を検索する場合の具体例について説明する。
【００６３】
図４は、実施形態２に係わる文書検索装置２０の機能的な構成を示すブロック図である。この文書検索装置２０は、参照文書集合などを入力する入力部２１と、後述する重要語選択部２２と、この重要語選択部２２で選択された重要語をもとにして、検索対象文書群から参照文書集合に類似する文書を検索する検索部２３と、重要語選択部２２及び検索部２３で使用される文書などを一時的に格納する記憶部２４と、前記検索部２３で検索された文書を出力する出力部２４とから構成されている。
【００６４】
この実施形態２の重要語選択部２２は、参照文書集合の各文書と検索対象文書群から抽出した正解文書集合の各文書との間に出現する全ての単語ついてそれぞれ寄与度を算出する寄与度算出手段２２１と、各単語毎に寄与度算出手段２２１で算出された単語寄与度の総和を求め、この総和の上位にある単語を参照文書に対する重要語として選択する重要語選択手段２２２とから構成されている。
【００６５】
寄与度算出手段２２１は、例えば図１の寄与度算出装置１２０と同じ機能ブロックにより構成することができる。この寄与度算出手段２２１では、入力した参照文書集合Ｄと正解文書集合Ｃ＝｛ｄ１、ｄ２・・・ｄｎ｝の各文書との間に出現する全ての単語ｗｊの寄与度Ｃｏｎｔ（Ｄ、ｄｉ、ｗｊ）を計算する。単語の寄与度を算出する手順は図３のステップ１０１〜ステップ１０７と同じであるため説明を省略する。
【００６６】
重要語選択手段２２２は、参照文書集合Ｄと正解文書集合Ｃの各文書に出現した各単語毎に、前記寄与度算出手段２２１で算出された寄与度の総和ＳｕｍＣｏｎｔ（Ｄ、ｗｊ）を次の式により求める。
【００６７】
【数４】

次に、寄与度の総和ＳｕｍＣｏｎｔ（Ｄ、ｗｊ）の高い単語について、その上位Ｎ個の単語を参照文書集合Ｄに対する重要語として選択する。
【００６８】
次に、上記のように構成された重要語選択部２２において、参照文書集合Ｄに対する重要語を選択する場合の処理手順を図５のフローチャートを用いて説明する。
【００６９】
なお、図５のフローチャートでは、文書間に出現する全ての単語について寄与度を算出した後、各単語毎に寄与度の総和を求めるようにしているが、文書間に出現する各単語の寄与度を順に算出しながら、各単語毎に寄与度の総和を求めるようにしてもよい。
【００７０】
まず、寄与度算出手段２２１において、参照文書集合、正解文書集合及び重要語数を取り込む（ステップ２０１）。ここでは、参照文書集合をＤ、正解文書集合をＣ、重要語数をＮとする。次に、ＤとＣに含まれるｄｉ（ｄ１、ｄ２、・・・ｄｎ）について、出現単語とその出現頻度を求め、出現単語リストＷ（＝ｗ１、ｗ２、・・・ｗｎ）を抽出する（ステップ２０２）。次に、出現単語リストＷから単語（ｗｊ）を取り出し、ｄｉ、Ｄ間の類似度における単語ｗｊの寄与度Ｃｏｎｔ（Ｄ、ｄｉ、ｗｊ）を計算する（ステップ２０３）。このステップ２０３では、図３のステップ１０３〜ステップ１０７と同等の処理を行っている。次に、全ての単語ｗｊについて寄与度を計算したかどうかを判断し（ステップ２０４）、計算が終了していなければ、次の単語（ｗｊ＋１）を取り出してステップ２０３の処理を繰り返す。また、計算が終了していれば、重要度選択手段２２２において、ＤとＣの各文書との間に出現した単語毎に、寄与度算出手段２２１で算出された寄与度の総和ＳｕｍＣｏｎｔ（Ｄ、ｗｊ）を求める（ステップ２０５）。次に、総和ＳｕｍＣｏｎｔ（Ｄ、ｗｊ）の高い単語の上位Ｎ個を参照文書集合Ｄに対する重要語として選択する（ステップ２０６）。そして、計算した全てのデータを記憶部２４へ格納する。
【００７１】
図６は、選択された重要語をもとにして、検索対象文書群から参照文書集合に類似している文書を検索する場合の処理手順を示すフローチャートである。
【００７２】
重要語選択部２２で重要語が選択されると、検索部２３は記憶部２４に格納されている出現単語リストＷを参照し、参照文書集合Ｄと正解文書集合Ｃ中に出現する単語を抽出する（ステップ３０１）。次に、抽出した単語のうち、重要語選択部２２で選択された重要語Ｎ個に含まれる単語をもとにして、参照文書集合Ｄと検索対象文書群Ｄｏｃｓ中の各文書との類似度を算出し、類似度の高い文書集合を検索対象文書群Ｄｏｃｓの中から検索する（ステップ３０２）。その後、検索結果を出力部２５を通じてユーザに提示する（ステップ３０３）。
【００７３】
上記実施形態２においては、単語の寄与度をもとに選択した重要語を利用して類似検索を行うようにしているため、類似文書とそうでない文書との比率に極端な差がある場合でも、例えばχ二乗値に基づく重要語選択手法に比べて、不要な文書が検索される割合を少なくすることができ、かつ検索精度を向上させることができる。
【００７４】
［実施形態３］
この実施形態３では、実施形態１で説明した単語の寄与度を用いて検索式を拡張し、この拡張された検索式に基づいて類似する文書を検索する場合の具体例について説明する。
【００７５】
図７は、実施形態３に係わる文書検索装置３０の機能的な構成を示すブロック図である。この文書検索装置３０は、参照文書集合などを入力する入力部３１と、後述する検索式拡張部３２と、この検索式拡張部３２で拡張された検索式に基づいて、検索対象文書群から参照文書集合に類似する文書を検索する検索部３３と、検索式拡張部３２及び検索部３３で使用される文書などを一時的に格納する記憶部３４と、前記検索部３３で検索された文書を出力する出力部３５とから構成されている。
【００７６】
この実施形態３の検索式拡張部３２は、参照文書集合の各文書と検索対象文書群から抽出した正解文書集合の各文書との間に出現する全ての単語ついてそれぞれ単語寄与度を算出する寄与度算出手段３２１と、正解文書集合の各文書毎に寄与度算出手段３２１で算出された単語寄与度が低い単語を抽出し、さらに抽出した全ての単語のうち参照文書集合に含まれていない単語について、各単語毎に単語寄与度の総和を求め、その結果に重み付けをして、元の参照文書集合の検索式に加える検索式拡張手段３２２とから構成されている。
【００７７】
寄与度算出手段３２１は、例えば図１の寄与度算出装置１２０と同じ機能ブロックにより構成することができる。この寄与度算出手段３２１では、入力した参照文書集合Ｄと正解文書集合Ｃ＝｛ｄ１、ｄ２・・・ｄｎ｝の各文書との間に出現する全ての単語ｗｊの寄与度Ｃｏｎｔ（Ｄ、ｄｉ、ｗｊ）を計算する。単語寄与度を算出する手順は図３のステップ１０１〜ステップ１０７と同じであるため説明を省略する。
【００７８】
検索式拡張手段３２２は、参照文書集合Ｄと正解文書集合Ｃに含まれる各文書ｄ１、ｄ２・・・ｄｎに出現する全ての単語から、寄与度算出手段３２１で算出された単語寄与度の低い単語とその単語寄与度を抽出する。次に、抽出された全ての単語のうち参照文書集合Ｄに含まれていない単語について、次の式により、各単語毎に単語寄与度の総和Ｓｃｏｒｅ（ｗ）を求める。
【００７９】
【数５】

そして、抽出された単語とそのスコア（Ｓｃｏｒｅ）を元の参照文書集合Ｄの検索式に加えることによっって検索式
【００８０】
次に、上記のように構成された検索式拡張部３２において、単語寄与度を用いて検索式を拡張する場合の処理手順を図８のフローチャートを用いて説明する。
【００８１】
まず、寄与度算出手段３２１において、参照文書集合、正解文書集合、各文書から抽出される単語数及び抽出された単語に対する重みを取り込む（ステップ４０１）。ここでは、参照文書集合をＤ、正解文書集合をＣ（Ｄ）、各文書から抽出される単語寄与度の値が低い単語数をＮ、重みをｗｇｔとする。次に、ＤとＣ（Ｄ）に含まれる文書ｄｉ（ｄ１、ｄ２、・・・ｄｎ）について、出現単語とその出現頻度を求め、出現単語リストＷ（＝ｗ１、ｗ２、・・・ｗｍ）を抽出する（ステップ４０２）。次に、出現単語リストＷから単語（ｗｊ）を取り出し、Ｄ、ｄｉ間の類似度における単語ｗｊの寄与度Ｃｏｎｔ（Ｄ、ｄｉ、ｗｊ）を計算する（ステップ４０３）。このステップ４０３では、図３のステップ１０３〜ステップ１０７と同等の処理を行っている。次に、全ての単語ｗｊについて単語寄与度を計算したかどうかを判断し（ステップ４０４）、計算が終了していなければ、次の単語（ｗｊ＋１）を取り出してステップ４０３以降の処理を繰り返す。また、計算が終了していれば、検索式拡張手段３２２において、出現単語リストＷに含まれる単語のうち、単語寄与度の値が低い単語Ｎ個とその単語寄与度を抽出する（ステップ４０５）。次に、全ての文書ｄｉについて単語寄与度の低い単語を抽出したかどうかを判断し（ステップ４０６）、抽出が終了していなければ、次の文書（ｄｉ＋１）を取り出してステップ４０２以降の処理を繰り返す。また、抽出が終了していれば、抽出された全ての単語のうち参照文書集合Ｄに含まれていない単語についてＳｃｏｒｅ（ｗ）を求める（ステップ４０７）。そして、計算された全ての単語ｗとそのＳｃｏｒｅ（ｗ）を元の参照文書集合Ｄの検索式に加える（ステップ４０８）。拡張された検索式に関するデータは記憶部３４へ格納する。
【００８２】
この後、検索部３３は、記憶部３４に格納されている出現単語リストＷを参照し、参照文書集合Ｄと正解文書集合Ｃ（Ｄ）中に出現する単語を抽出する。そして、拡張された検索式を含む類似度の計算式に値を入れ、参照文書集合Ｄと検索対象文書群Ｄｏｃｓ中の各文書との類似度を算出し、類似度の高い文書集合を検索対象文書群Ｄｏｃｓの中から検索して、検索結果を出力部３５を通じてユーザに提示する。
【００８３】
上記実施形態３においては、単語の寄与度に基づいて検索式を拡張するための単語を抽出するようにしているため、文書間の類似度における単語の影響を考慮した検索式拡張を実現することができる。これによれば、例えばＲｏｃｃｈｉｏの手法に基づいた検索式拡張に比べて、類似度の計算に有効な単語を抽出することができるため、検索精度を向上させることができる。
【００８４】
【実施例】
次に、上述した実施形態１、２及び３の文書検索装置の有効性を示すため、従来技術との比較実験を行った。以下、実施形態１、２及び３に対応する比較実験の結果をそれぞれ実施例１〜３として説明する。
【００８５】
［実施例１］
最初に、実施例１として、実施形態１で説明した単語ｗｊの寄与度による特徴量と、従来のＴＦ＊ＩＤＦによる特徴量について比較実験を行った結果について説明する。
【００８６】
この実施例では、参照文書Ｄと、２５５１１個の文書からなる検索対象文書群Ｄｏｃｓとの間の類似度を計算した。その結果、Ｄｏｃｓ中の文書のうち、参照文書Ｄとの類似度が高い文書５５個（以下、Ｄｔｏｐ）を主観評価により、Ｄと類似性の高いＤｏ、類似性の無いＤｘ、及びＤｏ、Ｄｘのいずれにも属さないＤｚの３つのグループに分類した。各グループ中の文書数と、全体に占める割合を表１に示す。
【００８７】
【表１】

Ｄｔｏｐの各文書とＤ間の類似度において、寄与度の高い単語の上位５個を各文書毎に抽出し、これらのマージした単語リスト（４８単語）を生成した。各文書の特徴量は、このリスト中の各単語の寄与度を要素とした特徴ベクトルで表す。この特徴ベクトルの有効性を示すために、これを元に因子分析を行い、第１因子と第２因子を軸とした２次元平面上にプロットした。比較のため、Ｄｔｏｐの文書をＴＦ＊ＩＤＦによる特徴量を抽出し、同様に因子分析してプロットを行った。図９、図１０に、それぞれ寄与度及びＴＦ＊ＩＤＦによる特徴量の抽出に基づいた因子分析の結果を示す。図の中の「○」「△」「×」は、それぞれＤｏ、Ｄｚ、Ｄｘの文書を表している。
【００８８】
これらの結果を比較すると、図９では同じグループの文書が集中してプロットされているのに対し、図１０ではそれに比べて平面上にプロットが散らばっていることが明らかである。因子分析においては、情報量が多いほど○や△などの均等に散らばり、情報量が少ないほどかたまる傾向にある。したがって、寄与度による分析の方が、ＴＦ＊ＩＤＦによる分析よりも情報量が少ない、すなわち不要な文書が検索される割合が低いと考えられる。このことから、寄与度に基づく特徴量の方がＴＦ＊ＩＤＦに基づく特徴量よりも主観評価による文書分類に適していると考えられる。
【００８９】
次に、この結果を定量的に検証した場合について説明する。まず、図９、図１０の平面の両軸をそれぞれＮ等分し、各平面をＮ^２個の矩形領域に分割した。そして、以下の式に基づき、寄与度及びＴＦ＊ＩＤＦによる分析結果の情報量を計算した。ｐｉｊ（ｇ）は領域（ｉ、ｊ）内にｇ∈Ｇ＝｛○、△、×｝のプロットが現れる確率とする。
【００９０】
【数６】

Ｎ＝１、２、・・・、１０のときの情報量の計算結果を図１１に示す。図１１において、縦軸は情報量（ｅｎｔｒｏｐｙ）、横軸は分割数（ｄｉｖｉｓｉｏｎ）を表している。この結果より、寄与度による分析結果の方が、ＴＦ＊ＩＤＦによる分析結果よりも情報量が少ないことが明らかであり、定量的にも本発明の有効性が示された。
【００９１】
上記実施例１に示す比較結果は、一回の検索を行った場合の実験例である。このように、本発明による文書検索では、絞り込み検索を行わなくても、従来のＴＦ＊ＩＤＦに比べて不要な文書が検索される割合を少なくすることができる。そして、絞り込み検索を繰り返した場合は、検索精度をさらに向上させることができる。すなわち、絞り込み検索を行った場合は、図１１の寄与度による分析結果のグラフが、さらに下方に移行する（情報量が少なくなる）ことが予測される。
【００９２】
［実施例２］
次に、実施例２として、実施形態２で説明した単語の寄与度をもとに選択した重要語と、従来例のχ二乗値により選択した重要語による類似検索の比較実験を行った結果について説明する。
【００９３】
この実施例では、参照文書集合として、同期間にＫＤＤから出願された特許２０件を設定し、これらの特許に類似している特許の公開公報を検索業者に検索させた。この検索において、１つの参照文書に対して類似していると判断された特許の公開公報数の平均は１０．７５個であった。ここでは、前記特許２０件を参照文書集合Ｄｉ_ｎ＝｛ｉ_１、ｉ_２・・・ｉ_２０｝とし、参照文書ｉ_ｎに対する正解文書集合をＣｎとした。また、検索の対象となった全ての公開公報データのうちの１００００件を検索対象文書集合Ｄ_ｔとした。この検索対象文書集合Ｄ_ｔには、正解文書集合Ｃｎに含まれる全ての公開公報データが含まれている。
【００９４】
次に、χ二乗値による重要語選択手法と単語の寄与度を用いた重要語選択手法により、それぞれＮ個の重要語を選択した。そして、これらＮ個の単語をもとにＩｗａｙａｍａらの手法により検索対象文書集合Ｄ_ｔからの類似検索を行った。また比較例として、重要語を選択せずに文書間に出現した全ての単語に基づく類似検索も行った。
【００９５】
χ二乗値（χ^２）と単語の寄与度（Ｃｏｎｔ）について、それぞれ選択する重要語の個数Ｎを上位から１００、２００、３００、４００、５００としたときの検索結果、及び全ての単語に基づく検索結果（Ａｌｌ）の平均精度を表１に示す。
【表２】

表２の結果から明らかなように、寄与度を用いた重要語選択手法により得られた単語を使った場合は、Ｎ＝１００〜５００の全てにおいて、χ二乗値（χ^２）による検索結果及び全ての単語に基づく検索結果（Ａｌｌ）の数値を上回ることが確認された。
【００９６】
また、表２の中で最も結果の良かった重要語の個数として、単語の寄与度（Ｎ＝４００）、χ二乗値（Ｎ＝５００）を設定し、そこで選択された重要語による検索と、全ての単語に基づく検索のＲｅｃａｌｌ／Ｐｒｅｃｉｓｉｏｎ曲線をそれぞれ測定した。結果を図１２に示す。図１２において、横軸のＲｅｃａｌｌは検索された文書のうち、実際に類似している文書の割合を示している。
【００９７】
このＲｅｃａｌｌ／Ｐｒｅｃｉｓｉｏｎ曲線では、特性曲線が右上に位置するほど検索精度が高いことを示している。図１２に示すように、単語の寄与度を用いた重要語選択手法により得られた単語を使った検索結果は、χ二乗値（χ^２）による検索結果及び全ての単語に基づく検索結果（Ａｌｌ）のいずれも上回ることが確認された。
【００９８】
以上の結果から、χ二乗値による重要語選択手法では検索精度が下がったのに対し、単語の寄与度を用いた重要語選択手法では検索精度の向上が見られ、本発明の有効性が示された。
【００９９】
［実施例３］
次に、実施例３として、単語の寄与度に基づいて拡張された検索式と、従来例のＲｏｃｃｈｉｏの手法により拡張された検索式とを用いて、類似検索の比較実験を行った結果について説明する。
【０１００】
この実施例では、ＴｅｘｔＲｅｔｒｉｅｖａｌＣｏｎｆｅｒｅｎｃｅ（ＴＲＥＣ）で定められた統一的な実験条件として、ＴＲＥＣに使用されているテキストデータ５２８１５０件を検索対象文書とし、同じくＴＲＥＣで用意された５０個の参照文書とそれぞれの参照文書に対する正解文書集合をもとに評価を行った。
【０１０１】
まず、各参照文書毎に全ての検索対象文書に対して検索（初期検索）を行い、類似度の上位１０００件の文書を抽出した。ここで抽出した文書のうち、実際に各参照文書と類似している文書（正解文書集合）の上位Ｎｕｍ個から、単語寄与度に基づく検索式拡張の手法（実施形態３）により各参照文書の検索式拡張を行った。最終的な検索結果は、ここで拡張された検索式を使用し、全ての検索対象文書に対する検索を行うことで算出した。
【０１０２】
類似度の計算式としては、以下に示すような参照文書と検索対象文書を表すベクトル間の角度のコサイン値を求める式を用いた。
【０１０３】
【数７】

この式において、Ｄ→が参照文書の検索式、すなわち参照文書（集合）を表すベクトルを示している。
【０１０４】
Ｒｏｃｃｈｉｏの手法の入力となる正解文書集合には、単語寄与度に基づく検索式拡張の手法で抽出された正解文書集合をそのまま使用した。また、非類似文書集合としては、前記初期検索により抽出された１００件の文書のうち、類似度の順位が５００位以下の文書を使用した。ただし、この下位５００件の文書集合の中に類似文書が含まれている場合は、これを非類似文書集合から取り除くものとする。また、Ｒｏｃｃｈｉｏの手法で使用される係数α、β、γは、ＴＲＥＣ−７にて発表されたＳＭＡＲＴシステムで使用した値、すなわちα＝３、β＝２、γ＝２とした。同じく、ＴＲＥＣ−７での発表同様、各参照文書に対し、Ｒｏｃｃｈｉｏの手法における重みの上位２０単語を元の検索式に加えた。
【０１０５】
表３に、Ｎｕｍ＝１０、２０及びｗｇｔ＝−１００、−４００、−８００、−１２００の場合の単語寄与度に基づく検索式拡張の手法（ＷＣ１０、２０）に基づく検索式拡張、並びにＮｕｍ＝１０、２０とした場合のＲｏｃｃｈｉｏの手法（Ｒｏｃｃｈｉｏ１０、２０）の手法による検索式拡張、並びに初期検索（Ｂａｓｅｌｉｎｅ）による平均正解率の値を示す。
【０１０６】
【表３】

表３の結果から明らかなように、単語寄与度に基づく検索式拡張の手法は、初期検索と比較して１５７％〜１９７％以上の検索精度向上が達成されていることが確認された。また、重みｗｇｔの絶対値の増大、すなわち新しく拡張された単語の影響力の増大に伴って検索精度が向上し、さらにＲｏｃｃｈｉｏの手法と比較しても高い検索精度が得られていることから、単語寄与度に基づく検索式拡張の手法による単語抽出及び重み付けが有効であることが示された。
【０１０７】
また、図１３は、Ｎｕｍ＝２０の場合における各手法のＲｅｃａｌｌ／Ｐｒｅｃｉｓｉｏｎ曲線を示す。図１３に示すグラフの縦横軸及び特性曲線の評価は図１２と同様である。
【０１０８】
図１３のＲｅｃａｌｌ／Ｐｒｅｃｉｓｉｏｎ曲線から明らかなように、単語寄与度に基づく検索式拡張の手法は、どのＲｅｃａｌｌ値においても、Ｒｏｃｃｈｉｏの手法の正解率を上回ることが確認された。
【０１０９】
以上の結果から、単語寄与度に基づく検索式拡張の手法では、いずれの条件においてもＲｏｃｃｈｉｏの手法に比べて検索精度の向上が見られ、本発明の有効性が示された。
【０１１０】
【発明の効果】
以上説明したように、請求項１乃至６の発明においては、参照文書（集合）と検索対象文書群に出現する単語が文書間の類似度に与える影響の大きさを単語寄与度として定義し、この単語寄与度を要素とするベクトルデータを検索対象文書の特徴量として用いるようにしたため、単語の出現頻度などを要素とするベクトルデータを特徴量として用いる手法に比べて、不要な文書が検索される割合を少なくすることができる。また、参照文書（集合）に依存した特徴量を用いているので、絞り込み検索を繰り返した場合には、検索精度をさらに向上させることができる。
【０１１１】
また、請求項７乃至１４の発明においては、単語寄与度をもとに重要語を選択し、選択した重要語に基づいて類似検索を行うようにしたため、類似文書とそうでない文書との比率に極端な差がある場合でも、例えばχ二乗値に基づく重要語選択手法に比べて、不要な文書が検索される割合を少なくすることができ、かつ検索精度を向上させることができる。
【０１１２】
さらに、請求項１５乃至１８の発明においては、単語の寄与度を用いて検索式を拡張し、この拡張された検索式に基づいて類似検索を行うようにしたため、類似度の計算に有効な単語を抽出することができるようになり、従来の検索式拡張手法に比べて検索精度を向上させることができる。
【図面の簡単な説明】
【図１】図２に示す寄与度算出装置の機能的な構成を示すブロック図。
【図２】実施形態１に係わる文書検索装置の機能的な構成を示すブロック図。
【図３】単語ｗｊの寄与度と文書ｄｉの特徴ベクトルを計算する場合の処理手順を示すフローチャート。
【図４】実施形態２に係わる文書検索装置の機能的な構成を示すブロック図。
【図５】参照文書集合Ｄに対する重要語を選択する場合の処理手順を示すフローチャート。
【図６】選択された重要語をもとに検索対象文書群から参照文書集合に類似している文書を検索する場合の処理手順を示すフローチャート。
【図７】実施形態３に係わる文書検索装置の機能的な構成を示すブロック図。
【図８】単語寄与度を用いて検索式を拡張する場合の処理手順を示すフローチャート。
【図９】寄与度による因子分析の結果を示す説明図。
【図１０】ＴＦ＊ＩＤＦによる因子分析の結果を示す説明図。
【図１１】寄与度及びＴＦ＊ＩＤＦによる分析結果の情報量を示す説明図。
【図１２】各手法により得た重要語及び全単語により検索した場合のＲｅｃａｌｌ／Ｐｒｅｃｉｓｉｏｎ曲線を示す説明図。
【図１３】Ｎｕｍ＝２０の場合における各手法のＲｅｃａｌｌ／Ｐｒｅｃｉｓｉｏｎ曲線を示す説明図。
【図１４】従来の絞り込み検索の処理手順を示すフローチャート。
【符号の説明】
１０、２０、３０文書検索装置
１１、２１、３１入力部
１２、特徴量抽出部
１３、２３、３３検索部
１４、２４、３４記憶部
１５、２５、３５出力部
２２重要語選択部
３２検索式拡張部
１２０寄与度算出装置
１２１出現単語抽出手段
１２２単語除去手段
１２３類似度算出手段
１２４寄与度算出手段
２２１、３２１寄与度算出手段
２２２重要語選択手段
３２２検索式拡張手段[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a document search method for searching for a similar document from a search target document group.
[0002]
[Prior art]
Conventionally, various similarity measures for retrieving similar documents from a group of documents to be retrieved have been devised, and retrieval methods based on these measures have been proposed. However, the rate at which unnecessary documents are searched is still high regardless of which method is used, and it is necessary to further narrow down searches to improve search accuracy.
[0003]
Here, a narrow search in a general document search system will be briefly described.
[0004]
FIG. 14 is a flowchart showing a conventional narrowing-down search processing procedure. First, the system extracts the feature amount of each document in the search target document group (step 501), and then fetches a reference document set (step 502). The reference document set is a set of reference documents specified by the user to indicate a document that the system wants the system to search, and the system searches for documents similar to the reference document set from the search target document group.
[0005]
In an actual document search, a plurality of reference documents are often specified. Therefore, here, all the reference documents will be described as a reference document set.
[0006]
When the reference document set is captured in step 502, the feature amount of the reference document set is extracted (step 503). Next, based on the extracted feature amounts, a document set that is considered to be similar to the reference document set is searched from the search target document group and presented to the user (step 504). Here, the user examines the retrieved document set, and if it is determined that further refinement search needs to be performed, a document similar to the reference document set is extracted from the retrieved document set, Designated as a new reference document set to the system. If the user instructs a refinement search (Yes in step 505), the system returns to step 502, fetches a newly specified reference document set, and repeats the processing in

steps

503 and 504 again.
[0007]
As a feature amount used in the narrowing-down search, for example, a feature vector including an appearance frequency (term frequency) of a word (phrase) in a document as an element, or TF * IDF ( Text Frequency * Inverse Document Frequency) and the like are widely used in general.
[0008]
TF * IDF is a feature amount obtained by adding a weight in consideration of the appearance frequency in another search target document to the appearance frequency of each word in the document. Weight w for word t_tIs represented by the following equation.
[0009]
w_t  = Log (N / f_t  )
Where N is the number of documents to be searched, f_tIs the number of documents containing the word t. Therefore, each element w of the document feature vector_{d. t}  Is calculated as follows:
[0010]
w_{d. t}  = F_{d. t}  ・ Log (N / f_t)
Where f_{d. t}  Is the frequency of occurrence of word t in document d.
[0011]
Further, Iwayama et al. Proposed a statistical text classification method called SVMV (Single Random Variable with Multiple Values) based on such a feature amount of word frequency (hereinafter referred to as Iwayama et al.), And used other text classification methods. As a result of the comparison experiment, its showed the superiority (Iwayama, Tokunaga: "a Probablistic Model for Text Categorization: Based on a Single Random Variable with Multiple Values", Proc of 4th Conference on Applied Natural Language Processing, pp162-167, 1994).
[0012]
On the other hand, as a device for improving the search accuracy, there is a method of selecting a word considered to be important in similarity search and performing a search (hereinafter, an important word selection method).
[0013]
Until now, when measuring the similarity between documents, generally, the frequency of words appearing in each document was extracted, and this was used as an element to calculate the feature amount of the document. However, it is unlikely that all the words appearing here will affect the similarity between documents, and conversely, even if words irrelevant to the similarity are considered as document features, search accuracy will be reduced. There is a risk. Therefore, a key word selection technique is to extract words required for similarity search from words appearing in a search input and a set of documents to be searched.
[0014]
As an important word selection method, there has been proposed a method [1] of selecting an important word based on a chi-square value of each word (Schutze, Hull, Pedersen: “A Comparison of Classifiers and Document Representation for” the Routing Problem ", Proc of ACM SINGER '95, 1995.)
In this method [1], the search target document group D_i  The reference document_q  Are divided into two sets: a set of documents similar to (hereinafter referred to as a correct document set) C and a set 属 C which does not belong to this C (hereinafter referred to as a negative of C), and χ square of each word w The value is calculated by the following formula.
[0015]
(Equation 1)

Where N_{(C +)}Is the number of documents in C that contain the word, N_(C-)Is the number of documents in C that do not contain the word w, N_{(￣C +)}Is the number of documents in 単語 C containing the word w, N_(￣C-) Is the number of documents in ￣C that do not include the word w.
[0016]
The top N words of the chi-square value calculated in this way are selected as important words, a feature amount is calculated based on these words, and a similarity search is performed using, for example, the method of Iwayama et al.
[0017]
As an application example of the above method [1], a method [2] of extracting an important word based on a positive root of a chi-square value has also been proposed (Ng, Goh, Low; "Feature Selection, Perceptron Learning, and Usability Case Study for Text Categorization ", Proc of ACM SINGER '97, 1997.). This method [2] is an important word selection method proposed based on the assumption that the words included in the documents in the correct answer document set C are more important than the words included in ￣C.
[0018]
Further, in a similar document search system, a system employing a so-called search expression expansion technology for automatically expanding search expression information has been actively developed. At present, one of the most commonly used search formula expansion methods is based on Rocchio's algorithm. This method is a method developed for similarity search based on a vector space model. This method is based on the idea of maximizing the similarity between the optimal search formula (vector generated from the reference document) and documents similar to it, and minimizing the similarity with non-similar documents. ing. According to Rocchio, such a search formula can be calculated by obtaining a difference vector between the center of the vector of the similar document group and the center of the vector of the dissimilar document group. Therefore, the optimum search formula is obtained by the following formula.
[0019]
(Equation 2)

Here, R represents the number of similar documents, and N represents the number of dissimilar documents. As a result of this expression, the value of the element having a negative value is set to 0.
[0020]
In the above method, the search formula is optimized by moving the search formula close to the similar document and keeping the search formula away from the dissimilar document. However, this method has a problem that the characteristics of the original search formula are lost. Therefore, a technique for optimizing a search formula while retaining the features of the original search formula has been developed. In this method, optimization is performed by adding coefficients to the original search formula and vectors of similar documents and dissimilar documents. The formula is shown below.
[0021]
(Equation 3)

In a search system called "SMART" developed by AT & T, etc., a search formula extension method based on the above method is adopted. Specifically, this is a search formula expansion method in which the weight of each word is calculated from similar documents by the above formula, words whose sum of weights for each word is high are extracted, and the extracted words are added to the original search formula. , High search accuracy is obtained.
[0022]
[Problems to be solved by the invention]
However, although the feature amount of the existing vector space model including the TF * IDF represents the feature of the document group to be searched, it does not take into account the feature of the reference document set. There is a property that the feature itself does not change even if the reference document set changes in 502. Therefore, there is a problem in that it is not possible to extract a feature amount corresponding to a change in the reference document set, and that the search accuracy is not improved even if the refined search is repeated.
[0023]
Further, the chi-square value obtained by the above methods [1] and [2] has a disadvantage that the search accuracy is reduced when there is an extreme difference in the ratio of the number of documents included in C and ΔC. On the other hand, the similarity search generally searches a small number of similar documents (corresponding to the above C) from a large number of documents to be searched, and thus the key word selection method based on the χ square value is not effective for the similarity search. There was a problem.
[0024]
Further, the effect of improving the search accuracy by the Rocchio method has been confirmed in many papers and the like. However, this method only considers the effect of words in the search target document and does not consider the effect of words on the similarity between the reference document and each search target document. In some cases, valid words may not always be extracted.
[0025]
A first object of the present invention is to provide a document search that can reduce the rate at which unnecessary documents are searched in a similarity search using a feature amount of a word, and can improve the search accuracy when a narrowed search is repeated. A method and a document search device are provided.
[0026]
A second object of the present invention is to provide a document search method and a document search device that can reduce the rate of searching for unnecessary documents and improve search accuracy in similarity search using important words. is there.
[0027]
Still another object of the present invention is to provide a document search method and a document search device that can improve search accuracy in similarity search using a search expression expansion method as compared with a conventional method.
[0028]
[Means for Solving the Problems]
In order to achieve the first object, the invention according to claim 1 uses a feature amount of a search target document forming a search target document group to extract a document similar to a designated reference document from the search target document group. In the document search method for searching, the magnitude of the effect of the word appearing in the reference document and the search target document group on the similarity between documents is quantified as a word contribution, and vector data having the word contribution as an element is calculated. The feature amount of the search target document is used.
[0029]
According to a second aspect of the present invention, in the first aspect, the reference document is a document retrieved by the document retrieval method of the first aspect.
[0030]
According to a third aspect of the present invention, in the first or second aspect, the reference document is a reference document set.
[0031]
In order to achieve the first object, the invention according to claim 4 is characterized in that a document similar to a designated reference document is extracted using a feature amount of a search target document constituting a search target document group. In a document search apparatus for searching from a group, an appearing word extracting unit that extracts words appearing in the reference document and the search target document group, and a word extracted by the appearing word extracting unit from the reference document and the search target document group A word removing unit that generates a removed document; a similarity between the reference document and the document included in the search target document group; and a similarity between the reference document generated by the word removal unit and the document included in the search target document group. A similarity calculating means for calculating the similarity, and a word appearing in the reference document and the search target document group is given to the similarity between the documents based on the two similarities calculated by the similarity calculating means. The magnitude of the impact A contribution calculating means for quantifying the word contribution and generating vector data having the word contribution as an element, wherein the vector data having the word contribution as an element is used as a feature amount of the search target document. Features.
[0032]
According to a fifth aspect of the present invention, in the fourth aspect, the reference document is a document retrieved by the document retrieval device of the fourth aspect.
[0033]
The invention according to claim 6 is characterized in that in

claim

4 or 5, the reference document is a reference document set.
[0034]
In order to achieve the second object, the invention according to claim 7 uses a designated reference document and an important word selected from words appearing in a group of documents to be searched for a document similar to the reference document. In the document search method for searching from the group of documents to be searched, a magnitude of an influence of a word appearing in the set of correct documents extracted from the reference document and the group of correct documents extracted from the group of documents to be searched on a similarity between documents is numerically expressed as a word contribution. And selecting a word having a high degree of word contribution as an important word from all words appearing in the reference document and the correct answer document set.
[0035]
The invention according to claim 8 is the method according to claim 7, wherein after calculating the word contributions for all the words appearing between the reference document and each document of the correct answer document set, the sum of the word contributions is calculated for each word. , And a word at the top of the sum is set as an important word for the reference document.
[0036]
The invention of claim 9 is characterized in that, in

claim

7 or 8, the reference document is a reference document set.
[0037]
A tenth aspect of the present invention is characterized in that, in the seventh, eighth or ninth aspect, the correct answer document set is a set of documents similar to the reference document.
[0038]
In order to achieve the second object, the invention according to claim 11 resembles the reference document by using a designated reference document and an important word selected from words appearing in a search target document group. In a document search apparatus for searching a document from the search target document group, an appearance word extraction unit for extracting a word appearing in the reference document and a correct document set extracted from the search target document group, and the reference document and the correct document set Word removing means for generating a document from which the words extracted by the appearing word extracting means are removed, similarity between the reference document and the documents included in the correct answer document set, and a reference document generated by the word removing means. And a similarity calculating means for calculating the similarity between each document included in the correct answer document set, and the two similarities calculated by the similarity calculating means appear in the reference document and the correct answer document set. You A contribution calculating means for quantifying the magnitude of the effect of the word on the similarity between the documents as a word contribution, and the contribution from all the words appearing between the reference document and each document of the correct document set. Key word selecting means for selecting a word having a high degree of word contribution quantified by the degree calculating means as an important word, wherein a document similar to the reference document is searched for the document to be searched based on the selected important word. It is characterized by searching from a group.
[0039]
In a twelfth aspect of the present invention, in the eleventh aspect, the contribution calculating means calculates a word contribution for each word appearing between the reference document and each document of the correct answer document set. The word selecting means obtains the sum of the word contributions calculated by the contribution calculating means for each word, and sets a word higher in the sum as an important word for the reference document.
[0040]
According to a thirteenth aspect, in the eleventh or twelfth aspect, the reference document is a reference document set.
[0041]
According to a fourteenth aspect, in the eleventh, twelfth, or thirteenth aspect, the correct document set is a set of documents similar to the reference document.
[0042]
In order to achieve the third object, the invention according to claim 15 uses the similarity calculation formula including, as a search expression, a vector representing each of a designated reference document and a document of a search target document group as the reference document. In a document search method for searching for a similar document from the group of documents to be searched, the magnitude of the effect of the word appearing in the set of correct documents extracted from the group of reference documents and the group of correct documents extracted from the group of documents to be searched is determined by the word Numerical values are assigned as contributions, words having low word contributions are extracted from each document included in the correct answer document set, and words that are not included in the reference document out of all extracted words are subjected to word contribution for each word. The sum of the degrees is obtained, and the result is weighted and added to the search expression of the reference document.
[0043]
The invention of claim 16 is characterized in that, in claim 15, the reference document is a reference document set.
[0044]
According to a seventeenth aspect of the present invention, a document similar to the reference document is extracted from the search target document group by using a similarity calculation formula including, as a search formula, a vector representing each of the designated reference document and the search target document group. In a document search apparatus for searching, an appearing word extracting unit for extracting a word appearing in a correct document set extracted from the reference document and the search target document group, and an appearing word extracting unit extracting from the reference document and the correct document set Word removing means for generating a document excluding the extracted word, similarity between the reference document and the document included in the correct document set, and the similarity between the reference document and the correct document set generated by the word removing means. A similarity calculating means for calculating the similarity between the documents; and a word appearing in the reference document and the correct answer document set, based on the two similarities calculated by the similarity calculating means. And a contribution degree calculating means for quantifying the magnitude of the influence on the word as a word contribution degree, and a numerical value calculated by the contribution degree calculating means from all words appearing in each of the documents included in the reference document and the correct answer document set. A word having a low word contribution is extracted, and for words not included in the reference document among all the extracted words, the sum of the word contributions is obtained for each word, and the result is weighted to obtain the reference document. And a search formula extending means for adding to the search formula.
[0045]
The invention of claim 18 is characterized in that, in claim 17, the reference document is a reference document set.
[0046]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, specific examples of the document search method and the document search device according to the present invention will be described as

Embodiments

1, 2, and 3.
[0047]
In the first, second, and third embodiments, all reference documents will be described as a reference document set. However, in the document search method and the document search apparatus according to the present invention, and the document search apparatuses according to the first, second, and third embodiments described below, the same operation is performed even when a (single) reference document is used as a designated document. The effect can be obtained.
[0048]
[Embodiment 1]
In the first embodiment, a specific example will be described in which the degree of contribution of a word in the degree of similarity between documents is calculated, and a similar document is searched for using a feature amount based on the degree of contribution.
[0049]
FIG. 2 is a block diagram illustrating a functional configuration of the document search device 10 according to the first embodiment. The document search device 10 performs a search based on an input unit 11 for inputting a set of reference documents and the like, a feature amount extraction unit 12 described later, and a degree of contribution as a feature amount extracted by the feature amount extraction unit 12. A retrieval unit 13 for retrieving documents similar to the reference document set from the target document group, a storage unit 14 for temporarily storing documents used by the feature amount extraction unit 12 and the retrieval unit 13, and the like. And an output unit 14 for outputting the retrieved document.
[0050]
The feature amount extraction unit 12 according to the first embodiment newly defines a concept of word contribution (word contribution) in the similarity between documents as a feature amount dependent on a reference document set.wordVector data (hereinafter, feature vector) having the contribution as an element is calculated. Say herewordContribution is a numerical value of the effect of words appearing in a document on the similarity when calculating the similarity between a document in the search target document group and a set of reference documents. The larger the word, the greater the effect on the similarity.That is, the word contribution is a similarity calculated when a word that appears in both a certain document in the search target document group and a reference document set is directly included, and the word is removed from the similarity calculated when the word is removed from both. Defined as change to degree.In the first embodiment, it is assumed that a reference document set also including a plurality of documents is given to a search target document group including a plurality of documents.
[0051]
FIG. 1 is a block diagram showing a functional configuration of a contribution calculation device 120 configured as a specific example of the feature amount extraction unit 12. The contribution calculation device 120 includes an appearance word extraction unit 121, a word removal unit 122, a similarity calculation unit 123, and a contribution calculation unit 124.
[0052]
The appearing word extracting unit 121 outputs the words (appearing words) appearing in the input reference document set D and the documents di (d1, d2,... Dn) included in the search target document group Docs. Find the frequency. The words to be extracted can be selected by specifying, for example, only nouns. The extracted words wj (w1, w2,... Wn) are output as an appearance word list W = (w1, w2,... Wn). The appearance frequency of each word is used when the similarity calculating means 123 calculates the similarity.
[0053]
The word removing unit 122 generates a document in which the words extracted by the appearing word extracting unit 121 are removed from the reference document set and the search target document group. Here, a document obtained by removing wj from di is denoted by di '(wj), and a document obtained by removing wj from D is denoted by D' (wj).
[0054]
The similarity calculating unit 123 calculates the following two similarities as the similarity between the reference document set and each document included in the search target document group. One is the similarity Sim (di, D) between the documents di and D, and the other is the similarity Sim (di '(wj) between the documents di' (wj) and D '(wj). D ′ (wj)). As a calculation method of these similarities, a known calculation method can be used. In the first embodiment, the method of Iwayama et al. Described above is used.
[0055]
The contribution calculating means 124 calculates Sim (di, D) -Sim (di '(wj), D' (wj) as the contribution Cont (di, D, wj) of the word wj in the similarity between di and D. ) Is calculated. The contribution Cont (di, D, wj) is obtained for all the appearing words in di and D.
[0056]
According to this, when the effect on the similarity of the word wj appearing in the documents di and D is large, the value of Sim (di ′ (wj), D ′ (wj)) becomes small, and the contribution of the word wj The degree Cont (di, D, wj) increases. On the other hand, when the influence on the similarity of the word wj appearing in the documents di and D is small, the value of Sim (di ′ (wj), D ′ (wj)) increases, and the contribution degree Cont of the word wj increases. (Di, D, wj) becomes smaller. That is, the degree of contribution Cont (di, D, wj) of the word wj is a numerical value that directly reflects the magnitude relationship of the influence on the similarity of the word wj.
[0057]
Since the contribution Cont (di, D, wj) of the word wj changes according to the change of the reference document set, the feature amount of the document di composed of the feature vector having the contribution as an element is the reference document set. Is dependent on the characteristic amount. Therefore, in the refined search in which the reference document set changes as the search progresses, it is possible to extract a feature amount adapted to the change in the reference document set, and to further improve the search accuracy by repeating the refined search. be able to.
[0058]
Next, a processing procedure when the contribution degree calculating device 120 configured as described above calculates the contribution degree of the word wj and the feature vector of the document di will be described with reference to the flowchart of FIG.
[0059]
First, the appearing word extracting means 121 takes in a search target document group and a reference document set (step 101). Here, the search target document group is set to Docs, and the reference document set is set to D. Next, the appearing word extracting means 121 obtains appearing words and their appearance frequencies for di (d1, d2,... Dn) included in D and Docs, and appears in the appearing word list W (= w1, w2,. wn) is extracted (step 102). Next, the similarity calculating means 123 calculates the similarity between one document di and D included in Docs (step 103). Here, it is assumed that the similarity between di and D is Sim (di, D). Next, in the word removing means 122, the word wj is extracted from the appearing word, and the word wj removed from di and the word wj removed from D are obtained (step 104). Here, a document obtained by removing the word wj from di is referred to as di '(wj), and a document obtained by removing the word wj from D is referred to as D' (wj). Next, the similarity calculating means 123 calculates the similarity between di '(wj) and D' (wj) (step 105). Here, it is assumed that the similarity between di '(wj) and D' (wj) is Sim (di '(wj), D' (wj)). Next, the contribution degree calculation means 124 calculates the contribution degree of the word wj in the similarity degree between di and D (step 106). Here, the contribution of the word wj is Cont (di, D, wj), and Cont (di, D, wj) = Sim (di, D) −Sim (di ′ (wj), D ′ (wj)) calculate. Next, it is determined whether or not the contribution has been calculated for all the words wj (step 107). If the calculation has not been completed, the next word wj is extracted and the processing of steps 104 to 106 is repeated. If the calculation is completed, Cont (di, D, wj) calculated for each word wj is output as a di feature vector V (d) (step 108). Subsequently, it is determined whether or not feature vectors have been output for all dis included in Docs (step 109). If the calculation has not been completed, the next di is taken out and the processing of steps 103 to 108 is repeated. . If the calculation is completed, the processing is ended there, and all calculated data is stored in the storage unit 14 (FIG. 2).
[0060]
Thereafter, in the search unit 13 (FIG. 2), a document set similar to the reference document set is extracted from the search target document group based on the feature amount of the document di including the feature vector having the contribution degree as an element. Search for. The search result is presented to the user through the output unit 15 (FIG. 2). If the user determines that it is necessary to perform a further refined search on the searched document set, the user extracts a document similar to the reference document set from the searched document set, and sets this as a new reference. Specify to the system as a document set. After that, the refined search is executed based on the new reference document set.
[0061]
In the document search apparatus according to the first embodiment, since vector data having a contribution as an element is used as a feature amount of a search target document, for example, vector data having an element such as a frequency of occurrence of a word, such as TF * IDF, is used. The rate at which unnecessary documents are searched can be reduced as compared with the method using the amount. In addition, since the feature amount based on the contribution degree described above depends on the reference document set, the feature amount corresponding to the change in the reference document set can be extracted in the narrow-down search. That is, when the refined search is repeated, the search accuracy can be further improved. Moreover, when the refined search is repeated, a part of the reference document set is duplicated, so that a part of the calculation can be omitted, so that the processing time can be reduced.
[0062]
[Embodiment 2]
In the second embodiment, a specific example will be described in which an important word is selected based on the degree of contribution of the word described in the first embodiment, and a similar document is searched based on the important word.
[0063]
FIG. 4 is a block diagram illustrating a functional configuration of the document search device 20 according to the second embodiment. The document search apparatus 20 includes an input unit 21 for inputting a set of reference documents and the like, an important word selecting unit 22 described later, and a group of documents to be searched based on the important words selected by the important word selecting unit 22. , A storage unit 24 for temporarily storing documents used by the key word selection unit 22 and the search unit 23, and a search unit 23 for searching for documents similar to the reference document set. And an output unit 24 for outputting a document.
[0064]
The important word selecting unit 22 according to the second embodiment calculates a contribution degree for each word appearing between each document of the reference document set and each document of the correct answer document set extracted from the search target document group. It comprises a calculating means 221 and an important word selecting means 222 for obtaining the sum of the word contributions calculated by the contribution calculating means 221 for each word, and selecting a word higher in the sum as an important word for the reference document. Have been.
[0065]
The contribution calculating means 221 can be constituted by, for example, the same functional blocks as the contribution calculating device 120 of FIG. In the contribution calculating means 221, the contributions Cont (D, di) of all the words wj appearing between the input reference document set D and each document of the correct document set C = {d1, d2. , Wj). The procedure for calculating the degree of contribution of a word is the same as that in steps 101 to 107 in FIG.
[0066]
The important word selection unit 222 calculates the sum SumCont (D, wj) of the contributions calculated by the contribution calculation unit 221 for each word appearing in each document of the reference document set D and the correct document set C as follows. It is determined by the formula.
[0067]
(Equation 4)

Next, for words having a high total sum SumCont (D, wj), the top N words are selected as important words for the reference document set D.
[0068]
Next, a processing procedure when the important word selection unit 22 configured as described above selects an important word for the reference document set D will be described with reference to the flowchart of FIG.
[0069]
In the flowchart of FIG. 5, after calculating the contributions for all the words appearing between the documents, the sum of the contributions is calculated for each word. May be calculated in order, and the sum of contributions may be calculated for each word.
[0070]
First, the contribution calculation unit 221 fetches a reference document set, a correct document set, and the number of important words (step 201). Here, the reference document set is D, the correct answer document set is C, and the number of important words is N. Next, for di (d1, d2,... Dn) included in D and C, the appearance words and their appearance frequencies are obtained, and an appearance word list W (= w1, w2,... Wn) is extracted ( Step 202). Next, the word (wj) is extracted from the appearing word list W, and the contribution Cont (D, di, wj) of the word wj in the similarity between di and D is calculated (step 203). In this step 203, processing equivalent to steps 103 to 107 in FIG. 3 is performed. Next, it is determined whether or not the contribution has been calculated for all the words wj (step 204). If the calculation is not completed, the next word (wj + 1) is taken out and the process of step 203 is repeated. If the calculation has been completed, the importance selecting means 222 selects, for each word appearing between the documents D and C, the sum SumCont (D, wj) is obtained (step 205). Next, the top N words having a high sum SumCont (D, wj) are selected as important words for the reference document set D (step 206). Then, all the calculated data are stored in the storage unit 24.
[0071]
FIG. 6 is a flowchart showing a processing procedure when searching for a document similar to the reference document set from the search target document group based on the selected important word.
[0072]
When an important word is selected by the important word selection unit 22, the search unit 23 refers to the appearance word list W stored in the storage unit 24 and extracts words that appear in the reference document set D and the correct answer document set C. (Step 301). Next, based on the words included in the N important words selected by the important word selection unit 22 among the extracted words, the similarity between the reference document set D and each document in the search target document group Docs is determined. Is calculated, and a document set having a high degree of similarity is searched from the search target document group Docs (step 302). Thereafter, the search result is presented to the user through the output unit 25 (Step 303).
[0073]
In the second embodiment, since the similarity search is performed using the important word selected based on the degree of contribution of the word, even if there is an extreme difference in the ratio between the similar document and the other document, For example, as compared with an important word selection method based on a chi-square value, the rate at which unnecessary documents are searched can be reduced, and search accuracy can be improved.
[0074]
[Embodiment 3]
In the third embodiment, a specific example will be described in which a search formula is extended using the degree of contribution of the word described in the first embodiment, and a similar document is searched based on the expanded search formula.
[0075]
FIG. 7 is a block diagram illustrating a functional configuration of the document search device 30 according to the third embodiment. The document search device 30 includes a reference unit for inputting a set of reference documents and the like, a search expression extension unit 32 described later, and a reference from a search target document group based on the search expression extended by the search expression extension unit 32. A search unit 33 for searching for a document similar to a document set, a storage unit 34 for temporarily storing documents used by the search expression expansion unit 32 and the search unit 33, and a document searched by the search unit 33 And an output unit 35 for outputting.
[0076]
The search formula expansion unit 32 according to the third embodiment calculates the contribution of each word for each word appearing between each document in the reference document set and each document in the correct answer document set extracted from the search target document group. Degree calculating means 321 and a word having a low word contribution degree calculated by the degree of contribution calculating means 321 for each document of the correct answer document set, and further extracting words which are not included in the reference document set among all the extracted words And a search formula expansion unit 322 that calculates the sum of word contributions for each word, weights the result, and adds the result to the search formula of the original reference document set.
[0077]
The contribution calculating means 321 can be constituted by, for example, the same functional blocks as the contribution calculating device 120 in FIG. In the contribution calculating means 321, the contributions Cont (D, di) of all the words wj appearing between the input reference document set D and each document of the correct document set C = {d1, d2. , Wj). The procedure for calculating the word contribution is the same as that in steps 101 to 107 in FIG.
[0078]
The retrieval formula expansion unit 322 determines the low word contribution calculated by the contribution calculation unit 321 from all the words appearing in each of the documents d1, d2,... Dn included in the reference document set D and the correct document set C. Extract a word and its word contribution. Next, for words that are not included in the reference document set D among all the extracted words, the total sum Score (w) of the word contributions is calculated for each word by the following formula.
[0079]
(Equation 5)

Then, by adding the extracted word and its score (Score) to the search formula of the original reference document set D, the search formula is obtained.
[0080]
Next, a processing procedure in the case where the search formula expansion unit 32 configured as described above expands the search formula using the word contribution will be described with reference to the flowchart of FIG.
[0081]
First, the contribution calculation unit 321 fetches a reference document set, a correct answer document set, the number of words extracted from each document, and the weight for the extracted words (step 401). Here, it is assumed that the reference document set is D, the correct answer document set is C (D), the number of words having low word contribution values extracted from each document is N, and the weight is wgt. Next, for the documents di (d1, d2,... Dn) included in D and C (D), the appearance words and their appearance frequencies are obtained, and the appearance word list W (= w1, w2,. Is extracted (step 402). Next, the word (wj) is extracted from the appearing word list W, and the contribution Cont (D, di, wj) of the word wj in the similarity between D and di is calculated (step 403). In this step 403, processing equivalent to steps 103 to 107 in FIG. 3 is performed. Next, it is determined whether or not the word contribution has been calculated for all the words wj (step 404). If the calculation has not been completed, the next word (wj + 1) is taken out, and the processing of step 403 and thereafter is repeated. If the calculation has been completed, the search expression expanding means 322 extracts N words having a low value of the word contribution degree and the word contribution degrees from the words included in the appearance word list W (step 405). . Next, it is determined whether or not words having a low word contribution have been extracted from all the documents di (step 406). If the extraction has not been completed, the next document (di + 1) is taken out, and the processing after step 402 is performed. repeat. If the extraction has been completed, Score (w) is obtained for words that are not included in the reference document set D among all the extracted words (step 407). Then, all the calculated words w and their Score (w) are added to the search formula of the original reference document set D (step 408). Data related to the expanded search formula is stored in the storage unit 34.
[0082]
Thereafter, the search unit 33 refers to the appearance word list W stored in the storage unit 34 and extracts words that appear in the reference document set D and the correct answer document set C (D). Then, a value is entered in the similarity calculation formula including the expanded search formula, the similarity between the reference document set D and each document in the search target document group Docs is calculated, and a document set having a high similarity is searched for. A search is performed from the document group Docs, and the search result is presented to the user through the output unit 35.
[0083]
In the third embodiment, since the words for expanding the search formula are extracted based on the degree of contribution of the words, the expansion of the search formula in consideration of the influence of the word on the similarity between documents is realized. Can be. According to this, words that are effective in calculating the similarity can be extracted as compared with a search formula extension based on, for example, the Rocchio method, so that search accuracy can be improved.
[0084]
【Example】
Next, in order to show the effectiveness of the document search apparatuses according to the first, second, and third embodiments, a comparative experiment with the related art was performed. Hereinafter, the results of comparative experiments corresponding to

Embodiments

1, 2, and 3 will be described as Examples 1 to 3, respectively.
[0085]
[Example 1]
First, as Example 1, a description will be given of the result of a comparative experiment performed on the feature amount based on the contribution of the word wj described in the first embodiment and the feature amount based on the conventional TF * IDF.
[0086]
In this embodiment, the similarity between the reference document D and the search target document group Docs including 25511 documents is calculated. As a result, among the documents in Docs, 55 documents having high similarity to the reference document D (hereinafter, Dtop) are subjected to subjective evaluation, and Do having high similarity to D, Dx having no similarity to D, and Do, Dx Dz not belonging to any ofIn three groups. Table 1 shows the number of documents in each group and the percentage of the total.
[0087]
[Table 1]

In terms of the similarity between each document in Dtop and D, the top five words with high contribution were extracted for each document, and a merged word list (48 words) was generated. The feature amount of each document is represented by a feature vector having the degree of contribution of each word in the list as an element. To show the effectiveness of this feature vector, factor analysis was performed based on the feature vector, and plotted on a two-dimensional plane with the first and second factors as axes. For comparison, a Dtop document was extracted with a TF * IDF feature value, and factor-analyzed and plotted. 9 and 10 show the results of the factor analysis based on the extraction of the feature amounts by the contribution and TF * IDF, respectively. “○”, “△”, and “×” in the figure are Do and Dz, respectively., Dx.
[0088]
Comparing these results, it is clear that the documents of the same group are plotted intensively in FIG. 9, whereas the plots are more scattered on the plane in FIG. In the factor analysis, there is a tendency that the larger the amount of information is, the more evenly scattered, such as △ and △, and the smaller the amount of information is, the more clumped. Therefore, it is considered that the analysis based on the degree of contribution has a smaller amount of information than the analysis based on TF * IDF, that is, the rate of searching for unnecessary documents is lower. From this, it is considered that the feature amount based on the contribution degree is more suitable for the document classification based on the subjective evaluation than the feature amount based on TF * IDF.
[0089]
Next, a case where the result is quantitatively verified will be described. First, both axes of the planes in FIG. 9 and FIG.² Divided into rectangular areas. Then, based on the following equation, the contribution and the information amount of the analysis result by TF * IDF were calculated. pij (g) is the probability that a plot of g∈G = ｛｛, △, ×｝ appears in the area (i, j).
[0090]
(Equation 6)

FIG. 11 shows the calculation result of the information amount when N = 1, 2,..., 10. In FIG. 11, the vertical axis represents the information amount (entropy), and the horizontal axis represents the number of divisions (division). From these results, it is clear that the analysis result based on the degree of contribution has a smaller amount of information than the analysis result based on TF * IDF, and the effectiveness of the present invention is shown quantitatively.
[0091]
The comparison result shown in Example 1 is an experimental example when one search is performed. As described above, in the document search according to the present invention, the rate of searching for unnecessary documents can be reduced as compared with the conventional TF * IDF even without performing a narrowing search. Then, when the refined search is repeated, the search accuracy can be further improved. That is, when the refined search is performed, it is predicted that the graph of the analysis result based on the degree of contribution in FIG. 11 shifts further downward (the amount of information decreases).
[0092]
[Example 2]
Next, as Example 2, a result of a comparison experiment of a similarity search using an important word selected based on the degree of contribution of a word described in the second embodiment and an important word selected using a chi-square value of a conventional example will be described. explain.
[0093]
In this embodiment, 20 patents filed by KDD during the same period were set as a set of reference documents, and a search company was made to search for patent publications similar to those patents. In this search, the average number of publications of patents determined to be similar to one reference document was 10.75. Here, the 20 patents are referred to as a reference document set Di._n= ｛I₁, I₂... i₂₀｝ And the reference document i_nIs set to Cn. In addition, 10000 of all the publication data searched for are set to the search target document set D._t And This search target document set D_t Contains all published publication data included in the correct answer document set Cn.
[0094]
Next, N important words were respectively selected by an important word selection method using a chi-square value and an important word selection method using the degree of contribution of a word. Then, based on these N words, the search target document set D is obtained by the method of Iwayama et al._t A similar search from was performed. As a comparative example, a similarity search based on all words that appeared between documents without selecting an important word was also performed.
[0095]
χ squared value (χ²) And word contributions (Cont), the search result when the number N of important words to be selected is 100, 200, 300, 400, and 500 from the top, and the search result (All) based on all words, respectively. Table 1 shows the average accuracy.
[Table 2]

As is clear from the results in Table 2, when a word obtained by the important word selection method using the degree of contribution is used, the Ｎ square value (χ²) And the search result (All) based on all the words.
[0096]
In addition, as the number of important words having the best result in Table 2, the contribution degree of the word (N = 400) and the χ square value (N = 500) are set. The Recall / Precision curves for all word-based searches were measured respectively. The results are shown in FIG. In FIG. 12, Recall on the horizontal axis indicates the ratio of documents that are actually similar to the retrieved documents.
[0097]
In this Recall / Precision curve, the closer the characteristic curve is to the upper right, the higher the search accuracy is. As shown in FIG. 12, a search result using a word obtained by an important word selection method using the degree of contribution of a word is a χ square value (χ²) And the search result (All) based on all words.
[0098]
From the above results, while the retrieval accuracy was reduced in the key word selection method using the chi-square value, the retrieval accuracy was improved in the key word selection method using the degree of contribution of words, indicating the effectiveness of the present invention. Was done.
[0099]
[Example 3]
Next, as a third embodiment, a description will be given of a result of a comparison experiment of similarity search performed using a search formula extended based on the degree of contribution of a word and a search formula extended by the Rocchio method of the conventional example. I do.
[0100]
In this embodiment, as unified experimental conditions defined by Text Retrieval Conference (TREC), 528150 text data used in TREC are set as search target documents, and 50 reference documents prepared in TREC are also used. Evaluation was performed based on the set of correct documents for each reference document.
[0101]
First, a search (initial search) was performed on all search target documents for each reference document, and the top 1000 documents having similarities were extracted. From the extracted Num documents that are actually similar to each reference document (correct document set) among the extracted documents, a search expression expansion method based on the word contribution degree (embodiment 3) is used to search for each reference document. Search expression expansion was performed. The final search result was calculated by performing a search on all search target documents using the search formula extended here.
[0102]
As an expression for calculating the similarity, an expression for obtaining a cosine value of an angle between a vector representing a reference document and a vector representing a search target document as shown below was used.
[0103]
(Equation 7)

In this formula, D → indicates a search formula of the reference document, that is, a vector representing the reference document (set).
[0104]
The correct document set extracted by the search expression expansion method based on the word contribution was used as it is as the correct document set to be input to the Rocchio method. As a set of dissimilar documents, of the 100 documents extracted by the initial search, those having a similarity ranking of 500 or less were used. However, if a similar document is included in the lower 500 document sets, this is removed from the dissimilar document set. The coefficients α, β, and γ used in the method of Rocchio were the values used in the SMART system published at TREC-7, that is, α = 3, β = 2, and γ = 2. Similarly, as in the case of the presentation at TREC-7, for each reference document, the top 20 words of the weight in the Rocchio method were added to the original search formula.
[0105]
Table 3 shows the search expression expansion based on the method (WC10, 20) of the search expression expansion based on the word contribution in the case of Num = 10, 20 and wgt = −100, −400, −800, −1200, and Num = The search formula expansion by the method of Rocchio (Rocchio 10, 20) when the values are 10 and 20 and the value of the average correct answer rate by the initial search (Baseline) are shown.
[0106]
[Table 3]

As is evident from the results in Table 3, it was confirmed that the method of expanding the search formula based on the word contribution achieved a search accuracy improvement of 157% to 197% or more compared to the initial search. In addition, the search accuracy is improved with an increase in the absolute value of the weight wgt, that is, the influence of the newly expanded word is increased, and a higher search accuracy is obtained as compared with the Rocchio method. It was shown that word extraction and weighting by the search expression expansion method based on word contribution were effective.
[0107]
FIG. 13 shows a Recall / Precision curve of each method when Num = 20. The evaluation of the vertical and horizontal axes and the characteristic curve of the graph shown in FIG. 13 is the same as that of FIG.
[0108]
As is clear from the Recall / Precision curve of FIG. 13, it has been confirmed that the method of expanding the search formula based on the word contribution exceeds the accuracy rate of the Rocchio method at any Recall value.
[0109]
From the above results, in the search expression expansion method based on the word contribution, the search accuracy was improved as compared with the Rocchio method under any conditions, and the effectiveness of the present invention was shown.
[0110]
【The invention's effect】
As described above, according to the first to sixth aspects of the present invention, the magnitude of the influence of the words appearing in the reference document (set) and the search target document group on the similarity between the documents is defined as the word contribution, Since the vector data having the word contribution as an element is used as the feature of the search target document, unnecessary documents are searched compared to the method using the vector data having the frequency of occurrence of the word as the element as the feature. Ratio can be reduced. In addition, since the feature amount depending on the reference document (set) is used, the search accuracy can be further improved when the refined search is repeated.
[0111]
In the inventions of claims 7-14, an important word is selected based on the word contribution, and similarity search is performed based on the selected important word. Even when there is an extreme difference, it is possible to reduce the rate at which unnecessary documents are searched, and to improve the search accuracy, as compared with, for example, an important word selection method based on a chi-square value.
[0112]
Furthermore, in the inventions of claims 15 to 18, the search formula is extended using the degree of contribution of the word, and the similarity search is performed based on the expanded search formula. Can be extracted, and the search accuracy can be improved as compared with the conventional search expression expansion method.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a functional configuration of a contribution degree calculating device shown in FIG. 2;
FIG. 2 is a block diagram showing a functional configuration of the document search device according to the first embodiment.
FIG. 3 is a flowchart showing a processing procedure when calculating a contribution degree of a word wj and a feature vector of a document di.
FIG. 4 is a block diagram showing a functional configuration of a document search device according to a second embodiment.
FIG. 5 is a flowchart showing a processing procedure when an important word for a reference document set D is selected.
FIG. 6 is a flowchart showing a processing procedure when a document similar to a reference document set is searched from a search target document group based on a selected important word.
FIG. 7 is a block diagram showing a functional configuration of a document search device according to a third embodiment.
FIG. 8 is a flowchart showing a processing procedure when a search expression is extended using a word contribution.
FIG. 9 is an explanatory diagram showing a result of a factor analysis based on a degree of contribution.
FIG. 10 is an explanatory diagram showing a result of factor analysis by TF * IDF.
FIG. 11 is an explanatory diagram showing a contribution and an information amount of an analysis result by TF * IDF.
FIG. 12 is an explanatory diagram showing a Recall / Precision curve when a search is performed using important words and all words obtained by each method.
FIG. 13 is an explanatory diagram showing a Recall / Precision curve of each method when Num = 20.
FIG. 14 is a flowchart showing a processing procedure of a conventional refined search.
[Explanation of symbols]
10, 20, 30 document search device
11, 21, 31 input unit
12. Feature extraction unit
13,23,33 Search unit
14, 24, 34 storage unit
15, 25, 35 output section
22 Key word selection section
32 Search expression extension
120 Contribution calculator
121 Appearance word extraction means
122 word removal means
123 Similarity calculation means
124 Contribution calculating means
221, 321 contribution degree calculating means
222 Key word selection means
322 Search formula expansion means

Claims

A document search method for searching a document similar to a specified reference document from the search target document group by using a feature amount of the search target document configuring the search target document group,
Defined as a change from the similarity calculated when words appearing in both the document in the search target document group and the reference document are included as they are to the similarity calculated when the words are removed from both. A document search method comprising: calculating a word contribution to be searched; and using vector data having the word contribution as an element as a feature amount of the search target document.

2. The document search method according to claim 1, wherein the reference document is a document searched by the document search method according to claim 1.

3. The document search method according to claim 1, wherein the reference document is a set of reference documents.

A document search device that searches a document similar to a specified reference document from the search target document group by using a feature amount of the search target document configuring the search target document group,
Appearance word extraction means for extracting words that appear in the reference document and the search target document group,
A word removing unit that generates a document excluding the words extracted by the appearing word extracting unit from the reference document and the search target document group;
A similarity calculating unit that calculates a similarity between the reference document and the document included in the search target document group and a similarity between the reference document generated by the word removal unit and the document included in the search target document group. ,
Based on the two similarities calculated by the similarity calculating means, the similarity calculated from the similarity calculated when words appearing in both a certain document in the search target document group and the reference document are directly included. A contribution calculating means for calculating a word contribution defined as a change to a similarity calculated when the word is removed from both, and generating vector data having the word contribution as an element;
A document search apparatus, wherein vector data having the word contribution as an element is used as a feature amount of the search target document.

5. The document search device according to claim 4, wherein the reference document is a document searched by the document search device according to claim 4.

6. The document search device according to claim 4, wherein the reference document is a set of reference documents.

A document search method for searching a document similar to the reference document from the search target document group using an important word selected from words specified in the specified reference document and the search target document group,
From the similarity calculated when the word appearing in both the reference document and the set of correct documents extracted from the search target document group is included as it is to the similarity calculated when the word is removed from both. A document search method, wherein a word contribution defined as a change is calculated, and a word having a high word contribution is selected as an important word from all words appearing in the reference document and the correct answer document set.

After calculating the word contributions for all the words that appear between the reference document and each document of the correct answer document set, the sum of the word contributions is calculated for each word, and the words at the top of the sum are determined as the words 8. The document search method according to claim 7, wherein an important word is used for a reference document.

9. The document search method according to claim 7, wherein the reference document is a set of reference documents.

10. The document search method according to claim 7, wherein the correct document set is a set of documents similar to the reference document.

A document search apparatus that searches for a document similar to the reference document from the search target document group by using a designated reference document and an important word selected from words appearing in the search target document group,
Appearance word extraction means for extracting words that appear in the correct document set extracted from the reference document and the search target document group,
A word removing unit that generates a document excluding the words extracted by the appearing word extracting unit from the reference document and the correct answer document set,
A similarity calculating unit that calculates a similarity between the reference document and the document included in the correct answer document set, and a similarity between the reference document generated by the word removing unit and each document included in the correct answer document set,
Based on the two similarities calculated by the similarity calculating means, the word is removed from both the reference document and the correct answer document set from the similarity calculated when the word appearing in both is included as it is. A contribution calculating means for calculating a word contribution defined as a change to the similarity calculated when
Key word selecting means for selecting, from all words appearing between the reference document and each document of the correct answer document set, a word having a high word contribution calculated by the contribution calculating means as an important word,
A document search apparatus, wherein a document similar to the reference document is searched from the search target document group based on the selected important word.

The contribution calculating means calculates word contributions for all words appearing between the reference document and each document of the correct answer document set, and the important word selecting means calculates the contribution for each word. 12. The document search apparatus according to claim 11, wherein a sum of the word contributions calculated by the calculation means is obtained, and a word positioned higher in the sum is set as an important word for the reference document.

13. The document search device according to claim 11, wherein the reference document is a set of reference documents.

14. The document search device according to claim 11, wherein the correct answer document set is a set of documents similar to the reference document.

A document search method for searching a document similar to the reference document from the search target document group using a similarity calculation formula including a vector representing each document of the specified reference document and the search target document group as a search expression,
From the similarity calculated when the word appearing in both the reference document and the set of correct documents extracted from the search target document group is included as it is to the similarity calculated when the word is removed from both. Calculate the word contribution defined as a change , extract a word having a low word contribution from each document included in the correct answer document set, and, for all the extracted words, words not included in the reference document, A document search method, wherein a sum of word contributions is obtained for each word, the result is weighted and added to the reference document search formula.

16. The document search method according to claim 15, wherein the reference document is a set of reference documents.

A document search device that searches for a document similar to the reference document from the search target document group by using a similarity calculation formula including a vector representing each document of the specified reference document and the search target document group as a search expression,
Appearance word extraction means for extracting words that appear in the correct document set extracted from the reference document and the search target document group,
A word removing unit that generates a document excluding the words extracted by the appearing word extracting unit from the reference document and the correct answer document set,
A similarity calculating unit that calculates a similarity between the reference document and the document included in the correct answer document set, and a similarity between the reference document generated by the word removing unit and each document included in the correct answer document set,
Based on the two similarities calculated by the similarity calculating means, the word is removed from both the reference document and the correct answer document set from the similarity calculated when the word appearing in both is included as it is. A contribution calculating means for calculating a word contribution defined as a change to the similarity calculated when
From all words appearing in each document included in the reference document and the correct answer document set, a word having a low word contribution calculated by the contribution calculation means is extracted, and among the extracted words, For a word that is not included, a sum of word contributions is calculated for each word, and the result is weighted and added to a search formula of the reference document;
A document search device comprising:

18. The document search device according to claim 17, wherein the reference document is a set of reference documents.